7.8.12. PID Calibration Weights#

The PID calibration weights were introduced in BELLE2-NOTE-TE-2021-27 as a novel means for improving PID performance without low-level variables.

The gist of the method is to perform a weighted sum of detector log-likelihoods instead of a straight sum, with the weights to be trained using standard machine learning techniques.

To facilitate this, we provide a script for training these weights as well as a number of methods for preparing data samples for training and for analyzing the resulting performance.

Preparing Data Samples for Training#

In order to train weights on a data sample, we first need to prepare the data by saving out only the information necessary for training. This is done with several methods that are provided in the pidDataUtils module.

  1. Read your data into a DataFrame. We provide a read_root() method to do this (it is just a wrapper around the uproot.concatenate method, which can read several files and automatically concatenate them all together.) You can use whatever method you like, though. All that is required is that your DataFrame must contain momentum (‘p’), cosine-theta (‘cosTheta’), phi (‘phi’), and detector log-likelihood data for any particles of interest.

  2. For each particle type of interest, identify the tag that serves as a prefix in the column names, and use the make_h5() method to make a slim H5 file containing only the track and log-likelihood information for this particle type. For example, in a D* analysis, the tag for kaons is ‘DST_D0_K’, and for pions the tags are ‘DST_D0_pi’ and ‘DST_pi’. There are two nuances here to beware of, however.

    1. If your DataFrame is from simulation and contains the ‘mcPDG’ column, you can provide the argument pdg=None and the ‘pdg’ column in the HDF5 file will be filled with the ‘mcPDG’ data. However, if you’d prefer to use the kinematic tags or you don’t have ‘mcPDG’ data, provide the PDG code for the particle type you’re extracting. (e.g. pdg=321 for kaons, pdg=211 for pions, etc.)

    2. This method assumes that the detector log-likelihood columns are formatted as f”pidLogLikelyhoodOf{pdg}From{detector}”. If the log-likelihood columns in your dataset are named differently, please provide a function to the column argument to yield these column names. The function should take two arguments: the particle name (‘e’, ‘mu’, ‘pi’, ‘K’, ‘p’, ‘d’) and the detector name (‘SVD’, ‘CDC’, ‘TOP’, ‘ARICH’, ‘ECL’, ‘KLM’). It should return the corresponding detector log-likelihood column name as a string.

  3. Once slim H5 files are made for each particle type, merge them together using the merge_h5s() method. This method has a pdgs argument. If pdgs=None, the method will simply use the values from the ‘pdg’ columns in the given H5 files. One can also give a list of PDG values, one per filename, to be used instead. (If you set the PDG values in the last step, this shouldn’t be necessary. If you filled the columns with the mcPDG values, however, and at this stage would prefer to use kinematic tags, this is how you could achieve it.)

  4. Use the split_h5() method to finalize preparation by splitting the slim H5 file into train, validation, and test sets. (Or only train and validation sets, if test_size=0.) Note that the arguments train_size, val_size, and test_size need not sum to 1; the sizes will be renormalized if not.

Here is an example code snippet which we used to do this for a D* dataset.

import pidDataUtils as pdu

df = pdu.read_root(['dstar_1.root', 'dstar_2.root'])
pdu.make_h5(df, 'DST_D0_K', 'slim_dstar_K.h5', pdg=321)
pdu.make_h5(df, ['DST_D0_pi', 'DST_pi'], 'slim_dstar_pi.h5', pdg=211)
pdu.merge_h5s(['slim_dstar_K.h5', 'slim_dstar_pi.h5'], 'slim_dstar.h5')
pdu.split_h5('slim_dstar.h5', 'slim_dstar/')

Training the Weights#

To train the weights, we provide a script: pidTrainWeights.py. It can be run using

python3 pidTrainWeights.py data/ models/net.pt -n 100

In this snippet, we assume that split_h5() has been run and the output files were written into the folder data/. We then specify that we want our final model to be saved to models/net.pt and that we want the model to be trained for 100 epochs. Note that just the six-by-six array of weights will also be written to models/net_wgt.npy.

Note

The output filename needs to end in .pt or else the script will fail when attempting to write the weights to a .npy file.

To train a model on the training dataset in data/ but only over events within a certain interval in momentum and theta, one can specify the limits at the command-line. For example, to train only on events with momentum between 0.5 and 1.5 GeV and theta between 15 and 45 degrees, one could add the arguments --p_lims 0.5 1.5 --theta_lims 15 45.

One can also start training from an existing checkpoint with this script by using the --resume argument. There are three cases to consider with this argument.

  1. Omitting --resume means that a fresh set of weights will be trained and written to the output filepath, overwriting any existing model at that location.

  2. Including only --resume (as a flag) means that the network at the output filepath will be loaded, trained, and saved again, overwriting the model at the output location.

  3. Including --resume path/to/existing/model.pt allows one to load and train the network at the given filepath, but the final model is saved to the output location (so the existing model is not overwritten).

A fresh, new network is initialized with all weights equal to 1 by default. One can instead start the network with weights sampled from a normal distribution of mean 1 and width 0.5 if desired by using the --random flag.

Lastly, one can limit the allowed particle types using the --only flag. For example, if we are studying D* decays and only care about pions and kaons, and therefore want the non-{pi, K} particle types to have zero weight in the PID computations, we can specify --only 211 321 to zero and freeze the weights for the other hypotheses. Not specifying --only means that all particle types will be used.

Below, you can find the full documentation for this script.

usage: pidTrainWeights.py [-h] [-n N_EPOCHS] [--p_lims P_LIMS P_LIMS]
                          [--theta_lims THETA_LIMS THETA_LIMS] [-R [RESUME]]
                          [--only [ONLY [ONLY ...]]] [--random] [--beta BETA]
                          input output

Required Arguments#

input

Path to folder with training files (in .npz format).

output

Output filename where model will be saved (should end in .pt).

Optional Arguments#

-n, --n_epochs

Number of epochs to train the network. Defaults to 500.

--p_lims

Lower and upper limits for momentum in GeV. Lower limit should be given first. Default values are -inf, +inf.

--theta_lims

Lower and upper limits for theta in degrees. Lower limit should be given first. Default values are -inf, +inf.

-R, --resume

Load a pre-existing model and resume training instead of starting a new one. If ‘–resume’ is used and no file is specified, the output filepath will be loaded, and the training will overwrite that file. Alternatively, if a filepath is given, training will begin from the state saved in that file and will be saved to the output filepath.

--only

Use only log-likelihood data from a subset of particle types specified by PDG code. If left unspecified, all particle types will be used.

--random

Initialize network weights to random values, normally distributed with mean of 1 and width of 0.5. Has no effect if ‘resume’ is used.

--beta

Scaling factor for the pion binary cross entropy term in the loss function. Defaults to 0.1.

Applying Weights to Data for Performance Analysis#

We also, through pidDataUtils provide methods for applying a set of trained weights to a data sample for analysis. Here’s what that workflow might look like.

  1. Read your data into a DataFrame. Much like when preparing data samples, you can use our read_root() method. We also provide read_h5() and read_npz(), which will read in the slim format H5 files or the train, validation, or test set npz files used for training.

  2. Use prepare_df() to prepare the data. This method will apply weights to the detector log-likelihoods and create a number of additional, useful columns in the DataFrame. These columns include likelihood ratios, binary likelihood ratios, single-detector and ablation PID, contribution metrics, and more. It does, however, have several arguments and features to be aware of.

    1. The weights are specified with the weights keyword argument. This argument expects a six-by-six NumPy array (which could be obtained by using np.load() on the _wgt.npy file produced during training) or a dict, if there are individual weights for various momentum and theta bins. The dict should have int-tuples as keys and six-by-six NumPy arrays as values, where the int-tuple keys are (p_bin, theta_bin).

    2. Even if you are not using per-bin weights, you should still specify momentum and theta bins for the analysis. These are done by giving 1D NumPy arrays to the p_bins and theta_bins keyword arguments. By default, the bins will be the standard Belle II systematics bins. Any events with momentum or theta outside of the bins will have their 'p_bin' or 'theta_bin' value set to -1. By default, these events are then cut from the DataFrame, but this can be disabled by setting drop_outside_bins=False.

    3. If there are certain particle types that you want to exclude entirely from PID computations, you can do so by setting a list of allowed particles with the allowed_particles keyword argument. This expects a list of particle names (not PDG codes).

    4. As with the make_h5() function, there is again a keyword argument column for the case where your DataFrame contains detector log-likelihoods that are named in a special way. For this function, it is assumed by default that the detector log-likelihoods are simply named f”{detector}_{particle}”, since that is the format used when we read in the slim h5 or npz files. If you’re using a ROOT file that was read with read_root() and has columns in the format f”pidLogLikelyhoodOf{pdg}From{detector}”, use column=root_column. Otherwise, make your own function and give it here.

After using prepare_df(), you should have a DataFrame with a number of additional columns, including labels, likelihood ratios, binary likelihood ratios, PID predictions, single-detector and ablation PID predictions, momentum and theta bins, and contribution metrics. Feel free to analyze these however you like; however, there exists a Python package that expects DataFrames with these specific columns and produces a wide range of plots. It is pidplots, which can be found here.

PID calibration weights on the basf2 path#

The PID calibration weights can be registered in the database to utilize them on the basf2 path. The module PIDCalibrationWeightCreator can produce the dbobject PIDCalibrationWeight with a unique name of the weight matrix. One can find an example of the usage of the module in analysis/examples/PIDCalibration/02_SamplePIDAnalysis.py.

By loading the data object, the basf2 variables, such as weightedElectronID, provide the weighted PID probability from the original likelihood and the given data object. One can specify the name of the weight matrix in the argument of the variables.

import basf2 as b2
import modularAnalysis as ma

# create path
my_path = b2.create_path()

# load the local dbobject
localDB = 'localdb/database.txt'
b2.conditions.append_testing_payloads(localDB)
# or use the central global tag including the dbobject

ma.fillParticleList('pi+:all', cut='', path=my_path)

matrixName = "PIDCalibrationWeight_Example"
ma.variablesToNtuple('pi+:all', ['pionID', 'weightedPionID('+matrixName+')'], path=my_path)

pidDataUtils Functions#

pidDataUtils.apply_weights(df, weights, p_bins=array([0.5, 1., 1.5, 2., 2.5, 3., 3.5, 4.5]), theta_bins=array([0.29670597, 0.48869219, 0.6981317, 1.04719755, 1.34390352, 1.67551608, 2.00712864, 2.32128791, 2.61799388]), column=<function _column>)[source]#

Applies the given weights to the log-likelihood data in the DataFrame.

Parameters
  • df (pandas.DataFrame) – DataFrame to which the weights are applied.

  • weights (dict[tuple(int), numpy.array()] or numpy.array()) – The calibration weight values. If a dict, keys should be a tuple of ints, and each value is the six-by-six array of weights for the bin. If a single np.array, should be a six-by-six array of weights to be applied globally.

  • p_bins (numpy.array()) – The edges of the momentum bins in GeV. Defaults to P_BINS, [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.5] GeV.

  • theta_bins (numpy.array()) – The edges of the theta bins in radians. Defaults to THETA_BINS, [17, 28, 40, 60, 77, 96, 115, 133, 150] degrees.

  • column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.

pidDataUtils.compute_abl_lrs(d, det, column=<function _column>)[source]#

Computes ablation likelihood ratios from the given DataFrame.

Parameters
  • d (pandas.DataFrame) – DataFrame containing the detector log-likelihoods.

  • det (str) – The name of the detector to be omitted for the ablation.

  • column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.

Returns

The ablation likelihood ratios.

Return type

numpy.array()

pidDataUtils.compute_contrib(d, corr=True)[source]#

Computes the detector contributions.

Parameters
  • d (pandas.DataFrame) – DataFrame containing the likelihood ratio data.

  • corr (bool) – Whether to compute contribution to the likelihood ratio of the _correct_ hypothesis (True) or the _chosen_ hypothesis (False). Defaults to True.

Returns

The contributions of each detector.

Return type

dict[str, numpy.array()]

pidDataUtils.compute_det_lrs(d, det, column=<function _column>)[source]#

Computes single-detector likelihood ratios from the given DataFrame.

Parameters
  • d (pandas.DataFrame) – DataFrame containing the detector log-likelihoods.

  • det (str) – The name of the detector for which the single-detector likelihood ratios will be calculated.

  • column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.

Returns

The detector likelihood ratios.

Return type

numpy.array()

pidDataUtils.cut_particles(df, allowed_particles, column=<function _column>)[source]#

Cuts the log-likelihood data associated with given particle types.

Parameters
  • df (pandas.DataFrame) – DataFrame to which the cuts will be applied.

  • allowed_particles (list(str)) – List of allowed particle types. Any particle types not present will be cut, unless the list is empty (in which case no cuts are applied).

  • column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.

pidDataUtils.make_binary_lrs(df, column=<function _column>)[source]#

Makes binary likelihood ratio columns for each of the five non-pion particle type hypotheses in the given DataFrame.

Parameters
  • df (pandas.DataFrame) – DataFrame to which the columns will be added.

  • column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.

pidDataUtils.make_bins(df, p_bins=array([0.5, 1., 1.5, 2., 2.5, 3., 3.5, 4.5]), theta_bins=array([0.29670597, 0.48869219, 0.6981317, 1.04719755, 1.34390352, 1.67551608, 2.00712864, 2.32128791, 2.61799388]))[source]#

Make ‘p_bin’ and ‘theta_bin’ column in the given DataFrame.

Parameters
  • df (pandas.DataFrame) – The DataFrame to add bins columns to.

  • p_bins (numpy.array()) – The edges of the momentum bins in GeV. Defaults to P_BINS, [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.5] GeV.

  • theta_bins (numpy.array()) – The edges of the theta bins in radians. Defaults to THETA_BINS, [17, 28, 40, 60, 77, 96, 115, 133, 150] degrees.

pidDataUtils.make_columns(df, p_bins=array([0.5, 1., 1.5, 2., 2.5, 3., 3.5, 4.5]), theta_bins=array([0.29670597, 0.48869219, 0.6981317, 1.04719755, 1.34390352, 1.67551608, 2.00712864, 2.32128791, 2.61799388]), contrib_corr=True, column=<function _column>)[source]#

Makes all the additional columns for a given DataFrame.

Parameters
  • df (pandas.DataFrame) – DataFrame to which the columns will be added.

  • p_bins (numpy.array()) – The edges of the momentum bins in GeV. Defaults to P_BINS, [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.5] GeV.

  • theta_bins (numpy.array()) – The edges of the theta bins in radians. Defaults to THETA_BINS, [17, 28, 40, 60, 77, 96, 115, 133, 150] degrees.

  • contrib_corr (bool) – Whether to compute contribution to the likelihood ratio of the _correct_ hypothesis (True) or the _chosen_ hypothesis (False). Defaults to True.

  • column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.

pidDataUtils.make_contrib(df, corr=True)[source]#

Makes columns for the detector contributions in the given DataFrame.

Parameters
  • df (pandas.DataFrame) – DataFrame to which the columns will be added.

  • corr (bool) – Whether to compute contribution to the likelihood ratio of the _correct_ hypothesis (True) or the _chosen_ hypothesis (False). Defaults to True.

pidDataUtils.make_h5(df, tags, out_filename, pdg=None, column=<function root_column>)[source]#

Make an HDF5 file in our ‘slim’ format from the given DataFrame.

Parameters
  • df (pandas.DataFrame) – The DataFrame containing the data.

  • tags (list(str) or str) – The particle tags used as a prefix for desired columns. e.g. for kaons in a D* decay, this is ‘DST_D0_K’. One or more can be given.

  • out_filename (str) – Output filename for the h5 file that will be written.

  • pdg (int or None) – The PDG code for the particles being extracted. If None, uses the values found in the ‘mcPDG’ column of the DataFrame. Defaults to None.

  • column – A function which, given the particle and detector names, returns the column name for the corresponding detector log-likelihood. Defaults to root_column, which assumes column names are of the format f’pidLogLikelyhoodOf{pdg}From{detector}’.

pidDataUtils.make_labels(df)[source]#

Make a ‘labels’ column for the DataFrame. The ‘labels’ column contains the particle type labels for each event: 0 for electron, 1 for muon, and so on.

Parameters

df (pandas.DataFrame) – DataFrame that ‘labels’ column will be added to. Must not have NaN values in the ‘pdg’ column.

pidDataUtils.make_lrs(df, column=<function _column>)[source]#

Makes likelihood ratio columns for each of the six particle types in the given DataFrame.

Parameters
  • df (pandas.DataFrame) – DataFrame to which the columns will be added.

  • column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.

pidDataUtils.make_pid(df)[source]#

Makes a ‘pid’ column in the given DataFrame. The ‘pid’ column is the predicted particle type. Requires likelihood ratio columns to exist.

Parameters

df (pandas.DataFrame) – DataFrame to which the ‘pid’ column will be added.

pidDataUtils.make_pid_abl(df, column=<function _column>)[source]#

Makes ablation PID columns for each of the detectors in the given DataFrame.

Parameters
  • df (pandas.DataFrame) – DataFrame to which the columns will be added.

  • column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.

pidDataUtils.make_pid_det(df, column=<function _column>)[source]#

Makes single-detector PID columns for each of the detectors in the given DataFrame.

Parameters
  • df (pandas.DataFrame) – DataFrame to which the columns will be added.

  • column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.

pidDataUtils.merge_h5s(filenames, out_filename, pdgs=None)[source]#

Merge several HDF5 files in our ‘slim’ format together.

Parameters
  • filenames (list(str)) – Filenames of HDF5 files to be merged.

  • out_filename (str) – Output filename.

  • pdgs (list(int)) – The PDG tags for the particle types, one per filename, to overwrite the ‘pdg’ columns in those files when merging. If None, simply uses the ‘pdg’ columns from the files. Defaults to None.

pidDataUtils.produce_analysis_df(df, compute_cols=True, drop_nans=True, drop_outside_bins=True, weights=None, allowed_particles=[], p_bins=array([0.5, 1., 1.5, 2., 2.5, 3., 3.5, 4.5]), theta_bins=array([0.29670597, 0.48869219, 0.6981317, 1.04719755, 1.34390352, 1.67551608, 2.00712864, 2.32128791, 2.61799388]), column=None)[source]#

Prepares a DataFrame for PID analysis by applying weights, computing and adding additional columns, cutting NaNs, and more.

Parameters
  • df (pandas.DataFrame) – DataFrame to prepare for analysis.

  • compute_cols (bool) – Whether to compute and add additional columns. Defaults to True.

  • drop_nans (bool) – Whether to drop rows that contain NaNs. Defaults to True.

  • drop_outside_bins (bool) – Whether to drop rows for particles outside of the momentum and theta bins. Defaults to True.

  • weights (numpy.array()) – Calibration weights to be applied to the detector log-likelihoods. Defaults to None.

  • allowed_particles (list(str)) – If not empty, specifies the allowed particle types. Any not allowed particle types will be excluded from the PID calculations. If empty, all particle types are considered. Defaults to [].

  • p_bins (numpy.array()) – The edges of the momentum bins in GeV. Defaults to P_BINS, [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.5] GeV.

  • theta_bins (numpy.array()) – The edges of the theta bins in radians. Defaults to THETA_BINS, [17, 28, 40, 60, 77, 96, 115, 133, 150] degrees.

  • column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.

Returns

Return the prepared DataFrame. (Not all modifications in

this method are in-place.)

Return type

pandas.DataFrame

pidDataUtils.read_h5(filename)[source]#

Read an HDF5 file in our ‘slim’ format into a DataFrame.

Parameters

filename (str) – Input filename.

Returns

DataFrame containing data.

Return type

pandas.DataFrame

pidDataUtils.read_npz(filename)[source]#

Read an npz file in our training format into a DataFrame.

Parameters

filename (str) – Input filename.

Returns

DataFrame containing data.

Return type

pandas.DataFrame

pidDataUtils.read_root(root_filenames)[source]#

Reads one or several ROOT datafiles into a DataFrame.

Parameters

root_filenames (list(str) or str) – If only one filename, can be given as a string. If more than one, should be given as a list or tuple.

Returns

DataFrame containing the data of the ROOT datafile(s).

Return type

pandas.DataFrame

pidDataUtils.root_column(particle, detector)[source]#

Column names for detector log-likelihoods found in our ROOT datafiles.

Parameters
  • particle (str) – particle name

  • detector (str) – detector name

Returns

Corresponding column name.

Return type

str

pidDataUtils.softmax(x)[source]#

Performs softmax calculation with corrections to help prevent overflow.

Note

This is the calculation used to convert log-likelihoods to likelihood ratios. Implementation following https://stackoverflow.com/a/67112412/18837571

Parameters

x (numpy.array()) – Data to be softmaxed. Softmax is calculated over the last dimension.

Returns

Softmaxed data.

Return type

numpy.array()

pidDataUtils.split_h5(filename, output_dir, train_size=0.8, val_size=0.1, test_size=0.1, shuffle=True, random_state=None)[source]#

Split the data in an HDF5 ‘slim’ format file in train, validation, and test sets, stored in .npz files for ease of weight training.

Parameters
  • filename (str) – Filename of HDF5 input file.

  • output_dir (str) – Name of output directory, in which the train, validation, and test sets will be written. Will be created if it does not already exist.

  • train_size (float) – Fraction of the dataset to use for training. Defaults to 0.8.

  • val_size (float) – Fraction of the dataset to use for validation. Defaults to 0.1.

  • test_size (float) – Fraction of the dataset to use for testing. Defaults to 0.1.

  • shuffle (bool) – Whether to shuffle the dataset before splitting. Defaults to True.

  • random_state (int or None) – Random state for the shuffling. Defaults to None.