PID Calibration Weights
Contents
7.8.12. PID Calibration Weights#
The PID calibration weights were introduced in BELLE2-NOTE-TE-2021-27 as a novel means for improving PID performance without low-level variables.
The gist of the method is to perform a weighted sum of detector log-likelihoods instead of a straight sum, with the weights to be trained using standard machine learning techniques.
To facilitate this, we provide a script for training these weights as well as a number of methods for preparing data samples for training and for analyzing the resulting performance.
Preparing Data Samples for Training#
In order to train weights on a data sample, we first need to prepare the data by
saving out only the information necessary for training. This is done with several
methods that are provided in the pidDataUtils
module.
Read your data into a DataFrame. We provide a
read_root()
method to do this (it is just a wrapper around the uproot.concatenate method, which can read several files and automatically concatenate them all together.) You can use whatever method you like, though. All that is required is that your DataFrame must contain momentum (‘p’), cosine-theta (‘cosTheta’), phi (‘phi’), and detector log-likelihood data for any particles of interest.For each particle type of interest, identify the tag that serves as a prefix in the column names, and use the
make_h5()
method to make a slim H5 file containing only the track and log-likelihood information for this particle type. For example, in a D* analysis, the tag for kaons is ‘DST_D0_K’, and for pions the tags are ‘DST_D0_pi’ and ‘DST_pi’. There are two nuances here to beware of, however.If your DataFrame is from simulation and contains the ‘mcPDG’ column, you can provide the argument
pdg=None
and the ‘pdg’ column in the HDF5 file will be filled with the ‘mcPDG’ data. However, if you’d prefer to use the kinematic tags or you don’t have ‘mcPDG’ data, provide the PDG code for the particle type you’re extracting. (e.g.pdg=321
for kaons,pdg=211
for pions, etc.)This method assumes that the detector log-likelihood columns are formatted as f”pidLogLikelyhoodOf{pdg}From{detector}”. If the log-likelihood columns in your dataset are named differently, please provide a function to the
column
argument to yield these column names. The function should take two arguments: the particle name (‘e’, ‘mu’, ‘pi’, ‘K’, ‘p’, ‘d’) and the detector name (‘SVD’, ‘CDC’, ‘TOP’, ‘ARICH’, ‘ECL’, ‘KLM’). It should return the corresponding detector log-likelihood column name as a string.
Once slim H5 files are made for each particle type, merge them together using the
merge_h5s()
method. This method has apdgs
argument. Ifpdgs=None
, the method will simply use the values from the ‘pdg’ columns in the given H5 files. One can also give a list of PDG values, one per filename, to be used instead. (If you set the PDG values in the last step, this shouldn’t be necessary. If you filled the columns with the mcPDG values, however, and at this stage would prefer to use kinematic tags, this is how you could achieve it.)Use the
split_h5()
method to finalize preparation by splitting the slim H5 file into train, validation, and test sets. (Or only train and validation sets, iftest_size=0
.) Note that the argumentstrain_size
,val_size
, andtest_size
need not sum to 1; the sizes will be renormalized if not.
Here is an example code snippet which we used to do this for a D* dataset.
import pidDataUtils as pdu
df = pdu.read_root(['dstar_1.root', 'dstar_2.root'])
pdu.make_h5(df, 'DST_D0_K', 'slim_dstar_K.h5', pdg=321)
pdu.make_h5(df, ['DST_D0_pi', 'DST_pi'], 'slim_dstar_pi.h5', pdg=211)
pdu.merge_h5s(['slim_dstar_K.h5', 'slim_dstar_pi.h5'], 'slim_dstar.h5')
pdu.split_h5('slim_dstar.h5', 'slim_dstar/')
Training the Weights#
To train the weights, we provide a script: pidTrainWeights.py
. It can be run
using
python3 pidTrainWeights.py data/ models/net.pt -n 100
In this snippet, we assume that split_h5()
has been run and the output files
were written into the folder data/
. We then specify that we want our final
model to be saved to models/net.pt
and that we want the model to be trained
for 100 epochs. Note that just the six-by-six array of weights will also be
written to models/net_wgt.npy
.
Note
The output filename needs to end in .pt
or else the script will fail
when attempting to write the weights to a .npy
file.
To train a model on the training dataset in data/
but only over events
within a certain interval in momentum and theta, one can specify the limits at
the command-line. For example, to train only on events with momentum between 0.5
and 1.5 GeV and theta between 15 and 45 degrees, one could add the arguments
--p_lims 0.5 1.5 --theta_lims 15 45
.
One can also start training from an existing checkpoint with this script by
using the --resume
argument. There are three cases to consider with this
argument.
Omitting
--resume
means that a fresh set of weights will be trained and written to the output filepath, overwriting any existing model at that location.Including only
--resume
(as a flag) means that the network at the output filepath will be loaded, trained, and saved again, overwriting the model at the output location.Including
--resume path/to/existing/model.pt
allows one to load and train the network at the given filepath, but the final model is saved to the output location (so the existing model is not overwritten).
A fresh, new network is initialized with all weights equal to 1 by default. One
can instead start the network with weights sampled from a normal distribution of
mean 1 and width 0.5 if desired by using the --random
flag.
Lastly, one can limit the allowed particle types using the --only
flag. For
example, if we are studying D* decays and only care about pions and kaons, and
therefore want the non-{pi, K} particle types to have zero weight in the PID
computations, we can specify --only 211 321
to zero and freeze the weights
for the other hypotheses. Not specifying --only
means that all particle
types will be used.
Below, you can find the full documentation for this script.
usage: pidTrainWeights.py [-h] [-n N_EPOCHS] [--p_lims P_LIMS P_LIMS]
[--theta_lims THETA_LIMS THETA_LIMS] [-R [RESUME]]
[--only [ONLY [ONLY ...]]] [--random] [--beta BETA]
input output
Required Arguments#
- input
Path to folder with training files (in .npz format).
- output
Output filename where model will be saved (should end in .pt).
Optional Arguments#
- -n, --n_epochs
Number of epochs to train the network. Defaults to 500.
- --p_lims
Lower and upper limits for momentum in GeV. Lower limit should be given first. Default values are -inf, +inf.
- --theta_lims
Lower and upper limits for theta in degrees. Lower limit should be given first. Default values are -inf, +inf.
- -R, --resume
Load a pre-existing model and resume training instead of starting a new one. If ‘–resume’ is used and no file is specified, the output filepath will be loaded, and the training will overwrite that file. Alternatively, if a filepath is given, training will begin from the state saved in that file and will be saved to the output filepath.
- --only
Use only log-likelihood data from a subset of particle types specified by PDG code. If left unspecified, all particle types will be used.
- --random
Initialize network weights to random values, normally distributed with mean of 1 and width of 0.5. Has no effect if ‘resume’ is used.
- --beta
Scaling factor for the pion binary cross entropy term in the loss function. Defaults to 0.1.
Applying Weights to Data for Performance Analysis#
We also, through pidDataUtils
provide methods for applying a set of
trained weights to a data sample for analysis. Here’s what that workflow might
look like.
Read your data into a DataFrame. Much like when preparing data samples, you can use our
read_root()
method. We also provideread_h5()
andread_npz()
, which will read in the slim format H5 files or the train, validation, or test setnpz
files used for training.Use
prepare_df()
to prepare the data. This method will apply weights to the detector log-likelihoods and create a number of additional, useful columns in the DataFrame. These columns include likelihood ratios, binary likelihood ratios, single-detector and ablation PID, contribution metrics, and more. It does, however, have several arguments and features to be aware of.The weights are specified with the
weights
keyword argument. This argument expects a six-by-six NumPy array (which could be obtained by usingnp.load()
on the_wgt.npy
file produced during training) or a dict, if there are individual weights for various momentum and theta bins. The dict should have int-tuples as keys and six-by-six NumPy arrays as values, where the int-tuple keys are(p_bin, theta_bin)
.Even if you are not using per-bin weights, you should still specify momentum and theta bins for the analysis. These are done by giving 1D NumPy arrays to the
p_bins
andtheta_bins
keyword arguments. By default, the bins will be the standard Belle II systematics bins. Any events with momentum or theta outside of the bins will have their'p_bin'
or'theta_bin'
value set to -1. By default, these events are then cut from the DataFrame, but this can be disabled by settingdrop_outside_bins=False
.If there are certain particle types that you want to exclude entirely from PID computations, you can do so by setting a list of allowed particles with the
allowed_particles
keyword argument. This expects a list of particle names (not PDG codes).As with the
make_h5()
function, there is again a keyword argumentcolumn
for the case where your DataFrame contains detector log-likelihoods that are named in a special way. For this function, it is assumed by default that the detector log-likelihoods are simply named f”{detector}_{particle}”, since that is the format used when we read in the slim h5 or npz files. If you’re using a ROOT file that was read withread_root()
and has columns in the format f”pidLogLikelyhoodOf{pdg}From{detector}”, usecolumn=root_column
. Otherwise, make your own function and give it here.
After using prepare_df()
, you should have a DataFrame with a number of
additional columns, including labels, likelihood ratios, binary likelihood
ratios, PID predictions, single-detector and ablation PID predictions, momentum
and theta bins, and contribution metrics. Feel free to analyze these however you
like; however, there exists a Python package that expects DataFrames with these
specific columns and produces a wide range of plots. It is pidplots
, which
can be found here.
PID calibration weights on the basf2 path#
The PID calibration weights can be registered in the database to utilize them on
the basf2 path. The module PIDCalibrationWeightCreator
can produce the
dbobject PIDCalibrationWeight with a unique name of the weight matrix.
One can find an example of the usage of the module in
analysis/examples/PIDCalibration/02_SamplePIDAnalysis.py
.
By loading the data object, the basf2 variables, such as
weightedElectronID
, provide the weighted PID probability from the
original likelihood and the given data object. One can specify the name of the
weight matrix in the argument of the variables.
import basf2 as b2
import modularAnalysis as ma
# create path
my_path = b2.create_path()
# load the local dbobject
localDB = 'localdb/database.txt'
b2.conditions.append_testing_payloads(localDB)
# or use the central global tag including the dbobject
ma.fillParticleList('pi+:all', cut='', path=my_path)
matrixName = "PIDCalibrationWeight_Example"
ma.variablesToNtuple('pi+:all', ['pionID', 'weightedPionID('+matrixName+')'], path=my_path)
pidDataUtils Functions#
- pidDataUtils.apply_weights(df, weights, p_bins=array([0.5, 1., 1.5, 2., 2.5, 3., 3.5, 4.5]), theta_bins=array([0.29670597, 0.48869219, 0.6981317, 1.04719755, 1.34390352, 1.67551608, 2.00712864, 2.32128791, 2.61799388]), column=<function _column>)[source]#
Applies the given weights to the log-likelihood data in the DataFrame.
- Parameters
df (pandas.DataFrame) – DataFrame to which the weights are applied.
weights (dict[tuple(int),
numpy.array()
] ornumpy.array()
) – The calibration weight values. If a dict, keys should be a tuple of ints, and each value is the six-by-six array of weights for the bin. If a single np.array, should be a six-by-six array of weights to be applied globally.p_bins (
numpy.array()
) – The edges of the momentum bins in GeV. Defaults to P_BINS, [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.5] GeV.theta_bins (
numpy.array()
) – The edges of the theta bins in radians. Defaults to THETA_BINS, [17, 28, 40, 60, 77, 96, 115, 133, 150] degrees.column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.
- pidDataUtils.compute_abl_lrs(d, det, column=<function _column>)[source]#
Computes ablation likelihood ratios from the given DataFrame.
- Parameters
d (pandas.DataFrame) – DataFrame containing the detector log-likelihoods.
det (str) – The name of the detector to be omitted for the ablation.
column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.
- Returns
The ablation likelihood ratios.
- Return type
- pidDataUtils.compute_contrib(d, corr=True)[source]#
Computes the detector contributions.
- Parameters
d (pandas.DataFrame) – DataFrame containing the likelihood ratio data.
corr (bool) – Whether to compute contribution to the likelihood ratio of the _correct_ hypothesis (True) or the _chosen_ hypothesis (False). Defaults to True.
- Returns
The contributions of each detector.
- Return type
dict[str,
numpy.array()
]
- pidDataUtils.compute_det_lrs(d, det, column=<function _column>)[source]#
Computes single-detector likelihood ratios from the given DataFrame.
- Parameters
d (pandas.DataFrame) – DataFrame containing the detector log-likelihoods.
det (str) – The name of the detector for which the single-detector likelihood ratios will be calculated.
column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.
- Returns
The detector likelihood ratios.
- Return type
- pidDataUtils.cut_particles(df, allowed_particles, column=<function _column>)[source]#
Cuts the log-likelihood data associated with given particle types.
- Parameters
df (pandas.DataFrame) – DataFrame to which the cuts will be applied.
allowed_particles (list(str)) – List of allowed particle types. Any particle types not present will be cut, unless the list is empty (in which case no cuts are applied).
column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.
- pidDataUtils.make_binary_lrs(df, column=<function _column>)[source]#
Makes binary likelihood ratio columns for each of the five non-pion particle type hypotheses in the given DataFrame.
- Parameters
df (pandas.DataFrame) – DataFrame to which the columns will be added.
column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.
- pidDataUtils.make_bins(df, p_bins=array([0.5, 1., 1.5, 2., 2.5, 3., 3.5, 4.5]), theta_bins=array([0.29670597, 0.48869219, 0.6981317, 1.04719755, 1.34390352, 1.67551608, 2.00712864, 2.32128791, 2.61799388]))[source]#
Make ‘p_bin’ and ‘theta_bin’ column in the given DataFrame.
- Parameters
df (pandas.DataFrame) – The DataFrame to add bins columns to.
p_bins (
numpy.array()
) – The edges of the momentum bins in GeV. Defaults to P_BINS, [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.5] GeV.theta_bins (
numpy.array()
) – The edges of the theta bins in radians. Defaults to THETA_BINS, [17, 28, 40, 60, 77, 96, 115, 133, 150] degrees.
- pidDataUtils.make_columns(df, p_bins=array([0.5, 1., 1.5, 2., 2.5, 3., 3.5, 4.5]), theta_bins=array([0.29670597, 0.48869219, 0.6981317, 1.04719755, 1.34390352, 1.67551608, 2.00712864, 2.32128791, 2.61799388]), contrib_corr=True, column=<function _column>)[source]#
Makes all the additional columns for a given DataFrame.
- Parameters
df (pandas.DataFrame) – DataFrame to which the columns will be added.
p_bins (
numpy.array()
) – The edges of the momentum bins in GeV. Defaults to P_BINS, [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.5] GeV.theta_bins (
numpy.array()
) – The edges of the theta bins in radians. Defaults to THETA_BINS, [17, 28, 40, 60, 77, 96, 115, 133, 150] degrees.contrib_corr (bool) – Whether to compute contribution to the likelihood ratio of the _correct_ hypothesis (True) or the _chosen_ hypothesis (False). Defaults to True.
column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.
- pidDataUtils.make_contrib(df, corr=True)[source]#
Makes columns for the detector contributions in the given DataFrame.
- Parameters
df (pandas.DataFrame) – DataFrame to which the columns will be added.
corr (bool) – Whether to compute contribution to the likelihood ratio of the _correct_ hypothesis (True) or the _chosen_ hypothesis (False). Defaults to True.
- pidDataUtils.make_h5(df, tags, out_filename, pdg=None, column=<function root_column>)[source]#
Make an HDF5 file in our ‘slim’ format from the given DataFrame.
- Parameters
df (pandas.DataFrame) – The DataFrame containing the data.
tags (list(str) or str) – The particle tags used as a prefix for desired columns. e.g. for kaons in a D* decay, this is ‘DST_D0_K’. One or more can be given.
out_filename (str) – Output filename for the h5 file that will be written.
pdg (int or None) – The PDG code for the particles being extracted. If None, uses the values found in the ‘mcPDG’ column of the DataFrame. Defaults to None.
column – A function which, given the particle and detector names, returns the column name for the corresponding detector log-likelihood. Defaults to root_column, which assumes column names are of the format f’pidLogLikelyhoodOf{pdg}From{detector}’.
- pidDataUtils.make_labels(df)[source]#
Make a ‘labels’ column for the DataFrame. The ‘labels’ column contains the particle type labels for each event: 0 for electron, 1 for muon, and so on.
- Parameters
df (pandas.DataFrame) – DataFrame that ‘labels’ column will be added to. Must not have NaN values in the ‘pdg’ column.
- pidDataUtils.make_lrs(df, column=<function _column>)[source]#
Makes likelihood ratio columns for each of the six particle types in the given DataFrame.
- Parameters
df (pandas.DataFrame) – DataFrame to which the columns will be added.
column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.
- pidDataUtils.make_pid(df)[source]#
Makes a ‘pid’ column in the given DataFrame. The ‘pid’ column is the predicted particle type. Requires likelihood ratio columns to exist.
- Parameters
df (pandas.DataFrame) – DataFrame to which the ‘pid’ column will be added.
- pidDataUtils.make_pid_abl(df, column=<function _column>)[source]#
Makes ablation PID columns for each of the detectors in the given DataFrame.
- Parameters
df (pandas.DataFrame) – DataFrame to which the columns will be added.
column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.
- pidDataUtils.make_pid_det(df, column=<function _column>)[source]#
Makes single-detector PID columns for each of the detectors in the given DataFrame.
- Parameters
df (pandas.DataFrame) – DataFrame to which the columns will be added.
column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.
- pidDataUtils.merge_h5s(filenames, out_filename, pdgs=None)[source]#
Merge several HDF5 files in our ‘slim’ format together.
- Parameters
filenames (list(str)) – Filenames of HDF5 files to be merged.
out_filename (str) – Output filename.
pdgs (list(int)) – The PDG tags for the particle types, one per filename, to overwrite the ‘pdg’ columns in those files when merging. If None, simply uses the ‘pdg’ columns from the files. Defaults to None.
- pidDataUtils.produce_analysis_df(df, compute_cols=True, drop_nans=True, drop_outside_bins=True, weights=None, allowed_particles=[], p_bins=array([0.5, 1., 1.5, 2., 2.5, 3., 3.5, 4.5]), theta_bins=array([0.29670597, 0.48869219, 0.6981317, 1.04719755, 1.34390352, 1.67551608, 2.00712864, 2.32128791, 2.61799388]), column=None)[source]#
Prepares a DataFrame for PID analysis by applying weights, computing and adding additional columns, cutting NaNs, and more.
- Parameters
df (pandas.DataFrame) – DataFrame to prepare for analysis.
compute_cols (bool) – Whether to compute and add additional columns. Defaults to True.
drop_nans (bool) – Whether to drop rows that contain NaNs. Defaults to True.
drop_outside_bins (bool) – Whether to drop rows for particles outside of the momentum and theta bins. Defaults to True.
weights (
numpy.array()
) – Calibration weights to be applied to the detector log-likelihoods. Defaults to None.allowed_particles (list(str)) – If not empty, specifies the allowed particle types. Any not allowed particle types will be excluded from the PID calculations. If empty, all particle types are considered. Defaults to [].
p_bins (
numpy.array()
) – The edges of the momentum bins in GeV. Defaults to P_BINS, [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.5] GeV.theta_bins (
numpy.array()
) – The edges of the theta bins in radians. Defaults to THETA_BINS, [17, 28, 40, 60, 77, 96, 115, 133, 150] degrees.column – A function which given the particle and detector names returns the corresponding detector log-likelihood column name. Defaults to _column, which gives column names of the format f”{detector}_{particle}”.
- Returns
- Return the prepared DataFrame. (Not all modifications in
this method are in-place.)
- Return type
- pidDataUtils.read_h5(filename)[source]#
Read an HDF5 file in our ‘slim’ format into a DataFrame.
- Parameters
filename (str) – Input filename.
- Returns
DataFrame containing data.
- Return type
- pidDataUtils.read_npz(filename)[source]#
Read an npz file in our training format into a DataFrame.
- Parameters
filename (str) – Input filename.
- Returns
DataFrame containing data.
- Return type
- pidDataUtils.read_root(root_filenames)[source]#
Reads one or several ROOT datafiles into a DataFrame.
- pidDataUtils.root_column(particle, detector)[source]#
Column names for detector log-likelihoods found in our ROOT datafiles.
- pidDataUtils.softmax(x)[source]#
Performs softmax calculation with corrections to help prevent overflow.
Note
This is the calculation used to convert log-likelihoods to likelihood ratios. Implementation following https://stackoverflow.com/a/67112412/18837571
- Parameters
x (
numpy.array()
) – Data to be softmaxed. Softmax is calculated over the last dimension.- Returns
Softmaxed data.
- Return type
- pidDataUtils.split_h5(filename, output_dir, train_size=0.8, val_size=0.1, test_size=0.1, shuffle=True, random_state=None)[source]#
Split the data in an HDF5 ‘slim’ format file in train, validation, and test sets, stored in .npz files for ease of weight training.
- Parameters
filename (str) – Filename of HDF5 input file.
output_dir (str) – Name of output directory, in which the train, validation, and test sets will be written. Will be created if it does not already exist.
train_size (float) – Fraction of the dataset to use for training. Defaults to 0.8.
val_size (float) – Fraction of the dataset to use for validation. Defaults to 0.1.
test_size (float) – Fraction of the dataset to use for testing. Defaults to 0.1.
shuffle (bool) – Whether to shuffle the dataset before splitting. Defaults to True.
random_state (int or None) – Random state for the shuffling. Defaults to None.