SmartBKG: Selective background simulation using graph neural networks

13.3. SmartBKG: Selective background simulation using graph neural networks#

Using graph neural network with attention mechanism to predict whether a generated event will pass the skim after detector simulation and reconstruction. Selection and weighting methods are then invested to choose proper events according to their scores while trying to avoid bias. The computational resource used for steps between generation and skim will be saved in this way.

For the inference part, a well trained neural network (parameters saved in global tag SmartBKG_GATGAP with payload GATGAPgen.pth) to filter out events that can pass FEI hadronic B0 skim is available. The corresponding neural network built with PyTorch is stored in generators/scripts/smartBKG/models/gatgap.py while generators/scripts/smartBKG/b2modules/NN_filter_module.py is a wrapper (basf2.module) suited in basf2 framework. An example of how to generate skimmed events using SmartBKG can be found in generators/examples/SmartBKGEvtGen.py.

Note

Notice that each event generated through SmartBKG should be reweighted with the inversed neural network output.

To train the neural network for a different skim, you can use the provided example as a guide. First refer to generators/examples/SmartBKGSkimFlag.py and substitute the FEI hadronic B0 skim with your custom skimming process.

# Create the mDST output file before skimming
mdst.add_mdst_output(
    path=main,
    filename=f'{out_dir}_submdst{job_id}.root',
)
# Arbitrary skimming process
your_skimming(main)
# Save the event number of each pass event as the flag for the training of NN
main.add_module(SaveFlag(f'{out_dir}_flag{job_id}.parquet'))

Next, execute the script generators/examples/SmartBKGDataProduction.py to prepare training data. You can choose between fast mode and advanced mode by setting the save_vars argument to False or True, respectively. In advanced mode, manual specifications are required in the scripts generators/scripts/smartBKG/b2modules/NN_trainer_module/data_production.py and generators/script/smartBKG/__init__.py for expected particle lists and variables. The provided example (save_vars has to be given as True, otherwise False by default) demonstrates the advanced mode, showcasing variables from Y(4S) and B lists. Intermediate files are automatically generated and removed during the advanced mode process, while in fast mode, each step is completed only in the cache, offering the advantage of less processing time and reduced disk load. The modules are designed to handle continuum datasets as well.

Both flag generation and preprocessing can be seamlessly executed on the same node of the batch system by specifying file names and/or job IDs.

Note

A substantial number of passing events (O(10^5)) is necessary for effective training of the neural network.

The neural network can be trained using generators/examples/SmartBKGTrain.py without the need for the basf2 environment. This allows you to conduct training or fine-tuning locally on your GPU device, enhancing performance and speeding up the process.

To utilize your well-trained model locally, you can set the model_file parameter for the NNFilterModule. Alternatively, you can update the globaltag (refer to Section 27.2).

Modules in this project:

class smartBKG.b2modules.NN_trainer_module.SaveFlag(out_file=None)[source]#

Save event numbers to a Parquet file.

Parameters:

out_file (str) – Output file path for saving the event numbers.

Returns:

None

Note

This module should be added after the skimming process.

class smartBKG.b2modules.NN_trainer_module.TrainDataSaver(output_file, flag_file)[source]#

Save MCParticles to Pandas Dataframe.

Parameters:
  • output_file (str) – Filename to save training data. Ending with parquet indicating fast mode, which will generate the final parquet file for training. Ending with h5 indicating advanced mode, which will produce a temperary h5 file for further preprocessing.

  • flag_file (str) – Filename of the flag file indicating passing events.

Returns:

None

class smartBKG.b2modules.NN_trainer_module.data_production(in_dir, out_dir, job_id, save_vars=None)[source]#

Process data for training and save to Parquet file. Two modes are provided: Fast mode: save_vars set to None, produce the dataset with only the necessary information for the training. Advanced mode: save_vars set to a dictionary of event-level variables, run through hard-coded b2 steering code in self.process_b2script to produce the required particle lists and save the required variables, can be used for event-level cuts or evaluations of the NN performance.

Parameters:
  • in_dir (str) – Input directory.

  • out_dir (str) – Output directory.

  • job_id (int) – Job ID for batch processing.

  • save_vars (dict) – Event-level variables to save for different particles. By default None for fast mode. In the example script having Y4S and B keys for the corresponding particle list.

Returns:

None