10cdc_and_svd_ckf_merger_mva_training
11-----------------------------------------
16This python script is used for the training and validation of the classifier of
17the MVA-based result filter of the CDCToSVDSeedCKF, which combines tracks that
18were found by the CDC and SVD standalone tracking algorithms.
20To avoid mistakes, b2luigi is used to create a task chain for a combined training and
21validation of all classifiers.
23The order of the b2luigi tasks in this script is as follows (top to bottom):
24* Two tasks to create input samples for training and testing (``GenerateSimTask`` and
25``SplitNMergeSimTask``). The ``SplitNMergeSimTask`` takes a number of events to be
26generated and a number of events per task to reduce runtime. It then divides the total
27number of events by the number of events per task and creates as ``GenerateSimTask`` as
28needed, each with a specific random seed, so that in the end the total number of
29training and testing events are simulated. The individual files are then combined
30by the SplitNMergeSimTask into one file each for training and testing.
31* The ``ResultRecordingTask`` writes out the data used for training of the MVA.
32* The ``CKFResultFilterTeacherTask`` trains the MVA, FastBDT per default, with a
33given set of FastBDT options.
34* The ``ValidationAndOptimisationTask`` uses the trained weight files and cut values
35provided to run the tracking chain with the weight file under test, and also
36runs the tracking validation.
37* Finally, the ``SummaryTask`` is the "brain" of the script. It invokes the
38``ValidationAndOptimisationTask`` with the different combinations of FastBDT options
39and cut values on the MVA classifier output.
41Due to the dependencies, the calls of the task are reversed. The SummaryTask
42calls the ``ValidationAndOptimisationTask`` with different FastBDT options and cut
43values, and the ``ValidationAndOptimisationTask`` itself calls the required teacher,
44training, and simulation tasks.
46b2luigi: Understanding the steering file
47~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49All trainings and validations are done in the correct order in this steering
50file. For the purpose of creating a dependency graph, the `b2luigi
51<https://b2luigi.readthedocs.io>`_ python package is used, which extends the
52`luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
54Each task that has to be done is represented by a special class, which defines
55which defines parameters, output files and which other tasks with which
56parameters it depends on. For example a teacher task, which runs
57``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
58task which runs a reconstruction and writes out track-wise variables into a root
59file for training. An evaluation/validation task for testing the classifier
60requires both the teacher task, as it needs the weightfile to be present, and
61also a data collection task, because it needs a dataset for testing classifier.
63The final task that defines which tasks need to be done for the steering file to
64finish is the ``SummaryTask``. When you only want to run parts of the
65training/validation pipeline, you can comment out requirements in the Master
66task or replace them by lower-level tasks during debugging.
71This steering file relies on b2luigi_ for task scheduling. It can be installed
74 python3 -m pip install [--user] b2luigi
76Use the ``--user`` option if you have not rights to install python packages into
77your externals (e.g. because you are using cvmfs) and install them in
78``$HOME/.local`` instead.
83Instead of command line arguments, the b2luigi script is configured via a
84``settings.json`` file. Open it in your favorite text editor and modify it to
85fit to your requirements.
90You can test the b2luigi without running it via::
92 python3 cdc_and_svd_ckf_merger_mva_training.py --dry-run
93 python3 cdc_and_svd_ckf_merger_mva_training.py --show-output
95This will show the outputs and show potential errors in the definitions of the
96luigi task dependencies. To run the the steering file in normal (local) mode,
99 python3 cdc_and_svd_ckf_merger_mva_training.py
101One can use the interactive luigi web interface via the central scheduler
102which visualizes the task graph while it is running. Therefore, the scheduler
103daemon ``luigid`` has to run in the background, which is located in
104``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
109Then, execute your steering (e.g. in another terminal) with::
111 python3 cdc_and_svd_ckf_merger_mva_training.py --scheduler-port 8886
113To view the web interface, open your webbrowser enter into the url bar::
117If you don't run the steering file on the same machine on which you run your web
118browser, you have two options:
120 1. Run both the steering file and ``luigid`` remotely and use
121 ssh-port-forwarding to your local host. Therefore, run on your local
124 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
126 2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
127 local host>`` argument when calling the steering file
129Accessing the results / output files
130~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
132All output files are stored in a directory structure in the ``result_path``. The
133directory tree encodes the used b2luigi parameters. This ensures reproducibility
134and makes parameter searches easy. Sometimes, it is hard to find the relevant
135output files. You can view the whole directory structure by running ``tree
136<result_path>``. Ise the unix ``find`` command to find the files that interest
139 find <result_path> -name "*.root" # find all ROOT files
151from tracking.path_utils import add_hit_preparation_modules, add_cdc_track_finding, add_svd_standalone_tracking
156from ckf_training
import my_basf2_mva_teacher, create_fbdt_option_string
157from tracking_mva_filter_payloads.write_tracking_mva_filter_payloads_to_db
import write_tracking_mva_filter_payloads_to_db
160install_helpstring_formatter = (
"\nCould not find {module} python module.Try installing it via\n"
161 " python3 -m pip install [--user] {module}\n")
164 from b2luigi.core.utils
import create_output_dirs
165 from b2luigi.basf2_helper
import Basf2PathTask, Basf2Task
166except ModuleNotFoundError:
167 print(install_helpstring_formatter.format(module=
"b2luigi"))
173 Simple task that defines the configuration of the LSF batch submission.
190 Same as LSFTask, but for memory-intensive tasks.
199 Generate simulated Monte Carlo with background overlay.
201 Make sure to use different ``random_seed`` parameters for the training data
202 format the classifier trainings and for the test data for the respective
203 evaluation/validation tasks.
207 experiment_number = b2luigi.IntParameter()
209 n_events = b2luigi.IntParameter()
212 random_seed = b2luigi.Parameter()
214 bkgfiles_dir = b2luigi.Parameter(
223 Create output file name depending on number of events and production
224 mode that is specified in the random_seed string.
226 :param n_events: Number of events to simulate.
227 :param random_seed: Random seed to use for the simulation to create independent samples.
231 if random_seed
is None:
233 return "generated_mc_N" + str(n_events) +
"_" + random_seed +
".root"
237 Generate list of output files that the task should produce.
238 The task is considered finished if and only if the outputs all exist.
244 Create basf2 path to process with event generation and simulation.
247 path = basf2.create_path()
251 path.add_module(
"EvtGenInput")
270 Default function from base b2luigi.Task class.
272 self._remove_output()
279 Generate simulated Monte Carlo with background overlay.
281 Make sure to use different ``random_seed`` parameters for the training data
282 format the classifier trainings and for the test data for the respective
283 evaluation/validation tasks.
287 experiment_number = b2luigi.IntParameter()
289 n_events = b2luigi.IntParameter()
292 random_seed = b2luigi.Parameter()
294 bkgfiles_dir = b2luigi.Parameter(
303 Create output file name depending on number of events and production
304 mode that is specified in the random_seed string.
306 :param n_events: Number of events to simulate.
307 :param random_seed: Random seed to use for the simulation to create independent samples.
311 if random_seed
is None:
313 return "generated_mc_N" + str(n_events) +
"_" + random_seed +
".root"
317 Generate list of output files that the task should produce.
318 The task is considered finished if and only if the outputs all exist.
324 This task requires several GenerateSimTask to be finished so that he required number of events is created.
326 n_events_per_task = SummaryTask.n_events_per_task
327 quotient, remainder = divmod(self.
n_events, n_events_per_task)
328 for i
in range(quotient):
331 num_processes=SummaryTask.num_processes,
332 random_seed=self.
random_seed +
'_' + str(i).zfill(3),
333 n_events=n_events_per_task,
339 num_processes=SummaryTask.num_processes,
340 random_seed=self.
random_seed +
'_' + str(quotient).zfill(3),
345 @b2luigi.on_temporary_files
348 When all GenerateSimTasks finished, merge the output.
350 create_output_dirs(self)
352 file_list = [f
for f
in self.get_all_input_file_names()]
353 print(
"Merge the following files:")
355 cmd = [
"b2file-merge",
"-f"]
356 args = cmd + [self.get_output_file_name(self.
output_file_name())] + file_list
357 subprocess.check_call(args)
363 print(
"Finished merging. Now remove the input files to save space.")
364 file_list = [f
for f
in self.get_all_input_file_names()]
365 for input_file
in file_list:
367 os.remove(input_file)
368 except FileNotFoundError:
373 Default function from base b2luigi.Task class.
375 self._remove_output()
380 Task to record data for the final result filter. This only requires found and MC-matched SVD and CDC tracks that need to be
381 merged, all state filters are set to "all"
385 experiment_number = b2luigi.IntParameter()
387 n_events_training = b2luigi.IntParameter()
390 random_seed = b2luigi.Parameter()
393 result_filter_records_name = b2luigi.Parameter()
397 Generate list of output files that the task should produce.
398 The task is considered finished if and only if the outputs all exist.
404 This task requires that the training SplitMergeSimTask is finished.
415 Create a path for the recording of the result filter. This file is then used to train the result filter.
417 :param result_filter_records_name: Name of the recording file.
420 path = basf2.create_path()
423 file_list = [fname
for fname
in self.get_all_input_file_names()
424 if "generated_mc_N" in fname
and "training" in fname
and fname.endswith(
".root")]
425 path.add_module(
"RootInput", inputFileNames=file_list)
427 path.add_module(
"Gearbox")
428 path.add_module(
"Geometry")
429 path.add_module(
"SetupGenfitExtrapolation")
431 add_hit_preparation_modules(path, components=[
"SVD"])
434 mc_reco_tracks =
"MCRecoTracks"
435 path.add_module(
'TrackFinderMCTruthRecoTracks',
436 RecoTracksStoreArrayName=mc_reco_tracks)
439 cdc_reco_tracks =
"CDCRecoTracks"
440 add_cdc_track_finding(path, output_reco_tracks=cdc_reco_tracks)
441 path.add_module(
"MCRecoTracksMatcher", UsePXDHits=
False, UseSVDHits=
False, UseCDCHits=
True,
442 mcRecoTracksStoreArrayName=mc_reco_tracks,
443 prRecoTracksStoreArrayName=cdc_reco_tracks)
445 path.add_module(
"DAFRecoFitter", recoTracksStoreArrayName=cdc_reco_tracks)
448 svd_reco_tracks =
"SVDRecoTracks"
449 add_svd_standalone_tracking(path, reco_tracks=svd_reco_tracks)
450 path.add_module(
"MCRecoTracksMatcher", UsePXDHits=
False, UseSVDHits=
True, UseCDCHits=
False,
451 mcRecoTracksStoreArrayName=mc_reco_tracks,
452 prRecoTracksStoreArrayName=svd_reco_tracks)
454 direction =
"backward"
455 path.add_module(
"CDCToSVDSeedCKF",
456 inputRecoTrackStoreArrayName=cdc_reco_tracks,
458 fromRelationStoreArrayName=cdc_reco_tracks,
459 toRelationStoreArrayName=svd_reco_tracks,
461 relatedRecoTrackStoreArrayName=svd_reco_tracks,
462 cdcTracksStoreArrayName=cdc_reco_tracks,
463 vxdTracksStoreArrayName=svd_reco_tracks,
465 relationCheckForDirection=direction,
467 firstHighFilterParameters={
"direction": direction},
468 advanceHighFilterParameters={
"direction": direction},
470 writeOutDirection=direction,
473 filter=
"recording_with_relations",
474 filterParameters={
"rootFileName": result_filter_records_name})
480 Create basf2 path to process with event generation and simulation.
488 Default function from base b2luigi.Task class.
490 self._remove_output()
495 A teacher task runs the basf2 mva teacher on the training data provided by a
496 data collection task.
498 Since teacher tasks are needed for all quality estimators covered by this
499 steering file and the only thing that changes is the required data
500 collection task and some training parameters, I decided to use inheritance
501 and have the basic functionality in this base class/interface and have the
502 specific teacher tasks inherit from it.
505 experiment_number = b2luigi.IntParameter()
507 n_events_training = b2luigi.IntParameter()
510 random_seed = b2luigi.Parameter()
512 result_filter_records_name = b2luigi.Parameter()
514 training_target = b2luigi.Parameter(
521 exclude_variables = b2luigi.ListParameter(
523 hashed=
True, default=[]
527 fast_bdt_option = b2luigi.ListParameter(
529 hashed=
True, default=[200, 8, 3, 0.1]
535 Name of weightfile that is created by the teacher task.
537 :param fast_bdt_option: FastBDT option that is used to train this MVA
539 if fast_bdt_option
is None:
541 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
542 weightfile_name =
"trk_CDCToSVDSeedResultFilter" + fast_bdt_string
543 return weightfile_name
547 Generate list of luigi Tasks that this Task depends on.
558 Generate list of output files that the task should produce.
559 The task is considered finished if and only if the outputs all exist.
565 Use basf2_mva teacher to create MVA weightfile from collected training
568 This is the main process that is dispatched by the ``run`` method that
569 is inherited from ``Basf2Task``.
573 my_basf2_mva_teacher(
574 records_files=records_files,
576 weightfile_identifier=weightfile_identifier,
581 basf2_mva.download(weightfile_identifier, self.get_output_file_name(weightfile_identifier +
".root"))
585 Default function from base b2luigi.Task class.
587 self._remove_output()
593 Validate the performance of the trained filters by trying various combinations of FastBDT options, as well as cut values for
594 the states, the number of best candidates kept after each filter, and similar for the result filter.
597 experiment_number = b2luigi.IntParameter()
599 n_events_training = b2luigi.IntParameter()
601 fast_bdt_option = b2luigi.ListParameter(
603 hashed=
True, default=[200, 8, 3, 0.1]
607 n_events_testing = b2luigi.IntParameter()
609 result_filter_cut = b2luigi.FloatParameter()
612 basf2.conditions.prepend_testing_payloads(
"localdb/database.txt")
616 Generate list of output files that the task should produce.
617 The task is considered finished if and only if the outputs all exist.
620 yield self.add_to_output(
621 f
"cdc_svd_merger_ckf_validation{fbdt_string}{self.result_filter_cut}.root")
625 This task requires trained result filters, and that an independent data set for validation was created using the
626 ``SplitMergeSimTask`` with the random seed optimisation.
629 result_filter_records_name=
"filter_records.root",
633 random_seed=
'training'
639 random_seed=
"optimisation",
644 Create a path to validate the trained filters.
646 path = basf2.create_path()
649 file_list = [fname
for fname
in self.get_all_input_file_names()
650 if "generated_mc_N" in fname
and "optimisation" in fname
and fname.endswith(
".root")]
651 path.add_module(
"RootInput", inputFileNames=file_list)
653 path.add_module(
"Gearbox")
654 path.add_module(
"Geometry")
655 path.add_module(
"SetupGenfitExtrapolation")
657 add_hit_preparation_modules(path, components=[
"SVD"])
659 cdc_reco_tracks =
"CDCRecoTracks"
660 svd_reco_tracks =
"SVDRecoTracks"
661 reco_tracks =
"RecoTracks"
662 mc_reco_tracks =
"MCRecoTracks"
665 add_cdc_track_finding(path, output_reco_tracks=cdc_reco_tracks)
667 path.add_module(
"DAFRecoFitter", recoTracksStoreArrayName=cdc_reco_tracks)
670 add_svd_standalone_tracking(path, reco_tracks=svd_reco_tracks)
672 direction =
"backward"
678 f
"trk_CDCToSVDSeedResultFilterParameter{fbdt_string}",
680 f
"trk_CDCToSVDSeedResultFilter{fbdt_string}",
683 basf2.conditions.prepend_testing_payloads(
"localdb/database.txt")
684 result_filter_parameters = {
"DBPayloadName": f
"trk_CDCToSVDSeedResultFilterParameter{fbdt_string}"}
688 inputRecoTrackStoreArrayName=cdc_reco_tracks,
689 fromRelationStoreArrayName=cdc_reco_tracks,
690 toRelationStoreArrayName=svd_reco_tracks,
691 relatedRecoTrackStoreArrayName=svd_reco_tracks,
692 cdcTracksStoreArrayName=cdc_reco_tracks,
693 vxdTracksStoreArrayName=svd_reco_tracks,
694 relationCheckForDirection=direction,
696 firstHighFilterParameters={
697 "direction": direction},
698 advanceHighFilterParameters={
699 "direction": direction},
700 writeOutDirection=direction,
702 filter=
'mva_with_relations',
703 filterParameters=result_filter_parameters
706 path.add_module(
'RelatedTracksCombiner',
707 VXDRecoTracksStoreArrayName=svd_reco_tracks,
708 CDCRecoTracksStoreArrayName=cdc_reco_tracks,
709 recoTracksStoreArrayName=reco_tracks)
711 path.add_module(
'TrackFinderMCTruthRecoTracks',
712 RecoTracksStoreArrayName=mc_reco_tracks,
718 path.add_module(
"MCRecoTracksMatcher", UsePXDHits=
False, UseSVDHits=
True, UseCDCHits=
True,
719 mcRecoTracksStoreArrayName=mc_reco_tracks,
720 prRecoTracksStoreArrayName=reco_tracks)
724 output_file_name=self.get_output_file_name(
725 f
"cdc_svd_merger_ckf_validation{fbdt_string}{self.result_filter_cut}.root"),
726 reco_tracks_name=reco_tracks,
727 mc_reco_tracks_name=mc_reco_tracks,
736 Create basf2 path to process with event generation and simulation.
742 Default function from base b2luigi.Task class.
744 self._remove_output()
749 Task that collects and summarizes the main figure-of-merits from all the
750 (validation and optimisation) child taks.
753 n_events_training = b2luigi.get_setting(
755 "n_events_training", default=1000
759 n_events_testing = b2luigi.get_setting(
761 "n_events_testing", default=500
765 n_events_per_task = b2luigi.get_setting(
767 "n_events_per_task", default=100
771 num_processes = b2luigi.get_setting(
773 "basf2_processes_per_worker", default=0
778 bkgfiles_by_exp = b2luigi.get_setting(
"bkgfiles_by_exp")
780 bkgfiles_by_exp = {int(key): val
for (key, val)
in bkgfiles_by_exp.items()}
783 batch_system =
'local'
785 output_file_name =
'summary.json'
795 Generate list of tasks that needs to be done for luigi to finish running
806 cut_values.append((i+1) * 0.2)
808 experiment_numbers = b2luigi.get_setting(
"experiment_numbers")
811 for experiment_number, fast_bdt_option, cut_value
in itertools.product(
812 experiment_numbers, fast_bdt_options, cut_values
815 experiment_number=experiment_number,
817 fast_bdt_option=fast_bdt_option,
819 result_filter_cut=cut_value,
830 'MCSideTrackingValidationModule_overview_figures_of_merit',
831 'PRSideTrackingValidationModule_overview_figures_of_merit',
832 'PRSideTrackingValidationModule_subdetector_figures_of_merit'
837 all_files = self.get_all_input_file_names()
838 for idx, single_file
in enumerate(all_files):
839 with ROOT.TFile.Open(single_file,
'READ')
as f:
841 for ntuple_name
in ntuple_names:
842 ntuple = f.Get(ntuple_name)
843 for i
in range(min(1, ntuple.GetEntries())):
845 for branch
in ntuple.GetListOfBranches():
846 name = branch.GetName()
847 value = getattr(ntuple, name)
848 branch_data[name] = value
849 branch_data[
'file_path'] = single_file
850 output_dict[f
'{idx}'] = branch_data
854 json.dump(output_dict, f, indent=4)
858 Default function from base b2luigi.Task class.
860 self._remove_output()
863if __name__ ==
"__main__":
865 b2luigi.set_setting(
"env_script",
"./setup_basf2.sh")
866 b2luigi.set_setting(
"scratch_dir", tempfile.gettempdir())
867 workers = b2luigi.get_setting(
"workers", default=1)
868 b2luigi.process(
SummaryTask(), workers=workers, batch=
True)
get_background_files(folder=None, output_file_info=True)
get_weightfile_identifier(self, fast_bdt_option=None)
experiment_number
Experiment number of the conditions database, e.g.
result_filter_records_name
Name of the input file name.
fast_bdt_option
Hyperparameter option of the FastBDT algorithm.
n_events_training
Number of events to generate for the training data set.
exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
random_seed
Random basf2 seed.
training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
experiment_number
Experiment number of the conditions database, e.g.
n_events
Number of events to generate.
bkgfiles_dir
Directory with overlay background root files.
output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
random_seed
Random basf2 seed.
__init__(self, *args, **kwargs)
job_name
set the job name (inherited variable)
experiment_number
Experiment number of the conditions database, e.g.
result_filter_records_name
Name of the records file for training the final result filter.
n_events_training
Number of events to generate.
create_result_recording_path(self, result_filter_records_name)
random_seed
Random basf2 seed.
experiment_number
Experiment number of the conditions database, e.g.
n_events
Number of events to generate.
bkgfiles_dir
Directory with overlay background root files.
output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
random_seed
Random basf2 seed.
str output_file_name
Output file name.
n_events_training
Number of events to generate for the training data set.
n_events_testing
Number of events to generate for the test data set.
experiment_number
Experiment number of the conditions database, e.g.
fast_bdt_option
FastBDT option to use to train the StateFilters.
create_optimisation_and_validation_path(self)
result_filter_cut
Value of the cut on the MVA classifier output for a result candidate.
n_events_training
Number of events to generate for the training data set.
n_events_testing
Number of events to generate for the testing, validation, and optimisation data set.
add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False, save_all_charged_particles_in_mc=False)