10 cdc_and_svd_ckf_merger_mva_training
11 -----------------------------------------
13 Purpose of this script
14 ~~~~~~~~~~~~~~~~~~~~~~
16 This python script is used for the training and validation of the classifier of
17 the MVA-based result filter of the CDCToSVDSeedCKF, which combines tracks that
18 were found by the CDC and SVD standalone tracking algorithms.
20 To avoid mistakes, b2luigi is used to create a task chain for a combined training and
21 validation of all classifiers.
23 The order of the b2luigi tasks in this script is as follows (top to bottom):
24 * Two tasks to create input samples for training and testing (``GenerateSimTask`` and
25 ``SplitNMergeSimTask``). The ``SplitNMergeSimTask`` takes a number of events to be
26 generated and a number of events per task to reduce runtime. It then divides the total
27 number of events by the number of events per task and creates as ``GenerateSimTask`` as
28 needed, each with a specific random seed, so that in the end the total number of
29 training and testing events are simulated. The individual files are then combined
30 by the SplitNMergeSimTask into one file each for training and testing.
31 * The ``ResultRecordingTask`` writes out the data used for training of the MVA.
32 * The ``CKFResultFilterTeacherTask`` trains the MVA, FastBDT per default, with a
33 given set of FastBDT options.
34 * The ``ValidationAndOptimisationTask`` uses the trained weight files and cut values
35 provided to run the tracking chain with the weight file under test, and also
36 runs the tracking validation.
37 * Finally, the ``MainTask`` is the "brain" of the script. It invokes the
38 ``ValidationAndOptimisationTask`` with the different combinations of FastBDT options
39 and cut values on the MVA classifier output.
41 Due to the dependencies, the calls of the task are reversed. The MainTask
42 calls the ``ValidationAndOptimisationTask`` with different FastBDT options and cut
43 values, and the ``ValidationAndOptimisationTask`` itself calls the required teacher,
44 training, and simulation tasks.
46 b2luigi: Understanding the steering file
47 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
49 All trainings and validations are done in the correct order in this steering
50 file. For the purpose of creating a dependency graph, the `b2luigi
51 <https://b2luigi.readthedocs.io>`_ python package is used, which extends the
52 `luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
54 Each task that has to be done is represented by a special class, which defines
55 which defines parameters, output files and which other tasks with which
56 parameters it depends on. For example a teacher task, which runs
57 ``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
58 task which runs a reconstruction and writes out track-wise variables into a root
59 file for training. An evaluation/validation task for testing the classifier
60 requires both the teacher task, as it needs the weightfile to be present, and
61 also a data collection task, because it needs a dataset for testing classifier.
63 The final task that defines which tasks need to be done for the steering file to
64 finish is the ``MainTask``. When you only want to run parts of the
65 training/validation pipeline, you can comment out requirements in the Master
66 task or replace them by lower-level tasks during debugging.
71 This steering file relies on b2luigi_ for task scheduling. It can be installed
74 python3 -m pip install [--user] b2luigi
76 Use the ``--user`` option if you have not rights to install python packages into
77 your externals (e.g. because you are using cvmfs) and install them in
78 ``$HOME/.local`` instead.
83 Instead of command line arguments, the b2luigi script is configured via a
84 ``settings.json`` file. Open it in your favorite text editor and modify it to
85 fit to your requirements.
90 You can test the b2luigi without running it via::
92 python3 cdc_and_svd_ckf_merger_mva_training.py --dry-run
93 python3 cdc_and_svd_ckf_merger_mva_training.py --show-output
95 This will show the outputs and show potential errors in the definitions of the
96 luigi task dependencies. To run the the steering file in normal (local) mode,
99 python3 cdc_and_svd_ckf_merger_mva_training.py
101 One can use the interactive luigi web interface via the central scheduler
102 which visualizes the task graph while it is running. Therefore, the scheduler
103 daemon ``luigid`` has to run in the background, which is located in
104 ``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
109 Then, execute your steering (e.g. in another terminal) with::
111 python3 cdc_and_svd_ckf_merger_mva_training.py --scheduler-port 8886
113 To view the web interface, open your webbrowser enter into the url bar::
117 If you don't run the steering file on the same machine on which you run your web
118 browser, you have two options:
120 1. Run both the steering file and ``luigid`` remotely and use
121 ssh-port-forwarding to your local host. Therefore, run on your local
124 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
126 2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
127 local host>`` argument when calling the steering file
129 Accessing the results / output files
130 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
132 All output files are stored in a directory structure in the ``result_path``. The
133 directory tree encodes the used b2luigi parameters. This ensures reproducibility
134 and makes parameter searches easy. Sometimes, it is hard to find the relevant
135 output files. You can view the whole directory structure by running ``tree
136 <result_path>``. Ise the unix ``find`` command to find the files that interest
139 find <result_path> -name "*.root" # find all ROOT files
147 from tracking.path_utils import add_hit_preparation_modules, add_cdc_track_finding, add_svd_standalone_tracking
152 from ckf_training
import my_basf2_mva_teacher, create_fbdt_option_string
155 install_helpstring_formatter = (
"\nCould not find {module} python module.Try installing it via\n"
156 " python3 -m pip install [--user] {module}\n")
159 from b2luigi.core.utils
import create_output_dirs
160 from b2luigi.basf2_helper
import Basf2PathTask, Basf2Task
161 except ModuleNotFoundError:
162 print(install_helpstring_formatter.format(module=
"b2luigi"))
168 Generate simulated Monte Carlo with background overlay.
170 Make sure to use different ``random_seed`` parameters for the training data
171 format the classifier trainings and for the test data for the respective
172 evaluation/validation tasks.
176 experiment_number = b2luigi.IntParameter()
178 n_events = b2luigi.IntParameter()
181 random_seed = b2luigi.Parameter()
183 bkgfiles_dir = b2luigi.Parameter(
194 Create output file name depending on number of events and production
195 mode that is specified in the random_seed string.
197 :param n_events: Number of events to simulate.
198 :param random_seed: Random seed to use for the simulation to create independent samples.
202 if random_seed
is None:
204 return "generated_mc_N" + str(n_events) +
"_" + random_seed +
".root"
208 Generate list of output files that the task should produce.
209 The task is considered finished if and only if the outputs all exist.
215 Create basf2 path to process with event generation and simulation.
218 path = basf2.create_path()
222 path.add_module(
"EvtGenInput")
233 outputFileName=self.get_output_file_name(self.
output_file_nameoutput_file_name()),
242 Generate simulated Monte Carlo with background overlay.
244 Make sure to use different ``random_seed`` parameters for the training data
245 format the classifier trainings and for the test data for the respective
246 evaluation/validation tasks.
250 experiment_number = b2luigi.IntParameter()
252 n_events = b2luigi.IntParameter()
255 random_seed = b2luigi.Parameter()
257 bkgfiles_dir = b2luigi.Parameter(
268 Create output file name depending on number of events and production
269 mode that is specified in the random_seed string.
271 :param n_events: Number of events to simulate.
272 :param random_seed: Random seed to use for the simulation to create independent samples.
276 if random_seed
is None:
278 return "generated_mc_N" + str(n_events) +
"_" + random_seed +
".root"
282 Generate list of output files that the task should produce.
283 The task is considered finished if and only if the outputs all exist.
289 This task requires several GenerateSimTask to be finished so that he required number of events is created.
291 n_events_per_task = MainTask.n_events_per_task
292 quotient, remainder = divmod(self.
n_eventsn_events, n_events_per_task)
293 for i
in range(quotient):
296 num_processes=MainTask.num_processes,
297 random_seed=self.
random_seedrandom_seed +
'_' + str(i).zfill(3),
298 n_events=n_events_per_task,
304 num_processes=MainTask.num_processes,
305 random_seed=self.
random_seedrandom_seed +
'_' + str(quotient).zfill(3),
310 @b2luigi.on_temporary_files
313 When all GenerateSimTasks finished, merge the output.
315 create_output_dirs(self)
317 file_list = [item
for sublist
in self.get_input_file_names().values()
for item
in sublist]
318 print(
"Merge the following files:")
320 cmd = [
"b2file-merge",
"-f"]
321 args = cmd + [self.get_output_file_name(self.
output_file_nameoutput_file_name())] + file_list
322 subprocess.check_call(args)
323 print(
"Finished merging. Now remove the input files to save space.")
325 for tempfile
in file_list:
326 args = cmd2 + [tempfile]
327 subprocess.check_call(args)
332 Task to record data for the final result filter. This only requires found and MC-matched SVD and CDC tracks that need to be
333 merged, all state filters are set to "all"
337 experiment_number = b2luigi.IntParameter()
339 n_events_training = b2luigi.IntParameter()
342 random_seed = b2luigi.Parameter()
345 result_filter_records_name = b2luigi.Parameter()
349 Generate list of output files that the task should produce.
350 The task is considered finished if and only if the outputs all exist.
356 This task requires that the training SplitMergeSimTask is finished.
367 Create a path for the recording of the result filter. This file is then used to train the result filter.
369 :param result_filter_records_name: Name of the recording file.
372 path = basf2.create_path()
375 file_list = [fname
for sublist
in self.get_input_file_names().values()
376 for fname
in sublist
if "generated_mc_N" in fname
and "training" in fname
and fname.endswith(
".root")]
377 path.add_module(
"RootInput", inputFileNames=file_list)
379 path.add_module(
"Gearbox")
380 path.add_module(
"Geometry")
381 path.add_module(
"SetupGenfitExtrapolation")
383 add_hit_preparation_modules(path, components=[
"SVD"])
386 mc_reco_tracks =
"MCRecoTracks"
387 path.add_module(
'TrackFinderMCTruthRecoTracks',
388 RecoTracksStoreArrayName=mc_reco_tracks)
391 cdc_reco_tracks =
"CDCRecoTracks"
392 add_cdc_track_finding(path, output_reco_tracks=cdc_reco_tracks)
393 path.add_module(
"MCRecoTracksMatcher", UsePXDHits=
False, UseSVDHits=
False, UseCDCHits=
True,
394 mcRecoTracksStoreArrayName=mc_reco_tracks,
395 prRecoTracksStoreArrayName=cdc_reco_tracks)
397 path.add_module(
"DAFRecoFitter", recoTracksStoreArrayName=cdc_reco_tracks)
400 svd_reco_tracks =
"SVDRecoTracks"
401 add_svd_standalone_tracking(path, reco_tracks=svd_reco_tracks)
402 path.add_module(
"MCRecoTracksMatcher", UsePXDHits=
False, UseSVDHits=
True, UseCDCHits=
False,
403 mcRecoTracksStoreArrayName=mc_reco_tracks,
404 prRecoTracksStoreArrayName=svd_reco_tracks)
406 direction =
"backward"
407 path.add_module(
"CDCToSVDSeedCKF",
408 inputRecoTrackStoreArrayName=cdc_reco_tracks,
410 fromRelationStoreArrayName=cdc_reco_tracks,
411 toRelationStoreArrayName=svd_reco_tracks,
413 relatedRecoTrackStoreArrayName=svd_reco_tracks,
414 cdcTracksStoreArrayName=cdc_reco_tracks,
415 vxdTracksStoreArrayName=svd_reco_tracks,
417 relationCheckForDirection=direction,
419 firstHighFilterParameters={
"direction": direction},
420 advanceHighFilterParameters={
"direction": direction},
422 writeOutDirection=direction,
425 filter=
"recording_with_relations",
426 filterParameters={
"rootFileName": result_filter_records_name})
432 Create basf2 path to process with event generation and simulation.
441 A teacher task runs the basf2 mva teacher on the training data provided by a
442 data collection task.
444 Since teacher tasks are needed for all quality estimators covered by this
445 steering file and the only thing that changes is the required data
446 collection task and some training parameters, I decided to use inheritance
447 and have the basic functionality in this base class/interface and have the
448 specific teacher tasks inherit from it.
451 experiment_number = b2luigi.IntParameter()
453 n_events_training = b2luigi.IntParameter()
456 random_seed = b2luigi.Parameter()
458 result_filter_records_name = b2luigi.Parameter()
460 training_target = b2luigi.Parameter(
467 exclude_variables = b2luigi.ListParameter(
469 hashed=
True, default=[]
473 fast_bdt_option = b2luigi.ListParameter(
475 hashed=
True, default=[200, 8, 3, 0.1]
481 Name of the xml weightfile that is created by the teacher task.
482 It is subsequently used as a local weightfile in the following validation tasks.
484 :param fast_bdt_option: FastBDT option that is used to train this MVA
486 if fast_bdt_option
is None:
488 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
489 weightfile_name =
"trk_CDCToSVDSeedResultFilter" + fast_bdt_string
490 return weightfile_name +
".xml"
494 Generate list of luigi Tasks that this Task depends on.
505 Generate list of output files that the task should produce.
506 The task is considered finished if and only if the outputs all exist.
512 Use basf2_mva teacher to create MVA weightfile from collected training
515 This is the main process that is dispatched by the ``run`` method that
516 is inherited from ``Basf2Task``.
520 my_basf2_mva_teacher(
521 records_files=records_files,
525 exclude_variables=self.exclude_variables,
532 Validate the performance of the trained filters by trying various combinations of FastBDT options, as well as cut values for
533 the states, the number of best candidates kept after each filter, and similar for the result filter.
536 experiment_number = b2luigi.IntParameter()
538 n_events_training = b2luigi.IntParameter()
540 fast_bdt_option = b2luigi.ListParameter(
542 hashed=
True, default=[200, 8, 3, 0.1]
546 n_events_testing = b2luigi.IntParameter()
548 result_filter_cut = b2luigi.FloatParameter()
552 Generate list of output files that the task should produce.
553 The task is considered finished if and only if the outputs all exist.
555 fbdt_string = create_fbdt_option_string(self.
fast_bdt_optionfast_bdt_option)
556 yield self.add_to_output(
557 f
"cdc_svd_merger_ckf_validation{fbdt_string}_{self.result_filter_cut}.root")
561 This task requires trained result filters, and that an independent data set for validation was created using the
562 ``SplitMergeSimTask`` with the random seed optimisation.
565 result_filter_records_name=
"filter_records.root",
569 random_seed=
'training'
575 random_seed=
"optimisation",
580 Create a path to validate the trained filters.
582 path = basf2.create_path()
585 file_list = [fname
for sublist
in self.get_input_file_names().values()
586 for fname
in sublist
if "generated_mc_N" in fname
and "optimisation" in fname
and fname.endswith(
".root")]
587 path.add_module(
"RootInput", inputFileNames=file_list)
589 path.add_module(
"Gearbox")
590 path.add_module(
"Geometry")
591 path.add_module(
"SetupGenfitExtrapolation")
593 add_hit_preparation_modules(path, components=[
"SVD"])
595 cdc_reco_tracks =
"CDCRecoTracks"
596 svd_reco_tracks =
"SVDRecoTracks"
597 reco_tracks =
"RecoTracks"
598 mc_reco_tracks =
"MCRecoTracks"
601 add_cdc_track_finding(path, output_reco_tracks=cdc_reco_tracks)
603 path.add_module(
"DAFRecoFitter", recoTracksStoreArrayName=cdc_reco_tracks)
606 add_svd_standalone_tracking(path, reco_tracks=svd_reco_tracks)
608 direction =
"backward"
609 fbdt_string = create_fbdt_option_string(self.
fast_bdt_optionfast_bdt_option)
612 inputRecoTrackStoreArrayName=cdc_reco_tracks,
613 fromRelationStoreArrayName=cdc_reco_tracks,
614 toRelationStoreArrayName=svd_reco_tracks,
615 relatedRecoTrackStoreArrayName=svd_reco_tracks,
616 cdcTracksStoreArrayName=cdc_reco_tracks,
617 vxdTracksStoreArrayName=svd_reco_tracks,
618 relationCheckForDirection=direction,
620 firstHighFilterParameters={
621 "direction": direction},
622 advanceHighFilterParameters={
623 "direction": direction},
624 writeOutDirection=direction,
626 filter=
'mva_with_relations',
628 "identifier": self.get_input_file_names(f
"trk_CDCToSVDSeedResultFilter{fbdt_string}.xml")[0],
631 path.add_module(
'RelatedTracksCombiner',
632 VXDRecoTracksStoreArrayName=svd_reco_tracks,
633 CDCRecoTracksStoreArrayName=cdc_reco_tracks,
634 recoTracksStoreArrayName=reco_tracks)
636 path.add_module(
'TrackFinderMCTruthRecoTracks',
637 RecoTracksStoreArrayName=mc_reco_tracks,
643 path.add_module(
"MCRecoTracksMatcher", UsePXDHits=
False, UseSVDHits=
True, UseCDCHits=
True,
644 mcRecoTracksStoreArrayName=mc_reco_tracks,
645 prRecoTracksStoreArrayName=reco_tracks)
649 output_file_name=self.get_output_file_name(
650 f
"cdc_svd_merger_ckf_validation{fbdt_string}_{self.result_filter_cut}.root"),
651 reco_tracks_name=reco_tracks,
652 mc_reco_tracks_name=mc_reco_tracks,
661 Create basf2 path to process with event generation and simulation.
663 return self.create_optimisation_and_validation_path()
666 class MainTask(b2luigi.WrapperTask):
668 Wrapper task that needs to finish for b2luigi to finish running this steering file.
670 It is done if the outputs of all required subtasks exist. It is thus at the
671 top of the luigi task graph. Edit the ``requires`` method to steer which
672 tasks and with which parameters you want to run.
675 n_events_training = b2luigi.get_setting(
677 "n_events_training", default=1000
681 n_events_testing = b2luigi.get_setting(
683 "n_events_testing", default=500
687 n_events_per_task = b2luigi.get_setting(
689 "n_events_per_task", default=100
693 num_processes = b2luigi.get_setting(
695 "basf2_processes_per_worker", default=0
700 bkgfiles_by_exp = b2luigi.get_setting(
"bkgfiles_by_exp")
702 bkgfiles_by_exp = {int(key): val
for (key, val)
in bkgfiles_by_exp.items()}
706 Generate list of tasks that needs to be done for luigi to finish running
717 cut_values.append((i+1) * 0.2)
719 experiment_numbers = b2luigi.get_setting(
"experiment_numbers")
722 for experiment_number, fast_bdt_option, cut_value
in itertools.product(
723 experiment_numbers, fast_bdt_options, cut_values
726 experiment_number=experiment_number,
728 fast_bdt_option=fast_bdt_option,
730 result_filter_cut=cut_value,
734 if __name__ ==
"__main__":
735 b2luigi.set_setting(
"env_script",
"./setup_basf2.sh")
736 b2luigi.set_setting(
"batch_system",
"htcondor")
737 workers = b2luigi.get_setting(
"workers", default=1)
738 b2luigi.process(
MainTask(), workers=workers, batch=
True)
def get_background_files(folder=None, output_file_info=True)
experiment_number
Experiment number of the conditions database, e.g.
result_filter_records_name
Name of the input file name.
fast_bdt_option
Hyperparameter option of the FastBDT algorithm.
n_events_training
Number of events to generate for the training data set.
random_seed
Random basf2 seed.
training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
def get_weightfile_xml_identifier(self, fast_bdt_option=None)
experiment_number
Experiment number of the conditions database, e.g.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
n_events
Number of events to generate.
bkgfiles_dir
Directory with overlay background root files.
random_seed
Random basf2 seed.
n_events_training
Number of events to generate for the training data set.
n_events_testing
Number of events to generate for the test data set.
def create_result_recording_path(self, result_filter_records_name)
experiment_number
Experiment number of the conditions database, e.g.
result_filter_records_name
Name of the records file for training the final result filter.
n_events_training
Number of events to generate.
random_seed
Random basf2 seed.
experiment_number
Experiment number of the conditions database, e.g.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
n_events
Number of events to generate.
bkgfiles_dir
Directory with overlay background root files.
random_seed
Random basf2 seed.
experiment_number
Experiment number of the conditions database, e.g.
fast_bdt_option
FastBDT option to use to train the StateFilters.
def create_optimisation_and_validation_path(self)
result_filter_cut
Value of the cut on the MVA classifier output for a result candidate.
n_events_training
Number of events to generate for the training data set.
n_events_testing
Number of events to generate for the testing, validation, and optimisation data set.
def add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False)