Belle II Software release-09-00-00
combined_cdc_to_svd_ckf_mva_training.py
1
8
9"""
10combined_cdc_to_svd_ckf_mva_training
11-----------------------------------------
12
13Purpose of this script
14~~~~~~~~~~~~~~~~~~~~~~
15
16This python script is used for the training and validation of the classifiers of
17the three MVA-based state filters and one result filter of the CDCToSVDSpacePointCKF.
18This CKF extraplates tracks found in the CDC into the SVD and adds SVD hits using a
19combinatorial tree search and a Kalman filter based track fit in each step.
20
21To avoid mistakes, b2luigi is used to create a task chain for a combined training and
22validation of all classifiers.
23
24The order of the b2luigi tasks in this script is as follows (top to bottom):
25* Two tasks to create input samples for training and testing (``GenerateSimTask`` and
26``SplitNMergeSimTask``). The ``SplitNMergeSimTask`` takes a number of events to be
27generated and a number of events per task to reduce runtime. It then divides the total
28number of events by the number of events per task and creates as ``GenerateSimTask`` as
29needed, each with a specific random seed, so that in the end the total number of
30training and testing events are simulated. The individual files are then combined
31by the SplitNMergeSimTask into one file each for training and testing.
32* The ``StateRecordingTask`` writes out the data required for training the state
33filters.
34* The ``CKFStateFilterTeacherTask`` trains the state filter MVAs, using FastBDT by
35default, with a given set of options.
36* The ``ResultRecordingTask`` writes out the data used for the training of the result
37filter MVA. This task requires that the state filters have been trained before.
38* The ``CKFResultFilterTeacherTask`` trains the MVA, FastBDT per default, with a
39given set of FastBDT options. This requires that the result filter records have
40been created with the ``ResultRecordingTask``.
41* The ``ValidationAndOptimisationTask`` uses the trained weight files and cut values
42provided to run the tracking chain with the weight file under test, and also
43runs the tracking validation.
44* Finally, the ``MainTask`` is the "brain" of the script. It invokes the
45``ValidationAndOptimisationTask`` with the different combinations of FastBDT options
46and cut values on the MVA classifier output.
47
48Due to the dependencies, the calls of the task are reversed. The MainTask
49calls the ``ValidationAndOptimisationTask`` with different FastBDT options and cut
50values, and the ``ValidationAndOptimisationTask`` itself calls the required teacher,
51training, and simulation tasks.
52
53Each combination of FastBDT options and state filter cut values and candidate selection
54is used to train the result filter, which includes that the ``ResultRecordingTask``
55is executed multiple times with different combinations of FastBDT options and cut value
56and candidate selection.
57
58b2luigi: Understanding the steering file
59~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60
61All trainings and validations are done in the correct order in this steering
62file. For the purpose of creating a dependency graph, the `b2luigi
63<https://b2luigi.readthedocs.io>`_ python package is used, which extends the
64`luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
65
66Each task that has to be done is represented by a special class, which defines
67which defines parameters, output files and which other tasks with which
68parameters it depends on. For example a teacher task, which runs
69``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
70task which runs a reconstruction and writes out track-wise variables into a root
71file for training. An evaluation/validation task for testing the classifier
72requires both the teacher task, as it needs the weightfile to be present, and
73also a data collection task, because it needs a dataset for testing classifier.
74
75The final task that defines which tasks need to be done for the steering file to
76finish is the ``MainTask``. When you only want to run parts of the
77training/validation pipeline, you can comment out requirements in the Master
78task or replace them by lower-level tasks during debugging.
79
80Requirements
81~~~~~~~~~~~~
82
83This steering file relies on b2luigi_ for task scheduling. It can be installed
84via pip::
85
86 python3 -m pip install [--user] b2luigi
87
88Use the ``--user`` option if you have not rights to install python packages into
89your externals (e.g. because you are using cvmfs) and install them in
90``$HOME/.local`` instead.
91
92Configuration
93~~~~~~~~~~~~~
94
95Instead of command line arguments, the b2luigi script is configured via a
96``settings.json`` file. Open it in your favorite text editor and modify it to
97fit to your requirements.
98
99Usage
100~~~~~
101
102You can test the b2luigi without running it via::
103
104 python3 combined_cdc_to_svd_ckf_mva_training.py --dry-run
105 python3 combined_cdc_to_svd_ckf_mva_training.py --show-output
106
107This will show the outputs and show potential errors in the definitions of the
108luigi task dependencies. To run the the steering file in normal (local) mode,
109run::
110
111 python3 combined_cdc_to_svd_ckf_mva_training.py
112
113One can use the interactive luigi web interface via the central scheduler
114which visualizes the task graph while it is running. Therefore, the scheduler
115daemon ``luigid`` has to run in the background, which is located in
116``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
117example, run::
118
119 luigid --port 8886
120
121Then, execute your steering (e.g. in another terminal) with::
122
123 python3 combined_cdc_to_svd_ckf_mva_training.py --scheduler-port 8886
124
125To view the web interface, open your webbrowser enter into the url bar::
126
127 localhost:8886
128
129If you don't run the steering file on the same machine on which you run your webbrowser, you have two options:
130
131 1. Run both the steering file and ``luigid`` remotely and use
132 ssh-port-forwarding to your local host. Therefore, run on your local
133 machine::
134
135 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
136
137 2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
138 local host>`` argument when calling the steering file
139
140Accessing the results / output files
141~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
142
143All output files are stored in a directory structure in the ``result_path`` set in
144``settings.json``. The directory tree encodes the used b2luigi parameters. This
145ensures reproducibility and makes parameter searches easy. Sometimes, it is hard to
146find the relevant output files. You can view the whole directory structure by
147running ``tree <result_path>``. Ise the unix ``find`` command to find the files
148that interest you, e.g.::
149
150 find <result_path> -name "*.root" # find all ROOT files
151"""
152
153import itertools
154import subprocess
155import os
156
157import basf2
158from tracking import add_track_finding
159from tracking.path_utils import add_hit_preparation_modules
160from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
161import background
162import simulation
163
164from ckf_training import my_basf2_mva_teacher, create_fbdt_option_string
165
166# wrap python modules that are used here but not in the externals into a try except block
167install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
168 " python3 -m pip install [--user] {module}\n")
169try:
170 import b2luigi
171 from b2luigi.core.utils import create_output_dirs
172 from b2luigi.basf2_helper import Basf2PathTask, Basf2Task
173except ModuleNotFoundError:
174 print(install_helpstring_formatter.format(module="b2luigi"))
175 raise
176
177
178class GenerateSimTask(Basf2PathTask):
179 """
180 Generate simulated Monte Carlo with background overlay.
181
182 Make sure to use different ``random_seed`` parameters for the training data
183 format the classifier trainings and for the test data for the respective
184 evaluation/validation tasks.
185 """
186
187
188 experiment_number = b2luigi.IntParameter()
191 random_seed = b2luigi.Parameter()
193 n_events = b2luigi.IntParameter()
195 bkgfiles_dir = b2luigi.Parameter(
197 hashed=True
198
199 )
200
201 queue = 'l'
203
204 def output_file_name(self, n_events=None, random_seed=None):
205 """
206 Create output file name depending on number of events and production
207 mode that is specified in the random_seed string.
208
209 :param n_events: Number of events to simulate.
210 :param random_seed: Random seed to use for the simulation to create independent samples.
211 """
212 if n_events is None:
213 n_events = self.n_events
214 if random_seed is None:
215 random_seed = self.random_seed
216 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
217
218 def output(self):
219 """
220 Generate list of output files that the task should produce.
221 The task is considered finished if and only if the outputs all exist.
222 """
223 yield self.add_to_output(self.output_file_name())
224
225 def create_path(self):
226 """
227 Create basf2 path to process with event generation and simulation.
228 """
229 basf2.set_random_seed(self.random_seed)
230 path = basf2.create_path()
231 path.add_module(
232 "EventInfoSetter", evtNumList=[self.n_events], runList=[0], expList=[self.experiment_number]
233 )
234 path.add_module("EvtGenInput")
235 bkg_files = ""
236 # \cond suppress doxygen warning
237 if self.experiment_number == 0:
239 else:
241 # \endcond
242
243 simulation.add_simulation(path, bkgfiles=bkg_files, bkgOverlay=True, usePXDDataReduction=False)
244
245 path.add_module(
246 "RootOutput",
247 outputFileName=self.get_output_file_name(self.output_file_name()),
248 )
249 return path
250
251
252# I don't use the default MergeTask or similar because they only work if every input file is called the same.
253# Additionally, I want to add more features like deleting the original input to save storage space.
254class SplitNMergeSimTask(Basf2Task):
255 """
256 Generate simulated Monte Carlo with background overlay.
257
258 Make sure to use different ``random_seed`` parameters for the training data
259 format the classifier trainings and for the test data for the respective
260 evaluation/validation tasks.
261 """
262
263 experiment_number = b2luigi.IntParameter()
266 random_seed = b2luigi.Parameter()
268 n_events = b2luigi.IntParameter()
270 bkgfiles_dir = b2luigi.Parameter(
272 hashed=True
273
274 )
275
276 queue = 'sx'
278
279 def output_file_name(self, n_events=None, random_seed=None):
280 """
281 Create output file name depending on number of events and production
282 mode that is specified in the random_seed string.
283
284 :param n_events: Number of events to simulate.
285 :param random_seed: Random seed to use for the simulation to create independent samples.
286 """
287 if n_events is None:
288 n_events = self.n_events
289 if random_seed is None:
290 random_seed = self.random_seed
291 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
292
293 def output(self):
294 """
295 Generate list of output files that the task should produce.
296 The task is considered finished if and only if the outputs all exist.
297 """
298 yield self.add_to_output(self.output_file_name())
299
300 def requires(self):
301 """
302 This task requires several GenerateSimTask to be finished so that he required number of events is created.
303 """
304 n_events_per_task = MainTask.n_events_per_task
305 quotient, remainder = divmod(self.n_events, n_events_per_task)
306 for i in range(quotient):
307 yield GenerateSimTask(
308 bkgfiles_dir=self.bkgfiles_dir,
309 num_processes=MainTask.num_processes,
310 random_seed=self.random_seed + '_' + str(i).zfill(3),
311 n_events=n_events_per_task,
312 experiment_number=self.experiment_number,
313 )
314 if remainder > 0:
315 yield GenerateSimTask(
316 bkgfiles_dir=self.bkgfiles_dir,
317 num_processes=MainTask.num_processes,
318 random_seed=self.random_seed + '_' + str(quotient).zfill(3),
319 n_events=remainder,
320 experiment_number=self.experiment_number,
321 )
322
323 @b2luigi.on_temporary_files
324 def process(self):
325 """
326 When all GenerateSimTasks finished, merge the output.
327 """
328 create_output_dirs(self)
329
330 file_list = [item for sublist in self.get_input_file_names().values() for item in sublist]
331 print("Merge the following files:")
332 print(file_list)
333 cmd = ["b2file-merge", "-f"]
334 args = cmd + [self.get_output_file_name(self.output_file_name())] + file_list
335 subprocess.check_call(args)
336 print("Finished merging. Now remove the input files to save space.")
337 for input_file in file_list:
338 try:
339 os.remove(input_file)
340 except FileNotFoundError:
341 pass
342
343
344class StateRecordingTask(Basf2PathTask):
345 """
346 Record the data for the three state filters for the CDCToSVDSpacePointCKF.
347
348 This task requires that the events used for training have been simulated before, which is done using the
349 ``SplitMergeSimTask``.
350 """
351
352 experiment_number = b2luigi.IntParameter()
355 random_seed = b2luigi.Parameter()
357 n_events = b2luigi.IntParameter()
359
360 layer = b2luigi.IntParameter()
362 def output(self):
363 """
364 Generate list of output files that the task should produce.
365 The task is considered finished if and only if the outputs all exist.
366 """
367 for record_fname in ["records1.root", "records2.root", "records3.root"]:
368 yield self.add_to_output(record_fname)
369
370 def requires(self):
371 """
372 This task only requires that the input files have been created.
373 """
374 yield SplitNMergeSimTask(
375 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
376 experiment_number=self.experiment_number,
377 random_seed=self.random_seed,
378 n_events=self.n_events,
379 )
380
381 def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname):
382 """
383 Create a path for the recording. To record the data for the SVD state filters, CDC tracks are required, and these must
384 be truth matched before. The data have to recorded for each layer of the SVD, i.e. layers 3 to 6, but also an artificial
385 layer 7.
386
387 :param layer: The layer for which the data are recorded.
388 :param records1_fname: Name of the records1 file.
389 :param records2_fname: Name of the records2 file.
390 :param records3_fname: Name of the records3 file.
391 """
392 path = basf2.create_path()
393
394 # get all the file names from the list of input files that are meant for training
395 file_list = [fname for sublist in self.get_input_file_names().values()
396 for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
397 path.add_module("RootInput", inputFileNames=file_list)
398
399 path.add_module("Gearbox")
400 path.add_module("Geometry")
401 path.add_module("SetupGenfitExtrapolation")
402
403 add_hit_preparation_modules(path, components=["SVD"])
404
405 add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
406
407 path.add_module('TrackFinderMCTruthRecoTracks',
408 RecoTracksStoreArrayName="MCRecoTracks",
409 WhichParticles=[],
410 UsePXDHits=True,
411 UseSVDHits=True,
412 UseCDCHits=True)
413
414 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=False, UseCDCHits=True,
415 mcRecoTracksStoreArrayName="MCRecoTracks",
416 prRecoTracksStoreArrayName="CDCRecoTracks")
417 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCRecoTracks")
418
419 path.add_module("CDCToSVDSpacePointCKF",
420 inputRecoTrackStoreArrayName="CDCRecoTracks",
421 outputRecoTrackStoreArrayName="VXDRecoTracks",
422 outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
423
424 relationCheckForDirection="backward",
425 reverseSeed=False,
426 writeOutDirection="backward",
427
428 firstHighFilter="truth",
429 firstEqualFilter="recording",
430 firstEqualFilterParameters={"treeName": "records1", "rootFileName":
431 records1_fname, "returnWeight": 1.0},
432 firstLowFilter="none",
433 firstHighUseNStates=0,
434 firstToggleOnLayer=layer,
435
436 advanceHighFilter="advance",
437
438 secondHighFilter="truth",
439 secondEqualFilter="recording",
440 secondEqualFilterParameters={"treeName": "records2", "rootFileName":
441 records2_fname, "returnWeight": 1.0},
442 secondLowFilter="none",
443 secondHighUseNStates=0,
444 secondToggleOnLayer=layer,
445
446 updateHighFilter="fit",
447
448 thirdHighFilter="truth",
449 thirdEqualFilter="recording",
450 thirdEqualFilterParameters={"treeName": "records3", "rootFileName": records3_fname},
451 thirdLowFilter="none",
452 thirdHighUseNStates=0,
453 thirdToggleOnLayer=layer,
454
455 filter="none",
456 exportTracks=False,
457
458 enableOverlapResolving=False)
459
460 return path
461
462 def create_path(self):
463 """
464 Create basf2 path to process with event generation and simulation.
465 """
466 return self.create_state_recording_path(
467 layer=self.layer,
468 records1_fname=self.get_output_file_name("records1.root"),
469 records2_fname=self.get_output_file_name("records2.root"),
470 records3_fname=self.get_output_file_name("records3.root"),
471 )
472
473
474class CKFStateFilterTeacherTask(Basf2Task):
475 """
476 A teacher task runs the basf2 mva teacher on the training data provided by a
477 data collection task.
478
479 In this task the three state filters are trained, each with the corresponding recordings from the different layers.
480 It will be executed for each FastBDT option defined in the MainTask.
481 """
482
483 experiment_number = b2luigi.IntParameter()
486 random_seed = b2luigi.Parameter()
488 n_events = b2luigi.IntParameter()
490 fast_bdt_option_state_filter = b2luigi.ListParameter(
492 hashed=True, default=[50, 8, 3, 0.1]
493
494 )
495
496 filter_number = b2luigi.IntParameter()
498 training_target = b2luigi.Parameter(
500 default="truth"
501
502 )
503
505 exclude_variables = b2luigi.ListParameter(
508 hashed=True, default=[
509 "id",
510 "last_id",
511 "number",
512 "last_layer",
513
514 "seed_cdc_hits",
515 "seed_svd_hits",
516 "seed_lowest_svd_layer",
517 "seed_lowest_cdc_layer",
518 "quality_index_triplet",
519 "quality_index_circle",
520 "quality_index_helix",
521 "cluster_1_charge",
522 "cluster_2_charge",
523 "mean_rest_cluster_charge",
524 "min_rest_cluster_charge",
525 "std_rest_cluster_charge",
526 "cluster_1_seed_charge",
527 "cluster_2_seed_charge",
528 "mean_rest_cluster_seed_charge",
529 "min_rest_cluster_seed_charge",
530 "std_rest_cluster_seed_charge",
531 "cluster_1_size",
532 "cluster_2_size",
533 "mean_rest_cluster_size",
534 "min_rest_cluster_size",
535 "std_rest_cluster_size",
536 "cluster_1_snr",
537 "cluster_2_snr",
538 "mean_rest_cluster_snr",
539 "min_rest_cluster_snr",
540 "std_rest_cluster_snr",
541 "cluster_1_charge_over_size",
542 "cluster_2_charge_over_size",
543 "mean_rest_cluster_charge_over_size",
544 "min_rest_cluster_charge_over_size",
545 "std_rest_cluster_charge_over_size",
546 ]
547
548 )
549
550 def get_weightfile_xml_identifier(self, fast_bdt_option=None, filter_number=1):
551 """
552 Name of the xml weightfile that is created by the teacher task.
553 It is subsequently used as a local weightfile in the following validation tasks.
554
555 :param fast_bdt_option: FastBDT option that is used to train this MVA
556 :param filter_number: Filter number (first=1, second=2, third=3) to be trained
557 """
558 if fast_bdt_option is None:
559 fast_bdt_option = self.fast_bdt_option_state_filter
560 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
561 weightfile_name = f"trk_CDCToSVDSpacePointStateFilter_{filter_number}" + fast_bdt_string
562 return weightfile_name + ".xml"
563
564 def requires(self):
565 """
566 This task requires that the recordings for the state filters.
567 """
568 for layer in [3, 4, 5, 6, 7]:
569 yield self.clone(
570 StateRecordingTask,
571 experiment_number=self.experiment_number,
572 n_events=self.n_events,
573 random_seed="training",
574 layer=layer,
575 )
576
577 def output(self):
578 """
579 Generate list of output files that the task should produce.
580 The task is considered finished if and only if the outputs all exist.
581 """
582 yield self.add_to_output(self.get_weightfile_xml_identifier(filter_number=self.filter_number))
583
584 def process(self):
585 """
586 Use basf2_mva teacher to create MVA weightfile from collected training
587 data variables.
588
589 This is the main process that is dispatched by the ``run`` method that
590 is inherited from ``Basf2Task``.
591 """
592 records_files = self.get_input_file_names(f"records{self.filter_number}.root")
593 tree_name = f"records{self.filter_number}"
594 print(f"Processed records files: {records_files=},\nfeature tree name: {tree_name=}")
595
596 my_basf2_mva_teacher(
597 records_files=records_files,
598 tree_name=tree_name,
599 weightfile_identifier=self.get_output_file_name(self.get_weightfile_xml_identifier(filter_number=self.filter_number)),
600 target_variable=self.training_target,
601 exclude_variables=self.exclude_variables,
602 fast_bdt_option=self.fast_bdt_option_state_filter,
603 )
604
605
606class ResultRecordingTask(Basf2PathTask):
607 """
608 Task to record data for the final result filter. This requires trained state filters.
609 The cuts on the state filter classifiers are set to rather low values to ensure that all signal is contained in the
610 recorded file. Also, the values for XXXXXHighUseNStates are chosen conservatively, i.e. rather on the high side.
611 """
612
613
614 experiment_number = b2luigi.IntParameter()
617 random_seed = b2luigi.Parameter()
619 n_events = b2luigi.IntParameter()
621 fast_bdt_option_state_filter = b2luigi.ListParameter(
623 hashed=True, default=[50, 8, 3, 0.1]
624
625 )
626
627 result_filter_records_name = b2luigi.Parameter()
629 def output(self):
630 """
631 Generate list of output files that the task should produce.
632 The task is considered finished if and only if the outputs all exist.
633 """
634 yield self.add_to_output(self.result_filter_records_name)
635
636 def requires(self):
637 """
638 This task requires that the training SplitMergeSimTask is finished, as well as that the state filters are trained
639 using the CKFStateFilterTeacherTask..
640 """
641 yield SplitNMergeSimTask(
642 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
643 experiment_number=self.experiment_number,
644 random_seed=self.random_seed,
645 n_events=self.n_events,
646 )
647 filter_numbers = [1, 2, 3]
648 for filter_number in filter_numbers:
649 yield self.clone(
650 CKFStateFilterTeacherTask,
651 experiment_number=self.experiment_number,
652 n_events=self.n_events,
653 random_seed=self.random_seed,
654 filter_number=filter_number,
655 fast_bdt_option=self.fast_bdt_option_state_filter
656 )
657
658 def create_result_recording_path(self, result_filter_records_name):
659 """
660 Create a path for the recording of the result filter. This file is then used to train the result filter.
661
662 :param result_filter_records_name: Name of the recording file.
663 """
664
665 path = basf2.create_path()
666
667 # get all the file names from the list of input files that are meant for training
668 file_list = [fname for sublist in self.get_input_file_names().values()
669 for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
670 path.add_module("RootInput", inputFileNames=file_list)
671
672 path.add_module("Gearbox")
673 path.add_module("Geometry")
674 path.add_module("SetupGenfitExtrapolation")
675
676 add_hit_preparation_modules(path, components=["SVD"])
677
678 add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
679
680 path.add_module('TrackFinderMCTruthRecoTracks',
681 RecoTracksStoreArrayName="MCRecoTracks",
682 WhichParticles=[],
683 UsePXDHits=True,
684 UseSVDHits=True,
685 UseCDCHits=True)
686
687 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=False, UseCDCHits=True,
688 mcRecoTracksStoreArrayName="MCRecoTracks",
689 prRecoTracksStoreArrayName="CDCRecoTracks")
690 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCRecoTracks")
691
692 fast_bdt_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
693 path.add_module("CDCToSVDSpacePointCKF",
694 inputRecoTrackStoreArrayName="CDCRecoTracks",
695 outputRecoTrackStoreArrayName="VXDRecoTracks",
696 outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
697
698 relationCheckForDirection="backward",
699 reverseSeed=False,
700 writeOutDirection="backward",
701
702 firstHighFilter="mva_with_direction_check",
703 firstHighFilterParameters={
704 "identifier": self.get_input_file_names(f"trk_CDCToSVDSpacePointStateFilter_1{fast_bdt_string}.xml")[0],
705 "cut": 0.001,
706 "direction": "backward"},
707 firstHighUseNStates=10,
708
709 advanceHighFilter="advance",
710 advanceHighFilterParameters={"direction": "backward"},
711
712 secondHighFilter="mva",
713 secondHighFilterParameters={
714 "identifier": self.get_input_file_names(f"trk_CDCToSVDSpacePointStateFilter_2{fast_bdt_string}.xml")[0],
715 "cut": 0.001},
716 secondHighUseNStates=10,
717
718 updateHighFilter="fit",
719
720 thirdHighFilter="mva",
721 thirdHighFilterParameters={
722 "identifier": self.get_input_file_names(f"trk_CDCToSVDSpacePointStateFilter_3{fast_bdt_string}.xml")[0],
723 "cut": 0.001},
724 thirdHighUseNStates=10,
725
726 filter="recording",
727 filterParameters={"rootFileName": result_filter_records_name},
728 exportTracks=False,
729
730 enableOverlapResolving=True)
731
732 return path
733
734 def create_path(self):
735 """
736 Create basf2 path to process with event generation and simulation.
737 """
739 result_filter_records_name=self.get_output_file_name(self.result_filter_records_name),
740 )
741
742
743class CKFResultFilterTeacherTask(Basf2Task):
744 """
745 A teacher task runs the basf2 mva teacher on the training data provided by a
746 data collection task.
747
748 Since teacher tasks are needed for all quality estimators covered by this
749 steering file and the only thing that changes is the required data
750 collection task and some training parameters, I decided to use inheritance
751 and have the basic functionality in this base class/interface and have the
752 specific teacher tasks inherit from it.
753 """
754
755 experiment_number = b2luigi.IntParameter()
758 random_seed = b2luigi.Parameter()
760 n_events = b2luigi.IntParameter()
762 fast_bdt_option_state_filter = b2luigi.ListParameter(
764 hashed=True, default=[50, 8, 3, 0.1]
765
766 )
767
768 fast_bdt_option_result_filter = b2luigi.ListParameter(
770 hashed=True, default=[200, 8, 3, 0.1]
771
772 )
773
774 result_filter_records_name = b2luigi.Parameter()
776 training_target = b2luigi.Parameter(
778 default="truth"
779
780 )
781
783 exclude_variables = b2luigi.ListParameter(
785 hashed=True, default=[]
786
787 )
788
789 def get_weightfile_xml_identifier(self, fast_bdt_option=None):
790 """
791 Name of the xml weightfile that is created by the teacher task.
792 It is subsequently used as a local weightfile in the following validation tasks.
793
794 :param fast_bdt_option: FastBDT option that is used to train this MVA
795 """
796 if fast_bdt_option is None:
797 fast_bdt_option = self.fast_bdt_option_result_filter
798 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
799 weightfile_name = "trk_CDCToSVDSpacePointResultFilter" + fast_bdt_string
800 return weightfile_name + ".xml"
801
802 def requires(self):
803 """
804 Generate list of luigi Tasks that this Task depends on.
805 """
807 experiment_number=self.experiment_number,
808 n_events=self.n_events,
809 random_seed=self.random_seed,
810 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
811 result_filter_records_name=self.result_filter_records_name,
812 )
813
814 def output(self):
815 """
816 Generate list of output files that the task should produce.
817 The task is considered finished if and only if the outputs all exist.
818 """
819 yield self.add_to_output(self.get_weightfile_xml_identifier())
820
821 def process(self):
822 """
823 Use basf2_mva teacher to create MVA weightfile from collected training
824 data variables.
825
826 This is the main process that is dispatched by the ``run`` method that
827 is inherited from ``Basf2Task``.
828 """
829 records_files = self.get_input_file_names(self.result_filter_records_name)
830 tree_name = "records"
831 print(f"Processed records files for result filter training: {records_files=},\nfeature tree name: {tree_name=}")
832
833 my_basf2_mva_teacher(
834 records_files=records_files,
835 tree_name=tree_name,
836 weightfile_identifier=self.get_output_file_name(self.get_weightfile_xml_identifier()),
837 target_variable=self.training_target,
838 exclude_variables=self.exclude_variables,
839 fast_bdt_option=self.fast_bdt_option_result_filter,
840 )
841
842
843class ValidationAndOptimisationTask(Basf2PathTask):
844 """
845 Validate the performance of the trained filters by trying various combinations of FastBDT options, as well as cut values
846 for the states, the number of best candidates kept after each filter, and similar for the result filter.
847 """
848
849 experiment_number = b2luigi.IntParameter()
851 n_events_training = b2luigi.IntParameter()
853 fast_bdt_option_state_filter = b2luigi.ListParameter(
854 # ## \cond
855 hashed=True, default=[50, 8, 3, 0.1]
856 # ## \endcond
857 )
858
859 fast_bdt_option_result_filter = b2luigi.ListParameter(
860 # ## \cond
861 hashed=True, default=[200, 8, 3, 0.1]
862 # ## \endcond
863 )
864
865 n_events_testing = b2luigi.IntParameter()
867 state_filter_cut = b2luigi.FloatParameter()
869 use_n_best_states = b2luigi.IntParameter()
871 result_filter_cut = b2luigi.FloatParameter()
873 use_n_best_results = b2luigi.IntParameter()
875 def output(self):
876 """
877 Generate list of output files that the task should produce.
878 The task is considered finished if and only if the outputs all exist.
879 """
880 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
881 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
882 yield self.add_to_output(
883 f"cdc_to_svd_spacepoint_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root")
884
885 def requires(self):
886 """
887 This task requires trained result filters, trained state filters, and that an independent data set for validation was
888 created using the SplitMergeSimTask with the random seed optimisation.
889 """
890 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
892 result_filter_records_name=f"filter_records{fbdt_state_filter_string}.root",
893 experiment_number=self.experiment_number,
894 n_events=self.n_events_training,
895 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
896 fast_bdt_option_result_filter=self.fast_bdt_option_result_filter,
897 random_seed='training'
898 )
899 yield SplitNMergeSimTask(
900 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
901 experiment_number=self.experiment_number,
902 n_events=self.n_events_testing,
903 random_seed="optimisation",
904 )
905 filter_numbers = [1, 2, 3]
906 for filter_number in filter_numbers:
907 yield self.clone(
908 CKFStateFilterTeacherTask,
909 experiment_number=self.experiment_number,
910 random_seed="training",
911 n_events=self.n_events_training,
912 filter_number=filter_number,
913 fast_bdt_option=self.fast_bdt_option_state_filter
914 )
915
917 """
918 Create a path to validate the trained filters.
919 """
920 path = basf2.create_path()
921
922 # get all the file names from the list of input files that are meant for optimisation / validation
923 file_list = [fname for sublist in self.get_input_file_names().values()
924 for fname in sublist if "generated_mc_N" in fname and "optimisation" in fname and fname.endswith(".root")]
925 path.add_module("RootInput", inputFileNames=file_list)
926
927 path.add_module("Gearbox")
928 path.add_module("Geometry")
929 path.add_module("SetupGenfitExtrapolation")
930
931 add_hit_preparation_modules(path, components=["SVD"])
932
933 add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
934
935 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
936 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
937 path.add_module("CDCToSVDSpacePointCKF",
938
939 inputRecoTrackStoreArrayName="CDCRecoTracks",
940 outputRecoTrackStoreArrayName="VXDRecoTracks",
941 outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
942
943 relationCheckForDirection="backward",
944 reverseSeed=False,
945 writeOutDirection="backward",
946
947 firstHighFilter="mva_with_direction_check",
948 firstHighFilterParameters={
949 "identifier": self.get_input_file_names(
950 f"trk_CDCToSVDSpacePointStateFilter_1{fbdt_state_filter_string}.xml")[0],
951 "cut": self.state_filter_cut,
952 "direction": "backward"},
953 firstHighUseNStates=self.use_n_best_states,
954
955 advanceHighFilter="advance",
956 advanceHighFilterParameters={"direction": "backward"},
957
958 secondHighFilter="mva",
959 secondHighFilterParameters={
960 "identifier": self.get_input_file_names(
961 f"trk_CDCToSVDSpacePointStateFilter_2{fbdt_state_filter_string}.xml")[0],
962 "cut": self.state_filter_cut},
963 secondHighUseNStates=self.use_n_best_states,
964
965 updateHighFilter="fit",
966
967 thirdHighFilter="mva",
968 thirdHighFilterParameters={
969 "identifier": self.get_input_file_names(
970 f"trk_CDCToSVDSpacePointStateFilter_3{fbdt_state_filter_string}.xml")[0],
971 "cut": self.state_filter_cut},
972 thirdHighUseNStates=self.use_n_best_states,
973
974 filter="mva",
975 filterParameters={
976 "identifier": self.get_input_file_names(
977 f"trk_CDCToSVDSpacePointResultFilter{fbdt_result_filter_string}.xml")[0],
978 "cut": self.result_filter_cut},
979 useBestNInSeed=self.use_n_best_results,
980
981 exportTracks=True,
982 enableOverlapResolving=True)
983
984 path.add_module('RelatedTracksCombiner',
985 VXDRecoTracksStoreArrayName="VXDRecoTracks",
986 CDCRecoTracksStoreArrayName="CDCRecoTracks",
987 recoTracksStoreArrayName="RecoTracks")
988
989 path.add_module('TrackFinderMCTruthRecoTracks',
990 RecoTracksStoreArrayName="MCRecoTracks",
991 WhichParticles=[],
992 UsePXDHits=True,
993 UseSVDHits=True,
994 UseCDCHits=True)
995
996 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
997 mcRecoTracksStoreArrayName="MCRecoTracks",
998 prRecoTracksStoreArrayName="RecoTracks")
999
1000 path.add_module(
1002 output_file_name=self.get_output_file_name(
1003 f"cdc_to_svd_spacepoint_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root"),
1004 reco_tracks_name="RecoTracks",
1005 mc_reco_tracks_name="MCRecoTracks",
1006 name="",
1007 contact="",
1008 expert_level=200))
1009
1010 return path
1011
1012 def create_path(self):
1013 """
1014 Create basf2 path to process with event generation and simulation.
1015 """
1017
1018
1019class MainTask(b2luigi.WrapperTask):
1020 """
1021 Wrapper task that needs to finish for b2luigi to finish running this steering file.
1022
1023 It is done if the outputs of all required subtasks exist. It is thus at the
1024 top of the luigi task graph. Edit the ``requires`` method to steer which
1025 tasks and with which parameters you want to run.
1026 """
1027
1028 n_events_training = b2luigi.get_setting(
1030 "n_events_training", default=1000
1031
1032 )
1033
1034 n_events_testing = b2luigi.get_setting(
1036 "n_events_testing", default=500
1037
1038 )
1039
1040 n_events_per_task = b2luigi.get_setting(
1042 "n_events_per_task", default=100
1043
1044 )
1045
1046 num_processes = b2luigi.get_setting(
1048 "basf2_processes_per_worker", default=0
1049
1050 )
1051
1052
1053 bkgfiles_by_exp = b2luigi.get_setting("bkgfiles_by_exp")
1055 bkgfiles_by_exp = {int(key): val for (key, val) in bkgfiles_by_exp.items()}
1057 def requires(self):
1058 """
1059 Generate list of tasks that needs to be done for luigi to finish running
1060 this steering file.
1061 """
1062
1063 fast_bdt_options = [
1064 [50, 8, 3, 0.1],
1065 [100, 8, 3, 0.1],
1066 [200, 8, 3, 0.1],
1067 ]
1068
1069 experiment_numbers = b2luigi.get_setting("experiment_numbers")
1070
1071 # iterate over all possible combinations of parameters from the above defined parameter lists
1072 for experiment_number, fast_bdt_option_state_filter, fast_bdt_option_result_filter in itertools.product(
1073 experiment_numbers, fast_bdt_options, fast_bdt_options
1074 ):
1075
1076 state_filter_cuts = [0.01, 0.02, 0.03, 0.05, 0.1, 0.2]
1077 n_best_states_list = [3, 5, 10]
1078 result_filter_cuts = [0.05, 0.1, 0.2]
1079 n_best_results_list = [3, 5, 10]
1080 for state_filter_cut, n_best_states, result_filter_cut, n_best_results in \
1081 itertools.product(state_filter_cuts, n_best_states_list, result_filter_cuts, n_best_results_list):
1082 yield self.clone(
1083 ValidationAndOptimisationTask,
1084 experiment_number=experiment_number,
1085 n_events_training=self.n_events_training,
1086 n_events_testing=self.n_events_testing,
1087 state_filter_cut=state_filter_cut,
1088 use_n_best_states=n_best_states,
1089 result_filter_cut=result_filter_cut,
1090 use_n_best_results=n_best_results,
1091 fast_bdt_option_state_filter=fast_bdt_option_state_filter,
1092 fast_bdt_option_result_filter=fast_bdt_option_result_filter,
1093 )
1094
1095
1096if __name__ == "__main__":
1097 b2luigi.set_setting("env_script", "./setup_basf2.sh")
1098 b2luigi.set_setting("batch_system", "htcondor")
1099 workers = b2luigi.get_setting("workers", default=1)
1100 b2luigi.process(MainTask(), workers=workers, batch=True)
1101
def get_background_files(folder=None, output_file_info=True)
Definition: background.py:17
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
b2luigi fast_bdt_option_result_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi n_events
Number of events to generate for the training data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
def get_weightfile_xml_identifier(self, fast_bdt_option=None, filter_number=1)
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
b2luigi filter_number
Number of the filter for which the records files are to be processed.
b2luigi n_events
Number of events to generate for the training data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the test data set.
b2luigi n_events
Number of events to generate for the training data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi result_filter_records_name
Name of the records file for training the final result filter.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname)
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi layer
Layer on which to toggle for recording the information for training.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the testing, validation, and optimisation data set.
b2luigi use_n_best_states
How many states should be kept at maximum in the combinatorial part of the CKF tree search.
b2luigi use_n_best_results
How many results should be kept at maximum to search for overlaps.
b2luigi state_filter_cut
Value of the cut on the MVA classifier output for accepting a state during CKF tracking.
b2luigi fast_bdt_option_result_filter
FastBDT option to use to train the Result Filter.
b2luigi result_filter_cut
Value of the cut on the MVA classifier output for a result candidate.
b2luigi fast_bdt_option_state_filter
FastBDT option to use to train the StateFilters.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
def add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False)
Definition: simulation.py:126