Belle II Software development
combined_cdc_to_svd_ckf_mva_training.py
1
8
9"""
10combined_cdc_to_svd_ckf_mva_training
11-----------------------------------------
12
13Purpose of this script
14~~~~~~~~~~~~~~~~~~~~~~
15
16This python script is used for the training and validation of the classifiers of
17the three MVA-based state filters and one result filter of the CDCToSVDSpacePointCKF.
18This CKF extraplates tracks found in the CDC into the SVD and adds SVD hits using a
19combinatorial tree search and a Kalman filter based track fit in each step.
20
21To avoid mistakes, b2luigi is used to create a task chain for a combined training and
22validation of all classifiers.
23
24The order of the b2luigi tasks in this script is as follows (top to bottom):
25* Two tasks to create input samples for training and testing (``GenerateSimTask`` and
26``SplitNMergeSimTask``). The ``SplitNMergeSimTask`` takes a number of events to be
27generated and a number of events per task to reduce runtime. It then divides the total
28number of events by the number of events per task and creates as ``GenerateSimTask`` as
29needed, each with a specific random seed, so that in the end the total number of
30training and testing events are simulated. The individual files are then combined
31by the SplitNMergeSimTask into one file each for training and testing.
32* The ``StateRecordingTask`` writes out the data required for training the state
33filters.
34* The ``CKFStateFilterTeacherTask`` trains the state filter MVAs, using FastBDT by
35default, with a given set of options.
36* The ``ResultRecordingTask`` writes out the data used for the training of the result
37filter MVA. This task requires that the state filters have been trained before.
38* The ``CKFResultFilterTeacherTask`` trains the MVA, FastBDT per default, with a
39given set of FastBDT options. This requires that the result filter records have
40been created with the ``ResultRecordingTask``.
41* The ``ValidationAndOptimisationTask`` uses the trained weight files and cut values
42provided to run the tracking chain with the weight file under test, and also
43runs the tracking validation.
44* Finally, the ``MainTask`` is the "brain" of the script. It invokes the
45``ValidationAndOptimisationTask`` with the different combinations of FastBDT options
46and cut values on the MVA classifier output.
47
48Due to the dependencies, the calls of the task are reversed. The MainTask
49calls the ``ValidationAndOptimisationTask`` with different FastBDT options and cut
50values, and the ``ValidationAndOptimisationTask`` itself calls the required teacher,
51training, and simulation tasks.
52
53Each combination of FastBDT options and state filter cut values and candidate selection
54is used to train the result filter, which includes that the ``ResultRecordingTask``
55is executed multiple times with different combinations of FastBDT options and cut value
56and candidate selection.
57
58b2luigi: Understanding the steering file
59~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60
61All trainings and validations are done in the correct order in this steering
62file. For the purpose of creating a dependency graph, the `b2luigi
63<https://b2luigi.readthedocs.io>`_ python package is used, which extends the
64`luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
65
66Each task that has to be done is represented by a special class, which defines
67which defines parameters, output files and which other tasks with which
68parameters it depends on. For example a teacher task, which runs
69``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
70task which runs a reconstruction and writes out track-wise variables into a root
71file for training. An evaluation/validation task for testing the classifier
72requires both the teacher task, as it needs the weightfile to be present, and
73also a data collection task, because it needs a dataset for testing classifier.
74
75The final task that defines which tasks need to be done for the steering file to
76finish is the ``MainTask``. When you only want to run parts of the
77training/validation pipeline, you can comment out requirements in the Master
78task or replace them by lower-level tasks during debugging.
79
80Requirements
81~~~~~~~~~~~~
82
83This steering file relies on b2luigi_ for task scheduling. It can be installed
84via pip::
85
86 python3 -m pip install [--user] b2luigi
87
88Use the ``--user`` option if you have not rights to install python packages into
89your externals (e.g. because you are using cvmfs) and install them in
90``$HOME/.local`` instead.
91
92Configuration
93~~~~~~~~~~~~~
94
95Instead of command line arguments, the b2luigi script is configured via a
96``settings.json`` file. Open it in your favorite text editor and modify it to
97fit to your requirements.
98
99Usage
100~~~~~
101
102You can test the b2luigi without running it via::
103
104 python3 combined_cdc_to_svd_ckf_mva_training.py --dry-run
105 python3 combined_cdc_to_svd_ckf_mva_training.py --show-output
106
107This will show the outputs and show potential errors in the definitions of the
108luigi task dependencies. To run the the steering file in normal (local) mode,
109run::
110
111 python3 combined_cdc_to_svd_ckf_mva_training.py
112
113One can use the interactive luigi web interface via the central scheduler
114which visualizes the task graph while it is running. Therefore, the scheduler
115daemon ``luigid`` has to run in the background, which is located in
116``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
117example, run::
118
119 luigid --port 8886
120
121Then, execute your steering (e.g. in another terminal) with::
122
123 python3 combined_cdc_to_svd_ckf_mva_training.py --scheduler-port 8886
124
125To view the web interface, open your webbrowser enter into the url bar::
126
127 localhost:8886
128
129If you don't run the steering file on the same machine on which you run your webbrowser, you have two options:
130
131 1. Run both the steering file and ``luigid`` remotely and use
132 ssh-port-forwarding to your local host. Therefore, run on your local
133 machine::
134
135 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
136
137 2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
138 local host>`` argument when calling the steering file
139
140Accessing the results / output files
141~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
142
143All output files are stored in a directory structure in the ``result_path`` set in
144``settings.json``. The directory tree encodes the used b2luigi parameters. This
145ensures reproducibility and makes parameter searches easy. Sometimes, it is hard to
146find the relevant output files. You can view the whole directory structure by
147running ``tree <result_path>``. Ise the unix ``find`` command to find the files
148that interest you, e.g.::
149
150 find <result_path> -name "*.root" # find all ROOT files
151"""
152
153import itertools
154import subprocess
155import os
156
157import basf2_mva
158import basf2
159from tracking import add_track_finding
160from tracking.path_utils import add_hit_preparation_modules
161from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
162import background
163import simulation
164
165from ckf_training import my_basf2_mva_teacher, create_fbdt_option_string
166from tracking_mva_filter_payloads.write_tracking_mva_filter_payloads_to_db import write_tracking_mva_filter_payloads_to_db
167
168# wrap python modules that are used here but not in the externals into a try except block
169install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
170 " python3 -m pip install [--user] {module}\n")
171try:
172 import b2luigi
173 from b2luigi.core.utils import create_output_dirs
174 from b2luigi.basf2_helper import Basf2PathTask, Basf2Task
175except ModuleNotFoundError:
176 print(install_helpstring_formatter.format(module="b2luigi"))
177 raise
178
179
180class GenerateSimTask(Basf2PathTask):
181 """
182 Generate simulated Monte Carlo with background overlay.
183
184 Make sure to use different ``random_seed`` parameters for the training data
185 format the classifier trainings and for the test data for the respective
186 evaluation/validation tasks.
187 """
188
189
190 experiment_number = b2luigi.IntParameter()
193 random_seed = b2luigi.Parameter()
195 n_events = b2luigi.IntParameter()
197 bkgfiles_dir = b2luigi.Parameter(
199 hashed=True
200
201 )
202
203 queue = 'l'
205
206 def output_file_name(self, n_events=None, random_seed=None):
207 """
208 Create output file name depending on number of events and production
209 mode that is specified in the random_seed string.
210
211 :param n_events: Number of events to simulate.
212 :param random_seed: Random seed to use for the simulation to create independent samples.
213 """
214 if n_events is None:
215 n_events = self.n_events
216 if random_seed is None:
217 random_seed = self.random_seed
218 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
219
220 def output(self):
221 """
222 Generate list of output files that the task should produce.
223 The task is considered finished if and only if the outputs all exist.
224 """
225 yield self.add_to_output(self.output_file_name())
226
227 def create_path(self):
228 """
229 Create basf2 path to process with event generation and simulation.
230 """
231 basf2.set_random_seed(self.random_seed)
232 path = basf2.create_path()
233 path.add_module(
234 "EventInfoSetter", evtNumList=[self.n_events], runList=[0], expList=[self.experiment_number]
235 )
236 path.add_module("EvtGenInput")
237 bkg_files = ""
238 # \cond suppress doxygen warning
239 if self.experiment_number == 0:
241 else:
243 # \endcond
244
245 simulation.add_simulation(path, bkgfiles=bkg_files, bkgOverlay=True, usePXDDataReduction=False)
246
247 path.add_module(
248 "RootOutput",
249 outputFileName=self.get_output_file_name(self.output_file_name()),
250 )
251 return path
252
253
254# I don't use the default MergeTask or similar because they only work if every input file is called the same.
255# Additionally, I want to add more features like deleting the original input to save storage space.
256class SplitNMergeSimTask(Basf2Task):
257 """
258 Generate simulated Monte Carlo with background overlay.
259
260 Make sure to use different ``random_seed`` parameters for the training data
261 format the classifier trainings and for the test data for the respective
262 evaluation/validation tasks.
263 """
264
265 experiment_number = b2luigi.IntParameter()
268 random_seed = b2luigi.Parameter()
270 n_events = b2luigi.IntParameter()
272 bkgfiles_dir = b2luigi.Parameter(
274 hashed=True
275
276 )
277
278 queue = 'sx'
280
281 def output_file_name(self, n_events=None, random_seed=None):
282 """
283 Create output file name depending on number of events and production
284 mode that is specified in the random_seed string.
285
286 :param n_events: Number of events to simulate.
287 :param random_seed: Random seed to use for the simulation to create independent samples.
288 """
289 if n_events is None:
290 n_events = self.n_events
291 if random_seed is None:
292 random_seed = self.random_seed
293 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
294
295 def output(self):
296 """
297 Generate list of output files that the task should produce.
298 The task is considered finished if and only if the outputs all exist.
299 """
300 yield self.add_to_output(self.output_file_name())
301
302 def requires(self):
303 """
304 This task requires several GenerateSimTask to be finished so that he required number of events is created.
305 """
306 n_events_per_task = MainTask.n_events_per_task
307 quotient, remainder = divmod(self.n_events, n_events_per_task)
308 for i in range(quotient):
309 yield GenerateSimTask(
310 bkgfiles_dir=self.bkgfiles_dir,
311 num_processes=MainTask.num_processes,
312 random_seed=self.random_seed + '_' + str(i).zfill(3),
313 n_events=n_events_per_task,
314 experiment_number=self.experiment_number,
315 )
316 if remainder > 0:
317 yield GenerateSimTask(
318 bkgfiles_dir=self.bkgfiles_dir,
319 num_processes=MainTask.num_processes,
320 random_seed=self.random_seed + '_' + str(quotient).zfill(3),
321 n_events=remainder,
322 experiment_number=self.experiment_number,
323 )
324
325 @b2luigi.on_temporary_files
326 def process(self):
327 """
328 When all GenerateSimTasks finished, merge the output.
329 """
330 create_output_dirs(self)
331
332 file_list = [item for sublist in self.get_input_file_names().values() for item in sublist]
333 print("Merge the following files:")
334 print(file_list)
335 cmd = ["b2file-merge", "-f"]
336 args = cmd + [self.get_output_file_name(self.output_file_name())] + file_list
337 subprocess.check_call(args)
338 print("Finished merging. Now remove the input files to save space.")
339 for input_file in file_list:
340 try:
341 os.remove(input_file)
342 except FileNotFoundError:
343 pass
344
345
346class StateRecordingTask(Basf2PathTask):
347 """
348 Record the data for the three state filters for the CDCToSVDSpacePointCKF.
349
350 This task requires that the events used for training have been simulated before, which is done using the
351 ``SplitMergeSimTask``.
352 """
353
354 experiment_number = b2luigi.IntParameter()
357 random_seed = b2luigi.Parameter()
359 n_events = b2luigi.IntParameter()
361
362 layer = b2luigi.IntParameter()
364 def output(self):
365 """
366 Generate list of output files that the task should produce.
367 The task is considered finished if and only if the outputs all exist.
368 """
369 for record_fname in ["records1.root", "records2.root", "records3.root"]:
370 yield self.add_to_output(record_fname)
371
372 def requires(self):
373 """
374 This task only requires that the input files have been created.
375 """
376 yield SplitNMergeSimTask(
377 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
378 experiment_number=self.experiment_number,
379 random_seed=self.random_seed,
380 n_events=self.n_events,
381 )
382
383 def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname):
384 """
385 Create a path for the recording. To record the data for the SVD state filters, CDC tracks are required, and these must
386 be truth matched before. The data have to recorded for each layer of the SVD, i.e. layers 3 to 6, but also an artificial
387 layer 7.
388
389 :param layer: The layer for which the data are recorded.
390 :param records1_fname: Name of the records1 file.
391 :param records2_fname: Name of the records2 file.
392 :param records3_fname: Name of the records3 file.
393 """
394 path = basf2.create_path()
395
396 # get all the file names from the list of input files that are meant for training
397 file_list = [fname for sublist in self.get_input_file_names().values()
398 for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
399 path.add_module("RootInput", inputFileNames=file_list)
400
401 path.add_module("Gearbox")
402 path.add_module("Geometry")
403 path.add_module("SetupGenfitExtrapolation")
404
405 add_hit_preparation_modules(path, components=["SVD"])
406
407 add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
408
409 path.add_module('TrackFinderMCTruthRecoTracks',
410 RecoTracksStoreArrayName="MCRecoTracks",
411 WhichParticles=[],
412 UsePXDHits=True,
413 UseSVDHits=True,
414 UseCDCHits=True)
415
416 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=False, UseCDCHits=True,
417 mcRecoTracksStoreArrayName="MCRecoTracks",
418 prRecoTracksStoreArrayName="CDCRecoTracks")
419 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCRecoTracks")
420
421 path.add_module("CDCToSVDSpacePointCKF",
422 inputRecoTrackStoreArrayName="CDCRecoTracks",
423 outputRecoTrackStoreArrayName="VXDRecoTracks",
424 outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
425
426 relationCheckForDirection="backward",
427 reverseSeed=False,
428 writeOutDirection="backward",
429
430 firstHighFilter="truth",
431 firstEqualFilter="recording",
432 firstEqualFilterParameters={"treeName": "records1", "rootFileName":
433 records1_fname, "returnWeight": 1.0},
434 firstLowFilter="none",
435 firstHighUseNStates=0,
436 firstToggleOnLayer=layer,
437
438 advanceHighFilter="advance",
439
440 secondHighFilter="truth",
441 secondEqualFilter="recording",
442 secondEqualFilterParameters={"treeName": "records2", "rootFileName":
443 records2_fname, "returnWeight": 1.0},
444 secondLowFilter="none",
445 secondHighUseNStates=0,
446 secondToggleOnLayer=layer,
447
448 updateHighFilter="fit",
449
450 thirdHighFilter="truth",
451 thirdEqualFilter="recording",
452 thirdEqualFilterParameters={"treeName": "records3", "rootFileName": records3_fname},
453 thirdLowFilter="none",
454 thirdHighUseNStates=0,
455 thirdToggleOnLayer=layer,
456
457 filter="none",
458 exportTracks=False,
459
460 enableOverlapResolving=False)
461
462 return path
463
464 def create_path(self):
465 """
466 Create basf2 path to process with event generation and simulation.
467 """
468 return self.create_state_recording_path(
469 layer=self.layer,
470 records1_fname=self.get_output_file_name("records1.root"),
471 records2_fname=self.get_output_file_name("records2.root"),
472 records3_fname=self.get_output_file_name("records3.root"),
473 )
474
475
476class CKFStateFilterTeacherTask(Basf2Task):
477 """
478 A teacher task runs the basf2 mva teacher on the training data provided by a
479 data collection task.
480
481 In this task the three state filters are trained, each with the corresponding recordings from the different layers.
482 It will be executed for each FastBDT option defined in the MainTask.
483 """
484
485 experiment_number = b2luigi.IntParameter()
488 random_seed = b2luigi.Parameter()
490 n_events = b2luigi.IntParameter()
492 fast_bdt_option_state_filter = b2luigi.ListParameter(
494 hashed=True, default=[50, 8, 3, 0.1]
495
496 )
497
498 filter_number = b2luigi.IntParameter()
500 training_target = b2luigi.Parameter(
502 default="truth"
503
504 )
505
507 exclude_variables = b2luigi.ListParameter(
510 hashed=True, default=[
511 "id",
512 "last_id",
513 "number",
514 "last_layer",
515
516 "seed_cdc_hits",
517 "seed_svd_hits",
518 "seed_lowest_svd_layer",
519 "seed_lowest_cdc_layer",
520 "quality_index_triplet",
521 "quality_index_circle",
522 "quality_index_helix",
523 "cluster_1_charge",
524 "cluster_2_charge",
525 "mean_rest_cluster_charge",
526 "min_rest_cluster_charge",
527 "std_rest_cluster_charge",
528 "cluster_1_seed_charge",
529 "cluster_2_seed_charge",
530 "mean_rest_cluster_seed_charge",
531 "min_rest_cluster_seed_charge",
532 "std_rest_cluster_seed_charge",
533 "cluster_1_size",
534 "cluster_2_size",
535 "mean_rest_cluster_size",
536 "min_rest_cluster_size",
537 "std_rest_cluster_size",
538 "cluster_1_snr",
539 "cluster_2_snr",
540 "mean_rest_cluster_snr",
541 "min_rest_cluster_snr",
542 "std_rest_cluster_snr",
543 "cluster_1_charge_over_size",
544 "cluster_2_charge_over_size",
545 "mean_rest_cluster_charge_over_size",
546 "min_rest_cluster_charge_over_size",
547 "std_rest_cluster_charge_over_size",
548 ]
549
550 )
551
552 def get_weightfile_identifier(self, fast_bdt_option=None, filter_number=None):
553 """
554 Name of weightfile that is created by the teacher task.
555
556 :param fast_bdt_option: FastBDT option that is used to train this MVA
557 :param filter_number: Filter number (first=1, second=2, third=3) to be trained
558
559 """
560 if fast_bdt_option is None:
561 fast_bdt_option = self.fast_bdt_option_state_filter
562 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
563 if filter_number is None:
564 filter_number = self.filter_number
565 weightfile_name = f"trk_CDCToSVDSpacePointStateFilter_{filter_number}" + fast_bdt_string
566 return weightfile_name
567
568 def requires(self):
569 """
570 This task requires that the recordings for the state filters.
571 """
572 for layer in [3, 4, 5, 6, 7]:
573 yield self.clone(
574 StateRecordingTask,
575 experiment_number=self.experiment_number,
576 n_events=self.n_events,
577 random_seed="training",
578 layer=layer,
579 )
580
581 def output(self):
582 """
583 Generate list of output files that the task should produce.
584 The task is considered finished if and only if the outputs all exist.
585 """
586 yield self.add_to_output(self.get_weightfile_identifier() + ".root")
587
588 def process(self):
589 """
590 Use basf2_mva teacher to create MVA weightfile from collected training
591 data variables.
592
593 This is the main process that is dispatched by the ``run`` method that
594 is inherited from ``Basf2Task``.
595 """
596 records_files = self.get_input_file_names(f"records{self.filter_number}.root")
597 weightfile_identifier = self.get_weightfile_identifier(filter_number=self.filter_number)
598 tree_name = f"records{self.filter_number}"
599 print(f"Processed records files: {records_files},\nfeature tree name: {tree_name}")
600
601 my_basf2_mva_teacher(
602 records_files=records_files,
603 tree_name=tree_name,
604 weightfile_identifier=weightfile_identifier,
605 target_variable=self.training_target,
606 exclude_variables=self.exclude_variables,
607 fast_bdt_option=self.fast_bdt_option_state_filter,
608 )
609 basf2_mva.download(weightfile_identifier, self.get_output_file_name(weightfile_identifier + ".root"))
610
611
612class ResultRecordingTask(Basf2PathTask):
613 """
614 Task to record data for the final result filter. This requires trained state filters.
615 The cuts on the state filter classifiers are set to rather low values to ensure that all signal is contained in the
616 recorded file. Also, the values for XXXXXHighUseNStates are chosen conservatively, i.e. rather on the high side.
617 """
618
619
620 experiment_number = b2luigi.IntParameter()
623 random_seed = b2luigi.Parameter()
625 n_events = b2luigi.IntParameter()
627 fast_bdt_option_state_filter = b2luigi.ListParameter(
629 hashed=True, default=[50, 8, 3, 0.1]
630
631 )
632
633 result_filter_records_name = b2luigi.Parameter()
635 # prepend the testing payloads
636 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
637
638 def output(self):
639 """
640 Generate list of output files that the task should produce.
641 The task is considered finished if and only if the outputs all exist.
642 """
643 yield self.add_to_output(self.result_filter_records_name)
644
645 def requires(self):
646 """
647 This task requires that the training SplitMergeSimTask is finished, as well as that the state filters are trained
648 using the CKFStateFilterTeacherTask..
649 """
650 yield SplitNMergeSimTask(
651 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
652 experiment_number=self.experiment_number,
653 random_seed=self.random_seed,
654 n_events=self.n_events,
655 )
656 filter_numbers = [1, 2, 3]
657 for filter_number in filter_numbers:
658 yield self.clone(
659 CKFStateFilterTeacherTask,
660 experiment_number=self.experiment_number,
661 n_events=self.n_events,
662 random_seed=self.random_seed,
663 filter_number=filter_number,
664 fast_bdt_option=self.fast_bdt_option_state_filter
665 )
666
667 def create_result_recording_path(self, result_filter_records_name):
668 """
669 Create a path for the recording of the result filter. This file is then used to train the result filter.
670
671 :param result_filter_records_name: Name of the recording file.
672 """
673
674 path = basf2.create_path()
675
676 # get all the file names from the list of input files that are meant for training
677 file_list = [fname for sublist in self.get_input_file_names().values()
678 for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
679 path.add_module("RootInput", inputFileNames=file_list)
680
681 path.add_module("Gearbox")
682 path.add_module("Geometry")
683 path.add_module("SetupGenfitExtrapolation")
684
685 add_hit_preparation_modules(path, components=["SVD"])
686
687 add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
688
689 path.add_module('TrackFinderMCTruthRecoTracks',
690 RecoTracksStoreArrayName="MCRecoTracks",
691 WhichParticles=[],
692 UsePXDHits=True,
693 UseSVDHits=True,
694 UseCDCHits=True)
695
696 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=False, UseCDCHits=True,
697 mcRecoTracksStoreArrayName="MCRecoTracks",
698 prRecoTracksStoreArrayName="CDCRecoTracks")
699 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCRecoTracks")
700
701 fast_bdt_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
702 # write the tracking MVA filter parameters and the cut on MVA classifier to be applied on a local db
703 iov = [0, 0, 0, -1]
705 f"trk_CDCToSVDSpacePointStateFilter_1_Parameter{fast_bdt_string}",
706 iov,
707 f"trk_CDCToSVDSpacePointStateFilter_1{fast_bdt_string}",
708 0.001)
709
711 f"trk_CDCToSVDSpacePointStateFilter_2_Parameter{fast_bdt_string}",
712 iov,
713 f"trk_CDCToSVDSpacePointStateFilter_2{fast_bdt_string}",
714 0.001)
715
717 f"trk_CDCToSVDSpacePointStateFilter_3_Parameter{fast_bdt_string}",
718 iov,
719 f"trk_CDCToSVDSpacePointStateFilter_3{fast_bdt_string}",
720 0.001)
721
722 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
723 first_high_filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_1_Parameter{fast_bdt_string}",
724 "direction": "backward"}
725 second_high_filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_2_Parameter{fast_bdt_string}"}
726 third_high_filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_3_Parameter{fast_bdt_string}"}
727
728 path.add_module("CDCToSVDSpacePointCKF",
729 inputRecoTrackStoreArrayName="CDCRecoTracks",
730 outputRecoTrackStoreArrayName="VXDRecoTracks",
731 outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
732
733 relationCheckForDirection="backward",
734 reverseSeed=False,
735 writeOutDirection="backward",
736
737 firstHighFilter="mva_with_direction_check",
738 firstHighFilterParameters=first_high_filter_parameters,
739 firstHighUseNStates=10,
740
741 advanceHighFilter="advance",
742 advanceHighFilterParameters={"direction": "backward"},
743
744 secondHighFilter="mva",
745 secondHighFilterParameters=second_high_filter_parameters,
746 secondHighUseNStates=10,
747
748 updateHighFilter="fit",
749
750 thirdHighFilter="mva",
751 thirdHighFilterParameters=third_high_filter_parameters,
752 thirdHighUseNStates=10,
753
754 filter="recording",
755 filterParameters={"rootFileName": result_filter_records_name},
756 exportTracks=False,
757
758 enableOverlapResolving=True)
759
760 return path
761
762 def create_path(self):
763 """
764 Create basf2 path to process with event generation and simulation.
765 """
767 result_filter_records_name=self.get_output_file_name(self.result_filter_records_name),
768 )
769
770
771class CKFResultFilterTeacherTask(Basf2Task):
772 """
773 A teacher task runs the basf2 mva teacher on the training data provided by a
774 data collection task.
775
776 Since teacher tasks are needed for all quality estimators covered by this
777 steering file and the only thing that changes is the required data
778 collection task and some training parameters, I decided to use inheritance
779 and have the basic functionality in this base class/interface and have the
780 specific teacher tasks inherit from it.
781 """
782
783 experiment_number = b2luigi.IntParameter()
786 random_seed = b2luigi.Parameter()
788 n_events = b2luigi.IntParameter()
790 fast_bdt_option_state_filter = b2luigi.ListParameter(
792 hashed=True, default=[50, 8, 3, 0.1]
793
794 )
795
796 fast_bdt_option_result_filter = b2luigi.ListParameter(
798 hashed=True, default=[200, 8, 3, 0.1]
799
800 )
801
802 result_filter_records_name = b2luigi.Parameter()
804 training_target = b2luigi.Parameter(
806 default="truth"
807
808 )
809
811 exclude_variables = b2luigi.ListParameter(
813 hashed=True, default=[]
814
815 )
816
817 def get_weightfile_identifier(self, fast_bdt_option=None):
818 """
819 Name of weightfile that is created by the teacher task.
820
821 :param fast_bdt_option: FastBDT option that is used to train this MVA
822 """
823 if fast_bdt_option is None:
824 fast_bdt_option = self.fast_bdt_option_result_filter
825 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
826 weightfile_name = "trk_CDCToSVDSpacePointResultFilter" + fast_bdt_string
827 return weightfile_name
828
829 def requires(self):
830 """
831 Generate list of luigi Tasks that this Task depends on.
832 """
834 experiment_number=self.experiment_number,
835 n_events=self.n_events,
836 random_seed=self.random_seed,
837 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
838 result_filter_records_name=self.result_filter_records_name,
839 )
840
841 def output(self):
842 """
843 Generate list of output files that the task should produce.
844 The task is considered finished if and only if the outputs all exist.
845 """
846 yield self.add_to_output(self.get_weightfile_identifier() + ".root")
847
848 def process(self):
849 """
850 Use basf2_mva teacher to create MVA weightfile from collected training
851 data variables.
852
853 This is the main process that is dispatched by the ``run`` method that
854 is inherited from ``Basf2Task``.
855 """
856 records_files = self.get_input_file_names(self.result_filter_records_name)
857 tree_name = "records"
858 print(f"Processed records files for result filter training: {records_files},\nfeature tree name: {tree_name}")
859 weightfile_identifier = self.get_weightfile_identifier()
860 my_basf2_mva_teacher(
861 records_files=records_files,
862 tree_name=tree_name,
863 weightfile_identifier=weightfile_identifier,
864 target_variable=self.training_target,
865 exclude_variables=self.exclude_variables,
866 fast_bdt_option=self.fast_bdt_option_result_filter,
867 )
868
869 basf2_mva.download(weightfile_identifier, self.get_output_file_name(weightfile_identifier + ".root"))
870
871
872class ValidationAndOptimisationTask(Basf2PathTask):
873 """
874 Validate the performance of the trained filters by trying various combinations of FastBDT options, as well as cut values
875 for the states, the number of best candidates kept after each filter, and similar for the result filter.
876 """
877
878 experiment_number = b2luigi.IntParameter()
880 n_events_training = b2luigi.IntParameter()
882 fast_bdt_option_state_filter = b2luigi.ListParameter(
883 # ## \cond
884 hashed=True, default=[50, 8, 3, 0.1]
885 # ## \endcond
886 )
887
888 fast_bdt_option_result_filter = b2luigi.ListParameter(
889 # ## \cond
890 hashed=True, default=[200, 8, 3, 0.1]
891 # ## \endcond
892 )
893
894 n_events_testing = b2luigi.IntParameter()
896 state_filter_cut = b2luigi.FloatParameter()
898 use_n_best_states = b2luigi.IntParameter()
900 result_filter_cut = b2luigi.FloatParameter()
902 use_n_best_results = b2luigi.IntParameter()
904 # prepend the testing payloads
905 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
906
907 def output(self):
908 """
909 Generate list of output files that the task should produce.
910 The task is considered finished if and only if the outputs all exist.
911 """
912 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
913 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
914 yield self.add_to_output(
915 f"cdc_to_svd_spacepoint_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root")
916
917 def requires(self):
918 """
919 This task requires trained result filters, trained state filters, and that an independent data set for validation was
920 created using the SplitMergeSimTask with the random seed optimisation.
921 """
922 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
924 result_filter_records_name=f"filter_records{fbdt_state_filter_string}.root",
925 experiment_number=self.experiment_number,
926 n_events=self.n_events_training,
927 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
928 fast_bdt_option_result_filter=self.fast_bdt_option_result_filter,
929 random_seed='training'
930 )
931 yield SplitNMergeSimTask(
932 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
933 experiment_number=self.experiment_number,
934 n_events=self.n_events_testing,
935 random_seed="optimisation",
936 )
937 filter_numbers = [1, 2, 3]
938 for filter_number in filter_numbers:
939 yield self.clone(
940 CKFStateFilterTeacherTask,
941 experiment_number=self.experiment_number,
942 random_seed="training",
943 n_events=self.n_events_training,
944 filter_number=filter_number,
945 fast_bdt_option=self.fast_bdt_option_state_filter
946 )
947
949 """
950 Create a path to validate the trained filters.
951 """
952 path = basf2.create_path()
953
954 # get all the file names from the list of input files that are meant for optimisation / validation
955 file_list = [fname for sublist in self.get_input_file_names().values()
956 for fname in sublist if "generated_mc_N" in fname and "optimisation" in fname and fname.endswith(".root")]
957 path.add_module("RootInput", inputFileNames=file_list)
958
959 path.add_module("Gearbox")
960 path.add_module("Geometry")
961 path.add_module("SetupGenfitExtrapolation")
962
963 add_hit_preparation_modules(path, components=["SVD"])
964
965 add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
966
967 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
968 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
969
970 # write the tracking MVA filter parameters and the cut on MVA classifier to be applied on a local db
971 iov = [0, 0, 0, -1]
973 f"trk_CDCToSVDSpacePointStateFilter_1_Parameter{fbdt_state_filter_string}",
974 iov,
975 f"trk_CDCToSVDSpacePointStateFilter_1{fbdt_state_filter_string}",
976 self.state_filter_cut)
977
979 f"trk_CDCToSVDSpacePointStateFilter_2_Parameter{fbdt_state_filter_string}",
980 iov,
981 f"trk_CDCToSVDSpacePointStateFilter_2{fbdt_state_filter_string}",
982 self.state_filter_cut)
983
985 f"trk_CDCToSVDSpacePointStateFilter_3_Parameter{fbdt_state_filter_string}",
986 iov,
987 f"trk_CDCToSVDSpacePointStateFilter_3{fbdt_state_filter_string}",
988 self.state_filter_cut)
989
991 f"trk_CDCToSVDSpacePointResultFilter_Parameter{fbdt_result_filter_string}",
992 iov,
993 f"trk_CDCToSVDSpacePointResultFilter{fbdt_result_filter_string}",
995
996 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
997 first_high_filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_1_Parameter{fbdt_state_filter_string}",
998 "direction": "backward"}
999 second_high_filter_parameters = {
1000 "DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_2_Parameter{fbdt_state_filter_string}"}
1001 third_high_filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_3_Parameter{fbdt_state_filter_string}"}
1002 filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointResultFilter_Parameter{fbdt_result_filter_string}"}
1003
1004 path.add_module("CDCToSVDSpacePointCKF",
1005
1006 inputRecoTrackStoreArrayName="CDCRecoTracks",
1007 outputRecoTrackStoreArrayName="VXDRecoTracks",
1008 outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
1009
1010 relationCheckForDirection="backward",
1011 reverseSeed=False,
1012 writeOutDirection="backward",
1013
1014 firstHighFilter="mva_with_direction_check",
1015 firstHighFilterParameters=first_high_filter_parameters,
1016 firstHighUseNStates=self.use_n_best_states,
1017
1018 advanceHighFilter="advance",
1019 advanceHighFilterParameters={"direction": "backward"},
1020
1021 secondHighFilter="mva",
1022 secondHighFilterParameters=second_high_filter_parameters,
1023 secondHighUseNStates=self.use_n_best_states,
1024
1025 updateHighFilter="fit",
1026
1027 thirdHighFilter="mva",
1028 thirdHighFilterParameters=third_high_filter_parameters,
1029 thirdHighUseNStates=self.use_n_best_states,
1030
1031 filter="mva",
1032 filterParameters=filter_parameters,
1033 useBestNInSeed=self.use_n_best_results,
1034
1035 exportTracks=True,
1036 enableOverlapResolving=True)
1037
1038 path.add_module('RelatedTracksCombiner',
1039 VXDRecoTracksStoreArrayName="VXDRecoTracks",
1040 CDCRecoTracksStoreArrayName="CDCRecoTracks",
1041 recoTracksStoreArrayName="RecoTracks")
1042
1043 path.add_module('TrackFinderMCTruthRecoTracks',
1044 RecoTracksStoreArrayName="MCRecoTracks",
1045 WhichParticles=[],
1046 UsePXDHits=True,
1047 UseSVDHits=True,
1048 UseCDCHits=True)
1049
1050 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
1051 mcRecoTracksStoreArrayName="MCRecoTracks",
1052 prRecoTracksStoreArrayName="RecoTracks")
1053
1054 path.add_module(
1056 output_file_name=self.get_output_file_name(
1057 f"cdc_to_svd_spacepoint_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root"),
1058 reco_tracks_name="RecoTracks",
1059 mc_reco_tracks_name="MCRecoTracks",
1060 name="",
1061 contact="",
1062 expert_level=200))
1063
1064 return path
1065
1066 def create_path(self):
1067 """
1068 Create basf2 path to process with event generation and simulation.
1069 """
1071
1072
1073class MainTask(b2luigi.WrapperTask):
1074 """
1075 Wrapper task that needs to finish for b2luigi to finish running this steering file.
1076
1077 It is done if the outputs of all required subtasks exist. It is thus at the
1078 top of the luigi task graph. Edit the ``requires`` method to steer which
1079 tasks and with which parameters you want to run.
1080 """
1081
1082 n_events_training = b2luigi.get_setting(
1084 "n_events_training", default=1000
1085
1086 )
1087
1088 n_events_testing = b2luigi.get_setting(
1090 "n_events_testing", default=500
1091
1092 )
1093
1094 n_events_per_task = b2luigi.get_setting(
1096 "n_events_per_task", default=100
1097
1098 )
1099
1100 num_processes = b2luigi.get_setting(
1102 "basf2_processes_per_worker", default=0
1103
1104 )
1105
1106
1107 bkgfiles_by_exp = b2luigi.get_setting("bkgfiles_by_exp")
1109 bkgfiles_by_exp = {int(key): val for (key, val) in bkgfiles_by_exp.items()}
1111 def requires(self):
1112 """
1113 Generate list of tasks that needs to be done for luigi to finish running
1114 this steering file.
1115 """
1116
1117 fast_bdt_options = [
1118 [50, 8, 3, 0.1],
1119 [100, 8, 3, 0.1],
1120 [200, 8, 3, 0.1],
1121 ]
1122
1123 experiment_numbers = b2luigi.get_setting("experiment_numbers")
1124
1125 # iterate over all possible combinations of parameters from the above defined parameter lists
1126 for experiment_number, fast_bdt_option_state_filter, fast_bdt_option_result_filter in itertools.product(
1127 experiment_numbers, fast_bdt_options, fast_bdt_options
1128 ):
1129
1130 state_filter_cuts = [0.01, 0.02, 0.03, 0.05, 0.1, 0.2]
1131 n_best_states_list = [3, 5, 10]
1132 result_filter_cuts = [0.05, 0.1, 0.2]
1133 n_best_results_list = [3, 5, 10]
1134 for state_filter_cut, n_best_states, result_filter_cut, n_best_results in \
1135 itertools.product(state_filter_cuts, n_best_states_list, result_filter_cuts, n_best_results_list):
1136 yield self.clone(
1137 ValidationAndOptimisationTask,
1138 experiment_number=experiment_number,
1139 n_events_training=self.n_events_training,
1140 n_events_testing=self.n_events_testing,
1141 state_filter_cut=state_filter_cut,
1142 use_n_best_states=n_best_states,
1143 result_filter_cut=result_filter_cut,
1144 use_n_best_results=n_best_results,
1145 fast_bdt_option_state_filter=fast_bdt_option_state_filter,
1146 fast_bdt_option_result_filter=fast_bdt_option_result_filter,
1147 )
1148
1149
1150if __name__ == "__main__":
1151
1152 b2luigi.set_setting("env_script", "./setup_basf2.sh")
1153 b2luigi.get_setting("batch_system", "lsf")
1154 workers = b2luigi.get_setting("workers", default=1)
1155 b2luigi.process(MainTask(), workers=workers, batch=True)
1156
def get_background_files(folder=None, output_file_info=True)
Definition: background.py:17
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
b2luigi fast_bdt_option_result_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi n_events
Number of events to generate for the training data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
def get_weightfile_identifier(self, fast_bdt_option=None, filter_number=None)
b2luigi filter_number
Number of the filter for which the records files are to be processed.
b2luigi n_events
Number of events to generate for the training data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the test data set.
b2luigi n_events
Number of events to generate for the training data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi result_filter_records_name
Name of the records file for training the final result filter.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname)
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi layer
Layer on which to toggle for recording the information for training.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the testing, validation, and optimisation data set.
b2luigi use_n_best_states
How many states should be kept at maximum in the combinatorial part of the CKF tree search.
b2luigi use_n_best_results
How many results should be kept at maximum to search for overlaps.
b2luigi state_filter_cut
Value of the cut on the MVA classifier output for accepting a state during CKF tracking.
b2luigi fast_bdt_option_result_filter
FastBDT option to use to train the Result Filter.
b2luigi result_filter_cut
Value of the cut on the MVA classifier output for a result candidate.
b2luigi fast_bdt_option_state_filter
FastBDT option to use to train the StateFilters.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
def add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False)
Definition: simulation.py:126