Belle II Software development
combined_to_pxd_ckf_mva_training.py
1
8
9"""
10combined_to_pxd_ckf_mva_training
11-----------------------------------------
12
13Purpose of this script
14~~~~~~~~~~~~~~~~~~~~~~
15
16This python script is used for the training and validation of the classifiers of
17the three MVA-based state filters and one result filter of the ToPXDCKF.
18This CKF extraplates tracks found in CDC and SVD into the PXD and adds PXD hits
19using a combinatorial tree search and a Kalman filter based track fit in each step.
20
21To avoid mistakes, b2luigi is used to create a task chain for a combined training and
22validation of all classifiers.
23
24The order of the b2luigi tasks in this script is as follows (top to bottom):
25* Two tasks to create input samples for training and testing (``GenerateSimTask`` and
26``SplitNMergeSimTask``). The ``SplitNMergeSimTask`` takes a number of events to be
27generated and a number of events per task to reduce runtime. It then divides the total
28number of events by the number of events per task and creates as ``GenerateSimTask`` as
29needed, each with a specific random seed, so that in the end the total number of
30training and testing events are simulated. The individual files are then combined
31by the SplitNMergeSimTask into one file each for training and testing.
32* The ``StateRecordingTask`` writes out the data required for training the state
33filters.
34* The ``CKFStateFilterTeacherTask`` trains the state filter MVAs, using FastBDT by
35default, with a given set of options.
36* The ``ResultRecordingTask`` writes out the data used for the training of the result
37filter MVA. This task requires that the state filters have been trained before.
38* The ``CKFResultFilterTeacherTask`` trains the MVA, FastBDT per default, with a
39given set of FastBDT options. This requires that the result filter records have
40been created with the ``ResultRecordingTask``.
41* The ``ValidationAndOptimisationTask`` uses the trained weight files and cut values
42provided to run the tracking chain with the weight file under test, and also
43runs the tracking validation.
44* Finally, the ``MainTask`` is the "brain" of the script. It invokes the
45``ValidationAndOptimisationTask`` with the different combinations of FastBDT options
46and cut values on the MVA classifier output.
47
48Due to the dependencies, the calls of the task are reversed. The MainTask
49calls the ``ValidationAndOptimisationTask`` with different FastBDT options and cut
50values, and the ``ValidationAndOptimisationTask`` itself calls the required teacher,
51training, and simulation tasks.
52
53Each combination of FastBDT options and state filter cut values and candidate selection
54is used to train the result filter, which includes that the ``ResultRecordingTask``
55is executed multiple times with different combinations of FastBDT options and cut value
56and candidate selection.
57
58b2luigi: Understanding the steering file
59~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60
61All trainings and validations are done in the correct order in this steering
62file. For the purpose of creating a dependency graph, the `b2luigi
63<https://b2luigi.readthedocs.io>`_ python package is used, which extends the
64`luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
65
66Each task that has to be done is represented by a special class, which defines
67which defines parameters, output files and which other tasks with which
68parameters it depends on. For example a teacher task, which runs
69``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
70task which runs a reconstruction and writes out track-wise variables into a root
71file for training. An evaluation/validation task for testing the classifier
72requires both the teacher task, as it needs the weightfile to be present, and
73also a data collection task, because it needs a dataset for testing classifier.
74
75The final task that defines which tasks need to be done for the steering file to
76finish is the ``MainTask``. When you only want to run parts of the
77training/validation pipeline, you can comment out requirements in the Master
78task or replace them by lower-level tasks during debugging.
79
80Requirements
81~~~~~~~~~~~~
82
83This steering file relies on b2luigi_ for task scheduling. It can be installed
84via pip::
85
86 python3 -m pip install [--user] b2luigi
87
88Use the ``--user`` option if you have not rights to install python packages into
89your externals (e.g. because you are using cvmfs) and install them in
90``$HOME/.local`` instead.
91
92Configuration
93~~~~~~~~~~~~~
94
95Instead of command line arguments, the b2luigi script is configured via a
96``settings.json`` file. Open it in your favorite text editor and modify it to
97fit to your requirements.
98
99Usage
100~~~~~
101
102You can test the b2luigi without running it via::
103
104 python3 combined_to_pxd_ckf_mva_training.py --dry-run
105 python3 combined_to_pxd_ckf_mva_training.py --show-output
106
107This will show the outputs and show potential errors in the definitions of the
108luigi task dependencies. To run the the steering file in normal (local) mode,
109run::
110
111 python3 combined_to_pxd_ckf_mva_training.py
112
113One can use the interactive luigi web interface via the central scheduler
114which visualizes the task graph while it is running. Therefore, the scheduler
115daemon ``luigid`` has to run in the background, which is located in
116``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
117example, run::
118
119 luigid --port 8886
120
121Then, execute your steering (e.g. in another terminal) with::
122
123 python3 combined_to_pxd_ckf_mva_training.py --scheduler-port 8886
124
125To view the web interface, open your webbrowser enter into the url bar::
126
127 localhost:8886
128
129If you don't run the steering file on the same machine on which you run your webbrowser, you have two options:
130
131 1. Run both the steering file and ``luigid`` remotely and use
132 ssh-port-forwarding to your local host. Therefore, run on your local
133 machine::
134
135 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
136
137 2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
138 local host>`` argument when calling the steering file
139
140Accessing the results / output files
141~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
142
143All output files are stored in a directory structure in the ``result_path`` set in
144``settings.json``. The directory tree encodes the used b2luigi parameters. This
145ensures reproducibility and makes parameter searches easy. Sometimes, it is hard to
146find the relevant output files. You can view the whole directory structure by
147running ``tree <result_path>``. Ise the unix ``find`` command to find the files
148that interest you, e.g.::
149
150 find <result_path> -name "*.root" # find all ROOT files
151"""
152
153import itertools
154import subprocess
155
156import basf2
157import basf2_mva
158from tracking import add_track_finding
159from tracking.path_utils import add_hit_preparation_modules
160from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
161import background
162import simulation
163
164from ckf_training import my_basf2_mva_teacher, create_fbdt_option_string
165from tracking_mva_filter_payloads.write_tracking_mva_filter_payloads_to_db import write_tracking_mva_filter_payloads_to_db
166
167# wrap python modules that are used here but not in the externals into a try except block
168install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
169 " python3 -m pip install [--user] {module}\n")
170try:
171 import b2luigi
172 from b2luigi.core.utils import create_output_dirs
173 from b2luigi.basf2_helper import Basf2PathTask, Basf2Task
174except ModuleNotFoundError:
175 print(install_helpstring_formatter.format(module="b2luigi"))
176 raise
177
178
179class GenerateSimTask(Basf2PathTask):
180 """
181 Generate simulated Monte Carlo with background overlay.
182
183 Make sure to use different ``random_seed`` parameters for the training data
184 format the classifier trainings and for the test data for the respective
185 evaluation/validation tasks.
186 """
187
188
189 experiment_number = b2luigi.IntParameter()
192 random_seed = b2luigi.Parameter()
194 n_events = b2luigi.IntParameter()
196 bkgfiles_dir = b2luigi.Parameter(
198 hashed=True
199
200 )
201
202 queue = 'l'
204
205 def output_file_name(self, n_events=None, random_seed=None):
206 """
207 Create output file name depending on number of events and production
208 mode that is specified in the random_seed string.
209
210 :param n_events: Number of events to simulate.
211 :param random_seed: Random seed to use for the simulation to create independent samples.
212 """
213 if n_events is None:
214 n_events = self.n_events
215 if random_seed is None:
216 random_seed = self.random_seed
217 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
218
219 def output(self):
220 """
221 Generate list of output files that the task should produce.
222 The task is considered finished if and only if the outputs all exist.
223 """
224 yield self.add_to_output(self.output_file_name())
225
226 def create_path(self):
227 """
228 Create basf2 path to process with event generation and simulation.
229 """
230 basf2.set_random_seed(self.random_seed)
231 path = basf2.create_path()
232 path.add_module(
233 "EventInfoSetter", evtNumList=[self.n_events], runList=[0], expList=[self.experiment_number]
234 )
235 path.add_module("EvtGenInput")
236 bkg_files = ""
237 # \cond suppress doxygen warning
238 if self.experiment_number == 0:
240 else:
242 # \endcond
243
244 simulation.add_simulation(path, bkgfiles=bkg_files, bkgOverlay=True, usePXDDataReduction=False)
245
246 path.add_module(
247 "RootOutput",
248 outputFileName=self.get_output_file_name(self.output_file_name()),
249 )
250 return path
251
252
253# I don't use the default MergeTask or similar because they only work if every input file is called the same.
254# Additionally, I want to add more features like deleting the original input to save storage space.
255class SplitNMergeSimTask(Basf2Task):
256 """
257 Generate simulated Monte Carlo with background overlay.
258
259 Make sure to use different ``random_seed`` parameters for the training data
260 format the classifier trainings and for the test data for the respective
261 evaluation/validation tasks.
262 """
263
264
265 experiment_number = b2luigi.IntParameter()
268 random_seed = b2luigi.Parameter()
270 n_events = b2luigi.IntParameter()
272 bkgfiles_dir = b2luigi.Parameter(
274 hashed=True
275
276 )
277
278 queue = 'sx'
280
281 def output_file_name(self, n_events=None, random_seed=None):
282 """
283 Create output file name depending on number of events and production
284 mode that is specified in the random_seed string.
285
286 :param n_events: Number of events to simulate.
287 :param random_seed: Random seed to use for the simulation to create independent samples.
288 """
289 if n_events is None:
290 n_events = self.n_events
291 if random_seed is None:
292 random_seed = self.random_seed
293 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
294
295 def output(self):
296 """
297 Generate list of output files that the task should produce.
298 The task is considered finished if and only if the outputs all exist.
299 """
300 yield self.add_to_output(self.output_file_name())
301
302 def requires(self):
303 """
304 This task requires several GenerateSimTask to be finished so that he required number of events is created.
305 """
306 n_events_per_task = MainTask.n_events_per_task
307 quotient, remainder = divmod(self.n_events, n_events_per_task)
308 for i in range(quotient):
309 yield GenerateSimTask(
310 bkgfiles_dir=self.bkgfiles_dir,
311 num_processes=MainTask.num_processes,
312 random_seed=self.random_seed + '_' + str(i).zfill(3),
313 n_events=n_events_per_task,
314 experiment_number=self.experiment_number,
315 )
316 if remainder > 0:
317 yield GenerateSimTask(
318 bkgfiles_dir=self.bkgfiles_dir,
319 num_processes=MainTask.num_processes,
320 random_seed=self.random_seed + '_' + str(quotient).zfill(3),
321 n_events=remainder,
322 experiment_number=self.experiment_number,
323 )
324
325 @b2luigi.on_temporary_files
326 def process(self):
327 """
328 When all GenerateSimTasks finished, merge the output.
329 """
330 create_output_dirs(self)
331
332 file_list = [item for sublist in self.get_input_file_names().values() for item in sublist]
333 print("Merge the following files:")
334 print(file_list)
335 cmd = ["b2file-merge", "-f"]
336 args = cmd + [self.get_output_file_name(self.output_file_name())] + file_list
337 subprocess.check_call(args)
338 print("Finished merging. Now remove the input files to save space.")
339 cmd2 = ["rm", "-f"]
340 for tempfile in file_list:
341 args = cmd2 + [tempfile]
342 subprocess.check_call(args)
343
344
345class StateRecordingTask(Basf2PathTask):
346 """
347 Record the data for the three state filters for the ToPXDCKF.
348
349 This task requires that the events used for training have been simulated before, which is done using the
350 ``SplitMergeSimTask``.
351 """
352
353 experiment_number = b2luigi.IntParameter()
356 random_seed = b2luigi.Parameter()
358 n_events = b2luigi.IntParameter()
360
361 layer = b2luigi.IntParameter()
363 def output(self):
364 """
365 Generate list of output files that the task should produce.
366 The task is considered finished if and only if the outputs all exist.
367 """
368 for record_fname in ["records1.root", "records2.root", "records3.root"]:
369 yield self.add_to_output(record_fname)
370
371 def requires(self):
372 """
373 This task only requires that the input files have been created.
374 """
375 yield SplitNMergeSimTask(
376 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
377 experiment_number=self.experiment_number,
378 n_events=self.n_events,
379 random_seed=self.random_seed,
380 )
381
382 def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname):
383 """
384 Create a path for the recording. To record the data for the PXD state filters, CDC+SVD tracks are required, and these
385 must be truth matched before. The data have to recorded for each layer of the PXD, i.e. layers 1 and 2, but also an
386 artificial layer 3.
387
388 :param layer: The layer for which the data are recorded.
389 :param records1_fname: Name of the records1 file.
390 :param records2_fname: Name of the records2 file.
391 :param records3_fname: Name of the records3 file.
392 """
393 path = basf2.create_path()
394
395 # get all the file names from the list of input files that are meant for training
396 file_list = [fname for sublist in self.get_input_file_names().values()
397 for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
398 path.add_module("RootInput", inputFileNames=file_list)
399
400 path.add_module("Gearbox")
401 path.add_module("Geometry")
402 path.add_module("SetupGenfitExtrapolation")
403
404 add_hit_preparation_modules(path, components=["SVD", "PXD"])
405
406 add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
407
408 path.add_module('TrackFinderMCTruthRecoTracks',
409 RecoTracksStoreArrayName="MCRecoTracks",
410 WhichParticles=[],
411 UsePXDHits=True,
412 UseSVDHits=True,
413 UseCDCHits=True)
414
415 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
416 mcRecoTracksStoreArrayName="MCRecoTracks",
417 prRecoTracksStoreArrayName="CDCSVDRecoTracks")
418 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
419
420 path.add_module("ToPXDCKF",
421 inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
422 outputRecoTrackStoreArrayName="RecoTracks",
423 outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
424 hitFilter="angulardistance",
425 seedFilter="angulardistance",
426 preSeedFilter='all',
427 preHitFilter='all',
428
429 relationCheckForDirection="backward",
430 reverseSeed=False,
431 writeOutDirection="backward",
432
433 firstHighFilter="truth",
434 firstEqualFilter="recording",
435 firstEqualFilterParameters={"treeName": "records1", "rootFileName": records1_fname, "returnWeight": 1.0},
436 firstLowFilter="none",
437 firstHighUseNStates=0,
438 firstToggleOnLayer=layer,
439
440 advanceHighFilter="advance",
441
442 secondHighFilter="truth",
443 secondEqualFilter="recording",
444 secondEqualFilterParameters={"treeName": "records2", "rootFileName": records2_fname, "returnWeight": 1.0},
445 secondLowFilter="none",
446 secondHighUseNStates=0,
447 secondToggleOnLayer=layer,
448
449 updateHighFilter="fit",
450
451 thirdHighFilter="truth",
452 thirdEqualFilter="recording",
453 thirdEqualFilterParameters={"treeName": "records3", "rootFileName": records3_fname},
454 thirdLowFilter="none",
455 thirdHighUseNStates=0,
456 thirdToggleOnLayer=layer,
457
458 filter="none",
459 exportTracks=False,
460
461 enableOverlapResolving=False)
462
463 return path
464
465 def create_path(self):
466 """
467 Create basf2 path to process with event generation and simulation.
468 """
469 return self.create_state_recording_path(
470 layer=self.layer,
471 records1_fname=self.get_output_file_name("records1.root"),
472 records2_fname=self.get_output_file_name("records2.root"),
473 records3_fname=self.get_output_file_name("records3.root"),
474 )
475
476
477class CKFStateFilterTeacherTask(Basf2Task):
478 """
479 A teacher task runs the basf2 mva teacher on the training data provided by a
480 data collection task.
481
482 In this task the three state filters are trained, each with the corresponding recordings from the different layers.
483 It will be executed for each FastBDT option defined in the MainTask.
484 """
485
486
487 experiment_number = b2luigi.IntParameter()
490 random_seed = b2luigi.Parameter()
492 n_events = b2luigi.IntParameter()
494 fast_bdt_option_state_filter = b2luigi.ListParameter(
496 hashed=True, default=[50, 8, 3, 0.1]
497
498 )
499
500 filter_number = b2luigi.IntParameter()
502 training_target = b2luigi.Parameter(
504 default="truth"
505
506 )
507
509 exclude_variables = b2luigi.ListParameter(
511 hashed=True, default=[]
512
513 )
514
515 def get_weightfile_identifier(self, fast_bdt_option=None, filter_number=None):
516 """
517 Name of weightfile that is created by the teacher task.
518
519 :param fast_bdt_option: FastBDT option that is used to train this MVA
520 :param filter_number: Filter number (first=1, second=2, third=3) to be trained
521 """
522 if fast_bdt_option is None:
523 fast_bdt_option = self.fast_bdt_option_state_filter
524 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
525
526 if filter_number is None:
527 filter_number = self.filter_number
528 weightfile_name = f"trk_ToPXDStateFilter_{filter_number}" + fast_bdt_string
529 return weightfile_name
530
531 def requires(self):
532 """
533 This task requires that the recordings for the state filters.
534 """
535 for layer in [1, 2, 3]:
536 yield self.clone(
537 StateRecordingTask,
538 experiment_number=self.experiment_number,
539 n_events_training=self.n_events,
540 random_seed="training",
541 layer=layer
542 )
543
544 def output(self):
545 """
546 Generate list of output files that the task should produce.
547 The task is considered finished if and only if the outputs all exist.
548 """
549 yield self.add_to_output(self.get_weightfile_identifier() + ".root")
550
551 def process(self):
552 """
553 Use basf2_mva teacher to create MVA weightfile from collected training
554 data variables.
555
556 This is the main process that is dispatched by the ``run`` method that
557 is inherited from ``Basf2Task``.
558 """
559 records_files = self.get_input_file_names(f"records{self.filter_number}.root")
560 weightfile_identifier = self.get_weightfile_identifier(filter_number=self.filter_number)
561 tree_name = f"records{self.filter_number}"
562 print(f"Processed records files: {records_files},\nfeature tree name: {tree_name}")
563
564 my_basf2_mva_teacher(
565 records_files=records_files,
566 tree_name=tree_name,
567 weightfile_identifier=weightfile_identifier,
568 target_variable=self.training_target,
569 exclude_variables=self.exclude_variables,
570 fast_bdt_option=self.fast_bdt_option_state_filter,
571 )
572 basf2_mva.download(weightfile_identifier, self.get_output_file_name(weightfile_identifier + '.root'))
573
574
575class ResultRecordingTask(Basf2PathTask):
576 """
577 Task to record data for the final result filter. This requires trained state filters.
578 The cuts on the state filter classifiers are set to rather low values to ensure that all signal is contained in the recorded
579 file. Also, the values for XXXXXHighUseNStates are chosen conservatively, i.e. rather on the high side.
580 """
581
582
583 experiment_number = b2luigi.IntParameter()
586 random_seed = b2luigi.Parameter()
588 n_events_training = b2luigi.IntParameter()
590 fast_bdt_option_state_filter = b2luigi.ListParameter(
592 hashed=True, default=[200, 8, 3, 0.1]
593
594 )
595
596 result_filter_records_name = b2luigi.Parameter()
598 # prepend testing payloads
599 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
600
601 def output(self):
602 """
603 Generate list of output files that the task should produce.
604 The task is considered finished if and only if the outputs all exist.
605 """
606 yield self.add_to_output(self.result_filter_records_name)
607
608 def requires(self):
609 """
610 This task requires that the training SplitMergeSimTask is finished, as well as that the state filters are trained using
611 the CKFStateFilterTeacherTask..
612 """
613 yield SplitNMergeSimTask(
614 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
615 experiment_number=self.experiment_number,
616 n_events=self.n_events_training,
617 random_seed=self.random_seed,
618 )
619 filter_numbers = [1, 2, 3]
620 for filter_number in filter_numbers:
621 yield self.clone(
622 CKFStateFilterTeacherTask,
623 experiment_number=self.experiment_number,
624 n_events=self.n_events_training,
625 random_seed=self.random_seed,
626 filter_number=filter_number,
627 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter
628 )
629
630 def create_result_recording_path(self, result_filter_records_name):
631 """
632 Create a path for the recording of the result filter. This file is then used to train the result filter.
633
634 :param result_filter_records_name: Name of the recording file.
635 """
636
637 path = basf2.create_path()
638
639 # get all the file names from the list of input files that are meant for training
640 file_list = [fname for sublist in self.get_input_file_names().values()
641 for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
642 path.add_module("RootInput", inputFileNames=file_list)
643
644 path.add_module("Gearbox")
645 path.add_module("Geometry")
646 path.add_module("SetupGenfitExtrapolation")
647
648 add_hit_preparation_modules(path, components=["SVD", "PXD"])
649
650 add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
651
652 path.add_module('TrackFinderMCTruthRecoTracks',
653 RecoTracksStoreArrayName="MCRecoTracks",
654 WhichParticles=[],
655 UsePXDHits=True,
656 UseSVDHits=True,
657 UseCDCHits=True)
658
659 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
660 mcRecoTracksStoreArrayName="MCRecoTracks",
661 prRecoTracksStoreArrayName="CDCSVDRecoTracks")
662 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
663
664 fast_bdt_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
665
666 # write the tracking MVA filter parameters and the cut on MVA classifier to be applied on a local db
667 iov = [0, 0, 0, -1]
669 f"trk_ToPXDStateFilter_1_Parameter{fast_bdt_string}",
670 iov,
671 f"trk_ToPXDStateFilter_1{fast_bdt_string}",
672 0.01)
673
675 f"trk_ToPXDStateFilter_2_Parameter{fast_bdt_string}",
676 iov,
677 f"trk_ToPXDStateFilter_2{fast_bdt_string}",
678 0.01)
679
681 f"trk_ToPXDStateFilter_3_Parameter{fast_bdt_string}",
682 iov,
683 f"trk_ToPXDStateFilter_3{fast_bdt_string}",
684 0.01)
685
686 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
687 first_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_1_Parameter{fast_bdt_string}",
688 "direction": "backward"}
689 second_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_2_Parameter{fast_bdt_string}"}
690 third_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_3_Parameter{fast_bdt_string}"}
691
692 path.add_module("ToPXDCKF",
693 inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
694 outputRecoTrackStoreArrayName="RecoTracks",
695 outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
696
697 relationCheckForDirection="backward",
698 reverseSeed=False,
699 writeOutDirection="backward",
700
701 firstHighFilter="mva",
702 firstHighFilterParameters=first_high_filter_parameters,
703 firstHighUseNStates=10,
704
705 advanceHighFilter="advance",
706
707 secondHighFilter="mva",
708 secondHighFilterParameters=second_high_filter_parameters,
709 secondHighUseNStates=10,
710
711 updateHighFilter="fit",
712
713 thirdHighFilter="mva",
714 thirdHighFilterParameters=third_high_filter_parameters,
715 thirdHighUseNStates=10,
716
717 filter="recording",
718 filterParameters={"rootFileName": result_filter_records_name},
719 exportTracks=False,
720
721 enableOverlapResolving=True)
722
723 return path
724
725 def create_path(self):
726 """
727 Create basf2 path to process with event generation and simulation.
728 """
730 result_filter_records_name=self.get_output_file_name(self.result_filter_records_name),
731 )
732
733
734class CKFResultFilterTeacherTask(Basf2Task):
735 """
736 A teacher task runs the basf2 mva teacher on the training data for the result filter.
737 """
738
739
740 experiment_number = b2luigi.IntParameter()
743 random_seed = b2luigi.Parameter()
745 n_events = b2luigi.IntParameter()
747 fast_bdt_option_state_filter = b2luigi.ListParameter(
749 hashed=True, default=[50, 8, 3, 0.1]
750
751 )
752
753 fast_bdt_option_result_filter = b2luigi.ListParameter(
755 hashed=True, default=[200, 8, 3, 0.1]
756
757 )
758
759 result_filter_records_name = b2luigi.Parameter()
761 training_target = b2luigi.Parameter(
763 default="truth"
764
765 )
766
768 exclude_variables = b2luigi.ListParameter(
770 hashed=True, default=[]
771
772 )
773
774 def get_weightfile_identifier(self, fast_bdt_option=None):
775 """
776 Name of weightfile that is created by the teacher task.
777
778 :param fast_bdt_option: FastBDT option that is used to train this MVA
779 """
780 if fast_bdt_option is None:
781 fast_bdt_option = self.fast_bdt_option_result_filter
782 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
783 weightfile_name = "trk_ToPXDResultFilter" + fast_bdt_string
784 return weightfile_name
785
786 def requires(self):
787 """
788 Generate list of luigi Tasks that this Task depends on.
789 """
791 experiment_number=self.experiment_number,
792 n_events_training=self.n_events,
793 random_seed=self.random_seed,
794 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
795 result_filter_records_name=self.result_filter_records_name,
796 )
797
798 def output(self):
799 """
800 Generate list of output files that the task should produce.
801 The task is considered finished if and only if the outputs all exist.
802 """
803 yield self.add_to_output(self.get_weightfile_identifier() + ".root")
804
805 def process(self):
806 """
807 Use basf2_mva teacher to create MVA weightfile from collected training
808 data variables.
809
810 This is the main process that is dispatched by the ``run`` method that
811 is inherited from ``Basf2Task``.
812 """
813 records_files = self.get_input_file_names(self.result_filter_records_name)
814 tree_name = "records"
815 print(f"Processed records files for result filter training: {records_files},\nfeature tree name: {tree_name}")
816 weightfile_identifier = self.get_weightfile_identifier()
817 my_basf2_mva_teacher(
818 records_files=records_files,
819 tree_name=tree_name,
820 weightfile_identifier=self.get_weightfile_identifier(),
821 target_variable=self.training_target,
822 exclude_variables=self.exclude_variables,
823 fast_bdt_option=self.fast_bdt_option_result_filter,
824 )
825 basf2_mva.download(weightfile_identifier, self.get_output_file_name(weightfile_identifier + ".root"))
826
827
828class ValidationAndOptimisationTask(Basf2PathTask):
829 """
830 Validate the performance of the trained filters by trying various combinations of FastBDT options, as well as cut values for
831 the states, the number of best candidates kept after each filter, and similar for the result filter.
832 """
833
834 experiment_number = b2luigi.IntParameter()
836 n_events_training = b2luigi.IntParameter()
838 fast_bdt_option_state_filter = b2luigi.ListParameter(
839 # ## \cond
840 hashed=True, default=[200, 8, 3, 0.1]
841 # ## \endcond
842 )
843
844 fast_bdt_option_result_filter = b2luigi.ListParameter(
845 # ## \cond
846 hashed=True, default=[200, 8, 3, 0.1]
847 # ## \endcond
848 )
849
850 n_events_testing = b2luigi.IntParameter()
852 state_filter_cut = b2luigi.FloatParameter()
854 use_n_best_states = b2luigi.IntParameter()
856 result_filter_cut = b2luigi.FloatParameter()
858 use_n_best_results = b2luigi.IntParameter()
860 # prepend the testing payloads
861 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
862
863 def output(self):
864 """
865 Generate list of output files that the task should produce.
866 The task is considered finished if and only if the outputs all exist.
867 """
868 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
869 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
870 yield self.add_to_output(
871 f"to_pxd_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root")
872
873 def requires(self):
874 """
875 This task requires trained result filters, trained state filters, and that an independent data set for validation was
876 created using the SplitMergeSimTask with the random seed optimisation.
877 """
878 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
880 result_filter_records_name=f"filter_records{fbdt_state_filter_string}.root",
881 experiment_number=self.experiment_number,
882 n_events=self.n_events_training,
883 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
884 fast_bdt_option_result_filter=self.fast_bdt_option_result_filter,
885 random_seed='training'
886 )
887 yield SplitNMergeSimTask(
888 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
889 experiment_number=self.experiment_number,
890 n_events=self.n_events_testing,
891 random_seed="optimisation",
892 )
893 filter_numbers = [1, 2, 3]
894 for filter_number in filter_numbers:
895 yield self.clone(
896 CKFStateFilterTeacherTask,
897 experiment_number=self.experiment_number,
898 n_events=self.n_events_training,
899 random_seed="training",
900 filter_number=filter_number,
901 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter
902 )
903
905 """
906 Create a path to validate the trained filters.
907 """
908 path = basf2.create_path()
909
910 # get all the file names from the list of input files that are meant for optimisation / validation
911 file_list = [fname for sublist in self.get_input_file_names().values()
912 for fname in sublist if "generated_mc_N" in fname and "optimisation" in fname and fname.endswith(".root")]
913 path.add_module("RootInput", inputFileNames=file_list)
914
915 path.add_module("Gearbox")
916 path.add_module("Geometry")
917 path.add_module("SetupGenfitExtrapolation")
918
919 add_hit_preparation_modules(path, components=["SVD", "PXD"])
920
921 add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
922
923 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
924
925 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
926 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
927
928 # write the tracking MVA filter parameters and the cut on MVA classifier to be applied on a local db
929 iov = [0, 0, 0, -1]
931 f"trk_ToPXDStateFilter_1_Parameter{fbdt_state_filter_string}",
932 iov,
933 f"trk_ToPXDStateFilter_1{fbdt_state_filter_string}",
934 self.state_filter_cut)
935
937 f"trk_ToPXDStateFilter_2_Parameter{fbdt_state_filter_string}",
938 iov,
939 f"trk_ToPXDStateFilter_2{fbdt_state_filter_string}",
940 self.state_filter_cut)
941
943 f"trk_ToPXDStateFilter_3_Parameter{fbdt_state_filter_string}",
944 iov,
945 f"trk_ToPXDStateFilter_3{fbdt_state_filter_string}",
946 self.state_filter_cut)
947
949 f"trk_ToPXDResultFilter_Parameter{fbdt_result_filter_string}",
950 iov,
951 f"trk_ToPXDResultFilter{fbdt_result_filter_string}",
953
954 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
955 first_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_1_Parameter{fbdt_state_filter_string}",
956 "direction": "backward"}
957 second_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_2_Parameter{fbdt_state_filter_string}"}
958 third_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_3_Parameter{fbdt_state_filter_string}"}
959 filter_parameters = {"DBPayloadName": f"trk_ToPXDResultFilter_Parameter{fbdt_result_filter_string}"}
960
961 path.add_module("ToPXDCKF",
962 inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
963 outputRecoTrackStoreArrayName="PXDRecoTracks",
964 outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
965
966 relationCheckForDirection="backward",
967 reverseSeed=False,
968 writeOutDirection="backward",
969
970 firstHighFilter="mva_with_direction_check",
971 firstHighFilterParameters=first_high_filter_parameters,
972 firstHighUseNStates=self.use_n_best_states,
973
974 advanceHighFilter="advance",
975 advanceHighFilterParameters={"direction": "backward"},
976
977 secondHighFilter="mva",
978 secondHighFilterParameters=second_high_filter_parameters,
979 secondHighUseNStates=self.use_n_best_states,
980
981 updateHighFilter="fit",
982
983 thirdHighFilter="mva",
984 thirdHighFilterParameters=third_high_filter_parameters,
985 thirdHighUseNStates=self.use_n_best_states,
986
987 filter="mva",
988 filterParameters=filter_parameters,
989 useBestNInSeed=self.use_n_best_results,
990
991 exportTracks=True,
992 enableOverlapResolving=True)
993
994 path.add_module('RelatedTracksCombiner',
995 VXDRecoTracksStoreArrayName="PXDRecoTracks",
996 CDCRecoTracksStoreArrayName="CDCSVDRecoTracks",
997 recoTracksStoreArrayName="RecoTracks")
998
999 path.add_module('TrackFinderMCTruthRecoTracks',
1000 RecoTracksStoreArrayName="MCRecoTracks",
1001 WhichParticles=[],
1002 UsePXDHits=True,
1003 UseSVDHits=True,
1004 UseCDCHits=True)
1005
1006 path.add_module("MCRecoTracksMatcher", UsePXDHits=True, UseSVDHits=True, UseCDCHits=True,
1007 mcRecoTracksStoreArrayName="MCRecoTracks",
1008 prRecoTracksStoreArrayName="RecoTracks")
1009
1010 path.add_module(
1012 output_file_name=self.get_output_file_name(
1013 f"to_pxd_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root"),
1014 reco_tracks_name="RecoTracks",
1015 mc_reco_tracks_name="MCRecoTracks",
1016 name="",
1017 contact="",
1018 expert_level=200))
1019
1020 return path
1021
1022 def create_path(self):
1023 """
1024 Create basf2 path to process with event generation and simulation.
1025 """
1027
1028
1029class MainTask(b2luigi.WrapperTask):
1030 """
1031 Wrapper task that needs to finish for b2luigi to finish running this steering file.
1032
1033 It is done if the outputs of all required subtasks exist. It is thus at the
1034 top of the luigi task graph. Edit the ``requires`` method to steer which
1035 tasks and with which parameters you want to run.
1036 """
1037
1038 n_events_training = b2luigi.get_setting(
1040 "n_events_training", default=1000
1041
1042 )
1043
1044 n_events_testing = b2luigi.get_setting(
1046 "n_events_testing", default=500
1047
1048 )
1049
1050 n_events_per_task = b2luigi.get_setting(
1052 "n_events_per_task", default=100
1053
1054 )
1055
1056 num_processes = b2luigi.get_setting(
1058 "basf2_processes_per_worker", default=0
1059
1060 )
1061
1062
1063 bkgfiles_by_exp = b2luigi.get_setting("bkgfiles_by_exp")
1065 bkgfiles_by_exp = {int(key): val for (key, val) in bkgfiles_by_exp.items()}
1067 def requires(self):
1068 """
1069 Generate list of tasks that needs to be done for luigi to finish running
1070 this steering file.
1071 """
1072
1073 fast_bdt_options = [
1074 [50, 8, 3, 0.1],
1075 [100, 8, 3, 0.1],
1076 [200, 8, 3, 0.1],
1077 ]
1078
1079 experiment_numbers = b2luigi.get_setting("experiment_numbers")
1080
1081 # iterate over all possible combinations of parameters from the above defined parameter lists
1082 for experiment_number, fast_bdt_option_state_filter, fast_bdt_option_result_filter in itertools.product(
1083 experiment_numbers, fast_bdt_options, fast_bdt_options
1084 ):
1085
1086 state_filter_cuts = [0.01, 0.02, 0.03, 0.05, 0.1, 0.2]
1087 n_best_states_list = [3, 5, 10]
1088 result_filter_cuts = [0.05, 0.1, 0.2]
1089 n_best_results_list = [2, 3, 5]
1090 for state_filter_cut, n_best_states, result_filter_cut, n_best_results in \
1091 itertools.product(state_filter_cuts, n_best_states_list, result_filter_cuts, n_best_results_list):
1092 yield self.clone(
1093 ValidationAndOptimisationTask,
1094 experiment_number=experiment_number,
1095 n_events_training=self.n_events_training,
1096 n_events_testing=self.n_events_testing,
1097 state_filter_cut=state_filter_cut,
1098 use_n_best_states=n_best_states,
1099 result_filter_cut=result_filter_cut,
1100 use_n_best_results=n_best_results,
1101 fast_bdt_option_state_filter=fast_bdt_option_state_filter,
1102 fast_bdt_option_result_filter=fast_bdt_option_result_filter,
1103 )
1104
1105
1106if __name__ == "__main__":
1107
1108 b2luigi.set_setting("env_script", "./setup_basf2.sh")
1109 b2luigi.get_setting("batch_system", "lsf")
1110 workers = b2luigi.get_setting("workers", default=1)
1111 b2luigi.process(MainTask(), workers=workers, batch=True)
1112
def get_background_files(folder=None, output_file_info=True)
Definition: background.py:17
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
b2luigi fast_bdt_option_result_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi n_events
Number of events to generate for the training data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
def get_weightfile_identifier(self, fast_bdt_option=None, filter_number=None)
b2luigi filter_number
Number of the filter for which the records files are to be processed.
b2luigi n_events
Number of events to generate for the training data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the test data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi result_filter_records_name
Name of the records file for training the final result filter.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname)
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi layer
Layer on which to toggle for recording the information for training.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the testing, validation, and optimisation data set.
b2luigi use_n_best_states
How many states should be kept at maximum in the combinatorial part of the CKF tree search.
b2luigi use_n_best_results
How many results should be kept at maximum to search for overlaps.
b2luigi state_filter_cut
Value of the cut on the MVA classifier output for accepting a state during CKF tracking.
b2luigi fast_bdt_option_result_filter
FastBDT option to use to train the Result Filter.
b2luigi result_filter_cut
Value of the cut on the MVA classifier output for a result candidate.
b2luigi fast_bdt_option_state_filter
FastBDT option to use to train the StateFilters.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
def add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False)
Definition: simulation.py:126