Belle II Software release-09-00-00
combined_to_pxd_ckf_mva_training.py
1
8
9"""
10combined_to_pxd_ckf_mva_training
11-----------------------------------------
12
13Purpose of this script
14~~~~~~~~~~~~~~~~~~~~~~
15
16This python script is used for the training and validation of the classifiers of
17the three MVA-based state filters and one result filter of the ToPXDCKF.
18This CKF extraplates tracks found in CDC and SVD into the PXD and adds PXD hits
19using a combinatorial tree search and a Kalman filter based track fit in each step.
20
21To avoid mistakes, b2luigi is used to create a task chain for a combined training and
22validation of all classifiers.
23
24The order of the b2luigi tasks in this script is as follows (top to bottom):
25* Two tasks to create input samples for training and testing (``GenerateSimTask`` and
26``SplitNMergeSimTask``). The ``SplitNMergeSimTask`` takes a number of events to be
27generated and a number of events per task to reduce runtime. It then divides the total
28number of events by the number of events per task and creates as ``GenerateSimTask`` as
29needed, each with a specific random seed, so that in the end the total number of
30training and testing events are simulated. The individual files are then combined
31by the SplitNMergeSimTask into one file each for training and testing.
32* The ``StateRecordingTask`` writes out the data required for training the state
33filters.
34* The ``CKFStateFilterTeacherTask`` trains the state filter MVAs, using FastBDT by
35default, with a given set of options.
36* The ``ResultRecordingTask`` writes out the data used for the training of the result
37filter MVA. This task requires that the state filters have been trained before.
38* The ``CKFResultFilterTeacherTask`` trains the MVA, FastBDT per default, with a
39given set of FastBDT options. This requires that the result filter records have
40been created with the ``ResultRecordingTask``.
41* The ``ValidationAndOptimisationTask`` uses the trained weight files and cut values
42provided to run the tracking chain with the weight file under test, and also
43runs the tracking validation.
44* Finally, the ``MainTask`` is the "brain" of the script. It invokes the
45``ValidationAndOptimisationTask`` with the different combinations of FastBDT options
46and cut values on the MVA classifier output.
47
48Due to the dependencies, the calls of the task are reversed. The MainTask
49calls the ``ValidationAndOptimisationTask`` with different FastBDT options and cut
50values, and the ``ValidationAndOptimisationTask`` itself calls the required teacher,
51training, and simulation tasks.
52
53Each combination of FastBDT options and state filter cut values and candidate selection
54is used to train the result filter, which includes that the ``ResultRecordingTask``
55is executed multiple times with different combinations of FastBDT options and cut value
56and candidate selection.
57
58b2luigi: Understanding the steering file
59~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60
61All trainings and validations are done in the correct order in this steering
62file. For the purpose of creating a dependency graph, the `b2luigi
63<https://b2luigi.readthedocs.io>`_ python package is used, which extends the
64`luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
65
66Each task that has to be done is represented by a special class, which defines
67which defines parameters, output files and which other tasks with which
68parameters it depends on. For example a teacher task, which runs
69``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
70task which runs a reconstruction and writes out track-wise variables into a root
71file for training. An evaluation/validation task for testing the classifier
72requires both the teacher task, as it needs the weightfile to be present, and
73also a data collection task, because it needs a dataset for testing classifier.
74
75The final task that defines which tasks need to be done for the steering file to
76finish is the ``MainTask``. When you only want to run parts of the
77training/validation pipeline, you can comment out requirements in the Master
78task or replace them by lower-level tasks during debugging.
79
80Requirements
81~~~~~~~~~~~~
82
83This steering file relies on b2luigi_ for task scheduling. It can be installed
84via pip::
85
86 python3 -m pip install [--user] b2luigi
87
88Use the ``--user`` option if you have not rights to install python packages into
89your externals (e.g. because you are using cvmfs) and install them in
90``$HOME/.local`` instead.
91
92Configuration
93~~~~~~~~~~~~~
94
95Instead of command line arguments, the b2luigi script is configured via a
96``settings.json`` file. Open it in your favorite text editor and modify it to
97fit to your requirements.
98
99Usage
100~~~~~
101
102You can test the b2luigi without running it via::
103
104 python3 combined_to_pxd_ckf_mva_training.py --dry-run
105 python3 combined_to_pxd_ckf_mva_training.py --show-output
106
107This will show the outputs and show potential errors in the definitions of the
108luigi task dependencies. To run the the steering file in normal (local) mode,
109run::
110
111 python3 combined_to_pxd_ckf_mva_training.py
112
113One can use the interactive luigi web interface via the central scheduler
114which visualizes the task graph while it is running. Therefore, the scheduler
115daemon ``luigid`` has to run in the background, which is located in
116``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
117example, run::
118
119 luigid --port 8886
120
121Then, execute your steering (e.g. in another terminal) with::
122
123 python3 combined_to_pxd_ckf_mva_training.py --scheduler-port 8886
124
125To view the web interface, open your webbrowser enter into the url bar::
126
127 localhost:8886
128
129If you don't run the steering file on the same machine on which you run your webbrowser, you have two options:
130
131 1. Run both the steering file and ``luigid`` remotely and use
132 ssh-port-forwarding to your local host. Therefore, run on your local
133 machine::
134
135 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
136
137 2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
138 local host>`` argument when calling the steering file
139
140Accessing the results / output files
141~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
142
143All output files are stored in a directory structure in the ``result_path`` set in
144``settings.json``. The directory tree encodes the used b2luigi parameters. This
145ensures reproducibility and makes parameter searches easy. Sometimes, it is hard to
146find the relevant output files. You can view the whole directory structure by
147running ``tree <result_path>``. Ise the unix ``find`` command to find the files
148that interest you, e.g.::
149
150 find <result_path> -name "*.root" # find all ROOT files
151"""
152
153import itertools
154import subprocess
155
156import basf2
157from tracking import add_track_finding
158from tracking.path_utils import add_hit_preparation_modules
159from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
160import background
161import simulation
162
163from ckf_training import my_basf2_mva_teacher, create_fbdt_option_string
164
165# wrap python modules that are used here but not in the externals into a try except block
166install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
167 " python3 -m pip install [--user] {module}\n")
168try:
169 import b2luigi
170 from b2luigi.core.utils import create_output_dirs
171 from b2luigi.basf2_helper import Basf2PathTask, Basf2Task
172except ModuleNotFoundError:
173 print(install_helpstring_formatter.format(module="b2luigi"))
174 raise
175
176
177class GenerateSimTask(Basf2PathTask):
178 """
179 Generate simulated Monte Carlo with background overlay.
180
181 Make sure to use different ``random_seed`` parameters for the training data
182 format the classifier trainings and for the test data for the respective
183 evaluation/validation tasks.
184 """
185
186
187 experiment_number = b2luigi.IntParameter()
190 random_seed = b2luigi.Parameter()
192 n_events = b2luigi.IntParameter()
194 bkgfiles_dir = b2luigi.Parameter(
196 hashed=True
197
198 )
199
200 queue = 'l'
202
203 def output_file_name(self, n_events=None, random_seed=None):
204 """
205 Create output file name depending on number of events and production
206 mode that is specified in the random_seed string.
207
208 :param n_events: Number of events to simulate.
209 :param random_seed: Random seed to use for the simulation to create independent samples.
210 """
211 if n_events is None:
212 n_events = self.n_events
213 if random_seed is None:
214 random_seed = self.random_seed
215 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
216
217 def output(self):
218 """
219 Generate list of output files that the task should produce.
220 The task is considered finished if and only if the outputs all exist.
221 """
222 yield self.add_to_output(self.output_file_name())
223
224 def create_path(self):
225 """
226 Create basf2 path to process with event generation and simulation.
227 """
228 basf2.set_random_seed(self.random_seed)
229 path = basf2.create_path()
230 path.add_module(
231 "EventInfoSetter", evtNumList=[self.n_events], runList=[0], expList=[self.experiment_number]
232 )
233 path.add_module("EvtGenInput")
234 bkg_files = ""
235 # \cond suppress doxygen warning
236 if self.experiment_number == 0:
238 else:
240 # \endcond
241
242 simulation.add_simulation(path, bkgfiles=bkg_files, bkgOverlay=True, usePXDDataReduction=False)
243
244 path.add_module(
245 "RootOutput",
246 outputFileName=self.get_output_file_name(self.output_file_name()),
247 )
248 return path
249
250
251# I don't use the default MergeTask or similar because they only work if every input file is called the same.
252# Additionally, I want to add more features like deleting the original input to save storage space.
253class SplitNMergeSimTask(Basf2Task):
254 """
255 Generate simulated Monte Carlo with background overlay.
256
257 Make sure to use different ``random_seed`` parameters for the training data
258 format the classifier trainings and for the test data for the respective
259 evaluation/validation tasks.
260 """
261
262
263 experiment_number = b2luigi.IntParameter()
266 random_seed = b2luigi.Parameter()
268 n_events = b2luigi.IntParameter()
270 bkgfiles_dir = b2luigi.Parameter(
272 hashed=True
273
274 )
275
276 queue = 'sx'
278
279 def output_file_name(self, n_events=None, random_seed=None):
280 """
281 Create output file name depending on number of events and production
282 mode that is specified in the random_seed string.
283
284 :param n_events: Number of events to simulate.
285 :param random_seed: Random seed to use for the simulation to create independent samples.
286 """
287 if n_events is None:
288 n_events = self.n_events
289 if random_seed is None:
290 random_seed = self.random_seed
291 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
292
293 def output(self):
294 """
295 Generate list of output files that the task should produce.
296 The task is considered finished if and only if the outputs all exist.
297 """
298 yield self.add_to_output(self.output_file_name())
299
300 def requires(self):
301 """
302 This task requires several GenerateSimTask to be finished so that he required number of events is created.
303 """
304 n_events_per_task = MainTask.n_events_per_task
305 quotient, remainder = divmod(self.n_events, n_events_per_task)
306 for i in range(quotient):
307 yield GenerateSimTask(
308 bkgfiles_dir=self.bkgfiles_dir,
309 num_processes=MainTask.num_processes,
310 random_seed=self.random_seed + '_' + str(i).zfill(3),
311 n_events=n_events_per_task,
312 experiment_number=self.experiment_number,
313 )
314 if remainder > 0:
315 yield GenerateSimTask(
316 bkgfiles_dir=self.bkgfiles_dir,
317 num_processes=MainTask.num_processes,
318 random_seed=self.random_seed + '_' + str(quotient).zfill(3),
319 n_events=remainder,
320 experiment_number=self.experiment_number,
321 )
322
323 @b2luigi.on_temporary_files
324 def process(self):
325 """
326 When all GenerateSimTasks finished, merge the output.
327 """
328 create_output_dirs(self)
329
330 file_list = [item for sublist in self.get_input_file_names().values() for item in sublist]
331 print("Merge the following files:")
332 print(file_list)
333 cmd = ["b2file-merge", "-f"]
334 args = cmd + [self.get_output_file_name(self.output_file_name())] + file_list
335 subprocess.check_call(args)
336 print("Finished merging. Now remove the input files to save space.")
337 cmd2 = ["rm", "-f"]
338 for tempfile in file_list:
339 args = cmd2 + [tempfile]
340 subprocess.check_call(args)
341
342
343class StateRecordingTask(Basf2PathTask):
344 """
345 Record the data for the three state filters for the ToPXDCKF.
346
347 This task requires that the events used for training have been simulated before, which is done using the
348 ``SplitMergeSimTask``.
349 """
350
351 experiment_number = b2luigi.IntParameter()
354 random_seed = b2luigi.Parameter()
356 n_events = b2luigi.IntParameter()
358
359 layer = b2luigi.IntParameter()
361 def output(self):
362 """
363 Generate list of output files that the task should produce.
364 The task is considered finished if and only if the outputs all exist.
365 """
366 for record_fname in ["records1.root", "records2.root", "records3.root"]:
367 yield self.add_to_output(record_fname)
368
369 def requires(self):
370 """
371 This task only requires that the input files have been created.
372 """
373 yield SplitNMergeSimTask(
374 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
375 experiment_number=self.experiment_number,
376 n_events=self.n_events,
377 random_seed=self.random_seed,
378 )
379
380 def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname):
381 """
382 Create a path for the recording. To record the data for the PXD state filters, CDC+SVD tracks are required, and these
383 must be truth matched before. The data have to recorded for each layer of the PXD, i.e. layers 1 and 2, but also an
384 artificial layer 3.
385
386 :param layer: The layer for which the data are recorded.
387 :param records1_fname: Name of the records1 file.
388 :param records2_fname: Name of the records2 file.
389 :param records3_fname: Name of the records3 file.
390 """
391 path = basf2.create_path()
392
393 # get all the file names from the list of input files that are meant for training
394 file_list = [fname for sublist in self.get_input_file_names().values()
395 for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
396 path.add_module("RootInput", inputFileNames=file_list)
397
398 path.add_module("Gearbox")
399 path.add_module("Geometry")
400 path.add_module("SetupGenfitExtrapolation")
401
402 add_hit_preparation_modules(path, components=["SVD", "PXD"])
403
404 add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
405
406 path.add_module('TrackFinderMCTruthRecoTracks',
407 RecoTracksStoreArrayName="MCRecoTracks",
408 WhichParticles=[],
409 UsePXDHits=True,
410 UseSVDHits=True,
411 UseCDCHits=True)
412
413 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
414 mcRecoTracksStoreArrayName="MCRecoTracks",
415 prRecoTracksStoreArrayName="CDCSVDRecoTracks")
416 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
417
418 path.add_module("ToPXDCKF",
419 inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
420 outputRecoTrackStoreArrayName="RecoTracks",
421 outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
422 hitFilter="distance",
423 seedFilter="distance",
424 preSeedFilter='all',
425 preHitFilter='all',
426
427 relationCheckForDirection="backward",
428 reverseSeed=False,
429 writeOutDirection="backward",
430
431 firstHighFilter="truth",
432 firstEqualFilter="recording",
433 firstEqualFilterParameters={"treeName": "records1", "rootFileName": records1_fname, "returnWeight": 1.0},
434 firstLowFilter="none",
435 firstHighUseNStates=0,
436 firstToggleOnLayer=layer,
437
438 advanceHighFilter="advance",
439
440 secondHighFilter="truth",
441 secondEqualFilter="recording",
442 secondEqualFilterParameters={"treeName": "records2", "rootFileName": records2_fname, "returnWeight": 1.0},
443 secondLowFilter="none",
444 secondHighUseNStates=0,
445 secondToggleOnLayer=layer,
446
447 updateHighFilter="fit",
448
449 thirdHighFilter="truth",
450 thirdEqualFilter="recording",
451 thirdEqualFilterParameters={"treeName": "records3", "rootFileName": records3_fname},
452 thirdLowFilter="none",
453 thirdHighUseNStates=0,
454 thirdToggleOnLayer=layer,
455
456 filter="none",
457 exportTracks=False,
458
459 enableOverlapResolving=False)
460
461 return path
462
463 def create_path(self):
464 """
465 Create basf2 path to process with event generation and simulation.
466 """
467 return self.create_state_recording_path(
468 layer=self.layer,
469 records1_fname=self.get_output_file_name("records1.root"),
470 records2_fname=self.get_output_file_name("records2.root"),
471 records3_fname=self.get_output_file_name("records3.root"),
472 )
473
474
475class CKFStateFilterTeacherTask(Basf2Task):
476 """
477 A teacher task runs the basf2 mva teacher on the training data provided by a
478 data collection task.
479
480 In this task the three state filters are trained, each with the corresponding recordings from the different layers.
481 It will be executed for each FastBDT option defined in the MainTask.
482 """
483
484
485 experiment_number = b2luigi.IntParameter()
488 random_seed = b2luigi.Parameter()
490 n_events = b2luigi.IntParameter()
492 fast_bdt_option_state_filter = b2luigi.ListParameter(
494 hashed=True, default=[50, 8, 3, 0.1]
495
496 )
497
498 filter_number = b2luigi.IntParameter()
500 training_target = b2luigi.Parameter(
502 default="truth"
503
504 )
505
507 exclude_variables = b2luigi.ListParameter(
509 hashed=True, default=[]
510
511 )
512
513 def get_weightfile_xml_identifier(self, fast_bdt_option=None, filter_number=1):
514 """
515 Name of the xml weightfile that is created by the teacher task.
516 It is subsequently used as a local weightfile in the following validation tasks.
517
518 :param fast_bdt_option: FastBDT option that is used to train this MVA.
519 :param filter_number: Filter number (first=1, second=2, third=3) to be trained.
520 """
521 if fast_bdt_option is None:
522 fast_bdt_option = self.fast_bdt_option_state_filter
523 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
524 weightfile_name = f"trk_ToPXDStateFilter_{filter_number}" + fast_bdt_string
525 return weightfile_name + ".xml"
526
527 def requires(self):
528 """
529 This task requires that the recordings for the state filters.
530 """
531 for layer in [1, 2, 3]:
532 yield self.clone(
533 StateRecordingTask,
534 experiment_number=self.experiment_number,
535 n_events_training=self.n_events,
536 random_seed="training",
537 layer=layer
538 )
539
540 def output(self):
541 """
542 Generate list of output files that the task should produce.
543 The task is considered finished if and only if the outputs all exist.
544 """
545 yield self.add_to_output(self.get_weightfile_xml_identifier(filter_number=self.filter_number))
546
547 def process(self):
548 """
549 Use basf2_mva teacher to create MVA weightfile from collected training
550 data variables.
551
552 This is the main process that is dispatched by the ``run`` method that
553 is inherited from ``Basf2Task``.
554 """
555 records_files = self.get_input_file_names(f"records{self.filter_number}.root")
556 tree_name = f"records{self.filter_number}"
557 print(f"Processed records files: {records_files=},\nfeature tree name: {tree_name=}")
558
559 my_basf2_mva_teacher(
560 records_files=records_files,
561 tree_name=tree_name,
562 weightfile_identifier=self.get_output_file_name(self.get_weightfile_xml_identifier(filter_number=self.filter_number)),
563 target_variable=self.training_target,
564 exclude_variables=self.exclude_variables,
565 fast_bdt_option=self.fast_bdt_option_state_filter,
566 )
567
568
569class ResultRecordingTask(Basf2PathTask):
570 """
571 Task to record data for the final result filter. This requires trained state filters.
572 The cuts on the state filter classifiers are set to rather low values to ensure that all signal is contained in the recorded
573 file. Also, the values for XXXXXHighUseNStates are chosen conservatively, i.e. rather on the high side.
574 """
575
576
577 experiment_number = b2luigi.IntParameter()
580 random_seed = b2luigi.Parameter()
582 n_events_training = b2luigi.IntParameter()
584 fast_bdt_option_state_filter = b2luigi.ListParameter(
586 hashed=True, default=[200, 8, 3, 0.1]
587
588 )
589
590 result_filter_records_name = b2luigi.Parameter()
592 def output(self):
593 """
594 Generate list of output files that the task should produce.
595 The task is considered finished if and only if the outputs all exist.
596 """
597 yield self.add_to_output(self.result_filter_records_name)
598
599 def requires(self):
600 """
601 This task requires that the training SplitMergeSimTask is finished, as well as that the state filters are trained using
602 the CKFStateFilterTeacherTask..
603 """
604 yield SplitNMergeSimTask(
605 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
606 experiment_number=self.experiment_number,
607 n_events=self.n_events_training,
608 random_seed=self.random_seed,
609 )
610 filter_numbers = [1, 2, 3]
611 for filter_number in filter_numbers:
612 yield self.clone(
613 CKFStateFilterTeacherTask,
614 experiment_number=self.experiment_number,
615 n_events=self.n_events_training,
616 random_seed=self.random_seed,
617 filter_number=filter_number,
618 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter
619 )
620
621 def create_result_recording_path(self, result_filter_records_name):
622 """
623 Create a path for the recording of the result filter. This file is then used to train the result filter.
624
625 :param result_filter_records_name: Name of the recording file.
626 """
627
628 path = basf2.create_path()
629
630 # get all the file names from the list of input files that are meant for training
631 file_list = [fname for sublist in self.get_input_file_names().values()
632 for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
633 path.add_module("RootInput", inputFileNames=file_list)
634
635 path.add_module("Gearbox")
636 path.add_module("Geometry")
637 path.add_module("SetupGenfitExtrapolation")
638
639 add_hit_preparation_modules(path, components=["SVD", "PXD"])
640
641 add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
642
643 path.add_module('TrackFinderMCTruthRecoTracks',
644 RecoTracksStoreArrayName="MCRecoTracks",
645 WhichParticles=[],
646 UsePXDHits=True,
647 UseSVDHits=True,
648 UseCDCHits=True)
649
650 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
651 mcRecoTracksStoreArrayName="MCRecoTracks",
652 prRecoTracksStoreArrayName="CDCSVDRecoTracks")
653 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
654
655 fast_bdt_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
656 path.add_module("ToPXDCKF",
657 inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
658 outputRecoTrackStoreArrayName="RecoTracks",
659 outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
660
661 relationCheckForDirection="backward",
662 reverseSeed=False,
663 writeOutDirection="backward",
664
665 firstHighFilter="mva",
666 firstHighFilterParameters={
667 "identifier": self.get_input_file_names(f"trk_ToPXDStateFilter_1{fast_bdt_string}.xml")[0],
668 "cut": 0.01},
669 firstHighUseNStates=10,
670
671 advanceHighFilter="advance",
672
673 secondHighFilter="mva",
674 secondHighFilterParameters={
675 "identifier": self.get_input_file_names(f"trk_ToPXDStateFilter_2{fast_bdt_string}.xml")[0],
676 "cut": 0.01},
677 secondHighUseNStates=10,
678
679 updateHighFilter="fit",
680
681 thirdHighFilter="mva",
682 thirdHighFilterParameters={
683 "identifier": self.get_input_file_names(f"trk_ToPXDStateFilter_3{fast_bdt_string}.xml")[0],
684 "cut": 0.01},
685 thirdHighUseNStates=10,
686
687 filter="recording",
688 filterParameters={"rootFileName": result_filter_records_name},
689 exportTracks=False,
690
691 enableOverlapResolving=True)
692
693 return path
694
695 def create_path(self):
696 """
697 Create basf2 path to process with event generation and simulation.
698 """
700 result_filter_records_name=self.get_output_file_name(self.result_filter_records_name),
701 )
702
703
704class CKFResultFilterTeacherTask(Basf2Task):
705 """
706 A teacher task runs the basf2 mva teacher on the training data for the result filter.
707 """
708
709
710 experiment_number = b2luigi.IntParameter()
713 random_seed = b2luigi.Parameter()
715 n_events = b2luigi.IntParameter()
717 fast_bdt_option_state_filter = b2luigi.ListParameter(
719 hashed=True, default=[50, 8, 3, 0.1]
720
721 )
722
723 fast_bdt_option_result_filter = b2luigi.ListParameter(
725 hashed=True, default=[200, 8, 3, 0.1]
726
727 )
728
729 result_filter_records_name = b2luigi.Parameter()
731 training_target = b2luigi.Parameter(
733 default="truth"
734
735 )
736
738 exclude_variables = b2luigi.ListParameter(
740 hashed=True, default=[]
741
742 )
743
744 def get_weightfile_xml_identifier(self, fast_bdt_option=None):
745 """
746 Name of the xml weightfile that is created by the teacher task.
747 It is subsequently used as a local weightfile in the following validation tasks.
748
749 :param fast_bdt_option: FastBDT option that is used to train this MVA
750 """
751 if fast_bdt_option is None:
752 fast_bdt_option = self.fast_bdt_option_result_filter
753 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
754 weightfile_name = "trk_ToPXDResultFilter" + fast_bdt_string
755 return weightfile_name + ".xml"
756
757 def requires(self):
758 """
759 Generate list of luigi Tasks that this Task depends on.
760 """
762 experiment_number=self.experiment_number,
763 n_events_training=self.n_events,
764 random_seed=self.random_seed,
765 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
766 result_filter_records_name=self.result_filter_records_name,
767 )
768
769 def output(self):
770 """
771 Generate list of output files that the task should produce.
772 The task is considered finished if and only if the outputs all exist.
773 """
774 yield self.add_to_output(self.get_weightfile_xml_identifier())
775
776 def process(self):
777 """
778 Use basf2_mva teacher to create MVA weightfile from collected training
779 data variables.
780
781 This is the main process that is dispatched by the ``run`` method that
782 is inherited from ``Basf2Task``.
783 """
784 records_files = self.get_input_file_names(self.result_filter_records_name)
785 tree_name = "records"
786 print(f"Processed records files for result filter training: {records_files=},\nfeature tree name: {tree_name=}")
787
788 my_basf2_mva_teacher(
789 records_files=records_files,
790 tree_name=tree_name,
791 weightfile_identifier=self.get_output_file_name(self.get_weightfile_xml_identifier()),
792 target_variable=self.training_target,
793 exclude_variables=self.exclude_variables,
794 fast_bdt_option=self.fast_bdt_option_result_filter,
795 )
796
797
798class ValidationAndOptimisationTask(Basf2PathTask):
799 """
800 Validate the performance of the trained filters by trying various combinations of FastBDT options, as well as cut values for
801 the states, the number of best candidates kept after each filter, and similar for the result filter.
802 """
803
804 experiment_number = b2luigi.IntParameter()
806 n_events_training = b2luigi.IntParameter()
808 fast_bdt_option_state_filter = b2luigi.ListParameter(
809 # ## \cond
810 hashed=True, default=[200, 8, 3, 0.1]
811 # ## \endcond
812 )
813
814 fast_bdt_option_result_filter = b2luigi.ListParameter(
815 # ## \cond
816 hashed=True, default=[200, 8, 3, 0.1]
817 # ## \endcond
818 )
819
820 n_events_testing = b2luigi.IntParameter()
822 state_filter_cut = b2luigi.FloatParameter()
824 use_n_best_states = b2luigi.IntParameter()
826 result_filter_cut = b2luigi.FloatParameter()
828 use_n_best_results = b2luigi.IntParameter()
830 def output(self):
831 """
832 Generate list of output files that the task should produce.
833 The task is considered finished if and only if the outputs all exist.
834 """
835 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
836 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
837 yield self.add_to_output(
838 f"to_pxd_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root")
839
840 def requires(self):
841 """
842 This task requires trained result filters, trained state filters, and that an independent data set for validation was
843 created using the SplitMergeSimTask with the random seed optimisation.
844 """
845 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
847 result_filter_records_name=f"filter_records{fbdt_state_filter_string}.root",
848 experiment_number=self.experiment_number,
849 n_events=self.n_events_training,
850 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
851 fast_bdt_option_result_filter=self.fast_bdt_option_result_filter,
852 random_seed='training'
853 )
854 yield SplitNMergeSimTask(
855 bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_number],
856 experiment_number=self.experiment_number,
857 n_events=self.n_events_testing,
858 random_seed="optimisation",
859 )
860 filter_numbers = [1, 2, 3]
861 for filter_number in filter_numbers:
862 yield self.clone(
863 CKFStateFilterTeacherTask,
864 experiment_number=self.experiment_number,
865 n_events=self.n_events_training,
866 random_seed="training",
867 filter_number=filter_number,
868 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter
869 )
870
872 """
873 Create a path to validate the trained filters.
874 """
875 path = basf2.create_path()
876
877 # get all the file names from the list of input files that are meant for optimisation / validation
878 file_list = [fname for sublist in self.get_input_file_names().values()
879 for fname in sublist if "generated_mc_N" in fname and "optimisation" in fname and fname.endswith(".root")]
880 path.add_module("RootInput", inputFileNames=file_list)
881
882 path.add_module("Gearbox")
883 path.add_module("Geometry")
884 path.add_module("SetupGenfitExtrapolation")
885
886 add_hit_preparation_modules(path, components=["SVD", "PXD"])
887
888 add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
889
890 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
891
892 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
893 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
894 path.add_module("ToPXDCKF",
895 inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
896 outputRecoTrackStoreArrayName="PXDRecoTracks",
897 outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
898
899 relationCheckForDirection="backward",
900 reverseSeed=False,
901 writeOutDirection="backward",
902
903 firstHighFilter="mva_with_direction_check",
904 firstHighFilterParameters={
905 "identifier": self.get_input_file_names(
906 f"trk_ToPXDStateFilter_1{fbdt_state_filter_string}.xml")[0],
907 "cut": self.state_filter_cut,
908 "direction": "backward"},
909 firstHighUseNStates=self.use_n_best_states,
910
911 advanceHighFilter="advance",
912 advanceHighFilterParameters={"direction": "backward"},
913
914 secondHighFilter="mva",
915 secondHighFilterParameters={
916 "identifier": self.get_input_file_names(
917 f"trk_ToPXDStateFilter_2{fbdt_state_filter_string}.xml")[0],
918 "cut": self.state_filter_cut},
919 secondHighUseNStates=self.use_n_best_states,
920
921 updateHighFilter="fit",
922
923 thirdHighFilter="mva",
924 thirdHighFilterParameters={
925 "identifier": self.get_input_file_names(
926 f"trk_ToPXDStateFilter_3{fbdt_state_filter_string}.xml")[0],
927 "cut": self.state_filter_cut},
928 thirdHighUseNStates=self.use_n_best_states,
929
930 filter="mva",
931 filterParameters={
932 "identifier": self.get_input_file_names(
933 f"trk_ToPXDResultFilter{fbdt_result_filter_string}.xml")[0],
934 "cut": self.result_filter_cut},
935 useBestNInSeed=self.use_n_best_results,
936
937 exportTracks=True,
938 enableOverlapResolving=True)
939
940 path.add_module('RelatedTracksCombiner',
941 VXDRecoTracksStoreArrayName="PXDRecoTracks",
942 CDCRecoTracksStoreArrayName="CDCSVDRecoTracks",
943 recoTracksStoreArrayName="RecoTracks")
944
945 path.add_module('TrackFinderMCTruthRecoTracks',
946 RecoTracksStoreArrayName="MCRecoTracks",
947 WhichParticles=[],
948 UsePXDHits=True,
949 UseSVDHits=True,
950 UseCDCHits=True)
951
952 path.add_module("MCRecoTracksMatcher", UsePXDHits=True, UseSVDHits=True, UseCDCHits=True,
953 mcRecoTracksStoreArrayName="MCRecoTracks",
954 prRecoTracksStoreArrayName="RecoTracks")
955
956 path.add_module(
958 output_file_name=self.get_output_file_name(
959 f"to_pxd_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root"),
960 reco_tracks_name="RecoTracks",
961 mc_reco_tracks_name="MCRecoTracks",
962 name="",
963 contact="",
964 expert_level=200))
965
966 return path
967
968 def create_path(self):
969 """
970 Create basf2 path to process with event generation and simulation.
971 """
973
974
975class MainTask(b2luigi.WrapperTask):
976 """
977 Wrapper task that needs to finish for b2luigi to finish running this steering file.
978
979 It is done if the outputs of all required subtasks exist. It is thus at the
980 top of the luigi task graph. Edit the ``requires`` method to steer which
981 tasks and with which parameters you want to run.
982 """
983
984 n_events_training = b2luigi.get_setting(
986 "n_events_training", default=1000
987
988 )
989
990 n_events_testing = b2luigi.get_setting(
992 "n_events_testing", default=500
993
994 )
995
996 n_events_per_task = b2luigi.get_setting(
998 "n_events_per_task", default=100
999
1000 )
1001
1002 num_processes = b2luigi.get_setting(
1004 "basf2_processes_per_worker", default=0
1005
1006 )
1007
1008
1009 bkgfiles_by_exp = b2luigi.get_setting("bkgfiles_by_exp")
1011 bkgfiles_by_exp = {int(key): val for (key, val) in bkgfiles_by_exp.items()}
1013 def requires(self):
1014 """
1015 Generate list of tasks that needs to be done for luigi to finish running
1016 this steering file.
1017 """
1018
1019 fast_bdt_options = [
1020 [50, 8, 3, 0.1],
1021 [100, 8, 3, 0.1],
1022 [200, 8, 3, 0.1],
1023 ]
1024
1025 experiment_numbers = b2luigi.get_setting("experiment_numbers")
1026
1027 # iterate over all possible combinations of parameters from the above defined parameter lists
1028 for experiment_number, fast_bdt_option_state_filter, fast_bdt_option_result_filter in itertools.product(
1029 experiment_numbers, fast_bdt_options, fast_bdt_options
1030 ):
1031
1032 state_filter_cuts = [0.01, 0.02, 0.03, 0.05, 0.1, 0.2]
1033 n_best_states_list = [3, 5, 10]
1034 result_filter_cuts = [0.05, 0.1, 0.2]
1035 n_best_results_list = [2, 3, 5]
1036 for state_filter_cut, n_best_states, result_filter_cut, n_best_results in \
1037 itertools.product(state_filter_cuts, n_best_states_list, result_filter_cuts, n_best_results_list):
1038 yield self.clone(
1039 ValidationAndOptimisationTask,
1040 experiment_number=experiment_number,
1041 n_events_training=self.n_events_training,
1042 n_events_testing=self.n_events_testing,
1043 state_filter_cut=state_filter_cut,
1044 use_n_best_states=n_best_states,
1045 result_filter_cut=result_filter_cut,
1046 use_n_best_results=n_best_results,
1047 fast_bdt_option_state_filter=fast_bdt_option_state_filter,
1048 fast_bdt_option_result_filter=fast_bdt_option_result_filter,
1049 )
1050
1051
1052if __name__ == "__main__":
1053 b2luigi.set_setting("env_script", "./setup_basf2.sh")
1054 b2luigi.set_setting("batch_system", "htcondor")
1055 workers = b2luigi.get_setting("workers", default=1)
1056 b2luigi.process(MainTask(), workers=workers, batch=True)
1057
def get_background_files(folder=None, output_file_info=True)
Definition: background.py:17
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
b2luigi fast_bdt_option_result_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi n_events
Number of events to generate for the training data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
def get_weightfile_xml_identifier(self, fast_bdt_option=None, filter_number=1)
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
b2luigi filter_number
Number of the filter for which the records files are to be processed.
b2luigi n_events
Number of events to generate for the training data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the test data set.
b2luigi fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi result_filter_records_name
Name of the records file for training the final result filter.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname)
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi layer
Layer on which to toggle for recording the information for training.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the testing, validation, and optimisation data set.
b2luigi use_n_best_states
How many states should be kept at maximum in the combinatorial part of the CKF tree search.
b2luigi use_n_best_results
How many results should be kept at maximum to search for overlaps.
b2luigi state_filter_cut
Value of the cut on the MVA classifier output for accepting a state during CKF tracking.
b2luigi fast_bdt_option_result_filter
FastBDT option to use to train the Result Filter.
b2luigi result_filter_cut
Value of the cut on the MVA classifier output for a result candidate.
b2luigi fast_bdt_option_state_filter
FastBDT option to use to train the StateFilters.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
def add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False)
Definition: simulation.py:126