Belle II Software development
combined_to_pxd_ckf_mva_training.py
1
8
9"""
10combined_to_pxd_ckf_mva_training
11-----------------------------------------
12
13Purpose of this script
14~~~~~~~~~~~~~~~~~~~~~~
15
16This python script is used for the training and validation of the classifiers of
17the three MVA-based state filters and one result filter of the ToPXDCKF.
18This CKF extraplates tracks found in CDC and SVD into the PXD and adds PXD hits
19using a combinatorial tree search and a Kalman filter based track fit in each step.
20
21To avoid mistakes, b2luigi is used to create a task chain for a combined training and
22validation of all classifiers.
23
24The order of the b2luigi tasks in this script is as follows (top to bottom):
25* Two tasks to create input samples for training and testing (``GenerateSimTask`` and
26``SplitNMergeSimTask``). The ``SplitNMergeSimTask`` takes a number of events to be
27generated and a number of events per task to reduce runtime. It then divides the total
28number of events by the number of events per task and creates as ``GenerateSimTask`` as
29needed, each with a specific random seed, so that in the end the total number of
30training and testing events are simulated. The individual files are then combined
31by the SplitNMergeSimTask into one file each for training and testing.
32* The ``StateRecordingTask`` writes out the data required for training the state
33filters.
34* The ``CKFStateFilterTeacherTask`` trains the state filter MVAs, using FastBDT by
35default, with a given set of options.
36* The ``ResultRecordingTask`` writes out the data used for the training of the result
37filter MVA. This task requires that the state filters have been trained before.
38* The ``CKFResultFilterTeacherTask`` trains the MVA, FastBDT per default, with a
39given set of FastBDT options. This requires that the result filter records have
40been created with the ``ResultRecordingTask``.
41* The ``ValidationAndOptimisationTask`` uses the trained weight files and cut values
42provided to run the tracking chain with the weight file under test, and also
43runs the tracking validation.
44* Finally, the ``SummaryTask`` is the "brain" of the script. It invokes the
45``ValidationAndOptimisationTask`` with the different combinations of FastBDT options
46and cut values on the MVA classifier output.
47
48Due to the dependencies, the calls of the task are reversed. The SummaryTask
49calls the ``ValidationAndOptimisationTask`` with different FastBDT options and cut
50values, and the ``ValidationAndOptimisationTask`` itself calls the required teacher,
51training, and simulation tasks.
52
53Each combination of FastBDT options and state filter cut values and candidate selection
54is used to train the result filter, which includes that the ``ResultRecordingTask``
55is executed multiple times with different combinations of FastBDT options and cut value
56and candidate selection.
57
58b2luigi: Understanding the steering file
59~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60
61All trainings and validations are done in the correct order in this steering
62file. For the purpose of creating a dependency graph, the `b2luigi
63<https://b2luigi.readthedocs.io>`_ python package is used, which extends the
64`luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
65
66Each task that has to be done is represented by a special class, which defines
67which defines parameters, output files and which other tasks with which
68parameters it depends on. For example a teacher task, which runs
69``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
70task which runs a reconstruction and writes out track-wise variables into a root
71file for training. An evaluation/validation task for testing the classifier
72requires both the teacher task, as it needs the weightfile to be present, and
73also a data collection task, because it needs a dataset for testing classifier.
74
75The final task that defines which tasks need to be done for the steering file to
76finish is the ``SummaryTask``. When you only want to run parts of the
77training/validation pipeline, you can comment out requirements in the Master
78task or replace them by lower-level tasks during debugging.
79
80Requirements
81~~~~~~~~~~~~
82
83This steering file relies on b2luigi_ for task scheduling. It can be installed
84via pip::
85
86 python3 -m pip install [--user] b2luigi
87
88Use the ``--user`` option if you have not rights to install python packages into
89your externals (e.g. because you are using cvmfs) and install them in
90``$HOME/.local`` instead.
91
92Configuration
93~~~~~~~~~~~~~
94
95Instead of command line arguments, the b2luigi script is configured via a
96``settings.json`` file. Open it in your favorite text editor and modify it to
97fit to your requirements.
98
99Usage
100~~~~~
101
102You can test the b2luigi without running it via::
103
104 python3 combined_to_pxd_ckf_mva_training.py --dry-run
105 python3 combined_to_pxd_ckf_mva_training.py --show-output
106
107This will show the outputs and show potential errors in the definitions of the
108luigi task dependencies. To run the the steering file in normal (local) mode,
109run::
110
111 python3 combined_to_pxd_ckf_mva_training.py
112
113One can use the interactive luigi web interface via the central scheduler
114which visualizes the task graph while it is running. Therefore, the scheduler
115daemon ``luigid`` has to run in the background, which is located in
116``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
117example, run::
118
119 luigid --port 8886
120
121Then, execute your steering (e.g. in another terminal) with::
122
123 python3 combined_to_pxd_ckf_mva_training.py --scheduler-port 8886
124
125To view the web interface, open your webbrowser enter into the url bar::
126
127 localhost:8886
128
129If you don't run the steering file on the same machine on which you run your web
130browser, you have two options:
131
132 1. Run both the steering file and ``luigid`` remotely and use
133 ssh-port-forwarding to your local host. Therefore, run on your local
134 machine::
135
136 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
137
138 2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
139 local host>`` argument when calling the steering file
140
141Accessing the results / output files
142~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143
144All output files are stored in a directory structure in the ``result_path`` set in
145``settings.json``. The directory tree encodes the used b2luigi parameters. This
146ensures reproducibility and makes parameter searches easy. Sometimes, it is hard to
147find the relevant output files. You can view the whole directory structure by
148running ``tree <result_path>``. Ise the unix ``find`` command to find the files
149that interest you, e.g.::
150
151 find <result_path> -name "*.root" # find all ROOT files
152"""
153
154import itertools
155import json
156import os
157import subprocess
158import tempfile
159
160import basf2
161import basf2_mva
162from tracking import add_track_finding
163from tracking.path_utils import add_hit_preparation_modules
164from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
165import background
166import simulation
167
168from ckf_training import my_basf2_mva_teacher, create_fbdt_option_string
169from tracking_mva_filter_payloads.write_tracking_mva_filter_payloads_to_db import write_tracking_mva_filter_payloads_to_db
170
171# wrap python modules that are used here but not in the externals into a try except block
172install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
173 " python3 -m pip install [--user] {module}\n")
174try:
175 import b2luigi
176 from b2luigi.core.utils import create_output_dirs
177 from b2luigi.basf2_helper import Basf2PathTask, Basf2Task
178except ModuleNotFoundError:
179 print(install_helpstring_formatter.format(module="b2luigi"))
180 raise
181
182
183class LSFTask(b2luigi.Task):
184 """
185 Simple task that defines the configuration of the LSF batch submission.
186 """
187
188
189 batch_system = 'lsf'
190
191 queue = 's'
192
193 def __init__(self, *args, **kwargs):
194 """Constructor."""
195 super().__init__(*args, **kwargs)
196
197 self.job_name = self.task_id
198
199
201 """
202 Same as LSFTask, but for memory-intensive tasks.
203 """
204
205
206 job_slots = '4'
207
208
209class GenerateSimTask(Basf2PathTask, LSFTask):
210 """
211 Generate simulated Monte Carlo with background overlay.
212
213 Make sure to use different ``random_seed`` parameters for the training data
214 format the classifier trainings and for the test data for the respective
215 evaluation/validation tasks.
216 """
217
218
219 experiment_number = b2luigi.IntParameter()
220
222 random_seed = b2luigi.Parameter()
223
224 n_events = b2luigi.IntParameter()
225
226 bkgfiles_dir = b2luigi.Parameter(
227
228 hashed=True
229
230 )
231
232
233 def output_file_name(self, n_events=None, random_seed=None):
234 """
235 Create output file name depending on number of events and production
236 mode that is specified in the random_seed string.
237
238 :param n_events: Number of events to simulate.
239 :param random_seed: Random seed to use for the simulation to create independent samples.
240 """
241 if n_events is None:
242 n_events = self.n_events
243 if random_seed is None:
244 random_seed = self.random_seed
245 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
246
247 def output(self):
248 """
249 Generate list of output files that the task should produce.
250 The task is considered finished if and only if the outputs all exist.
251 """
252 yield self.add_to_output(self.output_file_name())
253
254 def create_path(self):
255 """
256 Create basf2 path to process with event generation and simulation.
257 """
258 basf2.set_random_seed(self.random_seed)
259 path = basf2.create_path()
260 path.add_module(
261 "EventInfoSetter", evtNumList=[self.n_events], runList=[0], expList=[self.experiment_number]
262 )
263 path.add_module("EvtGenInput")
264 bkg_files = ""
265 # \cond suppress doxygen warning
266 if self.experiment_number == 0:
268 else:
270 # \endcond
271
272 simulation.add_simulation(path, bkgfiles=bkg_files, bkgOverlay=True, usePXDDataReduction=False)
273
274 path.add_module(
275 "RootOutput",
276 outputFileName=self.get_output_file_name(self.output_file_name()),
277 )
278 return path
279
280 def remove_output(self):
281 """
282 Default function from base b2luigi.Task class.
283 """
284 self._remove_output()
285
286
287# I don't use the default MergeTask or similar because they only work if every input file is called the same.
288# Additionally, I want to add more features like deleting the original input to save storage space.
289class SplitNMergeSimTask(Basf2Task, LSFTask):
290 """
291 Generate simulated Monte Carlo with background overlay.
292
293 Make sure to use different ``random_seed`` parameters for the training data
294 format the classifier trainings and for the test data for the respective
295 evaluation/validation tasks.
296 """
297
298
299 experiment_number = b2luigi.IntParameter()
300
302 random_seed = b2luigi.Parameter()
303
304 n_events = b2luigi.IntParameter()
305
306 bkgfiles_dir = b2luigi.Parameter(
307
308 hashed=True
309
310 )
311
312
313 def output_file_name(self, n_events=None, random_seed=None):
314 """
315 Create output file name depending on number of events and production
316 mode that is specified in the random_seed string.
317
318 :param n_events: Number of events to simulate.
319 :param random_seed: Random seed to use for the simulation to create independent samples.
320 """
321 if n_events is None:
322 n_events = self.n_events
323 if random_seed is None:
324 random_seed = self.random_seed
325 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
326
327 def output(self):
328 """
329 Generate list of output files that the task should produce.
330 The task is considered finished if and only if the outputs all exist.
331 """
332 yield self.add_to_output(self.output_file_name())
333
334 def requires(self):
335 """
336 This task requires several GenerateSimTask to be finished so that he required number of events is created.
337 """
338 n_events_per_task = SummaryTask.n_events_per_task
339 quotient, remainder = divmod(self.n_events, n_events_per_task)
340 for i in range(quotient):
341 yield GenerateSimTask(
342 bkgfiles_dir=self.bkgfiles_dir,
343 num_processes=SummaryTask.num_processes,
344 random_seed=self.random_seed + '_' + str(i).zfill(3),
345 n_events=n_events_per_task,
346 experiment_number=self.experiment_number,
347 )
348 if remainder > 0:
349 yield GenerateSimTask(
350 bkgfiles_dir=self.bkgfiles_dir,
351 num_processes=SummaryTask.num_processes,
352 random_seed=self.random_seed + '_' + str(quotient).zfill(3),
353 n_events=remainder,
354 experiment_number=self.experiment_number,
355 )
356
357 @b2luigi.on_temporary_files
358 def process(self):
359 """
360 When all GenerateSimTasks finished, merge the output.
361 """
362 create_output_dirs(self)
363
364 file_list = [f for f in self.get_all_input_file_names()]
365 print("Merge the following files:")
366 print(file_list)
367 cmd = ["b2file-merge", "-f"]
368 args = cmd + [self.get_output_file_name(self.output_file_name())] + file_list
369 subprocess.check_call(args)
370
371 def on_success(self):
372 """
373 On success method.
374 """
375 print("Finished merging. Now remove the input files to save space.")
376 file_list = [f for f in self.get_all_input_file_names()]
377 for input_file in file_list:
378 try:
379 os.remove(input_file)
380 except FileNotFoundError:
381 pass
382
383 def remove_output(self):
384 """
385 Default function from base b2luigi.Task class.
386 """
387 self._remove_output()
388
389
390class StateRecordingTask(Basf2PathTask, LSFTask):
391 """
392 Record the data for the three state filters for the ToPXDCKF.
393
394 This task requires that the events used for training have been simulated before, which is done using the
395 ``SplitMergeSimTask``.
396 """
397
398 experiment_number = b2luigi.IntParameter()
399
401 random_seed = b2luigi.Parameter()
402
403 n_events = b2luigi.IntParameter()
404
405
406 layer = b2luigi.IntParameter()
407
408 def output(self):
409 """
410 Generate list of output files that the task should produce.
411 The task is considered finished if and only if the outputs all exist.
412 """
413 for record_fname in ["records1.root", "records2.root", "records3.root"]:
414 yield self.add_to_output(record_fname)
415
416 def requires(self):
417 """
418 This task only requires that the input files have been created.
419 """
420 yield SplitNMergeSimTask(
421 bkgfiles_dir=SummaryTask.bkgfiles_by_exp[self.experiment_number],
422 experiment_number=self.experiment_number,
423 n_events=self.n_events,
424 random_seed=self.random_seed,
425 )
426
427 def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname):
428 """
429 Create a path for the recording. To record the data for the PXD state filters, CDC+SVD tracks are required, and these
430 must be truth matched before. The data have to recorded for each layer of the PXD, i.e. layers 1 and 2, but also an
431 artificial layer 3.
432
433 :param layer: The layer for which the data are recorded.
434 :param records1_fname: Name of the records1 file.
435 :param records2_fname: Name of the records2 file.
436 :param records3_fname: Name of the records3 file.
437 """
438 path = basf2.create_path()
439
440 # get all the file names from the list of input files that are meant for training
441 file_list = [fname for fname in self.get_all_input_file_names()
442 if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
443 path.add_module("RootInput", inputFileNames=file_list)
444
445 path.add_module("Gearbox")
446 path.add_module("Geometry")
447 path.add_module("SetupGenfitExtrapolation")
448
449 add_hit_preparation_modules(path, components=["SVD", "PXD"])
450
451 add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
452
453 path.add_module('TrackFinderMCTruthRecoTracks',
454 RecoTracksStoreArrayName="MCRecoTracks",
455 WhichParticles=[],
456 UsePXDHits=True,
457 UseSVDHits=True,
458 UseCDCHits=True)
459
460 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
461 mcRecoTracksStoreArrayName="MCRecoTracks",
462 prRecoTracksStoreArrayName="CDCSVDRecoTracks")
463 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
464
465 path.add_module("ToPXDCKF",
466 inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
467 outputRecoTrackStoreArrayName="RecoTracks",
468 outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
469 hitFilter="angulardistance",
470 seedFilter="angulardistance",
471 preSeedFilter='all',
472 preHitFilter='all',
473
474 relationCheckForDirection="backward",
475 reverseSeed=False,
476 writeOutDirection="backward",
477
478 firstHighFilter="truth",
479 firstEqualFilter="recording",
480 firstEqualFilterParameters={"treeName": "records1", "rootFileName": records1_fname, "returnWeight": 1.0},
481 firstLowFilter="none",
482 firstHighUseNStates=0,
483 firstToggleOnLayer=layer,
484
485 advanceHighFilter="advance",
486
487 secondHighFilter="truth",
488 secondEqualFilter="recording",
489 secondEqualFilterParameters={"treeName": "records2", "rootFileName": records2_fname, "returnWeight": 1.0},
490 secondLowFilter="none",
491 secondHighUseNStates=0,
492 secondToggleOnLayer=layer,
493
494 updateHighFilter="fit",
495
496 thirdHighFilter="truth",
497 thirdEqualFilter="recording",
498 thirdEqualFilterParameters={"treeName": "records3", "rootFileName": records3_fname},
499 thirdLowFilter="none",
500 thirdHighUseNStates=0,
501 thirdToggleOnLayer=layer,
502
503 filter="none",
504 exportTracks=False,
505
506 enableOverlapResolving=False)
507
508 return path
509
510 def create_path(self):
511 """
512 Create basf2 path to process with event generation and simulation.
513 """
514 return self.create_state_recording_path(
515 layer=self.layer,
516 records1_fname=self.get_output_file_name("records1.root"),
517 records2_fname=self.get_output_file_name("records2.root"),
518 records3_fname=self.get_output_file_name("records3.root"),
519 )
520
521 def remove_output(self):
522 """
523 Default function from base b2luigi.Task class.
524 """
525 self._remove_output()
526
527
529 """
530 A teacher task runs the basf2 mva teacher on the training data provided by a
531 data collection task.
532
533 In this task the three state filters are trained, each with the corresponding recordings from the different layers.
534 It will be executed for each FastBDT option defined in the SummaryTask.
535 """
536
537
538 experiment_number = b2luigi.IntParameter()
539
541 random_seed = b2luigi.Parameter()
542
543 n_events = b2luigi.IntParameter()
544
545 fast_bdt_option_state_filter = b2luigi.ListParameter(
546
547 hashed=True, default=[50, 8, 3, 0.1]
548
549 )
550
551 filter_number = b2luigi.IntParameter()
552
553 training_target = b2luigi.Parameter(
554
555 default="truth"
556
557 )
558
560 exclude_variables = b2luigi.ListParameter(
561
562 hashed=True, default=[]
563
564 )
565
566 def get_weightfile_identifier(self, fast_bdt_option=None, filter_number=None):
567 """
568 Name of weightfile that is created by the teacher task.
569
570 :param fast_bdt_option: FastBDT option that is used to train this MVA
571 :param filter_number: Filter number (first=1, second=2, third=3) to be trained
572 """
573 if fast_bdt_option is None:
574 fast_bdt_option = self.fast_bdt_option_state_filter
575 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
576
577 if filter_number is None:
578 filter_number = self.filter_number
579 weightfile_name = f"trk_ToPXDStateFilter_{filter_number}" + fast_bdt_string
580 return weightfile_name
581
582 def requires(self):
583 """
584 This task requires that the recordings for the state filters.
585 """
586 for layer in [1, 2, 3]:
587 yield self.clone(
588 StateRecordingTask,
589 experiment_number=self.experiment_number,
590 n_events_training=self.n_events,
591 random_seed="training",
592 layer=layer
593 )
594
595 def output(self):
596 """
597 Generate list of output files that the task should produce.
598 The task is considered finished if and only if the outputs all exist.
599 """
600 yield self.add_to_output(self.get_weightfile_identifier() + ".root")
601
602 def process(self):
603 """
604 Use basf2_mva teacher to create MVA weightfile from collected training
605 data variables.
606
607 This is the main process that is dispatched by the ``run`` method that
608 is inherited from ``Basf2Task``.
609 """
610 records_files = self.get_input_file_names(f"records{self.filter_number}.root")
611 weightfile_identifier = self.get_weightfile_identifier(filter_number=self.filter_number)
612 tree_name = f"records{self.filter_number}"
613 print(f"Processed records files: {records_files},\nfeature tree name: {tree_name}")
614
615 my_basf2_mva_teacher(
616 records_files=records_files,
617 tree_name=tree_name,
618 weightfile_identifier=weightfile_identifier,
619 target_variable=self.training_target,
620 exclude_variables=self.exclude_variables,
621 fast_bdt_option=self.fast_bdt_option_state_filter,
622 )
623 basf2_mva.download(weightfile_identifier, self.get_output_file_name(weightfile_identifier + '.root'))
624
625 def remove_output(self):
626 """
627 Default function from base b2luigi.Task class.
628 """
629 self._remove_output()
630
631
632class ResultRecordingTask(Basf2PathTask, LSFTask):
633 """
634 Task to record data for the final result filter. This requires trained state filters.
635 The cuts on the state filter classifiers are set to rather low values to ensure that all signal is contained in the recorded
636 file. Also, the values for XXXXXHighUseNStates are chosen conservatively, i.e. rather on the high side.
637 """
638
639
640 experiment_number = b2luigi.IntParameter()
641
643 random_seed = b2luigi.Parameter()
644
645 n_events_training = b2luigi.IntParameter()
646
647 fast_bdt_option_state_filter = b2luigi.ListParameter(
648
649 hashed=True, default=[200, 8, 3, 0.1]
650
651 )
652
653 result_filter_records_name = b2luigi.Parameter()
654
655 # prepend testing payloads
656 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
657
658 def output(self):
659 """
660 Generate list of output files that the task should produce.
661 The task is considered finished if and only if the outputs all exist.
662 """
663 yield self.add_to_output(self.result_filter_records_name)
664
665 def requires(self):
666 """
667 This task requires that the training SplitMergeSimTask is finished, as well as that the state filters are trained using
668 the CKFStateFilterTeacherTask..
669 """
670 yield SplitNMergeSimTask(
671 bkgfiles_dir=SummaryTask.bkgfiles_by_exp[self.experiment_number],
672 experiment_number=self.experiment_number,
673 n_events=self.n_events_training,
674 random_seed=self.random_seed,
675 )
676 filter_numbers = [1, 2, 3]
677 for filter_number in filter_numbers:
678 yield self.clone(
679 CKFStateFilterTeacherTask,
680 experiment_number=self.experiment_number,
681 n_events=self.n_events_training,
682 random_seed=self.random_seed,
683 filter_number=filter_number,
684 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter
685 )
686
687 def create_result_recording_path(self, result_filter_records_name):
688 """
689 Create a path for the recording of the result filter. This file is then used to train the result filter.
690
691 :param result_filter_records_name: Name of the recording file.
692 """
693
694 path = basf2.create_path()
695
696 # get all the file names from the list of input files that are meant for training
697 file_list = [fname for fname in self.get_all_input_file_names()
698 if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
699 path.add_module("RootInput", inputFileNames=file_list)
700
701 path.add_module("Gearbox")
702 path.add_module("Geometry")
703 path.add_module("SetupGenfitExtrapolation")
704
705 add_hit_preparation_modules(path, components=["SVD", "PXD"])
706
707 add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
708
709 path.add_module('TrackFinderMCTruthRecoTracks',
710 RecoTracksStoreArrayName="MCRecoTracks",
711 WhichParticles=[],
712 UsePXDHits=True,
713 UseSVDHits=True,
714 UseCDCHits=True)
715
716 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
717 mcRecoTracksStoreArrayName="MCRecoTracks",
718 prRecoTracksStoreArrayName="CDCSVDRecoTracks")
719 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
720
721 fast_bdt_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
722
723 # write the tracking MVA filter parameters and the cut on MVA classifier to be applied on a local db
724 iov = [0, 0, 0, -1]
726 f"trk_ToPXDStateFilter_1_Parameter{fast_bdt_string}",
727 iov,
728 f"trk_ToPXDStateFilter_1{fast_bdt_string}",
729 0.01)
730
732 f"trk_ToPXDStateFilter_2_Parameter{fast_bdt_string}",
733 iov,
734 f"trk_ToPXDStateFilter_2{fast_bdt_string}",
735 0.01)
736
738 f"trk_ToPXDStateFilter_3_Parameter{fast_bdt_string}",
739 iov,
740 f"trk_ToPXDStateFilter_3{fast_bdt_string}",
741 0.01)
742
743 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
744 first_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_1_Parameter{fast_bdt_string}",
745 "direction": "backward"}
746 second_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_2_Parameter{fast_bdt_string}"}
747 third_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_3_Parameter{fast_bdt_string}"}
748
749 path.add_module("ToPXDCKF",
750 inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
751 outputRecoTrackStoreArrayName="RecoTracks",
752 outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
753
754 relationCheckForDirection="backward",
755 reverseSeed=False,
756 writeOutDirection="backward",
757
758 firstHighFilter="mva",
759 firstHighFilterParameters=first_high_filter_parameters,
760 firstHighUseNStates=10,
761
762 advanceHighFilter="advance",
763
764 secondHighFilter="mva",
765 secondHighFilterParameters=second_high_filter_parameters,
766 secondHighUseNStates=10,
767
768 updateHighFilter="fit",
769
770 thirdHighFilter="mva",
771 thirdHighFilterParameters=third_high_filter_parameters,
772 thirdHighUseNStates=10,
773
774 filter="recording",
775 filterParameters={"rootFileName": result_filter_records_name},
776 exportTracks=False,
777
778 enableOverlapResolving=True)
779
780 return path
781
782 def create_path(self):
783 """
784 Create basf2 path to process with event generation and simulation.
785 """
787 result_filter_records_name=self.get_output_file_name(self.result_filter_records_name),
788 )
789
790 def remove_output(self):
791 """
792 Default function from base b2luigi.Task class.
793 """
794 self._remove_output()
795
796
798 """
799 A teacher task runs the basf2 mva teacher on the training data for the result filter.
800 """
801
802
803 experiment_number = b2luigi.IntParameter()
804
806 random_seed = b2luigi.Parameter()
807
808 n_events = b2luigi.IntParameter()
809
810 fast_bdt_option_state_filter = b2luigi.ListParameter(
811
812 hashed=True, default=[50, 8, 3, 0.1]
813
814 )
815
816 fast_bdt_option_result_filter = b2luigi.ListParameter(
817
818 hashed=True, default=[200, 8, 3, 0.1]
819
820 )
821
822 result_filter_records_name = b2luigi.Parameter()
823
824 training_target = b2luigi.Parameter(
825
826 default="truth"
827
828 )
829
831 exclude_variables = b2luigi.ListParameter(
832
833 hashed=True, default=[]
834
835 )
836
837 def get_weightfile_identifier(self, fast_bdt_option=None):
838 """
839 Name of weightfile that is created by the teacher task.
840
841 :param fast_bdt_option: FastBDT option that is used to train this MVA
842 """
843 if fast_bdt_option is None:
844 fast_bdt_option = self.fast_bdt_option_result_filter
845 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
846 weightfile_name = "trk_ToPXDResultFilter" + fast_bdt_string
847 return weightfile_name
848
849 def requires(self):
850 """
851 Generate list of luigi Tasks that this Task depends on.
852 """
854 experiment_number=self.experiment_number,
855 n_events_training=self.n_events,
856 random_seed=self.random_seed,
857 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
858 result_filter_records_name=self.result_filter_records_name,
859 )
860
861 def output(self):
862 """
863 Generate list of output files that the task should produce.
864 The task is considered finished if and only if the outputs all exist.
865 """
866 yield self.add_to_output(self.get_weightfile_identifier() + ".root")
867
868 def process(self):
869 """
870 Use basf2_mva teacher to create MVA weightfile from collected training
871 data variables.
872
873 This is the main process that is dispatched by the ``run`` method that
874 is inherited from ``Basf2Task``.
875 """
876 records_files = self.get_input_file_names(self.result_filter_records_name)
877 tree_name = "records"
878 print(f"Processed records files for result filter training: {records_files},\nfeature tree name: {tree_name}")
879 weightfile_identifier = self.get_weightfile_identifier()
880 my_basf2_mva_teacher(
881 records_files=records_files,
882 tree_name=tree_name,
883 weightfile_identifier=self.get_weightfile_identifier(),
884 target_variable=self.training_target,
885 exclude_variables=self.exclude_variables,
886 fast_bdt_option=self.fast_bdt_option_result_filter,
887 )
888 basf2_mva.download(weightfile_identifier, self.get_output_file_name(weightfile_identifier + ".root"))
889
890 def remove_output(self):
891 """
892 Default function from base b2luigi.Task class.
893 """
894 self._remove_output()
895
896
898 """
899 Validate the performance of the trained filters by trying various combinations of FastBDT options, as well as cut values for
900 the states, the number of best candidates kept after each filter, and similar for the result filter.
901 """
902
903 experiment_number = b2luigi.IntParameter()
904
905 n_events_training = b2luigi.IntParameter()
906
907 fast_bdt_option_state_filter = b2luigi.ListParameter(
908 # ## \cond
909 hashed=True, default=[200, 8, 3, 0.1]
910 # ## \endcond
911 )
912
913 fast_bdt_option_result_filter = b2luigi.ListParameter(
914 # ## \cond
915 hashed=True, default=[200, 8, 3, 0.1]
916 # ## \endcond
917 )
918
919 n_events_testing = b2luigi.IntParameter()
920
921 state_filter_cut = b2luigi.FloatParameter()
922
923 use_n_best_states = b2luigi.IntParameter()
924
925 result_filter_cut = b2luigi.FloatParameter()
926
927 use_n_best_results = b2luigi.IntParameter()
928
929 # prepend the testing payloads
930 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
931
932 def output(self):
933 """
934 Generate list of output files that the task should produce.
935 The task is considered finished if and only if the outputs all exist.
936 """
937 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
938 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
939 yield self.add_to_output(
940 f"to_pxd_ckf_validation{fbdt_state_filter_string}{fbdt_result_filter_string}.root")
941
942 def requires(self):
943 """
944 This task requires trained result filters, trained state filters, and that an independent data set for validation was
945 created using the SplitMergeSimTask with the random seed optimisation.
946 """
947 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
949 result_filter_records_name=f"filter_records{fbdt_state_filter_string}.root",
950 experiment_number=self.experiment_number,
951 n_events=self.n_events_training,
952 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
953 fast_bdt_option_result_filter=self.fast_bdt_option_result_filter,
954 random_seed='training'
955 )
956 yield SplitNMergeSimTask(
957 bkgfiles_dir=SummaryTask.bkgfiles_by_exp[self.experiment_number],
958 experiment_number=self.experiment_number,
959 n_events=self.n_events_testing,
960 random_seed="optimisation",
961 )
962 filter_numbers = [1, 2, 3]
963 for filter_number in filter_numbers:
964 yield self.clone(
965 CKFStateFilterTeacherTask,
966 experiment_number=self.experiment_number,
967 n_events=self.n_events_training,
968 random_seed="training",
969 filter_number=filter_number,
970 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter
971 )
972
974 """
975 Create a path to validate the trained filters.
976 """
977 path = basf2.create_path()
978
979 # get all the file names from the list of input files that are meant for optimisation / validation
980 file_list = [fname for fname in self.get_all_input_file_names()
981 if "generated_mc_N" in fname and "optimisation" in fname and fname.endswith(".root")]
982 path.add_module("RootInput", inputFileNames=file_list)
983
984 path.add_module("Gearbox")
985 path.add_module("Geometry")
986 path.add_module("SetupGenfitExtrapolation")
987
988 add_hit_preparation_modules(path, components=["SVD", "PXD"])
989
990 add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
991
992 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
993
994 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
995 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
996
997 # write the tracking MVA filter parameters and the cut on MVA classifier to be applied on a local db
998 iov = [0, 0, 0, -1]
1000 f"trk_ToPXDStateFilter_1_Parameter{fbdt_state_filter_string}",
1001 iov,
1002 f"trk_ToPXDStateFilter_1{fbdt_state_filter_string}",
1003 self.state_filter_cut)
1004
1006 f"trk_ToPXDStateFilter_2_Parameter{fbdt_state_filter_string}",
1007 iov,
1008 f"trk_ToPXDStateFilter_2{fbdt_state_filter_string}",
1009 self.state_filter_cut)
1010
1012 f"trk_ToPXDStateFilter_3_Parameter{fbdt_state_filter_string}",
1013 iov,
1014 f"trk_ToPXDStateFilter_3{fbdt_state_filter_string}",
1015 self.state_filter_cut)
1016
1018 f"trk_ToPXDResultFilter_Parameter{fbdt_result_filter_string}",
1019 iov,
1020 f"trk_ToPXDResultFilter{fbdt_result_filter_string}",
1021 self.result_filter_cut)
1022
1023 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
1024 first_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_1_Parameter{fbdt_state_filter_string}",
1025 "direction": "backward"}
1026 second_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_2_Parameter{fbdt_state_filter_string}"}
1027 third_high_filter_parameters = {"DBPayloadName": f"trk_ToPXDStateFilter_3_Parameter{fbdt_state_filter_string}"}
1028 filter_parameters = {"DBPayloadName": f"trk_ToPXDResultFilter_Parameter{fbdt_result_filter_string}"}
1029
1030 path.add_module("ToPXDCKF",
1031 inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
1032 outputRecoTrackStoreArrayName="PXDRecoTracks",
1033 outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
1034
1035 relationCheckForDirection="backward",
1036 reverseSeed=False,
1037 writeOutDirection="backward",
1038
1039 firstHighFilter="mva_with_direction_check",
1040 firstHighFilterParameters=first_high_filter_parameters,
1041 firstHighUseNStates=self.use_n_best_states,
1042
1043 advanceHighFilter="advance",
1044 advanceHighFilterParameters={"direction": "backward"},
1045
1046 secondHighFilter="mva",
1047 secondHighFilterParameters=second_high_filter_parameters,
1048 secondHighUseNStates=self.use_n_best_states,
1049
1050 updateHighFilter="fit",
1051
1052 thirdHighFilter="mva",
1053 thirdHighFilterParameters=third_high_filter_parameters,
1054 thirdHighUseNStates=self.use_n_best_states,
1055
1056 filter="mva",
1057 filterParameters=filter_parameters,
1058 useBestNInSeed=self.use_n_best_results,
1059
1060 exportTracks=True,
1061 enableOverlapResolving=True)
1062
1063 path.add_module('RelatedTracksCombiner',
1064 VXDRecoTracksStoreArrayName="PXDRecoTracks",
1065 CDCRecoTracksStoreArrayName="CDCSVDRecoTracks",
1066 recoTracksStoreArrayName="RecoTracks")
1067
1068 path.add_module('TrackFinderMCTruthRecoTracks',
1069 RecoTracksStoreArrayName="MCRecoTracks",
1070 WhichParticles=[],
1071 UsePXDHits=True,
1072 UseSVDHits=True,
1073 UseCDCHits=True)
1074
1075 path.add_module("MCRecoTracksMatcher", UsePXDHits=True, UseSVDHits=True, UseCDCHits=True,
1076 mcRecoTracksStoreArrayName="MCRecoTracks",
1077 prRecoTracksStoreArrayName="RecoTracks")
1078
1079 path.add_module(
1081 output_file_name=self.get_output_file_name(
1082 f"to_pxd_ckf_validation{fbdt_state_filter_string}{fbdt_result_filter_string}.root"),
1083 reco_tracks_name="RecoTracks",
1084 mc_reco_tracks_name="MCRecoTracks",
1085 name="",
1086 contact="",
1087 expert_level=200))
1088
1089 return path
1090
1091 def create_path(self):
1092 """
1093 Create basf2 path to process with event generation and simulation.
1094 """
1096
1097 def remove_output(self):
1098 """
1099 Default function from base b2luigi.Task class.
1100 """
1101 self._remove_output()
1102
1103
1104class SummaryTask(b2luigi.Task):
1105 """
1106 Task that collects and summarizes the main figure-of-merits from all the
1107 (validation and optimisation) child taks.
1108 """
1109
1110 n_events_training = b2luigi.get_setting(
1111
1112 "n_events_training", default=1000
1113
1114 )
1115
1116 n_events_testing = b2luigi.get_setting(
1117
1118 "n_events_testing", default=500
1119
1120 )
1121
1122 n_events_per_task = b2luigi.get_setting(
1123
1124 "n_events_per_task", default=100
1125
1126 )
1127
1128 num_processes = b2luigi.get_setting(
1129
1130 "basf2_processes_per_worker", default=0
1131
1132 )
1133
1134
1135 bkgfiles_by_exp = b2luigi.get_setting("bkgfiles_by_exp")
1136
1137 bkgfiles_by_exp = {int(key): val for (key, val) in bkgfiles_by_exp.items()}
1138
1139
1140 batch_system = 'local'
1141
1142 output_file_name = 'summary.json'
1143
1144 def output(self):
1145 """
1146 Output method.
1147 """
1148 yield self.add_to_output(self.output_file_name)
1149
1150 def requires(self):
1151 """
1152 Generate list of tasks that needs to be done for luigi to finish running
1153 this steering file.
1154 """
1155
1156 fast_bdt_options = [
1157 [50, 8, 3, 0.1],
1158 [100, 8, 3, 0.1],
1159 [200, 8, 3, 0.1],
1160 ]
1161
1162 experiment_numbers = b2luigi.get_setting("experiment_numbers")
1163
1164 # iterate over all possible combinations of parameters from the above defined parameter lists
1165 for experiment_number, fast_bdt_option_state_filter, fast_bdt_option_result_filter in itertools.product(
1166 experiment_numbers, fast_bdt_options, fast_bdt_options
1167 ):
1168
1169 state_filter_cuts = [0.01, 0.02, 0.03, 0.05, 0.1, 0.2]
1170 n_best_states_list = [3, 5, 10]
1171 result_filter_cuts = [0.05, 0.1, 0.2]
1172 n_best_results_list = [2, 3, 5]
1173 for state_filter_cut, n_best_states, result_filter_cut, n_best_results in \
1174 itertools.product(state_filter_cuts, n_best_states_list, result_filter_cuts, n_best_results_list):
1175 yield self.clone(
1176 ValidationAndOptimisationTask,
1177 experiment_number=experiment_number,
1178 n_events_training=self.n_events_training,
1179 n_events_testing=self.n_events_testing,
1180 state_filter_cut=state_filter_cut,
1181 use_n_best_states=n_best_states,
1182 result_filter_cut=result_filter_cut,
1183 use_n_best_results=n_best_results,
1184 fast_bdt_option_state_filter=fast_bdt_option_state_filter,
1185 fast_bdt_option_result_filter=fast_bdt_option_result_filter,
1186 )
1187
1188 def run(self):
1189 """
1190 Run method.
1191 """
1192 import ROOT # noqa
1193
1194 # These are the "TNtuple" names to check for
1195 ntuple_names = (
1196 'MCSideTrackingValidationModule_overview_figures_of_merit',
1197 'PRSideTrackingValidationModule_overview_figures_of_merit',
1198 'PRSideTrackingValidationModule_subdetector_figures_of_merit'
1199 )
1200
1201 # Collect the information in a dictionary...
1202 output_dict = {}
1203 all_files = self.get_all_input_file_names()
1204 for idx, single_file in enumerate(all_files):
1205 with ROOT.TFile.Open(single_file, 'READ') as f:
1206 branch_data = {}
1207 for ntuple_name in ntuple_names:
1208 ntuple = f.Get(ntuple_name)
1209 for i in range(min(1, ntuple.GetEntries())): # Here we expect only 1 entry
1210 ntuple.GetEntry(i)
1211 for branch in ntuple.GetListOfBranches():
1212 name = branch.GetName()
1213 value = getattr(ntuple, name)
1214 branch_data[name] = value
1215 branch_data['file_path'] = single_file
1216 output_dict[f'{idx}'] = branch_data
1217
1218 # ... and store the information in a JSON file
1219 with open(self.get_output_file_name(self.output_file_name), 'w') as f:
1220 json.dump(output_dict, f, indent=4)
1221
1222 def remove_output(self):
1223 """
1224 Default function from base b2luigi.Task class.
1225 """
1226 self._remove_output()
1227
1228
1229if __name__ == "__main__":
1230
1231 b2luigi.set_setting("env_script", "./setup_basf2.sh")
1232 b2luigi.set_setting("scratch_dir", tempfile.gettempdir())
1233 workers = b2luigi.get_setting("workers", default=1)
1234 b2luigi.process(SummaryTask(), workers=workers, batch=True)
get_background_files(folder=None, output_file_info=True)
Definition background.py:17
experiment_number
Experiment number of the conditions database, e.g.
n_events
Number of events to generate for the training data set.
fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
fast_bdt_option_result_filter
Hyperparameter option of the FastBDT algorithm.
experiment_number
Experiment number of the conditions database, e.g.
filter_number
Number of the filter for which the records files are to be processed.
n_events
Number of events to generate for the training data set.
fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
get_weightfile_identifier(self, fast_bdt_option=None, filter_number=None)
exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
experiment_number
Experiment number of the conditions database, e.g.
bkgfiles_dir
Directory with overlay background root files.
output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
experiment_number
Experiment number of the conditions database, e.g.
result_filter_records_name
Name of the records file for training the final result filter.
fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
experiment_number
Experiment number of the conditions database, e.g.
output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
experiment_number
Experiment number of the conditions database, e.g.
create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname)
layer
Layer on which to toggle for recording the information for training.
n_events_training
Number of events to generate for the training data set.
n_events_testing
Number of events to generate for the test data set.
use_n_best_results
How many results should be kept at maximum to search for overlaps.
state_filter_cut
Value of the cut on the MVA classifier output for accepting a state during CKF tracking.
result_filter_cut
Value of the cut on the MVA classifier output for a result candidate.
use_n_best_states
How many states should be kept at maximum in the combinatorial part of the CKF tree search.
n_events_training
Number of events to generate for the training data set.
fast_bdt_option_state_filter
FastBDT option to use to train the StateFilters.
n_events_testing
Number of events to generate for the testing, validation, and optimisation data set.
fast_bdt_option_result_filter
FastBDT option to use to train the Result Filter.
add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False, save_all_charged_particles_in_mc=False)