Belle II Software development
combined_cdc_to_svd_ckf_mva_training.py
1
8
9"""
10combined_cdc_to_svd_ckf_mva_training
11-----------------------------------------
12
13Purpose of this script
14~~~~~~~~~~~~~~~~~~~~~~
15
16This python script is used for the training and validation of the classifiers of
17the three MVA-based state filters and one result filter of the CDCToSVDSpacePointCKF.
18This CKF extraplates tracks found in the CDC into the SVD and adds SVD hits using a
19combinatorial tree search and a Kalman filter based track fit in each step.
20
21To avoid mistakes, b2luigi is used to create a task chain for a combined training and
22validation of all classifiers.
23
24The order of the b2luigi tasks in this script is as follows (top to bottom):
25* Two tasks to create input samples for training and testing (``GenerateSimTask`` and
26``SplitNMergeSimTask``). The ``SplitNMergeSimTask`` takes a number of events to be
27generated and a number of events per task to reduce runtime. It then divides the total
28number of events by the number of events per task and creates as ``GenerateSimTask`` as
29needed, each with a specific random seed, so that in the end the total number of
30training and testing events are simulated. The individual files are then combined
31by the SplitNMergeSimTask into one file each for training and testing.
32* The ``StateRecordingTask`` writes out the data required for training the state
33filters.
34* The ``CKFStateFilterTeacherTask`` trains the state filter MVAs, using FastBDT by
35default, with a given set of options.
36* The ``ResultRecordingTask`` writes out the data used for the training of the result
37filter MVA. This task requires that the state filters have been trained before.
38* The ``CKFResultFilterTeacherTask`` trains the MVA, FastBDT per default, with a
39given set of FastBDT options. This requires that the result filter records have
40been created with the ``ResultRecordingTask``.
41* The ``ValidationAndOptimisationTask`` uses the trained weight files and cut values
42provided to run the tracking chain with the weight file under test, and also
43runs the tracking validation.
44* Finally, the ``SummaryTask`` is the "brain" of the script. It invokes the
45``ValidationAndOptimisationTask`` with the different combinations of FastBDT options
46and cut values on the MVA classifier output.
47
48Due to the dependencies, the calls of the task are reversed. The SummaryTask
49calls the ``ValidationAndOptimisationTask`` with different FastBDT options and cut
50values, and the ``ValidationAndOptimisationTask`` itself calls the required teacher,
51training, and simulation tasks.
52
53Each combination of FastBDT options and state filter cut values and candidate selection
54is used to train the result filter, which includes that the ``ResultRecordingTask``
55is executed multiple times with different combinations of FastBDT options and cut value
56and candidate selection.
57
58b2luigi: Understanding the steering file
59~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60
61All trainings and validations are done in the correct order in this steering
62file. For the purpose of creating a dependency graph, the `b2luigi
63<https://b2luigi.readthedocs.io>`_ python package is used, which extends the
64`luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
65
66Each task that has to be done is represented by a special class, which defines
67which defines parameters, output files and which other tasks with which
68parameters it depends on. For example a teacher task, which runs
69``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
70task which runs a reconstruction and writes out track-wise variables into a root
71file for training. An evaluation/validation task for testing the classifier
72requires both the teacher task, as it needs the weightfile to be present, and
73also a data collection task, because it needs a dataset for testing classifier.
74
75The final task that defines which tasks need to be done for the steering file to
76finish is the ``SummaryTask``. When you only want to run parts of the
77training/validation pipeline, you can comment out requirements in the Master
78task or replace them by lower-level tasks during debugging.
79
80Requirements
81~~~~~~~~~~~~
82
83This steering file relies on b2luigi_ for task scheduling. It can be installed
84via pip::
85
86 python3 -m pip install [--user] b2luigi
87
88Use the ``--user`` option if you have not rights to install python packages into
89your externals (e.g. because you are using cvmfs) and install them in
90``$HOME/.local`` instead.
91
92Configuration
93~~~~~~~~~~~~~
94
95Instead of command line arguments, the b2luigi script is configured via a
96``settings.json`` file. Open it in your favorite text editor and modify it to
97fit to your requirements.
98
99Usage
100~~~~~
101
102You can test the b2luigi without running it via::
103
104 python3 combined_cdc_to_svd_ckf_mva_training.py --dry-run
105 python3 combined_cdc_to_svd_ckf_mva_training.py --show-output
106
107This will show the outputs and show potential errors in the definitions of the
108luigi task dependencies. To run the the steering file in normal (local) mode,
109run::
110
111 python3 combined_cdc_to_svd_ckf_mva_training.py
112
113One can use the interactive luigi web interface via the central scheduler
114which visualizes the task graph while it is running. Therefore, the scheduler
115daemon ``luigid`` has to run in the background, which is located in
116``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
117example, run::
118
119 luigid --port 8886
120
121Then, execute your steering (e.g. in another terminal) with::
122
123 python3 combined_cdc_to_svd_ckf_mva_training.py --scheduler-port 8886
124
125To view the web interface, open your webbrowser enter into the url bar::
126
127 localhost:8886
128
129If you don't run the steering file on the same machine on which you run your web
130browser, you have two options:
131
132 1. Run both the steering file and ``luigid`` remotely and use
133 ssh-port-forwarding to your local host. Therefore, run on your local
134 machine::
135
136 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
137
138 2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
139 local host>`` argument when calling the steering file
140
141Accessing the results / output files
142~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143
144All output files are stored in a directory structure in the ``result_path`` set in
145``settings.json``. The directory tree encodes the used b2luigi parameters. This
146ensures reproducibility and makes parameter searches easy. Sometimes, it is hard to
147find the relevant output files. You can view the whole directory structure by
148running ``tree <result_path>``. Ise the unix ``find`` command to find the files
149that interest you, e.g.::
150
151 find <result_path> -name "*.root" # find all ROOT files
152"""
153
154import itertools
155import json
156import os
157import subprocess
158import tempfile
159
160import basf2_mva
161import basf2
162from tracking import add_track_finding
163from tracking.path_utils import add_hit_preparation_modules
164from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
165import background
166import simulation
167
168from ckf_training import my_basf2_mva_teacher, create_fbdt_option_string
169from tracking_mva_filter_payloads.write_tracking_mva_filter_payloads_to_db import write_tracking_mva_filter_payloads_to_db
170
171# wrap python modules that are used here but not in the externals into a try except block
172install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
173 " python3 -m pip install [--user] {module}\n")
174try:
175 import b2luigi
176 from b2luigi.core.utils import create_output_dirs
177 from b2luigi.basf2_helper import Basf2PathTask, Basf2Task
178except ModuleNotFoundError:
179 print(install_helpstring_formatter.format(module="b2luigi"))
180 raise
181
182
183class LSFTask(b2luigi.Task):
184 """
185 Simple task that defines the configuration of the LSF batch submission.
186 """
187
188
189 batch_system = 'lsf'
190
191 queue = 's'
192
193 def __init__(self, *args, **kwargs):
194 """Constructor."""
195 super().__init__(*args, **kwargs)
196
197 self.job_name = self.task_id
198
199
201 """
202 Same as LSFTask, but for memory-intensive tasks.
203 """
204
205
206 job_slots = '4'
207
208
209class GenerateSimTask(Basf2PathTask, LSFTask):
210 """
211 Generate simulated Monte Carlo with background overlay.
212
213 Make sure to use different ``random_seed`` parameters for the training data
214 format the classifier trainings and for the test data for the respective
215 evaluation/validation tasks.
216 """
217
218
219 experiment_number = b2luigi.IntParameter()
220
222 random_seed = b2luigi.Parameter()
223
224 n_events = b2luigi.IntParameter()
225
226 bkgfiles_dir = b2luigi.Parameter(
227
228 hashed=True
229
230 )
231
232
233 def output_file_name(self, n_events=None, random_seed=None):
234 """
235 Create output file name depending on number of events and production
236 mode that is specified in the random_seed string.
237
238 :param n_events: Number of events to simulate.
239 :param random_seed: Random seed to use for the simulation to create independent samples.
240 """
241 if n_events is None:
242 n_events = self.n_events
243 if random_seed is None:
244 random_seed = self.random_seed
245 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
246
247 def output(self):
248 """
249 Generate list of output files that the task should produce.
250 The task is considered finished if and only if the outputs all exist.
251 """
252 yield self.add_to_output(self.output_file_name())
253
254 def create_path(self):
255 """
256 Create basf2 path to process with event generation and simulation.
257 """
258 basf2.set_random_seed(self.random_seed)
259 path = basf2.create_path()
260 path.add_module(
261 "EventInfoSetter", evtNumList=[self.n_events], runList=[0], expList=[self.experiment_number]
262 )
263 path.add_module("EvtGenInput")
264 bkg_files = ""
265 # \cond suppress doxygen warning
266 if self.experiment_number == 0:
268 else:
270 # \endcond
271
272 simulation.add_simulation(path, bkgfiles=bkg_files, bkgOverlay=True, usePXDDataReduction=False)
273
274 path.add_module(
275 "RootOutput",
276 outputFileName=self.get_output_file_name(self.output_file_name()),
277 )
278 return path
279
280 def remove_output(self):
281 """
282 Default function from base b2luigi.Task class.
283 """
284 self._remove_output()
285
286
287# I don't use the default MergeTask or similar because they only work if every input file is called the same.
288# Additionally, I want to add more features like deleting the original input to save storage space.
289class SplitNMergeSimTask(Basf2Task, LSFTask):
290 """
291 Generate simulated Monte Carlo with background overlay.
292
293 Make sure to use different ``random_seed`` parameters for the training data
294 format the classifier trainings and for the test data for the respective
295 evaluation/validation tasks.
296 """
297
298 experiment_number = b2luigi.IntParameter()
299
301 random_seed = b2luigi.Parameter()
302
303 n_events = b2luigi.IntParameter()
304
305 bkgfiles_dir = b2luigi.Parameter(
306
307 hashed=True
308
309 )
310
311
312 def output_file_name(self, n_events=None, random_seed=None):
313 """
314 Create output file name depending on number of events and production
315 mode that is specified in the random_seed string.
316
317 :param n_events: Number of events to simulate.
318 :param random_seed: Random seed to use for the simulation to create independent samples.
319 """
320 if n_events is None:
321 n_events = self.n_events
322 if random_seed is None:
323 random_seed = self.random_seed
324 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
325
326 def output(self):
327 """
328 Generate list of output files that the task should produce.
329 The task is considered finished if and only if the outputs all exist.
330 """
331 yield self.add_to_output(self.output_file_name())
332
333 def requires(self):
334 """
335 This task requires several GenerateSimTask to be finished so that he required number of events is created.
336 """
337 n_events_per_task = SummaryTask.n_events_per_task
338 quotient, remainder = divmod(self.n_events, n_events_per_task)
339 for i in range(quotient):
340 yield GenerateSimTask(
341 bkgfiles_dir=self.bkgfiles_dir,
342 num_processes=SummaryTask.num_processes,
343 random_seed=self.random_seed + '_' + str(i).zfill(3),
344 n_events=n_events_per_task,
345 experiment_number=self.experiment_number,
346 )
347 if remainder > 0:
348 yield GenerateSimTask(
349 bkgfiles_dir=self.bkgfiles_dir,
350 num_processes=SummaryTask.num_processes,
351 random_seed=self.random_seed + '_' + str(quotient).zfill(3),
352 n_events=remainder,
353 experiment_number=self.experiment_number,
354 )
355
356 @b2luigi.on_temporary_files
357 def process(self):
358 """
359 When all GenerateSimTasks finished, merge the output.
360 """
361 create_output_dirs(self)
362
363 file_list = [f for f in self.get_all_input_file_names()]
364 print("Merge the following files:")
365 print(file_list)
366 cmd = ["b2file-merge", "-f"]
367 args = cmd + [self.get_output_file_name(self.output_file_name())] + file_list
368 subprocess.check_call(args)
369
370 def on_success(self):
371 """
372 On success method.
373 """
374 print("Finished merging. Now remove the input files to save space.")
375 file_list = [f for f in self.get_all_input_file_names()]
376 for input_file in file_list:
377 try:
378 os.remove(input_file)
379 except FileNotFoundError:
380 pass
381
382 def remove_output(self):
383 """
384 Default function from base b2luigi.Task class.
385 """
386 self._remove_output()
387
388
389class StateRecordingTask(Basf2PathTask, LSFTask):
390 """
391 Record the data for the three state filters for the CDCToSVDSpacePointCKF.
392
393 This task requires that the events used for training have been simulated before, which is done using the
394 ``SplitMergeSimTask``.
395 """
396
397 experiment_number = b2luigi.IntParameter()
398
400 random_seed = b2luigi.Parameter()
401
402 n_events = b2luigi.IntParameter()
403
404
405 layer = b2luigi.IntParameter()
406
407 def output(self):
408 """
409 Generate list of output files that the task should produce.
410 The task is considered finished if and only if the outputs all exist.
411 """
412 for record_fname in ["records1.root", "records2.root", "records3.root"]:
413 yield self.add_to_output(record_fname)
414
415 def requires(self):
416 """
417 This task only requires that the input files have been created.
418 """
419 yield SplitNMergeSimTask(
420 bkgfiles_dir=SummaryTask.bkgfiles_by_exp[self.experiment_number],
421 experiment_number=self.experiment_number,
422 random_seed=self.random_seed,
423 n_events=self.n_events,
424 )
425
426 def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname):
427 """
428 Create a path for the recording. To record the data for the SVD state filters, CDC tracks are required, and these must
429 be truth matched before. The data have to recorded for each layer of the SVD, i.e. layers 3 to 6, but also an artificial
430 layer 7.
431
432 :param layer: The layer for which the data are recorded.
433 :param records1_fname: Name of the records1 file.
434 :param records2_fname: Name of the records2 file.
435 :param records3_fname: Name of the records3 file.
436 """
437 path = basf2.create_path()
438
439 # get all the file names from the list of input files that are meant for training
440 file_list = [fname for fname in self.get_all_input_file_names()
441 if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
442 path.add_module("RootInput", inputFileNames=file_list)
443
444 path.add_module("Gearbox")
445 path.add_module("Geometry")
446 path.add_module("SetupGenfitExtrapolation")
447
448 add_hit_preparation_modules(path, components=["SVD"])
449
450 add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
451
452 path.add_module('TrackFinderMCTruthRecoTracks',
453 RecoTracksStoreArrayName="MCRecoTracks",
454 WhichParticles=[],
455 UsePXDHits=True,
456 UseSVDHits=True,
457 UseCDCHits=True)
458
459 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=False, UseCDCHits=True,
460 mcRecoTracksStoreArrayName="MCRecoTracks",
461 prRecoTracksStoreArrayName="CDCRecoTracks")
462 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCRecoTracks")
463
464 path.add_module("CDCToSVDSpacePointCKF",
465 inputRecoTrackStoreArrayName="CDCRecoTracks",
466 outputRecoTrackStoreArrayName="VXDRecoTracks",
467 outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
468
469 relationCheckForDirection="backward",
470 reverseSeed=False,
471 writeOutDirection="backward",
472
473 firstHighFilter="truth",
474 firstEqualFilter="recording",
475 firstEqualFilterParameters={"treeName": "records1", "rootFileName":
476 records1_fname, "returnWeight": 1.0},
477 firstLowFilter="none",
478 firstHighUseNStates=0,
479 firstToggleOnLayer=layer,
480
481 advanceHighFilter="advance",
482
483 secondHighFilter="truth",
484 secondEqualFilter="recording",
485 secondEqualFilterParameters={"treeName": "records2", "rootFileName":
486 records2_fname, "returnWeight": 1.0},
487 secondLowFilter="none",
488 secondHighUseNStates=0,
489 secondToggleOnLayer=layer,
490
491 updateHighFilter="fit",
492
493 thirdHighFilter="truth",
494 thirdEqualFilter="recording",
495 thirdEqualFilterParameters={"treeName": "records3", "rootFileName": records3_fname},
496 thirdLowFilter="none",
497 thirdHighUseNStates=0,
498 thirdToggleOnLayer=layer,
499
500 filter="none",
501 exportTracks=False,
502
503 enableOverlapResolving=False)
504
505 return path
506
507 def create_path(self):
508 """
509 Create basf2 path to process with event generation and simulation.
510 """
511 return self.create_state_recording_path(
512 layer=self.layer,
513 records1_fname=self.get_output_file_name("records1.root"),
514 records2_fname=self.get_output_file_name("records2.root"),
515 records3_fname=self.get_output_file_name("records3.root"),
516 )
517
518 def remove_output(self):
519 """
520 Default function from base b2luigi.Task class.
521 """
522 self._remove_output()
523
524
526 """
527 A teacher task runs the basf2 mva teacher on the training data provided by a
528 data collection task.
529
530 In this task the three state filters are trained, each with the corresponding recordings from the different layers.
531 It will be executed for each FastBDT option defined in the SummaryTask.
532 """
533
534 experiment_number = b2luigi.IntParameter()
535
537 random_seed = b2luigi.Parameter()
538
539 n_events = b2luigi.IntParameter()
540
541 fast_bdt_option_state_filter = b2luigi.ListParameter(
542
543 hashed=True, default=[50, 8, 3, 0.1]
544
545 )
546
547 filter_number = b2luigi.IntParameter()
548
549 training_target = b2luigi.Parameter(
550
551 default="truth"
552
553 )
554
556 exclude_variables = b2luigi.ListParameter(
557
559 hashed=True, default=[
560 "id",
561 "last_id",
562 "number",
563 "last_layer",
564
565 "seed_cdc_hits",
566 "seed_svd_hits",
567 "seed_lowest_svd_layer",
568 "seed_lowest_cdc_layer",
569 "quality_index_triplet",
570 "quality_index_circle",
571 "quality_index_helix",
572 "cluster_1_charge",
573 "cluster_2_charge",
574 "mean_rest_cluster_charge",
575 "min_rest_cluster_charge",
576 "std_rest_cluster_charge",
577 "cluster_1_seed_charge",
578 "cluster_2_seed_charge",
579 "mean_rest_cluster_seed_charge",
580 "min_rest_cluster_seed_charge",
581 "std_rest_cluster_seed_charge",
582 "cluster_1_size",
583 "cluster_2_size",
584 "mean_rest_cluster_size",
585 "min_rest_cluster_size",
586 "std_rest_cluster_size",
587 "cluster_1_snr",
588 "cluster_2_snr",
589 "mean_rest_cluster_snr",
590 "min_rest_cluster_snr",
591 "std_rest_cluster_snr",
592 "cluster_1_charge_over_size",
593 "cluster_2_charge_over_size",
594 "mean_rest_cluster_charge_over_size",
595 "min_rest_cluster_charge_over_size",
596 "std_rest_cluster_charge_over_size",
597 ]
598
599 )
600
601 def get_weightfile_identifier(self, fast_bdt_option=None, filter_number=None):
602 """
603 Name of weightfile that is created by the teacher task.
604
605 :param fast_bdt_option: FastBDT option that is used to train this MVA
606 :param filter_number: Filter number (first=1, second=2, third=3) to be trained
607
608 """
609 if fast_bdt_option is None:
610 fast_bdt_option = self.fast_bdt_option_state_filter
611 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
612 if filter_number is None:
613 filter_number = self.filter_number
614 weightfile_name = f"trk_CDCToSVDSpacePointStateFilter_{filter_number}" + fast_bdt_string
615 return weightfile_name
616
617 def requires(self):
618 """
619 This task requires that the recordings for the state filters.
620 """
621 for layer in [3, 4, 5, 6, 7]:
622 yield self.clone(
623 StateRecordingTask,
624 experiment_number=self.experiment_number,
625 n_events=self.n_events,
626 random_seed="training",
627 layer=layer,
628 )
629
630 def output(self):
631 """
632 Generate list of output files that the task should produce.
633 The task is considered finished if and only if the outputs all exist.
634 """
635 yield self.add_to_output(self.get_weightfile_identifier() + ".root")
636
637 def process(self):
638 """
639 Use basf2_mva teacher to create MVA weightfile from collected training
640 data variables.
641
642 This is the main process that is dispatched by the ``run`` method that
643 is inherited from ``Basf2Task``.
644 """
645 records_files = self.get_input_file_names(f"records{self.filter_number}.root")
646 weightfile_identifier = self.get_weightfile_identifier(filter_number=self.filter_number)
647 tree_name = f"records{self.filter_number}"
648 print(f"Processed records files: {records_files},\nfeature tree name: {tree_name}")
649
650 my_basf2_mva_teacher(
651 records_files=records_files,
652 tree_name=tree_name,
653 weightfile_identifier=weightfile_identifier,
654 target_variable=self.training_target,
655 exclude_variables=self.exclude_variables,
656 fast_bdt_option=self.fast_bdt_option_state_filter,
657 )
658 basf2_mva.download(weightfile_identifier, self.get_output_file_name(weightfile_identifier + ".root"))
659
660 def remove_output(self):
661 """
662 Default function from base b2luigi.Task class.
663 """
664 self._remove_output()
665
666
667class ResultRecordingTask(Basf2PathTask, LSFTask):
668 """
669 Task to record data for the final result filter. This requires trained state filters.
670 The cuts on the state filter classifiers are set to rather low values to ensure that all signal is contained in the
671 recorded file. Also, the values for XXXXXHighUseNStates are chosen conservatively, i.e. rather on the high side.
672 """
673
674
675 experiment_number = b2luigi.IntParameter()
676
678 random_seed = b2luigi.Parameter()
679
680 n_events = b2luigi.IntParameter()
681
682 fast_bdt_option_state_filter = b2luigi.ListParameter(
683
684 hashed=True, default=[50, 8, 3, 0.1]
685
686 )
687
688 result_filter_records_name = b2luigi.Parameter()
689
690 # prepend the testing payloads
691 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
692
693 def output(self):
694 """
695 Generate list of output files that the task should produce.
696 The task is considered finished if and only if the outputs all exist.
697 """
698 yield self.add_to_output(self.result_filter_records_name)
699
700 def requires(self):
701 """
702 This task requires that the training SplitMergeSimTask is finished, as well as that the state filters are trained
703 using the CKFStateFilterTeacherTask..
704 """
705 yield SplitNMergeSimTask(
706 bkgfiles_dir=SummaryTask.bkgfiles_by_exp[self.experiment_number],
707 experiment_number=self.experiment_number,
708 random_seed=self.random_seed,
709 n_events=self.n_events,
710 )
711 filter_numbers = [1, 2, 3]
712 for filter_number in filter_numbers:
713 yield self.clone(
714 CKFStateFilterTeacherTask,
715 experiment_number=self.experiment_number,
716 n_events=self.n_events,
717 random_seed=self.random_seed,
718 filter_number=filter_number,
719 fast_bdt_option=self.fast_bdt_option_state_filter
720 )
721
722 def create_result_recording_path(self, result_filter_records_name):
723 """
724 Create a path for the recording of the result filter. This file is then used to train the result filter.
725
726 :param result_filter_records_name: Name of the recording file.
727 """
728
729 path = basf2.create_path()
730
731 # get all the file names from the list of input files that are meant for training
732 file_list = [fname for fname in self.get_all_input_file_names()
733 if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
734 path.add_module("RootInput", inputFileNames=file_list)
735
736 path.add_module("Gearbox")
737 path.add_module("Geometry")
738 path.add_module("SetupGenfitExtrapolation")
739
740 add_hit_preparation_modules(path, components=["SVD"])
741
742 add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
743
744 path.add_module('TrackFinderMCTruthRecoTracks',
745 RecoTracksStoreArrayName="MCRecoTracks",
746 WhichParticles=[],
747 UsePXDHits=True,
748 UseSVDHits=True,
749 UseCDCHits=True)
750
751 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=False, UseCDCHits=True,
752 mcRecoTracksStoreArrayName="MCRecoTracks",
753 prRecoTracksStoreArrayName="CDCRecoTracks")
754 path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCRecoTracks")
755
756 fast_bdt_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
757 # write the tracking MVA filter parameters and the cut on MVA classifier to be applied on a local db
758 iov = [0, 0, 0, -1]
760 f"trk_CDCToSVDSpacePointStateFilter_1_Parameter{fast_bdt_string}",
761 iov,
762 f"trk_CDCToSVDSpacePointStateFilter_1{fast_bdt_string}",
763 0.001)
764
766 f"trk_CDCToSVDSpacePointStateFilter_2_Parameter{fast_bdt_string}",
767 iov,
768 f"trk_CDCToSVDSpacePointStateFilter_2{fast_bdt_string}",
769 0.001)
770
772 f"trk_CDCToSVDSpacePointStateFilter_3_Parameter{fast_bdt_string}",
773 iov,
774 f"trk_CDCToSVDSpacePointStateFilter_3{fast_bdt_string}",
775 0.001)
776
777 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
778 first_high_filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_1_Parameter{fast_bdt_string}",
779 "direction": "backward"}
780 second_high_filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_2_Parameter{fast_bdt_string}"}
781 third_high_filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_3_Parameter{fast_bdt_string}"}
782
783 path.add_module("CDCToSVDSpacePointCKF",
784 inputRecoTrackStoreArrayName="CDCRecoTracks",
785 outputRecoTrackStoreArrayName="VXDRecoTracks",
786 outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
787
788 relationCheckForDirection="backward",
789 reverseSeed=False,
790 writeOutDirection="backward",
791
792 firstHighFilter="mva_with_direction_check",
793 firstHighFilterParameters=first_high_filter_parameters,
794 firstHighUseNStates=10,
795
796 advanceHighFilter="advance",
797 advanceHighFilterParameters={"direction": "backward"},
798
799 secondHighFilter="mva",
800 secondHighFilterParameters=second_high_filter_parameters,
801 secondHighUseNStates=10,
802
803 updateHighFilter="fit",
804
805 thirdHighFilter="mva",
806 thirdHighFilterParameters=third_high_filter_parameters,
807 thirdHighUseNStates=10,
808
809 filter="recording",
810 filterParameters={"rootFileName": result_filter_records_name},
811 exportTracks=False,
812
813 enableOverlapResolving=True)
814
815 return path
816
817 def create_path(self):
818 """
819 Create basf2 path to process with event generation and simulation.
820 """
822 result_filter_records_name=self.get_output_file_name(self.result_filter_records_name),
823 )
824
825 def remove_output(self):
826 """
827 Default function from base b2luigi.Task class.
828 """
829 self._remove_output()
830
831
833 """
834 A teacher task runs the basf2 mva teacher on the training data provided by a
835 data collection task.
836
837 Since teacher tasks are needed for all quality estimators covered by this
838 steering file and the only thing that changes is the required data
839 collection task and some training parameters, I decided to use inheritance
840 and have the basic functionality in this base class/interface and have the
841 specific teacher tasks inherit from it.
842 """
843
844 experiment_number = b2luigi.IntParameter()
845
847 random_seed = b2luigi.Parameter()
848
849 n_events = b2luigi.IntParameter()
850
851 fast_bdt_option_state_filter = b2luigi.ListParameter(
852
853 hashed=True, default=[50, 8, 3, 0.1]
854
855 )
856
857 fast_bdt_option_result_filter = b2luigi.ListParameter(
858
859 hashed=True, default=[200, 8, 3, 0.1]
860
861 )
862
863 result_filter_records_name = b2luigi.Parameter()
864
865 training_target = b2luigi.Parameter(
866
867 default="truth"
868
869 )
870
872 exclude_variables = b2luigi.ListParameter(
873
874 hashed=True, default=[]
875
876 )
877
878 def get_weightfile_identifier(self, fast_bdt_option=None):
879 """
880 Name of weightfile that is created by the teacher task.
881
882 :param fast_bdt_option: FastBDT option that is used to train this MVA
883 """
884 if fast_bdt_option is None:
885 fast_bdt_option = self.fast_bdt_option_result_filter
886 fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
887 weightfile_name = "trk_CDCToSVDSpacePointResultFilter" + fast_bdt_string
888 return weightfile_name
889
890 def requires(self):
891 """
892 Generate list of luigi Tasks that this Task depends on.
893 """
895 experiment_number=self.experiment_number,
896 n_events=self.n_events,
897 random_seed=self.random_seed,
898 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
899 result_filter_records_name=self.result_filter_records_name,
900 )
901
902 def output(self):
903 """
904 Generate list of output files that the task should produce.
905 The task is considered finished if and only if the outputs all exist.
906 """
907 yield self.add_to_output(self.get_weightfile_identifier() + ".root")
908
909 def process(self):
910 """
911 Use basf2_mva teacher to create MVA weightfile from collected training
912 data variables.
913
914 This is the main process that is dispatched by the ``run`` method that
915 is inherited from ``Basf2Task``.
916 """
917 records_files = self.get_input_file_names(self.result_filter_records_name)
918 tree_name = "records"
919 print(f"Processed records files for result filter training: {records_files},\nfeature tree name: {tree_name}")
920 weightfile_identifier = self.get_weightfile_identifier()
921 my_basf2_mva_teacher(
922 records_files=records_files,
923 tree_name=tree_name,
924 weightfile_identifier=weightfile_identifier,
925 target_variable=self.training_target,
926 exclude_variables=self.exclude_variables,
927 fast_bdt_option=self.fast_bdt_option_result_filter,
928 )
929
930 basf2_mva.download(weightfile_identifier, self.get_output_file_name(weightfile_identifier + ".root"))
931
932 def remove_output(self):
933 """
934 Default function from base b2luigi.Task class.
935 """
936 self._remove_output()
937
938
940 """
941 Validate the performance of the trained filters by trying various combinations of FastBDT options, as well as cut values
942 for the states, the number of best candidates kept after each filter, and similar for the result filter.
943 """
944
945 experiment_number = b2luigi.IntParameter()
946
947 n_events_training = b2luigi.IntParameter()
948
949 fast_bdt_option_state_filter = b2luigi.ListParameter(
950 # ## \cond
951 hashed=True, default=[50, 8, 3, 0.1]
952 # ## \endcond
953 )
954
955 fast_bdt_option_result_filter = b2luigi.ListParameter(
956 # ## \cond
957 hashed=True, default=[200, 8, 3, 0.1]
958 # ## \endcond
959 )
960
961 n_events_testing = b2luigi.IntParameter()
962
963 state_filter_cut = b2luigi.FloatParameter()
964
965 use_n_best_states = b2luigi.IntParameter()
966
967 result_filter_cut = b2luigi.FloatParameter()
968
969 use_n_best_results = b2luigi.IntParameter()
970
971 # prepend the testing payloads
972 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
973
974 def output(self):
975 """
976 Generate list of output files that the task should produce.
977 The task is considered finished if and only if the outputs all exist.
978 """
979 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
980 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
981 yield self.add_to_output(
982 f"cdc_to_svd_spacepoint_ckf_validation{fbdt_state_filter_string}{fbdt_result_filter_string}.root")
983
984 def requires(self):
985 """
986 This task requires trained result filters, trained state filters, and that an independent data set for validation was
987 created using the SplitMergeSimTask with the random seed optimisation.
988 """
989 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
991 result_filter_records_name=f"filter_records{fbdt_state_filter_string}.root",
992 experiment_number=self.experiment_number,
993 n_events=self.n_events_training,
994 fast_bdt_option_state_filter=self.fast_bdt_option_state_filter,
995 fast_bdt_option_result_filter=self.fast_bdt_option_result_filter,
996 random_seed='training'
997 )
998 yield SplitNMergeSimTask(
999 bkgfiles_dir=SummaryTask.bkgfiles_by_exp[self.experiment_number],
1000 experiment_number=self.experiment_number,
1001 n_events=self.n_events_testing,
1002 random_seed="optimisation",
1003 )
1004 filter_numbers = [1, 2, 3]
1005 for filter_number in filter_numbers:
1006 yield self.clone(
1007 CKFStateFilterTeacherTask,
1008 experiment_number=self.experiment_number,
1009 random_seed="training",
1010 n_events=self.n_events_training,
1011 filter_number=filter_number,
1012 fast_bdt_option=self.fast_bdt_option_state_filter
1013 )
1014
1016 """
1017 Create a path to validate the trained filters.
1018 """
1019 path = basf2.create_path()
1020
1021 # get all the file names from the list of input files that are meant for optimisation / validation
1022 file_list = [fname for fname in self.get_all_input_file_names()
1023 if "generated_mc_N" in fname and "optimisation" in fname and fname.endswith(".root")]
1024 path.add_module("RootInput", inputFileNames=file_list)
1025
1026 path.add_module("Gearbox")
1027 path.add_module("Geometry")
1028 path.add_module("SetupGenfitExtrapolation")
1029
1030 add_hit_preparation_modules(path, components=["SVD"])
1031
1032 add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
1033
1034 fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filter)
1035 fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filter)
1036
1037 # write the tracking MVA filter parameters and the cut on MVA classifier to be applied on a local db
1038 iov = [0, 0, 0, -1]
1040 f"trk_CDCToSVDSpacePointStateFilter_1_Parameter{fbdt_state_filter_string}",
1041 iov,
1042 f"trk_CDCToSVDSpacePointStateFilter_1{fbdt_state_filter_string}",
1043 self.state_filter_cut)
1044
1046 f"trk_CDCToSVDSpacePointStateFilter_2_Parameter{fbdt_state_filter_string}",
1047 iov,
1048 f"trk_CDCToSVDSpacePointStateFilter_2{fbdt_state_filter_string}",
1049 self.state_filter_cut)
1050
1052 f"trk_CDCToSVDSpacePointStateFilter_3_Parameter{fbdt_state_filter_string}",
1053 iov,
1054 f"trk_CDCToSVDSpacePointStateFilter_3{fbdt_state_filter_string}",
1055 self.state_filter_cut)
1056
1058 f"trk_CDCToSVDSpacePointResultFilter_Parameter{fbdt_result_filter_string}",
1059 iov,
1060 f"trk_CDCToSVDSpacePointResultFilter{fbdt_result_filter_string}",
1061 self.result_filter_cut)
1062
1063 basf2.conditions.prepend_testing_payloads("localdb/database.txt")
1064 first_high_filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_1_Parameter{fbdt_state_filter_string}",
1065 "direction": "backward"}
1066 second_high_filter_parameters = {
1067 "DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_2_Parameter{fbdt_state_filter_string}"}
1068 third_high_filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointStateFilter_3_Parameter{fbdt_state_filter_string}"}
1069 filter_parameters = {"DBPayloadName": f"trk_CDCToSVDSpacePointResultFilter_Parameter{fbdt_result_filter_string}"}
1070
1071 path.add_module("CDCToSVDSpacePointCKF",
1072
1073 inputRecoTrackStoreArrayName="CDCRecoTracks",
1074 outputRecoTrackStoreArrayName="VXDRecoTracks",
1075 outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
1076
1077 relationCheckForDirection="backward",
1078 reverseSeed=False,
1079 writeOutDirection="backward",
1080
1081 firstHighFilter="mva_with_direction_check",
1082 firstHighFilterParameters=first_high_filter_parameters,
1083 firstHighUseNStates=self.use_n_best_states,
1084
1085 advanceHighFilter="advance",
1086 advanceHighFilterParameters={"direction": "backward"},
1087
1088 secondHighFilter="mva",
1089 secondHighFilterParameters=second_high_filter_parameters,
1090 secondHighUseNStates=self.use_n_best_states,
1091
1092 updateHighFilter="fit",
1093
1094 thirdHighFilter="mva",
1095 thirdHighFilterParameters=third_high_filter_parameters,
1096 thirdHighUseNStates=self.use_n_best_states,
1097
1098 filter="mva",
1099 filterParameters=filter_parameters,
1100 useBestNInSeed=self.use_n_best_results,
1101
1102 exportTracks=True,
1103 enableOverlapResolving=True)
1104
1105 path.add_module('RelatedTracksCombiner',
1106 VXDRecoTracksStoreArrayName="VXDRecoTracks",
1107 CDCRecoTracksStoreArrayName="CDCRecoTracks",
1108 recoTracksStoreArrayName="RecoTracks")
1109
1110 path.add_module('TrackFinderMCTruthRecoTracks',
1111 RecoTracksStoreArrayName="MCRecoTracks",
1112 WhichParticles=[],
1113 UsePXDHits=True,
1114 UseSVDHits=True,
1115 UseCDCHits=True)
1116
1117 path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
1118 mcRecoTracksStoreArrayName="MCRecoTracks",
1119 prRecoTracksStoreArrayName="RecoTracks")
1120
1121 path.add_module(
1123 output_file_name=self.get_output_file_name(
1124 f"cdc_to_svd_spacepoint_ckf_validation{fbdt_state_filter_string}{fbdt_result_filter_string}.root"),
1125 reco_tracks_name="RecoTracks",
1126 mc_reco_tracks_name="MCRecoTracks",
1127 name="",
1128 contact="",
1129 expert_level=200))
1130
1131 return path
1132
1133 def create_path(self):
1134 """
1135 Create basf2 path to process with event generation and simulation.
1136 """
1138
1139 def remove_output(self):
1140 """
1141 Default function from base b2luigi.Task class.
1142 """
1143 self._remove_output()
1144
1145
1146class SummaryTask(b2luigi.Task):
1147 """
1148 Task that collects and summarizes the main figure-of-merits from all the
1149 (validation and optimisation) child taks.
1150 """
1151
1152
1153 n_events_training = b2luigi.get_setting(
1154
1155 "n_events_training", default=1000
1156
1157 )
1158
1159 n_events_testing = b2luigi.get_setting(
1160
1161 "n_events_testing", default=500
1162
1163 )
1164
1165 n_events_per_task = b2luigi.get_setting(
1166
1167 "n_events_per_task", default=100
1168
1169 )
1170
1171 num_processes = b2luigi.get_setting(
1172
1173 "basf2_processes_per_worker", default=0
1174
1175 )
1176
1177
1178 bkgfiles_by_exp = b2luigi.get_setting("bkgfiles_by_exp")
1179
1180 bkgfiles_by_exp = {int(key): val for (key, val) in bkgfiles_by_exp.items()}
1181
1182
1183 batch_system = 'local'
1184
1185 output_file_name = 'summary.json'
1186
1187 def output(self):
1188 """
1189 Output method.
1190 """
1191 yield self.add_to_output(self.output_file_name)
1192
1193 def requires(self):
1194 """
1195 Generate list of tasks that needs to be done for luigi to finish running
1196 this steering file.
1197 """
1198
1199 fast_bdt_options = [
1200 [50, 8, 3, 0.1],
1201 [100, 8, 3, 0.1],
1202 [200, 8, 3, 0.1],
1203 ]
1204
1205 experiment_numbers = b2luigi.get_setting("experiment_numbers")
1206
1207 # iterate over all possible combinations of parameters from the above defined parameter lists
1208 for experiment_number, fast_bdt_option_state_filter, fast_bdt_option_result_filter in itertools.product(
1209 experiment_numbers, fast_bdt_options, fast_bdt_options
1210 ):
1211
1212 state_filter_cuts = [0.01, 0.02, 0.03, 0.05, 0.1, 0.2]
1213 n_best_states_list = [3, 5, 10]
1214 result_filter_cuts = [0.05, 0.1, 0.2]
1215 n_best_results_list = [3, 5, 10]
1216 for state_filter_cut, n_best_states, result_filter_cut, n_best_results in \
1217 itertools.product(state_filter_cuts, n_best_states_list, result_filter_cuts, n_best_results_list):
1218 yield self.clone(
1219 ValidationAndOptimisationTask,
1220 experiment_number=experiment_number,
1221 n_events_training=self.n_events_training,
1222 n_events_testing=self.n_events_testing,
1223 state_filter_cut=state_filter_cut,
1224 use_n_best_states=n_best_states,
1225 result_filter_cut=result_filter_cut,
1226 use_n_best_results=n_best_results,
1227 fast_bdt_option_state_filter=fast_bdt_option_state_filter,
1228 fast_bdt_option_result_filter=fast_bdt_option_result_filter,
1229 )
1230
1231 def run(self):
1232 """
1233 Run method.
1234 """
1235 import ROOT # noqa
1236
1237 # These are the "TNtuple" names to check for
1238 ntuple_names = (
1239 'MCSideTrackingValidationModule_overview_figures_of_merit',
1240 'PRSideTrackingValidationModule_overview_figures_of_merit',
1241 'PRSideTrackingValidationModule_subdetector_figures_of_merit'
1242 )
1243
1244 # Collect the information in a dictionary...
1245 output_dict = {}
1246 all_files = self.get_all_input_file_names()
1247 for idx, single_file in enumerate(all_files):
1248 with ROOT.TFile.Open(single_file, 'READ') as f:
1249 branch_data = {}
1250 for ntuple_name in ntuple_names:
1251 ntuple = f.Get(ntuple_name)
1252 for i in range(min(1, ntuple.GetEntries())): # Here we expect only 1 entry
1253 ntuple.GetEntry(i)
1254 for branch in ntuple.GetListOfBranches():
1255 name = branch.GetName()
1256 value = getattr(ntuple, name)
1257 branch_data[name] = value
1258 branch_data['file_path'] = single_file
1259 output_dict[f'{idx}'] = branch_data
1260
1261 # ... and store the information in a JSON file
1262 with open(self.get_output_file_name(self.output_file_name), 'w') as f:
1263 json.dump(output_dict, f, indent=4)
1264
1265 def remove_output(self):
1266 """
1267 Default function from base b2luigi.Task class.
1268 """
1269 self._remove_output()
1270
1271
1272if __name__ == "__main__":
1273
1274 b2luigi.set_setting("env_script", "./setup_basf2.sh")
1275 b2luigi.set_setting("scratch_dir", tempfile.gettempdir())
1276 workers = b2luigi.get_setting("workers", default=500)
1277 b2luigi.process(SummaryTask(), workers=workers, batch=True)
get_background_files(folder=None, output_file_info=True)
Definition background.py:17
exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
filter_number
Number of the filter for which the records files are to be processed.
exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
experiment_number
Experiment number of the conditions database, e.g.
output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
experiment_number
Experiment number of the conditions database, e.g.
result_filter_records_name
Name of the records file for training the final result filter.
n_events
Number of events to generate for the training data set.
fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
experiment_number
Experiment number of the conditions database, e.g.
output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
experiment_number
Experiment number of the conditions database, e.g.
create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname)
layer
Layer on which to toggle for recording the information for training.
n_events_training
Number of events to generate for the training data set.
n_events_testing
Number of events to generate for the test data set.
use_n_best_results
How many results should be kept at maximum to search for overlaps.
state_filter_cut
Value of the cut on the MVA classifier output for accepting a state during CKF tracking.
result_filter_cut
Value of the cut on the MVA classifier output for a result candidate.
use_n_best_states
How many states should be kept at maximum in the combinatorial part of the CKF tree search.
n_events_testing
Number of events to generate for the testing, validation, and optimisation data set.
add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False, save_all_charged_particles_in_mc=False)