Belle II Software  release-08-01-10
combined_cdc_to_svd_ckf_mva_training.py
1 
8 
9 """
10 combined_cdc_to_svd_ckf_mva_training
11 -----------------------------------------
12 
13 Purpose of this script
14 ~~~~~~~~~~~~~~~~~~~~~~
15 
16 This python script is used for the training and validation of the classifiers of
17 the three MVA-based state filters and one result filter of the CDCToSVDSpacePointCKF.
18 This CKF extraplates tracks found in the CDC into the SVD and adds SVD hits using a
19 combinatorial tree search and a Kalman filter based track fit in each step.
20 
21 To avoid mistakes, b2luigi is used to create a task chain for a combined training and
22 validation of all classifiers.
23 
24 The order of the b2luigi tasks in this script is as follows (top to bottom):
25 * Two tasks to create input samples for training and testing (``GenerateSimTask`` and
26 ``SplitNMergeSimTask``). The ``SplitNMergeSimTask`` takes a number of events to be
27 generated and a number of events per task to reduce runtime. It then divides the total
28 number of events by the number of events per task and creates as ``GenerateSimTask`` as
29 needed, each with a specific random seed, so that in the end the total number of
30 training and testing events are simulated. The individual files are then combined
31 by the SplitNMergeSimTask into one file each for training and testing.
32 * The ``StateRecordingTask`` writes out the data required for training the state
33 filters.
34 * The ``CKFStateFilterTeacherTask`` trains the state filter MVAs, using FastBDT by
35 default, with a given set of options.
36 * The ``ResultRecordingTask`` writes out the data used for the training of the result
37 filter MVA. This task requires that the state filters have been trained before.
38 * The ``CKFResultFilterTeacherTask`` trains the MVA, FastBDT per default, with a
39 given set of FastBDT options. This requires that the result filter records have
40 been created with the ``ResultRecordingTask``.
41 * The ``ValidationAndOptimisationTask`` uses the trained weight files and cut values
42 provided to run the tracking chain with the weight file under test, and also
43 runs the tracking validation.
44 * Finally, the ``MainTask`` is the "brain" of the script. It invokes the
45 ``ValidationAndOptimisationTask`` with the different combinations of FastBDT options
46 and cut values on the MVA classifier output.
47 
48 Due to the dependencies, the calls of the task are reversed. The MainTask
49 calls the ``ValidationAndOptimisationTask`` with different FastBDT options and cut
50 values, and the ``ValidationAndOptimisationTask`` itself calls the required teacher,
51 training, and simulation tasks.
52 
53 Each combination of FastBDT options and state filter cut values and candidate selection
54 is used to train the result filter, which includes that the ``ResultRecordingTask``
55 is executed multiple times with different combinations of FastBDT options and cut value
56 and candidate selection.
57 
58 b2luigi: Understanding the steering file
59 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60 
61 All trainings and validations are done in the correct order in this steering
62 file. For the purpose of creating a dependency graph, the `b2luigi
63 <https://b2luigi.readthedocs.io>`_ python package is used, which extends the
64 `luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
65 
66 Each task that has to be done is represented by a special class, which defines
67 which defines parameters, output files and which other tasks with which
68 parameters it depends on. For example a teacher task, which runs
69 ``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
70 task which runs a reconstruction and writes out track-wise variables into a root
71 file for training. An evaluation/validation task for testing the classifier
72 requires both the teacher task, as it needs the weightfile to be present, and
73 also a data collection task, because it needs a dataset for testing classifier.
74 
75 The final task that defines which tasks need to be done for the steering file to
76 finish is the ``MainTask``. When you only want to run parts of the
77 training/validation pipeline, you can comment out requirements in the Master
78 task or replace them by lower-level tasks during debugging.
79 
80 Requirements
81 ~~~~~~~~~~~~
82 
83 This steering file relies on b2luigi_ for task scheduling. It can be installed
84 via pip::
85 
86  python3 -m pip install [--user] b2luigi
87 
88 Use the ``--user`` option if you have not rights to install python packages into
89 your externals (e.g. because you are using cvmfs) and install them in
90 ``$HOME/.local`` instead.
91 
92 Configuration
93 ~~~~~~~~~~~~~
94 
95 Instead of command line arguments, the b2luigi script is configured via a
96 ``settings.json`` file. Open it in your favorite text editor and modify it to
97 fit to your requirements.
98 
99 Usage
100 ~~~~~
101 
102 You can test the b2luigi without running it via::
103 
104  python3 combined_cdc_to_svd_ckf_mva_training.py --dry-run
105  python3 combined_cdc_to_svd_ckf_mva_training.py --show-output
106 
107 This will show the outputs and show potential errors in the definitions of the
108 luigi task dependencies. To run the the steering file in normal (local) mode,
109 run::
110 
111  python3 combined_cdc_to_svd_ckf_mva_training.py
112 
113 One can use the interactive luigi web interface via the central scheduler
114 which visualizes the task graph while it is running. Therefore, the scheduler
115 daemon ``luigid`` has to run in the background, which is located in
116 ``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
117 example, run::
118 
119  luigid --port 8886
120 
121 Then, execute your steering (e.g. in another terminal) with::
122 
123  python3 combined_cdc_to_svd_ckf_mva_training.py --scheduler-port 8886
124 
125 To view the web interface, open your webbrowser enter into the url bar::
126 
127  localhost:8886
128 
129 If you don't run the steering file on the same machine on which you run your web
130 browser, you have two options:
131 
132  1. Run both the steering file and ``luigid`` remotely and use
133  ssh-port-forwarding to your local host. Therefore, run on your local
134  machine::
135 
136  ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
137 
138  2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
139  local host>`` argument when calling the steering file
140 
141 Accessing the results / output files
142 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143 
144 All output files are stored in a directory structure in the ``result_path`` set in
145 ``settings.json``. The directory tree encodes the used b2luigi parameters. This
146 ensures reproducibility and makes parameter searches easy. Sometimes, it is hard to
147 find the relevant output files. You can view the whole directory structure by
148 running ``tree <result_path>``. Ise the unix ``find`` command to find the files
149 that interest you, e.g.::
150 
151  find <result_path> -name "*.root" # find all ROOT files
152 """
153 
154 import itertools
155 import subprocess
156 import os
157 
158 import basf2
159 from tracking import add_track_finding
160 from tracking.path_utils import add_hit_preparation_modules
161 from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
162 import background
163 import simulation
164 
165 from ckf_training import my_basf2_mva_teacher, create_fbdt_option_string
166 
167 # wrap python modules that are used here but not in the externals into a try except block
168 install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
169  " python3 -m pip install [--user] {module}\n")
170 try:
171  import b2luigi
172  from b2luigi.core.utils import create_output_dirs
173  from b2luigi.basf2_helper import Basf2PathTask, Basf2Task
174 except ModuleNotFoundError:
175  print(install_helpstring_formatter.format(module="b2luigi"))
176  raise
177 
178 
179 class GenerateSimTask(Basf2PathTask):
180  """
181  Generate simulated Monte Carlo with background overlay.
182 
183  Make sure to use different ``random_seed`` parameters for the training data
184  format the classifier trainings and for the test data for the respective
185  evaluation/validation tasks.
186  """
187 
188 
189  experiment_number = b2luigi.IntParameter()
190 
192  random_seed = b2luigi.Parameter()
193 
194  n_events = b2luigi.IntParameter()
195 
196  bkgfiles_dir = b2luigi.Parameter(
197 
198  hashed=True
199 
200  )
201 
202  queue = 'l'
203 
204 
205  def output_file_name(self, n_events=None, random_seed=None):
206  """
207  Create output file name depending on number of events and production
208  mode that is specified in the random_seed string.
209 
210  :param n_events: Number of events to simulate.
211  :param random_seed: Random seed to use for the simulation to create independent samples.
212  """
213  if n_events is None:
214  n_events = self.n_eventsn_events
215  if random_seed is None:
216  random_seed = self.random_seedrandom_seed
217  return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
218 
219  def output(self):
220  """
221  Generate list of output files that the task should produce.
222  The task is considered finished if and only if the outputs all exist.
223  """
224  yield self.add_to_output(self.output_file_nameoutput_file_name())
225 
226  def create_path(self):
227  """
228  Create basf2 path to process with event generation and simulation.
229  """
230  basf2.set_random_seed(self.random_seedrandom_seed)
231  path = basf2.create_path()
232  path.add_module(
233  "EventInfoSetter", evtNumList=[self.n_eventsn_events], runList=[0], expList=[self.experiment_numberexperiment_number]
234  )
235  path.add_module("EvtGenInput")
236  bkg_files = ""
237  if self.experiment_numberexperiment_number == 0:
238  bkg_files = background.get_background_files()
239  else:
240  bkg_files = background.get_background_files(self.bkgfiles_dirbkgfiles_dir)
241 
242  simulation.add_simulation(path, bkgfiles=bkg_files, bkgOverlay=True, usePXDDataReduction=False)
243 
244  path.add_module(
245  "RootOutput",
246  outputFileName=self.get_output_file_name(self.output_file_nameoutput_file_name()),
247  )
248  return path
249 
250 
251 # I don't use the default MergeTask or similar because they only work if every input file is called the same.
252 # Additionally, I want to add more features like deleting the original input to save storage space.
253 class SplitNMergeSimTask(Basf2Task):
254  """
255  Generate simulated Monte Carlo with background overlay.
256 
257  Make sure to use different ``random_seed`` parameters for the training data
258  format the classifier trainings and for the test data for the respective
259  evaluation/validation tasks.
260  """
261 
262  experiment_number = b2luigi.IntParameter()
263 
265  random_seed = b2luigi.Parameter()
266 
267  n_events = b2luigi.IntParameter()
268 
269  bkgfiles_dir = b2luigi.Parameter(
270 
271  hashed=True
272 
273  )
274 
275  queue = 'sx'
276 
277 
278  def output_file_name(self, n_events=None, random_seed=None):
279  """
280  Create output file name depending on number of events and production
281  mode that is specified in the random_seed string.
282 
283  :param n_events: Number of events to simulate.
284  :param random_seed: Random seed to use for the simulation to create independent samples.
285  """
286  if n_events is None:
287  n_events = self.n_eventsn_events
288  if random_seed is None:
289  random_seed = self.random_seedrandom_seed
290  return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
291 
292  def output(self):
293  """
294  Generate list of output files that the task should produce.
295  The task is considered finished if and only if the outputs all exist.
296  """
297  yield self.add_to_output(self.output_file_nameoutput_file_name())
298 
299  def requires(self):
300  """
301  This task requires several GenerateSimTask to be finished so that he required number of events is created.
302  """
303  n_events_per_task = MainTask.n_events_per_task
304  quotient, remainder = divmod(self.n_eventsn_events, n_events_per_task)
305  for i in range(quotient):
306  yield GenerateSimTask(
307  bkgfiles_dir=self.bkgfiles_dirbkgfiles_dir,
308  num_processes=MainTask.num_processes,
309  random_seed=self.random_seedrandom_seed + '_' + str(i).zfill(3),
310  n_events=n_events_per_task,
311  experiment_number=self.experiment_numberexperiment_number,
312  )
313  if remainder > 0:
314  yield GenerateSimTask(
315  bkgfiles_dir=self.bkgfiles_dirbkgfiles_dir,
316  num_processes=MainTask.num_processes,
317  random_seed=self.random_seedrandom_seed + '_' + str(quotient).zfill(3),
318  n_events=remainder,
319  experiment_number=self.experiment_numberexperiment_number,
320  )
321 
322  @b2luigi.on_temporary_files
323  def process(self):
324  """
325  When all GenerateSimTasks finished, merge the output.
326  """
327  create_output_dirs(self)
328 
329  file_list = [item for sublist in self.get_input_file_names().values() for item in sublist]
330  print("Merge the following files:")
331  print(file_list)
332  cmd = ["b2file-merge", "-f"]
333  args = cmd + [self.get_output_file_name(self.output_file_nameoutput_file_name())] + file_list
334  subprocess.check_call(args)
335  print("Finished merging. Now remove the input files to save space.")
336  for input_file in file_list:
337  try:
338  os.remove(input_file)
339  except FileNotFoundError:
340  pass
341 
342 
343 class StateRecordingTask(Basf2PathTask):
344  """
345  Record the data for the three state filters for the CDCToSVDSpacePointCKF.
346 
347  This task requires that the events used for training have been simulated before, which is done using the
348  ``SplitMergeSimTask``.
349  """
350 
351  experiment_number = b2luigi.IntParameter()
352 
354  random_seed = b2luigi.Parameter()
355 
356  n_events = b2luigi.IntParameter()
357 
358 
359  layer = b2luigi.IntParameter()
360 
361  def output(self):
362  """
363  Generate list of output files that the task should produce.
364  The task is considered finished if and only if the outputs all exist.
365  """
366  for record_fname in ["records1.root", "records2.root", "records3.root"]:
367  yield self.add_to_output(record_fname)
368 
369  def requires(self):
370  """
371  This task only requires that the input files have been created.
372  """
373  yield SplitNMergeSimTask(
374  bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
375  experiment_number=self.experiment_numberexperiment_number,
376  random_seed=self.random_seedrandom_seed,
377  n_events=self.n_eventsn_events,
378  )
379 
380  def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname):
381  """
382  Create a path for the recording. To record the data for the SVD state filters, CDC tracks are required, and these must
383  be truth matched before. The data have to recorded for each layer of the SVD, i.e. layers 3 to 6, but also an artificial
384  layer 7.
385 
386  :param layer: The layer for which the data are recorded.
387  :param records1_fname: Name of the records1 file.
388  :param records2_fname: Name of the records2 file.
389  :param records3_fname: Name of the records3 file.
390  """
391  path = basf2.create_path()
392 
393  # get all the file names from the list of input files that are meant for training
394  file_list = [fname for sublist in self.get_input_file_names().values()
395  for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
396  path.add_module("RootInput", inputFileNames=file_list)
397 
398  path.add_module("Gearbox")
399  path.add_module("Geometry")
400  path.add_module("SetupGenfitExtrapolation")
401 
402  add_hit_preparation_modules(path, components=["SVD"])
403 
404  add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
405 
406  path.add_module('TrackFinderMCTruthRecoTracks',
407  RecoTracksStoreArrayName="MCRecoTracks",
408  WhichParticles=[],
409  UsePXDHits=True,
410  UseSVDHits=True,
411  UseCDCHits=True)
412 
413  path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=False, UseCDCHits=True,
414  mcRecoTracksStoreArrayName="MCRecoTracks",
415  prRecoTracksStoreArrayName="CDCRecoTracks")
416  path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCRecoTracks")
417 
418  path.add_module("CDCToSVDSpacePointCKF",
419  inputRecoTrackStoreArrayName="CDCRecoTracks",
420  outputRecoTrackStoreArrayName="VXDRecoTracks",
421  outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
422 
423  relationCheckForDirection="backward",
424  reverseSeed=False,
425  writeOutDirection="backward",
426 
427  firstHighFilter="truth",
428  firstEqualFilter="recording",
429  firstEqualFilterParameters={"treeName": "records1", "rootFileName":
430  records1_fname, "returnWeight": 1.0},
431  firstLowFilter="none",
432  firstHighUseNStates=0,
433  firstToggleOnLayer=layer,
434 
435  advanceHighFilter="advance",
436 
437  secondHighFilter="truth",
438  secondEqualFilter="recording",
439  secondEqualFilterParameters={"treeName": "records2", "rootFileName":
440  records2_fname, "returnWeight": 1.0},
441  secondLowFilter="none",
442  secondHighUseNStates=0,
443  secondToggleOnLayer=layer,
444 
445  updateHighFilter="fit",
446 
447  thirdHighFilter="truth",
448  thirdEqualFilter="recording",
449  thirdEqualFilterParameters={"treeName": "records3", "rootFileName": records3_fname},
450  thirdLowFilter="none",
451  thirdHighUseNStates=0,
452  thirdToggleOnLayer=layer,
453 
454  filter="none",
455  exportTracks=False,
456 
457  enableOverlapResolving=False)
458 
459  return path
460 
461  def create_path(self):
462  """
463  Create basf2 path to process with event generation and simulation.
464  """
465  return self.create_state_recording_pathcreate_state_recording_path(
466  layer=self.layerlayer,
467  records1_fname=self.get_output_file_name("records1.root"),
468  records2_fname=self.get_output_file_name("records2.root"),
469  records3_fname=self.get_output_file_name("records3.root"),
470  )
471 
472 
473 class CKFStateFilterTeacherTask(Basf2Task):
474  """
475  A teacher task runs the basf2 mva teacher on the training data provided by a
476  data collection task.
477 
478  In this task the three state filters are trained, each with the corresponding recordings from the different layers.
479  It will be executed for each FastBDT option defined in the MainTask.
480  """
481 
482  experiment_number = b2luigi.IntParameter()
483 
485  random_seed = b2luigi.Parameter()
486 
487  n_events = b2luigi.IntParameter()
488 
489  fast_bdt_option_state_filter = b2luigi.ListParameter(
490 
491  hashed=True, default=[50, 8, 3, 0.1]
492 
493  )
494 
495  filter_number = b2luigi.IntParameter()
496 
497  training_target = b2luigi.Parameter(
498 
499  default="truth"
500 
501  )
502 
504  exclude_variables = b2luigi.ListParameter(
505 
507  hashed=True, default=[
508  "id",
509  "last_id",
510  "number",
511  "last_layer",
512 
513  "seed_cdc_hits",
514  "seed_svd_hits",
515  "seed_lowest_svd_layer",
516  "seed_lowest_cdc_layer",
517  "quality_index_triplet",
518  "quality_index_circle",
519  "quality_index_helix",
520  "cluster_1_charge",
521  "cluster_2_charge",
522  "mean_rest_cluster_charge",
523  "min_rest_cluster_charge",
524  "std_rest_cluster_charge",
525  "cluster_1_seed_charge",
526  "cluster_2_seed_charge",
527  "mean_rest_cluster_seed_charge",
528  "min_rest_cluster_seed_charge",
529  "std_rest_cluster_seed_charge",
530  "cluster_1_size",
531  "cluster_2_size",
532  "mean_rest_cluster_size",
533  "min_rest_cluster_size",
534  "std_rest_cluster_size",
535  "cluster_1_snr",
536  "cluster_2_snr",
537  "mean_rest_cluster_snr",
538  "min_rest_cluster_snr",
539  "std_rest_cluster_snr",
540  "cluster_1_charge_over_size",
541  "cluster_2_charge_over_size",
542  "mean_rest_cluster_charge_over_size",
543  "min_rest_cluster_charge_over_size",
544  "std_rest_cluster_charge_over_size",
545  ]
546 
547  )
548 
549  def get_weightfile_xml_identifier(self, fast_bdt_option=None, filter_number=1):
550  """
551  Name of the xml weightfile that is created by the teacher task.
552  It is subsequently used as a local weightfile in the following validation tasks.
553 
554  :param fast_bdt_option: FastBDT option that is used to train this MVA
555  :param filter_number: Filter number (first=1, second=2, third=3) to be trained
556  """
557  if fast_bdt_option is None:
558  fast_bdt_option = self.fast_bdt_option_state_filterfast_bdt_option_state_filter
559  fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
560  weightfile_name = f"trk_CDCToSVDSpacePointStateFilter_{filter_number}" + fast_bdt_string
561  return weightfile_name + ".xml"
562 
563  def requires(self):
564  """
565  This task requires that the recordings for the state filters.
566  """
567  for layer in [3, 4, 5, 6, 7]:
568  yield self.clone(
569  StateRecordingTask,
570  experiment_number=self.experiment_numberexperiment_number,
571  n_events=self.n_eventsn_events,
572  random_seed="training",
573  layer=layer,
574  )
575 
576  def output(self):
577  """
578  Generate list of output files that the task should produce.
579  The task is considered finished if and only if the outputs all exist.
580  """
581  yield self.add_to_output(self.get_weightfile_xml_identifierget_weightfile_xml_identifier(filter_number=self.filter_numberfilter_number))
582 
583  def process(self):
584  """
585  Use basf2_mva teacher to create MVA weightfile from collected training
586  data variables.
587 
588  This is the main process that is dispatched by the ``run`` method that
589  is inherited from ``Basf2Task``.
590  """
591  records_files = self.get_input_file_names(f"records{self.filter_number}.root")
592  tree_name = f"records{self.filter_number}"
593  print(f"Processed records files: {records_files=},\nfeature tree name: {tree_name=}")
594 
595  my_basf2_mva_teacher(
596  records_files=records_files,
597  tree_name=tree_name,
598  weightfile_identifier=self.get_output_file_name(self.get_weightfile_xml_identifierget_weightfile_xml_identifier(filter_number=self.filter_numberfilter_number)),
599  target_variable=self.training_targettraining_target,
600  exclude_variables=self.exclude_variables,
601  fast_bdt_option=self.fast_bdt_option_state_filterfast_bdt_option_state_filter,
602  )
603 
604 
605 class ResultRecordingTask(Basf2PathTask):
606  """
607  Task to record data for the final result filter. This requires trained state filters.
608  The cuts on the state filter classifiers are set to rather low values to ensure that all signal is contained in the
609  recorded file. Also, the values for XXXXXHighUseNStates are chosen conservatively, i.e. rather on the high side.
610  """
611 
612 
613  experiment_number = b2luigi.IntParameter()
614 
616  random_seed = b2luigi.Parameter()
617 
618  n_events = b2luigi.IntParameter()
619 
620  fast_bdt_option_state_filter = b2luigi.ListParameter(
621 
622  hashed=True, default=[50, 8, 3, 0.1]
623 
624  )
625 
626  result_filter_records_name = b2luigi.Parameter()
627 
628  def output(self):
629  """
630  Generate list of output files that the task should produce.
631  The task is considered finished if and only if the outputs all exist.
632  """
633  yield self.add_to_output(self.result_filter_records_nameresult_filter_records_name)
634 
635  def requires(self):
636  """
637  This task requires that the training SplitMergeSimTask is finished, as well as that the state filters are trained
638  using the CKFStateFilterTeacherTask..
639  """
640  yield SplitNMergeSimTask(
641  bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
642  experiment_number=self.experiment_numberexperiment_number,
643  random_seed=self.random_seedrandom_seed,
644  n_events=self.n_eventsn_events,
645  )
646  filter_numbers = [1, 2, 3]
647  for filter_number in filter_numbers:
648  yield self.clone(
649  CKFStateFilterTeacherTask,
650  experiment_number=self.experiment_numberexperiment_number,
651  n_events=self.n_eventsn_events,
652  random_seed=self.random_seedrandom_seed,
653  filter_number=filter_number,
654  fast_bdt_option=self.fast_bdt_option_state_filterfast_bdt_option_state_filter
655  )
656 
657  def create_result_recording_path(self, result_filter_records_name):
658  """
659  Create a path for the recording of the result filter. This file is then used to train the result filter.
660 
661  :param result_filter_records_name: Name of the recording file.
662  """
663 
664  path = basf2.create_path()
665 
666  # get all the file names from the list of input files that are meant for training
667  file_list = [fname for sublist in self.get_input_file_names().values()
668  for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
669  path.add_module("RootInput", inputFileNames=file_list)
670 
671  path.add_module("Gearbox")
672  path.add_module("Geometry")
673  path.add_module("SetupGenfitExtrapolation")
674 
675  add_hit_preparation_modules(path, components=["SVD"])
676 
677  add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
678 
679  path.add_module('TrackFinderMCTruthRecoTracks',
680  RecoTracksStoreArrayName="MCRecoTracks",
681  WhichParticles=[],
682  UsePXDHits=True,
683  UseSVDHits=True,
684  UseCDCHits=True)
685 
686  path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=False, UseCDCHits=True,
687  mcRecoTracksStoreArrayName="MCRecoTracks",
688  prRecoTracksStoreArrayName="CDCRecoTracks")
689  path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCRecoTracks")
690 
691  fast_bdt_string = create_fbdt_option_string(self.fast_bdt_option_state_filterfast_bdt_option_state_filter)
692  path.add_module("CDCToSVDSpacePointCKF",
693  inputRecoTrackStoreArrayName="CDCRecoTracks",
694  outputRecoTrackStoreArrayName="VXDRecoTracks",
695  outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
696 
697  relationCheckForDirection="backward",
698  reverseSeed=False,
699  writeOutDirection="backward",
700 
701  firstHighFilter="mva_with_direction_check",
702  firstHighFilterParameters={
703  "identifier": self.get_input_file_names(f"trk_CDCToSVDSpacePointStateFilter_1{fast_bdt_string}.xml")[0],
704  "cut": 0.001,
705  "direction": "backward"},
706  firstHighUseNStates=10,
707 
708  advanceHighFilter="advance",
709  advanceHighFilterParameters={"direction": "backward"},
710 
711  secondHighFilter="mva",
712  secondHighFilterParameters={
713  "identifier": self.get_input_file_names(f"trk_CDCToSVDSpacePointStateFilter_2{fast_bdt_string}.xml")[0],
714  "cut": 0.001},
715  secondHighUseNStates=10,
716 
717  updateHighFilter="fit",
718 
719  thirdHighFilter="mva",
720  thirdHighFilterParameters={
721  "identifier": self.get_input_file_names(f"trk_CDCToSVDSpacePointStateFilter_3{fast_bdt_string}.xml")[0],
722  "cut": 0.001},
723  thirdHighUseNStates=10,
724 
725  filter="recording",
726  filterParameters={"rootFileName": result_filter_records_name},
727  exportTracks=False,
728 
729  enableOverlapResolving=True)
730 
731  return path
732 
733  def create_path(self):
734  """
735  Create basf2 path to process with event generation and simulation.
736  """
737  return self.create_result_recording_path(
738  result_filter_records_name=self.get_output_file_name(self.result_filter_records_nameresult_filter_records_name),
739  )
740 
741 
742 class CKFResultFilterTeacherTask(Basf2Task):
743  """
744  A teacher task runs the basf2 mva teacher on the training data provided by a
745  data collection task.
746 
747  Since teacher tasks are needed for all quality estimators covered by this
748  steering file and the only thing that changes is the required data
749  collection task and some training parameters, I decided to use inheritance
750  and have the basic functionality in this base class/interface and have the
751  specific teacher tasks inherit from it.
752  """
753 
754  experiment_number = b2luigi.IntParameter()
755 
757  random_seed = b2luigi.Parameter()
758 
759  n_events = b2luigi.IntParameter()
760 
761  fast_bdt_option_state_filter = b2luigi.ListParameter(
762 
763  hashed=True, default=[50, 8, 3, 0.1]
764 
765  )
766 
767  fast_bdt_option_result_filter = b2luigi.ListParameter(
768 
769  hashed=True, default=[200, 8, 3, 0.1]
770 
771  )
772 
773  result_filter_records_name = b2luigi.Parameter()
774 
775  training_target = b2luigi.Parameter(
776 
777  default="truth"
778 
779  )
780 
782  exclude_variables = b2luigi.ListParameter(
783 
784  hashed=True, default=[]
785 
786  )
787 
788  def get_weightfile_xml_identifier(self, fast_bdt_option=None):
789  """
790  Name of the xml weightfile that is created by the teacher task.
791  It is subsequently used as a local weightfile in the following validation tasks.
792 
793  :param fast_bdt_option: FastBDT option that is used to train this MVA
794  """
795  if fast_bdt_option is None:
796  fast_bdt_option = self.fast_bdt_option_result_filterfast_bdt_option_result_filter
797  fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
798  weightfile_name = "trk_CDCToSVDSpacePointResultFilter" + fast_bdt_string
799  return weightfile_name + ".xml"
800 
801  def requires(self):
802  """
803  Generate list of luigi Tasks that this Task depends on.
804  """
805  yield ResultRecordingTask(
806  experiment_number=self.experiment_numberexperiment_number,
807  n_events=self.n_eventsn_events,
808  random_seed=self.random_seedrandom_seed,
809  fast_bdt_option_state_filter=self.fast_bdt_option_state_filterfast_bdt_option_state_filter,
810  result_filter_records_name=self.result_filter_records_nameresult_filter_records_name,
811  )
812 
813  def output(self):
814  """
815  Generate list of output files that the task should produce.
816  The task is considered finished if and only if the outputs all exist.
817  """
818  yield self.add_to_output(self.get_weightfile_xml_identifierget_weightfile_xml_identifier())
819 
820  def process(self):
821  """
822  Use basf2_mva teacher to create MVA weightfile from collected training
823  data variables.
824 
825  This is the main process that is dispatched by the ``run`` method that
826  is inherited from ``Basf2Task``.
827  """
828  records_files = self.get_input_file_names(self.result_filter_records_nameresult_filter_records_name)
829  tree_name = "records"
830  print(f"Processed records files for result filter training: {records_files=},\nfeature tree name: {tree_name=}")
831 
832  my_basf2_mva_teacher(
833  records_files=records_files,
834  tree_name=tree_name,
835  weightfile_identifier=self.get_output_file_name(self.get_weightfile_xml_identifier()),
836  target_variable=self.training_target,
837  exclude_variables=self.exclude_variables,
838  fast_bdt_option=self.fast_bdt_option_result_filterfast_bdt_option_result_filter,
839  )
840 
841 
842 class ValidationAndOptimisationTask(Basf2PathTask):
843  """
844  Validate the performance of the trained filters by trying various combinations of FastBDT options, as well as cut values
845  for the states, the number of best candidates kept after each filter, and similar for the result filter.
846  """
847 
848  experiment_number = b2luigi.IntParameter()
849 
850  n_events_training = b2luigi.IntParameter()
851 
852  fast_bdt_option_state_filter = b2luigi.ListParameter(
853  # ## \cond
854  hashed=True, default=[50, 8, 3, 0.1]
855  # ## \endcond
856  )
857 
858  fast_bdt_option_result_filter = b2luigi.ListParameter(
859  # ## \cond
860  hashed=True, default=[200, 8, 3, 0.1]
861  # ## \endcond
862  )
863 
864  n_events_testing = b2luigi.IntParameter()
865 
866  state_filter_cut = b2luigi.FloatParameter()
867 
868  use_n_best_states = b2luigi.IntParameter()
869 
870  result_filter_cut = b2luigi.FloatParameter()
871 
872  use_n_best_results = b2luigi.IntParameter()
873 
874  def output(self):
875  """
876  Generate list of output files that the task should produce.
877  The task is considered finished if and only if the outputs all exist.
878  """
879  fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filterfast_bdt_option_state_filter)
880  fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filterfast_bdt_option_result_filter)
881  yield self.add_to_output(
882  f"cdc_to_svd_spacepoint_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root")
883 
884  def requires(self):
885  """
886  This task requires trained result filters, trained state filters, and that an independent data set for validation was
887  created using the SplitMergeSimTask with the random seed optimisation.
888  """
889  fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filterfast_bdt_option_state_filter)
891  result_filter_records_name=f"filter_records{fbdt_state_filter_string}.root",
892  experiment_number=self.experiment_numberexperiment_number,
893  n_events=self.n_events_trainingn_events_training,
894  fast_bdt_option_state_filter=self.fast_bdt_option_state_filterfast_bdt_option_state_filter,
895  fast_bdt_option_result_filter=self.fast_bdt_option_result_filterfast_bdt_option_result_filter,
896  random_seed='training'
897  )
898  yield SplitNMergeSimTask(
899  bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
900  experiment_number=self.experiment_numberexperiment_number,
901  n_events=self.n_events_testingn_events_testing,
902  random_seed="optimisation",
903  )
904  filter_numbers = [1, 2, 3]
905  for filter_number in filter_numbers:
906  yield self.clone(
907  CKFStateFilterTeacherTask,
908  experiment_number=self.experiment_numberexperiment_number,
909  random_seed="training",
910  n_events=self.n_events_trainingn_events_training,
911  filter_number=filter_number,
912  fast_bdt_option=self.fast_bdt_option_state_filterfast_bdt_option_state_filter
913  )
914 
916  """
917  Create a path to validate the trained filters.
918  """
919  path = basf2.create_path()
920 
921  # get all the file names from the list of input files that are meant for optimisation / validation
922  file_list = [fname for sublist in self.get_input_file_names().values()
923  for fname in sublist if "generated_mc_N" in fname and "optimisation" in fname and fname.endswith(".root")]
924  path.add_module("RootInput", inputFileNames=file_list)
925 
926  path.add_module("Gearbox")
927  path.add_module("Geometry")
928  path.add_module("SetupGenfitExtrapolation")
929 
930  add_hit_preparation_modules(path, components=["SVD"])
931 
932  add_track_finding(path, reco_tracks="CDCRecoTracks", components=["CDC"], prune_temporary_tracks=False)
933 
934  fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filterfast_bdt_option_state_filter)
935  fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filterfast_bdt_option_result_filter)
936  path.add_module("CDCToSVDSpacePointCKF",
937 
938  inputRecoTrackStoreArrayName="CDCRecoTracks",
939  outputRecoTrackStoreArrayName="VXDRecoTracks",
940  outputRelationRecoTrackStoreArrayName="CDCRecoTracks",
941 
942  relationCheckForDirection="backward",
943  reverseSeed=False,
944  writeOutDirection="backward",
945 
946  firstHighFilter="mva_with_direction_check",
947  firstHighFilterParameters={
948  "identifier": self.get_input_file_names(
949  f"trk_CDCToSVDSpacePointStateFilter_1{fbdt_state_filter_string}.xml")[0],
950  "cut": self.state_filter_cutstate_filter_cut,
951  "direction": "backward"},
952  firstHighUseNStates=self.use_n_best_statesuse_n_best_states,
953 
954  advanceHighFilter="advance",
955  advanceHighFilterParameters={"direction": "backward"},
956 
957  secondHighFilter="mva",
958  secondHighFilterParameters={
959  "identifier": self.get_input_file_names(
960  f"trk_CDCToSVDSpacePointStateFilter_2{fbdt_state_filter_string}.xml")[0],
961  "cut": self.state_filter_cutstate_filter_cut},
962  secondHighUseNStates=self.use_n_best_statesuse_n_best_states,
963 
964  updateHighFilter="fit",
965 
966  thirdHighFilter="mva",
967  thirdHighFilterParameters={
968  "identifier": self.get_input_file_names(
969  f"trk_CDCToSVDSpacePointStateFilter_3{fbdt_state_filter_string}.xml")[0],
970  "cut": self.state_filter_cutstate_filter_cut},
971  thirdHighUseNStates=self.use_n_best_statesuse_n_best_states,
972 
973  filter="mva",
974  filterParameters={
975  "identifier": self.get_input_file_names(
976  f"trk_CDCToSVDSpacePointResultFilter{fbdt_result_filter_string}.xml")[0],
977  "cut": self.result_filter_cutresult_filter_cut},
978  useBestNInSeed=self.use_n_best_resultsuse_n_best_results,
979 
980  exportTracks=True,
981  enableOverlapResolving=True)
982 
983  path.add_module('RelatedTracksCombiner',
984  VXDRecoTracksStoreArrayName="VXDRecoTracks",
985  CDCRecoTracksStoreArrayName="CDCRecoTracks",
986  recoTracksStoreArrayName="RecoTracks")
987 
988  path.add_module('TrackFinderMCTruthRecoTracks',
989  RecoTracksStoreArrayName="MCRecoTracks",
990  WhichParticles=[],
991  UsePXDHits=True,
992  UseSVDHits=True,
993  UseCDCHits=True)
994 
995  path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
996  mcRecoTracksStoreArrayName="MCRecoTracks",
997  prRecoTracksStoreArrayName="RecoTracks")
998 
999  path.add_module(
1001  output_file_name=self.get_output_file_name(
1002  f"cdc_to_svd_spacepoint_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root"),
1003  reco_tracks_name="RecoTracks",
1004  mc_reco_tracks_name="MCRecoTracks",
1005  name="",
1006  contact="",
1007  expert_level=200))
1008 
1009  return path
1010 
1011  def create_path(self):
1012  """
1013  Create basf2 path to process with event generation and simulation.
1014  """
1015  return self.create_optimisation_and_validation_path()
1016 
1017 
1018 class MainTask(b2luigi.WrapperTask):
1019  """
1020  Wrapper task that needs to finish for b2luigi to finish running this steering file.
1021 
1022  It is done if the outputs of all required subtasks exist. It is thus at the
1023  top of the luigi task graph. Edit the ``requires`` method to steer which
1024  tasks and with which parameters you want to run.
1025  """
1026 
1027  n_events_training = b2luigi.get_setting(
1028 
1029  "n_events_training", default=1000
1031  )
1032 
1033  n_events_testing = b2luigi.get_setting(
1034 
1035  "n_events_testing", default=500
1037  )
1039  n_events_per_task = b2luigi.get_setting(
1041  "n_events_per_task", default=100
1042 
1043  )
1044 
1045  num_processes = b2luigi.get_setting(
1046 
1047  "basf2_processes_per_worker", default=0
1048 
1049  )
1050 
1051 
1052  bkgfiles_by_exp = b2luigi.get_setting("bkgfiles_by_exp")
1053 
1054  bkgfiles_by_exp = {int(key): val for (key, val) in bkgfiles_by_exp.items()}
1055 
1056  def requires(self):
1057  """
1058  Generate list of tasks that needs to be done for luigi to finish running
1059  this steering file.
1060  """
1061 
1062  fast_bdt_options = [
1063  [50, 8, 3, 0.1],
1064  [100, 8, 3, 0.1],
1065  [200, 8, 3, 0.1],
1066  ]
1067 
1068  experiment_numbers = b2luigi.get_setting("experiment_numbers")
1069 
1070  # iterate over all possible combinations of parameters from the above defined parameter lists
1071  for experiment_number, fast_bdt_option_state_filter, fast_bdt_option_result_filter in itertools.product(
1072  experiment_numbers, fast_bdt_options, fast_bdt_options
1073  ):
1074 
1075  state_filter_cuts = [0.01, 0.02, 0.03, 0.05, 0.1, 0.2]
1076  n_best_states_list = [3, 5, 10]
1077  result_filter_cuts = [0.05, 0.1, 0.2]
1078  n_best_results_list = [3, 5, 10]
1079  for state_filter_cut, n_best_states, result_filter_cut, n_best_results in \
1080  itertools.product(state_filter_cuts, n_best_states_list, result_filter_cuts, n_best_results_list):
1081  yield self.clone(
1082  ValidationAndOptimisationTask,
1083  experiment_number=experiment_number,
1084  n_events_training=self.n_events_trainingn_events_training,
1085  n_events_testing=self.n_events_testingn_events_testing,
1086  state_filter_cut=state_filter_cut,
1087  use_n_best_states=n_best_states,
1088  result_filter_cut=result_filter_cut,
1089  use_n_best_results=n_best_results,
1090  fast_bdt_option_state_filter=fast_bdt_option_state_filter,
1091  fast_bdt_option_result_filter=fast_bdt_option_result_filter,
1092  )
1093 
1094 
1095 if __name__ == "__main__":
1096  b2luigi.set_setting("env_script", "./setup_basf2.sh")
1097  b2luigi.set_setting("batch_system", "htcondor")
1098  workers = b2luigi.get_setting("workers", default=1)
1099  b2luigi.process(MainTask(), workers=workers, batch=True)
def get_background_files(folder=None, output_file_info=True)
Definition: background.py:17
def get_weightfile_xml_identifier(self, fast_bdt_option=None, filter_number=1)
filter_number
Number of the filter for which the records files are to be processed.
training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
experiment_number
Experiment number of the conditions database, e.g.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
n_events_training
Number of events to generate for the training data set.
n_events_testing
Number of events to generate for the test data set.
experiment_number
Experiment number of the conditions database, e.g.
result_filter_records_name
Name of the records file for training the final result filter.
n_events
Number of events to generate for the training data set.
fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
experiment_number
Experiment number of the conditions database, e.g.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
experiment_number
Experiment number of the conditions database, e.g.
def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname)
layer
Layer on which to toggle for recording the information for training.
use_n_best_results
How many results should be kept at maximum to search for overlaps.
state_filter_cut
Value of the cut on the MVA classifier output for accepting a state during CKF tracking.
result_filter_cut
Value of the cut on the MVA classifier output for a result candidate.
use_n_best_states
How many states should be kept at maximum in the combinatorial part of the CKF tree search.
n_events_testing
Number of events to generate for the testing, validation, and optimisation data set.
def add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False)
Definition: simulation.py:121