Belle II Software  release-08-01-10
combined_to_pxd_ckf_mva_training.py
1 
8 
9 """
10 combined_to_pxd_ckf_mva_training
11 -----------------------------------------
12 
13 Purpose of this script
14 ~~~~~~~~~~~~~~~~~~~~~~
15 
16 This python script is used for the training and validation of the classifiers of
17 the three MVA-based state filters and one result filter of the ToPXDCKF.
18 This CKF extraplates tracks found in CDC and SVD into the PXD and adds PXD hits
19 using a combinatorial tree search and a Kalman filter based track fit in each step.
20 
21 To avoid mistakes, b2luigi is used to create a task chain for a combined training and
22 validation of all classifiers.
23 
24 The order of the b2luigi tasks in this script is as follows (top to bottom):
25 * Two tasks to create input samples for training and testing (``GenerateSimTask`` and
26 ``SplitNMergeSimTask``). The ``SplitNMergeSimTask`` takes a number of events to be
27 generated and a number of events per task to reduce runtime. It then divides the total
28 number of events by the number of events per task and creates as ``GenerateSimTask`` as
29 needed, each with a specific random seed, so that in the end the total number of
30 training and testing events are simulated. The individual files are then combined
31 by the SplitNMergeSimTask into one file each for training and testing.
32 * The ``StateRecordingTask`` writes out the data required for training the state
33 filters.
34 * The ``CKFStateFilterTeacherTask`` trains the state filter MVAs, using FastBDT by
35 default, with a given set of options.
36 * The ``ResultRecordingTask`` writes out the data used for the training of the result
37 filter MVA. This task requires that the state filters have been trained before.
38 * The ``CKFResultFilterTeacherTask`` trains the MVA, FastBDT per default, with a
39 given set of FastBDT options. This requires that the result filter records have
40 been created with the ``ResultRecordingTask``.
41 * The ``ValidationAndOptimisationTask`` uses the trained weight files and cut values
42 provided to run the tracking chain with the weight file under test, and also
43 runs the tracking validation.
44 * Finally, the ``MainTask`` is the "brain" of the script. It invokes the
45 ``ValidationAndOptimisationTask`` with the different combinations of FastBDT options
46 and cut values on the MVA classifier output.
47 
48 Due to the dependencies, the calls of the task are reversed. The MainTask
49 calls the ``ValidationAndOptimisationTask`` with different FastBDT options and cut
50 values, and the ``ValidationAndOptimisationTask`` itself calls the required teacher,
51 training, and simulation tasks.
52 
53 Each combination of FastBDT options and state filter cut values and candidate selection
54 is used to train the result filter, which includes that the ``ResultRecordingTask``
55 is executed multiple times with different combinations of FastBDT options and cut value
56 and candidate selection.
57 
58 b2luigi: Understanding the steering file
59 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
60 
61 All trainings and validations are done in the correct order in this steering
62 file. For the purpose of creating a dependency graph, the `b2luigi
63 <https://b2luigi.readthedocs.io>`_ python package is used, which extends the
64 `luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
65 
66 Each task that has to be done is represented by a special class, which defines
67 which defines parameters, output files and which other tasks with which
68 parameters it depends on. For example a teacher task, which runs
69 ``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
70 task which runs a reconstruction and writes out track-wise variables into a root
71 file for training. An evaluation/validation task for testing the classifier
72 requires both the teacher task, as it needs the weightfile to be present, and
73 also a data collection task, because it needs a dataset for testing classifier.
74 
75 The final task that defines which tasks need to be done for the steering file to
76 finish is the ``MainTask``. When you only want to run parts of the
77 training/validation pipeline, you can comment out requirements in the Master
78 task or replace them by lower-level tasks during debugging.
79 
80 Requirements
81 ~~~~~~~~~~~~
82 
83 This steering file relies on b2luigi_ for task scheduling. It can be installed
84 via pip::
85 
86  python3 -m pip install [--user] b2luigi
87 
88 Use the ``--user`` option if you have not rights to install python packages into
89 your externals (e.g. because you are using cvmfs) and install them in
90 ``$HOME/.local`` instead.
91 
92 Configuration
93 ~~~~~~~~~~~~~
94 
95 Instead of command line arguments, the b2luigi script is configured via a
96 ``settings.json`` file. Open it in your favorite text editor and modify it to
97 fit to your requirements.
98 
99 Usage
100 ~~~~~
101 
102 You can test the b2luigi without running it via::
103 
104  python3 combined_to_pxd_ckf_mva_training.py --dry-run
105  python3 combined_to_pxd_ckf_mva_training.py --show-output
106 
107 This will show the outputs and show potential errors in the definitions of the
108 luigi task dependencies. To run the the steering file in normal (local) mode,
109 run::
110 
111  python3 combined_to_pxd_ckf_mva_training.py
112 
113 One can use the interactive luigi web interface via the central scheduler
114 which visualizes the task graph while it is running. Therefore, the scheduler
115 daemon ``luigid`` has to run in the background, which is located in
116 ``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
117 example, run::
118 
119  luigid --port 8886
120 
121 Then, execute your steering (e.g. in another terminal) with::
122 
123  python3 combined_to_pxd_ckf_mva_training.py --scheduler-port 8886
124 
125 To view the web interface, open your webbrowser enter into the url bar::
126 
127  localhost:8886
128 
129 If you don't run the steering file on the same machine on which you run your web
130 browser, you have two options:
131 
132  1. Run both the steering file and ``luigid`` remotely and use
133  ssh-port-forwarding to your local host. Therefore, run on your local
134  machine::
135 
136  ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
137 
138  2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
139  local host>`` argument when calling the steering file
140 
141 Accessing the results / output files
142 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
143 
144 All output files are stored in a directory structure in the ``result_path`` set in
145 ``settings.json``. The directory tree encodes the used b2luigi parameters. This
146 ensures reproducibility and makes parameter searches easy. Sometimes, it is hard to
147 find the relevant output files. You can view the whole directory structure by
148 running ``tree <result_path>``. Ise the unix ``find`` command to find the files
149 that interest you, e.g.::
150 
151  find <result_path> -name "*.root" # find all ROOT files
152 """
153 
154 import itertools
155 import subprocess
156 
157 import basf2
158 from tracking import add_track_finding
159 from tracking.path_utils import add_hit_preparation_modules
160 from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
161 import background
162 import simulation
163 
164 from ckf_training import my_basf2_mva_teacher, create_fbdt_option_string
165 
166 # wrap python modules that are used here but not in the externals into a try except block
167 install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
168  " python3 -m pip install [--user] {module}\n")
169 try:
170  import b2luigi
171  from b2luigi.core.utils import create_output_dirs
172  from b2luigi.basf2_helper import Basf2PathTask, Basf2Task
173 except ModuleNotFoundError:
174  print(install_helpstring_formatter.format(module="b2luigi"))
175  raise
176 
177 
178 class GenerateSimTask(Basf2PathTask):
179  """
180  Generate simulated Monte Carlo with background overlay.
181 
182  Make sure to use different ``random_seed`` parameters for the training data
183  format the classifier trainings and for the test data for the respective
184  evaluation/validation tasks.
185  """
186 
187 
188  experiment_number = b2luigi.IntParameter()
189 
191  random_seed = b2luigi.Parameter()
192 
193  n_events = b2luigi.IntParameter()
194 
195  bkgfiles_dir = b2luigi.Parameter(
196 
197  hashed=True
198 
199  )
200 
201  queue = 'l'
202 
203 
204  def output_file_name(self, n_events=None, random_seed=None):
205  """
206  Create output file name depending on number of events and production
207  mode that is specified in the random_seed string.
208 
209  :param n_events: Number of events to simulate.
210  :param random_seed: Random seed to use for the simulation to create independent samples.
211  """
212  if n_events is None:
213  n_events = self.n_eventsn_events
214  if random_seed is None:
215  random_seed = self.random_seedrandom_seed
216  return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
217 
218  def output(self):
219  """
220  Generate list of output files that the task should produce.
221  The task is considered finished if and only if the outputs all exist.
222  """
223  yield self.add_to_output(self.output_file_nameoutput_file_name())
224 
225  def create_path(self):
226  """
227  Create basf2 path to process with event generation and simulation.
228  """
229  basf2.set_random_seed(self.random_seedrandom_seed)
230  path = basf2.create_path()
231  path.add_module(
232  "EventInfoSetter", evtNumList=[self.n_eventsn_events], runList=[0], expList=[self.experiment_numberexperiment_number]
233  )
234  path.add_module("EvtGenInput")
235  bkg_files = ""
236  if self.experiment_numberexperiment_number == 0:
237  bkg_files = background.get_background_files()
238  else:
239  bkg_files = background.get_background_files(self.bkgfiles_dirbkgfiles_dir)
240 
241  simulation.add_simulation(path, bkgfiles=bkg_files, bkgOverlay=True, usePXDDataReduction=False)
242 
243  path.add_module(
244  "RootOutput",
245  outputFileName=self.get_output_file_name(self.output_file_nameoutput_file_name()),
246  )
247  return path
248 
249 
250 # I don't use the default MergeTask or similar because they only work if every input file is called the same.
251 # Additionally, I want to add more features like deleting the original input to save storage space.
252 class SplitNMergeSimTask(Basf2Task):
253  """
254  Generate simulated Monte Carlo with background overlay.
255 
256  Make sure to use different ``random_seed`` parameters for the training data
257  format the classifier trainings and for the test data for the respective
258  evaluation/validation tasks.
259  """
260 
261 
262  experiment_number = b2luigi.IntParameter()
263 
265  random_seed = b2luigi.Parameter()
266 
267  n_events = b2luigi.IntParameter()
268 
269  bkgfiles_dir = b2luigi.Parameter(
270 
271  hashed=True
272 
273  )
274 
275  queue = 'sx'
276 
277 
278  def output_file_name(self, n_events=None, random_seed=None):
279  """
280  Create output file name depending on number of events and production
281  mode that is specified in the random_seed string.
282 
283  :param n_events: Number of events to simulate.
284  :param random_seed: Random seed to use for the simulation to create independent samples.
285  """
286  if n_events is None:
287  n_events = self.n_eventsn_events
288  if random_seed is None:
289  random_seed = self.random_seedrandom_seed
290  return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
291 
292  def output(self):
293  """
294  Generate list of output files that the task should produce.
295  The task is considered finished if and only if the outputs all exist.
296  """
297  yield self.add_to_output(self.output_file_nameoutput_file_name())
298 
299  def requires(self):
300  """
301  This task requires several GenerateSimTask to be finished so that he required number of events is created.
302  """
303  n_events_per_task = MainTask.n_events_per_task
304  quotient, remainder = divmod(self.n_eventsn_events, n_events_per_task)
305  for i in range(quotient):
306  yield GenerateSimTask(
307  bkgfiles_dir=self.bkgfiles_dirbkgfiles_dir,
308  num_processes=MainTask.num_processes,
309  random_seed=self.random_seedrandom_seed + '_' + str(i).zfill(3),
310  n_events=n_events_per_task,
311  experiment_number=self.experiment_numberexperiment_number,
312  )
313  if remainder > 0:
314  yield GenerateSimTask(
315  bkgfiles_dir=self.bkgfiles_dirbkgfiles_dir,
316  num_processes=MainTask.num_processes,
317  random_seed=self.random_seedrandom_seed + '_' + str(quotient).zfill(3),
318  n_events=remainder,
319  experiment_number=self.experiment_numberexperiment_number,
320  )
321 
322  @b2luigi.on_temporary_files
323  def process(self):
324  """
325  When all GenerateSimTasks finished, merge the output.
326  """
327  create_output_dirs(self)
328 
329  file_list = [item for sublist in self.get_input_file_names().values() for item in sublist]
330  print("Merge the following files:")
331  print(file_list)
332  cmd = ["b2file-merge", "-f"]
333  args = cmd + [self.get_output_file_name(self.output_file_nameoutput_file_name())] + file_list
334  subprocess.check_call(args)
335  print("Finished merging. Now remove the input files to save space.")
336  cmd2 = ["rm", "-f"]
337  for tempfile in file_list:
338  args = cmd2 + [tempfile]
339  subprocess.check_call(args)
340 
341 
342 class StateRecordingTask(Basf2PathTask):
343  """
344  Record the data for the three state filters for the ToPXDCKF.
345 
346  This task requires that the events used for training have been simulated before, which is done using the
347  ``SplitMergeSimTask``.
348  """
349 
350  experiment_number = b2luigi.IntParameter()
351 
353  random_seed = b2luigi.Parameter()
354 
355  n_events = b2luigi.IntParameter()
356 
357 
358  layer = b2luigi.IntParameter()
359 
360  def output(self):
361  """
362  Generate list of output files that the task should produce.
363  The task is considered finished if and only if the outputs all exist.
364  """
365  for record_fname in ["records1.root", "records2.root", "records3.root"]:
366  yield self.add_to_output(record_fname)
367 
368  def requires(self):
369  """
370  This task only requires that the input files have been created.
371  """
372  yield SplitNMergeSimTask(
373  bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
374  experiment_number=self.experiment_numberexperiment_number,
375  n_events=self.n_eventsn_events,
376  random_seed=self.random_seedrandom_seed,
377  )
378 
379  def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname):
380  """
381  Create a path for the recording. To record the data for the PXD state filters, CDC+SVD tracks are required, and these
382  must be truth matched before. The data have to recorded for each layer of the PXD, i.e. layers 1 and 2, but also an
383  artificial layer 3.
384 
385  :param layer: The layer for which the data are recorded.
386  :param records1_fname: Name of the records1 file.
387  :param records2_fname: Name of the records2 file.
388  :param records3_fname: Name of the records3 file.
389  """
390  path = basf2.create_path()
391 
392  # get all the file names from the list of input files that are meant for training
393  file_list = [fname for sublist in self.get_input_file_names().values()
394  for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
395  path.add_module("RootInput", inputFileNames=file_list)
396 
397  path.add_module("Gearbox")
398  path.add_module("Geometry")
399  path.add_module("SetupGenfitExtrapolation")
400 
401  add_hit_preparation_modules(path, components=["SVD", "PXD"])
402 
403  add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
404 
405  path.add_module('TrackFinderMCTruthRecoTracks',
406  RecoTracksStoreArrayName="MCRecoTracks",
407  WhichParticles=[],
408  UsePXDHits=True,
409  UseSVDHits=True,
410  UseCDCHits=True)
411 
412  path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
413  mcRecoTracksStoreArrayName="MCRecoTracks",
414  prRecoTracksStoreArrayName="CDCSVDRecoTracks")
415  path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
416 
417  path.add_module("ToPXDCKF",
418  inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
419  outputRecoTrackStoreArrayName="RecoTracks",
420  outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
421  hitFilter="distance",
422  seedFilter="distance",
423  preSeedFilter='all',
424  preHitFilter='all',
425 
426  relationCheckForDirection="backward",
427  reverseSeed=False,
428  writeOutDirection="backward",
429 
430  firstHighFilter="truth",
431  firstEqualFilter="recording",
432  firstEqualFilterParameters={"treeName": "records1", "rootFileName": records1_fname, "returnWeight": 1.0},
433  firstLowFilter="none",
434  firstHighUseNStates=0,
435  firstToggleOnLayer=layer,
436 
437  advanceHighFilter="advance",
438 
439  secondHighFilter="truth",
440  secondEqualFilter="recording",
441  secondEqualFilterParameters={"treeName": "records2", "rootFileName": records2_fname, "returnWeight": 1.0},
442  secondLowFilter="none",
443  secondHighUseNStates=0,
444  secondToggleOnLayer=layer,
445 
446  updateHighFilter="fit",
447 
448  thirdHighFilter="truth",
449  thirdEqualFilter="recording",
450  thirdEqualFilterParameters={"treeName": "records3", "rootFileName": records3_fname},
451  thirdLowFilter="none",
452  thirdHighUseNStates=0,
453  thirdToggleOnLayer=layer,
454 
455  filter="none",
456  exportTracks=False,
457 
458  enableOverlapResolving=False)
459 
460  return path
461 
462  def create_path(self):
463  """
464  Create basf2 path to process with event generation and simulation.
465  """
466  return self.create_state_recording_pathcreate_state_recording_path(
467  layer=self.layerlayer,
468  records1_fname=self.get_output_file_name("records1.root"),
469  records2_fname=self.get_output_file_name("records2.root"),
470  records3_fname=self.get_output_file_name("records3.root"),
471  )
472 
473 
474 class CKFStateFilterTeacherTask(Basf2Task):
475  """
476  A teacher task runs the basf2 mva teacher on the training data provided by a
477  data collection task.
478 
479  In this task the three state filters are trained, each with the corresponding recordings from the different layers.
480  It will be executed for each FastBDT option defined in the MainTask.
481  """
482 
483 
484  experiment_number = b2luigi.IntParameter()
485 
487  random_seed = b2luigi.Parameter()
488 
489  n_events = b2luigi.IntParameter()
490 
491  fast_bdt_option_state_filter = b2luigi.ListParameter(
492 
493  hashed=True, default=[50, 8, 3, 0.1]
494 
495  )
496 
497  filter_number = b2luigi.IntParameter()
498 
499  training_target = b2luigi.Parameter(
500 
501  default="truth"
502 
503  )
504 
506  exclude_variables = b2luigi.ListParameter(
507 
508  hashed=True, default=[]
509 
510  )
511 
512  def get_weightfile_xml_identifier(self, fast_bdt_option=None, filter_number=1):
513  """
514  Name of the xml weightfile that is created by the teacher task.
515  It is subsequently used as a local weightfile in the following validation tasks.
516 
517  :param fast_bdt_option: FastBDT option that is used to train this MVA.
518  :param filter_number: Filter number (first=1, second=2, third=3) to be trained.
519  """
520  if fast_bdt_option is None:
521  fast_bdt_option = self.fast_bdt_option_state_filterfast_bdt_option_state_filter
522  fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
523  weightfile_name = f"trk_ToPXDStateFilter_{filter_number}" + fast_bdt_string
524  return weightfile_name + ".xml"
525 
526  def requires(self):
527  """
528  This task requires that the recordings for the state filters.
529  """
530  for layer in [1, 2, 3]:
531  yield self.clone(
532  StateRecordingTask,
533  experiment_number=self.experiment_numberexperiment_number,
534  n_events_training=self.n_eventsn_events,
535  random_seed="training",
536  layer=layer
537  )
538 
539  def output(self):
540  """
541  Generate list of output files that the task should produce.
542  The task is considered finished if and only if the outputs all exist.
543  """
544  yield self.add_to_output(self.get_weightfile_xml_identifierget_weightfile_xml_identifier(filter_number=self.filter_numberfilter_number))
545 
546  def process(self):
547  """
548  Use basf2_mva teacher to create MVA weightfile from collected training
549  data variables.
550 
551  This is the main process that is dispatched by the ``run`` method that
552  is inherited from ``Basf2Task``.
553  """
554  records_files = self.get_input_file_names(f"records{self.filter_number}.root")
555  tree_name = f"records{self.filter_number}"
556  print(f"Processed records files: {records_files=},\nfeature tree name: {tree_name=}")
557 
558  my_basf2_mva_teacher(
559  records_files=records_files,
560  tree_name=tree_name,
561  weightfile_identifier=self.get_output_file_name(self.get_weightfile_xml_identifierget_weightfile_xml_identifier(filter_number=self.filter_numberfilter_number)),
562  target_variable=self.training_targettraining_target,
563  exclude_variables=self.exclude_variables,
564  fast_bdt_option=self.fast_bdt_option_state_filterfast_bdt_option_state_filter,
565  )
566 
567 
568 class ResultRecordingTask(Basf2PathTask):
569  """
570  Task to record data for the final result filter. This requires trained state filters.
571  The cuts on the state filter classifiers are set to rather low values to ensure that all signal is contained in the recorded
572  file. Also, the values for XXXXXHighUseNStates are chosen conservatively, i.e. rather on the high side.
573  """
574 
575 
576  experiment_number = b2luigi.IntParameter()
577 
579  random_seed = b2luigi.Parameter()
580 
581  n_events_training = b2luigi.IntParameter()
582 
583  fast_bdt_option_state_filter = b2luigi.ListParameter(
584 
585  hashed=True, default=[200, 8, 3, 0.1]
586 
587  )
588 
589  result_filter_records_name = b2luigi.Parameter()
590 
591  def output(self):
592  """
593  Generate list of output files that the task should produce.
594  The task is considered finished if and only if the outputs all exist.
595  """
596  yield self.add_to_output(self.result_filter_records_nameresult_filter_records_name)
597 
598  def requires(self):
599  """
600  This task requires that the training SplitMergeSimTask is finished, as well as that the state filters are trained using
601  the CKFStateFilterTeacherTask..
602  """
603  yield SplitNMergeSimTask(
604  bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
605  experiment_number=self.experiment_numberexperiment_number,
606  n_events=self.n_events_trainingn_events_training,
607  random_seed=self.random_seedrandom_seed,
608  )
609  filter_numbers = [1, 2, 3]
610  for filter_number in filter_numbers:
611  yield self.clone(
612  CKFStateFilterTeacherTask,
613  experiment_number=self.experiment_numberexperiment_number,
614  n_events=self.n_events_trainingn_events_training,
615  random_seed=self.random_seedrandom_seed,
616  filter_number=filter_number,
617  fast_bdt_option_state_filter=self.fast_bdt_option_state_filterfast_bdt_option_state_filter
618  )
619 
620  def create_result_recording_path(self, result_filter_records_name):
621  """
622  Create a path for the recording of the result filter. This file is then used to train the result filter.
623 
624  :param result_filter_records_name: Name of the recording file.
625  """
626 
627  path = basf2.create_path()
628 
629  # get all the file names from the list of input files that are meant for training
630  file_list = [fname for sublist in self.get_input_file_names().values()
631  for fname in sublist if "generated_mc_N" in fname and "training" in fname and fname.endswith(".root")]
632  path.add_module("RootInput", inputFileNames=file_list)
633 
634  path.add_module("Gearbox")
635  path.add_module("Geometry")
636  path.add_module("SetupGenfitExtrapolation")
637 
638  add_hit_preparation_modules(path, components=["SVD", "PXD"])
639 
640  add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
641 
642  path.add_module('TrackFinderMCTruthRecoTracks',
643  RecoTracksStoreArrayName="MCRecoTracks",
644  WhichParticles=[],
645  UsePXDHits=True,
646  UseSVDHits=True,
647  UseCDCHits=True)
648 
649  path.add_module("MCRecoTracksMatcher", UsePXDHits=False, UseSVDHits=True, UseCDCHits=True,
650  mcRecoTracksStoreArrayName="MCRecoTracks",
651  prRecoTracksStoreArrayName="CDCSVDRecoTracks")
652  path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
653 
654  fast_bdt_string = create_fbdt_option_string(self.fast_bdt_option_state_filterfast_bdt_option_state_filter)
655  path.add_module("ToPXDCKF",
656  inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
657  outputRecoTrackStoreArrayName="RecoTracks",
658  outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
659 
660  relationCheckForDirection="backward",
661  reverseSeed=False,
662  writeOutDirection="backward",
663 
664  firstHighFilter="mva",
665  firstHighFilterParameters={
666  "identifier": self.get_input_file_names(f"trk_ToPXDStateFilter_1{fast_bdt_string}.xml")[0],
667  "cut": 0.01},
668  firstHighUseNStates=10,
669 
670  advanceHighFilter="advance",
671 
672  secondHighFilter="mva",
673  secondHighFilterParameters={
674  "identifier": self.get_input_file_names(f"trk_ToPXDStateFilter_2{fast_bdt_string}.xml")[0],
675  "cut": 0.01},
676  secondHighUseNStates=10,
677 
678  updateHighFilter="fit",
679 
680  thirdHighFilter="mva",
681  thirdHighFilterParameters={
682  "identifier": self.get_input_file_names(f"trk_ToPXDStateFilter_3{fast_bdt_string}.xml")[0],
683  "cut": 0.01},
684  thirdHighUseNStates=10,
685 
686  filter="recording",
687  filterParameters={"rootFileName": result_filter_records_name},
688  exportTracks=False,
689 
690  enableOverlapResolving=True)
691 
692  return path
693 
694  def create_path(self):
695  """
696  Create basf2 path to process with event generation and simulation.
697  """
698  return self.create_result_recording_path(
699  result_filter_records_name=self.get_output_file_name(self.result_filter_records_nameresult_filter_records_name),
700  )
701 
702 
703 class CKFResultFilterTeacherTask(Basf2Task):
704  """
705  A teacher task runs the basf2 mva teacher on the training data for the result filter.
706  """
707 
708 
709  experiment_number = b2luigi.IntParameter()
710 
712  random_seed = b2luigi.Parameter()
713 
714  n_events = b2luigi.IntParameter()
715 
716  fast_bdt_option_state_filter = b2luigi.ListParameter(
717 
718  hashed=True, default=[50, 8, 3, 0.1]
719 
720  )
721 
722  fast_bdt_option_result_filter = b2luigi.ListParameter(
723 
724  hashed=True, default=[200, 8, 3, 0.1]
725 
726  )
727 
728  result_filter_records_name = b2luigi.Parameter()
729 
730  training_target = b2luigi.Parameter(
731 
732  default="truth"
733 
734  )
735 
737  exclude_variables = b2luigi.ListParameter(
738 
739  hashed=True, default=[]
740 
741  )
742 
743  def get_weightfile_xml_identifier(self, fast_bdt_option=None):
744  """
745  Name of the xml weightfile that is created by the teacher task.
746  It is subsequently used as a local weightfile in the following validation tasks.
747 
748  :param fast_bdt_option: FastBDT option that is used to train this MVA
749  """
750  if fast_bdt_option is None:
751  fast_bdt_option = self.fast_bdt_option_result_filterfast_bdt_option_result_filter
752  fast_bdt_string = create_fbdt_option_string(fast_bdt_option)
753  weightfile_name = "trk_ToPXDResultFilter" + fast_bdt_string
754  return weightfile_name + ".xml"
755 
756  def requires(self):
757  """
758  Generate list of luigi Tasks that this Task depends on.
759  """
760  yield ResultRecordingTask(
761  experiment_number=self.experiment_numberexperiment_number,
762  n_events_training=self.n_eventsn_events,
763  random_seed=self.random_seedrandom_seed,
764  fast_bdt_option_state_filter=self.fast_bdt_option_state_filterfast_bdt_option_state_filter,
765  result_filter_records_name=self.result_filter_records_nameresult_filter_records_name,
766  )
767 
768  def output(self):
769  """
770  Generate list of output files that the task should produce.
771  The task is considered finished if and only if the outputs all exist.
772  """
773  yield self.add_to_output(self.get_weightfile_xml_identifierget_weightfile_xml_identifier())
774 
775  def process(self):
776  """
777  Use basf2_mva teacher to create MVA weightfile from collected training
778  data variables.
779 
780  This is the main process that is dispatched by the ``run`` method that
781  is inherited from ``Basf2Task``.
782  """
783  records_files = self.get_input_file_names(self.result_filter_records_nameresult_filter_records_name)
784  tree_name = "records"
785  print(f"Processed records files for result filter training: {records_files=},\nfeature tree name: {tree_name=}")
786 
787  my_basf2_mva_teacher(
788  records_files=records_files,
789  tree_name=tree_name,
790  weightfile_identifier=self.get_output_file_name(self.get_weightfile_xml_identifier()),
791  target_variable=self.training_target,
792  exclude_variables=self.exclude_variables,
793  fast_bdt_option=self.fast_bdt_option_result_filterfast_bdt_option_result_filter,
794  )
795 
796 
797 class ValidationAndOptimisationTask(Basf2PathTask):
798  """
799  Validate the performance of the trained filters by trying various combinations of FastBDT options, as well as cut values for
800  the states, the number of best candidates kept after each filter, and similar for the result filter.
801  """
802 
803  experiment_number = b2luigi.IntParameter()
804 
805  n_events_training = b2luigi.IntParameter()
806 
807  fast_bdt_option_state_filter = b2luigi.ListParameter(
808  # ## \cond
809  hashed=True, default=[200, 8, 3, 0.1]
810  # ## \endcond
811  )
812 
813  fast_bdt_option_result_filter = b2luigi.ListParameter(
814  # ## \cond
815  hashed=True, default=[200, 8, 3, 0.1]
816  # ## \endcond
817  )
818 
819  n_events_testing = b2luigi.IntParameter()
820 
821  state_filter_cut = b2luigi.FloatParameter()
822 
823  use_n_best_states = b2luigi.IntParameter()
824 
825  result_filter_cut = b2luigi.FloatParameter()
826 
827  use_n_best_results = b2luigi.IntParameter()
828 
829  def output(self):
830  """
831  Generate list of output files that the task should produce.
832  The task is considered finished if and only if the outputs all exist.
833  """
834  fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filterfast_bdt_option_state_filter)
835  fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filterfast_bdt_option_result_filter)
836  yield self.add_to_output(
837  f"to_pxd_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root")
838 
839  def requires(self):
840  """
841  This task requires trained result filters, trained state filters, and that an independent data set for validation was
842  created using the SplitMergeSimTask with the random seed optimisation.
843  """
844  fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filterfast_bdt_option_state_filter)
846  result_filter_records_name=f"filter_records{fbdt_state_filter_string}.root",
847  experiment_number=self.experiment_numberexperiment_number,
848  n_events=self.n_events_trainingn_events_training,
849  fast_bdt_option_state_filter=self.fast_bdt_option_state_filterfast_bdt_option_state_filter,
850  fast_bdt_option_result_filter=self.fast_bdt_option_result_filterfast_bdt_option_result_filter,
851  random_seed='training'
852  )
853  yield SplitNMergeSimTask(
854  bkgfiles_dir=MainTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
855  experiment_number=self.experiment_numberexperiment_number,
856  n_events=self.n_events_testingn_events_testing,
857  random_seed="optimisation",
858  )
859  filter_numbers = [1, 2, 3]
860  for filter_number in filter_numbers:
861  yield self.clone(
862  CKFStateFilterTeacherTask,
863  experiment_number=self.experiment_numberexperiment_number,
864  n_events=self.n_events_trainingn_events_training,
865  random_seed="training",
866  filter_number=filter_number,
867  fast_bdt_option_state_filter=self.fast_bdt_option_state_filterfast_bdt_option_state_filter
868  )
869 
871  """
872  Create a path to validate the trained filters.
873  """
874  path = basf2.create_path()
875 
876  # get all the file names from the list of input files that are meant for optimisation / validation
877  file_list = [fname for sublist in self.get_input_file_names().values()
878  for fname in sublist if "generated_mc_N" in fname and "optimisation" in fname and fname.endswith(".root")]
879  path.add_module("RootInput", inputFileNames=file_list)
880 
881  path.add_module("Gearbox")
882  path.add_module("Geometry")
883  path.add_module("SetupGenfitExtrapolation")
884 
885  add_hit_preparation_modules(path, components=["SVD", "PXD"])
886 
887  add_track_finding(path, reco_tracks="CDCSVDRecoTracks", components=["CDC", "SVD"], prune_temporary_tracks=False)
888 
889  path.add_module("DAFRecoFitter", recoTracksStoreArrayName="CDCSVDRecoTracks")
890 
891  fbdt_state_filter_string = create_fbdt_option_string(self.fast_bdt_option_state_filterfast_bdt_option_state_filter)
892  fbdt_result_filter_string = create_fbdt_option_string(self.fast_bdt_option_result_filterfast_bdt_option_result_filter)
893  path.add_module("ToPXDCKF",
894  inputRecoTrackStoreArrayName="CDCSVDRecoTracks",
895  outputRecoTrackStoreArrayName="PXDRecoTracks",
896  outputRelationRecoTrackStoreArrayName="CDCSVDRecoTracks",
897 
898  relationCheckForDirection="backward",
899  reverseSeed=False,
900  writeOutDirection="backward",
901 
902  firstHighFilter="mva_with_direction_check",
903  firstHighFilterParameters={
904  "identifier": self.get_input_file_names(
905  f"trk_ToPXDStateFilter_1{fbdt_state_filter_string}.xml")[0],
906  "cut": self.state_filter_cutstate_filter_cut,
907  "direction": "backward"},
908  firstHighUseNStates=self.use_n_best_statesuse_n_best_states,
909 
910  advanceHighFilter="advance",
911  advanceHighFilterParameters={"direction": "backward"},
912 
913  secondHighFilter="mva",
914  secondHighFilterParameters={
915  "identifier": self.get_input_file_names(
916  f"trk_ToPXDStateFilter_2{fbdt_state_filter_string}.xml")[0],
917  "cut": self.state_filter_cutstate_filter_cut},
918  secondHighUseNStates=self.use_n_best_statesuse_n_best_states,
919 
920  updateHighFilter="fit",
921 
922  thirdHighFilter="mva",
923  thirdHighFilterParameters={
924  "identifier": self.get_input_file_names(
925  f"trk_ToPXDStateFilter_3{fbdt_state_filter_string}.xml")[0],
926  "cut": self.state_filter_cutstate_filter_cut},
927  thirdHighUseNStates=self.use_n_best_statesuse_n_best_states,
928 
929  filter="mva",
930  filterParameters={
931  "identifier": self.get_input_file_names(
932  f"trk_ToPXDResultFilter{fbdt_result_filter_string}.xml")[0],
933  "cut": self.result_filter_cutresult_filter_cut},
934  useBestNInSeed=self.use_n_best_resultsuse_n_best_results,
935 
936  exportTracks=True,
937  enableOverlapResolving=True)
938 
939  path.add_module('RelatedTracksCombiner',
940  VXDRecoTracksStoreArrayName="PXDRecoTracks",
941  CDCRecoTracksStoreArrayName="CDCSVDRecoTracks",
942  recoTracksStoreArrayName="RecoTracks")
943 
944  path.add_module('TrackFinderMCTruthRecoTracks',
945  RecoTracksStoreArrayName="MCRecoTracks",
946  WhichParticles=[],
947  UsePXDHits=True,
948  UseSVDHits=True,
949  UseCDCHits=True)
950 
951  path.add_module("MCRecoTracksMatcher", UsePXDHits=True, UseSVDHits=True, UseCDCHits=True,
952  mcRecoTracksStoreArrayName="MCRecoTracks",
953  prRecoTracksStoreArrayName="RecoTracks")
954 
955  path.add_module(
957  output_file_name=self.get_output_file_name(
958  f"to_pxd_ckf_validation{fbdt_state_filter_string}_{fbdt_result_filter_string}.root"),
959  reco_tracks_name="RecoTracks",
960  mc_reco_tracks_name="MCRecoTracks",
961  name="",
962  contact="",
963  expert_level=200))
964 
965  return path
966 
967  def create_path(self):
968  """
969  Create basf2 path to process with event generation and simulation.
970  """
971  return self.create_optimisation_and_validation_path()
972 
973 
974 class MainTask(b2luigi.WrapperTask):
975  """
976  Wrapper task that needs to finish for b2luigi to finish running this steering file.
977 
978  It is done if the outputs of all required subtasks exist. It is thus at the
979  top of the luigi task graph. Edit the ``requires`` method to steer which
980  tasks and with which parameters you want to run.
981  """
982 
983  n_events_training = b2luigi.get_setting(
984 
985  "n_events_training", default=1000
986 
987  )
988 
989  n_events_testing = b2luigi.get_setting(
990 
991  "n_events_testing", default=500
992 
993  )
994 
995  n_events_per_task = b2luigi.get_setting(
996 
997  "n_events_per_task", default=100
998 
999  )
1000 
1001  num_processes = b2luigi.get_setting(
1002 
1003  "basf2_processes_per_worker", default=0
1004 
1005  )
1006 
1007 
1008  bkgfiles_by_exp = b2luigi.get_setting("bkgfiles_by_exp")
1009 
1010  bkgfiles_by_exp = {int(key): val for (key, val) in bkgfiles_by_exp.items()}
1011 
1012  def requires(self):
1013  """
1014  Generate list of tasks that needs to be done for luigi to finish running
1015  this steering file.
1016  """
1017 
1018  fast_bdt_options = [
1019  [50, 8, 3, 0.1],
1020  [100, 8, 3, 0.1],
1021  [200, 8, 3, 0.1],
1022  ]
1023 
1024  experiment_numbers = b2luigi.get_setting("experiment_numbers")
1025 
1026  # iterate over all possible combinations of parameters from the above defined parameter lists
1027  for experiment_number, fast_bdt_option_state_filter, fast_bdt_option_result_filter in itertools.product(
1028  experiment_numbers, fast_bdt_options, fast_bdt_options
1029  ):
1030 
1031  state_filter_cuts = [0.01, 0.02, 0.03, 0.05, 0.1, 0.2]
1032  n_best_states_list = [3, 5, 10]
1033  result_filter_cuts = [0.05, 0.1, 0.2]
1034  n_best_results_list = [2, 3, 5]
1035  for state_filter_cut, n_best_states, result_filter_cut, n_best_results in \
1036  itertools.product(state_filter_cuts, n_best_states_list, result_filter_cuts, n_best_results_list):
1037  yield self.clone(
1038  ValidationAndOptimisationTask,
1039  experiment_number=experiment_number,
1040  n_events_training=self.n_events_trainingn_events_training,
1041  n_events_testing=self.n_events_testingn_events_testing,
1042  state_filter_cut=state_filter_cut,
1043  use_n_best_states=n_best_states,
1044  result_filter_cut=result_filter_cut,
1045  use_n_best_results=n_best_results,
1046  fast_bdt_option_state_filter=fast_bdt_option_state_filter,
1047  fast_bdt_option_result_filter=fast_bdt_option_result_filter,
1048  )
1049 
1050 
1051 if __name__ == "__main__":
1052  b2luigi.set_setting("env_script", "./setup_basf2.sh")
1053  b2luigi.set_setting("batch_system", "htcondor")
1054  workers = b2luigi.get_setting("workers", default=1)
1055  b2luigi.process(MainTask(), workers=workers, batch=True)
def get_background_files(folder=None, output_file_info=True)
Definition: background.py:17
experiment_number
Experiment number of the conditions database, e.g.
n_events
Number of events to generate for the training data set.
fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
fast_bdt_option_result_filter
Hyperparameter option of the FastBDT algorithm.
def get_weightfile_xml_identifier(self, fast_bdt_option=None, filter_number=1)
experiment_number
Experiment number of the conditions database, e.g.
filter_number
Number of the filter for which the records files are to be processed.
n_events
Number of events to generate for the training data set.
fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
experiment_number
Experiment number of the conditions database, e.g.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
bkgfiles_dir
Directory with overlay background root files.
n_events_training
Number of events to generate for the training data set.
n_events_testing
Number of events to generate for the test data set.
experiment_number
Experiment number of the conditions database, e.g.
result_filter_records_name
Name of the records file for training the final result filter.
fast_bdt_option_state_filter
Hyperparameter option of the FastBDT algorithm.
experiment_number
Experiment number of the conditions database, e.g.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
bkgfiles_dir
Directory with overlay background root files.
experiment_number
Experiment number of the conditions database, e.g.
def create_state_recording_path(self, layer, records1_fname, records2_fname, records3_fname)
layer
Layer on which to toggle for recording the information for training.
use_n_best_results
How many results should be kept at maximum to search for overlaps.
state_filter_cut
Value of the cut on the MVA classifier output for accepting a state during CKF tracking.
result_filter_cut
Value of the cut on the MVA classifier output for a result candidate.
use_n_best_states
How many states should be kept at maximum in the combinatorial part of the CKF tree search.
n_events_training
Number of events to generate for the training data set.
fast_bdt_option_state_filter
FastBDT option to use to train the StateFilters.
n_events_testing
Number of events to generate for the testing, validation, and optimisation data set.
fast_bdt_option_result_filter
FastBDT option to use to train the Result Filter.
def add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False)
Definition: simulation.py:121