Belle II Software  release-08-01-10
combined_quality_estimator_teacher.py
1 #!/usr/bin/env python3
2 # -*- coding: utf-8 -*-
3 
4 
11 
12 """
13 combined_module_quality_estimator_teacher
14 -----------------------------------------
15 
16 Information on the MVA Track Quality Indicator / Estimator can be found
17 on `Confluence
18 <https://confluence.desy.de/display/BI/MVA+Track+Quality+Indicator>`_.
19 
20 Purpose of this script
21 ~~~~~~~~~~~~~~~~~~~~~~
22 
23 This python script is used for the combined training and validation of three
24 classifiers, the actual final MVA track quality estimator and the two quality
25 estimators for the intermediate standalone track finders that it depends on.
26 
27  - Final MVA track quality estimator:
28  The final quality estimator for fully merged and fitted tracks (RecoTracks).
29  Its classifier uses features from the track fitting, merger, hit pattern, ...
30  But it also uses the outputs from respective intermediate quality
31  estimators for the VXD and the CDC track finding as inputs. It provides
32  the final quality indicator (QI) exported to the track objects.
33 
34  - VXDTF2 track quality estimator:
35  MVA quality estimator for the VXD standalone track finding.
36 
37  - CDC track quality estimator:
38  MVA quality estimator for the CDC standalone track finding.
39 
40 Each classifier requires for its training a different training data set and they
41 need to be validated on a separate testing data set. Further, the final quality
42 estimator can only be trained, when the trained weights for the intermediate
43 quality estimators are available. If the final estimator shall be trained without
44 one or both previous estimators, the requirements have to be commented out in the
45 __init__.py file of tracking.
46 For all estimators, a list of variables to be ignored is specified in the MasterTask.
47 The current choice is mainly based on pure data MC agreement in these quantities or
48 on outdated implementations. It was decided to leave them in the hardcoded "ugly" way
49 in here to remind future generations that they exist in principle and they should and
50 could be added to the estimator, once their modelling becomes better in future or an
51 alternative implementation is programmed.
52 To avoid mistakes, b2luigi is used to create a task chain for a combined training and
53 validation of all classifiers.
54 
55 b2luigi: Understanding the steering file
56 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
57 
58 All trainings and validations are done in the correct order in this steering
59 file. For the purpose of creating a dependency graph, the `b2luigi
60 <https://b2luigi.readthedocs.io>`_ python package is used, which extends the
61 `luigi <https://luigi.readthedocs.io>`_ packag developed by spotify.
62 
63 Each task that has to be done is represented by a special class, which defines
64 which defines parameters, output files and which other tasks with which
65 parameters it depends on. For example a teacher task, which runs
66 ``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
67 task which runs a reconstruction and writes out track-wise variables into a root
68 file for training. An evaluation/validation task for testing the classifier
69 requires both the teacher task, as it needs the weightfile to be present, and
70 also a data collection task, because it needs a dataset for testing classifier.
71 
72 The final task that defines which tasks need to be done for the steering file to
73 finish is the ``MasterTask``. When you only want to run parts of the
74 training/validation pipeline, you can comment out requirements in the Master
75 task or replace them by lower-level tasks during debugging.
76 
77 Requirements
78 ~~~~~~~~~~~~
79 
80 This steering file relies on b2luigi_ for task scheduling and `uncertain_panda
81 <https://github.com/nils-braun/uncertain_panda>`_ for uncertainty calculations.
82 uncertain_panda is not in the externals and b2luigi is not upto v01-07-01. Both
83 can be installed via pip::
84 
85  python3 -m pip install [--user] b2luigi uncertain_panda
86 
87 Use the ``--user`` option if you have not rights to install python packages into
88 your externals (e.g. because you are using cvmfs) and install them in
89 ``$HOME/.local`` instead.
90 
91 Configuration
92 ~~~~~~~~~~~~~
93 
94 Instead of command line arguments, the b2luigi script is configured via a
95 ``settings.json`` file. Open it in your favorite text editor and modify it to
96 fit to your requirements.
97 
98 Usage
99 ~~~~~
100 
101 You can test the b2luigi without running it via::
102 
103  python3 combined_quality_estimator_teacher.py --dry-run
104  python3 combined_quality_estimator_teacher.py --show-output
105 
106 This will show the outputs and show potential errors in the definitions of the
107 luigi task dependencies. To run the the steering file in normal (local) mode,
108 run::
109 
110  python3 combined_quality_estimator_teacher.py
111 
112 I usually use the interactive luigi web interface via the central scheduler
113 which visualizes the task graph while it is running. Therefore, the scheduler
114 daemon ``luigid`` has to run in the background, which is located in
115 ``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
116 example, run::
117 
118  luigid --port 8886
119 
120 Then, execute your steering (e.g. in another terminal) with::
121 
122  python3 combined_quality_estimator_teacher.py --scheduler-port 8886
123 
124 To view the web interface, open your webbrowser enter into the url bar::
125 
126  localhost:8886
127 
128 If you don't run the steering file on the same machine on which you run your web
129 browser, you have two options:
130 
131  1. Run both the steering file and ``luigid`` remotely and use
132  ssh-port-forwarding to your local host. Therefore, run on your local
133  machine::
134 
135  ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
136 
137  2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
138  local host>`` argument when calling the steering file
139 
140 Accessing the results / output files
141 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
142 
143 All output files are stored in a directory structure in the ``result_path``. The
144 directory tree encodes the used b2luigi parameters. This ensures reproducibility
145 and makes parameter searches easy. Sometimes, it is hard to find the relevant
146 output files. You can view the whole directory structure by running ``tree
147 <result_path>``. Ise the unix ``find`` command to find the files that interest
148 you, e.g.::
149 
150  find <result_path> -name "*.pdf" # find all validation plot files
151  find <result_path> -name "*.root" # find all ROOT files
152 """
153 
154 import itertools
155 import os
156 from pathlib import Path
157 import shutil
158 import subprocess
159 import textwrap
160 from datetime import datetime
161 from typing import Iterable
162 
163 import matplotlib.pyplot as plt
164 import numpy as np
165 import uproot
166 from matplotlib.backends.backend_pdf import PdfPages
167 
168 import basf2
169 import basf2_mva
170 from packaging import version
171 import background
172 import simulation
173 import tracking
174 import tracking.root_utils as root_utils
175 from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
176 
177 # wrap python modules that are used here but not in the externals into a try except block
178 install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
179  " python3 -m pip install [--user] {module}\n")
180 try:
181  import b2luigi
182  from b2luigi.core.utils import get_serialized_parameters, get_log_file_dir, create_output_dirs
183  from b2luigi.basf2_helper import Basf2PathTask, Basf2Task
184  from b2luigi.core.task import Task, ExternalTask
185  from b2luigi.basf2_helper.utils import get_basf2_git_hash
186 except ModuleNotFoundError:
187  print(install_helpstring_formatter.format(module="b2luigi"))
188  raise
189 try:
190  from uncertain_panda import pandas as upd
191 except ModuleNotFoundError:
192  print(install_helpstring_formatter.format(module="uncertain_panda"))
193  raise
194 
195 # If b2luigi version 0.3.2 or older, it relies on $BELLE2_RELEASE being "head",
196 # which is not the case in the new externals. A fix has been merged into b2luigi
197 # via https://github.com/nils-braun/b2luigi/pull/17 and thus should be available
198 # in future releases.
199 if (
200  version.parse(b2luigi.__version__) <= version.parse("0.3.2") and
201  get_basf2_git_hash() is None and
202  os.getenv("BELLE2_LOCAL_DIR") is not None
203 ):
204  print(f"b2luigi version could not obtain git hash because of a bug not yet fixed in version {b2luigi.__version__}\n"
205  "Please install the latest version of b2luigi from github via\n\n"
206  " python3 -m pip install --upgrade [--user] git+https://github.com/nils-braun/b2luigi.git\n")
207  raise ImportError
208 
209 # Utility functions
210 
211 
212 def create_fbdt_option_string(fast_bdt_option):
213  """
214  returns a readable string created by the fast_bdt_option array
215  """
216  return "_nTrees" + str(fast_bdt_option[0]) + "_nCuts" + str(fast_bdt_option[1]) + "_nLevels" + \
217  str(fast_bdt_option[2]) + "_shrin" + str(int(round(100*fast_bdt_option[3], 0)))
218 
219 
220 def createV0momenta(x, mu, beta):
221  """
222  Copied from Biancas K_S0 particle gun code: Returns a realistic V0 momentum distribution
223  when running over x. Mu and Beta are properties of the function that define center and tails.
224  Used for the particle gun simulation code for K_S0 and Lambda_0
225  """
226  return (1/beta)*np.exp(-(x - mu)/beta) * np.exp(-np.exp(-(x - mu) / beta))
227 
228 
229 def my_basf2_mva_teacher(
230  records_files,
231  tree_name,
232  weightfile_identifier,
233  target_variable="truth",
234  exclude_variables=None,
235  fast_bdt_option=[200, 8, 3, 0.1] # nTrees, nCuts, nLevels, shrinkage
236 ):
237  """
238  My custom wrapper for basf2 mva teacher. Adapted from code in ``trackfindingcdc_teacher``.
239 
240  :param records_files: List of files with collected ("recorded") variables to use as training data for the MVA.
241  :param tree_name: Name of the TTree in the ROOT file from the ``data_collection_task``
242  that contains the training data for the MVA teacher.
243  :param weightfile_identifier: Name of the weightfile that is created.
244  Should either end in ".xml" for local weightfiles or in ".root", when
245  the weightfile needs later to be uploaded as a payload to the conditions
246  database.
247  :param target_variable: Feature/variable to use as truth label in the quality estimator MVA classifier.
248  :param exclude_variables: List of collected variables to not use in the training of the QE MVA classifier.
249  In addition to variables containing the "truth" substring, which are excluded by default.
250  :param fast_bdt_option: specified fast BDT options, default: [200, 8, 3, 0.1] [nTrees, nCuts, nLevels, shrinkage]
251  """
252  if exclude_variables is None:
253  exclude_variables = []
254 
255  weightfile_extension = Path(weightfile_identifier).suffix
256  if weightfile_extension not in {".xml", ".root"}:
257  raise ValueError(f"Weightfile Identifier should end in .xml or .root, but ends in {weightfile_extension}")
258 
259  # extract names of all variables from one record file
260  with root_utils.root_open(records_files[0]) as records_tfile:
261  input_tree = records_tfile.Get(tree_name)
262  feature_names = [leave.GetName() for leave in input_tree.GetListOfLeaves()]
263 
264  # get list of variables to use for training without MC truth
265  truth_free_variable_names = [
266  name
267  for name in feature_names
268  if (
269  ("truth" not in name) and
270  (name != target_variable) and
271  (name not in exclude_variables)
272  )
273  ]
274  if "weight" in truth_free_variable_names:
275  truth_free_variable_names.remove("weight")
276  weight_variable = "weight"
277  elif "__weight__" in truth_free_variable_names:
278  truth_free_variable_names.remove("__weight__")
279  weight_variable = "__weight__"
280  else:
281  weight_variable = ""
282 
283  # Set options for MVA training
284  general_options = basf2_mva.GeneralOptions()
285  general_options.m_datafiles = basf2_mva.vector(*records_files)
286  general_options.m_treename = tree_name
287  general_options.m_weight_variable = weight_variable
288  general_options.m_identifier = weightfile_identifier
289  general_options.m_variables = basf2_mva.vector(*truth_free_variable_names)
290  general_options.m_target_variable = target_variable
291  fastbdt_options = basf2_mva.FastBDTOptions()
292 
293  fastbdt_options.m_nTrees = fast_bdt_option[0]
294  fastbdt_options.m_nCuts = fast_bdt_option[1]
295  fastbdt_options.m_nLevels = fast_bdt_option[2]
296  fastbdt_options.m_shrinkage = fast_bdt_option[3]
297  # Train a MVA method and store the weightfile (MVAFastBDT.root) locally.
298  basf2_mva.teacher(general_options, fastbdt_options)
299 
300 
301 def _my_uncertain_mean(series: upd.Series):
302  """
303  Temporary Workaround bug in ``uncertain_panda`` where a ``ValueError`` is
304  thrown for ``Series.unc.mean`` if the series is empty. Can be replaced by
305  .unc.mean when the issue is fixed.
306  https://github.com/nils-braun/uncertain_panda/issues/2
307  """
308  try:
309  return series.unc.mean()
310  except ValueError:
311  if series.empty:
312  return np.nan
313  else:
314  raise
315 
316 
317 def get_uncertain_means_for_qi_cuts(df: upd.DataFrame, column: str, qi_cuts: Iterable[float]):
318  """
319  Return a pandas series with an mean of the dataframe column and
320  uncertainty for each quality indicator cut.
321 
322  :param df: Pandas dataframe with at least ``quality_indicator``
323  and another numeric ``column``.
324  :param column: Column of which we want to aggregate the means
325  and uncertainties for different QI cuts
326  :param qi_cuts: Iterable of quality indicator minimal thresholds.
327  :returns: Series of of means and uncertainties with ``qi_cuts`` as index
328  """
329 
330  uncertain_means = (_my_uncertain_mean(df.query(f"quality_indicator > {qi_cut}")[column])
331  for qi_cut in qi_cuts)
332  uncertain_means_series = upd.Series(data=uncertain_means, index=qi_cuts)
333  return uncertain_means_series
334 
335 
336 def plot_with_errobands(uncertain_series,
337  error_band_alpha=0.3,
338  plot_kwargs={},
339  fill_between_kwargs={},
340  ax=None):
341  """
342  Plot an uncertain series with error bands for y-errors
343  """
344  if ax is None:
345  ax = plt.gca()
346  uncertain_series = uncertain_series.dropna()
347  ax.plot(uncertain_series.index.values, uncertain_series.nominal_value, **plot_kwargs)
348  ax.fill_between(x=uncertain_series.index,
349  y1=uncertain_series.nominal_value - uncertain_series.std_dev,
350  y2=uncertain_series.nominal_value + uncertain_series.std_dev,
351  alpha=error_band_alpha,
352  **fill_between_kwargs)
353 
354 
355 def format_dictionary(adict, width=80, bullet="•"):
356  """
357  Helper function to format dictionary to string as a wrapped key-value bullet
358  list. Useful to print metadata from dictionaries.
359 
360  :param adict: Dictionary to format
361  :param width: Characters after which to wrap a key-value line
362  :param bullet: Character to begin a key-value line with, e.g. ``-`` for a
363  yaml-like string
364  """
365  # It might be possible to replace this function yaml.dump, but the current
366  # version in the externals does not allow to disable the sorting of the
367  # dictionary yet and also I am not sure if it is wrappable
368  return "\n".join(textwrap.fill(f"{bullet} {key}: {value}", width=width)
369  for (key, value) in adict.items())
370 
371 # Begin definitions of b2luigi task classes
372 
373 
374 class GenerateSimTask(Basf2PathTask):
375  """
376  Generate simulated Monte Carlo with background overlay.
377 
378  Make sure to use different ``random_seed`` parameters for the training data
379  format the classifier trainings and for the test data for the respective
380  evaluation/validation tasks.
381  """
382 
383 
384  n_events = b2luigi.IntParameter()
385 
386  experiment_number = b2luigi.IntParameter()
387 
389  random_seed = b2luigi.Parameter()
390 
391  bkgfiles_dir = b2luigi.Parameter(
392 
393  hashed=True
394 
395  )
396 
397  queue = 'l'
398 
399 
400  def output_file_name(self, n_events=None, random_seed=None):
401  """
402  Create output file name depending on number of events and production
403  mode that is specified in the random_seed string.
404  """
405  if n_events is None:
406  n_events = self.n_eventsn_events
407  if random_seed is None:
408  random_seed = self.random_seedrandom_seed
409  return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
410 
411  def output(self):
412  """
413  Generate list of output files that the task should produce.
414  The task is considered finished if and only if the outputs all exist.
415  """
416  yield self.add_to_output(self.output_file_nameoutput_file_name())
417 
418  def create_path(self):
419  """
420  Create basf2 path to process with event generation and simulation.
421  """
422  basf2.set_random_seed(self.random_seedrandom_seed)
423  path = basf2.create_path()
424  if self.experiment_numberexperiment_number in [0, 1002, 1003]:
425  runNo = 0
426  else:
427  runNo = 0
428  raise ValueError(
429  f"Simulating events with experiment number {self.experiment_number} is not implemented yet.")
430  path.add_module(
431  "EventInfoSetter", evtNumList=[self.n_eventsn_events], runList=[runNo], expList=[self.experiment_numberexperiment_number]
432  )
433  if "BBBAR" in self.random_seedrandom_seed:
434  path.add_module("EvtGenInput")
435  elif "V0BBBAR" in self.random_seedrandom_seed:
436  path.add_module("EvtGenInput")
437  path.add_module("InclusiveParticleChecker", particles=[310, 3122], includeConjugates=True)
438  else:
439  import generators as ge
440  # WARNING: There are a few differences in the production of MC13a and b like the following lines
441  # as well as ActivatePXD.. and the beamparams for bhabha... I use these from MC13b, not a... :/
442  # import beamparameters as bp
443  # beamparameters = bp.add_beamparameters(path, "Y4S")
444  # beamparameters.param("covVertex", [(14.8e-4)**2, (1.5e-4)**2, (360e-4)**2])
445  if "V0STUDY" in self.random_seedrandom_seed:
446  if "V0STUDYKS" in self.random_seedrandom_seed:
447  # Bianca looked at the Ks dists and extracted these values:
448  mu = 0.5
449  beta = 0.2
450  pdgs = [310] # Ks (has no antiparticle, Klong is different)
451  if "V0STUDYL0" in self.random_seedrandom_seed:
452  # I just made the lambda values up, such that they peak at 0.35 and are slightly shifted to lower values
453  mu = 0.35
454  beta = 0.15 # if this is chosen higher, one needs to make sure not to get values >0 for 0
455  pdgs = [3122, -3122] # Lambda0
456  else:
457  # also these values are made up
458  mu = 0.43
459  beta = 0.18
460  pdgs = [310, 3122, -3122] # Ks and Lambda0
461  # create realistic momentum distribution
462  myx = [i*0.01 for i in range(321)]
463  myy = []
464  for x in myx:
465  y = createV0momenta(x, mu, beta)
466  myy.append(y)
467  polParams = myx + myy
468  # define particles that are produced
469  pdg_list = pdgs
470 
471  particlegun = basf2.register_module('ParticleGun')
472  particlegun.param('pdgCodes', pdg_list)
473  particlegun.param('nTracks', 8) # number of particles (not tracks!) that is created in each event
474  particlegun.param('momentumGeneration', 'polyline')
475  particlegun.param('momentumParams', polParams)
476  particlegun.param('thetaGeneration', 'uniformCos')
477  particlegun.param('thetaParams', [17, 150]) # [0, 180]) #[17, 150]
478  particlegun.param('phiGeneration', 'uniform')
479  particlegun.param('phiParams', [0, 360])
480  particlegun.param('vertexGeneration', 'fixed')
481  particlegun.param('xVertexParams', [0])
482  particlegun.param('yVertexParams', [0])
483  particlegun.param('zVertexParams', [0])
484  path.add_module(particlegun)
485  if "BHABHA" in self.random_seedrandom_seed:
486  ge.add_babayaganlo_generator(path=path, finalstate='ee', minenergy=0.15, minangle=10.0)
487  elif "MUMU" in self.random_seedrandom_seed:
488  ge.add_kkmc_generator(path=path, finalstate='mu+mu-')
489  elif "YY" in self.random_seedrandom_seed:
490  babayaganlo = basf2.register_module('BabayagaNLOInput')
491  babayaganlo.param('FinalState', 'gg')
492  babayaganlo.param('MaxAcollinearity', 180.0)
493  babayaganlo.param('ScatteringAngleRange', [0., 180.])
494  babayaganlo.param('FMax', 75000)
495  babayaganlo.param('MinEnergy', 0.01)
496  babayaganlo.param('Order', 'exp')
497  babayaganlo.param('DebugEnergySpread', 0.01)
498  babayaganlo.param('Epsilon', 0.00005)
499  path.add_module(babayaganlo)
500  generatorpreselection = basf2.register_module('GeneratorPreselection')
501  generatorpreselection.param('nChargedMin', 0)
502  generatorpreselection.param('nChargedMax', 999)
503  generatorpreselection.param('MinChargedPt', 0.15)
504  generatorpreselection.param('MinChargedTheta', 17.)
505  generatorpreselection.param('MaxChargedTheta', 150.)
506  generatorpreselection.param('nPhotonMin', 1)
507  generatorpreselection.param('MinPhotonEnergy', 1.5)
508  generatorpreselection.param('MinPhotonTheta', 15.0)
509  generatorpreselection.param('MaxPhotonTheta', 165.0)
510  generatorpreselection.param('applyInCMS', True)
511  path.add_module(generatorpreselection)
512  empty = basf2.create_path()
513  generatorpreselection.if_value('!=11', empty)
514  elif "EEEE" in self.random_seedrandom_seed:
515  ge.add_aafh_generator(path=path, finalstate='e+e-e+e-', preselection=False)
516  elif "EEMUMU" in self.random_seedrandom_seed:
517  ge.add_aafh_generator(path=path, finalstate='e+e-mu+mu-', preselection=False)
518  elif "TAUPAIR" in self.random_seedrandom_seed:
519  ge.add_kkmc_generator(path, finalstate='tau+tau-')
520  elif "DDBAR" in self.random_seedrandom_seed:
521  ge.add_continuum_generator(path, finalstate='ddbar')
522  elif "UUBAR" in self.random_seedrandom_seed:
523  ge.add_continuum_generator(path, finalstate='uubar')
524  elif "SSBAR" in self.random_seedrandom_seed:
525  ge.add_continuum_generator(path, finalstate='ssbar')
526  elif "CCBAR" in self.random_seedrandom_seed:
527  ge.add_continuum_generator(path, finalstate='ccbar')
528  # activate simulation of dead/masked pixel and reproduce detector gain, which will be
529  # applied at reconstruction level when the data GT is present in the DB chain
530  # path.add_module("ActivatePXDPixelMasker")
531  # path.add_module("ActivatePXDGainCalibrator")
532  bkg_files = background.get_background_files(self.bkgfiles_dirbkgfiles_dir)
533  if self.experiment_numberexperiment_number == 1002:
534  # remove KLM because of bug in background files with release 4
535  components = ['PXD', 'SVD', 'CDC', 'ECL', 'TOP', 'ARICH', 'TRG']
536  else:
537  components = None
538  simulation.add_simulation(path, bkgfiles=bkg_files, bkgOverlay=True, components=components) # , usePXDDataReduction=False)
539 
540  path.add_module(
541  "RootOutput",
542  outputFileName=self.get_output_file_name(self.output_file_nameoutput_file_name()),
543  )
544  return path
545 
546 
547 # I don't use the default MergeTask or similar because they only work if every input file is called the same.
548 # Additionally, I want to add more features like deleting the original input to save storage space.
549 class SplitNMergeSimTask(Basf2Task):
550  """
551  Generate simulated Monte Carlo with background overlay.
552 
553  Make sure to use different ``random_seed`` parameters for the training data
554  format the classifier trainings and for the test data for the respective
555  evaluation/validation tasks.
556  """
557 
558 
559  n_events = b2luigi.IntParameter()
560 
561  experiment_number = b2luigi.IntParameter()
562 
564  random_seed = b2luigi.Parameter()
565 
566  bkgfiles_dir = b2luigi.Parameter(
567 
568  hashed=True
569 
570  )
571 
572  queue = 'sx'
573 
574 
575  def output_file_name(self, n_events=None, random_seed=None):
576  """
577  Create output file name depending on number of events and production
578  mode that is specified in the random_seed string.
579  """
580  if n_events is None:
581  n_events = self.n_eventsn_events
582  if random_seed is None:
583  random_seed = self.random_seedrandom_seed
584  return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
585 
586  def output(self):
587  """
588  Generate list of output files that the task should produce.
589  The task is considered finished if and only if the outputs all exist.
590  """
591  yield self.add_to_output(self.output_file_nameoutput_file_name())
592 
593  def requires(self):
594  """
595  Generate list of luigi Tasks that this Task depends on.
596  """
597  n_events_per_task = MasterTask.n_events_per_task
598  quotient, remainder = divmod(self.n_eventsn_events, n_events_per_task)
599  for i in range(quotient):
600  yield GenerateSimTask(
601  bkgfiles_dir=self.bkgfiles_dirbkgfiles_dir,
602  num_processes=MasterTask.num_processes,
603  random_seed=self.random_seedrandom_seed + '_' + str(i).zfill(3),
604  n_events=n_events_per_task,
605  experiment_number=self.experiment_numberexperiment_number,
606  )
607  if remainder > 0:
608  yield GenerateSimTask(
609  bkgfiles_dir=self.bkgfiles_dirbkgfiles_dir,
610  num_processes=MasterTask.num_processes,
611  random_seed=self.random_seedrandom_seed + '_' + str(quotient).zfill(3),
612  n_events=remainder,
613  experiment_number=self.experiment_numberexperiment_number,
614  )
615 
616  @b2luigi.on_temporary_files
617  def process(self):
618  """
619  When all GenerateSimTasks finished, merge the output.
620  """
621  create_output_dirs(self)
622 
623  file_list = []
624  for _, file_name in self.get_input_file_names().items():
625  file_list.append(*file_name)
626  print("Merge the following files:")
627  print(file_list)
628  cmd = ["b2file-merge", "-f"]
629  args = cmd + [self.get_output_file_name(self.output_file_nameoutput_file_name())] + file_list
630  subprocess.check_call(args)
631  print("Finished merging. Now remove the input files to save space.")
632  cmd2 = ["rm", "-f"]
633  for tempfile in file_list:
634  args = cmd2 + [tempfile]
635  subprocess.check_call(args)
636 
637 
638 class CheckExistingFile(ExternalTask):
639  """
640  Task to check if the given file really exists.
641  """
642 
643  filename = b2luigi.Parameter()
644 
645  def output(self):
646  """
647  Specify the output to be the file that was just checked.
648  """
649  from luigi import LocalTarget
650  return LocalTarget(self.filenamefilename)
651 
652 
653 class VXDQEDataCollectionTask(Basf2PathTask):
654  """
655  Collect variables/features from VXDTF2 tracking and write them to a ROOT
656  file.
657 
658  These variables are to be used as labelled training data for the MVA
659  classifier which is the VXD track quality estimator
660  """
661 
662  n_events = b2luigi.IntParameter()
663 
664  experiment_number = b2luigi.IntParameter()
665 
667  random_seed = b2luigi.Parameter()
668 
669  queue = 'l'
670 
671 
672  def get_records_file_name(self, n_events=None, random_seed=None):
673  """
674  Create output file name depending on number of events and production
675  mode that is specified in the random_seed string.
676  """
677  if n_events is None:
678  n_events = self.n_eventsn_events
679  if random_seed is None:
680  random_seed = self.random_seedrandom_seed
681  if 'vxd' not in random_seed:
682  random_seed += '_vxd'
683  if 'DATA' in random_seed:
684  return 'qe_records_DATA_vxd.root'
685  else:
686  if 'USESIMBB' in random_seed:
687  random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
688  elif 'USESIMEE' in random_seed:
689  random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
690  return 'qe_records_N' + str(n_events) + '_' + random_seed + '.root'
691 
692  def get_input_files(self, n_events=None, random_seed=None):
693  """
694  Get input file names depending on the use case: If they already exist, search in
695  the corresponding folders, for data check the specified list and if they are created
696  in the same run, check for the task that produced them.
697  """
698  if n_events is None:
699  n_events = self.n_eventsn_events
700  if random_seed is None:
701  random_seed = self.random_seedrandom_seed
702  if "USESIM" in random_seed:
703  if 'USESIMBB' in random_seed:
704  random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
705  elif 'USESIMEE' in random_seed:
706  random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
707  return ['datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
708  n_events=n_events, random_seed=random_seed)]
709  elif "DATA" in random_seed:
710  return MasterTask.datafiles
711  else:
712  return self.get_input_file_names(GenerateSimTask.output_file_name(
713  GenerateSimTask, n_events=n_events, random_seed=random_seed))
714 
715  def requires(self):
716  """
717  Generate list of luigi Tasks that this Task depends on.
718  """
719  if "USESIM" in self.random_seedrandom_seed or "DATA" in self.random_seedrandom_seed:
720  for filename in self.get_input_filesget_input_files():
721  yield CheckExistingFile(
722  filename=filename,
723  )
724  else:
725  yield SplitNMergeSimTask(
726  bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
727  random_seed=self.random_seedrandom_seed,
728  n_events=self.n_eventsn_events,
729  experiment_number=self.experiment_numberexperiment_number,
730  )
731 
732  def output(self):
733  """
734  Generate list of output files that the task should produce.
735  The task is considered finished if and only if the outputs all exist.
736  """
737  yield self.add_to_output(self.get_records_file_nameget_records_file_name())
738 
739  def create_path(self):
740  """
741  Create basf2 path with VXDTF2 tracking and VXD QE data collection.
742  """
743  path = basf2.create_path()
744  inputFileNames = self.get_input_filesget_input_files()
745  path.add_module(
746  "RootInput",
747  inputFileNames=inputFileNames,
748  )
749  path.add_module("Gearbox")
750  tracking.add_geometry_modules(path)
751  if 'DATA' in self.random_seedrandom_seed:
752  from rawdata import add_unpackers
753  add_unpackers(path, components=['SVD', 'PXD'])
754  tracking.add_hit_preparation_modules(path)
755  tracking.add_vxd_track_finding_vxdtf2(
756  path, components=["SVD"], add_mva_quality_indicator=False
757  )
758  if 'DATA' in self.random_seedrandom_seed:
759  path.add_module(
760  "VXDQETrainingDataCollector",
761  TrainingDataOutputName=self.get_output_file_name(self.get_records_file_nameget_records_file_name()),
762  SpacePointTrackCandsStoreArrayName="SPTrackCands",
763  EstimationMethod="tripletFit",
764  UseTimingInfo=False,
765  ClusterInformation="Average",
766  MCStrictQualityEstimator=False,
767  mva_target=False,
768  MCInfo=False,
769  )
770  else:
771  path.add_module(
772  "TrackFinderMCTruthRecoTracks",
773  RecoTracksStoreArrayName="MCRecoTracks",
774  WhichParticles=[],
775  UsePXDHits=False,
776  UseSVDHits=True,
777  UseCDCHits=False,
778  )
779  path.add_module(
780  "VXDQETrainingDataCollector",
781  TrainingDataOutputName=self.get_output_file_name(self.get_records_file_nameget_records_file_name()),
782  SpacePointTrackCandsStoreArrayName="SPTrackCands",
783  EstimationMethod="tripletFit",
784  UseTimingInfo=False,
785  ClusterInformation="Average",
786  MCStrictQualityEstimator=True,
787  mva_target=False,
788  )
789  return path
790 
791 
792 class CDCQEDataCollectionTask(Basf2PathTask):
793  """
794  Collect variables/features from CDC tracking and write them to a ROOT file.
795 
796  These variables are to be used as labelled training data for the MVA
797  classifier which is the CDC track quality estimator
798  """
799 
800  n_events = b2luigi.IntParameter()
801 
802  experiment_number = b2luigi.IntParameter()
803 
805  random_seed = b2luigi.Parameter()
806 
807  queue = 'l'
808 
809 
810  def get_records_file_name(self, n_events=None, random_seed=None):
811  """
812  Create output file name depending on number of events and production
813  mode that is specified in the random_seed string.
814  """
815  if n_events is None:
816  n_events = self.n_eventsn_events
817  if random_seed is None:
818  random_seed = self.random_seedrandom_seed
819  if 'cdc' not in random_seed:
820  random_seed += '_cdc'
821  if 'DATA' in random_seed:
822  return 'qe_records_DATA_cdc.root'
823  else:
824  if 'USESIMBB' in random_seed:
825  random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
826  elif 'USESIMEE' in random_seed:
827  random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
828  return 'qe_records_N' + str(n_events) + '_' + random_seed + '.root'
829 
830  def get_input_files(self, n_events=None, random_seed=None):
831  """
832  Get input file names depending on the use case: If they already exist, search in
833  the corresponding folders, for data check the specified list and if they are created
834  in the same run, check for the task that produced them.
835  """
836  if n_events is None:
837  n_events = self.n_eventsn_events
838  if random_seed is None:
839  random_seed = self.random_seedrandom_seed
840  if "USESIM" in random_seed:
841  if 'USESIMBB' in random_seed:
842  random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
843  elif 'USESIMEE' in random_seed:
844  random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
845  return ['datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
846  n_events=n_events, random_seed=random_seed)]
847  elif "DATA" in random_seed:
848  return MasterTask.datafiles
849  else:
850  return self.get_input_file_names(GenerateSimTask.output_file_name(
851  GenerateSimTask, n_events=n_events, random_seed=random_seed))
852 
853  def requires(self):
854  """
855  Generate list of luigi Tasks that this Task depends on.
856  """
857  if "USESIM" in self.random_seedrandom_seed or "DATA" in self.random_seedrandom_seed:
858  for filename in self.get_input_filesget_input_files():
859  yield CheckExistingFile(
860  filename=filename,
861  )
862  else:
863  yield SplitNMergeSimTask(
864  bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
865  random_seed=self.random_seedrandom_seed,
866  n_events=self.n_eventsn_events,
867  experiment_number=self.experiment_numberexperiment_number,
868  )
869 
870  def output(self):
871  """
872  Generate list of output files that the task should produce.
873  The task is considered finished if and only if the outputs all exist.
874  """
875  yield self.add_to_output(self.get_records_file_nameget_records_file_name())
876 
877  def create_path(self):
878  """
879  Create basf2 path with CDC standalone tracking and CDC QE with recording filter for MVA feature collection.
880  """
881  path = basf2.create_path()
882  inputFileNames = self.get_input_filesget_input_files()
883  path.add_module(
884  "RootInput",
885  inputFileNames=inputFileNames,
886  )
887  path.add_module("Gearbox")
888  tracking.add_geometry_modules(path)
889  if 'DATA' in self.random_seedrandom_seed:
890  filter_choice = "recording_data"
891  from rawdata import add_unpackers
892  add_unpackers(path, components=['CDC'])
893  else:
894  filter_choice = "recording"
895  # tracking.add_hit_preparation_modules(path) # only needed for SVD and
896  # PXD hit preparation. Does not change the CDC output.
897  tracking.add_cdc_track_finding(path, with_ca=False, add_mva_quality_indicator=True)
898 
899  basf2.set_module_parameters(
900  path,
901  name="TFCDC_TrackQualityEstimator",
902  filter=filter_choice,
903  filterParameters={
904  "rootFileName": self.get_output_file_name(self.get_records_file_nameget_records_file_name())
905  },
906  )
907  return path
908 
909 
910 class RecoTrackQEDataCollectionTask(Basf2PathTask):
911  """
912  Collect variables/features from the reco track reconstruction including the
913  fit and write them to a ROOT file.
914 
915  These variables are to be used as labelled training data for the MVA
916  classifier which is the MVA track quality estimator. The collected
917  variables include the classifier outputs from the VXD and CDC quality
918  estimators, namely the CDC and VXD quality indicators, combined with fit,
919  merger, timing, energy loss information etc. This task requires the
920  subdetector quality estimators to be trained.
921  """
922 
923 
924  n_events = b2luigi.IntParameter()
925 
926  experiment_number = b2luigi.IntParameter()
927 
929  random_seed = b2luigi.Parameter()
930 
931  cdc_training_target = b2luigi.Parameter()
932 
935  recotrack_option = b2luigi.Parameter(
936 
937  default='deleteCDCQI080'
938 
939  )
940 
941  fast_bdt_option = b2luigi.ListParameter(
942 
943  hashed=True, default=[200, 8, 3, 0.1]
944 
945  )
946 
947  queue = 'l'
948 
949 
950  def get_records_file_name(self, n_events=None, random_seed=None, recotrack_option=None):
951  """
952  Create output file name depending on number of events and production
953  mode that is specified in the random_seed string.
954  """
955  if n_events is None:
956  n_events = self.n_eventsn_events
957  if random_seed is None:
958  random_seed = self.random_seedrandom_seed
959  if recotrack_option is None:
960  recotrack_option = self.recotrack_optionrecotrack_option
961  if 'rec' not in random_seed:
962  random_seed += '_rec'
963  if 'DATA' in random_seed:
964  return 'qe_records_DATA_rec.root'
965  else:
966  if 'USESIMBB' in random_seed:
967  random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
968  elif 'USESIMEE' in random_seed:
969  random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
970  return 'qe_records_N' + str(n_events) + '_' + random_seed + '_' + recotrack_option + '.root'
971 
972  def get_input_files(self, n_events=None, random_seed=None):
973  """
974  Get input file names depending on the use case: If they already exist, search in
975  the corresponding folders, for data check the specified list and if they are created
976  in the same run, check for the task that produced them.
977  """
978  if n_events is None:
979  n_events = self.n_eventsn_events
980  if random_seed is None:
981  random_seed = self.random_seedrandom_seed
982  if "USESIM" in random_seed:
983  if 'USESIMBB' in random_seed:
984  random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
985  elif 'USESIMEE' in random_seed:
986  random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
987  return ['datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
988  n_events=n_events, random_seed=random_seed)]
989  elif "DATA" in random_seed:
990  return MasterTask.datafiles
991  else:
992  return self.get_input_file_names(GenerateSimTask.output_file_name(
993  GenerateSimTask, n_events=n_events, random_seed=random_seed))
994 
995  def requires(self):
996  """
997  Generate list of luigi Tasks that this Task depends on.
998  """
999  if "USESIM" in self.random_seedrandom_seed or "DATA" in self.random_seedrandom_seed:
1000  for filename in self.get_input_filesget_input_files():
1001  yield CheckExistingFile(
1002  filename=filename,
1003  )
1004  else:
1005  yield SplitNMergeSimTask(
1006  bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
1007  random_seed=self.random_seedrandom_seed,
1008  n_events=self.n_eventsn_events,
1009  experiment_number=self.experiment_numberexperiment_number,
1010  )
1011  if "DATA" not in self.random_seedrandom_seed:
1012  if 'useCDC' not in self.recotrack_optionrecotrack_option and 'noCDC' not in self.recotrack_optionrecotrack_option:
1013  yield CDCQETeacherTask(
1014  n_events_training=MasterTask.n_events_training,
1015  experiment_number=self.experiment_numberexperiment_number,
1016  training_target=self.cdc_training_targetcdc_training_target,
1017  process_type=self.random_seedrandom_seed.split("_", 1)[0],
1018  exclude_variables=MasterTask.exclude_variables_cdc,
1019  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
1020  )
1021  if 'useVXD' not in self.recotrack_optionrecotrack_option and 'noVXD' not in self.recotrack_optionrecotrack_option:
1022  yield VXDQETeacherTask(
1023  n_events_training=MasterTask.n_events_training,
1024  experiment_number=self.experiment_numberexperiment_number,
1025  process_type=self.random_seedrandom_seed.split("_", 1)[0],
1026  exclude_variables=MasterTask.exclude_variables_vxd,
1027  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
1028  )
1029 
1030  def output(self):
1031  """
1032  Generate list of output files that the task should produce.
1033  The task is considered finished if and only if the outputs all exist.
1034  """
1035  yield self.add_to_output(self.get_records_file_nameget_records_file_name())
1036 
1037  def create_path(self):
1038  """
1039  Create basf2 reconstruction path that should mirror the default path
1040  from ``add_tracking_reconstruction()``, but with modules for the VXD QE
1041  and CDC QE application and for collection of variables for the reco
1042  track quality estimator.
1043  """
1044  path = basf2.create_path()
1045  inputFileNames = self.get_input_filesget_input_files()
1046  path.add_module(
1047  "RootInput",
1048  inputFileNames=inputFileNames,
1049  )
1050  path.add_module("Gearbox")
1051 
1052  # First add tracking reconstruction with default quality estimation modules
1053  mvaCDC = True
1054  mvaVXD = True
1055  if 'noCDC' in self.recotrack_optionrecotrack_option:
1056  mvaCDC = False
1057  if 'noVXD' in self.recotrack_optionrecotrack_option:
1058  mvaVXD = False
1059  if 'DATA' in self.random_seedrandom_seed:
1060  from rawdata import add_unpackers
1061  add_unpackers(path)
1062  tracking.add_tracking_reconstruction(path, add_cdcTrack_QI=mvaCDC, add_vxdTrack_QI=mvaVXD, add_recoTrack_QI=True)
1063 
1064  # if data shall be processed check if newly trained mva files are available. Otherwise use default ones (CDB payloads):
1065  # if useCDC/VXD is specified, use the identifier lying in datafiles/ Otherwise, replace weightfile identifiers from defaults
1066  # (CDB payloads) to new weightfiles created by this b2luigi script
1067  if ('DATA' in self.random_seedrandom_seed or 'useCDC' in self.recotrack_optionrecotrack_option) and 'noCDC' not in self.recotrack_optionrecotrack_option:
1068  cdc_identifier = 'datafiles/' + \
1069  CDCQETeacherTask.get_weightfile_xml_identifier(CDCQETeacherTask, fast_bdt_option=self.fast_bdt_optionfast_bdt_option)
1070  if os.path.exists(cdc_identifier):
1071  replace_cdc_qi = True
1072  elif 'useCDC' in self.recotrack_optionrecotrack_option:
1073  raise ValueError(f"CDC QI Identifier not found: {cdc_identifier}")
1074  else:
1075  replace_cdc_qi = False
1076  elif 'noCDC' in self.recotrack_optionrecotrack_option:
1077  replace_cdc_qi = False
1078  else:
1079  cdc_identifier = self.get_input_file_names(
1080  CDCQETeacherTask.get_weightfile_xml_identifier(
1081  CDCQETeacherTask, fast_bdt_option=self.fast_bdt_optionfast_bdt_option))[0]
1082  replace_cdc_qi = True
1083  if ('DATA' in self.random_seedrandom_seed or 'useVXD' in self.recotrack_optionrecotrack_option) and 'noVXD' not in self.recotrack_optionrecotrack_option:
1084  vxd_identifier = 'datafiles/' + \
1085  VXDQETeacherTask.get_weightfile_xml_identifier(VXDQETeacherTask, fast_bdt_option=self.fast_bdt_optionfast_bdt_option)
1086  if os.path.exists(vxd_identifier):
1087  replace_vxd_qi = True
1088  elif 'useVXD' in self.recotrack_optionrecotrack_option:
1089  raise ValueError(f"VXD QI Identifier not found: {vxd_identifier}")
1090  else:
1091  replace_vxd_qi = False
1092  elif 'noVXD' in self.recotrack_optionrecotrack_option:
1093  replace_vxd_qi = False
1094  else:
1095  vxd_identifier = self.get_input_file_names(
1096  VXDQETeacherTask.get_weightfile_xml_identifier(
1097  VXDQETeacherTask, fast_bdt_option=self.fast_bdt_optionfast_bdt_option))[0]
1098  replace_vxd_qi = True
1099 
1100  cdc_qe_mva_filter_parameters = None
1101  # if tracks below a certain CDC QI index shall be deleted online, this needs to be specified in the filter parameters.
1102  # this is also possible in case of the default (CBD) payloads.
1103  if 'deleteCDCQI' in self.recotrack_optionrecotrack_option:
1104  cut_index = self.recotrack_optionrecotrack_option.find('deleteCDCQI') + len('deleteCDCQI')
1105  cut = int(self.recotrack_optionrecotrack_option[cut_index:cut_index+3])/100.
1106  if replace_cdc_qi:
1107  cdc_qe_mva_filter_parameters = {
1108  "identifier": cdc_identifier, "cut": cut}
1109  else:
1110  cdc_qe_mva_filter_parameters = {
1111  "cut": cut}
1112  elif replace_cdc_qi:
1113  cdc_qe_mva_filter_parameters = {
1114  "identifier": cdc_identifier}
1115  if cdc_qe_mva_filter_parameters is not None:
1116  # if no cut is specified, the default value is at zero and nothing is deleted.
1117  basf2.set_module_parameters(
1118  path,
1119  name="TFCDC_TrackQualityEstimator",
1120  filterParameters=cdc_qe_mva_filter_parameters,
1121  deleteTracks=True,
1122  resetTakenFlag=True
1123  )
1124  if replace_vxd_qi:
1125  basf2.set_module_parameters(
1126  path,
1127  name="VXDQualityEstimatorMVA",
1128  WeightFileIdentifier=vxd_identifier)
1129 
1130  # Replace final quality estimator module by training data collector module
1131  track_qe_module_name = "TrackQualityEstimatorMVA"
1132  module_found = False
1133  new_path = basf2.create_path()
1134  for module in path.modules():
1135  if module.name() != track_qe_module_name:
1136  if not module.name == 'TrackCreator':
1137  new_path.add_module(module)
1138  else:
1139  # the TrackCreator needs to be conducted before the Collector such that
1140  # MDSTTracks are related to RecoTracks and d0 and z0 can be read out
1141  new_path.add_module(
1142  'TrackCreator',
1143  pdgCodes=[
1144  211,
1145  321,
1146  2212],
1147  recoTrackColName='RecoTracks',
1148  trackColName='MDSTTracks') # , useClosestHitToIP=True, useBFieldAtHit=True)
1149  new_path.add_module(
1150  "TrackQETrainingDataCollector",
1151  TrainingDataOutputName=self.get_output_file_name(self.get_records_file_nameget_records_file_name()),
1152  collectEventFeatures=True,
1153  SVDPlusCDCStandaloneRecoTracksStoreArrayName="SVDPlusCDCStandaloneRecoTracks",
1154  )
1155  module_found = True
1156  if not module_found:
1157  raise KeyError(f"No module {track_qe_module_name} found in path")
1158  path = new_path
1159  return path
1160 
1161 
1162 class TrackQETeacherBaseTask(Basf2Task):
1163  """
1164  A teacher task runs the basf2 mva teacher on the training data provided by a
1165  data collection task.
1166 
1167  Since teacher tasks are needed for all quality estimators covered by this
1168  steering file and the only thing that changes is the required data
1169  collection task and some training parameters, I decided to use inheritance
1170  and have the basic functionality in this base class/interface and have the
1171  specific teacher tasks inherit from it.
1172  """
1173 
1174  n_events_training = b2luigi.IntParameter()
1175 
1176  experiment_number = b2luigi.IntParameter()
1177 
1180  process_type = b2luigi.Parameter(
1182  default="BBBAR"
1183 
1184  )
1185 
1186  training_target = b2luigi.Parameter(
1188  default="truth"
1189 
1190  )
1191 
1193  exclude_variables = b2luigi.ListParameter(
1194 
1195  hashed=True, default=[]
1196 
1197  )
1199  fast_bdt_option = b2luigi.ListParameter(
1200 
1201  hashed=True, default=[200, 8, 3, 0.1]
1202 
1203  )
1204 
1205  @property
1207  """
1208  Property defining the basename for the .xml and .root weightfiles that are created.
1209  Has to be implemented by the inheriting teacher task class.
1210  """
1211  raise NotImplementedError(
1212  "Teacher Task must define a static weightfile_identifier"
1213  )
1214 
1215  def get_weightfile_xml_identifier(self, fast_bdt_option=None, recotrack_option=None):
1216  """
1217  Name of the xml weightfile that is created by the teacher task.
1218  It is subsequently used as a local weightfile in the following validation tasks.
1219  """
1220  if fast_bdt_option is None:
1221  fast_bdt_option = self.fast_bdt_optionfast_bdt_option
1222  if recotrack_option is None and hasattr(self, 'recotrack_option'):
1223  recotrack_option = self.recotrack_option
1224  else:
1225  recotrack_option = ''
1226  weightfile_details = create_fbdt_option_string(fast_bdt_option)
1227  weightfile_name = self.weightfile_identifier_basenameweightfile_identifier_basename + weightfile_details
1228  if recotrack_option != '':
1229  weightfile_name = weightfile_name + '_' + recotrack_option
1230  return weightfile_name + ".weights.xml"
1231 
1232  @property
1233  def tree_name(self):
1234  """
1235  Property defining the name of the tree in the ROOT file from the
1236  ``data_collection_task`` that contains the recorded training data. Must
1237  implemented by the inheriting specific teacher task class.
1238  """
1239  raise NotImplementedError("Teacher Task must define a static tree_name")
1240 
1241  @property
1242  def random_seed(self):
1243  """
1244  Property defining random seed to be used by the ``GenerateSimTask``.
1245  Should differ from the random seed in the test data samples. Must
1246  implemented by the inheriting specific teacher task class.
1247  """
1248  raise NotImplementedError("Teacher Task must define a static random seed")
1249 
1250  @property
1251  def data_collection_task(self) -> Basf2PathTask:
1252  """
1253  Property defining the specific ``DataCollectionTask`` to require. Must
1254  implemented by the inheriting specific teacher task class.
1255  """
1256  raise NotImplementedError(
1257  "Teacher Task must define a data collection task to require "
1258  )
1259 
1260  def requires(self):
1261  """
1262  Generate list of luigi Tasks that this Task depends on.
1263  """
1264  if 'USEREC' in self.process_typeprocess_type:
1265  if 'USERECBB' in self.process_typeprocess_type:
1266  process = 'BBBAR'
1267  elif 'USERECEE' in self.process_typeprocess_type:
1268  process = 'BHABHA'
1269  yield CheckExistingFile(
1270  filename='datafiles/qe_records_N' + str(self.n_events_trainingn_events_training) + '_' + process + '_' + self.random_seedrandom_seed + '.root',
1271  )
1272  else:
1273  yield self.data_collection_taskdata_collection_task(
1274  num_processes=MasterTask.num_processes,
1275  n_events=self.n_events_trainingn_events_training,
1276  experiment_number=self.experiment_numberexperiment_number,
1277  random_seed=self.process_typeprocess_type + '_' + self.random_seedrandom_seed,
1278  )
1280  def output(self):
1281  """
1282  Generate list of output files that the task should produce.
1283  The task is considered finished if and only if the outputs all exist.
1284  """
1285  yield self.add_to_output(self.get_weightfile_xml_identifierget_weightfile_xml_identifier())
1286 
1287  def process(self):
1288  """
1289  Use basf2_mva teacher to create MVA weightfile from collected training
1290  data variables.
1291 
1292  This is the main process that is dispatched by the ``run`` method that
1293  is inherited from ``Basf2Task``.
1294  """
1295  if 'USEREC' in self.process_typeprocess_type:
1296  if 'USERECBB' in self.process_typeprocess_type:
1297  process = 'BBBAR'
1298  elif 'USERECEE' in self.process_typeprocess_type:
1299  process = 'BHABHA'
1300  records_files = ['datafiles/qe_records_N' + str(self.n_events_trainingn_events_training) +
1301  '_' + process + '_' + self.random_seedrandom_seed + '.root']
1302  else:
1303  if hasattr(self, 'recotrack_option'):
1304  records_files = self.get_input_file_names(
1305  self.data_collection_taskdata_collection_task.get_records_file_name(
1306  self.data_collection_taskdata_collection_task,
1307  n_events=self.n_events_trainingn_events_training,
1308  random_seed=self.process_typeprocess_type + '_' + self.random_seedrandom_seed,
1309  recotrack_option=self.recotrack_option))
1310  else:
1311  records_files = self.get_input_file_names(
1312  self.data_collection_taskdata_collection_task.get_records_file_name(
1313  self.data_collection_taskdata_collection_task,
1314  n_events=self.n_events_trainingn_events_training,
1315  random_seed=self.process_typeprocess_type + '_' + self.random_seedrandom_seed))
1316 
1317  my_basf2_mva_teacher(
1318  records_files=records_files,
1319  tree_name=self.tree_nametree_nametree_name,
1320  weightfile_identifier=self.get_output_file_name(self.get_weightfile_xml_identifierget_weightfile_xml_identifier()),
1321  target_variable=self.training_targettraining_target,
1322  exclude_variables=self.exclude_variablesexclude_variables,
1323  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
1324  )
1325 
1326 
1328  """
1329  Task to run basf2 mva teacher on collected data for VXDTF2 track quality estimator
1330  """
1331 
1332  weightfile_identifier_basename = "vxdtf2_mva_qe"
1333 
1335  tree_name = "tree"
1336 
1337  random_seed = "train_vxd"
1338 
1340  data_collection_task = VXDQEDataCollectionTask
1341 
1342 
1344  """
1345  Task to run basf2 mva teacher on collected data for CDC track quality estimator
1346  """
1347 
1348  weightfile_identifier_basename = "cdc_mva_qe"
1349 
1351  tree_name = "records"
1352 
1353  random_seed = "train_cdc"
1354 
1356  data_collection_task = CDCQEDataCollectionTask
1357 
1358 
1360  """
1361  Task to run basf2 mva teacher on collected data for the final, combined
1362  track quality estimator
1363  """
1364 
1367  recotrack_option = b2luigi.Parameter(
1369  default='deleteCDCQI080'
1371  )
1372 
1374  weightfile_identifier_basename = "recotrack_mva_qe"
1377  tree_name = "tree"
1378 
1379  random_seed = "train_rec"
1380 
1382  data_collection_task = RecoTrackQEDataCollectionTask
1383 
1384  cdc_training_target = b2luigi.Parameter()
1385 
1386  def requires(self):
1387  """
1388  Generate list of luigi Tasks that this Task depends on.
1389  """
1390  if 'USEREC' in self.process_typeprocess_type:
1391  if 'USERECBB' in self.process_typeprocess_type:
1392  process = 'BBBAR'
1393  elif 'USERECEE' in self.process_typeprocess_type:
1394  process = 'BHABHA'
1395  yield CheckExistingFile(
1396  filename='datafiles/qe_records_N' + str(self.n_events_trainingn_events_training) + '_' + process + '_' + self.random_seedrandom_seedrandom_seed + '.root',
1397  )
1398  else:
1399  yield self.data_collection_taskdata_collection_taskdata_collection_task(
1400  cdc_training_target=self.cdc_training_targetcdc_training_target,
1401  num_processes=MasterTask.num_processes,
1402  n_events=self.n_events_trainingn_events_training,
1403  experiment_number=self.experiment_numberexperiment_number,
1404  random_seed=self.process_typeprocess_type + '_' + self.random_seed,
1405  recotrack_option=self.recotrack_option,
1406  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
1407  )
1409 
1410 class HarvestingValidationBaseTask(Basf2PathTask):
1411  """
1412  Run track reconstruction with MVA quality estimator and write out
1413  (="harvest") a root file with variables useful for the validation.
1414  """
1415 
1417  n_events_testing = b2luigi.IntParameter()
1418 
1419  n_events_training = b2luigi.IntParameter()
1420 
1421  experiment_number = b2luigi.IntParameter()
1425  process_type = b2luigi.Parameter(
1426 
1427  default="BBBAR"
1428 
1429  )
1430 
1432  exclude_variables = b2luigi.ListParameter(
1433 
1434  hashed=True
1435 
1436  )
1437 
1438  fast_bdt_option = b2luigi.ListParameter(
1440  hashed=True, default=[200, 8, 3, 0.1]
1441 
1442  )
1443 
1444  validation_output_file_name = "harvesting_validation.root"
1446  reco_output_file_name = "reconstruction.root"
1447 
1448  components = None
1449 
1450  @property
1451  def teacher_task(self) -> TrackQETeacherBaseTask:
1452  """
1453  Teacher task to require to provide a quality estimator weightfile for ``add_tracking_with_quality_estimation``
1454  """
1455  raise NotImplementedError()
1456 
1457  def add_tracking_with_quality_estimation(self, path: basf2.Path) -> None:
1458  """
1459  Add modules for track reconstruction to basf2 path that are to be
1460  validated. Besides track finding it should include MC matching, fitted
1461  track creation and a quality estimator module.
1462  """
1463  raise NotImplementedError()
1464 
1465  def requires(self):
1466  """
1467  Generate list of luigi Tasks that this Task depends on.
1468  """
1469  yield self.teacher_taskteacher_task(
1470  n_events_training=self.n_events_trainingn_events_training,
1471  experiment_number=self.experiment_numberexperiment_number,
1472  process_type=self.process_typeprocess_type,
1473  exclude_variables=self.exclude_variablesexclude_variables,
1474  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
1475  )
1476  if 'USE' in self.process_typeprocess_type: # USESIM and USEREC
1477  if 'BB' in self.process_typeprocess_type:
1478  process = 'BBBAR'
1479  elif 'EE' in self.process_typeprocess_type:
1480  process = 'BHABHA'
1481  yield CheckExistingFile(
1482  filename='datafiles/generated_mc_N' + str(self.n_events_testingn_events_testing) + '_' + process + '_test.root'
1483  )
1484  else:
1485  yield SplitNMergeSimTask(
1486  bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
1487  random_seed=self.process_typeprocess_type + '_test',
1488  n_events=self.n_events_testingn_events_testing,
1489  experiment_number=self.experiment_numberexperiment_number,
1490  )
1491 
1492  def output(self):
1493  """
1494  Generate list of output files that the task should produce.
1495  The task is considered finished if and only if the outputs all exist.
1496  """
1497  yield self.add_to_output(self.validation_output_file_namevalidation_output_file_name)
1498  yield self.add_to_output(self.reco_output_file_namereco_output_file_name)
1499 
1500  def create_path(self):
1501  """
1502  Create a basf2 path that uses ``add_tracking_with_quality_estimation()``
1503  and adds the ``CombinedTrackingValidationModule`` to write out variables
1504  for validation.
1505  """
1506  # prepare track finding
1507  path = basf2.create_path()
1508  if 'USE' in self.process_typeprocess_type:
1509  if 'BB' in self.process_typeprocess_type:
1510  process = 'BBBAR'
1511  elif 'EE' in self.process_typeprocess_type:
1512  process = 'BHABHA'
1513  inputFileNames = ['datafiles/generated_mc_N' + str(self.n_events_testingn_events_testing) + '_' + process + '_test.root']
1514  else:
1515  inputFileNames = self.get_input_file_names(GenerateSimTask.output_file_name(
1516  GenerateSimTask, n_events=self.n_events_testingn_events_testing, random_seed=self.process_typeprocess_type + '_test'))
1517  path.add_module(
1518  "RootInput",
1519  inputFileNames=inputFileNames,
1520  )
1521  path.add_module("Gearbox")
1522  tracking.add_geometry_modules(path)
1523  tracking.add_hit_preparation_modules(path) # only needed for simulated hits
1524  # add track finding module that needs to be validated
1525  self.add_tracking_with_quality_estimationadd_tracking_with_quality_estimation(path)
1526  # add modules for validation
1527  path.add_module(
1529  name=None,
1530  contact=None,
1531  expert_level=200,
1532  output_file_name=self.get_output_file_name(
1533  self.validation_output_file_namevalidation_output_file_namevalidation_output_file_name
1534  ),
1535  )
1536  )
1537  path.add_module(
1538  "RootOutput",
1539  outputFileName=self.get_output_file_name(self.reco_output_file_namereco_output_file_namereco_output_file_name),
1540  )
1541  return path
1542 
1545  """
1546  Run VXDTF2 track reconstruction and write out (="harvest") a root file with
1547  variables useful for validation of the VXD Quality Estimator.
1548  """
1549 
1550 
1551  validation_output_file_name = "vxd_qe_harvesting_validation.root"
1552 
1553  reco_output_file_name = "vxd_qe_reconstruction.root"
1554 
1555  teacher_task = VXDQETeacherTask
1556 
1557  def add_tracking_with_quality_estimation(self, path):
1558  """
1559  Add modules for VXDTF2 tracking with VXD quality estimator to basf2 path.
1560  """
1561  tracking.add_vxd_track_finding_vxdtf2(
1562  path,
1563  components=["SVD"],
1564  reco_tracks="RecoTracks",
1565  add_mva_quality_indicator=True,
1566  )
1567  # Replace the weightfiles of all quality estimator module by those
1568  # produced in this training by b2luigi
1569  basf2.set_module_parameters(
1570  path,
1571  name="VXDQualityEstimatorMVA",
1572  WeightFileIdentifier=self.get_input_file_names(
1573  self.teacher_taskteacher_taskteacher_task.get_weightfile_xml_identifier(self.teacher_taskteacher_taskteacher_task, fast_bdt_option=self.fast_bdt_optionfast_bdt_option)
1574  )[0],
1575  )
1576  tracking.add_mc_matcher(path, components=["SVD"])
1577  tracking.add_track_fit_and_track_creator(path, components=["SVD"])
1579 
1581  """
1582  Run CDC reconstruction and write out (="harvest") a root file with variables
1583  useful for validation of the CDC Quality Estimator.
1584  """
1585 
1586  training_target = b2luigi.Parameter()
1587 
1588  validation_output_file_name = "cdc_qe_harvesting_validation.root"
1589 
1590  reco_output_file_name = "cdc_qe_reconstruction.root"
1591 
1592  teacher_task = CDCQETeacherTask
1593 
1594  # overload needed due to specific training target
1595  def requires(self):
1596  """
1597  Generate list of luigi Tasks that this Task depends on.
1598  """
1599  yield self.teacher_taskteacher_taskteacher_task(
1600  n_events_training=self.n_events_trainingn_events_training,
1601  experiment_number=self.experiment_numberexperiment_number,
1602  process_type=self.process_typeprocess_type,
1603  training_target=self.training_targettraining_target,
1604  exclude_variables=self.exclude_variablesexclude_variables,
1605  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
1606  )
1607  if 'USE' in self.process_typeprocess_type: # USESIM and USEREC
1608  if 'BB' in self.process_typeprocess_type:
1609  process = 'BBBAR'
1610  elif 'EE' in self.process_typeprocess_type:
1611  process = 'BHABHA'
1612  yield CheckExistingFile(
1613  filename='datafiles/generated_mc_N' + str(self.n_events_testingn_events_testing) + '_' + process + '_test.root'
1614  )
1615  else:
1616  yield SplitNMergeSimTask(
1617  bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
1618  random_seed=self.process_typeprocess_type + '_test',
1619  n_events=self.n_events_testingn_events_testing,
1620  experiment_number=self.experiment_numberexperiment_number,
1621  )
1622 
1623  def add_tracking_with_quality_estimation(self, path):
1624  """
1625  Add modules for CDC standalone tracking with CDC quality estimator to basf2 path.
1626  """
1627  tracking.add_cdc_track_finding(
1628  path,
1629  output_reco_tracks="RecoTracks",
1630  add_mva_quality_indicator=True,
1631  )
1632  # change weightfile of quality estimator to the one produced by this training script
1633  cdc_qe_mva_filter_parameters = {
1634  "identifier": self.get_input_file_names(
1635  CDCQETeacherTask.get_weightfile_xml_identifier(
1636  CDCQETeacherTask,
1637  fast_bdt_option=self.fast_bdt_optionfast_bdt_option))[0]}
1638  basf2.set_module_parameters(
1639  path,
1640  name="TFCDC_TrackQualityEstimator",
1641  filterParameters=cdc_qe_mva_filter_parameters,
1642  )
1643  tracking.add_mc_matcher(path, components=["CDC"])
1644  tracking.add_track_fit_and_track_creator(path, components=["CDC"])
1646 
1648  """
1649  Run track reconstruction and write out (="harvest") a root file with variables
1650  useful for validation of the MVA track Quality Estimator.
1651  """
1652 
1653  cdc_training_target = b2luigi.Parameter()
1654 
1655  validation_output_file_name = "reco_qe_harvesting_validation.root"
1656 
1657  reco_output_file_name = "reco_qe_reconstruction.root"
1658 
1659  teacher_task = RecoTrackQETeacherTask
1660 
1661  def requires(self):
1662  """
1663  Generate list of luigi Tasks that this Task depends on.
1664  """
1665  yield CDCQETeacherTask(
1666  n_events_training=self.n_events_trainingn_events_training,
1667  experiment_number=self.experiment_numberexperiment_number,
1668  process_type=self.process_typeprocess_type,
1669  training_target=self.cdc_training_targetcdc_training_target,
1670  exclude_variables=MasterTask.exclude_variables_cdc,
1671  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
1672  )
1673  yield VXDQETeacherTask(
1674  n_events_training=self.n_events_trainingn_events_training,
1675  experiment_number=self.experiment_numberexperiment_number,
1676  process_type=self.process_typeprocess_type,
1677  exclude_variables=MasterTask.exclude_variables_vxd,
1678  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
1679  )
1680 
1681  yield self.teacher_taskteacher_taskteacher_task(
1682  n_events_training=self.n_events_trainingn_events_training,
1683  experiment_number=self.experiment_numberexperiment_number,
1684  process_type=self.process_typeprocess_type,
1685  exclude_variables=self.exclude_variablesexclude_variables,
1686  cdc_training_target=self.cdc_training_targetcdc_training_target,
1687  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
1688  )
1689  if 'USE' in self.process_typeprocess_type: # USESIM and USEREC
1690  if 'BB' in self.process_typeprocess_type:
1691  process = 'BBBAR'
1692  elif 'EE' in self.process_typeprocess_type:
1693  process = 'BHABHA'
1694  yield CheckExistingFile(
1695  filename='datafiles/generated_mc_N' + str(self.n_events_testingn_events_testing) + '_' + process + '_test.root'
1696  )
1697  else:
1698  yield SplitNMergeSimTask(
1699  bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
1700  random_seed=self.process_typeprocess_type + '_test',
1701  n_events=self.n_events_testingn_events_testing,
1702  experiment_number=self.experiment_numberexperiment_number,
1703  )
1704 
1705  def add_tracking_with_quality_estimation(self, path):
1706  """
1707  Add modules for reco tracking with all track quality estimators to basf2 path.
1708  """
1709 
1710  # add tracking reconstruction with quality estimator modules added
1711  tracking.add_tracking_reconstruction(
1712  path,
1713  add_cdcTrack_QI=True,
1714  add_vxdTrack_QI=True,
1715  add_recoTrack_QI=True,
1716  skipGeometryAdding=True,
1717  skipHitPreparerAdding=False,
1718  )
1719 
1720  # Replace the weightfiles of all quality estimator modules by those
1721  # produced in the training by b2luigi
1722  cdc_qe_mva_filter_parameters = {
1723  "identifier": self.get_input_file_names(
1724  CDCQETeacherTask.get_weightfile_xml_identifier(
1725  CDCQETeacherTask,
1726  fast_bdt_option=self.fast_bdt_optionfast_bdt_option))[0]}
1727  basf2.set_module_parameters(
1728  path,
1729  name="TFCDC_TrackQualityEstimator",
1730  filterParameters=cdc_qe_mva_filter_parameters,
1731  )
1732  basf2.set_module_parameters(
1733  path,
1734  name="VXDQualityEstimatorMVA",
1735  WeightFileIdentifier=self.get_input_file_names(
1736  VXDQETeacherTask.get_weightfile_xml_identifier(VXDQETeacherTask, fast_bdt_option=self.fast_bdt_optionfast_bdt_option)
1737  )[0],
1738  )
1739  basf2.set_module_parameters(
1740  path,
1741  name="TrackQualityEstimatorMVA",
1742  WeightFileIdentifier=self.get_input_file_names(
1743  RecoTrackQETeacherTask.get_weightfile_xml_identifier(RecoTrackQETeacherTask, fast_bdt_option=self.fast_bdt_optionfast_bdt_option)
1744  )[0],
1745  )
1746 
1747 
1748 class TrackQEEvaluationBaseTask(Task):
1749  """
1750  Base class for evaluating a quality estimator ``basf2_mva_evaluate.py`` on a
1751  separate test data set.
1752 
1753  Evaluation tasks for VXD, CDC and combined QE can inherit from it.
1754  """
1755 
1761  git_hash = b2luigi.Parameter(
1763  default=get_basf2_git_hash()
1764 
1765  )
1766 
1767  n_events_testing = b2luigi.IntParameter()
1768 
1769  n_events_training = b2luigi.IntParameter()
1770 
1771  experiment_number = b2luigi.IntParameter()
1772 
1775  process_type = b2luigi.Parameter(
1776 
1777  default="BBBAR"
1779  )
1780 
1781  training_target = b2luigi.Parameter(
1782 
1783  default="truth"
1785  )
1786 
1788  exclude_variables = b2luigi.ListParameter(
1789 
1790  hashed=True
1791 
1792  )
1794  fast_bdt_option = b2luigi.ListParameter(
1795 
1796  hashed=True, default=[200, 8, 3, 0.1]
1797 
1798  )
1799 
1800  @property
1801  def teacher_task(self) -> TrackQETeacherBaseTask:
1802  """
1803  Property defining specific teacher task to require.
1804  """
1805  raise NotImplementedError(
1806  "Evaluation Tasks must define a teacher task to require "
1807  )
1808 
1809  @property
1810  def data_collection_task(self) -> Basf2PathTask:
1811  """
1812  Property defining the specific ``DataCollectionTask`` to require. Must
1813  implemented by the inheriting specific teacher task class.
1814  """
1815  raise NotImplementedError(
1816  "Evaluation Tasks must define a data collection task to require "
1817  )
1818 
1819  @property
1820  def task_acronym(self):
1821  """
1822  Acronym to distinguish between cdc, vxd and rec(o) MVA
1823  """
1824  raise NotImplementedError(
1825  "Evaluation Tasks must define a task acronym."
1826  )
1827 
1828  def requires(self):
1829  """
1830  Generate list of luigi Tasks that this Task depends on.
1831  """
1832  yield self.teacher_taskteacher_task(
1833  n_events_training=self.n_events_trainingn_events_training,
1834  experiment_number=self.experiment_numberexperiment_number,
1835  process_type=self.process_typeprocess_type,
1836  training_target=self.training_targettraining_target,
1837  exclude_variables=self.exclude_variablesexclude_variables,
1838  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
1839  )
1840  if 'USEREC' in self.process_typeprocess_type:
1841  if 'USERECBB' in self.process_typeprocess_type:
1842  process = 'BBBAR'
1843  elif 'USERECEE' in self.process_typeprocess_type:
1844  process = 'BHABHA'
1845  yield CheckExistingFile(
1846  filename='datafiles/qe_records_N' + str(self.n_events_testingn_events_testing) + '_' + process + '_test_' +
1847  self.task_acronymtask_acronym + '.root'
1848  )
1849  else:
1850  yield self.data_collection_taskdata_collection_task(
1851  num_processes=MasterTask.num_processes,
1852  n_events=self.n_events_testingn_events_testing,
1853  experiment_number=self.experiment_numberexperiment_number,
1854  random_seed=self.process_typeprocess_type + '_test',
1855  )
1856 
1857  def output(self):
1858  """
1859  Generate list of output files that the task should produce.
1860  The task is considered finished if and only if the outputs all exist.
1861  """
1862  weightfile_details = create_fbdt_option_string(self.fast_bdt_optionfast_bdt_option)
1863  evaluation_pdf_output = self.teacher_taskteacher_task.weightfile_identifier_basename + weightfile_details + ".pdf"
1864  yield self.add_to_output(evaluation_pdf_output)
1865 
1866  @b2luigi.on_temporary_files
1867  def run(self):
1868  """
1869  Run ``basf2_mva_evaluate.py`` subprocess to evaluate QE MVA.
1870 
1871  The MVA weight file created from training on the training data set is
1872  evaluated on separate test data.
1873  """
1874  weightfile_details = create_fbdt_option_string(self.fast_bdt_optionfast_bdt_option)
1875  evaluation_pdf_output_basename = self.teacher_taskteacher_task.weightfile_identifier_basename + weightfile_details + ".pdf"
1876 
1877  evaluation_pdf_output_path = self.get_output_file_name(evaluation_pdf_output_basename)
1878 
1879  if 'USEREC' in self.process_typeprocess_type:
1880  if 'USERECBB' in self.process_typeprocess_type:
1881  process = 'BBBAR'
1882  elif 'USERECEE' in self.process_typeprocess_type:
1883  process = 'BHABHA'
1884  datafiles = 'datafiles/qe_records_N' + str(self.n_events_testingn_events_testing) + '_' + \
1885  process + '_test_' + self.task_acronymtask_acronym + '.root'
1886  else:
1887  datafiles = self.get_input_file_names(
1888  self.data_collection_taskdata_collection_task.get_records_file_name(
1889  self.data_collection_taskdata_collection_task,
1890  n_events=self.n_events_testingn_events_testing,
1891  random_seed=self.process + '_test_' +
1892  self.task_acronymtask_acronym))[0]
1893  cmd = [
1894  "basf2_mva_evaluate.py",
1895  "--identifiers",
1896  self.get_input_file_names(
1897  self.teacher_taskteacher_task.get_weightfile_xml_identifier(
1898  self.teacher_taskteacher_task,
1899  fast_bdt_option=self.fast_bdt_optionfast_bdt_option))[0],
1900  "--datafiles",
1901  datafiles,
1902  "--treename",
1903  self.teacher_taskteacher_task.tree_name,
1904  "--outputfile",
1905  evaluation_pdf_output_path,
1906  ]
1907 
1908  # Prepare log files
1909  log_file_dir = get_log_file_dir(self)
1910  # check if directory already exists, if not, create it. I think this is necessary as this task does not
1911  # inherit properly from b2luigi and thus does not do it automatically??
1912  try:
1913  os.makedirs(log_file_dir, exist_ok=True)
1914  # the following should be unnecessary as exist_ok=True should take care that no FileExistError rises. I
1915  # might ask about a permission error...
1916  except FileExistsError:
1917  print('Directory ' + log_file_dir + 'already exists.')
1918  stderr_log_file_path = log_file_dir + "stderr"
1919  stdout_log_file_path = log_file_dir + "stdout"
1920  with open(stdout_log_file_path, "w") as stdout_file:
1921  stdout_file.write("stdout output of the command:\n{}\n\n".format(" ".join(cmd)))
1922  if os.path.exists(stderr_log_file_path):
1923  # remove stderr file if it already exists b/c in the following it will be opened in appending mode
1924  os.remove(stderr_log_file_path)
1926  # Run evaluation via subprocess and write output into logfiles
1927  with open(stdout_log_file_path, "a") as stdout_file:
1928  with open(stderr_log_file_path, "a") as stderr_file:
1929  try:
1930  subprocess.run(cmd, check=True, stdin=stdout_file, stderr=stderr_file)
1931  except subprocess.CalledProcessError as err:
1932  stderr_file.write(f"Evaluation failed with error:\n{err}")
1933  raise err
1935 
1937  """
1938  Run ``basf2_mva_evaluate.py`` for the VXD quality estimator on separate test data
1939  """
1942  teacher_task = VXDQETeacherTask
1945  data_collection_task = VXDQEDataCollectionTask
1948  task_acronym = 'vxd'
1950 
1952  """
1953  Run ``basf2_mva_evaluate.py`` for the CDC quality estimator on separate test data
1954  """
1955 
1957  teacher_task = CDCQETeacherTask
1958 
1960  data_collection_task = CDCQEDataCollectionTask
1961 
1963  task_acronym = 'cdc'
1965 
1967  """
1968  Run ``basf2_mva_evaluate.py`` for the final, combined quality estimator on
1969  separate test data
1970  """
1971 
1973  teacher_task = RecoTrackQETeacherTask
1974 
1976  data_collection_task = RecoTrackQEDataCollectionTask
1977 
1979  task_acronym = 'rec'
1980 
1981  cdc_training_target = b2luigi.Parameter()
1982 
1983  def requires(self):
1984  """
1985  Generate list of luigi Tasks that this Task depends on.
1986  """
1987  yield self.teacher_taskteacher_taskteacher_task(
1988  n_events_training=self.n_events_trainingn_events_training,
1989  experiment_number=self.experiment_numberexperiment_number,
1990  process_type=self.process_typeprocess_type,
1991  training_target=self.training_targettraining_target,
1992  exclude_variables=self.exclude_variablesexclude_variables,
1993  cdc_training_target=self.cdc_training_targetcdc_training_target,
1994  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
1995  )
1996  if 'USEREC' in self.process_typeprocess_type:
1997  if 'USERECBB' in self.process_typeprocess_type:
1998  process = 'BBBAR'
1999  elif 'USERECEE' in self.process_typeprocess_type:
2000  process = 'BHABHA'
2001  yield CheckExistingFile(
2002  filename='datafiles/qe_records_N' + str(self.n_events_testingn_events_testing) + '_' + process + '_test_' +
2003  self.task_acronym + '.root'
2004  )
2005  else:
2006  yield self.data_collection_task(
2007  num_processes=MasterTask.num_processes,
2008  n_events=self.n_events_testingn_events_testing,
2009  experiment_number=self.experiment_numberexperiment_number,
2010  random_seed=self.process_typeprocess_type + "_test",
2011  cdc_training_target=self.cdc_training_target,
2012  )
2013 
2014 
2015 class PlotsFromHarvestingValidationBaseTask(Basf2Task):
2016  """
2017  Create a PDF file with validation plots for a quality estimator produced
2018  from the ROOT ntuples produced by a harvesting validation task
2019  """
2020 
2021  n_events_testing = b2luigi.IntParameter()
2022 
2023  n_events_training = b2luigi.IntParameter()
2024 
2025  experiment_number = b2luigi.IntParameter()
2026 
2029  process_type = b2luigi.Parameter(
2030 
2031  default="BBBAR"
2032 
2033  )
2036  exclude_variables = b2luigi.ListParameter(
2037 
2038  hashed=True
2039 
2040  )
2041 
2042  fast_bdt_option = b2luigi.ListParameter(
2043 
2044  hashed=True, default=[200, 8, 3, 0.1]
2045 
2046  )
2047 
2048  primaries_only = b2luigi.BoolParameter(
2050  default=True
2051 
2052  ) # normalize finding efficiencies to primary MC-tracks
2053 
2054  @property
2055  def harvesting_validation_task_instance(self) -> HarvestingValidationBaseTask:
2056  """
2057  Specifies related harvesting validation task which produces the ROOT
2058  files with the data that is plotted by this task.
2059  """
2060  raise NotImplementedError("Must define a QI harvesting validation task for which to do the plots")
2061 
2062  @property
2064  """
2065  Name of the output PDF file containing the validation plots
2066  """
2067  validation_harvest_basename = self.harvesting_validation_task_instanceharvesting_validation_task_instance.validation_output_file_name
2068  return validation_harvest_basename.replace(".root", "_plots.pdf")
2069 
2070  def requires(self):
2071  """
2072  Generate list of luigi Tasks that this Task depends on.
2073  """
2074  yield self.harvesting_validation_task_instanceharvesting_validation_task_instance
2075 
2076  def output(self):
2077  """
2078  Generate list of output files that the task should produce.
2079  The task is considered finished if and only if the outputs all exist.
2080  """
2081  yield self.add_to_output(self.output_pdf_file_basenameoutput_pdf_file_basename)
2082 
2083  @b2luigi.on_temporary_files
2084  def process(self):
2085  """
2086  Use basf2_mva teacher to create MVA weightfile from collected training
2087  data variables.
2088 
2089  Main process that is dispatched by the ``run`` method that is inherited
2090  from ``Basf2Task``.
2091  """
2092  # get the validation "harvest", which is the ROOT file with ntuples for validation
2093  validation_harvest_basename = self.harvesting_validation_task_instanceharvesting_validation_task_instance.validation_output_file_name
2094  validation_harvest_path = self.get_input_file_names(validation_harvest_basename)[0]
2095 
2096  # Load "harvested" validation data from root files into dataframes (requires enough memory to hold data)
2097  pr_columns = [ # Restrict memory usage by only reading in columns that are used in the steering file
2098  'is_fake', 'is_clone', 'is_matched', 'quality_indicator',
2099  'experiment_number', 'run_number', 'event_number', 'pr_store_array_number',
2100  'pt_estimate', 'z0_estimate', 'd0_estimate', 'tan_lambda_estimate',
2101  'phi0_estimate', 'pt_truth', 'z0_truth', 'd0_truth', 'tan_lambda_truth',
2102  'phi0_truth',
2103  ]
2104  # In ``pr_df`` each row corresponds to a track from Pattern Recognition
2105  pr_df = uproot.open(validation_harvest_path)['pr_tree/pr_tree'].arrays(pr_columns, library='pd')
2106  mc_columns = [ # restrict mc_df to these columns
2107  'experiment_number',
2108  'run_number',
2109  'event_number',
2110  'pr_store_array_number',
2111  'is_missing',
2112  'is_primary',
2113  ]
2114  # In ``mc_df`` each row corresponds to an MC track
2115  mc_df = uproot.open(validation_harvest_path)['mc_tree/mc_tree'].arrays(mc_columns, library='pd')
2116  if self.primaries_onlyprimaries_only:
2117  mc_df = mc_df[mc_df.is_primary.eq(True)]
2118 
2119  # Define QI thresholds for the FOM plots and the ROC curves
2120  qi_cuts = np.linspace(0., 1, 20, endpoint=False)
2121  # # Add more points at the very end between the previous maximum and 1
2122  # qi_cuts = np.append(qi_cuts, np.linspace(np.max(qi_cuts), 1, 20, endpoint=False))
2123 
2124  # Create plots and append them to single output pdf
2125 
2126  output_pdf_file_path = self.get_output_file_name(self.output_pdf_file_basenameoutput_pdf_file_basename)
2127  with PdfPages(output_pdf_file_path, keep_empty=False) as pdf:
2128 
2129  # Add a title page to validation plot PDF with some metadata
2130  # Remember that most metadata is in the xml file of the weightfile
2131  # and in the b2luigi directory structure
2132  titlepage_fig, titlepage_ax = plt.subplots()
2133  titlepage_ax.axis("off")
2134  title = f"Quality Estimator validation plots from {self.__class__.__name__}"
2135  titlepage_ax.set_title(title)
2136  teacher_task = self.harvesting_validation_task_instanceharvesting_validation_task_instance.teacher_task
2137  weightfile_identifier = teacher_task.get_weightfile_xml_identifier(teacher_task, fast_bdt_option=self.fast_bdt_optionfast_bdt_option)
2138  meta_data = {
2139  "Date": datetime.today().strftime("%Y-%m-%d %H:%M"),
2140  "Created by steering file": os.path.realpath(__file__),
2141  "Created from data in": validation_harvest_path,
2142  "Background directory": MasterTask.bkgfiles_by_exp[self.experiment_numberexperiment_number],
2143  "weight file": weightfile_identifier,
2144  }
2145  if hasattr(self, 'exclude_variables'):
2146  meta_data["Excluded variables"] = ", ".join(self.exclude_variablesexclude_variables)
2147  meta_data_string = (format_dictionary(meta_data) +
2148  "\n\n(For all MVA training parameters look into the produced weight file)")
2149  luigi_params = get_serialized_parameters(self)
2150  luigi_param_string = (f"\n\nb2luigi parameters for {self.__class__.__name__}\n" +
2151  format_dictionary(luigi_params))
2152  title_page_text = meta_data_string + luigi_param_string
2153  titlepage_ax.text(0, 1, title_page_text, ha="left", va="top", wrap=True, fontsize=8)
2154  pdf.savefig(titlepage_fig)
2155  plt.close(titlepage_fig)
2156 
2157  fake_rates = get_uncertain_means_for_qi_cuts(pr_df, "is_fake", qi_cuts)
2158  fake_fig, fake_ax = plt.subplots()
2159  fake_ax.set_title("Fake rate")
2160  plot_with_errobands(fake_rates, ax=fake_ax)
2161  fake_ax.set_ylabel("fake rate")
2162  fake_ax.set_xlabel("quality indicator requirement")
2163  pdf.savefig(fake_fig, bbox_inches="tight")
2164  plt.close(fake_fig)
2165 
2166  # Plot clone rates
2167  clone_rates = get_uncertain_means_for_qi_cuts(pr_df, "is_clone", qi_cuts)
2168  clone_fig, clone_ax = plt.subplots()
2169  clone_ax.set_title("Clone rate")
2170  plot_with_errobands(clone_rates, ax=clone_ax)
2171  clone_ax.set_ylabel("clone rate")
2172  clone_ax.set_xlabel("quality indicator requirement")
2173  pdf.savefig(clone_fig, bbox_inches="tight")
2174  plt.close(clone_fig)
2175 
2176  # Plot finding efficiency
2177 
2178  # The Quality Indicator is only available in pr_tree and thus the
2179  # PR-track dataframe. To get the QI of the related PR track for an MC
2180  # track, merge the PR dataframe into the MC dataframe
2181  pr_track_identifiers = ['experiment_number', 'run_number', 'event_number', 'pr_store_array_number']
2182  mc_df = upd.merge(
2183  left=mc_df, right=pr_df[pr_track_identifiers + ['quality_indicator']],
2184  how='left',
2185  on=pr_track_identifiers
2186  )
2187 
2188  missing_fractions = (
2189  _my_uncertain_mean(mc_df[
2190  mc_df.quality_indicator.isnull() | (mc_df.quality_indicator > qi_cut)]['is_missing'])
2191  for qi_cut in qi_cuts
2192  )
2193 
2194  findeff_fig, findeff_ax = plt.subplots()
2195  findeff_ax.set_title("Finding efficiency")
2196  finding_efficiencies = 1.0 - upd.Series(data=missing_fractions, index=qi_cuts)
2197  plot_with_errobands(finding_efficiencies, ax=findeff_ax)
2198  findeff_ax.set_ylabel("finding efficiency")
2199  findeff_ax.set_xlabel("quality indicator requirement")
2200  pdf.savefig(findeff_fig, bbox_inches="tight")
2201  plt.close(findeff_fig)
2202 
2203  # Plot ROC curves
2204 
2205  # Fake rate vs. finding efficiency ROC curve
2206  fake_roc_fig, fake_roc_ax = plt.subplots()
2207  fake_roc_ax.set_title("Fake rate vs. finding efficiency ROC curve")
2208  fake_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=fake_rates.nominal_value,
2209  xerr=finding_efficiencies.std_dev, yerr=fake_rates.std_dev, elinewidth=0.8)
2210  fake_roc_ax.set_xlabel('finding efficiency')
2211  fake_roc_ax.set_ylabel('fake rate')
2212  pdf.savefig(fake_roc_fig, bbox_inches="tight")
2213  plt.close(fake_roc_fig)
2214 
2215  # Clone rate vs. finding efficiency ROC curve
2216  clone_roc_fig, clone_roc_ax = plt.subplots()
2217  clone_roc_ax.set_title("Clone rate vs. finding efficiency ROC curve")
2218  clone_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=clone_rates.nominal_value,
2219  xerr=finding_efficiencies.std_dev, yerr=clone_rates.std_dev, elinewidth=0.8)
2220  clone_roc_ax.set_xlabel('finding efficiency')
2221  clone_roc_ax.set_ylabel('clone rate')
2222  pdf.savefig(clone_roc_fig, bbox_inches="tight")
2223  plt.close(clone_roc_fig)
2224 
2225  # Plot kinematic distributions
2226 
2227  # use fewer qi cuts as each cut will be it's own subplot now and not a point
2228  kinematic_qi_cuts = [0, 0.5, 0.9]
2229 
2230  # Define kinematic parameters which we want to histogram and define
2231  # dictionaries relating them to latex labels, units and binnings
2232  params = ['d0', 'z0', 'pt', 'tan_lambda', 'phi0']
2233  label_by_param = {
2234  "pt": "$p_T$",
2235  "z0": "$z_0$",
2236  "d0": "$d_0$",
2237  "tan_lambda": r"$\tan{\lambda}$",
2238  "phi0": r"$\phi_0$"
2239  }
2240  unit_by_param = {
2241  "pt": "GeV",
2242  "z0": "cm",
2243  "d0": "cm",
2244  "tan_lambda": "rad",
2245  "phi0": "rad"
2246  }
2247  n_kinematic_bins = 75 # number of bins per kinematic variable
2248  bins_by_param = {
2249  "pt": np.linspace(0, np.percentile(pr_df['pt_truth'].dropna(), 95), n_kinematic_bins),
2250  "z0": np.linspace(-0.1, 0.1, n_kinematic_bins),
2251  "d0": np.linspace(0, 0.01, n_kinematic_bins),
2252  "tan_lambda": np.linspace(-2, 3, n_kinematic_bins),
2253  "phi0": np.linspace(0, 2 * np.pi, n_kinematic_bins)
2254  }
2255 
2256  # Iterate over each parameter and for each make stacked histograms for different QI cuts
2257  kinematic_qi_cuts = [0, 0.5, 0.8]
2258  blue, yellow, green = plt.get_cmap("tab10").colors[0:3]
2259  for param in params:
2260  fig, axarr = plt.subplots(ncols=len(kinematic_qi_cuts), sharey=True, sharex=True, figsize=(14, 6))
2261  fig.suptitle(f"{label_by_param[param]} distributions")
2262  for i, qi in enumerate(kinematic_qi_cuts):
2263  ax = axarr[i]
2264  ax.set_title(f"QI > {qi}")
2265  incut = pr_df[(pr_df['quality_indicator'] > qi)]
2266  incut_matched = incut[incut.is_matched.eq(True)]
2267  incut_clones = incut[incut.is_clone.eq(True)]
2268  incut_fake = incut[incut.is_fake.eq(True)]
2269 
2270  # if any series is empty, break out of loop and don't draw try to draw a stacked histogram
2271  if any(series.empty for series in (incut, incut_matched, incut_clones, incut_fake)):
2272  ax.text(0.5, 0.5, "Not enough data in bin", ha="center", va="center", transform=ax.transAxes)
2273  continue
2274 
2275  bins = bins_by_param[param]
2276  stacked_histogram_series_tuple = (
2277  incut_matched[f'{param}_estimate'],
2278  incut_clones[f'{param}_estimate'],
2279  incut_fake[f'{param}_estimate'],
2280  )
2281  histvals, _, _ = ax.hist(stacked_histogram_series_tuple,
2282  stacked=True,
2283  bins=bins, range=(bins.min(), bins.max()),
2284  color=(blue, green, yellow),
2285  label=("matched", "clones", "fakes"))
2286  ax.set_xlabel(f'{label_by_param[param]} estimate / ({unit_by_param[param]})')
2287  ax.set_ylabel('# tracks')
2288  axarr[0].legend(loc="upper center", bbox_to_anchor=(0, -0.15))
2289  pdf.savefig(fig, bbox_inches="tight")
2290  plt.close(fig)
2291 
2292 
2294  """
2295  Create a PDF file with validation plots for the VXDTF2 track quality
2296  estimator produced from the ROOT ntuples produced by a VXDTF2 track QE
2297  harvesting validation task
2298  """
2299 
2300  @property
2302  """
2303  Harvesting validation task to require, which produces the ROOT files
2304  with variables to produce the VXD QE validation plots.
2305  """
2307  n_events_testing=self.n_events_testingn_events_testing,
2308  n_events_training=self.n_events_trainingn_events_training,
2309  process_type=self.process_typeprocess_type,
2310  experiment_number=self.experiment_numberexperiment_number,
2311  exclude_variables=self.exclude_variablesexclude_variables,
2312  num_processes=MasterTask.num_processes,
2313  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
2314  )
2315 
2316 
2318  """
2319  Create a PDF file with validation plots for the CDC track quality estimator
2320  produced from the ROOT ntuples produced by a CDC track QE harvesting
2321  validation task
2322  """
2324  training_target = b2luigi.Parameter()
2325 
2326  @property
2328  """
2329  Harvesting validation task to require, which produces the ROOT files
2330  with variables to produce the CDC QE validation plots.
2331  """
2333  n_events_testing=self.n_events_testingn_events_testing,
2334  n_events_training=self.n_events_trainingn_events_training,
2335  process_type=self.process_typeprocess_type,
2336  experiment_number=self.experiment_numberexperiment_number,
2337  training_target=self.training_target,
2338  exclude_variables=self.exclude_variablesexclude_variables,
2339  num_processes=MasterTask.num_processes,
2340  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
2341  )
2342 
2343 
2345  """
2346  Create a PDF file with validation plots for the reco MVA track quality
2347  estimator produced from the ROOT ntuples produced by a reco track QE
2348  harvesting validation task
2349  """
2351  cdc_training_target = b2luigi.Parameter()
2352 
2353  @property
2354  def harvesting_validation_task_instance(self):
2355  """
2356  Harvesting validation task to require, which produces the ROOT files
2357  with variables to produce the final MVA track QE validation plots.
2358  """
2360  n_events_testing=self.n_events_testing,
2361  n_events_training=self.n_events_trainingn_events_training,
2362  process_type=self.process_typeprocess_type,
2363  experiment_number=self.experiment_numberexperiment_number,
2364  cdc_training_target=self.cdc_training_targetcdc_training_target,
2365  exclude_variables=self.exclude_variables,
2366  num_processes=MasterTask.num_processes,
2367  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
2368  )
2370 
2371 class QEWeightsLocalDBCreatorTask(Basf2Task):
2372  """
2373  Collect weightfile identifiers from different teacher tasks and merge them
2374  into a local database for testing.
2375  """
2376 
2377  n_events_training = b2luigi.IntParameter()
2378 
2379  experiment_number = b2luigi.IntParameter()
2380 
2383  process_type = b2luigi.Parameter(
2384 
2385  default="BBBAR"
2386 
2387  )
2388 
2389  cdc_training_target = b2luigi.Parameter()
2390 
2391  fast_bdt_option = b2luigi.ListParameter(
2392 
2393  hashed=True, default=[200, 8, 3, 0.1]
2394 
2395  )
2396 
2397  def requires(self):
2398  """
2399  Required teacher tasks
2400  """
2401  yield VXDQETeacherTask(
2402  n_events_training=self.n_events_trainingn_events_training,
2403  process_type=self.process_typeprocess_type,
2404  experiment_number=self.experiment_numberexperiment_number,
2405  exclude_variables=MasterTask.exclude_variables_vxd,
2406  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
2407  )
2409  n_events_training=self.n_events_trainingn_events_training,
2410  process_type=self.process_typeprocess_type,
2411  experiment_number=self.experiment_numberexperiment_number,
2412  training_target=self.cdc_training_targetcdc_training_target,
2413  exclude_variables=MasterTask.exclude_variables_cdc,
2414  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
2415  )
2416  yield RecoTrackQETeacherTask(
2417  n_events_training=self.n_events_trainingn_events_training,
2418  process_type=self.process_typeprocess_type,
2419  experiment_number=self.experiment_numberexperiment_number,
2420  cdc_training_target=self.cdc_training_targetcdc_training_target,
2421  exclude_variables=MasterTask.exclude_variables_rec,
2422  fast_bdt_option=self.fast_bdt_optionfast_bdt_option,
2423  )
2424 
2425  def output(self):
2426  """
2427  Local database
2428  """
2429  yield self.add_to_output("localdb.tar")
2430 
2431  def process(self):
2432  """
2433  Create local database
2434  """
2435  current_path = Path.cwd()
2436  localdb_archive_path = Path(self.get_output_file_name("localdb.tar")).absolute()
2437  output_dir = localdb_archive_path.parent
2438 
2439  # remove existing local databases in output directories
2440  self._clean_clean()
2441  # "Upload" the weightfiles of all 3 teacher tasks into the same localdb
2442  for task in (VXDQETeacherTask, CDCQETeacherTask, RecoTrackQETeacherTask):
2443  # Extract xml identifier input file name before switching working directories, as it returns relative paths
2444  weightfile_xml_identifier_path = os.path.abspath(self.get_input_file_names(
2445  task.get_weightfile_xml_identifier(task, fast_bdt_option=self.fast_bdt_optionfast_bdt_option))[0])
2446  # As localdb is created in working directory, chdir into desired output path
2447  try:
2448  os.chdir(output_dir)
2449  # Same as basf2_mva_upload on the command line, creates localdb directory in current working dir
2450  basf2_mva.upload(
2451  weightfile_xml_identifier_path,
2452  task.weightfile_identifier_basename,
2453  self.experiment_numberexperiment_number, 0,
2454  self.experiment_numberexperiment_number, -1,
2455  )
2456  finally: # Switch back to working directory of b2luigi, even if upload failed
2457  os.chdir(current_path)
2458 
2459  # Pack localdb into tar archive, so that we can have on single output file instead
2460  shutil.make_archive(
2461  base_name=localdb_archive_path.as_posix().split('.')[0],
2462  format="tar",
2463  root_dir=output_dir,
2464  base_dir="localdb",
2465  verbose=True,
2466  )
2467 
2468  def _clean(self):
2469  """
2470  Remove local database and tar archives in output directory
2471  """
2472  localdb_archive_path = Path(self.get_output_file_name("localdb.tar"))
2473  localdb_path = localdb_archive_path.parent / "localdb"
2474 
2475  if localdb_path.exists():
2476  print(f"Deleting localdb\n{localdb_path}\nwith contents\n ",
2477  "\n ".join(f.name for f in localdb_path.iterdir()))
2478  shutil.rmtree(localdb_path, ignore_errors=False) # recursively delete localdb
2479 
2480  if localdb_archive_path.is_file():
2481  print(f"Deleting {localdb_archive_path}")
2482  os.remove(localdb_archive_path)
2483 
2484  def on_failure(self, exception):
2485  """
2486  Cleanup: Remove local database to prevent existing outputs when task did not finish successfully
2487  """
2488  self._clean()
2489  # Run existing on_failure from parent class
2490  super().on_failure(exception)
2492 
2493 class MasterTask(b2luigi.WrapperTask):
2494  """
2495  Wrapper task that needs to finish for b2luigi to finish running this steering file.
2497  It is done if the outputs of all required subtasks exist. It is thus at the
2498  top of the luigi task graph. Edit the ``requires`` method to steer which
2499  tasks and with which parameters you want to run.
2500  """
2504  process_type = b2luigi.get_setting(
2505 
2506  "process_type", default='BBBAR'
2507 
2508  )
2509 
2510  n_events_training = b2luigi.get_setting(
2511 
2512  "n_events_training", default=20000
2513 
2514  )
2515 
2516  n_events_testing = b2luigi.get_setting(
2517 
2518  "n_events_testing", default=5000
2519 
2520  )
2521 
2522  n_events_per_task = b2luigi.get_setting(
2523 
2524  "n_events_per_task", default=100
2525 
2526  )
2527 
2528  num_processes = b2luigi.get_setting(
2529 
2530  "basf2_processes_per_worker", default=0
2531 
2532  )
2533 
2534  datafiles = b2luigi.get_setting("datafiles")
2535 
2536  bkgfiles_by_exp = b2luigi.get_setting("bkgfiles_by_exp")
2537 
2538  bkgfiles_by_exp = {int(key): val for (key, val) in bkgfiles_by_exp.items()}
2539 
2540  exclude_variables_cdc = [
2541  "has_matching_segment",
2542  "size",
2543  "n_tracks", # not written out per default anyway
2544  "avg_hit_dist",
2545  "cont_layer_mean",
2546  "cont_layer_variance",
2547  "cont_layer_max",
2548  "cont_layer_min",
2549  "cont_layer_first",
2550  "cont_layer_last",
2551  "cont_layer_max_vs_last",
2552  "cont_layer_first_vs_min",
2553  "cont_layer_count",
2554  "cont_layer_occupancy",
2555  "super_layer_mean",
2556  "super_layer_variance",
2557  "super_layer_max_vs_last",
2558  "super_layer_first_vs_min",
2559  "super_layer_occupancy",
2560  "drift_length_mean",
2561  "drift_length_variance",
2562  "drift_length_max",
2563  "drift_length_min",
2564  "drift_length_sum",
2565  "norm_drift_length_mean",
2566  "norm_drift_length_variance",
2567  "norm_drift_length_max",
2568  "norm_drift_length_min",
2569  "norm_drift_length_sum",
2570  "adc_mean",
2571  "adc_variance",
2572  "adc_max",
2573  "adc_min",
2574  "adc_sum",
2575  "tot_mean",
2576  "tot_variance",
2577  "tot_max",
2578  "tot_min",
2579  "tot_sum",
2580  "empty_s_mean",
2581  "empty_s_variance",
2582  "empty_s_max"]
2583 
2584  exclude_variables_vxd = [
2585  'energyLoss_max', 'energyLoss_min', 'energyLoss_mean', 'energyLoss_std', 'energyLoss_sum',
2586  'size_max', 'size_min', 'size_mean', 'size_std', 'size_sum',
2587  'seedCharge_max', 'seedCharge_min', 'seedCharge_mean', 'seedCharge_std', 'seedCharge_sum',
2588  'tripletFit_P_Mag', 'tripletFit_P_Eta', 'tripletFit_P_Phi', 'tripletFit_P_X', 'tripletFit_P_Y', 'tripletFit_P_Z']
2589 
2590  exclude_variables_rec = [
2591  'background',
2592  'ghost',
2593  'fake',
2594  'clone',
2595  '__experiment__',
2596  '__run__',
2597  '__event__',
2598  'N_RecoTracks',
2599  'N_PXDRecoTracks',
2600  'N_SVDRecoTracks',
2601  'N_CDCRecoTracks',
2602  'N_diff_PXD_SVD_RecoTracks',
2603  'N_diff_SVD_CDC_RecoTracks',
2604  'Fit_Successful',
2605  'Fit_NFailedPoints',
2606  'Fit_Chi2',
2607  'N_TrackPoints_without_KalmanFitterInfo',
2608  'N_Hits_without_TrackPoint',
2609  'SVD_CDC_CDCwall_Chi2',
2610  'SVD_CDC_CDCwall_Pos_diff_Z',
2611  'SVD_CDC_CDCwall_Pos_diff_Pt',
2612  'SVD_CDC_CDCwall_Pos_diff_Theta',
2613  'SVD_CDC_CDCwall_Pos_diff_Phi',
2614  'SVD_CDC_CDCwall_Pos_diff_Mag',
2615  'SVD_CDC_CDCwall_Pos_diff_Eta',
2616  'SVD_CDC_CDCwall_Mom_diff_Z',
2617  'SVD_CDC_CDCwall_Mom_diff_Pt',
2618  'SVD_CDC_CDCwall_Mom_diff_Theta',
2619  'SVD_CDC_CDCwall_Mom_diff_Phi',
2620  'SVD_CDC_CDCwall_Mom_diff_Mag',
2621  'SVD_CDC_CDCwall_Mom_diff_Eta',
2622  'SVD_CDC_POCA_Pos_diff_Z',
2623  'SVD_CDC_POCA_Pos_diff_Pt',
2624  'SVD_CDC_POCA_Pos_diff_Theta',
2625  'SVD_CDC_POCA_Pos_diff_Phi',
2626  'SVD_CDC_POCA_Pos_diff_Mag',
2627  'SVD_CDC_POCA_Pos_diff_Eta',
2628  'SVD_CDC_POCA_Mom_diff_Z',
2629  'SVD_CDC_POCA_Mom_diff_Pt',
2630  'SVD_CDC_POCA_Mom_diff_Theta',
2631  'SVD_CDC_POCA_Mom_diff_Phi',
2632  'SVD_CDC_POCA_Mom_diff_Mag',
2633  'SVD_CDC_POCA_Mom_diff_Eta',
2634  'POCA_Pos_Pt',
2635  'POCA_Pos_Mag',
2636  'POCA_Pos_Phi',
2637  'POCA_Pos_Z',
2638  'POCA_Pos_Theta',
2639  'PXD_QI',
2640  'SVD_FitSuccessful',
2641  'CDC_FitSuccessful',
2642  'pdg_ID',
2643  'pdg_ID_Mother',
2644  'is_Vzero_Daughter',
2645  'is_Primary',
2646  'z0',
2647  'd0',
2648  'seed_Charge',
2649  'Fit_Charge',
2650  'weight_max',
2651  'weight_min',
2652  'weight_mean',
2653  'weight_std',
2654  'weight_median',
2655  'weight_n_zeros',
2656  'weight_firstCDCHit',
2657  'weight_lastSVDHit',
2658  'smoothedChi2_max',
2659  'smoothedChi2_min',
2660  'smoothedChi2_mean',
2661  'smoothedChi2_std',
2662  'smoothedChi2_median',
2663  'smoothedChi2_n_zeros',
2664  'smoothedChi2_firstCDCHit',
2665  'smoothedChi2_lastSVDHit']
2666 
2667  def requires(self):
2668  """
2669  Generate list of tasks that needs to be done for luigi to finish running
2670  this steering file.
2671  """
2672  cdc_training_targets = [
2673  "truth", # treats clones as background, only best matched CDC tracks are true
2674  # "truth_track_is_matched" # treats clones as signal
2675  ]
2676 
2677  fast_bdt_options = []
2678  # possible to run over a chosen hyperparameter space if wanted
2679  # in principle this can be extended to specific options for the three different MVAs
2680  # for i in range(250, 400, 50):
2681  # for j in range(6, 10, 2):
2682  # for k in range(2, 6):
2683  # for l in range(0, 5):
2684  # fast_bdt_options.append([100 + i, j, 3+k, 0.025+l*0.025])
2685  # fast_bdt_options.append([200, 8, 3, 0.1]) # default FastBDT option
2686  fast_bdt_options.append([350, 6, 5, 0.1])
2687 
2688  experiment_numbers = b2luigi.get_setting("experiment_numbers")
2689 
2690  # iterate over all possible combinations of parameters from the above defined parameter lists
2691  for experiment_number, cdc_training_target, fast_bdt_option in itertools.product(
2692  experiment_numbers, cdc_training_targets, fast_bdt_options
2693  ):
2694  # if test_selected_task is activated, only run the following tasks:
2695  if b2luigi.get_setting("test_selected_task", default=False):
2696  # for process_type in ['BHABHA', 'MUMU', 'TAUPAIR', 'YY', 'EEEE', 'EEMUMU', 'UUBAR', \
2697  # 'DDBAR', 'CCBAR', 'SSBAR', 'BBBAR', 'V0BBBAR', 'V0STUDY']:
2698  for cut in ['000', '070', '090', '095']:
2700  num_processes=self.num_processesnum_processes,
2701  n_events=self.n_events_testingn_events_testing,
2702  experiment_number=experiment_number,
2703  random_seed=self.process_typeprocess_type + '_test',
2704  recotrack_option='useCDC_noVXD_deleteCDCQI'+cut,
2705  cdc_training_target=cdc_training_target,
2706  fast_bdt_option=fast_bdt_option,
2707  )
2709  num_processes=self.num_processesnum_processes,
2710  n_events=self.n_events_testingn_events_testing,
2711  experiment_number=experiment_number,
2712  random_seed=self.process_typeprocess_type + '_test',
2713  )
2714  yield CDCQETeacherTask(
2715  n_events_training=self.n_events_trainingn_events_training,
2716  process_type=self.process_typeprocess_type,
2717  experiment_number=experiment_number,
2718  exclude_variables=self.exclude_variables_cdcexclude_variables_cdc,
2719  training_target=cdc_training_target,
2720  fast_bdt_option=fast_bdt_option,
2721  )
2722  else:
2723  # if data shall be processed, it can neither be trained nor evaluated
2724  if 'DATA' in self.process_typeprocess_type:
2726  num_processes=self.num_processesnum_processes,
2727  n_events=self.n_events_testingn_events_testing,
2728  experiment_number=experiment_number,
2729  random_seed=self.process_typeprocess_type + '_test',
2730  )
2732  num_processes=self.num_processesnum_processes,
2733  n_events=self.n_events_testingn_events_testing,
2734  experiment_number=experiment_number,
2735  random_seed=self.process_typeprocess_type + '_test',
2736  )
2738  num_processes=self.num_processesnum_processes,
2739  n_events=self.n_events_testingn_events_testing,
2740  experiment_number=experiment_number,
2741  random_seed=self.process_typeprocess_type + '_test',
2742  recotrack_option='deleteCDCQI080',
2743  cdc_training_target=cdc_training_target,
2744  fast_bdt_option=fast_bdt_option,
2745  )
2746  else:
2748  n_events_training=self.n_events_trainingn_events_training,
2749  process_type=self.process_typeprocess_type,
2750  experiment_number=experiment_number,
2751  cdc_training_target=cdc_training_target,
2752  fast_bdt_option=fast_bdt_option,
2753  )
2754 
2755  if b2luigi.get_setting("run_validation_tasks", default=True):
2757  n_events_training=self.n_events_trainingn_events_training,
2758  n_events_testing=self.n_events_testingn_events_testing,
2759  process_type=self.process_typeprocess_type,
2760  experiment_number=experiment_number,
2761  cdc_training_target=cdc_training_target,
2762  exclude_variables=self.exclude_variables_recexclude_variables_rec,
2763  fast_bdt_option=fast_bdt_option,
2764  )
2766  n_events_training=self.n_events_trainingn_events_training,
2767  n_events_testing=self.n_events_testingn_events_testing,
2768  process_type=self.process_typeprocess_type,
2769  experiment_number=experiment_number,
2770  exclude_variables=self.exclude_variables_cdcexclude_variables_cdc,
2771  training_target=cdc_training_target,
2772  fast_bdt_option=fast_bdt_option,
2773  )
2775  n_events_training=self.n_events_trainingn_events_training,
2776  n_events_testing=self.n_events_testingn_events_testing,
2777  process_type=self.process_typeprocess_type,
2778  exclude_variables=self.exclude_variables_vxdexclude_variables_vxd,
2779  experiment_number=experiment_number,
2780  fast_bdt_option=fast_bdt_option,
2781  )
2782 
2783  if b2luigi.get_setting("run_mva_evaluate", default=True):
2784  # Evaluate trained weightfiles via basf2_mva_evaluate.py on separate testdatasets
2785  # requires a latex installation to work
2787  n_events_training=self.n_events_trainingn_events_training,
2788  n_events_testing=self.n_events_testingn_events_testing,
2789  process_type=self.process_typeprocess_type,
2790  experiment_number=experiment_number,
2791  cdc_training_target=cdc_training_target,
2792  exclude_variables=self.exclude_variables_recexclude_variables_rec,
2793  fast_bdt_option=fast_bdt_option,
2794  )
2796  n_events_training=self.n_events_trainingn_events_training,
2797  n_events_testing=self.n_events_testingn_events_testing,
2798  process_type=self.process_typeprocess_type,
2799  experiment_number=experiment_number,
2800  exclude_variables=self.exclude_variables_cdcexclude_variables_cdc,
2801  fast_bdt_option=fast_bdt_option,
2802  training_target=cdc_training_target,
2803  )
2805  n_events_training=self.n_events_trainingn_events_training,
2806  n_events_testing=self.n_events_testingn_events_testing,
2807  process_type=self.process_typeprocess_type,
2808  experiment_number=experiment_number,
2809  exclude_variables=self.exclude_variables_vxdexclude_variables_vxd,
2810  fast_bdt_option=fast_bdt_option,
2811  )
2812 
2813 
2814 if __name__ == "__main__":
2815  # if n_events_test_on_data is specified to be different from -1 in the settings,
2816  # then stop after N events (mainly useful to test data reconstruction):
2817  nEventsTestOnData = b2luigi.get_setting("n_events_test_on_data", default=-1)
2818  if nEventsTestOnData > 0 and 'DATA' in b2luigi.get_setting("process_type", default="BBBAR"):
2819  from ROOT import Belle2
2820  environment = Belle2.Environment.Instance()
2821  environment.setNumberEventsOverride(nEventsTestOnData)
2822  # if global tags are specified in the settings, use them:
2823  # e.g. for data use ["data_reprocessing_prompt", "online"]. Make sure to be up to date here
2824  globaltags = b2luigi.get_setting("globaltags", default=[])
2825  if len(globaltags) > 0:
2826  basf2.conditions.reset()
2827  for gt in globaltags:
2828  basf2.conditions.prepend_globaltag(gt)
2829  workers = b2luigi.get_setting("workers", default=1)
2830  b2luigi.process(MasterTask(), workers=workers)
def get_background_files(folder=None, output_file_info=True)
Definition: background.py:17
static Environment & Instance()
Static method to get a reference to the Environment instance.
Definition: Environment.cc:28
def get_records_file_name(self, n_events=None, random_seed=None)
Filename of the recorded/collected data for the final QE MVA training.
experiment_number
Experiment number of the conditions database, e.g.
training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
teacher_task
Teacher task to require to provide a quality estimator weightfile for add_tracking_with_quality_estim...
experiment_number
Experiment number of the conditions database, e.g.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
n_events_training
Number of events to generate for the training data set.
string validation_output_file_name
Name of the "harvested" ROOT output file with variables that can be used for validation.
exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
string reco_output_file_name
Name of the output of the RootOutput module with reconstructed events.
process_type
Define which kind of process shall be used.
list exclude_variables_rec
list of variables to exclude for the recotrack mva:
list exclude_variables_vxd
list of variables to exclude for the vxd mva:
n_events_training
Number of events to generate for the training data set.
n_events_testing
Number of events to generate for the test data set.
list exclude_variables_cdc
list of variables to exclude for the cdc mva.
num_processes
Number of basf2 processes to use in Basf2PathTasks.
primaries_only
Whether to normalize the track finding efficiencies to primary particles only.
exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
cdc_training_target
Feature/vaiable to use as truth label for the CDC track quality estimator.
n_events_training
Number of events to generate for the training data set.
cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
recotrack_option
RecoTrack option, use string that is additive: deleteCDCQI0XY (= deletes CDCTracks with CDC-QI below ...
def get_records_file_name(self, n_events=None, random_seed=None, recotrack_option=None)
Filename of the recorded/collected data for the final QE MVA training.
cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
teacher_task
Task that is required by the evaluation base class to create the MVA weightfile that needs to be eval...
cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
teacher_task
Teacher task to require to provide a quality estimator weightfile for add_tracking_with_quality_estim...
cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
data_collection_task
Defines DataCollectionTask to require by the base class to collect features for the MVA training.
string random_seed
Random basf2 seed used to create the training data set.
experiment_number
Experiment number of the conditions database, e.g.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
n_events_training
Number of events to generate for the training data set.
exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
experiment_number
Experiment number of the conditions database, e.g.
n_events_training
Number of events to generate for the training data set.
def get_weightfile_xml_identifier(self, fast_bdt_option=None, recotrack_option=None)
exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
def get_records_file_name(self, n_events=None, random_seed=None)
Filename of the recorded/collected data for the final QE MVA training.
experiment_number
Experiment number of the conditions database, e.g.
string validation_output_file_name
Name of the "harvested" ROOT output file with variables that can be used for validation.
string reco_output_file_name
Name of the output of the RootOutput module with reconstructed events.
string tree_name
Name of the TTree in the ROOT file from the data_collection_task that contains the training data for ...
def add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False)
Definition: simulation.py:121