Belle II Software development
combined_quality_estimator_teacher.py
1#!/usr/bin/env python3
2
3
10
11"""
12combined_module_quality_estimator_teacher
13-----------------------------------------
14
15Information on the MVA Track Quality Indicator / Estimator can be found
16on `XWiki
17<https://xwiki.desy.de/xwiki/rest/p/0d3f4>`_.
18
19Purpose of this script
20~~~~~~~~~~~~~~~~~~~~~~
21
22This python script is used for the combined training and validation of three
23classifiers, the actual final MVA track quality estimator and the two quality
24estimators for the intermediate standalone track finders that it depends on.
25
26 - Final MVA track quality estimator:
27 The final quality estimator for fully merged and fitted tracks (RecoTracks).
28 Its classifier uses features from the track fitting, merger, hit pattern, ...
29 But it also uses the outputs from respective intermediate quality
30 estimators for the VXD and the CDC track finding as inputs. It provides
31 the final quality indicator (QI) exported to the track objects.
32
33 - VXDTF2 track quality estimator:
34 MVA quality estimator for the VXD standalone track finding.
35
36 - CDC track quality estimator:
37 MVA quality estimator for the CDC standalone track finding.
38
39Each classifier requires for its training a different training data set and they
40need to be validated on a separate testing data set. Further, the final quality
41estimator can only be trained, when the trained weights for the intermediate
42quality estimators are available. If the final estimator shall be trained without
43one or both previous estimators, the requirements have to be commented out in the
44__init__.py file of tracking.
45For all estimators, a list of variables to be ignored is specified in the MasterTask.
46The current choice is mainly based on pure data MC agreement in these quantities or
47on outdated implementations. It was decided to leave them in the hardcoded "ugly" way
48in here to remind future generations that they exist in principle and they should and
49could be added to the estimator, once their modelling becomes better in future or an
50alternative implementation is programmed.
51To avoid mistakes, b2luigi is used to create a task chain for a combined training and
52validation of all classifiers.
53
54b2luigi: Understanding the steering file
55~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56
57All trainings and validations are done in the correct order in this steering
58file. For the purpose of creating a dependency graph, the `b2luigi
59<https://b2luigi.readthedocs.io>`_ python package is used, which extends the
60`luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
61
62Each task that has to be done is represented by a special class, which defines
63which defines parameters, output files and which other tasks with which
64parameters it depends on. For example a teacher task, which runs
65``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
66task which runs a reconstruction and writes out track-wise variables into a root
67file for training. An evaluation/validation task for testing the classifier
68requires both the teacher task, as it needs the weightfile to be present, and
69also a data collection task, because it needs a dataset for testing classifier.
70
71The final task that defines which tasks need to be done for the steering file to
72finish is the ``MasterTask``. When you only want to run parts of the
73training/validation pipeline, you can comment out requirements in the Master
74task or replace them by lower-level tasks during debugging.
75
76Requirements
77~~~~~~~~~~~~
78
79This steering file relies on b2luigi_ for task scheduling and `uncertain_panda
80<https://github.com/nils-braun/uncertain_panda>`_ for uncertainty calculations.
81uncertain_panda is not in the externals and b2luigi is not upto v01-07-01. Both
82can be installed via pip::
83
84 python3 -m pip install [--user] b2luigi uncertain_panda
85
86Use the ``--user`` option if you have not rights to install python packages into
87your externals (e.g. because you are using cvmfs) and install them in
88``$HOME/.local`` instead.
89
90Configuration
91~~~~~~~~~~~~~
92
93Instead of command line arguments, the b2luigi script is configured via a
94``settings.json`` file. Open it in your favorite text editor and modify it to
95fit to your requirements.
96
97Usage
98~~~~~
99
100You can test the b2luigi without running it via::
101
102 python3 combined_quality_estimator_teacher.py --dry-run
103 python3 combined_quality_estimator_teacher.py --show-output
104
105This will show the outputs and show potential errors in the definitions of the
106luigi task dependencies. To run the the steering file in normal (local) mode,
107run::
108
109 python3 combined_quality_estimator_teacher.py
110
111I usually use the interactive luigi web interface via the central scheduler
112which visualizes the task graph while it is running. Therefore, the scheduler
113daemon ``luigid`` has to run in the background, which is located in
114``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
115example, run::
116
117 luigid --port 8886
118
119Then, execute your steering (e.g. in another terminal) with::
120
121 python3 combined_quality_estimator_teacher.py --scheduler-port 8886
122
123To view the web interface, open your webbrowser enter into the url bar::
124
125 localhost:8886
126
127If you don't run the steering file on the same machine on which you run your web
128browser, you have two options:
129
130 1. Run both the steering file and ``luigid`` remotely and use
131 ssh-port-forwarding to your local host. Therefore, run on your local
132 machine::
133
134 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
135
136 2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
137 local host>`` argument when calling the steering file
138
139Accessing the results / output files
140~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
141
142All output files are stored in a directory structure in the ``result_path``. The
143directory tree encodes the used b2luigi parameters. This ensures reproducibility
144and makes parameter searches easy. Sometimes, it is hard to find the relevant
145output files. You can view the whole directory structure by running ``tree
146<result_path>``. Ise the unix ``find`` command to find the files that interest
147you, e.g.::
148
149 find <result_path> -name "*.pdf" # find all validation plot files
150 find <result_path> -name "*.root" # find all ROOT files
151"""
152
153import itertools
154import os
155from pathlib import Path
156import shutil
157import subprocess
158import textwrap
159from datetime import datetime
160from typing import Iterable
161
162import matplotlib.pyplot as plt
163import numpy as np
164import uproot
165from matplotlib.backends.backend_pdf import PdfPages
166
167import basf2
168import basf2_mva
169from packaging import version
170import background
171import simulation
172import tracking
173import tracking.root_utils as root_utils
174from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
175
176# @cond internal_test
177
178# wrap python modules that are used here but not in the externals into a try except block
179install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
180 " python3 -m pip install [--user] {module}\n")
181try:
182 import b2luigi
183 from b2luigi.core.utils import get_serialized_parameters, get_log_file_dir, create_output_dirs
184 from b2luigi.basf2_helper import Basf2PathTask, Basf2Task
185 from b2luigi.core.task import Task, ExternalTask
186 from b2luigi.basf2_helper.utils import get_basf2_git_hash
187except ModuleNotFoundError:
188 print(install_helpstring_formatter.format(module="b2luigi"))
189 raise
190try:
191 from uncertain_panda import pandas as upd
192except ModuleNotFoundError:
193 print(install_helpstring_formatter.format(module="uncertain_panda"))
194 raise
195
196# If b2luigi version 0.3.2 or older, it relies on $BELLE2_RELEASE being "head",
197# which is not the case in the new externals. A fix has been merged into b2luigi
198# via https://github.com/nils-braun/b2luigi/pull/17 and thus should be available
199# in future releases.
200if (
201 version.parse(b2luigi.__version__) <= version.parse("0.3.2") and
202 get_basf2_git_hash() is None and
203 os.getenv("BELLE2_LOCAL_DIR") is not None
204):
205 print(f"b2luigi version could not obtain git hash because of a bug not yet fixed in version {b2luigi.__version__}\n"
206 "Please install the latest version of b2luigi from github via\n\n"
207 " python3 -m pip install --upgrade [--user] git+https://github.com/nils-braun/b2luigi.git\n")
208 raise ImportError
209
210# Utility functions
211
212
213def create_fbdt_option_string(fast_bdt_option):
214 """
215 returns a readable string created by the fast_bdt_option array
216 """
217 return "_nTrees" + str(fast_bdt_option[0]) + "_nCuts" + str(fast_bdt_option[1]) + "_nLevels" + \
218 str(fast_bdt_option[2]) + "_shrin" + str(int(round(100*fast_bdt_option[3], 0)))
219
220
221def createV0momenta(x, mu, beta):
222 """
223 Copied from Biancas K_S0 particle gun code: Returns a realistic V0 momentum distribution
224 when running over x. Mu and Beta are properties of the function that define center and tails.
225 Used for the particle gun simulation code for K_S0 and Lambda_0
226 """
227 return (1/beta)*np.exp(-(x - mu)/beta) * np.exp(-np.exp(-(x - mu) / beta))
228
229
230def my_basf2_mva_teacher(
231 records_files,
232 tree_name,
233 weightfile_identifier,
234 target_variable="truth",
235 exclude_variables=None,
236 fast_bdt_option=[200, 8, 3, 0.1] # nTrees, nCuts, nLevels, shrinkage
237):
238 """
239 My custom wrapper for basf2 mva teacher. Adapted from code in ``trackfindingcdc_teacher``.
240
241 :param records_files: List of files with collected ("recorded") variables to use as training data for the MVA.
242 :param tree_name: Name of the TTree in the ROOT file from the ``data_collection_task``
243 that contains the training data for the MVA teacher.
244 :param weightfile_identifier: Name of the weightfile that is created.
245 Should either end in ".xml" for local weightfiles or in ".root", when
246 the weightfile needs later to be uploaded as a payload to the conditions
247 database.
248 :param target_variable: Feature/variable to use as truth label in the quality estimator MVA classifier.
249 :param exclude_variables: List of collected variables to not use in the training of the QE MVA classifier.
250 In addition to variables containing the "truth" substring, which are excluded by default.
251 :param fast_bdt_option: specified fast BDT options, default: [200, 8, 3, 0.1] [nTrees, nCuts, nLevels, shrinkage]
252 """
253 if exclude_variables is None:
254 exclude_variables = []
255
256 weightfile_extension = Path(weightfile_identifier).suffix
257 if weightfile_extension not in {".xml", ".root"}:
258 raise ValueError(f"Weightfile Identifier should end in .xml or .root, but ends in {weightfile_extension}")
259
260 # extract names of all variables from one record file
261 with root_utils.root_open(records_files[0]) as records_tfile:
262 input_tree = records_tfile.Get(tree_name)
263 feature_names = [leave.GetName() for leave in input_tree.GetListOfLeaves()]
264
265 # get list of variables to use for training without MC truth
266 truth_free_variable_names = [
267 name
268 for name in feature_names
269 if (
270 ("truth" not in name) and
271 (name != target_variable) and
272 (name not in exclude_variables)
273 )
274 ]
275 if "weight" in truth_free_variable_names:
276 truth_free_variable_names.remove("weight")
277 weight_variable = "weight"
278 elif "__weight__" in truth_free_variable_names:
279 truth_free_variable_names.remove("__weight__")
280 weight_variable = "__weight__"
281 else:
282 weight_variable = ""
283
284 # Set options for MVA training
285 general_options = basf2_mva.GeneralOptions()
286 general_options.m_datafiles = basf2_mva.vector(*records_files)
287 general_options.m_treename = tree_name
288 general_options.m_weight_variable = weight_variable
289 general_options.m_identifier = weightfile_identifier
290 general_options.m_variables = basf2_mva.vector(*truth_free_variable_names)
291 general_options.m_target_variable = target_variable
292 fastbdt_options = basf2_mva.FastBDTOptions()
293
294 fastbdt_options.m_nTrees = fast_bdt_option[0]
295 fastbdt_options.m_nCuts = fast_bdt_option[1]
296 fastbdt_options.m_nLevels = fast_bdt_option[2]
297 fastbdt_options.m_shrinkage = fast_bdt_option[3]
298 # Train a MVA method and store the weightfile (MVAFastBDT.root) locally.
299 basf2_mva.teacher(general_options, fastbdt_options)
300
301
302def _my_uncertain_mean(series: upd.Series):
303 """
304 Temporary Workaround bug in ``uncertain_panda`` where a ``ValueError`` is
305 thrown for ``Series.unc.mean`` if the series is empty. Can be replaced by
306 .unc.mean when the issue is fixed.
307 https://github.com/nils-braun/uncertain_panda/issues/2
308 """
309 try:
310 return series.unc.mean()
311 except ValueError:
312 if series.empty:
313 return np.nan
314 else:
315 raise
316
317
318def get_uncertain_means_for_qi_cuts(df: upd.DataFrame, column: str, qi_cuts: Iterable[float]):
319 """
320 Return a pandas series with an mean of the dataframe column and
321 uncertainty for each quality indicator cut.
322
323 :param df: Pandas dataframe with at least ``quality_indicator``
324 and another numeric ``column``.
325 :param column: Column of which we want to aggregate the means
326 and uncertainties for different QI cuts
327 :param qi_cuts: Iterable of quality indicator minimal thresholds.
328 :returns: Series of of means and uncertainties with ``qi_cuts`` as index
329 """
330
331 uncertain_means = (_my_uncertain_mean(df.query(f"quality_indicator > {qi_cut}")[column])
332 for qi_cut in qi_cuts)
333 uncertain_means_series = upd.Series(data=uncertain_means, index=qi_cuts)
334 return uncertain_means_series
335
336
337def plot_with_errobands(uncertain_series,
338 error_band_alpha=0.3,
339 plot_kwargs={},
340 fill_between_kwargs={},
341 ax=None):
342 """
343 Plot an uncertain series with error bands for y-errors
344 """
345 if ax is None:
346 ax = plt.gca()
347 uncertain_series = uncertain_series.dropna()
348 ax.plot(uncertain_series.index.values, uncertain_series.nominal_value, **plot_kwargs)
349 ax.fill_between(x=uncertain_series.index,
350 y1=uncertain_series.nominal_value - uncertain_series.std_dev,
351 y2=uncertain_series.nominal_value + uncertain_series.std_dev,
352 alpha=error_band_alpha,
353 **fill_between_kwargs)
354
355
356def format_dictionary(adict, width=80, bullet="•"):
357 """
358 Helper function to format dictionary to string as a wrapped key-value bullet
359 list. Useful to print metadata from dictionaries.
360
361 :param adict: Dictionary to format
362 :param width: Characters after which to wrap a key-value line
363 :param bullet: Character to begin a key-value line with, e.g. ``-`` for a
364 yaml-like string
365 """
366 # It might be possible to replace this function yaml.dump, but the current
367 # version in the externals does not allow to disable the sorting of the
368 # dictionary yet and also I am not sure if it is wrappable
369 return "\n".join(textwrap.fill(f"{bullet} {key}: {value}", width=width)
370 for (key, value) in adict.items())
371
372# Begin definitions of b2luigi task classes
373
374
375class GenerateSimTask(Basf2PathTask):
376 """
377 Generate simulated Monte Carlo with background overlay.
378
379 Make sure to use different ``random_seed`` parameters for the training data
380 format the classifier trainings and for the test data for the respective
381 evaluation/validation tasks.
382 """
383
384
385 n_events = b2luigi.IntParameter()
386
387 experiment_number = b2luigi.IntParameter()
388
390 random_seed = b2luigi.Parameter()
391
392 bkgfiles_dir = b2luigi.Parameter(
393
394 hashed=True
395
396 )
397
398 queue = 'l'
399
400
401 def output_file_name(self, n_events=None, random_seed=None):
402 """
403 Create output file name depending on number of events and production
404 mode that is specified in the random_seed string.
405 """
406 if n_events is None:
407 n_events = self.n_events
408 if random_seed is None:
409 random_seed = self.random_seed
410 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
411
412 def output(self):
413 """
414 Generate list of output files that the task should produce.
415 The task is considered finished if and only if the outputs all exist.
416 """
417 yield self.add_to_output(self.output_file_name())
418
419 def create_path(self):
420 """
421 Create basf2 path to process with event generation and simulation.
422 """
423 basf2.set_random_seed(self.random_seed)
424 path = basf2.create_path()
425 if self.experiment_number in [0, 1002, 1003]:
426 runNo = 0
427 else:
428 runNo = 0
429 raise ValueError(
430 f"Simulating events with experiment number {self.experiment_number} is not implemented yet.")
431 path.add_module(
432 "EventInfoSetter", evtNumList=[self.n_events], runList=[runNo], expList=[self.experiment_number]
433 )
434 if "BBBAR" in self.random_seed:
435 path.add_module("EvtGenInput")
436 elif "V0BBBAR" in self.random_seed:
437 path.add_module("EvtGenInput")
438 path.add_module("InclusiveParticleChecker", particles=[310, 3122], includeConjugates=True)
439 else:
440 import generators as ge
441 # WARNING: There are a few differences in the production of MC13a and b like the following lines
442 # as well as ActivatePXD.. and the beamparams for bhabha... I use these from MC13b, not a... :/
443 # import beamparameters as bp
444 # beamparameters = bp.add_beamparameters(path, "Y4S")
445 # beamparameters.param("covVertex", [(14.8e-4)**2, (1.5e-4)**2, (360e-4)**2])
446 if "V0STUDY" in self.random_seed:
447 if "V0STUDYKS" in self.random_seed:
448 # Bianca looked at the Ks dists and extracted these values:
449 mu = 0.5
450 beta = 0.2
451 pdgs = [310] # Ks (has no antiparticle, Klong is different)
452 if "V0STUDYL0" in self.random_seed:
453 # I just made the lambda values up, such that they peak at 0.35 and are slightly shifted to lower values
454 mu = 0.35
455 beta = 0.15 # if this is chosen higher, one needs to make sure not to get values >0 for 0
456 pdgs = [3122, -3122] # Lambda0
457 else:
458 # also these values are made up
459 mu = 0.43
460 beta = 0.18
461 pdgs = [310, 3122, -3122] # Ks and Lambda0
462 # create realistic momentum distribution
463 myx = [i*0.01 for i in range(321)]
464 myy = []
465 for x in myx:
466 y = createV0momenta(x, mu, beta)
467 myy.append(y)
468 polParams = myx + myy
469 # define particles that are produced
470 pdg_list = pdgs
471
472 particlegun = basf2.register_module('ParticleGun')
473 particlegun.param('pdgCodes', pdg_list)
474 particlegun.param('nTracks', 8) # number of particles (not tracks!) that is created in each event
475 particlegun.param('momentumGeneration', 'polyline')
476 particlegun.param('momentumParams', polParams)
477 particlegun.param('thetaGeneration', 'uniformCos')
478 particlegun.param('thetaParams', [17, 150]) # [0, 180]) #[17, 150]
479 particlegun.param('phiGeneration', 'uniform')
480 particlegun.param('phiParams', [0, 360])
481 particlegun.param('vertexGeneration', 'fixed')
482 particlegun.param('xVertexParams', [0])
483 particlegun.param('yVertexParams', [0])
484 particlegun.param('zVertexParams', [0])
485 path.add_module(particlegun)
486 if "BHABHA" in self.random_seed:
487 ge.add_babayaganlo_generator(path=path, finalstate='ee', minenergy=0.15, minangle=10.0)
488 elif "MUMU" in self.random_seed:
489 ge.add_kkmc_generator(path=path, finalstate='mu+mu-')
490 elif "YY" in self.random_seed:
491 babayaganlo = basf2.register_module('BabayagaNLOInput')
492 babayaganlo.param('FinalState', 'gg')
493 babayaganlo.param('MaxAcollinearity', 180.0)
494 babayaganlo.param('ScatteringAngleRange', [0., 180.])
495 babayaganlo.param('FMax', 75000)
496 babayaganlo.param('MinEnergy', 0.01)
497 babayaganlo.param('Order', 'exp')
498 babayaganlo.param('DebugEnergySpread', 0.01)
499 babayaganlo.param('Epsilon', 0.00005)
500 path.add_module(babayaganlo)
501 generatorpreselection = basf2.register_module('GeneratorPreselection')
502 generatorpreselection.param('nChargedMin', 0)
503 generatorpreselection.param('nChargedMax', 999)
504 generatorpreselection.param('MinChargedPt', 0.15)
505 generatorpreselection.param('MinChargedTheta', 17.)
506 generatorpreselection.param('MaxChargedTheta', 150.)
507 generatorpreselection.param('nPhotonMin', 1)
508 generatorpreselection.param('MinPhotonEnergy', 1.5)
509 generatorpreselection.param('MinPhotonTheta', 15.0)
510 generatorpreselection.param('MaxPhotonTheta', 165.0)
511 generatorpreselection.param('applyInCMS', True)
512 path.add_module(generatorpreselection)
513 empty = basf2.create_path()
514 generatorpreselection.if_value('!=11', empty)
515 elif "EEEE" in self.random_seed:
516 ge.add_aafh_generator(path=path, finalstate='e+e-e+e-', preselection=False)
517 elif "EEMUMU" in self.random_seed:
518 ge.add_aafh_generator(path=path, finalstate='e+e-mu+mu-', preselection=False)
519 elif "TAUPAIR" in self.random_seed:
520 ge.add_kkmc_generator(path, finalstate='tau+tau-')
521 elif "DDBAR" in self.random_seed:
522 ge.add_continuum_generator(path, finalstate='ddbar')
523 elif "UUBAR" in self.random_seed:
524 ge.add_continuum_generator(path, finalstate='uubar')
525 elif "SSBAR" in self.random_seed:
526 ge.add_continuum_generator(path, finalstate='ssbar')
527 elif "CCBAR" in self.random_seed:
528 ge.add_continuum_generator(path, finalstate='ccbar')
529 # activate simulation of dead/masked pixel and reproduce detector gain, which will be
530 # applied at reconstruction level when the data GT is present in the DB chain
531 # path.add_module("ActivatePXDPixelMasker")
532 # path.add_module("ActivatePXDGainCalibrator")
533 bkg_files = background.get_background_files(self.bkgfiles_dir)
534 # \cond suppress doxygen warning
535 if self.experiment_number == 1002:
536 # remove KLM because of bug in background files with release 4
537 components = ['PXD', 'SVD', 'CDC', 'ECL', 'TOP', 'ARICH', 'TRG']
538 else:
539 components = None
540 # \endcond
541 simulation.add_simulation(path, bkgfiles=bkg_files, bkgOverlay=True, components=components) # , usePXDDataReduction=False)
542
543 path.add_module(
544 "RootOutput",
545 outputFileName=self.get_output_file_name(self.output_file_name()),
546 )
547 return path
548
549
550# I don't use the default MergeTask or similar because they only work if every input file is called the same.
551# Additionally, I want to add more features like deleting the original input to save storage space.
552class SplitNMergeSimTask(Basf2Task):
553 """
554 Generate simulated Monte Carlo with background overlay.
555
556 Make sure to use different ``random_seed`` parameters for the training data
557 format the classifier trainings and for the test data for the respective
558 evaluation/validation tasks.
559 """
560
561
562 n_events = b2luigi.IntParameter()
563
564 experiment_number = b2luigi.IntParameter()
565
567 random_seed = b2luigi.Parameter()
568
569 bkgfiles_dir = b2luigi.Parameter(
570
571 hashed=True
572
573 )
574
575 queue = 'sx'
576
577
578 def output_file_name(self, n_events=None, random_seed=None):
579 """
580 Create output file name depending on number of events and production
581 mode that is specified in the random_seed string.
582 """
583 if n_events is None:
584 n_events = self.n_events
585 if random_seed is None:
586 random_seed = self.random_seed
587 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
588
589 def output(self):
590 """
591 Generate list of output files that the task should produce.
592 The task is considered finished if and only if the outputs all exist.
593 """
594 yield self.add_to_output(self.output_file_name())
595
596 def requires(self):
597 """
598 Generate list of luigi Tasks that this Task depends on.
599 """
600 n_events_per_task = MasterTask.n_events_per_task
601 quotient, remainder = divmod(self.n_events, n_events_per_task)
602 for i in range(quotient):
603 yield GenerateSimTask(
604 bkgfiles_dir=self.bkgfiles_dir,
605 num_processes=MasterTask.num_processes,
606 random_seed=self.random_seed + '_' + str(i).zfill(3),
607 n_events=n_events_per_task,
608 experiment_number=self.experiment_number,
609 )
610 if remainder > 0:
611 yield GenerateSimTask(
612 bkgfiles_dir=self.bkgfiles_dir,
613 num_processes=MasterTask.num_processes,
614 random_seed=self.random_seed + '_' + str(quotient).zfill(3),
615 n_events=remainder,
616 experiment_number=self.experiment_number,
617 )
618
619 @b2luigi.on_temporary_files
620 def process(self):
621 """
622 When all GenerateSimTasks finished, merge the output.
623 """
624 create_output_dirs(self)
625
626 file_list = []
627 for _, file_name in self.get_input_file_names().items():
628 file_list.append(*file_name)
629 print("Merge the following files:")
630 print(file_list)
631 cmd = ["b2file-merge", "-f"]
632 args = cmd + [self.get_output_file_name(self.output_file_name())] + file_list
633 subprocess.check_call(args)
634 print("Finished merging. Now remove the input files to save space.")
635 cmd2 = ["rm", "-f"]
636 for tempfile in file_list:
637 args = cmd2 + [tempfile]
638 subprocess.check_call(args)
639
640
641class CheckExistingFile(ExternalTask):
642 """
643 Task to check if the given file really exists.
644 """
645
646 filename = b2luigi.Parameter()
647
648 def output(self):
649 """
650 Specify the output to be the file that was just checked.
651 """
652 from luigi import LocalTarget
653 return LocalTarget(self.filename)
654
655
656class VXDQEDataCollectionTask(Basf2PathTask):
657 """
658 Collect variables/features from VXDTF2 tracking and write them to a ROOT
659 file.
660
661 These variables are to be used as labelled training data for the MVA
662 classifier which is the VXD track quality estimator
663 """
664
665 n_events = b2luigi.IntParameter()
666
667 experiment_number = b2luigi.IntParameter()
668
670 random_seed = b2luigi.Parameter()
671
672 queue = 'l'
673
674
675 def get_records_file_name(self, n_events=None, random_seed=None):
676 """
677 Create output file name depending on number of events and production
678 mode that is specified in the random_seed string.
679 """
680 if n_events is None:
681 n_events = self.n_events
682 if random_seed is None:
683 random_seed = self.random_seed
684 if 'vxd' not in random_seed:
685 random_seed += '_vxd'
686 if 'DATA' in random_seed:
687 return 'qe_records_DATA_vxd.root'
688 else:
689 if 'USESIMBB' in random_seed:
690 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
691 elif 'USESIMEE' in random_seed:
692 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
693 return 'qe_records_N' + str(n_events) + '_' + random_seed + '.root'
694
695 def get_input_files(self, n_events=None, random_seed=None):
696 """
697 Get input file names depending on the use case: If they already exist, search in
698 the corresponding folders, for data check the specified list and if they are created
699 in the same run, check for the task that produced them.
700 """
701 if n_events is None:
702 n_events = self.n_events
703 if random_seed is None:
704 random_seed = self.random_seed
705 if "USESIM" in random_seed:
706 if 'USESIMBB' in random_seed:
707 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
708 elif 'USESIMEE' in random_seed:
709 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
710 return ['datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
711 n_events=n_events, random_seed=random_seed)]
712 elif "DATA" in random_seed:
713 return MasterTask.datafiles
714 else:
715 return self.get_input_file_names(GenerateSimTask.output_file_name(
716 GenerateSimTask, n_events=n_events, random_seed=random_seed))
717
718 def requires(self):
719 """
720 Generate list of luigi Tasks that this Task depends on.
721 """
722 if "USESIM" in self.random_seed or "DATA" in self.random_seed:
723 for filename in self.get_input_files():
724 yield CheckExistingFile(
725 filename=filename,
726 )
727 else:
728 yield SplitNMergeSimTask(
729 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
730 random_seed=self.random_seed,
731 n_events=self.n_events,
732 experiment_number=self.experiment_number,
733 )
734
735 def output(self):
736 """
737 Generate list of output files that the task should produce.
738 The task is considered finished if and only if the outputs all exist.
739 """
740 yield self.add_to_output(self.get_records_file_name())
741
742 def create_path(self):
743 """
744 Create basf2 path with VXDTF2 tracking and VXD QE data collection.
745 """
746 path = basf2.create_path()
747 inputFileNames = self.get_input_files()
748 path.add_module(
749 "RootInput",
750 inputFileNames=inputFileNames,
751 )
752 path.add_module("Gearbox")
753 tracking.add_geometry_modules(path)
754 if 'DATA' in self.random_seed:
755 from rawdata import add_unpackers
756 add_unpackers(path, components=['SVD', 'PXD'])
757 tracking.add_hit_preparation_modules(path)
758 tracking.add_vxd_track_finding_vxdtf2(
759 path, components=["SVD"], add_mva_quality_indicator=False
760 )
761 if 'DATA' in self.random_seed:
762 path.add_module(
763 "VXDQETrainingDataCollector",
764 TrainingDataOutputName=self.get_output_file_name(self.get_records_file_name()),
765 SpacePointTrackCandsStoreArrayName="SPTrackCands",
766 EstimationMethod="tripletFit",
767 UseTimingInfo=False,
768 ClusterInformation="Average",
769 MCStrictQualityEstimator=False,
770 mva_target=False,
771 MCInfo=False,
772 )
773 else:
774 path.add_module(
775 "TrackFinderMCTruthRecoTracks",
776 RecoTracksStoreArrayName="MCRecoTracks",
777 WhichParticles=[],
778 UsePXDHits=False,
779 UseSVDHits=True,
780 UseCDCHits=False,
781 )
782 path.add_module(
783 "VXDQETrainingDataCollector",
784 TrainingDataOutputName=self.get_output_file_name(self.get_records_file_name()),
785 SpacePointTrackCandsStoreArrayName="SPTrackCands",
786 EstimationMethod="tripletFit",
787 UseTimingInfo=False,
788 ClusterInformation="Average",
789 MCStrictQualityEstimator=True,
790 mva_target=False,
791 )
792 return path
793
794
795class CDCQEDataCollectionTask(Basf2PathTask):
796 """
797 Collect variables/features from CDC tracking and write them to a ROOT file.
798
799 These variables are to be used as labelled training data for the MVA
800 classifier which is the CDC track quality estimator
801 """
802
803 n_events = b2luigi.IntParameter()
804
805 experiment_number = b2luigi.IntParameter()
806
808 random_seed = b2luigi.Parameter()
809
810 queue = 'l'
811
812
813 def get_records_file_name(self, n_events=None, random_seed=None):
814 """
815 Create output file name depending on number of events and production
816 mode that is specified in the random_seed string.
817 """
818 if n_events is None:
819 n_events = self.n_events
820 if random_seed is None:
821 random_seed = self.random_seed
822 if 'cdc' not in random_seed:
823 random_seed += '_cdc'
824 if 'DATA' in random_seed:
825 return 'qe_records_DATA_cdc.root'
826 else:
827 if 'USESIMBB' in random_seed:
828 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
829 elif 'USESIMEE' in random_seed:
830 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
831 return 'qe_records_N' + str(n_events) + '_' + random_seed + '.root'
832
833 def get_input_files(self, n_events=None, random_seed=None):
834 """
835 Get input file names depending on the use case: If they already exist, search in
836 the corresponding folders, for data check the specified list and if they are created
837 in the same run, check for the task that produced them.
838 """
839 if n_events is None:
840 n_events = self.n_events
841 if random_seed is None:
842 random_seed = self.random_seed
843 if "USESIM" in random_seed:
844 if 'USESIMBB' in random_seed:
845 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
846 elif 'USESIMEE' in random_seed:
847 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
848 return ['datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
849 n_events=n_events, random_seed=random_seed)]
850 elif "DATA" in random_seed:
851 return MasterTask.datafiles
852 else:
853 return self.get_input_file_names(GenerateSimTask.output_file_name(
854 GenerateSimTask, n_events=n_events, random_seed=random_seed))
855
856 def requires(self):
857 """
858 Generate list of luigi Tasks that this Task depends on.
859 """
860 if "USESIM" in self.random_seed or "DATA" in self.random_seed:
861 for filename in self.get_input_files():
862 yield CheckExistingFile(
863 filename=filename,
864 )
865 else:
866 yield SplitNMergeSimTask(
867 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
868 random_seed=self.random_seed,
869 n_events=self.n_events,
870 experiment_number=self.experiment_number,
871 )
872
873 def output(self):
874 """
875 Generate list of output files that the task should produce.
876 The task is considered finished if and only if the outputs all exist.
877 """
878 yield self.add_to_output(self.get_records_file_name())
879
880 def create_path(self):
881 """
882 Create basf2 path with CDC standalone tracking and CDC QE with recording filter for MVA feature collection.
883 """
884 path = basf2.create_path()
885 inputFileNames = self.get_input_files()
886 path.add_module(
887 "RootInput",
888 inputFileNames=inputFileNames,
889 )
890 path.add_module("Gearbox")
891 tracking.add_geometry_modules(path)
892 if 'DATA' in self.random_seed:
893 filter_choice = "recording_data"
894 from rawdata import add_unpackers
895 add_unpackers(path, components=['CDC'])
896 else:
897 filter_choice = "recording"
898 # tracking.add_hit_preparation_modules(path) # only needed for SVD and
899 # PXD hit preparation. Does not change the CDC output.
900 tracking.add_cdc_track_finding(path, with_cdc_cellular_automaton=False, add_mva_quality_indicator=True)
901
902 basf2.set_module_parameters(
903 path,
904 name="TFCDC_TrackQualityEstimator",
905 filter=filter_choice,
906 filterParameters={
907 "rootFileName": self.get_output_file_name(self.get_records_file_name())
908 },
909 )
910 return path
911
912
913class RecoTrackQEDataCollectionTask(Basf2PathTask):
914 """
915 Collect variables/features from the reco track reconstruction including the
916 fit and write them to a ROOT file.
917
918 These variables are to be used as labelled training data for the MVA
919 classifier which is the MVA track quality estimator. The collected
920 variables include the classifier outputs from the VXD and CDC quality
921 estimators, namely the CDC and VXD quality indicators, combined with fit,
922 merger, timing, energy loss information etc. This task requires the
923 subdetector quality estimators to be trained.
924 """
925
926
927 n_events = b2luigi.IntParameter()
928
929 experiment_number = b2luigi.IntParameter()
930
932 random_seed = b2luigi.Parameter()
933
934 cdc_training_target = b2luigi.Parameter()
935
938 recotrack_option = b2luigi.Parameter(
939
940 default='deleteCDCQI080'
941
942 )
943
944 fast_bdt_option = b2luigi.ListParameter(
945
946 hashed=True, default=[200, 8, 3, 0.1]
947
948 )
949
950 queue = 'l'
951
952
953 def get_records_file_name(self, n_events=None, random_seed=None, recotrack_option=None):
954 """
955 Create output file name depending on number of events and production
956 mode that is specified in the random_seed string.
957 """
958 if n_events is None:
959 n_events = self.n_events
960 if random_seed is None:
961 random_seed = self.random_seed
962 if recotrack_option is None:
963 recotrack_option = self.recotrack_option
964 if 'rec' not in random_seed:
965 random_seed += '_rec'
966 if 'DATA' in random_seed:
967 return 'qe_records_DATA_rec.root'
968 else:
969 if 'USESIMBB' in random_seed:
970 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
971 elif 'USESIMEE' in random_seed:
972 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
973 return 'qe_records_N' + str(n_events) + '_' + random_seed + '_' + recotrack_option + '.root'
974
975 def get_input_files(self, n_events=None, random_seed=None):
976 """
977 Get input file names depending on the use case: If they already exist, search in
978 the corresponding folders, for data check the specified list and if they are created
979 in the same run, check for the task that produced them.
980 """
981 if n_events is None:
982 n_events = self.n_events
983 if random_seed is None:
984 random_seed = self.random_seed
985 if "USESIM" in random_seed:
986 if 'USESIMBB' in random_seed:
987 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
988 elif 'USESIMEE' in random_seed:
989 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
990 return ['datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
991 n_events=n_events, random_seed=random_seed)]
992 elif "DATA" in random_seed:
993 return MasterTask.datafiles
994 else:
995 return self.get_input_file_names(GenerateSimTask.output_file_name(
996 GenerateSimTask, n_events=n_events, random_seed=random_seed))
997
998 def requires(self):
999 """
1000 Generate list of luigi Tasks that this Task depends on.
1001 """
1002 if "USESIM" in self.random_seed or "DATA" in self.random_seed:
1003 for filename in self.get_input_files():
1004 yield CheckExistingFile(
1005 filename=filename,
1006 )
1007 else:
1008 yield SplitNMergeSimTask(
1009 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
1010 random_seed=self.random_seed,
1011 n_events=self.n_events,
1012 experiment_number=self.experiment_number,
1013 )
1014 if "DATA" not in self.random_seed:
1015 if 'useCDC' not in self.recotrack_option and 'noCDC' not in self.recotrack_option:
1016 yield CDCQETeacherTask(
1017 n_events_training=MasterTask.n_events_training,
1018 experiment_number=self.experiment_number,
1019 training_target=self.cdc_training_target,
1020 process_type=self.random_seed.split("_", 1)[0],
1021 exclude_variables=MasterTask.exclude_variables_cdc,
1022 fast_bdt_option=self.fast_bdt_option,
1023 )
1024 if 'useVXD' not in self.recotrack_option and 'noVXD' not in self.recotrack_option:
1025 yield VXDQETeacherTask(
1026 n_events_training=MasterTask.n_events_training,
1027 experiment_number=self.experiment_number,
1028 process_type=self.random_seed.split("_", 1)[0],
1029 exclude_variables=MasterTask.exclude_variables_vxd,
1030 fast_bdt_option=self.fast_bdt_option,
1031 )
1032
1033 def output(self):
1034 """
1035 Generate list of output files that the task should produce.
1036 The task is considered finished if and only if the outputs all exist.
1037 """
1038 yield self.add_to_output(self.get_records_file_name())
1039
1040 def create_path(self):
1041 """
1042 Create basf2 reconstruction path that should mirror the default path
1043 from ``add_tracking_reconstruction()``, but with modules for the VXD QE
1044 and CDC QE application and for collection of variables for the reco
1045 track quality estimator.
1046 """
1047 path = basf2.create_path()
1048 inputFileNames = self.get_input_files()
1049 path.add_module(
1050 "RootInput",
1051 inputFileNames=inputFileNames,
1052 )
1053 path.add_module("Gearbox")
1054
1055 # First add tracking reconstruction with default quality estimation modules
1056 mvaCDC = True
1057 mvaVXD = True
1058 if 'noCDC' in self.recotrack_option:
1059 mvaCDC = False
1060 if 'noVXD' in self.recotrack_option:
1061 mvaVXD = False
1062 if 'DATA' in self.random_seed:
1063 from rawdata import add_unpackers
1064 add_unpackers(path)
1065 tracking.add_tracking_reconstruction(path, add_cdcTrack_QI=mvaCDC, add_vxdTrack_QI=mvaVXD, add_recoTrack_QI=True)
1066
1067 # if data shall be processed check if newly trained mva files are available. Otherwise use default ones (CDB payloads):
1068 # if useCDC/VXD is specified, use the identifier lying in datafiles/ Otherwise, replace weightfile identifiers from defaults
1069 # (CDB payloads) to new weightfiles created by this b2luigi script
1070 if ('DATA' in self.random_seed or 'useCDC' in self.recotrack_option) and 'noCDC' not in self.recotrack_option:
1071 cdc_identifier = 'datafiles/' + \
1072 CDCQETeacherTask.get_weightfile_xml_identifier(CDCQETeacherTask, fast_bdt_option=self.fast_bdt_option)
1073 if os.path.exists(cdc_identifier):
1074 replace_cdc_qi = True
1075 elif 'useCDC' in self.recotrack_option:
1076 raise ValueError(f"CDC QI Identifier not found: {cdc_identifier}")
1077 else:
1078 replace_cdc_qi = False
1079 elif 'noCDC' in self.recotrack_option:
1080 replace_cdc_qi = False
1081 else:
1082 cdc_identifier = self.get_input_file_names(
1083 CDCQETeacherTask.get_weightfile_xml_identifier(
1084 CDCQETeacherTask, fast_bdt_option=self.fast_bdt_option))[0]
1085 replace_cdc_qi = True
1086 if ('DATA' in self.random_seed or 'useVXD' in self.recotrack_option) and 'noVXD' not in self.recotrack_option:
1087 vxd_identifier = 'datafiles/' + \
1088 VXDQETeacherTask.get_weightfile_xml_identifier(VXDQETeacherTask, fast_bdt_option=self.fast_bdt_option)
1089 if os.path.exists(vxd_identifier):
1090 replace_vxd_qi = True
1091 elif 'useVXD' in self.recotrack_option:
1092 raise ValueError(f"VXD QI Identifier not found: {vxd_identifier}")
1093 else:
1094 replace_vxd_qi = False
1095 elif 'noVXD' in self.recotrack_option:
1096 replace_vxd_qi = False
1097 else:
1098 vxd_identifier = self.get_input_file_names(
1099 VXDQETeacherTask.get_weightfile_xml_identifier(
1100 VXDQETeacherTask, fast_bdt_option=self.fast_bdt_option))[0]
1101 replace_vxd_qi = True
1102
1103 cdc_qe_mva_filter_parameters = None
1104 # if tracks below a certain CDC QI index shall be deleted online, this needs to be specified in the filter parameters.
1105 # this is also possible in case of the default (CBD) payloads.
1106 if 'deleteCDCQI' in self.recotrack_option:
1107 cut_index = self.recotrack_option.find('deleteCDCQI') + len('deleteCDCQI')
1108 cut = int(self.recotrack_option[cut_index:cut_index+3])/100.
1109 if replace_cdc_qi:
1110 cdc_qe_mva_filter_parameters = {
1111 "identifier": cdc_identifier, "cut": cut}
1112 else:
1113 cdc_qe_mva_filter_parameters = {
1114 "cut": cut}
1115 elif replace_cdc_qi:
1116 cdc_qe_mva_filter_parameters = {
1117 "identifier": cdc_identifier}
1118 if cdc_qe_mva_filter_parameters is not None:
1119 # if no cut is specified, the default value is at zero and nothing is deleted.
1120 basf2.set_module_parameters(
1121 path,
1122 name="TFCDC_TrackQualityEstimator",
1123 filterParameters=cdc_qe_mva_filter_parameters,
1124 deleteTracks=True,
1125 resetTakenFlag=True
1126 )
1127 if replace_vxd_qi:
1128 basf2.set_module_parameters(
1129 path,
1130 name="VXDQualityEstimatorMVA",
1131 WeightFileIdentifier=vxd_identifier)
1132
1133 # Replace final quality estimator module by training data collector module
1134 track_qe_module_name = "TrackQualityEstimatorMVA"
1135 module_found = False
1136 new_path = basf2.create_path()
1137 for module in path.modules():
1138 if module.name() != track_qe_module_name:
1139 if not module.name == 'TrackCreator':
1140 new_path.add_module(module)
1141 else:
1142 # the TrackCreator needs to be conducted before the Collector such that
1143 # MDSTTracks are related to RecoTracks and d0 and z0 can be read out
1144 new_path.add_module(
1145 'TrackCreator',
1146 pdgCodes=[
1147 211,
1148 321,
1149 2212],
1150 recoTrackColName='RecoTracks',
1151 trackColName='MDSTTracks') # , useClosestHitToIP=True, useBFieldAtHit=True)
1152 new_path.add_module(
1153 "TrackQETrainingDataCollector",
1154 TrainingDataOutputName=self.get_output_file_name(self.get_records_file_name()),
1155 collectEventFeatures=True,
1156 SVDPlusCDCStandaloneRecoTracksStoreArrayName="SVDPlusCDCStandaloneRecoTracks",
1157 )
1158 module_found = True
1159 if not module_found:
1160 raise KeyError(f"No module {track_qe_module_name} found in path")
1161 path = new_path
1162 return path
1163
1164
1165class TrackQETeacherBaseTask(Basf2Task):
1166 """
1167 A teacher task runs the basf2 mva teacher on the training data provided by a
1168 data collection task.
1169
1170 Since teacher tasks are needed for all quality estimators covered by this
1171 steering file and the only thing that changes is the required data
1172 collection task and some training parameters, I decided to use inheritance
1173 and have the basic functionality in this base class/interface and have the
1174 specific teacher tasks inherit from it.
1175 """
1176
1177 n_events_training = b2luigi.IntParameter()
1178
1179 experiment_number = b2luigi.IntParameter()
1180
1183 process_type = b2luigi.Parameter(
1184
1185 default="BBBAR"
1186
1187 )
1188
1189 training_target = b2luigi.Parameter(
1190
1191 default="truth"
1192
1193 )
1194
1196 exclude_variables = b2luigi.ListParameter(
1197
1198 hashed=True, default=[]
1199
1200 )
1201
1202 fast_bdt_option = b2luigi.ListParameter(
1203
1204 hashed=True, default=[200, 8, 3, 0.1]
1205
1206 )
1207
1208 @property
1209 def weightfile_identifier_basename(self):
1210 """
1211 Property defining the basename for the .xml and .root weightfiles that are created.
1212 Has to be implemented by the inheriting teacher task class.
1213 """
1214 raise NotImplementedError(
1215 "Teacher Task must define a static weightfile_identifier"
1216 )
1217
1218 def get_weightfile_xml_identifier(self, fast_bdt_option=None, recotrack_option=None):
1219 """
1220 Name of the xml weightfile that is created by the teacher task.
1221 It is subsequently used as a local weightfile in the following validation tasks.
1222 """
1223 if fast_bdt_option is None:
1224 fast_bdt_option = self.fast_bdt_option
1225 if recotrack_option is None and hasattr(self, 'recotrack_option'):
1226 recotrack_option = self.recotrack_option
1227 else:
1228 recotrack_option = ''
1229 weightfile_details = create_fbdt_option_string(fast_bdt_option)
1230 weightfile_name = self.weightfile_identifier_basename + weightfile_details
1231 if recotrack_option != '':
1232 weightfile_name = weightfile_name + '_' + recotrack_option
1233 return weightfile_name + ".weights.xml"
1234
1235 @property
1236 def tree_name(self):
1237 """
1238 Property defining the name of the tree in the ROOT file from the
1239 ``data_collection_task`` that contains the recorded training data. Must
1240 implemented by the inheriting specific teacher task class.
1241 """
1242 raise NotImplementedError("Teacher Task must define a static tree_name")
1243
1244 @property
1245 def random_seed(self):
1246 """
1247 Property defining random seed to be used by the ``GenerateSimTask``.
1248 Should differ from the random seed in the test data samples. Must
1249 implemented by the inheriting specific teacher task class.
1250 """
1251 raise NotImplementedError("Teacher Task must define a static random seed")
1252
1253 @property
1254 def data_collection_task(self) -> Basf2PathTask:
1255 """
1256 Property defining the specific ``DataCollectionTask`` to require. Must
1257 implemented by the inheriting specific teacher task class.
1258 """
1259 raise NotImplementedError(
1260 "Teacher Task must define a data collection task to require "
1261 )
1262
1263 def requires(self):
1264 """
1265 Generate list of luigi Tasks that this Task depends on.
1266 """
1267 if 'USEREC' in self.process_type:
1268 if 'USERECBB' in self.process_type:
1269 process = 'BBBAR'
1270 elif 'USERECEE' in self.process_type:
1271 process = 'BHABHA'
1272 yield CheckExistingFile(
1273 filename='datafiles/qe_records_N' + str(self.n_events_training) + '_' + process + '_' + self.random_seed + '.root',
1274 )
1275 else:
1276 yield self.data_collection_task(
1277 num_processes=MasterTask.num_processes,
1278 n_events=self.n_events_training,
1279 experiment_number=self.experiment_number,
1280 random_seed=self.process_type + '_' + self.random_seed,
1281 )
1282
1283 def output(self):
1284 """
1285 Generate list of output files that the task should produce.
1286 The task is considered finished if and only if the outputs all exist.
1287 """
1288 yield self.add_to_output(self.get_weightfile_xml_identifier())
1289
1290 def process(self):
1291 """
1292 Use basf2_mva teacher to create MVA weightfile from collected training
1293 data variables.
1294
1295 This is the main process that is dispatched by the ``run`` method that
1296 is inherited from ``Basf2Task``.
1297 """
1298 if 'USEREC' in self.process_type:
1299 if 'USERECBB' in self.process_type:
1300 process = 'BBBAR'
1301 elif 'USERECEE' in self.process_type:
1302 process = 'BHABHA'
1303 records_files = ['datafiles/qe_records_N' + str(self.n_events_training) +
1304 '_' + process + '_' + self.random_seed + '.root']
1305 else:
1306 if hasattr(self, 'recotrack_option'):
1307 records_files = self.get_input_file_names(
1308 self.data_collection_task.get_records_file_name(
1309 self.data_collection_task,
1310 n_events=self.n_events_training,
1311 random_seed=self.process_type + '_' + self.random_seed,
1312 recotrack_option=self.recotrack_option))
1313 else:
1314 records_files = self.get_input_file_names(
1315 self.data_collection_task.get_records_file_name(
1316 self.data_collection_task,
1317 n_events=self.n_events_training,
1318 random_seed=self.process_type + '_' + self.random_seed))
1319
1320 my_basf2_mva_teacher(
1321 records_files=records_files,
1322 tree_name=self.tree_name,
1323 weightfile_identifier=self.get_output_file_name(self.get_weightfile_xml_identifier()),
1324 target_variable=self.training_target,
1325 exclude_variables=self.exclude_variables,
1326 fast_bdt_option=self.fast_bdt_option,
1327 )
1328
1329
1330class VXDQETeacherTask(TrackQETeacherBaseTask):
1331 """
1332 Task to run basf2 mva teacher on collected data for VXDTF2 track quality estimator
1333 """
1334
1335 weightfile_identifier_basename = "vxdtf2_mva_qe"
1336
1338 tree_name = "tree"
1339
1340 random_seed = "train_vxd"
1341
1343 data_collection_task = VXDQEDataCollectionTask
1344
1345
1346class CDCQETeacherTask(TrackQETeacherBaseTask):
1347 """
1348 Task to run basf2 mva teacher on collected data for CDC track quality estimator
1349 """
1350
1351 weightfile_identifier_basename = "cdc_mva_qe"
1352
1354 tree_name = "records"
1355
1356 random_seed = "train_cdc"
1357
1359 data_collection_task = CDCQEDataCollectionTask
1360
1361
1362class RecoTrackQETeacherTask(TrackQETeacherBaseTask):
1363 """
1364 Task to run basf2 mva teacher on collected data for the final, combined
1365 track quality estimator
1366 """
1367
1370 recotrack_option = b2luigi.Parameter(
1371
1372 default='deleteCDCQI080'
1373
1374 )
1375
1376
1377 weightfile_identifier_basename = "recotrack_mva_qe"
1378
1380 tree_name = "tree"
1381
1382 random_seed = "train_rec"
1383
1385 data_collection_task = RecoTrackQEDataCollectionTask
1386
1387 cdc_training_target = b2luigi.Parameter()
1388
1389 def requires(self):
1390 """
1391 Generate list of luigi Tasks that this Task depends on.
1392 """
1393 if 'USEREC' in self.process_type:
1394 if 'USERECBB' in self.process_type:
1395 process = 'BBBAR'
1396 elif 'USERECEE' in self.process_type:
1397 process = 'BHABHA'
1398 yield CheckExistingFile(
1399 filename='datafiles/qe_records_N' + str(self.n_events_training) + '_' + process + '_' + self.random_seed + '.root',
1400 )
1401 else:
1402 yield self.data_collection_task(
1403 cdc_training_target=self.cdc_training_target,
1404 num_processes=MasterTask.num_processes,
1405 n_events=self.n_events_training,
1406 experiment_number=self.experiment_number,
1407 random_seed=self.process_type + '_' + self.random_seed,
1408 recotrack_option=self.recotrack_option,
1409 fast_bdt_option=self.fast_bdt_option,
1410 )
1411
1412
1413class HarvestingValidationBaseTask(Basf2PathTask):
1414 """
1415 Run track reconstruction with MVA quality estimator and write out
1416 (="harvest") a root file with variables useful for the validation.
1417 """
1418
1419
1420 n_events_testing = b2luigi.IntParameter()
1421
1422 n_events_training = b2luigi.IntParameter()
1423
1424 experiment_number = b2luigi.IntParameter()
1425
1428 process_type = b2luigi.Parameter(
1429
1430 default="BBBAR"
1431
1432 )
1433
1435 exclude_variables = b2luigi.ListParameter(
1436
1437 hashed=True
1438
1439 )
1440
1441 fast_bdt_option = b2luigi.ListParameter(
1442
1443 hashed=True, default=[200, 8, 3, 0.1]
1444
1445 )
1446
1447 validation_output_file_name = "harvesting_validation.root"
1448
1449 reco_output_file_name = "reconstruction.root"
1450
1451 components = None
1452
1453 @property
1454 def teacher_task(self) -> TrackQETeacherBaseTask:
1455 """
1456 Teacher task to require to provide a quality estimator weightfile for ``add_tracking_with_quality_estimation``
1457 """
1458 raise NotImplementedError()
1459
1460 def add_tracking_with_quality_estimation(self, path: basf2.Path) -> None:
1461 """
1462 Add modules for track reconstruction to basf2 path that are to be
1463 validated. Besides track finding it should include MC matching, fitted
1464 track creation and a quality estimator module.
1465 """
1466 raise NotImplementedError()
1467
1468 def requires(self):
1469 """
1470 Generate list of luigi Tasks that this Task depends on.
1471 """
1472 yield self.teacher_task(
1473 n_events_training=self.n_events_training,
1474 experiment_number=self.experiment_number,
1475 process_type=self.process_type,
1476 exclude_variables=self.exclude_variables,
1477 fast_bdt_option=self.fast_bdt_option,
1478 )
1479 if 'USE' in self.process_type: # USESIM and USEREC
1480 if 'BB' in self.process_type:
1481 process = 'BBBAR'
1482 elif 'EE' in self.process_type:
1483 process = 'BHABHA'
1484 yield CheckExistingFile(
1485 filename='datafiles/generated_mc_N' + str(self.n_events_testing) + '_' + process + '_test.root'
1486 )
1487 else:
1488 yield SplitNMergeSimTask(
1489 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
1490 random_seed=self.process_type + '_test',
1491 n_events=self.n_events_testing,
1492 experiment_number=self.experiment_number,
1493 )
1494
1495 def output(self):
1496 """
1497 Generate list of output files that the task should produce.
1498 The task is considered finished if and only if the outputs all exist.
1499 """
1500 yield self.add_to_output(self.validation_output_file_name)
1501 yield self.add_to_output(self.reco_output_file_name)
1502
1503 def create_path(self):
1504 """
1505 Create a basf2 path that uses ``add_tracking_with_quality_estimation()``
1506 and adds the ``CombinedTrackingValidationModule`` to write out variables
1507 for validation.
1508 """
1509 # prepare track finding
1510 path = basf2.create_path()
1511 if 'USE' in self.process_type:
1512 if 'BB' in self.process_type:
1513 process = 'BBBAR'
1514 elif 'EE' in self.process_type:
1515 process = 'BHABHA'
1516 inputFileNames = ['datafiles/generated_mc_N' + str(self.n_events_testing) + '_' + process + '_test.root']
1517 else:
1518 inputFileNames = self.get_input_file_names(GenerateSimTask.output_file_name(
1519 GenerateSimTask, n_events=self.n_events_testing, random_seed=self.process_type + '_test'))
1520 path.add_module(
1521 "RootInput",
1522 inputFileNames=inputFileNames,
1523 )
1524 path.add_module("Gearbox")
1525 tracking.add_geometry_modules(path)
1526 tracking.add_hit_preparation_modules(path) # only needed for simulated hits
1527 # add track finding module that needs to be validated
1528 self.add_tracking_with_quality_estimation(path)
1529 # add modules for validation
1530 path.add_module(
1531 CombinedTrackingValidationModule(
1532 name=None,
1533 contact=None,
1534 expert_level=200,
1535 output_file_name=self.get_output_file_name(
1536 self.validation_output_file_name
1537 ),
1538 )
1539 )
1540 path.add_module(
1541 "RootOutput",
1542 outputFileName=self.get_output_file_name(self.reco_output_file_name),
1543 )
1544 return path
1545
1546
1547class VXDQEHarvestingValidationTask(HarvestingValidationBaseTask):
1548 """
1549 Run VXDTF2 track reconstruction and write out (="harvest") a root file with
1550 variables useful for validation of the VXD Quality Estimator.
1551 """
1552
1553
1554 validation_output_file_name = "vxd_qe_harvesting_validation.root"
1555
1556 reco_output_file_name = "vxd_qe_reconstruction.root"
1557
1558 teacher_task = VXDQETeacherTask
1559
1560 def add_tracking_with_quality_estimation(self, path):
1561 """
1562 Add modules for VXDTF2 tracking with VXD quality estimator to basf2 path.
1563 """
1564 tracking.add_vxd_track_finding_vxdtf2(
1565 path,
1566 components=["SVD"],
1567 reco_tracks="RecoTracks",
1568 add_mva_quality_indicator=True,
1569 )
1570 # Replace the weightfiles of all quality estimator module by those
1571 # produced in this training by b2luigi
1572 basf2.set_module_parameters(
1573 path,
1574 name="VXDQualityEstimatorMVA",
1575 WeightFileIdentifier=self.get_input_file_names(
1576 self.teacher_task.get_weightfile_xml_identifier(self.teacher_task, fast_bdt_option=self.fast_bdt_option)
1577 )[0],
1578 )
1579 tracking.add_mc_matcher(path, components=["SVD"])
1580 tracking.add_track_fit_and_track_creator(path, components=["SVD"])
1581
1582
1583class CDCQEHarvestingValidationTask(HarvestingValidationBaseTask):
1584 """
1585 Run CDC reconstruction and write out (="harvest") a root file with variables
1586 useful for validation of the CDC Quality Estimator.
1587 """
1588
1589 training_target = b2luigi.Parameter()
1590
1591 validation_output_file_name = "cdc_qe_harvesting_validation.root"
1592
1593 reco_output_file_name = "cdc_qe_reconstruction.root"
1594
1595 teacher_task = CDCQETeacherTask
1596
1597 # overload needed due to specific training target
1598 def requires(self):
1599 """
1600 Generate list of luigi Tasks that this Task depends on.
1601 """
1602 yield self.teacher_task(
1603 n_events_training=self.n_events_training,
1604 experiment_number=self.experiment_number,
1605 process_type=self.process_type,
1606 training_target=self.training_target,
1607 exclude_variables=self.exclude_variables,
1608 fast_bdt_option=self.fast_bdt_option,
1609 )
1610 if 'USE' in self.process_type: # USESIM and USEREC
1611 if 'BB' in self.process_type:
1612 process = 'BBBAR'
1613 elif 'EE' in self.process_type:
1614 process = 'BHABHA'
1615 yield CheckExistingFile(
1616 filename='datafiles/generated_mc_N' + str(self.n_events_testing) + '_' + process + '_test.root'
1617 )
1618 else:
1619 yield SplitNMergeSimTask(
1620 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
1621 random_seed=self.process_type + '_test',
1622 n_events=self.n_events_testing,
1623 experiment_number=self.experiment_number,
1624 )
1625
1626 def add_tracking_with_quality_estimation(self, path):
1627 """
1628 Add modules for CDC standalone tracking with CDC quality estimator to basf2 path.
1629 """
1630 tracking.add_cdc_track_finding(
1631 path,
1632 output_reco_tracks="RecoTracks",
1633 add_mva_quality_indicator=True,
1634 )
1635 # change weightfile of quality estimator to the one produced by this training script
1636 cdc_qe_mva_filter_parameters = {
1637 "identifier": self.get_input_file_names(
1638 CDCQETeacherTask.get_weightfile_xml_identifier(
1639 CDCQETeacherTask,
1640 fast_bdt_option=self.fast_bdt_option))[0]}
1641 basf2.set_module_parameters(
1642 path,
1643 name="TFCDC_TrackQualityEstimator",
1644 filterParameters=cdc_qe_mva_filter_parameters,
1645 )
1646 tracking.add_mc_matcher(path, components=["CDC"])
1647 tracking.add_track_fit_and_track_creator(path, components=["CDC"])
1648
1649
1650class RecoTrackQEHarvestingValidationTask(HarvestingValidationBaseTask):
1651 """
1652 Run track reconstruction and write out (="harvest") a root file with variables
1653 useful for validation of the MVA track Quality Estimator.
1654 """
1655
1656 cdc_training_target = b2luigi.Parameter()
1657
1658 validation_output_file_name = "reco_qe_harvesting_validation.root"
1659
1660 reco_output_file_name = "reco_qe_reconstruction.root"
1661
1662 teacher_task = RecoTrackQETeacherTask
1663
1664 def requires(self):
1665 """
1666 Generate list of luigi Tasks that this Task depends on.
1667 """
1668 yield CDCQETeacherTask(
1669 n_events_training=self.n_events_training,
1670 experiment_number=self.experiment_number,
1671 process_type=self.process_type,
1672 training_target=self.cdc_training_target,
1673 exclude_variables=MasterTask.exclude_variables_cdc,
1674 fast_bdt_option=self.fast_bdt_option,
1675 )
1676 yield VXDQETeacherTask(
1677 n_events_training=self.n_events_training,
1678 experiment_number=self.experiment_number,
1679 process_type=self.process_type,
1680 exclude_variables=MasterTask.exclude_variables_vxd,
1681 fast_bdt_option=self.fast_bdt_option,
1682 )
1683
1684 yield self.teacher_task(
1685 n_events_training=self.n_events_training,
1686 experiment_number=self.experiment_number,
1687 process_type=self.process_type,
1688 exclude_variables=self.exclude_variables,
1689 cdc_training_target=self.cdc_training_target,
1690 fast_bdt_option=self.fast_bdt_option,
1691 )
1692 if 'USE' in self.process_type: # USESIM and USEREC
1693 if 'BB' in self.process_type:
1694 process = 'BBBAR'
1695 elif 'EE' in self.process_type:
1696 process = 'BHABHA'
1697 yield CheckExistingFile(
1698 filename='datafiles/generated_mc_N' + str(self.n_events_testing) + '_' + process + '_test.root'
1699 )
1700 else:
1701 yield SplitNMergeSimTask(
1702 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
1703 random_seed=self.process_type + '_test',
1704 n_events=self.n_events_testing,
1705 experiment_number=self.experiment_number,
1706 )
1707
1708 def add_tracking_with_quality_estimation(self, path):
1709 """
1710 Add modules for reco tracking with all track quality estimators to basf2 path.
1711 """
1712
1713 # add tracking reconstruction with quality estimator modules added
1714 tracking.add_tracking_reconstruction(
1715 path,
1716 add_cdcTrack_QI=True,
1717 add_vxdTrack_QI=True,
1718 add_recoTrack_QI=True,
1719 skipGeometryAdding=True,
1720 skipHitPreparerAdding=False,
1721 )
1722
1723 # Replace the weightfiles of all quality estimator modules by those
1724 # produced in the training by b2luigi
1725 cdc_qe_mva_filter_parameters = {
1726 "identifier": self.get_input_file_names(
1727 CDCQETeacherTask.get_weightfile_xml_identifier(
1728 CDCQETeacherTask,
1729 fast_bdt_option=self.fast_bdt_option))[0]}
1730 basf2.set_module_parameters(
1731 path,
1732 name="TFCDC_TrackQualityEstimator",
1733 filterParameters=cdc_qe_mva_filter_parameters,
1734 )
1735 basf2.set_module_parameters(
1736 path,
1737 name="VXDQualityEstimatorMVA",
1738 WeightFileIdentifier=self.get_input_file_names(
1739 VXDQETeacherTask.get_weightfile_xml_identifier(VXDQETeacherTask, fast_bdt_option=self.fast_bdt_option)
1740 )[0],
1741 )
1742 basf2.set_module_parameters(
1743 path,
1744 name="TrackQualityEstimatorMVA",
1745 WeightFileIdentifier=self.get_input_file_names(
1746 RecoTrackQETeacherTask.get_weightfile_xml_identifier(RecoTrackQETeacherTask, fast_bdt_option=self.fast_bdt_option)
1747 )[0],
1748 )
1749
1750
1751class TrackQEEvaluationBaseTask(Task):
1752 """
1753 Base class for evaluating a quality estimator ``basf2_mva_evaluate.py`` on a
1754 separate test data set.
1755
1756 Evaluation tasks for VXD, CDC and combined QE can inherit from it.
1757 """
1758
1759
1764 git_hash = b2luigi.Parameter(
1765
1766 default=get_basf2_git_hash()
1767
1768 )
1769
1770 n_events_testing = b2luigi.IntParameter()
1771
1772 n_events_training = b2luigi.IntParameter()
1773
1774 experiment_number = b2luigi.IntParameter()
1775
1778 process_type = b2luigi.Parameter(
1779
1780 default="BBBAR"
1781
1782 )
1783
1784 training_target = b2luigi.Parameter(
1785
1786 default="truth"
1787
1788 )
1789
1791 exclude_variables = b2luigi.ListParameter(
1792
1793 hashed=True
1794
1795 )
1796
1797 fast_bdt_option = b2luigi.ListParameter(
1798
1799 hashed=True, default=[200, 8, 3, 0.1]
1800
1801 )
1802
1803 @property
1804 def teacher_task(self) -> TrackQETeacherBaseTask:
1805 """
1806 Property defining specific teacher task to require.
1807 """
1808 raise NotImplementedError(
1809 "Evaluation Tasks must define a teacher task to require "
1810 )
1811
1812 @property
1813 def data_collection_task(self) -> Basf2PathTask:
1814 """
1815 Property defining the specific ``DataCollectionTask`` to require. Must
1816 implemented by the inheriting specific teacher task class.
1817 """
1818 raise NotImplementedError(
1819 "Evaluation Tasks must define a data collection task to require "
1820 )
1821
1822 @property
1823 def task_acronym(self):
1824 """
1825 Acronym to distinguish between cdc, vxd and rec(o) MVA
1826 """
1827 raise NotImplementedError(
1828 "Evaluation Tasks must define a task acronym."
1829 )
1830
1831 def requires(self):
1832 """
1833 Generate list of luigi Tasks that this Task depends on.
1834 """
1835 yield self.teacher_task(
1836 n_events_training=self.n_events_training,
1837 experiment_number=self.experiment_number,
1838 process_type=self.process_type,
1839 training_target=self.training_target,
1840 exclude_variables=self.exclude_variables,
1841 fast_bdt_option=self.fast_bdt_option,
1842 )
1843 if 'USEREC' in self.process_type:
1844 if 'USERECBB' in self.process_type:
1845 process = 'BBBAR'
1846 elif 'USERECEE' in self.process_type:
1847 process = 'BHABHA'
1848 yield CheckExistingFile(
1849 filename='datafiles/qe_records_N' + str(self.n_events_testing) + '_' + process + '_test_' +
1850 self.task_acronym + '.root'
1851 )
1852 else:
1853 yield self.data_collection_task(
1854 num_processes=MasterTask.num_processes,
1855 n_events=self.n_events_testing,
1856 experiment_number=self.experiment_number,
1857 random_seed=self.process_type + '_test',
1858 )
1859
1860 def output(self):
1861 """
1862 Generate list of output files that the task should produce.
1863 The task is considered finished if and only if the outputs all exist.
1864 """
1865 weightfile_details = create_fbdt_option_string(self.fast_bdt_option)
1866 evaluation_pdf_output = self.teacher_task.weightfile_identifier_basename + weightfile_details + ".pdf"
1867 yield self.add_to_output(evaluation_pdf_output)
1868
1869 @b2luigi.on_temporary_files
1870 def run(self):
1871 """
1872 Run ``basf2_mva_evaluate.py`` subprocess to evaluate QE MVA.
1873
1874 The MVA weight file created from training on the training data set is
1875 evaluated on separate test data.
1876 """
1877 weightfile_details = create_fbdt_option_string(self.fast_bdt_option)
1878 evaluation_pdf_output_basename = self.teacher_task.weightfile_identifier_basename + weightfile_details + ".pdf"
1879
1880 evaluation_pdf_output_path = self.get_output_file_name(evaluation_pdf_output_basename)
1881
1882 if 'USEREC' in self.process_type:
1883 if 'USERECBB' in self.process_type:
1884 process = 'BBBAR'
1885 elif 'USERECEE' in self.process_type:
1886 process = 'BHABHA'
1887 datafiles = 'datafiles/qe_records_N' + str(self.n_events_testing) + '_' + \
1888 process + '_test_' + self.task_acronym + '.root'
1889 else:
1890 datafiles = self.get_input_file_names(
1891 self.data_collection_task.get_records_file_name(
1892 self.data_collection_task,
1893 n_events=self.n_events_testing,
1894 random_seed=self.process + '_test_' +
1895 self.task_acronym))[0]
1896 cmd = [
1897 "basf2_mva_evaluate.py",
1898 "--identifiers",
1899 self.get_input_file_names(
1900 self.teacher_task.get_weightfile_xml_identifier(
1901 self.teacher_task,
1902 fast_bdt_option=self.fast_bdt_option))[0],
1903 "--datafiles",
1904 datafiles,
1905 "--treename",
1906 self.teacher_task.tree_name,
1907 "--outputfile",
1908 evaluation_pdf_output_path,
1909 ]
1910
1911 # Prepare log files
1912 log_file_dir = get_log_file_dir(self)
1913 # check if directory already exists, if not, create it. I think this is necessary as this task does not
1914 # inherit properly from b2luigi and thus does not do it automatically??
1915 try:
1916 os.makedirs(log_file_dir, exist_ok=True)
1917 # the following should be unnecessary as exist_ok=True should take care that no FileExistError rises. I
1918 # might ask about a permission error...
1919 except FileExistsError:
1920 print('Directory ' + log_file_dir + 'already exists.')
1921 stderr_log_file_path = log_file_dir + "stderr"
1922 stdout_log_file_path = log_file_dir + "stdout"
1923 with open(stdout_log_file_path, "w") as stdout_file:
1924 stdout_file.write(f'stdout output of the command:\n{" ".join(cmd)}\n\n')
1925 if os.path.exists(stderr_log_file_path):
1926 # remove stderr file if it already exists b/c in the following it will be opened in appending mode
1927 os.remove(stderr_log_file_path)
1928
1929 # Run evaluation via subprocess and write output into logfiles
1930 with open(stdout_log_file_path, "a") as stdout_file:
1931 with open(stderr_log_file_path, "a") as stderr_file:
1932 try:
1933 subprocess.run(cmd, check=True, stdin=stdout_file, stderr=stderr_file)
1934 except subprocess.CalledProcessError as err:
1935 stderr_file.write(f"Evaluation failed with error:\n{err}")
1936 raise err
1937
1938
1939class VXDTrackQEEvaluationTask(TrackQEEvaluationBaseTask):
1940 """
1941 Run ``basf2_mva_evaluate.py`` for the VXD quality estimator on separate test data
1942 """
1943
1945 teacher_task = VXDQETeacherTask
1946
1948 data_collection_task = VXDQEDataCollectionTask
1949
1951 task_acronym = 'vxd'
1952
1953
1954class CDCTrackQEEvaluationTask(TrackQEEvaluationBaseTask):
1955 """
1956 Run ``basf2_mva_evaluate.py`` for the CDC quality estimator on separate test data
1957 """
1958
1960 teacher_task = CDCQETeacherTask
1961
1963 data_collection_task = CDCQEDataCollectionTask
1964
1966 task_acronym = 'cdc'
1967
1968
1969class RecoTrackQEEvaluationTask(TrackQEEvaluationBaseTask):
1970 """
1971 Run ``basf2_mva_evaluate.py`` for the final, combined quality estimator on
1972 separate test data
1973 """
1974
1976 teacher_task = RecoTrackQETeacherTask
1977
1979 data_collection_task = RecoTrackQEDataCollectionTask
1980
1982 task_acronym = 'rec'
1983
1984 cdc_training_target = b2luigi.Parameter()
1985
1986 def requires(self):
1987 """
1988 Generate list of luigi Tasks that this Task depends on.
1989 """
1990 yield self.teacher_task(
1991 n_events_training=self.n_events_training,
1992 experiment_number=self.experiment_number,
1993 process_type=self.process_type,
1994 training_target=self.training_target,
1995 exclude_variables=self.exclude_variables,
1996 cdc_training_target=self.cdc_training_target,
1997 fast_bdt_option=self.fast_bdt_option,
1998 )
1999 if 'USEREC' in self.process_type:
2000 if 'USERECBB' in self.process_type:
2001 process = 'BBBAR'
2002 elif 'USERECEE' in self.process_type:
2003 process = 'BHABHA'
2004 yield CheckExistingFile(
2005 filename='datafiles/qe_records_N' + str(self.n_events_testing) + '_' + process + '_test_' +
2006 self.task_acronym + '.root'
2007 )
2008 else:
2009 yield self.data_collection_task(
2010 num_processes=MasterTask.num_processes,
2011 n_events=self.n_events_testing,
2012 experiment_number=self.experiment_number,
2013 random_seed=self.process_type + "_test",
2014 cdc_training_target=self.cdc_training_target,
2015 )
2016
2017
2018class PlotsFromHarvestingValidationBaseTask(Basf2Task):
2019 """
2020 Create a PDF file with validation plots for a quality estimator produced
2021 from the ROOT ntuples produced by a harvesting validation task
2022 """
2023
2024 n_events_testing = b2luigi.IntParameter()
2025
2026 n_events_training = b2luigi.IntParameter()
2027
2028 experiment_number = b2luigi.IntParameter()
2029
2032 process_type = b2luigi.Parameter(
2033
2034 default="BBBAR"
2035
2036 )
2037
2039 exclude_variables = b2luigi.ListParameter(
2040
2041 hashed=True
2042
2043 )
2044
2045 fast_bdt_option = b2luigi.ListParameter(
2046
2047 hashed=True, default=[200, 8, 3, 0.1]
2048
2049 )
2050
2051 primaries_only = b2luigi.BoolParameter(
2052
2053 default=True
2054
2055 ) # normalize finding efficiencies to primary MC-tracks
2056
2057 @property
2058 def harvesting_validation_task_instance(self) -> HarvestingValidationBaseTask:
2059 """
2060 Specifies related harvesting validation task which produces the ROOT
2061 files with the data that is plotted by this task.
2062 """
2063 raise NotImplementedError("Must define a QI harvesting validation task for which to do the plots")
2064
2065 @property
2066 def output_pdf_file_basename(self):
2067 """
2068 Name of the output PDF file containing the validation plots
2069 """
2070 validation_harvest_basename = self.harvesting_validation_task_instance.validation_output_file_name
2071 return validation_harvest_basename.replace(".root", "_plots.pdf")
2072
2073 def requires(self):
2074 """
2075 Generate list of luigi Tasks that this Task depends on.
2076 """
2077 yield self.harvesting_validation_task_instance
2078
2079 def output(self):
2080 """
2081 Generate list of output files that the task should produce.
2082 The task is considered finished if and only if the outputs all exist.
2083 """
2084 yield self.add_to_output(self.output_pdf_file_basename)
2085
2086 @b2luigi.on_temporary_files
2087 def process(self):
2088 """
2089 Use basf2_mva teacher to create MVA weightfile from collected training
2090 data variables.
2091
2092 Main process that is dispatched by the ``run`` method that is inherited
2093 from ``Basf2Task``.
2094 """
2095 # get the validation "harvest", which is the ROOT file with ntuples for validation
2096 validation_harvest_basename = self.harvesting_validation_task_instance.validation_output_file_name
2097 validation_harvest_path = self.get_input_file_names(validation_harvest_basename)[0]
2098
2099 # Load "harvested" validation data from root files into dataframes (requires enough memory to hold data)
2100 pr_columns = [ # Restrict memory usage by only reading in columns that are used in the steering file
2101 'is_fake', 'is_clone', 'is_matched', 'quality_indicator',
2102 'experiment_number', 'run_number', 'event_number', 'pr_store_array_number',
2103 'pt_estimate', 'z0_estimate', 'd0_estimate', 'tan_lambda_estimate',
2104 'phi0_estimate', 'pt_truth', 'z0_truth', 'd0_truth', 'tan_lambda_truth',
2105 'phi0_truth',
2106 ]
2107 # In ``pr_df`` each row corresponds to a track from Pattern Recognition
2108 pr_df = uproot.open(validation_harvest_path)['pr_tree/pr_tree'].arrays(pr_columns, library='pd')
2109 mc_columns = [ # restrict mc_df to these columns
2110 'experiment_number',
2111 'run_number',
2112 'event_number',
2113 'pr_store_array_number',
2114 'is_missing',
2115 'is_primary',
2116 ]
2117 # In ``mc_df`` each row corresponds to an MC track
2118 mc_df = uproot.open(validation_harvest_path)['mc_tree/mc_tree'].arrays(mc_columns, library='pd')
2119 if self.primaries_only:
2120 mc_df = mc_df[mc_df.is_primary.eq(True)]
2121
2122 # Define QI thresholds for the FOM plots and the ROC curves
2123 qi_cuts = np.linspace(0., 1, 20, endpoint=False)
2124 # # Add more points at the very end between the previous maximum and 1
2125 # qi_cuts = np.append(qi_cuts, np.linspace(np.max(qi_cuts), 1, 20, endpoint=False))
2126
2127 # Create plots and append them to single output pdf
2128
2129 output_pdf_file_path = self.get_output_file_name(self.output_pdf_file_basename)
2130 with PdfPages(output_pdf_file_path, keep_empty=False) as pdf:
2131
2132 # Add a title page to validation plot PDF with some metadata
2133 # Remember that most metadata is in the xml file of the weightfile
2134 # and in the b2luigi directory structure
2135 titlepage_fig, titlepage_ax = plt.subplots()
2136 titlepage_ax.axis("off")
2137 title = f"Quality Estimator validation plots from {self.__class__.__name__}"
2138 titlepage_ax.set_title(title)
2139 teacher_task = self.harvesting_validation_task_instance.teacher_task
2140 weightfile_identifier = teacher_task.get_weightfile_xml_identifier(teacher_task, fast_bdt_option=self.fast_bdt_option)
2141 meta_data = {
2142 "Date": datetime.today().strftime("%Y-%m-%d %H:%M"),
2143 "Created by steering file": os.path.realpath(__file__),
2144 "Created from data in": validation_harvest_path,
2145 "Background directory": MasterTask.bkgfiles_by_exp[self.experiment_number],
2146 "weight file": weightfile_identifier,
2147 }
2148 if hasattr(self, 'exclude_variables'):
2149 meta_data["Excluded variables"] = ", ".join(self.exclude_variables)
2150 meta_data_string = (format_dictionary(meta_data) +
2151 "\n\n(For all MVA training parameters look into the produced weight file)")
2152 luigi_params = get_serialized_parameters(self)
2153 luigi_param_string = (f"\n\nb2luigi parameters for {self.__class__.__name__}\n" +
2154 format_dictionary(luigi_params))
2155 title_page_text = meta_data_string + luigi_param_string
2156 titlepage_ax.text(0, 1, title_page_text, ha="left", va="top", wrap=True, fontsize=8)
2157 pdf.savefig(titlepage_fig)
2158 plt.close(titlepage_fig)
2159
2160 fake_rates = get_uncertain_means_for_qi_cuts(pr_df, "is_fake", qi_cuts)
2161 fake_fig, fake_ax = plt.subplots()
2162 fake_ax.set_title("Fake rate")
2163 plot_with_errobands(fake_rates, ax=fake_ax)
2164 fake_ax.set_ylabel("fake rate")
2165 fake_ax.set_xlabel("quality indicator requirement")
2166 pdf.savefig(fake_fig, bbox_inches="tight")
2167 plt.close(fake_fig)
2168
2169 # Plot clone rates
2170 clone_rates = get_uncertain_means_for_qi_cuts(pr_df, "is_clone", qi_cuts)
2171 clone_fig, clone_ax = plt.subplots()
2172 clone_ax.set_title("Clone rate")
2173 plot_with_errobands(clone_rates, ax=clone_ax)
2174 clone_ax.set_ylabel("clone rate")
2175 clone_ax.set_xlabel("quality indicator requirement")
2176 pdf.savefig(clone_fig, bbox_inches="tight")
2177 plt.close(clone_fig)
2178
2179 # Plot finding efficiency
2180
2181 # The Quality Indicator is only available in pr_tree and thus the
2182 # PR-track dataframe. To get the QI of the related PR track for an MC
2183 # track, merge the PR dataframe into the MC dataframe
2184 pr_track_identifiers = ['experiment_number', 'run_number', 'event_number', 'pr_store_array_number']
2185 mc_df = upd.merge(
2186 left=mc_df, right=pr_df[pr_track_identifiers + ['quality_indicator']],
2187 how='left',
2188 on=pr_track_identifiers
2189 )
2190
2191 missing_fractions = (
2192 _my_uncertain_mean(mc_df[
2193 mc_df.quality_indicator.isnull() | (mc_df.quality_indicator > qi_cut)]['is_missing'])
2194 for qi_cut in qi_cuts
2195 )
2196
2197 findeff_fig, findeff_ax = plt.subplots()
2198 findeff_ax.set_title("Finding efficiency")
2199 finding_efficiencies = 1.0 - upd.Series(data=missing_fractions, index=qi_cuts)
2200 plot_with_errobands(finding_efficiencies, ax=findeff_ax)
2201 findeff_ax.set_ylabel("finding efficiency")
2202 findeff_ax.set_xlabel("quality indicator requirement")
2203 pdf.savefig(findeff_fig, bbox_inches="tight")
2204 plt.close(findeff_fig)
2205
2206 # Plot ROC curves
2207
2208 # Fake rate vs. finding efficiency ROC curve
2209 fake_roc_fig, fake_roc_ax = plt.subplots()
2210 fake_roc_ax.set_title("Fake rate vs. finding efficiency ROC curve")
2211 fake_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=fake_rates.nominal_value,
2212 xerr=finding_efficiencies.std_dev, yerr=fake_rates.std_dev, elinewidth=0.8)
2213 fake_roc_ax.set_xlabel('finding efficiency')
2214 fake_roc_ax.set_ylabel('fake rate')
2215 pdf.savefig(fake_roc_fig, bbox_inches="tight")
2216 plt.close(fake_roc_fig)
2217
2218 # Clone rate vs. finding efficiency ROC curve
2219 clone_roc_fig, clone_roc_ax = plt.subplots()
2220 clone_roc_ax.set_title("Clone rate vs. finding efficiency ROC curve")
2221 clone_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=clone_rates.nominal_value,
2222 xerr=finding_efficiencies.std_dev, yerr=clone_rates.std_dev, elinewidth=0.8)
2223 clone_roc_ax.set_xlabel('finding efficiency')
2224 clone_roc_ax.set_ylabel('clone rate')
2225 pdf.savefig(clone_roc_fig, bbox_inches="tight")
2226 plt.close(clone_roc_fig)
2227
2228 # Plot kinematic distributions
2229
2230 # use fewer qi cuts as each cut will be it's own subplot now and not a point
2231 kinematic_qi_cuts = [0, 0.5, 0.9]
2232
2233 # Define kinematic parameters which we want to histogram and define
2234 # dictionaries relating them to latex labels, units and binnings
2235 params = ['d0', 'z0', 'pt', 'tan_lambda', 'phi0']
2236 label_by_param = {
2237 "pt": "$p_T$",
2238 "z0": "$z_0$",
2239 "d0": "$d_0$",
2240 "tan_lambda": r"$\tan{\lambda}$",
2241 "phi0": r"$\phi_0$"
2242 }
2243 unit_by_param = {
2244 "pt": "GeV",
2245 "z0": "cm",
2246 "d0": "cm",
2247 "tan_lambda": "rad",
2248 "phi0": "rad"
2249 }
2250 n_kinematic_bins = 75 # number of bins per kinematic variable
2251 bins_by_param = {
2252 "pt": np.linspace(0, np.percentile(pr_df['pt_truth'].dropna(), 95), n_kinematic_bins),
2253 "z0": np.linspace(-0.1, 0.1, n_kinematic_bins),
2254 "d0": np.linspace(0, 0.01, n_kinematic_bins),
2255 "tan_lambda": np.linspace(-2, 3, n_kinematic_bins),
2256 "phi0": np.linspace(0, 2 * np.pi, n_kinematic_bins)
2257 }
2258
2259 # Iterate over each parameter and for each make stacked histograms for different QI cuts
2260 kinematic_qi_cuts = [0, 0.5, 0.8]
2261 blue, yellow, green = plt.get_cmap("tab10").colors[0:3]
2262 for param in params:
2263 fig, axarr = plt.subplots(ncols=len(kinematic_qi_cuts), sharey=True, sharex=True, figsize=(14, 6))
2264 fig.suptitle(f"{label_by_param[param]} distributions")
2265 for i, qi in enumerate(kinematic_qi_cuts):
2266 ax = axarr[i]
2267 ax.set_title(f"QI > {qi}")
2268 incut = pr_df[(pr_df['quality_indicator'] > qi)]
2269 incut_matched = incut[incut.is_matched.eq(True)]
2270 incut_clones = incut[incut.is_clone.eq(True)]
2271 incut_fake = incut[incut.is_fake.eq(True)]
2272
2273 # if any series is empty, break out of loop and don't draw try to draw a stacked histogram
2274 if any(series.empty for series in (incut, incut_matched, incut_clones, incut_fake)):
2275 ax.text(0.5, 0.5, "Not enough data in bin", ha="center", va="center", transform=ax.transAxes)
2276 continue
2277
2278 bins = bins_by_param[param]
2279 stacked_histogram_series_tuple = (
2280 incut_matched[f'{param}_estimate'],
2281 incut_clones[f'{param}_estimate'],
2282 incut_fake[f'{param}_estimate'],
2283 )
2284 histvals, _, _ = ax.hist(stacked_histogram_series_tuple,
2285 stacked=True,
2286 bins=bins, range=(bins.min(), bins.max()),
2287 color=(blue, green, yellow),
2288 label=("matched", "clones", "fakes"))
2289 ax.set_xlabel(f'{label_by_param[param]} estimate / ({unit_by_param[param]})')
2290 ax.set_ylabel('# tracks')
2291 axarr[0].legend(loc="upper center", bbox_to_anchor=(0, -0.15))
2292 pdf.savefig(fig, bbox_inches="tight")
2293 plt.close(fig)
2294
2295
2296class VXDQEValidationPlotsTask(PlotsFromHarvestingValidationBaseTask):
2297 """
2298 Create a PDF file with validation plots for the VXDTF2 track quality
2299 estimator produced from the ROOT ntuples produced by a VXDTF2 track QE
2300 harvesting validation task
2301 """
2302
2303 @property
2304 def harvesting_validation_task_instance(self):
2305 """
2306 Harvesting validation task to require, which produces the ROOT files
2307 with variables to produce the VXD QE validation plots.
2308 """
2309 return VXDQEHarvestingValidationTask(
2310 n_events_testing=self.n_events_testing,
2311 n_events_training=self.n_events_training,
2312 process_type=self.process_type,
2313 experiment_number=self.experiment_number,
2314 exclude_variables=self.exclude_variables,
2315 num_processes=MasterTask.num_processes,
2316 fast_bdt_option=self.fast_bdt_option,
2317 )
2318
2319
2320class CDCQEValidationPlotsTask(PlotsFromHarvestingValidationBaseTask):
2321 """
2322 Create a PDF file with validation plots for the CDC track quality estimator
2323 produced from the ROOT ntuples produced by a CDC track QE harvesting
2324 validation task
2325 """
2326
2327 training_target = b2luigi.Parameter()
2328
2329 @property
2330 def harvesting_validation_task_instance(self):
2331 """
2332 Harvesting validation task to require, which produces the ROOT files
2333 with variables to produce the CDC QE validation plots.
2334 """
2335 return CDCQEHarvestingValidationTask(
2336 n_events_testing=self.n_events_testing,
2337 n_events_training=self.n_events_training,
2338 process_type=self.process_type,
2339 experiment_number=self.experiment_number,
2340 training_target=self.training_target,
2341 exclude_variables=self.exclude_variables,
2342 num_processes=MasterTask.num_processes,
2343 fast_bdt_option=self.fast_bdt_option,
2344 )
2345
2346
2347class RecoTrackQEValidationPlotsTask(PlotsFromHarvestingValidationBaseTask):
2348 """
2349 Create a PDF file with validation plots for the reco MVA track quality
2350 estimator produced from the ROOT ntuples produced by a reco track QE
2351 harvesting validation task
2352 """
2353
2354 cdc_training_target = b2luigi.Parameter()
2355
2356 @property
2357 def harvesting_validation_task_instance(self):
2358 """
2359 Harvesting validation task to require, which produces the ROOT files
2360 with variables to produce the final MVA track QE validation plots.
2361 """
2362 return RecoTrackQEHarvestingValidationTask(
2363 n_events_testing=self.n_events_testing,
2364 n_events_training=self.n_events_training,
2365 process_type=self.process_type,
2366 experiment_number=self.experiment_number,
2367 cdc_training_target=self.cdc_training_target,
2368 exclude_variables=self.exclude_variables,
2369 num_processes=MasterTask.num_processes,
2370 fast_bdt_option=self.fast_bdt_option,
2371 )
2372
2373
2374class QEWeightsLocalDBCreatorTask(Basf2Task):
2375 """
2376 Collect weightfile identifiers from different teacher tasks and merge them
2377 into a local database for testing.
2378 """
2379
2380 n_events_training = b2luigi.IntParameter()
2381
2382 experiment_number = b2luigi.IntParameter()
2383
2386 process_type = b2luigi.Parameter(
2387
2388 default="BBBAR"
2389
2390 )
2391
2392 cdc_training_target = b2luigi.Parameter()
2393
2394 fast_bdt_option = b2luigi.ListParameter(
2395
2396 hashed=True, default=[200, 8, 3, 0.1]
2397
2398 )
2399
2400 def requires(self):
2401 """
2402 Required teacher tasks
2403 """
2404 yield VXDQETeacherTask(
2405 n_events_training=self.n_events_training,
2406 process_type=self.process_type,
2407 experiment_number=self.experiment_number,
2408 exclude_variables=MasterTask.exclude_variables_vxd,
2409 fast_bdt_option=self.fast_bdt_option,
2410 )
2411 yield CDCQETeacherTask(
2412 n_events_training=self.n_events_training,
2413 process_type=self.process_type,
2414 experiment_number=self.experiment_number,
2415 training_target=self.cdc_training_target,
2416 exclude_variables=MasterTask.exclude_variables_cdc,
2417 fast_bdt_option=self.fast_bdt_option,
2418 )
2419 yield RecoTrackQETeacherTask(
2420 n_events_training=self.n_events_training,
2421 process_type=self.process_type,
2422 experiment_number=self.experiment_number,
2423 cdc_training_target=self.cdc_training_target,
2424 exclude_variables=MasterTask.exclude_variables_rec,
2425 fast_bdt_option=self.fast_bdt_option,
2426 )
2427
2428 def output(self):
2429 """
2430 Local database
2431 """
2432 yield self.add_to_output("localdb.tar")
2433
2434 def process(self):
2435 """
2436 Create local database
2437 """
2438 current_path = Path.cwd()
2439 localdb_archive_path = Path(self.get_output_file_name("localdb.tar")).absolute()
2440 output_dir = localdb_archive_path.parent
2441
2442 # remove existing local databases in output directories
2443 self._clean()
2444 # "Upload" the weightfiles of all 3 teacher tasks into the same localdb
2445 for task in (VXDQETeacherTask, CDCQETeacherTask, RecoTrackQETeacherTask):
2446 # Extract xml identifier input file name before switching working directories, as it returns relative paths
2447 weightfile_xml_identifier_path = os.path.abspath(self.get_input_file_names(
2448 task.get_weightfile_xml_identifier(task, fast_bdt_option=self.fast_bdt_option))[0])
2449 # As localdb is created in working directory, chdir into desired output path
2450 try:
2451 os.chdir(output_dir)
2452 # Same as basf2_mva_upload on the command line, creates localdb directory in current working dir
2453 basf2_mva.upload(
2454 weightfile_xml_identifier_path,
2455 task.weightfile_identifier_basename,
2456 self.experiment_number, 0,
2457 self.experiment_number, -1,
2458 )
2459 finally: # Switch back to working directory of b2luigi, even if upload failed
2460 os.chdir(current_path)
2461
2462 # Pack localdb into tar archive, so that we can have on single output file instead
2463 shutil.make_archive(
2464 base_name=localdb_archive_path.as_posix().split('.')[0],
2465 format="tar",
2466 root_dir=output_dir,
2467 base_dir="localdb",
2468 verbose=True,
2469 )
2470
2471 def _clean(self):
2472 """
2473 Remove local database and tar archives in output directory
2474 """
2475 localdb_archive_path = Path(self.get_output_file_name("localdb.tar"))
2476 localdb_path = localdb_archive_path.parent / "localdb"
2477
2478 if localdb_path.exists():
2479 print(f"Deleting localdb\n{localdb_path}\nwith contents\n ",
2480 "\n ".join(f.name for f in localdb_path.iterdir()))
2481 shutil.rmtree(localdb_path, ignore_errors=False) # recursively delete localdb
2482
2483 if localdb_archive_path.is_file():
2484 print(f"Deleting {localdb_archive_path}")
2485 os.remove(localdb_archive_path)
2486
2487 def on_failure(self, exception):
2488 """
2489 Cleanup: Remove local database to prevent existing outputs when task did not finish successfully
2490 """
2491 self._clean()
2492 # Run existing on_failure from parent class
2493 super().on_failure(exception)
2494
2495
2496class MasterTask(b2luigi.WrapperTask):
2497 """
2498 Wrapper task that needs to finish for b2luigi to finish running this steering file.
2499
2500 It is done if the outputs of all required subtasks exist. It is thus at the
2501 top of the luigi task graph. Edit the ``requires`` method to steer which
2502 tasks and with which parameters you want to run.
2503 """
2504
2507 process_type = b2luigi.get_setting(
2508
2509 "process_type", default='BBBAR'
2510
2511 )
2512
2513 n_events_training = b2luigi.get_setting(
2514
2515 "n_events_training", default=20000
2516
2517 )
2518
2519 n_events_testing = b2luigi.get_setting(
2520
2521 "n_events_testing", default=5000
2522
2523 )
2524
2525 n_events_per_task = b2luigi.get_setting(
2526
2527 "n_events_per_task", default=100
2528
2529 )
2530
2531 num_processes = b2luigi.get_setting(
2532
2533 "basf2_processes_per_worker", default=0
2534
2535 )
2536
2537 datafiles = b2luigi.get_setting("datafiles")
2538
2539 bkgfiles_by_exp = b2luigi.get_setting("bkgfiles_by_exp")
2540
2541 bkgfiles_by_exp = {int(key): val for (key, val) in bkgfiles_by_exp.items()}
2542
2543 exclude_variables_cdc = [
2544 "has_matching_segment",
2545 "size",
2546 "n_tracks", # not written out per default anyway
2547 "avg_hit_dist",
2548 "cont_layer_mean",
2549 "cont_layer_variance",
2550 "cont_layer_max",
2551 "cont_layer_min",
2552 "cont_layer_first",
2553 "cont_layer_last",
2554 "cont_layer_max_vs_last",
2555 "cont_layer_first_vs_min",
2556 "cont_layer_count",
2557 "cont_layer_occupancy",
2558 "super_layer_mean",
2559 "super_layer_variance",
2560 "super_layer_max_vs_last",
2561 "super_layer_first_vs_min",
2562 "super_layer_occupancy",
2563 "drift_length_mean",
2564 "drift_length_variance",
2565 "drift_length_max",
2566 "drift_length_min",
2567 "drift_length_sum",
2568 "norm_drift_length_mean",
2569 "norm_drift_length_variance",
2570 "norm_drift_length_max",
2571 "norm_drift_length_min",
2572 "norm_drift_length_sum",
2573 "adc_mean",
2574 "adc_variance",
2575 "adc_max",
2576 "adc_min",
2577 "adc_sum",
2578 "tot_mean",
2579 "tot_variance",
2580 "tot_max",
2581 "tot_min",
2582 "tot_sum",
2583 "empty_s_mean",
2584 "empty_s_variance",
2585 "empty_s_max"]
2586
2587 exclude_variables_vxd = [
2588 'energyLoss_max', 'energyLoss_min', 'energyLoss_mean', 'energyLoss_std', 'energyLoss_sum',
2589 'size_max', 'size_min', 'size_mean', 'size_std', 'size_sum',
2590 'seedCharge_max', 'seedCharge_min', 'seedCharge_mean', 'seedCharge_std', 'seedCharge_sum',
2591 'tripletFit_P_Mag', 'tripletFit_P_Eta', 'tripletFit_P_Phi', 'tripletFit_P_X', 'tripletFit_P_Y', 'tripletFit_P_Z']
2592
2593 exclude_variables_rec = [
2594 'background',
2595 'ghost',
2596 'fake',
2597 'clone',
2598 '__experiment__',
2599 '__run__',
2600 '__event__',
2601 'N_RecoTracks',
2602 'N_PXDRecoTracks',
2603 'N_SVDRecoTracks',
2604 'N_CDCRecoTracks',
2605 'N_diff_PXD_SVD_RecoTracks',
2606 'N_diff_SVD_CDC_RecoTracks',
2607 'Fit_Successful',
2608 'Fit_NFailedPoints',
2609 'Fit_Chi2',
2610 'N_TrackPoints_without_KalmanFitterInfo',
2611 'N_Hits_without_TrackPoint',
2612 'SVD_CDC_CDCwall_Chi2',
2613 'SVD_CDC_CDCwall_Pos_diff_Z',
2614 'SVD_CDC_CDCwall_Pos_diff_Pt',
2615 'SVD_CDC_CDCwall_Pos_diff_Theta',
2616 'SVD_CDC_CDCwall_Pos_diff_Phi',
2617 'SVD_CDC_CDCwall_Pos_diff_Mag',
2618 'SVD_CDC_CDCwall_Pos_diff_Eta',
2619 'SVD_CDC_CDCwall_Mom_diff_Z',
2620 'SVD_CDC_CDCwall_Mom_diff_Pt',
2621 'SVD_CDC_CDCwall_Mom_diff_Theta',
2622 'SVD_CDC_CDCwall_Mom_diff_Phi',
2623 'SVD_CDC_CDCwall_Mom_diff_Mag',
2624 'SVD_CDC_CDCwall_Mom_diff_Eta',
2625 'SVD_CDC_POCA_Pos_diff_Z',
2626 'SVD_CDC_POCA_Pos_diff_Pt',
2627 'SVD_CDC_POCA_Pos_diff_Theta',
2628 'SVD_CDC_POCA_Pos_diff_Phi',
2629 'SVD_CDC_POCA_Pos_diff_Mag',
2630 'SVD_CDC_POCA_Pos_diff_Eta',
2631 'SVD_CDC_POCA_Mom_diff_Z',
2632 'SVD_CDC_POCA_Mom_diff_Pt',
2633 'SVD_CDC_POCA_Mom_diff_Theta',
2634 'SVD_CDC_POCA_Mom_diff_Phi',
2635 'SVD_CDC_POCA_Mom_diff_Mag',
2636 'SVD_CDC_POCA_Mom_diff_Eta',
2637 'POCA_Pos_Pt',
2638 'POCA_Pos_Mag',
2639 'POCA_Pos_Phi',
2640 'POCA_Pos_Z',
2641 'POCA_Pos_Theta',
2642 'PXD_QI',
2643 'SVD_FitSuccessful',
2644 'CDC_FitSuccessful',
2645 'pdg_ID',
2646 'pdg_ID_Mother',
2647 'is_Vzero_Daughter',
2648 'is_Primary',
2649 'z0',
2650 'd0',
2651 'seed_Charge',
2652 'Fit_Charge',
2653 'weight_max',
2654 'weight_min',
2655 'weight_mean',
2656 'weight_std',
2657 'weight_median',
2658 'weight_n_zeros',
2659 'weight_firstCDCHit',
2660 'weight_lastSVDHit',
2661 'smoothedChi2_max',
2662 'smoothedChi2_min',
2663 'smoothedChi2_mean',
2664 'smoothedChi2_std',
2665 'smoothedChi2_median',
2666 'smoothedChi2_n_zeros',
2667 'smoothedChi2_firstCDCHit',
2668 'smoothedChi2_lastSVDHit']
2669
2670 def requires(self):
2671 """
2672 Generate list of tasks that needs to be done for luigi to finish running
2673 this steering file.
2674 """
2675 cdc_training_targets = [
2676 "truth", # treats clones as background, only best matched CDC tracks are true
2677 # "truth_track_is_matched" # treats clones as signal
2678 ]
2679
2680 fast_bdt_options = []
2681 # possible to run over a chosen hyperparameter space if wanted
2682 # in principle this can be extended to specific options for the three different MVAs
2683 # for i in range(250, 400, 50):
2684 # for j in range(6, 10, 2):
2685 # for k in range(2, 6):
2686 # for l in range(0, 5):
2687 # fast_bdt_options.append([100 + i, j, 3+k, 0.025+l*0.025])
2688 # fast_bdt_options.append([200, 8, 3, 0.1]) # default FastBDT option
2689 fast_bdt_options.append([350, 6, 5, 0.1])
2690
2691 experiment_numbers = b2luigi.get_setting("experiment_numbers")
2692
2693 # iterate over all possible combinations of parameters from the above defined parameter lists
2694 for experiment_number, cdc_training_target, fast_bdt_option in itertools.product(
2695 experiment_numbers, cdc_training_targets, fast_bdt_options
2696 ):
2697 # if test_selected_task is activated, only run the following tasks:
2698 if b2luigi.get_setting("test_selected_task", default=False):
2699 # for process_type in ['BHABHA', 'MUMU', 'TAUPAIR', 'YY', 'EEEE', 'EEMUMU', 'UUBAR', \
2700 # 'DDBAR', 'CCBAR', 'SSBAR', 'BBBAR', 'V0BBBAR', 'V0STUDY']:
2701 for cut in ['000', '070', '090', '095']:
2702 yield RecoTrackQEDataCollectionTask(
2703 num_processes=self.num_processes,
2704 n_events=self.n_events_testing,
2705 experiment_number=experiment_number,
2706 random_seed=self.process_type + '_test',
2707 recotrack_option='useCDC_noVXD_deleteCDCQI'+cut,
2708 cdc_training_target=cdc_training_target,
2709 fast_bdt_option=fast_bdt_option,
2710 )
2711 yield CDCQEDataCollectionTask(
2712 num_processes=self.num_processes,
2713 n_events=self.n_events_testing,
2714 experiment_number=experiment_number,
2715 random_seed=self.process_type + '_test',
2716 )
2717 yield CDCQETeacherTask(
2718 n_events_training=self.n_events_training,
2719 process_type=self.process_type,
2720 experiment_number=experiment_number,
2721 exclude_variables=self.exclude_variables_cdc,
2722 training_target=cdc_training_target,
2723 fast_bdt_option=fast_bdt_option,
2724 )
2725 else:
2726 # if data shall be processed, it can neither be trained nor evaluated
2727 if 'DATA' in self.process_type:
2728 yield VXDQEDataCollectionTask(
2729 num_processes=self.num_processes,
2730 n_events=self.n_events_testing,
2731 experiment_number=experiment_number,
2732 random_seed=self.process_type + '_test',
2733 )
2734 yield CDCQEDataCollectionTask(
2735 num_processes=self.num_processes,
2736 n_events=self.n_events_testing,
2737 experiment_number=experiment_number,
2738 random_seed=self.process_type + '_test',
2739 )
2740 yield RecoTrackQEDataCollectionTask(
2741 num_processes=self.num_processes,
2742 n_events=self.n_events_testing,
2743 experiment_number=experiment_number,
2744 random_seed=self.process_type + '_test',
2745 recotrack_option='deleteCDCQI080',
2746 cdc_training_target=cdc_training_target,
2747 fast_bdt_option=fast_bdt_option,
2748 )
2749 else:
2750 yield QEWeightsLocalDBCreatorTask(
2751 n_events_training=self.n_events_training,
2752 process_type=self.process_type,
2753 experiment_number=experiment_number,
2754 cdc_training_target=cdc_training_target,
2755 fast_bdt_option=fast_bdt_option,
2756 )
2757
2758 if b2luigi.get_setting("run_validation_tasks", default=True):
2759 yield RecoTrackQEValidationPlotsTask(
2760 n_events_training=self.n_events_training,
2761 n_events_testing=self.n_events_testing,
2762 process_type=self.process_type,
2763 experiment_number=experiment_number,
2764 cdc_training_target=cdc_training_target,
2765 exclude_variables=self.exclude_variables_rec,
2766 fast_bdt_option=fast_bdt_option,
2767 )
2768 yield CDCQEValidationPlotsTask(
2769 n_events_training=self.n_events_training,
2770 n_events_testing=self.n_events_testing,
2771 process_type=self.process_type,
2772 experiment_number=experiment_number,
2773 exclude_variables=self.exclude_variables_cdc,
2774 training_target=cdc_training_target,
2775 fast_bdt_option=fast_bdt_option,
2776 )
2777 yield VXDQEValidationPlotsTask(
2778 n_events_training=self.n_events_training,
2779 n_events_testing=self.n_events_testing,
2780 process_type=self.process_type,
2781 exclude_variables=self.exclude_variables_vxd,
2782 experiment_number=experiment_number,
2783 fast_bdt_option=fast_bdt_option,
2784 )
2785
2786 if b2luigi.get_setting("run_mva_evaluate", default=True):
2787 # Evaluate trained weightfiles via basf2_mva_evaluate.py on separate testdatasets
2788 # requires a latex installation to work
2789 yield RecoTrackQEEvaluationTask(
2790 n_events_training=self.n_events_training,
2791 n_events_testing=self.n_events_testing,
2792 process_type=self.process_type,
2793 experiment_number=experiment_number,
2794 cdc_training_target=cdc_training_target,
2795 exclude_variables=self.exclude_variables_rec,
2796 fast_bdt_option=fast_bdt_option,
2797 )
2798 yield CDCTrackQEEvaluationTask(
2799 n_events_training=self.n_events_training,
2800 n_events_testing=self.n_events_testing,
2801 process_type=self.process_type,
2802 experiment_number=experiment_number,
2803 exclude_variables=self.exclude_variables_cdc,
2804 fast_bdt_option=fast_bdt_option,
2805 training_target=cdc_training_target,
2806 )
2807 yield VXDTrackQEEvaluationTask(
2808 n_events_training=self.n_events_training,
2809 n_events_testing=self.n_events_testing,
2810 process_type=self.process_type,
2811 experiment_number=experiment_number,
2812 exclude_variables=self.exclude_variables_vxd,
2813 fast_bdt_option=fast_bdt_option,
2814 )
2815
2816
2817if __name__ == "__main__":
2818 # if n_events_test_on_data is specified to be different from -1 in the settings,
2819 # then stop after N events (mainly useful to test data reconstruction):
2820 nEventsTestOnData = b2luigi.get_setting("n_events_test_on_data", default=-1)
2821 if nEventsTestOnData > 0 and 'DATA' in b2luigi.get_setting("process_type", default="BBBAR"):
2822 from ROOT import Belle2
2823 environment = Belle2.Environment.Instance()
2824 environment.setNumberEventsOverride(nEventsTestOnData)
2825 # if global tags are specified in the settings, use them:
2826 # e.g. for data use ["data_reprocessing_prompt", "online"]. Make sure to be up to date here
2827 globaltags = b2luigi.get_setting("globaltags", default=[])
2828 if len(globaltags) > 0:
2829 basf2.conditions.reset()
2830 for gt in globaltags:
2831 basf2.conditions.prepend_globaltag(gt)
2832 workers = b2luigi.get_setting("workers", default=1)
2833 b2luigi.process(MasterTask(), workers=workers)
2834
2835# @endcond
get_background_files(folder=None, output_file_info=True)
Definition background.py:17
static Environment & Instance()
Static method to get a reference to the Environment instance.
add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False, save_all_charged_particles_in_mc=False)