Belle II Software development
combined_quality_estimator_teacher.py
1#!/usr/bin/env python3
2
3
10
11"""
12combined_module_quality_estimator_teacher
13-----------------------------------------
14
15Information on the MVA Track Quality Indicator / Estimator can be found
16on `XWiki
17<https://xwiki.desy.de/xwiki/rest/p/0d3f4>`_.
18
19Purpose of this script
20~~~~~~~~~~~~~~~~~~~~~~
21
22This python script is used for the combined training and validation of three
23classifiers, the actual final MVA track quality estimator and the two quality
24estimators for the intermediate standalone track finders that it depends on.
25
26 - Final MVA track quality estimator:
27 The final quality estimator for fully merged and fitted tracks (RecoTracks).
28 Its classifier uses features from the track fitting, merger, hit pattern, ...
29 But it also uses the outputs from respective intermediate quality
30 estimators for the VXD and the CDC track finding as inputs. It provides
31 the final quality indicator (QI) exported to the track objects.
32
33 - VXDTF2 track quality estimator:
34 MVA quality estimator for the VXD standalone track finding.
35
36 - CDC track quality estimator:
37 MVA quality estimator for the CDC standalone track finding.
38
39Each classifier requires for its training a different training data set and they
40need to be validated on a separate testing data set. Further, the final quality
41estimator can only be trained, when the trained weights for the intermediate
42quality estimators are available. If the final estimator shall be trained without
43one or both previous estimators, the requirements have to be commented out in the
44__init__.py file of tracking.
45For all estimators, a list of variables to be ignored is specified in the MasterTask.
46The current choice is mainly based on pure data MC agreement in these quantities or
47on outdated implementations. It was decided to leave them in the hardcoded "ugly" way
48in here to remind future generations that they exist in principle and they should and
49could be added to the estimator, once their modelling becomes better in future or an
50alternative implementation is programmed.
51To avoid mistakes, b2luigi is used to create a task chain for a combined training and
52validation of all classifiers.
53
54b2luigi: Understanding the steering file
55~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
56
57All trainings and validations are done in the correct order in this steering
58file. For the purpose of creating a dependency graph, the `b2luigi
59<https://b2luigi.readthedocs.io>`_ python package is used, which extends the
60`luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
61
62Each task that has to be done is represented by a special class, which defines
63which defines parameters, output files and which other tasks with which
64parameters it depends on. For example a teacher task, which runs
65``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
66task which runs a reconstruction and writes out track-wise variables into a root
67file for training. An evaluation/validation task for testing the classifier
68requires both the teacher task, as it needs the weightfile to be present, and
69also a data collection task, because it needs a dataset for testing classifier.
70
71The final task that defines which tasks need to be done for the steering file to
72finish is the ``MasterTask``. When you only want to run parts of the
73training/validation pipeline, you can comment out requirements in the Master
74task or replace them by lower-level tasks during debugging.
75
76Requirements
77~~~~~~~~~~~~
78
79This steering file relies on b2luigi_ for task scheduling and `uncertain_panda
80<https://github.com/nils-braun/uncertain_panda>`_ for uncertainty calculations.
81uncertain_panda is not in the externals and b2luigi is not upto v01-07-01. Both
82can be installed via pip::
83
84 python3 -m pip install [--user] b2luigi uncertain_panda
85
86Use the ``--user`` option if you have not rights to install python packages into
87your externals (e.g. because you are using cvmfs) and install them in
88``$HOME/.local`` instead.
89
90Configuration
91~~~~~~~~~~~~~
92
93Instead of command line arguments, the b2luigi script is configured via a
94``settings.json`` file. Open it in your favorite text editor and modify it to
95fit to your requirements.
96
97Usage
98~~~~~
99
100You can test the b2luigi without running it via::
101
102 python3 combined_quality_estimator_teacher.py --dry-run
103 python3 combined_quality_estimator_teacher.py --show-output
104
105This will show the outputs and show potential errors in the definitions of the
106luigi task dependencies. To run the the steering file in normal (local) mode,
107run::
108
109 python3 combined_quality_estimator_teacher.py
110
111I usually use the interactive luigi web interface via the central scheduler
112which visualizes the task graph while it is running. Therefore, the scheduler
113daemon ``luigid`` has to run in the background, which is located in
114``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
115example, run::
116
117 luigid --port 8886
118
119Then, execute your steering (e.g. in another terminal) with::
120
121 python3 combined_quality_estimator_teacher.py --scheduler-port 8886
122
123To view the web interface, open your webbrowser enter into the url bar::
124
125 localhost:8886
126
127If you don't run the steering file on the same machine on which you run your webbrowser, you have two options:
128
129 1. Run both the steering file and ``luigid`` remotely and use
130 ssh-port-forwarding to your local host. Therefore, run on your local
131 machine::
132
133 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
134
135 2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
136 local host>`` argument when calling the steering file
137
138Accessing the results / output files
139~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
140
141All output files are stored in a directory structure in the ``result_path``. The
142directory tree encodes the used b2luigi parameters. This ensures reproducibility
143and makes parameter searches easy. Sometimes, it is hard to find the relevant
144output files. You can view the whole directory structure by running ``tree
145<result_path>``. Ise the unix ``find`` command to find the files that interest
146you, e.g.::
147
148 find <result_path> -name "*.pdf" # find all validation plot files
149 find <result_path> -name "*.root" # find all ROOT files
150"""
151
152import itertools
153import os
154from pathlib import Path
155import shutil
156import subprocess
157import textwrap
158from datetime import datetime
159from typing import Iterable
160
161import matplotlib.pyplot as plt
162import numpy as np
163import uproot
164from matplotlib.backends.backend_pdf import PdfPages
165
166import basf2
167import basf2_mva
168from packaging import version
169import background
170import simulation
171import tracking
172import tracking.root_utils as root_utils
173from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
174
175# wrap python modules that are used here but not in the externals into a try except block
176install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
177 " python3 -m pip install [--user] {module}\n")
178try:
179 import b2luigi
180 from b2luigi.core.utils import get_serialized_parameters, get_log_file_dir, create_output_dirs
181 from b2luigi.basf2_helper import Basf2PathTask, Basf2Task
182 from b2luigi.core.task import Task, ExternalTask
183 from b2luigi.basf2_helper.utils import get_basf2_git_hash
184except ModuleNotFoundError:
185 print(install_helpstring_formatter.format(module="b2luigi"))
186 raise
187try:
188 from uncertain_panda import pandas as upd
189except ModuleNotFoundError:
190 print(install_helpstring_formatter.format(module="uncertain_panda"))
191 raise
192
193# If b2luigi version 0.3.2 or older, it relies on $BELLE2_RELEASE being "head",
194# which is not the case in the new externals. A fix has been merged into b2luigi
195# via https://github.com/nils-braun/b2luigi/pull/17 and thus should be available
196# in future releases.
197if (
198 version.parse(b2luigi.__version__) <= version.parse("0.3.2") and
199 get_basf2_git_hash() is None and
200 os.getenv("BELLE2_LOCAL_DIR") is not None
201):
202 print(f"b2luigi version could not obtain git hash because of a bug not yet fixed in version {b2luigi.__version__}\n"
203 "Please install the latest version of b2luigi from github via\n\n"
204 " python3 -m pip install --upgrade [--user] git+https://github.com/nils-braun/b2luigi.git\n")
205 raise ImportError
206
207# Utility functions
208
209
210def create_fbdt_option_string(fast_bdt_option):
211 """
212 returns a readable string created by the fast_bdt_option array
213 """
214 return "_nTrees" + str(fast_bdt_option[0]) + "_nCuts" + str(fast_bdt_option[1]) + "_nLevels" + \
215 str(fast_bdt_option[2]) + "_shrin" + str(int(round(100*fast_bdt_option[3], 0)))
216
217
218def createV0momenta(x, mu, beta):
219 """
220 Copied from Biancas K_S0 particle gun code: Returns a realistic V0 momentum distribution
221 when running over x. Mu and Beta are properties of the function that define center and tails.
222 Used for the particle gun simulation code for K_S0 and Lambda_0
223 """
224 return (1/beta)*np.exp(-(x - mu)/beta) * np.exp(-np.exp(-(x - mu) / beta))
225
226
227def my_basf2_mva_teacher(
228 records_files,
229 tree_name,
230 weightfile_identifier,
231 target_variable="truth",
232 exclude_variables=None,
233 fast_bdt_option=[200, 8, 3, 0.1] # nTrees, nCuts, nLevels, shrinkage
234):
235 """
236 My custom wrapper for basf2 mva teacher. Adapted from code in ``trackfindingcdc_teacher``.
237
238 :param records_files: List of files with collected ("recorded") variables to use as training data for the MVA.
239 :param tree_name: Name of the TTree in the ROOT file from the ``data_collection_task``
240 that contains the training data for the MVA teacher.
241 :param weightfile_identifier: Name of the weightfile that is created.
242 Should either end in ".xml" for local weightfiles or in ".root", when
243 the weightfile needs later to be uploaded as a payload to the conditions
244 database.
245 :param target_variable: Feature/variable to use as truth label in the quality estimator MVA classifier.
246 :param exclude_variables: List of collected variables to not use in the training of the QE MVA classifier.
247 In addition to variables containing the "truth" substring, which are excluded by default.
248 :param fast_bdt_option: specified fast BDT options, default: [200, 8, 3, 0.1] [nTrees, nCuts, nLevels, shrinkage]
249 """
250 if exclude_variables is None:
251 exclude_variables = []
252
253 weightfile_extension = Path(weightfile_identifier).suffix
254 if weightfile_extension not in {".xml", ".root"}:
255 raise ValueError(f"Weightfile Identifier should end in .xml or .root, but ends in {weightfile_extension}")
256
257 # extract names of all variables from one record file
258 with root_utils.root_open(records_files[0]) as records_tfile:
259 input_tree = records_tfile.Get(tree_name)
260 feature_names = [leave.GetName() for leave in input_tree.GetListOfLeaves()]
261
262 # get list of variables to use for training without MC truth
263 truth_free_variable_names = [
264 name
265 for name in feature_names
266 if (
267 ("truth" not in name) and
268 (name != target_variable) and
269 (name not in exclude_variables)
270 )
271 ]
272 if "weight" in truth_free_variable_names:
273 truth_free_variable_names.remove("weight")
274 weight_variable = "weight"
275 elif "__weight__" in truth_free_variable_names:
276 truth_free_variable_names.remove("__weight__")
277 weight_variable = "__weight__"
278 else:
279 weight_variable = ""
280
281 # Set options for MVA training
282 general_options = basf2_mva.GeneralOptions()
283 general_options.m_datafiles = basf2_mva.vector(*records_files)
284 general_options.m_treename = tree_name
285 general_options.m_weight_variable = weight_variable
286 general_options.m_identifier = weightfile_identifier
287 general_options.m_variables = basf2_mva.vector(*truth_free_variable_names)
288 general_options.m_target_variable = target_variable
289 fastbdt_options = basf2_mva.FastBDTOptions()
290
291 fastbdt_options.m_nTrees = fast_bdt_option[0]
292 fastbdt_options.m_nCuts = fast_bdt_option[1]
293 fastbdt_options.m_nLevels = fast_bdt_option[2]
294 fastbdt_options.m_shrinkage = fast_bdt_option[3]
295 # Train a MVA method and store the weightfile (MVAFastBDT.root) locally.
296 basf2_mva.teacher(general_options, fastbdt_options)
297
298
299def _my_uncertain_mean(series: upd.Series):
300 """
301 Temporary Workaround bug in ``uncertain_panda`` where a ``ValueError`` is
302 thrown for ``Series.unc.mean`` if the series is empty. Can be replaced by
303 .unc.mean when the issue is fixed.
304 https://github.com/nils-braun/uncertain_panda/issues/2
305 """
306 try:
307 return series.unc.mean()
308 except ValueError:
309 if series.empty:
310 return np.nan
311 else:
312 raise
313
314
315def get_uncertain_means_for_qi_cuts(df: upd.DataFrame, column: str, qi_cuts: Iterable[float]):
316 """
317 Return a pandas series with an mean of the dataframe column and
318 uncertainty for each quality indicator cut.
319
320 :param df: Pandas dataframe with at least ``quality_indicator``
321 and another numeric ``column``.
322 :param column: Column of which we want to aggregate the means
323 and uncertainties for different QI cuts
324 :param qi_cuts: Iterable of quality indicator minimal thresholds.
325 :returns: Series of of means and uncertainties with ``qi_cuts`` as index
326 """
327
328 uncertain_means = (_my_uncertain_mean(df.query(f"quality_indicator > {qi_cut}")[column])
329 for qi_cut in qi_cuts)
330 uncertain_means_series = upd.Series(data=uncertain_means, index=qi_cuts)
331 return uncertain_means_series
332
333
334def plot_with_errobands(uncertain_series,
335 error_band_alpha=0.3,
336 plot_kwargs={},
337 fill_between_kwargs={},
338 ax=None):
339 """
340 Plot an uncertain series with error bands for y-errors
341 """
342 if ax is None:
343 ax = plt.gca()
344 uncertain_series = uncertain_series.dropna()
345 ax.plot(uncertain_series.index.values, uncertain_series.nominal_value, **plot_kwargs)
346 ax.fill_between(x=uncertain_series.index,
347 y1=uncertain_series.nominal_value - uncertain_series.std_dev,
348 y2=uncertain_series.nominal_value + uncertain_series.std_dev,
349 alpha=error_band_alpha,
350 **fill_between_kwargs)
351
352
353def format_dictionary(adict, width=80, bullet="•"):
354 """
355 Helper function to format dictionary to string as a wrapped key-value bullet
356 list. Useful to print metadata from dictionaries.
357
358 :param adict: Dictionary to format
359 :param width: Characters after which to wrap a key-value line
360 :param bullet: Character to begin a key-value line with, e.g. ``-`` for a
361 yaml-like string
362 """
363 # It might be possible to replace this function yaml.dump, but the current
364 # version in the externals does not allow to disable the sorting of the
365 # dictionary yet and also I am not sure if it is wrappable
366 return "\n".join(textwrap.fill(f"{bullet} {key}: {value}", width=width)
367 for (key, value) in adict.items())
368
369# Begin definitions of b2luigi task classes
370
371
372class GenerateSimTask(Basf2PathTask):
373 """
374 Generate simulated Monte Carlo with background overlay.
375
376 Make sure to use different ``random_seed`` parameters for the training data
377 format the classifier trainings and for the test data for the respective
378 evaluation/validation tasks.
379 """
380
381
382 n_events = b2luigi.IntParameter()
384 experiment_number = b2luigi.IntParameter()
387 random_seed = b2luigi.Parameter()
389 bkgfiles_dir = b2luigi.Parameter(
391 hashed=True
392
393 )
394
395 queue = 'l'
397
398 def output_file_name(self, n_events=None, random_seed=None):
399 """
400 Create output file name depending on number of events and production
401 mode that is specified in the random_seed string.
402 """
403 if n_events is None:
404 n_events = self.n_events
405 if random_seed is None:
406 random_seed = self.random_seed
407 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
408
409 def output(self):
410 """
411 Generate list of output files that the task should produce.
412 The task is considered finished if and only if the outputs all exist.
413 """
414 yield self.add_to_output(self.output_file_name())
415
416 def create_path(self):
417 """
418 Create basf2 path to process with event generation and simulation.
419 """
420 basf2.set_random_seed(self.random_seed)
421 path = basf2.create_path()
422 if self.experiment_number in [0, 1002, 1003]:
423 runNo = 0
424 else:
425 runNo = 0
426 raise ValueError(
427 f"Simulating events with experiment number {self.experiment_number} is not implemented yet.")
428 path.add_module(
429 "EventInfoSetter", evtNumList=[self.n_events], runList=[runNo], expList=[self.experiment_number]
430 )
431 if "BBBAR" in self.random_seed:
432 path.add_module("EvtGenInput")
433 elif "V0BBBAR" in self.random_seed:
434 path.add_module("EvtGenInput")
435 path.add_module("InclusiveParticleChecker", particles=[310, 3122], includeConjugates=True)
436 else:
437 import generators as ge
438 # WARNING: There are a few differences in the production of MC13a and b like the following lines
439 # as well as ActivatePXD.. and the beamparams for bhabha... I use these from MC13b, not a... :/
440 # import beamparameters as bp
441 # beamparameters = bp.add_beamparameters(path, "Y4S")
442 # beamparameters.param("covVertex", [(14.8e-4)**2, (1.5e-4)**2, (360e-4)**2])
443 if "V0STUDY" in self.random_seed:
444 if "V0STUDYKS" in self.random_seed:
445 # Bianca looked at the Ks dists and extracted these values:
446 mu = 0.5
447 beta = 0.2
448 pdgs = [310] # Ks (has no antiparticle, Klong is different)
449 if "V0STUDYL0" in self.random_seed:
450 # I just made the lambda values up, such that they peak at 0.35 and are slightly shifted to lower values
451 mu = 0.35
452 beta = 0.15 # if this is chosen higher, one needs to make sure not to get values >0 for 0
453 pdgs = [3122, -3122] # Lambda0
454 else:
455 # also these values are made up
456 mu = 0.43
457 beta = 0.18
458 pdgs = [310, 3122, -3122] # Ks and Lambda0
459 # create realistic momentum distribution
460 myx = [i*0.01 for i in range(321)]
461 myy = []
462 for x in myx:
463 y = createV0momenta(x, mu, beta)
464 myy.append(y)
465 polParams = myx + myy
466 # define particles that are produced
467 pdg_list = pdgs
468
469 particlegun = basf2.register_module('ParticleGun')
470 particlegun.param('pdgCodes', pdg_list)
471 particlegun.param('nTracks', 8) # number of particles (not tracks!) that is created in each event
472 particlegun.param('momentumGeneration', 'polyline')
473 particlegun.param('momentumParams', polParams)
474 particlegun.param('thetaGeneration', 'uniformCos')
475 particlegun.param('thetaParams', [17, 150]) # [0, 180]) #[17, 150]
476 particlegun.param('phiGeneration', 'uniform')
477 particlegun.param('phiParams', [0, 360])
478 particlegun.param('vertexGeneration', 'fixed')
479 particlegun.param('xVertexParams', [0])
480 particlegun.param('yVertexParams', [0])
481 particlegun.param('zVertexParams', [0])
482 path.add_module(particlegun)
483 if "BHABHA" in self.random_seed:
484 ge.add_babayaganlo_generator(path=path, finalstate='ee', minenergy=0.15, minangle=10.0)
485 elif "MUMU" in self.random_seed:
486 ge.add_kkmc_generator(path=path, finalstate='mu+mu-')
487 elif "YY" in self.random_seed:
488 babayaganlo = basf2.register_module('BabayagaNLOInput')
489 babayaganlo.param('FinalState', 'gg')
490 babayaganlo.param('MaxAcollinearity', 180.0)
491 babayaganlo.param('ScatteringAngleRange', [0., 180.])
492 babayaganlo.param('FMax', 75000)
493 babayaganlo.param('MinEnergy', 0.01)
494 babayaganlo.param('Order', 'exp')
495 babayaganlo.param('DebugEnergySpread', 0.01)
496 babayaganlo.param('Epsilon', 0.00005)
497 path.add_module(babayaganlo)
498 generatorpreselection = basf2.register_module('GeneratorPreselection')
499 generatorpreselection.param('nChargedMin', 0)
500 generatorpreselection.param('nChargedMax', 999)
501 generatorpreselection.param('MinChargedPt', 0.15)
502 generatorpreselection.param('MinChargedTheta', 17.)
503 generatorpreselection.param('MaxChargedTheta', 150.)
504 generatorpreselection.param('nPhotonMin', 1)
505 generatorpreselection.param('MinPhotonEnergy', 1.5)
506 generatorpreselection.param('MinPhotonTheta', 15.0)
507 generatorpreselection.param('MaxPhotonTheta', 165.0)
508 generatorpreselection.param('applyInCMS', True)
509 path.add_module(generatorpreselection)
510 empty = basf2.create_path()
511 generatorpreselection.if_value('!=11', empty)
512 elif "EEEE" in self.random_seed:
513 ge.add_aafh_generator(path=path, finalstate='e+e-e+e-', preselection=False)
514 elif "EEMUMU" in self.random_seed:
515 ge.add_aafh_generator(path=path, finalstate='e+e-mu+mu-', preselection=False)
516 elif "TAUPAIR" in self.random_seed:
517 ge.add_kkmc_generator(path, finalstate='tau+tau-')
518 elif "DDBAR" in self.random_seed:
519 ge.add_continuum_generator(path, finalstate='ddbar')
520 elif "UUBAR" in self.random_seed:
521 ge.add_continuum_generator(path, finalstate='uubar')
522 elif "SSBAR" in self.random_seed:
523 ge.add_continuum_generator(path, finalstate='ssbar')
524 elif "CCBAR" in self.random_seed:
525 ge.add_continuum_generator(path, finalstate='ccbar')
526 # activate simulation of dead/masked pixel and reproduce detector gain, which will be
527 # applied at reconstruction level when the data GT is present in the DB chain
528 # path.add_module("ActivatePXDPixelMasker")
529 # path.add_module("ActivatePXDGainCalibrator")
531 # \cond suppress doxygen warning
532 if self.experiment_number == 1002:
533 # remove KLM because of bug in background files with release 4
534 components = ['PXD', 'SVD', 'CDC', 'ECL', 'TOP', 'ARICH', 'TRG']
535 else:
536 components = None
537 # \endcond
538 simulation.add_simulation(path, bkgfiles=bkg_files, bkgOverlay=True, components=components) # , usePXDDataReduction=False)
539
540 path.add_module(
541 "RootOutput",
542 outputFileName=self.get_output_file_name(self.output_file_name()),
543 )
544 return path
545
546
547# I don't use the default MergeTask or similar because they only work if every input file is called the same.
548# Additionally, I want to add more features like deleting the original input to save storage space.
549class SplitNMergeSimTask(Basf2Task):
550 """
551 Generate simulated Monte Carlo with background overlay.
552
553 Make sure to use different ``random_seed`` parameters for the training data
554 format the classifier trainings and for the test data for the respective
555 evaluation/validation tasks.
556 """
557
558
559 n_events = b2luigi.IntParameter()
561 experiment_number = b2luigi.IntParameter()
564 random_seed = b2luigi.Parameter()
566 bkgfiles_dir = b2luigi.Parameter(
568 hashed=True
569
570 )
571
572 queue = 'sx'
574
575 def output_file_name(self, n_events=None, random_seed=None):
576 """
577 Create output file name depending on number of events and production
578 mode that is specified in the random_seed string.
579 """
580 if n_events is None:
581 n_events = self.n_events
582 if random_seed is None:
583 random_seed = self.random_seed
584 return "generated_mc_N" + str(n_events) + "_" + random_seed + ".root"
585
586 def output(self):
587 """
588 Generate list of output files that the task should produce.
589 The task is considered finished if and only if the outputs all exist.
590 """
591 yield self.add_to_output(self.output_file_name())
592
593 def requires(self):
594 """
595 Generate list of luigi Tasks that this Task depends on.
596 """
597 n_events_per_task = MasterTask.n_events_per_task
598 quotient, remainder = divmod(self.n_events, n_events_per_task)
599 for i in range(quotient):
600 yield GenerateSimTask(
601 bkgfiles_dir=self.bkgfiles_dir,
602 num_processes=MasterTask.num_processes,
603 random_seed=self.random_seed + '_' + str(i).zfill(3),
604 n_events=n_events_per_task,
605 experiment_number=self.experiment_number,
606 )
607 if remainder > 0:
608 yield GenerateSimTask(
609 bkgfiles_dir=self.bkgfiles_dir,
610 num_processes=MasterTask.num_processes,
611 random_seed=self.random_seed + '_' + str(quotient).zfill(3),
612 n_events=remainder,
613 experiment_number=self.experiment_number,
614 )
615
616 @b2luigi.on_temporary_files
617 def process(self):
618 """
619 When all GenerateSimTasks finished, merge the output.
620 """
621 create_output_dirs(self)
622
623 file_list = []
624 for _, file_name in self.get_input_file_names().items():
625 file_list.append(*file_name)
626 print("Merge the following files:")
627 print(file_list)
628 cmd = ["b2file-merge", "-f"]
629 args = cmd + [self.get_output_file_name(self.output_file_name())] + file_list
630 subprocess.check_call(args)
631 print("Finished merging. Now remove the input files to save space.")
632 cmd2 = ["rm", "-f"]
633 for tempfile in file_list:
634 args = cmd2 + [tempfile]
635 subprocess.check_call(args)
636
637
638class CheckExistingFile(ExternalTask):
639 """
640 Task to check if the given file really exists.
641 """
642
643 filename = b2luigi.Parameter()
645 def output(self):
646 """
647 Specify the output to be the file that was just checked.
648 """
649 from luigi import LocalTarget
650 return LocalTarget(self.filename)
651
652
653class VXDQEDataCollectionTask(Basf2PathTask):
654 """
655 Collect variables/features from VXDTF2 tracking and write them to a ROOT
656 file.
657
658 These variables are to be used as labelled training data for the MVA
659 classifier which is the VXD track quality estimator
660 """
661
662 n_events = b2luigi.IntParameter()
664 experiment_number = b2luigi.IntParameter()
667 random_seed = b2luigi.Parameter()
669 queue = 'l'
671
672 def get_records_file_name(self, n_events=None, random_seed=None):
673 """
674 Create output file name depending on number of events and production
675 mode that is specified in the random_seed string.
676 """
677 if n_events is None:
678 n_events = self.n_events
679 if random_seed is None:
680 random_seed = self.random_seed
681 if 'vxd' not in random_seed:
682 random_seed += '_vxd'
683 if 'DATA' in random_seed:
684 return 'qe_records_DATA_vxd.root'
685 else:
686 if 'USESIMBB' in random_seed:
687 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
688 elif 'USESIMEE' in random_seed:
689 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
690 return 'qe_records_N' + str(n_events) + '_' + random_seed + '.root'
691
692 def get_input_files(self, n_events=None, random_seed=None):
693 """
694 Get input file names depending on the use case: If they already exist, search in
695 the corresponding folders, for data check the specified list and if they are created
696 in the same run, check for the task that produced them.
697 """
698 if n_events is None:
699 n_events = self.n_events
700 if random_seed is None:
701 random_seed = self.random_seed
702 if "USESIM" in random_seed:
703 if 'USESIMBB' in random_seed:
704 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
705 elif 'USESIMEE' in random_seed:
706 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
707 return ['datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
708 n_events=n_events, random_seed=random_seed)]
709 elif "DATA" in random_seed:
710 return MasterTask.datafiles
711 else:
712 return self.get_input_file_names(GenerateSimTask.output_file_name(
713 GenerateSimTask, n_events=n_events, random_seed=random_seed))
714
715 def requires(self):
716 """
717 Generate list of luigi Tasks that this Task depends on.
718 """
719 if "USESIM" in self.random_seed or "DATA" in self.random_seed:
720 for filename in self.get_input_files():
721 yield CheckExistingFile(
722 filename=filename,
723 )
724 else:
725 yield SplitNMergeSimTask(
726 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
727 random_seed=self.random_seed,
728 n_events=self.n_events,
729 experiment_number=self.experiment_number,
730 )
731
732 def output(self):
733 """
734 Generate list of output files that the task should produce.
735 The task is considered finished if and only if the outputs all exist.
736 """
737 yield self.add_to_output(self.get_records_file_name())
738
739 def create_path(self):
740 """
741 Create basf2 path with VXDTF2 tracking and VXD QE data collection.
742 """
743 path = basf2.create_path()
744 inputFileNames = self.get_input_files()
745 path.add_module(
746 "RootInput",
747 inputFileNames=inputFileNames,
748 )
749 path.add_module("Gearbox")
750 tracking.add_geometry_modules(path)
751 if 'DATA' in self.random_seed:
752 from rawdata import add_unpackers
753 add_unpackers(path, components=['SVD', 'PXD'])
754 tracking.add_hit_preparation_modules(path)
755 tracking.add_vxd_track_finding_vxdtf2(
756 path, components=["SVD"], add_mva_quality_indicator=False
757 )
758 if 'DATA' in self.random_seed:
759 path.add_module(
760 "VXDQETrainingDataCollector",
761 TrainingDataOutputName=self.get_output_file_name(self.get_records_file_name()),
762 SpacePointTrackCandsStoreArrayName="SPTrackCands",
763 EstimationMethod="tripletFit",
764 UseTimingInfo=False,
765 ClusterInformation="Average",
766 MCStrictQualityEstimator=False,
767 mva_target=False,
768 MCInfo=False,
769 )
770 else:
771 path.add_module(
772 "TrackFinderMCTruthRecoTracks",
773 RecoTracksStoreArrayName="MCRecoTracks",
774 WhichParticles=[],
775 UsePXDHits=False,
776 UseSVDHits=True,
777 UseCDCHits=False,
778 )
779 path.add_module(
780 "VXDQETrainingDataCollector",
781 TrainingDataOutputName=self.get_output_file_name(self.get_records_file_name()),
782 SpacePointTrackCandsStoreArrayName="SPTrackCands",
783 EstimationMethod="tripletFit",
784 UseTimingInfo=False,
785 ClusterInformation="Average",
786 MCStrictQualityEstimator=True,
787 mva_target=False,
788 )
789 return path
790
791
792class CDCQEDataCollectionTask(Basf2PathTask):
793 """
794 Collect variables/features from CDC tracking and write them to a ROOT file.
795
796 These variables are to be used as labelled training data for the MVA
797 classifier which is the CDC track quality estimator
798 """
799
800 n_events = b2luigi.IntParameter()
802 experiment_number = b2luigi.IntParameter()
805 random_seed = b2luigi.Parameter()
807 queue = 'l'
809
810 def get_records_file_name(self, n_events=None, random_seed=None):
811 """
812 Create output file name depending on number of events and production
813 mode that is specified in the random_seed string.
814 """
815 if n_events is None:
816 n_events = self.n_events
817 if random_seed is None:
818 random_seed = self.random_seed
819 if 'cdc' not in random_seed:
820 random_seed += '_cdc'
821 if 'DATA' in random_seed:
822 return 'qe_records_DATA_cdc.root'
823 else:
824 if 'USESIMBB' in random_seed:
825 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
826 elif 'USESIMEE' in random_seed:
827 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
828 return 'qe_records_N' + str(n_events) + '_' + random_seed + '.root'
829
830 def get_input_files(self, n_events=None, random_seed=None):
831 """
832 Get input file names depending on the use case: If they already exist, search in
833 the corresponding folders, for data check the specified list and if they are created
834 in the same run, check for the task that produced them.
835 """
836 if n_events is None:
837 n_events = self.n_events
838 if random_seed is None:
839 random_seed = self.random_seed
840 if "USESIM" in random_seed:
841 if 'USESIMBB' in random_seed:
842 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
843 elif 'USESIMEE' in random_seed:
844 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
845 return ['datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
846 n_events=n_events, random_seed=random_seed)]
847 elif "DATA" in random_seed:
848 return MasterTask.datafiles
849 else:
850 return self.get_input_file_names(GenerateSimTask.output_file_name(
851 GenerateSimTask, n_events=n_events, random_seed=random_seed))
852
853 def requires(self):
854 """
855 Generate list of luigi Tasks that this Task depends on.
856 """
857 if "USESIM" in self.random_seed or "DATA" in self.random_seed:
858 for filename in self.get_input_files():
859 yield CheckExistingFile(
860 filename=filename,
861 )
862 else:
863 yield SplitNMergeSimTask(
864 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
865 random_seed=self.random_seed,
866 n_events=self.n_events,
867 experiment_number=self.experiment_number,
868 )
869
870 def output(self):
871 """
872 Generate list of output files that the task should produce.
873 The task is considered finished if and only if the outputs all exist.
874 """
875 yield self.add_to_output(self.get_records_file_name())
876
877 def create_path(self):
878 """
879 Create basf2 path with CDC standalone tracking and CDC QE with recording filter for MVA feature collection.
880 """
881 path = basf2.create_path()
882 inputFileNames = self.get_input_files()
883 path.add_module(
884 "RootInput",
885 inputFileNames=inputFileNames,
886 )
887 path.add_module("Gearbox")
888 tracking.add_geometry_modules(path)
889 if 'DATA' in self.random_seed:
890 filter_choice = "recording_data"
891 from rawdata import add_unpackers
892 add_unpackers(path, components=['CDC'])
893 else:
894 filter_choice = "recording"
895 # tracking.add_hit_preparation_modules(path) # only needed for SVD and
896 # PXD hit preparation. Does not change the CDC output.
897 tracking.add_cdc_track_finding(path, with_ca=False, add_mva_quality_indicator=True)
898
899 basf2.set_module_parameters(
900 path,
901 name="TFCDC_TrackQualityEstimator",
902 filter=filter_choice,
903 filterParameters={
904 "rootFileName": self.get_output_file_name(self.get_records_file_name())
905 },
906 )
907 return path
908
909
910class RecoTrackQEDataCollectionTask(Basf2PathTask):
911 """
912 Collect variables/features from the reco track reconstruction including the
913 fit and write them to a ROOT file.
914
915 These variables are to be used as labelled training data for the MVA
916 classifier which is the MVA track quality estimator. The collected
917 variables include the classifier outputs from the VXD and CDC quality
918 estimators, namely the CDC and VXD quality indicators, combined with fit,
919 merger, timing, energy loss information etc. This task requires the
920 subdetector quality estimators to be trained.
921 """
922
923
924 n_events = b2luigi.IntParameter()
926 experiment_number = b2luigi.IntParameter()
929 random_seed = b2luigi.Parameter()
931 cdc_training_target = b2luigi.Parameter()
935 recotrack_option = b2luigi.Parameter(
937 default='deleteCDCQI080'
938
939 )
940
941 fast_bdt_option = b2luigi.ListParameter(
943 hashed=True, default=[200, 8, 3, 0.1]
944
945 )
946
947 queue = 'l'
949
950 def get_records_file_name(self, n_events=None, random_seed=None, recotrack_option=None):
951 """
952 Create output file name depending on number of events and production
953 mode that is specified in the random_seed string.
954 """
955 if n_events is None:
956 n_events = self.n_events
957 if random_seed is None:
958 random_seed = self.random_seed
959 if recotrack_option is None:
960 recotrack_option = self.recotrack_option
961 if 'rec' not in random_seed:
962 random_seed += '_rec'
963 if 'DATA' in random_seed:
964 return 'qe_records_DATA_rec.root'
965 else:
966 if 'USESIMBB' in random_seed:
967 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
968 elif 'USESIMEE' in random_seed:
969 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
970 return 'qe_records_N' + str(n_events) + '_' + random_seed + '_' + recotrack_option + '.root'
971
972 def get_input_files(self, n_events=None, random_seed=None):
973 """
974 Get input file names depending on the use case: If they already exist, search in
975 the corresponding folders, for data check the specified list and if they are created
976 in the same run, check for the task that produced them.
977 """
978 if n_events is None:
979 n_events = self.n_events
980 if random_seed is None:
981 random_seed = self.random_seed
982 if "USESIM" in random_seed:
983 if 'USESIMBB' in random_seed:
984 random_seed = 'BBBAR_' + random_seed.split("_", 1)[1]
985 elif 'USESIMEE' in random_seed:
986 random_seed = 'BHABHA_' + random_seed.split("_", 1)[1]
987 return ['datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
988 n_events=n_events, random_seed=random_seed)]
989 elif "DATA" in random_seed:
990 return MasterTask.datafiles
991 else:
992 return self.get_input_file_names(GenerateSimTask.output_file_name(
993 GenerateSimTask, n_events=n_events, random_seed=random_seed))
994
995 def requires(self):
996 """
997 Generate list of luigi Tasks that this Task depends on.
998 """
999 if "USESIM" in self.random_seed or "DATA" in self.random_seed:
1000 for filename in self.get_input_files():
1001 yield CheckExistingFile(
1002 filename=filename,
1003 )
1004 else:
1005 yield SplitNMergeSimTask(
1006 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
1007 random_seed=self.random_seed,
1008 n_events=self.n_events,
1009 experiment_number=self.experiment_number,
1010 )
1011 if "DATA" not in self.random_seed:
1012 if 'useCDC' not in self.recotrack_option and 'noCDC' not in self.recotrack_option:
1013 yield CDCQETeacherTask(
1014 n_events_training=MasterTask.n_events_training,
1015 experiment_number=self.experiment_number,
1016 training_target=self.cdc_training_target,
1017 process_type=self.random_seed.split("_", 1)[0],
1018 exclude_variables=MasterTask.exclude_variables_cdc,
1019 fast_bdt_option=self.fast_bdt_option,
1020 )
1021 if 'useVXD' not in self.recotrack_option and 'noVXD' not in self.recotrack_option:
1022 yield VXDQETeacherTask(
1023 n_events_training=MasterTask.n_events_training,
1024 experiment_number=self.experiment_number,
1025 process_type=self.random_seed.split("_", 1)[0],
1026 exclude_variables=MasterTask.exclude_variables_vxd,
1027 fast_bdt_option=self.fast_bdt_option,
1028 )
1029
1030 def output(self):
1031 """
1032 Generate list of output files that the task should produce.
1033 The task is considered finished if and only if the outputs all exist.
1034 """
1035 yield self.add_to_output(self.get_records_file_name())
1036
1037 def create_path(self):
1038 """
1039 Create basf2 reconstruction path that should mirror the default path
1040 from ``add_tracking_reconstruction()``, but with modules for the VXD QE
1041 and CDC QE application and for collection of variables for the reco
1042 track quality estimator.
1043 """
1044 path = basf2.create_path()
1045 inputFileNames = self.get_input_files()
1046 path.add_module(
1047 "RootInput",
1048 inputFileNames=inputFileNames,
1049 )
1050 path.add_module("Gearbox")
1051
1052 # First add tracking reconstruction with default quality estimation modules
1053 mvaCDC = True
1054 mvaVXD = True
1055 if 'noCDC' in self.recotrack_option:
1056 mvaCDC = False
1057 if 'noVXD' in self.recotrack_option:
1058 mvaVXD = False
1059 if 'DATA' in self.random_seed:
1060 from rawdata import add_unpackers
1061 add_unpackers(path)
1062 tracking.add_tracking_reconstruction(path, add_cdcTrack_QI=mvaCDC, add_vxdTrack_QI=mvaVXD, add_recoTrack_QI=True)
1063
1064 # if data shall be processed check if newly trained mva files are available. Otherwise use default ones (CDB payloads):
1065 # if useCDC/VXD is specified, use the identifier lying in datafiles/ Otherwise, replace weightfile identifiers from defaults
1066 # (CDB payloads) to new weightfiles created by this b2luigi script
1067 if ('DATA' in self.random_seed or 'useCDC' in self.recotrack_option) and 'noCDC' not in self.recotrack_option:
1068 cdc_identifier = 'datafiles/' + \
1069 CDCQETeacherTask.get_weightfile_xml_identifier(CDCQETeacherTask, fast_bdt_option=self.fast_bdt_option)
1070 if os.path.exists(cdc_identifier):
1071 replace_cdc_qi = True
1072 elif 'useCDC' in self.recotrack_option:
1073 raise ValueError(f"CDC QI Identifier not found: {cdc_identifier}")
1074 else:
1075 replace_cdc_qi = False
1076 elif 'noCDC' in self.recotrack_option:
1077 replace_cdc_qi = False
1078 else:
1079 cdc_identifier = self.get_input_file_names(
1080 CDCQETeacherTask.get_weightfile_xml_identifier(
1081 CDCQETeacherTask, fast_bdt_option=self.fast_bdt_option))[0]
1082 replace_cdc_qi = True
1083 if ('DATA' in self.random_seed or 'useVXD' in self.recotrack_option) and 'noVXD' not in self.recotrack_option:
1084 vxd_identifier = 'datafiles/' + \
1085 VXDQETeacherTask.get_weightfile_xml_identifier(VXDQETeacherTask, fast_bdt_option=self.fast_bdt_option)
1086 if os.path.exists(vxd_identifier):
1087 replace_vxd_qi = True
1088 elif 'useVXD' in self.recotrack_option:
1089 raise ValueError(f"VXD QI Identifier not found: {vxd_identifier}")
1090 else:
1091 replace_vxd_qi = False
1092 elif 'noVXD' in self.recotrack_option:
1093 replace_vxd_qi = False
1094 else:
1095 vxd_identifier = self.get_input_file_names(
1096 VXDQETeacherTask.get_weightfile_xml_identifier(
1097 VXDQETeacherTask, fast_bdt_option=self.fast_bdt_option))[0]
1098 replace_vxd_qi = True
1099
1100 cdc_qe_mva_filter_parameters = None
1101 # if tracks below a certain CDC QI index shall be deleted online, this needs to be specified in the filter parameters.
1102 # this is also possible in case of the default (CBD) payloads.
1103 if 'deleteCDCQI' in self.recotrack_option:
1104 cut_index = self.recotrack_option.find('deleteCDCQI') + len('deleteCDCQI')
1105 cut = int(self.recotrack_option[cut_index:cut_index+3])/100.
1106 if replace_cdc_qi:
1107 cdc_qe_mva_filter_parameters = {
1108 "identifier": cdc_identifier, "cut": cut}
1109 else:
1110 cdc_qe_mva_filter_parameters = {
1111 "cut": cut}
1112 elif replace_cdc_qi:
1113 cdc_qe_mva_filter_parameters = {
1114 "identifier": cdc_identifier}
1115 if cdc_qe_mva_filter_parameters is not None:
1116 # if no cut is specified, the default value is at zero and nothing is deleted.
1117 basf2.set_module_parameters(
1118 path,
1119 name="TFCDC_TrackQualityEstimator",
1120 filterParameters=cdc_qe_mva_filter_parameters,
1121 deleteTracks=True,
1122 resetTakenFlag=True
1123 )
1124 if replace_vxd_qi:
1125 basf2.set_module_parameters(
1126 path,
1127 name="VXDQualityEstimatorMVA",
1128 WeightFileIdentifier=vxd_identifier)
1129
1130 # Replace final quality estimator module by training data collector module
1131 track_qe_module_name = "TrackQualityEstimatorMVA"
1132 module_found = False
1133 new_path = basf2.create_path()
1134 for module in path.modules():
1135 if module.name() != track_qe_module_name:
1136 if not module.name == 'TrackCreator':
1137 new_path.add_module(module)
1138 else:
1139 # the TrackCreator needs to be conducted before the Collector such that
1140 # MDSTTracks are related to RecoTracks and d0 and z0 can be read out
1141 new_path.add_module(
1142 'TrackCreator',
1143 pdgCodes=[
1144 211,
1145 321,
1146 2212],
1147 recoTrackColName='RecoTracks',
1148 trackColName='MDSTTracks') # , useClosestHitToIP=True, useBFieldAtHit=True)
1149 new_path.add_module(
1150 "TrackQETrainingDataCollector",
1151 TrainingDataOutputName=self.get_output_file_name(self.get_records_file_name()),
1152 collectEventFeatures=True,
1153 SVDPlusCDCStandaloneRecoTracksStoreArrayName="SVDPlusCDCStandaloneRecoTracks",
1154 )
1155 module_found = True
1156 if not module_found:
1157 raise KeyError(f"No module {track_qe_module_name} found in path")
1158 path = new_path
1159 return path
1160
1161
1162class TrackQETeacherBaseTask(Basf2Task):
1163 """
1164 A teacher task runs the basf2 mva teacher on the training data provided by a
1165 data collection task.
1166
1167 Since teacher tasks are needed for all quality estimators covered by this
1168 steering file and the only thing that changes is the required data
1169 collection task and some training parameters, I decided to use inheritance
1170 and have the basic functionality in this base class/interface and have the
1171 specific teacher tasks inherit from it.
1172 """
1173
1174 n_events_training = b2luigi.IntParameter()
1176 experiment_number = b2luigi.IntParameter()
1180 process_type = b2luigi.Parameter(
1182 default="BBBAR"
1183
1184 )
1185
1186 training_target = b2luigi.Parameter(
1188 default="truth"
1189
1190 )
1191
1193 exclude_variables = b2luigi.ListParameter(
1195 hashed=True, default=[]
1196
1197 )
1198
1199 fast_bdt_option = b2luigi.ListParameter(
1201 hashed=True, default=[200, 8, 3, 0.1]
1202
1203 )
1204
1205 @property
1207 """
1208 Property defining the basename for the .xml and .root weightfiles that are created.
1209 Has to be implemented by the inheriting teacher task class.
1210 """
1211 raise NotImplementedError(
1212 "Teacher Task must define a static weightfile_identifier"
1213 )
1214
1215 def get_weightfile_xml_identifier(self, fast_bdt_option=None, recotrack_option=None):
1216 """
1217 Name of the xml weightfile that is created by the teacher task.
1218 It is subsequently used as a local weightfile in the following validation tasks.
1219 """
1220 if fast_bdt_option is None:
1221 fast_bdt_option = self.fast_bdt_option
1222 if recotrack_option is None and hasattr(self, 'recotrack_option'):
1223 recotrack_option = self.recotrack_option
1224 else:
1225 recotrack_option = ''
1226 weightfile_details = create_fbdt_option_string(fast_bdt_option)
1227 weightfile_name = self.weightfile_identifier_basename + weightfile_details
1228 if recotrack_option != '':
1229 weightfile_name = weightfile_name + '_' + recotrack_option
1230 return weightfile_name + ".weights.xml"
1231
1232 @property
1233 def tree_name(self):
1234 """
1235 Property defining the name of the tree in the ROOT file from the
1236 ``data_collection_task`` that contains the recorded training data. Must
1237 implemented by the inheriting specific teacher task class.
1238 """
1239 raise NotImplementedError("Teacher Task must define a static tree_name")
1240
1241 @property
1242 def random_seed(self):
1243 """
1244 Property defining random seed to be used by the ``GenerateSimTask``.
1245 Should differ from the random seed in the test data samples. Must
1246 implemented by the inheriting specific teacher task class.
1247 """
1248 raise NotImplementedError("Teacher Task must define a static random seed")
1249
1250 @property
1251 def data_collection_task(self) -> Basf2PathTask:
1252 """
1253 Property defining the specific ``DataCollectionTask`` to require. Must
1254 implemented by the inheriting specific teacher task class.
1255 """
1256 raise NotImplementedError(
1257 "Teacher Task must define a data collection task to require "
1258 )
1259
1260 def requires(self):
1261 """
1262 Generate list of luigi Tasks that this Task depends on.
1263 """
1264 if 'USEREC' in self.process_type:
1265 if 'USERECBB' in self.process_type:
1266 process = 'BBBAR'
1267 elif 'USERECEE' in self.process_type:
1268 process = 'BHABHA'
1269 yield CheckExistingFile(
1270 filename='datafiles/qe_records_N' + str(self.n_events_training) + '_' + process + '_' + self.random_seed + '.root',
1271 )
1272 else:
1273 yield self.data_collection_task(
1274 num_processes=MasterTask.num_processes,
1275 n_events=self.n_events_training,
1276 experiment_number=self.experiment_number,
1277 random_seed=self.process_type + '_' + self.random_seed,
1278 )
1279
1280 def output(self):
1281 """
1282 Generate list of output files that the task should produce.
1283 The task is considered finished if and only if the outputs all exist.
1284 """
1285 yield self.add_to_output(self.get_weightfile_xml_identifier())
1286
1287 def process(self):
1288 """
1289 Use basf2_mva teacher to create MVA weightfile from collected training
1290 data variables.
1291
1292 This is the main process that is dispatched by the ``run`` method that
1293 is inherited from ``Basf2Task``.
1294 """
1295 if 'USEREC' in self.process_type:
1296 if 'USERECBB' in self.process_type:
1297 process = 'BBBAR'
1298 elif 'USERECEE' in self.process_type:
1299 process = 'BHABHA'
1300 records_files = ['datafiles/qe_records_N' + str(self.n_events_training) +
1301 '_' + process + '_' + self.random_seed + '.root']
1302 else:
1303 if hasattr(self, 'recotrack_option'):
1304 records_files = self.get_input_file_names(
1305 self.data_collection_task.get_records_file_name(
1307 n_events=self.n_events_training,
1308 random_seed=self.process_type + '_' + self.random_seed,
1309 recotrack_option=self.recotrack_option))
1310 else:
1311 records_files = self.get_input_file_names(
1312 self.data_collection_task.get_records_file_name(
1314 n_events=self.n_events_training,
1315 random_seed=self.process_type + '_' + self.random_seed))
1316
1317 my_basf2_mva_teacher(
1318 records_files=records_files,
1319 tree_name=self.tree_name,
1320 weightfile_identifier=self.get_output_file_name(self.get_weightfile_xml_identifier()),
1321 target_variable=self.training_target,
1322 exclude_variables=self.exclude_variables,
1323 fast_bdt_option=self.fast_bdt_option,
1324 )
1325
1326
1328 """
1329 Task to run basf2 mva teacher on collected data for VXDTF2 track quality estimator
1330 """
1331
1332 weightfile_identifier_basename = "vxdtf2_mva_qe"
1335 tree_name = "tree"
1337 random_seed = "train_vxd"
1340 data_collection_task = VXDQEDataCollectionTask
1342
1344 """
1345 Task to run basf2 mva teacher on collected data for CDC track quality estimator
1346 """
1347
1348 weightfile_identifier_basename = "cdc_mva_qe"
1351 tree_name = "records"
1353 random_seed = "train_cdc"
1356 data_collection_task = CDCQEDataCollectionTask
1358
1360 """
1361 Task to run basf2 mva teacher on collected data for the final, combined
1362 track quality estimator
1363 """
1364
1367 recotrack_option = b2luigi.Parameter(
1369 default='deleteCDCQI080'
1370
1371 )
1372
1373
1374 weightfile_identifier_basename = "recotrack_mva_qe"
1377 tree_name = "tree"
1379 random_seed = "train_rec"
1382 data_collection_task = RecoTrackQEDataCollectionTask
1384 cdc_training_target = b2luigi.Parameter()
1386 def requires(self):
1387 """
1388 Generate list of luigi Tasks that this Task depends on.
1389 """
1390 if 'USEREC' in self.process_type:
1391 if 'USERECBB' in self.process_type:
1392 process = 'BBBAR'
1393 elif 'USERECEE' in self.process_type:
1394 process = 'BHABHA'
1395 yield CheckExistingFile(
1396 filename='datafiles/qe_records_N' + str(self.n_events_training) + '_' + process + '_' + self.random_seedrandom_seed + '.root',
1397 )
1398 else:
1400 cdc_training_target=self.cdc_training_target,
1401 num_processes=MasterTask.num_processes,
1402 n_events=self.n_events_training,
1403 experiment_number=self.experiment_number,
1404 random_seed=self.process_type + '_' + self.random_seedrandom_seed,
1405 recotrack_option=self.recotrack_option,
1406 fast_bdt_option=self.fast_bdt_option,
1407 )
1408
1409
1410class HarvestingValidationBaseTask(Basf2PathTask):
1411 """
1412 Run track reconstruction with MVA quality estimator and write out
1413 (="harvest") a root file with variables useful for the validation.
1414 """
1415
1416
1417 n_events_testing = b2luigi.IntParameter()
1419 n_events_training = b2luigi.IntParameter()
1421 experiment_number = b2luigi.IntParameter()
1425 process_type = b2luigi.Parameter(
1427 default="BBBAR"
1428
1429 )
1430
1432 exclude_variables = b2luigi.ListParameter(
1434 hashed=True
1435
1436 )
1437
1438 fast_bdt_option = b2luigi.ListParameter(
1440 hashed=True, default=[200, 8, 3, 0.1]
1441
1442 )
1443
1444 validation_output_file_name = "harvesting_validation.root"
1446 reco_output_file_name = "reconstruction.root"
1448 components = None
1450 @property
1451 def teacher_task(self) -> TrackQETeacherBaseTask:
1452 """
1453 Teacher task to require to provide a quality estimator weightfile for ``add_tracking_with_quality_estimation``
1454 """
1455 raise NotImplementedError()
1456
1457 def add_tracking_with_quality_estimation(self, path: basf2.Path) -> None:
1458 """
1459 Add modules for track reconstruction to basf2 path that are to be
1460 validated. Besides track finding it should include MC matching, fitted
1461 track creation and a quality estimator module.
1462 """
1463 raise NotImplementedError()
1464
1465 def requires(self):
1466 """
1467 Generate list of luigi Tasks that this Task depends on.
1468 """
1469 yield self.teacher_task(
1470 n_events_training=self.n_events_training,
1471 experiment_number=self.experiment_number,
1472 process_type=self.process_type,
1473 exclude_variables=self.exclude_variables,
1474 fast_bdt_option=self.fast_bdt_option,
1475 )
1476 if 'USE' in self.process_type: # USESIM and USEREC
1477 if 'BB' in self.process_type:
1478 process = 'BBBAR'
1479 elif 'EE' in self.process_type:
1480 process = 'BHABHA'
1481 yield CheckExistingFile(
1482 filename='datafiles/generated_mc_N' + str(self.n_events_testing) + '_' + process + '_test.root'
1483 )
1484 else:
1485 yield SplitNMergeSimTask(
1486 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
1487 random_seed=self.process_type + '_test',
1488 n_events=self.n_events_testing,
1489 experiment_number=self.experiment_number,
1490 )
1491
1492 def output(self):
1493 """
1494 Generate list of output files that the task should produce.
1495 The task is considered finished if and only if the outputs all exist.
1496 """
1497 yield self.add_to_output(self.validation_output_file_name)
1498 yield self.add_to_output(self.reco_output_file_name)
1499
1500 def create_path(self):
1501 """
1502 Create a basf2 path that uses ``add_tracking_with_quality_estimation()``
1503 and adds the ``CombinedTrackingValidationModule`` to write out variables
1504 for validation.
1505 """
1506 # prepare track finding
1507 path = basf2.create_path()
1508 if 'USE' in self.process_type:
1509 if 'BB' in self.process_type:
1510 process = 'BBBAR'
1511 elif 'EE' in self.process_type:
1512 process = 'BHABHA'
1513 inputFileNames = ['datafiles/generated_mc_N' + str(self.n_events_testing) + '_' + process + '_test.root']
1514 else:
1515 inputFileNames = self.get_input_file_names(GenerateSimTask.output_file_name(
1516 GenerateSimTask, n_events=self.n_events_testing, random_seed=self.process_type + '_test'))
1517 path.add_module(
1518 "RootInput",
1519 inputFileNames=inputFileNames,
1520 )
1521 path.add_module("Gearbox")
1522 tracking.add_geometry_modules(path)
1523 tracking.add_hit_preparation_modules(path) # only needed for simulated hits
1524 # add track finding module that needs to be validated
1526 # add modules for validation
1527 path.add_module(
1529 name=None,
1530 contact=None,
1531 expert_level=200,
1532 output_file_name=self.get_output_file_name(
1534 ),
1535 )
1536 )
1537 path.add_module(
1538 "RootOutput",
1539 outputFileName=self.get_output_file_name(self.reco_output_file_name),
1540 )
1541 return path
1542
1543
1545 """
1546 Run VXDTF2 track reconstruction and write out (="harvest") a root file with
1547 variables useful for validation of the VXD Quality Estimator.
1548 """
1549
1550
1551 validation_output_file_name = "vxd_qe_harvesting_validation.root"
1553 reco_output_file_name = "vxd_qe_reconstruction.root"
1555 teacher_task = VXDQETeacherTask
1558 """
1559 Add modules for VXDTF2 tracking with VXD quality estimator to basf2 path.
1560 """
1561 tracking.add_vxd_track_finding_vxdtf2(
1562 path,
1563 components=["SVD"],
1564 reco_tracks="RecoTracks",
1565 add_mva_quality_indicator=True,
1566 )
1567 # Replace the weightfiles of all quality estimator module by those
1568 # produced in this training by b2luigi
1569 basf2.set_module_parameters(
1570 path,
1571 name="VXDQualityEstimatorMVA",
1572 WeightFileIdentifier=self.get_input_file_names(
1573 self.teacher_taskteacher_task.get_weightfile_xml_identifier(self.teacher_taskteacher_task, fast_bdt_option=self.fast_bdt_option)
1574 )[0],
1575 )
1576 tracking.add_mc_matcher(path, components=["SVD"])
1577 tracking.add_track_fit_and_track_creator(path, components=["SVD"])
1578
1579
1581 """
1582 Run CDC reconstruction and write out (="harvest") a root file with variables
1583 useful for validation of the CDC Quality Estimator.
1584 """
1585
1586 training_target = b2luigi.Parameter()
1588 validation_output_file_name = "cdc_qe_harvesting_validation.root"
1590 reco_output_file_name = "cdc_qe_reconstruction.root"
1592 teacher_task = CDCQETeacherTask
1594 # overload needed due to specific training target
1595 def requires(self):
1596 """
1597 Generate list of luigi Tasks that this Task depends on.
1598 """
1599 yield self.teacher_taskteacher_task(
1600 n_events_training=self.n_events_training,
1601 experiment_number=self.experiment_number,
1602 process_type=self.process_type,
1603 training_target=self.training_target,
1604 exclude_variables=self.exclude_variables,
1605 fast_bdt_option=self.fast_bdt_option,
1606 )
1607 if 'USE' in self.process_type: # USESIM and USEREC
1608 if 'BB' in self.process_type:
1609 process = 'BBBAR'
1610 elif 'EE' in self.process_type:
1611 process = 'BHABHA'
1612 yield CheckExistingFile(
1613 filename='datafiles/generated_mc_N' + str(self.n_events_testing) + '_' + process + '_test.root'
1614 )
1615 else:
1616 yield SplitNMergeSimTask(
1617 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
1618 random_seed=self.process_type + '_test',
1619 n_events=self.n_events_testing,
1620 experiment_number=self.experiment_number,
1621 )
1622
1624 """
1625 Add modules for CDC standalone tracking with CDC quality estimator to basf2 path.
1626 """
1627 tracking.add_cdc_track_finding(
1628 path,
1629 output_reco_tracks="RecoTracks",
1630 add_mva_quality_indicator=True,
1631 )
1632 # change weightfile of quality estimator to the one produced by this training script
1633 cdc_qe_mva_filter_parameters = {
1634 "identifier": self.get_input_file_names(
1635 CDCQETeacherTask.get_weightfile_xml_identifier(
1636 CDCQETeacherTask,
1637 fast_bdt_option=self.fast_bdt_option))[0]}
1638 basf2.set_module_parameters(
1639 path,
1640 name="TFCDC_TrackQualityEstimator",
1641 filterParameters=cdc_qe_mva_filter_parameters,
1642 )
1643 tracking.add_mc_matcher(path, components=["CDC"])
1644 tracking.add_track_fit_and_track_creator(path, components=["CDC"])
1645
1646
1648 """
1649 Run track reconstruction and write out (="harvest") a root file with variables
1650 useful for validation of the MVA track Quality Estimator.
1651 """
1652
1653 cdc_training_target = b2luigi.Parameter()
1655 validation_output_file_name = "reco_qe_harvesting_validation.root"
1657 reco_output_file_name = "reco_qe_reconstruction.root"
1659 teacher_task = RecoTrackQETeacherTask
1661 def requires(self):
1662 """
1663 Generate list of luigi Tasks that this Task depends on.
1664 """
1665 yield CDCQETeacherTask(
1666 n_events_training=self.n_events_training,
1667 experiment_number=self.experiment_number,
1668 process_type=self.process_type,
1669 training_target=self.cdc_training_target,
1670 exclude_variables=MasterTask.exclude_variables_cdc,
1671 fast_bdt_option=self.fast_bdt_option,
1672 )
1673 yield VXDQETeacherTask(
1674 n_events_training=self.n_events_training,
1675 experiment_number=self.experiment_number,
1676 process_type=self.process_type,
1677 exclude_variables=MasterTask.exclude_variables_vxd,
1678 fast_bdt_option=self.fast_bdt_option,
1679 )
1680
1681 yield self.teacher_taskteacher_task(
1682 n_events_training=self.n_events_training,
1683 experiment_number=self.experiment_number,
1684 process_type=self.process_type,
1685 exclude_variables=self.exclude_variables,
1686 cdc_training_target=self.cdc_training_target,
1687 fast_bdt_option=self.fast_bdt_option,
1688 )
1689 if 'USE' in self.process_type: # USESIM and USEREC
1690 if 'BB' in self.process_type:
1691 process = 'BBBAR'
1692 elif 'EE' in self.process_type:
1693 process = 'BHABHA'
1694 yield CheckExistingFile(
1695 filename='datafiles/generated_mc_N' + str(self.n_events_testing) + '_' + process + '_test.root'
1696 )
1697 else:
1698 yield SplitNMergeSimTask(
1699 bkgfiles_dir=MasterTask.bkgfiles_by_exp[self.experiment_number],
1700 random_seed=self.process_type + '_test',
1701 n_events=self.n_events_testing,
1702 experiment_number=self.experiment_number,
1703 )
1704
1706 """
1707 Add modules for reco tracking with all track quality estimators to basf2 path.
1708 """
1709
1710 # add tracking reconstruction with quality estimator modules added
1711 tracking.add_tracking_reconstruction(
1712 path,
1713 add_cdcTrack_QI=True,
1714 add_vxdTrack_QI=True,
1715 add_recoTrack_QI=True,
1716 skipGeometryAdding=True,
1717 skipHitPreparerAdding=False,
1718 )
1719
1720 # Replace the weightfiles of all quality estimator modules by those
1721 # produced in the training by b2luigi
1722 cdc_qe_mva_filter_parameters = {
1723 "identifier": self.get_input_file_names(
1724 CDCQETeacherTask.get_weightfile_xml_identifier(
1725 CDCQETeacherTask,
1726 fast_bdt_option=self.fast_bdt_option))[0]}
1727 basf2.set_module_parameters(
1728 path,
1729 name="TFCDC_TrackQualityEstimator",
1730 filterParameters=cdc_qe_mva_filter_parameters,
1731 )
1732 basf2.set_module_parameters(
1733 path,
1734 name="VXDQualityEstimatorMVA",
1735 WeightFileIdentifier=self.get_input_file_names(
1736 VXDQETeacherTask.get_weightfile_xml_identifier(VXDQETeacherTask, fast_bdt_option=self.fast_bdt_option)
1737 )[0],
1738 )
1739 basf2.set_module_parameters(
1740 path,
1741 name="TrackQualityEstimatorMVA",
1742 WeightFileIdentifier=self.get_input_file_names(
1743 RecoTrackQETeacherTask.get_weightfile_xml_identifier(RecoTrackQETeacherTask, fast_bdt_option=self.fast_bdt_option)
1744 )[0],
1745 )
1746
1747
1748class TrackQEEvaluationBaseTask(Task):
1749 """
1750 Base class for evaluating a quality estimator ``basf2_mva_evaluate.py`` on a
1751 separate test data set.
1752
1753 Evaluation tasks for VXD, CDC and combined QE can inherit from it.
1754 """
1755
1756
1761 git_hash = b2luigi.Parameter(
1763 default=get_basf2_git_hash()
1764
1765 )
1766
1767 n_events_testing = b2luigi.IntParameter()
1769 n_events_training = b2luigi.IntParameter()
1771 experiment_number = b2luigi.IntParameter()
1775 process_type = b2luigi.Parameter(
1777 default="BBBAR"
1778
1779 )
1780
1781 training_target = b2luigi.Parameter(
1783 default="truth"
1784
1785 )
1786
1788 exclude_variables = b2luigi.ListParameter(
1790 hashed=True
1791
1792 )
1793
1794 fast_bdt_option = b2luigi.ListParameter(
1796 hashed=True, default=[200, 8, 3, 0.1]
1797
1798 )
1799
1800 @property
1801 def teacher_task(self) -> TrackQETeacherBaseTask:
1802 """
1803 Property defining specific teacher task to require.
1804 """
1805 raise NotImplementedError(
1806 "Evaluation Tasks must define a teacher task to require "
1807 )
1808
1809 @property
1810 def data_collection_task(self) -> Basf2PathTask:
1811 """
1812 Property defining the specific ``DataCollectionTask`` to require. Must
1813 implemented by the inheriting specific teacher task class.
1814 """
1815 raise NotImplementedError(
1816 "Evaluation Tasks must define a data collection task to require "
1817 )
1818
1819 @property
1820 def task_acronym(self):
1821 """
1822 Acronym to distinguish between cdc, vxd and rec(o) MVA
1823 """
1824 raise NotImplementedError(
1825 "Evaluation Tasks must define a task acronym."
1826 )
1827
1828 def requires(self):
1829 """
1830 Generate list of luigi Tasks that this Task depends on.
1831 """
1832 yield self.teacher_task(
1833 n_events_training=self.n_events_training,
1834 experiment_number=self.experiment_number,
1835 process_type=self.process_type,
1836 training_target=self.training_target,
1837 exclude_variables=self.exclude_variables,
1838 fast_bdt_option=self.fast_bdt_option,
1839 )
1840 if 'USEREC' in self.process_type:
1841 if 'USERECBB' in self.process_type:
1842 process = 'BBBAR'
1843 elif 'USERECEE' in self.process_type:
1844 process = 'BHABHA'
1845 yield CheckExistingFile(
1846 filename='datafiles/qe_records_N' + str(self.n_events_testing) + '_' + process + '_test_' +
1847 self.task_acronym + '.root'
1848 )
1849 else:
1850 yield self.data_collection_task(
1851 num_processes=MasterTask.num_processes,
1852 n_events=self.n_events_testing,
1853 experiment_number=self.experiment_number,
1854 random_seed=self.process_type + '_test',
1855 )
1856
1857 def output(self):
1858 """
1859 Generate list of output files that the task should produce.
1860 The task is considered finished if and only if the outputs all exist.
1861 """
1862 weightfile_details = create_fbdt_option_string(self.fast_bdt_option)
1863 evaluation_pdf_output = self.teacher_task.weightfile_identifier_basename + weightfile_details + ".pdf"
1864 yield self.add_to_output(evaluation_pdf_output)
1865
1866 @b2luigi.on_temporary_files
1867 def run(self):
1868 """
1869 Run ``basf2_mva_evaluate.py`` subprocess to evaluate QE MVA.
1870
1871 The MVA weight file created from training on the training data set is
1872 evaluated on separate test data.
1873 """
1874 weightfile_details = create_fbdt_option_string(self.fast_bdt_option)
1875 evaluation_pdf_output_basename = self.teacher_task.weightfile_identifier_basename + weightfile_details + ".pdf"
1876
1877 evaluation_pdf_output_path = self.get_output_file_name(evaluation_pdf_output_basename)
1878
1879 if 'USEREC' in self.process_type:
1880 if 'USERECBB' in self.process_type:
1881 process = 'BBBAR'
1882 elif 'USERECEE' in self.process_type:
1883 process = 'BHABHA'
1884 datafiles = 'datafiles/qe_records_N' + str(self.n_events_testing) + '_' + \
1885 process + '_test_' + self.task_acronym + '.root'
1886 else:
1887 datafiles = self.get_input_file_names(
1888 self.data_collection_task.get_records_file_name(
1890 n_events=self.n_events_testing,
1891 random_seed=self.process + '_test_' +
1892 self.task_acronym))[0]
1893 cmd = [
1894 "basf2_mva_evaluate.py",
1895 "--identifiers",
1896 self.get_input_file_names(
1897 self.teacher_task.get_weightfile_xml_identifier(
1898 self.teacher_task,
1899 fast_bdt_option=self.fast_bdt_option))[0],
1900 "--datafiles",
1901 datafiles,
1902 "--treename",
1903 self.teacher_task.tree_name,
1904 "--outputfile",
1905 evaluation_pdf_output_path,
1906 ]
1907
1908 # Prepare log files
1909 log_file_dir = get_log_file_dir(self)
1910 # check if directory already exists, if not, create it. I think this is necessary as this task does not
1911 # inherit properly from b2luigi and thus does not do it automatically??
1912 try:
1913 os.makedirs(log_file_dir, exist_ok=True)
1914 # the following should be unnecessary as exist_ok=True should take care that no FileExistError rises. I
1915 # might ask about a permission error...
1916 except FileExistsError:
1917 print('Directory ' + log_file_dir + 'already exists.')
1918 stderr_log_file_path = log_file_dir + "stderr"
1919 stdout_log_file_path = log_file_dir + "stdout"
1920 with open(stdout_log_file_path, "w") as stdout_file:
1921 stdout_file.write(f'stdout output of the command:\n{" ".join(cmd)}\n\n')
1922 if os.path.exists(stderr_log_file_path):
1923 # remove stderr file if it already exists b/c in the following it will be opened in appending mode
1924 os.remove(stderr_log_file_path)
1925
1926 # Run evaluation via subprocess and write output into logfiles
1927 with open(stdout_log_file_path, "a") as stdout_file:
1928 with open(stderr_log_file_path, "a") as stderr_file:
1929 try:
1930 subprocess.run(cmd, check=True, stdin=stdout_file, stderr=stderr_file)
1931 except subprocess.CalledProcessError as err:
1932 stderr_file.write(f"Evaluation failed with error:\n{err}")
1933 raise err
1934
1935
1937 """
1938 Run ``basf2_mva_evaluate.py`` for the VXD quality estimator on separate test data
1939 """
1940
1942 teacher_task = VXDQETeacherTask
1945 data_collection_task = VXDQEDataCollectionTask
1948 task_acronym = 'vxd'
1950
1952 """
1953 Run ``basf2_mva_evaluate.py`` for the CDC quality estimator on separate test data
1954 """
1955
1957 teacher_task = CDCQETeacherTask
1960 data_collection_task = CDCQEDataCollectionTask
1963 task_acronym = 'cdc'
1965
1967 """
1968 Run ``basf2_mva_evaluate.py`` for the final, combined quality estimator on
1969 separate test data
1970 """
1971
1973 teacher_task = RecoTrackQETeacherTask
1976 data_collection_task = RecoTrackQEDataCollectionTask
1979 task_acronym = 'rec'
1981 cdc_training_target = b2luigi.Parameter()
1983 def requires(self):
1984 """
1985 Generate list of luigi Tasks that this Task depends on.
1986 """
1987 yield self.teacher_taskteacher_task(
1988 n_events_training=self.n_events_training,
1989 experiment_number=self.experiment_number,
1990 process_type=self.process_type,
1991 training_target=self.training_target,
1992 exclude_variables=self.exclude_variables,
1993 cdc_training_target=self.cdc_training_target,
1994 fast_bdt_option=self.fast_bdt_option,
1995 )
1996 if 'USEREC' in self.process_type:
1997 if 'USERECBB' in self.process_type:
1998 process = 'BBBAR'
1999 elif 'USERECEE' in self.process_type:
2000 process = 'BHABHA'
2001 yield CheckExistingFile(
2002 filename='datafiles/qe_records_N' + str(self.n_events_testing) + '_' + process + '_test_' +
2003 self.task_acronymtask_acronym + '.root'
2004 )
2005 else:
2007 num_processes=MasterTask.num_processes,
2008 n_events=self.n_events_testing,
2009 experiment_number=self.experiment_number,
2010 random_seed=self.process_type + "_test",
2011 cdc_training_target=self.cdc_training_target,
2012 )
2013
2014
2016 """
2017 Create a PDF file with validation plots for a quality estimator produced
2018 from the ROOT ntuples produced by a harvesting validation task
2019 """
2020
2021 n_events_testing = b2luigi.IntParameter()
2023 n_events_training = b2luigi.IntParameter()
2025 experiment_number = b2luigi.IntParameter()
2029 process_type = b2luigi.Parameter(
2031 default="BBBAR"
2032
2033 )
2034
2036 exclude_variables = b2luigi.ListParameter(
2038 hashed=True
2039
2040 )
2041
2042 fast_bdt_option = b2luigi.ListParameter(
2044 hashed=True, default=[200, 8, 3, 0.1]
2045
2046 )
2047
2048 primaries_only = b2luigi.BoolParameter(
2050 default=True
2051
2052 ) # normalize finding efficiencies to primary MC-tracks
2053
2054 @property
2055 def harvesting_validation_task_instance(self) -> HarvestingValidationBaseTask:
2056 """
2057 Specifies related harvesting validation task which produces the ROOT
2058 files with the data that is plotted by this task.
2059 """
2060 raise NotImplementedError("Must define a QI harvesting validation task for which to do the plots")
2061
2062 @property
2063 def output_pdf_file_basename(self):
2064 """
2065 Name of the output PDF file containing the validation plots
2066 """
2067 validation_harvest_basename = self.harvesting_validation_task_instance.validation_output_file_name
2068 return validation_harvest_basename.replace(".root", "_plots.pdf")
2069
2070 def requires(self):
2071 """
2072 Generate list of luigi Tasks that this Task depends on.
2073 """
2075
2076 def output(self):
2077 """
2078 Generate list of output files that the task should produce.
2079 The task is considered finished if and only if the outputs all exist.
2080 """
2081 yield self.add_to_output(self.output_pdf_file_basename)
2082
2083 @b2luigi.on_temporary_files
2084 def process(self):
2085 """
2086 Use basf2_mva teacher to create MVA weightfile from collected training
2087 data variables.
2088
2089 Main process that is dispatched by the ``run`` method that is inherited
2090 from ``Basf2Task``.
2091 """
2092 # get the validation "harvest", which is the ROOT file with ntuples for validation
2093 validation_harvest_basename = self.harvesting_validation_task_instance.validation_output_file_name
2094 validation_harvest_path = self.get_input_file_names(validation_harvest_basename)[0]
2095
2096 # Load "harvested" validation data from root files into dataframes (requires enough memory to hold data)
2097 pr_columns = [ # Restrict memory usage by only reading in columns that are used in the steering file
2098 'is_fake', 'is_clone', 'is_matched', 'quality_indicator',
2099 'experiment_number', 'run_number', 'event_number', 'pr_store_array_number',
2100 'pt_estimate', 'z0_estimate', 'd0_estimate', 'tan_lambda_estimate',
2101 'phi0_estimate', 'pt_truth', 'z0_truth', 'd0_truth', 'tan_lambda_truth',
2102 'phi0_truth',
2103 ]
2104 # In ``pr_df`` each row corresponds to a track from Pattern Recognition
2105 pr_df = uproot.open(validation_harvest_path)['pr_tree/pr_tree'].arrays(pr_columns, library='pd')
2106 mc_columns = [ # restrict mc_df to these columns
2107 'experiment_number',
2108 'run_number',
2109 'event_number',
2110 'pr_store_array_number',
2111 'is_missing',
2112 'is_primary',
2113 ]
2114 # In ``mc_df`` each row corresponds to an MC track
2115 mc_df = uproot.open(validation_harvest_path)['mc_tree/mc_tree'].arrays(mc_columns, library='pd')
2116 if self.primaries_only:
2117 mc_df = mc_df[mc_df.is_primary.eq(True)]
2118
2119 # Define QI thresholds for the FOM plots and the ROC curves
2120 qi_cuts = np.linspace(0., 1, 20, endpoint=False)
2121 # # Add more points at the very end between the previous maximum and 1
2122 # qi_cuts = np.append(qi_cuts, np.linspace(np.max(qi_cuts), 1, 20, endpoint=False))
2123
2124 # Create plots and append them to single output pdf
2125
2126 output_pdf_file_path = self.get_output_file_name(self.output_pdf_file_basename)
2127 with PdfPages(output_pdf_file_path, keep_empty=False) as pdf:
2128
2129 # Add a title page to validation plot PDF with some metadata
2130 # Remember that most metadata is in the xml file of the weightfile
2131 # and in the b2luigi directory structure
2132 titlepage_fig, titlepage_ax = plt.subplots()
2133 titlepage_ax.axis("off")
2134 title = f"Quality Estimator validation plots from {self.__class__.__name__}"
2135 titlepage_ax.set_title(title)
2136 teacher_task = self.harvesting_validation_task_instance.teacher_task
2137 weightfile_identifier = teacher_task.get_weightfile_xml_identifier(teacher_task, fast_bdt_option=self.fast_bdt_option)
2138 meta_data = {
2139 "Date": datetime.today().strftime("%Y-%m-%d %H:%M"),
2140 "Created by steering file": os.path.realpath(__file__),
2141 "Created from data in": validation_harvest_path,
2142 "Background directory": MasterTask.bkgfiles_by_exp[self.experiment_number],
2143 "weight file": weightfile_identifier,
2144 }
2145 if hasattr(self, 'exclude_variables'):
2146 meta_data["Excluded variables"] = ", ".join(self.exclude_variables)
2147 meta_data_string = (format_dictionary(meta_data) +
2148 "\n\n(For all MVA training parameters look into the produced weight file)")
2149 luigi_params = get_serialized_parameters(self)
2150 luigi_param_string = (f"\n\nb2luigi parameters for {self.__class__.__name__}\n" +
2151 format_dictionary(luigi_params))
2152 title_page_text = meta_data_string + luigi_param_string
2153 titlepage_ax.text(0, 1, title_page_text, ha="left", va="top", wrap=True, fontsize=8)
2154 pdf.savefig(titlepage_fig)
2155 plt.close(titlepage_fig)
2156
2157 fake_rates = get_uncertain_means_for_qi_cuts(pr_df, "is_fake", qi_cuts)
2158 fake_fig, fake_ax = plt.subplots()
2159 fake_ax.set_title("Fake rate")
2160 plot_with_errobands(fake_rates, ax=fake_ax)
2161 fake_ax.set_ylabel("fake rate")
2162 fake_ax.set_xlabel("quality indicator requirement")
2163 pdf.savefig(fake_fig, bbox_inches="tight")
2164 plt.close(fake_fig)
2165
2166 # Plot clone rates
2167 clone_rates = get_uncertain_means_for_qi_cuts(pr_df, "is_clone", qi_cuts)
2168 clone_fig, clone_ax = plt.subplots()
2169 clone_ax.set_title("Clone rate")
2170 plot_with_errobands(clone_rates, ax=clone_ax)
2171 clone_ax.set_ylabel("clone rate")
2172 clone_ax.set_xlabel("quality indicator requirement")
2173 pdf.savefig(clone_fig, bbox_inches="tight")
2174 plt.close(clone_fig)
2175
2176 # Plot finding efficiency
2177
2178 # The Quality Indicator is only available in pr_tree and thus the
2179 # PR-track dataframe. To get the QI of the related PR track for an MC
2180 # track, merge the PR dataframe into the MC dataframe
2181 pr_track_identifiers = ['experiment_number', 'run_number', 'event_number', 'pr_store_array_number']
2182 mc_df = upd.merge(
2183 left=mc_df, right=pr_df[pr_track_identifiers + ['quality_indicator']],
2184 how='left',
2185 on=pr_track_identifiers
2186 )
2187
2188 missing_fractions = (
2189 _my_uncertain_mean(mc_df[
2190 mc_df.quality_indicator.isnull() | (mc_df.quality_indicator > qi_cut)]['is_missing'])
2191 for qi_cut in qi_cuts
2192 )
2193
2194 findeff_fig, findeff_ax = plt.subplots()
2195 findeff_ax.set_title("Finding efficiency")
2196 finding_efficiencies = 1.0 - upd.Series(data=missing_fractions, index=qi_cuts)
2197 plot_with_errobands(finding_efficiencies, ax=findeff_ax)
2198 findeff_ax.set_ylabel("finding efficiency")
2199 findeff_ax.set_xlabel("quality indicator requirement")
2200 pdf.savefig(findeff_fig, bbox_inches="tight")
2201 plt.close(findeff_fig)
2202
2203 # Plot ROC curves
2204
2205 # Fake rate vs. finding efficiency ROC curve
2206 fake_roc_fig, fake_roc_ax = plt.subplots()
2207 fake_roc_ax.set_title("Fake rate vs. finding efficiency ROC curve")
2208 fake_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=fake_rates.nominal_value,
2209 xerr=finding_efficiencies.std_dev, yerr=fake_rates.std_dev, elinewidth=0.8)
2210 fake_roc_ax.set_xlabel('finding efficiency')
2211 fake_roc_ax.set_ylabel('fake rate')
2212 pdf.savefig(fake_roc_fig, bbox_inches="tight")
2213 plt.close(fake_roc_fig)
2214
2215 # Clone rate vs. finding efficiency ROC curve
2216 clone_roc_fig, clone_roc_ax = plt.subplots()
2217 clone_roc_ax.set_title("Clone rate vs. finding efficiency ROC curve")
2218 clone_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=clone_rates.nominal_value,
2219 xerr=finding_efficiencies.std_dev, yerr=clone_rates.std_dev, elinewidth=0.8)
2220 clone_roc_ax.set_xlabel('finding efficiency')
2221 clone_roc_ax.set_ylabel('clone rate')
2222 pdf.savefig(clone_roc_fig, bbox_inches="tight")
2223 plt.close(clone_roc_fig)
2224
2225 # Plot kinematic distributions
2226
2227 # use fewer qi cuts as each cut will be it's own subplot now and not a point
2228 kinematic_qi_cuts = [0, 0.5, 0.9]
2229
2230 # Define kinematic parameters which we want to histogram and define
2231 # dictionaries relating them to latex labels, units and binnings
2232 params = ['d0', 'z0', 'pt', 'tan_lambda', 'phi0']
2233 label_by_param = {
2234 "pt": "$p_T$",
2235 "z0": "$z_0$",
2236 "d0": "$d_0$",
2237 "tan_lambda": r"$\tan{\lambda}$",
2238 "phi0": r"$\phi_0$"
2239 }
2240 unit_by_param = {
2241 "pt": "GeV",
2242 "z0": "cm",
2243 "d0": "cm",
2244 "tan_lambda": "rad",
2245 "phi0": "rad"
2246 }
2247 n_kinematic_bins = 75 # number of bins per kinematic variable
2248 bins_by_param = {
2249 "pt": np.linspace(0, np.percentile(pr_df['pt_truth'].dropna(), 95), n_kinematic_bins),
2250 "z0": np.linspace(-0.1, 0.1, n_kinematic_bins),
2251 "d0": np.linspace(0, 0.01, n_kinematic_bins),
2252 "tan_lambda": np.linspace(-2, 3, n_kinematic_bins),
2253 "phi0": np.linspace(0, 2 * np.pi, n_kinematic_bins)
2254 }
2255
2256 # Iterate over each parameter and for each make stacked histograms for different QI cuts
2257 kinematic_qi_cuts = [0, 0.5, 0.8]
2258 blue, yellow, green = plt.get_cmap("tab10").colors[0:3]
2259 for param in params:
2260 fig, axarr = plt.subplots(ncols=len(kinematic_qi_cuts), sharey=True, sharex=True, figsize=(14, 6))
2261 fig.suptitle(f"{label_by_param[param]} distributions")
2262 for i, qi in enumerate(kinematic_qi_cuts):
2263 ax = axarr[i]
2264 ax.set_title(f"QI > {qi}")
2265 incut = pr_df[(pr_df['quality_indicator'] > qi)]
2266 incut_matched = incut[incut.is_matched.eq(True)]
2267 incut_clones = incut[incut.is_clone.eq(True)]
2268 incut_fake = incut[incut.is_fake.eq(True)]
2269
2270 # if any series is empty, break out of loop and don't draw try to draw a stacked histogram
2271 if any(series.empty for series in (incut, incut_matched, incut_clones, incut_fake)):
2272 ax.text(0.5, 0.5, "Not enough data in bin", ha="center", va="center", transform=ax.transAxes)
2273 continue
2274
2275 bins = bins_by_param[param]
2276 stacked_histogram_series_tuple = (
2277 incut_matched[f'{param}_estimate'],
2278 incut_clones[f'{param}_estimate'],
2279 incut_fake[f'{param}_estimate'],
2280 )
2281 histvals, _, _ = ax.hist(stacked_histogram_series_tuple,
2282 stacked=True,
2283 bins=bins, range=(bins.min(), bins.max()),
2284 color=(blue, green, yellow),
2285 label=("matched", "clones", "fakes"))
2286 ax.set_xlabel(f'{label_by_param[param]} estimate / ({unit_by_param[param]})')
2287 ax.set_ylabel('# tracks')
2288 axarr[0].legend(loc="upper center", bbox_to_anchor=(0, -0.15))
2289 pdf.savefig(fig, bbox_inches="tight")
2290 plt.close(fig)
2291
2292
2294 """
2295 Create a PDF file with validation plots for the VXDTF2 track quality
2296 estimator produced from the ROOT ntuples produced by a VXDTF2 track QE
2297 harvesting validation task
2298 """
2299
2300 @property
2302 """
2303 Harvesting validation task to require, which produces the ROOT files
2304 with variables to produce the VXD QE validation plots.
2305 """
2307 n_events_testing=self.n_events_testing,
2308 n_events_training=self.n_events_training,
2309 process_type=self.process_type,
2310 experiment_number=self.experiment_number,
2311 exclude_variables=self.exclude_variables,
2312 num_processes=MasterTask.num_processes,
2313 fast_bdt_option=self.fast_bdt_option,
2314 )
2315
2316
2318 """
2319 Create a PDF file with validation plots for the CDC track quality estimator
2320 produced from the ROOT ntuples produced by a CDC track QE harvesting
2321 validation task
2322 """
2323
2324 training_target = b2luigi.Parameter()
2326 @property
2328 """
2329 Harvesting validation task to require, which produces the ROOT files
2330 with variables to produce the CDC QE validation plots.
2331 """
2333 n_events_testing=self.n_events_testing,
2334 n_events_training=self.n_events_training,
2335 process_type=self.process_type,
2336 experiment_number=self.experiment_number,
2337 training_target=self.training_target,
2338 exclude_variables=self.exclude_variables,
2339 num_processes=MasterTask.num_processes,
2340 fast_bdt_option=self.fast_bdt_option,
2341 )
2342
2343
2345 """
2346 Create a PDF file with validation plots for the reco MVA track quality
2347 estimator produced from the ROOT ntuples produced by a reco track QE
2348 harvesting validation task
2349 """
2350
2351 cdc_training_target = b2luigi.Parameter()
2353 @property
2355 """
2356 Harvesting validation task to require, which produces the ROOT files
2357 with variables to produce the final MVA track QE validation plots.
2358 """
2360 n_events_testing=self.n_events_testing,
2361 n_events_training=self.n_events_training,
2362 process_type=self.process_type,
2363 experiment_number=self.experiment_number,
2364 cdc_training_target=self.cdc_training_target,
2365 exclude_variables=self.exclude_variables,
2366 num_processes=MasterTask.num_processes,
2367 fast_bdt_option=self.fast_bdt_option,
2368 )
2369
2370
2371class QEWeightsLocalDBCreatorTask(Basf2Task):
2372 """
2373 Collect weightfile identifiers from different teacher tasks and merge them
2374 into a local database for testing.
2375 """
2376
2377 n_events_training = b2luigi.IntParameter()
2379 experiment_number = b2luigi.IntParameter()
2383 process_type = b2luigi.Parameter(
2385 default="BBBAR"
2386
2387 )
2388
2389 cdc_training_target = b2luigi.Parameter()
2391 fast_bdt_option = b2luigi.ListParameter(
2393 hashed=True, default=[200, 8, 3, 0.1]
2394
2395 )
2396
2397 def requires(self):
2398 """
2399 Required teacher tasks
2400 """
2401 yield VXDQETeacherTask(
2402 n_events_training=self.n_events_training,
2403 process_type=self.process_type,
2404 experiment_number=self.experiment_number,
2405 exclude_variables=MasterTask.exclude_variables_vxd,
2406 fast_bdt_option=self.fast_bdt_option,
2407 )
2408 yield CDCQETeacherTask(
2409 n_events_training=self.n_events_training,
2410 process_type=self.process_type,
2411 experiment_number=self.experiment_number,
2412 training_target=self.cdc_training_target,
2413 exclude_variables=MasterTask.exclude_variables_cdc,
2414 fast_bdt_option=self.fast_bdt_option,
2415 )
2417 n_events_training=self.n_events_training,
2418 process_type=self.process_type,
2419 experiment_number=self.experiment_number,
2420 cdc_training_target=self.cdc_training_target,
2421 exclude_variables=MasterTask.exclude_variables_rec,
2422 fast_bdt_option=self.fast_bdt_option,
2423 )
2424
2425 def output(self):
2426 """
2427 Local database
2428 """
2429 yield self.add_to_output("localdb.tar")
2430
2431 def process(self):
2432 """
2433 Create local database
2434 """
2435 current_path = Path.cwd()
2436 localdb_archive_path = Path(self.get_output_file_name("localdb.tar")).absolute()
2437 output_dir = localdb_archive_path.parent
2438
2439 # remove existing local databases in output directories
2440 self._clean()
2441 # "Upload" the weightfiles of all 3 teacher tasks into the same localdb
2442 for task in (VXDQETeacherTask, CDCQETeacherTask, RecoTrackQETeacherTask):
2443 # Extract xml identifier input file name before switching working directories, as it returns relative paths
2444 weightfile_xml_identifier_path = os.path.abspath(self.get_input_file_names(
2445 task.get_weightfile_xml_identifier(task, fast_bdt_option=self.fast_bdt_option))[0])
2446 # As localdb is created in working directory, chdir into desired output path
2447 try:
2448 os.chdir(output_dir)
2449 # Same as basf2_mva_upload on the command line, creates localdb directory in current working dir
2450 basf2_mva.upload(
2451 weightfile_xml_identifier_path,
2452 task.weightfile_identifier_basename,
2453 self.experiment_number, 0,
2454 self.experiment_number, -1,
2455 )
2456 finally: # Switch back to working directory of b2luigi, even if upload failed
2457 os.chdir(current_path)
2458
2459 # Pack localdb into tar archive, so that we can have on single output file instead
2460 shutil.make_archive(
2461 base_name=localdb_archive_path.as_posix().split('.')[0],
2462 format="tar",
2463 root_dir=output_dir,
2464 base_dir="localdb",
2465 verbose=True,
2466 )
2467
2468 def _clean(self):
2469 """
2470 Remove local database and tar archives in output directory
2471 """
2472 localdb_archive_path = Path(self.get_output_file_name("localdb.tar"))
2473 localdb_path = localdb_archive_path.parent / "localdb"
2474
2475 if localdb_path.exists():
2476 print(f"Deleting localdb\n{localdb_path}\nwith contents\n ",
2477 "\n ".join(f.name for f in localdb_path.iterdir()))
2478 shutil.rmtree(localdb_path, ignore_errors=False) # recursively delete localdb
2479
2480 if localdb_archive_path.is_file():
2481 print(f"Deleting {localdb_archive_path}")
2482 os.remove(localdb_archive_path)
2483
2484 def on_failure(self, exception):
2485 """
2486 Cleanup: Remove local database to prevent existing outputs when task did not finish successfully
2487 """
2488 self._clean()
2489 # Run existing on_failure from parent class
2490 super().on_failure(exception)
2491
2492
2493class MasterTask(b2luigi.WrapperTask):
2494 """
2495 Wrapper task that needs to finish for b2luigi to finish running this steering file.
2496
2497 It is done if the outputs of all required subtasks exist. It is thus at the
2498 top of the luigi task graph. Edit the ``requires`` method to steer which
2499 tasks and with which parameters you want to run.
2500 """
2501
2504 process_type = b2luigi.get_setting(
2506 "process_type", default='BBBAR'
2507
2508 )
2509
2510 n_events_training = b2luigi.get_setting(
2512 "n_events_training", default=20000
2513
2514 )
2515
2516 n_events_testing = b2luigi.get_setting(
2518 "n_events_testing", default=5000
2519
2520 )
2521
2522 n_events_per_task = b2luigi.get_setting(
2524 "n_events_per_task", default=100
2525
2526 )
2527
2528 num_processes = b2luigi.get_setting(
2530 "basf2_processes_per_worker", default=0
2531
2532 )
2533
2534 datafiles = b2luigi.get_setting("datafiles")
2536 bkgfiles_by_exp = b2luigi.get_setting("bkgfiles_by_exp")
2538 bkgfiles_by_exp = {int(key): val for (key, val) in bkgfiles_by_exp.items()}
2540 exclude_variables_cdc = [
2541 "has_matching_segment",
2542 "size",
2543 "n_tracks", # not written out per default anyway
2544 "avg_hit_dist",
2545 "cont_layer_mean",
2546 "cont_layer_variance",
2547 "cont_layer_max",
2548 "cont_layer_min",
2549 "cont_layer_first",
2550 "cont_layer_last",
2551 "cont_layer_max_vs_last",
2552 "cont_layer_first_vs_min",
2553 "cont_layer_count",
2554 "cont_layer_occupancy",
2555 "super_layer_mean",
2556 "super_layer_variance",
2557 "super_layer_max_vs_last",
2558 "super_layer_first_vs_min",
2559 "super_layer_occupancy",
2560 "drift_length_mean",
2561 "drift_length_variance",
2562 "drift_length_max",
2563 "drift_length_min",
2564 "drift_length_sum",
2565 "norm_drift_length_mean",
2566 "norm_drift_length_variance",
2567 "norm_drift_length_max",
2568 "norm_drift_length_min",
2569 "norm_drift_length_sum",
2570 "adc_mean",
2571 "adc_variance",
2572 "adc_max",
2573 "adc_min",
2574 "adc_sum",
2575 "tot_mean",
2576 "tot_variance",
2577 "tot_max",
2578 "tot_min",
2579 "tot_sum",
2580 "empty_s_mean",
2581 "empty_s_variance",
2582 "empty_s_max"]
2583
2584 exclude_variables_vxd = [
2585 'energyLoss_max', 'energyLoss_min', 'energyLoss_mean', 'energyLoss_std', 'energyLoss_sum',
2586 'size_max', 'size_min', 'size_mean', 'size_std', 'size_sum',
2587 'seedCharge_max', 'seedCharge_min', 'seedCharge_mean', 'seedCharge_std', 'seedCharge_sum',
2588 'tripletFit_P_Mag', 'tripletFit_P_Eta', 'tripletFit_P_Phi', 'tripletFit_P_X', 'tripletFit_P_Y', 'tripletFit_P_Z']
2589
2590 exclude_variables_rec = [
2591 'background',
2592 'ghost',
2593 'fake',
2594 'clone',
2595 '__experiment__',
2596 '__run__',
2597 '__event__',
2598 'N_RecoTracks',
2599 'N_PXDRecoTracks',
2600 'N_SVDRecoTracks',
2601 'N_CDCRecoTracks',
2602 'N_diff_PXD_SVD_RecoTracks',
2603 'N_diff_SVD_CDC_RecoTracks',
2604 'Fit_Successful',
2605 'Fit_NFailedPoints',
2606 'Fit_Chi2',
2607 'N_TrackPoints_without_KalmanFitterInfo',
2608 'N_Hits_without_TrackPoint',
2609 'SVD_CDC_CDCwall_Chi2',
2610 'SVD_CDC_CDCwall_Pos_diff_Z',
2611 'SVD_CDC_CDCwall_Pos_diff_Pt',
2612 'SVD_CDC_CDCwall_Pos_diff_Theta',
2613 'SVD_CDC_CDCwall_Pos_diff_Phi',
2614 'SVD_CDC_CDCwall_Pos_diff_Mag',
2615 'SVD_CDC_CDCwall_Pos_diff_Eta',
2616 'SVD_CDC_CDCwall_Mom_diff_Z',
2617 'SVD_CDC_CDCwall_Mom_diff_Pt',
2618 'SVD_CDC_CDCwall_Mom_diff_Theta',
2619 'SVD_CDC_CDCwall_Mom_diff_Phi',
2620 'SVD_CDC_CDCwall_Mom_diff_Mag',
2621 'SVD_CDC_CDCwall_Mom_diff_Eta',
2622 'SVD_CDC_POCA_Pos_diff_Z',
2623 'SVD_CDC_POCA_Pos_diff_Pt',
2624 'SVD_CDC_POCA_Pos_diff_Theta',
2625 'SVD_CDC_POCA_Pos_diff_Phi',
2626 'SVD_CDC_POCA_Pos_diff_Mag',
2627 'SVD_CDC_POCA_Pos_diff_Eta',
2628 'SVD_CDC_POCA_Mom_diff_Z',
2629 'SVD_CDC_POCA_Mom_diff_Pt',
2630 'SVD_CDC_POCA_Mom_diff_Theta',
2631 'SVD_CDC_POCA_Mom_diff_Phi',
2632 'SVD_CDC_POCA_Mom_diff_Mag',
2633 'SVD_CDC_POCA_Mom_diff_Eta',
2634 'POCA_Pos_Pt',
2635 'POCA_Pos_Mag',
2636 'POCA_Pos_Phi',
2637 'POCA_Pos_Z',
2638 'POCA_Pos_Theta',
2639 'PXD_QI',
2640 'SVD_FitSuccessful',
2641 'CDC_FitSuccessful',
2642 'pdg_ID',
2643 'pdg_ID_Mother',
2644 'is_Vzero_Daughter',
2645 'is_Primary',
2646 'z0',
2647 'd0',
2648 'seed_Charge',
2649 'Fit_Charge',
2650 'weight_max',
2651 'weight_min',
2652 'weight_mean',
2653 'weight_std',
2654 'weight_median',
2655 'weight_n_zeros',
2656 'weight_firstCDCHit',
2657 'weight_lastSVDHit',
2658 'smoothedChi2_max',
2659 'smoothedChi2_min',
2660 'smoothedChi2_mean',
2661 'smoothedChi2_std',
2662 'smoothedChi2_median',
2663 'smoothedChi2_n_zeros',
2664 'smoothedChi2_firstCDCHit',
2665 'smoothedChi2_lastSVDHit']
2666
2667 def requires(self):
2668 """
2669 Generate list of tasks that needs to be done for luigi to finish running
2670 this steering file.
2671 """
2672 cdc_training_targets = [
2673 "truth", # treats clones as background, only best matched CDC tracks are true
2674 # "truth_track_is_matched" # treats clones as signal
2675 ]
2676
2677 fast_bdt_options = []
2678 # possible to run over a chosen hyperparameter space if wanted
2679 # in principle this can be extended to specific options for the three different MVAs
2680 # for i in range(250, 400, 50):
2681 # for j in range(6, 10, 2):
2682 # for k in range(2, 6):
2683 # for l in range(0, 5):
2684 # fast_bdt_options.append([100 + i, j, 3+k, 0.025+l*0.025])
2685 # fast_bdt_options.append([200, 8, 3, 0.1]) # default FastBDT option
2686 fast_bdt_options.append([350, 6, 5, 0.1])
2687
2688 experiment_numbers = b2luigi.get_setting("experiment_numbers")
2689
2690 # iterate over all possible combinations of parameters from the above defined parameter lists
2691 for experiment_number, cdc_training_target, fast_bdt_option in itertools.product(
2692 experiment_numbers, cdc_training_targets, fast_bdt_options
2693 ):
2694 # if test_selected_task is activated, only run the following tasks:
2695 if b2luigi.get_setting("test_selected_task", default=False):
2696 # for process_type in ['BHABHA', 'MUMU', 'TAUPAIR', 'YY', 'EEEE', 'EEMUMU', 'UUBAR', \
2697 # 'DDBAR', 'CCBAR', 'SSBAR', 'BBBAR', 'V0BBBAR', 'V0STUDY']:
2698 for cut in ['000', '070', '090', '095']:
2700 num_processes=self.num_processes,
2701 n_events=self.n_events_testing,
2702 experiment_number=experiment_number,
2703 random_seed=self.process_type + '_test',
2704 recotrack_option='useCDC_noVXD_deleteCDCQI'+cut,
2705 cdc_training_target=cdc_training_target,
2706 fast_bdt_option=fast_bdt_option,
2707 )
2709 num_processes=self.num_processes,
2710 n_events=self.n_events_testing,
2711 experiment_number=experiment_number,
2712 random_seed=self.process_type + '_test',
2713 )
2714 yield CDCQETeacherTask(
2715 n_events_training=self.n_events_training,
2716 process_type=self.process_type,
2717 experiment_number=experiment_number,
2718 exclude_variables=self.exclude_variables_cdc,
2719 training_target=cdc_training_target,
2720 fast_bdt_option=fast_bdt_option,
2721 )
2722 else:
2723 # if data shall be processed, it can neither be trained nor evaluated
2724 if 'DATA' in self.process_type:
2726 num_processes=self.num_processes,
2727 n_events=self.n_events_testing,
2728 experiment_number=experiment_number,
2729 random_seed=self.process_type + '_test',
2730 )
2732 num_processes=self.num_processes,
2733 n_events=self.n_events_testing,
2734 experiment_number=experiment_number,
2735 random_seed=self.process_type + '_test',
2736 )
2738 num_processes=self.num_processes,
2739 n_events=self.n_events_testing,
2740 experiment_number=experiment_number,
2741 random_seed=self.process_type + '_test',
2742 recotrack_option='deleteCDCQI080',
2743 cdc_training_target=cdc_training_target,
2744 fast_bdt_option=fast_bdt_option,
2745 )
2746 else:
2748 n_events_training=self.n_events_training,
2749 process_type=self.process_type,
2750 experiment_number=experiment_number,
2751 cdc_training_target=cdc_training_target,
2752 fast_bdt_option=fast_bdt_option,
2753 )
2754
2755 if b2luigi.get_setting("run_validation_tasks", default=True):
2757 n_events_training=self.n_events_training,
2758 n_events_testing=self.n_events_testing,
2759 process_type=self.process_type,
2760 experiment_number=experiment_number,
2761 cdc_training_target=cdc_training_target,
2762 exclude_variables=self.exclude_variables_rec,
2763 fast_bdt_option=fast_bdt_option,
2764 )
2766 n_events_training=self.n_events_training,
2767 n_events_testing=self.n_events_testing,
2768 process_type=self.process_type,
2769 experiment_number=experiment_number,
2770 exclude_variables=self.exclude_variables_cdc,
2771 training_target=cdc_training_target,
2772 fast_bdt_option=fast_bdt_option,
2773 )
2775 n_events_training=self.n_events_training,
2776 n_events_testing=self.n_events_testing,
2777 process_type=self.process_type,
2778 exclude_variables=self.exclude_variables_vxd,
2779 experiment_number=experiment_number,
2780 fast_bdt_option=fast_bdt_option,
2781 )
2782
2783 if b2luigi.get_setting("run_mva_evaluate", default=True):
2784 # Evaluate trained weightfiles via basf2_mva_evaluate.py on separate testdatasets
2785 # requires a latex installation to work
2787 n_events_training=self.n_events_training,
2788 n_events_testing=self.n_events_testing,
2789 process_type=self.process_type,
2790 experiment_number=experiment_number,
2791 cdc_training_target=cdc_training_target,
2792 exclude_variables=self.exclude_variables_rec,
2793 fast_bdt_option=fast_bdt_option,
2794 )
2796 n_events_training=self.n_events_training,
2797 n_events_testing=self.n_events_testing,
2798 process_type=self.process_type,
2799 experiment_number=experiment_number,
2800 exclude_variables=self.exclude_variables_cdc,
2801 fast_bdt_option=fast_bdt_option,
2802 training_target=cdc_training_target,
2803 )
2805 n_events_training=self.n_events_training,
2806 n_events_testing=self.n_events_testing,
2807 process_type=self.process_type,
2808 experiment_number=experiment_number,
2809 exclude_variables=self.exclude_variables_vxd,
2810 fast_bdt_option=fast_bdt_option,
2811 )
2812
2813
2814if __name__ == "__main__":
2815 # if n_events_test_on_data is specified to be different from -1 in the settings,
2816 # then stop after N events (mainly useful to test data reconstruction):
2817 nEventsTestOnData = b2luigi.get_setting("n_events_test_on_data", default=-1)
2818 if nEventsTestOnData > 0 and 'DATA' in b2luigi.get_setting("process_type", default="BBBAR"):
2819 from ROOT import Belle2
2820 environment = Belle2.Environment.Instance()
2821 environment.setNumberEventsOverride(nEventsTestOnData)
2822 # if global tags are specified in the settings, use them:
2823 # e.g. for data use ["data_reprocessing_prompt", "online"]. Make sure to be up to date here
2824 globaltags = b2luigi.get_setting("globaltags", default=[])
2825 if len(globaltags) > 0:
2826 basf2.conditions.reset()
2827 for gt in globaltags:
2828 basf2.conditions.prepend_globaltag(gt)
2829 workers = b2luigi.get_setting("workers", default=1)
2830 b2luigi.process(MasterTask(), workers=workers)
2831
def get_background_files(folder=None, output_file_info=True)
Definition: background.py:17
static Environment & Instance()
Static method to get a reference to the Environment instance.
Definition: Environment.cc:28
b2luigi random_seed
Random basf2 seed used by the GenerateSimTask.
def get_records_file_name(self, n_events=None, random_seed=None)
Filename of the recorded/collected data for the final QE MVA training.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
CDCQETeacherTask teacher_task
Teacher task to require to provide a quality estimator weightfile for add_tracking_with_quality_estim...
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi fast_bdt_option
Hyperparameter option of the FastBDT algorithm.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the test data set.
str validation_output_file_name
Name of the "harvested" ROOT output file with variables that can be used for validation.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
str reco_output_file_name
Name of the output of the RootOutput module with reconstructed events.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the test data set.
list exclude_variables_rec
list of variables to exclude for the recotrack mva:
b2luigi num_processes
Number of basf2 processes to use in Basf2PathTasks.
list exclude_variables_vxd
list of variables to exclude for the vxd mva:
b2luigi process_type
Define which kind of process shall be used.
list exclude_variables_cdc
list of variables to exclude for the cdc mva.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi primaries_only
Whether to normalize the track finding efficiencies to primary particles only.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
b2luigi fast_bdt_option
Hyperparameter option of the FastBDT algorithm.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
b2luigi fast_bdt_option
Hyperparameter option of the FastBDT algorithm.
b2luigi recotrack_option
RecoTrack option, use string that is additive: deleteCDCQI0XY (= deletes CDCTracks with CDC-QI below ...
def get_records_file_name(self, n_events=None, random_seed=None, recotrack_option=None)
Filename of the recorded/collected data for the final QE MVA training.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
RecoTrackQETeacherTask teacher_task
Task that is required by the evaluation base class to create the MVA weightfile that needs to be eval...
RecoTrackQEDataCollectionTask data_collection_task
Task that is required by the evaluation base class to collect the test data for the evaluation.
str task_acronym
Acronym that is required by the evaluation base class to find the correct collection task file.
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
RecoTrackQETeacherTask teacher_task
Teacher task to require to provide a quality estimator weightfile for add_tracking_with_quality_estim...
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
b2luigi recotrack_option
RecoTrack option, use string that is additive: deleteCDCQI0XY (= deletes CDCTracks with CDC-QI below ...
RecoTrackQEDataCollectionTask data_collection_task
Defines DataCollectionTask to require by the base class to collect features for the MVA training.
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
str random_seed
Random basf2 seed used to create the training data set.
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi fast_bdt_option
Hyperparameter options for the FastBDT algorithm.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the test data set.
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
b2luigi fast_bdt_option
Hyperparameter option of the FastBDT algorithm.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
b2luigi process_type
Define which kind of process shall be used.
def get_weightfile_xml_identifier(self, fast_bdt_option=None, recotrack_option=None)
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
b2luigi random_seed
Random basf2 seed used by the GenerateSimTask.
def get_records_file_name(self, n_events=None, random_seed=None)
Filename of the recorded/collected data for the final QE MVA training.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
VXDQETeacherTask teacher_task
Teacher task to require to provide a quality estimator weightfile for add_tracking_with_quality_estim...
def add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False)
Definition: simulation.py:126