12combined_module_quality_estimator_teacher
13-----------------------------------------
15Information on the MVA Track Quality Indicator / Estimator can be found
17<https://xwiki.desy.de/xwiki/rest/p/0d3f4>`_.
22This python script is used for the combined training and validation of three
23classifiers, the actual final MVA track quality estimator and the two quality
24estimators for the intermediate standalone track finders that it depends on.
26 - Final MVA track quality estimator:
27 The final quality estimator for fully merged
and fitted tracks (RecoTracks).
28 Its classifier uses features
from the track fitting, merger, hit pattern, ...
29 But it also uses the outputs
from respective intermediate quality
30 estimators
for the VXD
and the CDC track finding
as inputs. It provides
31 the final quality indicator (QI) exported to the track objects.
33 - VXDTF2 track quality estimator:
34 MVA quality estimator
for the VXD standalone track finding.
36 - CDC track quality estimator:
37 MVA quality estimator
for the CDC standalone track finding.
39Each classifier requires
for its training a different training data set
and they
40need to be validated on a separate testing data set. Further, the final quality
41estimator can only be trained, when the trained weights
for the intermediate
42quality estimators are available. If the final estimator shall be trained without
43one
or both previous estimators, the requirements have to be commented out
in the
44__init__.py file of tracking.
45For all estimators, a list of variables to be ignored
is specified
in the MasterTask.
46The current choice
is mainly based on pure data MC agreement
in these quantities
or
47on outdated implementations. It was decided to leave them
in the hardcoded
"ugly" way
48in here to remind future generations that they exist
in principle
and they should
and
49could be added to the estimator, once their modelling becomes better
in future
or an
50alternative implementation
is programmed.
51To avoid mistakes, b2luigi
is used to create a task chain
for a combined training
and
52validation of all classifiers.
54b2luigi: Understanding the steering file
55~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
57All trainings
and validations are done
in the correct order
in this steering
58file. For the purpose of creating a dependency graph, the `b2luigi
59<https://b2luigi.readthedocs.io>`_ python package
is used, which extends the
60`luigi <https://luigi.readthedocs.io>`_ package developed by spotify.
62Each task that has to be done
is represented by a special
class, which defines
63which defines parameters, output files
and which other tasks
with which
64parameters it depends on. For example a teacher task, which runs
65``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
66task which runs a reconstruction
and writes out track-wise variables into a root
67file
for training. An evaluation/validation task
for testing the classifier
68requires both the teacher task,
as it needs the weightfile to be present,
and
69also a data collection task, because it needs a dataset
for testing classifier.
71The final task that defines which tasks need to be done
for the steering file to
72finish
is the ``MasterTask``. When you only want to run parts of the
73training/validation pipeline, you can comment out requirements
in the Master
74task
or replace them by lower-level tasks during debugging.
79This steering file relies on b2luigi_
for task scheduling
and `uncertain_panda
80<https://github.com/nils-braun/uncertain_panda>`_
for uncertainty calculations.
81uncertain_panda
is not in the externals
and b2luigi
is not upto v01-07-01. Both
82can be installed via pip::
84 python3 -m pip install [--user] b2luigi uncertain_panda
86Use the ``--user`` option
if you have
not rights to install python packages into
87your externals (e.g. because you are using cvmfs)
and install them
in
88``$HOME/.local`` instead.
93Instead of command line arguments, the b2luigi script
is configured via a
94``settings.json`` file. Open it
in your favorite text editor
and modify it to
95fit to your requirements.
100You can test the b2luigi without running it via::
102 python3 combined_quality_estimator_teacher.py --dry-run
103 python3 combined_quality_estimator_teacher.py --show-output
105This will show the outputs
and show potential errors
in the definitions of the
106luigi task dependencies. To run the the steering file
in normal (local) mode,
109 python3 combined_quality_estimator_teacher.py
111I usually use the interactive luigi web interface via the central scheduler
112which visualizes the task graph
while it
is running. Therefore, the scheduler
113daemon ``luigid`` has to run
in the background, which
is located
in
114``~/.local/bin/luigid``
in case b2luigi had been installed
with ``--user``. For
119Then, execute your steering (e.g.
in another terminal)
with::
121 python3 combined_quality_estimator_teacher.py --scheduler-port 8886
123To view the web interface, open your webbrowser enter into the url bar::
127If you don
't run the steering file on the same machine on which you run your webbrowser, you have two options:
129 1. Run both the steering file and ``luigid`` remotely
and use
130 ssh-port-forwarding to your local host. Therefore, run on your local
133 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
135 2. Run the ``luigid`` scheduler locally
and use the ``--scheduler-host <your
136 local host>`` argument when calling the steering file
138Accessing the results / output files
139~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
141All output files are stored
in a directory structure
in the ``result_path``. The
142directory tree encodes the used b2luigi parameters. This ensures reproducibility
143and makes parameter searches easy. Sometimes, it
is hard to find the relevant
144output files. You can view the whole directory structure by running ``tree
145<result_path>``. Ise the unix ``find`` command to find the files that interest
148 find <result_path> -name
"*.pdf"
149 find <result_path> -name
"*.root"
154from pathlib import Path
158from datetime import datetime
159from typing import Iterable
161import matplotlib.pyplot as plt
164from matplotlib.backends.backend_pdf import PdfPages
168from packaging import version
172import tracking.root_utils as root_utils
173from tracking.harvesting_validation.combined_module import CombinedTrackingValidationModule
175# wrap python modules that are used here but not in the externals into a try except block
176install_helpstring_formatter = ("\nCould not find {module} python module.Try installing it via\n"
177 " python3 -m pip install [--user] {module}\n")
180 from b2luigi.core.utils
import get_serialized_parameters, get_log_file_dir, create_output_dirs
181 from b2luigi.basf2_helper
import Basf2PathTask, Basf2Task
182 from b2luigi.core.task
import Task, ExternalTask
183 from b2luigi.basf2_helper.utils
import get_basf2_git_hash
184except ModuleNotFoundError:
185 print(install_helpstring_formatter.format(module=
"b2luigi"))
188 from uncertain_panda
import pandas
as upd
189except ModuleNotFoundError:
190 print(install_helpstring_formatter.format(module=
"uncertain_panda"))
198 version.parse(b2luigi.__version__) <= version.parse(
"0.3.2")
and
199 get_basf2_git_hash()
is None and
200 os.getenv(
"BELLE2_LOCAL_DIR")
is not None
202 print(f
"b2luigi version could not obtain git hash because of a bug not yet fixed in version {b2luigi.__version__}\n"
203 "Please install the latest version of b2luigi from github via\n\n"
204 " python3 -m pip install --upgrade [--user] git+https://github.com/nils-braun/b2luigi.git\n")
210def create_fbdt_option_string(fast_bdt_option):
212 returns a readable string created by the fast_bdt_option array
214 return "_nTrees" + str(fast_bdt_option[0]) +
"_nCuts" + str(fast_bdt_option[1]) +
"_nLevels" + \
215 str(fast_bdt_option[2]) +
"_shrin" + str(int(round(100*fast_bdt_option[3], 0)))
218def createV0momenta(x, mu, beta):
220 Copied from Biancas K_S0 particle gun code: Returns a realistic V0 momentum distribution
221 when running over x. Mu
and Beta are properties of the function that define center
and tails.
222 Used
for the particle gun simulation code
for K_S0
and Lambda_0
224 return (1/beta)*np.exp(-(x - mu)/beta) * np.exp(-np.exp(-(x - mu) / beta))
227def my_basf2_mva_teacher(
230 weightfile_identifier,
231 target_variable="truth",
232 exclude_variables=None,
233 fast_bdt_option=[200, 8, 3, 0.1]
236 My custom wrapper for basf2 mva teacher. Adapted
from code
in ``trackfindingcdc_teacher``.
238 :param records_files: List of files
with collected (
"recorded") variables to use
as training data
for the MVA.
239 :param tree_name: Name of the TTree
in the ROOT file
from the ``data_collection_task``
240 that contains the training data
for the MVA teacher.
241 :param weightfile_identifier: Name of the weightfile that
is created.
242 Should either end
in ".xml" for local weightfiles
or in ".root", when
243 the weightfile needs later to be uploaded
as a payload to the conditions
245 :param target_variable: Feature/variable to use
as truth label
in the quality estimator MVA classifier.
246 :param exclude_variables: List of collected variables to
not use
in the training of the QE MVA classifier.
247 In addition to variables containing the
"truth" substring, which are excluded by default.
248 :param fast_bdt_option: specified fast BDT options, default: [200, 8, 3, 0.1] [nTrees, nCuts, nLevels, shrinkage]
250 if exclude_variables
is None:
251 exclude_variables = []
253 weightfile_extension = Path(weightfile_identifier).suffix
254 if weightfile_extension
not in {
".xml",
".root"}:
255 raise ValueError(f
"Weightfile Identifier should end in .xml or .root, but ends in {weightfile_extension}")
258 with root_utils.root_open(records_files[0])
as records_tfile:
259 input_tree = records_tfile.Get(tree_name)
260 feature_names = [leave.GetName()
for leave
in input_tree.GetListOfLeaves()]
263 truth_free_variable_names = [
265 for name
in feature_names
267 (
"truth" not in name)
and
268 (name != target_variable)
and
269 (name
not in exclude_variables)
272 if "weight" in truth_free_variable_names:
273 truth_free_variable_names.remove(
"weight")
274 weight_variable =
"weight"
275 elif "__weight__" in truth_free_variable_names:
276 truth_free_variable_names.remove(
"__weight__")
277 weight_variable =
"__weight__"
282 general_options = basf2_mva.GeneralOptions()
283 general_options.m_datafiles = basf2_mva.vector(*records_files)
284 general_options.m_treename = tree_name
285 general_options.m_weight_variable = weight_variable
286 general_options.m_identifier = weightfile_identifier
287 general_options.m_variables = basf2_mva.vector(*truth_free_variable_names)
288 general_options.m_target_variable = target_variable
289 fastbdt_options = basf2_mva.FastBDTOptions()
291 fastbdt_options.m_nTrees = fast_bdt_option[0]
292 fastbdt_options.m_nCuts = fast_bdt_option[1]
293 fastbdt_options.m_nLevels = fast_bdt_option[2]
294 fastbdt_options.m_shrinkage = fast_bdt_option[3]
296 basf2_mva.teacher(general_options, fastbdt_options)
299def _my_uncertain_mean(series: upd.Series):
301 Temporary Workaround bug in ``uncertain_panda`` where a ``ValueError``
is
302 thrown
for ``Series.unc.mean``
if the series
is empty. Can be replaced by
303 .unc.mean when the issue
is fixed.
304 https://github.com/nils-braun/uncertain_panda/issues/2
307 return series.unc.mean()
315def get_uncertain_means_for_qi_cuts(df: upd.DataFrame, column: str, qi_cuts: Iterable[float]):
317 Return a pandas series with an mean of the dataframe column
and
318 uncertainty
for each quality indicator cut.
320 :param df: Pandas dataframe
with at least ``quality_indicator``
321 and another numeric ``column``.
322 :param column: Column of which we want to aggregate the means
323 and uncertainties
for different QI cuts
324 :param qi_cuts: Iterable of quality indicator minimal thresholds.
325 :returns: Series of of means
and uncertainties
with ``qi_cuts``
as index
328 uncertain_means = (_my_uncertain_mean(df.query(f"quality_indicator > {qi_cut}")[column])
329 for qi_cut
in qi_cuts)
330 uncertain_means_series = upd.Series(data=uncertain_means, index=qi_cuts)
331 return uncertain_means_series
334def plot_with_errobands(uncertain_series,
335 error_band_alpha=0.3,
337 fill_between_kwargs={},
340 Plot an uncertain series with error bands
for y-errors
344 uncertain_series = uncertain_series.dropna()
345 ax.plot(uncertain_series.index.values, uncertain_series.nominal_value, **plot_kwargs)
346 ax.fill_between(x=uncertain_series.index,
347 y1=uncertain_series.nominal_value - uncertain_series.std_dev,
348 y2=uncertain_series.nominal_value + uncertain_series.std_dev,
349 alpha=error_band_alpha,
350 **fill_between_kwargs)
353def format_dictionary(adict, width=80, bullet="•"):
355 Helper function to format dictionary to string as a wrapped key-value bullet
356 list. Useful to
print metadata
from dictionaries.
358 :param adict: Dictionary to format
359 :param width: Characters after which to wrap a key-value line
360 :param bullet: Character to begin a key-value line
with, e.g. ``-``
for a
366 return "\n".join(textwrap.fill(f
"{bullet} {key}: {value}", width=width)
367 for (key, value)
in adict.items())
372class GenerateSimTask(Basf2PathTask):
374 Generate simulated Monte Carlo with background overlay.
376 Make sure to use different ``random_seed`` parameters
for the training data
377 format the classifier trainings
and for the test data
for the respective
378 evaluation/validation tasks.
382 n_events = b2luigi.IntParameter()
384 experiment_number = b2luigi.IntParameter()
387 random_seed = b2luigi.Parameter()
389 bkgfiles_dir = b2luigi.Parameter(
400 Create output file name depending on number of events and production
401 mode that
is specified
in the random_seed string.
405 if random_seed
is None:
407 return "generated_mc_N" + str(n_events) +
"_" + random_seed +
".root"
411 Generate list of output files that the task should produce.
412 The task is considered finished
if and only
if the outputs all exist.
416 def create_path(self):
418 Create basf2 path to process with event generation
and simulation.
421 path = basf2.create_path()
427 f
"Simulating events with experiment number {self.experiment_number} is not implemented yet.")
432 path.add_module(
"EvtGenInput")
434 path.add_module(
"EvtGenInput")
435 path.add_module(
"InclusiveParticleChecker", particles=[310, 3122], includeConjugates=
True)
437 import generators
as ge
458 pdgs = [310, 3122, -3122]
460 myx = [i*0.01
for i
in range(321)]
463 y = createV0momenta(x, mu, beta)
465 polParams = myx + myy
469 particlegun = basf2.register_module(
'ParticleGun')
470 particlegun.param(
'pdgCodes', pdg_list)
471 particlegun.param(
'nTracks', 8)
472 particlegun.param(
'momentumGeneration',
'polyline')
473 particlegun.param(
'momentumParams', polParams)
474 particlegun.param(
'thetaGeneration',
'uniformCos')
475 particlegun.param(
'thetaParams', [17, 150])
476 particlegun.param(
'phiGeneration',
'uniform')
477 particlegun.param(
'phiParams', [0, 360])
478 particlegun.param(
'vertexGeneration',
'fixed')
479 particlegun.param(
'xVertexParams', [0])
480 particlegun.param(
'yVertexParams', [0])
481 particlegun.param(
'zVertexParams', [0])
482 path.add_module(particlegun)
484 ge.add_babayaganlo_generator(path=path, finalstate=
'ee', minenergy=0.15, minangle=10.0)
486 ge.add_kkmc_generator(path=path, finalstate=
'mu+mu-')
488 babayaganlo = basf2.register_module(
'BabayagaNLOInput')
489 babayaganlo.param(
'FinalState',
'gg')
490 babayaganlo.param(
'MaxAcollinearity', 180.0)
491 babayaganlo.param(
'ScatteringAngleRange', [0., 180.])
492 babayaganlo.param(
'FMax', 75000)
493 babayaganlo.param(
'MinEnergy', 0.01)
494 babayaganlo.param(
'Order',
'exp')
495 babayaganlo.param(
'DebugEnergySpread', 0.01)
496 babayaganlo.param(
'Epsilon', 0.00005)
497 path.add_module(babayaganlo)
498 generatorpreselection = basf2.register_module(
'GeneratorPreselection')
499 generatorpreselection.param(
'nChargedMin', 0)
500 generatorpreselection.param(
'nChargedMax', 999)
501 generatorpreselection.param(
'MinChargedPt', 0.15)
502 generatorpreselection.param(
'MinChargedTheta', 17.)
503 generatorpreselection.param(
'MaxChargedTheta', 150.)
504 generatorpreselection.param(
'nPhotonMin', 1)
505 generatorpreselection.param(
'MinPhotonEnergy', 1.5)
506 generatorpreselection.param(
'MinPhotonTheta', 15.0)
507 generatorpreselection.param(
'MaxPhotonTheta', 165.0)
508 generatorpreselection.param(
'applyInCMS',
True)
509 path.add_module(generatorpreselection)
510 empty = basf2.create_path()
511 generatorpreselection.if_value(
'!=11', empty)
513 ge.add_aafh_generator(path=path, finalstate=
'e+e-e+e-', preselection=
False)
515 ge.add_aafh_generator(path=path, finalstate=
'e+e-mu+mu-', preselection=
False)
517 ge.add_kkmc_generator(path, finalstate=
'tau+tau-')
519 ge.add_continuum_generator(path, finalstate=
'ddbar')
521 ge.add_continuum_generator(path, finalstate=
'uubar')
523 ge.add_continuum_generator(path, finalstate=
'ssbar')
525 ge.add_continuum_generator(path, finalstate=
'ccbar')
534 components = [
'PXD',
'SVD',
'CDC',
'ECL',
'TOP',
'ARICH',
'TRG']
551 Generate simulated Monte Carlo with background overlay.
553 Make sure to use different ``random_seed`` parameters
for the training data
554 format the classifier trainings
and for the test data
for the respective
555 evaluation/validation tasks.
559 n_events = b2luigi.IntParameter()
561 experiment_number = b2luigi.IntParameter()
564 random_seed = b2luigi.Parameter()
566 bkgfiles_dir = b2luigi.Parameter(
577 Create output file name depending on number of events and production
578 mode that
is specified
in the random_seed string.
582 if random_seed
is None:
584 return "generated_mc_N" + str(n_events) +
"_" + random_seed +
".root"
588 Generate list of output files that the task should produce.
589 The task is considered finished
if and only
if the outputs all exist.
595 Generate list of luigi Tasks that this Task depends on.
597 n_events_per_task = MasterTask.n_events_per_task
598 quotient, remainder = divmod(self.n_events, n_events_per_task)
599 for i
in range(quotient):
602 num_processes=MasterTask.num_processes,
603 random_seed=self.
random_seed +
'_' + str(i).zfill(3),
604 n_events=n_events_per_task,
610 num_processes=MasterTask.num_processes,
611 random_seed=self.
random_seed +
'_' + str(quotient).zfill(3),
616 @b2luigi.on_temporary_files
619 When all GenerateSimTasks finished, merge the output.
621 create_output_dirs(self)
624 for _, file_name
in self.get_input_file_names().items():
625 file_list.append(*file_name)
626 print(
"Merge the following files:")
628 cmd = [
"b2file-merge",
"-f"]
629 args = cmd + [self.get_output_file_name(self.
output_file_name())] + file_list
630 subprocess.check_call(args)
631 print(
"Finished merging. Now remove the input files to save space.")
633 for tempfile
in file_list:
634 args = cmd2 + [tempfile]
635 subprocess.check_call(args)
640 Task to check if the given file really exists.
643 filename = b2luigi.Parameter()
647 Specify the output to be the file that was just checked.
649 from luigi
import LocalTarget
655 Collect variables/features from VXDTF2 tracking
and write them to a ROOT
658 These variables are to be used
as labelled training data
for the MVA
659 classifier which
is the VXD track quality estimator
662 n_events = b2luigi.IntParameter()
664 experiment_number = b2luigi.IntParameter()
667 random_seed = b2luigi.Parameter()
674 Create output file name depending on number of events and production
675 mode that
is specified
in the random_seed string.
679 if random_seed
is None:
681 if 'vxd' not in random_seed:
682 random_seed +=
'_vxd'
683 if 'DATA' in random_seed:
684 return 'qe_records_DATA_vxd.root'
686 if 'USESIMBB' in random_seed:
687 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
688 elif 'USESIMEE' in random_seed:
689 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
690 return 'qe_records_N' + str(n_events) +
'_' + random_seed +
'.root'
694 Get input file names depending on the use case: If they already exist, search in
695 the corresponding folders,
for data check the specified list
and if they are created
696 in the same run, check
for the task that produced them.
700 if random_seed
is None:
702 if "USESIM" in random_seed:
703 if 'USESIMBB' in random_seed:
704 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
705 elif 'USESIMEE' in random_seed:
706 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
707 return [
'datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
708 n_events=n_events, random_seed=random_seed)]
709 elif "DATA" in random_seed:
710 return MasterTask.datafiles
712 return self.get_input_file_names(GenerateSimTask.output_file_name(
713 GenerateSimTask, n_events=n_events, random_seed=random_seed))
717 Generate list of luigi Tasks that this Task depends on.
734 Generate list of output files that the task should produce.
735 The task is considered finished
if and only
if the outputs all exist.
739 def create_path(self):
741 Create basf2 path with VXDTF2 tracking
and VXD QE data collection.
743 path = basf2.create_path()
747 inputFileNames=inputFileNames,
749 path.add_module(
"Gearbox")
750 tracking.add_geometry_modules(path)
752 from rawdata
import add_unpackers
753 add_unpackers(path, components=[
'SVD',
'PXD'])
754 tracking.add_hit_preparation_modules(path)
755 tracking.add_vxd_track_finding_vxdtf2(
756 path, components=[
"SVD"], add_mva_quality_indicator=
False
760 "VXDQETrainingDataCollector",
762 SpacePointTrackCandsStoreArrayName=
"SPTrackCands",
763 EstimationMethod=
"tripletFit",
765 ClusterInformation=
"Average",
766 MCStrictQualityEstimator=
False,
772 "TrackFinderMCTruthRecoTracks",
773 RecoTracksStoreArrayName=
"MCRecoTracks",
780 "VXDQETrainingDataCollector",
782 SpacePointTrackCandsStoreArrayName=
"SPTrackCands",
783 EstimationMethod=
"tripletFit",
785 ClusterInformation=
"Average",
786 MCStrictQualityEstimator=
True,
794 Collect variables/features from CDC tracking
and write them to a ROOT file.
796 These variables are to be used
as labelled training data
for the MVA
797 classifier which
is the CDC track quality estimator
800 n_events = b2luigi.IntParameter()
802 experiment_number = b2luigi.IntParameter()
805 random_seed = b2luigi.Parameter()
812 Create output file name depending on number of events and production
813 mode that
is specified
in the random_seed string.
817 if random_seed
is None:
819 if 'cdc' not in random_seed:
820 random_seed +=
'_cdc'
821 if 'DATA' in random_seed:
822 return 'qe_records_DATA_cdc.root'
824 if 'USESIMBB' in random_seed:
825 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
826 elif 'USESIMEE' in random_seed:
827 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
828 return 'qe_records_N' + str(n_events) +
'_' + random_seed +
'.root'
832 Get input file names depending on the use case: If they already exist, search in
833 the corresponding folders,
for data check the specified list
and if they are created
834 in the same run, check
for the task that produced them.
838 if random_seed
is None:
840 if "USESIM" in random_seed:
841 if 'USESIMBB' in random_seed:
842 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
843 elif 'USESIMEE' in random_seed:
844 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
845 return [
'datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
846 n_events=n_events, random_seed=random_seed)]
847 elif "DATA" in random_seed:
848 return MasterTask.datafiles
850 return self.get_input_file_names(GenerateSimTask.output_file_name(
851 GenerateSimTask, n_events=n_events, random_seed=random_seed))
855 Generate list of luigi Tasks that this Task depends on.
872 Generate list of output files that the task should produce.
873 The task is considered finished
if and only
if the outputs all exist.
877 def create_path(self):
879 Create basf2 path with CDC standalone tracking
and CDC QE
with recording filter
for MVA feature collection.
881 path = basf2.create_path()
885 inputFileNames=inputFileNames,
887 path.add_module(
"Gearbox")
888 tracking.add_geometry_modules(path)
890 filter_choice =
"recording_data"
891 from rawdata
import add_unpackers
892 add_unpackers(path, components=[
'CDC'])
894 filter_choice =
"recording"
897 tracking.add_cdc_track_finding(path, with_ca=
False, add_mva_quality_indicator=
True)
899 basf2.set_module_parameters(
901 name=
"TFCDC_TrackQualityEstimator",
902 filter=filter_choice,
912 Collect variables/features from the reco track reconstruction including the
913 fit
and write them to a ROOT file.
915 These variables are to be used
as labelled training data
for the MVA
916 classifier which
is the MVA track quality estimator. The collected
917 variables include the classifier outputs
from the VXD
and CDC quality
918 estimators, namely the CDC
and VXD quality indicators, combined
with fit,
919 merger, timing, energy loss information etc. This task requires the
920 subdetector quality estimators to be trained.
924 n_events = b2luigi.IntParameter()
926 experiment_number = b2luigi.IntParameter()
929 random_seed = b2luigi.Parameter()
931 cdc_training_target = b2luigi.Parameter()
935 recotrack_option = b2luigi.Parameter(
937 default='deleteCDCQI080'
941 fast_bdt_option = b2luigi.ListParameter(
943 hashed=
True, default=[200, 8, 3, 0.1]
952 Create output file name depending on number of events and production
953 mode that
is specified
in the random_seed string.
957 if random_seed
is None:
959 if recotrack_option
is None:
961 if 'rec' not in random_seed:
962 random_seed +=
'_rec'
963 if 'DATA' in random_seed:
964 return 'qe_records_DATA_rec.root'
966 if 'USESIMBB' in random_seed:
967 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
968 elif 'USESIMEE' in random_seed:
969 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
970 return 'qe_records_N' + str(n_events) +
'_' + random_seed +
'_' + recotrack_option +
'.root'
974 Get input file names depending on the use case: If they already exist, search in
975 the corresponding folders,
for data check the specified list
and if they are created
976 in the same run, check
for the task that produced them.
980 if random_seed
is None:
982 if "USESIM" in random_seed:
983 if 'USESIMBB' in random_seed:
984 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
985 elif 'USESIMEE' in random_seed:
986 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
987 return [
'datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
988 n_events=n_events, random_seed=random_seed)]
989 elif "DATA" in random_seed:
990 return MasterTask.datafiles
992 return self.get_input_file_names(GenerateSimTask.output_file_name(
993 GenerateSimTask, n_events=n_events, random_seed=random_seed))
997 Generate list of luigi Tasks that this Task depends on.
1014 n_events_training=MasterTask.n_events_training,
1018 exclude_variables=MasterTask.exclude_variables_cdc,
1023 n_events_training=MasterTask.n_events_training,
1026 exclude_variables=MasterTask.exclude_variables_vxd,
1032 Generate list of output files that the task should produce.
1033 The task is considered finished
if and only
if the outputs all exist.
1037 def create_path(self):
1039 Create basf2 reconstruction path that should mirror the default path
1040 from ``add_tracking_reconstruction()``, but
with modules
for the VXD QE
1041 and CDC QE application
and for collection of variables
for the reco
1042 track quality estimator.
1044 path = basf2.create_path()
1048 inputFileNames=inputFileNames,
1050 path.add_module(
"Gearbox")
1060 from rawdata
import add_unpackers
1062 tracking.add_tracking_reconstruction(path, add_cdcTrack_QI=mvaCDC, add_vxdTrack_QI=mvaVXD, add_recoTrack_QI=
True)
1068 cdc_identifier =
'datafiles/' + \
1069 CDCQETeacherTask.get_weightfile_xml_identifier(CDCQETeacherTask, fast_bdt_option=self.
fast_bdt_option)
1070 if os.path.exists(cdc_identifier):
1071 replace_cdc_qi =
True
1073 raise ValueError(f
"CDC QI Identifier not found: {cdc_identifier}")
1075 replace_cdc_qi =
False
1077 replace_cdc_qi =
False
1079 cdc_identifier = self.get_input_file_names(
1080 CDCQETeacherTask.get_weightfile_xml_identifier(
1082 replace_cdc_qi =
True
1084 vxd_identifier =
'datafiles/' + \
1085 VXDQETeacherTask.get_weightfile_xml_identifier(VXDQETeacherTask, fast_bdt_option=self.
fast_bdt_option)
1086 if os.path.exists(vxd_identifier):
1087 replace_vxd_qi =
True
1089 raise ValueError(f
"VXD QI Identifier not found: {vxd_identifier}")
1091 replace_vxd_qi =
False
1093 replace_vxd_qi =
False
1095 vxd_identifier = self.get_input_file_names(
1096 VXDQETeacherTask.get_weightfile_xml_identifier(
1098 replace_vxd_qi =
True
1100 cdc_qe_mva_filter_parameters =
None
1107 cdc_qe_mva_filter_parameters = {
1108 "identifier": cdc_identifier,
"cut": cut}
1110 cdc_qe_mva_filter_parameters = {
1112 elif replace_cdc_qi:
1113 cdc_qe_mva_filter_parameters = {
1114 "identifier": cdc_identifier}
1115 if cdc_qe_mva_filter_parameters
is not None:
1117 basf2.set_module_parameters(
1119 name=
"TFCDC_TrackQualityEstimator",
1120 filterParameters=cdc_qe_mva_filter_parameters,
1125 basf2.set_module_parameters(
1127 name=
"VXDQualityEstimatorMVA",
1128 WeightFileIdentifier=vxd_identifier)
1131 track_qe_module_name =
"TrackQualityEstimatorMVA"
1132 module_found =
False
1133 new_path = basf2.create_path()
1134 for module
in path.modules():
1135 if module.name() != track_qe_module_name:
1136 if not module.name ==
'TrackCreator':
1137 new_path.add_module(module)
1141 new_path.add_module(
1147 recoTrackColName=
'RecoTracks',
1148 trackColName=
'MDSTTracks')
1149 new_path.add_module(
1150 "TrackQETrainingDataCollector",
1152 collectEventFeatures=
True,
1153 SVDPlusCDCStandaloneRecoTracksStoreArrayName=
"SVDPlusCDCStandaloneRecoTracks",
1156 if not module_found:
1157 raise KeyError(f
"No module {track_qe_module_name} found in path")
1164 A teacher task runs the basf2 mva teacher on the training data provided by a
1165 data collection task.
1167 Since teacher tasks are needed for all quality estimators covered by this
1168 steering file
and the only thing that changes
is the required data
1169 collection task
and some training parameters, I decided to use inheritance
1170 and have the basic functionality
in this base
class/interface
and have the
1171 specific teacher tasks inherit
from it.
1174 n_events_training = b2luigi.IntParameter()
1176 experiment_number = b2luigi.IntParameter()
1180 process_type = b2luigi.Parameter(
1186 training_target = b2luigi.Parameter(
1193 exclude_variables = b2luigi.ListParameter(
1195 hashed=
True, default=[]
1199 fast_bdt_option = b2luigi.ListParameter(
1201 hashed=
True, default=[200, 8, 3, 0.1]
1208 Property defining the basename for the .xml
and .root weightfiles that are created.
1209 Has to be implemented by the inheriting teacher task
class.
1211 raise NotImplementedError(
1212 "Teacher Task must define a static weightfile_identifier"
1217 Name of the xml weightfile that is created by the teacher task.
1218 It
is subsequently used
as a local weightfile
in the following validation tasks.
1220 if fast_bdt_option
is None:
1222 if recotrack_option
is None and hasattr(self,
'recotrack_option'):
1223 recotrack_option = self.recotrack_option
1225 recotrack_option =
''
1226 weightfile_details = create_fbdt_option_string(fast_bdt_option)
1228 if recotrack_option !=
'':
1229 weightfile_name = weightfile_name +
'_' + recotrack_option
1230 return weightfile_name +
".weights.xml"
1235 Property defining the name of the tree in the ROOT file
from the
1236 ``data_collection_task`` that contains the recorded training data. Must
1237 implemented by the inheriting specific teacher task
class.
1239 raise NotImplementedError(
"Teacher Task must define a static tree_name")
1244 Property defining random seed to be used by the ``GenerateSimTask``.
1245 Should differ from the random seed
in the test data samples. Must
1246 implemented by the inheriting specific teacher task
class.
1248 raise NotImplementedError(
"Teacher Task must define a static random seed")
1253 Property defining the specific ``DataCollectionTask`` to require. Must
1254 implemented by the inheriting specific teacher task class.
1256 raise NotImplementedError(
1257 "Teacher Task must define a data collection task to require "
1262 Generate list of luigi Tasks that this Task depends on.
1274 num_processes=MasterTask.num_processes,
1282 Generate list of output files that the task should produce.
1283 The task is considered finished
if and only
if the outputs all exist.
1289 Use basf2_mva teacher to create MVA weightfile from collected training
1292 This
is the main process that
is dispatched by the ``run`` method that
1293 is inherited
from ``Basf2Task``.
1303 if hasattr(self,
'recotrack_option'):
1304 records_files = self.get_input_file_names(
1309 recotrack_option=self.recotrack_option))
1311 records_files = self.get_input_file_names(
1317 my_basf2_mva_teacher(
1318 records_files=records_files,
1329 Task to run basf2 mva teacher on collected data for VXDTF2 track quality estimator
1332 weightfile_identifier_basename = "vxdtf2_mva_qe"
1337 random_seed =
"train_vxd"
1340 data_collection_task = VXDQEDataCollectionTask
1345 Task to run basf2 mva teacher on collected data for CDC track quality estimator
1348 weightfile_identifier_basename = "cdc_mva_qe"
1351 tree_name =
"records"
1353 random_seed =
"train_cdc"
1356 data_collection_task = CDCQEDataCollectionTask
1361 Task to run basf2 mva teacher on collected data for the final, combined
1362 track quality estimator
1367 recotrack_option = b2luigi.Parameter(
1369 default='deleteCDCQI080'
1374 weightfile_identifier_basename =
"recotrack_mva_qe"
1379 random_seed =
"train_rec"
1382 data_collection_task = RecoTrackQEDataCollectionTask
1384 cdc_training_target = b2luigi.Parameter()
1388 Generate list of luigi Tasks that this Task depends on.
1401 num_processes=MasterTask.num_processes,
1412 Run track reconstruction with MVA quality estimator
and write out
1413 (=
"harvest") a root file
with variables useful
for the validation.
1417 n_events_testing = b2luigi.IntParameter()
1419 n_events_training = b2luigi.IntParameter()
1421 experiment_number = b2luigi.IntParameter()
1425 process_type = b2luigi.Parameter(
1432 exclude_variables = b2luigi.ListParameter(
1438 fast_bdt_option = b2luigi.ListParameter(
1440 hashed=
True, default=[200, 8, 3, 0.1]
1444 validation_output_file_name =
"harvesting_validation.root"
1446 reco_output_file_name =
"reconstruction.root"
1453 Teacher task to require to provide a quality estimator weightfile for ``add_tracking_with_quality_estimation``
1455 raise NotImplementedError()
1459 Add modules for track reconstruction to basf2 path that are to be
1460 validated. Besides track finding it should include MC matching, fitted
1461 track creation
and a quality estimator module.
1463 raise NotImplementedError()
1467 Generate list of luigi Tasks that this Task depends on.
1482 filename=
'datafiles/generated_mc_N' + str(self.
n_events_testing) +
'_' + process +
'_test.root'
1494 Generate list of output files that the task should produce.
1495 The task is considered finished
if and only
if the outputs all exist.
1500 def create_path(self):
1503 and adds the ``CombinedTrackingValidationModule`` to write out variables
1507 path = basf2.create_path()
1513 inputFileNames = [
'datafiles/generated_mc_N' + str(self.
n_events_testing) +
'_' + process +
'_test.root']
1515 inputFileNames = self.get_input_file_names(GenerateSimTask.output_file_name(
1519 inputFileNames=inputFileNames,
1521 path.add_module(
"Gearbox")
1522 tracking.add_geometry_modules(path)
1523 tracking.add_hit_preparation_modules(path)
1532 output_file_name=self.get_output_file_name(
1546 Run VXDTF2 track reconstruction and write out (=
"harvest") a root file
with
1547 variables useful
for validation of the VXD Quality Estimator.
1551 validation_output_file_name = "vxd_qe_harvesting_validation.root"
1553 reco_output_file_name =
"vxd_qe_reconstruction.root"
1555 teacher_task = VXDQETeacherTask
1559 Add modules for VXDTF2 tracking
with VXD quality estimator to basf2 path.
1561 tracking.add_vxd_track_finding_vxdtf2(
1564 reco_tracks=
"RecoTracks",
1565 add_mva_quality_indicator=
True,
1569 basf2.set_module_parameters(
1571 name=
"VXDQualityEstimatorMVA",
1572 WeightFileIdentifier=self.get_input_file_names(
1576 tracking.add_mc_matcher(path, components=[
"SVD"])
1577 tracking.add_track_fit_and_track_creator(path, components=[
"SVD"])
1582 Run CDC reconstruction and write out (=
"harvest") a root file
with variables
1583 useful
for validation of the CDC Quality Estimator.
1586 training_target = b2luigi.Parameter()
1588 validation_output_file_name = "cdc_qe_harvesting_validation.root"
1590 reco_output_file_name =
"cdc_qe_reconstruction.root"
1592 teacher_task = CDCQETeacherTask
1597 Generate list of luigi Tasks that this Task depends on.
1613 filename=
'datafiles/generated_mc_N' + str(self.
n_events_testing) +
'_' + process +
'_test.root'
1625 Add modules for CDC standalone tracking
with CDC quality estimator to basf2 path.
1627 tracking.add_cdc_track_finding(
1629 output_reco_tracks="RecoTracks",
1630 add_mva_quality_indicator=
True,
1633 cdc_qe_mva_filter_parameters = {
1634 "identifier": self.get_input_file_names(
1635 CDCQETeacherTask.get_weightfile_xml_identifier(
1638 basf2.set_module_parameters(
1640 name=
"TFCDC_TrackQualityEstimator",
1641 filterParameters=cdc_qe_mva_filter_parameters,
1643 tracking.add_mc_matcher(path, components=[
"CDC"])
1644 tracking.add_track_fit_and_track_creator(path, components=[
"CDC"])
1649 Run track reconstruction and write out (=
"harvest") a root file
with variables
1650 useful
for validation of the MVA track Quality Estimator.
1653 cdc_training_target = b2luigi.Parameter()
1655 validation_output_file_name = "reco_qe_harvesting_validation.root"
1657 reco_output_file_name =
"reco_qe_reconstruction.root"
1659 teacher_task = RecoTrackQETeacherTask
1663 Generate list of luigi Tasks that this Task depends on.
1670 exclude_variables=MasterTask.exclude_variables_cdc,
1677 exclude_variables=MasterTask.exclude_variables_vxd,
1695 filename=
'datafiles/generated_mc_N' + str(self.
n_events_testing) +
'_' + process +
'_test.root'
1707 Add modules for reco tracking
with all track quality estimators to basf2 path.
1711 tracking.add_tracking_reconstruction(
1713 add_cdcTrack_QI=
True,
1714 add_vxdTrack_QI=
True,
1715 add_recoTrack_QI=
True,
1716 skipGeometryAdding=
True,
1717 skipHitPreparerAdding=
False,
1722 cdc_qe_mva_filter_parameters = {
1723 "identifier": self.get_input_file_names(
1724 CDCQETeacherTask.get_weightfile_xml_identifier(
1727 basf2.set_module_parameters(
1729 name=
"TFCDC_TrackQualityEstimator",
1730 filterParameters=cdc_qe_mva_filter_parameters,
1732 basf2.set_module_parameters(
1734 name=
"VXDQualityEstimatorMVA",
1735 WeightFileIdentifier=self.get_input_file_names(
1736 VXDQETeacherTask.get_weightfile_xml_identifier(VXDQETeacherTask, fast_bdt_option=self.
fast_bdt_option)
1739 basf2.set_module_parameters(
1741 name=
"TrackQualityEstimatorMVA",
1742 WeightFileIdentifier=self.get_input_file_names(
1743 RecoTrackQETeacherTask.get_weightfile_xml_identifier(RecoTrackQETeacherTask, fast_bdt_option=self.
fast_bdt_option)
1750 Base class for evaluating a quality estimator ``basf2_mva_evaluate.py`` on a
1751 separate test data set.
1753 Evaluation tasks
for VXD, CDC
and combined QE can inherit
from it.
1761 git_hash = b2luigi.Parameter(
1763 default=get_basf2_git_hash()
1767 n_events_testing = b2luigi.IntParameter()
1769 n_events_training = b2luigi.IntParameter()
1771 experiment_number = b2luigi.IntParameter()
1775 process_type = b2luigi.Parameter(
1781 training_target = b2luigi.Parameter(
1788 exclude_variables = b2luigi.ListParameter(
1794 fast_bdt_option = b2luigi.ListParameter(
1796 hashed=
True, default=[200, 8, 3, 0.1]
1803 Property defining specific teacher task to require.
1805 raise NotImplementedError(
1806 "Evaluation Tasks must define a teacher task to require "
1812 Property defining the specific ``DataCollectionTask`` to require. Must
1813 implemented by the inheriting specific teacher task class.
1815 raise NotImplementedError(
1816 "Evaluation Tasks must define a data collection task to require "
1822 Acronym to distinguish between cdc, vxd and rec(o) MVA
1824 raise NotImplementedError(
1825 "Evaluation Tasks must define a task acronym."
1830 Generate list of luigi Tasks that this Task depends on.
1846 filename=
'datafiles/qe_records_N' + str(self.
n_events_testing) +
'_' + process +
'_test_' +
1851 num_processes=MasterTask.num_processes,
1859 Generate list of output files that the task should produce.
1860 The task is considered finished
if and only
if the outputs all exist.
1863 evaluation_pdf_output = self.teacher_task.weightfile_identifier_basename + weightfile_details + ".pdf"
1864 yield self.add_to_output(evaluation_pdf_output)
1866 @b2luigi.on_temporary_files
1869 Run ``basf2_mva_evaluate.py`` subprocess to evaluate QE MVA.
1871 The MVA weight file created from training on the training data set
is
1872 evaluated on separate test data.
1875 evaluation_pdf_output_basename = self.teacher_task.weightfile_identifier_basename + weightfile_details + ".pdf"
1877 evaluation_pdf_output_path = self.get_output_file_name(evaluation_pdf_output_basename)
1884 datafiles =
'datafiles/qe_records_N' + str(self.
n_events_testing) +
'_' + \
1887 datafiles = self.get_input_file_names(
1891 random_seed=self.process +
'_test_' +
1894 "basf2_mva_evaluate.py",
1896 self.get_input_file_names(
1905 evaluation_pdf_output_path,
1909 log_file_dir = get_log_file_dir(self)
1913 os.makedirs(log_file_dir, exist_ok=
True)
1916 except FileExistsError:
1917 print(
'Directory ' + log_file_dir +
'already exists.')
1918 stderr_log_file_path = log_file_dir +
"stderr"
1919 stdout_log_file_path = log_file_dir +
"stdout"
1920 with open(stdout_log_file_path,
"w")
as stdout_file:
1921 stdout_file.write(f
'stdout output of the command:\n{" ".join(cmd)}\n\n')
1922 if os.path.exists(stderr_log_file_path):
1924 os.remove(stderr_log_file_path)
1927 with open(stdout_log_file_path,
"a")
as stdout_file:
1928 with open(stderr_log_file_path,
"a")
as stderr_file:
1930 subprocess.run(cmd, check=
True, stdin=stdout_file, stderr=stderr_file)
1931 except subprocess.CalledProcessError
as err:
1932 stderr_file.write(f
"Evaluation failed with error:\n{err}")
1938 Run ``basf2_mva_evaluate.py`` for the VXD quality estimator on separate test data
1942 teacher_task = VXDQETeacherTask
1945 data_collection_task = VXDQEDataCollectionTask
1948 task_acronym = 'vxd'
1953 Run ``basf2_mva_evaluate.py`` for the CDC quality estimator on separate test data
1957 teacher_task = CDCQETeacherTask
1960 data_collection_task = CDCQEDataCollectionTask
1963 task_acronym = 'cdc'
1968 Run ``basf2_mva_evaluate.py`` for the final, combined quality estimator on
1973 teacher_task = RecoTrackQETeacherTask
1976 data_collection_task = RecoTrackQEDataCollectionTask
1979 task_acronym = 'rec'
1981 cdc_training_target = b2luigi.Parameter()
1985 Generate list of luigi Tasks that this Task depends on.
2002 filename=
'datafiles/qe_records_N' + str(self.
n_events_testing) +
'_' + process +
'_test_' +
2007 num_processes=MasterTask.num_processes,
2017 Create a PDF file with validation plots
for a quality estimator produced
2018 from the ROOT ntuples produced by a harvesting validation task
2021 n_events_testing = b2luigi.IntParameter()
2023 n_events_training = b2luigi.IntParameter()
2025 experiment_number = b2luigi.IntParameter()
2029 process_type = b2luigi.Parameter(
2036 exclude_variables = b2luigi.ListParameter(
2042 fast_bdt_option = b2luigi.ListParameter(
2044 hashed=
True, default=[200, 8, 3, 0.1]
2048 primaries_only = b2luigi.BoolParameter(
2057 Specifies related harvesting validation task which produces the ROOT
2058 files with the data that
is plotted by this task.
2060 raise NotImplementedError(
"Must define a QI harvesting validation task for which to do the plots")
2065 Name of the output PDF file containing the validation plots
2068 return validation_harvest_basename.replace(
".root",
"_plots.pdf")
2072 Generate list of luigi Tasks that this Task depends on.
2078 Generate list of output files that the task should produce.
2079 The task is considered finished
if and only
if the outputs all exist.
2083 @b2luigi.on_temporary_files
2086 Use basf2_mva teacher to create MVA weightfile from collected training
2089 Main process that
is dispatched by the ``run`` method that
is inherited
2094 validation_harvest_path = self.get_input_file_names(validation_harvest_basename)[0]
2098 'is_fake',
'is_clone',
'is_matched',
'quality_indicator',
2099 'experiment_number',
'run_number',
'event_number',
'pr_store_array_number',
2100 'pt_estimate',
'z0_estimate',
'd0_estimate',
'tan_lambda_estimate',
2101 'phi0_estimate',
'pt_truth',
'z0_truth',
'd0_truth',
'tan_lambda_truth',
2105 pr_df = uproot.open(validation_harvest_path)[
'pr_tree/pr_tree'].arrays(pr_columns, library=
'pd')
2107 'experiment_number',
2110 'pr_store_array_number',
2115 mc_df = uproot.open(validation_harvest_path)[
'mc_tree/mc_tree'].arrays(mc_columns, library=
'pd')
2117 mc_df = mc_df[mc_df.is_primary.eq(
True)]
2120 qi_cuts = np.linspace(0., 1, 20, endpoint=
False)
2127 with PdfPages(output_pdf_file_path, keep_empty=
False)
as pdf:
2132 titlepage_fig, titlepage_ax = plt.subplots()
2133 titlepage_ax.axis(
"off")
2134 title = f
"Quality Estimator validation plots from {self.__class__.__name__}"
2135 titlepage_ax.set_title(title)
2137 weightfile_identifier = teacher_task.get_weightfile_xml_identifier(teacher_task, fast_bdt_option=self.
fast_bdt_option)
2139 "Date": datetime.today().strftime(
"%Y-%m-%d %H:%M"),
2140 "Created by steering file": os.path.realpath(__file__),
2141 "Created from data in": validation_harvest_path,
2143 "weight file": weightfile_identifier,
2145 if hasattr(self,
'exclude_variables'):
2147 meta_data_string = (format_dictionary(meta_data) +
2148 "\n\n(For all MVA training parameters look into the produced weight file)")
2149 luigi_params = get_serialized_parameters(self)
2150 luigi_param_string = (f
"\n\nb2luigi parameters for {self.__class__.__name__}\n" +
2151 format_dictionary(luigi_params))
2152 title_page_text = meta_data_string + luigi_param_string
2153 titlepage_ax.text(0, 1, title_page_text, ha=
"left", va=
"top", wrap=
True, fontsize=8)
2154 pdf.savefig(titlepage_fig)
2155 plt.close(titlepage_fig)
2157 fake_rates = get_uncertain_means_for_qi_cuts(pr_df,
"is_fake", qi_cuts)
2158 fake_fig, fake_ax = plt.subplots()
2159 fake_ax.set_title(
"Fake rate")
2160 plot_with_errobands(fake_rates, ax=fake_ax)
2161 fake_ax.set_ylabel(
"fake rate")
2162 fake_ax.set_xlabel(
"quality indicator requirement")
2163 pdf.savefig(fake_fig, bbox_inches=
"tight")
2167 clone_rates = get_uncertain_means_for_qi_cuts(pr_df,
"is_clone", qi_cuts)
2168 clone_fig, clone_ax = plt.subplots()
2169 clone_ax.set_title(
"Clone rate")
2170 plot_with_errobands(clone_rates, ax=clone_ax)
2171 clone_ax.set_ylabel(
"clone rate")
2172 clone_ax.set_xlabel(
"quality indicator requirement")
2173 pdf.savefig(clone_fig, bbox_inches=
"tight")
2174 plt.close(clone_fig)
2181 pr_track_identifiers = [
'experiment_number',
'run_number',
'event_number',
'pr_store_array_number']
2183 left=mc_df, right=pr_df[pr_track_identifiers + [
'quality_indicator']],
2185 on=pr_track_identifiers
2188 missing_fractions = (
2189 _my_uncertain_mean(mc_df[
2190 mc_df.quality_indicator.isnull() | (mc_df.quality_indicator > qi_cut)][
'is_missing'])
2191 for qi_cut
in qi_cuts
2194 findeff_fig, findeff_ax = plt.subplots()
2195 findeff_ax.set_title(
"Finding efficiency")
2196 finding_efficiencies = 1.0 - upd.Series(data=missing_fractions, index=qi_cuts)
2197 plot_with_errobands(finding_efficiencies, ax=findeff_ax)
2198 findeff_ax.set_ylabel(
"finding efficiency")
2199 findeff_ax.set_xlabel(
"quality indicator requirement")
2200 pdf.savefig(findeff_fig, bbox_inches=
"tight")
2201 plt.close(findeff_fig)
2206 fake_roc_fig, fake_roc_ax = plt.subplots()
2207 fake_roc_ax.set_title(
"Fake rate vs. finding efficiency ROC curve")
2208 fake_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=fake_rates.nominal_value,
2209 xerr=finding_efficiencies.std_dev, yerr=fake_rates.std_dev, elinewidth=0.8)
2210 fake_roc_ax.set_xlabel(
'finding efficiency')
2211 fake_roc_ax.set_ylabel(
'fake rate')
2212 pdf.savefig(fake_roc_fig, bbox_inches=
"tight")
2213 plt.close(fake_roc_fig)
2216 clone_roc_fig, clone_roc_ax = plt.subplots()
2217 clone_roc_ax.set_title(
"Clone rate vs. finding efficiency ROC curve")
2218 clone_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=clone_rates.nominal_value,
2219 xerr=finding_efficiencies.std_dev, yerr=clone_rates.std_dev, elinewidth=0.8)
2220 clone_roc_ax.set_xlabel(
'finding efficiency')
2221 clone_roc_ax.set_ylabel(
'clone rate')
2222 pdf.savefig(clone_roc_fig, bbox_inches=
"tight")
2223 plt.close(clone_roc_fig)
2228 kinematic_qi_cuts = [0, 0.5, 0.9]
2232 params = [
'd0',
'z0',
'pt',
'tan_lambda',
'phi0']
2237 "tan_lambda":
r"$\tan{\lambda}$",
2244 "tan_lambda":
"rad",
2247 n_kinematic_bins = 75
2249 "pt": np.linspace(0, np.percentile(pr_df[
'pt_truth'].dropna(), 95), n_kinematic_bins),
2250 "z0": np.linspace(-0.1, 0.1, n_kinematic_bins),
2251 "d0": np.linspace(0, 0.01, n_kinematic_bins),
2252 "tan_lambda": np.linspace(-2, 3, n_kinematic_bins),
2253 "phi0": np.linspace(0, 2 * np.pi, n_kinematic_bins)
2257 kinematic_qi_cuts = [0, 0.5, 0.8]
2258 blue, yellow, green = plt.get_cmap(
"tab10").colors[0:3]
2259 for param
in params:
2260 fig, axarr = plt.subplots(ncols=len(kinematic_qi_cuts), sharey=
True, sharex=
True, figsize=(14, 6))
2261 fig.suptitle(f
"{label_by_param[param]} distributions")
2262 for i, qi
in enumerate(kinematic_qi_cuts):
2264 ax.set_title(f
"QI > {qi}")
2265 incut = pr_df[(pr_df[
'quality_indicator'] > qi)]
2266 incut_matched = incut[incut.is_matched.eq(
True)]
2267 incut_clones = incut[incut.is_clone.eq(
True)]
2268 incut_fake = incut[incut.is_fake.eq(
True)]
2271 if any(series.empty
for series
in (incut, incut_matched, incut_clones, incut_fake)):
2272 ax.text(0.5, 0.5,
"Not enough data in bin", ha=
"center", va=
"center", transform=ax.transAxes)
2275 bins = bins_by_param[param]
2276 stacked_histogram_series_tuple = (
2277 incut_matched[f
'{param}_estimate'],
2278 incut_clones[f
'{param}_estimate'],
2279 incut_fake[f
'{param}_estimate'],
2281 histvals, _, _ = ax.hist(stacked_histogram_series_tuple,
2283 bins=bins, range=(bins.min(), bins.max()),
2284 color=(blue, green, yellow),
2285 label=(
"matched",
"clones",
"fakes"))
2286 ax.set_xlabel(f
'{label_by_param[param]} estimate / ({unit_by_param[param]})')
2287 ax.set_ylabel(
'# tracks')
2288 axarr[0].legend(loc=
"upper center", bbox_to_anchor=(0, -0.15))
2289 pdf.savefig(fig, bbox_inches=
"tight")
2295 Create a PDF file with validation plots
for the VXDTF2 track quality
2296 estimator produced
from the ROOT ntuples produced by a VXDTF2 track QE
2297 harvesting validation task
2303 Harvesting validation task to require, which produces the ROOT files
2304 with variables to produce the VXD QE validation plots.
2312 num_processes=MasterTask.num_processes,
2319 Create a PDF file with validation plots
for the CDC track quality estimator
2320 produced
from the ROOT ntuples produced by a CDC track QE harvesting
2324 training_target = b2luigi.Parameter()
2329 Harvesting validation task to require, which produces the ROOT files
2330 with variables to produce the CDC QE validation plots.
2339 num_processes=MasterTask.num_processes,
2346 Create a PDF file with validation plots
for the reco MVA track quality
2347 estimator produced
from the ROOT ntuples produced by a reco track QE
2348 harvesting validation task
2351 cdc_training_target = b2luigi.Parameter()
2356 Harvesting validation task to require, which produces the ROOT files
2357 with variables to produce the final MVA track QE validation plots.
2366 num_processes=MasterTask.num_processes,
2373 Collect weightfile identifiers from different teacher tasks
and merge them
2374 into a local database
for testing.
2377 n_events_training = b2luigi.IntParameter()
2379 experiment_number = b2luigi.IntParameter()
2383 process_type = b2luigi.Parameter(
2389 cdc_training_target = b2luigi.Parameter()
2391 fast_bdt_option = b2luigi.ListParameter(
2393 hashed=
True, default=[200, 8, 3, 0.1]
2399 Required teacher tasks
2405 exclude_variables=MasterTask.exclude_variables_vxd,
2413 exclude_variables=MasterTask.exclude_variables_cdc,
2421 exclude_variables=MasterTask.exclude_variables_rec,
2429 yield self.add_to_output(
"localdb.tar")
2433 Create local database
2435 current_path = Path.cwd()
2436 localdb_archive_path = Path(self.get_output_file_name("localdb.tar")).absolute()
2437 output_dir = localdb_archive_path.parent
2442 for task
in (VXDQETeacherTask, CDCQETeacherTask, RecoTrackQETeacherTask):
2444 weightfile_xml_identifier_path = os.path.abspath(self.get_input_file_names(
2445 task.get_weightfile_xml_identifier(task, fast_bdt_option=self.
fast_bdt_option))[0])
2448 os.chdir(output_dir)
2451 weightfile_xml_identifier_path,
2452 task.weightfile_identifier_basename,
2457 os.chdir(current_path)
2460 shutil.make_archive(
2461 base_name=localdb_archive_path.as_posix().split(
'.')[0],
2463 root_dir=output_dir,
2470 Remove local database and tar archives
in output directory
2472 localdb_archive_path = Path(self.get_output_file_name("localdb.tar"))
2473 localdb_path = localdb_archive_path.parent /
"localdb"
2475 if localdb_path.exists():
2476 print(f
"Deleting localdb\n{localdb_path}\nwith contents\n ",
2477 "\n ".join(f.name
for f
in localdb_path.iterdir()))
2478 shutil.rmtree(localdb_path, ignore_errors=
False)
2480 if localdb_archive_path.is_file():
2481 print(f
"Deleting {localdb_archive_path}")
2482 os.remove(localdb_archive_path)
2486 Cleanup: Remove local database to prevent existing outputs when task did not finish successfully
2495 Wrapper task that needs to finish for b2luigi to finish running this steering file.
2497 It
is done
if the outputs of all required subtasks exist. It
is thus at the
2498 top of the luigi task graph. Edit the ``requires`` method to steer which
2499 tasks
and with which parameters you want to run.
2504 process_type = b2luigi.get_setting(
2506 "process_type", default=
'BBBAR'
2510 n_events_training = b2luigi.get_setting(
2512 "n_events_training", default=20000
2516 n_events_testing = b2luigi.get_setting(
2518 "n_events_testing", default=5000
2522 n_events_per_task = b2luigi.get_setting(
2524 "n_events_per_task", default=100
2528 num_processes = b2luigi.get_setting(
2530 "basf2_processes_per_worker", default=0
2534 datafiles = b2luigi.get_setting(
"datafiles")
2536 bkgfiles_by_exp = b2luigi.get_setting(
"bkgfiles_by_exp")
2538 bkgfiles_by_exp = {int(key): val
for (key, val)
in bkgfiles_by_exp.items()}
2540 exclude_variables_cdc = [
2541 "has_matching_segment",
2546 "cont_layer_variance",
2551 "cont_layer_max_vs_last",
2552 "cont_layer_first_vs_min",
2554 "cont_layer_occupancy",
2556 "super_layer_variance",
2557 "super_layer_max_vs_last",
2558 "super_layer_first_vs_min",
2559 "super_layer_occupancy",
2560 "drift_length_mean",
2561 "drift_length_variance",
2565 "norm_drift_length_mean",
2566 "norm_drift_length_variance",
2567 "norm_drift_length_max",
2568 "norm_drift_length_min",
2569 "norm_drift_length_sum",
2584 exclude_variables_vxd = [
2585 'energyLoss_max',
'energyLoss_min',
'energyLoss_mean',
'energyLoss_std',
'energyLoss_sum',
2586 'size_max',
'size_min',
'size_mean',
'size_std',
'size_sum',
2587 'seedCharge_max',
'seedCharge_min',
'seedCharge_mean',
'seedCharge_std',
'seedCharge_sum',
2588 'tripletFit_P_Mag',
'tripletFit_P_Eta',
'tripletFit_P_Phi',
'tripletFit_P_X',
'tripletFit_P_Y',
'tripletFit_P_Z']
2590 exclude_variables_rec = [
2602 'N_diff_PXD_SVD_RecoTracks',
2603 'N_diff_SVD_CDC_RecoTracks',
2605 'Fit_NFailedPoints',
2607 'N_TrackPoints_without_KalmanFitterInfo',
2608 'N_Hits_without_TrackPoint',
2609 'SVD_CDC_CDCwall_Chi2',
2610 'SVD_CDC_CDCwall_Pos_diff_Z',
2611 'SVD_CDC_CDCwall_Pos_diff_Pt',
2612 'SVD_CDC_CDCwall_Pos_diff_Theta',
2613 'SVD_CDC_CDCwall_Pos_diff_Phi',
2614 'SVD_CDC_CDCwall_Pos_diff_Mag',
2615 'SVD_CDC_CDCwall_Pos_diff_Eta',
2616 'SVD_CDC_CDCwall_Mom_diff_Z',
2617 'SVD_CDC_CDCwall_Mom_diff_Pt',
2618 'SVD_CDC_CDCwall_Mom_diff_Theta',
2619 'SVD_CDC_CDCwall_Mom_diff_Phi',
2620 'SVD_CDC_CDCwall_Mom_diff_Mag',
2621 'SVD_CDC_CDCwall_Mom_diff_Eta',
2622 'SVD_CDC_POCA_Pos_diff_Z',
2623 'SVD_CDC_POCA_Pos_diff_Pt',
2624 'SVD_CDC_POCA_Pos_diff_Theta',
2625 'SVD_CDC_POCA_Pos_diff_Phi',
2626 'SVD_CDC_POCA_Pos_diff_Mag',
2627 'SVD_CDC_POCA_Pos_diff_Eta',
2628 'SVD_CDC_POCA_Mom_diff_Z',
2629 'SVD_CDC_POCA_Mom_diff_Pt',
2630 'SVD_CDC_POCA_Mom_diff_Theta',
2631 'SVD_CDC_POCA_Mom_diff_Phi',
2632 'SVD_CDC_POCA_Mom_diff_Mag',
2633 'SVD_CDC_POCA_Mom_diff_Eta',
2640 'SVD_FitSuccessful',
2641 'CDC_FitSuccessful',
2644 'is_Vzero_Daughter',
2656 'weight_firstCDCHit',
2657 'weight_lastSVDHit',
2660 'smoothedChi2_mean',
2662 'smoothedChi2_median',
2663 'smoothedChi2_n_zeros',
2664 'smoothedChi2_firstCDCHit',
2665 'smoothedChi2_lastSVDHit']
2669 Generate list of tasks that needs to be done for luigi to finish running
2672 cdc_training_targets = [
2677 fast_bdt_options = []
2686 fast_bdt_options.append([350, 6, 5, 0.1])
2688 experiment_numbers = b2luigi.get_setting(
"experiment_numbers")
2691 for experiment_number, cdc_training_target, fast_bdt_option
in itertools.product(
2692 experiment_numbers, cdc_training_targets, fast_bdt_options
2695 if b2luigi.get_setting(
"test_selected_task", default=
False):
2698 for cut
in [
'000',
'070',
'090',
'095']:
2702 experiment_number=experiment_number,
2704 recotrack_option=
'useCDC_noVXD_deleteCDCQI'+cut,
2705 cdc_training_target=cdc_training_target,
2706 fast_bdt_option=fast_bdt_option,
2711 experiment_number=experiment_number,
2717 experiment_number=experiment_number,
2719 training_target=cdc_training_target,
2720 fast_bdt_option=fast_bdt_option,
2728 experiment_number=experiment_number,
2734 experiment_number=experiment_number,
2740 experiment_number=experiment_number,
2742 recotrack_option=
'deleteCDCQI080',
2743 cdc_training_target=cdc_training_target,
2744 fast_bdt_option=fast_bdt_option,
2750 experiment_number=experiment_number,
2751 cdc_training_target=cdc_training_target,
2752 fast_bdt_option=fast_bdt_option,
2755 if b2luigi.get_setting(
"run_validation_tasks", default=
True):
2760 experiment_number=experiment_number,
2761 cdc_training_target=cdc_training_target,
2763 fast_bdt_option=fast_bdt_option,
2769 experiment_number=experiment_number,
2771 training_target=cdc_training_target,
2772 fast_bdt_option=fast_bdt_option,
2779 experiment_number=experiment_number,
2780 fast_bdt_option=fast_bdt_option,
2783 if b2luigi.get_setting(
"run_mva_evaluate", default=
True):
2790 experiment_number=experiment_number,
2791 cdc_training_target=cdc_training_target,
2793 fast_bdt_option=fast_bdt_option,
2799 experiment_number=experiment_number,
2801 fast_bdt_option=fast_bdt_option,
2802 training_target=cdc_training_target,
2808 experiment_number=experiment_number,
2810 fast_bdt_option=fast_bdt_option,
2814if __name__ ==
"__main__":
2817 nEventsTestOnData = b2luigi.get_setting(
"n_events_test_on_data", default=-1)
2818 if nEventsTestOnData > 0
and 'DATA' in b2luigi.get_setting(
"process_type", default=
"BBBAR"):
2819 from ROOT
import Belle2
2821 environment.setNumberEventsOverride(nEventsTestOnData)
2824 globaltags = b2luigi.get_setting(
"globaltags", default=[])
2825 if len(globaltags) > 0:
2826 basf2.conditions.reset()
2827 for gt
in globaltags:
2828 basf2.conditions.prepend_globaltag(gt)
2829 workers = b2luigi.get_setting(
"workers", default=1)
2830 b2luigi.process(
MasterTask(), workers=workers)
def get_background_files(folder=None, output_file_info=True)
static Environment & Instance()
Static method to get a reference to the Environment instance.
b2luigi random_seed
Random basf2 seed used by the GenerateSimTask.
def get_records_file_name(self, n_events=None, random_seed=None)
Filename of the recorded/collected data for the final QE MVA training.
def get_input_files(self, n_events=None, random_seed=None)
b2luigi n_events
Number of events to generate.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
def add_tracking_with_quality_estimation(self, path)
CDCQETeacherTask teacher_task
Teacher task to require to provide a quality estimator weightfile for add_tracking_with_quality_estim...
def harvesting_validation_task_instance(self)
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
b2luigi filename
filename to check
b2luigi random_seed
Random basf2 seed.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi n_events
Number of events to generate.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi fast_bdt_option
Hyperparameter option of the FastBDT algorithm.
b2luigi n_events_training
Number of events to generate for the training data set.
TrackQETeacherBaseTask teacher_task(self)
b2luigi n_events_testing
Number of events to generate for the test data set.
b2luigi process_type
Define which kind of process shall be used.
str validation_output_file_name
Name of the "harvested" ROOT output file with variables that can be used for validation.
None add_tracking_with_quality_estimation(self, basf2.Path path)
b2luigi experiment_number
Experiment number of the conditions database, e.g.
str reco_output_file_name
Name of the output of the RootOutput module with reconstructed events.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the test data set.
list exclude_variables_rec
list of variables to exclude for the recotrack mva:
b2luigi num_processes
Number of basf2 processes to use in Basf2PathTasks.
list exclude_variables_vxd
list of variables to exclude for the vxd mva:
b2luigi process_type
Define which kind of process shall be used.
list exclude_variables_cdc
list of variables to exclude for the cdc mva.
b2luigi fast_bdt_option
Hyperparameter option of the FastBDT algorithm.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi n_events_testing
Number of events to generate for the test data set.
b2luigi process_type
Define which kind of process shall be used.
b2luigi primaries_only
Whether to normalize the track finding efficiencies to primary particles only.
def output_pdf_file_basename(self)
b2luigi experiment_number
Experiment number of the conditions database, e.g.
HarvestingValidationBaseTask harvesting_validation_task_instance(self)
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
b2luigi fast_bdt_option
Hyperparameter option of the FastBDT algorithm.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi process_type
Define which kind of process shall be used.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
def on_failure(self, exception)
b2luigi random_seed
Random basf2 seed used by the GenerateSimTask.
b2luigi fast_bdt_option
Hyperparameter option of the FastBDT algorithm.
b2luigi recotrack_option
RecoTrack option, use string that is additive: deleteCDCQI0XY (= deletes CDCTracks with CDC-QI below ...
def get_records_file_name(self, n_events=None, random_seed=None, recotrack_option=None)
Filename of the recorded/collected data for the final QE MVA training.
def get_input_files(self, n_events=None, random_seed=None)
b2luigi n_events
Number of events to generate.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
RecoTrackQETeacherTask teacher_task
Task that is required by the evaluation base class to create the MVA weightfile that needs to be eval...
RecoTrackQEDataCollectionTask data_collection_task
Task that is required by the evaluation base class to collect the test data for the evaluation.
str task_acronym
Acronym that is required by the evaluation base class to find the correct collection task file.
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
RecoTrackQETeacherTask teacher_task
Teacher task to require to provide a quality estimator weightfile for add_tracking_with_quality_estim...
def add_tracking_with_quality_estimation(self, path)
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
b2luigi recotrack_option
RecoTrack option, use string that is additive: deleteCDCQI0XY (= deletes CDCTracks with CDC-QI below ...
RecoTrackQEDataCollectionTask data_collection_task
Defines DataCollectionTask to require by the base class to collect features for the MVA training.
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
str random_seed
Random basf2 seed used to create the training data set.
def harvesting_validation_task_instance(self)
b2luigi cdc_training_target
Feature/variable to use as truth label for the CDC track quality estimator.
b2luigi random_seed
Random basf2 seed.
def output_file_name(self, n_events=None, random_seed=None)
Name of the ROOT output file with generated and simulated events.
b2luigi bkgfiles_dir
Directory with overlay background root files.
b2luigi n_events
Number of events to generate.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi fast_bdt_option
Hyperparameter options for the FastBDT algorithm.
b2luigi n_events_training
Number of events to generate for the training data set.
TrackQETeacherBaseTask teacher_task(self)
b2luigi n_events_testing
Number of events to generate for the test data set.
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
b2luigi process_type
Define which kind of process shall be used.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
Basf2PathTask data_collection_task(self)
b2luigi fast_bdt_option
Hyperparameter option of the FastBDT algorithm.
b2luigi n_events_training
Number of events to generate for the training data set.
b2luigi training_target
Feature/variable to use as truth label in the quality estimator MVA classifier.
def weightfile_identifier_basename(self)
b2luigi process_type
Define which kind of process shall be used.
def get_weightfile_xml_identifier(self, fast_bdt_option=None, recotrack_option=None)
b2luigi experiment_number
Experiment number of the conditions database, e.g.
b2luigi exclude_variables
List of collected variables to not use in the training of the QE MVA classifier.
Basf2PathTask data_collection_task(self)
b2luigi random_seed
Random basf2 seed used by the GenerateSimTask.
def get_records_file_name(self, n_events=None, random_seed=None)
Filename of the recorded/collected data for the final QE MVA training.
def get_input_files(self, n_events=None, random_seed=None)
b2luigi n_events
Number of events to generate.
b2luigi experiment_number
Experiment number of the conditions database, e.g.
def add_tracking_with_quality_estimation(self, path)
VXDQETeacherTask teacher_task
Teacher task to require to provide a quality estimator weightfile for add_tracking_with_quality_estim...
def harvesting_validation_task_instance(self)
def add_simulation(path, components=None, bkgfiles=None, bkgOverlay=True, forceSetPXDDataReduction=False, usePXDDataReduction=True, cleanupPXDDataReduction=True, generate_2nd_cdc_hits=False, simulateT0jitter=True, isCosmics=False, FilterEvents=False, usePXDGatedMode=False, skipExperimentCheckForBG=False, save_slow_pions_in_mc=False)