5 combined_module_quality_estimator_teacher
6 -----------------------------------------
8 Information on the MVA Track Quality Indicator / Estimator can be found
10 <https://confluence.desy.de/display/BI/MVA+Track+Quality+Indicator>`_.
12 Purpose of this script
13 ~~~~~~~~~~~~~~~~~~~~~~
15 This python script is used for the combined training and validation of three
16 classifiers, the actual final MVA track quality estimator and the two quality
17 estimators for the intermediate standalone track finders that it depends on.
19 - Final MVA track quality estimator:
20 The final quality estimator for fully merged and fitted tracks (RecoTracks).
21 Its classifier uses features from the track fitting, merger, hit pattern, ...
22 But it also uses the outputs from respective intermediate quality
23 estimators for the VXD and the CDC track finding as inputs. It provides
24 the final quality indicator (QI) exported to the track objects.
26 - VXDTF2 track quality estimator:
27 MVA quality estimator for the VXD standalone track finding.
29 - CDC track quality estimator:
30 MVA quality estimator for the CDC standalone track finding.
32 Each classifier requires for its training a different training data set and they
33 need to be validated on a separate testing data set. Further, the final quality
34 estimator can only be trained, when the trained weights for the intermediate
35 quality estimators are available. If the final estimator shall be trained without
36 one or both previous estimators, the requirements have to be commented out in the
37 __init__.py file of tracking.
38 For all estimators, a list of variables to be ignored is specified in the MasterTask.
39 The current choice is mainly based on pure data MC agreement in these quantities or
40 on outdated implementations. It was decided to leave them in the hardcoded "ugly" way
41 in here to remind future generations that they exist in principle and they should and
42 could be added to the estimator, once their modelling becomes better in future or an
43 alternative implementation is programmed.
44 To avoid mistakes, b2luigi is used to create a task chain for a combined training and
45 validation of all classifiers.
47 b2luigi: Understanding the steering file
48 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
50 All trainings and validations are done in the correct order in this steering
51 file. For the purpose of creating a dependency graph, the `b2luigi
52 <https://b2luigi.readthedocs.io>`_ python package is used, which extends the
53 `luigi <https://luigi.readthedocs.io>`_ packag developed by spotify.
55 Each task that has to be done is represented by a special class, which defines
56 which defines parameters, output files and which other tasks with which
57 parameters it depends on. For example a teacher task, which runs
58 ``basf2_mva_teacher.py`` to train the classifier, depends on a data collection
59 task which runs a reconstruction and writes out track-wise variables into a root
60 file for training. An evaluation/validation task for testing the classifier
61 requires both the teacher task, as it needs the weightfile to be present, and
62 also a data collection task, because it needs a dataset for testing classifier.
64 The final task that defines which tasks need to be done for the steering file to
65 finish is the ``MasterTask``. When you only want to run parts of the
66 training/validation pipeline, you can comment out requirements in the Master
67 task or replace them by lower-level tasks during debugging.
72 This steering file relies on b2luigi_ for task scheduling and `uncertain_panda
73 <https://github.com/nils-braun/uncertain_panda>`_ for uncertainty calculations.
74 uncertain_panda is not in the externals and b2luigi is not upto v01-07-01. Both
75 can be installed via pip::
77 python3 -m pip install [--user] b2luigi uncertain_panda
79 Use the ``--user`` option if you have not rights to install python packages into
80 your externals (e.g. because you are using cvmfs) and install them in
81 ``$HOME/.local`` instead.
86 Instead of command line arguments, the b2luigi script is configured via a
87 ``settings.json`` file. Open it in your favorite text editor and modify it to
88 fit to your requirements.
93 You can test the b2luigi without running it via::
95 python3 combined_quality_estimator_teacher.py --dry-run
96 python3 combined_quality_estimator_teacher.py --show-output
98 This will show the outputs and show potential errors in the definitions of the
99 luigi task dependencies. To run the the steering file in normal (local) mode,
102 python3 combined_quality_estimator_teacher.py
104 I usually use the interactive luigi web interface via the central scheduler
105 which visualizes the task graph while it is running. Therefore, the scheduler
106 daemon ``luigid`` has to run in the background, which is located in
107 ``~/.local/bin/luigid`` in case b2luigi had been installed with ``--user``. For
112 Then, execute your steering (e.g. in another terminal) with::
114 python3 combined_quality_estimator_teacher.py --scheduler-port 8886
116 To view the web interface, open your webbrowser enter into the url bar::
120 If you don't run the steering file on the same machine on which you run your web
121 browser, you have two options:
123 1. Run both the steering file and ``luigid`` remotely and use
124 ssh-port-forwarding to your local host. Therefore, run on your local
127 ssh -N -f -L 8886:localhost:8886 <remote_user>@<remote_host>
129 2. Run the ``luigid`` scheduler locally and use the ``--scheduler-host <your
130 local host>`` argument when calling the steering file
132 Accessing the results / output files
133 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
135 All output files are stored in a directory structure in the ``result_path``. The
136 directory tree encodes the used b2luigi parameters. This ensures reproducability
137 and makes parameter searches easy. Sometimes, it is hard to find the relevant
138 output files. You can view the whole directory structure by running ``tree
139 <result_path>``. Ise the unix ``find`` command to find the files that interest
142 find <result_path> -name "*.pdf" # find all validation plot files
143 find <result_path> -name "*.root" # find all ROOT files
148 from pathlib
import Path
152 from datetime
import datetime
153 from typing
import Iterable
155 import matplotlib.pyplot
as plt
158 from matplotlib.backends.backend_pdf
import PdfPages
162 from packaging
import version
170 install_helpstring_formatter = (
"\nCould not find {module} python module.Try installing it via\n"
171 " python3 -m pip install [--user] {module}\n")
174 from b2luigi.core.utils
import get_serialized_parameters, get_log_file_dir, create_output_dirs
175 from b2luigi.basf2_helper
import Basf2PathTask, Basf2Task, HaddTask
176 from b2luigi.core.task
import Task, ExternalTask
177 from b2luigi.basf2_helper.utils
import get_basf2_git_hash
178 except ModuleNotFoundError:
179 print(install_helpstring_formatter.format(module=
"b2luigi"))
182 from uncertain_panda
import pandas
as upd
183 except ModuleNotFoundError:
184 print(install_helpstring_formatter.format(module=
"uncertain_panda"))
192 version.parse(b2luigi.__version__) <= version.parse(
"0.3.2")
and
193 get_basf2_git_hash()
is None and
194 os.getenv(
"BELLE2_LOCAL_DIR")
is not None
196 print(f
"b2luigi version could not obtain git hash because of a bug not yet fixed in version {b2luigi.__version__}\n"
197 "Please install the latest version of b2luigi from github via\n\n"
198 " python3 -m pip install --upgrade [--user] git+https://github.com/nils-braun/b2luigi.git\n")
204 def create_fbdt_option_string(fast_bdt_option):
205 return "_nTrees" + str(fast_bdt_option[0]) +
"_nCuts" + str(fast_bdt_option[1]) +
"_nLevels" + \
206 str(fast_bdt_option[2]) +
"_shrin" + str(int(round(100*fast_bdt_option[3], 0)))
209 def my_basf2_mva_teacher(
212 weightfile_identifier,
213 target_variable="truth",
214 exclude_variables=None,
215 fast_bdt_option=[200, 8, 3, 0.1]
218 My custom wrapper for basf2 mva teacher. Adapted from code in ``trackfindingcdc_teacher``.
220 :param records_files: List of files with collected ("recorded") variables to use as training data for the MVA.
221 :param tree_name: Name of the TTree in the ROOT file from the ``data_collection_task``
222 that contains the training data for the MVA teacher.
223 :param weightfile_identifier: Name of the weightfile that is created.
224 Should either end in ".xml" for local weightfiles or in ".root", when
225 the weightfile needs later to be uploaded as a payload to the conditions
227 :param target_variable: Feature/variable to use as truth label in the quality estimator MVA classifier.
228 :param exclude_variables: List of collected variables to not use in the training of the QE MVA classifier.
229 In addition to variables containing the "truth" substring, which are excluded by default.
230 :param fast_bdt_option: specified fast BDT options, defaut: [200, 8, 3, 0.1] [nTrees, nCuts, nLevels, shrinkage]
232 if exclude_variables
is None:
233 exclude_variables = []
235 weightfile_extension = Path(weightfile_identifier).suffix
236 if weightfile_extension
not in {
".xml",
".root"}:
237 raise ValueError(f
"Weightfile Identifier should end in .xml or .root, but ends in {weightfile_extension}")
240 with root_utils.root_open(records_files[0])
as records_tfile:
241 input_tree = records_tfile.Get(tree_name)
242 feature_names = [leave.GetName()
for leave
in input_tree.GetListOfLeaves()]
245 truth_free_variable_names = [
247 for name
in feature_names
249 (
"truth" not in name)
and
250 (name != target_variable)
and
251 (name
not in exclude_variables)
254 if "weight" in truth_free_variable_names:
255 truth_free_variable_names.remove(
"weight")
256 weight_variable =
"weight"
257 elif "__weight__" in truth_free_variable_names:
258 truth_free_variable_names.remove(
"__weight__")
259 weight_variable =
"__weight__"
264 general_options = basf2_mva.GeneralOptions()
265 general_options.m_datafiles = basf2_mva.vector(*records_files)
266 general_options.m_treename = tree_name
267 general_options.m_weight_variable = weight_variable
268 general_options.m_identifier = weightfile_identifier
269 general_options.m_variables = basf2_mva.vector(*truth_free_variable_names)
270 general_options.m_target_variable = target_variable
271 fastbdt_options = basf2_mva.FastBDTOptions()
273 fastbdt_options.m_nTrees = fast_bdt_option[0]
274 fastbdt_options.m_nCuts = fast_bdt_option[1]
275 fastbdt_options.m_nLevels = fast_bdt_option[2]
276 fastbdt_options.m_shrinkage = fast_bdt_option[3]
278 basf2_mva.teacher(general_options, fastbdt_options)
281 def _my_uncertain_mean(series: upd.Series):
283 Temporary Workaround bug in ``uncertain_panda`` where a ``ValueError`` is
284 thrown for ``Series.unc.mean`` if the series is empty. Can be replaced by
285 .unc.mean when the issue is fixed.
286 https://github.com/nils-braun/uncertain_panda/issues/2
289 return series.unc.mean()
297 def get_uncertain_means_for_qi_cuts(df: upd.DataFrame, column: str, qi_cuts: Iterable[float]):
299 Return a pandas series with an mean of the dataframe column and
300 uncertainty for each quality indicator cut.
302 :param df: Pandas dataframe with at least ``quality_indicator``
303 and another numeric ``column``.
304 :param column: Column of which we want to aggregate the means
305 and uncertainties for different QI cuts
306 :param qi_cuts: Iterable of quality indicator minimal thresholds.
307 :returns: Series of of means and uncertainties with ``qi_cuts`` as index
310 uncertain_means = (_my_uncertain_mean(df.query(f
"quality_indicator > {qi_cut}")[column])
311 for qi_cut
in qi_cuts)
312 uncertain_means_series = upd.Series(data=uncertain_means, index=qi_cuts)
313 return uncertain_means_series
316 def plot_with_errobands(uncertain_series,
317 error_band_alpha=0.3,
319 fill_between_kwargs={},
322 Plot an uncertain series with error bands for y-errors
326 uncertain_series = uncertain_series.dropna()
327 ax.plot(uncertain_series.index.values, uncertain_series.nominal_value, **plot_kwargs)
328 ax.fill_between(x=uncertain_series.index,
329 y1=uncertain_series.nominal_value - uncertain_series.std_dev,
330 y2=uncertain_series.nominal_value + uncertain_series.std_dev,
331 alpha=error_band_alpha,
332 **fill_between_kwargs)
335 def format_dictionary(adict, width=80, bullet="•"):
337 Helper function to format dictionary to string as a wrapped key-value bullet
338 list. Useful to print metadata from dictionaries.
340 :param adict: Dictionary to format
341 :param width: Characters after which to wrap a key-value line
342 :param bullet: Character to begin a key-value line with, e.g. ``-`` for a
348 return "\n".join(textwrap.fill(f
"{bullet} {key}: {value}", width=width)
349 for (key, value)
in adict.items())
356 Generate simulated Monte Carlo with background overlay.
358 Make sure to use different ``random_seed`` parameters for the training data
359 format the classifier trainings and for the test data for the respective
360 evaluation/validation tasks.
364 n_events = b2luigi.IntParameter()
366 experiment_number = b2luigi.IntParameter()
369 random_seed = b2luigi.Parameter()
371 bkgfiles_dir = b2luigi.Parameter(hashed=
True)
378 if random_seed
is None:
380 return "generated_mc_N" + str(n_events) +
"_" + random_seed +
".root"
384 Generate list of output files that the task should produce.
385 The task is considered finished if and only if the outputs all exist.
391 Create basf2 path to process with event generation and simulation.
394 path = basf2.create_path()
403 path.add_module(
"EvtGenInput")
405 import generators
as ge
408 import beamparameters
as bp
409 beamparameters = bp.add_beamparameters(path,
"Y4S")
410 beamparameters.param(
"covVertex", [(14.8e-4)**2, (1.5e-4)**2, (360e-4)**2])
412 ge.add_babayaganlo_generator(path=path, finalstate=
'ee', minenergy=0.15, minangle=10.0)
414 ge.add_kkmc_generator(path=path, finalstate=
'mu+mu-')
416 babayaganlo = basf2.register_module(
'BabayagaNLOInput')
417 babayaganlo.param(
'FinalState',
'gg')
418 babayaganlo.param(
'MaxAcollinearity', 180.0)
419 babayaganlo.param(
'ScatteringAngleRange', [0., 180.])
420 babayaganlo.param(
'FMax', 75000)
421 babayaganlo.param(
'MinEnergy', 0.01)
422 babayaganlo.param(
'Order',
'exp')
423 babayaganlo.param(
'DebugEnergySpread', 0.01)
424 babayaganlo.param(
'Epsilon', 0.00005)
425 path.add_module(babayaganlo)
426 generatorpreselection = basf2.register_module(
'GeneratorPreselection')
427 generatorpreselection.param(
'nChargedMin', 0)
428 generatorpreselection.param(
'nChargedMax', 999)
429 generatorpreselection.param(
'MinChargedPt', 0.15)
430 generatorpreselection.param(
'MinChargedTheta', 17.)
431 generatorpreselection.param(
'MaxChargedTheta', 150.)
432 generatorpreselection.param(
'nPhotonMin', 1)
433 generatorpreselection.param(
'MinPhotonEnergy', 1.5)
434 generatorpreselection.param(
'MinPhotonTheta', 15.0)
435 generatorpreselection.param(
'MaxPhotonTheta', 165.0)
436 generatorpreselection.param(
'applyInCMS',
True)
437 path.add_module(generatorpreselection)
440 ge.add_kkmc_generator(path, finalstate=
'tau+tau-')
442 ge.add_continuum_generator(path, finalstate=
'ddbar')
444 ge.add_continuum_generator(path, finalstate=
'uubar')
446 ge.add_continuum_generator(path, finalstate=
'ssbar')
448 ge.add_continuum_generator(path, finalstate=
'ccbar')
451 path.add_module(
"ActivatePXDPixelMasker")
452 path.add_module(
"ActivatePXDGainCalibrator")
456 components = [
'PXD',
'SVD',
'CDC',
'ECL',
'TOP',
'ARICH',
'TRG']
473 Generate simulated Monte Carlo with background overlay.
475 Make sure to use different ``random_seed`` parameters for the training data
476 format the classifier trainings and for the test data for the respective
477 evaluation/validation tasks.
481 n_events = b2luigi.IntParameter()
483 experiment_number = b2luigi.IntParameter()
486 random_seed = b2luigi.Parameter()
488 bkgfiles_dir = b2luigi.Parameter(hashed=
True)
495 if random_seed
is None:
497 return "generated_mc_N" + str(n_events) +
"_" + random_seed +
".root"
501 Generate list of output files that the task should produce.
502 The task is considered finished if and only if the outputs all exist.
507 n_events_per_task = MasterTask.n_events_per_task
508 quotient, remainder = divmod(self.
n_events, n_events_per_task)
509 for i
in range(quotient):
512 num_processes=MasterTask.num_processes,
513 random_seed=self.
random_seed +
'_' + str(i).zfill(3),
514 n_events=n_events_per_task,
520 num_processes=MasterTask.num_processes,
521 random_seed=self.
random_seed +
'_' + str(quotient).zfill(3),
526 @b2luigi.on_temporary_files
528 create_output_dirs(self)
531 for _, file_name
in self.get_input_file_names().items():
532 file_list.append(*file_name)
533 print(
"Merge the following files:")
535 cmd = [
"b2file-merge",
"-f"]
536 args = cmd + [self.get_output_file_name(self.
output_file_name())] + file_list
537 subprocess.check_call(args)
538 print(
"Finished merging. Now remove the input files to save space.")
540 for tempfile
in file_list:
541 args = cmd2 + [tempfile]
542 subprocess.check_call(args)
547 Task to check if the given file really exists.
550 filename = b2luigi.Parameter()
553 from luigi
import LocalTarget
559 Collect variables/features from VXDTF2 tracking and write them to a ROOT
562 These variables are to be used as labelled training data for the MVA
563 classifier which is the VXD track quality estimator
566 n_events = b2luigi.IntParameter()
568 experiment_number = b2luigi.IntParameter()
571 random_seed = b2luigi.Parameter()
578 if random_seed
is None:
580 if 'vxd' not in random_seed:
581 random_seed +=
'_vxd'
582 if 'DATA' in random_seed:
583 return 'qe_records_DATA_vxd.root'
585 if 'USESIMBB' in random_seed:
586 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
587 elif 'USESIMEE' in random_seed:
588 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
589 return 'qe_records_N' + str(n_events) +
'_' + random_seed +
'.root'
591 def get_input_files(self, n_events=None, random_seed=None):
594 if random_seed
is None:
596 if "USESIM" in random_seed:
597 if 'USESIMBB' in random_seed:
598 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
599 elif 'USESIMEE' in random_seed:
600 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
601 return [
'datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
602 n_events=n_events, random_seed=random_seed)]
603 elif "DATA" in random_seed:
604 return MasterTask.datafiles
606 return self.get_input_file_names(GenerateSimTask.output_file_name(
607 GenerateSimTask, n_events=n_events, random_seed=random_seed))
611 Generate list of luigi Tasks that this Task depends on.
628 Generate list of output files that the task should produce.
629 The task is considered finished if and only if the outputs all exist.
635 Create basf2 path with VXDTF2 tracking and VXD QE data collection.
637 path = basf2.create_path()
641 inputFileNames=inputFileNames,
643 path.add_module(
"Gearbox")
644 tracking.add_geometry_modules(path)
646 from rawdata
import add_unpackers
647 add_unpackers(path, components=[
'SVD',
'PXD'])
648 tracking.add_hit_preparation_modules(path)
649 tracking.add_vxd_track_finding_vxdtf2(
650 path, components=[
"SVD"], add_mva_quality_indicator=
False
654 "VXDQETrainingDataCollector",
656 SpacePointTrackCandsStoreArrayName=
"SPTrackCands",
657 EstimationMethod=
"tripletFit",
659 ClusterInformation=
"Average",
660 MCStrictQualityEstimator=
False,
666 "TrackFinderMCTruthRecoTracks",
667 RecoTracksStoreArrayName=
"MCRecoTracks",
674 "VXDQETrainingDataCollector",
676 SpacePointTrackCandsStoreArrayName=
"SPTrackCands",
677 EstimationMethod=
"tripletFit",
679 ClusterInformation=
"Average",
680 MCStrictQualityEstimator=
True,
688 Collect variables/features from CDC tracking and write them to a ROOT file.
690 These variables are to be used as labelled training data for the MVA
691 classifier which is the CDC track quality estimator
694 n_events = b2luigi.IntParameter()
696 experiment_number = b2luigi.IntParameter()
699 random_seed = b2luigi.Parameter()
706 if random_seed
is None:
708 if 'cdc' not in random_seed:
709 random_seed +=
'_cdc'
710 if 'DATA' in random_seed:
711 return 'qe_records_DATA_cdc.root'
713 if 'USESIMBB' in random_seed:
714 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
715 elif 'USESIMEE' in random_seed:
716 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
717 return 'qe_records_N' + str(n_events) +
'_' + random_seed +
'.root'
719 def get_input_files(self, n_events=None, random_seed=None):
722 if random_seed
is None:
724 if "USESIM" in random_seed:
725 if 'USESIMBB' in random_seed:
726 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
727 elif 'USESIMEE' in random_seed:
728 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
729 return [
'datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
730 n_events=n_events, random_seed=random_seed)]
731 elif "DATA" in random_seed:
732 return MasterTask.datafiles
734 return self.get_input_file_names(GenerateSimTask.output_file_name(
735 GenerateSimTask, n_events=n_events, random_seed=random_seed))
739 Generate list of luigi Tasks that this Task depends on.
756 Generate list of output files that the task should produce.
757 The task is considered finished if and only if the outputs all exist.
763 Create basf2 path with CDC standalone tracking and CDC QE with recording filter for MVA feature collection.
765 path = basf2.create_path()
769 inputFileNames=inputFileNames,
771 path.add_module(
"Gearbox")
772 tracking.add_geometry_modules(path)
774 filter_choice =
"recording_data"
775 from rawdata
import add_unpackers
776 add_unpackers(path, components=[
'CDC'])
778 filter_choice =
"recording"
781 tracking.add_cdc_track_finding(path, with_ca=
False, add_mva_quality_indicator=
True)
783 basf2.set_module_parameters(
785 name=
"TFCDC_TrackQualityEstimator",
786 filter=filter_choice,
796 Collect variables/features from the reco track reconstruction including the
797 fit and write them to a ROOT file.
799 These variables are to be used as labelled training data for the MVA
800 classifier which is the MVA track quality estimator. The collected
801 variables include the classifier outputs from the VXD and CDC quality
802 estimators, namely the CDC and VXD quality indicators, combined with fit,
803 merger, timing, energy loss information etc. This task requires the
804 subdetector quality estimators to be trained.
808 n_events = b2luigi.IntParameter()
810 experiment_number = b2luigi.IntParameter()
813 random_seed = b2luigi.Parameter()
815 cdc_training_target = b2luigi.Parameter()
819 recotrack_option = b2luigi.Parameter(default=
'deleteCDCQI080')
821 fast_bdt_option = b2luigi.ListParameter(hashed=
True, default=[200, 8, 3, 0.1])
828 if random_seed
is None:
830 if recotrack_option
is None:
832 if 'rec' not in random_seed:
833 random_seed +=
'_rec'
834 if 'DATA' in random_seed:
835 return 'qe_records_DATA_rec.root'
837 if 'USESIMBB' in random_seed:
838 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
839 elif 'USESIMEE' in random_seed:
840 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
841 return 'qe_records_N' + str(n_events) +
'_' + random_seed +
'_' + recotrack_option +
'.root'
843 def get_input_files(self, n_events=None, random_seed=None):
846 if random_seed
is None:
848 if "USESIM" in random_seed:
849 if 'USESIMBB' in random_seed:
850 random_seed =
'BBBAR_' + random_seed.split(
"_", 1)[1]
851 elif 'USESIMEE' in random_seed:
852 random_seed =
'BHABHA_' + random_seed.split(
"_", 1)[1]
853 return [
'datafiles/' + GenerateSimTask.output_file_name(GenerateSimTask,
854 n_events=n_events, random_seed=random_seed)]
855 elif "DATA" in random_seed:
856 return MasterTask.datafiles
858 return self.get_input_file_names(GenerateSimTask.output_file_name(
859 GenerateSimTask, n_events=n_events, random_seed=random_seed))
863 Generate list of luigi Tasks that this Task depends on.
880 n_events_training=MasterTask.n_events_training,
884 exclude_variables=MasterTask.exclude_variables_cdc,
889 n_events_training=MasterTask.n_events_training,
892 exclude_variables=MasterTask.exclude_variables_vxd,
898 Generate list of output files that the task should produce.
899 The task is considered finished if and only if the outputs all exist.
905 Create basf2 reconstruction path that should mirror the default path
906 from ``add_tracking_reconstruction()``, but with modules for the VXD QE
907 and CDC QE application and for collection of variables for the reco
908 track quality estimator.
910 path = basf2.create_path()
914 inputFileNames=inputFileNames,
916 path.add_module(
"Gearbox")
926 from rawdata
import add_unpackers
934 cdc_identifier =
'datafiles/' + \
935 CDCQETeacherTask.get_weightfile_xml_identifier(CDCQETeacherTask, fast_bdt_option=self.
fast_bdt_option)
936 if os.path.exists(cdc_identifier):
937 replace_cdc_qi =
True
939 raise ValueError(f
"CDC QI Identifier not found: {cdc_identifier}")
941 replace_cdc_qi =
False
943 replace_cdc_qi =
False
945 cdc_identifier = self.get_input_file_names(
946 CDCQETeacherTask.get_weightfile_xml_identifier(
948 replace_cdc_qi =
True
950 vxd_identifier =
'datafiles/' + \
951 VXDQETeacherTask.get_weightfile_xml_identifier(VXDQETeacherTask, fast_bdt_option=self.
fast_bdt_option)
952 if os.path.exists(vxd_identifier):
953 replace_vxd_qi =
True
955 raise ValueError(f
"VXD QI Identifier not found: {vxd_identifier}")
957 replace_vxd_qi =
False
959 replace_vxd_qi =
False
961 vxd_identifier = self.get_input_file_names(
962 VXDQETeacherTask.get_weightfile_xml_identifier(
964 replace_vxd_qi =
True
966 cdc_qe_mva_filter_parameters =
None
973 cdc_qe_mva_filter_parameters = {
974 "identifier": cdc_identifier,
"cut": cut}
976 cdc_qe_mva_filter_parameters = {
979 cdc_qe_mva_filter_parameters = {
980 "identifier": cdc_identifier}
981 if cdc_qe_mva_filter_parameters
is not None:
982 basf2.set_module_parameters(
984 name=
"TFCDC_TrackQualityEstimator",
985 filterParameters=cdc_qe_mva_filter_parameters,
988 basf2.set_module_parameters(
990 name=
"VXDQualityEstimatorMVA",
991 WeightFileIdentifier=vxd_identifier)
994 track_qe_module_name =
"TrackQualityEstimatorMVA"
996 new_path = basf2.create_path()
997 for module
in path.modules():
998 if module.name() != track_qe_module_name:
999 new_path.add_module(module)
1001 new_path.add_module(
1002 "TrackQETrainingDataCollector",
1004 collectEventFeatures=
True,
1005 SVDPlusCDCStandaloneRecoTracksStoreArrayName=
"SVDPlusCDCStandaloneRecoTracks",
1008 if not module_found:
1009 raise KeyError(f
"No module {track_qe_module_name} found in path")
1016 A teacher task runs the basf2 mva teacher on the training data provided by a
1017 data collection task.
1019 Since teacher tasks are needed for all quality estimators covered by this
1020 steering file and the only thing that changes is the required data
1021 collection task and some training parameters, I decided to use inheritance
1022 and have the basic functionality in this base class/interface and have the
1023 specific teacher tasks inherit from it.
1026 n_events_training = b2luigi.IntParameter()
1028 experiment_number = b2luigi.IntParameter()
1032 process_type = b2luigi.Parameter(default=
"BBBAR")
1034 training_target = b2luigi.Parameter(default=
"truth")
1037 exclude_variables = b2luigi.ListParameter(hashed=
True, default=[])
1039 fast_bdt_option = b2luigi.ListParameter(hashed=
True, default=[200, 8, 3, 0.1])
1044 Property defining the basename for the .xml and .root weightfiles that are created.
1045 Has to be implemented by the inheriting teacher task class.
1047 raise NotImplementedError(
1048 "Teacher Task must define a static weightfile_identifier"
1053 Name of the xml weightfile that is created by the teacher task.
1054 It is subsequently used as a local weightfile in the following validation tasks.
1056 if fast_bdt_option
is None:
1058 if recotrack_option
is None and hasattr(self,
'recotrack_option'):
1059 recotrack_option = self.recotrack_option
1061 recotrack_option =
''
1062 weightfile_details = create_fbdt_option_string(fast_bdt_option)
1064 if recotrack_option !=
'':
1065 weightfile_name = weightfile_name +
'_' + recotrack_option
1066 return weightfile_name +
".weights.xml"
1071 Property defining the name of the tree in the ROOT file from the
1072 ``data_collection_task`` that contains the recorded training data. Must
1073 implemented by the inheriting specific teacher task class.
1075 raise NotImplementedError(
"Teacher Task must define a static tree_name")
1080 Property defining random seed to be used by the ``GenerateSimTask``.
1081 Should differ from the random seed in the test data samples. Must
1082 implemented by the inheriting specific teacher task class.
1084 raise NotImplementedError(
"Teacher Task must define a static random seed")
1089 Property defining the specific ``DataCollectionTask`` to require. Must
1090 implemented by the inheriting specific teacher task class.
1092 raise NotImplementedError(
1093 "Teacher Task must define a data collection task to require "
1098 Generate list of luigi Tasks that this Task depends on.
1110 num_processes=MasterTask.num_processes,
1118 Generate list of output files that the task should produce.
1119 The task is considered finished if and only if the outputs all exist.
1125 Use basf2_mva teacher to create MVA weightfile from collected training
1128 This is the main process that is dispatched by the ``run`` method that
1129 is inherited from ``Basf2Task``.
1139 if hasattr(self,
'recotrack_option'):
1140 records_files = self.get_input_file_names(
1145 recotrack_option=self.recotrack_option))
1147 records_files = self.get_input_file_names(
1153 my_basf2_mva_teacher(
1154 records_files=records_files,
1165 Task to run basf2 mva teacher on collected data for VXDTF2 track quality estimator
1168 weightfile_identifier_basename =
"vxdtf2_mva_qe"
1173 random_seed =
"train_vxd"
1176 data_collection_task = VXDQEDataCollectionTask
1181 Task to run basf2 mva teacher on collected data for CDC track quality estimator
1184 weightfile_identifier_basename =
"cdc_mva_qe"
1187 tree_name =
"records"
1189 random_seed =
"train_cdc"
1192 data_collection_task = CDCQEDataCollectionTask
1197 Task to run basf2 mva teacher on collected data for the final, combined
1198 track quality estimator
1203 recotrack_option = b2luigi.Parameter(default=
'deleteCDCQI080')
1206 weightfile_identifier_basename =
"recotrack_mva_qe"
1211 random_seed =
"train_rec"
1214 data_collection_task = RecoTrackQEDataCollectionTask
1216 cdc_training_target = b2luigi.Parameter()
1220 Generate list of luigi Tasks that this Task depends on.
1233 num_processes=MasterTask.num_processes,
1244 Run track reconstruction with MVA quality estimator and write out
1245 (="harvest") a root file with variables useful for the validation.
1249 n_events_testing = b2luigi.IntParameter()
1251 n_events_training = b2luigi.IntParameter()
1253 experiment_number = b2luigi.IntParameter()
1257 process_type = b2luigi.Parameter(default=
"BBBAR")
1260 exclude_variables = b2luigi.ListParameter(hashed=
True)
1262 fast_bdt_option = b2luigi.ListParameter(hashed=
True, default=[200, 8, 3, 0.1])
1264 validation_output_file_name =
"harvesting_validation.root"
1266 reco_output_file_name =
"reconstruction.root"
1273 Teacher task to require to provide a quality estimator weightfile for ``add_tracking_with_quality_estimation``
1275 raise NotImplementedError()
1279 Add modules for track reconstruction to basf2 path that are to be
1280 validated. Besides track finding it should include MC matching, fitted
1281 track creation and a quality estimator module.
1283 raise NotImplementedError()
1287 Generate list of luigi Tasks that this Task depends on.
1302 filename=
'datafiles/generated_mc_N' + str(self.
n_events_testing) +
'_' + process +
'_test.root'
1314 Generate list of output files that the task should produce.
1315 The task is considered finished if and only if the outputs all exist.
1322 Create a basf2 path that uses ``add_tracking_with_quality_estimation()``
1323 and adds the ``CombinedTrackingValidationModule`` to write out variables
1327 path = basf2.create_path()
1333 inputFileNames = [
'datafiles/generated_mc_N' + str(self.
n_events_testing) +
'_' + process +
'_test.root']
1335 inputFileNames = self.get_input_file_names(GenerateSimTask.output_file_name(
1339 inputFileNames=inputFileNames,
1341 path.add_module(
"Gearbox")
1342 tracking.add_geometry_modules(path)
1343 tracking.add_hit_preparation_modules(path)
1352 output_file_name=self.get_output_file_name(
1366 Run VXDTF2 track reconstruction and write out (="harvest") a root file with
1367 variables useful for validation of the VXD Quality Estimator.
1371 validation_output_file_name =
"vxd_qe_harvesting_validation.root"
1373 reco_output_file_name =
"vxd_qe_reconstruction.root"
1375 teacher_task = VXDQETeacherTask
1379 Add modules for VXDTF2 tracking with VXD quality estimator to basf2 path.
1381 tracking.add_vxd_track_finding_vxdtf2(
1384 reco_tracks=
"RecoTracks",
1385 add_mva_quality_indicator=
True,
1389 basf2.set_module_parameters(
1391 name=
"VXDQualityEstimatorMVA",
1392 WeightFileIdentifier=self.get_input_file_names(
1396 tracking.add_mc_matcher(path, components=[
"SVD"])
1397 tracking.add_track_fit_and_track_creator(path, components=[
"SVD"])
1402 Run CDC reconstruction and write out (="harvest") a root file with variables
1403 useful for validation of the CDC Quality Estimator.
1406 training_target = b2luigi.Parameter()
1408 validation_output_file_name =
"cdc_qe_harvesting_validation.root"
1410 reco_output_file_name =
"cdc_qe_reconstruction.root"
1412 teacher_task = CDCQETeacherTask
1417 Generate list of luigi Tasks that this Task depends on.
1433 filename=
'datafiles/generated_mc_N' + str(self.
n_events_testing) +
'_' + process +
'_test.root'
1445 Add modules for CDC standalone tracking with CDC quality estimator to basf2 path.
1447 tracking.add_cdc_track_finding(
1449 output_reco_tracks=
"RecoTracks",
1450 add_mva_quality_indicator=
True,
1453 cdc_qe_mva_filter_parameters = {
1454 "identifier": self.get_input_file_names(
1455 CDCQETeacherTask.get_weightfile_xml_identifier(
1458 basf2.set_module_parameters(
1460 name=
"TFCDC_TrackQualityEstimator",
1461 filterParameters=cdc_qe_mva_filter_parameters,
1463 tracking.add_mc_matcher(path, components=[
"CDC"])
1464 tracking.add_track_fit_and_track_creator(path, components=[
"CDC"])
1469 Run track reconstruction and write out (="harvest") a root file with variables
1470 useful for validation of the MVA track Quality Estimator.
1473 cdc_training_target = b2luigi.Parameter()
1475 validation_output_file_name =
"reco_qe_harvesting_validation.root"
1477 reco_output_file_name =
"reco_qe_reconstruction.root"
1479 teacher_task = RecoTrackQETeacherTask
1483 Generate list of luigi Tasks that this Task depends on.
1490 exclude_variables=MasterTask.exclude_variables_cdc,
1497 exclude_variables=MasterTask.exclude_variables_vxd,
1515 filename=
'datafiles/generated_mc_N' + str(self.
n_events_testing) +
'_' + process +
'_test.root'
1527 Add modules for reco tracking with all track quality estimators to basf2 path.
1533 add_cdcTrack_QI=
True,
1534 add_vxdTrack_QI=
True,
1535 add_recoTrack_QI=
True,
1536 skipGeometryAdding=
True,
1537 skipHitPreparerAdding=
False,
1542 cdc_qe_mva_filter_parameters = {
1543 "identifier": self.get_input_file_names(
1544 CDCQETeacherTask.get_weightfile_xml_identifier(
1547 basf2.set_module_parameters(
1549 name=
"TFCDC_TrackQualityEstimator",
1550 filterParameters=cdc_qe_mva_filter_parameters,
1552 basf2.set_module_parameters(
1554 name=
"VXDQualityEstimatorMVA",
1555 WeightFileIdentifier=self.get_input_file_names(
1556 VXDQETeacherTask.get_weightfile_xml_identifier(VXDQETeacherTask, fast_bdt_option=self.
fast_bdt_option)
1559 basf2.set_module_parameters(
1561 name=
"TrackQualityEstimatorMVA",
1562 WeightFileIdentifier=self.get_input_file_names(
1563 RecoTrackQETeacherTask.get_weightfile_xml_identifier(RecoTrackQETeacherTask, fast_bdt_option=self.
fast_bdt_option)
1570 Base class for evaluating a quality estimator ``basf2_mva_evaluate.py`` on a
1571 separate test data set.
1573 Evaluation tasks for VXD, CDC and combined QE can inherit from it.
1581 git_hash = b2luigi.Parameter(default=get_basf2_git_hash())
1583 n_events_testing = b2luigi.IntParameter()
1585 n_events_training = b2luigi.IntParameter()
1587 experiment_number = b2luigi.IntParameter()
1591 process_type = b2luigi.Parameter(default=
"BBBAR")
1593 training_target = b2luigi.Parameter(default=
"truth")
1596 exclude_variables = b2luigi.ListParameter(hashed=
True)
1598 fast_bdt_option = b2luigi.ListParameter(hashed=
True, default=[200, 8, 3, 0.1])
1603 Property defining specific teacher task to require.
1605 raise NotImplementedError(
1606 "Evaluation Tasks must define a teacher task to require "
1612 Property defining the specific ``DataCollectionTask`` to require. Must
1613 implemented by the inheriting specific teacher task class.
1615 raise NotImplementedError(
1616 "Evaluation Tasks must define a data collection task to require "
1622 Acronym to distinguish between cdc, vxd and rec(o) MVA
1624 raise NotImplementedError(
1625 "Evalutation Tasks must define a task acronym."
1630 Generate list of luigi Tasks that this Task depends on.
1646 filename=
'datafiles/qe_records_N' + str(self.
n_events_testing) +
'_' + process +
'_test_' +
1651 num_processes=MasterTask.num_processes,
1659 Generate list of output files that the task should produce.
1660 The task is considered finished if and only if the outputs all exist.
1663 evaluation_pdf_output = self.
teacher_task.weightfile_identifier_basename + weightfile_details +
".pdf"
1664 yield self.add_to_output(evaluation_pdf_output)
1666 @b2luigi.on_temporary_files
1669 Run ``basf2_mva_evaluate.py`` subprocess to evaluate QE MVA.
1671 The MVA weight file created from training on the training data set is
1672 evaluated on separate test data.
1675 evaluation_pdf_output_basename = self.
teacher_task.weightfile_identifier_basename + weightfile_details +
".pdf"
1677 evaluation_pdf_output_path = self.get_output_file_name(evaluation_pdf_output_basename)
1684 datafiles =
'datafiles/qe_records_N' + str(self.
n_events_testing) +
'_' + \
1687 datafiles = self.get_input_file_names(
1691 random_seed=self.process +
'_test_' +
1694 "basf2_mva_evaluate.py",
1696 self.get_input_file_names(
1705 evaluation_pdf_output_path,
1709 log_file_dir = get_log_file_dir(self)
1713 os.makedirs(log_file_dir, exist_ok=
True)
1716 except FileExistsError:
1717 print(
'Directory ' + log_file_dir +
'already exists.')
1718 stderr_log_file_path = log_file_dir +
"stderr"
1719 stdout_log_file_path = log_file_dir +
"stdout"
1720 with open(stdout_log_file_path,
"w")
as stdout_file:
1721 stdout_file.write(
"stdout output of the command:\n{}\n\n".format(
" ".join(cmd)))
1722 if os.path.exists(stderr_log_file_path):
1724 os.remove(stderr_log_file_path)
1727 with open(stdout_log_file_path,
"a")
as stdout_file:
1728 with open(stderr_log_file_path,
"a")
as stderr_file:
1730 subprocess.run(cmd, check=
True, stdin=stdout_file, stderr=stderr_file)
1731 except subprocess.CalledProcessError
as err:
1732 stderr_file.write(f
"Evaluation failed with error:\n{err}")
1738 Run ``basf2_mva_evaluate.py`` for the VXD quality estimator on separate test data
1742 teacher_task = VXDQETeacherTask
1745 data_collection_task = VXDQEDataCollectionTask
1748 task_acronym =
'vxd'
1753 Run ``basf2_mva_evaluate.py`` for the CDC quality estimator on separate test data
1757 teacher_task = CDCQETeacherTask
1760 data_collection_task = CDCQEDataCollectionTask
1763 task_acronym =
'cdc'
1768 Run ``basf2_mva_evaluate.py`` for the final, combined quality estimator on
1773 teacher_task = RecoTrackQETeacherTask
1776 data_collection_task = RecoTrackQEDataCollectionTask
1779 task_acronym =
'rec'
1781 cdc_training_target = b2luigi.Parameter()
1785 Generate list of luigi Tasks that this Task depends on.
1802 filename=
'datafiles/qe_records_N' + str(self.
n_events_testing) +
'_' + process +
'_test_' +
1807 num_processes=MasterTask.num_processes,
1817 Create a PDF file with validation plots for a quality estimator produced
1818 from the ROOT ntuples produced by a harvesting validation task
1821 n_events_testing = b2luigi.IntParameter()
1823 n_events_training = b2luigi.IntParameter()
1825 experiment_number = b2luigi.IntParameter()
1829 process_type = b2luigi.Parameter(default=
"BBBAR")
1832 exclude_variables = b2luigi.ListParameter(hashed=
True)
1834 fast_bdt_option = b2luigi.ListParameter(hashed=
True, default=[200, 8, 3, 0.1])
1836 primaries_only = b2luigi.BoolParameter(default=
True)
1841 Specifies related harvesting validation task which produces the ROOT
1842 files with the data that is plotted by this task.
1844 raise NotImplementedError(
"Must define a QI harvesting validation task for which to do the plots")
1849 Name of the output PDF file containing the validation plots
1852 return validation_harvest_basename.replace(
".root",
"_plots.pdf")
1856 Generate list of luigi Tasks that this Task depends on.
1862 Generate list of output files that the task should produce.
1863 The task is considered finished if and only if the outputs all exist.
1867 @b2luigi.on_temporary_files
1870 Use basf2_mva teacher to create MVA weightfile from collected training
1873 Main process that is dispatched by the ``run`` method that is inherited
1878 validation_harvest_path = self.get_input_file_names(validation_harvest_basename)[0]
1882 'is_fake',
'is_clone',
'is_matched',
'quality_indicator',
1883 'experiment_number',
'run_number',
'event_number',
'pr_store_array_number',
1884 'pt_estimate',
'z0_estimate',
'd0_estimate',
'tan_lambda_estimate',
1885 'phi0_estimate',
'pt_truth',
'z0_truth',
'd0_truth',
'tan_lambda_truth',
1889 pr_df = root_pandas.read_root(validation_harvest_path, key=
'pr_tree/pr_tree', columns=pr_columns)
1891 'experiment_number',
1894 'pr_store_array_number',
1899 mc_df = root_pandas.read_root(validation_harvest_path, key=
'mc_tree/mc_tree', columns=mc_columns)
1901 mc_df = mc_df[mc_df.is_primary.eq(
True)]
1904 qi_cuts = np.linspace(0., 1, 20, endpoint=
False)
1911 with PdfPages(output_pdf_file_path, keep_empty=
False)
as pdf:
1916 titlepage_fig, titlepage_ax = plt.subplots()
1917 titlepage_ax.axis(
"off")
1918 title = f
"Quality Estimator validation plots from {self.__class__.__name__}"
1919 titlepage_ax.set_title(title)
1921 weightfile_identifier = teacher_task.get_weightfile_xml_identifier(teacher_task, fast_bdt_option=self.
fast_bdt_option)
1923 "Date": datetime.today().strftime(
"%Y-%m-%d %H:%M"),
1924 "Created by steering file": os.path.realpath(__file__),
1925 "Created from data in": validation_harvest_path,
1927 "weight file": weightfile_identifier,
1929 if hasattr(self,
'exclude_variables'):
1931 meta_data_string = (format_dictionary(meta_data) +
1932 "\n\n(For all MVA training parameters look into the produced weight file)")
1933 luigi_params = get_serialized_parameters(self)
1934 luigi_param_string = (f
"\n\nb2luigi parameters for {self.__class__.__name__}\n" +
1935 format_dictionary(luigi_params))
1936 title_page_text = meta_data_string + luigi_param_string
1937 titlepage_ax.text(0, 1, title_page_text, ha=
"left", va=
"top", wrap=
True, fontsize=8)
1938 pdf.savefig(titlepage_fig)
1939 plt.close(titlepage_fig)
1941 fake_rates = get_uncertain_means_for_qi_cuts(pr_df,
"is_fake", qi_cuts)
1942 fake_fig, fake_ax = plt.subplots()
1943 fake_ax.set_title(
"Fake rate")
1944 plot_with_errobands(fake_rates, ax=fake_ax)
1945 fake_ax.set_ylabel(
"fake rate")
1946 fake_ax.set_xlabel(
"quality indicator requirement")
1947 pdf.savefig(fake_fig, bbox_inches=
"tight")
1951 clone_rates = get_uncertain_means_for_qi_cuts(pr_df,
"is_clone", qi_cuts)
1952 clone_fig, clone_ax = plt.subplots()
1953 clone_ax.set_title(
"Clone rate")
1954 plot_with_errobands(clone_rates, ax=clone_ax)
1955 clone_ax.set_ylabel(
"clone rate")
1956 clone_ax.set_xlabel(
"quality indicator requirement")
1957 pdf.savefig(clone_fig, bbox_inches=
"tight")
1958 plt.close(clone_fig)
1965 pr_track_identifiers = [
'experiment_number',
'run_number',
'event_number',
'pr_store_array_number']
1967 left=mc_df, right=pr_df[pr_track_identifiers + [
'quality_indicator']],
1969 on=pr_track_identifiers
1972 missing_fractions = (
1973 _my_uncertain_mean(mc_df[
1974 mc_df.quality_indicator.isnull() | (mc_df.quality_indicator > qi_cut)][
'is_missing'])
1975 for qi_cut
in qi_cuts
1978 findeff_fig, findeff_ax = plt.subplots()
1979 findeff_ax.set_title(
"Finding efficiency")
1980 finding_efficiencies = 1.0 - upd.Series(data=missing_fractions, index=qi_cuts)
1981 plot_with_errobands(finding_efficiencies, ax=findeff_ax)
1982 findeff_ax.set_ylabel(
"finding efficiency")
1983 findeff_ax.set_xlabel(
"quality indicator requirement")
1984 pdf.savefig(findeff_fig, bbox_inches=
"tight")
1985 plt.close(findeff_fig)
1990 fake_roc_fig, fake_roc_ax = plt.subplots()
1991 fake_roc_ax.set_title(
"Fake rate vs. finding efficiency ROC curve")
1992 fake_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=fake_rates.nominal_value,
1993 xerr=finding_efficiencies.std_dev, yerr=fake_rates.std_dev, elinewidth=0.8)
1994 fake_roc_ax.set_xlabel(
'finding efficiency')
1995 fake_roc_ax.set_ylabel(
'fake rate')
1996 pdf.savefig(fake_roc_fig, bbox_inches=
"tight")
1997 plt.close(fake_roc_fig)
2000 clone_roc_fig, clone_roc_ax = plt.subplots()
2001 clone_roc_ax.set_title(
"Clone rate vs. finding efficiency ROC curve")
2002 clone_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=clone_rates.nominal_value,
2003 xerr=finding_efficiencies.std_dev, yerr=clone_rates.std_dev, elinewidth=0.8)
2004 clone_roc_ax.set_xlabel(
'finding efficiency')
2005 clone_roc_ax.set_ylabel(
'clone rate')
2006 pdf.savefig(clone_roc_fig, bbox_inches=
"tight")
2007 plt.close(clone_roc_fig)
2012 kinematic_qi_cuts = [0, 0.5, 0.9]
2016 params = [
'd0',
'z0',
'pt',
'tan_lambda',
'phi0']
2021 "tan_lambda":
r"$\tan{\lambda}$",
2028 "tan_lambda":
"rad",
2031 n_kinematic_bins = 75
2033 "pt": np.linspace(0, np.percentile(pr_df[
'pt_truth'].dropna(), 95), n_kinematic_bins),
2034 "z0": np.linspace(-0.1, 0.1, n_kinematic_bins),
2035 "d0": np.linspace(0, 0.01, n_kinematic_bins),
2036 "tan_lambda": np.linspace(-2, 3, n_kinematic_bins),
2037 "phi0": np.linspace(0, 2 * np.pi, n_kinematic_bins)
2041 kinematic_qi_cuts = [0, 0.5, 0.8]
2042 blue, yellow, green = plt.get_cmap(
"tab10").colors[0:3]
2043 for param
in params:
2044 fig, axarr = plt.subplots(ncols=len(kinematic_qi_cuts), sharey=
True, sharex=
True, figsize=(14, 6))
2045 fig.suptitle(f
"{label_by_param[param]} distributions")
2046 for i, qi
in enumerate(kinematic_qi_cuts):
2048 ax.set_title(f
"QI > {qi}")
2049 incut = pr_df[(pr_df[
'quality_indicator'] > qi)]
2050 incut_matched = incut[incut.is_matched.eq(
True)]
2051 incut_clones = incut[incut.is_clone.eq(
True)]
2052 incut_fake = incut[incut.is_fake.eq(
True)]
2055 if any(series.empty
for series
in (incut, incut_matched, incut_clones, incut_fake)):
2056 ax.text(0.5, 0.5,
"Not enough data in bin", ha=
"center", va=
"center", transform=ax.transAxes)
2059 bins = bins_by_param[param]
2060 stacked_histogram_series_tuple = (
2061 incut_matched[f
'{param}_estimate'],
2062 incut_clones[f
'{param}_estimate'],
2063 incut_fake[f
'{param}_estimate'],
2065 histvals, _, _ = ax.hist(stacked_histogram_series_tuple,
2067 bins=bins, range=(bins.min(), bins.max()),
2068 color=(blue, green, yellow),
2069 label=(
"matched",
"clones",
"fakes"))
2070 ax.set_xlabel(f
'{label_by_param[param]} estimate / ({unit_by_param[param]})')
2071 ax.set_ylabel(
'# tracks')
2072 axarr[0].legend(loc=
"upper center", bbox_to_anchor=(0, -0.15))
2073 pdf.savefig(fig, bbox_inches=
"tight")
2079 Create a PDF file with validation plots for the VXDTF2 track quality
2080 estimator produced from the ROOT ntuples produced by a VXDTF2 track QE
2081 harvesting validation task
2087 Harvesting validation task to require, which produces the ROOT files
2088 with variables to produce the VXD QE validation plots.
2096 num_processes=MasterTask.num_processes,
2103 Create a PDF file with validation plots for the CDC track quality estimator
2104 produced from the ROOT ntuples produced by a CDC track QE harvesting
2108 training_target = b2luigi.Parameter()
2113 Harvesting validation task to require, which produces the ROOT files
2114 with variables to produce the CDC QE validation plots.
2123 num_processes=MasterTask.num_processes,
2130 Create a PDF file with validation plots for the reco MVA track quality
2131 estimator produced from the ROOT ntuples produced by a reco track QE
2132 harvesting validation task
2135 cdc_training_target = b2luigi.Parameter()
2140 Harvesting validation task to require, which produces the ROOT files
2141 with variables to produce the final MVA track QE validation plots.
2150 num_processes=MasterTask.num_processes,
2157 Collect weightfile identifiers from different teacher tasks and merge them
2158 into a local database for testing.
2161 n_events_training = b2luigi.IntParameter()
2163 experiment_number = b2luigi.IntParameter()
2167 process_type = b2luigi.Parameter(default=
"BBBAR")
2169 cdc_training_target = b2luigi.Parameter()
2171 fast_bdt_option = b2luigi.ListParameter(hashed=
True, default=[200, 8, 3, 0.1])
2175 Required teacher tasks
2181 exclude_variables=MasterTask.exclude_variables_vxd,
2189 exclude_variables=MasterTask.exclude_variables_cdc,
2197 exclude_variables=MasterTask.exclude_variables_rec,
2205 yield self.add_to_output(
"localdb.tar")
2209 Create local database
2211 current_path = Path.cwd()
2212 localdb_archive_path = Path(self.get_output_file_name(
"localdb.tar")).absolute()
2213 output_dir = localdb_archive_path.parent
2218 for task
in (VXDQETeacherTask, CDCQETeacherTask, RecoTrackQETeacherTask):
2220 weightfile_xml_identifier_path = os.path.abspath(self.get_input_file_names(
2221 task.get_weightfile_xml_identifier(task, fast_bdt_option=self.
fast_bdt_option))[0])
2224 os.chdir(output_dir)
2227 weightfile_xml_identifier_path,
2228 task.weightfile_identifier_basename,
2233 os.chdir(current_path)
2236 shutil.make_archive(
2237 base_name=localdb_archive_path.as_posix().split(
'.')[0],
2239 root_dir=output_dir,
2246 Remove local database and tar archives in output directory
2248 localdb_archive_path = Path(self.get_output_file_name(
"localdb.tar"))
2249 localdb_path = localdb_archive_path.parent /
"localdb"
2251 if localdb_path.exists():
2252 print(f
"Deleting localdb\n{localdb_path}\nwith contents\n ",
2253 "\n ".join(f.name
for f
in localdb_path.iterdir()))
2254 shutil.rmtree(localdb_path, ignore_errors=
False)
2256 if localdb_archive_path.is_file():
2257 print(f
"Deleting {localdb_archive_path}")
2258 os.remove(localdb_archive_path)
2262 Cleanup: Remove local database to prevent existing outputs when task did not finish successfully
2271 Wrapper task that needs to finish for b2luigi to finish running this steering file.
2273 It is done if the outputs of all required subtasks exist. It is thus at the
2274 top of the luigi task graph. Edit the ``requires`` method to steer which
2275 tasks and with which parameters you want to run.
2280 process_type = b2luigi.get_setting(
"process_type", default=
'BBBAR')
2282 n_events_training = b2luigi.get_setting(
"n_events_training", default=20000)
2284 n_events_testing = b2luigi.get_setting(
"n_events_testing", default=5000)
2286 n_events_per_task = b2luigi.get_setting(
"n_events_per_task", default=100)
2288 num_processes = b2luigi.get_setting(
"basf2_processes_per_worker", default=0)
2289 datafiles = b2luigi.get_setting(
"datafiles")
2291 bkgfiles_by_exp = b2luigi.get_setting(
"bkgfiles_by_exp")
2293 bkgfiles_by_exp = {int(key): val
for (key, val)
in bkgfiles_by_exp.items()}
2295 exclude_variables_cdc = [
2296 "has_matching_segment",
2300 "drift_length_mean",
2301 "drift_length_variance",
2304 "norm_drift_length_sum",
2305 "norm_drift_length_max",
2306 "norm_drift_length_min",
2307 "norm_drift_length_mean",
2308 "norm_drift_length_variance",
2322 "cont_layer_first_vs_min",
2323 "cont_layer_max_vs_last",
2324 "cont_layer_occupancy",
2330 "cont_layer_variance",
2332 "super_layer_first_vs_min",
2333 "super_layer_max_vs_last",
2334 "super_layer_occupancy",
2336 "super_layer_variance"]
2338 exclude_variables_vxd = [
2339 'energyLoss_max',
'energyLoss_min',
'energyLoss_mean',
'energyLoss_std',
'energyLoss_sum',
2340 'size_max',
'size_min',
'size_mean',
'size_std',
'size_sum',
2341 'seedCharge_max',
'seedCharge_min',
'seedCharge_mean',
'seedCharge_std',
'seedCharge_sum',
2342 'tripletFit_P_Mag',
'tripletFit_P_Eta',
'tripletFit_P_Phi',
'tripletFit_P_X',
'tripletFit_P_Y',
'tripletFit_P_Z']
2344 exclude_variables_rec = [
2349 'N_diff_PXD_SVD_RecoTracks',
2350 'N_diff_SVD_CDC_RecoTracks',
2352 'Fit_NFailedPoints',
2354 'N_TrackPoints_without_KalmanFitterInfo',
2355 'N_Hits_without_TrackPoint',
2356 'SVD_CDC_CDCwall_Chi2',
2357 'SVD_CDC_CDCwall_Pos_diff_Z',
2358 'SVD_CDC_CDCwall_Pos_diff_Pt',
2359 'SVD_CDC_CDCwall_Pos_diff_Theta',
2360 'SVD_CDC_CDCwall_Pos_diff_Phi',
2361 'SVD_CDC_CDCwall_Pos_diff_Mag',
2362 'SVD_CDC_CDCwall_Pos_diff_Eta',
2363 'SVD_CDC_CDCwall_Mom_diff_Z',
2364 'SVD_CDC_CDCwall_Mom_diff_Pt',
2365 'SVD_CDC_CDCwall_Mom_diff_Theta',
2366 'SVD_CDC_CDCwall_Mom_diff_Phi',
2367 'SVD_CDC_CDCwall_Mom_diff_Mag',
2368 'SVD_CDC_CDCwall_Mom_diff_Eta',
2369 'SVD_CDC_POCA_Pos_diff_Z',
2370 'SVD_CDC_POCA_Pos_diff_Pt',
2371 'SVD_CDC_POCA_Pos_diff_Theta',
2372 'SVD_CDC_POCA_Pos_diff_Phi',
2373 'SVD_CDC_POCA_Pos_diff_Mag',
2374 'SVD_CDC_POCA_Pos_diff_Eta',
2375 'SVD_CDC_POCA_Mom_diff_Z',
2376 'SVD_CDC_POCA_Mom_diff_Pt',
2377 'SVD_CDC_POCA_Mom_diff_Theta',
2378 'SVD_CDC_POCA_Mom_diff_Phi',
2379 'SVD_CDC_POCA_Mom_diff_Mag',
2380 'SVD_CDC_POCA_Mom_diff_Eta',
2387 'SVD_FitSuccessful',
2388 'CDC_FitSuccessful',
2397 'weight_firstCDCHit',
2398 'weight_lastSVDHit',
2401 'smoothedChi2_mean',
2403 'smoothedChi2_median',
2404 'smoothedChi2_n_zeros',
2405 'smoothedChi2_firstCDCHit',
2406 'smoothedChi2_lastSVDHit']
2410 Generate list of tasks that needs to be done for luigi to finish running
2413 cdc_training_targets = [
2418 fast_bdt_options = []
2426 fast_bdt_options.append([350, 6, 5, 0.1])
2428 experiment_numbers = b2luigi.get_setting(
"experiment_numbers")
2431 for experiment_number, cdc_training_target, fast_bdt_option
in itertools.product(
2432 experiment_numbers, cdc_training_targets, fast_bdt_options
2435 if b2luigi.get_setting(
"test_selected_task", default=
False):
2439 experiment_number=experiment_number,
2445 experiment_number=experiment_number,
2447 training_target=cdc_training_target,
2448 fast_bdt_option=fast_bdt_option,
2456 experiment_number=experiment_number,
2462 experiment_number=experiment_number,
2468 experiment_number=experiment_number,
2470 recotrack_option=
'deleteCDCQI080',
2471 cdc_training_target=cdc_training_target,
2472 fast_bdt_option=fast_bdt_option,
2478 experiment_number=experiment_number,
2479 cdc_training_target=cdc_training_target,
2480 fast_bdt_option=fast_bdt_option,
2483 if b2luigi.get_setting(
"run_validation_tasks", default=
True):
2488 experiment_number=experiment_number,
2489 cdc_training_target=cdc_training_target,
2491 fast_bdt_option=fast_bdt_option,
2497 experiment_number=experiment_number,
2499 training_target=cdc_training_target,
2500 fast_bdt_option=fast_bdt_option,
2507 experiment_number=experiment_number,
2508 fast_bdt_option=fast_bdt_option,
2511 if b2luigi.get_setting(
"run_mva_evaluate", default=
True):
2518 experiment_number=experiment_number,
2519 cdc_training_target=cdc_training_target,
2521 fast_bdt_option=fast_bdt_option,
2527 experiment_number=experiment_number,
2529 fast_bdt_option=fast_bdt_option,
2530 training_target=cdc_training_target,
2536 experiment_number=experiment_number,
2538 fast_bdt_option=fast_bdt_option,
2542 if __name__ ==
"__main__":
2544 globaltags = b2luigi.get_setting(
"globaltags", default=[])
2545 if len(globaltags) > 0:
2546 basf2.conditions.reset()
2547 for gt
in globaltags:
2548 basf2.conditions.prepend_globaltag(gt)
2549 workers = b2luigi.get_setting(
"workers", default=1)
2550 b2luigi.process(
MasterTask(), workers=workers)