Inheritance diagram for PlotsFromHarvestingValidationBaseTask:

Public Member Functions
HarvestingValidationBaseTask	harvesting_validation_task_instance (self)

def	output_pdf_file_basename (self)

def	requires (self)

def	output (self)

def	process (self)

Static Public Attributes
b2luigi	n_events_testing = b2luigi.IntParameter()
	Number of events to generate for the test data set.

b2luigi	n_events_training = b2luigi.IntParameter()
	Number of events to generate for the training data set.

b2luigi	experiment_number = b2luigi.IntParameter()
	Experiment number of the conditions database, e.g.

b2luigi	process_type
	Define which kind of process shall be used.

b2luigi	exclude_variables
	List of collected variables to not use in the training of the QE MVA classifier.

b2luigi	fast_bdt_option
	Hyperparameter option of the FastBDT algorithm.

b2luigi	primaries_only
	Whether to normalize the track finding efficiencies to primary particles only.

Detailed Description

Create a PDF file with validation plots for a quality estimator produced
from the ROOT ntuples produced by a harvesting validation task

Definition at line 2016 of file combined_quality_estimator_teacher.py.

Member Function Documentation

◆ harvesting_validation_task_instance()

HarvestingValidationBaseTask harvesting_validation_task_instance ( self )

Specifies related harvesting validation task which produces the ROOT
files with the data that is plotted by this task.

Reimplemented in VXDQEValidationPlotsTask, CDCQEValidationPlotsTask, and RecoTrackQEValidationPlotsTask.

Definition at line 2056 of file combined_quality_estimator_teacher.py.

    def harvesting_validation_task_instance(self) -> HarvestingValidationBaseTask:
        """
        Specifies related harvesting validation task which produces the ROOT
        files with the data that is plotted by this task.
        """
        raise NotImplementedError("Must define a QI harvesting validation task for which to do the plots")
 

◆ output()

def output ( self )

Generate list of output files that the task should produce.
The task is considered finished if and only if the outputs all exist.

Definition at line 2077 of file combined_quality_estimator_teacher.py.

    def output(self):
        """
        Generate list of output files that the task should produce.
        The task is considered finished if and only if the outputs all exist.
        """
        yield self.add_to_output(self.output_pdf_file_basename)
 

◆ output_pdf_file_basename()

def output_pdf_file_basename ( self )

Name of the output PDF file containing the validation plots

Definition at line 2064 of file combined_quality_estimator_teacher.py.

    def output_pdf_file_basename(self):
        """
        Name of the output PDF file containing the validation plots
        """
        validation_harvest_basename = self.harvesting_validation_task_instance.validation_output_file_name
        return validation_harvest_basename.replace(".root", "_plots.pdf")
 

◆ process()

def process ( self )

Use basf2_mva teacher to create MVA weightfile from collected training
data variables.

Main process that is dispatched by the ``run`` method that is inherited
from ``Basf2Task``.

Definition at line 2085 of file combined_quality_estimator_teacher.py.

    def process(self):
        """
        Use basf2_mva teacher to create MVA weightfile from collected training
        data variables.
 
        Main process that is dispatched by the ``run`` method that is inherited
        from ``Basf2Task``.
        """
        # get the validation "harvest", which is the ROOT file with ntuples for validation
        validation_harvest_basename = self.harvesting_validation_task_instance.validation_output_file_name
        validation_harvest_path = self.get_input_file_names(validation_harvest_basename)[0]
 
        # Load "harvested" validation data from root files into dataframes (requires enough memory to hold data)
        pr_columns = [  # Restrict memory usage by only reading in columns that are used in the steering file
            'is_fake', 'is_clone', 'is_matched', 'quality_indicator',
            'experiment_number', 'run_number', 'event_number', 'pr_store_array_number',
            'pt_estimate', 'z0_estimate', 'd0_estimate', 'tan_lambda_estimate',
            'phi0_estimate', 'pt_truth', 'z0_truth', 'd0_truth', 'tan_lambda_truth',
            'phi0_truth',
        ]
        # In ``pr_df`` each row corresponds to a track from Pattern Recognition
        pr_df = uproot.open(validation_harvest_path)['pr_tree/pr_tree'].arrays(pr_columns, library='pd')
        mc_columns = [  # restrict mc_df to these columns
            'experiment_number',
            'run_number',
            'event_number',
            'pr_store_array_number',
            'is_missing',
            'is_primary',
        ]
        # In ``mc_df`` each row corresponds to an MC track
        mc_df = uproot.open(validation_harvest_path)['mc_tree/mc_tree'].arrays(mc_columns, library='pd')
        if self.primaries_only:
            mc_df = mc_df[mc_df.is_primary.eq(True)]
 
        # Define QI thresholds for the FOM plots and the ROC curves
        qi_cuts = np.linspace(0., 1, 20, endpoint=False)
        # # Add more points at the very end between the previous maximum and 1
        # qi_cuts = np.append(qi_cuts, np.linspace(np.max(qi_cuts), 1, 20, endpoint=False))
 
        # Create plots and append them to single output pdf
 
        output_pdf_file_path = self.get_output_file_name(self.output_pdf_file_basename)
        with PdfPages(output_pdf_file_path, keep_empty=False) as pdf:
 
            # Add a title page to validation plot PDF with some metadata
            # Remember that most metadata is in the xml file of the weightfile
            # and in the b2luigi directory structure
            titlepage_fig, titlepage_ax = plt.subplots()
            titlepage_ax.axis("off")
            title = f"Quality Estimator validation plots from {self.__class__.__name__}"
            titlepage_ax.set_title(title)
            teacher_task = self.harvesting_validation_task_instance.teacher_task
            weightfile_identifier = teacher_task.get_weightfile_xml_identifier(teacher_task, fast_bdt_option=self.fast_bdt_option)
            meta_data = {
                "Date": datetime.today().strftime("%Y-%m-%d %H:%M"),
                "Created by steering file": os.path.realpath(__file__),
                "Created from data in": validation_harvest_path,
                "Background directory": MasterTask.bkgfiles_by_exp[self.experiment_number],
                "weight file": weightfile_identifier,
            }
            if hasattr(self, 'exclude_variables'):
                meta_data["Excluded variables"] = ", ".join(self.exclude_variables)
            meta_data_string = (format_dictionary(meta_data) +
                                "\n\n(For all MVA training parameters look into the produced weight file)")
            luigi_params = get_serialized_parameters(self)
            luigi_param_string = (f"\n\nb2luigi parameters for {self.__class__.__name__}\n" +
                                  format_dictionary(luigi_params))
            title_page_text = meta_data_string + luigi_param_string
            titlepage_ax.text(0, 1, title_page_text, ha="left", va="top", wrap=True, fontsize=8)
            pdf.savefig(titlepage_fig)
            plt.close(titlepage_fig)
 
            fake_rates = get_uncertain_means_for_qi_cuts(pr_df, "is_fake", qi_cuts)
            fake_fig, fake_ax = plt.subplots()
            fake_ax.set_title("Fake rate")
            plot_with_errobands(fake_rates, ax=fake_ax)
            fake_ax.set_ylabel("fake rate")
            fake_ax.set_xlabel("quality indicator requirement")
            pdf.savefig(fake_fig, bbox_inches="tight")
            plt.close(fake_fig)
 
            # Plot clone rates
            clone_rates = get_uncertain_means_for_qi_cuts(pr_df, "is_clone", qi_cuts)
            clone_fig, clone_ax = plt.subplots()
            clone_ax.set_title("Clone rate")
            plot_with_errobands(clone_rates, ax=clone_ax)
            clone_ax.set_ylabel("clone rate")
            clone_ax.set_xlabel("quality indicator requirement")
            pdf.savefig(clone_fig, bbox_inches="tight")
            plt.close(clone_fig)
 
            # Plot finding efficiency
 
            # The Quality Indicator is only available in pr_tree and thus the
            # PR-track dataframe. To get the QI of the related PR track for an MC
            # track, merge the PR dataframe into the MC dataframe
            pr_track_identifiers = ['experiment_number', 'run_number', 'event_number', 'pr_store_array_number']
            mc_df = upd.merge(
                left=mc_df, right=pr_df[pr_track_identifiers + ['quality_indicator']],
                how='left',
                on=pr_track_identifiers
            )
 
            missing_fractions = (
                _my_uncertain_mean(mc_df[
                    mc_df.quality_indicator.isnull() | (mc_df.quality_indicator > qi_cut)]['is_missing'])
                for qi_cut in qi_cuts
            )
 
            findeff_fig, findeff_ax = plt.subplots()
            findeff_ax.set_title("Finding efficiency")
            finding_efficiencies = 1.0 - upd.Series(data=missing_fractions, index=qi_cuts)
            plot_with_errobands(finding_efficiencies, ax=findeff_ax)
            findeff_ax.set_ylabel("finding efficiency")
            findeff_ax.set_xlabel("quality indicator requirement")
            pdf.savefig(findeff_fig, bbox_inches="tight")
            plt.close(findeff_fig)
 
            # Plot ROC curves
 
            # Fake rate vs. finding efficiency ROC curve
            fake_roc_fig, fake_roc_ax = plt.subplots()
            fake_roc_ax.set_title("Fake rate vs. finding efficiency ROC curve")
            fake_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=fake_rates.nominal_value,
                                 xerr=finding_efficiencies.std_dev, yerr=fake_rates.std_dev, elinewidth=0.8)
            fake_roc_ax.set_xlabel('finding efficiency')
            fake_roc_ax.set_ylabel('fake rate')
            pdf.savefig(fake_roc_fig, bbox_inches="tight")
            plt.close(fake_roc_fig)
 
            # Clone rate vs. finding efficiency ROC curve
            clone_roc_fig, clone_roc_ax = plt.subplots()
            clone_roc_ax.set_title("Clone rate vs. finding efficiency ROC curve")
            clone_roc_ax.errorbar(x=finding_efficiencies.nominal_value, y=clone_rates.nominal_value,
                                  xerr=finding_efficiencies.std_dev, yerr=clone_rates.std_dev, elinewidth=0.8)
            clone_roc_ax.set_xlabel('finding efficiency')
            clone_roc_ax.set_ylabel('clone rate')
            pdf.savefig(clone_roc_fig, bbox_inches="tight")
            plt.close(clone_roc_fig)
 
            # Plot kinematic distributions
 
            # use fewer qi cuts as each cut will be it's own subplot now and not a point
            kinematic_qi_cuts = [0, 0.5, 0.9]
 
            # Define kinematic parameters which we want to histogram and define
            # dictionaries relating them to latex labels, units and binnings
            params = ['d0', 'z0', 'pt', 'tan_lambda', 'phi0']
            label_by_param = {
                "pt": "$p_T$",
                "z0": "$z_0$",
                "d0": "$d_0$",
                "tan_lambda": r"$\tan{\lambda}$",
                "phi0": r"$\phi_0$"
            }
            unit_by_param = {
                "pt": "GeV",
                "z0": "cm",
                "d0": "cm",
                "tan_lambda": "rad",
                "phi0": "rad"
            }
            n_kinematic_bins = 75  # number of bins per kinematic variable
            bins_by_param = {
                "pt": np.linspace(0, np.percentile(pr_df['pt_truth'].dropna(), 95), n_kinematic_bins),
                "z0": np.linspace(-0.1, 0.1, n_kinematic_bins),
                "d0": np.linspace(0, 0.01, n_kinematic_bins),
                "tan_lambda": np.linspace(-2, 3, n_kinematic_bins),
                "phi0": np.linspace(0, 2 * np.pi, n_kinematic_bins)
            }
 
            # Iterate over each parameter and for each make stacked histograms for different QI cuts
            kinematic_qi_cuts = [0, 0.5, 0.8]
            blue, yellow, green = plt.get_cmap("tab10").colors[0:3]
            for param in params:
                fig, axarr = plt.subplots(ncols=len(kinematic_qi_cuts), sharey=True, sharex=True, figsize=(14, 6))
                fig.suptitle(f"{label_by_param[param]}  distributions")
                for i, qi in enumerate(kinematic_qi_cuts):
                    ax = axarr[i]
                    ax.set_title(f"QI > {qi}")
                    incut = pr_df[(pr_df['quality_indicator'] > qi)]
                    incut_matched = incut[incut.is_matched.eq(True)]
                    incut_clones = incut[incut.is_clone.eq(True)]
                    incut_fake = incut[incut.is_fake.eq(True)]
 
                    # if any series is empty, break out of loop and don't draw try to draw a stacked histogram
                    if any(series.empty for series in (incut, incut_matched, incut_clones, incut_fake)):
                        ax.text(0.5, 0.5, "Not enough data in bin", ha="center", va="center", transform=ax.transAxes)
                        continue
 
                    bins = bins_by_param[param]
                    stacked_histogram_series_tuple = (
                        incut_matched[f'{param}_estimate'],
                        incut_clones[f'{param}_estimate'],
                        incut_fake[f'{param}_estimate'],
                    )
                    histvals, _, _ = ax.hist(stacked_histogram_series_tuple,
                                             stacked=True,
                                             bins=bins, range=(bins.min(), bins.max()),
                                             color=(blue, green, yellow),
                                             label=("matched", "clones", "fakes"))
                    ax.set_xlabel(f'{label_by_param[param]} estimate / ({unit_by_param[param]})')
                    ax.set_ylabel('# tracks')
                axarr[0].legend(loc="upper center", bbox_to_anchor=(0, -0.15))
                pdf.savefig(fig, bbox_inches="tight")
                plt.close(fig)
 
 

◆ requires()

def requires ( self )

Generate list of luigi Tasks that this Task depends on.

Definition at line 2071 of file combined_quality_estimator_teacher.py.

    def requires(self):
        """
        Generate list of luigi Tasks that this Task depends on.
        """
        yield self.harvesting_validation_task_instance
 

Member Data Documentation

◆ exclude_variables

b2luigi exclude_variables

static

Initial value:

=  b2luigi.ListParameter(
        
    )

List of collected variables to not use in the training of the QE MVA classifier.

In addition to variables containing the "truth" substring, which are excluded by default.

Definition at line 2037 of file combined_quality_estimator_teacher.py.

◆ experiment_number

b2luigi experiment_number = b2luigi.IntParameter()

static

Experiment number of the conditions database, e.g.

defines simulation geometry

Definition at line 2026 of file combined_quality_estimator_teacher.py.

◆ fast_bdt_option

b2luigi fast_bdt_option

static

Initial value:

=  b2luigi.ListParameter(
        
    )

Hyperparameter option of the FastBDT algorithm.

default are the FastBDT default values.

Definition at line 2043 of file combined_quality_estimator_teacher.py.

◆ n_events_testing

b2luigi n_events_testing = b2luigi.IntParameter()

static

Number of events to generate for the test data set.

Definition at line 2022 of file combined_quality_estimator_teacher.py.

◆ n_events_training

b2luigi n_events_training = b2luigi.IntParameter()

static

Number of events to generate for the training data set.

Definition at line 2024 of file combined_quality_estimator_teacher.py.

◆ primaries_only

b2luigi primaries_only

static

Initial value:

=  b2luigi.BoolParameter(
        
    )

Whether to normalize the track finding efficiencies to primary particles only.

Definition at line 2049 of file combined_quality_estimator_teacher.py.

◆ process_type

b2luigi process_type

static

Initial value:

=  b2luigi.Parameter(
        
    )

Define which kind of process shall be used.

Decide between simulating BBBAR or BHABHA, MUMU, YY, DDBAR, UUBAR, SSBAR, CCBAR, reconstructing DATA or already simulated files (USESIMBB/EE) or running on existing reconstructed files (USERECBB/EE)

Definition at line 2030 of file combined_quality_estimator_teacher.py.

The documentation for this class was generated from the following file:

tracking/scripts/tracking/train/combined_quality_estimator_teacher.py

Public Member Functions

Static Public Attributes

Detailed Description

Member Function Documentation

◆ harvesting_validation_task_instance()

◆ output()

◆ output_pdf_file_basename()

◆ process()

◆ requires()

Member Data Documentation

◆ exclude_variables

◆ experiment_number

◆ fast_bdt_option

◆ n_events_testing

◆ n_events_training

◆ primaries_only

◆ process_type