.. _onlinebook_workflowmanagement:

Workflow Management
===================
 
Increasingly complex HEP analyses require many separate analysis steps on Monte Carlo simulations and data. In a typical Belle II analysis, we run skims, reconstructions and offline analysis on different computing resources:
   
.. figure:: workflowmanagement/workflows/workflow_steps.png
    :width: 40em
    :align: center
    
    Analysis steps on different computing resources in a sample Belle II analysis.
   
**The sequence of all processing steps required for your analysis is a workflow.** In your own interest (as well as in the interest of analysts following after you) you should set up your analysis in a workflow management system, i.e. automate the entire workflow execution. 

Currently, there is a lack of documentation of interplay of the different scripts and jobs, which are executed manually one-by-one by the analyst. This is error-prone, time-consuming and deteriorates the reproducibility of results, the transparency of collaborative reviews and hinders data preservation efforts.

In so-called workflow management tools, **dependencies between processing steps are made explicit in a stand-alone executable**, including job submission to remote computing resources, parallel computing etc. Previous boilerplate code (such as custom bash scripts) becomes obsolete.

A workflow is visualized in a directed acyclic graph (DAG), which illustrates the dependencies between all processing steps. The DAG for a typical Belle II analysis quickly gets large, and workflow management tools can save you lots of headaches:

.. figure:: workflowmanagement/workflows/dag.jpg
    :width: 40em
    :align: center
    
    Directed acyclic graph (DAG) for a sample Belle II analysis.
    
A wide variety of workflow management tools exists (see for example `here <https://github.com/pditommaso/awesome-pipeline/>`_). For Belle II analyses, the :doc:`b2luigi:index` (based on the luigi framework) and snakemake workflow management tools are particularly useful (see e.g. `our comparison <https://arxiv.org/abs/2212.01422>`_). In general, each processing step is implemented as a task in the workflow, with its input(s) and output(s). A task is automatically scheduled for execution by the workflow management tool, as soon as all of its input(s) are existing but not all of its output(s). If output(s) to a task are already existing upon launch, the corresponding task will not be run again.

In this lesson, we build a minimalistic Belle II analysis in both tools, employing gbasf2, :py:mod:`basf2` and the LSF batch system:

.. toctree::
   :glob:
   :maxdepth: 1

   workflowmanagement/*

.. topic:: Author(s) of this lesson

   Caspar Schmitt

