.. _onlinebook_workflowmanagement: Workflow Management =================== Increasingly complex HEP analyses require many separate analysis steps on Monte Carlo simulations and data. In a typical Belle II analysis, we run skims, reconstructions and offline analysis on different computing resources: .. figure:: workflowmanagement/workflows/workflow_steps.png :width: 40em :align: center Analysis steps on different computing resources in a sample Belle II analysis. **The sequence of all processing steps required for your analysis is a workflow.** In your own interest (as well as in the interest of analysts following after you) you should set up your analysis in a workflow management system, i.e. automatize the entire workflow execution. Currently, there is a lack of documentation of interplay of the different scripts and jobs, which are executed manually one-by-one by the analyst. This is error-prone, time-consuming and deteriorates the reproducibility of results, the transparency of collaborative reviews and hinders data preservation efforts. In so-called workflow management tools, **dependencies between processing steps are made explicit in a stand-alone executable**, including job submission to remote computing resources, parallel computing etc. Previous boilerplate code (such as custom bash scripts) becomes obsolete. A workflow is visualized in a directed acyclic graph (DAG), which illustrates the dependencies between all processing steps. The DAG for a typical Belle II analysis quickly gets large, and workflow management tools can save you lots of headaches: .. figure:: workflowmanagement/workflows/dag.jpg :width: 40em :align: center Directed acyclic graph (DAG) for a sample Belle II analysis. A wide variety of workflow management tools exists (see for example `here `_). For Belle II analyses, the b2luigi (based on the luigi framework) and snakemake workflow management tools are particulary useful (see e.g. `our comparison `_). In general, each processing step is implemented as a task in the workflow, with its input(s) and output(s). A task is automatically scheduled for execution by the workflow management tool, as soon as all of its input(s) are existing but not all of its output(s). If output(s) to a task are already existing upon launch, the corresponding task will not be run again. In this lesson, we build a minimalistic Belle II analysis in both tools, employing gbasf2, basf2 and the LSF batch system: .. toctree:: :glob: :maxdepth: 1 workflowmanagement/* .. topic:: Author(s) of this lesson Caspar Schmitt