Belle II Software development
HTCondor Class Reference
Inheritance diagram for HTCondor:
Batch Backend

Classes

class  HTCondorResult
 

Public Member Functions

 get_batch_submit_script_path (self, job)
 
 can_submit (self, njobs=1)
 
 condor_q (cls, class_ads=None, job_id="", username="")
 
 condor_history (cls, class_ads=None, job_id="", username="")
 
 submit (self, job, check_can_submit=True, jobs_per_check=100)
 
 get_submit_script_path (self, job)
 

Public Attributes

int global_job_limit = self.default_global_job_limit
 The active job limit.
 
int sleep_between_submission_checks = self.default_sleep_between_submission_checks
 Seconds we wait before checking if we can submit a list of jobs.
 
dict backend_args = {**self.default_backend_args, **backend_args}
 The backend args that will be applied to jobs unless the job specifies them itself.
 

Static Public Attributes

str batch_submit_script = "submit.sub"
 HTCondor batch script (different to the wrapper script of Backend.submit_script)
 
list default_class_ads = ["GlobalJobId", "JobStatus", "Owner"]
 Default ClassAd attributes to return from commands like condor_q.
 
list submission_cmds = []
 Shell command to submit a script, should be implemented in the derived class.
 
int default_global_job_limit = 1000
 Default global limit on the total number of submitted/running jobs that the user can have.
 
int default_sleep_between_submission_checks = 30
 Default time betweeon re-checking if the active jobs is below the global job limit.
 
str submit_script = "submit.sh"
 Default submission script name.
 
str exit_code_file = "__BACKEND_CMD_EXIT_STATUS__"
 Default exit code file name.
 
dict default_backend_args = {}
 Default backend_args.
 

Protected Member Functions

 _make_submit_file (self, job, submit_file_path)
 
 _add_batch_directives (self, job, batch_file)
 
 _create_cmd (self, script_path)
 
 _submit_to_batch (cls, cmd)
 
 _create_job_result (cls, job, job_id)
 
 _create_parent_job_result (cls, parent)
 
 _add_wrapper_script_setup (self, job, batch_file)
 
 _add_wrapper_script_teardown (self, job, batch_file)
 

Static Protected Member Functions

 _add_setup (job, batch_file)
 

Detailed Description

Backend for submitting calibration processes to a HTCondor batch system.

Definition at line 1929 of file backends.py.

Member Function Documentation

◆ _add_batch_directives()

_add_batch_directives ( self,
job,
batch_file )
protected
For HTCondor leave empty as the directives are already included in the submit file.

Reimplemented from Batch.

Definition at line 1976 of file backends.py.

1976 def _add_batch_directives(self, job, batch_file):
1977 """
1978 For HTCondor leave empty as the directives are already included in the submit file.
1979 """
1980 print('#!/bin/bash', file=batch_file)
1981

◆ _add_setup()

_add_setup ( job,
batch_file )
staticprotectedinherited
Adds setup lines to the shell script file.

Definition at line 807 of file backends.py.

807 def _add_setup(job, batch_file):
808 """
809 Adds setup lines to the shell script file.
810 """
811 for line in job.setup_cmds:
812 print(line, file=batch_file)
813

◆ _add_wrapper_script_setup()

_add_wrapper_script_setup ( self,
job,
batch_file )
protectedinherited
Adds lines to the submitted script that help with job monitoring/setup. Mostly here so that we can insert
`trap` statements for Ctrl-C situations.

Definition at line 814 of file backends.py.

814 def _add_wrapper_script_setup(self, job, batch_file):
815 """
816 Adds lines to the submitted script that help with job monitoring/setup. Mostly here so that we can insert
817 `trap` statements for Ctrl-C situations.
818 """
819 start_wrapper = f"""# ---
820# trap ctrl-c and call ctrl_c()
821trap '(ctrl_c 130)' SIGINT
822trap '(ctrl_c 143)' SIGTERM
823
824function write_exit_code() {{
825 echo "Writing $1 to exit status file"
826 echo "$1" > {self.exit_code_file}
827 exit $1
828}}
829
830function ctrl_c() {{
831 trap '' SIGINT SIGTERM
832 echo "** Trapped Ctrl-C **"
833 echo "$1" > {self.exit_code_file}
834 exit $1
835}}
836# ---"""
837 print(start_wrapper, file=batch_file)
838

◆ _add_wrapper_script_teardown()

_add_wrapper_script_teardown ( self,
job,
batch_file )
protectedinherited
Adds lines to the submitted script that help with job monitoring/teardown. Mostly here so that we can insert
an exit code of the job cmd being written out to a file. Which means that we can know if the command was
successful or not even if the backend server/monitoring database purges the data about our job i.e. If PBS
removes job information too quickly we may never know if a job succeeded or failed without some kind of exit
file.

Definition at line 839 of file backends.py.

839 def _add_wrapper_script_teardown(self, job, batch_file):
840 """
841 Adds lines to the submitted script that help with job monitoring/teardown. Mostly here so that we can insert
842 an exit code of the job cmd being written out to a file. Which means that we can know if the command was
843 successful or not even if the backend server/monitoring database purges the data about our job i.e. If PBS
844 removes job information too quickly we may never know if a job succeeded or failed without some kind of exit
845 file.
846 """
847 end_wrapper = """# ---
848write_exit_code $?"""
849 print(end_wrapper, file=batch_file)
850

◆ _create_cmd()

_create_cmd ( self,
script_path )
protected
 

Reimplemented from Batch.

Definition at line 1982 of file backends.py.

1982 def _create_cmd(self, script_path):
1983 """
1984 """
1985 submission_cmd = self.submission_cmds[:]
1986 submission_cmd.append(script_path.as_posix())
1987 return submission_cmd
1988

◆ _create_job_result()

_create_job_result ( cls,
job,
job_id )
protected
 

Reimplemented from Batch.

Definition at line 2126 of file backends.py.

2126 def _create_job_result(cls, job, job_id):
2127 """
2128 """
2129 B2INFO(f"Job ID of {job} recorded as: {job_id}")
2130 job.result = cls.HTCondorResult(job, job_id)
2131

◆ _create_parent_job_result()

_create_parent_job_result ( cls,
parent )
protected
We want to be able to call `ready()` on the top level `Job.result`. So this method needs to exist
so that a Job.result object actually exists. It will be mostly empty and simply updates subjob
statuses and allows the use of ready().

Reimplemented from Backend.

Definition at line 2133 of file backends.py.

2133 def _create_parent_job_result(cls, parent):
2134 parent.result = cls.HTCondorResult(parent, None)
2135

◆ _make_submit_file()

_make_submit_file ( self,
job,
submit_file_path )
protected
Fill HTCondor submission file.

Reimplemented from Batch.

Definition at line 1950 of file backends.py.

1950 def _make_submit_file(self, job, submit_file_path):
1951 """
1952 Fill HTCondor submission file.
1953 """
1954 # Find all files/directories in the working directory to copy on the worker node
1955
1956 files_to_transfer = [i.as_posix() for i in job.working_dir.iterdir()]
1957
1958 job_backend_args = {**self.backend_args, **job.backend_args} # Merge the two dictionaries, with the job having priority
1959
1960 with open(submit_file_path, "w") as submit_file:
1961 print(f'executable = {self.get_submit_script_path(job)}', file=submit_file)
1962 print(f'log = {Path(job.output_dir, "htcondor.log").as_posix()}', file=submit_file)
1963 print(f'output = {Path(job.working_dir, _STDOUT_FILE).as_posix()}', file=submit_file)
1964 print(f'error = {Path(job.working_dir, _STDERR_FILE).as_posix()}', file=submit_file)
1965 print('transfer_input_files = ', ','.join(files_to_transfer), file=submit_file)
1966 print(f'universe = {job_backend_args["universe"]}', file=submit_file)
1967 print(f'getenv = {job_backend_args["getenv"]}', file=submit_file)
1968 print(f'request_memory = {job_backend_args["request_memory"]}', file=submit_file)
1969 print('should_transfer_files = Yes', file=submit_file)
1970 print('when_to_transfer_output = ON_EXIT', file=submit_file)
1971 # Any other lines in the backend args that we don't deal with explicitly but maybe someone wants to insert something
1972 for line in job_backend_args["extra_lines"]:
1973 print(line, file=submit_file)
1974 print('queue', file=submit_file)
1975

◆ _submit_to_batch()

_submit_to_batch ( cls,
cmd )
protected
Do the actual batch submission command and collect the output to find out the job id for later monitoring.

Reimplemented from Batch.

Definition at line 1996 of file backends.py.

1996 def _submit_to_batch(cls, cmd):
1997 """
1998 Do the actual batch submission command and collect the output to find out the job id for later monitoring.
1999 """
2000 job_dir = Path(cmd[-1]).parent.as_posix()
2001 sub_out = ""
2002 attempt = 0
2003 sleep_time = 30
2004
2005 while attempt < 3:
2006 try:
2007 sub_out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, universal_newlines=True, cwd=job_dir)
2008 break
2009 except subprocess.CalledProcessError as e:
2010 attempt += 1
2011 if attempt == 3:
2012 B2ERROR(f"Error during condor_submit: {str(e)} occurred more than 3 times.")
2013 raise e
2014 else:
2015 B2ERROR(f"Error during condor_submit: {str(e)}, sleeping for {sleep_time} seconds.")
2016 time.sleep(30)
2017 return re.search(r"(\d+\.\d+) - \d+\.\d+", sub_out).groups()[0]
2018

◆ can_submit()

can_submit ( self,
njobs = 1 )
Checks the global number of jobs in HTCondor right now (submitted or running) for this user.
Returns True if the number is lower that the limit, False if it is higher.

Parameters:
    njobs (int): The number of jobs that we want to submit before checking again. Lets us check if we
        are sufficiently below the limit in order to (somewhat) safely submit. It is slightly dangerous to
        assume that it is safe to submit too many jobs since there might be other processes also submitting jobs.
        So njobs really shouldn't be abused when you might be getting close to the limit i.e. keep it <=250
        and check again before submitting more.

Reimplemented from Batch.

Definition at line 2136 of file backends.py.

2136 def can_submit(self, njobs=1):
2137 """
2138 Checks the global number of jobs in HTCondor right now (submitted or running) for this user.
2139 Returns True if the number is lower that the limit, False if it is higher.
2140
2141 Parameters:
2142 njobs (int): The number of jobs that we want to submit before checking again. Lets us check if we
2143 are sufficiently below the limit in order to (somewhat) safely submit. It is slightly dangerous to
2144 assume that it is safe to submit too many jobs since there might be other processes also submitting jobs.
2145 So njobs really shouldn't be abused when you might be getting close to the limit i.e. keep it <=250
2146 and check again before submitting more.
2147 """
2148 B2DEBUG(29, "Calling HTCondor().can_submit()")
2149 jobs_info = self.condor_q()
2150 total_jobs = jobs_info["NJOBS"]
2151 B2INFO(f"Total jobs active in the HTCondor system is currently {total_jobs}")
2152 if (total_jobs + njobs) > self.global_job_limit:
2153 B2INFO(f"Since the global limit is {self.global_job_limit} we cannot submit {njobs} jobs until some complete.")
2154 return False
2155 else:
2156 B2INFO("There is enough space to submit more jobs.")
2157 return True
2158

◆ condor_history()

condor_history ( cls,
class_ads = None,
job_id = "",
username = "" )
Simplistic interface to the ``condor_history`` command. lets you request information about all jobs matching the filters
``job_id`` and ``username``. Note that setting job_id negates username so it is ignored.
The result is a JSON dictionary filled by output of the ``-json`` ``condor_history`` option.

Parameters:
    class_ads (list[str]): A list of condor_history ClassAds that you would like information about.
        By default we give {cls.default_class_ads}, increasing the amount of class_ads increase the time taken
        by the condor_q call.
    job_id (str): String representation of the Job ID given by condor_submit during submission.
        If this argument is given then the output of this function will be only information about this job.
        If this argument is not given, then all jobs matching the other filters will be returned.
    username (str): By default we return information about only the current user's jobs. By giving
        a username you can access the job information of a specific user's jobs. By giving ``username='all'`` you will
        receive job information from all known user jobs matching the other filters. This is limited to 10000 records
        and isn't recommended.

Returns:
    dict: JSON dictionary of the form:

    .. code-block:: python

      {
        "NJOBS":<number of records returned by command>,
        "JOBS":[
                {
                 <ClassAd: value>, ...
                }, ...
               ]
      }

Definition at line 2227 of file backends.py.

2227 def condor_history(cls, class_ads=None, job_id="", username=""):
2228 """
2229 Simplistic interface to the ``condor_history`` command. lets you request information about all jobs matching the filters
2230 ``job_id`` and ``username``. Note that setting job_id negates username so it is ignored.
2231 The result is a JSON dictionary filled by output of the ``-json`` ``condor_history`` option.
2232
2233 Parameters:
2234 class_ads (list[str]): A list of condor_history ClassAds that you would like information about.
2235 By default we give {cls.default_class_ads}, increasing the amount of class_ads increase the time taken
2236 by the condor_q call.
2237 job_id (str): String representation of the Job ID given by condor_submit during submission.
2238 If this argument is given then the output of this function will be only information about this job.
2239 If this argument is not given, then all jobs matching the other filters will be returned.
2240 username (str): By default we return information about only the current user's jobs. By giving
2241 a username you can access the job information of a specific user's jobs. By giving ``username='all'`` you will
2242 receive job information from all known user jobs matching the other filters. This is limited to 10000 records
2243 and isn't recommended.
2244
2245 Returns:
2246 dict: JSON dictionary of the form:
2247
2248 .. code-block:: python
2249
2250 {
2251 "NJOBS":<number of records returned by command>,
2252 "JOBS":[
2253 {
2254 <ClassAd: value>, ...
2255 }, ...
2256 ]
2257 }
2258 """
2259 B2DEBUG(29, f"Calling HTCondor.condor_history(class_ads={class_ads}, job_id={job_id}, username={username})")
2260 if not class_ads:
2261 class_ads = cls.default_class_ads
2262 # Output fields should be comma separated.
2263 field_list_cmd = ",".join(class_ads)
2264 cmd_list = ["condor_history", "-json", "-attributes", field_list_cmd]
2265 # If job_id is set then we ignore all other filters
2266 if job_id:
2267 cmd_list.append(job_id)
2268 else:
2269 if not username:
2270 username = os.environ["USER"]
2271 # If the username is set to all it is a special case
2272 if username != "all":
2273 cmd_list.append(username)
2274 # We get a JSON serialisable summary from condor_q. But we will alter it slightly to be more similar to other backends
2275 cmd = " ".join(cmd_list)
2276 B2DEBUG(29, f"Calling subprocess with command = '{cmd}'")
2277 try:
2278 records = subprocess.check_output(cmd, stderr=subprocess.STDOUT, universal_newlines=True, shell=True)
2279 except BaseException:
2280 records = None
2281
2282 if records:
2283 records = decode_json_string(records)
2284 else:
2285 records = []
2286
2287 jobs_info = {"JOBS": records}
2288 jobs_info["NJOBS"] = len(jobs_info["JOBS"]) # Just to avoid having to len() it in the future
2289 return jobs_info
2290
2291

◆ condor_q()

condor_q ( cls,
class_ads = None,
job_id = "",
username = "" )
Simplistic interface to the `condor_q` command. lets you request information about all jobs matching the filters
'job_id' and 'username'. Note that setting job_id negates username so it is ignored.
The result is the JSON dictionary returned by output of the ``-json`` condor_q option.

Parameters:
    class_ads (list[str]): A list of condor_q ClassAds that you would like information about.
        By default we give {cls.default_class_ads}, increasing the amount of class_ads increase the time taken
        by the condor_q call.
    job_id (str): String representation of the Job ID given by condor_submit during submission.
        If this argument is given then the output of this function will be only information about this job.
        If this argument is not given, then all jobs matching the other filters will be returned.
    username (str): By default we return information about only the current user's jobs. By giving
        a username you can access the job information of a specific user's jobs. By giving ``username='all'`` you will
        receive job information from all known user jobs matching the other filters. This may be a LOT of jobs
        so it isn't recommended.

Returns:
    dict: JSON dictionary of the form:

    .. code-block:: python

      {
        "NJOBS":<number of records returned by command>,
        "JOBS":[
                {
                 <ClassAd: value>, ...
                }, ...
               ]
      }

Definition at line 2160 of file backends.py.

2160 def condor_q(cls, class_ads=None, job_id="", username=""):
2161 """
2162 Simplistic interface to the `condor_q` command. lets you request information about all jobs matching the filters
2163 'job_id' and 'username'. Note that setting job_id negates username so it is ignored.
2164 The result is the JSON dictionary returned by output of the ``-json`` condor_q option.
2165
2166 Parameters:
2167 class_ads (list[str]): A list of condor_q ClassAds that you would like information about.
2168 By default we give {cls.default_class_ads}, increasing the amount of class_ads increase the time taken
2169 by the condor_q call.
2170 job_id (str): String representation of the Job ID given by condor_submit during submission.
2171 If this argument is given then the output of this function will be only information about this job.
2172 If this argument is not given, then all jobs matching the other filters will be returned.
2173 username (str): By default we return information about only the current user's jobs. By giving
2174 a username you can access the job information of a specific user's jobs. By giving ``username='all'`` you will
2175 receive job information from all known user jobs matching the other filters. This may be a LOT of jobs
2176 so it isn't recommended.
2177
2178 Returns:
2179 dict: JSON dictionary of the form:
2180
2181 .. code-block:: python
2182
2183 {
2184 "NJOBS":<number of records returned by command>,
2185 "JOBS":[
2186 {
2187 <ClassAd: value>, ...
2188 }, ...
2189 ]
2190 }
2191 """
2192 B2DEBUG(29, f"Calling HTCondor.condor_q(class_ads={class_ads}, job_id={job_id}, username={username})")
2193 if not class_ads:
2194 class_ads = cls.default_class_ads
2195 # Output fields should be comma separated.
2196 field_list_cmd = ",".join(class_ads)
2197 cmd_list = ["condor_q", "-json", "-attributes", field_list_cmd]
2198 # If job_id is set then we ignore all other filters
2199 if job_id:
2200 cmd_list.append(job_id)
2201 else:
2202 if not username:
2203 username = os.environ["USER"]
2204 # If the username is set to all it is a special case
2205 if username == "all":
2206 cmd_list.append("-allusers")
2207 else:
2208 cmd_list.append(username)
2209 # We get a JSON serialisable summary from condor_q. But we will alter it slightly to be more similar to other backends
2210 cmd = " ".join(cmd_list)
2211 B2DEBUG(29, f"Calling subprocess with command = '{cmd}'")
2212 # condor_q occasionally fails
2213 try:
2214 records = subprocess.check_output(cmd, stderr=subprocess.STDOUT, universal_newlines=True, shell=True)
2215 except BaseException:
2216 records = None
2217
2218 if records:
2219 records = decode_json_string(records)
2220 else:
2221 records = []
2222 jobs_info = {"JOBS": records}
2223 jobs_info["NJOBS"] = len(jobs_info["JOBS"]) # Just to avoid having to len() it in the future
2224 return jobs_info
2225

◆ get_batch_submit_script_path()

get_batch_submit_script_path ( self,
job )
Construct the Path object of the .sub file that we will use to describe the job.

Reimplemented from Batch.

Definition at line 1989 of file backends.py.

1989 def get_batch_submit_script_path(self, job):
1990 """
1991 Construct the Path object of the .sub file that we will use to describe the job.
1992 """
1993 return Path(job.working_dir, self.batch_submit_script)
1994

◆ get_submit_script_path()

get_submit_script_path ( self,
job )
inherited
Construct the Path object of the bash script file that we will submit. It will contain
the actual job command, wrapper commands, setup commands, and any batch directives

Definition at line 860 of file backends.py.

860 def get_submit_script_path(self, job):
861 """
862 Construct the Path object of the bash script file that we will submit. It will contain
863 the actual job command, wrapper commands, setup commands, and any batch directives
864 """
865 return Path(job.working_dir, self.submit_script)
866
867

◆ submit()

submit ( self,
job,
check_can_submit = True,
jobs_per_check = 100 )
inherited
 

Reimplemented from Backend.

Definition at line 1205 of file backends.py.

1205 def submit(self, job, check_can_submit=True, jobs_per_check=100):
1206 """
1207 """
1208 raise NotImplementedError("This is an abstract submit(job) method that shouldn't have been called. "
1209 "Did you submit a (Sub)Job?")
1210

Member Data Documentation

◆ backend_args

dict backend_args = {**self.default_backend_args, **backend_args}
inherited

The backend args that will be applied to jobs unless the job specifies them itself.

Definition at line 797 of file backends.py.

◆ batch_submit_script

batch_submit_script = "submit.sub"
static

HTCondor batch script (different to the wrapper script of Backend.submit_script)

Definition at line 1934 of file backends.py.

◆ default_backend_args

dict default_backend_args = {}
staticinherited

Default backend_args.

Definition at line 789 of file backends.py.

◆ default_class_ads

list default_class_ads = ["GlobalJobId", "JobStatus", "Owner"]
static

Default ClassAd attributes to return from commands like condor_q.

Definition at line 1948 of file backends.py.

◆ default_global_job_limit

int default_global_job_limit = 1000
staticinherited

Default global limit on the total number of submitted/running jobs that the user can have.

This limit will not affect the total number of jobs that are eventually submitted. But the jobs won't actually be submitted until this limit can be respected i.e. until the number of total jobs in the Batch system goes down. Since we actually submit in chunks of N jobs, before checking this limit value again, this value needs to be a little lower than the real batch system limit. Otherwise you could accidentally go over during the N job submission if other processes are checking and submitting concurrently. This is quite common for the first submission of jobs from parallel calibrations.

Note that if there are other jobs already submitted for your account, then these will count towards this limit.

Definition at line 1156 of file backends.py.

◆ default_sleep_between_submission_checks

int default_sleep_between_submission_checks = 30
staticinherited

Default time betweeon re-checking if the active jobs is below the global job limit.

Definition at line 1158 of file backends.py.

◆ exit_code_file

str exit_code_file = "__BACKEND_CMD_EXIT_STATUS__"
staticinherited

Default exit code file name.

Definition at line 787 of file backends.py.

◆ global_job_limit

int global_job_limit = self.default_global_job_limit
inherited

The active job limit.

This is 'global' because we want to prevent us accidentally submitting too many jobs from all current and previous submission scripts.

Definition at line 1167 of file backends.py.

◆ sleep_between_submission_checks

int sleep_between_submission_checks = self.default_sleep_between_submission_checks
inherited

Seconds we wait before checking if we can submit a list of jobs.

Only relevant once we hit the global limit of active jobs, which is a lot usually.

Definition at line 1170 of file backends.py.

◆ submission_cmds

list submission_cmds = []
staticinherited

Shell command to submit a script, should be implemented in the derived class.

Definition at line 1143 of file backends.py.

◆ submit_script

submit_script = "submit.sh"
staticinherited

Default submission script name.

Definition at line 785 of file backends.py.


The documentation for this class was generated from the following file: