Belle II Software development
HTCondor.HTCondorResult Class Reference
Inheritance diagram for HTCondor.HTCondorResult:
Result

Public Member Functions

 __init__ (self, job, job_id)
 
 update_status (self)
 
 ready (self)
 
 get_exit_code_from_file (self)
 

Public Attributes

 job_id = job_id
 job id given by HTCondor
 
 job = job
 Job object for result.
 
 time_to_wait_for_exit_code_file = timedelta(minutes=5)
 After our first attempt to view the exit code file once the job is 'finished', how long should we wait for it to exist before timing out?
 
 exit_code_file_initial_time = None
 Time we started waiting for the exit code file to appear.
 

Static Public Attributes

dict backend_code_to_status
 HTCondor statuses mapped to Job statuses.
 

Protected Member Functions

 _update_result_status (self, condor_q_output)
 

Protected Attributes

bool _is_ready = False
 Quicker way to know if it's ready once it has already been found.
 

Detailed Description

Simple class to help monitor status of jobs submitted by HTCondor Backend.

You pass in a `Job` object and job id from a condor_submit command.
When you call the `ready` method it runs condor_q and, if needed, ``condor_history``
to see whether or not the job has finished.

Definition at line 2014 of file backends.py.

Constructor & Destructor Documentation

◆ __init__()

__init__ ( self,
job,
job_id )
Pass in the job object and the job id to allow the result to do monitoring and perform
post processing of the job.

Definition at line 2033 of file backends.py.

2033 def __init__(self, job, job_id):
2034 """
2035 Pass in the job object and the job id to allow the result to do monitoring and perform
2036 post processing of the job.
2037 """
2038 super().__init__(job)
2039
2040 self.job_id = job_id
2041

Member Function Documentation

◆ _update_result_status()

_update_result_status ( self,
condor_q_output )
protected
In order to be slightly more efficient we pass in a previous call to condor_q to see if it can work.
If it is there we update the job's status. If not we are forced to start calling condor_q and, if needed,
``condor_history``, etc.

Parameters:
    condor_q_output (dict): The JSON output of a previous call to `HTCondor.condor_q` which we can reuse to find the
        status of this job if it was active when that command ran.

Definition at line 2055 of file backends.py.

2055 def _update_result_status(self, condor_q_output):
2056 """
2057 In order to be slightly more efficient we pass in a previous call to condor_q to see if it can work.
2058 If it is there we update the job's status. If not we are forced to start calling condor_q and, if needed,
2059 ``condor_history``, etc.
2060
2061 Parameters:
2062 condor_q_output (dict): The JSON output of a previous call to `HTCondor.condor_q` which we can reuse to find the
2063 status of this job if it was active when that command ran.
2064 """
2065 B2DEBUG(29, f"Calling {self.job}.result._update_result_status()")
2066 jobs_info = []
2067 for job_record in condor_q_output["JOBS"]:
2068 job_id = job_record["GlobalJobId"].split("#")[1]
2069 if job_id == self.job_id:
2070 B2DEBUG(29, f"Found {self.job_id} in condor_q_output.")
2071 jobs_info.append(job_record)
2072
2073 # Let's look for the exit code file where we expect it
2074 if not jobs_info:
2075 try:
2076 exit_code = self.get_exit_code_from_file()
2077 except FileNotFoundError:
2078 waiting_time = datetime.now() - self.exit_code_file_initial_time
2079 if self.time_to_wait_for_exit_code_file > waiting_time:
2080 B2ERROR(f"Exit code file for {self.job} missing and we can't wait longer. Setting exit code to 1.")
2081 exit_code = 1
2082 else:
2083 B2WARNING(f"Exit code file for {self.job} missing, will wait longer.")
2084 return
2085 if exit_code:
2086 jobs_info = [{"JobStatus": 6, "HoldReason": None}] # Set to failed
2087 else:
2088 jobs_info = [{"JobStatus": 4, "HoldReason": None}] # Set to completed
2089
2090 # If this job wasn't in the passed in condor_q output, let's try our own with the specific job_id
2091 if not jobs_info:
2092 jobs_info = HTCondor.condor_q(job_id=self.job_id, class_ads=["JobStatus", "HoldReason"])["JOBS"]
2093
2094 # If no job information is returned then the job already left the queue
2095 # check in the history to see if it succeeded or failed
2096 if not jobs_info:
2097 try:
2098 jobs_info = HTCondor.condor_history(job_id=self.job_id, class_ads=["JobStatus", "HoldReason"])["JOBS"]
2099 except KeyError:
2100 hold_reason = "No Reason Known"
2101
2102 # Still no record of it after waiting for the exit code file?
2103 if not jobs_info:
2104 jobs_info = [{"JobStatus": 6, "HoldReason": None}] # Set to failed
2105
2106 job_info = jobs_info[0]
2107 backend_status = job_info["JobStatus"]
2108 # if job is held (backend_status = 5) then report why then keep waiting
2109 if backend_status == 5:
2110 hold_reason = job_info.get("HoldReason", None)
2111 B2WARNING(f"{self.job} on hold because of {hold_reason}. Keep waiting.")
2112 backend_status = 2
2113 try:
2114 new_job_status = self.backend_code_to_status[backend_status]
2115 except KeyError as err:
2116 raise BackendError(f"Unidentified backend status found for {self.job}: {backend_status}") from err
2117 if new_job_status != self.job.status:
2118 self.job.status = new_job_status
2119
2120 @classmethod

◆ get_exit_code_from_file()

get_exit_code_from_file ( self)
inherited
Read the exit code file to discover the exit status of the job command. Useful fallback if the job is no longer
known to the job database (batch system purged it for example). Since some backends may take time to download
the output files of the job back to the working directory we use a time limit on how long to wait.

Definition at line 908 of file backends.py.

908 def get_exit_code_from_file(self):
909 """
910 Read the exit code file to discover the exit status of the job command. Useful fallback if the job is no longer
911 known to the job database (batch system purged it for example). Since some backends may take time to download
912 the output files of the job back to the working directory we use a time limit on how long to wait.
913 """
914 if not self.exit_code_file_initial_time:
915 self.exit_code_file_initial_time = datetime.now()
916 exit_code_path = Path(self.job.working_dir, Backend.exit_code_file)
917 with open(exit_code_path) as f:
918 exit_code = int(f.read().strip())
919 B2DEBUG(29, f"Exit code from file for {self.job} was {exit_code}")
920 return exit_code
921
922

◆ ready()

ready ( self)
inherited
Returns whether or not this job result is known to be ready. Doesn't actually change the job status. Just changes
the 'readiness' based on the known job status.

Definition at line 887 of file backends.py.

887 def ready(self):
888 """
889 Returns whether or not this job result is known to be ready. Doesn't actually change the job status. Just changes
890 the 'readiness' based on the known job status.
891 """
892 B2DEBUG(29, f"Calling {self.job}.result.ready()")
893 if self._is_ready:
894 return True
895 elif self.job.status in self.job.exit_statuses:
896 self._is_ready = True
897 return True
898 else:
899 return False
900

◆ update_status()

update_status ( self)
Update the job's (or subjobs') status by calling condor_q.

Reimplemented from Result.

Definition at line 2042 of file backends.py.

2042 def update_status(self):
2043 """
2044 Update the job's (or subjobs') status by calling condor_q.
2045 """
2046 B2DEBUG(29, f"Calling {self.job.name}.result.update_status()")
2047 # Get all jobs info and reuse it for each status update to minimise tie spent on this updating.
2048 condor_q_output = HTCondor.condor_q()
2049 if self.job.subjobs:
2050 for subjob in self.job.subjobs.values():
2051 subjob.result._update_result_status(condor_q_output)
2052 else:
2053 self._update_result_status(condor_q_output)
2054

Member Data Documentation

◆ _is_ready

bool _is_ready = False
protectedinherited

Quicker way to know if it's ready once it has already been found.

Saves a lot of calls to batch system commands.

Definition at line 880 of file backends.py.

◆ backend_code_to_status

dict backend_code_to_status
static
Initial value:
= {0: "submitted",
1: "submitted",
2: "running",
3: "failed",
4: "completed",
5: "submitted",
6: "failed"
}

HTCondor statuses mapped to Job statuses.

Definition at line 2024 of file backends.py.

◆ exit_code_file_initial_time

exit_code_file_initial_time = None
inherited

Time we started waiting for the exit code file to appear.

Definition at line 885 of file backends.py.

◆ job

job = job
inherited

Job object for result.

Definition at line 878 of file backends.py.

◆ job_id

job_id = job_id

job id given by HTCondor

Definition at line 2040 of file backends.py.

◆ time_to_wait_for_exit_code_file

time_to_wait_for_exit_code_file = timedelta(minutes=5)
inherited

After our first attempt to view the exit code file once the job is 'finished', how long should we wait for it to exist before timing out?

Definition at line 883 of file backends.py.


The documentation for this class was generated from the following file: