Belle II Software development
LSF.LSFResult Class Reference
Inheritance diagram for LSF.LSFResult:
Result

Public Member Functions

 __init__ (self, job, job_id)
 
 update_status (self)
 
 ready (self)
 
 get_exit_code_from_file (self)
 

Public Attributes

 job_id = job_id
 job id given by LSF
 
 job = job
 Job object for result.
 
 time_to_wait_for_exit_code_file = timedelta(minutes=5)
 After our first attempt to view the exit code file once the job is 'finished', how long should we wait for it to exist before timing out?
 
 exit_code_file_initial_time = None
 Time we started waiting for the exit code file to appear.
 

Static Public Attributes

dict backend_code_to_status
 LSF statuses mapped to Job statuses.
 

Protected Member Functions

 _update_result_status (self, bjobs_output)
 
 _get_status_from_output (self, output)
 

Protected Attributes

bool _is_ready = False
 Quicker way to know if it's ready once it has already been found.
 

Detailed Description

Simple class to help monitor status of jobs submitted by LSF Backend.

You pass in a `Job` object and job id from a bsub command.
When you call the `ready` method it runs bjobs to see whether or not the job has finished.

Definition at line 1682 of file backends.py.

Constructor & Destructor Documentation

◆ __init__()

__init__ ( self,
job,
job_id )
Pass in the job object and the job id to allow the result to do monitoring and perform
post processing of the job.

Definition at line 1698 of file backends.py.

1698 def __init__(self, job, job_id):
1699 """
1700 Pass in the job object and the job id to allow the result to do monitoring and perform
1701 post processing of the job.
1702 """
1703 super().__init__(job)
1704
1705 self.job_id = job_id
1706

Member Function Documentation

◆ _get_status_from_output()

_get_status_from_output ( self,
output )
protected
Get status from output

Definition at line 1761 of file backends.py.

1761 def _get_status_from_output(self, output):
1762 """
1763 Get status from output
1764 """
1765 if output["JOBS"] and "ERROR" in output["JOBS"][0]:
1766 if output["JOBS"][0]["ERROR"] == f"Job <{self.job_id}> is not found":
1767 raise KeyError(f"No job record in the 'output' argument had the 'JOBID'=={self.job_id}")
1768 else:
1769 raise BackendError(f"Unidentified Error during status check for {self.job}: {output}")
1770 else:
1771 for job_info in output["JOBS"]:
1772 if job_info["JOBID"] == self.job_id:
1773 return job_info["STAT"]
1774 else:
1775 raise KeyError(f"No job record in the 'output' argument had the 'JOBID'=={self.job_id}")
1776
1777 @classmethod

◆ _update_result_status()

_update_result_status ( self,
bjobs_output )
protected
Parameters:
    bjobs_output (dict): The JSON output of a previous call to bjobs which we can reuse to find the
        status of this job. Obviously you should only be passing a JSON dict that contains the 'stat' and
        'id' information, otherwise it is useless.

Definition at line 1720 of file backends.py.

1720 def _update_result_status(self, bjobs_output):
1721 """
1722 Parameters:
1723 bjobs_output (dict): The JSON output of a previous call to bjobs which we can reuse to find the
1724 status of this job. Obviously you should only be passing a JSON dict that contains the 'stat' and
1725 'id' information, otherwise it is useless.
1726
1727 """
1728 try:
1729 backend_status = self._get_status_from_output(bjobs_output)
1730 except KeyError:
1731 # If this happens then maybe the job id wasn't in the bjobs_output argument because it finished.
1732 # Instead of failing immediately we try re-running bjobs here explicitly and then fail if it still isn't there.
1733 bjobs_output = LSF.bjobs(output_fields=["stat", "id"], job_id=str(self.job_id))
1734 try:
1735 backend_status = self._get_status_from_output(bjobs_output)
1736 except KeyError:
1737 # If this happened, maybe we're looking at an old finished job. We could fall back to bhist, but it's
1738 # slow and terrible. Instead let's try looking for the exit code file.
1739 try:
1740 exit_code = self.get_exit_code_from_file()
1741 except FileNotFoundError:
1742 waiting_time = datetime.now() - self.exit_code_file_initial_time
1743 if self.time_to_wait_for_exit_code_file > waiting_time:
1744 B2ERROR(f"Exit code file for {self.job} missing and we can't wait longer. Setting exit code to 1.")
1745 exit_code = 1
1746 else:
1747 B2WARNING(f"Exit code file for {self.job} missing, will wait longer.")
1748 return
1749 if exit_code:
1750 backend_status = "EXIT"
1751 else:
1752 backend_status = "FINISHED"
1753 try:
1754 new_job_status = self.backend_code_to_status[backend_status]
1755 except KeyError as err:
1756 raise BackendError(f"Unidentified backend status found for {self.job}: {backend_status}") from err
1757
1758 if new_job_status != self.job.status:
1759 self.job.status = new_job_status
1760

◆ get_exit_code_from_file()

get_exit_code_from_file ( self)
inherited
Read the exit code file to discover the exit status of the job command. Useful fallback if the job is no longer
known to the job database (batch system purged it for example). Since some backends may take time to download
the output files of the job back to the working directory we use a time limit on how long to wait.

Definition at line 909 of file backends.py.

909 def get_exit_code_from_file(self):
910 """
911 Read the exit code file to discover the exit status of the job command. Useful fallback if the job is no longer
912 known to the job database (batch system purged it for example). Since some backends may take time to download
913 the output files of the job back to the working directory we use a time limit on how long to wait.
914 """
915 if not self.exit_code_file_initial_time:
916 self.exit_code_file_initial_time = datetime.now()
917 exit_code_path = Path(self.job.working_dir, Backend.exit_code_file)
918 with open(exit_code_path) as f:
919 exit_code = int(f.read().strip())
920 B2DEBUG(29, f"Exit code from file for {self.job} was {exit_code}")
921 return exit_code
922
923

◆ ready()

ready ( self)
inherited
Returns whether or not this job result is known to be ready. Doesn't actually change the job status. Just changes
the 'readiness' based on the known job status.

Definition at line 888 of file backends.py.

888 def ready(self):
889 """
890 Returns whether or not this job result is known to be ready. Doesn't actually change the job status. Just changes
891 the 'readiness' based on the known job status.
892 """
893 B2DEBUG(29, f"Calling {self.job}.result.ready()")
894 if self._is_ready:
895 return True
896 elif self.job.status in self.job.exit_statuses:
897 self._is_ready = True
898 return True
899 else:
900 return False
901

◆ update_status()

update_status ( self)
Update the job's (or subjobs') status by calling bjobs.

Reimplemented from Result.

Definition at line 1707 of file backends.py.

1707 def update_status(self):
1708 """
1709 Update the job's (or subjobs') status by calling bjobs.
1710 """
1711 B2DEBUG(29, f"Calling {self.job.name}.result.update_status()")
1712 # Get all jobs info and reuse it for each status update to minimise tie spent on this updating.
1713 bjobs_output = LSF.bjobs(output_fields=["stat", "id"])
1714 if self.job.subjobs:
1715 for subjob in self.job.subjobs.values():
1716 subjob.result._update_result_status(bjobs_output)
1717 else:
1718 self._update_result_status(bjobs_output)
1719

Member Data Documentation

◆ _is_ready

bool _is_ready = False
protectedinherited

Quicker way to know if it's ready once it has already been found.

Saves a lot of calls to batch system commands.

Definition at line 881 of file backends.py.

◆ backend_code_to_status

dict backend_code_to_status
static
Initial value:
= {"RUN": "running",
"DONE": "completed",
"FINISHED": "completed",
"EXIT": "failed",
"PEND": "submitted"
}

LSF statuses mapped to Job statuses.

Definition at line 1691 of file backends.py.

◆ exit_code_file_initial_time

exit_code_file_initial_time = None
inherited

Time we started waiting for the exit code file to appear.

Definition at line 886 of file backends.py.

◆ job

job = job
inherited

Job object for result.

Definition at line 879 of file backends.py.

◆ job_id

job_id = job_id

job id given by LSF

Definition at line 1705 of file backends.py.

◆ time_to_wait_for_exit_code_file

time_to_wait_for_exit_code_file = timedelta(minutes=5)
inherited

After our first attempt to view the exit code file once the job is 'finished', how long should we wait for it to exist before timing out?

Definition at line 884 of file backends.py.


The documentation for this class was generated from the following file: