Jobs
Submission
In order for a task to be executed, it must be instantiated with concrete inputs and then submitted as a job. The task is then be executed in the context of the job, and if it spawns sub-tasks, they are executed as part of the same job as well.
Once a job is submitted, its root task is scheduled for execution, and any eligible task runner may pick it up and execute it.
For submitting a job, a job client needs to be instantiated first. This can be done by calling the jobs
method on the
workflows’ client.
Once you have a job client, you can submit a job by calling the submit
method on it. This requires a name for the job,
an instance of the root task, and a cluster
to submit the job to.
Now that a job has been submitted, it’s immediately scheduled for execution. As soon as a eligible task runner is available, the root task of the job is picked up and executed.
Retry Handling
Tasks support retry handling when their execution fails. This also applies to
the root task of a job, where the number of retries can be specified by using the max_retries
argument of the submit
method.
In this example, if MyFlakyTask
fails, it gets retried up to 5 times before being eventually marked as failed.
Fetching a specific Job by ID
When a job is submitted, it gets assigned a unique identifier. This identifier can be used to retrieve the job at any time.
To retrieve a job by its identifier, the find
method on the job client can be used.
Visualization
Sometimes it’s useful to visualize the execution of a job. Since the workflow orchestrator keeps track of the execution of all tasks in a job, including sub-tasks and dependencies, it’s possible to visualize the execution of a job as a graph diagram.
Assuming you have submitted a job, you can use the display
method on the job client to display the execution of the job as a graph diagram.
The display
method is designed to be used in an interactive environment, such as a Jupyter notebook.
In non-interactive environments, use job_client.visualize
instead which returns the rendered diagram as a string
in the SVG format.
The following diagram displays the execution of a job as a graph. Each task is represented as a node in the graph, and the edges between nodes represent a sub-task relationship. The diagram also shows the status of each task, by using a color code.
The color codes used to represent the state of a task are:
Task State | Color | Description |
---|---|---|
Queued | SalmonYellow | The task is queued and waiting for a task runner to pick it up. |
Running | Blue | The task is currently being executed by a task runner. |
Succeeded | Green | The task has successfully finished executing. |
Failed | Red | The task has been executed, but resulted in an error. |
Below is a visualization of another job that is currently being executed by some task runners.
From this visualization, the following observations can be made:
- The root task of the job,
MyTask
, was already executed and spawned three sub-tasks. - At least three task runners available, since currently three tasks are being executed at the same time.
- The
SubTask
that is currently still being executed didn’t spawn any sub-tasks yet. That is because submitted sub-tasks are only queued for execution after the task that spawned them has itself finished executing. - The
DependentTask
that is still queued is waiting for the completion of theLeafTask
before it can be executed.
Visualization of a job is intended to be used as help for development and debugging purposes. It’s not intended to be used for large jobs with hundreds or thousands of tasks, as the diagram may become too complex to be useful or readable. That’s why currently, the visualization is limited to jobs with a max of 200 tasks.
Customizing Task Display Names
The text representing a task in the diagram defaults to the class name of the task. If you want to customize this text, you can do so by
changing the display
field of the current_task
object in the execution context of a task execution. The max length of
the display name is 1024 characters. Any string larger than that is truncated. The display name may optionally contain
line breaks.
Cancellation
A job can be canceled at any time. When a job is canceled, all tasks of the job that are queued are removed from the queue. Even if there are idle task runners available, they the tasks of a canceled job are not being executed. But if a task is currently being executed, while the job is being cancelled, it doesn’t get interrupted, but continue to execute until it finishes.
To cancel a job you can use the cancel
method on the job client.
A canceled job can be resumed at any time, by retrying it.
If the execution of a task within a job fails, the job is automatically canceled. This is to prevent the execution of further tasks in the job, which may not be relevant anymore, due to the failure of the task. In a future release, this behaviour can be configured for each task individually, since there are use cases where you may want to continue the execution of a job, even if some of its tasks fail.
Retries
If a task within a job fails due to a bug in the task’s implementation, or a lack of resources, or any other reason, it’s not necessary to resubmit the job again and re-execute all computed tasks after a fix is deployed. Instead, you can retry the job, and the execution of the job is resumed from the point of the failure. This means that all the work that was already done up until the point of the failure isn’t lost.
In a future release, an automatic retry for a job for certain failure conditions can be configured. This can be useful in cases where an automatic retry after a certain amount of time, or after a certain condition makes sense.
Below is an example of a failing job due to a bug in the task’s implementation. After the initial buggy job submission, the bug is fixed and the job is retried.
The following workflow accepts a list of movie titles, and then queries the OMDb API for the release date of each movie.
Submitting below job reveals a bug in the PrintMovieStats
task.
It seems like one of the PrintMovieStats
tasks failed with a KeyError
. This probably occurs if a movie
title is not found by the OMDb API, in which case the response doesn’t contain the Title
and Released
fields.
The console output of the task runners confirms this:
A fixed version of the PrintMovieStats
looks like this:
Now with this fix in place, and the task runners redeployed with the updated
implementation of the PrintMovieStats
task, it’s time to retry the job:
The console output of the task now looks like this:
The output of the task runner confirms that only two tasks were executed, instead of a new task for all four movies.
Great, the job was retried and now succeeded. The two tasks that were already executed successfully before the failure were not executed again, instead the execution was resumed from the point of the failure.
Was this page helpful?