Submission

In order for a task to be executed, it must be instantiated with concrete inputs and then submitted as a job. The task is then be executed in the context of the job, and if it spawns sub-tasks, they are executed as part of the same job as well.

Once a job is submitted, its root task is scheduled for execution, and any eligible task runner may pick it up and execute it.

For submitting a job, a job client needs to be instantiated first. This can be done by calling the jobs method on the workflows’ client.

Once you have a job client, you can submit a job by calling the submit method on it. This requires a name for the job, an instance of the root task, and a cluster to submit the job to.

Now that a job has been submitted, it’s immediately scheduled for execution. As soon as a eligible task runner is available, the root task of the job is picked up and executed.

Retry Handling

Tasks support retry handling when their execution fails. This also applies to the root task of a job, where the number of retries can be specified by using the max_retries argument of the submit method.

In this example, if MyFlakyTask fails, it gets retried up to 5 times before being eventually marked as failed.

Fetching a specific Job by ID

When a job is submitted, it gets assigned a unique identifier. This identifier can be used to retrieve the job at any time.

To retrieve a job by its identifier, the find method on the job client can be used.

Visualization

Sometimes it’s useful to visualize the execution of a job. Since the workflow orchestrator keeps track of the execution of all tasks in a job, including sub-tasks and dependencies, it’s possible to visualize the execution of a job as a graph diagram.

Assuming you have submitted a job, you can use the display method on the job client to display the execution of the job as a graph diagram.

The display method is designed to be used in an interactive environment, such as a Jupyter notebook. In non-interactive environments, use job_client.visualize instead which returns the rendered diagram as a string in the SVG format.

The following diagram displays the execution of a job as a graph. Each task is represented as a node in the graph, and the edges between nodes represent a sub-task relationship. The diagram also shows the status of each task, by using a color code.

The color codes used to represent the state of a task are:

Task StateColorDescription
QueuedSalmonThe task is queued and waiting for a task runner to pick it up.
RunningBlueThe task is currently being executed by a task runner.
SucceededGreenThe task has successfully finished executing.
FailedRedThe task has been executed, but resulted in an error.

Below is a visualization of another job that is currently being executed by some task runners.

From this visualization, the following observations can be made:

  • The root task of the job, MyTask, was already executed and spawned three sub-tasks.
  • At least three task runners available, since currently three tasks are being executed at the same time.
  • The SubTask that is currently still being executed didn’t spawn any sub-tasks yet. That is because submitted sub-tasks are only queued for execution after the task that spawned them has itself finished executing.
  • The DependentTask that is still queued is waiting for the completion of the LeafTask before it can be executed.

Visualization of a job is intended to be used as help for development and debugging purposes. It’s not intended to be used for large jobs with hundreds or thousands of tasks, as the diagram may become too complex to be useful or readable. That’s why currently, the visualization is limited to jobs with a max of 200 tasks.

Customizing Task Display Names

The text representing a task in the diagram defaults to the class name of the task. If you want to customize this text, you can do so by changing the display field of the current_task object in the execution context of a task execution. The max length of the display name is 1024 characters. Any string larger than that is truncated. The display name may optionally contain line breaks.

Cancellation

A job can be canceled at any time. When a job is canceled, all tasks of the job that are queued are removed from the queue. Even if there are idle task runners available, they the tasks of a canceled job are not being executed. But if a task is currently being executed, while the job is being cancelled, it doesn’t get interrupted, but continue to execute until it finishes.

To cancel a job you can use the cancel method on the job client.

A canceled job can be resumed at any time, by retrying it.

If the execution of a task within a job fails, the job is automatically canceled. This is to prevent the execution of further tasks in the job, which may not be relevant anymore, due to the failure of the task. In a future release, this behaviour can be configured for each task individually, since there are use cases where you may want to continue the execution of a job, even if some of its tasks fail.

Retries

If a task within a job fails due to a bug in the task’s implementation, or a lack of resources, or any other reason, it’s not necessary to resubmit the job again and re-execute all computed tasks after a fix is deployed. Instead, you can retry the job, and the execution of the job is resumed from the point of the failure. This means that all the work that was already done up until the point of the failure isn’t lost.

In a future release, an automatic retry for a job for certain failure conditions can be configured. This can be useful in cases where an automatic retry after a certain amount of time, or after a certain condition makes sense.

Below is an example of a failing job due to a bug in the task’s implementation. After the initial buggy job submission, the bug is fixed and the job is retried.

The following workflow accepts a list of movie titles, and then queries the OMDb API for the release date of each movie.

Submitting below job reveals a bug in the PrintMovieStats task.

It seems like one of the PrintMovieStats tasks failed with a KeyError. This probably occurs if a movie title is not found by the OMDb API, in which case the response doesn’t contain the Title and Released fields.

The console output of the task runners confirms this:

Output
The Matrix was released on 31 Mar 1999
Shrek 2 was released on 19 May 2004
ERROR: Task PrintMovieStats failed with exception: KeyError('Title')

A fixed version of the PrintMovieStats looks like this:

Now with this fix in place, and the task runners redeployed with the updated implementation of the PrintMovieStats task, it’s time to retry the job:

The console output of the task now looks like this:

Output
Could not find the release date for Tilebox - The Movie
The Avengers was released on 04 May 2012

The output of the task runner confirms that only two tasks were executed, instead of a new task for all four movies.

Great, the job was retried and now succeeded. The two tasks that were already executed successfully before the failure were not executed again, instead the execution was resumed from the point of the failure.