What is a Task?
Task
base class and implements the execute
method. The execute
method is the entry point for the task where its logic is defined. It’s called when the task is executed.
class MyFirstTask(Task)
MyFirstTask
is a subclass of the Task
class, which serves as the base class for all defined tasks. It provides the essential structure for a task. Inheriting from Task
automatically makes the class a dataclass
, which is useful for specifying inputs. Additionally, by inheriting from Task
, the task is automatically assigned an identifier based on the class name.def execute
execute
method is the entry point for executing the task. This is where the task’s logic is defined. It’s invoked by a task runner when the task runs and performs the task’s operation.context: ExecutionContext
context
argument is an ExecutionContext
instance that provides access to an API for submitting new tasks as part of the same job and features like shared caching.type MyFirstTask struct{}
MyFirstTask
is a struct that implements the Task
interface. It represents the task to be executed.func (t *MyFirstTask) Execute(ctx context.Context) error
Execute
method is the entry point for executing the task. This is where the task’s logic is defined. It’s invoked by a task runner when the task runs and performs the task’s operation.Task
class, the task is treated as a Python dataclass
, allowing input parameters to be defined as class attributes.
str
, int
, float
, bool
ParentTask
submits ChildTask
tasks as subtasks. The number of subtasks to be submitted is based on the num_subtasks
attribute of the ParentTask
. The submit_subtask
method takes an instance of a task as its argument, meaning the task to be submitted must be instantiated with concrete parameters first.
Parent task do not have access to results of subtasks, instead, tasks can use shared caching to share data between tasks.
execute
method, this allows the subtask’s execution to occur on a different machine or in parallel with other subtasks. To learn more about how tasks are executed, see the section on task runners.DownloadRandomDogImages
DownloadRandomDogImages
fetches a specific number of random dog image URLs from an API. It then submits a DownloadImage
task for each received image URL.DownloadImage
DownloadImage
downloads an image from a specified URL and saves it to a file.DownloadRandomDogImages
submits DownloadImage
tasks as subtasks.
Visualizing the execution of such a workflow is akin to a tree structure where the DownloadRandomDogImages
task is the root, and the DownloadImage
tasks are the leaves. For instance, when downloading five random dog images, the following tasks are executed.
DownloadRandomDogImages
task and five DownloadImage
tasks. The DownloadImage
tasks can execute in parallel, as they are independent. If more than one task runner is available, the Tilebox Workflow Orchestrator automatically parallelizes the execution of these tasks.
64
subtasks per task is in place to discourage creating workflows where individual parent tasks submit a large number of subtasks, which can lead to performance issues since those parent tasks are not parallelized. If you need to submit more than 64
subtasks, consider using recursive subtask submission instead.
RecursiveTask
below is a valid task that submits smaller instances of itself as subtasks.
DownloadRandomDogImages
was responsible for fetching all random dog image URLs and only submitted the individual download tasks once all URLs were retrieved. For a large number of images this setup can bottleneck the entire workflow.
To improve this, recursive subtask submission decomposes a DownloadRandomDogImages
task with a high number of images into two smaller DownloadRandomDogImages
tasks, each fetching half. This process can be repeated until a specified threshold is met, at which point the Dog API can be queried directly for image URLs. That way, image downloads start as soon as the first URLs are retrieved, without initial waiting.
An implementation of this recursive submission may look like this:
max_retries
argument of the submit_subtask
method.
Check out the example below to see how this might look like in practice.
depends_on
argument of the submit_subtask
method. This means that a dependent task will only execute after the task it relies on has successfully completed.
depends_on
argument accepts a list of tasks, enabling a task to depend on multiple other tasks.RootTask
submits three PrintTask
tasks as subtasks. These tasks depend on each other, meaning the second task executes only after the first task has successfully completed, and the third only executes after the second completes. The tasks are executed sequentially.
Task | Dependencies | Description |
---|---|---|
NewsWorkflow | - | The root task of the workflow. It spawns the other tasks and sets up the dependencies between them. |
FetchNews | - | A task that fetches news articles from the API and writes the results to a file, which is then read by dependent tasks. |
PrintHeadlines | FetchNews | A task that prints the headlines of the news articles to the console. |
MostFrequentAuthors | FetchNews | A task that counts the number of articles each author has written and prints the result to the console. |
PrintHeadlines
and MostFrequentAuthors
tasks. This means they can execute in parallel, which the Tilebox Workflow Orchestrator will do, provided multiple task runners are available.
FetchNews
are stored in a file. This is not the recommended method for passing data between tasks. When executing on a distributed cluster, the existence of a file written by a dependent task cannot be guaranteed. Instead, it’s better to use a shared cache.PrintHeadlines
task in the previous example is "PrintHeadlines"
. This is good for prototyping, but not recommended for production, as changing the class name also changes the identifier, which can lead to issues during refactoring. It also prevents different tasks from sharing the same class name.
To address this, Tilebox Workflows offers a way to explicitly specify the identifier of a task. This is done by overriding the identifier
method of the Task
class. This method should return a unique string identifying the task. This decouples the task’s identifier from its class name, allowing you to change the identifier without renaming the class. It also allows tasks with the same class name to have different identifiers. The identifier
method can also specify a version number for the task—see the section on semantic versioning below for more details.
identifier
method must be defined as either a classmethod
or a staticmethod
, meaning it can be called without instantiating the class.identifier
method can return a tuple of two strings, where the first string is the identifier and the second string is the version number. This allows for semantic versioning of tasks.
Versioning is important for managing changes to a task’s execution method. It allows for new features, bug fixes, and changes while ensuring existing workflows operate as expected. Additionally, it enables multiple versions of a task to coexist, enabling gradual rollout of changes without interrupting production deployments.
You assign a version number by overriding the identifier
method of the task class. It must return a tuple of two strings: the first is the identifier and the second is the version number, which must match the pattern vX.Y
(where X
and Y
are non-negative integers). X
is the major version number and Y
is the minor version.
For example, this task has the identifier "tilebox.com/example_workflow/MyTask"
and the version "v1.3"
:
MyTask
is submitted as part of a job. The version is "v1.3"
."v1.3"
of MyTask
would executes this task."v1.5"
of MyTask
would also executes this task."v1.2"
of MyTask
would not execute this task, as its minor version is lower than that of the submitted task."v2.5"
of MyTask
would not execute this task, as its major version differs from that of the submitted task.QUEUED
: the task is queued and waiting to be runRUNNING
: the task is currently running on some task runnerCOMPUTED
: the task has been computed and the output is available. Once in this state, the task will never transition to any other stateFAILED
: the task has failedCANCELLED
: the task has been cancelled due to user request