Understanding and Creating Tasks
Creating a Task
Creating a task in Tilebox can be achieved by defining a class that extends the Task
base class and implements the execute
method.
The execute
method is the entry point for the task and is where the task’s logic is defined. It’s called when the task
is executed and performs the task’s operation.
This example showcases a simple task that prints "Hello World!"
to the console. The key components of this task are:
The code samples on this page do not show how the task is actually executed, but don’t worry, that is covered in the next section on task runners. The reason for this is that the execution of tasks is a separate concern from the implementation of tasks.
Input Parameters
Tasks often require input parameters to perform their operation. These inputs can be simple values or complex data structures.
By inheriting from the Task
class, the task is automatically converted to a python dataclass
. This means that input parameters
can be defined as class attributes.
Tasks need to be serializable to JSON, because they may be distributed across a cluster of task runners.
Supported types for input parameters are:
- Basic types such as
str
,int
,float
,bool
- Lists and dictionaries of basic types
- Nested data classes that are again JSON-serializable
Task Composition and subtasks
Until now tasks have only performed a single operation. But tasks can be much more powerful than that. Tasks can submit other tasks as their subtasks. This enables a modular design of workflows, where complex operations are broken down into simpler, manageable pieces. And as an added bonus, the execution of subtasks is automatically parallelized whenever possible.
In the preceding example, a ParentTask
that submits ChildTask
tasks as subtasks is defined. The number of subtasks to be submitted
is determined by the num_subtasks
attribute of the ParentTask
. The submit_subtask
method takes an instance of a task as its argument.
This means that the task to be submitted as a subtask must be instantiated with concrete parameters before it can be submitted.
By submitting a task as a subtask, the execution of the subtask is scheduled as part of the same job as the parent
task. Compared to just calling the subtask’s execute
method directly, this allows for the subtask to be executed on
a different machine, or even in parallel with other subtasks. To learn more about how tasks are executed, see the
section on task runners.
Larger subtasks Example
As an example of how task composition can be used to build a larger workflow a practical example might be helpful. The workflow is capable of downloading a certain number of random dog images from the internet. The Dog API can be used to get the URLs of random dog images and then download them. Implementing a workflow such as this using Task Composition could look like this:
This larger example consists of the following tasks:
Together, these tasks form a workflow that downloads a certain number of random dog images from the internet. The relationship between the two tasks, and the fact that they together form a workflow, is implicit in the fact that DownloadRandomDogImages
submits DownloadImage
tasks as subtasks.
Visualizing the execution of a workflow that uses these tasks could look like a tree of tasks, where the DownloadRandomDogImages
task is the root and the DownloadImage
tasks are the leaves. For example, when downloading five random dog images, the following tasks are executed.
In total six tasks are executed. The DownloadRandomDogImages
task and five DownloadImage
tasks. The DownloadImage
tasks can be executed in parallel, as they are independent of one another. Assuming that more than one task runner is available, the Tilebox Workflow Orchestrator automatically parallelizes the execution of the DownloadImage
tasks.
Check out job_client.display to learn how this visualization was automatically generated from the task executions.
Currently, a limit of 64
subtasks per task is imposed. This is to discourage the creation of workflows where individual tasks submit a large number of subtasks, as this can lead to performance issues - since those individual tasks are not parallelized. If you need to submit more than 64
subtasks, consider using recursive subtask submission instead.
Recursive subtasks
Tasks can submit other tasks as subtasks, and this allows for the creation of complex workflows. Sometimes a tasks input is a list, whose elements can be mapped to individual subtasks, whose outputs oftentimes are then aggregated in a reduce step. This is a common pattern in workflows, called MapReduce. But often times the initial map step - submitting the individual subtasks - is already an expensive operation. And since this happens within a single task execution, it’s not further parallelizable. This can be a bottleneck for the entire workflow.
Fortunately, Tilebox Workflows offers a way to resolve this issue by using recursive subtask submission. A task can not only submit other tasks as subtasks, but it can also submit an instance of itself as a subtask. This allows for a recursive decomposition of a task into smaller subtasks.
As an example, the RecursiveTask
below is a perfectly valid task that recursively submits smaller instances of itself as subtasks.
Recursive subtask Example
One example where this is useful is the random dog images workflow defined earlier. While the previous implementation
was already downloading images in parallel, the initial orchestration of the individual download tasks was not parallelized. That’s the case because
DownloadRandomDogImages
, the root task of the workflow, currently fetches a certain number of random dog images URLs and only once all URLs
are fetched, the individual download tasks for each of those images are submitted.
If a large number of images is to be downloaded, first a query to the Dog API is made to return all those image URLs
before any downloads can begin. This is not ideal, as it means that the DownloadRandomDogImages
task is a bottleneck for the entire workflow.
As an improvement, recursive subtask submission can be used to decompose a DownloadRandomDogImages
task with a large number
of images into two smaller DownloadRandomDogImages
tasks, each one fetching half of the images. This is then recursively
applied, decomposing tasks into even smaller tasks, until a certain threshold is reached, at which point the Dog API is now directly
queried for the image URLs.
This way, there is no initial wait for the Dog API to return all the image URLs before image downloads can be started. Instead, the first images start
downloading as soon as the first image URLs are retrieved.
An implementation of this recursive subtask submission could look like this:
With this implementation, downloading a large number of images (for example 9 in the workflow below) results in the following tasks being executed:
Retry Handling
By default, when a task fails to execute, they are marked as failed.
In some cases, it might be beneficial to retry the task a certain number of times before marking it as failed.
This can be useful for tasks that are dependent on external services, which might be temporarily unavailable.
Tilebox Workflows offers a way to specify the number of retries for a task by using the max_retries
argument of the submit_subtask
method.
In the example below, the RootTask
submits a FlakyTask
as a subtask with a max_retries
of 5
.
If the FlakyTask
then fails, it gets retried for up to 5 times before eventually being marked as failed.
The failed task may be picked up by any available runner and not necessarily the same one that it failed on.
Dependencies
Up until now all tasks were independent of one another. But often, tasks have dependencies on other tasks. For example, a task that processes data might depend on a task that fetches the data.
Tasks can express their dependencies on other tasks by using the depends_on
argument of the submit_subtask
method. A dependency between two tasks means that the dependent task is only executed after the task it depends on has been successfully executed.
The depends_on
argument accepts a list of tasks, which means that a task can depend on a variable number of other
tasks.
A workflow that has tasks with dependencies on other tasks is implemented as follows:
The RootTask
submits three PrintTask
tasks as subtasks. Those tasks are dependent on one another, meaning that the
second task is only be executed after the first task has been successfully executed, and the third task is only
executed after the second task has been successfully executed.
This results in the tasks being executed sequentially.
If a task that’s being depended on submits subtasks, those subtasks are also executed before any dependent task.
Dependencies Example
As a practical example, below is a workflow that fetches news articles from an API and then processes them using News API.
This workflow consists of four tasks:
Task | Dependencies | Description |
---|---|---|
NewsWorkflow | - | The root task of the workflow. It spawns the other tasks and correctly sets up the dependencies between them. |
FetchNews | - | A task that fetches the news articles from the API. It writes the result to a file, which is then read by dependent tasks. |
PrintHeadlines | FetchNews | A task that prints the headlines of the news articles to the console. |
MostFrequentAuthors | FetchNews | A task that counts the number of articles each author has written and prints the result to the console. |
One important aspect to note is that there is no dependency between the PrintHeadlines
and MostFrequentAuthors
tasks.
This means that they can be executed in parallel, which is exactly what the Tilebox Workflow Orchestrator does, if more than one
task runner is available.
In this example, the results of the FetchNews
task were written to a file. This is not the recommended way of
passing data between tasks. When a workflow is executed on a distributed cluster, the existence of a file written by a
dependent task cannot be guaranteed. Instead, it’s recommend to use a shared cache.
Task Identifiers
A task identifier is a unique string that the Tilebox Workflow Orchestrator uses to identify the task. It’s what task runners use to determine which task class to select for deserializing the task input and executing the task. The identifier is also used as the default name in visualizations of the execution of a job as a tree of tasks.
If not otherwise specified, the identifier of a task is the class name. For example, the identifier of the PrintHeadlines
task
in the previous example is "PrintHeadlines"
. This is great for quick prototyping, but it’s not recommended for production use.
The reason is that if the class name of a task changes, the identifier of the task changes as well. This can lead to problems
when refactoring or changing the codebase. Additionally, it makes it impossible to create different Tasks that share the same class name.
To solve this problem, Tilebox Workflows offers a way to explicitly specify the identifier of a task. This is done by overriding the
identifier
method of the Task
class. The identifier
method should return a unique string that identifies the task.
This way, the identifier of a task is decoupled from the class name and can be changed without having to rename the task class.
Additionally, it allows for different tasks to have the same class name, as long as they have different identifiers.
The identifier
method can also be used to specify a version number for the task. This is
why it must return a tuple of two strings, where the first string is the identifier and the second string is the version number.
See the section on semantic versioning down below for more information.
The identifier
method is required to either be a classmethod
or a staticmethod
, which means that it can be
called without instantiating the class.
Semantic Versioning
As you might have noticed in the previous section, the identifier
method can return a tuple of two strings, where the first
string is the identifier and the second string is the version number. This allows for semantic versioning of tasks.
Versioning tasks is important for managing changes to the task’s execution method. It allows for the introduction of new features, bug fixes and other changes to the task, while ensuring that existing workflows continue to work as expected. Additionally, it allows for different versions of a task to coexist, which is useful for rolling out changes gradually without interrupting a production deployment.
Versioning is useful due to the distributed nature of Tilebox Workflows. When a task is submitted as part of a job, the version of the task it’s submitted from is recorded. It may be different from the version of the task available on the task runner that executes the task.
Assigning a version number to a task is done by overriding the identifier
method of the task class.
It should return a tuple of two strings, where the first string is the identifier and the second string is the version number.
The version number must match the pattern vX.Y
, where X
and Y
are non-negative integers. X
is the major version number and Y
is the minor version number.
As an example, below is a task that has the identifier "tilebox.com/example_workflow/MyTask"
and the version "v1.3"
:
When task runners execute a task, they need to have a task with a matching identifier string and a compatible version number registered. A compatible version number is a version number where the major version number of the task on the task runner is equal to the major version number of the task that was submitted, and the minor version number of the task on the task runner is equal to or greater than the minor version number of the task that was submitted.
A few examples of compatible version numbers are:
MyTask
is submitted as part of a job. The version ofMyTask
on the machine submitting the job is"v1.3"
.- A task runner that has version
"v1.3"
ofMyTask
deployed would execute this task. - A task runner that has version
"v1.5"
ofMyTask
deployed would also execute this task. - A task runner that has version
"v1.2"
ofMyTask
deployed would not execute this task, as the minor version number is lower than the minor version number of the task that was submitted. - A task runner that has version
"v2.5"
ofMyTask
deployed would not execute this task, as the major version number is different from the major version number of the task that was submitted.
Conclusion
Tasks are the foundation upon which Tilebox Workflows are built. By understanding how to create and manage tasks, you’re equipped to leverage the full power of Tilebox to automate and optimize your workflows. Experiment with defining your own tasks, utilizing subtasks, managing dependencies, and employing semantic versioning to develop robust, efficient workflows.
Was this page helpful?