What is a Task Runner?
What is a Task Runner?
Task runners are the execution agents within the Tilebox Workflows ecosystem that execute tasks. They can be deployed in different computing environments, including on-premise servers and cloud-based auto-scaling clusters. Task runners execute tasks as scheduled by the workflow orchestrator, ensuring they have the necessary resources and environment for effective execution.
Implementing a Task Runner
A task runner is a continuously running process that listens for new tasks to execute. You can start multiple task runner processes to execute tasks concurrently. When a task runner receives a task, it executes it and reports the results back to the workflow orchestrator. The task runner also handles any errors that occur during task execution, reporting them to the orchestrator as well. To execute a task, at least one task runner must be running and available. If no task runners are available, tasks will remain queued until one becomes available. To create and start a task runner, follow these steps:1
Connecting to the Tilebox Workflows API
Instantiate a client connected to the Tilebox Workflows API.
2
Selecting a cluster (optional)
Select or create a cluster and specify its slug when creating a task runner.
If no cluster is specified, the task runner will use the default cluster.
3
Registering tasks
Register tasks by specifying the task classes that the task runner can execute as a list to the 
runner method.4
Listening for new tasks
Call the 
run_forever method of the task runner to listen for new tasks until the task runner process is shut down.Task Selection
For a task runner to pick up a submitted task, the following conditions must be met:- The cluster where the task was submitted must match the task runner’s cluster.
- The task runner must have a registered task that matches the task identifier of the submitted task.
- The version of the task runner’s registered task must be compatible with the submitted task’s version.
Often, multiple submitted tasks match the conditions for execution. In that case, the task runner selects one of the tasks to execute, and the remaining tasks stay in a queue until the selected task is completed or another task runner becomes available.
Parallelism
You can start multiple task runner instances in parallel to execute tasks concurrently. Each task runner listens for new tasks and executes them as they become available. This allows for high parallelism and can be used to scale the execution of tasks to handle large workloads. To test this, run multiple instances of the task runner script in different terminal windows on your local machine, or use a tool like call-in-parallel to start the task runner script multiple times. For example, to start five task runners in parallel, use the following command:Deploying Task Runners
Task runners are continuously running processes that can be deployed in different computing environments. The only requirement for deploying task runners is access to the Tilebox Workflows API. Once this is met, task runners can be deployed in many different environments, including:- On-premise servers
- Cloud-based virtual machines
- Cloud-based auto-scaling clusters
Scaling
One key benefit of task runners is their ability to scale even while workflows are executing. You can start new task runners at any time, and they can immediately pick up queued tasks to execute. It’s not necessary to have an entire processing cluster available at the start of a workflow, as task runners can be started and stopped as needed. This is particularly beneficial in cloud environments, where task runners can be automatically started and stopped based on current workload, measured by metrics such as CPU usage. Here’s an example scenario:- A single instance of a task runner is actively waiting for work in a cloud environment.
- A large workload is submitted to the workflow orchestrator, prompting the task runner to pick up the first task.
- The first task creates new sub-tasks for processing, which the task runner also picks up.
- As the workload increases, the task runner’s CPU usage rises, triggering the cloud environment to automatically start new task runner instances.
- Newly started task runners begin executing queued tasks, distributing the workload among all available task runners.
- Once the workload decreases, the cloud environment automatically stops some task runners.
- The first task runner completes the remaining work until everything is done.
- The first task runner remains idle until new tasks arrive.
Distributed Execution
Task runners can be distributed across different compute environments. For instance, some data stored on-premise may need pre-processing, while further processing occurs in the cloud. A job might involve tasks that filter relevant on-premise data and publish it to the cloud, and other tasks that read data from the cloud and process it. In such scenarios, a task runners can run on-premise and another in a cloud environments, resulting in them effectively collaborating on the same job. Another advantage of distributed task runners is executing workflows that require specific hardware for certain tasks. For example, one task might need a GPU, while another requires extensive memory. Here’s an example of a distributed workflow:DownloadData task, while the second picks up the ProcessData task.
The DistributedWorkflow does not require specific hardware, so it can be registered with both runners and executed by either one.
- download_task_runner.py
- gpu_task_runner.py
download_task_runner.py and gpu_task_runner.py are started, in parallel, on different machines with the required hardware for each. When DistributedWorkflow is submitted, it executes on one of the two runners, and it’s submitted sub-tasks are handled by the appropriate runner.
In this case, since ProcessData depends on DownloadData, the GPU task runner remains idle until the download completion, then picks up the processing task.
Task Failures
If an unhandled exception occurs during task execution, the task runner captures it and reports it back to the workflow orchestrator. The orchestrator then marks the task as failed, leading to job cancellation to prevent further tasks of the same job-that may not be relevant anymore-from being executed. A task failure does not result in losing all previous work done by the job. If the failure is fixable—by fixing a bug in a task implementation, ensuring the task has necessary resources, or simply retrying it due to a flaky network connection—it may be worth retrying the job. When retrying a job, all failed tasks are added back to the queue, allowing a task runner to potentially execute them. If execution then succeeds, the job continues smoothly. Otherwise, the task will remain marked as failed and can be retried again if desired. If fixing a failure requires modifying the task implementation, it’s important to deploy the updated version to the task runners before retrying the job. Otherwise, a task runner could pick up the original, faulty implementation again, leading to another failure.Task idempotency
Since a task may be retried, it’s possible that a task is executed more than once. Depending on where in the execution of the task it failed, it may have already performed some side effects, such as writing to a database, or sending a message to a queue. Because of that it’s crucial to ensure that tasks are idempotent. Idempotent tasks can be executed multiple times without altering the outcome beyond the first successful execution. A special case of idempotency involves submitting sub-tasks. After a task callscontext.submit_subtask and then fails and is retried, those submitted sub-tasks of an earlier failed execution are automatically removed, ensuring that they can be safely submitted again when the task is retried.