Task Runners

What is a Task Runner?

Implementing a Task Runner

A task runner is a continuously running process that listens for new tasks to execute. You can start multiple task runner processes to execute tasks concurrently. When a task runner receives a task, it executes it and reports the results back to the workflow orchestrator. The task runner also handles any errors that occur during task execution, reporting them to the orchestrator as well. To execute a task, at least one task runner must be running and available. If no task runners are available, tasks will remain queued until one becomes available. To create and start a task runner, follow these steps:

Connecting to the Tilebox Workflows API

Instantiate a client connected to the Tilebox Workflows API.

Selecting a cluster (optional)

Select or create a cluster and specify its slug when creating a task runner. If no cluster is specified, the task runner will use the default cluster.

Registering tasks

Listening for new tasks

Call the run_forever method of the task runner to listen for new tasks until the task runner process is shut down.

Here is a simple example demonstrating these steps:

from tilebox.workflows import Client
# your own workflow:
from my_workflow import MyTask, OtherTask

def main():
    client = Client()  # 1. connect to the Tilebox Workflows API
    runner = client.runner(
        cluster= "dev-cluster"  # 2. select a cluster to join (optional, omit to use the default cluster)
        tasks=[MyTask, OtherTask]  # 3. register tasks
    )
    runner.run_forever()  # 4. listen for new tasks to execute

if __name__ == "__main__":
    main()

To start the task runner locally, run it as a script:

> python task_runner.py

Task Selection

For a task runner to pick up a submitted task, the following conditions must be met:

The cluster where the task was submitted must match the task runner’s cluster.
The task runner must have a registered task that matches the task identifier of the submitted task.
The version of the task runner’s registered task must be compatible with the submitted task’s version.

If a task meets these conditions, the task runner executes it. Otherwise, the task runner remains idle until a matching task is available.

Often, multiple submitted tasks match the conditions for execution. In that case, the task runner selects one of the tasks to execute, and the remaining tasks stay in a queue until the selected task is completed or another task runner becomes available.

Parallelism

You can start multiple task runner instances in parallel to execute tasks concurrently. Each task runner listens for new tasks and executes them as they become available. This allows for high parallelism and can be used to scale the execution of tasks to handle large workloads. To test this, run multiple instances of the task runner script in different terminal windows on your local machine, or use a tool like call-in-parallel to start the task runner script multiple times. For example, to start five task runners in parallel, use the following command:

> call-in-parallel -n 5 -- python task_runner.py

Deploying Task Runners

Task runners are continuously running processes that can be deployed in different computing environments. The only requirement for deploying task runners is access to the Tilebox Workflows API. Once this is met, task runners can be deployed in many different environments, including:

On-premise servers
Cloud-based virtual machines
Cloud-based auto-scaling clusters

Scaling

One key benefit of task runners is their ability to scale even while workflows are executing. You can start new task runners at any time, and they can immediately pick up queued tasks to execute. It’s not necessary to have an entire processing cluster available at the start of a workflow, as task runners can be started and stopped as needed. This is particularly beneficial in cloud environments, where task runners can be automatically started and stopped based on current workload, measured by metrics such as CPU usage. Here’s an example scenario:

A single instance of a task runner is actively waiting for work in a cloud environment.
A large workload is submitted to the workflow orchestrator, prompting the task runner to pick up the first task.
The first task creates new sub-tasks for processing, which the task runner also picks up.
As the workload increases, the task runner’s CPU usage rises, triggering the cloud environment to automatically start new task runner instances.
Newly started task runners begin executing queued tasks, distributing the workload among all available task runners.
Once the workload decreases, the cloud environment automatically stops some task runners.
The first task runner completes the remaining work until everything is done.
The first task runner remains idle until new tasks arrive.

CPU usage-based auto scaling is just one method to scale task runners. Other metrics, such as memory usage or network bandwidth, are also supported by many cloud environments. In a future release, configuration options for scaling task runners based on custom metrics (for example the number of queued tasks) are planned.

Distributed Execution

Task runners can be distributed across different compute environments. For instance, some data stored on-premise may need pre-processing, while further processing occurs in the cloud. A job might involve tasks that filter relevant on-premise data and publish it to the cloud, and other tasks that read data from the cloud and process it. In such scenarios, a task runners can run on-premise and another in a cloud environments, resulting in them effectively collaborating on the same job. Another advantage of distributed task runners is executing workflows that require specific hardware for certain tasks. For example, one task might need a GPU, while another requires extensive memory. Here’s an example of a distributed workflow:

from tilebox.workflows import Task, ExecutionContext

class DistributedWorkflow(Task):
    def execute(self, context: ExecutionContext) -> None:
        download_task = context.submit_subtask(DownloadData())
        process_task = context.submit_subtask(
          ProcessData(),
          depends_on=[download_task],
        )

class DownloadData(Task):
    """
    Download a dataset and store it in a shared internal bucket.
    Requires a good network connection for high download bandwidth.
    """
    def execute(self, context: ExecutionContext) -> None:
        pass

class ProcessData(Task):
    """
    Perform compute-intensive processing of a dataset.
    The dataset must be available in an internal bucket.
    Requires access to a GPU for optimal performance.
    """
    def execute(self, context: ExecutionContext) -> None:
        pass

To achieve distributed execution for this workflow, no single task runner capable of executing all three of the tasks is set up. Instead, two task runners, each capable of executing one of the tasks are set up: one in a high-speed network environment and the other with GPU access. When the distributed workflow runs, the first task runner picks up the DownloadData task, while the second picks up the ProcessData task. The DistributedWorkflow does not require specific hardware, so it can be registered with both runners and executed by either one.

from tilebox.workflows import Client

client = Client()
high_network_speed_runner = client.runner(
    tasks=[DownloadData, DistributedWorkflow]
)
high_network_speed_runner.run_forever()

Now, both download_task_runner.py and gpu_task_runner.py are started, in parallel, on different machines with the required hardware for each. When DistributedWorkflow is submitted, it executes on one of the two runners, and it’s submitted sub-tasks are handled by the appropriate runner. In this case, since ProcessData depends on DownloadData, the GPU task runner remains idle until the download completion, then picks up the processing task.

You can also differentiate between task runners by specifying different clusters and choosing specific clusters for sub-task submissions. For more details, see the Clusters section.

Task Failures

If an unhandled exception occurs during task execution, the task runner captures it and reports it back to the workflow orchestrator. The orchestrator then marks the task as failed, leading to job cancellation to prevent further tasks of the same job-that may not be relevant anymore-from being executed. A task failure does not result in losing all previous work done by the job. If the failure is fixable—by fixing a bug in a task implementation, ensuring the task has necessary resources, or simply retrying it due to a flaky network connection—it may be worth retrying the job. When retrying a job, all failed tasks are added back to the queue, allowing a task runner to potentially execute them. If execution then succeeds, the job continues smoothly. Otherwise, the task will remain marked as failed and can be retried again if desired. If fixing a failure requires modifying the task implementation, it’s important to deploy the updated version to the task runners before retrying the job. Otherwise, a task runner could pick up the original, faulty implementation again, leading to another failure.

Task idempotency

Since a task may be retried, it’s possible that a task is executed more than once. Depending on where in the execution of the task it failed, it may have already performed some side effects, such as writing to a database, or sending a message to a queue. Because of that it’s crucial to ensure that tasks are idempotent. Idempotent tasks can be executed multiple times without altering the outcome beyond the first successful execution. A special case of idempotency involves submitting sub-tasks. After a task calls context.submit_subtask and then fails and is retried, those submitted sub-tasks of an earlier failed execution are automatically removed, ensuring that they can be safely submitted again when the task is retried.

Runner Crashes

Tilebox Workflows has an internal mechanism to handle unexpected task runner crashes. When a task runner picks up a task, it periodically sends a heartbeat to the workflow orchestrator. If the orchestrator does not receive this heartbeat for a defined duration, it marks the task as failed and automatically attempts to retry it up to 10 times. This allows another task runner to pick up the task and continue executing the job. This mechanism ensures that scenarios such as power outages, hardware failures, or dropped network connections are handled effectively, preventing any task from remaining in a running state indefinitely.

Observability

Task runners are continuously running processes, making it essential to monitor their health and performance. You can achieve observability by collecting and analyzing logs, metrics, and traces from task runners. Tilebox Workflows provides tools to enable this data collection and analysis. To learn how to configure task runners for observability, head over to the Observability section.

Get Started

Datasets

Storage

Workflows

Implementing a Task Runner

Task Selection

Parallelism

Deploying Task Runners

Scaling

Distributed Execution

Task Failures

Task idempotency

Runner Crashes

Observability

Get Started

Datasets

Storage

Workflows

​Implementing a Task Runner

​Task Selection

​Parallelism

​Deploying Task Runners

​Scaling

​Distributed Execution

​Task Failures

​Task idempotency

​Runner Crashes

​Observability

Implementing a Task Runner

Task Selection

Parallelism

Deploying Task Runners

Scaling

Distributed Execution

Task Failures

Task idempotency

Runner Crashes

Observability