The cache API is currently in an experimental state, with many more features, as well as planned backends on the roadmap. There might be breaking changes to the Cache API in the future.

Caches are configured on a task runner level. Since task runners can be deployed in a distributed fashion, caches also need to be accessible from all task runners contributing to the execution of a workflow.

That is why the current default cache implementation is backed by a Google cloud Storage bucket. This allows for a simple and scalable way to share data between tasks. For quick prototyping and local development you can also use a local file system cache, which is also provided out of the box.

You can also write your own cache backend by implementing the Cache interface.

Configuring a Cache

A cache can be configured when instantiating a task runner, by passing a suitable cache instance to the cache parameter. For example, to use an in-memory cache, you can use the InMemoryCache class from the tilebox.workflows.cache module. This is a cache implementation useful for local development and quick prototyping. Check out different supported cache backends for more options.

By configuring such a cache, the context object that is passed to the execution of each task gains access to a job_cache attribute that can be used to store and retrieve data from the cache.

Cache Backends

Tilebox Workflows comes with four cache implementations out of the box, each backed by a different storage system.

Google Storage Cache

A cache implementation backed by a Google cloud Storage bucket to store and retrieve data. This is a scalable and reliable way to share data between tasks, even when running in a distributed fashion. This is a recommended cache implementation for production. All you need is a GCP project and a bucket you have access to.

The Tilebox Workflow orchestrator uses the official Python Client for Google cloud Storage. Check out the official documentation to learn how to set up the required authentication.

The prefix parameter is optional and can be used to assign a common prefix to all cache keys. This can be useful to scope all cache objects to a certain prefix, for example when using the same bucket for other purposes.

Amazon S3 Cache

A cache implementation backed by an Amazon S3 bucket to store and retrieve data. This is a scalable and reliable way to share data between tasks, even when running in a distributed fashion. This is a recommended cache implementation for production. All you need is a bucket you have access to.

The Tilebox Workflow orchestrator uses the official boto3 library to interact with Amazon S3. Check out it’s documentation to learn how to set up the required authentication.

The prefix parameter is optional and can be used to assign a common prefix to all cache keys. This can be useful to scope all cache objects to a certain prefix, for example when using the same bucket for other purposes.

Local File System Cache

A cache implementation backed by the local file system. Also useful for quick prototyping and local development, when all task runners are running on the same machine or have access to the same file system. Intended for the testing of workflows executing on different task runners, but not recommended for production use.

In-Memory Cache

A simple in-memory cache implementation. This is useful for quick prototyping and local development. Cache data is not shared between task runners, and not persisted between task runner restarts. Only use this cache for quick prototyping of workflows executed on a single task runner.

Data Isolation

Caches are isolated for each job. This means that data stored in a cache is only accessible by tasks within the same job. This avoids conflicts in cache keys between different jobs from the same workflow. But it also means that currently data cannot be shared between jobs.

The ability to share data between jobs is on the roadmap. That can be especially useful for near-real-time processing workflows, or for workflows that require auxiliary data from third party sources.

Storing and Retrieving Data

To access the cache from within a task, you can use the job_cache attribute of the context object that is passed to the task function. This attribute is an instance of the JobCache interface, which provides methods to store and retrieve data from the cache. Where and how the data is stored depends on the cache backend that is used.

Since the cache API is designed to be simple and generic, and to handle all kinds of data, it’s capable of storing arbitrary binary data in the form of a bytes object under a specific cache key str. This means that you can store any kind of data, such as pickled Python objects, serialized JSON, UTF-8 or ASCII encoded strings or raw binary data.

The following code snippet shows how to store and retrieve data from the cache.

The data passed to the save method is stored in the cache under the key "data". It can be arbitrarily large binary data, as long as it fits into the constraints of the cache backend. The chosen key should be unique within the scope of the job, to avoid conflicts with other tasks that might use the same key.

To test the preceding workflow, it’s possible to start a task runner locally, that uses the InMemoryCache backend. That way a job can be submitted to execute the ProducerTask and the output of the ConsumerTask can be observed.

Output
Read b'my_binary_data_to_store' from cache

Groups and Hierarchical Keys

The cache API also supports the concept of groups and hierarchical keys. This behaves in a similar way to the concept of directories and files in a file system. Groups can be used to organize cache keys into a hierarchical structure, which can be useful to avoid conflicts between keys, or to organize data in a more structured way. Additionally, the provided list_items method can be used to list all keys in a group. This can be especially useful when a workflow spawns many tasks that store data in the cache, and a successive task needs to access all the produced data, for example to produce an aggregated result.

The following code snippet shows how to use groups and hierarchical keys. Data is stored in a group called "random_numbers" under the keys "1", "2" and "3". Then all keys in the group are listed and the cached data is retrieved for each key. The output then displays the sum of all data points.

Submitting a job of the CacheGroupDemoWorkflow, and then running it with a task runner, could look like this:

Output
Sum of all numbers: 284

Different Concurrent Caches

Currently, only one cache can be configured per task runner. This means that all tasks executed by the task runner share the same cache. It’s possible to configure different task runners with different cache configurations to achieve a similar effect. Adding support for different concurrent caches is on the roadmap for a future release.