Why use async?

When working with external datasets, such as Tilebox datasets, loading data may take some time. To speed up this process, you can run requests in parallel. While you can use multi-threading or multi-processing, which can be complex, often times a simpler option is to perform data loading tasks asynchronously using coroutines and asyncio.

Switching to an async datasets client

To switch to the async client, change the import statement for the Client. The example below illustrates this change.

from tilebox.datasets import Client

# This client is synchronous
client = Client()

After switching to the async client, use await for operations that interact with the Tilebox API.

# Listing datasets
datasets = client.datasets()

# Listing collections
dataset = datasets.open_data.copernicus.sentinel1_sar
collections = dataset.collections()

# Collection information
collection = collections["S1A_IW_RAW__0S"]
info = collection.info()
print(f"Data for My-collection is available for {info.availability}")

# Loading data
data = collection.load(("2022-05-01", "2022-06-01"), show_progress=True)

# Finding a specific datapoint
datapoint_uuid = "01910b3c-8552-7671-3345-b902cc0813f3"
datapoint = collection.find(datapoint_uuid)

Jupyter notebooks and similar interactive environments support asynchronous code execution. You can use await some_async_call() as the output of a code cell.

Fetching data concurrently

The primary benefit of the async client is that it allows concurrent requests, enhancing performance. In below example, data is fetched from multiple collections. The synchronous approach retrieves data sequentially, while the async approach does so concurrently, resulting in faster execution.

# Example: fetching data sequentially

# switch to the async example to compare the differences
import time
from tilebox.datasets import Client
from tilebox.datasets.sync.timeseries import TimeseriesCollection

client = Client()
datasets = client.datasets()
collections = datasets.open_data.copernicus.landsat8_oli_tirs.collections()

def stats_for_2020(collection: TimeseriesCollection) -> None:
    """Fetch data for 2020 and print the number of data points that were loaded."""
    data = collection.load(("2020-01-01", "2021-01-01"), show_progress=True)
    n = data.sizes['time'] if 'time' in data else 0
    return (collection.name, n)

start = time.monotonic()
results = [stats_for_2020(collections[name]) for name in collections]
duration = time.monotonic() - start

for collection_name, n in results:
    print(f"There are {n} datapoints in {collection_name} for 2020.")
print(f"Fetching data took {duration:.2f} seconds")

The output demonstrates that the async approach runs approximately 30% faster for this example. With show_progress enabled, the progress bars update concurrently.

There are 19624 datapoints in L1GT for 2020.
There are 1281 datapoints in L1T for 2020.
There are 65313 datapoints in L1TP for 2020.
There are 25375 datapoints in L2SP for 2020.
Fetching data took 10.92 seconds

Async workflows

The Tilebox workflows Python client does not have an async client. This is because workflows are designed for distributed and concurrent execution outside a single async event loop. But within a single task, you may use still useasync code to take advantage of asynchronous execution, such as parallel data loading. You can achieve this by wrapping your async code in asyncio.run.

Below is an example of using async code within a workflow task.

import asyncio
import xarray as xr

from tilebox.datasets.aio import Client as DatasetsClient
from tilebox.datasets.data import TimeIntervalLike
from tilebox.workflows import Task, ExecutionContext

class FetchData(Task):
    def execute(self, context: ExecutionContext) -> None:
        # The task execution itself is synchronous
        # But we can leverage async code within the task using asyncio.run

        # This will fetch three months of data in parallel
        data_jan, data_feb, data_mar = asyncio.run(load_first_three_months())
        
async def load_data(interval: TimeIntervalLike):
    datasets = await DatasetsClient().datasets()
    collections = await datasets.open_data.copernicus.landsat8_oli_tirs.collections()
    return await collections["L1T"].load(interval)

async def load_first_three_months() -> tuple[xr.Dataset, xr.Dataset, xr.Dataset]:
    jan = load_data(("2020-01-01", "2020-02-01"))
    feb = load_data(("2020-02-01", "2020-03-01"))
    mar = load_data(("2020-03-01", "2020-04-01"))
    # load the three months in parallel
    jan, feb, mar = await asyncio.gather(jan, feb, mar)
    return jan, feb, mar

If you encounter an error like RuntimeError: asyncio.run() cannot be called from a running event loop, it means you’re trying to start another asyncio event loop (with asyncio.run) from within an existing one. This often happens in Jupyter notebooks since they automatically start an event loop. A way to resolve this is by using nest-asyncio.