When working with external datasets, such as Tilebox datasets, loading data may take some time. To speed up this process, you can run requests in parallel. While you can use multi-threading or multi-processing, which can be complex, often times a simpler option is to perform data loading tasks asynchronously using coroutines and asyncio.
To switch to the async client, change the import statement for the Client. The example below illustrates this change.
Copy
from tilebox.datasets import Client# This client is synchronousclient = Client()
After switching to the async client, use await for operations that interact with the Tilebox API.
Copy
# Listing datasetsdatasets = client.datasets()# Listing collectionsdataset = datasets.open_data.copernicus.sentinel1_sarcollections = dataset.collections()# Collection informationcollection = collections["S1A_IW_RAW__0S"]info = collection.info()print(f"Data for My-collection is available for {info.availability}")# Loading datadata = collection.load(("2022-05-01","2022-06-01"), show_progress=True)# Finding a specific datapointdatapoint_uuid ="01910b3c-8552-7671-3345-b902cc0813f3"datapoint = collection.find(datapoint_uuid)
Jupyter notebooks and similar interactive environments support asynchronous code execution. You can use
await some_async_call() as the output of a code cell.
The primary benefit of the async client is that it allows concurrent requests, enhancing performance.
In below example, data is fetched from multiple collections. The synchronous approach retrieves data sequentially, while the async approach does so concurrently, resulting in faster execution.
Copy
# Example: fetching data sequentially# switch to the async example to compare the differencesimport timefrom tilebox.datasets import Clientfrom tilebox.datasets.sync.timeseries import TimeseriesCollectionclient = Client()datasets = client.datasets()collections = datasets.open_data.copernicus.landsat8_oli_tirs.collections()defstats_for_2020(collection: TimeseriesCollection)->None:"""Fetch data for 2020 and print the number of data points that were loaded.""" data = collection.load(("2020-01-01","2021-01-01"), show_progress=True) n = data.sizes['time']if'time'in data else0return(collection.name, n)start = time.monotonic()results =[stats_for_2020(collections[name])for name in collections]duration = time.monotonic()- startfor collection_name, n in results:print(f"There are {n} datapoints in {collection_name} for 2020.")print(f"Fetching data took {duration:.2f} seconds")
The output demonstrates that the async approach runs approximately 30% faster for this example. With show_progress enabled, the progress bars update concurrently.
Copy
There are 19624 datapoints in L1GT for 2020.There are 1281 datapoints in L1T for 2020.There are 65313 datapoints in L1TP for 2020.There are 25375 datapoints in L2SP for 2020.Fetching data took 10.92 seconds
The Tilebox workflows Python client does not have an async client. This is because workflows are designed for distributed and concurrent execution outside a single async event loop. But within a single task, you may use still useasync code to take advantage of asynchronous execution, such as parallel data loading. You can achieve this by wrapping your async code in asyncio.run.
Below is an example of using async code within a workflow task.
Copy
import asyncioimport xarray as xrfrom tilebox.datasets.aio import Client as DatasetsClientfrom tilebox.datasets.data import TimeIntervalLikefrom tilebox.workflows import Task, ExecutionContextclassFetchData(Task):defexecute(self, context: ExecutionContext)->None:# The task execution itself is synchronous# But we can leverage async code within the task using asyncio.run# This will fetch three months of data in parallel data_jan, data_feb, data_mar = asyncio.run(load_first_three_months())asyncdefload_data(interval: TimeIntervalLike): datasets =await DatasetsClient().datasets() collections =await datasets.open_data.copernicus.landsat8_oli_tirs.collections()returnawait collections["L1T"].load(interval)asyncdefload_first_three_months()->tuple[xr.Dataset, xr.Dataset, xr.Dataset]: jan = load_data(("2020-01-01","2020-02-01")) feb = load_data(("2020-02-01","2020-03-01")) mar = load_data(("2020-03-01","2020-04-01"))# load the three months in parallel jan, feb, mar =await asyncio.gather(jan, feb, mar)return jan, feb, mar
If you encounter an error like RuntimeError: asyncio.run() cannot be called from a running event loop, it means you’re trying to start another asyncio event loop (with asyncio.run) from within an existing one. This often happens in Jupyter notebooks since they automatically start an event loop. A way to resolve this is by using nest-asyncio.