Overview

This section provides an overview of the API for loading data from a collection. It includes usage examples for many common scenarios.

MethodDescription
collection.loadQuery data points from a collection.
collection.findFind a specific datapoint in a collection by its id.

Check out the examples below for common scenarios when loading data from collections.

from tilebox.datasets import Client

client = Client()
datasets = client.datasets()
collections = datasets.open_data.copernicus.sentinel1_sar.collections()
collection = collections["S1A_IW_RAW__0S"]

To load data points from a dataset collection, use the load method. It requires a time_or_interval parameter to specify the time or time interval for loading.

Filtering by time

Time interval

To load data for a specific time interval, use a tuple in the form (start, end) as the time_or_interval parameter. Both start and end must be TimeScalars, which can be datetime objects or strings in ISO 8601 format.

interval = ("2017-01-01", "2023-01-01")
data = collection.load(interval, show_progress=True)
Output
<xarray.Dataset> Size: 725MB
Dimensions:                (time: 1109597, latlon: 2)
Coordinates:
    ingestion_time         (time) datetime64[ns] 9MB 2024-06-21T11:03:33.8524...
    id                     (time) <U36 160MB '01595763-bae7-a646-99a5-d7f40d7...
  * time                   (time) datetime64[ns] 9MB 2017-01-01T00:17:50.8230...
  * latlon                 (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/30)
    granule_name           (time) object 9MB 'S1A_IW_RAW__0SSV_20170101T00175...
    processing_level       (time) <U2 9MB 'L0' 'L0' 'L0' 'L0' ... 'L0' 'L0' 'L0'
    satellite              (time) object 9MB 'SENTINEL-1' ... 'SENTINEL-1'
    ...                     ...

The show_progress parameter is optional and can be used to display a tqdm progress bar while loading data.

A time interval specified as a tuple is interpreted as a half-closed interval. This means the start time is inclusive, and the end time is exclusive. For instance, using an end time of 2023-01-01 includes data points up to 2022-12-31 23:59:59.999, but excludes those from 2023-01-01 00:00:00.000. This behavior mimics the Python range function and is useful for chaining time intervals.

import xarray as xr

data = []
for year in [2017, 2018, 2019, 2020, 2021, 2022]:
    interval = (f"{year}-01-01", f"{year + 1}-01-01")
    data.append(collection.load(interval, show_progress=True))

# Concatenate the data into a single dataset, which is equivalent
# to the result of the single request in the code example above.
data = xr.concat(data, dim="time")

Above example demonstrates how to split a large time interval into smaller chunks while loading data in separate requests. Typically, this is not necessary as the datasets client auto-paginates large intervals.

Endpoint inclusivity

For greater control over inclusivity of start and end times, you can use the TimeInterval dataclass instead of a tuple of two TimeScalars. This class allows you to specify the start and end times, as well as their inclusivity. Here’s an example of creating equivalent TimeInterval objects in two different ways.

from datetime import datetime
from tilebox.datasets.data import TimeInterval

interval1 = TimeInterval(
    datetime(2017, 1, 1), datetime(2023, 1, 1),
    end_inclusive=False
)
interval2 = TimeInterval(
    # python datetime granularity is in milliseconds
    datetime(2017, 1, 1), datetime(2022, 12, 31, 23, 59, 59, 999999),
    end_inclusive=True
)

print("Inclusivity is indicated by interval notation: ( and [")
print(interval1)
print(interval2)
print(f"They are equivalent: {interval1 == interval2}")
print(interval2.to_half_open())

# Query data for a time interval
data = collection.load(interval1, show_progress=True)
Output
Inclusivity is indicated by interval notation: ( and [
[2017-01-01T00:00:00.000 UTC, 2023-01-01T00:00:00.000 UTC)
[2017-01-01T00:00:00.000 UTC, 2022-12-31T23:59:59.999 UTC]
They are equivalent: True
[2017-01-01T00:00:00.000 UTC, 2023-01-01T00:00:00.000 UTC)

Time scalars

You can load all datapoints linked to a specific timestamp by specifying a TimeScalar as the time query argument. A TimeScalar can be a datetime object or a string in ISO 8601 format. When passed to the load method, it retrieves all data points matching exactly that specified time, with a millisecond precision.

A collection may contain multiple datapoints for one millisecond, so multiple data points could still be returned. If you want to fetch only a single data point, query the collection by id instead.

Here’s how to load a data point at a specific millisecond from a collection.

data = collection.load("2024-08-01 00:00:01.362")
print(data)
Output
<xarray.Dataset> Size: 721B
Dimensions:                (time: 1, latlon: 2)
Coordinates:
    ingestion_time         (time) datetime64[ns] 8B 2024-08-01T08:53:08.450499
    id                     (time) <U36 144B '01910b3c-8552-7671-3345-b902cc08...
  * time                   (time) datetime64[ns] 8B 2024-08-01T00:00:01.362000
  * latlon                 (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/30)
    granule_name           (time) object 8B 'S1A_IW_RAW__0SDV_20240801T000001...
    processing_level       (time) <U2 8B 'L0'
    satellite              (time) object 8B 'SENTINEL-1'
    ...                     ...

Tilebox uses millisecond precision for timestamps. To load all data points for a specific second, it’s a time interval request. Refer to the examples below for details.

The output of the load method is an xarray.Dataset object. To learn more about Xarray, visit the dedicated Xarray page.

Time iterables

You can specify a time interval by using an iterable of TimeScalars as the time_or_interval parameter. This is especially useful when you want to use the output of a previous load call as input for another load. Here’s how that works.

interval = ("2017-01-01", "2023-01-01")
meta_data = collection.load(interval, skip_data=True)

first_50_data_points = collection.load(meta_data.time[:50], skip_data=False)
print(first_50_data_points)
Output
<xarray.Dataset> Size: 33kB
Dimensions:                (time: 50, latlon: 2)
Coordinates:
    ingestion_time         (time) datetime64[ns] 400B 2024-06-21T11:03:33.852...
    id                     (time) <U36 7kB '01595763-bae7-a646-99a5-d7f40d7e6...
  * time                   (time) datetime64[ns] 400B 2017-01-01T00:17:50.823...
  * latlon                 (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/30)
    granule_name           (time) object 400B 'S1A_IW_RAW__0SSV_20170101T0017...
    processing_level       (time) <U2 400B 'L0' 'L0' 'L0' ... 'L0' 'L0' 'L0'
    satellite              (time) object 400B 'SENTINEL-1' ... 'SENTINEL-1'
    ...                     ...

This feature works by constructing a TimeInterval object from the first and last elements of the iterable, making both the start and end time inclusive.

Timezones

All TimeScalars specified as a string are treated as UTC if they do not include a timezone suffix. If you want to query data for a specific time or time range in another timezone, it’s recommended to use a datetime object. In this case, the Tilebox API will convert the datetime to UTC before making the request. The output will always contain UTC timestamps, which will need to be converted again if a different timezone is required.

from datetime import datetime
import pytz

# Tokyo has a UTC+9 hours offset, so this is the same as
# 2017-01-01 02:45:25.679 UTC
tokyo_time = pytz.timezone('Asia/Tokyo').localize(
    datetime(2017, 1, 1, 11, 45, 25, 679000)
)
print(tokyo_time)
data = collection.load(tokyo_time)
print(data)  # time is in UTC since API always returns UTC timestamps
Output
2017-01-01 11:45:25.679000+09:00
<xarray.Dataset> Size: 725B
Dimensions:                (time: 1, latlon: 2)
Coordinates:
    ingestion_time         (time) datetime64[ns] 8B 2024-06-21T11:03:33.852435
    id                     (time) <U36 144B '015957ea-d82f-e454-34ab-a87603ee...
  * time                   (time) datetime64[ns] 8B 2017-01-01T02:45:25.679000
  * latlon                 (latlon) <U9 72B 'latitude' 'longitude'
Data variables:
    ...

Filtering by area of interest

Spatio-temporal also come with spatial filtering capabilities. When querying, you can no longer specify a time interval, but additionally also specify a bounding box or a polygon as an area of interest to filter by.

Spatio-temporal datasets - including spatial filtering capabilities - are currently in development and not available yet. Stay tuned for updates!

Fetching only metadata

Sometimes, it may be useful to load only dataset metadata fields without the actual data fields. This can be done by setting the skip_data parameter to True. For example, when only checking if a datapoint exists, you may want to use skip_data=True to avoid loading the data fields. If this flag is set, the response will only include the required fields for the given dataset type, but no additional custom data fields.

data = collection.load("2024-08-01 00:00:01.362", skip_data=True)
print(data)
Output
<xarray.Dataset> Size: 160B
Dimensions:         (time: 1)
Coordinates:
    ingestion_time  (time) datetime64[ns] 8B 2024-08-01T08:53:08.450499
    id              (time) <U36 144B '01910b3c-8552-7671-3345-b902cc0813f3'
  * time            (time) datetime64[ns] 8B 2024-08-01T00:00:01.362000
Data variables:
    *empty*

Empty response

The load method always returns an xarray.Dataset object, even if there are no data points for the specified query. In such cases, the returned dataset will be empty, but no error will be raised.

time_with_no_data_points = "1997-02-06 10:21:00"
data = collection.load(time_with_no_data_points)
print(data)
Output
<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*

By datapoint ID

If you know the ID of the data point you want to load, you can use collection.find.

This method always returns a single data point or raises an exception if no data point with the specified ID exists.

datapoint_id = "01916d89-ba23-64c9-e383-3152644bcbde"
datapoint = collection.find(datapoint_id)
print(datapoint)
Output
<xarray.Dataset> Size: 725B
Dimensions:                (latlon: 2)
Coordinates:
    ingestion_time         datetime64[ns] 8B 2024-08-20T05:53:08.600528
    id                     <U36 144B '01916d89-ba23-64c9-e383-3152644bcbde'
    time                   datetime64[ns] 8B 2024-08-20T02:07:08.323000
  * latlon                 (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/30)
    granule_name           object 8B 'S1A_IW_RAW__0SDV_20240820T020708_202408...
    processing_level       <U2 8B 'L0'
    satellite              object 8B 'SENTINEL-1'
    ...                     ...

Since find returns only a single data point, the output dataset does not include a time dimension.

You can also set the skip_data parameter when calling find to load only the metadata of the data point, same as for load.

Possible errors

  • NotFoundError: raised if no data point with the given ID is found in the collection
  • ValueError: raised if the specified datapoint_id is not a valid UUID