Xarray is a library for working with labelled multi-dimensional arrays. Xarray is built on top of NumPy and Pandas. Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.

A good overview of the Xarray library and why it’s a perfect fit for N-dimensional data (such as Tilebox time series datasets) can be found in the official Why Xarray? documentation page.

The Tilebox Python client provides access to your satellite data in the form of a xarray.Dataset. This brings a great number of benefits compared to custom Tilebox specific data structures such as:

An example dataset

To get an understanding of how Xarray works, a sample dataset is used, as it could be returned by Tilebox datasets.

Output
<xarray.Dataset> Size: 305kB
Dimensions:                (time: 514, latlon: 2)
Coordinates:
    ingestion_time         (time) datetime64[ns] 4kB 2024-07-22T09:06:43.5586...
    id                     (time) <U36 74kB '01807eaa-86f8-2a72-1a03-794e7a55...
  * time                   (time) datetime64[ns] 4kB 2022-05-01T08:09:06.5520...
  * latlon                 (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/28)
    granule_name           (time) object 4kB 'LC08_L1GT_175018_20220501_20220...
    processing_level       (time) <U2 4kB 'L1' 'L1' 'L1' 'L1' ... 'L1' 'L1' 'L1'
    satellite              (time) object 4kB 'LANDSAT-8' ... 'LANDSAT-8'
    product_type           (time) object 4kB 'L1GT' 'L1GT' ... 'L1GT' 'L1GT'
    copernicus_id          (time) <U36 74kB '2181f9f6-1ef0-510c-b715-0f299320...
    online                 (time) bool 514B True True True ... True True True
    ...                     ...
    resolution             (time) int64 4kB 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
    bands                  (time) int64 4kB 12 12 12 12 12 12 ... 12 12 12 12 12
    path                   (time) int64 4kB 175 175 175 175 ... 209 209 209 209
    row                    (time) int64 4kB 18 30 31 32 27 32 ... 18 21 22 23 24
    sun_azimuth            (time) float64 4kB 164.0 144.9 143.0 ... 153.5 151.6
    sun_elevation          (time) float64 4kB 44.2 57.78 58.76 ... 56.72 57.82

This is a simple dataset that was generated to showcase some common Xarray use-cases. If you want to follow along, you can download the dataset as a NetCDF file. The Reading and writing files section explains how to save and load Xarray datasets to and from NetCDF files.

Here is a breakdown of the preceding output:

  • satellite_data dataset contains different dimensions, coordinates and variables
  • time dimension consists of 514 elements. This means there are 514 data points in the dataset
  • time dimension coordinate contains datetime values. This is the time when the data was measured. The * mark shows that it’s a dimension coordinate. Dimension coordinates are used for label based indexing and alignment, it means you can use the time to access individual data points in the dataset
  • ingestion_time non-dimension coordinate contains datetime values. This is the time when the data was ingested into the Tilebox database. Non-dimension coordinates are variables that contain coordinate data, but are not used for label based indexing and alignment. They can even be multidimensional
  • The dataset contains 28 variables
  • bands variable contains integers, this variable tells you how many bands the data contains
  • sun_elevation variable contains floating point values, this variable contains the sun elevation when the data was measured

Check out the xarray terminology overview to deepen your understanding of datasets, dimensions, coordinates, and variables.

The examples below showcase some of the most common use-cases for Xarray. Since the data is already loaded into memory, no more API requests are required, there is no difference between the sync and async Client in the examples below.

Accessing data in a dataset

Accessing by index

There a couple of different ways that you can access data in a dataset. The Xarray documentation provides a great overview of all those methods.

You can access the sun_elevation variable:

Accessing values
# Let's print the first sun elevation value
print(satellite_data.sun_elevation[0])
Output
<xarray.DataArray 'sun_elevation' ()> Size: 8B
array(44.19904463)
Coordinates:
    ingestion_time   datetime64[ns] 8B 2024-07-22T09:06:43.558629
    id               <U36 144B '01807eaa-86f8-2a72-1a03-794e7a556271'
    time             datetime64[ns] 8B 2022-05-01T08:09:06.552000

You can see in the preceding output that the first sun elevation value is 44.19904463, but the output is not just a plain float containing that value. Instead it’s an xarray.DataArray object. This is because that way you can still access the coordinates belonging to that value. To get the plain python object you can use the item() method:

Accessing raw values
sun_elevation = satellite_data.sun_elevation[0].item()
print(sun_elevation)
Output
44.19904463

You can access coordinates in a similar manner. For datetime fields Xarray additionally offers a special dt (datetime) accessor, which you can use to format the time as a string:

Accessing and formatting datetime fields
time_format = "%Y-%m-%d %H:%M:%S"
time = satellite_data.time[0].dt.strftime(time_format).item()
ingestion_time = satellite_data.ingestion_time[0].dt.strftime(time_format).item()
print(f"Measurement 0 was taken at {time} and ingested at {ingestion_time}")
Output
Measurement 0 was taken at 2022-05-01 08:09:06 and ingested at 2024-07-22 09:06:43

Similarly you can also retrieve a whole dataset containing all variables and coordinates for a single data point in the example dataset. For this Xarray offers the isel method (it stands for index selection):

Accessing a whole datapoint by index
datapoint = satellite_data.isel(time=0)
print(datapoint)
Output
<xarray.Dataset> Size: 665B
Dimensions:                (latlon: 2)
Coordinates:
    ingestion_time         datetime64[ns] 8B 2024-07-22T09:06:43.558629
    id                     <U36 144B '01807eaa-86f8-2a72-1a03-794e7a556271'
    time                   datetime64[ns] 8B 2022-05-01T08:09:06.552000
  * latlon                 (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/28)
    granule_name           object 8B 'LC08_L1GT_175018_20220501_20220504_02_T2'
    processing_level       <U2 8B 'L1'
    satellite              object 8B 'LANDSAT-8'
    product_type           object 8B 'L1GT'
    copernicus_id          <U36 144B '2181f9f6-1ef0-510c-b715-0f2993208b11'
    online                 bool 1B True
    ...                     ...
    resolution             int64 8B 0
    bands                  int64 8B 12
    path                   int64 8B 175
    row                    int64 8B 18
    sun_azimuth            float64 8B 164.0
    sun_elevation          float64 8B 44.2

Subsets of data

You can also access a subsets of the data. Here are a couple of ways you can retrieve the first 3 and last 3 sun elevations.

Accessing raw values
# individual variables
first_3_sun_elevations = satellite_data.sun_elevation[0:3]
print("First 3 sun_elevations", first_3_sun_elevations.values)
last_3_sun_elevations = satellite_data.sun_elevation[-3:]
print("Last 3 sun_elevations", last_3_sun_elevations.values)

# whole sub datasets
first_3 = satellite_data.isel(time=slice(0, 3))
last_3 = satellite_data.isel(time=slice(-3, None))
print("Sub dataset of the last 3 datapoints")
print(last_3)
Output
First 3 sun_elevations [44.19904463 57.77561083 58.76316786]
Last 3 sun_elevations [55.60690523 56.72453179 57.81917624]
Sub dataset of the last 3 datapoints
<xarray.Dataset> Size: 2kB
Dimensions:                (time: 3, latlon: 2)
Coordinates:
    ingestion_time         (time) datetime64[ns] 24B 2024-07-22T09:08:24.7395...
    id                     (time) <U36 432B '018119eb-5291-edbc-381e-ce71e885...
  * time                   (time) datetime64[ns] 24B 2022-05-31T11:41:01.4570...
  * latlon                 (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/28)
    granule_name           (time) object 24B 'LC08_L1GT_209022_20220531_20220...
    processing_level       (time) <U2 24B 'L1' 'L1' 'L1'
    satellite              (time) object 24B 'LANDSAT-8' 'LANDSAT-8' 'LANDSAT-8'
    product_type           (time) object 24B 'L1GT' 'L1GT' 'L1GT'
    copernicus_id          (time) <U36 432B '0ed5c8aa-9ecb-5559-bb9a-773965ce...
    online                 (time) bool 3B True True True
    ...                     ...
    resolution             (time) int64 24B 0 0 0
    bands                  (time) int64 24B 12 12 12
    path                   (time) int64 24B 209 209 209
    row                    (time) int64 24B 22 23 24
    sun_azimuth            (time) float64 24B 155.4 153.5 151.6
    sun_elevation          (time) float64 24B 55.61 56.72 57.82

Filtering data

Xarray also offers a convenient way of filtering a dataset based on a condition. For example, you can filter the dataset to only look at sun elevation values taken by cloud cover 0.

Filtering data by sensor
sun_elevations_without_cloud = satellite_data.sun_elevation[satellite_data.cloud_cover == 0]
print(sun_elevations_without_cloud)
Output
<xarray.DataArray 'sun_elevation' (time: 27)> Size: 216B
array([63.89629314, 63.35038654, 64.10330149, 64.11904038, 64.32007459,
       65.00696561, 60.81739662, 65.72788105, 65.90881403, 65.90881403,
       66.51835574, 66.51835574, 61.24068875, 66.34420723, 66.34420723,
       65.07319907, 65.07319907, 67.19808628, 67.19808628, 67.69088228,
       61.54950615, 67.76723723, 67.76723723, 68.23219829, 68.23219829,
       64.37400345, 64.37400345])
Coordinates:
    ingestion_time  (time) datetime64[ns] 216B 2024-07-22T09:06:43.558629 ......
    id              (time) <U36 4kB '01807f66-8411-2e5d-719b-ce51152175eb' .....
  * time            (time) datetime64[ns] 216B 2022-05-01T11:34:26.577000 ......

You can combine conditions, for example to filter for sun elevation values between 45 and 90 taken by cloud cover 0:

Filtering data by cloud_cover and sun_elevation value
data_filter = (
    (satellite_data.cloud_cover == 0) &
    (satellite_data.sun_elevation > 45) &
    (satellite_data.sun_elevation < 90)
)
filtered_sun_elevations = satellite_data.sun_elevation[data_filter]
print(filtered_sun_elevations)
Output
<xarray.DataArray 'sun_elevation' (time: 27)> Size: 216B
array([63.89629314, 63.35038654, 64.10330149, 64.11904038, 64.32007459,
       65.00696561, 60.81739662, 65.72788105, 65.90881403, 65.90881403,
       66.51835574, 66.51835574, 61.24068875, 66.34420723, 66.34420723,
       65.07319907, 65.07319907, 67.19808628, 67.19808628, 67.69088228,
       61.54950615, 67.76723723, 67.76723723, 68.23219829, 68.23219829,
       64.37400345, 64.37400345])
Coordinates:
    ingestion_time  (time) datetime64[ns] 216B 2024-07-22T09:06:43.558629 ......
    id              (time) <U36 4kB '01807f66-8411-2e5d-719b-ce51152175eb' .....
  * time            (time) datetime64[ns] 216B 2022-05-01T11:34:26.577000 ......

Selecting data by value

You can use the values of a primary coordinate to index your dataset. For example, you can access the data point taken at 2022-05-01T11:28:28.249000:

Indexing by time
specific_datapoint = satellite_data.sel(time="2022-05-01T11:28:28.249000")
print(specific_datapoint)
Output
<xarray.Dataset> Size: 665B
Dimensions:                (latlon: 2)
Coordinates:
    ingestion_time         datetime64[ns] 8B 2024-07-22T09:06:43.558629
    id                     <U36 144B '01807f61-0c59-99ed-8e33-c5d8ed6e7879'
    time                   datetime64[ns] 8B 2022-05-01T11:28:28.249000
  * latlon                 (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/28)
    granule_name           object 8B 'LC08_L1GT_207022_20220501_20220501_02_T2'
    processing_level       <U2 8B 'L1'
    satellite              object 8B 'LANDSAT-8'
    product_type           object 8B 'L1GT'
    copernicus_id          <U36 144B 'c263b8a9-2b4d-5b76-ac46-eee32338ff37'
    online                 bool 1B True
    ...                     ...
    resolution             int64 8B 0
    bands                  int64 8B 12
    path                   int64 8B 207
    row                    int64 8B 22
    sun_azimuth            float64 8B 158.2
    sun_elevation          float64 8B 49.03

When trying to access a value that is not in the dataset, a KeyError is raised.

Indexing by time (not found)
nearest_datapoint = satellite_data.sel(time="2022-05-01T11:28:28.000000")
>>> raises KeyError: "2022-05-01T11:28:28.000000"

The method parameter can be used to return the closest value instead of raising an error.

Finding the closest data point
nearest_datapoint = satellite_data.sel(time="2022-05-01T11:28:28.000000", method="nearest")
assert nearest_datapoint.equals(specific_datapoint)  # passes

Indexing requires the coordinate values to be unique. If there are duplicated values, Xarray raises an error, because it can not determine which value to return. An easy way to avoid this is to drop_duplicates before indexing. satellite_data = satellite_data.drop_duplicates("time")

Statistics

Xarray and NumPy offer a wide range of statistical functions that can be applied to a dataset or a DataArray. Here are a few examples:

Computing dataset statistics
cloud_cover = satellite_data.cloud_cover
min_meas = cloud_cover.min().item()
max_meas = cloud_cover.max().item()
mean_meas = cloud_cover.mean().item()
std_meas = cloud_cover.std().item()
print(f"Cloud cover from {min_meas:.2f} to {max_meas:.2f} with mean {mean_meas:.2f} and a std of {std_meas:.2f}")
Output
Cloud cover from 0.00 to 100.00 with mean 76.48 and a std of 34.17

You can also use many NumPy functions directly on a dataset or DataArray. For example, to find out how many bands the data contains, you can use np.unique to get all the unique values in the bands data array.

Finding unique values
import numpy as np
print("Sensors:", np.unique(satellite_data.bands))
Output
Sensors: [12]

Reading and writing files

Xarray also offers a convenient way to save and load datasets to and from files. This is especially useful if you want to share your data with others or if you want to persist your data for later use. Xarray supports a wide range of file formats, including NetCDF, Zarr, GRIB, and many more. For a full list of supported formats, please refer to the official documentation page.

You might need to install the netcdf4 package first. You can do this by running pip install netcdf4.

Here is how you can save the example dataset to a NetCDF file:

Saving a dataset to a file
satellite_data.to_netcdf("example_satellite_data.nc")

It creates a file called example_satellite_data.nc in the current directory. You can now load this file back into memory:

Loading a dataset from a file
import xarray as xr

satellite_data = xr.open_dataset("example_satellite_data.nc")

In case you want to follow along with the examples in this section, you can download the example dataset as a NetCDF file here.

Further reading

This section only covered a few of the most common use-cases for Xarray. Xarray offers many more functions and features. For more information, please refer to the Xarray documentation or check out the Xarray Tutorials.

Some useful capability that this section did not cover include: