Xarray is a library for working with labelled multi-dimensional arrays.
Xarray is built on top of NumPy and Pandas. Xarray introduces labels in
the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive,
more concise, and less error-prone developer experience. The package includes a large and growing library of
domain-agnostic functions for advanced analytics and visualization with these data structures.
A good overview of the Xarray library and why it’s a perfect fit for N-dimensional data (such as Tilebox time series
datasets) can be found in the official Why Xarray? documentation
page.
The Tilebox Python client provides access to your satellite data in the form of a
xarray.Dataset. This brings a great
number of benefits compared to custom Tilebox specific data structures such as:
An example dataset
To get an understanding of how Xarray works, a sample dataset is used, as it could be returned by
Tilebox datasets.
<xarray.Dataset> Size: 305kB
Dimensions: (time: 514, latlon: 2)
Coordinates:
ingestion_time (time) datetime64[ns] 4kB 2024-07-22T09:06:43.5586...
id (time) <U36 74kB '01807eaa-86f8-2a72-1a03-794e7a55...
* time (time) datetime64[ns] 4kB 2022-05-01T08:09:06.5520...
* latlon (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/28)
granule_name (time) object 4kB 'LC08_L1GT_175018_20220501_20220...
processing_level (time) <U2 4kB 'L1' 'L1' 'L1' 'L1' ... 'L1' 'L1' 'L1'
satellite (time) object 4kB 'LANDSAT-8' ... 'LANDSAT-8'
product_type (time) object 4kB 'L1GT' 'L1GT' ... 'L1GT' 'L1GT'
copernicus_id (time) <U36 74kB '2181f9f6-1ef0-510c-b715-0f299320...
online (time) bool 514B True True True ... True True True
... ...
resolution (time) int64 4kB 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0
bands (time) int64 4kB 12 12 12 12 12 12 ... 12 12 12 12 12
path (time) int64 4kB 175 175 175 175 ... 209 209 209 209
row (time) int64 4kB 18 30 31 32 27 32 ... 18 21 22 23 24
sun_azimuth (time) float64 4kB 164.0 144.9 143.0 ... 153.5 151.6
sun_elevation (time) float64 4kB 44.2 57.78 58.76 ... 56.72 57.82
Here is a breakdown of the preceding output:
satellite_data
dataset contains different dimensions, coordinates and variables
time
dimension consists of 514 elements. This means there are 514 data points in the dataset
time
dimension coordinate contains datetime values. This is the time when the data was measured.
The *
mark shows that it’s a dimension coordinate. Dimension coordinates are used for label based
indexing and alignment, it means you can use the time to access individual data points in the dataset
ingestion_time
non-dimension coordinate contains datetime values. This is the time when the data was
ingested into the Tilebox database. Non-dimension coordinates are variables that contain coordinate data, but are not
used for label based indexing and alignment. They can even be multidimensional
- The dataset contains 28 variables
bands
variable contains integers, this variable tells you how many bands the data contains
sun_elevation
variable contains floating point values, this variable contains the sun elevation when the data was measured
The examples below showcase some of the most common use-cases for Xarray. Since the data is already loaded into memory,
no more API requests are required, there is no difference between the sync and async Client in the examples below.
Accessing data in a dataset
Accessing by index
There a couple of different ways that you can access data in a dataset. The Xarray documentation provides a
great overview of all those methods.
You can access the sun_elevation
variable:
print(satellite_data.sun_elevation[0])
<xarray.DataArray 'sun_elevation' ()> Size: 8B
array(44.19904463)
Coordinates:
ingestion_time datetime64[ns] 8B 2024-07-22T09:06:43.558629
id <U36 144B '01807eaa-86f8-2a72-1a03-794e7a556271'
time datetime64[ns] 8B 2022-05-01T08:09:06.552000
You can see in the preceding output that the first sun elevation value is 44.19904463
, but the output is not just a plain
float
containing that value. Instead it’s an xarray.DataArray
object. This is because that way you can still
access the coordinates belonging to that value. To get the plain python object you can use the item()
method:
sun_elevation = satellite_data.sun_elevation[0].item()
print(sun_elevation)
You can access coordinates in a similar manner. For datetime fields Xarray additionally offers a special dt
(datetime)
accessor, which you can use to format the time as a string:
Accessing and formatting datetime fields
time_format = "%Y-%m-%d %H:%M:%S"
time = satellite_data.time[0].dt.strftime(time_format).item()
ingestion_time = satellite_data.ingestion_time[0].dt.strftime(time_format).item()
print(f"Measurement 0 was taken at {time} and ingested at {ingestion_time}")
Measurement 0 was taken at 2022-05-01 08:09:06 and ingested at 2024-07-22 09:06:43
Similarly you can also retrieve a whole dataset containing all variables and coordinates for a single data point in the
example dataset. For this Xarray offers the isel
method (it stands for index selection):
Accessing a whole datapoint by index
datapoint = satellite_data.isel(time=0)
print(datapoint)
<xarray.Dataset> Size: 665B
Dimensions: (latlon: 2)
Coordinates:
ingestion_time datetime64[ns] 8B 2024-07-22T09:06:43.558629
id <U36 144B '01807eaa-86f8-2a72-1a03-794e7a556271'
time datetime64[ns] 8B 2022-05-01T08:09:06.552000
* latlon (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/28)
granule_name object 8B 'LC08_L1GT_175018_20220501_20220504_02_T2'
processing_level <U2 8B 'L1'
satellite object 8B 'LANDSAT-8'
product_type object 8B 'L1GT'
copernicus_id <U36 144B '2181f9f6-1ef0-510c-b715-0f2993208b11'
online bool 1B True
... ...
resolution int64 8B 0
bands int64 8B 12
path int64 8B 175
row int64 8B 18
sun_azimuth float64 8B 164.0
sun_elevation float64 8B 44.2
Subsets of data
You can also access a subsets of the data.
Here are a couple of ways you can retrieve the first 3 and last 3 sun elevations.
first_3_sun_elevations = satellite_data.sun_elevation[0:3]
print("First 3 sun_elevations", first_3_sun_elevations.values)
last_3_sun_elevations = satellite_data.sun_elevation[-3:]
print("Last 3 sun_elevations", last_3_sun_elevations.values)
first_3 = satellite_data.isel(time=slice(0, 3))
last_3 = satellite_data.isel(time=slice(-3, None))
print("Sub dataset of the last 3 datapoints")
print(last_3)
First 3 sun_elevations [44.19904463 57.77561083 58.76316786]
Last 3 sun_elevations [55.60690523 56.72453179 57.81917624]
Sub dataset of the last 3 datapoints
<xarray.Dataset> Size: 2kB
Dimensions: (time: 3, latlon: 2)
Coordinates:
ingestion_time (time) datetime64[ns] 24B 2024-07-22T09:08:24.7395...
id (time) <U36 432B '018119eb-5291-edbc-381e-ce71e885...
* time (time) datetime64[ns] 24B 2022-05-31T11:41:01.4570...
* latlon (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/28)
granule_name (time) object 24B 'LC08_L1GT_209022_20220531_20220...
processing_level (time) <U2 24B 'L1' 'L1' 'L1'
satellite (time) object 24B 'LANDSAT-8' 'LANDSAT-8' 'LANDSAT-8'
product_type (time) object 24B 'L1GT' 'L1GT' 'L1GT'
copernicus_id (time) <U36 432B '0ed5c8aa-9ecb-5559-bb9a-773965ce...
online (time) bool 3B True True True
... ...
resolution (time) int64 24B 0 0 0
bands (time) int64 24B 12 12 12
path (time) int64 24B 209 209 209
row (time) int64 24B 22 23 24
sun_azimuth (time) float64 24B 155.4 153.5 151.6
sun_elevation (time) float64 24B 55.61 56.72 57.82
Filtering data
Xarray also offers a convenient way of filtering a dataset based on a condition.
For example, you can filter the dataset to only look at sun elevation values taken by cloud cover 0
.
sun_elevations_without_cloud = satellite_data.sun_elevation[satellite_data.cloud_cover == 0]
print(sun_elevations_without_cloud)
<xarray.DataArray 'sun_elevation' (time: 27)> Size: 216B
array([63.89629314, 63.35038654, 64.10330149, 64.11904038, 64.32007459,
65.00696561, 60.81739662, 65.72788105, 65.90881403, 65.90881403,
66.51835574, 66.51835574, 61.24068875, 66.34420723, 66.34420723,
65.07319907, 65.07319907, 67.19808628, 67.19808628, 67.69088228,
61.54950615, 67.76723723, 67.76723723, 68.23219829, 68.23219829,
64.37400345, 64.37400345])
Coordinates:
ingestion_time (time) datetime64[ns] 216B 2024-07-22T09:06:43.558629 ......
id (time) <U36 4kB '01807f66-8411-2e5d-719b-ce51152175eb' .....
* time (time) datetime64[ns] 216B 2022-05-01T11:34:26.577000 ......
You can combine conditions, for example to filter for sun elevation values between 45
and 90
taken by cloud cover 0
:
Filtering data by cloud_cover and sun_elevation value
data_filter = (
(satellite_data.cloud_cover == 0) &
(satellite_data.sun_elevation > 45) &
(satellite_data.sun_elevation < 90)
)
filtered_sun_elevations = satellite_data.sun_elevation[data_filter]
print(filtered_sun_elevations)
<xarray.DataArray 'sun_elevation' (time: 27)> Size: 216B
array([63.89629314, 63.35038654, 64.10330149, 64.11904038, 64.32007459,
65.00696561, 60.81739662, 65.72788105, 65.90881403, 65.90881403,
66.51835574, 66.51835574, 61.24068875, 66.34420723, 66.34420723,
65.07319907, 65.07319907, 67.19808628, 67.19808628, 67.69088228,
61.54950615, 67.76723723, 67.76723723, 68.23219829, 68.23219829,
64.37400345, 64.37400345])
Coordinates:
ingestion_time (time) datetime64[ns] 216B 2024-07-22T09:06:43.558629 ......
id (time) <U36 4kB '01807f66-8411-2e5d-719b-ce51152175eb' .....
* time (time) datetime64[ns] 216B 2022-05-01T11:34:26.577000 ......
Selecting data by value
You can use the values of a primary coordinate to index your dataset.
For example, you can access the data point taken at 2022-05-01T11:28:28.249000
:
specific_datapoint = satellite_data.sel(time="2022-05-01T11:28:28.249000")
print(specific_datapoint)
<xarray.Dataset> Size: 665B
Dimensions: (latlon: 2)
Coordinates:
ingestion_time datetime64[ns] 8B 2024-07-22T09:06:43.558629
id <U36 144B '01807f61-0c59-99ed-8e33-c5d8ed6e7879'
time datetime64[ns] 8B 2022-05-01T11:28:28.249000
* latlon (latlon) <U9 72B 'latitude' 'longitude'
Data variables: (12/28)
granule_name object 8B 'LC08_L1GT_207022_20220501_20220501_02_T2'
processing_level <U2 8B 'L1'
satellite object 8B 'LANDSAT-8'
product_type object 8B 'L1GT'
copernicus_id <U36 144B 'c263b8a9-2b4d-5b76-ac46-eee32338ff37'
online bool 1B True
... ...
resolution int64 8B 0
bands int64 8B 12
path int64 8B 207
row int64 8B 22
sun_azimuth float64 8B 158.2
sun_elevation float64 8B 49.03
When trying to access a value that is not in the dataset, a KeyError
is raised.
Indexing by time (not found)
nearest_datapoint = satellite_data.sel(time="2022-05-01T11:28:28.000000")
>>> raises KeyError: "2022-05-01T11:28:28.000000"
The method
parameter can be used to return the closest value instead of raising an error.
Finding the closest data point
nearest_datapoint = satellite_data.sel(time="2022-05-01T11:28:28.000000", method="nearest")
assert nearest_datapoint.equals(specific_datapoint)
Indexing requires the coordinate values to be unique. If there are duplicated values, Xarray raises an error, because
it can not determine which value to return. An easy way to avoid this is to
drop_duplicates before indexing.
satellite_data = satellite_data.drop_duplicates("time")
Statistics
Xarray and NumPy offer a wide range of statistical functions that can be applied to a dataset or a DataArray. Here are
a few examples:
Computing dataset statistics
cloud_cover = satellite_data.cloud_cover
min_meas = cloud_cover.min().item()
max_meas = cloud_cover.max().item()
mean_meas = cloud_cover.mean().item()
std_meas = cloud_cover.std().item()
print(f"Cloud cover from {min_meas:.2f} to {max_meas:.2f} with mean {mean_meas:.2f} and a std of {std_meas:.2f}")
Cloud cover from 0.00 to 100.00 with mean 76.48 and a std of 34.17
You can also use many NumPy functions directly on a dataset or DataArray. For example, to find out how many bands
the data contains, you can use np.unique to
get all the unique values in the bands
data array.
import numpy as np
print("Sensors:", np.unique(satellite_data.bands))
Reading and writing files
Xarray also offers a convenient way to save and load datasets to and from files. This is especially useful if you want
to share your data with others or if you want to persist your data for later use. Xarray supports a wide range of file
formats, including NetCDF, Zarr, GRIB, and many more. For a full list of supported formats, please refer to the
official documentation page.
You might need to install the netcdf4
package first. You can do this by running pip install netcdf4
.
Here is how you can save the example dataset to a NetCDF file:
Saving a dataset to a file
satellite_data.to_netcdf("example_satellite_data.nc")
It creates a file called example_satellite_data.nc
in the current directory. You can now load this file back
into memory:
Loading a dataset from a file
import xarray as xr
satellite_data = xr.open_dataset("example_satellite_data.nc")
In case you want to follow along with the examples in this section, you can download the example dataset as a NetCDF
file here.
Further reading
This section only covered a few of the most common use-cases for Xarray. Xarray offers many more functions and features.
For more information, please refer to the Xarray documentation
or check out the Xarray Tutorials.
Some useful capability that this section did not cover include: