Overview of the Xarray library, common use cases, and implementation details.
Xarray is a library designed for working with labeled multi-dimensional arrays. Built on top of NumPy and Pandas, Xarray adds labels in the form of dimensions, coordinates, and attributes, enhancing the usability of raw NumPy-like arrays. This enables a more intuitive, concise, and less error-prone development experience. The library also includes a large and expanding collection of functions for advanced analytics and visualization.
An overview of the Xarray library and its suitability for N-dimensional data (such as Tilebox time series datasets) is available in the official Why Xarray? documentation page.
The Tilebox Python client provides access to satellite data as an xarray.Dataset. This approach offers a great number of benefits over custom Tilebox-specific data structures:
Familiarity
Xarray is based on NumPy and Pandas—two of the most widely used Python libraries for scientific computing. Familiarity with these libraries translates well to using Xarray.
Performance
Leveraging NumPy, which is built on C and Fortran, Xarray benefits from extensive performance optimizations. This allows Xarray to efficiently handle large datasets.
Interoperability
As a widely used library, Xarray easily integrates with many other libraries. Many third-party libraries are also available to expand Xarray’s capabilities for diverse use cases.
Flexibility
Xarray is versatile and supports a broad range of applications. It’s also easy to extend with custom features.
The satellite_datadataset contains dimensions, coordinates, and variables.
The timedimension has 514 elements, indicating that there are 514 data points in the dataset.
The timedimension coordinate contains datetime values representing when the data was measured. The * indicates a dimension coordinate, which enables label-based indexing and alignment.
The ingestion_timenon-dimension coordinate holds datetime values for when the data was ingested into Tilebox. Non-dimension coordinates carry coordinate data but are not used for label-based indexing. They can even be multidimensional.
The dataset includes 28 variables.
The bandsvariable contains integers indicating how many bands the data contains.
The sun_elevationvariable contains floating-point values representing the sun’s elevation when the data was measured.
Explore the xarray terminology overview to broaden your understanding of datasets, dimensions, coordinates, and variables.
You can access data in different ways. The Xarray documentation offers a comprehensive overview of these methods.To access the sun_elevation variable:
Accessing values
Copy
Ask AI
# Print the first sun elevation valueprint(satellite_data.sun_elevation[0])
Output
Copy
Ask AI
<xarray.DataArray 'sun_elevation' ()> Size: 8Barray(44.19904463)Coordinates: ingestion_time datetime64[ns] 8B 2024-07-22T09:06:43.558629 id <U36 144B '01807eaa-86f8-2a72-1a03-794e7a556271' time datetime64[ns] 8B 2022-05-01T08:09:06.552000
In the output, the first sun elevation value is 44.19904463. It appears as an xarray.DataArray object to allow access to the corresponding coordinates. To retrieve the plain Python object, use the item() method:
You can access coordinates similarly. For datetime fields, Xarray provides a special dt (datetime) accessor for formatting time as a string:
Accessing and formatting datetime fields
Copy
Ask AI
time_format = "%Y-%m-%d %H:%M:%S"time = satellite_data.time[0].dt.strftime(time_format).item()ingestion_time = satellite_data.ingestion_time[0].dt.strftime(time_format).item()print(f"Measurement 0 was taken at {time} and ingested at {ingestion_time}")
Output
Copy
Ask AI
Measurement 0 was taken at 2022-05-01 08:09:06 and ingested at 2024-07-22 09:06:43
You can also retrieve an entire dataset containing all variables and coordinates for a single data point using the isel method (index selection):
You can access subsets of the data as well. Here are methods to retrieve the first three and last three sun elevations.
Accessing raw values
Copy
Ask AI
# Individual variablesfirst_3_sun_elevations = satellite_data.sun_elevation[0:3]print("First 3 sun elevations", first_3_sun_elevations.values)last_3_sun_elevations = satellite_data.sun_elevation[-3:]print("Last 3 sun elevations", last_3_sun_elevations.values)# Whole sub datasetsfirst_3 = satellite_data.isel(time=slice(0, 3))last_3 = satellite_data.isel(time=slice(-3, None))print("Sub dataset of the last 3 data points")print(last_3)
Output
Copy
Ask AI
First 3 sun elevations [44.19904463 57.77561083 58.76316786]Last 3 sun elevations [55.60690523 56.72453179 57.81917624]Sub dataset of the last 3 data points<xarray.Dataset> Size: 2kBDimensions: (time: 3, latlon: 2)Coordinates: ingestion_time (time) datetime64[ns] 24B 2024-07-22T09:08:24.7395... id (time) <U36 432B '018119eb-5291-edbc-381e-ce71e885... * time (time) datetime64[ns] 24B 2022-05-31T11:41:01.4570... * latlon (latlon) <U9 72B 'latitude' 'longitude'Data variables: (12/28) granule_name (time) object 24B 'LC08_L1GT_209022_20220531_20220... processing_level (time) <U2 24B 'L1' 'L1' 'L1' satellite (time) object 24B 'LANDSAT-8' 'LANDSAT-8' 'LANDSAT-8' ... ...
Xarray allows convenient filtering of datasets based on conditions. For example, filter a dataset to only include sun elevation values where cloud cover is 0:
You can use dimension coordinate values to index your dataset. For instance, access the data point recorded at 2022-05-01T11:28:28.249000:
Selecting data by value requires unique coordinate values. In case of duplicates, you will encounter an InvalidIndexError. To avoid this, you can drop duplicates.
Xarray and NumPy include a wide range of statistical functions that you can apply to a dataset or DataArray. Here are some examples:
Computing dataset statistics
Copy
Ask AI
cloud_cover = satellite_data.cloud_covermin_meas = cloud_cover.min().item()max_meas = cloud_cover.max().item()mean_meas = cloud_cover.mean().item()std_meas = cloud_cover.std().item()print(f"Cloud cover ranges from {min_meas:.2f} to {max_meas:.2f} with a mean of {mean_meas:.2f} and a standard deviation of {std_meas:.2f}")
Output
Copy
Ask AI
Cloud cover ranges from 0.00 to 100.00 with a mean of 76.48 and a standard deviation of 34.17
You can also directly apply many NumPy functions to datasets or DataArrays. For example, to find out how many unique bands the data contains, use np.unique:
Finding unique values
Copy
Ask AI
import numpy as npprint("Sensors:", np.unique(satellite_data.bands))
Xarray provides a simple method for saving and loading datasets from files. This is useful for sharing your data or storing it for future use. Xarray supports many different file formats, including NetCDF, Zarr, GRIB, and more. For a complete list of supported formats, refer to the official documentation page.To save the example dataset as a NetCDF file:
You may need to install the netcdf4 package first.
This section covers only a few common use cases for Xarray. The library offers many more functions and features. For more information, please see the Xarray documentation or explore the Xarray Tutorials.Some useful capabilities not covered in this section include: