Ingesting from common file formats

For ingesting data from common file formats, it’s recommend to use the Tilebox Python SDK, since it provides out-of-the-box support for reading many common formats through third party libraries for loading data as either pandas.DataFrame or xarray.Dataset, which can then be directly ingested into Tilebox.

Reading and previewing the data

To ingest data from a file, you first need to read it into a pandas.DataFrame or an xarray.Dataset. How that can be achieved depends on the file format. The following sections show examples for a couple of common file formats.

CSV

Comma-separated values (CSV) is a common file format for tabular data. It’s widely used in data science. Tilebox supports CSV ingestion using the pandas.read_csv function. Assuming you have a CSV file named data.csv with the following content. If you want to follow along, you can download the file here.

ingestion_data.csv

time,value,sensor,precise_time,sensor_history,some_unwanted_column
2025-03-28T11:44:23Z,45.16,A,2025-03-28T11:44:23.345761444Z,"[-12.15, 13.45, -8.2, 16.5, 45.16]","Unsupported"
2025-03-28T11:45:19Z,273.15,B,2025-03-28T11:45:19.128742312Z,"[300.16, 280.12, 273.15]","Unsupported"

import pandas as pd

data = pd.read_csv("ingestion_data.csv")

Parquet

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. Tilebox supports Parquet ingestion using the pandas.read_parquet function. The parquet file used in this example is available here.

import pandas as pd

data = pd.read_parquet("ingestion_data.parquet")

GeoParquet

GeoParquet is an extension of the Parquet file format, adding geospatial features support to Parquet. Tilebox supports GeoParquet ingestion using the geopandas.read_parquet function. The GeoParquet file used in this example is available here.

import geopandas as gpd

data = gpd.read_parquet("modis_MCD12Q1.geoparquet")

For a step-by-step guide of ingesting a GeoParquet file, check out our Ingesting data guide.

Feather

Feather is a file format originating from the Apache Arrow project, designed for storing tabular data in a fast and memory-efficient way. Tilebox supports Feather ingestion using the pandas.read_feather function. The feather file used in this example is available here.

import pandas as pd

data = pd.read_feather("ingestion_data.feather")

Mapping columns to dataset fields

Once data is read into a pandas.DataFrame or an xarray.Dataset, it can be ingested into Tilebox directly. The column names of the pandas.DataFrame or the variables and coordinates of the xarray.Dataset are mapped to the fields of the Tilebox dataset to ingest into. Depending on how closely the column names or variable/coordinate names match the field names in the Tilebox dataset, you might need to rename some columns/variables/coordinates before ingestion.

Renaming fields

data = data.rename({"precise_time": "measurement_time"})

Dropping fields

In case you want to skip certain columns/variables/coordinates entirely, you can drop them before ingestion.

data = data.drop(columns=["some_unwanted_column"])

Ingesting the data

Once the data is in the correct format, you can ingest it into Tilebox.

from tilebox.datasets import Client

client = Client()
# my_custom_dataset has a schema that matches the data we want to ingest
dataset = client.dataset("my_org.my_custom_dataset")

collection = dataset.get_or_create_collection("Measurements")
collection.ingest(data)

Datasets

Workflows

Ingesting from common file formats

Reading and previewing the data

CSV

Parquet

GeoParquet

Feather

Mapping columns to dataset fields

Renaming fields

Dropping fields

Ingesting the data

Datasets

Workflows

​Reading and previewing the data

​CSV

​Parquet

​GeoParquet

​Feather

​Mapping columns to dataset fields

​Renaming fields

​Dropping fields

​Ingesting the data

Reading and previewing the data

CSV

Parquet

GeoParquet

Feather

Mapping columns to dataset fields

Renaming fields

Dropping fields

Ingesting the data