Ingesting Data
Learn how to ingest data into a Tilebox dataset.
You need to have write permission on the collection to be able to ingest data.
Check out the examples below for common scenarios of ingesting data into a collection.
Dataset schema
Tilebox Datasets are strongly typed. This means you can only ingest data that matches the schema of a dataset. The schema is defined during dataset creation time.
The examples on this page assume that you have access to a Timeseries dataset that has the following schema:
Once you’ve defined the schema and created a dataset, you can access it and create a collection to ingest data into.
Preparing data for ingestion
collection.ingest
supports a wide range of input types. Below is an example of using either a pandas.DataFrame
or an xarray.Dataset
as input.
pandas.DataFrame
A pandas.DataFrame is a representation of two-dimensional, potentially heterogeneous tabular data. It’s a powerful tool for working with structured data, and Tilebox supports it as input for ingest
.
The example below shows how to construct a pandas.DataFrame
from scratch, that matches the schema of the MyCustomDataset
dataset and can be ingested into it.
Once you have the data ready in this format, you can ingest
it into a collection.
You can now also head on over to the Tilebox Console and view the newly ingested data points there.
xarray.Dataset
xarray.Dataset
is the default format in which Tilebox Datasets returns data when
querying data from a collection.
Tilebox also supports it as input for ingestion. The example below shows how to construct an xarray.Dataset
from scratch, that matches the schema of the MyCustomDataset
dataset and can then be ingested into it.
To learn more about xarray.Dataset
, visit Tilebox dedicated Xarray documentation page.
Array fields manifest in xarray using an extra dimension, in this case n_sensor_history
. In case
of different array sizes for each data point, remaining values are filled up with a fill value, depending on the
dtype
of the array. For float64
this is np.nan
(not a number).
Don’t worry - when ingesting data into a Tilebox dataset, Tilebox will automatically skip those padding fill values
and not store them in the dataset.
Now that you have the xarray.Dataset
in the correct format, you can ingest it into the Tilebox dataset collection.
Copying or moving data
Since collection.load returns a xarray.Dataset
, and ingest
takes such a dataset as input you
can easily copy or move data from one collection to another.
Copying data like this also works across datasets in case the dataset schemas are compatible.
Automatic batching
Tilebox automatically batches the ingestion requests for you, so you don’t have to worry about the maximum request size.
Idempotency
Tilebox will auto-generate datapoint IDs based on the data of all its fields - except for the auto-generated
ingestion_time
, so ingesting the same data twice will result in the same ID being generated. By default, Tilebox
will silently skip any data points that are duplicates of existing ones in a collection. This behavior is especially
useful when implementing idempotent algorithms. That way, re-executions of certain ingestion tasks due to retries
or other reasons will never result in duplicate data points.
You can instead also request an error to be raised if any of the generated datapoint IDs already exist.
This can be done by setting the allow_existing
parameter to False
.
Ingestion from common file formats
Through the usage of xarray
and pandas
you can also easily ingest existing datasets available in file
formats, such as CSV, Parquet, Feather and more.
CSV
Comma-separated values (CSV) is a common file format for tabular data. It’s widely used in data science. Tilebox
supports CSV ingestion using the pandas.read_csv
function.
Assuming you have a CSV file named data.csv
with the following content. If you want to follow along, you can
download the file here.
This data already conforms to the schema of the MyCustomDataset
dataset, except for some_unwanted_column
which
you want to drop before you ingest it. Here is how this could look like:
Parquet
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
Tilebox supports Parquet ingestion using the pandas.read_parquet
function.
The parquet file used in this example is available here.
Feather
Feather is a file format originating from the Apache Arrow project,
designed for storing tabular data in a fast and memory-efficient way. It’s supported by many programming languages,
including Python. Tilebox supports Feather ingestion using the pandas.read_feather
function.
The feather file used in this example is available here.