Preparing your data for Lace
Compared with many other machine learning tools, lace has very few requirements
for data: data columns may be integer, continuous, or categorical string types;
empty cells do not not need to be filled in; and the table must contain a
string row index column labeled ID
or Index
(case-insensitive).
Supported data types for inference
Lace supports several data types, and more can be supported (with some work from us).
Continuous data
Continuous columns are modeled as mixtures of Gaussian distributions. Find an explanation of the parameters in the codebook
Categorical data
Note: For categorical columns, lace currently supports up to 256 unique values.
Categorical columns are modeled as mixtures of categorical distributions. Find an explanation of the parameters in the codebook.
Count data
Support exists for a count data type, which is modeled as a mixture of Poission distributions, but there are some drawbacks, which make it best to convert the data to continuous in most cases.
- The Poisson distribution is a single parameter model so the location and variance of the mixture components cannot be controlled individually. In the Poisson model, higher magnitude means higher variance.
- The hyper prior for count data is finicky and can often cause underflow/overflow errors when the underlying data do not look like Poisson distributions.
Note: If you use Count data, do so because you know that the underlying mixture components will be Poisson-like and be sure to set the prior and unset the hyperprior in the codebook
Preparing your data for Lace
Lace is pretty forgiving when it comes to data. You can have missing values, string values, and numerical values all in the same table; but there are some rules that your data must follow for the platform to pick up on things. Here you will learn how to make sure Lace understands your data properly.
Accepted formats
Lace currently accepts the following data formats
- CSV
- CSV.gz (gzipped CSV)
- Parquet
- IPC (Apache Arrow v2)
- JSON (as output by the pandas function
df.to_json('mydata.json)
) - JSON Lines
Using a schemaless data format
Formatting your data properly will help the platform understand your data.
Under the hood, Lace uses polars
for reading data formats into a DataFrame
.
For more information about i/o in polars
, see the polars API
documentation.
Here are the rules:
- Real-valued (continuous data) cells must have decimals.
- Integer-valued cells, whether count or categorical, must not have decimals.
- Categorical data cells may be integers (up to 255) or string values
- In a CSV, missing cells should be empty
- A row index is required. The index label should be 'ID'.
Not following these rules will confuse the codebook and could cause parsing errors.
Row and column names
Row and column indices or names must be strings. If you were to create a codebook from a csv with integer row and column indices, Lace would convert them to strings.
Tips on creating valid data with pandas
When reading data from a CSV, Pandas will convert integer columns with missing
cells to float values since floats can represent NaN
, which is how pandas
represents missing data. You have a couple of options for saving your CSV file
with both missing cells and properly formatted integers:
You can coerce the types to Int64
, which is basically Int plus NaN
, and
then write to CSV.
import pandas as pd
df = pd.DataFrame([10,20,30], columns=['my_int_col'])
df['my_int_col'] = df['my_int_col'].astype('Int64')
df.to_csv('mydata.csv', index_label='ID')
If you have a lot of columns or particularly long columns, you might find it
much faster just to reformat as you write to the csv, in which case you can
use the float_format
option in DataFrame.to_csv
df.to_csv('mydata.csv', index_label='ID', float_format='%g')