Around early 2010's, as HDFS was at its peak, big data was still growing. It was evident that just the storage system is not enough to handle the data deluge. Innovations were needed to make data processing faster and efficient at the storage format itself.
Doug Cutting, the creator of Hadoop, started Trevni columnar storage format initially. Twitter and Cloudera then collaborated to improve this format. This resulted in the creation of Parquet format. First released in 2013, Parquet is a columnar storage file format that is optimized for large-scale data processing and analytics. Apache Parquet is a top level Apache Software Foundation project since 2015.
In this post we will explore the structure of Parquet files, its advantages, and use cases. We will also discuss best practices for working with Parquet files. We'll also look at why we at Parseable chose Parquet as our primary storage format and the challenges we face.
What is a columnar format?
A column oriented format differs structurally from row oriented formats such as CSV
or Avro
. The easiest way to understand this is to start with a simple dataset in a row oriented file such as a CSV, like this:
ID | First Name | Last Name | City | Age |
---|---|---|---|---|
1 | Alice | Brown | Boston | 29 |
2 | Bob | Smith | New York | 35 |
3 | Charlie | Davis | San Francisco | 42 |
When this data is stored in a CSV file, it is stored row-wise, i.e. data from a row is next to data from the next row and so on. Each row contains all the columns for a single record. So the actual physical storage of the data would look like this:
1,Alice,Brown,Boston,29;2,Bob,Smith,New York,35;3,Charlie,Davis,San Francisco,42;
In contrast, a columnar file stores data from a column together. So, the same data in a columnar file would be physically stored like this:
1,2,3;Alice,Bob,Charlie;Brown,Smith,Davis;Boston,New York,San Francisco;29,35,42;
Sure, it is difficult to read for humans, but this format is much more efficient for certain types of queries. For example, if you wanted to know the average age of all the people in the dataset, you would only need to read the Age
column. In a row-based file, you would have to read the entire file to get the same information.
Implications of columnar format
The way data is organized in in a columnar format has several important implications (both good and bad) on how data is stored, read, and processed. Let's examine some of these implications:
Writing the file: While writing a columnar file, the writer needs to know the all the values of a column for all the rows (that need to be written). This means that the writer needs to buffer the entire column in memory before writing it to disk. This can be a challenge for very large datasets. Parquet and other columnar formats added a way around, that we'll review later.
Column level configuration: A columnar file stores data from one column together physically, it can use different compression and encoding schemes for each column. For example, if there is a column with repeating values, RLE or Dictionary encoding will be more efficient than other encoding schemes. Such flexibility allows optimizing storage and query performance for each column.
Random access: Because of relatively complex format, data in a columnar file is difficult to be randomly accessed. For example, if you want to read a row from a columnar file, you would have to read the footer, then the metadata, then the column chunks, and finally the pages. You may even be able to skip a few steps, but you'll need to read a whole page at the least to get the data you need. This is in contrast to row-based files where you can read a row directly.
Schema evolution: Schema evolution is the process of changing the schema of a dataset. In a columnar file, schema evolution is next to impossible. This is because the schema is stored in the metadata and changing the schema means changing the metadata. This can be a challenge when working with large datasets.
Interaction with files: Columnar files are not human-readable. This means that you can't open a columnar file in a text editor and read the data. You need a special tool or library to read the data. This can be a challenge when working with small datasets or when debugging. Since, each columnar file format is different in the structure, each of then need their own reader and writer implementations in different languages, only then end users can interact with the files.
Parquet file format
At a high level, a Parquet file is a collection of row groups, each of which contains column chunks. Each column chunk is further divided into pages. The file also contains metadata that helps in reading and decoding the data efficiently.
Let's take a visual example. In below example, there are N columns in this table, split into M row groups. The file metadata contains the locations of all the column chunk start locations.
4-byte magic number "PAR1"
<Row Group 1>
<Column 1 Chunk 1>
<Column 2 Chunk 1>
...
<Column N Chunk 1>
</Row Group 1>
<Row Group 2>
<Column 1 Chunk 2>
<Column 2 Chunk 2>
...
<Column N Chunk 2>
</Row Group 2>
<Row Group M>
<Column 1 Chunk M>
<Column 2 Chunk M>
...
<Column N Chunk M>
</Row Group M>
File Metadata
4-byte length in bytes of file metadata (little endian)
4-byte magic number "PAR1"
Row Groups
A writer needs to know all the values of a column before writing it to disk. Why? let's understand this better, with an example. Take our previous table.
ID | First Name | Last Name | City | Age |
---|---|---|---|---|
1 | Alice | Brown | Boston | 29 |
2 | Bob | Smith | New York | 35 |
3 | Charlie | Davis | San Francisco | 42 |
If you were to write this data in a columnar file, you would need to write all the ID
values together, then all the First Name
values together, and so on. The final file looks like:
1,2,3;Alice,Bob,Charlie;Brown,Smith,Davis;Boston,New York,San Francisco;29,35,42;
Now, imagine if this file was not complete, yet. Say, there are 5 more rows to be appended to the file (after first 3 rows are written in the Parquet file).
One way to do this is, read the entire file, append the new rows, and write the entire file back to disk. This would need a lot of data shuffling to keep the columnar format intact. Other way would be to instead keep all the rows in memory (till you get all the 8 rows) and write them to disk, and don't allow appends (once the file is written).
Now think of production workloads where writers need to write millions of rows worth of data to a Parquet file. Neither the shuffling of data nor the loading of all rows in memory is efficient. To solve this problem, Parquet introduced the concept of row groups.
Row groups are essentially a way to divide the data in smaller chunks, so they can be efficiently written to disk. Each row group contains a subset of the rows and all the columns for those rows. The writer needs to load only the rows in a row group to write them to disk, and then wait for next set of rows. When the file is complete, the writer can write the metadata to the file.
Column Chunks
Within a Row Group, data is stored by columns. These column chunks are the core of Parquet's columnar storage format. By storing data this way, Parquet can efficiently read only the columns needed for a query.
Pages: Each column chunk is further divided into pages. Pages are the smallest unit of data storage in a Parquet file. They contain the actual data and metadata necessary for decoding.
Metadata: Parquet files contain metadata at different levels. File-level metadata includes information about the schema and the number of row groups. Row group metadata provides details about column chunks such as their size and encoding. Column chunk metadata includes information about pages. Other page header metadata is used to read and decode the given data.
Supported Types
The data types supported by a Parquet file are minimal to help reduce the complexity of implementing readers and writers for the file type. Further, the user can use a logical type to extend the types stored in your file. These annotations are stored in the metadata file as LogicalTypes.md
.
Some of the supported data types include:
- BOOLEAN: 1 bit boolean
- INT32: 32-bit signed ints
- INT64: 64-bit signed ints
- INT96: 96-bit signed ints
- FLOAT: IEEE 32-bit floating point values
- DOUBLE: IEEE 64-bit floating point values
- BYTE_ARRAY: arbitrarily long byte arrays
- FIXED_LEN_BYTE_ARRAY: fixed length byte arrays
Building and Working with Parquet
Working with Parquet files in different programming environments is straightforward thanks to the wide range of libraries and tools available. For example, if you want to work with Oracle Databases, you can opt for Python-based parquet tools.
For example, parquet-tools
in Python is a very popular CLI tool to interact with Parquet files.
pip3 install parquet-tools
Refer parquet-tools
docs here.
Use Cases
Parquet is particularly advantageous in scenarios that involve large-scale data processing and analytics.
Big Data Processing with Apache Spark: Apache Spark comes with Parquet as the default storage format. Its compression and encoding schemes enable Spark to efficiently read and write large datasets, run fast analytical queries, and save storage space.
Data Warehousing: Parquet's efficient storage and quick query performance in data warehousing are excellent for storing and analyzing large volumes of data. Data warehousing solutions such as Amazon Redshift and Google BigQuery are common use cases.
Machine Learning: Parquet files are the file format of choice for machine learning applications due to their efficient storage and retrieval of large datasets. Data scientists use fast data loading and processing to accelerate model training and evaluation.
Best Practices
To optimize your work with Parquet files, consider the following best practices:
Column Pruning: When querying data, select only the columns you need. This reduces the amount of data read from disk and improves query performance.
Predicate Pushdown: Use predicates in your queries to filter data early in the read process. Parquet supports predicate pushdown, which helps minimize the amount of data read.
Compression and Encoding: Based on your data characteristics, choose the proper compression and encoding schemes. Experiment with different options to find the best balance between storage space and query performance.
Batch Processing: Process Parquet files in batches to take advantage of Parquet's efficient row group and page structures. This improves I/O performance and reduces overhead.
Top comments (1)
Very cool post, thanks a lot for detailing how it works and the efforts put to make it easy to understand 🙏