Using Apache Parquet to work with big data efficiently

17 November 2022

Analysing service data can be slow and costly, how can we improve this?

Imagine we’ve got an enormous table with millions of rows and thousands of columns and our business analysts ask us if it’s possible to analyse this data to determine how successful the digital service is. We can either use the data as is, or we can improve it using Parquet.

Use data in its current format

Digital services often store their data in a database using a row-based format. This makes a lot of sense for humans using spreadsheets and for humans who use digital services. We create or find a record, view the information associated with it and make updates or delete it. Here’s a small table of people data as an example:

Name	Age	Joined
Alice	1	2022-07-22
Bob	2	2019-02-05

We might start by exporting this data out of a live database and into a data warehouse. This separation is important, because it enables analysts to access the data and run expensive queries without risking the stability of the live service.

Once the data’s in a service for example, AWS S3 it could be paired with AWS Athena from which analysts can start running queries and sharing answers with interested stakeholders. This is a great start, value has been made relatively quickly but, given the large amount of data, the team still has a problem:

how do we prevent our queries from breaking in the future? The live service is still iterating and might change its data schema
how much more will this cost? We’ve just duplicated our data storage
queries are slow and costing more in terms of compute time, how can we improve this?

Make it better with Parquet

To help us address these issues, we could go further and perform an automatic data transformation into a column-based format, like Parquet, using a tool like AWS Glue.

After converting our earlier example we can see it’s been reorganised with the original columns as the focus:

Name	Alice	Bob
Age	1	2
Joined	2022-07-22	2019-02-05

Data for new records will now be added horizontally instead of vertically.

The row-based format is often used by digital services as it’s optimised for interacting with all of the data within a single record. As we are now dealing with a copy of service data, we no longer need to worry about that optimisation. We can instead use the column-based format which is optimised for reading narrow slices of data across many records.

Though this example is small, remember that this problem applies for big datasets that have millions of rows and hundreds of columns.

This has a few benefits over row data when working in the read-only analysis mode.

Compression

With native support for compression by column, Parquet can help reduce storage costs. There are several methods available, including: SNAPPY, GZIP, LZO, BROTLI, LZ4, and ZSTD. The engineering team at Uber found compression with ZSTD could reduce a file size by 79%.

Performance

Instead of slicing and dicing our data by selected records we can target the columns we’re interested in instead. By selecting columns, the search functionality can avoid scanning through all of the unrelated record data. This limits the amount of computational resources we need to use and pay for.

Familiarity

We don’t have to learn a new query language to make this work with Parquet, SQL works just fine:

“`

SELECT name FROM file.parquet WHERE age > 1

“`

Future proofing

Parquet files each contain their own metadata about the structure of the data within. Given that these files are column focused, Parquet is able to merge multiple files together without breaking existing queries.

New columns can be merged onto the end:

Name	Alice	Bob	Charlie
Age	1	2	3
Joined	2022-07-22	2019-02-05	2004-09-23
Hair colour	Black	Blue	Red

If the “Joined” column is later removed upstream, old data will remain and new data will stop appearing:

Name	Alice	Bob	Charlie
Age	1	2	3
Joined	2022-07-22	2019-02-05

Try this

Add a CSV to a AWS S3 bucket
Configure AWS Glue to consume this CSV and transform it to Parquet
Use AWS Athena to query the parquet from within the AWS Console
Set up automatic transformations so that a new CSV will update the parquet dataset. You can use AWS S3 Event listeners, AWS SQS and AWS SNS for this.

Using Apache Parquet to work with big data efficiently

Use data in its current format

Make it better with Parquet

Compression

Performance

Familiarity

Future proofing

Try this

London

Leeds

Other links

Using Apache Parquet to work with big data efficiently

Use data in its current format

Make it better with Parquet

Compression

Performance

Familiarity

Future proofing

Try this

Next

Previous

Keep in touch