Using Apache Parquet to work with big data efficiently
Analysing service data can be slow and costly, how can we improve this?
Imagine we’ve got an enormous table with millions of rows and thousands of columns and our business analysts ask us if it’s possible to analyse this data to determine how successful the digital service is. We can either use the data as is, or we can improve it using Parquet.
Use data in its current format
Digital services often store their data in a database using a row-based format. This makes a lot of sense for humans using spreadsheets and for humans who use digital services. We create or find a record, view the information associated with it and make updates or delete it. Here’s a small table of people data as an example:
We might start by exporting this data out of a live database and into a data warehouse. This separation is important, because it enables analysts to access the data and run expensive queries without risking the stability of the live service.
Once the data’s in a service for example, AWS S3 it could be paired with AWS Athena from which analysts can start running queries and sharing answers with interested stakeholders. This is a great start, value has been made relatively quickly but, given the large amount of data, the team still has a problem:
- how do we prevent our queries from breaking in the future? The live service is still iterating and might change its data schema
- how much more will this cost? We’ve just duplicated our data storage
- queries are slow and costing more in terms of compute time, how can we improve this?
Make it better with Parquet
To help us address these issues, we could go further and perform an automatic data transformation into a column-based format, like Parquet, using a tool like AWS Glue.
After converting our earlier example we can see it’s been reorganised with the original columns as the focus:
Data for new records will now be added horizontally instead of vertically.
The row-based format is often used by digital services as it’s optimised for interacting with all of the data within a single record. As we are now dealing with a copy of service data, we no longer need to worry about that optimisation. We can instead use the column-based format which is optimised for reading narrow slices of data across many records.
Though this example is small, remember that this problem applies for big datasets that have millions of rows and hundreds of columns.
This has a few benefits over row data when working in the read-only analysis mode.
With native support for compression by column, Parquet can help reduce storage costs. There are several methods available, including: SNAPPY, GZIP, LZO, BROTLI, LZ4, and ZSTD. The engineering team at Uber found compression with ZSTD could reduce a file size by 79%.
Instead of slicing and dicing our data by selected records we can target the columns we’re interested in instead. By selecting columns, the search functionality can avoid scanning through all of the unrelated record data. This limits the amount of computational resources we need to use and pay for.
We don’t have to learn a new query language to make this work with Parquet, SQL works just fine:
SELECT name FROM file.parquet WHERE age > 1
Parquet files each contain their own metadata about the structure of the data within. Given that these files are column focused, Parquet is able to merge multiple files together without breaking existing queries.
New columns can be merged onto the end:
If the “Joined” column is later removed upstream, old data will remain and new data will stop appearing:
- Add a CSV to a AWS S3 bucket
- Configure AWS Glue to consume this CSV and transform it to Parquet
- Use AWS Athena to query the parquet from within the AWS Console
- Set up automatic transformations so that a new CSV will update the parquet dataset. You can use AWS S3 Event listeners, AWS SQS and AWS SNS for this.