An Introduction to Big Data Formats
Innovative, data-centric companies are increasingly relying on big data formats like Avro, Parquet, and ORC. At Nexla, we’ve seen that more and more companies are struggling to get a handle on these formats in their data operations. Some need to convert JSON logs into Parquet for use in Amazon Athena. Some need to convert web or mobile event data in Avro files to csv to feed into other business processes. Most commonly we hear about Avro to JSON and JSON to Avro, but Avro to Parquet or Parquet to Avro are not rare either. Despite their increasing usefulness, few data professionals outside of database experts have a clear understanding of these formats. To help all audiences understand these building blocks of big data, we wrote the whitepaper, An Introduction to Big Data Formats.
What are Avro, Parquet, and ORC?
These formats are optimized for queries while minimizing costs for vast quantities of data. Companies use them to power machine learning, advanced analytics, and business processes. They’re common inputs into big data query tools like Amazon Athena, Spark, and Hive.
But what exactly are Avro, Parquet, and ORC? How do you decide which of these formats is right for the job? And what do you do when your data is not in the optimal format? If you’re not a database expert, the choices and nuances of big data formats can be overwhelming. We invite you to read the whitepaper to sharpen your big data knowledge and understand:
- Why different formats emerged, and some of the trade-offs required when choosing a format
- The evolution of data formats and ideal use cases for each type
- Why analysts and engineers may prefer certain formats – and what “Avro,” “Parquet,” and “ORC” mean
- The challenges involved in converting formats and how to overcome them
An Evaluation Framework for Avro, Parquet, and ORC
In the paper, we discuss a basic evaluation framework for deciding which big data format is right for any given job. The framework is comprised of four key considerations:
- Row vs. Column
- Schema evolution support
If you need a refresher (or an introduction) to row vs. column-based data stores, the paper will be a worthwhile read. Once you’ve determined if you need data stored in rows or columns, we discuss the other important considerations. Schema evolution is another key topic that perhaps doesn’t receive the discussion it deserves outside of technical discussions. Analysts, data scientists, and business users of data would be wise to brush up on the topic to prevent pain down the road. Finally, the framework discusses compression and splitability, two considerations that weigh heavily on performance and cost.
Converting Data Formats
In an ideal world, you’d always choose the data format that was right for your use case and infrastructure. However, sometimes we don’t get to decide how we receive the data we need to work with. Data may be coming in any format—CSV, JSON, XML, or one of the big data formats we discussed. Converting data from the incoming format to the one optimally suited for a specific processing need can be a laborious process. It may include detecting, evolving, or modifying schemas, combining or splitting files, and applying partitioning. This is in addition to managing the difference in frequency of incoming data to the desired frequency of output. All things considered, converting data formats can significantly increase workloads.
Nexla makes these data format conversions easy. Point Nexla to any source—such as a datastore with Avro files—and Nexla can extract, transform, and convert the data into the preferred format. Companies use this capability to convert JSON CloudTrail logs into Parquet for use in Amazon Athena, or ingest Avro event data to process into database tables. Perhaps your system outputs data into Avro but you have a machine learning project that could benefit from Parquet. No matter how you’re getting the data, with Nexla you can easily create the pipeline to convert it into the format that works for you.
Download An Introduction to Big Data Formats and improve your big data IQ.