Humanity’s fight against Coronavirus depends on the speed and precision with which we can take defensive action like social distancing, as well as fight the disease through medical advancement. Data can help on both fronts, and a great example is this New York Times article that studied the origin and spread of the virus using location data.
At Nexla, we are committed to helping for this cause. We have been providing our data integration tools as well as our expertise to teams that are contributing to the global effort to fight COVID-19. In addition we are trying to help make public data more accessible and easier to use for researchers. Please contact us if you need help with your research on COVID-19. We are providing pro bono help to qualified researchers in the form of our software by committing a portion of our compute capacity and team’s time.
As in a typical data science or analytics process, researchers are getting data from various public and non-public resources. The first step for them in using this data is to create a pipeline from each data source into the data system (Spark, Athena, Snowflake, BigQuery, etc.) where they will run their analysis, models, and algorithms. The pipeline often includes significant and unavoidable work such as changing formats, data structures, handling errors, and delivering clean data. To keep the process running continuously for unreliable or changing data proves as it comes over time is even harder.
An example of highly valuable research data being made available is the COVID-19 Open Research Dataset (CORD-19) provided by Allen Institute for AI. One of the ways we are helping researchers is by making such public data available in a variety of different formats making it ready to use in the data applications of their choice.
Fill the form here to download* the CORD-19 data in Parquet, Avro, or JSONL formats. *Please be sure to read the terms and conditions of the data provider.
Continue reading if you would like to learn more or need help with additional data …
Source Data and Data Structure
The source data here has the complete contents from medical research papers on Coronavirus. The entire dataset is nearly 30,000 files. New files are being added periodically. The data providers here have done the heavy lifting of converting information in research papers into structured data making it a gold-mine for researchers who can now apply various data tools to analyze. Each research paper is expressed as a single JSON record contained in a single file. This has the benefit of presenting the entire contents of the paper in an object structure with fields such as authors, keywords, sections, references etc. As a result, all the information within each paper is well organized for analysis and cross-reference.
Format Conversion Benefits
Source data here is a conversion of unstructured data in the form of research papers into structured data as a JSON Objects. Here are some of the benefits of the format conversion we have provided:
- Number of files: Thousands of individual source files were combined into a single file which makes it easier to load into many tools
- File Size: Source Data was in plain text JSON. Converting to binary formats like Avro and Parquet got the benefit of compression
- We combined data across files to massively reduce the number of files anyone would need to process
- During our processing, we partitioned data so that any single file stays around 256MB max size
- Formats: Data is made available in Parquet, Avro, and JSONL formats. To learn more about different data formats and their benefits read our white paper, Introduction to Big Data Formats: Understanding Avro, Parquet, and ORC
- Schema Management: Both Avro and Parquet contain the data schema within the file
- Change management: New data is periodically being added on the source website.
- Processing speed: Querying data will be parallelized and much faster in Parquet. For example you want to search for all papers by a particular author it would run significantly faster in parquet than on the original source data
- Transforming Data: The data and data structure is unchanged from source in the files we have provided. However, some users may need to structurally modify or enrich the data as an intermediate step. For help, please reach out to firstname.lastname@example.org
We are here to help
We will continue to proactively find and help with accessibility of public datasets. If you are a researcher working on COVID-19, let us know how we can help you. We are providing pro bono help to qualified researchers in the form of our software by committing a portion of our compute capacity and team’s time. Please contact us at email@example.com