Blog DataOps

Standardizing Data Delivery with Data as a Product

On Dwight Eisenhower’s first day in the White House as President, an assistant handed him two unopened letters. Ike responded, “never bring me a sealed envelope.” He only wanted to consume vetted information.

Like Ike, modern enterprises only want to consume vetted information. But their data engineers struggle to deliver the timely, accurate data this requires. They struggle to craft data pipelines across proliferating sources, targets, users, use cases, and platforms. They need to standardize and automate data delivery so they can reduce the complexity—and avoid persistent performance and quality issues.

Automated pipeline and quality tools can help, but in themselves cannot offer the necessary standardization.

Data as a Product Architecture

Data as a Product

A new way to standardize and automate data delivery is to treat data as a product. You can view a data product as a modular package that data teams create, use, curate, and reuse with no scripting. The data product enables data engineers to become more productive. It also empowers data scientists, data analysts, developers, or even data-savvy business managers to consume data without the help of data engineers. This blog explores what that looks like in a pipeline architecture.

The data product includes transformation logic, metadata, and schema, along with views of the data itself. Let’s break this down.

Transformation logic prepares a data product for consumption by combining its data from multiple sources, filtering out unneeded records or values, re-formatting data, and validating its accuracy. The transformation logic also might enrich a data product by correlating it with relevant third-party data. Finally, it might obfuscate sensitive data by masking those values.
Metadata—such as the name of a file, its characteristics, and its lineage—describes data so that users know where and how they might use it. Users might tag a data product with additional metadata, for example to describe its purpose and relevance to a given project.
Schema structure the data for consumption. For example, a schema defines how the rows and columns of a SQL table—or the columns of an Apache Parquet file—relate to one another. By structuring the data, the schema helps applications, tools, and users consume with it.
Views, as you’d expect, present the data to those applications, tools, and users for consumption. These views are logical representations of underlying physical data. Multiple views might apply to a given physical data set, with each view offering a distinct combination or slice of the data.

The data product packages together transformation logic, metadata, schema, and views of the data itself

Implemented well, the data product automatically incorporates changes to ensure users maintain an accurate, consistent picture of the business at all times. These changes might include new rows in a table, a new tag from a user, or an altered schema from source. The data product provides monitoring capabilities to keep users informed of material changes. If new rows in a table fail a validation test, that should trigger an alert. A viable data product offering also includes role-based access controls that authenticate users and authorize the actions they take on the data.

Data Pipelines

Data products should support all the major types of data pipelines. You can view this in three dimensions: the sequence of tasks (e.g., ETL vs. ELT), and latency (batch vs. stream), and push vs. pull.

ETL and ELT. A data pipeline can extract data from a source, transform it, then load it to a target in an ETL sequence. Another popular sequence is ELT, which extracts, loads, and then transforms data on the target. Whatever the sequence, here is how those tasks apply to a data product.
- They extract data, metadata, and source schemas.
- They transform the data using the transformation logic described above.
- They load these components, including the data views, as a bundled data product to the target.
Batch and stream. A data pipeline can process full batches of data on a scheduled basis—say every hour or day—or continuously process incremental updates within a data stream. Stream processing consumes fewer resources and meets real-time business needs. Whatever the latency, the data product should automatically absorb incoming data, metadata, and schema changes. Users should see these changes via dashboards or alerts so they can adjust transformation logic or various settings as needed.
Push and pull. A data pipeline can push data to a target, meaning that it delivers a batch or stream based on an event or rule. For example, a source database might register a customer transaction on a source database, prompting a traditional ETL or ELT pipeline to push that transaction record to a target such as a cloud data platform. Alternatively, an application programming interface (API) on the target might pull data from the source based on a consumption need. For example, an API pull service might fetch a customer’s transaction history because a SaaS application user requests it. This second scenario also is known as Data as a Service.

Consumption

There are many ways to access and consume a data product from a target such as a cloud data platform or cloud database. To support analytics, a data scientist or data analyst might use an analytics tool to discover and query the data product. To support operations or embedded analytics, a software as a service (SaaS) application might consume the data product through an API pull service. These scenarios enable enterprise workloads to consume data in a standardized and automated fashion.

This is the vision. Can enterprises make it work? Check out Eckerson Group’s webinar with Nexla on January 26 as we explore this question.