Data Operations (DataOps) has received a lot of attention in recent months, but it seems that the hype outstrips the realities. When we say Factory Operations we imagine a state where people, process, and technology work in harmony, creating high quality products at a very large scale. Similarly, from Data Operations we desire a system that takes raw or dirty data from anywhere as input and creates high quality ready-to-use data that delivers business value for a nearly limitless number of data driven applications—analytics, data science, AI, automation, etc.The reality is less exciting and somewhat sobering. It is harder than it looks on the surface, and DataOps solutions mostly have been built for highly technical data engineers delivering pipelines for data science. But DataOps delivered in this manner barely scratches the surface of unfilled demand for data.
There are really two fundamental problems with the current state of DataOps. First is the technical bias that ignores the needs of self-service data analysts. How can we imagine any scalable operations when a large set of people cannot perform at their full capacity because they are dependent instead of self-sufficient? Second is the singular focus on data pipelines. Data delivery is absolutely an important intermediate goal. But business impact is the meaningful end point—people and applications using the data to drive positive business results.
According to Wikipedia “DataOps is an automated, process-oriented methodology, used by analytics and data teams to improve the quality and reduce the cycle time of data analytics.” Wikipedia goes on to say that DataOps includes agile methodology, and that it uses statistical process control (SPC) to monitor and control data pipelines. Herein lies the problem. The current practice is a complex methodology that focuses first on monitoring and control. DataOps in its current state is arguably well-suited to data science needs. Data scientists, however, are a very small part of the population that needs to access, transform, share, exchange, and analyze data. And they are the slowest growing segment of the data population. Self-service data analysts—business people using Tableau, Qlik, PowerBI, Excel, and other business-friendly tools—are the majority and the fastest growing group of data analysts.
It is time to rethink and redefine DataOps by looking at the whole picture from an idea through a well-run data pipeline to data that becomes part of critical business processes. We need to move to a new generation of DataOps that truly innovates with new discovery processes, automated platforms, and design for use by all people in the organization:
DataOps is a combination of processes, platforms, and people working together to manage data flow complexities throughout and across the organization, to produce and operate reliable, predictable, and scalable data pipelines for all who work with data, to integrate data and facilitate data sharing, and to ultimately derive business value from data through analytics, automation, and machine learning.
Let’s unpack this definition to examine all of the components of this new approach to DataOps. Figure 1 illustrates the scope of processes, platforms, and people that must be combined and must work together for DataOps success.
The New Data Ops: Complete Process Coverage from Idea to Delivery
The processes in the scope of DataOps are those that are essential to extract business value from data. The 5 stages of DataOps, each with its specific processes, work together to turn an idea into a system operating at scale.
Discovery is necessary to know what data is available and what is possible. At every company I visit, the discovery processes are overwhelmed and backlogged because modern data sources and formats are inaccessible and unusable by the business without data engineering support. In effect, you have to do some integration work to perform discovery and define your requirements. Making data discovery accessible to all improves ideation, produces better specifications for IT, and increases efficiency of data integration and data preparation.
Integration consolidates data for sharing, reuse, automation, and creation of user-ready datasets. Integration processes break down data silos and resolve inconsistencies and conflicts that exist in disparate data. However, these processes are increasingly expensive due to the shortage of data engineers. Building the processes is costly and time-consuming due with manual coding typically required to do the work. With more data sources and more data consumers driving an increasing need for data integration, the problem is exacerbated. One recent study shows that unfilled job openings for data engineers are 12 times greater than those for data scientists. The data engineering shortage won’t be resolved by hiring and training more engineers. The only practical solution is democratization of data integration enabled through automation.
Preparation processes improve, enrich, blend, and format data for analysis and reporting. The business users know the data while the data engineer knows the systems. This phase is challenged by collaboration and communication because business users can’t code. As one user said, if you know the requirement, the work isn’t that bad. Providing a collaborative workspace for engineers and non-engineers allows for faster prototyping and fewer iterations to get the job done right.
Deployment is time-consuming, and small changes can have significant impacts. You need to be able to understand what will break if you make changes. Deployment without a systematic approach that includes version management is a high-risk scenario that is likely to result in outages, disruptions, and potentially sever business impacts.
Monitoring is a must for critical processes, yet we often don’t monitor data management processes as diligently as we should. All too frequently, examining process execution is a job undertaken only after a failure occurs. However, it is a chore that typically gets completed after the specific flow has an outage. Modern DataOps must include serious and automated monitoring with goals of being proactive, preventive, and predictive. It must minimize probability and frequency of failures and provide useful information to perform root cause analysis when incidents do occur.
Figure 1. The Scope of Data Operations
The New Data Ops: Bringing Together Business, Data, and People
Modern data platforms are designed to support complex data ecosystems and the many interdependent components that are needed to seamlessly flow data from sources to consumers, and ultimately to deliver business value from data. The platforms must provide ability to connect to heterogenous data sources—both internal and external—and ingest the data into a landing area. They must include features and functions that process data to various stages of refinement with cleansing, integration, and aggregation as core capabilities. They must support the exploratory nature of data analysis with analyst sandboxes. They must deliver data in the right forms for a variety of use cases ranging from basic reporting to leading-edge data science applications. And the platforms must support a strong data management foundation with features and functions for data governance, metadata management, master data management, platform and environment administration, and multi-cloud and hybrid infrastructure.
The human dimension of a new DataOps approach must address the needs of many different people. Old-style DataOps is geared to data engineers and data scientists, using highly technical tools designed for technically skilled people. Modern DataOps must reach beyond scientists and engineers to include self-service data consumers such as business and data analysts. DataOps tools can no longer demand programming and scripting skills as a prerequisite for building data pipelines. Low-code/no-code tools bring self-service dataflow capabilities to self-service data consumers. It is simply impractical to rely on DataOps technology that works only for a small part of the large population of data consumers. Modern DataOps must also enable those people with data management responsibilities for data curation, stewardship, and governance. These people fill important roles to make data discoverable and easy to use, and they perform critical activities to ensure security and protection of sensitive data.
The number of data consumers and the demand for data will continue to grow. Most of that growth will occur as self-service data consumers. At the same time the IT and data engineering backlog will increase. The gap between data demand and data pipeline supply will continue to expand if we continue on the same DataOps path. The time has come to take a new path—a different approach to DataOps—where self-service and low-code/no-code data pipelines are a fundamental principle. Three characteristics of the new DataOps are critical to future success: automation of data management processes, inclusion of everyone who works with data, and an end-to-end view that begins with business needs and ends with business impact.
Compare yourself against your peers by using the benchmarks in the Definitive DataOps Report built with insights from 266 data professionals from over 25 industries.