Introducing the 2018 DataOps Assessment

Jarah Euston
Jarah Euston

Every year Nexla surveys hundreds of data professionals to better understand the growth of data operations (DataOps) as a discipline. Companies are taking notice and increasingly looking to develop their DataOps capabilities. Last year we found that 70% of companies have plans to make investments in data operations—but where should they start? (Read last year’s report) In this post, we examine why companies are keen to make this investment and how they can assess their current DataOps capabilities to make sure they focus investment where it counts.

Why DataOps is Important

What is DataOps, exactly, and why are companies planning to invest in it? In a nutshell, DataOps controls the flow of data from source to value, speeding up the process of deriving value from data. Fundamentally, DataOps ensures that processes and systems that control the data journey are scalable and repeatable. In today’s environment, the number of data flows a company has to manage are never-ending. Because the volume, velocity, and variety of data are increasing, companies need a new way to manage this complexity. In order to maximize data efficiency and value creation, scalability and repeatability of data work are essential.

The activities that fall under the DataOps umbrella include integrating with data sources, performing transformations, converting data formats, and writing or delivering data to its required destination. DataOps also encompasses the monitoring and governance of data flows while ensuring security. Some of these tasks have been done for decades but there are three fundamental shifts that are changing the game.

1. More inter-company data collaboration: Data from outside the organization significantly increases variety (and hence complexity) in terms of schemas, formats, delivery mechanisms (APIs, files etc).

2. More datasets: Growth into hundreds or thousands of unique data sources/datasets to work on. This means a need for higher operational velocity. How quickly can things get up and running? How quickly can modifications be made?

3. Emergence of real-time: Real-time data is a small but growing segment with impact. It doesn’t matter how fast our compute, network, and storage get. Software that is batch-based will not become real-time.

All these things are pushing a need for faster data movement, across thousands of data flows, with a high degree of variation. DataOps is a practice that ensures all aspects of these tasks are carried out efficiently to deliver data to where it can add value. DataOps is not a feature or a tool but a broader practice with the goal of making data work an infinitely scalable and repeatable process.

Introducing the DataOps Matrix

To help companies assess the current state of their Data Operations, we developed the DataOps Matrix. This is a helpful way to rate your capabilities across what we believe are the two most important DataOps vectors: scalability and repeatability. Once you’ve determined where your organization lies on these axes, you’ll be able to understand your current DataOps capabilities assessment.

Focus on Scalability and Repeatability

To understand how the matrix works, it’s critical to align on the right definition and criteria for assessing scalability and repeatability.

Factor One: Scalability

Scalability in this context is a measure of how easily a DataOps system can grow the volume of data, the number of data users, and operational complexity.

A data operations infrastructure that is highly scalable can handle high volumes of data and process it in near real-time. How you define “high volume” will be dependent on the industry. For example, if we consider online advertising technology, high volume can be measured in terabytes (or even petabytes) per hour. If we consider retail product data, volumes could be in gigabytes per day. While the definition of “high volume” is relative, the ability to elegantly process data at max volumes is critical.

In addition to scaling with data volumes, a DataOps infrastructure needs to scale with people. As companies ingest and send more data, the number of people who need to work with the data will only grow. Business analysts, data analysts, technical support teams, implementation teams, partnership teams and more all have people who understand the data very well but don’t have programming skills. Empowering them with tools is essential to the scalability of DataOps. If they can handle 80% of use cases with the right tools, that frees up data engineers to focus on the most complex problems.

This creates organizational leverage around data. The right tools, processes, and people as part of the DataOps solution can have a force multiplying effect.

Factor Two: Repeatability

Repeatability in this context is a measure of how easily a system can automate or repeat tasks.

In DataOps, rule-based automation is table stakes. Standard processes like ETL (extract, transform, load) would fit into this category. For DataOps to move data quickly and easily, these rules must be followed. When errors occur, DataOps needs to provide alerts and potentially recommendations on how to correct and reprocess errors. Errors can happen for any number of reasons—a partner changes a schema, unexpected values appear, or a colleague updates a downstream business process to name a few.

Sophisticated DataOps systems will be able to maintain repeatability despite heterogeneous data types and sources. The average company is processing data from flat files on FTP servers, APIs, and file sharing services like Box. The ability to easily process data from multiple sources is key to repeatability.

Because data flows are constantly being added, DataOps needs to continuously add new pipelines. Systems that can support duplicating pipelines, making edits, and pausing and activating flows provide the most repeatability. As the number of data flows increase, orchestration becomes more complex and important. DataOps is responsible for managing the interdependence of data flows and must provide a mechanism to ensure changes to an upstream process don’t break things downstream.

Systems need to allow for development and testing before pushing pipelines to production, where monitoring becomes critical. DataOps is at its most repeatable when it can be smart about the source connections and easily integrate and transform data. DataOps platforms that allow for shared transformations across an organization can put the power of repeatability in more hands.

For businesses that need to ingest data from and send data to an increasing number of partners, repeatability might even be more important than the ability to scale.

The DataOps Assessment

The first step to improving the data operations in your company is understanding where you are today. Based on our conversations with hundreds of companies, we’ve developed what we think is a handy matrix to understand where your company is on its DataOps journey. Depending on how your company rates on scalability and repeatability, your DataOps practice will fall into one of the four quadrants explained below. To see where you land, take the assessment.

State of the Art: High Scalability, High Repeatability

Your company recognizes the power of data and maintains ambitious goals to drive business value with data. The pain and effort typically involved with troubleshooting are minimized because you have tools and processes in place. Business users are empowered to access needed data. Key characteristics of bleeding edge companies include:

  • You can quickly and easily integrate with many internal and external data sources
  • Systems can handle high volume and real-time data movement when needed
  • Ad-hoc engineering tasks are minimized, users are empowered to access their own data

Innovative: High Scalability, Low Repeatability

Data engineers have built an easy and efficient way to manage, prep, and clean data. Business users are empowered to access data and perform their own analyses with the occasional help from engineering. The company is currently using or experimenting with third-party data services. Key characteristics of innovative companies include:

  • The company has a process in place to integrate data smoothly with improvements on the roadmap
  • The foundation for high volume, real-time data processing is being built
  • The company has the resources to invest in building even more efficient data processes

Advanced: Low Scalability, High Repeatability

Data Operations is a recognized and valued methodology. Data engineers have built and maintain pipelines to feed data into the system of choice. Ad hoc analyses are infrequent but prioritized. Key characteristics of advanced companies include:

  • Management supports and has plans to fund continued investment in Data Operations
  • While acceptable, visibility into data quality could be improved with continued investment in tools and processes
  • There’s an opportunity to reduce the number of ad-hoc requests from business users and analysts with improved data access processes or tools

Basic: Low Scalability, Low Repeatability

Data Operations are just beginning. Data engineers run ETL (Extract-Transform-Load) jobs to join and prep data. Warehouses or data lakes are the current and desired way of storing data. Analysts build dashboards and perform ad hoc analyses as requested. Key characteristics include:

  • Data integrations with third-party partners must be scheduled and can be time-consuming
  • Relatively low data velocity and batch processing
  • Low visibility into data monitoring periods. Errors and troubleshooting can become significant projects

Growing Your DataOps Capabilities

In this post, we introduced you to the DataOps Matrix, a handy framework for assessing your company’s DataOps maturity. We explained why the factors of scalability and repeatability are the most important when evaluating your current DataOps practice, and designing future enhancements. Based on the Matrix, we examined the characteristics of each quadrant and their strengths and weaknesses.

We also discussed a broad definition for DataOps: Data Operations is the practice that controls the flow of data from source to value, with the goal of accelerating time to value. Its adoption is driven by three key trends: inter-company data collaboration, dataset growth, and the emergence of real-time data.

Finally, we shared the 2018 DataOps Assessment you can take right now to understand where your company sits on the matrix. If you’re not happy with where you are today, fear not. There are lots of ways to improve your DataOps infrastructure. To focus improvement efforts, it’s a good idea to decide which factor you need to optimize: scalability or repeatability. From there, evaluate your internal processes to determine if you have the tools and teams you need to be effective.

If you’d like to receive some 1:1 DataOps coaching, just reach out. We’d be happy to chat with you!