Data Operations for Cybersecurity Innovators: Five Best Practices to use today

Saket Saurabh
Saket Saurabh

Today’s enterprises face new attack surfaces such as deepfakes and ransomware, while countering traditional threats like denial of service. It is well regarded that countering these threats requires the ability to leverage vast amounts of diverse data and derive real-time intelligence from it. In addition, increasingly the cybersecurity industry also recognizes the importance of “the human element”, and that technology alone isn’t the magic bullet for data-driven cybersecurity. This is particularly true for all the data operations required to build the latest cybersecurity innovation whether it be threat detection, endpoint security, network monitoring, or security analytics.

So what should you consider when scaling up your ability to leverage data for building and adopting cutting edge cybersecurity technology?  We share some best practices from our experience working with cybersecurity innovators and helping them stay ahead of the game. Read along to also learn about enabling human collaboration, as well as human-machine collaboration.

1. Leverage data from many sources

Advanced cybersecurity algorithms need to leverage data in many dimensions to enrich existing or in-house data. External data sources can provide a variety of data including IP addresses, whois, user-agent, DNS data, security advisories, threat lists etc. Data providers include Pulsedive, Cymon, Virustotal, Fingerbank, UrlVoid and hundreds more. Often there are multiple providers for similar types of data, each with varying coverage and quality. That makes it necessary to combine data from several sources. Once you have complete and accurate information it can be used to enrich internal information such as event logs.

The Challenge:

Each Data Provider has:

  • Their own feed delivery. Information could be coming in files, APIs, or even Emails. 
  • A unique payload structure in how they name fields and organize data.
  • Varying data quality. For example:  If the input signal is an IP address two providers may have different recency and quality of information on the same IP.

Figure 1: Data from multiple sources

The Solution:

Tools that allow you to easily connect and unify data from multiple sources while making real time decision on what information to use from which provider.

2. Collaborate! Empower analysts, scale with human element

The Challenge:

The promise of Data Operations is to deliver speed and reliability in data management. Achieving this is not easy, and leading cybersecurity technology companies dedicate some of their best engineers and product leaders to solve this mission critical need. Unfortunately this never ending work can be overwhelming and even feel like grunt work for the best engineers.

The Solution:

Empower and leverage the broader team. For example Threat Analysts have all the know how to:

  1. Expertly look at data from different providers
  2. Decide how to combine similar data from different sources
  3. Enrich data to create high quality curated datasets specifically for your technology
  4. Set up validation rules and quality checks
  5. Translate data from the structure of an external provider to your internal canonical view
  6. Further collaboration: Share datasets to enable teams to leverage the work of each other. Comments and descriptions are essential for enabling such collaboration.

What prevents Threat Analysts from doing this is lack of tooling designed for business analysts. With the right tool your Threat Analysts become collaborators and act as a force multiplier for your data engineering team.

3. Create a flexible approach for delivering data to your application

Every Application has its optimal way of consuming data. A database may work best for your analytics tool, a file feed for your ML training sets, and data APIs for your applications. Sometimes the same data may be used in multiple applications, and therefore staying flexible is the key to success.

Let’s consider the case of an application that needs to get data via an API. This is classic Service Oriented Architecture that allows applications to only depend on API response and performance while keeping them independent of the system that delivers the API.

Challenge 1:

Underlying data source is an API:

  • Your application calls an API which in turn calls the data provider’s API. Let’s say that data is for blacklisted IPs.
  • This data can change periodically so you need real-time, fresh data.
  • Data has to be sourced from multiple API providers in order to ensure quality and coverage. Each provider may have their own authentication and payload structure.
  • While data needs to be fresh, you may not want to call the data providers API too many times due to rate limits and usage based costs. A little bit of intelligent caching can go a long way without negative impacts on your application.

The Solution

An API Proxy acts as a transparent layer between your application making the API call and the API provider  giving you a unified interface to data from an API in a format you need, but without having to write code. In any API proxy technology you should look for:

  • A self-service UI to set up any data provider API and translate its response payload. 
  • No-code usability so that threat analysts can act as collaborators.
  • Ease of use in adding new providers or removing old ones. 
  • Manage authentication information and notify in case of issues.
  • Chain API calls if needed.
  • Handle and convert across response formats e.g. XML to JSON.
  • Provide API proxy mechanism to access data.
  • Cache recently fetched data.

Figure 2: API Proxy

Challenge 2:

Underlying data source is a Database, File, Email, or other static source.

  • API response needs to come from the underlying source data
  • Performance management becomes even more critical
  • Change management is also needed for data that has a periodic update via a file
  • As a user you need to design the API response format, such as JSON, to fit your needs since the underlying data is most likely flat structure

The Solution

Few companies offer a unified solution for this. Nexla is one of them. An ideal solution should have:

  • Automated generation of API from any underlying data source
  • A no-code UI to design the API response
  • Change management to keep the data fresh and caching to keep the API fast

Figure 3: Delivering data in multiple ways

Additional Data Integrations: In addition to API access applications need traditional ETL and ELT data integration mechanisms to support analytics, search, and machine learning use cases. We will be covering this in greater detail in a follow-on post

4. Monitor, manage, and stay on top of Data Quality

The Challenge

Quality of data providers can vary, especially when it comes to amount of coverage across geographic regions or categories of data. It would be excellent if you could consistently rely on one provider for every dimension of a particular data set but the reality is more likely that a patchwork of providers will be needed. Provider data quality can also vary over time, adding the complexity of monitoring your data providers for quality changes a best practice.

The Solution

 Modern data-driven tools will give you one consistent interface for merging data from multiple providers and automatically monitor your data and its changing quality. In addition they allow you to set your own rules and thresholds. These are all tasks that have traditionally needed engineering. Look for design oriented tools where the process can be code-free with instantaneous updates. The best of breed tools will let you extend the no-code functionality with plugin code modules, once again enabling collaboration and reusability between engineers and the rest of the organization.

5. Don’t get tied down to a specific data execution engine

Companies innovating cutting edge cybersecurity technology need to leverage a wide variety of data at varying complexity of size, scale, and quality. That is why when it comes to the underlying execution engine they may need:

  • A real-time system for API data 
  • A streaming system such as Kafka for events and logs
  • A batch Spark or Map reduce system for large scale data. 

The Challenge

Determining the right execution engine is not often easy. Having access to multiple such execution engines is even harder, as it requires dedicated infrastructure and operations expertise. This is in addition to customized engineering to plug data into the particular execution engine.

The Solution

Data teams need the ability to use the execution engine that is best suited for a given use case. Systems like Nexla encapsulate multiple execution engines. This approach has three benefits. 

  1. The entire execution process is taken care of automatically, and the user doesn’t need to deploy, manage, or monitor anything.
  2. The user doesn’t have to worry about which execution engine to use because the system can make that decision. 
  3. Most importantly the user’s workflow or interface doesn’t change because they are working with more data, or a different format or streaming instead of batch.

Managed data execution with full flexibility again empowers the non-engineers. They can make decisions about what to do with data without having expertise in the underlying data systems.

Summary: By adopting a well thought out strategy on how to manage data and its ongoing operations, companies innovating Cybersecurity technology can accelerate their pace of innovation and stay nimble for future change. Agile data integration, automated operations, adaptive data access mechanisms and collaborative ease of use as a team are all key elements of any solution. 

The Nexla Workflow for Cybersecurity Innovators

Nexla platform is built to helps teams leveraging data for intelligent cybersecurity technology. 

  • Integrate with any cybersecurity data source. 
    • Analysts configure. Nexla handles all complexities and limitations including authentication, rate-limits, etc. to deliver the latest data.
    •  Pre-built configurations provided by Nexla for popular data sources
  • Self-serve tools for analysts. No data engineering necessary
    • Analysts set up no-code transformations
    • Enrich data with internal or third party information
    • Modify and normalize data schema/model
    • Set validation rules and error conditions
  • Share, reuse, and collaborate on datasets, transformations, and error management. Create shared knowledge through comments, notes, and annotations.
  • Data Integration flows including ET, ELT, API Proxy and Push API
  • Managed data execution engine including Kafka, Spark, and real-time with automatically monitored data streams and pipelines

Interested in learning more? Get an introductory demo to Nexla.