Data Engineering Podcast

Angela Chen
Angela Chen

Transcript:

  1. Introduction 1:51 – 2:17

Saket: Hi, my name is Saket, I’m the Co-founder and CEO at Nexla. I’m an engineer by background but eventually went towards the business side. I’ve been in the world of data for almost 12 years now. 

Avinash: Hi, I’m Avinash, Co-founder and CTO at Nexla. Worked with Saket on his previous start-up  that was in the mobile ad-tech space. Before that, I spent a bunch of years building data systems in financial services and service management platforms. 

  1. How did you first get involved in data management? 2:23 – 04:05

Saket: Yeah, as I said, I was an engineer by background, and worked for a number of companies. In 2009, I started a company building a mobile app server, one of the early ones at the time. That company is called Mobsmith, and we ended up building a pretty complex data system. There’s a lot of data in the advertising space, that’s why you see Google, Facebook and all those companies building up data technology. We ended up building a system that would process nearly 300 billion records of data everyday. We were doing some very sophisticated machine learning on that, for pricing, ad options and so on. We ended up creating a huge data platform as part of this advertising startup that I was doing. That’s what got us into the data space. 

Avinash: We found this ad tech company, we had really early users of the current data technologies. So very early users of Kafka etc. You see this problem from every startup that everytime the engineering team comes to us with a requirement, we were like, oh you need to put this in the back of the queue, it will take us a few months to do this. This is where we thought about how we will build something which makes it easy enough for a non-engineer to build data solutions with a data engineering system. 

  1. Can you describe what Nexla is and the story behind it? 3:58 – 5:46

Saket: There’s a lot of data in advertising. For Avinash and me, we are the cofounders, we thought the data challenges in advertising were super interesting. We were not as enamoured by the media side. We were interested in the compute side having worked at Nvidia. At our last company, there was a lot of data in there, and we ended up getting  data from hundreds of different partners and companies. Clearly, we have gotten a lot of value applying that towards machine learning. So the thinking that took us towards Nexla was that this was a problem (back in 2014). We thought more and more companies will use data in more efficient ways, but it will be very hard for them. There’s a lot of technology that has to go in, and that started the seed of the idea of Nexla inside of us which is: companies will have data inside of their organization, and they will get data from the ecosystem. It will be hard to work with, and more and more ppl need to work with data. However, how do we make that happen? How do we deal with this growing complexity, and how do we make someone who is not an engineer to be able to eventually work with it and say “this is what I want to do with my data.” That was the core behind starting Nexla. 

  1. What are the major problems that Nexla is aiming to solve? 5:47 – 8:06

Saket: So when we posed the question “ how can data be used by more ppl and in more applications,” it let us see core problems that we knew that had to be solved. First, we wanted the non-technical users to be able to do this. So that led us to 4 key problems. First problem was that we had to break through the connector barrier. There are so many systems that connect to so many formats and different velocities of data. How can we connect the different data systems without having to write and build new connectors? So that led to our technology, the universal connector. The second problem is that once we connect to the data, how can we present it in a way that users can understand? That led to Nexsets as a logical unit, something that is consistent no matter where the data comes from. Another thing was we were always thinking about data that’s going between companies and their ecosystems of their customers and partners and vendors. We thought if data has to move across companies, it will have to move across clouds, it will have to move across on-prem to cloud. Therefore, the architectures should support those scenarios. The fourth thing is that we want the data users to be empowered, but data challenges are very diverse, so we would need a way to collaborate. So the problem was collaboration. Collaboration all the way from someone who’s an expert in data, to someone who’s an expert in the data system. That led to the concept of the no-code capabilities of Nexla, the low-code capabilities, to the full on developer platform which has the APIs, SDKs, and CLIs. 

  1. What are the components of a data platform that Nelxa might replace? 8:07 – 11:22

Saket: On the high level, when it comes to technologies like Kafka, Spark, and so on, we don’t replace technologies like Kafka and Spark. We make them more usable and adoptable. There are 4 different things we solve for companies. 1. The multi-modal Integration: ETL, ELT, reverse ETL, streaming integration, API integration. We provide a solution for it, like something that can be done in the user interface. When the actual execution happens, we are able to leverage some of these technologies that are appropriate like kafka or spark or some of our home-run engines as well. That way, we can sit on top of the benefit of the technologies we already have and the know-how we have, but make it more accessible to users. We never think of it as a replacement. When it comes to technology in data, there are so many problems that are coming that are new. It’s not usually about going back and replacing things, it’s more about solving the new problems. 2. The preparation of data: how do we transform, enrich, filter, validate data so it becomes ready to use. 3. Monitoring: Since we are in the flow of data, we do enable error management, data quarantine, notifications, alerting as far as the flow of data is concerned. 4. Discovery: we generate a catalog from our Nexsets where users can find the data they are looking for, access and use it. 

Avinash: A point to add here is that these are all complements to each other. You can choose not to use the entire thing. You can choose to use only parts of it. For example, if you already have written some code in Python, but you don’t have the integration layer where you can fetch this data from or you have a new integration layer where you want to fetch data from, you could use Nexla for the integration piece, but use your pre-written transformations in Python to transform your data and push it out. Similarly if you have a new customer requirement who wants to move from gcb to azure. Maybe you’ve written the logic in Azure but don’t have a synapse connector. You can use Nexla to push data to synapse. The user can get the data via Nexla in two or three clicks.  

Saket: So we have been very much in use with the existing monitoring systems, catalog systems, and we end up supplementing them or feeding information into those systemy and making them more capable. 

  1. What are the use cases and benefits of being able to publish datasets for use outside and across organizations? 11:23 – 17:55

Saket: First a quick definition of Nexsets: Nexsets are these logical data units, they don’t have data, but they are a way to use and access data. These Nexsets have the concept of sharing and collaborating, within the team of course. I can design a Nexset which is PII compliant data and give access to a team to use that. This also has applications across companies. When two companies work together, as you imagine, there is always some flow of data going between them. We see this in some of our customers. For example, one of our customers is instacart. For them to deliver groceries, they would need to know if the stores have the product available and can i go pick it up and list it? So Instacart has access to data that comes from the merchant partners. In these cases, we enable the flow of data across companies. The second part is the community aspect. Someone might have worked on this public health data and made it ready to use, and they can make it public. Anybody can use that. There’s a flexibility part to it. It’s not unusual for companies to put files on FTP or create an API so that they can give data to somebody else. There is a two way integration challenge: you have to generate data into a file and push it to a FTP server or create an API, and then you have to integrate. They have to fetch that data and run a process for it. So Nexsets are enabling this. Once you give someone access, you give them the choice – do you want to consume that data into a file or an API for a database, or not have to integrate at all. There’s a lot of power in that. We do see that these use cases are growing, and there’s some real benefits for the company and their customers. 

Avinash: FTP is still going strong even after a few decades. How companies share data – they probably give you a username and password, and then you forget about it. How do you know what all integrations that are currently connected to an FTP server or an API. It’s really hard for today’s systems. You look at the FTP server logs – when did these users last login? Or you look at the API logs – what are the users connected currently? With Nexsets, you can just go to the screen and see these are the things I have access to, when was the data last imported. So the monitoring aspect of data exchange is something I think that prevents companies from connecting multiple different data systems because you are scared of how I would know who is using my data. So having that aspect is always useful.

Saket: I have seen horror stories of somebody changing the column name between an insurance company and an automobile company, and the whole thing came down because of that. So it’s a big problem sometimes. 

The more modern analog to the FTP server is the capabilities that systems like bigquery or snowflake offer with being able to share tables or datasets from data warehouses. Can you give us a comparison between Nexsets and the capabilities they provide? 

Saket: I think the sharetables in these systems like Bigquery and Snowflake are a great way to share data. How Nexsets are helping companies is 1. Bringing data into bigquery or snowflake. Once the data is in a warehouse, it has become a batch or a static system. So in use cases where data needs to be going real time from one system to another, the Nexsets have multiple execution engines underneath and can move data in real time. There are many use cases like ecommerce shipment tracking. 

Avinash: when you share a table from Snowflake or Bigquery, you still do a lift and shift on where you want to eventually use it. You might use it in a machine learning setting where you extract the table and so on. With Nexsets, you solve the problem on both ends of the ecosystem. On one side, you can use Nexset to push data into these things, on the other side, you can have a Nexset that directs to your machine learning model. So you are solving both of those things. So we do have customers who share tables in Snowflake but we would have another customer that just consumes the data from that table with Nexset on top of it. 

Saket: I’ve had cases of companies coming to me and saying “we are not current customers of this x or y data sharing tool, but our customers are asking us to bring data there. Can you give us the mechanism to move the data from the  data lake or something into that system from our data lake so we can share?”

  1. What are the different elements involved in implementing DataOps? 17:55 – 25:10

Saket: We were one of the early ones in 2016 talking about dataops. The purpose of operations is to bring scale- how are more ppl working with datasets, are there more datasets in applications. Our approach to the operations aspect was that you have to have a certain degree of automation. How you connect, flow data, how many ppl can use data together – that’s how you scale your team. Those are the core principles. The resulting application for the users is integration, preparation and monitoring of data. We think of operations as a layer on top of these fundamental data functions that allows a team to scale. 

The fundamental premise behind Nelxa is on one hand you have a variety of data and on the other you have users that shouldn’t have to write code. We think between them, there has to be some technology that takes all that variety, different systems and connectors of data and presents them to the user in a useful fashion. How these Nexsets function is that first they start with observing the data so the connectors are putting the data out, and the Nexsets are figuring out what the schema is, they will get the schema from the first line of record but will continuously look at it. The second aspect is what time does data come, how data do we see. Then we go into the data itself and recognize the attributes and compute the characteristics of the data. The term for all that is metadata intelligence or data fabric technologies. The Nexset also knows where the data is coming from. For example, this record was line 500 in that file. So that information was also being gathered. And then, it says I know enough about the data to present to the users in a UI that they can look at. But at this time, it doesn’t matter to a Nexset the data comes from a stream or from an API or from a file. It’s immaterial where it came from, it can present it to the user in a common interface. And then the Nexsets, on top of that, determines who can access it, what’s the history, how did the schema change overtime, which users have modified it — it keeps documentation about what the data is about, has the user created an annotation for an attribute. As the data is flowing through, Nexsets are not doing anything, it’s not flowing anything yet, it’s observing. Once you connect a nextset to a dta system, then it starts to materialize the data there. You point a Nexset to a database, you can do an ETL flow. Or if you want to get an API for the data, or whenever you want to use the data, Nexsets know enough about the data to convert the format .It also knows enough about the data to now run some validations. This view should be like this – let the user define the validations When it sees those validation failed, then it puts the data record into a quarantine where a separate processing is. So Nexset is this logical entity in the middle and acts like a data router when you use it. 

Avinash:  There’s complexity in the whole runtime of this. How do you run a Nexset which might have billions of records behind it? Also there is a Nexset that has a few hundred records in the system. How do you separate out an execution engine from a user who’s waiting on the other side to receive a response but then versus users who are expecting a batch of data to be dropped into the data warehouse once in the night. The is challenging in presenting the same interface for the user but having different execution engines. 

Saket:  When you are looking at a UI, making decisions, designing stuff, and when it’s done, the execution knows from all that information which execution engine is right. 

Do you provide the infrastructure? 

Saket: We provide the execution part. Depending on your needs, we provide the execution engines which are scaling up and down the containers, it’s all packaged. We also allow customers to run Nexla on their own. They might need this for high value data on premise, so they can run their own data pipeline, and we can provide the templates to execute and run that. The end execution is sort of distributed in a federated fashion. 

  1. How is the Nexla platform implemented? 
    1. What have been the most complex engineering challenges? 25:11 – 32:10

Saket: 1. Connectors: how do we get to a place where we don’t create new connectors, our customers don’t wait for us to create the connectors. We ended up creating an extraction layer with 4 connectors. Authentication, retries and probing a connector, managing credentials, all those layers are extracted on top of that. 2. Multimodal data processing is another one. Let’s say you have a database as a source, you’re reading from it and pushing the data into an API. The normal flow will probably be run in kafka. If you want to reserve the flow by giving an API call,  that doesn’t go on kafka, that’s real time. So this exact same flow in Nexla can be run in 2 different ways. We recently started to offer “bring your own data engine.” If you have created your own, we can give you the interface to run that. 3. The whole infrastructure is dynamically orchestrated. So these containers come up at the right time to read the data. 

Avinash: We want to simplify the view for the user, but there’s a lot happening in the back. The orchestration changes over time and how you make that switch on the top is super challenging. Users have done some fantastic things with Nexsets. Some users brought in media files and got Nexsets out of it. We had not originally thought about it, but ppl have done it. We started it as a simple concept, but ppl have used these building blocks and done a lot more. 

The consistency is tricky among different models. How do you ensure consistency in these different ways of executing? 

Saket: There’s a key part in the processing of the data that is in the exactly same codebase that actually runs in different codebases. That module can run in kafka stream, or by our servers etc. There’s only one place, were not trying to replicate the logic in different systems. 

Avinash: We are able to keep the Nexset separate from where the Nexset gets executed. The separation between the definition and execution of Nexsets makes it able to run in different places. We should be able to bridge the gap between Nexset and a new engine without changing anything in the Nexset itself.  

Saket: a user comes to us and says here’s the dataset i want, this is what i want to do. How you execute that does depend on a few things like real time etc., but it shouldn’t matter to the user.

  1. How has the architecture changed or evolved since you first began working on it? 32:10 – 35;10

Saket: when we first started we focused on streaming and micro batching. We did encounter real time use cases early on, and we also came across where streaming was an overhead and not necessary. We started from the streaming engine and became multimodal overtime. We always thought about who’s producing/using the data. That concept has also evolved quite a bit within our product. It’s not just about the source and destination, but about whether you can do whatever you want with the Nexset. You can maybe use it to search function. 

Avinash: Having a platform which is operable in different places is a huge benefit. We thought about what are the points we want the platform to be open, can someone use the underline nexla infrastructure and bring their own connector, transformation and validation logic. The CLI has made the platform more and more powerful. We originally only focused on the no-code low code features but we eventually opened up all our APIs, and that’s what made the data space exciting I think. Customers challenge you to improve your product constantly. 

  1. What are some of the heuristics that you have found most useful in generating logical units of data in an automated fashion? 36:02 – 37:43

Saket: in an organization, ppl have different skill levels. In our connectors, you can implement some complex configurations, but they can become templates. The configurations can become pre-packaged, so a user can come and click on a button. Also, we consider how people will work together. How does a person come up with a solution once, and it becomes a solution that can be reused forever. At some point it can be shared in the community as well, what you have built in the product space. From a technical perspective, we have seen some structures, calls, even some simple unions of schemas can give you some interesting results. Of course we do more complex matches of schemas, but some simple heuristics can go a long way. 

  1. Once a Nexset has been created, what are some of the ways that they can be used or further processed? 38:12 – 43:50

Saket: Nexset + code creates a new Nexset. Ppl can do their own transformations, filters, enrichments, etc. to create new Nexsets. We also see customers combine multiple Nexsets together. You can also do lookups within the Nexsets. We will soon introduce similar things to sql joins etc. It is possible to create their own or pass it on to someone else to do it. 

Avinash: It is fundamentally possible. Having an open entity which is composable. If you get data from Postgres and you get data from S3, currently, if you want to do it, you end up writing a lot of integration code. When you do it with Nexla, you’ll be able to get 2 Nexsets, join those Nexsets, and you have 7 different ways of accessing that Nexset out of the box. 

Saket: We had a very interesting use case with a freight broker that was getting emails from drivers. The emails would have attachments containing the driver’s driver’s license and insurance information. Their goal was to validate this driver and onboard them to the platform. They made these emails as sources in our system. The first Nexset detected from the email would have that attachment or sort of pointing to that. And then it would detect tables in the email if there are any. That Nexset then is enriched by making calls to the OCR system and saying from the image, what other information can we grab. So the resulting Nexset after that process had a lot more data from those documents. And then, they have the logic to say is the driver’s license valid, is the insurance enough etc. — they had the criteria. The outcome of that Nexset is an API into their system. They were able to say: from now on we are able to onboard 80 plus drivers automatically. Then they could pipe the outcome of that Nexset out into a spreadsheet that will go through the approval process. We find ppl come up with some extremely compelling and interesting applications by composing these Nexsets together with other systems they have. 

In the case where you have 2 Nexsets and you only want half of the information each is presenting. How do you optimize the execution so that you don’t run both?

Avinash: So the execution plan of a Nexset is tweakable. You can run this Nexset in A execution engine, another Nexset in B execution engine. There’s a windowing concept you can apply on top of a string for example in this case. You can get a Nexset almost immediately. From the perspective of an underlying streaming architecture, you can do this easily. 

Saket: We sometimes advise that you can materialize the Nexsets in some system and use it as a source and make a new Nexset. An example is you get data from three different systems, you’re trying to do something complex. You don’t want to force the solution. You can materialize these three Nexsets in a database. And then, you can do some ELT transformations in there and then make that outcome a Nexset and take that further downstream. Sometimes problem solving in computer science can have multiple ways. Sometimes a solution can be too complicated for the user to figure out, and you can break that down into a few linear steps. The answer is always it depends. 

  1. What are some of your grand hopes and ambitions for the Nexla platform and the potential for data exchanges? 43:51 – 46:13

Saket: At the base level, our goal has always been to present multiple possible frameworks that ppl can use to build these data solutions. The second part is how can we go from an idea to something working very quickly. I’ve seen large companies spend months or years working on a framework but can’t get to an output. It’s always about how you can get to that faster. Get the data ready to use in the system. Our big hope in general is that we are taking the approach where we step back and say it makes sense to have certain extractions in the dataflow, and those extractions can be applied in multiple places. We think that one that will make life easier for many users. You don’t want to have a system to discover the data, another system to integrate it, have a third system to prepare it, and have a fourth system to monitor it, and have a fifth system to catalog it. Ppl have done that, and it’s extremely hard. What is a good baseline that can cover and solve this sort of problem for ppl, but then an open design so that they can plug in their favorite tools and pieces. We go back to that approach and present it as a framework. We do hope this becomes the collaborative way of working with data where ppl of different skill levels are using it because I can see the future where kids are already learning to code at age 11 and looking at data. The future is everybody working with data. And that’s only possible when there’s robust and usable data systems. 

  1. What are the most interesting, innovative, or unexpected ways that you have seen Nexla used? 46:13 – 47:41

Saket: The email attachment use case was a very interesting one. We had a large financial institution use S1 filings which are publically available as a data source and our system can take care of fetching that data, handling the rate limits, and even detecting the tables in those documents. They can focus on their machine learning services that can take that unstructured document and spread out more identifiable entities and features for them. We felt like that’s a very interesting way to use it because it’s helping them with their roadmap to reduce dependency on software engineering .We do routinely come across some very interesting use cases. 

Avinash: the use case in the shipping broker, multiple ecommerce companies working with different shipping programs. Each one of them has built their tracking system in a slightly different way. Integrating with all of them can be a nightmare. When you have something like Nexla, you have Nexsets on the inside and outside, and you can orchestrate this with a single API call. It will make clear where the data is coming from and going to. Things like these were not originally expected, but when used this way, you are able to get a Nexset on the input, and a Nexset on the output. With one API call, you are solving a business problem in a way that could’ve taken you years to do. 

  1. What are the most interesting, unexpected, or challenging lessons that you have learned while working on Nexla? 47:41 – 50:54

Saket: we came in with a lot of data experience, but we’ve been surprised by some of the challenges. I’ll give you an example of a non-technical challenge. As entrepreneurs, we want to make sure that we build a product that works. We build a product that actually solves a business problem and do that in an efficient way. We ended up creating a company that started with a seed round of funding, and then became profitable before taking further money. We want to make sure that we’re not in this cycle of keeping raising money and hoping to find the right use case sometime. That’s one of the challenges. 

Avinash: A technical challenge is that we started the company we wanted to build it on the latest technology that’s out there. We were like okay, let’s build it kafka or whatever is out there. We talked to the first five customers and they were like we use AS 400 now and have AS 400 in the Z OS, so just marrying the two worlds – yes we are using the latest technology, but we also need to work with the technology ppl have been using since the last 50 years. This is challenging. Imagine a Nexset in AS 400. What would that look like? This is something we didn’t start with but saw the platform could do. 

Saket: That happened when we were still thinking about whether everything should be streaming or real time. We thought maybe everyone would want that, but you ask those companies, and they would say no we don’t need streaming or real time for our data, that’s fine. How do you still keep everything compute-efficient and cost-efficient? Sometimes it’s good to use the latest tech and try to be ahead of the market, but you gotta meet the market at the right spot. 

  1. When is Nexla the wrong choice? 50:54 – 52:02

Saket: We’d like Nexla to be the right choice whenever you are working with data of course. For ppl who work in AI or BI, Nexla would help bring the data to you. There are many aspects. Data is a very broad space. For example, we do monitor the data as we are sitting in the flow. We can’t say that we will solve all of your data monitoring challenges. What we do best is that we generate so many signals that you can consume in your datadog or your favorite monitoring tool and do it yourself. Our goal is to not reinvent anything. If something is there, to want to connect with it and power it. As I mentioned, we do run SaaS, but we do offer the on prem option. If you’re running on prem and you want to do this one time data migration, I don’t know if you really want to take the time and get the security approval to apply our system for one-time use cases. So there are cases where you have to think is this technically the right solution, and you also need to think about business efficiency. But I think we cover a very broad set of use cases extremely efficiently for many companies. 

  1. What do you have planned for the future of Nexla? 52:10 – 53:28

Saket: We are expanding on this concept of templates, where the user who’s more technical can define something, like define the template of what is my schema, validation, or what’s the connector in a Nexset so that other ppl can use it. We are expanding on that. We’ll be bringing in more community capabilities like how you can share this knowledge in the company but also outside. The concept of data exchange is not something that’s formally offered as a product. Embedding Nexla into your own product, that’s something some companies have started doing, like using our APIs to embed those capabilities in their own products. As we keep working with more enterprises in more sectors, we do see that the challenges are very similar. We have had an education company to use us. They need to work with data across school systems and tech platforms. Marketing companies use us, you also see finance and ecommerce companies use us. We hope that we can keep scaling up our capabilities across these sectors to meet their unique needs. 

  1. The biggest gap in data management technology? 53:28

Saket: The gap comes down to the operational scale. How can tools be used in more applications by more ppl. I’m also getting more and more interested in the ppl aspect of it because that’s also very challenging. How do you get this complexity in a way that ppl can use it? It’s a hard challenge. The technology challenge is usually how do you take something very complex and make it easy. 

The other thing is that usually when ppl say it’s easy to use, it’s often not powerful enough. How can tools cover that continuum? Be easy to use, and also have the flexibility and the power where ppl can solve complex problems. As well as no code, low code developer tools. 

Avinash: There is a lot of closed tooling in the data space. You have tools to do a lot of things, but you can’t interact with those tools easily. That’s something that makes us think having an open platform is super helpful. But also with our concepts like the Nexsets which make them a very powerful combination. Plug it in with something that you deem it’s right and also have these basic and broad capabilities. 

Saket: in the last company where we worked together, Avinash actually designed a common standard for ad creatives so that multiple companies can come in and collaborate together. Will there be a common standard in terms of how data flows across companies? We are very close to that problem, and we hope that does happen over time. We achieve that through APIs and we published a JSON that manifests everything in Nexla. We hope to ultimately help the end customer to use all these tools together in an efficient way.