SE Radio 696: Flavia Saldanha on Data Engineering for AI

Flavia Saldanha, a consulting data engineer, joins host Kanchan Shringi to discuss the evolution of data engineering from ETL (extract, transform, load) and data lakes to modern lakehouse architectures enriched with vector databases and embeddings. Flavia explains the industry’s shift from treating data as a service to treating it as a product, emphasizing ownership, trust, and business context as critical for AI-readiness. She describes how unified pipelines now serve both business intelligence and AI use cases, combining structured and unstructured data while ensuring semantic enrichment and a single source of truth. She outlines key components of a modern data stack, including data marketplaces, observability tools, data quality checks, orchestration, and embedded governance with lineage tracking. The episode highlights strategies for abstracting tooling, future-proofing architectures, enforcing data privacy, and controlling AI-serving layers to prevent hallucinations. Saldanha concludes that data engineers must move beyond pure ETL thinking, embrace product and NLP skills, and work closely with MLOps, using AI as a co-pilot rather than a replacement.

Brought to you by IEEE Computer Society and IEEE Software magazine.

Show Notes

Related Episodes

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Kanchan Shringi 00:00:18 Welcome everyone to this episode of Software Engineering Radio. Today we welcome Flavia Salana. She’s a consulting data engineer who has designed and led enterprise level automated data warehousing solution. Her work spans across the banking and enterprise cloud platforms as well as data modernization and AI driven innovation. Flavia also, takes parts in internet groups focused on data products and AI readiness. So happy to have you here, Flavia, is there anything you’d like to add to your bio before we start talking about data engineering for AI?

Flavia Saldahna 00:00:55 Thank you Kanchan, that was a very nicely well-done introduction of myself and the role that I play in my organization around data engineering. I don’t think there is anything much to add other than my role is centered around data architecture, governance, and engineering enablement, now mostly moving towards AI readiness. So very grateful to be here on this platform to be able to share my own real experiences of working in a financial organization. Happy to be here.

Kanchan Shringi 00:01:28 Before we start drilling down further into data engineering, especially in the context of AI, I like to point the listeners to a few related Episodes that we’ve done in the past. These are Episode 561, Dan DeMers on Dataware, 507- Kevin Hu on Data Observability, 523-Jesse Ashtown and Uri Gilad on Data Governance. And lastly 424- Sean Knapp on Data Flow Pipeline Automation. Flavia, let’s start off with you trying to explain to us in a simple way, what exactly do data engineers do today?

Flavia Saldahna 00:02:07 In a very colloquial definition of data engineering, typically as we hear the word data engineering, the first thing that comes to anybody’s mind is data engineers are professionals who work with data. May not necessarily articulate well like you know, what aspect or facet of data engineering a data engineer typically performs. And even if we look at the traditional history and the way this discipline has evolved, there is so much that has changed in terms of what a data engineer can actually do, how a data engineer is positioned in an organization. So, yes, while a data engineer works with data, I think it is important where, let’s look even a little bit into the history of how this discipline itself has evolved. Data engineering became very popular a few years ago when big data was trending and organizations and companies were talking about how do we transform process volumes of big data, right?

Flavia Saldahna 00:03:13 And the velocity and the speed with which data needs to be processed faster and be made available for our consumers. Let’s go back a little bit further, and this actually takes back to the time when I was fresh out of college and into my first ITLP assignment. At that time, we didn’t really have these job titles of data engineers, but again, though these titles were not present, the aspect of data engineering or this job function itself has always existed. A lot of folks in the organization that have been here like me for a long time, more than a decade now, would know and relate with some of the job titles like SQL Developer, Oracle Developer or an RDBMS, or Relational Database Management System Developer or an architect, right? So we were very much confined to working with data into a container or that particular relational database in itself and we would write large long pages of code, maybe in the form of store procedures, user defined functions and so on.

Flavia Saldahna 00:04:19 So that was also, a form of data engineering where we were trying to write code that solved a particular business problem, right? We provided actions around what needs to be done with the data, how do we structure it, how do we define it, and put into a monolith like a data warehouse in those times. And then after writing large, long pages of code, there came a phase or a period I would say, where a lot of ETL technologies came in. Now with ETL, I mean Extract, Transform and Load technologies, which is essentially the core and the heart of data engineering. This title of ETL Developer became quite prominent in the industry and there were a lot of job families even like you know, which today we call as data engineering revolved and centered mainly around ETL roles. Now with ETL job families coming in, a lot of innovative ETL tools came in, which is essentially what defined this role where we didn’t really have to write a lot of, you know, SQL code, transact SQL and so on.

Flavia Saldahna 00:05:32 But a lot of code started to shrink and these technologies took over where this engineering professional then had to worry about where and how the data needs to be moved into, right? We were essentially like data movers under this job title of ETL developers and then came the big data phase where we were now talking about like bigger volumes of data and that’s when this term of data engineering became quite popular. But if you think about, even though this term was coined, the aspect of data engineering still centered and revolved around the movement itself where you’re moving data from point A to point B and every time a data engineer gets designated to a task or a goal, the very first question the data engineer is going to ask is, what’s my source for the data? What’s my destination of for the data, right?

Flavia Saldahna 00:06:27 This was up until I want to say maybe four or five years back where data engineering was everything around data movement. It was to a certain extent about making data usable, but the entire function was around how do we leverage data as a service? You provide a service to your data and you try to solve a business problem. In the recent years, where I see a bigger shift and how the data engineering has evolved is where we are now thinking about data as a product itself. From being service based data engineers are now being product based and I can tell very well from my own role and experience where now when we talk to our business stakeholders or we work on use cases where we as a data team have to plan around it, we try to talk more about how do we encapsulate the data itself as a product.

Flavia Saldahna 00:07:25 We want to understand what is the business problem they are trying to solve upfront rather than having to ask the very first question what the source of the data should be and where do you want the data to be landed or published? There’s a lot of product thinking that is now going into the data and I see this as an evolving trend in the data engineering discipline, which is where yes, the term data engineering still essentially revolves around working with data, but it has I feel like very strategically evolved into working smartly with data rather than having to keep it as a very pure service function. How it used to be in traditional times.

Kanchan Shringi 00:08:08 Flavia regards data as a service versus data as a product. Do you think you could provide an example that’ll probably help me understand and our listeners understand the roles better and the expectations from each of the roles, the producers, the consumers, and sounds like that is a product owner too?

Flavia Saldahna 00:08:29 Yes, sure Kanchan. So when we say data as a service, this has typically been the main function of data engineers of having to provide a service in terms of delivering data to their consumers. And when we think about service, data engineers typically think about moving data from one place to another. They’re mostly focused on this primary aspect of delivering data and that is the service we are talking about here. A good example or an analogy I can help share here, you know, like how data engineers operate would be similar to our Amazon delivery persons or it could be even like our local grocers who provide delivery services. Let’s take the example of an Amazon delivery person. Now they might not know what product is packaged and they know that they are supposed to deliver this product to the destination or the designated customer, in this case.

Flavia Saldahna 00:09:31 All they care about is making sure the package gets delivered, they ensure the packages like you know, foolproof, it’s safe and is delivered. The operations in itself are secured and well tied and controlled, but at the same time, let’s say I as a customer have questions about the product itself, I may not be able to question the delivery person about the product because they don’t know, they do not have the knowledge about the product or what’s in it or so on. They’ve just done the bean job. Very similarly all this while, data engineers have always had their main responsibilities centered around delivering data and with that, like while the delivery and everything would take place error free, our pipelines were good, data was moving through the pipelines, we had nice error logging frameworks and mechanisms, but when the consumers or a line of businesses had questions about the data itself, our data teams and engineers did not have a lot of business context or understanding about the data.

Flavia Saldahna 00:10:41 Which is where this big mindset shift starts to come in that we treat data as a product, there has to be some sort of ownership around the data because only when there is some ownership attached to the data, our customers or line of businesses can start to trust the data. And the reason the data industry started to move towards the shift is just like any other software or even any other, you know, products that we use today when we know a certain product comes from a particular brand, our entire like you know, customer experience or the way customers work with that product starts to shift. The same concept we are now trying to put on data. In my case, in the banking industry example, if a reporting user is interested in let’s say mortgage data, they will not go to a loans application data provider, right?

Flavia Saldahna 00:11:38 Or another team maybe let’s say retail or marketing data team to get the data. They know if they go to the mortgage IT team, they are going to get the right data points, they are going to get the right business context of the data. Now in this particular example, if you see the mortgage IT team is taking full ownership of the data and they are building meaningful data products out of it. Which is where we have to start working towards not just being able to deliver data as a service but then start thinking about ownership which will then instill trust in the consumers and consumers know where my data is coming from, is it coming from the source of truth? And when they have questions around the data, they can go and talk to those data owners. In this case, I know you mentioned about product owner, when it comes to data products, we have this title called Data Product Owner, right? The data product owner, just like any product owner for a software or any other solution, a data product owner takes full responsibility and accountability of the data product and ensures the data meets its service level objectives, meets the expectations of the consumers.

Kanchan Shringi 00:12:53 Does that mean that every team now would have a data engineer and a data product owner embedded?

Flavia Saldahna 00:13:00 There are different ways in which organizations are handling this. Depending on the complexity of the use cases, if we have a complex use case or a reporting or a data need, we have seen organizations investing in dedicated data teams belonging to that certain business domains. You have a data team where you have a data product owner and then you may have maybe like you know, three to four data engineers that only take care of a single or multiple data products within their data domains. I have also seen cases where teams take a blended approach where you have an application team and in that application team you also see data engineers working along with other software engineers or platform engineers. It all depends on, you know, how big or large of a use case you have around data, what kind of data delivery needs you have and what consumers. It could be some of the high value use case consumers like in our case it’s around regulatory reporting, compliance, fraud, analytics and so on where you may have a much larger need of data to be provided. Based on the use case, based on the need teams can have like you know, data engineers or data product owners accordingly.

Kanchan Shringi 00:14:18 Where does AI fit into this? Was AI the push for this shift or is benefiting from this shift?

Flavia Saldahna 00:14:26 At least for our organization? Content AI was not the initial pull. We as a centralized data team have always encountered finding ourselves in the middle of every operational and strategic call it like resiliency, modernization, supportability asks that the consumers have expected for us to be that middle force and we’ve always found our teams very much stressed enough to be able to provide that operational and DevOps support time and time, day in day to make sure the data is available timely, the data is fresh and we have accurate and quality controlled data flowing to our consumers. And that was always a very time-consuming activity. We didn’t really feel the cost investment was going right where we were investing mostly in just getting more and more people in this mix of having to provide these services. But again, like for an organization like ours where we have enterprise scale data, it wasn’t becoming very cost effective just to add more people on but be able to scale and continue the same services.

Flavia Saldahna 00:15:39 How we have traditionally offered. AI when it came in, we felt what we were trying to do here with our strategic shift right, of changing our ways of working with data itself helped our architecture to be much more flexible and extensible where we were now able to cater to AI. What started with having our solution be more smart enough make the data readily available for our BI consumers or for human consumers I would say is now also, working really well as we have now, I wouldn’t say shifting gears totally towards AI, but as we are augmenting more and more like you know AI related solutions both for business as well as for consumers and even for employees, we can see that how our framework is being able to follow both dual purposes, it’s able to serve our existing consumers and at the same time it is now also, able to flow all the data from the same single source of truth to our machines also, with our AI capabilities.

Kanchan Shringi 00:16:46 Three things stood out for me when you were explaining that shift, you set timely, fresh and quality controlled and we’ll come back to how exactly the stack supposes BI and ML use cases. I’m curious about that, but before we get there, can you maybe spend some time on the stack evolution itself? What has changed in the last few years? You know, I’ve started with warehouse, went to lakes as I understand on lakehouse now vector native systems to support AI. Can you drill into that please?

Flavia Saldahna 00:17:22 Absolutely, yes. As you pointed it well, we started with a centralized monolith. Everything was on-prem and in a very controlled environment, very much local to our organization itself. And then you had a central team having to build design architect, own, manage all the data that goes in and out of that centralized data warehouse function, right? A pretty much a relational database management system function. Now we knew that just having a centralized data warehouse may not really be enough considering the large volumes of data that we were looking at as some of the source systems and applications for producing. Then came the data lake concept where in the data lake concept, we don’t really have to do a lot of pre-processing before all of the data starts landing into the warehouse because anything that goes into the warehouse really required a lot of careful processing.

Flavia Saldahna 00:18:21 You wanted to make sure the pipelines are tightly governed and controlled and the data teams knew what data is being brought into the data warehouse before it gets the format of structured definitions. in the data lake concept, something very new and novel idea for data engineers came up is we do not spend a lot of time on pre-processing, but we just bring in all the data that is coming in raw in its most raw format into the lakes. It was almost like if an organization had 40 to 50 different core applications or source systems, we are not going to do any due diligence to understand what data points we need to capture. Is this going to be even used by the business or not? Let’s just capture anything and everything that is being generated from our source systems, dump everything into the data lake and it was almost like a problem that the data teams wanted to think about later.

Flavia Saldahna 00:19:20 It was very much I would say an after the fact or a retroactive way of looking at data like brink dump in all the raw data and then we’ll figure out how to make it more manageable, how to make it more usable. But what happened during this whole cycle of data lake trend is where the data engineers found themselves in a very complicated situation of having to now turn all this raw data into meaningful formats. There was really not a lot of business semantics that data teams were able to drive with this again, because the data teams didn’t have a lot of the mastery or the subject matter expertise around the data to even be able to convert the data into a more usable format. Data lakes were good from an ingestion standpoint of bringing large volumes of data that then again the consumption lacked the finesse or the business semantics that were expected from the lakes. Following that cycle or phase came a very blended approach, right? Where the data industry started to realize it is just not enough that we have all the capabilities and the technology to bring all raw data and large load volumes of data.

Flavia Saldahna 00:20:37 We need some structured and good processing around the data to drive those meaningful business semantics. The concept of the lakehouse came into existence where we are now trying to bring technology in place that gives you the ability of high magnitude data ingestion capabilities, but again at the same time also, gives you smart storage solutions that help you to persist the data that helps you to capture the trends of data, which in our world we call the change data capture process. Like how historically the data trends have evolved, what does the data say as the data starts to behave differently over a period of time and then those insights can be captured by businesses. So, lake houses are now still a pretty prominent and a popular concept in most of the industries and especially in industries that are looking for enterprise scale data solutions and it is still there, but with all the AI trend that is happening and where we are now thinking beyond humans or beyond just the businesses having to consume all of this data for analytics, for modeling and so on, we are now also thinking about machines.

Flavia Saldahna 00:21:54 When AI comes into picture, we want to augment those capabilities also, when machines are now able to extract all this data and can have some meaningful insights and decision making in this whole process, right? What we are trying to do is not shift away completely, I would say from lake houses to something new, but we are augmenting as you mentioned, vector database and embeddings and some of these additional concepts that traditional warehouses have never had so we can even handle and process unstructured data. Unstructured data brings us to a very good point, right? Because traditional warehouses have always worked with structured data. We’ve always required data in appropriate table formats and rows and columns. But if you think about it in any organization, structured data makes such a small population, it might be only like 0.5 to maybe 5% of your data in an organization structure.

Flavia Saldahna 00:22:56 Rest everything is unstructured. Unless you have the capabilities and the functionalities to be able to process unstructured data, you are not going to have a complete picture of analytics for all of your data that is coming out of your organization. Then it becomes very important for lake houses to be able to augment these additional capabilities with embeddings, vector databases and so on. They’re able to now not just process but even store some of the unstructured data and we are able to feed both structured and unstructured data, be it to our AI agents, LLMs and models so they can process all of this information with the right context and can take or drive maybe automation or some meaningful decision making into our operational processes.

Kanchan Shringi 00:23:48 Okay. Just trying to wrap my head around everything you said. You said you started with raw data in the data lakes, but it had no meaning and meaning is very important for AI. With lake house you mentioned capturing trends, so that’s adding source information enriching in some form or fashion. I’d like you to talk a little bit about that. But then you went on to say that you also, need unstructured data and you need to add embeddings or semantic meaning and that’s where vector databases come into play. What I understood from you was that it’s still two separate concepts. You still have a lake house and a vector database or can you combine them together? Does the tooling today or the infrastructure today support that?

Flavia Saldahna 00:24:38 Absolutely. So, let’s first talk about in the lake house to your question around what kind of enrichment or what kind of transformation we are looking into it. Yes with our lake house concept, irrespective of whether your lake house is positioned for AI to consume or not, just even with your regular set of data consumers and mostly your line of businesses and domain teams involved, you still have to be able to establish the right semantics, right? And make the data much more usable and meaningful where they can come and consume the data directly without having to worry or have the semantics be so overly complicated that our business teams do not really understand the syntax or the way in which the data itself have been presented in our data product layers. So yes, there are some light technical transformations that are done to the data, especially if my data products are more source aligned where we don’t really try to use all the binary language of the applications that the applications could be producing.

Flavia Saldahna 00:25:46 For example, you may have a data field that has a value of one or zero but having a value of one or zero may not necessarily resonate with your consumers. In the lake house what we do is there are some transformations that are done which are mostly business aligned in order to be able to translate these data points that have had binary values of ones and zeros into meaningful interpretations. So, that one and zeros could be something like, maybe a closed account and an open account status. And as soon as the consumers start to process or read this data, they understand what one represents coming from so and so application and what zero represents and so on. Now this was again with as I said, the regular consumer base we’ve always had. When AI comes into the picture and with AI, now we are also unlocking the potential of being able to process unstructured data.

Flavia Saldahna 00:26:46 With unstructured data we are talking about PDFs, images, sound files and so on, which traditionally we have never really put in a lot of costs investment and thought into process, to process unstructured data. But now we have the capabilities and the ease of access to be able to process unstructured data also. At least in our case, what we would say is we’ve not tried to shift our gears to introduce vector databases in silos, but we’ve made it part of our modern data tech stack itself and augmented it on top of our lake house pipelines, right? We have a very common flow of pipelines depending on whether structured data is fed or unstructured, data is fed, all of that data is processed. And then we have data products that get generated out of all the data processed through our pipelines and then those data products serve as, call it the authorized source of truth both for our AI solutions as well as BI and other direct Ad Hoc consumption and solutions from our other analytical users.

Kanchan Shringi 00:28:02 Okay, that explains this. It’s an augmentation strategy? And all the tooling and stack improvements that are happening are also, to support the BI use case making them better?

Flavia Saldahna 00:28:15 Absolutely. We don’t want to keep them separate. We do not want to duplicate our processes. We do not have to maintain standalone tooling processing for everything that we are doing for AI versus BI. We definitely want to ensure when the data is shared with be it humans or machines, it is consistent, it is accurate, and it is from the same single source of truth. We are definitely spending a lot of thoughtful time into how we are crafting the strategy for semantics, both for machines as well as for humans. And we are making sure there is no duplication and there is enough scalability and support to be able to handle both of our use cases

Kanchan Shringi 00:29:02 In terms of the infrastructure, sounds like the lake house, the vector database, the embedding model to actually create semantic meaning are the key? What else? What is the average stack size? What are the tools? What else do people use?

Flavia Saldahna 00:29:21 Yes, this is a question a lot of data teams as they are migrating or shifting from an on-prem to a much more modern tech stack, find themselves confused and feel like what is that starting point? Do we just go out to all the different vendors and like buy six to eight or and layer them on top of each other or do we go to a specific vendor and get the entire package or suite of solutions from them.

Kanchan Shringi 00:29:50 Before we get to that, can you just talk about the different, you know, what are these tools for? For example, for observability or for capturing metrics or for troubleshooting. Maybe just first talk about different goals for these tools.

Flavia Saldahna 00:30:07 Absolutely, yes. I think in everything that I have talked so far, we’ve talked about the core of data engineering, which is extracting the data, being able to load the data, be it on on-prem or cloud solution and then transforming the data to align with your business requirements and use cases we have. Now with having a modern data tech stack, we’ve also been trying to invest a lot on how do you make your data much more discoverable? What additional tools you require to make your data discoverable because now your data products are federated, you have so many different IT owners or data teams that are managing their own data products. You need a marketplace just like an Amazon marketplace. Very similarly you need a tool that can represent and be a good interface as a data marketplace that gives accessibility to your consumers to self-serve themselves, go find the data products and be able to consume those data products.

Flavia Saldahna 00:31:08 That was again a very different investment and initiative. My organization spends some good thought process and time into how we build a data marketplace and how do we make our data products much more discoverable for our consumers. So, data marketplace definitely comes into the tech stack. Second one I would say is spending on a data observability tool. Yes, there are a lot of data observability tools that are available in the market but it is very important to understand for your domain or for your organization what data observability means because for every line of business, the meaning of data observability may differ. I can speak to the financial domain that I come from. And for us when we talk about data observability, we want to make sure when our pipelines run and bring the data, the data freshness is a very important metric that we want to track against.

Flavia Saldahna 00:32:09 We also, want the data to be very timely. Since we are in a very regulated environment, we have daily external reporting to feds and other compliance organizations where the data is going. So, we want to make sure we are able to maintain those service level objectives. And where a data observability tool becomes handy is it is monitoring the health of our pipelines, it’s alerting our systems, it’s alerting our DevOps and support teams to be able to understand how the health of our pipelines is and they can take necessary steps and actions. The third tool that I would very strongly emphasize would be a good robust automated data quality tool. The reason I say a data quality tool, which is different from a data observability tool, is while observability would act as a monitor and would keep a tab on the entire health check of your data stack and your pipelines, quality tool goes a little bit deeper into your data itself where you have maybe some very foundational technical data quality checks that you want to put in place to make sure all the data operations get executed in a timely manner.

Flavia Saldahna 00:33:27 And there are also sometimes business prescribed data quality rules and one of the tools that comes to my mind is Ataccama, which is pretty good at how do you incorporate some anomaly checks. How can the tool perform some trending and analysis to be able to understand if I am not receiving the data in an expected format value over a period of time and maybe it sees a certain dip or a spike around certain metrics. So, this data quality tool comes handy when it tries to alert the data teams before it can be further sent downstream and then you have repercussions if the data is sent for the downstream without having the right alerts, quality checks and mechanism in place. And then last but not the least, to sum it all from a bare minimum standpoint, I strongly recommend a great orchestration tool, right? Because if you don’t really have a great orchestration tool that brings all of your, not just like the data platform synchronization, but even your pipelines synchronize talking to each other and you have a right handshake, everything that you have built and designed may really not be worth if you don’t have the timely handshake happening.

Flavia Saldahna 00:34:44 So tThe data reaches towards data consumers. In downstream and analytical solutions where data gets handed off between multiple teams and data points. You want to make sure you’re investing in a very reliable orchestration tool where you have defined all your predecessors. The orchestration tool knows what set of pipelines need to be run first, what comes in next and what happens if something doesn’t run. How does it call the observability monitor? How does the observability monitor then alert the teams. Everything is orchestrated in a way that everything runs in a very synchronous but a good handshake manner where consumers then start to feel trust not just in our platform but even in the data itself where we have all the right governance controls and procedures in place.

Kanchan Shringi 00:35:36 So warehouse, lake house, embedding models, making it discoverable. Are you creating a marketplace observability data quality for anomaly checks included and orchestration tooling? Where does data governance fit into this? How is that accomplished?

Flavia Saldahna 00:35:54 for data governance now when a data producer is in the ideation phase of how we want to design the data product, what is the purpose of the data product? What is going to solve? The data product owner needs to be careful in articulating the use case or the purpose that the data product is going to solve the problem for. For example, if a data product is for a regulatory reporting initiative or the data product is for a compliance initiative, you have different data governance rules that then the data product owner has to follow. Now those governance rules technically get translated into a data contract. So data contract is something like a legal contract, I say it legal because again it is a strong contract between the data producer and consumer itself where they agree upon and decide as to what the technical semantics of the data are going to look like.

Flavia Saldahna 00:36:55 What is the metadata a consumer is going to expect, what are all the different data points that are contained in the contract and so on. This is like the starting principle of how you embed data governance right from the start of designing your data product. And then in this entire process, once you double click and dive deeper into how do you extend the data governance functionalities across your pipeline, the data product owner find themselves answering questions like, how am I protecting the data? Do I have all the right controls and capabilities towards sensitive data, nonsensitive data? Do I have all the right data classification measures in place? Am I tagging the data with the right labels so the consumers understand what are critical data assets or critical data elements and do I have lineage around my entire pipeline and tech stack? Tomorrow if there is a question or a problem, we’re able to go and trace the pipeline.

Flavia Saldahna 00:37:58 To sum it up, I would say data governance is not something where towards the end of the cycle of the data product development, a data team should start thinking about as a checklist that we have to ensure we are following and maintaining all the data governance requirements. But right now, it is more important for data teams to understand that we start taking or considering data governance as a guiding principle and framework and start embedding them as we design, as we build our data pipelines and our data products and make them available for our consumers.

Kanchan Shringi 00:38:36 You started to hint at this where you know you have all these different toolings and that’s what constitutes a unified data platform. Do you reach out to a specific vendor for these, or do you have to make sure you are getting the best in breed from multiple vendors? What’s your strategy?

Flavia Saldahna 00:38:54 Yes, we are typically an organization that do not want to put all the eggs in one basket, right? With that we don’t really believe in having our entire modern data tech stack and data technology and solutions federated out to a single vendor and then be at the mercy of changes and trends and evolution that happens with all the different technologies and then having to scramble around how do we make our framework and architecture extensible and scalable as new changes evolve and new trends come in. So the kind of strategy we’ve put in place at our organization is, we first put a very thoughtful and careful process of having to decide what things we can build even before we go and buy. What I mean by that is we know our strengths, we understand where are some of the things that are going to be more cost effective, where our teams spend time in having to build those solutions rather than our first instinct always having to be go talk to a vendor and buy a readymade solution.

Flavia Saldahna 00:40:06 We then once we have formulated a plan around like what areas or what aspects of the data tech stack we are able to build it ourselves and make it more cost effective. Then we look at the other areas or the other tools and technologies where we think it might not be a great time investment, cost investment and an effort or may not give us a great ROI if we were to build it by ourselves, maybe it’s going to take us a very long time. We would rather stick with what’s working in the industry and we would rather go and buy a solution but be able to augment it. A critical piece in having to formulate the modern tech stack with this balanced approach of build versus buy. What we are doing here is, we are abstracting the entire concept of tooling when we make these interfaces available to our users and consumers.

Flavia Saldahna 00:41:01 What I mean by abstracting is, and a good example for this is going to be our consumers sometimes do not know that our data products are in a snowflake solution. Our consumers sometimes do not know that all the data ingestion that is happening is happening on an AWS platform or framework. we are abstracting the tooling, we are abstracting the technology itself but we are leveraging the autofit and we are making those services available for our users. And in this context, we are talking about the data producers to whom we have provided self-serve capabilities. When I say self-serve capabilities, what I mean is data producers today in a very simplistic format, they can come and bring data from files from other databases, not just structured but even like in unstructured formats and land all that data into our lake house, which they may not even know that lake house is snowflake, right?

Flavia Saldahna 00:42:07 That part is abstracted. We have created a lot of user-friendly interfaces on top of it that help them to provide or that help them to carry all the business functions or technical functions that have to happen. Whether it’s data ingestion, data transformation. And this has also helped us to federate the capabilities itself out the organization and here we start to eliminate the strong reasoning of having a very high barrier of entry when it comes to technical things. A good example I’ve seen in my organization is we have a lot of mainframe developers who have supported legacy applications for a very long time, but when we go to these mainframe teams and we talk about you have so much of subject matter expertise and knowledge, why don’t you come and build your own data products? They raise their hands, and they have always been shy about not knowing or understanding the data engineering concept, right?

Flavia Saldahna 00:43:09 We’ve abstracted calling or making our capabilities so complex where even any IT team, they can be a software engineer, platform engineer, could be even a data engineer all have access to the same means and the tools, we’ve simplified the concept of bringing data in into snowflake for this persona where these teams don’t really have to worry about what tech stack is in place or not. Now this helps us to future proof the system also. Let’s say and I don’t know this but just doing a prediction based on how I have seen tools and technologies evolving over a period of time with everything happening around AI evolution, there is a very good chance that something else may come in, a much smarter data solution may come in and we may have to rip and replace something that was already part of the modern tech stack.

Flavia Saldahna 00:44:01 So now because we have abstracted it, the user will still have a very consistent experience with their interfaces. Nothing on the interface side may change really much or they may get an inconsistent experience. But behind the scenes we now have the flexibility to be able to swap, you know, different tools or bring in something new innovative tool augmented with our existing tools and framework. That’s how we have, I would say future proofed our modern data tech stack. We have ensured that we, the data technology team and the data strategy team have control as things change, as times change, we can take control of the things behind the scenes but without having to impact business, without having to impact the day-to-day operations and still be able to support modernization, support transformation initiatives and so on.

Kanchan Shringi 00:44:53 Okay, sounds like the fact that things are changing so rapidly is a key driver for your strategy. Let’s shift a little bit and let me ask you a specific question. Given the strategy, given the tooling that you using, do you have metrics on how delivery timelines have changed?

Flavia Saldahna 00:45:12 Absolutely, yes. One of the key initiatives I can talk about where everything that we have done in the modern tech stack has been proving out to be giving us such great and efficient ROIs would be everything we are doing in our regulatory reporting space. There has been a lot of time thought and effort put into crafting a framework and a unified data model and solution that works for an area that has the highest stakes, right? And for a banking function, we all know we want to ensure we keep our auditors happy, we keep our regulators happy and we want things to be put in check and many a times the kind of questions and things we get from this group is more around show evidences, be able to provide results, be able to provide metrics in a way where we are able to reconcile things back to the original like you know, in a financial system it would be our ledger system.

Flavia Saldahna 00:46:15 What we have done is we have taken some of the very high value use cases, the regulatory risk and compliance use cases understood their needs. We have made sure we’ve documented all their service level objectives and those service level objectives have all been confined to these data products in a way where we are not just focusing on the timely delivery and the freshness aspect of the data, but we are making sure we have all the right governance, metadata controls, lineage and everything checked and in place. Given at any point of time there are questions asked by our regulators and auditors and sometimes the questions could go back in time, would ask us to do a time travel with our data where they may have questions around what did the data look like maybe seven years from now we have ensured our modern tech stack, our modern data solutions now have answers for this.

Flavia Saldahna 00:47:16 But if you were to compare some of our abilities that we are able to provide here with our modern tech stack and how traditionally we have done, we’ve always struggled when auditors ask us about lineage and auditors ask us about traceability, right? Like lineage was always a number one bottleneck. Lineage was always something that was very retroactively done, not proactively done or actively as the data moves with the pipelines. A good example would be where a lot of our data teams had to map out all of their mappings and the data movements that happens like from one data table to another in Excel spreadsheets. There were so many lineage tools that came into the market and we always found ourselves where certain lineage tools have had limitations with what type of file formats or data formats they could work and they couldn’t work.

Flavia Saldahna 00:48:14 Now what we’ve done with the modern tech stack is we’ve made sure we were able to build some of those capabilities but also, at the same time augment some of the SaaS solution capabilities that provide in order to extract the active lineage as the data keeps moving. So, as data is being collected, we collect all the traceability information and we load it into what we call as a meta marks where all of the metadata about our data flowing is being captured. All the places where the handshakes or maybe even logic changes and transformations that are happening, those all get captured. So, now we have a way where we are not just storing the data into our lake house but we are also, storing the metadata, the information about our core data into our meta marks and we are able to provide real time lineage viewing and capabilities to not just our regulatory consumers, anybody who needs to understand and drill through how our data has been evolving and growing can go and double click into the insights onto our marketplace interfaces and be able to consume all that lineage. So, lineage is one of the examples, but it has very well been a thing where we have been careful about what were our struggle and pain points with that traditional tech stack and how we have overcome some of those with just some innovative solutions, either building them or buying them from the vendors, putting them all together into a very synchronized and orchestrated manner. So, we’re still compliant on our governance use cases and policy checks and things.

Kanchan Shringi 00:49:52 Thanks Flavia. That’s a good segue into ensuring quality and scale and trust. So, you talked a lot about lineage and that’s the example you know, you gave with respect to how do you actually ensure that you have that by introducing meta marks, what about privacy? How do you make sure you achieve it and how do you make sure you prove it to any compliance team?

Flavia Saldahna 00:50:18 Again, especially for an organization that is walking towards shifting all the data from on-prem to cloud, the number one thing that we get questioned by our stakeholders line of business and customers is about how do we ensure my data is protected? How does this modern tech stack where we are actually publishing data to a cloud solution and this cloud solution is external to the boundaries of our organization and environment, how do we ensure it’s still secured? What are the kind of information security policies and things that are put in place? And we get this question a lot. Some of the things, and I’ll start at the high level like how we have again carefully strategize the way we keep the data secure. We do have our own data tokenization methods. We are not necessarily leveraging what the platforms itself are providing, but these are our own custom tokenization methods that we are using to ensure all of the PII data or PHI data is protected at all times.

Flavia Saldahna 00:51:23 As soon as the data is being read to the time the data is being published and made available for the consumers, we ensure every bit of that lifecycle of the data. We have all the right tokenization methods and things involved. Now many a times and especially our traditional BI or data consumers do sometimes have the need to be able to get hands on some of the sensitive data. A good example for a banking industry I would tell you is a customer account number. Okay? And somebody in the marketing department where they want to understand customer behavior and analytics, be able to tie it back to customer information. There will be certain authorized groups that would still require access to sensitive data, but then we want to make sure we have the right entitlements, identity access management or role-based access management procedures in place to make sure we are able to provide some of these capabilities only to a restricted set of users in a timely manner.

Flavia Saldahna 00:52:28 And that’s where we again have our own custom logics to do de-tokenization on the fly. So, tokenizing the data and even before tokenizing I should mention being able to identify the sensitive elements and then applying the tokenization act and for certain restricted users giving the capability of having to de-tokenize all as part of our data security framework. And I am proud of the fact that it has been going and running really well for the organization. We hear less complaints about trying to uncover or unlock sensitive data that some of the teams may not even require like what used to happen with our on-prem systems, right? So we have right obfuscation and tokenization controls in place and we are also, now becoming much more optimistic towards how do we apply some of the data classification procedures to. So as we talk about the California Privacy Act or we talk about some of the GDPR rules, the HIPAA Act and so on, we are now even doing some of the auto detection of data.

Flavia Saldahna 00:53:37 In the past when we started this whole concept of entrusting our data product teams to be able to identify the sensitive elements and then take all the measures to ensure the data security is applied, we always had to have trust in our data product owners to be able to abide and follow the measures. Now currently what is happening at my organization is we are taking the next step and we are trying to augment here and leverage some of the AI technologies and again the enterprise scaled AI solutions that can auto detect some of the sensitive information, some of the information that should have the right data classification labels and things. And we are leveraging this new concept of data tagging where we can tag the required data points for end users or even AI solutions to know what information is sensitive versus what is non-sensitive, what data points are critical versus what are non-critical. There’s a lot of emphasis and thought process that has been put into place where we have a right strategy to ensure data is protected at all times. And especially for banking industry, there are reputational risks that are tied to it, and we want to ensure we do nothing even when it comes to data where we hamper or risks the reputation of the organization itself.

Kanchan Shringi 00:54:59 Let’s drill down a little bit more into AI now. With all these improvements in place with governance toolings, what is the one type of data quality issue that still persists and how does that impact AI? Because there’s a direct correlation between low quality data and hallucinating AI. Can you explain that perhaps with an example?

Flavia Saldahna 00:55:24 Yes. Many a times when we challenge data engineers to start building data products in a way that AI solutions and machines now can start consuming it. The very first thought that runs in any data team or data engineer’s mind is to ensure we give a very clean data set to the AI solutions. What we have been doing in the recent past is challenging our data teams to think beyond data cleanliness. So yes, we want to make sure we are providing clean, accurate data, quality check, quality control data. But again, at the same time what is more important when we are about to feed AI solutions is to ensure we have the right context around the data and we are enriching and supporting it with the right metadata for all this information so AI and solutions are able to process all of this information and the context and the domain and everything that we provide it can take meaningful decisions without having to hallucinate anything.

Flavia Saldahna 00:56:33 A good example I can give you, and this is more around context setting, maybe a very simple example. On our operational processes that happen at the bank, right? Like where you have loan processing that takes place for the documents. We are trying to leverage what AI and AI agents can do to automate some of the processes. These are like very redundant and pretty manual processes that take place every single a bank where we’ve been trying to untap some of the things is how do we train the models in a way where we give them enough documentation and context in the right way? Let’s say for example, there is a certain piece on the underwriting process itself. We are ensuring the documents on which AI is being trained to make decisions has the most accurate and the recently approved quality-controlled document per se around that underwriting process itself.

Flavia Saldahna 00:57:31 If we try to give something that existed maybe years ago and the team have really not gone and have made any changes to the procedures and things, these are the times where your data is stale. The information that you’re providing in the documents may be stale and these are again like you know, flags or signals and indications where AI starts to hallucinate because you have provided still data to it and now it cannot decide between. AI is going to give you the decisions based on what you’re feeding it, right? You want to make sure you are feeding with the right context, not just clean data or structured data, but it is able to comprehend, able to understand all the different, the scenarios and the situations in which what AI and when it has to make certain decisions. This is one example where I can say where the teams are being challenged right now to ensure they’re investing a lot of time around building the right semantics for AI

Flavia Saldahna 00:58:32 And one of the ways we are trying to do this is to make sure we have a controlled environment or a control layer for data against which the AI solutions would be sourcing the data from. We are not keeping our complete unified data model exposed to our AI solutions where AI can just go and grab raw data or it can pick and choose and decide on its own what data it should look for and then be able to take decisions. But when you are able to control an environment and you make sure that controlled environment, which in our world we call as the AI and BI serving layer, which is nothing but like where our semantic models have now shaped, we ensure we are feeding the right input, the right source of and the right information to the AI solutions and not necessarily keeping it so autonomous for AI to just plug and play the connection with our data or the lake house itself and be able to make its own decision. This is one way of how we control the context, control the semantics, and we ensure we know what we are feeding to the AI. And AI can only make certain meaningful decisions. But again, even with those decisions, there’s always a human in the loop at this point of time who then observes and reviews these decisions before those decisions turn into actions.

Kanchan Shringi 00:59:58 Again, in today’s world with AI, what does Realtime mean? Is it seconds, milliseconds? And why is that really so critical for agentic applications?

Flavia Saldahna 01:00:11 Realtime? Yes. So, Realtime, as soon as we hear the word we think about instantaneous, right? Something needs to be made available instantly and at least in my opinion, I really don’t think there is a single definition for what real time would mean, especially with AI use cases and how we are working towards meeting some of the data needs, whether there could be some Realtime reporting needs. We do get a lot of these questions around does Realtime mean where data would be available in milliseconds, nanoseconds, or so, on? I think the definition is not really centered around the speed at which the data is available, but it is more around what is the purpose or what is the use case that we are after and what is the, in our world we call the data latency or what is the requirement or the timing requirement that the use case has.

Flavia Saldahna 01:01:14 I try to use the same example of fraud that I have used earlier. For teams like fraud analytics, these teams really require to understand if there is any suspicious activity that is happening when there are some ACH transactions that are happening, like a credit card transaction. And these teams need to be alerted instantaneously. Now again, one could debate what does instantaneously mean based on the use cases they have, based on the service level objectives they have, the team should be able to define what real time would mean. There is no single definition or a single time that we can pinpoint to save, applies to as we start moving towards agent AI solutions or enabling our data solutions for AI adoption. Even for those, the use cases are going to help define. Now there are a lot of many other reporting use cases or maybe even batch extract deliveries that happen may not really require real time use cases. The service level objective that gets defined for every data product in our case, or for every use case that you are catering to, will help define what real time actually means for that. And how do you define the timing and the scoping of that use case?

Kanchan Shringi 01:02:41 If it is milliseconds, what are the challenges?

Flavia Saldahna 01:02:44 Yes. So, if it is milliseconds, all of our traditional technology and traditional tools and tech stacks have always had this challenge of being able to exchange data between different modules and different components at a pace at which the use case requires, right? Like in this particular example where it requires milliseconds, we have really not seen great promising results where all of our different applications within an entire IT ecosystem can integrate and work or handshake with each other effectively. To be able to deliver results in milliseconds. But now with all of the AI advancements that are happening, and with some of the LLMs that are coming up, we are also, now seeing where, how we are able to process, translate some of this data and even integrate the data between different components or different applications or modules in a much faster way. So, yes, we are seeing results with some of these AI advancements.

Flavia Saldahna 01:03:50 I can share again, like you know, only some of the limited information that I can share for some of the experimentations that are happening in my organization where, yes, we are relying on some of these LLMs to see how can we do the unstructured data processing? How can we do the structured data processing seamlessly from the outer world when the bank users or the customers are interacting with our digital solutions and our internal processes are able to comprehend the information and be able to take some meaningful actions on that data. Which is where a lot of embeddings and vector databases come into play. And then how do you, again, retrieve information from that and be able to come up with a meaningful action? All that plays a role. So, yes, I would say in certain use cases, being able to come up with an action in milliseconds is now happening with the help of some AI advancements. But I’m pretty sure a lot of organizations at this point of time are experimenting with their use cases and trying to see if we are able to scale and really deliver consistent results like every time we are able to deliver the results in milliseconds or not in order to get that real time achievement.

Kanchan Shringi 01:05:07 For people that are moving from traditional data roles to now this AI focused one, can you talk about what are the things that they should unlearn and then what are the concepts and tools that they should learn in priority?

Flavia Saldahna 01:05:22 Absolutely, yes. Given the nature and the evolution, the data engineering discipline has always had, I think we’ve spent a lot of time always working with structured data and having human in the loop, making sure we are serving the humans. This is high time now where data engineers start to think of this new consumer base that has come in where we are now serving not just humans, but as well as the machines, and we want to be more careful and pragmatic with our approaches on how we are ensuring the data is being made available to our AI solutions. Having said that, yes, it’s time now to go beyond the realms of just having to understand ETL or ELT technologies, which have always been around extracting, loading, and transforming. But then how do you become a part architect? How do you become a part product thinker?

Flavia Saldahna 01:06:19 How do you become a part, call it an evangelist for your data solution? That becomes a very essential role if we want to help support our data solutions and organizations in a way where we are not compromising on some of the ethical values or we are not compromising on some of the data privacy or data governance measures. We definitely have to invest a lot of time in having to understand the natural language processing concept of some of these AI solutions. I think the need of the our or everybody in the business is getting to a point where they are hungry and they want to look for AI enabled data solutions, which is okay, but at the same time, being careful about not just having to dump all of the data and make it available for AI solutions is also, equally important. So, make sure we are putting a lot of emphasis, make sure we’re crafting and articulating and even designing solutions in a way where we are not compromising on any of like our core values when it comes to data engineering and measures.

Flavia Saldahna 01:07:27 But at the same time, how do we leverage, how do we augment some of these newer and advanced like vector databases? If you think about it, this is definitely something very innovative for database developers like me who have always worked with structured data, right? How does word embedding happen? How does image embedding happen? How do they all get converted into numerical representations? But again, at the same time when AI is trying to convert them into a language that they understand and then reconvert it back into a language that humans understand in that conversion and the reconversion, how do we make sure all the right controls and policies are put into place becomes extremely more important? So yes, we have to stick to our core, but again, at the same time, I think we also, have to understand the play and the essence of all the innovation technologies with AI are bringing into the mix. That way we are not just continuing to provide a direct solution for business, but now also, for machines. Machines can take some very informed decisions and meaningful actions on this data.

Kanchan Shringi 01:08:38 What about MLOps? What is the bridge between the data engineer and MLOps?

Flavia Saldahna 01:08:46 Yes, I can tell you from the last few years, all of the data scientists and the number one complaint or the common complaint they’ve always had is not getting the data in the format that machines can consume in order to build either call their feature stores or models on which they can perform some advanced analytics, do some predictive analytics and insights. Data engineer has and data engineer, rank Dev team and the DevOps team have a very important role to play in order to help build the bridge, as you alluded earlier, in order to build those analytical capabilities from a machine learning standpoint. When the data lake concept existed, yes, even machine learning was evolving and growing, but at that time, everything around MLOps, all of the effort was put into having to again, untap all the raw extracts and the raw data information that was coming in.

Flavia Saldahna 01:09:48 Most of the time was actually spent and having to process that raw data even before actual machine learning can happen on that data. There was a little bit of structuring that was required on that raw data. This is where the data engineering DevOps play a very important role where you then help ensure all that raw data has the right required semantics in place for machine learning to be able to act on it, to be able to proceed further into their predictive or analytical processing and insights on it. Definitely DevOps and data engineering, DevOps play a very critical as and a crucial role in ensuring we have data, which is ML ready. And I don’t think ML ops really needs to act directly on raw data that is just collected right from the source. There has to be some careful consideration and processing and structuring in place before MLOps can start acting on it.

Kanchan Shringi 01:10:50 Thank you. Flavia. Starting to wrap up now. You’ve covered a lot of ground, but is there anything you’d like to either summarize as a takeaway or cover something that fundamental that we missed that you must communicate?

Flavia Saldahna 01:11:04 Sure. First of all, I want to really thank and continue the Software Engineering Radio podcast for giving me the opportunity to not just talk about the recent and the modern trends of data engineering, but also, speak to the evolution of how this discipline has grown. Many a times I get this question, where is AI going to replace all the data engineers? I definitely want to use this platform in a medium to let all my fellow data engineers know that AI as an innovative solution is definitely not here to replace data engineers, but treat AI as your, again, an assistant or a copilot to work with to be able to help modernize your solutions and make great strides with what you’ve already been doing with your data. With that being said, there’s a lot of data that is going to be requiring processing, and we have a very big role to play in all of this where we ensure we are having ethical and non-biased data practices and AI practices put in place. We are essentially going to be fueling all of the innovation and technology that is happening because without the data, AI really can’t do very much. Again a very important aspect for all of us to understand here that rather than worrying about our roles and our existence and what’s going to happen with our careers, I think it’s high time where we start embracing AI, we start understanding the potential of it, and we make sure we’re using AI in the most ethical and the governed way possible, not just for us, but even for our organizations.

Kanchan Shringi 01:12:42 How can people contact you?

Flavia Saldahna 01:12:44 Yes, the easiest way to contact me would be on LinkedIn. If you search me on LinkedIn with my name Flavia Saldahna, you should be able to get hold of me. Feel free to message me if there are certain areas or maybe you have a certain data challenge that you would like to talk about it. But other than that, I am pretty much active on our IEE communities, especially with Women in Engineering, where I have been mentoring women and especially women in the data organization, helping shape up their careers. That could be another way of having to get in touch with me. Yes, do not be a stranger, and if you are looking for just some guidance and help around data engineering, absolutely. Please feel free to contact me.

Kanchan Shringi 01:13:27 Thank you, Flavia. This is a very interesting and informative conversation. I really appreciate you coming on.

Flavia Saldahna 01:13:34 Thank you so much, Kanchan. This is my pleasure to be here on this platform. Thank you.

[End of Audio]

SE Radio 696: Flavia Saldanha on Data Engineering for AI

Show Notes

Related Episodes

Transcript

Join the discussion

More from this show

SE Radio 707: Subhajit Paul on ERP Automation and AI

SE Radio 706: Yechezkel “Chez” Rabinovich on Observability Tool Migration Techniques

SE Radio 705: Murat Erder and Eoin Woods on Continuous Architecture

Menu

Recent posts

Search

Search

SE Radio 696: Flavia Saldanha on Data Engineering for AI

Show Notes

Related Episodes

Transcript

Join the discussion

More from this show

SE Radio 707: Subhajit Paul on ERP Automation and AI

SE Radio 706: Yechezkel “Chez” Rabinovich on Observability Tool Migration Techniques

SE Radio 705: Murat Erder and Eoin Woods on Continuous Architecture

Menu

Recent posts