SE Radio 523: Jessi Ashdown and Uri Gilad on Data Governance

Jessi Ashdown and Uri Gilad, authors of the book Data Governance: The Definitive Guide, discuss what data governance entails and how to implement it. Host Akshay Manchale speaks with them about why data governance is important for organizations of all sizes and how it impacts everything in the data lifecycle from ingestion and usage to deletion. Jessi and Uri illustrate that data governance helps not only with enforcing regulatory requirements but also empowering users with different data needs. They present several use cases and implementation choices seen in industry, including how it’s easier in the cloud for a company with no policies over their data to quickly develop a useful solution. They describe some current regulatory requirements for different types of data and users and offer recommendation for smaller organizations to start building a culture around data governance.

Show Notes

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Akshay Manchale 00:00:16 Welcome to Software Engineering Radio. I’m your host Akshay Monchale. Today’s topic is Data Governance. And I have two guests with me, Jesse Ashdown, and Uri Gilad. Jesse is a Senior User Experience Researcher at Google. She led data governance research for Google Cloud for three and a half years before moving to leading privacy security and trust research on Google Wallet. Before Google, Jesse led enterprise research for T-Mobile. Uri is a Group Product Manager at Google for the last four years. Helping cloud customers achieve better governance of their data through advanced policy management and data organization tooling. Prior to Google, Uri held executive product positions in security and cloud companies, such as for Forescout, CheckPoint and various other startups. Jesse and Uri are both authors of the O’ Reilly book, Data Governance, The Definitive Guide. Jesse, Uri, welcome to the show.

Uri Gilad 00:01:07 Thank you for having us.

Akshay Manchale 00:01:09 To start off, maybe Jesse, can we start with you? Can you define what data governance is and why is it important?

Jesse Ashdown 00:01:16 Yeah, definitely. So I think one of the things when defining data governance is really looking at it as a big picture definition. So oftentimes when I talk to people about data governance, they’re like, isn’t that just data security and it’s not, it’s so much more than that. It is data security, but it’s also organizing your data, managing your data, how you are able to distribute your data so that folks can use it. And in that same vein, if we ask, why is it important, who is it important for? Not to be dramatic, but it’s wildly important? Because how you’re organizing and managing your data is really how you’re able to leverage the data that you have. And definitely, I mean, this is what we’re going to talk pretty much the entire session about is how you’re thinking about the data that you have and how governance really kind of gets you to a place of where you’re able to leverage that data and really utilize it? And so when we’re thinking in that vein, who is it for? It’s really for everyone. All the way from satisfying legal inside your company to the end customer somewhere, right? Who is exercising their right to delete their data.

Akshay Manchale 00:02:27 Outside of these legal and regulatory requirements that might say you need to have these governance policies. Are there other consequences of not having any sort of governance policies over the data that you have? And is it different for small companies versus large companies in an unregulated industry?

Uri Gilad 00:02:45 Yes. So obviously the immediate go to for people is like, if I don’t have data governance legal, or the regulator will be after me, but it’s really like putting legal and regulation aside, data governance for example, is about understanding your data. If you have no understanding of your data, then you won’t be able to effectively use it. You will not be able to trust your data. You will not be able to efficiently manage the storage for your data because you will creating duplicates. People will spending a lot of their time hunting down tribal knowledge. Oh, I know this engineer who created this data set, that he will tell you what the column means, this kind of things. So data governance is really part of the fabric of the data you use in your organization. And it’s big or small. It’s more about the size of your data store other than the size of your organization. And think about the fabric, which has loose threads, which are beginning to fray? That is data fabric without governance.

Akshay Manchale 00:03:50 Sometimes when I hear data governance, I think about maybe there are restrictions on it. Maybe there are controls about how you can access it, et cetera. Does that come at odds with actually making use of that data? For instance, if I’m a machine learning engineer or a data scientist, maybe I want all access to everything there is so that I can actually make the best possible model for the problem that we’re solving. So is it at odds with such use cases or can they coexist in a way you can balance the needs?

Uri Gilad 00:04:22 So the short answer is, of course it depends. And the longer answer will be data governance is more of an enabler. In my opinion, than a restrictor. Data governance does not block you from data. It sort of like funnels you to the right kind of data to use to the, for example, the data with the highest quality, the data that is most relevant, use curated customer cases rather than raw customer cases for examples. And when people think about data governance as data restriction tool, the question to be asked is like, what exactly is it restricting? Is it restricting access? Okay, why? And if the access is restricted because the data is sensitive, for example, the data should not be shared around the organization. So there’s two immediate follow up questions. One is, if the data is to be used only within the organization and you are generating a general-purpose customer facing, for example, machine learning model, then maybe you shouldn’t because that has issues with it. Or maybe if you really want to do that, go and formally ask for that access because maybe the organization needs to just record the fact that you asked for it. Again, data governance is not a gate to be unlocked or left over or whatever. It’s more of a highway that you need to properly signal and get on.

Jesse Ashdown 00:05:49 I would add to that, and this is definitely what we’re going to get more into. Of data governance really being an enabler and a lot of it, which hopefully folks will get out of listening to this is, a lot of it is how you think about it and how you strategize. And as Uri was saying, if you’re kind of strategizing from that defensive standpoint versus kind of offensive of, “Okay, how do we protect the things that we need to, but how do we democratize it at the same time?” They don’t have to be at odds, but it does take some thought and planning and consideration in order for you to get to that point.

Akshay Manchale 00:06:22 Sounds great. And you mentioned earlier about having a way to find and know what data you have in your organization. So how do you go about classifying your data? What purpose does it serve? Do you have any examples to talk about how data is classified nicely versus something that is not classified nicely?

Jesse Ashdown 00:06:41 Yeah, it’s a great question. And one of like, my favorite quotes with data governance is “You can’t govern what you don’t know.” And that really kind of stems back to your question of about classification. And classification’s really a place to start. You can’t govern and govern meaning like I can’t restrict access. I can’t kind of figure out what sort of analytics even that I want to do, unless I really think about classifying. And I think sometimes when folks hear classification, they’re like, oh my gosh, I’m going to have to have 80 million different classes of my data. And it’s going to take an inordinate amount of tagging and things like that. And it could, there’s certainly companies that do that. But to your point of some examples through the research that I’ve done over years, there’s been many different approaches that companies have taken all the way from just a like literal binary of red, green, right?

Jesse Ashdown 00:07:33 Like red data goes here and people don’t use it. And green data goes here and people use it to things that are kind of more complex of like, okay, let’s have our top 35 classes of data or categories. So we’re going to have marketing, we’re going to have financial there’s HR or what have you. Right. And then we’re just going to look at these 35 classes and categories. And that’s what we’re going to divide by and then set policies on that. I know I’m jumping ahead a little bit by talking about policies. We’ll get more to that later, but yeah. Kind of thinking about classification of it’s a method of organization. Uri I think you have some to add to that too.

Uri Gilad 00:08:11 Think about data classification as the augment reality glasses that enable you to look at your data and the underlying theme in the industry. Generally today it’s a combination of manual label, which Jesse mentioned that like we have X categories and we need to like manual them and machine assisted, or even machine-generated classification, like for example, red, green. Red is everything we don’t want to touch. Maybe red data, this data source always produces red data. You don’t need the human to do anything there. You just mark this data sources, unsuitable or sensitive, and you’re done. Obviously classification and cataloging has evolved beyond that. There is a lot of technical metadata, which is already available with your data, which is already immediately useful to end users without even going through actual classification. Where did the data come from? What is the data source? What is the data’s lineage like, which data sources will use in order to generate this data?

Uri Gilad 00:09:19 If you think about structured data, what is the table name, the column name, those are useful things that are already there. If it’s unstructured data, what is the file name? And then you can begin. And this is where we can talk a little bit about common data classifications methods, really. This is where you can begin and going one layer deeper. One layer deeper is in image, it’s classic. There’s a lot of data classification technologies for image, what it contains and there’s a lot of companies there. Also for structured data, it’s a table, it has columns. You can sample enough values from a column to get a sense of what that column is. It’s a 9-digit number. Great. Is it a 9-digit social security number or is it a nine digit phone number? There’s patterns in the data that can help you find that. Addresses, names, GPS coordinates, IP addresses. all of those are like machine capable values that can be also detected and extracted by machines. And now you begin to lay over that with human curation, which is where we get that overwhelming label that Jesse mentioned. And you can say, okay, “humans, please tell me if this is a customer email or an employee email”. That is probably an immediate thing a human can do. And we are seeing tools that allow people to actually cloud found this kind of information. And Jesse, I think you have more about that.

Jesse Ashdown 00:10:53 Yeah. I’m so glad that you brought that up. I have a funny story of a company that I had interviewed and they were talking about the curation of their data, right? And sometimes these folks are called data stewards or they’re doing data stewardship tasks, and they’re the person who goes in and kind of, as Uri was saying, like that human of, okay, “Is this an email address? Is this kind of what is this sort of thing?” And this company had a full-time person doing this job and that person quit, and I quote, because it was soul sucking. And I think it’s really, Uri’s point is so good about the classification and curation is so important, but my goodness, having a person do all that, no one’s going to do it, right? And oftentimes it doesn’t get done at all because it’s nobody’s full-time job.

Jesse Ashdown 00:11:44 And the poor folks who it is, I mean this is just one case study. Right? But quit because they don’t want to do that. So, know there’s many methods that the answer isn’t to just throw up your hands and say, I’m not going to classify anything, or we have to classify everything. But as Uri is really getting at finding those places, can we leverage some of that machine learning or some of the technologies that have come out that really automate some of these things and then having your kind of manual humans to do some of these other things that the machines can’t quite do yet.

Akshay Manchale 00:12:17 I really like your initial approach of just classifying it as red and blue, that takes you from having absolutely no classification to some sort of classification. And that’s really nice. However, when you come to say a large company, you might end up seeing data that’s in different storage mediums, right? Like you might have a data lake, that’s a dump all ground for things. You might have the database that’s running your operations. You might have like logs and metrics that is just operational data. Can you talk a little bit about how you catalog these different data source in different storage mediums?

Uri Gilad 00:12:52 So this is a bit where we talk about tooling and what tools are available because you are already saying there’s a data store that looks like this in another data store that looks like that. And here’s what not to do because I’ve seen this done many times when you have this conversation with a vendor, and I’m very much aware that Google Cloud is a vendor, and the vendor says, oh, that’s easy. First of all, move all of your data to this new magical data store. And everything will be right with the world. I have seen many organizations who have a series of graveyards where, oh, this vendor told us to move there. We started a 6- year project. We moved half the data. We still had to use the data store that we originally were migrating up for out of. So we ended up with two data stores and then another vendor came and told us to move to a third data store.

Uri Gilad 00:13:47 So now we have three data stores and those seems to be continuously duplicating. So don’t do that. Here’s a better approach. There’s a lot of third-party as well as first-party — in which I mean like cloud provider-based catalogs — all of these products have plugins and integrations to all of the common data stores. Again, the features and builds and whistles on each of those plugins and each of our catalogs differ? And this is where maybe you need to do a sort of like ranked choice. But at the end of the day, the industry is in a place where you can point a data catalog at certain data store, it will scrape it, it will collect the technical metadata, and then you can decide what you want to move, what you want to further annotate, what you are satisfied with. Oh, all of this is green. All of this is red and move on. Think about a layered strategy and also like land and expand strategy.

Akshay Manchale 00:14:49 Is that like a plug and play sort of a solution that you say might exist like as a third-party tool, or maybe even in cloud providers where you can just point to it and maybe it does the machine learning saying, “hey, okay, this looks like a nine to check number. So maybe this is social security, something. So maybe I’m going to just limit access to this.” Is there an automated way to go from zero to something when you’re using third-party tools or cloud providers?

Uri Gilad 00:15:13 So I want to break down this question a little bit. There’s cataloging, there’s classification. Those are normally two different steps. Cataloging usually collects technical metadata, file names, table names, column names. Classification usually gets geared up by please look at this table data set, like file bucket and classify the contents of this destination and the different classification tools. I’m obviously colored as coming from Google Cloud. We have Google Cloud DLP, which is fairly robust, actually was used internally within Google to sift through some of our own data. Interestingly enough, we had a case where Google was doing some of its support for some of its products over sort of like chat interface and that chat interface for regulatory purposes was captured and stored. And customers would begin a chat like, “Hi, I am so and so, this is my credit card number. Please extend this subscription from this value to that value.” And that’s a problem because that data store, speaking about governance, was not built to hold credit card numbers. Despite that, customers would really insist about providing them. And one of the key initial uses for the data classified is find credit card numbers and actually eliminate them, actually delete them from the record because we didn’t want to keep them.

Akshay Manchale 00:16:48 So is this whole process easier in the cloud?

Uri Gilad 00:16:51 That’s an excellent question. And the topic of cloud is really relevant when you talk about data classification, data cataloging, because think about the era that existed before cloud. There was your Big Data data storage was a SQL server on a mini tower in some cubicle, and it will churn happily its disc space. And when you needed to get more data, somebody needed to walk over to the computer store and buy another disc or whatever. In the cloud, there’s an interesting situation where suddenly your infrastructure is unlimited. Really your infrastructure is unlimited, costs are always going down, and now you are in a reverse situation where before you had to censor yourself in order not to overwhelm that poor SQL server in a mini tower in the cubicle, and suddenly you are in a different situation where like your default is, “ah, just keep it in the cloud and you will be fine.”

Uri Gilad 00:17:47 And then enters the topic of data governance and easier in the cloud. It’s easier because compute is also more accessible. The data is immediately reachable. You don’t need to plug in another network connection to that SQL server. You just access the data through API. You have highly trained machine learning models that can operate on your data and classify it. So, from that aspect, it is easier. On the other side, from the topics of scale and volume, it’s actually harder because people default to just, “ah, let’s just store it. Maybe we’ll use it later,” which kind of in presents an interesting governance challenge.

Jesse Ashdown 00:18:24 Yes, that’s exactly what I was going to mention too. Sort of with the advent of cloud storage, as Uri was saying, you can just, “Oh I can store everything” and just dump and dump and dump. And I think a lot of past dumpage, is where we’re seeing a lot of the problems come now, right? Because people just thought, well, I’ll just collect everything and put it somewhere. And maybe now I’ll put it in the cloud because maybe that’s cheaper than my on-prem that can’t hold it anymore, right? But now you’ve got a governance conundrum, right? You have so much that, honestly, some of it might not even be useful that now you’re having to sift through and govern, and this poor guy — let’s call him Joe — is going to quit because he doesn’t want to curate all that. Right?

Jesse Ashdown 00:19:13 So I think one of the takeaways there is there are tools that can help you, but also being strategic about what do you save and really thinking about. And, and I guess we were kind of getting to that with sort of our classification and curation of not that you have to then cut everything that you don’t need, but just think about it and consider because there might be things that you put in this kind of storage or that place. Folks have different zones and data lakes and what have you, but yeah, don’t store everything, but don’t not store everything either.

Akshay Manchale 00:19:48 Yeah. I guess the elasticity of the cloud definitely brings in more challenges. Of course, it makes certain things easier, but it does make things challenging. Uri, do you have something to add there?

Uri Gilad 00:19:59 Yeah. So, here’s another unexpected benefit of cloud, which is formats. We, Jesse and I, talked recently to a government entity and that government entity is actually bound by law to index and archive all kinds of data. And it was funny they were sharing anecdotal with you. “Oh, we are just about to end scanning the mountain of papers dating back to the 1950s. And now we are finally getting into advanced file formats such as Microsoft Word 6,” which is by the way, the Microsoft Word which was prevalent in 1995. And they were like, those are available on floppy disks and kind of stuff like that. Now I’m not saying cloud will magically solve all your format problems, but you can definitely keep up with formats when all of your data is accessible through the same interface, other than a filing cabinet, which is another kind of one point.

Akshay Manchale 00:20:58 In a world where maybe they are dealing with current data and they have an application out there, they have some sort of like need or they understand the importance of data governance: you’re ingesting data, so how do you add policies around ingestion? Like, what is acceptable to store? Do you have any comments about how to think about that, how to approach that problem? Maybe Jesse.

Jesse Ashdown 00:21:20 Yeah. I mean, I think, again, this sort of goes to that idea of really being planful, of thinking about kind of what you need to store, and one of the things when we talked about classification of kind of these different ideas of red, green, or kind of these top things, Uri and I, in talking to many companies, have also heard different methods for ingestion. So, I certainly think that this isn’t something that there’s only one good way to do it. So, we’ve kind of heard different ways of, “Okay, I’m going to ingest everything into one place as like a holding place.” And then once I curate that data and I classify that data, then I will move it into another location where I apply blanket policies. So, in this location, the policy is everyone gets access or the policy is no one gets access or just these people do.

Jesse Ashdown 00:22:13 So there’s definitely a way to think about it, of different kind of ingestion methods that you have. But the other thing too is kind of thinking about what those policies are and how they help you or how they hinder you. And this is something that we’ve heard a lot of companies talk about. And I think you were kind of getting at that at the beginning too: Is governance and data democratization at odds? Can you have them both? And it really comes down a lot of times to what the policies are that you create. And a lot of folks for quite a long time have gone with very traditional role-based policies, right? If you are this analyst working in this team, you get access. If you are in HR, you get this kind of access. And I know Uri’s going to talk more about this, but what we found is that these sorts of role-based access methods of policy enforcement are sort of outdated, and Uri I think you had more to mention with that.

Uri Gilad 00:23:14 So couple of things: first of all, thinking about policies and really policies or tools who say who can do what, in what, and what Jesse was alluding to earlier is like, it’s not only who can do what with what, but also in what context, because I may be a data analyst and I’m spending 9AM till 1PM working for marketing, in which case I’m mailing a lot of customers our latest, shiny glossy catalog, in which case I need customers’ home addresses. At the second part of the day, the same me looking at the same data, but now the context I’m operating on is I need to understand, I don’t know, usage or invoices or something completely different. That means I should not probably access customers’ home addresses. That data should not be used as a source product for everything downstream from whatever reports I’m generating.

Uri Gilad 00:24:17 So context is also important, not just my role. But just to pause for a moment and acknowledge the fact that policies are much more than just access control. Policies talk about life cycle. Like we talked about, for example, ingesting everything, dropping everything in sort of like a holding place, that’s a beginning of a life cycle. It’s first held, then maybe curated, analyzed, added quality tool like you test the high-quality data that there are no like broken records, there are no missing elements, there are no typos. So, you test that. Then you maybe want to retain certain data for certain durations. Maybe you want to delete certain data, like my credit card example. Maybe you are allowed to use certain data for certain use cases and you are not allowed to use certain data for other use cases, as I explained. So all of these are like worldly policies, but it’s all about what you want to do with the data, and in what context.

Akshay Manchale 00:25:23 Do you have any example where maybe the sort of role-based classification where you are allowed to access this depending on your job function may not be sufficient to have a place where you’re able to extract the most out of the underlying data?

Jesse Ashdown 00:25:38 Yeah, we do. There was a company that we had spoken to that is a large retailer, and they were talking about how role-based policies aren’t necessarily working for them very well anymore. And it was very close to what Uri was discussing just a few minutes ago. They have analysts who are working on sending out catalogs or things like that, right? But let’s say that you also have access to customers emails and things like that, or shipping addresses because you’ve had to ship something to them. So let’s say they bought, I don’t know, a chair or something. And you’re an analyst, you have access to their address and whatnot because you had to send them the chair. And now you see that, oh, our slip covers for these chairs are on sale.

Jesse Ashdown 00:26:26 Well, now you have a different hat on. Now the analyst has a marketing hat on, right? My focus right now is marketing, of sending out marketing material emails on sales and whatnot. Well, if I collected that customer’s data for the purpose of just shipping something that they had bought, I can’t — unless they’ve given permission — I can’t use that same email address or home address to send marketing material to. Now, if your policy was just, here’s my analysts who are working on shipping data, and then my marketing analysts. If I just had role-based access control, that would be fine. These things would not intersect. But if you have the same analyst who, as Uri had mentioned is accessing these data sets, same data sets, same engineer, same analyst, but for completely different purposes, some of those are okay, and some of those are not. And so really having these, they were one of the first companies that we had talked to that were really saying, “I need something more that is more along a use case, like a purpose for what am I using that data for?” It’s not just who am I and what’s my job, but what am I going to be using it for? And in that context, is it acceptable to be accessing and using the data?

Akshay Manchale 00:27:42 That’s a great example. Thanks. Now, when you’re ingesting data, maybe you’re getting these orders, or maybe you are looking at analytical stuff about where this user is accessing from, et cetera, how do you enforce the policies that you may have already defined on data that’s coming in from all of these sources? Things like you might have streaming data, you might have data address, transactional stuff. So, how do you manage the policies or enforcing the policies on incoming data, especially things that are fresh and new.

Jesse Ashdown 00:28:12 So I love this question and I want to add a little bit to it. So, I want to give some background before we kind of jump into that. When we’re thinking about policies, we’re often thinking about that step of enforcing it, right? And I think what gets lost is that there’s really two steps that happen before that — and there’s, there’s probably more; I’m glossing over it all — but there’s defining the policy. So, do I get this from Legal? Is there some new law like, CCPA or GDPR or HIPAA or something and this is kind of where I’m getting sort of the nuts and bolts of the policy from, defining it. And then, you have to have someone who’s implementing it. And so this is kind of what you’re talking about, kind of getting into: is it data at rest?

Jesse Ashdown 00:29:00 Is it an ingestion? Where am I writing these policies? And then there’s enforcing the policy, which isn’t just a tool doing that, but can also be “okay, I’m going to scan through and see how many people are accessing this data set that I know really shouldn’t be accessed much at all?” And the reason why I’m discussing these distinct different pieces of policy definition, implementation, and enforcement is those can often be different people. And so, having a line of communication or something between those folks, Uri and I have heard from many companies gets super lost, and this can completely break down. So really acknowledging that there’s kind of these distinct parts of it — and parts that have to happen before enforcement even happens — is sort of an important thing to kind of wrap your head around. But Uri can definitely talk more about the like actually getting in there and enforcing the policies.

Uri Gilad 00:29:59 I agree with everything that was said. Again, yes sometimes for some reason, the people who actually audit the data, or actually not the data who audit the data policies get sort of like forgotten and it tell kind of important people. When we talked about why data governance is important, we said, forget legal for moment. Why data governance is important because you want to make sure the highest quality data gets to the right people. Great. Who can prove that? It’s the person who’s monitoring the policies who can prove that. Also that person may be useful when you’re talking with the European commission and you want to prove to them that you are compliant with GDPR. So that’s an important person. But talking about enforcing policies on data as it comes in. So couple of thoughts there. First of all, you have what we in Google call organization policies or org policies.

Uri Gilad 00:30:53 Those are like, what process can create what data store where? And this is kind of important even before you have the data, because you don’t want necessarily your apps in Europe to be beaming data to the US. Maybe again, you don’t know what a data is. You don’t know what it contains. It hasn’t arrived yet, but maybe you don’t even want to create a sync for it in a region of the world where it shouldn’t be, right? Because you are compliant with GDPR because you promise your German company that you work with that employee information remains in Germany. That’s very common. It’s beyond GDPR. Maybe you want to create a data store that is read-only, or write-once, read-only more correctly because you are financial institution and you are required by laws that predate GDPR by a decade to hold transaction information for fraud detection.

Uri Gilad 00:31:47 And apparently there’s fairly detailed regulations about that. After that it’s a bit of workflow management, the data is already landed. Now you can say, okay, maybe I want to build a TL system, like we discussed earlier, where there the landing zone, very few people can access this landing zone. Maybe only machines can access the landing zone and they do basic scraping and the augmenting and enriching. And it transferred to very few people, very few human people. And then later it’s published to the entire organization and maybe there’s an even later step where it’s shared with partners, peers, and consumers. And this is by the way, a pattern, this landing zone, intermediate zone, public zone, or published zone. This is a pattern we are seeing more and more across the data landscape in our data products. And in Google, we actually created a product for that called DataPlex, which is first-of-a-kind, which gives a first-class entity to those, kind of like, holding zones.

Akshay Manchale 00:32:50 Yeah. What about smaller to medium sized companies that might have very basic data access policies? Are there things that they can do today to have this policy enforcement or applying a policy when you don’t have all of these lines of communication established, let’s say between legal to marketing to PR to your engineers who are trying to build something, or analytics trying to give feedback back into the business? So, in a smaller context, when you’re not necessarily dealing with a vast amount of data, maybe you have two data sources or something, what can they do with limited amount of resources to improve their state of data governance?

Jesse Ashdown 00:33:28 Yeah, that’s a really great question. And it’s sort of one of these things that can sometimes make it easier, right? So, if you have a bit less data and if your organization is quite a bit smaller — for example, Uri and I had spoken with a company that I think had seven people total on their data analytics team, total in the entire company — it makes it a lot simpler. Do they all get access? Or maybe it’s just Steve, because Steve works with all the scary stuff. And so, he is the one, or maybe it’s Jane that gets it all. So, we’ve definitely seen the ability for smaller companies, with less people and less data, to be maybe a bit more creative or not have as much of a weight, but that isn’t necessarily always the case because there can also be small organizations that do deal with a large amount of data.

Jesse Ashdown 00:34:21 And to your point, it can be challenging. And I think Uri has more to add to this. But one thing I will say is that, kind of as we had spoken in the beginning, of really selecting what is it then that you need to govern? And especially if you don’t have the headcount, which so many folks don’t, you’re going to have to strategically think about where can I start? You can’t boil the ocean, but where can you start? And maybe it’s five things, maybe it’s 10 things, right? Maybe it’s the things that hit most the bottom line of the business, or that are the most scary, because as Uri said, the auditor’s going to come in, we’ve got to make sure that this is locked down. I going to make sure I can prove that this is locked down. So starting there, but to not get overwhelmed by all of it, but to say, “You know what if I just start somewhere, then I can build out.” But just something.

Uri Gilad 00:35:16 Yeah. Adding to what Jesse said, the case of the small company with the small amount of data is potentially simpler. It’s actually quite common to have a small company with a lot of data. And that is because maybe that company was acquired or was acquiring. That happens. And also, maybe because it’s so easy to form a single, simple mobile app to generate so much data, especially if the app is popular, which is a good case; it’s a good problem to have. Now you are suddenly costing the threshold where regulators are starting to notice you, maybe your spend on cloud storage is beginning to be painful to your wallet, and you are still the same tiny team. There’s this only Steve, and Steve is the only one who understands this data. What does Steve do? And the answer is it’s a little bit of what Jesse said of like start where you have the most impact, identify the top 20% of the data mostly used, but also there’s a lot of built-in tools that allow you to get immediate value without a lot of investment.

Uri Gilad 00:36:25 Google’s Cloud data catalog, like, out of the Box, it will give you a search bar that allows you to search across table name, column names, and find names. And maybe that makes a difference again, imagine just finding all the tables that have email as a column name, that is immediately useful can be immediately impactful today. And that requires no installation. It requires no investment in processing or compute. It’s just there already. Similarly for Amazon, there’s something similar; for Microsoft cloud, there is something similar. Now that you have sort of like lowered the watermark of pressure a little bit down, you can start thinking, okay, maybe I want to consolidate data stores. Maybe I want to consolidate data catalogs. Maybe I want to go and shop for a third-party solution, but start small, identify the top 20% impact. And you will go from there.

Jesse Ashdown 00:37:20 Yeah. I think that’s such a great point about starting with that 20%. I had gone to a data governance conference a couple of years ago now. Right? Back when conferences were being held in person. And there was this presentation about kind of the ideal data governance state, right? And there were these beautiful images of you have this person doing this thing. And then these people and all like this, this perfect way that it would all work. And these four guys stood up and he said, so I don’t have the headcount or the budget to do any of that. So how do I do this? And the guy’s response was, “Well, then you just need to get it.” And we sincerely hope that through talking on podcasts and through the book, that folks will not feel like that? They won’t feel like, well my only recourse is to hire 20 more people to get a million.

Jesse Ashdown 00:38:20 Well, probably not even a million, I don’t know, 10 million or whatever budget, buy all the tools, all the fancy things, and that’s the only way that I can do this. And that’s not the case. Uri said kind of starting with Steve and, and the 20% that Steve can do and then building from there. I mean, of course, clearly we feel very passionate about this, so we could talk for hours and hours. But if the folks listening, take nothing else away, I hope that that’s one of the takeaways of this can be condensed. It can be made smaller and then you can blow it out and make it bigger as you can.

Akshay Manchale 00:38:53 Yeah. I think that’s a great suggestion or a great recommendation, right? Because even as a consumer, for example, I’m better off knowing that maybe if I’m using your app, you have some sort of governance policy in place, even though you might not be too big, maybe you don’t have the headcount to have this crazy structure around it, but you have some start. I think that’s actually really nice. Uri you mentioned earlier about one of the access policies can be something like, “write once read many times”, etc. for financial transactions, for example, and makes me wonder, how do you keep track of the source of data? How do you track the lineage of data? Is that important? Why is it important?

Uri Gilad 00:39:31 So let’s start from the actual end of the question, which is why is that important? So, couple of reasons, one is lineage provides a real important and sometimes actionable context to the data. It’s a very different kind of data. If it was sourced from a consumer contact details table, then if it was sourced from the employee database, those are different kinds of groups of people. They have different kinds of needs and requirements. And actually the data is shaped differently for employees. It’s all about a user idea at company.com, for example. That’s different shape of email than for a consumer, but the data itself will have the same sort of like container that will be a table of people with names, maybe addresses, maybe phone numbers, maybe emails. So that’s an easy example where context is important. But adding to that a little bit more, let’s say you have data, which is sensitive.

Uri Gilad 00:40:30 You want all the derivatives of this data to be sensitive as well. And that’s a decision you can make automatically. There’s no need for a human to come in and check boxes. That some point upstream in the lineage graph this column table, whatever was deemed to be sensitive, just make sure that context stream retains itself as long as the data is evolving. That is another, how do you collect lineage and how do you deal with unknown data sources? So for lineage collection, you really need a tool. The speed of evolution of data in today’s environment really requires you to have some sort of automated tooling that as data is created, the information about where it came from physically, like this file bucket, that data set, is recorded. That’s like humans cannot really effectively do that because they will make mistakes or they’ll just be lazy.

Uri Gilad 00:41:25 I’m lazy. I know that. What do you do with unknown data sources? So this is where good defaults are really important. There’s a data, somebody, some random person who is not available for questions at the moment has created the data source. And this is being used widely. Now you don’t know what the data source is. So you don’t know quality, you don’t know sensitivity, and you need to do something about it because tomorrow the regulator is coming for a visit. So good defaults means like what’s your risk profile. And if your risk profile is, this is going to be come up in the review or audit, just markets is sensitive and put it on somebody’s task list to go into it later and try and figure out what this is. If you have a good lineage collection tool, then you will be able to track all the by-products and be able to automatically categorize them. Does that make sense?

Akshay Manchale 00:42:20 Yeah, absolutely. I think maybe applying the strongest, most restrictive one for derived data is maybe the safest approach. Right. And that totally makes sense. Can you, we’ve talked a lot about just regulatory requirements, right? We’ve mentioned it. Can you maybe give some examples of what regulatory requirements are out there? We’ve mentioned GDPR, CCPA, HIPAA previously. So maybe can you just dig into one of those or maybe all of those briefly, just say what exists right now and what are some of those most popular regulatory requirements that you really have to think about?

Uri Gilad 00:42:55 So, first of all, disclaimer: not a lawyer, not an expert on regulations. And also, this is important: regulations are different depending not only on where you are and what language you speak, but also on what kind of data you collect and what do you use it for? Everybody is concern about GDPR and CCPA. So I’ll talk about them, but I’ll also talk about what exists beyond that scope. GDPR, General Data Protection and CCPA, which is the California Consumer Privacy Act, really novel a little bit in that they say, “oh, if you are collecting people’s data, you should pay attention to that.” Now this is not going to be an analysis of GDPR and whether this applies to that — talk to your lawyers — but in broad strokes, what I mean is if you collect people’s data, you should do two very simple things. First of all, let those people know. That sounds surprising, but people did not used to do that.

Uri Gilad 00:43:56 And there were unexpected things that happened as a result for that. Second of all, if you are collecting people’s data, give them the option to opt out. Like, I don’t want my data to be collected. That may mean I cannot require the service from you, but I have the option to say no. And again, not many people understand that, but at least they have the option. They also have the option to come back later and say, “Hey, you know what? I want to be taken off your system. I love Google. It’s a great company. I enjoyed my Gmail very much, but I’ve changed my mind. I’m moving over to a competitor. Please delete everything you know about me so I can rest more easily.” And that’s another option. Both GDPR and CCPA are also novel in the fact that they contain teeth, which means there’s a financial penalty if people fail to comply people, meaning companies fail to comply.

Uri Gilad 00:44:45 And there’s that those whole lot of other like GDPR is a robust piece of legislation. It has hundreds of pages, but there’s also care to be taken as a thread across the regulation around, please be mindful about which companies, services, vendors, people process people’s data. It’ll be highly remiss if we did not mention two classes of regulation beyond GDPR and CCPA, those are health related regulations in the US. There’s HIPAA. There’s an equivalent in Europe. There’s equivalents actually all across the planet. And those are like, what do you do with medical data? Like, do I really want people that are not my own personal physician to know that I have a certain medical condition? What do you do about that? If my data is to be used in the creation of lifesaving drug, how is that to be used?

Uri Gilad 00:45:45 And we were hearing a lot about that in, unfortunately, the pandemic, like people were developing dogs very rapidly, and we were hearing a lot about that. There’s another class of regulation, which governs financial transactions. Again, highly sensitive, because I don’t want people to know how much money I have. I won’t want people to know who I negotiate and do business with, but sometimes banks need to know that because certain patterns of your transactions indicate fraud, and that’s a valuable service they can provide for detection, fraud preventions. There’s also bad actors. We have this situation in Eastern Europe, banks, Russian banks are being blocked. There’s a way for banks to detect trading with those entities and block them. And again, Russian banks are a recent example, but there more older examples of unwanted actors and you can insert your financial crime here. So that will be my answer.

Akshay Manchale 00:46:47 Yeah. Thanks for that, like, quick walkthrough of those. It’s really, I think, going back to what you were emphasizing earlier about starting somewhere with respect to data governance, it’s all the more important when you have all of these policies and regulatory requirements really, to at least be aware of what you should be doing with data or what your responsibilities are as a company or as an engineer or whoever you are listening to the podcast. I want to ask another thing about just data storage. I think there are specifically, there are countries, or there are places where they say, data residency rules apply where you can’t really move data out of the country. Can you give an example about how that impacts your business? How does that impact your maybe operations, where you deploy your business, et cetera?

Uri Gilad 00:47:36 So in general — again, not a lawyer — but generally speaking, keep data in the same geographic region where it was sourced for is usually a good practice. That begets a lot of like interesting questions, which do not have a straight answer. Do not have a simple answer, like, okay, I’m keeping all, let’s say I have, let’s take something simple. I have a music app. The music app makes money by sending targeted ads to people listening to music. Fairly simple. Now in order to send targeted ads and you need to collect data about the people, listening to music, for example, what music they’re listening to, fairly simple so far. Now, where do you store that data? Okay. So Uri said in the podcast, store it in the region of the world it was collected from, great. Now here’s a question where do you store the information about the existence of this data in the country?

Uri Gilad 00:48:32 Basically, if you have now a search bar to search for music listened by people in Germany, does this search, like, do you need to go into each individual region where you store data and search for that data, or is there a centralized search? As things stand right now, the regulation on metadata, which is what I’m talking about, the existence of data about data, does not exist yet. It’s trending to be also restricted by region. And that presents all kinds of interesting challenges. The good news is, if you have this problem, that means that your music application was hugely successful, adopted all over the planet and you have users all over the planet. That probably means you are in a good place. So that’s a happy start.

Akshay Manchale 00:49:20 Yeah, I think also when you look at machine learning, AI being so prevalent right now in the industry, I have to ask when you are trying to build a model out of data that is local to a region maybe, or maybe it contains personally identifiable information, and the user comes in and says, Hey, I want to be forgotten. How do you deal with this sort of derived data that exists in the form of an AI application or just a machine learning model where maybe you can’t get back the data that you started with, but you have used it in your training data or test data or something like that?

Jesse Ashdown 00:49:55 That’s a really good question. And to kind of even go back before we’re even talking about ML and AI, it’s really funny. Well, I don’t know if it’s funny but you can’t go in and forget somebody unless you have a way to find that person. Right. So one of the things that we’ve found in kind of interviewing companies kind of, as they’re really trying to get their governance off the ground and be in compliance is, they can’t find people to forget them. They can’t find that data. And this is why it’s so important. I can’t extract that data. I can’t delete it if you’ve ever had the case of where you’ve unsubscribed from something, and you don’t get emails for a while only to then all of a sudden you get emails again. And you’re wondering why that is well it’s because the governance wasn’t that great.

Jesse Ashdown 00:50:46 Right? And I don’t mean governance in terms of like security and not that it’s any malicious point on those folks at all. Right. But it shows you of exactly what you’re saying of where is that kind of streaming down. And Uri was making this point of really looking at the lineage of kind of finding where all the places where this is going, and now you can’t capture all these things. But the better governance that you have, and as you’re thinking about how do I prioritize, right? Like we were kind of talking about, there might be some, I need to make data driven decisions in the business. So these are some things that I’m going to prioritize in terms of my classifying, my lineage tracking. And then maybe there’s other things related to regulations of, I have to prove this to that poor auditor that has to go in and look at things. So maybe I prioritize some of those things. So I think even before we get in to machine learning and things like that, these should be some of the things that folks are thinking about to like put eyes on and why some of that governance and strategy that you put into place beforehand is so important. But specifically with the ML and AI, Uri, that’s definitely more up your alley than mine.

Uri Gilad 00:51:59 Yeah. I can talk about that briefly. So first of all, as Jesse mentioned, the fact that you don’t have good data governance and people are trying to unsubscribe, and you don’t know who these people are and you are doing your best, but that’s not good enough. That’s not good enough. And if somebody has a stick to beat you with, they will wave that stick. So besides that, here’s something that has worked well for Google actually. Which is when you are training AI model again, it’s highly tempting to use all of the features you can, including people’s data and all that. There’s sometimes very good results that you can achieve without actually saving any data about people. And there’s two examples for that. One is if anybody’s listening to, this is familiar with the COVID exposures notification app, that’s an app and it’s widely documented and just look up for it in other Apples or Google’s information pages.

Uri Gilad 00:52:59 That app does not contain anything about you and does not share anything about you. The TLDR on how it works, it’s a rolling random identifier. That’s keeping a rolling random identifier of everything you, everybody you have met. And if one of those rolling random identifiers happens to have a positive diagnosis, then it’s that the other people know, but nothing personal is actually kept. No location, no usernames, no phone numbers, nothing, just the rolling random identifier, which by itself does not mean anything. That’s one example. The other example is actually very cool. It’s called Federated Learning. It’s a whole recognized technique, which is the basis for auto complete in mobile phone keyboards. So if you type on your mobile phone, both Apple and Google, you will say a couple of suggestions for words, and you can actually build whole sentences out of that without typing a single letter.

Uri Gilad 00:53:55 And that is kind of fun. The way this works is there’s a machine learning model that’s trying to predict what word you are going to use. And it predicts that we’re looking in the sentence that machine learning model runs locally on your phone. The only data is shared is actually, okay. I’ve spent a day predicting words and doing this day, apparently sunshine was more common than rainfall. So I’m going to beam to the centralized database. Sunshine is more common than rainfall. There’s nothing about the user there, there’s nothing about the individual, but it’s useful information. And apparently it works. So how do you deal with machine learning models? Try first, not to save any data at all. Yes. There are some cases where you have to which again, not being a huge expert of it, but in some cases you will need to rebuild and retrain your machine learning model, try to make those cases, the exception, not the whole.

Akshay Manchale 00:54:53 Yeah. I really like your first example of COVID right, where you can achieve the same result by using PII and also without using PII, just requires you to think about a way to achieve the same goals without putting all of the personal information in that path. And I think that’s a great example. I want to switch gears a little bit into just the monitoring aspects of it. You have like regulatory requirements maybe for monitoring, or maybe just as a company. You want to know that the ideal policies, access controls that you have are not being violated. What are strategies for monitoring? Do you have any examples?

Jesse Ashdown 00:55:31 That is a great question. And I’m sure anyone who’s listening who has dealt with this problem is like, yes. How do you do that? Because it’s really, really challenging. If I had a dollar, even a penny for every time I talk to a company and they ask me, but is there a dashboard? Like, is there a dashboard where I can see everything that’s going on? So to your point, it’s definitely a big, it’s an issue. It’s a problem of being able to do that. There certainly are some tools that are coming out that are aiming to be better at that. Certainly Uri can speak more on that. DataPlex is a product that he mentioned and some of the monitoring capabilities in there are directly from years of interviews that we did with customers and companies of what they needed to see to enable them to better know what the heck is going on with my data estate?

Jesse Ashdown 00:56:33 How is it doing? Who’s accessing what, how many violations are there? So I suppose my answer to your question is there, there’s no great way to do it quite yet. And save for some tooling that can help you. I think it’s another place of defining, I can’t monitor everything? What do I have to monitor most? What do I have to make sure that I’m monitoring and how do I start there and then branch out. And I think another important part is really defining who is going to do what? That’s one thing that we found a lot is that if it’s not someone’s job, someone’s explicit job, it’s often not going to get done. So really saying, okay, “Steve poor, Steve, Steve has got so much, Steve, you need to monitor how many folks are accessing this particular zone within our data lake that has all of the sensitive stuff or what have you.” But defining kind of those tasks and who is going to do them is definitely a start. But I know Uri has more on this.

Uri Gilad 00:57:37 Yeah, just briefly. It’s a common customer problem. And customers are like, I understand that the file storage product has a detailed log. I understand how the data analytics product has a detailed log. Everything has a detailed log, but I want a single log to look at, which shows me both. And that’s why we built DataPlex, which is sort of like a unifying management console that doesn’t kill where your data is. It tells you how your data is governed. Who’s accessing it, what interface are doing and wherever. And it’s a first, it was launched recently and it’s intended not to be a new way of processing your data, but actually approaching at how customers think about the data. Customers don’t think about their data in terms of files and tables. Customers think about their data as this is customer data. This is pre-processed data. This is data that I’m willing to share. And we are trying to approach those metaphors with our products rather than giving them a most excellent file storage, which is only the basis of the use case. We also give the most excellent file storage.

Akshay Manchale 00:58:48 Yeah, I think a lot of tools are certainly adding in that sort of monitoring auditing capabilities that I usually see with new products. And that’s actually a great step in the right direction. I want to start wrapping things up and I think this sort of culture of having some counts in place or just starting somewhere is really great. And when I look at say a large company, they usually have different kinds of trainings that you have to take that explicitly spell out what is okay to do in this company. What can you access? There are security based controls for accessing sensitive information audits and all of that. But if you take that same thing in an unregulated industry, maybe, or a small to medium sized company, how do you build that sort of data culture? How do you train your people who are coming in and showing your company about what your data philosophy or principles are or data governance policies are? Do you have any examples or do you have any takes on how someone can get started on some of those aspects?

Jesse Ashdown 00:59:46 It’s a really good question. And something that often gets overlooked, like you said, in a big company, there’s okay. We know we have to have trainings and things like this, but in smaller companies or unregulated industries, it often gets forgotten. And I think you hit on an important point of having some of those principles. Again, it’s a place of starting somewhere, but I think even more than that, it’s just being purposeful. We literally have an entire chapter in the book dedicated to culture because that’s how important we feel it is. And I feel like it’s one of those places of where the people really matter, right? We’ve talked so much in this last hour plus together of there’s these tools, ingestion, storage, da na na and a little bit about the people, but that’s really where the culture can come into play.

Jesse Ashdown 01:00:32 And it’s about being planful and it doesn’t have to be fancy. It doesn’t have to be fancy trainings and whatnot. But as you had mentioned, having principles that you say, okay, “this is how we’re going to use data. This is what we’re going to do”. And taking the time to get the folks who are going to be touching the data, at least on board with that. And I had mentioned it before, but really defining roles and responsibilities and who does what? There can’t be one person that does everything. It has to be sort of a spreading out of responsibilities. But again, you have to be planful of thinking, what are those tasks? It doesn’t have to be a hundred tasks, but what are these tasks? Let’s literally list them out. Okay. Now who’s going to do what, because unless we define that Joe is going to get stuck doing all the curation and he’s going to quit and that’s just not going to work.

Uri Gilad 01:01:22 So adding to that a little bit, it’s not just, again, small company, unregulated industry does not a huge hammer waiting for them. How do they get data governance? And being planful is a huge part of that. It’s also about like, I’ve already confessed to being lazy. So I have no issue confessing to it again, someday you will believe me, but it’s telling the employees what’s in it for them. And data governance is not a gatekeeper. It’s a huge enabler. Do you want to quickly find the data that’s relevant to you to all, to do the next version of the music app? Oh, then you better when you create a new data source, just to add those like five words saying, what is this new database about? Who was it sourced from? Does it content PI just click those five check boxes and in return, we’ll give you a better index.

Uri Gilad 01:02:14 Oh, you want to make sure that you don’t need to go in requisition all the time, new permissions for data? Make sure you don’t save PII. Oh, you don’t know what PII is? Here’s a handy classifier. Just make sure you run it as part of your workflow. We will take it from there. And again, that is the first step in making data work for you. Other than poor Joe who’s, nobody is classifying in the organization, so everybody like leans on him and he quits. Other than doing that, show employees what’s in it for them. They will be the ones to classify. That’s actually good news because they’re actually the ones who know what the data is. Joe has no idea. And that will be a happier organization.

Akshay Manchale 01:02:56 Yeah. I think that’s a really nice note to end it on that. You don’t need really need to look at this as a regulatory requirement alone, but really look at it as what can the sort of governance policies do for you? What can it enable in the future? What can it simplify for you? I think that’s fantastic. With that, I’d like to end and Jesse and Uri. Thank you so much for coming on the show. I’m going to leave a link to the book in our show notes. Thank you again. This is Akshay Manchale for Software Engineering Radio. Thank you for listening.

Uri Gilad 01:03:25 And the book is Data Governance. The Definitive Guide, the product is cloud’s, Dataplex, and they’re both Googleable. [End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

SE Radio 523: Jessi Ashdown and Uri Gilad on Data Governance

Join the discussion

More from this show

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

SE Radio 725: Danny Yang and Sam Goldman on the Pyrefly Type Checker

Menu

Recent posts

Search

Search

SE Radio 523: Jessi Ashdown and Uri Gilad on Data Governance

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

SE Radio 725: Danny Yang and Sam Goldman on the Pyrefly Type Checker

Menu

Recent posts