SE Radio 641: Catherine Nelson on Machine Learning in Data Science

Catherine Nelson, author of the new O’Reilly book, Software Engineering for Data Scientists, discusses the collaboration between data scientists and software engineers — an increasingly common pairing on machine learning and AI projects. Host Philip Winston speaks with Nelson about the role of a data scientist, the difference between running experiments in notebooks and building an automated pipeline for production, machine learning vs. AI, the typical pipeline steps for machine learning, and the role of software engineering in data science. Brought to you by IEEE Computer Society and IEEE Software magazine.

Show Notes

Software Engineering for Data Scientists (O’Reilly, 2024)
Building Machine Learning Pipelines (O’Reilly, 2020)
LinkedIn: CatherineNelson1

Related Episodes

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Philip Winston 00:00:35 Welcome to Software Engineering Radio. This is Philip Winston. My guest today is Catherine Nelson. Catherine is a freelance data scientist and the author of two O’Reilly books: this year’s Software Engineering for Data Scientists and her 2020 book, Building Machine Learning Pipelines co-authored with Hannah Kafka. Previously, she was a principal data scientist at SAP Concur, and before that she had a career as a geophysicist. Catherine has a PhD in Geophysics from Durham University and a master’s of Earth Sciences from Oxford University. She is currently consulting for startups in the generative AI space. Welcome Catherine.

Catherine Nelson 00:01:16 Thanks Philip. It’s great to be on the podcast.

Philip Winston 00:01:19 Today we’re going to discuss the role of the data scientist and how this role can overlap with or intersect software engineering. Let’s start with what is a data scientist?

Catherine Nelson 00:01:31 That’s such a great question because what a data scientist is depends on where you work. At some companies it can be more in the data analytics space and at others it can mean that you’re spending all your time training machine learning models. But overall, I’d say being a data scientist involves translating business problems into data problems, solving them where possible, and then sometimes building machine learning powered features.

Philip Winston 00:01:57 So what skills does a data scientist need either prior to getting the role or what skills do they need to develop to be good at the role?

Catherine Nelson 00:02:05 They need to have skills for working with data. So those would include a knowledge of statistics, a knowledge of coding to be able to manipulate the data, take courses in basic machine learning, learn about the algorithms that make up machine learning, data visualization, sometimes storytelling with data, how to weave those data visualizations together to a coherent whole. A lot of data scientists will take courses on data ethics, data privacy, because sometimes that is part of the data scientists job as well. It’s a real mixed bag.

Philip Winston 00:02:43 It seems like data scientists need perhaps more domain knowledge or business knowledge than some engineering roles. Why do you think this is?

Catherine Nelson 00:02:55 I’d say that’s right. I think it’s because you are translating the problems from a business problem to a data problem. So you might be tasked to answer a problem such as why are our customers churning? Why do some customers leave the business? And you dig into the data to try and see what features of a company are correlated with them stopping using your product. So it might be something like the size of the business or they might have left given you feedback that has some reasons for that. So you can’t really answer a problem like that without having a good sense of what the business does, what products that are, how things fit together. So yeah, I think it involves a lot more context.

Philip Winston 00:03:41 On a typical project, who does the data scientists have to communicate with, typically?

Catherine Nelson 00:03:46 The interesting thing I’ve found with my data science career is I wouldn’t say I have a typical project. So I’ve done some projects where it’s been extremely exploratory. It’s been like, we might be considering creating a new feature for the product, is this even possible? It’s really blue sky. And then there’s other projects I’ve worked on where it’s been towards the production end of things, deploying new models into production. So I’m going to be working with different people depending on the type of project, but some commonalities would be a product organization and obviously engineers if it’s involving building features.

Philip Winston 00:04:32 For most of the episode we’re going to be talking about machine learning and AI, but as I understand it, there’s more to data science than just these two fields. Can you give me example of a problem you solved or a solution you came up with in data science that didn’t involve ML or AI?

Catherine Nelson 00:04:51 Actually, the example that I just mentioned, looking at why customers might leave a business that involved no machine learning at all. It was a predictive modeling problem, but I didn’t use a machine learning solution. So the projects that are more around answering questions versus building features are the ones where there’s a lower level of machine learning, AI usage, and more statistics or data visualization or general data analysis skills.

Philip Winston 00:05:20 I want to mention two past episodes related to data science. There’s Episode 315, Jeroen Janssen’s on Tools for Data Science. That was in 2018 and Episode 286, Katie Malone, Intro to Machine Learning, 2017. Katie Malone is a data scientist. So now let’s move into talking about machine learning and AI to start with. What is the difference between these two fields? And in my research, I think I’ve seen the term AI has been evolving a lot, so I’m wondering what definitions you use.

Catherine Nelson 00:05:55 The most useful definition I’ve heard and the one that I’ve adopted and continue to use was from a podcast that I heard with Christopher Manning who’s a professor at Stanford University in their natural language processing. And that is that if you’re dealing with machine learning, then you are training a model for one particular problem, one particular use. But an AI model can answer many problems. So you might use your AI model to power a chat bot, but you could use the same model to summarize some text or extract some information for some text. Whereas in classical traditional machine learning, if you wanted to have a model that extracted some information from some text, you’d go and collect the dataset designed for exactly that problem. You take some of the input text and what the output that you wanted it to produce and then you’d train your model and measure how accurate it was on that particular problem.

Philip Winston 00:07:00 I’d like to talk a little bit about the use of notebooks like Google’s CoLab in data science. This is a technique or a method that I think is more common in data science than in software engineering at large. So I’m wondering what are the pros and if there are any cons of doing your work inside of a notebook?

Catherine Nelson 00:07:21 Definitely. I’m a huge fan of Jupyter Notebooks. I love being able to get the instant feedback on what my code is doing, which is particularly useful when I’m looking at data all the time. I can print that data, I can plot a small graph of that data, I can really interact with that data while I’m coding. I find them incredibly useful when I’m starting a project, I don’t quite know where things are going. I’m really exploring around and trying to see what the data I’m working with can do for me. Or I’m starting with a basic machine learning model and seeing if it learns anything about the problem that I’m working on.

Philip Winston 00:08:06 What sorts of signs are there that maybe you need to switch to just a traditional Git repository? What starts to become difficult with a notebook?

Catherine Nelson 00:08:17 For me, I refactor when I’m at the point where I want to train that model repeatedly. So in a machine learning problem, I have chosen the features that I want to work with. I’ve chosen the data that I want to work with. I’ve trained an initial model, it’s getting a reasonable result, but then I want to train it repeatedly and optimize the high parameters and then eventually move towards deploying it into production. So I think that’s the main difference is that when I just have code that I may only run once, I donít know where I’m going, I donít know exactly what the final code base will look like, that’s when I’m happiest in the Jupyter Notebook. But when it’s going to be run repeatedly, when I need to write tests, when I need to make sure that code is robust, that’s when I do that refactor.

Philip Winston 00:09:19 In a little while, we’re going to talk through the steps in a machine learning workflow, focusing on what it would be like to make an automated reusable pipeline out of them. But let’s talk a little bit more about roles. So I think you mentioned data analyst relative to data scientists. Let’s talk about a machine learning engineer that certainly comes up. What is your feeling about their role and how it differs from either data science or software engineer?

Catherine Nelson 00:09:49 Many companies, I think it’s the data scientists that will make some initial explorations. Take a fresh problem and say like, is this even a problem that we should be solving with machine learning? What type of algorithms are suitable for this particular problem? Train an initial model, prove that that model is going to answer the question that’s under consideration. And then it’s the machine learning engineer that takes over when that has been established, when those initial experiments have been done and then puts that model into production. And then they look more on the side of monitoring that model, checking that it’s performance, checking that it returns, it’s the influence happens in the right amount of time and so on.

Philip Winston 00:10:37 How about at a smaller company? I imagine that people end up wearing multiple hats until you’re able to hire for all of these different roles. Have you seen that?

Catherine Nelson 00:10:47 Absolutely. And sometimes at bigger companies too, if the data science team is small, then one person might wear a lot of hats. That’s challenging because it’s a different mindset that you have for when you’re running an experiment and being very like open to trying lots of different things, versus you want to be in your production mindset where everything has to run repeatedly where it has to be very robust.

Philip Winston 00:11:13 Let’s specifically zoom in on the relationship between data science and software engineering. What motivated you to write your latest book software engineering for data scientists?

Catherine Nelson 00:11:25 A couple of things really. One is that it’s a book that I wanted to read earlier in my career as a data scientist. So a while ago I joined a team where I was the only data scientist on a team of developers, designers and so on. And I found it hard to just even understand the language that the developers were using. Like many data scientists, my education didn’t include any kind of computer science courses. I didn’t have that much familiarity with software engineering ways of working. When I started as a data scientist and I had questions like, what is an API? How do you build one? And I started getting interested in how I could write better code. And the books that were available, the examples were in Java or they were owned at web development. They weren’t very accessible to me writing code in Python and not needing to have all the skills and background of a web developer. And there’s also data scientists have a reputation for writing bad code and I wanted to help change that.

Philip Winston 00:12:36 I think you answered this. I was going to ask how much software engineering training do data scientists have? And I think you’re saying that on the lower end it can be more or less none, I guess.

Catherine Nelson 00:12:45 Yep. There’s a couple of main routes for people getting into data science. One is from a hard science background, often physical sciences or other science PhDs. So they have their data analysis, but they might be writing more academic code but doesn’t need to be particularly robust. Doesn’t need to be particularly well tested. And then another way is through data science undergraduate degrees or masters, which may include some level of growing courses. But there’s so many things to try and cover in a data science degree that don’t, it’s hard to do that in depth.

Philip Winston 00:13:26 From your book. Can you pick just maybe two skills that you think would be most beneficial for a data scientist to learn?

Catherine Nelson 00:13:35 One where I see there’s often a gap is in writing tests. That’s often something that’s not familiar to people from a data science background. And that’s because data science projects can be so ad hoc, so exploratory, it’s not obvious when to add tests. You can’t add tests to every single piece of code that you’re writing in the data science project because half of them you’re going to throw away because you found that that particular line of inquiry goes nowhere. There’s not really a culture of going back and adding those tests later, but if you then move on from that exploratory code to putting your machine learning model into production, it’s a problem if your code’s not tested. Another one is that, again, that comes from this exploratory nature. Often data scientists are reluctant to use version control when it’s just an individual project. It seems like it’s more hassle than it’s worth. It’s not obvious what the benefits of that are until you start working on a larger code base.

Philip Winston 00:14:35 What programming languages are commonly used for data science? I know Python is common to machine learning in general, but there is other more data science specific languages?

Catherine Nelson 00:14:49 I would say Python is the biggest data science language at this point. Previously R was pretty big as well, but the proportion is declining a bit. Some people use Julia, but it’s not received this widespread adoption that Python has.

Philip Winston 00:15:06 Those are the other two I had down here. R and Julia maybe tackling the same question from a different angle. What should software engineers keep in mind about working with data scientists?

Catherine Nelson 00:15:19 I think that data scientists will be coming from a different mindset than a software engineer. So they’re used to tackling very vague problems and turning those into more data-focused problems. And the thing with data science projects is that you often don’t know at the start where you’re going to end up. If you are working on a machine learning problem, you might find that the algorithm that you end up using is very simple or you might end up using something that’s very complex. You might start off with, say it’s a text data extraction problem, you might start off with trying a random forest-based approach with some very simple text features, but that doesn’t actually perform very well, but accuracy is low. Then you might move on to trying a deep learning model, seeing if that works any better. So this makes it hard to estimate at the start of the project how long it’s going to take or even what the final outcome is going to be. Is this going to be a large model, a two-gigabyte model where we’re going to need some specialized infrastructure to deploy it or is it going to be very small and it’ll scale very easily? So I think keeping in mind that uncertainty, and it’s not that the data scientist is just bad at estimating, that is the nature of these projects. It’s not clear what from the start, what the end is going to look like.

Philip Winston 00:16:55 I could imagine a scenario where the software engineers are eager to start deep into the implementation phase, but the data scientist hasn’t yet found the model for sure. And that might take some patience and some time to iterate before it’s ready for application.

Catherine Nelson 00:17:10 Exactly, exactly. Yeah, it’s not clear what that model is going to be.

Philip Winston 00:17:15 Let’s go through some typical machine learning workflow steps and explain from a data science point of view, what are some considerations, what’s the procedure or technique that might be used? And if we’re working with software engineers to create an automated pipeline, what are some things to keep in mind about this particular workflow step? Basically, what sorts of tools or techniques should we keep in mind for each step? So the first one I have down is data ingestion. So I guess there’s a lot of different projects, but what are some things we might be ingesting and what are we feeding it into?

Catherine Nelson 00:17:54 Yeah, this step is when you take your data from wherever it’s stored in your company’s infrastructure and feeding it into the rest of the pipeline, this is the point where you might also make the split into training data and validation data. It’s picking up that data from whatever format it’s stored in and then potentially transforming it into a format that can move through the rest of that pipeline.

Philip Winston 00:18:24 Can you give an example of unstructured or structured data?

Catherine Nelson 00:18:28 Yeah, usually we call text or images, unstructured data and structured data is data that’s in a tabular format so that the tabular data could be data about the sizes of companies that you’re considering or the unstructured data could be something like the text of the complaints that you’re trying to make a prediction from.

Philip Winston 00:18:54 So are there any specific tools that might come into play here, whether they’re standalone tools or libraries that are commonly used for ingestion?

Catherine Nelson 00:19:02 There’s a few different solutions for the entire pipeline that have a data ingestion component. TensorFlow extended is one of these. There’s also Amazon’s SageMaker pipelines, and I believe ML Flow has a similar structure, although I haven’t worked with that myself.

Philip Winston 00:19:24 The next step I have is data validation. You talked about dividing the data into training and validation sets, but that might be a different type of validation.

Catherine Nelson 00:19:33 Yeah, that’s right. So when you’re dividing your data into training and a validation set that’s used during the model training or analysis model or validation step to check whether that model is sufficiently accurate for the problem that you’re working on, data validation is checking that the data that you’ve ingested is what you expect. So some problems that you might have with that data could include there’s the data’s missing, something’s gone wrong upstream, and suddenly you’re getting null values in your data and then your machine learning model wouldn’t be able to train with those null values in there. So the point of the data validation step is that if there’s a problem with your data, you can stop the pipeline at this point rather than go through the lengthy training step only to find out that there’s an error at that point or your model isn’t as accurate as you’ve expected because there’s a quality issue with the data.

Philip Winston 00:20:35 Let’s pause for a second and talk about when we would rerun the pipeline or why we’re rerunning the pipeline. So if this was just a one-off exploratory investigation and we created a model and produced a visualization and that was the end, but in this case, we’re talking about building a pipeline. So when is it that we rerun this pipeline? Is it because we have new data? Is it because we’re trying to train a better model? Or what situations, and I guess related to that is do we rerun the entire thing or is it being able to rerun portions of it?

Catherine Nelson 00:21:07 So for many business problems, the data doesn’t stay static. The data changes through time, people behave differently with your product and so on. So that causes the model performance to degrade with time because if you’ve trained a model at a specific point in time, it’s been trained on that data and then as your usage pattern changes, then that model is not quite so relevant to that data. So the performance drops, that’s the time when you might want to retrain that model and usually, you’d want to run the entire training pipeline all the way through. If you just run part of it, you don’t actually change anything because the artifact that you get at the end of the pipeline that you’re going to deploy into production is that trained model on that updated data.

Philip Winston 00:22:03 So as part of validation, how do you measure data quality or under what situation would the data fail? Validation that might be specific to a project.

Catherine Nelson 00:22:15 Yeah, if you had some numeric data, then you would look at basic statistics of that data, like the mean, the standard deviation, you could look at the proportion of nulls in that data. If that goes up, that’s a signal that your data quality has decreased. If you are using text data, then it’s less obvious what you should check, but you could check the length of that text if something has gone wrong upstream, you might be getting empty strings coming through into your pipeline. So that’s something you could check.

Philip Winston 00:22:47 And what are your options if the data fails validation, is that basically signaling for someone to intervene or is there any automated step you could take to allow you to continue?

Catherine Nelson 00:23:01 You could consider rerunning the ingestion step if it’s something that’s gone wrong in that step, you could change the data that you are putting into that pipeline, but in general, it’s a kind of safety valve against the final model being incorrect rather than anything that you’d need to change automatically like that.

Philip Winston 00:23:22 That kind of raises the question, how long is this whole pipeline going to take? I’m sure it varies drastically by application, but in the systems, you’ve worked on, can you give me idea of the range of time that the full pipeline take? And the reason I ask that is because if we’re preventing proceeding with bad data, we’re saving this amount of time. And so I guess if the whole thing was very short, it wouldn’t be a big deal. But if the whole thing was long, then early out could benefit us a lot.

Catherine Nelson 00:23:52 The systems I’ve worked on, it’s usually been in the minutes to hours scale, so it’s not days and days, but the point is that ideally you would have this pipeline set up so that it runs automatically without any interference from myself without needing to do anything. So it’s more than being able to automate it from the start to the end than particularly the time saving. That’s important here.

Philip Winston 00:24:17 I think I might know the answer to this, but does the data have to be perfect or how can we judge how tolerant our pipeline is to bad data?

Catherine Nelson 00:24:26 I think that the data should be sort of reflective of the real world that it’s trying to model. So like if you have data about your customers, about a bunch of different companies, then that’s going to be very variable. One thing is that some machine learning algorithms can’t cope with missing data. So in that situation, all the values do need to be filled out and it needs to be perfect from that point of view, but it can have a very wide distribution and that’s fine.

Philip Winston 00:25:00 So let me read the next four steps so we have some idea where this is going and maybe what to talk about at which step. So I have next data, pre-processing, then model training, then model analysis and validation, and then deployment. So let’s talk about data pre-processing next. I don’t know if this is an official step or is this depend on the workflow, but I guess how is pre-processing different from the previous steps?

Catherine Nelson 00:25:27 Pre-processing is often synonymous with feature engineering, so that’s translating the raw data into something that you can use to train the model. So if your raw data was text, then it might be word frequencies or something like that. And that’s different from the validation step because in the validation step you are describing the data; you are checking that the data doesn’t contain nulls and so on.

Philip Winston 00:25:56 Has deep learning kind of eroded the necessity for feature engineering? I remember I worked on a project a long time ago and a huge amount of effort was put into feature engineering, and more recently I worked on something and they’re kind of saying that feature engineering kind of goes away in some cases. What has your experience been?

Catherine Nelson 00:26:18 Yeah, I think that’s right, especially, I’ve worked on a lot of text models and it’s become a lot better to put to not do much with the text, put it in pretty much raw, and have a more complex model that’s able to learn a lot more from that text rather than doing extensive feature engineering to extract those features from the text and then train the model. That seems right to me.

Philip Winston 00:26:44 Okay. Let’s move on to model training. How about this idea of training from scratch versus fine tuning an existing model? Are both of these possibilities in a pipeline?

Catherine Nelson 00:26:59 I’d say there’s actually three possibilities. There’s training from scratch, there’s fine tuning, and there’s retraining the exact same model on new data. Training from scratch, I wouldn’t do that in my machine learning pipeline. I would do that separately in standalone codes to get that model established the first time around, check that it is actually accurate enough to solve that problem, then I would build a machine learning pipeline only when I knew that I had that model and was going to be retraining it. Fine tuning you can certainly do within the pipeline because you might want to tweak the hyper parameters of that model when there’s new data. So you might want to have a small step within that.

Philip Winston 00:27:45 You mentioned hyper parameters. I was wondering when you say have the model and then retrain it, what is the model at that point? Is it all the parameters associated? I guess what would be part of the model that then gets retrained? What stays?

Catherine Nelson 00:28:01 Yeah, that’s a great point. Sometimes it’s the model architecture and those hyper parameters, and sometimes it’s just the model architecture. So if you’re in the neural network world, then the number of layers in that model, the types of layers in there, how they’re connected, that’s probably going to stay static because changing that up within a pipeline is hard because you don’t have quite such instant feedback on whether the model is working as you do in a separate piece of training code that’s just designed to run through those experiments. You’ve got a lot of other code around that model that makes it a little more complex to debug them.

Philip Winston 00:28:47 You mentioned the duration of the pipelines you’ve worked with range from minutes to hours. Is most of that time in the model training?

Catherine Nelson 00:28:56 Yes. Yeah, that’s right. Usually the other steps are shorter and it’s the model training that’s the long one. So that’s why it’s important to have those other steps separate so that you know that your data is in good shape by the time it gets to the time-consuming training step.

Philip Winston 00:29:14 Another element of time would be how long we’re going to use a model before retraining. How does that vary? I think very early on, I imagine people used models for a long period of time and more recently I feel like people are retraining more and more often. Is that a trend?

Catherine Nelson 00:29:36 Yeah, I think from what I’ve worked on that depends on the maturity of the use of machine learning in that organization. So early on you might have built these models fairly ad hoc, and then it’s a big effort to deploy them into production. But when you do that, it makes a big step change in the accuracy of your product. Whereas as time goes on, you are making smaller improvements in your product, but you want to make that more frequently. So having that pipeline set up allows you to change your model often as the input data changes. So yeah, I think that’s why you’re saying that.

Philip Winston 00:30:19 I guess taking a step back for a second, during all of these steps of creating a pipeline, in what cases are we able to just hand this over to software engineers and kind of give them the information about the model? And in what cases do you feel the data scientists needs to be involved? What’s the trade off from either a handoff situation or a collaboration like side by side situation?

Catherine Nelson 00:30:44 Part of this is going to depend on the team that you have, and the skillset sets available, but I would say it’s very useful to have the data scientist involved in setting up the initial pipeline. In particular things like what are the criteria for the data validation step, what is a sensible distribution of your data, what are the hyper parameters that you should be considering when you’re training the model? And particularly in the step that we haven’t talked about yet, which is the model analysis step. I think that’s where the data scientist has a really crucial part to play. I think any data scientist can learn the skills that they need to deploy a pipeline, but often being able to debug that complex system, being able to set it up so that it interfaces with the rest of the product, making sure that it’s well tested and so on. That’s where a software engineer can add so much value here.

Philip Winston 00:31:46 I’ve worked with many scientists unrelated to data science, but just sort of a, you know, biologist or physicist and yeah, they can learn to code as just part of their education or sort of on the side, but then there are certain skills that they don’t have as much experience in. But I definitely feel that many people end up learning to program by necessity. And I think that’s a good thing for the most part.

Catherine Nelson 00:32:12 And for me it’s also because I enjoy being able to write better codes. It’s a lot of fun being able to do this well and write code that’s robust and scales and so on.

Philip Winston 00:32:22 You mentioned model analysis and validation, that’s the next step. So because the word’s the same, how is this different from a data validation? I guess it’s a question of what are we validating?

Catherine Nelson 00:32:34 Yeah, so this is where we are looking at the performance of the model in terms of how accurate it is, what’s the precision and recall, and also sometimes splitting that accuracy down into finer grained sectors. So if you had a model that you were deploying in lots of different countries, does it perform equally well on the data from all those countries? That’s something that you could do with your validation data, which is the split of your data goes into the training data and validation or test data. And I know that we’re using the word validation way too many times in this, but that seems to be the way that the terminology’s gone. So analysis is looking at that accuracy across different aspects. This is a point where you might look for bias in your model as well. Is it providing better performance for certain groups? Is it providing better performance on your female users versus your male users? That would be something you’d want to look for at this step. And then the validation part of that is the model should only be deployed if it is acceptable in all the analysis criteria. So this is kind of your final go or no go step before you deploy that model into production setting.

Philip Winston 00:33:56 I wanted to flag the term precision and recall. We’re not going to try to go through all that, but I’m guessing that relates to false positives versus false negatives.

Catherine Nelson 00:34:05 Yes. And a classification problem. That would be some of the metrics you’d look at.

Philip Winston 00:34:10 And when you say deciding whether the model is good enough, the reason we might want to make that decision is maybe we have a previous model that was pretty good already and we don’t want to make it worse.

Catherine Nelson 00:34:21 Exactly. That’s exactly right. So you’ve taken the same model and retrained it on new data. Does it perform better as a result of this?

Philip Winston 00:34:30 How about overfitting versus generalization? I don’t know if those are too technical, but can we just give an idea for what those have to do with the model analysis?

Catherine Nelson 00:34:40 This is where if you have overfit your model, then you’ll see a higher accuracy on your training sets than on your validation set. And then you’ll know that your model is too closely replicating your training set and it’s not able to generalize to new data. And what you want it to be able to do is exactly generalize it to new data when you’re deploying it into production.

Philip Winston 00:35:07 This might be a naive question, but is the size of the model unrelated to the amount of training data you have, or does the model size grow? Depending on the dataset.

Catherine Nelson 00:35:19 If you have a small dataset and a large model, then it’s very prone to overfitting because your model is basically able to memorize the data that you have, so then it might perform not that great on when it sees new data. Yeah, yeah, that’s a good question.

Philip Winston 00:35:38 And I have this down as interpretability. I don’t know if that’s really part of this step, but I guess that’s a element of a model, whether you really understand what it’s doing or whether it’s sort of a black box?

Catherine Nelson 00:35:51 Yeah, that’s definitely part of model analysis, but it’s probably not something that you, you might not put this in the pipeline because for models where there’s like you need to explain it, that’s almost the opposite of automating the problem. If you need to look very carefully into what features are causing it to make a certain prediction, you can’t really do that as part of an automated setup that’s going to deploy a model as soon as it’s trained. You got to take that step back. If you needed to interpret your model, you might run the pipeline up to this point, but then maybe this is where your data scientist steps in to really take a good look at that model and then it’s a manual step to deploy it to production.

Philip Winston 00:36:41 That also kind of raises the point, maybe if we have a large team, we’re working on building this pipeline with the involvement of data scientists and software engineers, but then maybe there’s some other data scientists working on sort of a next generation model or something like that so that these things could be happening in parallel. It’s not like we just drop everything and build a static pipeline.

Catherine Nelson 00:37:04 That’s right. Yeah.

Philip Winston 00:37:06 So let’s start talking about deployment at this point. We have a pipeline; we can run it hopefully with minimal intervention. Hopefully it runs kind of straight through. Maybe it retrains with more data. What is different about deployment or maybe production? And I’m sure it varies based on the project, but in general, kind of what is this transition from. I have a pipeline I can run to, it’s been deployed, which maybe means it’s running at a larger scale or it’s running more often.

Catherine Nelson 00:37:41 Yeah, this is really the point at which I kind of hand off my tasks to the software engineers as well. The pipeline produces a model artifact, which can be a saved set of model weights and so on. And then that’s when that model gets handed over and is set up to run inference. So that’s really the point at which the data scientistsí job is done. That newly trained model has been produced and now it can be put into the product so it can provide the service that it’s planned to do.

Philip Winston 00:38:14 How does scaling enter into the equation here? Is it possible that our model is too large or too computationally intensive that we can’t put it in production? Or is that something you’ve thought of from very early on? Is there sort of a point here where we have to decide if we can even deploy this?

Catherine Nelson 00:38:35 Ideally you would know that before you have started building the pipeline, because a lot of what affects that will be the model architecture, the number of layers, the size of the layers if you’re dealing with a neural network. So ideally you would want to know what some of the requirements are for inference at the experimentation stage when you’re trying out lots of models. Because if you need your model to be extremely fast, you might limit yourself in those experiments to models that are small and fast.

Philip Winston 00:39:08 This might apply to all the steps, not just deployment, but what is your opinion on heavily refactoring and evolving the initial sort of pipeline versus rewriting or having a sort of clean start when you’re implementing the pipeline?

Catherine Nelson 00:39:26 A lot of the pipeline solutions, like TenseFlow extended or Amazon SageMaker pipelines, they will take as inputs scripts like your training script. So you don’t necessarily have to rewrite, you can just kind of pick up the code that you’ve used from training and drop that into the pipeline code. That works pretty well because you already know that that’s working. You don’t have to completely rewrite from scratch. But a lot of the boilerplate around code around the pipeline is that’s new that’s coming up from scratch to actually link those pieces together and make sure that one step goes to the next, to the next and so on.

Philip Winston 00:40:10 One role we didn’t talk about at the start was MLOps, which I guess is an offshoot of DevOps that’s more specific to machine learning. Do you have any experience with sort of what could go wrong in deployment that maybe is special to machine learning? So in regular backend work, there’s certain problems that crop up. I’m wondering if there’s any machine learning specific. Maybe it has to do with monitoring the inference time or the memory used or something.

Catherine Nelson 00:40:40 So one problem that I have heard about, but not the experience myself, is called training and serving SKU. And what this is, is when you might have feature engineering code and model training code, and you need to do that feature engineering when your model is deployed and running inference as well as when you’re training the model. So what you might do is update that feature engineering code in your training pipeline, and then train a model based on those particular features, and then deploy that model and forget to update the feature engineering code. So when your model is running inference, the data’s coming in, it’s getting the old feature engineering, but the new model. So the data might be completely valid after it’s had that feature engineering step. The features might be completely valid, but they aren’t the right distribution for that particular model. So your model performs worse and you’re not quite sure why. So that’s a subtle one that can come up.

Philip Winston 00:41:49 Yes. That does sound like very machine learning specific debugging where yeah, the behavior of this model that we validated and analyzed is not where we thought it would be. And maybe we have to roll something back or maybe, we have to fix it.

Catherine Nelson 00:42:06 So it’s good to have monitoring in production to be able to sniff out these kinds of things to check that the model accuracy is what you expect.

Philip Winston 00:42:15 Okay. It sounds like we did all the steps of the pipeline. Let’s start wrapping up. We talked about different roles throughout this episode. Do you see any new roles on the horizon or roles that you think are changing or becoming more prominent?

Catherine Nelson 00:42:33 Yeah, I think there’s a couple of things here. Now that I’ve started working on generative AI solutions, AI engineer is the obvious one. Someone who’s not necessarily building the model or training the model that is designing applications that are based on AI models, that’s huge. And that’s only going to continue to grow as well. The other thing I see is that I think there’s a big place for data scientists in the world of AI and that’s making sure that that’s in evaluating AI models. So if you are trying to use an LLM for some particular business application, it’s actually very hard to check how accurate that model is. So I think that’s a huge growth area for data science.

Philip Winston 00:43:23 And again, I think, yeah, when we say AI engineer, we’re talking about people working with foundation models of some type. It could be LLM or not. Maybe they’re just working through an API, and they don’t run any machine learning locally at all. I guess in some cases they’re programming, you know, in English writing prompts and things.

Catherine Nelson 00:43:44 Yeah, I think that’s going to be huge in the future, and it already is huge right now

Philip Winston 00:43:49 We try to focus on collaboration. Moving beyond just the software and machine learning, what collaboration methods have you found been useful between data scientists and software engineers or other members of the team, whether it be a tool or just a technique that you find is helpful?

Catherine Nelson 00:44:07 I think the best way of collaborating is having a team that’s open to ideas. It doesn’t speak really to any tools or techniques. It’s all about valuing each other’s ideas. And the best teams I’ve worked on for that have been where people are very supportive of each other and supportive of those, of someone bringing in new ideas. That seems to me to be the key rather than any particular tool or piece of software.

Philip Winston 00:44:38 So continuing to wrap up, what are you excited about looking ahead in machine learning projects you’re working on or that you see in the wider industry?

Catherine Nelson 00:44:47 So having relatively recently started working with LLMs, it just makes me so just so blown away by the capabilities at the moment. I’ve been working on a project to showcase a good example of a use of an LLM for a startup and I’m working with, and the project we decided to choose was extracting people’s flight details out of an email. So you send the servers an email with your flight details, and it will extract the origin, the destination, the time of departure, time of arrival, and so on, and populate those into whatever kind of app you want to work on. And I’ve worked on similar projects before and seen things like big piles of regular expressions or doing all this complex feature engineering to get this out of it, but now I can do it in a five-line prompt to open ai and it works better than all those previous incredibly complicated solutions. And you can even do things like you can ask it for the airport code instead of the name of the city, and even if the airport code isn’t in the email, you can still get that because the LM has that context. So I’m just really excited about what we’re going to be able to do with these things in the future.

Philip Winston 00:46:11 Yeah. I guess that leads to the question at the end of a project, all the technical successes there, but what is the business value and sort of what, in the projects you’ve worked on, what are they doing sort of for the next version? Is it usually kind of doubling down on techniques or is it tackling a completely new area of the business? Or like what are the possible directions at the end of a project?

Catherine Nelson 00:46:34 Yeah, so some of the projects I’ve worked on, it adds a new feature that wasn’t possible without machine learning. And a lot of these have been extracting information out of unstructured data, and that gives you the capability to add something that you didn’t think you were going to be able to do, offer some new feature to your customer, and then you know, you might spend some time optimizing that so that the accuracy improves and so on. So I think, yeah, it’s a balance between improving these existing models, improving the accuracy, and then the step change of adding a completely new feature.

Philip Winston 00:47:09 Okay. Where can listeners find out more about you or your new book?

Catherine Nelson 00:47:14 Best place is to follow me on LinkedIn.

Philip Winston 00:47:17 Okay. I’ll put your handle for your LinkedIn name in the show notes. Thanks for talking to me today, Catherine.

Catherine Nelson 00:47:25 Thanks, Philip. It’s been great talking to you.

Philip Winston 00:47:28 This is Philip Winston for Software Engineering Radio. Thanks for listening.

[End of Audio]

SE Radio 641: Catherine Nelson on Machine Learning in Data Science

Show Notes

Related Episodes

Transcript

Join the discussion

More from this show

SE Radio 728: Clare Liguori on the AWS Strands SDK for AI Agents

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

Menu

Recent posts

Search

Search

SE Radio 641: Catherine Nelson on Machine Learning in Data Science

Show Notes

Related Episodes

Transcript

Join the discussion

More from this show

SE Radio 728: Clare Liguori on the AWS Strands SDK for AI Agents

SE Radio 727: Jeroen Janssens and Thijs Nieuwdorp on Using Polars

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

Menu

Recent posts