Pete Warden, CEO of Useful Sensors and a founding member of the TensorFlow team at Google, discusses TinyML, the technology enabling machine learning on low-power, small-footprint devices. This innovation opens up applications such as voice-controlled devices, offline translation tools, and smarter embedded systems, which are crucial for privacy and efficiency.
SE Radio host Kanchan Shringi speaks with Warden about challenges like model compression, deployment constraints, and privacy concerns. They also explore applications in agriculture, healthcare, and consumer electronics, and close with some practical advice from Pete for newcomers to TinyML development.
Brought to you by IEEE Computer Society and IEEE Software magazine.
Show Notes
- TinyML: Machine Learning with TensorFlow Lite on Arduino and Ultra-Low-Power Microcontrollers
- Blog post: Why the Future of Machine Learning is Tiny
- Harvard Courses
- YouTube: @PeteWarden
- Github: ee292d labs
Transcript
Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.
Kanchan Shringi 00:00:19 Hello everyone, welcome to this episode of Software Engineering Radio. Our guest today is Pete Warden. Pete is CEO of useful sensors. Pete was the founding member of the TensorFlow team at Google and the tech lead for all non-cloud TensorFlow infrastructure at Google for seven years. Welcome Pete to the show. It’s really great to have you here. Is there anything you’d like to add to your bio before we get started?
Pete Warden 00:00:44 No, I think that’s great. Thanks so much.
Kanchan Shringi 00:00:46 So Pete, our goal today is to explore the reasons to and the challenges in deploying machine learning on low power, small footprint hardware. This is the essence of TinyML. A term I understand has been coined by you. So Pete, to start, what is TinyML? Why is it important?
Pete Warden 00:01:07 Yeah, and to start with, I should say that Evgeni Gousev of Qualcomm and me with the coiners of TinyML and actually started the TinyML conference series together. So I want to make sure he gets credit, but really the idea behind TinyML is that being able to run machine learning on tiny sheet embedded devices that are in everyday objects all around us, opens up a whole bunch of possibilities of doing things, especially around interacting with our environment in ways that we’ve never been able to before.
Kanchan Shringi 00:01:43 Can you give an example?
Pete Warden 00:01:45 So one of my things I really want to be able to do is just look at a table lamp and say ìonî and have the lamp come on. And that sounds like a straightforward task, but actually there’s a whole bunch of technological challenges in order to make that happen. And then on top of those, a whole bunch of economic and business challenges.
Kanchan Shringi 00:02:08 Before we get more into that, is TinyML related to IoT, the Internet of Things or is it completely unrelated?
Pete Warden 00:02:16 So this is actually one of my bug bears is I really don’t think that the internet of things has succeeded. And the definition that I’ve seen most often of the internet of things is the idea that by adding network connectivity to everyday objects, you are going to actually enable a whole bunch of really exciting new things in the same way that adding internet connectivity to laptops and cell phones created incredible innovation and change and growth. I think that the analogy does not hold for embedded systems and everyday objects for a whole bunch of reasons. And I think it kind of misled a lot of people to invest in adding connectivity and then just hoping wonderful things happen. TinyML from my perspective is a lot more about saying, hey, let’s leave connectivity until we need it, but let’s try and make these individual objects smarter on their own locally with no internet connection and let’s see how far we can go with that.
Kanchan Shringi 00:03:25 So in your example of being able to look at a lamp and that turning on it has got nothing to do with network connectivity?
Pete Warden 00:03:32 Yeah, and one of the key things with that is you can imagine buying a lamp, plugging it in and then immediately being able to talk to it and have it work just like you could in the old days with any other object you bought, you didn’t have to go through set up, you didn’t have to use a phone app, you didn’t have to have an account, you didn’t have to worry about firmware updates. It’s kind of the idea of going back to objects that just work out of the box is I think a big part of the promise.
Kanchan Shringi 00:04:05 But there is machine learning involved. How is this TinyML different from traditional machine learning?
Pete Warden 00:04:13 So I didn’t know much about embedded devices when I first joined Google. I sort of knew that they were this esoteric area, but I’d never had much contact with them. When I joined Google, one of the first people I met was Azi Alvarez and he explained that they were using 30 kilobyte sized models for the, I won’t say the full wake word for Google’s voice assistant because I don’t want everybody’s phones going off, but the detection of that wake word only ran in 30 kilobytes and I couldn’t understand why. And so that led me down this path of realizing that there are these chips that only cost 50 cents or 25 cents, but they use almost no power. So you can have them running for a very long time on a battery and they’re so cheap and low power means that they can be in almost everything you buy.
Pete Warden 00:05:11 So there were 40 billion of these being sold a year. So I grew fascinated by the constraints that were involved in taking what is traditionally you are looking at maybe hundreds of megabytes for machine learning model and trying to run it in something that may only have as much memory as like a Commodore 64 from the ë80s. And it turns out that machine learning actually scales down really, really well and you can do things not with the same accuracy but with enough accuracy to be useful even in these really cheap low power devices. Especially things around understanding what people are doing and understanding the environment around them.
Kanchan Shringi 00:05:56 So you compare the amount of memory with the very old hardware, how much is it actually, the memory?
Pete Warden 00:06:03 The smallest devices I’ve actually done work on have had 64 kilobytes of RAM and one of the design goals of the open-source software that I worked on at Google aimed at these kinds of devices, the whole of the software framework had to fit in less than 20 kilobytes. So we are really talking very much kind of 1980s scales of memory when we are looking at these very ubiquitous embedded devices.
Kanchan Shringi 00:06:35 So the devices that we’re looking at, you give an example of a lamp, what else are we talking, doorbells?
Pete Warden 00:06:43 I mean I would say doorbells are one of the few places where IoT has actually shown its potential because they tend to be connected because you need to watch images from afar. I would say actually if you looked at things like coffee machines, Keurig actually ship a little computer vision model in a lot of their coffee machines now that recognizes the pods that you put in. So that is running locally on a very low cost because coffee machines are often below a hundred bucks retail, so that means probably you can’t have more than 15, $20 worth of parts in them. So you’re looking at a very, very low-cost piece of hardware, including the camera, but they want to be able to offer this functionality so they figured out how to shrink down the machine learning models to do this kind of image recognition into something that’s probably only 20 cents worth of hardware.
Kanchan Shringi 00:07:39 So it’s cheap memory, you said very, very low 64 KB RAM was example. What about CPU and power consumption?
Pete Warden 00:07:47 So it depends on the particular target environment. Some applications really need to run with very low power because they need to run for a long time and they need to be left to run kind of unattended with no charging or battery changes happening. So in those cases you might have something like, I think Ambic are an example of a company that does very low power cortex MCUs and they might be running at one milliwatt or below. And when you get down to that low sort of power, you are paying a little bit more in cost, it might be more like a dollar for an MCU to get that low power regime. And then on the other side I’ve started to see risk five MCUs with somewhere in the region of 64 kilobytes of memory for 10 cents on some of these sites. So they’re not going to be using as little power, it might be 50 milliwatts of power that they’re using, but they’re incredibly cheap. You’re starting to get to the point where they’re cheaper than individual resistors and capacitors and things. So they start to edge out traditional electronics. When you want to build something, you might as well just grab an MCU.
Kanchan Shringi 00:09:08 You mentioned MCU, that’s microcontroller?
Pete Warden 00:09:11 Yes, thank you for spelling that out. Yeah.
Kanchan Shringi 00:09:14 And is there any other type of hardware that is used here?
Pete Warden 00:09:18 So another related piece of hardware is Digital Signal Processor or DSP. They often share a lot of characteristics with microcontroller units, but they tend to have a lot more compute capability. They’re designed to run doing things like audio processing and as it turns out they’re actually the same operations that you need to do audio processing are also very useful for doing machine learning operations. So these traditional DSPs are often very good at running machine learning models if you are able to take advantage of them.
Kanchan Shringi 00:09:58 So traditional machine learning is associated with big data. So how do you actually reconcile TinyML with the data heavy requirements of traditional ml?
Pete Warden 00:10:10 So you often still require a large amount of data to train the models and in some ways you might even require more data because you are training a model that has no capacity to spare. You are training a model that’s as small as you possibly can. So for a 30-kilobyte model to recognize a wake word, you might actually have something on the order of a million utterances of different words to help train the model to recognize the ones that you want it to recognize. So you are still using a whole bunch of GPUs on the training side because of all of this data throughput. The only advantage is that because the model itself is so small, you don’t spend much compute on the back of propagation and the other parts of the training. So it’s an interesting world and it’s actually good in a way because the lowered GPU requirements mean that it’s a lot more open to innovators and academics and other people who want to get involved but don’t have a million dollars to drop on a GPU cluster. Like some of the folks like open Ai.
Kanchan Shringi 00:11:23 Sorry Pete, so you still need a lot of compute to train it but you don’t need a lot of compute to run it. Is that the takeaway?
Pete Warden 00:11:31 Yeah, and a lot of the compute that you’re using when you’re training it is actually just the processing of the training data. The actual training itself is pretty lightweight because the models are so small, which means that even though you are still having to deal with big data, you are not having to spend nearly as much as you would to train like a large language model or even a full-sized image recognition model.
Kanchan Shringi 00:11:57 And then that’s because of the size of the model? Can you try to connect the dots? Do you always start with a small size model as part of the training or there’s some process to reduce the size after the training?
Pete Warden 00:12:09 So there’s different schools of thought on how to approach this. Some people like to use techniques where they train a large model and then they use things like techniques and literature known as things like distillation or teach a student approaches. The general approach that was used at Google and that I tend to use is just starting with an architecture of a model that only has a few layers and only has a few weights and then just pushing through a lot of data through that model and then doing some things like quantizing down to eight bit and a few other things like that to just make sure that you’ve compressed the model as much as you possibly can for deployment.
Kanchan Shringi 00:12:55 As you start the process, you are aware that this model is going to be used for TinyML, you need much more data to train it, but the model size is small and hence easier to actually more accessible for more people to train.
Pete Warden 00:13:09 Exactly, and this is one thing that’s a bit different from TinyML compared to a lot of traditional ML projects is you really need to know about your deployment platform when you are training your model. The workflow for ML in general is often you’ll have researchers who are using Python and PyTorch and these other frameworks and their job is to just come up with a model that is as accurate as possible and they aren’t thinking too much about the resources involved because accuracy is the most important thing. And if you’re running on the cloud, you can always just buy a bigger Amazon EC2 instance or buy a bigger graphics card or buy more virtual machines and you can just pay money to run larger models. So those ML teams, those researchers are able to effectively train a model as best they can and then just throw over the wall to the deployment teams and have the production teams worry about how they’re actually going to deploy those models.
Pete Warden 00:14:15 Whereas in TinyML you have to be thinking about the full stack of data, training, and deployment when you are actually creating your model. You may even have to think about things like, oh this particular piece of hardware is good at doing five by five convolutions say, but it’s not good at doing anything larger than that. So you actually set up your model so that they only use the operations in ways that are optimized for the piece of hardware that you’re actually going to be deploying on. So it requires a lot more knowledge of different layers of the whole stack, all the way from the world of embedded tools and chips all the way up to PyTorch and ML frameworks, GPU training and data. So in some ways it’s a lot more challenging, but I also think it’s really fun because if you get frustrated in one area, you can always go and work on a problem in another.
Kanchan Shringi 00:15:18 Could you maybe drill down a little bit into the examples you mentioned about being aware of the extent of being, the fact that there’s a five-by-five convolution. Could you maybe just start with explaining what’s a convolution and help us picture what the challenge is?
Pete Warden 00:15:33 Yeah, definitely. So the way that I like to think about neural networks and deep learning is as these pattern recognition machines and a convolution is in its simplest form, it’s a square block which contains numerical values that form a pattern. So if you’ve ever done any image processing, you might think about trying to recognize vertical lines by having a five by five or a three-by-three square block where you have high values going down in a vertical line through the center and then zero values for the rest of the block. And what happens when you take that patch, that conclusive patch and run it over an image kind of as a sliding window, is that when the patch and the underlying image match, you basically do what’s known as a convolution, but actually it’s not what we call a convolution in classical image processing, which is additionally confusing, but you basically multiply each pixel in the image against each value in this convolution patch and then you add up all of the results.
Pete Warden 00:16:54 So that means when you’ve got this sliding window patch that’s moving across the image, if it’s got a vertical line down through the middle, any parts of the image that have a vertical line in that same position will come up with a very high value and then the rest of the image will give a much lower value for that patch. What happens with neural networks, especially in the vision side, is you don’t create those patches manually like you would with classic image processing. Instead you have the network train through back propagation to figure out what patches it should use for kind of running across the image to find interesting things. So one example I like to use is if you imagine that you were training a neural network to look for a cat, you’d look for some low-level features first in the kind of first layer.
Pete Warden 00:17:51 So you might say, hey, I’ve got a patch which corresponds to kind of the texture of fur. And you run that across the image and you find out which parts of the image have kind of fur in them. And then you might also have something that looks for kind of triangular corners in the image as another patch and you run that over the image and you do that for maybe 64 or 128 different patches in the first layer and you end up with this new image that’s based on the original image but has the results of running those patches in sliding windows over the image in each channel. And the magic of neural networks is that they start off with these very low level representations of oh, maybe there’s fur in this corner of the image and maybe there’s like this triangular bit and each layer looks at the information from the layer before and does a similar kind of convolution, but instead of looking for the fur, you know the fur texture, it might say, hey, anywhere where the fur value is high in that channel, oh and somewhere where you’ve got this triangular point, then I’m going to say that might be an ear.
Pete Warden 00:19:11 So the next layer in the neural network has this higher-level semantic information about what’s happening in the image and eventually you get to the last layer which is able to say, hey, are there two ears, two eyes, whiskers and fur? And in that case there’s probably a high probability that there’s a cat. Now the channels and the layers don’t actually work out to be quite that neat. You can’t say oh here’s the fur layer or here’s the ear layer. But conceptually that’s how the kind of information processing works through a neural network, if that makes sense.
Kanchan Shringi 00:19:52 And here, because you have to be conscious that the number of layers are small so you have to modify the process.
Pete Warden 00:19:58 Exactly and you have to think about what’s most important because in the example I used of listening out for audio keywords to wake up a voice assistant because it’s only 30 kilobytes, it’s not going to do as accurate a job as something that’s running even on the main CPU on the phone or in the cloud. But what it does is it wakes up the CPU when it thinks it’s found something that might be the audio wake word and its job is to be okay with having false positives. So it can be mistaken about thinking that somebody said the wake word when they actually didn’t like that’s fine because then you’ll run a model on that same audio that’s larger on the CPU and then maybe another one in the cloud. And if it’s not the right phrase, those larger, more accurate models will flag that and the assistant won’t be activated. But you have to be very, very sure that you don’t miss somebody saying the wake word when they actually said it because in that case that 30 kilobyte model will just give the thumbs down and none of the larger models will get a chance to run. So you have to think carefully about how this model works within a larger system, which is also part of the fun.
Kanchan Shringi 00:21:20 You covered that some of the challenges and to some extent the benefit that it’s the device that is responsible for catching all the wake words in an example and hence definitely reduced latency because it doesn’t have to go connect to the cloud for everything. A low power consumption also you covered. Is there any other benefit for on-device processing?
Pete Warden 00:21:42 So privacy, one of the things that I am pretty passionate about is you don’t have to connect your keyboard to the internet and send all your keystrokes up to the cloud to use it. But at the moment, that’s the way that voice interfaces work. So it’s creepy because there’s a lot of very private personal information that’s happening and people are understandably freaked out about the implications of this. So being able to run everything locally on device and even on a device that provably has no internet connection, like in the example of the lamp that you control by voice, if that doesn’t have a WiFi or a Bluetooth chip in it, then you’ve got a pretty good assurance that it’s not going to be used for spying on you. So privacy is something that I think is a big deal here too.
Kanchan Shringi 00:22:39 Absolutely, privacy is certainly a very big deal as we get into this new world. Okay, so maybe let’s take a step back and look at different industries where TinyML would be most impacted. Do you have some examples, specific use cases?
Pete Warden 00:22:55 Yeah, so honestly one of the challenges with TinyML is that there’s a lot of engineering and academic and research technologist excitement about it, but it’s been over five years now and it’s been slow to be adopted in a lot of the industries where it can have the most impact. Now having said that, some of the things that I’ve seen have been around the idea of actually using TinyML in agriculture. So for spotting pests or spotting disease using tiny, cheap sensors that can be kind of scattered throughout a field or in a kind of crop storage area to detect when there’s pests and things actually coming in. There’s also been a lot of use cases around medical devices. So having something that is actually able to run for a long time on a battery to do monitoring of various vital statistics and look at the patterns to understand what’s actually happening with your body and kind of raise them alert if something significant is going wrong. But one of the challenges in this area is that we don’t have enough of these stories yet. So there’s a lot of possibilities for really important things that we can do here. But the adoption has not been as fast as I think any of us would’ve hoped.
Kanchan Shringi 00:24:30 What are the different sensors that are integrated into these devices?
Pete Warden 00:24:35 So often the most important sensors are cameras, small CCD, sensors that can be as little as 10 cents each and microphones, which also can be in the single digit sense for kind of a low-cost-low-end microphone. And what’s interesting about these is you can actually figure out a lot of stuff of what’s happening in the world around you using vision and using audio. So for example, you might not think of cameras and microphones as weather sensors, but if you are actually able to hear raindrops or if you’re actually able to see rain happening or snow or other kind of things like that, then you are able to use these universal sensors almost to find out all sorts of other things that are happening in the proximity of these embedded devices. Now there are a lot of other sensors that come in, temperature, vibration, accelerometers for example, like one of my favorite use cases is detecting car crashes with phones.
Pete Warden 00:25:52 So you actually have the accelerometer running all the time and if it detects a sharp change in acceleration then that gets flagged by an ML model that’s running because it can’t run on the CPU because it would drain the battery almost immediately. So it has to run on this sensor hub, which is essentially a microcontroller or DSP sitting in the phone that uses very little power. But the interesting thing about that use case is it uses the accelerometer to detect a crash, but the accelerometer readings for a crash a car crash are very similar to dropping your phone. So it actually also uses the microphone to see if there’s sounds associated with the car crash to make sure that it isn’t just calling emergency services because know you dropped your phone in the bathroom.
Kanchan Shringi 00:26:44 So it’s a combination of sensors?
Pete Warden 00:26:46 Exactly, this idea of sensor fusion but really being able to use ML models to do this understanding of this raw data. So you’ve got this stream of audio as 16 kilohertz samples coming in and if you just stare at the audio wave form, like traditionally there’s nothing you could really do with that. But using ML models you can say, hey, was there a car crash? Were there sounds of a car crash and get a pretty reliable answer that. So that’s really the big change that’s happening here.
Kanchan Shringi 00:27:20 You mentioned that there’s many applications you can think of for agriculture, for healthcare, but there has not yet been an adoption. I was reading about useful sensors and that your company, the offline translation tools and that certainly struck a chord with me because of a recent trip to Italy, we’re trying to use translation tools is good but pretty hard and clunky. So I was very interested in the translation tool that useful sensors has developed. Can you talk more about it?
Pete Warden 00:27:55 Yeah, so one of the things that I found as well, like going abroad and even being in the US and having a very large Spanish speaking population, I’m cut off from a lot of people because I can’t speak their language and I love Google Translate like it’s a technological marvel but it’s also, it hasn’t evolved a massive amount over the last, sort of decade or so and it’s also a fairly clunky experience. You have to kind of speak into your phone and kind of hand it over and handing over your phone, it’s kind of this weird experience. So really the dream was, can we make something that’s a really fluent experience? I like to say talking to a person in a foreign language should be as easy as watching a movie with subtitles in a different language.
Pete Warden 00:28:53 And that’s not something that you can do with Google Translate. If you tried to use Google Translate to watch a movie, it would just be extremely tough. And the other side of this as well is you really want something that will work anywhere in the world, and you want that privacy for these conversations. So it was a really good use case for this idea of running machine learning locally. Now it’s actually getting a bit away from the TinyML roots because I’m not running this on a microcontroller or DSP, but actually running it on a Cortex a like the kind of SOC that you’d find in a phone. So it’s a lot higher power it’s like $20 instead of 20 cents. But a lot of the same principles about running everything locally with no network connection still apply and I think we’re, I’m not happy that we’ve reached our goals yet but it’s been really great to be able to put these devices in front of people and have something that’s on the table but still allowing people to have a conversation like we are having without having to pause and feel like you are really in this kind of awkward flow.
Pete Warden 00:30:12 Being able to talk fluently between two people who don’t share a language. I just think that’s really cool.
Kanchan Shringi 00:30:19 You said not as low power hardware as you would like and not as cheap as you would like, but it’s still a model that’s running on device?
Pete Warden 00:30:30 Exactly. With no network connection.
Kanchan Shringi 00:30:32 And what was the size of the model?
Pete Warden 00:30:35 So that’s actually a lot of the work we’ve put in is in the speech to text side of things. So actually taking in the audio, the raw audio and actually figuring out what people are saying as text strings and we’ve opened sourced the models that we’ve developed as moonshine and those models, the smallest one is about 26 million parameters and the larger one is 59 million parameters. So we quantize everything down to eight bits. So that’s like 26 megabytes for the smaller model that does this and then 59 megabytes for the larger model, which is a little bit more accurate.
Kanchan Shringi 00:31:23 Now playing the devil’s advocate about the comment about privacy that the processing is on device but these devices are also more ubiquitous and how do you balance that with the fact that they’re not sending the data to the cloud but of course they could. Right?
Pete Warden 00:31:41 Yeah and this is part of part of my struggle is that I am essentially advocating for putting cameras and microphones in everything. And that’s a super creepy world if you think about it , which has led me to think very hard about okay, how can we actually safeguard ourselves in this world? How can we avoid creating this world where we are setting things up so everything can spy on us? And one of the things I really want to see is essentially a nutrition info label for all devices to say what data they’re actually gathering. And in the example of the lamp you could have a little label that’s saying hey, we are listening to audio but we’re never recording it, that we don’t have the capability to send it up to the cloud. And if then somebody discovered that oh actually there was a WiFi chip and it was being used to transmit data, then it’s kind of like, if you were selling toothpaste that had an ingredients list that just had a list of nice things but you’re actually putting sawdust in it and you just hadn’t listed it on the ingredients, then the companies that were actually doing that malicious behavior could be called out and prosecuted and everything else.
Pete Warden 00:33:06 So I think it’s this combination of trying to have labels that tell people what’s going to happen to their data and having systems that make it the easiest way is to have your data never be transmitted and never be stored and just go away immediately, like make that the most easy engineering and product solution so that people aren’t tempted to cut corners and build these insecure systems.
Kanchan Shringi 00:33:36 Now let’s get into a little bit about what does it take for someone to develop these applications, starting with what are the most popular frameworks and how do they help?
Pete Warden 00:33:48 Yeah, that’s a good question. There are a whole bunch of really interesting solutions from chip manufacturers themselves. So if you look at NXP SD micro arm, all of the major micro control and DSP manufacturers have proprietary tools that let you take models from popular frameworks like PyTorch and convert them down into their own frameworks, proprietary formats. The problem with those is that there are so many different models that you can potentially create in training frameworks and the conversion tools are only able to handle quite a small subset of those. So it can be hard to actually get all the way through that process. The framework that I put together when I was at Google is called TensorFlow like micro and it’s designed to be a cross platform solution for running ML models on really tiny machines. And so it does things like, it never uses any memory allocation, everything is kind of pre allocated.
Pete Warden 00:35:05 So that’s often an important requirement for embedded hardware or for embedded system software. And it also doesn’t rely on any of the C runtime library. So print F can take up 25 kilobytes of code space. So making sure that we never call things like print F is really important. Having something that will fit in kind of 20 kilobytes of program space. The best way to get models into TensorFlow like micro is to export them from TensorFlow. Now TensorFlow honestly is a bit long in the tooth these days. I think we’re coming up to its 10th anniversary and so, I tend to use PyTorch these days for my day-to-day work, but there are still a lot of resources where you can start with TensorFlow models, you can find a lot of the popular models that are in TensorFlow format and then actually convert them down to TensorFlow like micro and run them there.
Pete Warden 00:36:10 Honestly, what I usually recommend to people getting started is you find some examples that do something similar to what you want to do, like doing an audio recognition, dealing with accelerometer data or doing image recognition and you take some of those open source examples and get them working on your device and then actually try and customize them and retrain the models and change things to work as you as you want to. I would say these as well are all available for Arduino and for the RP 2040 and 2350 from Raspberry pi. So you can run them on most of the popular platforms.
Kanchan Shringi 00:36:52 Is Arduino a microcontroller or what would you call it?
Pete Warden 00:36:56 So Arduino is a software framework and a family of microcontroller boards that are designed to be very accessible for everyday users who aren’t necessarily embedded systems engineers. So they’re one of the best ways I’ve found for actually getting into this kind of embedded space without losing your mind honestly .
Kanchan Shringi 00:37:24 So if I wanted to be an embedded systems engineer developing TinyML applications, do I wear two hats, which is training the model with an idea of where it’s going to be used and hence the size constraints and also for the deployment of the model and put possibly retraining and running inference. Do I wear both hats or are there two people on two roles here?
Pete Warden 00:37:52 So it depends, but if there are two people or three people or four people, they really need to be working very, very closely together. So one of my favorite groups, Google was the team that put together a mobile net, which is one of the most popular small models out there for doing image recognition. And what was great about them was they actually had like Andrew Howard who was the researcher and the architect of the model, he was working in the same group, in the same team, in the same room as people who were writing the runtime code, like the inner loops of the kind of assembler that was going to execute these models and working with the people who gathered the data and they had this small team that was vertically integrated so that, the person who’s writing the inner loops of the assembler could be talking to Andrew and could give him advice on what operations and attributes the model should have to run optimally on the actual frameworks and the hardware that they’re going to be deploying on. And kind of like vice versa, you get this really, really productive exchange of ideas and requirements in a way that just isn’t possible in any other sense.
Kanchan Shringi 00:39:13 But ideally to develop TinyML applications, one needs to learn machine learning and embedded systems engineering. Is that correct or would you put it in a different way?
Pete Warden 00:39:26 So if you put it like that, it sounds like you can’t do anything until you’ve learned both of those. The way I would think about it is you can get started with recipes that don’t code examples and tutorials that don’t expect you to understand much of either embedded systems or ml, and then the way I always learn is by getting something running and then kind of like looking under the hood and starting to tinker with stuff and like seeing what happens. And I found that to be a really great way to, for me to learn and a lot of the students that I work with to actually learn. So don’t feel like you have to get all of this knowledge before you start doing this stuff. You can get something running like a wake word detector or a simple image recognition model just by following a set of directions and then you can start to modify things and change things and like learn what happens in a very hands-on way.
Pete Warden 00:40:29 And that’s one of the things that I like most about this combination and the TinyML when you are doing just machine learning, traditionally that’s a very mathematical kind of almost kind of abstract thing where you are not actually training full models yourself, you’re kind of fine tuning or you are a long way away from the actual end application and the same with embedded systems. You might learn a lot about the low-level system but you are kind of not getting to work on kind of cutting edge cool problems very often. Whereas TinyML you get all of this wonderful stuff that’s happening in the AI world is super relevant and you get to learn about it but you are also actually putting stuff out in the world and having people use and play with the results of what you’ve done in a very kind of, way that you can hold in your hand. Which I love.
Kanchan Shringi 00:41:28 And that’s why you brought up Arduino.
Pete Warden 00:41:31 Yes, exactly. Like that’s something that I’ve seen be great for students and high school kids and all of these people who might not get into machine learning as this kind of abstract thing. But if you can build a little robot that you can give voice commands to that is a lot more concrete and tangible and kind of brings in a much wider audience of people and is a way of showing that machine learning doesn’t have to be this off-putting daunting thing to jump into, it can be this really fun playful world that I really honestly believe anybody can actually start to learn about.
Kanchan Shringi 00:42:12 Let’s say I follow your practical advice and I get to that level, what should I do beyond that to improve my skills and develop real world applications? What all do I have to learn in the world of moral quantization, transfer learning, knowledge distillation? If you can explain these terms and talk about how do I get to that level?
Pete Warden 00:42:31 Yeah, definitely. And as you say, I think the starting point is to pick real world applications and to pick something you care about. And then I find the best way is to kind of pick up those techniques as you go. So for quantization it’s this idea that I talked about those convolution patches that you run across the image. By default each one of the values in the convolution patch would be a 32-bit floating-point value and so would the underlying pixels that it’s operating on. Now it turns out, because if you imagine that vertical line example that I gave, you probably don’t need 32 bits of precision to be able to represent that pattern in a way that’s effective. And it turns out that if you encode the values in things like the convolution patch, what we call the weights as eight bits and you just use a linear encoding.
Pete Warden 00:43:35 So you say, hey, I’m going to store what the minimum floating-point value of this the whole patch was and the maximum floating-point value of this whole patch. And then for each value within the patch, I’m going to find the zero to 255 kind of coded value that’s closest to the floating-point value. If you kind of linearly scale between that minimum max, then you really don’t lose any accuracy in the models or you lose a very, very tiny amount. And so that helps the models get smaller if you’re worried about kind of memory size. But the really fun thing is if you then take the activation layers or the underlying image that that convolution is working on and you convert those 32 bit floating point values into eight bit values, then instead of doing 32 bit float by 32 bit float multiply ads, which is effectively what you’re doing when you are putting the convolution patch over each section of the image, you could actually do eight bit by eight bit integer math, which is exactly what a lot of digital signal processes have traditionally been designed for.
Pete Warden 00:44:48 Because that’s an operation you need to do a lot in signal processing and is a lot less expensive to implement in silicon in terms of power and latency than full floating-point implementations of like 32-bit floating point. So quantization helps you both shrink the amount of memory and file size that you need for these models, but also makes the math a lot easier to implement on these microcontrollers that may well not even have floating point hardware. So quantization is at the heart of a lot of these techniques that we use to be able to take these large models and actually get them running on this very constrained hardware. And now I’m glad you mentioned transfer learning as well because that’s one of my favorite hacks in the world of machine learning. And the basic idea is if you think of an image recognition model, for example, that you’ve trained to recognize say a thousand different animals.
Pete Warden 00:45:54 So everything from penguins to zebras, to aardvarks that model, like I was saying in the example of cat recognition, it’s going to figure out a lot of stuff like, hey, I know how to recognize ears, I know how to recognize fur, I know how to recognize eyes and hooves and horns and things like that just because that selection of a thousand animals has all of those features in them and the model has to kind of recognize different combinations of them to tell which animal it is. Now if you come along with an animal that’s not included in that original thousand animals that you trained on, it’s likely that the animalís features have a lot in common with some of the other animals that were in that original training set. So it probably has ears, fur, might have hooves, might have horns. And what that means is you can chop off the very last layer of the model because if you think about the increasing levels of kind of semantic information that you are operating on as you kind of go through the network, that very last layer is the one that says, oh, have you got two ears, fur, whiskers, and you don’t have hooves, then you might be a cat. If you get rid of that and you instead put in a completely untrained layer, but you keep all of the previous layers intact.
Pete Warden 00:47:30 So all of the information about hey, are there hooves, ears, all of this stuff still comes through. But then that last layer, you have maybe a couple of new animals that you throw in the mix. You might have maybe like a monkey and a buffalo that weren’t originally included and you then train on a much smaller amount of data. Like you might have thousands of images of the original animals, but you might only have tens of images of the buffalo and the monkey. It turns out that you can train that final layer very accurately with much less data by using everything that it’s learned from identifying all of these other animals in the big training run. And this is a principle that applies across many, many different machine learning models. So if you have a model that can recognize a particular word that you’ve trained on massive amounts of data on that word, if you want to recognize a different word, then you can often freeze most of the layers, just kind of retrain the final layer and get it to recognize other words with much less work.
Pete Warden 00:48:46 So that’s one of the key techniques of this kind of TinyML world is take a model that works on a similar problem to the one you’re facing and then just gather a small amount of data and try and do transfer learning.
Kanchan Shringi 00:48:57 So beyond this, getting the model to the right size and quantize, is there something beyond that that is required to have it run on this embedded device and perform inferences?
Pete Warden 00:49:12 So as well as the right size? The other thing that you have to think about is what is the latency? Because these devices often don’t have, well they will not have as much compute as an Nvidia GPU. One of the pleasant surprises is that they actually even very cheap devices maybe running at hundreds of megahertz and have SIMD instructions on them. So you can actually do hundreds of millions of operations, arithmetic operations per second. So you might be surprised at what kind of latency you can get on these pieces of hardware. And one of the things I’ve talked about sometimes is this idea of dark compute that we’ve ended up with all of these toasters with incredible amounts of arithmetic, compute available, but all they’re doing is lighting up a couple of LEDs and noticing when somebody presses a button. So one of the things I like about ML is it’s TinyML is it’s a way of actually putting some of that compute that’s not being used to use, but you do have to think about, okay, how much latency can I get?
Pete Warden 00:50:21 You know with an image recognition model, it might be that it takes a second to run or it might be that it takes 10 seconds to run and you’ve got to think about your application about whether that actually works for you. If you want lights to come on when you talk to your lamp, you don’t want the lights to come on when you talk to your lamp, you donít want your lamp to turn on 10 seconds after you set it because then you might as well just walk over and like hit the switch. But there were other applications, like if you’re detecting pests in a field, like you can take a minute to run that because you’re probably only running that image recognition like once an hour or something like waking up the device and running it and there’s no particular urgency in how fast you need to get the results back. So latency is the other big issue, but what I like is you can often find applications that can tolerate pretty high latency.
Kanchan Shringi 00:51:19 We’ll start to wrap up now, but before we do that, you did use the acronym SIMD as you were talking. I don’t want to end this section without you just explaining that to our listeners.
Pete Warden 00:51:31 Yes, sorry, I should have explained that, but it’s Single Instruction Multiple Data, and it’s this idea that you can have a single CPU instruction that instead of just operating on a single 32 bit value, say integer value or floating point value, you actually say, hey, I’m going to do this same operation across this set of values like four or eight or 16 values at a time, by running this single instruction. And it’s a really good fit for neural networks. Because if you think about that, coming back to that convolution patch example, you want to run an entire row of that convolution patch against an entire row of pixels. So if you have a SIMD instruction that’s able to kind of do that in a single instruction and execute it in a single cycle, then you can actually get through a lot more operations a lot more quickly than if you are reliant on more traditional computing approaches.
Kanchan Shringi 00:52:46 I do notice that I’m trying to wrap up now to give our listeners some pointers for things they can use to learn more. I found a blog post, a book that you’ve written and also an excellent course on Harvard. Are those, did I cover all of it?
Pete Warden 00:53:04 Yes, you did though I will send you a link to, because I’ve been teaching at Stanford how to run models on the Raspberry pi for the last few years and I actually have, I’m hoping to put these together into more of an organized, possibly a small ebook. But I actually have some examples of doing some of the things we’ve talked about up on GitHub as the class assignments that I give our students when I’m teaching. So I’ll send you a link to those and they were little less polished than some of the other stuff up there, but I’m looking to get feedback, so I’d love to hear what people think.
Kanchan Shringi 00:53:42 Certainly put all of this in the show notes. Pete, is there anything today that we missed that you’d like to cover?
Pete Warden 00:53:49 No, I think this was really fun, really interesting questions and I would say I always love hearing from people as they’re diving into this world. So feel free to get in touch if you have further questions.
Kanchan Shringi 00:54:01 How can people get in touch with you?
Pete Warden 00:54:03 So my email is [email protected]. You can find my blog at petewarden.com where you can find all the details on how to get hold of me. So yeah, love to hear from you.
Kanchan Shringi 00:54:14 Thank you we’ll put all that in the show notes as well. And thank you so much for joining us today.
Pete Warden 00:54:20 Thank you too.
Kanchan Shringi 00:54:21 Thanks everyone for listening.
[End of Audio]