Stefan Tilkov talks to Camille Fournier about the challenges developers face when building distributed systems. Topics include the definition of a distributed system, whether developers can avoid building them at all, and what changes occur once they choose to. They also talk about the role distributed consensus tools such as Apache Zookeeper play, and whether and when they are needed, and give some examples from Fournier’s experience building real-world systems at Goldman Sachs and Rent the Runway.
- Camille Fournier’s website
- SE Radio Episode on consensus in distributed systems with Kyle Kingsbury
- SE Radio episode on distributed coordination with Apache ZooKeeper with Flavio Junqueira
- SE Radio episode on the CAP theorem with Eric Brewer
- SE Radio episode with Leslie Lamport on distributed systems
Transcript brought to you by innoQ
This is Software Engineering Radio, the podcast for professional developers, on the web at SE-Radio.net. SE-Radio brings you relevant and detailed discussions of software engineering topics at least once a month. SE-Radio is brought to you by IEEE Software Magazine, online at computer.org/software.
* * *
Stefan Tilkov: Welcome, listeners, to another episode of Software Engineering Radio. Today I’m talking to Camille Fournier, who until recently was CTO at Rent The Runway, where she lead a team of over 60 engineers who were responsible for both the customer-facing technology, as well as the custom logistics system.
Prior to that, Camille served as a software engineer at Microsoft, and most recently as a vice-president at Goldman Sachs, where she dealt with risk technology and technology infrastructure.
Camille is very well known within the tech community, speaking on a variety of topics such as distributed systems and technology leadership. Camille received a bachelor’s degree in Computer Science from Carnegie Mellon University and an MS in Computer Science from the University of Wisconsin. She’s currently a member of the Apache Foundation and lives in New York City with her husband and son. Camille, welcome to the show!
Camille Fournier: [00:02:11.07] Thank you for having me.
Stefan Tilkov: [00:02:13.16] Our topic today is real-world distributed systems. Could you start us off by briefly describing what a distributed system is?
Camille Fournier: [00:02:22.14] My favorite definition of distributed system comes from Leslie Lamport. He was one of the famous early researchers in the field. He describes a distributed system as one in which the failure of a computer you didn’t know existed can render your own computer machine unusable. I love that definition, because people can get very hung up on distributed systems being about “There’s a network in the flow. Oh, well if you are talking over the network to a database, you are in a distributed system”, for example. Even when you are in an old-school, three-tier architecture land where you are building a view layer and an application layer and a database layer – if those went over the network to talk to each other, even if there were only two or three of them, that technically has some of the complexities of a distributed system, because you are going over the network, and there are more failure domains.
[00:03:37.13] The reason that I like the Lamport description where it’s “the failure of a computer you didn’t know existed”, is that it gets into the heart of when distributed systems become more than just another little thing that can fail, and something where you have so much more complexity than it’s possible for you to easily reason about. That’s that tipping point, where you don’t know things exactly. You’re calling systems that are calling other systems that you don’t even know about. Some downstream system’s failure can cause you to be failing, and you didn’t even know that that downstream system maybe even existed.
Stefan Tilkov: [00:04:15.15] Is it a fair assessment to say that we are all building distributed systems all the time? Because by that definition there isn’t really anything left that doesn’t talk to something else that you don’t know.
Camille Fournier: [00:04:28.05] Yes and no. For this book-writing thing that I did in November I used a piece of software called Scrivener. It’s a text editor made for people writing long works; if you’re writing books or screenplays, it’s made for that. If you’re writing a Scrivener-type program, you’re probably not too worried about network stuff; you’re worried about writing to disk. As long as the operating system that you’re writing this thing for is reliable, then your program is going to install on this person’s computer and work.
There’s less and less software like that these days, because we’re all building these cloud-aware systems, but I think there is something to be said about, “Don’t worry about problems you don’t have yet.” Even though you can be building even a networked piece of software that will have some complexity, that doesn’t mean that you have to engage with the entire world of distributed systems complexity to build that software as well as it needs to be built and to make it useable.
Stefan Tilkov: [00:05:47.22] When do people have to worry about this thing? What is the tipping point, the threshold that you have to cross when the simple thing where you don’t worry that much about parts that are being distributed turns into something where it really becomes a major issue and you need to be aware of it?
Camille Fournier: [00:06:03.14] That’s a good question. The tipping point tends to come when you really have many different systems that you, your team or your software is run across. Anyone who is working in a services architecture, especially a microservices-type architecture, you’re starting to get some of that complexity. It doesn’t mean that you need to be embracing all of it, but you’re starting to get some of it. Versus, for example, if you’re building code that runs in a web server and you happen to run that web server horizontally-scaled so that you can have more people accessing it, but you’re not sharding your database, you’re not using multiple different data stores, the distributed systems problems that you need to be engaging with in that horizontally-scaled but not that complex web server – they’re just a different scale of distributed systems complexity.
[00:07:10.01] It will come in a little bit depending on what you’re doing. If you’re working with a database over the network, you need to understand the way that failures are communicated to your client. If you do a request to that database and the network drops out, how do you figure out whether the requests succeeded or failed? There’s a little bit of a distributed systems complexity there, even if you’re just doing this simple, three-tier architecture model. But you probably don’t need to worry about really detailed transaction retries, storing extra state around multiple different versions of the same data because it’s spread across multiple different machines that may or may not be in sync and may or may not be able to communicate with each other.
[00:08:02.06] Distributed systems are as much an engineering problem as they are a theory problem. That means that you’re going to apply as much theory as you need to solve the engineering problem, but you don’t want to pile up more.
Stefan Tilkov: [00:08:17.02] Some people say that you have a distributed system as soon as a web browser and there is a [unintelligible 00:08:21.25], because the browser, as a client, accesses a server – voila, a distributed system. Would you agree with that, or is that an example of something where you’d say that it is technically, but not practically relevant?
Camille Fournier: [00:08:37.00] What I would ask is, “What’s the point?” What are we getting from making that statement? What are you trying to teach developers about when you make that statement? What are you trying to get them to be aware of? What kinds of things do you think they should be worrying about that they are not worrying about normally? Because that’s a more useful thing than saying, “It’s a distributed system.” Okay, it’s a distributed system – so what?
If it’s a distributed system where you are never caching state that you’re trying to modify within the browser, then does it matter that it’s a distributed system? Or do you just need to think about, “Oh, well don’t try to cache a bunch of state, do a bunch of modifications on that state in the browser, and then talk to the web server, because you could fail, you could lose all that state that you cached because the customer closed it…
[00:09:37.16] I make a point in one of my talks – people underestimate human’s ability to have intuitive understanding of systems that they’re very deeply familiar with. That includes both developers and users of certain types of network software. We all understand that if we are typing something in our browser window and the browser window crashes, it’s likely that we will lose that data. We understand that just because it’s happened to us a bunch of times probably. We may not understand the reasoning behind it, but we know that that will happen, because it’s happened to us in the past, so we know that it’s a risk. So as users, we know that it’s a risk and we’ll get really annoyed when that happens. As web developers, we also know that it’s a risk and we know that if we’re doing something where losing that state halfway through matters, we’re probably going to be more actively sending it over to the server, versus “Whatever, I’m just building some little thing. If the browser crashes, the user’s got bigger problems; I don’t care. They’ll come back to me or they won’t – whatever.”
[00:10:50.15] You can’t underestimate people’s intuitive understandings of the problems and the spaces that they work in every day. Sometimes we get really hung up on the jargon of distributed systems – maybe because it sounds cool, or it’s hard, or it’s complex – without thinking about, “Okay, but how is this useful to the people who are working every day, trying to make good systems that the customers are going to want to use”, that are going to mostly do the right thing, within reasonable failure domains.
Stefan Tilkov: [00:11:27.25] Let’s talk about situations where we actually have crossed that threshold, where we are building something where the fact that it’s a distributed system becomes important. What are some of the challenges that people can run into, and how would you address them?
Camille Fournier: [00:11:46.25] The challenges that people tend to run into, especially when they get to these really large scaling systems… Let’s back off and say, “Why do people build these distributed systems?” When you think of the classic distributed system, you’re probably thinking of the Google-scale systems that they’ve built, and they’ve written a bunch of different papers about what they do. Or maybe you’re thinking about Facebook-scale systems, or Twitter. Netflix is a great example. You imagine these clusters, data centers full of computers running bunches of different code that’s all working together to create some kind of experience.
[00:12:31.09] One point that I like to make is when you think about why these companies built these systems, why they went to distributed systems, they went for two reasons. They went because they needed to scale, because they had problems that would not fit easily on a single machine; they had more data, they had more volume of users, they had more computation etc. and/or they also needed failure tolerance; they wanted to have a really highly available 24/7 system where if something failed, the system would survive a certain amount of failure.
[00:13:08.11] The challenges of scaling and failure tolerance are making each other harder. As you scale, you put more machines out there, you put more network out there, you have more moving parts, you have more complexity, and you have more places for failure to happen. There are countless post-mortems out there – from big distributed systems vendors, from Amazon, from Netflix, from whomever – where they talk about, “Oh yeah, so this other system that we built, that this third system relies on, it had some disk issues, so it’s sort of degrading its performance, and then the other system kept trying to hammer it, and it didn’t back off, and it caused this other system to start having problems, and thus we had downtime, basically.”
[00:13:58.28] You see these really interconnected, complex parts working together, and they can fail. Each one has their own independent failure domains and it’s really, really hard to write a system where you think about all of the possible failures that can happen to you, that can happen to tertiary systems, and account for all of them. Nobody does that perfectly. Even the best developers, the best companies, nobody does that perfectly.
You’ve got this massive scale, which makes failure harder. Okay, so how do we deal with failure? The ways that we deal with failure tend to make scaling harder. We put a bunch of redundancy and we write data in multiple places. All of a sudden, you’re doing more work. You’ve added more overhead, to make sure that when you do fail, you don’t lose data, or you lose less; you have some redundant work going on, so it’s less likely that you’re going to fail in all of these different ways. Okay, now you’ve made it so that your scaling is less efficient. You have to have more machines to scale the same amount.
[00:15:06.13] These two elements are the fundamental goal of building these big distributed systems, but they’re also fundamentally in conflict. What you see is people really decide on their distributed system usually balancing those two things. They decide which of them they care about more to some extent. It’s almost like a slight rephrasing of the cap theorem as one of those famous distributed systems pieces of research, which says that if you have a distributed system, you really can’t get rid of the P as partition tolerance, and you have to be able to withstand some kind of network partition, or you’re not a distributed system.
[00:15:45.09] So, ignoring the P, you have C – consistency, meaning “Can you lose data? Can you have people read data at the same time and see different results? Where does that fall out?” And Availability – how many degradations/failures can you tolerate and still serve responses to clients? Most people are twiddling between those two axes, to say, “Okay, well we can show some stale data, or even potentially lose data. It’s better to do that than to not have something that responds to our clients.” Or, “You know what? If we lose the majority of nodes, we’d rather not be available at all, because that would reduce our ability to be consistent in the way that our clients are expecting.” You’re balancing between those two things as you decide how to structure and architect a distributed system that you’re building.
Stefan Tilkov: [00:16:46.12] What seems to be implicit in what you’ve just said is also a tradeoff or an axis from simplicity to complexity. Things become much more complex, so you add a lot of possible things that can fail; that makes a lot of sense. You said at the beginning that this is what Google, Facebook and Netflix face. Do normal people face it as well, or is it only if you’re building global scale systems?
Camille Fournier: [00:17:18.10] Normal people certainly face it. Normal people face it when they build out services architectures, whether or not you call them microservices. Services architectures are a very sane way to think about software at a certain scale. That scale does not have to be Netflix scale. That scale does not have to be massive scale. It should probably be more that proof of concept scale… It’s not clear to me that you should start writing your software in a services architecture, but once you’ve proven out and gotten into the situation where you have 30, 40, 50 developers, there can be a lot of value in thinking about the entities of your system that can operate independently, and pulling those into services and writing a system that joins those entities as necessary, and allows independent development and independent ownership of data, instead of a big monolith.
[00:18:31.17] There’s nothing intrinsically wrong with monoliths, but the way that modern software developers tend to build software and the kinds of problems that we’re solving and the increasing ease of creating a distributed system… One thing that I want to make a point on is I think the reason that you see so many distributed systems today is not just because people have the need for more scale and for more failure tolerance, it’s actually that it’s just easier to do it because we have the cloud. It’s easier to get access to hardware, to network-attached servers and network-attached disk and the network.
There are more systems out there that are distributed systems themselves for data storage and data processing. Everybody is already starting in a way where it’s very easy to build a distributed system, at least from the hardware accessibility perspective, which was not the case ten years ago. Ten years ago, pre-cloud, it was very hard to get your hands on the kind of software, the kind of networking and frankly, distributed data storage types of systems that would make building an application distributed system (microservices-type architecture) fairly straightforward.
[00:19:55.00] You would see these kinds of systems in large companies. I worked for Goldman Sachs, and we definitely had service-oriented architectures all over that company, but they also had their own data centers. They had a lot hardware, they had 10,000 developers, so there was a lot of independent software being written. But at a smaller company, even with hundreds of developers, you may not have done that because the accessibility and the speed in which you could spin up a new server was not quite there before the cloud all of a sudden made that much easier.
Stefan Tilkov: [00:20:27.16] I totally agree with that. Let me play devil’s advocate for a second. Let’s say that I claim that the problem with that is that it’s become too easy to build those things, and people build those things without knowing what they’re getting themselves into. All those problems that they have because they now have a really strongly distributed system are problems that they wouldn’t have if they just stuck to their boring, old database architecture. Maybe we should all stop pretending we’re Google and just build simple stuff like we did 15 years ago. For my reputation, I have to say I do not actually believe that, but let’s say that’s a very valid argument… What would you say to that?
Camille Fournier: [00:21:16.00] I’m more sympathetic than you might expect to that argument. My personal take on all things is don’t buy complexity you don’t need. I am a fan of if your data will be reasonably usable in a relational database and you’re doing transactions, you should probably use a relational database. If you really care about transactional, [unintelligible 00:21:44.25] you don’t want to lose data. I realize that even the relational MySQLs of the world have failure conditions where they can do unexpected things, but they are so much rarer and so much more well-understood by the operators out there in the world, and frankly, by developers and other people, than the MongoDB, Cassandra, yadda-yadda worlds, that I am actually very much like… We ran MySQL at Rent a Runway as the transactional database because that was what was there when I joined, but it’s a perfectly fine transactional database for what we were doing. Maybe we would have used Postgres if we were building it today. I certainly would not have used a NoSQL or lighter weight distributed system to build that because it’s not worth reasoning about the edge cases. When you’re talking about something where you’re taking money from people, where you’re taking orders, you want to make sure you’re going to be able to fulfill those orders.
[00:22:56.14] I am very sympathetic in being conservative where it matters. The balance that you want to make with conservatism is the following: if it’s truly faster for your developers to build things in a distributed system way, or if it’s truly faster for them to use certain types of systems for the type of data that they’re working, the type of volume that they’re working with, the type of application that they are building, then they should probably use a distributed system, even if it has edge cases that they haven’t fully considered. I think that the biggest cost to your business is developer time. It is such a huge expense… And I think it goes both ways. Sometimes we overbuild – we build too complex systems, because we want to keep developers entertained; that’s another great reason to build a distributed system, cool points or developer entertainment. But if you’re building a system that you genuinely believe will help your development team be more productive, scale better, be able to get more business features out there…
[00:24:11.23] And yes, okay, you have an outage that makes you realize, “Oh crap, the way that we’ve been writing this data into this table is not good and we need to fix that”, I think if you take a step back and you quantify the potential for danger there, it’s probably less than people think. You have outages of all sorts all the time, when you’re running a business. You have outages related to hurricane Sandy flooding half of New York and cutting off your access to certain servers. You have outages because somebody at one of those big companies like Amazon screws up their distributed system, and you didn’t really know you were relying on that. You have outages because hackers attack you, you have outages because somebody fat-fingers something. You’re going to have failure and outages. You are not going to be able to think through every single possible edge case, and if you are deeply uncomfortable working in a world where you can have failure, then certainly startup land is probably not for you.
[00:25:29.12] There are places where you don’t have outages. You’re building flight-control software maybe, you’re building software to land a person on Mars, you’re building certain kinds of financial systems software. But for most people working in small to mid-sized companies, you are always going to be balancing the cost of building something, the cost of maintaining it over time, and the risk of failure. That is the balance that you do all the time. You have to make that conscious calculation as often as possible; that includes being boring, sometimes, and that includes taking something they may be a little bit flaky, but that’s really easy and intuitive for your engineers to use and get started on and to run, like MongoDB, for example. MongoDB is incredibly easy for people to get started on and use.
That’s actually a brilliant aspect of that software, that they realized that the startup costs for so many of the distributed storage systems at the time was really high. I’m not even sure that they built it that way because they were trying to compete with other people, but they felt like the usability element of the distributed system was a more important problem to tackles first than the correctness. You may say that that’s horrible, and they are horrible people and how dare they, they tricked the industry, and I’m not going to argue either way on that, but I will say that’s an interesting choice that they have made. People adopted it not just because they were idiots, but people adopted it because it was easy for their teams to use it and to get going with it, and they got value out of it.
[00:27:19.20] You’re making that tradeoff both to go conservative and both to go experimental. As long as you’re constantly thinking about “Where am I willing to take risks and where is it better for me to say, let’s just favor developer productivity and do this an easy way? Or let’s favor future scaling of developer productivity and put a little elbow grease into it now that will pay off when we have 50 developers or 500 developers.” These things are really hard, but these are the kinds of tradeoffs that you should be thinking about.
Stefan Tilkov: [00:27:56.03] A lot of things that you mentioned are related to areas where there is not that much complexity because the system is essentially stateless. You mentioned the web servers just scaling horizontally, and that was the implicit assumption that these were stateless things. How do things change if we’re talking about the stateful part, if we make that distributed?
Camille Fournier: [00:28:19.12] State is where everything gets complicated, certainly. When it comes to state, you want to be thoughtful about the level of consistency that you need for various parts of your data. Understand where you are comfortable having inconsistent state, and where you absolutely need that consistency.
Here’s a great example from my past job. We relied on MySQL to be the transactional part of our data, so if you were making a reservation, that would go to MySQL. We did that because I really didn’t want to lose that state. I wanted for that state to be kept in a place that was very core to the business. If the MySQL database went down, the business would not be able to function. There are core parts of every business, so you’re going to have some kind of system that’s storing some of the data of the core parts of your business where if it goes away for whatever reason, the whole business, or large parts of the business are not going to be able to work.
[00:29:50.07] So when you’re building a distributed system and you’re dealing with parts that you really care about consistent state, you definitely get a lot of complexity there. I’m a fan of use a relational database if you can for that part of your state, because those systems tend to be more well understood. You may still have to do things like shard it widely, which makes an incredibly operationally complex system, and there are many distributed systems aspects there. All of a sudden, you’re writing data to different places if you want to try to rebalance that data because one shard is getting full, how do you think about managing/rebalancing without taking the whole system down – again, balancing that scaling vs. failure tolerance that we’ve talked about earlier.
[00:30:41.11] One of the important things to think about as you’re building out a distributed system is where do you care about coherent, consistent states? You don’t lose data, you don’t lose rights, you don’t have people looking at the database at the same time and seeing different state. You want to be able to lock paces of data so that multiple people can update them at once.
On the other hand you’re going to have what I think of as cache state. My personal feeling is if it’s okay if people see slightly stale data, it’s okay for it to be in a cache. You want to think very carefully about where is it okay for people to see slightly stale data, where is it okay for you to lose data because you can easily reconstruct that data, and you can serve those kinds of pieces of data out of systems like MongoDB or Redis.
For example, at Rent a Runway we used MongoDB for both data that was more document-like, that made sense (our product data). We have a lot of products that update, and they have a lot of metadata around them. That was all stored in MongoDB, because the process of creating that product — if we lost a detail, we could easily reconstruct it. It wasn’t that big a deal if it was slightly inconsistent for a little while. If you read something, and then you read it again and you got a different version of it because it had been updated – whatever. It just wasn’t that big of a deal. We were caching it in a whole bunch of places anyway. It’s not like placing an order, where if you hit Place Order and you place four orders, or you don’t place an order at all but you still think you did – that’s a really terrible customer experience.
[00:32:55.27] It’s about understanding where you can get away with thinking less about consistency but where certain kinds of scaling might make your life easier. Caching is one of these layers where even if you were a building the horizontally-scaled web server — and okay, maybe it doesn’t have a lot of the challenges of a distributed system… You probably had memcache in there, and you probably thought about what you put in memcache; you thought about “Okay, what data matters if it changes, and I need to know that change immediately and evict it from the cache, and what data doesn’t matter? What data can I put on a CDN (content delivery network)?”
[00:33:41.27] If I’m serving websites, some of my data is going to be cached at the edge, so it’s not even coming to my website; my images, things like that. Again, that’s data where it’s going to be stale. It might change and be stale for a little while, and that can be okay. One of the major considerations as you find yourself building a distributed system – because you will – is actually an extension of the nuance of the things that you had to understand to even build a working non-distributed system, which is what data has to be fresh, what data has to be locked, what data can only have a single writer and the readers must always see the updated versions of it, what data could be slightly stale, what data could be really stale and we don’t care? What’s the data that if we lose a little bit of it, it’s a bummer, but it’s not… Like, if I lose a little bit of the clickstream data on my website because I screw up the analytics processing, that sucks; I’ve lost a valuable information, but I haven’t lost any orders. I’ve just lost metadata.
[00:34:53.00] Those are some of the complexities that developers do need to understand. You need to have a basic understanding of when do you care about losing data, and go from there. Because if you don’t have that understanding, then you cannot easily reason about a distributed system. If you say, “Well, I never want to lose data”, that’s not true. I don’t believe that’s true. Most people, in most environments, they can say “Okay, well if I lost this click data, if I lost the fact that somebody fav-ed this tweet”, that’s not good. I’m sure Twitter cares very much about not losing that, but it’s not like I lost somebody who gave me money placing an order. We have different thresholds of loss that we are willing to tolerate. I think asking that question, and then understanding the implications of “If I use this type of system, where CAN I lose data?”, because that’s the flipside of it. You will be able to lose data; different types of systems will have different ways in which you can either lose or read stale data, or see old results, and that’s a big part of the balance that you’re making as you build yourself into a distributed system.
Stefan Tilkov: [00:36:20.05] A lot of the examples that you’ve mentioned are considered application data, like the order, or maybe the product data. Another part of the data that a distributed system has to deal with is the actual configuration data, and the information about the runtime side of things. How do you deal with that?
Camille Fournier: [00:36:40.08] That is one of the places where you probably want it to be fairly consistent. I spend a lot of time talking about Apache’s Zookeeper, because I’m a member of that project. That was one of the original use cases for what I call the consensus systems, which are systems like Zookeeper, Chubby is the system inside of Google that does this, that’s both on Paxos… Systems that are built where every update goes through a consensus algorithm. Paxos is the classic one; ZAB (Zookeeper Atomic Broadcast) is the one inside of Zookeeper; there’s also Raft, which is a new one that was developed in the past few years, and there’s a bunch of systems like Etcd (now relies on Raft).
[00:37:34.21] All of these systems do key/value stores, but key/values stores of not very much data. The expectation is you’re not storing a ton of data inside of these systems, but you’re using them to make sure that if you write something and the system says “I got it”, everybody who can access the system sees that update. These are really useful, especially for service orchestration type work and large scale cluster management; understanding who is around and available, and giving those people information; distributed locking is another huge use case for these types of systems. If you need to elect a leader in your distributed system which is going to be managing all the other processes of the distributed system in some way, you don’t want to have a system that can give you two different leaders. If you have two leaders, bad things happen.
[00:38:42.07] You use a consensus algorithm, and you may just use an external consensus system like Zookeeper to find the leader and store the results of who is the leader, so that you never have that situation where two people think they have what is essentially a distributed lock.
Similarly, you do want to have the configuration of your global environment, understanding which servers are running and taking which data… Again, it depends on how stateless the servers might be. There are systems out there that use gossip protocols to identify servers that are running, which is slightly different than using a consensus algorithm and having to configure and know where the cluster of Zookeeper or Etcd is running, for example. Instead of doing that, you use UDP and these gossip protocols to find other nodes on the network and get information through them.
[00:39:56.08] There are other ways to solve these problems – I’m honestly not nearly an expert in that – but I do think that when it comes to the operational complexity of running a distributed system, that’s where people can get a little fast and loose with, “Well, I don’t really care that much about the accuracy of my distributed lock. I’m just going to use the database to write a distributed lock”, and that’s actually not a really awesome way to do distributed locking. Even though databases are great for transactional actions, that’s a little bit different than what you need for distributed locking. In particular, you probably don’t want to have to poll a database to see if a lock state has changed, and these consensus systems all were built with the ability to notify you of state changes in the systems, which is really useful in cases where you want to immediately elect a new leader if one failed, so you don’t have to do manual failover.
Stefan Tilkov: [00:41:04.24] Would you say that every time you build a distributed system you should use one of those consensus-driven systems?
Camille Fournier: [00:41:11.02] No. First of all, if you’re building a distributed database, you may very well just be needing to implement your own Paxos or your own consensus state machine internally to do that kind of management. Or if you’re trying to sell someone a product, you probably don’t want to make them spin up a Zookeeper or and Etcd and then spin up your system, unless it’s part of a larger ecosystem.
Stefan Tilkov: [00:41:41.12] Don’t some of those tools do that anyway? Kafka, for example…
Camille Fournier: [00:41:46.23] Totally, yes. I would say that they do, because they were built within environments where there were already a lot of people using Zookeeper. You can have a Zookeeper that supports a whole bunch of Hadoop-family products, and a single Zookeeper can be used to support that whole family of things, because they’re not writing that much data into them.
These systems being built in environments where they already had operators who knew how to operate Zookeeper, it was easier for them to build assuming that there was a Zookeeper operational within the environment and not trying to build the state machine within them, rather than building the state machine within them.
[00:42:31.05] Again, Kafka came out of LinkedIn. LinkedIn built it internally for their own usage. Now they’re trying to spin it out into a real product that other people may buy and use, so it will be interesting to see if they maintain the dependence on Zookeeper if they decide to build some of that stuff internally. I don’t know where they’re going to go with that, but for their audience it may be reasonable to say, “You know what, because so many people are using Zookeeper anyway, because many of them are already using Hadoop for things, they’re using Spark or Storm or whatever…” I’m never quite sure which products require Zookeeper and which don’t these days, but if they’ve already got it, that’s not going to be the critical part of our roadmap to build it internally.
[00:43:16.10] That being said, if you look at Cassandra, if you look at Riak, or other systems like that which probably do have consensus protocols built internally, they’re not relying on you having the centralized system, but they’re still using consensus protocols and state machines in order to do that cluster management piece that they need to do, and to do certain kinds of transactional operations that they want to provide.
Now, you can run a distributed environment of microservices-type things without having a lot of wholesale cluster orchestration if the pieces of your system are independent. Rent a Runaway, microservices – there were lots of services. We had lots of servers running lots of services, but each service was developed to be deployable independent of the other services. You don’t need to do coordination there. Configuration was always deployed with the service, it wasn’t dynamically changing, so we didn’t really need it for that element.
[00:44:29.08] We weren’t quite at the scale where humans couldn’t comprehend the scope of the entire system for pure operational perspectives, meaning like if you have an operations person saying, “Oh, we have a problem, it’s in this server over here.” We weren’t adding new instances so quickly that we didn’t even know where they were, or we didn’t have IP addresses for them and we had to dynamically discover, “Oh yeah, I needed to autoscale out that service, so it got automatically assigned to this server over there.” We didn’t have that level of dynamism, but still a distributed system. It just didn’t have the level of dynamism that required a Zookeeper or other kind of service coordinator to really be managing it. And that’s a good way to go.
[00:45:23.27] We used load balancers to enable us to deploy with no downtime. You can use a load balancer to drain traffic off of nodes, update their software and then put them back in the pool and drain the traffic off of the other ones, update their software, put them back in the pool. That doesn’t work if you break backward compatibility with your API endpoints, but breaking backward compatibility should be considered to be a major event in any architecture.
I am very reluctant to build software that makes it easy to do something that I think should be a major event because it’s so easy for other things to break when you break backward compatibility that it’s just not a good practice for normal software development.
[00:46:15.26] I don’t think you have to have a consensus system if you have a distributed system of manageable scale, where you don’t have a lot of interdependence between the various services. You still have a distributed system, you still have multiple services, you still have some complexity, but you’re not massively scaling, you’re not trying to autoscale your cloud, you’re not trying to throw a service wherever is available at the time and assign it an IP and tell everybody, “Hey, it’s over here at this IP.” You have a little bit more of a static environment that grows at human scale rather than at machine dynamic scale.
Stefan Tilkov: [00:46:57.06] That sounds like what a lot of the enterprise environments where people now start to adopt microservices look like. They will be much more static than a startup that might need to scale a hundredfold in a week or so, so that predictability is much higher and that seems much better suited to something slightly more aesthetic and slightly more reliable as a result of that.
Camille Fournier: [00:47:20.23] Yes, exactly.
Stefan Tilkov: [00:47:24.01] In the beginning you mentioned that you have a history with Goldman Sachs, as well as with Rent a Runway. I believe I can’t think of two companies that can be more different than these two. Can you talk about the differences, both in software engineering and architecture, but maybe also in engineering-related organizational aspects that you experienced when you went from one to the other?
Camille Fournier: [00:47:50.00] Yes, so obviously Goldman Sachs is a huge financial institution, and a lot of what I did working there was building big distributed systems. When I joined Goldman Sachs (in 2005), they operated very much against the business that you are supporting. I worked in risk technology, and the technology team – the immediate team for the product that I worked on was about 60-80 people, and then the wider team was much larger than that – we were very much dedicated to the risk management side of the firm-wide technology organization. In many ways, we were able to act fairly independently, and that was somewhat startup-like.
[00:48:50.17] It was enterprise in that I had to dress up to go to work; I didn’t have to wear a suit, but you know… And obviously, we had all of this support around us. We didn’t have to [unintelligible 00:49:01.19] hardware, although we did do what I would call much closer to dev ops than most people may have imagined. We certainly supported our own software in production and deployed it ourselves. We did not have an ops team that did all support and deployment. We had support of the database, but as developers of the software, we were also supporting it.
There were a lot of things about those early experiences that were very similar to being in a startup. You’re very close to the business, your focus is building something that will be valuable for that business. It wasn’t moving that slowly; we had some freedom of choice about what kinds of software we used, although not perfect freedom.
[00:49:53.29] It was not as different from working at a startup as you might expect. I worked in technology infrastructure, which was much more like a big company. We were the big, centralized infrastructure team for the entire company, which meant that everybody wanted to have a say in what we worked on, and our software could be used by a lot of teams. On the flipside, we had much less freedom because we had to get so much political buy-in for everything we did. That was what drove me to leave. In many ways, I loved the company; I had a very good experience there. Goldman Sachs is not as evil as people might want to believe — certainly not in the technology area. They’re still engineers, you know? And I got to work on some really large-scale problems that were much larger scale than most people at startups will ever get to experience. And they had much harder requirements than most startups, so I did have a lot higher threshold for the risk management that I had to do.
[00:51:17.12] The last system that I built at Goldman Sachs was actually a global service discovery system, so I had to think about how to build this distributed system that would be available in Asia, in the EU, in the Americas… And if part of it went down, I didn’t want the whole system to go down. That would be very bad. It would be very bad if I built a system where if for whatever reason the Asia region was having technical problems and America couldn’t get work done, that would be really, really bad. You had to think really large scale, and the amount of money that you were supporting and the amount of risk that you were taking on if you broke something was really high.
Startup land had some similarities, in that the things that I learned at Goldman about really understanding and caring about the business that I was building and working fast, getting stuff done, shipping frequently, were things that I took to Rent a Runway. The difference is that it’s a startup and you have to do everything yourself. You really have much less support from those secret areas that you didn’t even think about when you were working for a big company. We were cloud hardware, cloud networks – much less reliable than hardware supported by dedicated teams inside of Goldman Sachs, where you really owned all the servers yourself. That still wasn’t a perfectly reliable world, but it was more stable, it didn’t change as much. In the cloud world we had a lot more problems. Learning about that kind of failure was interesting, and learning about the stress and failure of growing a team, and what kinds of things that didn’t work in enterprise that were really necessary in a startup.
[00:53:18.09] There were some things that surprised me. Goldman Sachs did not necessarily have a lot of teams where every single piece of software required code review, which may seem surprising, but it was the case. It’s not that we never did code review, and they may very well have that requirement now, but when I worked there we did a lot of pair programming, and you wrote a lot of tests. Instead of relying heavily on code review, there was a lot more reliance on the process around building software and validation. I was very reluctant to require code review at Rent a Runway, and my engineers at some point just revolted and they were like, “The thing that’s not working here is that we’re not doing code reviews.” I was like, “Really? Okay, that’s an easy one to fix. Now we have to do code reviews!”
Sometimes there’s non-intuitive things that are actually harder and easier, depending on the size of the company and the type of the company that people don’t really expect.
Stefan Tilkov: [00:54:29.06] Very interesting. Is there something you would take back from the lessons you learned in the startup world, if you were to work for a big company again?
Camille Fournier: [00:54:36.06] Goldman was good, but only in places… People get so far away from the business and the people that they’re supporting, that they lose their sense of what is important, what is valuable, what needs to happen. People feel alienated at big companies because they don’t understand why they’re building what they’re building, they don’t understand who they’re helping with the software that they’re building.
It’s very hard at a startup to get away from your customer. At Rent a Runway we let all the employees rent for free; the male employees would rent for their females friends, their girlfriends or their wives, and obviously, the female employees – we would rent for ourselves. So we all really got the experience of being the customer, using the software as a customer would use it.
[00:55:43.06] We had a big warehouse where we did a lot of logistics and we made everybody work a day in the warehouse. The team that built the software for the warehouse went there very frequently and observed the processes, and did some of the work to understand the way the software they were building impacted that team. That is such an important thing. Everybody pays a lot of lip service to this in startup land and in big company land, but it’s easy to lose sight of how important it is that people be thinking about and reminded of why they’re building what they’re building, who they’re helping, what matters to that person and what doesn’t matter to that person.
Stefan Tilkov: [00:56:27.15] Awesome. Camille, we’re at the end of our time. Are there any last words you want to add? Some place you want to point people to? We’ll have show notes, of course, but maybe you want a particular thing you want to add.
Camille Fournier: [00:56:38.23] If you like listening to me talk, I have many talks that I’ve given that are online that you can see. What I would point people to – just to take a step back from this general topic of distributed systems – I hope that what everyone took away from my conversation was you are in a distributed world and you are going to need to think about that. I don’t want to downplay that as a fact of life right now, but that doesn’t mean you have to tackle every single problem. That doesn’t mean that you have to expect that every system is going to be perfectly reliable, and if Kyle does a Jepsen test on it and reveals a weakness, that therefore that system is garbage and you shouldn’t use it.
[00:57:34.09] We make a lot of tradeoffs as engineers, we always have. Now we just make them at larger scale. Sometimes those tradeoffs cause us sleepless nights and a lot of challenges, but you are always going to be making tradeoffs, no matter what you do. Try not to tackle every problem; tackle the problems that you really need to solve, and work on developing a sense of where value lies in your business, where value lies in your software and your infrastructure, where it’s reasonable to take risks and where it’s not reasonable to take risks. Just developing those instincts will make the process of living in the distributed world, building distributed systems much easier.
Stefan Tilkov: [00:58:26.01] Awesome. Camille, thank you so much for being with Software Engineering Radio, and to listeners, thank you for listening.
Camille Fournier: [00:58:31.29] Thank you.