Jeff Meyerson talks to Haoyuan Li about Alluxio, a memory-centric distributed storage system. The cost of memory and disk capacity are both decreasing every year–but only the throughput of memory is increasing exponentially. This trend is driving opportunity in the space of big data processing. Alluxio is an open source, memory-centric, distributed, and reliable storage system enabling data sharing across clusters at memory speed. Alluxio was formerly known as Tachyon. Haoyuan is the creator of Alluxio. Haoyuan was a member of the Berkeley AMPLab, which is the same research facility from which Apache Mesos and Apache Spark were born. In this episode, we discuss Alluxio, Spark, Hadoop, and the evolution of the data center software architecture.
Transcript brought to you by innoQ
This is Software Engineering Radio, the podcast for professional developers, on the web at SE-Radio.net. SE-Radio brings you relevant and detailed discussions of software engineering topics at least once a month. SE-Radio is brought to you by IEEE Software Magazine, online at computer.org/software.
* * *
Jeff Meyerson: [00:00:36.16] For Software Engineering Radio, this is Jeff Meyerson. Haoyuan Li is the creator of Alluxio. Haoyuan was a member of the Berkeley AMPLab, which is the same research facility from which Apache Mesos and Apache Spark were born. Haoyuan also goes by HY. HY, welcome to Software Engineering Radio!
Haoyuan Li: [00:00:55.05] Thank you, Jeff. Thank you for the invitation.
Jeff Meyerson: [00:00:59.02] Absolutely, thanks for coming on! You’ve said that memory is a king, and there is a trend that motivates you to say that memory is king. The cost of memory and disk capacity are both decreasing every year, but only the throughput of memory is increasing exponentially. Give some more color on this trend – why is the growing throughput of memory so important to the big data stack?
Haoyuan Li: [00:01:28.01] That’s a great question. What we are seeing is that both memory and disk price goes down. The memory price goes down by 50% every 18 months; on the other side, memory performance is still increasing exponentially year. Why this is important is because the data is also growing exponentially. This means that the memory speed will keep growing, along with the data size growing. You can do exponentially more data access at the same time. That’s the reason why this trend is so important, and why we’re saying memory is king.
[00:02:20.19] From Alluxio’s perspective, because we see this and we view memory as the primary storage media, and we design our architecture to be a memory-centric architecture to leverage memory performance as much as possible. This trend, at the beginning of the year 2000 Wall-Street companies started to leverage the memory (DRAM) aggressively. In 2010, companies like Baidu, Google – those internet giants, they started to leverage the memory aggressively, as well. What we are now seeing is the whole industry moving in this direction, and we are the leading memory-centric solution for storage.
Jeff Meyerson: [00:03:20.00] Before we get into talking about Alluxio in more depth, I want to first talk about FileSystems and storage to get a historical context, in order to understand the breakthroughs that Alluxio makes. Let’s go into the past a little bit.
Historically, we’ve had to make tradeoffs along what is called the ‘memory hierarchy’. Could you give a brief overview of how the memory hierarchy has evolved and how this relates to the modern Hadoop or the modern big data stack?
Haoyuan Li: [00:03:51.24] Traditionally, there are a lot of storage systems in history, and in some of these storage systems they have the hierarchy. Inside of every single machine you have level 1 (L1) cache, level 2 (L2), level 3 (L3), you have DRAM, you have some storage on your PC, on your laptop, like disks (SSD). People put hotter data in a higher layer in the hierarchy, to leverage their performance. This has been true since the beginning of the storage systems.
[00:04:34.00] What recently happened is that along with the increase of the memory capacity, along with the popularity of the distributed computing, so has increased the data size. Because of the larger memory, every single machine has more memory space left for storage. What we are seeing is that a lot of companies are buying machines with 256 gigabytes memory, or even half terabytes memory per server.
[00:05:16.18] We’ll have a cluster of this type of servers – a hundred servers – so you will have 50 gigabytes memory space to manage. It’s very intuitive in this sense that we need to have a distributed storage system which can leverage this giant memory space. That’s what we’re seeing, and because the space is much larger, it’s very intuitive for people to be able to leverage this bigger and bigger memory size, to do memory-centric storage and computing.
Jeff Meyerson: [00:05:55.07] If memory usage is getting more appealing, does this mean that we should start migrating all of the functionality of disk to RAM?
Haoyuan Li: [00:06:06.07] It’s not really about removing the disk from the picture per se, it’s more about saying, “How do we leverage the more and more resources from DRAM?” DRAM is great; if people will move more and more hot data into the memory, along with our [unintelligible 00:06:27.17] storage, people will leverage more SSD as well. But in the meantime, people now see more disk (HDD) as the backup storage. It’s not being removed from the picture; the functionality, the main goal becomes different, they serve a different purpose.
Jeff Meyerson: [00:06:50.22] Let’s start to talk more specifically about the use cases of Alluxio and how it works. In the current big data ecosystem, we have many frameworks. They already leverage memory in certain ways, but file sharing among jobs is replicated to disk, because replication enables fault-tolerance, and there are always faults in distributed computing – things break. What are the problems with needing to replicate to disk? Or, perhaps we should start with what are the prototypical types of cases? Maybe Spark jobs, or Spark plus MapReduce jobs working together, where we need to replicate to disk.
Haoyuan Li: [00:07:38.13] Traditionally, the storage systems use replication to achieve the fault-tolerance of the data. What we are doing here – along with the new announcement we had last week – is that we see there are so many computation frameworks; in the meantime, there are all kinds of storage systems, as well.
What you can see is that from a storage perspective you have Amazon S3, you have Google Cloud Storage, you have EMC, you have NetApp, you have HDFS, you have Gluster and Ceph from RedHat, and so on. So what we are doing here is we put ourselves between all these different persistent storage sytems – as well as the different computation frameworks – and we bring a lot of benefit to the whole ecosystem, especially to our customers.
[00:08:43.19] For example, to the upper layer – MapReduce, Spark, Presto and other kinds of computation frameworks – by working with Alluxio, they can access data from any storage. We enable any framework to talk to any data from any storage. In the meantime, we bring a lot of performance benefit. That’s the benefit to the upper layer.
The benefit to the lower layer, similarly – if a storage works with us, they can suddenly reach a lot of types of different workloads and applications. Anything supported by Alluxio on top of us, if a storage system integrates it with us, they will be able to work with those computation frameworks. In the meantime, we can also bring a lot of performance improvement. That’s our benefit to the upper layer and the lower layer.
[00:09:50.22] Another cool thing – we call this a future-proven architecture. It means that no matter if you are a computation framework, and if you look at the ecosystem and later on there may be another storage system coming up – their company may buy another new storage system – by working with Alluxio, as long as the new storage is also being plugged into the Alluxio, that existing application will be able to access the data from the new storage system immediately. That’s a strong benefit, as well.
Jeff Meyerson: [00:10:33.23] You mentioned the top layer and the bottom layer – these are opposite sides of where Alluxio sits in the stack. Alluxio is typically going to sit somewhere in the stack where below it you have HDFS, S3, GlusterFS or all these other types of typical data storage. Alluxio is this layer between that layer and the higher level frameworks, like Spark, MapReduce, Flink, or these other types of data processing frameworks.
In order to get a finer idea of how Alluxio improves things – just in case listeners don’t quite understand some of the context, talk about how some data science job, or some multi-step job where you have multiple steps of Spark, or a Spark and then a MapReduce job working in conjunction, and how that would work prior to Alluxio being in the Hadoop stack, and how it would work after Alluxio is involved.
Haoyuan Li: [00:11:54.25] That’s a great point. We call ourselves a memory-centric virtual distributed storage. Without Alluxio, these different jobs – they all interact with another storage system. Normally, people need to move the data from different places/storage systems into one single storage system. Different jobs then interact with that single storage system. In this interaction, there are different jobs in the user’s environment; the jobs could be written in one framework, or they could be written in multiple frameworks. When they need to share the data, they have to read and write through a storage, such as Amazon S3, HDFS, or EMC NetApp Storage, and this sharing process is slow.
[00:12:57.12] The ETL process from different storages into this one single storage is slow, as well as complicated and time-consuming. If you put Alluxio between the computation frameworks, as well as the different storage systems, it will present a unified namespace to our upper layer (Spark, MapReduce Flink, Presto). We can suddenly access the data from different storage systems. In the meantime, because we have this memory-centric architecture, the framework has a very high bandwidth; we provide a very high bandwidth for the upper layer to interact with the data stored in our space.
[00:13:53.23] We first of all remove this costly ETL process, and in the meantime, we provide a very high throughput to our upper layer to interact with the data. We make it so that different application jobs, potentially written in different computation frameworks, to be able to share the data very efficiently, at a high performance. That’s how things are different with us.
Jeff Meyerson: [00:14:28.27] Talk more about the API or the FileSystem protocol that you use to present this unified namespace to different frameworks.
Haoyuan Li: [00:14:44.22] We pull up different API’s for people to access data from us. We have a native FileSystem API; it’s a Java-like FileSystem API. Any framework or any application can interact with Alluxio using our native FileSystem API. That provides the best performance, as well. On top of that, we also build other API’s. For example, we have a Hadoop-compatible FileSystem API. That means that any application/framework, as long as it can work with the Hadoop FileSystem API, they can work with us transparently, without any code change. That’s the second example.
[00:15:35.07] The third example – actually from Alluxio 1.0 release we have a FUSE API, as well. It’s a joint work between us and IBM. With a FUSE API, we enable much larger/wider applications to be able to interact and access data from us. What’s more interesting is that we also pull out an alpha version of the key/value API in the 1.0 release. That means that people can also use key/value interface to interact with Alluxio from 1.0. We are working with very big trial users to improve this feature along the way.
[00:16:25.05] I just gave some real examples, but at our layer people can easily build more API’s on top of us, to provide different ways to access data. We really encourage the ecosystem to do this, to provide more ways to access the data from us. That means that we can access more data from different storage at a high speed. We are seeing people building different ways, particularly suitable for them to access data from us.
Jeff Meyerson: [00:17:05.27] What we have been talking about is the interface between Alluxio and the things that operate on top of it. Let’s talk about what happens beneath Alluxio. Alluxio has UnderFS, which is a file system that interacts with S3, HDFS, or these traditional storage systems. What is UnderFS, first of all? Or, was everything I said accurate?
Haoyuan Li: [00:17:38.15] Yes, that’s correct. We provide a very thin API from our side, from the bottom of Alluxio to be able to interact with any storage system. [unintelligible 00:17:52.00] storage system from our perspective is the layer below Alluxio. From our position, we really want to work well with a lot of different other storage systems. That’s based on the need from our users. It’s very simple; you can see different users that have different storage systems deployed in the environment, and we just want to help them to be able to leverage that.
[00:18:28.10] To make matters simpler, the under-storage system from our perspective is any storage that users have, and can be performed as our under-storage, and we can load and save data into it. To our users, it’s totally transparent. We will handle the complexity to work with these different storage systems.
Jeff Meyerson: [00:18:53.21] You handle the complexity, regardless of whether it’s S3, HDFS or something else, but what are the commonalities between those different types of interactions? I imagine the UnderFS interaction between S3 has some similarities between how UnderFS interacts with HDFS, but I imagine there are also differences. Maybe you could give some granularity on what is the relationship between UnderFS and these different storage systems.
Haoyuan Li: [00:19:26.14] Yes, they are different; they have different properties and they have different performance characteristics. We would normally leverage the existing infrastructure in our user’s environment, and then we enable more ways to access the data for that particular storage in our user’s environment, and provide a much better performance. Definitely, different storage systems have different properties: Amazon S3 has a different property than HDFS, and a different property than Ceph and Gluster file system. That’s up to the use case, in most cases, as well as up to the user’s environment.
[00:20:11.17] If they really value the performance of their under-storage system, they may buy very high-performance under-storage. If they don’t care, they may choose a cheaper solution. That’s up to our users, and in the meantime it depends on their requirements.
From our side, with different under-storage systems we have different integration, as well. For certain under-storage systems we have more primitive integration, because it’s early. For certain other under-storage systems, because of a lot of user request, we may have a deeper integration.
[00:20:53.28] With a deeper integration we can improve the performance, as well as the reliability. We have a lot of progress undergoing on this front.
Jeff Meyerson: [00:21:06.07] Now that we’ve talked about what goes on above and below Alluxio, let’s talk about Alluxio itself. Tell me about the Alluxio system architecture.
Haoyuan Li: [00:21:20.00] We have a scale-out, memory-centric architecture. Simply speaking, logically, it’s just a single master architecture with [unintelligible 00:21:35.13] It’s like many other scale-out architectures. The difference is that in our architecture design, we focus a lot on the memory-centric front. That’s one major difference.
In the meantime, we have a lot of features in our space, for example tiered storage. That means that yes, we focus on memory performance, but we also leverage other tiers as well, like SSD and HDD. This feature has been running in a production environment for over a year, for some cluster with more than a hundred nodes, like Baidu. We also do master fault-tolerance to guarantee there is no single point of failure. That’s the high level of our architecture.
Jeff Meyerson: [00:22:33.00] So Alluxio has a master and it has some number of workers, and each worker has access to a RAM file system. Could you describe this architecture in a little more detail? Give me an idea of what the responsibilities of the workers are, what their relationship is to the master, and so on.
Haoyuan Li: [00:22:54.22] Sure, no problem. On every worker node we run an Alluxio work daemon which manages the local space (local tier structure). It reports the status periodically to the master, as well as [unintelligible 00:23:08.21]. Beyond the worker daemon, we also run a rundisk on every single worker machine. Whenever there is a data locality, our Alluxio client is going to access the rundisk directly, to get a full memory performance. That’s the worker part.
[00:23:36.18] From the master part, our master node daemon keeps track of the status of all the workers status, as well as understanding where the data is (the data location information). Sometimes it will respond a client request to ask for the location information of the data. Of course, it handles metadata.
Jeff Meyerson: [00:24:08.10] How often do nodes fail in a typical cluster?
Haoyuan Li: [00:24:12.17] There are different types of failures. If you say machine failure, what we are seeing is similar to how other company [unintelligible 00:24:23.00], and we have a mechanism to recover from that. For the software itself, it depends on whether you are using a stable release or you are using a cutting-edge master branch.
People are pretty happy about our software quality. For example Baidu, they have been running us in their production for more than a year and they had a pleasant experience, which we are very happy to hear. We’re constantly improving the system from every single perspective, including the stability of the system.
Jeff Meyerson: [00:25:23.21] We’ll talk more about some of the company use cases later on, including Baidu. I want to talk about failures and what you do around failures. Alluxio’s idea to deal with failures is similar to Spark’s idea, which is the lineage-based recovery. What is lineage-based storage?
Haoyuan Li: [00:25:53.08] Lineage refers to the relationship among different data. One concrete example would be you have a program that reads a bunch of files, and writes a bunch of files, as well. This is a lineage information in Alluxio. By knowing that information, if any data got lost, Alluxio will be able to relaunch necessary competition jobs to get the data back. That’s just one way of how we handle this. We can also report different options to our users.
[00:26:33.18] You can do asynchronous persistence and use lineage to get fault-tolerance. Or you can use synchronous persistence, which means while writing into Alluxio, Alluxio also synchronously flushes the data into our under-storage, which also provides reliability of the data. A third way – in some cases, our user doesn’t care about the reliability of certain data, so they may simply say, “Okay, I want to do the asynchronous flush, but if some data got lost because of machine failure – who cares?” In some use cases it’s like that.
[00:27:15.18] There are even more interesting use cases. For example, in some industries they have a regulation which prevents them from saving the data to the disk; they can only put the data in the memory, and we can help them in that case, as well. We provide a pretty flexible architecture for people to choose; we provide them options for them to choose the ways best suitable for them, and that make the most sense for them.
Jeff Meyerson: [00:27:48.01] That’s very interesting. I want to touch more on the lineage-based idea, because this trend of lineage-based operations is an important concept in distributed systems. You see it in Spark, you see it in Alluxio, and I imagine we’ll be seeing it for a long time. For people who may be newer to distributed systems, they may not understand… When they’re thinking about replicating the data versus having the lineage, and in the event of a fault, if you compare the simple data replication – if you have a fault, you’ve got the data replicated, so you can restore the data; in the case of a fault in a lineage-based system, it seems like that would be a lot more complicated. You have to take all these past operations, recompute the data in order to get it back. Why is the lineage-based operation actually cheaper than just replicating that data?
Haoyuan Li: [00:28:54.08] That’s a great point. What made it possible or attractive is that in the big data world, the data volume is really high, and the cost of replicating that is much higher than recording the lineage information. I agree with what you said, that to handle this lineage information well is not an easy job. What we are doing is we provide API for the upper layer – basically, the framework layers – to leverage this functionality to do the integration.
[00:29:36.14] If any computation framework integrates with us for that particular feature, that means that any application on top of that framework will be able to leverage this feature transparently. Even though the system itself is hard, most of the complexity is handled by the computation framework developers, instead of the application developers. We try to hide the complexity from the operator as much as possible. It’s also an ongoing effort; you can always make it better and easier for the operator to consume.
Jeff Meyerson: [00:30:25.25] Why is it cheaper to — maybe I’m rephrasing the same question that is asked, and maybe I’m naive, but why is it cheaper to do this lineage-based strategy rather than just replicating everything in memory? Couldn’t you just have three copies of everything and have them in memory on different nodes?
Haoyuan Li: [00:30:47.22] Yes, you can do that in some cases. As I said, you have this big data environment; your banner program may be 100 MB, and the data that program is dealing with, the input/output – the output data could be hundreds of gigabytes, and it’s a thousand times different. Replicating that – first of all, you will reach the bottleneck of the network; that’s one issue. The other issue is by replicating it in memory, even though memory price is going down exponentially every year, you still have limited memory space. By replicating it, you will consume a lot of memory space. That’s the economical choice people can make, but by using our way, that issue doesn’t exist.
Jeff Meyerson: [00:31:44.04] So in the event of a failure, and if you need to recompute from lineage rather than recomputing from a replicated type of system, how time-intensive is it to recompute data from the lineage?
Haoyuan Li: [00:32:02.11] That depends on the application. It’s not an easy issue. We do have ways to provide [unintelligible 00:32:13.06] from that perspective. In the meantime, this lineage feature we’ve started is only one part of the system. If you look at the whole system itself, or if you look at the new announcement we made, now the system has progressed tremendously since the beginning of this project.
[00:32:43.08] Let me give you some numbers. Let’s talk about the number of commits; everyone can understand that. Now the system has around 13,000 commits. A year ago it was 3,000 commits, so over the past year this increased five times. The community also grew very fast, as well. A year ago we had around 50-60 contributors, and now we have more than 200 contributors. The growth has been tremendous, and we are pleasantly surprised by this. We believe that with this growth, we along with the community will add more features into the system, which will bring more advantages for use cases to our users.
[00:33:47.14] Back to the lineage point – that’s the initial feature we had, and now we have so much more, like the tiered storage system we discussed about, the unified namespace, or the different ways to access the data (key/value, FUSE API). It has been amazing to see this progress.
Jeff Meyerson: [00:34:13.14] I saw several talks that you gave as I was preparing for this interview, and in each of these talks you talk about the overarching goal with Alluxio. Many people asked you this question – what is the goal of Alluxio, formerly Tachyon? At first glance you might think the goal is just this in-memory computing thing, but it’s actually a little more abstract. Talk about the goals of Alluxio.
Haoyuan Li: [00:34:47.07] As a system, the number one goal is to bring more benefit to the ecosystem. The way we do this – before us there are two layers: one is the computation, the other one is the persistent storage. There are issues with this, especially along with the growth on both sides. You have more types of storage systems, you have more types of computation frameworks, and in many cases, performance is the issue. As I said, we’re bringing this new layer which abstracts the under-storage systems, and we provide a unified namespace to the upper layer to make the upper layer (computation frameworks, as well as applications) easier to access data from any storage system. That’s the benefit. In the meantime, we provide a much better performance, and our long-term vision is to make this become the de-facto standard of this unified data access layer.
[00:35:54.27] In the meantime, we provide tremendous performance improvement to our users. These are our goals. In that case, we can bring benefits to the whole ecosystem – to different frameworks, to different storage systems, and more importantly, to our users. That’s the way we see this.
Jeff Meyerson: [00:36:16.13] Let’s talk about some of those benefits in practice. For example, Alluxio obviously speeds things up, and that propagates to the people working at the higher end of the big data stack – we have data engineers, data scientists. How does the job of a data scientist or a data engineer working with Spark or some other set of tools – how does their job and their workflow change in a big data stack that is post-Alluxio?
Haoyuan Li: [00:36:50.25] That’s a great question. There are a lot of perspectives, but there are some major ones. From the data scientist’s perspective, on one side, Alluxio makes their life much easier. Because it’s a unified namespace, they don’t need to explicitly move data anymore. They simply tell Alluxio, “I want to access this data”, and then it will just be able to access it. That’s one thing. They don’t need to wait for one or two months to let people transfer data for them.
[00:37:26.21] On the other hand, for anyone with this type of data access, the performance will be much better. Barclays wrote an article two weeks ago, published on Dzone. The article says, “Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds”. That’s tremendous. Previously, a data scientist needed to wait half a day to get this prepared; now, this became seconds. This has tremendously changed their working behavior; it made it much more efficient and brought them competitive advantage in their space.
[00:38:25.12] In short, first of all it made their lives much easier and it improved the performance tremendously. This unified layer — my PhD co-advisor, Ion Stoica (the co-creator of Spark; he’s also in our press release) said “Alluxio to storage can be acknowledged to be an IP layer to internet, and we can enable faster innovation and make people’s lives much easier.” That’s the way we see it.
Jeff Meyerson: [00:39:08.22] What I find interesting – when you go from taking an hour to run a Spark job to seconds… Traditionally, at least when I was in school, five or six years ago, we always talked about things in terms of Moore’s law, and having this very predictable 18-month cycle of compute time or number of processors on a chip doubling; however you want to phrase it, compute power doubles every 18 months. But perhaps that set of trends was constricted by what occurred on the hardware layer, because you had to have all these people moving in lockstep in order to orchestrate the hardware progressions. But when you make an advance at the software layer, it seems like perhaps you need less coordination and therefore you can get bigger step changes, such as what we’re seeing with Alluxio. Does that idea resonate with you, or do you think that’s crazy?
Haoyuan Li: [00:40:10.29] The hardware trend is always — look at the whole industry. It’s improving from many different perspectives. You always have a lot of innovation on the hardware side, but in the meantime the software also adapts and improves. As an example, we saw the trend of the DRAM. We leveraged the trend, we provide much better software, and innovative architecture to the industry. What we are seeing is that we just collaborate with the whole ecosystem to bring more benefit to our users. That’s the way we see it.
Jeff Meyerson: [00:40:51.28] With Alluxio, you’ve mentioned there are still cases where you do want to store to disk. Let’s talk about that in a little more detail. When would you still want to — I’m a data scientist, I’m a data engineer. When am I still going to want to use disk storage, despite Alluxio?
Haoyuan Li: [00:41:14.29] You still want to use it; you have a lot of data that you want to put on disk. That’s our under-storage, in the Alluxio world. Alluxio always works with at least a [unintelligible 00:41:27.25] under-storage systems, in most cases. We really want to work with this database storage systems well, and the goal is not to replace them. The goal is to work with them and to improve the whole stack.
Jeff Meyerson: [00:41:49.20] Let me close off with one other question. What is the future of the Hadoop ecosystem and how does Alluxio fit in?
Haoyuan Li: [00:42:08.02] That’s a great question. The Hadoop ecosystem is a very big ecosystem, and there are a lot of pioneers in this ecosystem. How we are fitting together is that with Alluxio we can bridge… First of all, of course, we can improve the performance. On the other side, we’ll make people’s lives much easier, as well as bridging the traditional world with this ecosystem. In the meantime, from our perspective, Alluxio also works with a lot more different types of ecosystems, as well. That’s the way we see it. To come back to the Hadoop ecosystem, we will keep improving, to make our user’s experience better, by improving performance and improving the usability of the system.
Jeff Meyerson: [00:43:02.18] Where can people find out more about Alluxio and about you?
Haoyuan Li: [00:43:06.21] If they google Alluxio, they can go to our websites – project websites, company websites… We have a lot more information there. In the meantime, if you’re interested, feel free to contact us as well, we’d be happy to help whenever it’s suitable.
Jeff Meyerson: [00:43:23.12] Great. HY, thanks for coming on Software Engineering Radio. It was great talking to you about Alluxio, and I wish you the best of luck; I’m really happy with your project.
Haoyuan Li: [00:43:32.26] My pleasure. Thank you very much, Jeff.