Evan Weaver of Fauna discusses the Fauna distributed database. Host Felienne spoke with him about the database’s design and its properties, as well as the FQL query language and the different models it supports: document-based as well as relational. They discuss how Fauna deals with data manipulation with stored procedure-like functions, and how it guarantees transactionality as a real-time streaming database.
This episode sponsored by O’Reilly.
Show Notes
Related Links
- Episode 194: Michael Hunger on Graph Databases
- Episode 353: Max Neunhoffer on Multi-Model Databases and ArangoDB
- Evan Weaver on Twitter
- Fauna
Transcript
Transcript brought to you by IEEE Software
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].
Intro 00:00:00 This is software engineering radio, the podcast for professional developers on the [email protected] se radio is brought to you by the computer society. I is your belief software magazine online at computer.org/software. Keeping your teams on top of the latest tech developments is a monumental challenge. Helping them get answers to urgent problems they face daily is even harder. That’s why 66% of all fortune 100 companies count on O’Reilly online. Learning at O’Reilly. Your teams will get live courses, tons of resources, interactive scenarios, and sandboxes, and fast answers to their pressing questions. See what O’Reilly online learning can do for your teams. Visit O’Riley dot com for a demo.
Felienne 00:00:46 Hello everyone. This is Felienne for software engineering radio. Today with me on the show is Evan Weaver. Evan is a former director of infrastructure at Twitter, where he was the 15th employee. He also master’s degree in computer science and has previously worked at sea nets and SAP. Currently he is the co-founder CTO and co-inventor of fauna. Welcome to the show, Evan.
Evan Weaver 00:01:09 Hey, thanks for having me.
Felienne 00:01:11 So today we are going to talk about your product for, and according to your website for now is eight flexible developer friendly transactional databases delivered to you as a secure web native API. So that’s a lot to confront and I want to talk about all those separate attributes, but let’s first talk about the basics, what problem solve.
Evan Weaver 00:01:37 So fauna saw as a problem that we experienced it Twitter. So I joined Twitter. It was ancient history. Now is 2008, the height of the web two O era in the midst of the early stages of cloud adoption. And I came from CNN. I built consumer websites with Ruby on rails, primarily in my SQL. And you know, I went to Twitter because I wanted the product to survive. You know, I didn’t have big dreams about building distributed systems or implementing databases or anything like that. But we found, you know, as the handful, like the half dozen engineers who are at the company at the time, there’s nothing off the shelf that would let a team like ours scale, a product, especially a global soft, real time product to like Twitter from small to large. And we had to invest in, you know, learning and studying and inventing new technology and new strategies and algorithms in distributed systems just to keep the product scaling.
Evan Weaver 00:02:30 Then didn’t make a lot of sense to us. We felt like, you know, there’s nothing intrinsic and information science that says you can’t have something which is both flexible to build against in the small and scalable in the large, there were no systems off the shelf that can do it in particular. There were no operational data systems that can do it. Operational data is the most conservative part of the industry in a lot of ways, because it’s the riskiest, it’s your mission, critical data to your user data. It’s things that are here replaceable. And we were dealing with systems like my SQL and it’s primary secondary statement based replication looked at Postgres as well, looked at Mongo as well, looked at Cassandra and invested a lot in Cassandra or in the early days of the no SQL era. And just none of these systems could scale with these discontent.
Evan Weaver 00:03:16 None of these systems could scale without these discontinuous architecture events along the way. So we ended up rearchitecting things almost on an annual basis. That was very frustrating. I don’t think anyone works on databases out of love for existing databases. They work on them because they’re frustrated that things are not better than they are. And they see a path to eliminating that area of pain for future developers who have to work with data. And we’re no different. So after spending four years at Twitter and scaling all the core business objects and writing a bunch of custom distributed systems, we found it fun and did a bunch of consulting for four years, exploring the data space. And then eventually realized that if we didn’t solve this fundamental problem, that wasn’t going to get solved. So we embarked on product development. That’s fun and fun and DB
Felienne 00:04:05 To summarize. I think what you’re saying is that the biggest problem that you found is that existing solutions were not flexible or they weren’t scalable enough,
Evan Weaver 00:04:14 Right? Exactly. Yeah. You had to make trade-offs and every step of the way. And that means too, it doesn’t just mean like things are, are, and for companies that are succeeding, it means they’re all kinds of companies and products and applications that just don’t get built because the incidental cost of scalability and operations and management administration and all this stuff is too high to justify the investment. So it’s kind of similar to you in the nineties. If you wanted to have a website you’d to rack a bunch of hardware and configure a bunch of custom databases and like fussing around with Apache was a big deal. We don’t have to do any of that stuff anymore. And we forget how much of a drag that was on productivity in the industry. And even having like a basic HTML page, like yahoo.com out the door was tough. So there are a few entrance in the early market of the web, but now we’re still in the same place, even with managed cloud where, you know, you have to deal with provisioning and the sort of the physicality of your data in order to build an application. And especially as the world goes more global and goes more remote. None of that really makes sense until extraneous costs.
Felienne 00:05:20 Yeah, that makes sense. So can you describe a typical use case of phone you already described? Of course, the scalability part of it, and what type of situation for developers would it make sense to consider Fortnite as an alternative to whatever database you’re currently using?
Evan Weaver 00:05:36 So fun is a serverless, uh, probation on data API. So as soon as the same role as the traditional operation on database, whether that’s a no Segal database like Mongo or an RDBMS like my SQL or Postgres, or what fun does is provide you the basic building blocks of a soft, real time consumer or B2B facing service, it could be a CRM, it could be some other SAS, it could be social media, it could be video gaming, anything where you need to transact across business data, user data, all the usual concerns in the LTP or operational data sphere. Fun is designed to both offer serverless provisioning and serverless scalability. So you don’t have to do any operations at all, and to fit well with the rest of the serverless stack or as augmentation to the managed cloud stack so that you can continue to build your products, build your applications without having to incur additional operational, overhead or debt.
Evan Weaver 00:06:41 You know, another way to look at it as it’s sort of serverless overall is a return and a realization of the dream of utility computing. And we used to talk about that a lot in the nineties. And, you know, we tried to manage it by fussing around with Perl scripts and that kind of thing. And it didn’t really work what’s back. And it’s back in a big way with different interfaces, especially the Posics interfaces for containers and that kind of thing. And WebAssembly, and other lighter ways to isolate compute. And we have CDN which offer asset hosting and a truly server license zero operations way. But we started this project kind of before serverless was real, but the vision has always been the same. You know, what, if you could interact with your data with a ubiquitous utility price global API, and you never had to think about where that data lived ever again.
Felienne 00:07:29 So I think what you’re saying is if you already have a solution that runs in the cloud runs on the road crew or AWS, and you want to extend that with database functionality, then it really makes sense to look at because it’s also sorry.
Evan Weaver 00:07:43 Yeah. If you’re billing with the rest of the serverless stack and that can be jam stack, it could be server side compute with AWS Lambda, or you have an existing managed, or even on-prem application where you don’t want to make further investment in the operations of that system. Do your typical Postgres installation, for example, gets to a point where people are afraid to migrate a column or add a table because there’s no performance isolation, you know, it’s risky. So where else do you put your data? You don’t really want to keep spinning up. Postgres is that’ll keep having these problems. The one you currently have is working perfectly fine. So you probably don’t want to take the risk of migrating it to something brand new, even a different edition of Postgres can be risky, but you can take something like fun, uh, and add it into the application stack, integrate it through web APIs and service API APIs, and start to get the benefit of that utility computing, that serverless data for the augmented features.
Felienne 00:08:40 This actually perfectly segues into the question I was going to ask next, because you also say that Fala is a flexible database. And I think you just touched upon that on addressing the pain of adding a column to a database that people might be worried to make changes. I guess that’s where you are different. That there’s flexibility means that it is easier to make changes through our data model.
Evan Weaver 00:09:03 Yeah. Fun is designed to minimize operational overhead and maximize developer productivity. And that means we chose not to directly implement sequel. Fanta is a relational database. It offers transactions and indexes and views and foreign keys and all that good stuff from the relational model. But it doesn’t at least currently expose a SQL interface. It exposes a document, relational interface over graph QL, which is a web standard you may be familiar with. And it’s very popular, especially in the JAMstack space and SQL, which is our own language, which is a functional relational language. It’s essentially a list. But though we try not to talk about that too loud, because a lot of people are scared of lists, but it gives you a more, a more safe and semantically pure way of interacting with the data, which includes things like dealing with documents, which also retain their history. So you’re not discarding data just by updating records. You can go back and look at a particular point in time, it’s type safe. It lets you compose queries. It lets you store stored procedures as functions, execute business logic globally in the database and compose, it essentially gives you programmable access in a modern way to your data. Set.
Felienne 00:10:22 Those things are all by the way, things I want to ask you a little bit more about so distort procedures and the different types of interfaces, we will definitely zoom into that later in the episode. But firstly, let’s talk about the flexibility a little bit more to give listeners an idea of what phone I can mean. Can you give an example of a case in which this flexibility really helped solve a customer problem?
Evan Weaver 00:10:45 Well, I’ll give you two examples and I’ll give you one on sort of the feature set size. So we have a customer named shift X. They have a business process modeling tool. And one of the things they took advantage of fauna with in their tool is they took advantage of the temporality to let their own users go back in time and see how their process models have changed. And that’s the kind of thing that in a traditional database, whether it’s no sequel or sequel, you have to build as a secondary concern in your business model. And that gets very confusing, especially in a typical SAS application, you end up trying to implement these partially application facing partially database facing concerns like by temporality, multitenancy in application code. There’s a lot of security features which are missing from the SQL interface in particular, and that pollutes your application.
Evan Weaver 00:11:38 It makes it hard to iterate. It makes it hard to sort of maintain that kind of purity of approach to the business problem you’re trying to solve. They were able to use the temporality feature directly to expose time-travel and just move on to bigger and better things on the operational side, in particular, the serverless nature of fun. It means that you never have to provision any specific levels of scale. You can rely on the database to scale like an API, which is to say you don’t think about it. Capacity is always there. You pay by usage. You don’t pay by like scheduled time or some variation of that. And a use case. I just saw the other day, someone’s building an application that is used at conferences and they’re anticipating a return to in-person conferences. Um, and you know, one of the challenges with conference attendance and especially connectivity is that like the wifi never works when you’re actually in the room with a hundred other people trying to get on the wifi, they’re building a semi offline application, which will work while it has partial or zero connectivity.
Evan Weaver 00:12:43 And then when connectivity is restored, it uploads the data to the cloud and makes it accessible to others. And that means basically whenever these groups disperse, there’s a huge spike of synchronization from people who suddenly have network access. And like, you don’t want to provision for that spike all the time because you’re wasting money running databases and 0% utilization while people have, have no connectivity all as a group. But at the same time, if you provision for a small amount of capacity, then you’re going to have failed queries, time to queries the application. Isn’t going to work in these synchronization points. So they’re taking advantage of fondness, operational flexibility to just eliminate that concern entirely from their application process. And instead of doing like exponential backoff and this and that, and all the kinds of things you would have to usually do to work around this problem, they can just rely on phone and to make capacity available as needed.
Felienne 00:13:42 Great. So I think the two things you mentioned were that you get scalability for free. You don’t have to think about it if you have Soudan peaks in use and that is covered automatically, and indeed you don’t have to buy lots of servers just to go over for this peaks that might happen very rarely. And also the other point, you mentioned that temporality, that sounds so interesting that you have traveled by default. Can you talk a little bit more how that works as a developer? Because like what I’m envisioning is something like select star from customers yesterday or something. How do you interface with a database that has 10 preps built in
Evan Weaver 00:14:22 That is essentially how it works. One of the reasons we implemented SQL query language, our own language instead of SQL is that, you know, SQL is a business analysis language. It was designed for people sitting at workstations, writing reports, you know, security, it was managed outside the language itself. Performance was managed outside the language itself. There are all these concerns, which are now intrinsic to building in particular SAS and consumer facing web applications, which those kind of interfaces make no attempt to solve. And temporality is one of the things that we saw a lot of need for both in backing direct to user facing application features like time travel, like select this document at a point in the past, see the changes between two points it’s related to the streaming capability, which I think we’re going to talk about in a little bit, but also, you know, for things like change data capture for synchronization with second systems like analytics systems, it’s very important in particular, to be able to do cheap uptake queries, to see if anything has changed rather than having to table scan all the time, because all the records are fundamentally on partitioned by time, even though they have a time column and into them, cause it’s at the application codes and the databases, no aware it’s in the application layer.
Evan Weaver 00:15:41 So the database has no awareness of it. And you know, other things like the web native security model, which makes it possible to access the database safely over the public internet, everything’s secure by default, unlike the opposite where you have to use firewalls and VPCs and so on to try to lock down administrative access to the database and the granularity that lets you authenticate users directly against the database. It’s only essentially operators who can mess with one table at a time, all that kind of stuff. Like what we’ve tried to do is basically retrace in particular, the history of my SQL and its ability to be a general purpose, transactional data platform for what was at the time, the modern web, that lamp stack web two O and serverless operational context for the way that the stack has changed. And I think historically, you know, we get a new development stack about every 10 years and that stack typically has only one database that really becomes the default database that’s chosen for it.
Evan Weaver 00:16:46 And I think part of this points to some of the conservatism or the riskiness of making that choice, like people aren’t choosing an operational database just for one project, they’re choosing it for their company overall. And even oftentimes for their career, it’s a huge investment to learn a new database, to learn, to operate it in particular for the pre-service databases operated safely interact with the consistency model appropriately and so on. So in the nineties, you know, in my SQL with lamp SQL server with the Microsoft stacks, and there’s a bifurcation there because we had two totally different stacks, even though they shared some concepts then in the O’s we got Postgris with the ORM kind of the ORM generation with hibernate and active record then in the tens. And we got Mongo in the mean stack and we have a new stack now, the serverless stack and it needs a database. So, you know, we’re bringing back everything you would expect from the traditional relational model, but updated and revised and reimplemented for serverless operations and serverless development.
Felienne 00:17:53 So diving into that a little bit more about the options there are to be, let’s say the next database for this era, who do you think fallen us, biggest competitors are and how are you different from them?
Evan Weaver 00:18:05 We see dynamo DB the most in dynamo DB has done a good job in that utility computing aspect, giving you multitenant granular billing based on usage where it hasn’t done a good job is maintaining the flexibility of the relational model and eliminating the physicality of the database from the application level concerns. Dynamo is an eventually consistent system. It was derived in finally, you know, in one of these like weird quirks of history, it was derived from Cassandra, which itself was derived from the old dynamo DB, which was an internal Amazon project, but it came full circle as a no SQL rich key value store that was designed for low latency and consistent performance, not for developer flexibility and productivity. Amazon has made a lot of retrofit updates to it, like adding optional transactions and optional indexes and that kind of thing, but they don’t compose because the model is not fundamentally built around an architecture, which is minimal to compared to phono transactionality all the time and ubiquitously available, consistent indexes all the time, rich relational queries that can express flexible business logic and that kind of thing.
Evan Weaver 00:19:28 So on the pricing side, like does it scale up and down dynamically? Yes. In that sense, it is competitive in the developer flexibility side. It’s not, and we see that a lot, you know, and you can compare it to, to Mongo. I think Mongo offers more of a developer flexibility, but the Mongo Atlas cloud product is not serverless and it’s never going to be because the architecture isn’t there, various other vendors have approached like one aspect of this problem, but because of the legacy baggage, and there’s so much path dependence, especially in operational data, because it’s so risky and there’s just no path for them to really deliver the full picture without starting over from ground zero. Like we did.
Felienne 00:20:12 Yeah. And of course you set all those requirements from the beginning. So that’s why it might be a little bit easier to actually hit that sweet spot.
Evan Weaver 00:20:19 Our story is a little atypical. I think for Silicon valley or for startups, we were in tech for a startup. We saw this technical problem at Twitter. We believed we can solve it. We watched as, you know, continuing to go and solve an industry. And then we embarked on, on the journey to solve it. Once we had solved these core problems of basically global serverless, transactional database management, we need to figure out how to take it to market. You compare to someone, you know, a company Mongo, which found its product market fit very early, but then had to retrofit technology because they literally had nothing. They had like a mapped file and a JavaScript process. Like there’s no database there that’s just interface. And we went the other way. We built architecture and then essentially we were too early to market and to wait for the serverless market to catch up and be ready for what our original vision was. And then, you know, we had to make tweaks along the way, like adding the graph QL interface and so on. And in particular improving the comprehensiveness of the SQL standard library to meet the developer flexibility and productivity goals. But at the end of the day, the serverless vision of ubiquitous utility computing has always been our vision.
Felienne 00:21:32 Great. So one thing I want to, to me, and a little bit more about you say that you have transactionality all the time, what does that entail?
Evan Weaver 00:21:42 Transactionality is essentially data correctness at the highest level and databases are more or less partitioned into consistent transactional databases or eventually consistent non transactional databases. One of the issues, especially when we were starting out with fauna was that there was a general idea, especially push by the no-sequel vendors that having data consistency and having scalability were fundamentally antagonistic to each other. And that you couldn’t have a system, which if you need it to scale, then scalability is the most critical operational capability that you can have because it doesn’t matter if your data is correct. If no one can get to it, then you should give up consistency and correctness in favor of scale. And then try to add back in, in application code, whatever sort of band-aids over the, essentially the corrupt state of your operational data, you needed to keep the application working. And we knew that that wasn’t the case based in our experience building the distributed systems at Twitter.
Evan Weaver 00:22:49 But when we started falling out, we wanted to make sure that we chose a replication and a transaction architecture, which would both give us low latency at global scale. And literally the lowest possible latency, which I can elaborate on in a minute, but also never give up the highest level of consistency, which is strict serializability, which basically means you can treat the entire database cluster, not just a single machine, but the entire cluster as if it was local to you as if everything is happening in a serial, physical order, one transaction at a time in that database. And that’s a very easy model for developers to reason about compared to eventual consistency, which is essentially relativistic, like different clients, different applications, different users, accessing the database, all see different views of the state of the world. And they cannot fundamentally be reconciled in real time.
Felienne 00:23:44 It might also be cognitively harder for programmers through think in need of all those different versions. Whereas the transaction model might be closer to what people are used to.
Evan Weaver 00:23:54 Yeah. The transaction model is intuitive. You know, it’s just, you can essentially imagine that there’s a giant lock around the entire database and only one user can access it at a time. And obviously that isn’t the case and fun, and it can’t be the case because although you can implement it here, your throughput would be pretty poor. But yeah, we saw based on our experience in Twitter and in our consulting days at fauna that not just your average developer, but even your best developers just could not reasonably reason about the consistency of their data in the context of their application code. It’s just not possible. Like data consistency is the domain of formal proofs and very aggressive adversarial analysis of the systems to make sure that the implementation matches the proof and all that kind of stuff. And like, that’s not how people build products. There’s no formal proof for Twitter Tik TOK or what have you. Like. It doesn’t make any sense. It’s a waste of time. It would not even be possible. Given the complexity of the, you know, the service typologies that go into these products.
Felienne 00:24:59 So coming to this topic a little bit more, because it looks like you do support replication. So it’s fallen on a distributed system.
Evan Weaver 00:25:08 We are in distributed systems. So we as the phone provider, the vendor, uh, per a phone and multiple globally distributed regions, which are hosted on multiple cloud providers under the hood, you as a user, don’t have to care about that because you just interact with an HTTPS API, which gives you this facade of ubiquitous, consistent access to the data wherever you happen to be querying from. And that leads to an, a, a lot of nice properties for you because you can always guarantee that your rights will be consistently low latency regardless of where you are in the world, because there’s no primary region or cluster that you have to talk to you to perform a right. You can guarantee the same thing for reeds with even lower latency, whichever phone a region is nearest to your application. That’s the region you’ll be routed to, to do your work.
Felienne 00:26:02 So if I understand this correctly, you have replication, you are distributed system and even have regional distribution, but as a user, I don’t really see that. I don’t have to say, I want to use this version. And I don’t have to think about where it, my data and my, some things go out of sync because that’s all covered all cared for at your side,
Evan Weaver 00:26:23 Right? We’re adding the ability to restrict the replication topology for compliance purposes, principally because you know, you don’t always want all your data given various legal jurors, you know, regimes replicated all around the world, but your phone is replication is semi synchronous and always consistent from the perspective of any individual group of applications, which are accessing the same data center. So it’s impossible in the read-write transaction pipeline to violate consistency. And in the read only transaction pipeline, it’s impossible to violate it without deliberately going out of your way to pass information behind the scenes that you basically disallow photo from. So in practice, you get a consistent experience all the time, no matter who and how is interacting with your data set around the world.
Felienne 00:27:14 Yeah, it is a distributed system, but as a user, you don’t have to worry about it. You can just benefit from it.
Evan Weaver 00:27:21 Yeah. The model for you as a user is essentially the model of a single node RDBMS
Felienne 00:27:28 Database.
Evan Weaver 00:27:30 You have a rack in the closet and it has the data and you access it. And the data is always the data because it’s only coming from one place like falling over stores that illusion of the data coming from only one place to a completely globally distributed and highly available operational context
Speaker 1 00:27:49 At O’Reilly. We know your tech teams need quick answers to their most urgent questions. They need to stay on top of new tech developments. They need a safe place to learn the technologies, your company adopts, and they need it all 24 7. Well, they can get it all at O’Reilly dot com with O’Reilly online learning, your team gets live online courses, tons of resources, safe, interactive scenarios, and sandboxes and fast answers to their most pressing questions, visit O’Reilly dot com and request a demo.
Felienne 00:28:19 So let’s talk about real time streaming because that’s something that fauna also offers. So what is this? What is real-time streaming?
Evan Weaver 00:28:28 We have a feature in beta now, which basically takes the temporal model and upgrades it to an HTTP push, not just a pole. You could already pull on change sets points in time ranges in time, but in particular, for back in real-time applications, you want to get notified rather than query every 10 seconds or something like that. So fun. It lets you listen on individual documents for updates from both backend services or user applications, including browsers and mobile apps, that kind of thing. And then we’re expanding this feature to also allow you to listen on entire collections and indexes as well.
Felienne 00:29:08 So that means that I can implement something like new customer events. And then I can, I don’t know, send an email to, to someone that a new customers are registered.
Evan Weaver 00:29:19 Yeah. Or, you know, it’s kind of topical right now. Like there’s a lot of easy lines happening in the world right now. I’ve experienced some of them myself recently. Like if you want a new gaming console, you have to wait in it and you know, you sign up and then you get notified when they’re available. And then you have to rush to try to grab the reservation or for the Corona virus vaccines. It’s essentially the same process. Although it’s essentially the same process, even though it’s federal and state agencies, instead of like Sony who are offering you, the website, all these kinds of experiences are really good examples for the power of something like fauna, because both you need notification when something is available, you need the ability to order in a strictly serialized way who is line so that you can be fair about allocating the opportunities.
Evan Weaver 00:30:09 Then you need to make sure that only one person can buy a specific console and only one person can get a specific vaccine reservation. And that requires transactionality in particular, it requires strict serializability. And if you’re trying to do this and eventually consistent system, you see a lot of the failure patterns that we’re experiencing today, where it says, oh, consoles are available, but they’re not. You go to check out and it’s gone. Vaccine appointments are available, but they’re not. You go partly through the process, then you get booted out and then you can’t go back in because it wasn’t reserved specifically for you. Someone has taken over your application and it’s a real mess.
Felienne 00:30:46 That’s where you want to have transactionality. You went. And it’s very clear that this vaccine, as it being designed and there are so many left in,
Evan Weaver 00:30:55 Right? Yeah. It’s like, I mean, I think, you know, people say that banking is like the canonical example for transactionality and the value of acid, but it’s really not because financial data can be reconciled after the fact. But I think the best example is this kind of process, basically like the ticket master process, where people are competing to hear about an opportunity and reserve it in real time. And that just doesn’t work in a distributed context without a database like photo.
Felienne 00:31:23 That’s a great business case where it needed to really make sense to use a system like what’s your also supports business logic as atomic functions. And we talked about this, I think a little bit earlier where you said something about stored procedures. Because when I read custom business logic as atomic functions, then I did think is that the stored procedure,
Evan Weaver 00:31:46 It is a stored procedure store procedures are cool, but they’re back in a big way. We call them functions analogist to serverless functions. And like I mentioned fun as a programmable database, like what you’re submitting to the database is not only reads and writes. You’re submitting a mini program that executes in an atomic way over the state of the data and its point in the transaction log. That means in particular, you can’t do what you would do in SQL, which is have an interactive session transaction where your client chats back and forth with the database, accumulating the transaction over a period of requests. You have to submit the entire transaction to the database so that it can be strictly serialized as a block. That’s not really a barrier to development. Most people don’t really take advantage of interactive transactions in a meaningful when they’re building products with RDBMS is like they use ORMs and other things like that, and that abstract away the, the session based nature of the database.
Evan Weaver 00:32:47 Anyway. So your experience developing against Fona is similar, but it does mean that we have to offer a very rich standard library so that you can do a lot more of your compute work in the database. Co-located with the data in your transactions than you would typically be used to doing an RDBMS or in a no Siegel system. But that also gives us the benefit that we can then bring back stored procedures as functions and let you basically save particular parameterized transactions as your own standard library that you can then call and compose in your other queries to the database. And that lets you do a lot of your business modeling in the database, which has the advantage of also being secure. So you can access it directly from insecure clients like mobile apps and browsers and that kind of thing, but lets you manage that code in a clear and consistent way.
Evan Weaver 00:33:40 This is another example where the temporality comes into play because the database is tempo temporal. So are the functions you get, you know, version control over your code in the database itself and that kind of thing. And it really, it gives you back a lot of the power that was lost when we moved from kind of the client server architecture in the early nineties to three tier architecture with the web where the application had to orchestrate everything and basically use the database as the dumbest possible transactional key value store. Cause you kind of, when you couldn’t scale the compute layer of the database in particular, and also there were a host of security and code management issues involved in using stored procedures from web applications, but then doesn’t mean the model is broken. It just means that, you know, the, the implementation and the patterns people are trying to achieve in their applications had diverge.
Felienne 00:34:36 So I really want to hear more about this programmable database lot is programmable. Like firstly, what programming language do you write? Do you write your SQL language? Is this what you write the stored procedures in? Or do you have language bindings for JavaScript?
Evan Weaver 00:34:52 You do write it in SQL in the future. We want to expand that. But right now we’re focused on making the SQL standard library as comprehensive as possible. And FKO is fundamentally a list. It’s a functional language, which is composable and type safe. It’s turning complete because you can Recurse, we don’t recommend mining Bitcoin and fauna, but eventually someone’s going to do it the benefit of this model. And it goes back to, you know, some of the Lynda work or you probably noticed like Java spaces and that kind of stuff is that you can execute computations co located with your data and that makes them much more efficient and also much more consistent than when you have to query back and forth from application code and run compute incrementally in the app while you then do reads and writes also incrementally interleaved in the database and then lets us then give you essentially this ubiquitous data API, this data fabric in which you can compose notches your schemers and your data models, but also a lot of the business logic around this data models too. And then you can focus on the presentation, logic and application code and keep those tiers well segmented and also maintain both performance, high availability and consistency for all the business logic, which is implemented in the database itself.
Felienne 00:36:19 So that also means that the store procedures are Ren on your side. So they’re not running around clients. I’d bet that ran on your side. And also they’re always run on the same machine. Where did they also live?
Evan Weaver 00:36:32 They are they’re run in the database kernel. We don’t like shell out land or something like that. Phone is multitenancy motto is really unique. Phone is on dynamically scheduled. When you sign up for fun or you instantaneously get a new database that you can access immediately. The reason we can do that is that we never provisioned a static resource for anybody. All the isolation is dynamic. It’s essentially equivalent to cooperative multithreading which you may remember from windows 95. And the interpreter of the query ASTs is inserting yield points like windows 95 at iOS and loop boundaries, which then fall back to the scheduler. And that happens on a, a millisecond basis so that we can enter, leave long running queries. We can track quotas, we can prioritize and deprioritize usage across tenants, within tenants, all that kind of thing. And we can do that within the stored procedures that functions too. So we can maintain performance and security isolation without having to statically allocate anything. And that lets us not have to worry about running complex computations co located with the data, which then lets them obviously have zero latency to data on the local node, low latency to data that’s partitioned on adjacent nodes and that kind of thing.
Felienne 00:37:54 So let’s dive into writing queries in more depth, even because what you say is that you have a multimodal interface or you support document storage, relational data graph data, and also temporal data sets. And this is actually something that we covered earlier in the show in episode 353. So can you tell us a little bit more about how those different models work and especially how they also collaborate? And when you use one, when do you use the automatic and you mix and match within a query? How does that work?
Evan Weaver 00:38:26 I think we as a team have a unique view on this problem. So to us, there are two fundamental data models. There’s the document model like Mongo to normalize everything into a rich, flexible document and your transaction boundaries basically, or the document boundaries you can consistently update the document itself. Outside of that context, all bets are off. Then there’s the relational model set up a schema ahead of time, build indexes, enforce foreign keys, achieve fifth normal form. If you’re super into that, all that kind of stuff. Like it’s the opposite of denormalization it’s aggressively normalized literally in the name and use that to both reduce redundancy, which improves consistency. Cause you’re not duplicating the same data and you might where you might mess it up. And also, you know, it works well with storage layouts and so on of the traditional RDBMS, that’s all you need in terms of fundamental data models to build almost any application.
Evan Weaver 00:39:26 And thus Fona is in document relational database where it offers those relational links, indexes views, foreign keys transactions between flexible documents that don’t have rigid schema, although they are tight, you know, that allow you to dynamically populate data without having to go through migrations. And so on the other models, you mentioned temporality and graph. They’re not standalone models. There are additional capabilities that can be added to the core document, relational model. You don’t have to get graph database to do graph queries. Many people who use join tables very successfully and already BMS is to model graphs, understand, you know, you don’t have to go get some off the shelf, dietary temporal database to have temporality. Sometimes people do dump change data into in particular logging systems like Splunk. And that becomes kind of an ad hoc 10 portal database. But to me, that’s a defect in the core system they’re using not a paradigm in its own, right?
Evan Weaver 00:40:31 So what we’ve done in fauna is with SQL unify the document relational model in a very coherent way and then augment it with temporality and graph query. And in particular, the kind of point or traversal that you would do in a graph data model where you don’t have to have enforced foreign keys and can just link documents, Willy nilly to each other that interacts very well with the rest of Florida standard libraries. So that in the same application, in the same query on the same data set, you can take advantage of the benefits of these models without having to think like, am I am graph land now, am I in relational land now is my data even in the right data system to run the query, I’m going to run. And I think our graph QL interfacing graph QL, despite the name is not really a graph API. It’s a document API expresses the power of the core four on a semantics in that it offers a view into the same data with a different syntax so that you can keep the semantics and the data fully and access it. How you please from the application side.
Felienne 00:41:36 So before we dive into writing queries in these different models, I want to talk a little bit more about what it means that you have unified the document model and the relational model. Because to me, those things sound irreconcilable. How did you unify this? What does that even mean?
Evan Weaver 00:41:53 Every document has an implied schema. You have a local graph structure, you have keys, which are nested that have values. They may have Mali value attributes like a raise. They could have objects, which themselves nasty even further and so on. And so, so what we do is rather than enforcing the schema up from one data ingest, we rely on the presence of this implied schema to introspect the documents and understand what the relations are supposed to be. So like when you compose it index and an index and Florida is actually more like a relational view, it’s an explicit materialization of data from documents. It can even compose data from multiple collections of documents. It can transform it. It can do very few like things to express a different dimensionality and different transformation and different composition of the data in your dataset. You know, that doesn’t say, what is the schema for this collection?
Evan Weaver 00:42:51 It says, well, I’m going to assume that the keys, the values the is expressed in the view may exist in the documents. And as, as we traverse them, you know, if they’re there, I will include them. And if they’re not, the document will be alighted from that. It gives you the ability to basically take the core relational properties, joined foreign keys transactions, normalization, and apply them to what may not be a fully normalized data model. It could be a completely denormalized document, but the relational model fundamentally can be fit into the document, record paradigm and documents can be fit into the relational aquarium paradigm. Those things aren’t intrinsically opposed.
Felienne 00:43:36 Yeah. That makes sense. It’s really interesting. So how does that reflect into queries then? Do you have relational accents of SQL and a document based version of the query language? Or can this be mixed with one query?
Evan Weaver 00:43:52 No, it’s all one language with shared semantics. You know, where there are differences is in the alternate interfaces to the database. So SQL is kind of the central point, the point that expresses all possible semantics that you interact with SQL with the application drivers that give you an ORM like experience. They give you a DSL, you’re not writing SQL syntax, but you can also interact with the database with graph QL syntax. You can use graph QL clients to access your data through the semantics of fun on the graph QL also overlaps with or what we support and what we want to maintain. Like we’re not trying to be a standards compliant, comprehensive graph, QL client, and let you do for example, service composition, which is one of the big value props of graph, QL lending, you compose different data sources into a single query while we’re just one data source and we’re not, we’re not the right place for you to be calling out to other services you may own. But if that’s the interface you’re familiar with, then you can access phone in this way. We intend to add other syntaxes in the future.
Felienne 00:45:03 Are those the only two languages that you support or art or multiple that you support at this point?
Evan Weaver 00:45:09 So those are the only two right now, but we intend to add other languages to the extent that they fit the funnest Symantec model. And we’ll implement that part of the spec, but what we want to try to be comprehensive, we won’t try to be dropping compatible with existing other systems. That kind of thing
Felienne 00:45:24 You just briefly mentioned in the X-Wing. And I read that you do indexing on the fly. What does that mean?
Evan Weaver 00:45:32 Fun and indexes are similar to relational indexes, and then they’re kept transactionally up to date. You can add an index at any time. It asynchronously builds in the background. Once it’s ready, it’s available for querying, you express your queries by explicitly referring to the index, which is a, is a current difference from the relational model in terms of performance management. Like, you know, a lot of people are familiar with hinting in SQL where you build an index and you want to make sure it’s always used as effectively and always hinted system, which helps us maintain predictable performance for your queries. At scale, those indexes, once they’re asynchronously built are themselves serializable. So when you update your document, the indexes are updated in synchronous real time, such that you don’t have any divergence when, when you query them from an application side,
Felienne 00:46:24 Something else that I want to dive into. So you could say that you were north entirely responsible, right? For how people use your system. But what I was thinking about when you were describing that you both support the relational model and also the document model, the schema-less model, how do you force developers to work properly? Maybe I’m a bad developer, but I didn’t think I would worry that what would happen to me is I think, oh, I don’t need a scheme. I put a bit of data here and a put a bit of data there. The benefit of the traditional relational model is also that you are forced to think of what is the data type of this column and what is the name of this column and how did this collection goes together, which sometimes slows you down, but also forces you to work in a systematic way and to think in a systematic way, like sometimes you get a database error because it says, oh, but this element of the datatype integer. And then you’re like, oh yeah, yeah, that’s true. I put that constraint there for a reason. Do you know, how, how do they balance this freedom? Our goal is to
Evan Weaver 00:47:29 Maximize flexibility and let you know, basically opt in over time to additional constraints as your application matures. So when you’re starting out, you should totally just put some data here and there and mess around and see how it goes and try to get it working. I think it’s unrealistic to envision sitting down, you know, with a blank editor window and saying, I know what all my data models are going to be, let me write them out. Like you can do that once the application is mature, but especially now, applications are never mature. They’re always changing. So you always need the ability to both enforce and maintain the invariants, which you’ve decided are for the time being truly invariant and still have the flexibility to augment and experiment with new aspects of the application and the product you’re building. So right now, the best way to kind of enforce those, some variances through indexes, which let you, like I said, you know, transform existing documents, skip over ones that don’t match, expose a view, which is very consistent in schema and output, and even the values of the records themselves, because you can do almost anything within an index.
Evan Weaver 00:48:40 You can call stored procedures and so on. So you can really remix your data in unique and application specific ways. And if you always query through those indexes, you’ll only see the consistent view of the data. Also the functions, the stored procedures can do things like check types and do casting and that kind of thing, and air out if something doesn’t match. So you can use those to lock down your rights and reads to patterns which themselves enforce proper data ingest and proper data egress. We do intend to add optional typing in the future, which will let you basically do it, the stored procedures to in a declarative way instead of a procedural way and say, you know, I do have a column that I know what its type is. I know it is default is I want to enforce that fail or populate a specific value for rights that don’t match and so on. And that will also let us get closer to the ability to take a relational schema, a relational database, and directly important Intifada with no ETL, no transformation from that columnar model.
Felienne 00:49:48 Yeah, it was this the thing when you were driving your design philosophy, that it reminded me of ads, of gradual typing and darts, where it needs, you can start entirely free forum. And then as your application matures, you add more structure and a teams. And this is something that really fits with what you’re saying as well. The functions, the stored procedures help you to retrofit the structure into the unstructured data that you might’ve started off with. Right?
Evan Weaver 00:50:13 They do, but they’re procedural. They’re not type declaration. So Fonda is dynamically typed, but it is typed like it’s not tickle, right? You’re not just assuming a type you, when you go to execute a function and you’re doing type checks and we can expose those types to the developers so they can declare them and enforce them and add code to, you know, make decisions when the type doesn’t match and so on and merge the document model with the dynamic, typing better with the relational kilometer model, with static typing and let you mix and match even in the same collection, which is equivalent to a table, even in the same collection or document some enforced types, some implied or derived types and some aspects of the document, which aren’t typed at all.
Felienne 00:51:00 So I wanted to circle back to something. I said all the way in the beginning of the episode, because when I was saying like, what I read about font on your website, something that you also said was developer friendly that we haven’t talked about. But I think we talked about it in a sentence because I heard many things that I think are very developer friendly, like the flexibility and the possibility to switch between these different forms of databases. But I do want to ask you as well, like why do you say fauna is the vet upper friendly? Can you give an example of a featured as you specifically say, well, this is what we put in to make the lives of developers to make the lives of our users.
Evan Weaver 00:51:40 A good example of that is the authentication and identity system. If you’re coming from an RDBMS or even something like Mongo or Cassandra, you’re used to administrative security control, can somebody add and drop tables, which tables they can query. And so we took a look at that. And so we said, you know, this isn’t how people are building applications. They’re not building them for table by table access. And they don’t care which SQL statements that user can run the care, what data they can access. And also they want users to be able to access data securely from anywhere, not just from a trusted application tier and so on. We burn in authentication model native to the web and support row level, token based what we call attribute based access control, which lets you very granularly declare what parts of a record, which records in which collections in which scopes can be accessible to specific identities, both at the user level and the administrative level.
Evan Weaver 00:52:42 And that’s also integrated with third party authentication providers like zero, which, you know, let you then bring the database sort of into the web service typology you’re already using and not have it be its own island where the application as to interpret web security policies and apply them usually without relying on the database security model at all, and force them in application code because there’s such an impedance mismatch there. So like fun is a very complex system and it may sound scary from this conversation to interact with it, but all that complexity is there to prevent you from ever having to think about it. So then from your perspective, you’re writing clear queries in your native application language, in a DSL that makes sense that has access to the entire ecosystem of integrated services you would expect to be there and you just don’t have to care what happens beyond that query boundary, but you can rely on enforcement of the security policies pervasive through the data tier. You don’t have to worry if somebody’s writing a mobile app implemented the same security checks that the web app implements and all that kind of usual stuff, you don’t have to worry about SQL injection because the system is type safe and doesn’t rely on string interpolation and that kind of thing. So we’re really trying to basically push the hurdles to building a secure, scalable performance application down to the point that you don’t think about them anymore.
Felienne 00:54:12 Yeah. So you really, your interpretation of developer friendly is that you do all the work. All the hard work is on the side of fauna and for the user, it seems like a simple database system that like we would be used to before there was this real resistance, but you do enable the user to use to benefit from the power of a distributed,
Evan Weaver 00:54:34 Right? And we want to give you the upside and not the downside. And I mean, a lot of the downside is operational and administrative and particular. We’ve done a lot of work to eliminate that completely Fona is a zero operation system. And that’s not to say that we don’t endeavor to offer operational transparency. Like you need to know what’s happening in your data, but you shouldn’t have to be responsible for doing things in the database just to keep it running.
Felienne 00:55:00 Yeah. I think that concludes everything I wanted to ask you today. Is there anything that we missed, anything you would like to share about fauna that I haven’t asked you?
Evan Weaver 00:55:09 I would just encourage you. We have drivers for almost all popular application languages. You know, try it out. You can sign up for free. You don’t need a credit card. Your database is instantly available. Try it out and let us know what you think.
Felienne 00:55:23 Should we try it out? And what is that link that we can put in the show
Evan Weaver 00:55:25 Notes, finding.com F a U N a.
Felienne 00:55:30 We’ll definitely put that in the show notes. Is there any other place that we can follow you? Are you on Twitter? Do you have a blog?
Evan Weaver 00:55:36 I am on Twitter. My Twitter handle is Evan. Just Eva. N
Felienne 00:55:41 Oh, that’s easy. There’s a benefit of being an early employee.
Evan Weaver 00:55:46 I’m also Evan on GitHub though. So
Felienne 00:55:49 We will put that in the show notes as well. So thank you very much for being on the show with us today.
Evan Weaver 00:55:54 Thanks so much for having me
Speaker 1 00:55:56 Keeping your teams on top of the latest tech developments is a monumental challenge. Helping them get answers to urgent problems they face daily is even harder. That’s why 66% of all fortune 100 companies count on O’Reilly online. Learning at O’Reilly. Your teams will get live courses, tons of resources, interactive scenarios, and sandboxes and fast answers to their pressing questions. See what O’Reilly online learning can do for your teams. Visit O’Reilly dot com for a demo.
Outro 00:56:27 Thanks for listening to se radio and educational program brought to you by AAA software magazine or more about the podcast, including other episodes, visit our [email protected] to provide feedback. You can comment on each episode on the website or reach us on LinkedIn, Facebook, Twitter, or through our slack [email protected]. You can also email [email protected], this and all other episodes of se radio is licensed under creative commons license 2.5. Thanks for listening.
[End of Audio]
SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)