Search
Qian Li - SE Radio guest

SE Radio 681: Qian Li on DBOS Durable Execution/Serverless Computing Platform

Qian Li of DBOS, a durable execution platform born from research by the creators of Postgres and Spark, speaks with host Kanchan Shringi about building durable, observable, and scalable software systems, and why that matters for modern applications. They discuss database-backed program state, workflow orchestration, real-world AI use cases, and comparisons with other workflow technologies.

Li explains how DBOS persists not just application data but also program execution state in Postgres to enable automatic recovery and exactly-once execution. She outlines how DBOS uses workflow and step annotations to build deterministic, fault-tolerant flows for everything from e-commerce checkouts to LLM-powered agents. Observability features, including SQL-accessible state tables and a time-travel debugger, allow developers and business users to understand and troubleshoot system behavior. Finally, she compares DBOS with tools like Temporal and AWS Step Functions.

Brought to you by IEEE Computer Society and IEEE Software magazine.



Show Notes

Related Episodes

Other References


Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Kanchan Shringi 00:00:19 Hello everyone. Welcome to this episode of Software Engineering Radio. Our guest today is Qian Li. Qian Li is a co-founder and Chief Architect at DBOS. Before founding DBOS, Qian completed her PhD in Computer Science at Stanford in 2023. Her PhD research has focused on abstractions for efficient and reliable Cloud Computing. Qian is also the Co-op of the South Play Systems Club, which is an independent talk series focusing on systems programming. Before we actually talk a little bit more about DBOS and its research origins, I’d like to point listeners to Episode 596, which is Maxim Fateev on Durable Execution with Temporal. There are also a few other episodes we have done on related topics and I’ll put those in the show notes. These are Episode 351, 223, and 198. So happy to have you here, Qian, to talk about DBOS and durable execution. Welcome to the show. Would you like to add something to your bio before we jump right in?

Qian Li 00:01:22 Thanks for inviting me to the show. Yeah, so originally DBOS started as a joint research project. It’s a collaboration between Stanford and MIT. So we started the project since 2020. It was also led by a Postgres creator, Prof. Mike Stonebraker, and the Spark creator and database co-founder Matei Zaharia. So during the research project we really tested the capability of databases and see how databases can help you create reliable programs and how databases can help you make your programs more observable and debuggable. So during the research project we built several prototypes and we wrote several papers and when we presented it, people were really excited about the capability that DBOS can bring. So when we graduated in 2023, we decided to, based on the research project co-fund DBOS. And DBOS as a company, right now we focus on building durable software and we believe that all software should be reliable and observable and scalable by default. So now DBOS stands for Durable Backend Observable and Scalable.

Kanchan Shringi 00:02:31 So you mentioned the use of databases for basically running your applications, but why is that a new concept? Everybody uses databases for running apps or most people do. So what is the difference here? What is the secret sauce that you are talking about?

Qian Li 00:02:49 Yeah, so it’s true that people have been storing their business critical data in databases for like 30, 40 years. But the new concept is that we also want to persist program’s execution state in a database like here’s your program and it has multiple steps and we want to persist the steps output and the input into the database so that if your program crashes or machine failed, we’ll be able to resume from exactly where it left off. So the idea is to, in addition to application data, we also store your program execution state in the database, especially if you’re having long running and dynamic workflows in your programs. You really don’t want to restart from scratch every time you hit a system error, or you have a machine failure.

Kanchan Shringi 00:03:40 So Qian, maybe you can explain to listeners what exactly do you define a workflow as opposed to a service?

Qian Li 00:03:48 Yeah, so to get started I think we can talk about what is a workflow in this context. So traditionally people think workflows as state machines you have to define a deck, stuff like that. But actually in DBOS, anything can be a workflow. So to give a concrete example, a workflow is a sequence of operations or function calls. So very typical example in like job execution is let’s say, checkout service. So if you’re implementing a checkout service, usually you have to call say reserve inventory, you have to update a database to make sure that you have enough inventory. And then after that, if you successfully reserve the inventory, you will cut out to the payment process. For example, this could be an external service like Stripe or PayPal. You will say I want to charge a user this much. And after that you need to wait for the response from those services.

Qian Li 00:04:45 And then based on the result, you will decide whether to fulfill the order and send a confirmation email to the user or you have to undo your reservation for the inventory and then cancel the order and send a cancellation email. So this process can be abstracted as a workflow and then like what guarantees do we want for this workflow? First we want to make sure that once I click that checkout button, all steps will eventually succeed or complete, right? I don’t want to say I pay for my process but I never receive my item or I reserve the inventory but never charge a user. So that’s a first guarantee. And the second guarantee is that I want to guarantee effectively exactly once. So if I charge a user, I want to charge them once and if I reserve the inventory, I also want to only reserve it once.

Kanchan Shringi 00:05:38 So you talked about workflow and you mentioned Transaction. What’s the relationship?

Qian Li 00:05:43 So in this context, a Transaction is essentially, interaction to a local database and then the workflow can be composed of Transactions and also external services or we just call it steps in general. So in DBOS a workflow is a sequence of steps and some steps could be Transactions that talk to a database and other normal steps. Well, normal steps can be any functions, right? Can be talking to external APIs or can have some, any non-deterministic factors operations in that step. But overall the workflow needs to be deterministic and determinism and item potency are two foundations for durable workflows. And essentially what we build is a library that you can just easily install it and then add annotations to your program. So I described the checkout workflow to make it durable. In DBOS, you essentially put annotation at DBOS workflow in front of the overall orchestration workflow function and then you put at step at the function definition of every individual step.

Qian Li 00:06:54 Then when we call that function essentially, we’ll first we basically put wrappers around those functions to make them durable. So to make it more concrete, basically when you start a workflow, when you start call the workflow function, the wrapper would first checkpoint the inputs of the workflow in the database. And then for each step we first check have we executed this step before? If so, we directly read the recorded output from the database and return the output instead of re-executing it. So this way we guarantee exactly once. So the thing is for external systems, the way to guarantee exactly once is to guarantee at least once plus item potency key and DBOS basically automatically generates an item potency key per workflow and per step. So we can guarantee that each step will execute effectively exactly once. And then after each step finishes, we’ll checkpoint the output into the database and so on so forth until we reach the end of the workflow and say if something crashes in the middle when we recover, the DBOS library will look into the database and see which workflow is still pending. And then if the workflow is still pending, we’ll just essentially replay the workflow from the start and walk through all the steps. And then for the finish steps, we’ll already see the database record and then we’ll skip those steps and then we’ll eventually land to the last completed step and then we’ll continue resume the workflow from where I left off. So that’s basically a high-level idea of how DBOS.

Kanchan Shringi 00:08:36 You mentioned the item potency key, explain that further.

Qian Li 00:08:40 Yeah, so essentially item potency key is a way to guarantee that each step is executed once and only once. So essentially when work with external API say Stripe payment, right? You want to say I only want to charge this user once, and then you can piggyback an item potency key in your request to Stripe. So if Stripe receives the same key again, it will not issue another invoice to the user. Stripe will be able to look up its own database and say, hey, I already have an outstanding invoice for this checkout session. So they would only charge a user once. So I think item potency is a very important concept in terms of durable execution. And then DBOS, we automatically generate an item potency key for workflow if you don’t already provide one, then for each step we’ll basically append the step sequence number to it.

Qian Li 00:09:35 So every step in the workflow will also be uniquely identified. So the reason we need a potency key is that you can’t really control external systems, right? Say even if we only call Stripe API once, we don’t know whether the network connection will fail or maybe Stripe will fail or there’s some intermediate transient errors that may happen. So like if we don’t get the result back, we may think it failed, but in order to make sure that the checkout goes through, we may want to retry it and we will try it. We can’t generate a new session because maybe the Stripe payment already succeeded, but it just the network was cut when Stripe tried to send the result back. So we have to retry it and one way retry it. The way to tell Stripe that this is the same checkout request I sent it before, is to append the item put on the key. So item potency key, is the key to guarantee correctness and exactly one’s execution when you work with external systems.

Kanchan Shringi 00:10:35 Could you talk more about the DBOS architecture? You did talk about the workflow and the step. Is there any other key elements of the architecture you’d like people to know about and especially if they have origins in the research that you had done in the past?

Qian Li 00:10:51 Yeah, so DBOS workflows and steps and Transactions were implemented directly as SQL interactions in the library. And then beyond that we also have several other primitives. The one is widely used is called DBOS queues. So we also implement queues on top of Postgres with queues, basically you’ll be able to group, so instead of directly invoking or executing a workflow synchronously, you can in queue a workflow and then other workers will be able to pick it up. So the queue is also just a database record, right? Which is in queue by saying, here’s the task, here’s the queue name, and then for the worker they can just pull from each queue. So the benefit of queue is that it mixes really easy to control concurrency and rate limiting. Say for example, we can have different queues like open AI queue, Cloud queue, stuff like that. And we want to say for the open AI queue, we only want to send the request five times per minute.

Qian Li 00:11:56 So this is a way to group your invocations and to control how many outstanding requests you send to external systems. And another thing is based on research is messaging system. So when workflows and steps, basically they are all within the DBOS territory, but when you need to interact with external systems, you need a way to communicate with the workflow. So give an example of this. So say actually the checkout workflow, right? When you call Stripe, you don’t block and wait until user to pay out. So what do you actually do is that Stripe will directly return a confirmation saying okay, we are processing your payment, but when your payment is done, we’ll send you a webhook invocation to say this payment is done. And now how to do it in DBOS is that, so in the workflow we can have a DBOS style receive on a specific topic waiting for the specific Stripe payment session and then we can implement a webhook that listens Stripe callback.

Qian Li 00:12:59 When a Stripe callback we can extract because as we mentioned, we have the item potency key, we can uniquely identify which workflow we should call back to, and then we’ll be able to send say DBOS send with that item potency key, and then send Stripe payments results either paid or payment rejected, stuff like that. And then the workflow, well using the receive will be able to get the information. So this is really useful if you want to, like I think almost all the payment system for example, is using this type of pattern. And also if you want anything that goes to like human in the loop, say you want to send a human verification email and that you want to wait for the human confirmation to come back, you can’t just block there and wait, you have to wait for a callback and then DBOS send and receive primitive allows you to do that.

Kanchan Shringi 00:13:57 And this library is what you call as DBOS Transact? Anybody can use it as long as they annotate their code.

Qian Li 00:14:05 Yes. So DBOS Transact is an open-source library. It’s MIT license so anyone will be able to use it. It’s currently available in TypeScript and Python. So we are adding more language support. So for the audience, if you have any feedback on what languages you wish to see, please contact us and we’ll add more. So for the Transact library you can install it and run it anywhere.

Kanchan Shringi 00:14:29 What is the DBOS Cloud?

Qian Li 00:14:31 Right, so DBOS Cloud is a serverless hosting platform for Transact applications. So say, so you can run your Transact applications anywhere on your own laptop or on a Kubernetes cluster. But if you don’t have any resources and if you don’t want to manage any clusters, you can deploy your app to DBOS Cloud

Kanchan Shringi 00:14:51 So if I do use DBOS Transact library, I must get a Postgres. Can I, if I wanted to use another database system, could I?

Qian Li 00:14:59 Currently DBOS Transact supports any Postgres compatible databases. So you can use it with your own Postgres server. Like many of our users just simply add DBOS Transact because they already have a Postgres server, so they just use the same server to store your execution state data. But you can also use other offerings like each Cloud provider has some Postgres offering and we also compatible with new serverless Postgres like Neon or Base or CockroachDB or you gotten by DB. There are a lot of options here.

Kanchan Shringi 00:15:35 So is that just because that’s the database you tested with or is there very specific features of Postgres, Postgres compatible databases that you leverage?

Qian Li 00:15:45 Yeah, I guess it goes back to the question of why do we choose Postgres? So like there are a couple of reasons we chose Postgres. First is that the ecosystem is huge. As I said, there are a lot of providers in the Cloud or there are a lot of on-prem solutions as well. People know how to operate Postgres. It’s a very mature and battle tested technology and people really trust Postgres. So that’s one reason. And the second reason is that it’s a relational database, so it has built-in Transactions, it’s really reliable. So we don’t need to worry about like it has backups and it has replication if you want to. And finally the extension ecosystem is great. Some of our users actually use Postgres as a vector store, so some people also use Postgres to store time series data. So really like with a single database, you can achieve a lot of things. You can store basically all data from transactional data to analytics data to vector data. So we really like this versatility of Postgres.

Kanchan Shringi 00:16:47 So Qian, a lot of our listeners I think are really interested in the how like peering behind the black box. So if they will have access to the Postgres database where the state is getting stored, if they peak in there, what will they see and where should they peak?

Qian Li 00:17:04 Yeah, so DBOS stores all the information in the separate logical database inside the Postgres server. So you can think of a logical database is another namespace within the Postgres server to isolate it from your main application database so that we don’t interrupt or we don’t disturb your normal tables or your normal queries. So in that database under the DBOS schema, you will be able to find several tables that stores the information. So there are several tables you want to look into. The first is workflow status table, that’s the core table that stores workflow information. And then when you we first execute a workflow, it will say workflow pending. And then as your workflow progress and eventually succeed or fail, we’ll update the workflow status to error with the error, the actual error or success with the actual output. So that table is the core that drives the DBOS that stores the state changes of DBOS.

Qian Li 00:18:04 And then the second table you want to look at is operation outputs table. In that table, that’s where we store the serialized output of each step. And by looking into that table you will see the workflow ID and then the step ID inside of workflow and the result of every single step. So by looking at those two tables, you can piece together what workflow has executed when and what steps has executed at what timestamp. Besides that, we also have the workflow queues table. The queues table basically groups workflows into different queues and then by saying different queues, it’s just we assign a different name in the queue names table and that will allow us to quickly look up what workflows are assigned to be executed in that queue. So I would say the real beauty is in SQL you can use SQL to manage your workflows and you can also use SQL to simply query them and observe what happened in your system.

Kanchan Shringi 00:19:01 Do you provide out of the box visualizations for what you see in these tables?

Qian Li 00:19:07 Yes. So if you log into DBOS console, like console DBOS.dev, you’ll be able to see we have table visualization. You can basically select or filter based on the workflow names or based on the workflow status. And we are also actively developing a graph visualization of the workflow execution graph. So say what workflow has started and how many steps it has finished and like what’s a workflow, parent child relationship between different workflows, say like one workflow can invoke sub workflow that execute other tasks. So you’ll be able to connect the dots by looking at that graph.

Kanchan Shringi 00:19:44 Qian, do you have any real-world example where maybe talking to your customers where unreliable workflow executions cost them really problematic system failures or inefficiencies?

Qian Li 00:19:58 Yeah, so let’s see. There are several use cases. My favorite one is that one of our customers have to persist data across multiple systems. Say Shopify sends some data through Kafka and for each message, each message contains some customer data. They want to persist in their local Postgres database, they want to persist in their CRM, they want to persist in their ERP system. So they want to make sure that the data is consistent across all systems. So they chose DBOS and if you don’t have correct customer data, you can probably lose customers. So that’s what they really don’t want to see. And DBOS guarantees that whenever you receive a message from Kafka, we guarantee that the message will appear across all systems.

Kanchan Shringi 00:20:46 Maybe that answers one of my questions, which was why would I not roll my own or use code generators to help me write the code for durable execution? I think one of what you say tells me that there’s a lot of guarantees and compliance that using DBOS might help me with. Is that fair? Are you going through some kind of compliance process that people can leverage?

Qian Li 00:21:11 Yeah, so I think we can talk a bit more about the observability angle of DBOS. So because everything is stored in Postgres, it’s really easy to query and visualize what’s going on. So my favorite quote from customers is that they said DBOS is great because everything is in Postgres. I can just use SQL queries to see what workflows have run, what workflows failed and what happened at each step. And based on the information we store in Postgres, we are also developing more observability features like graph visualization of the workflow and the steps and the workflow may be spawning multiple workflows there. So we are visualizing the parent-child relationship there and more than that what we can provide is actually management over your workflows. So because those are just database records, if you want to say resume a workflow, you can just re-in queue it, put it back into a queue state.

Qian Li 00:22:10 If you want to cancel workflow, then you can just mark the workflow as canceled and then the downstream process will just cancel the operation. And if you want to say, we can talk more about it later when we talk about operations with DBOS, but like if you want to restart a workflow with your new versions of a code, you can just copy the workflow input information and assign it with a new item put in key and then just start execution on the new version. And because data is in Postgres is an original database is structured data and structured data is really easy to analyze and to observe.

Kanchan Shringi 00:22:45 A little bit of segue here and would like to contrast DBOS with solutions already in the market and I did mention our earlier episode, the latest one being with Maxim Fateev on Temporal. How does DBOS contrast with Temporal and is your definition of workflow consistent with Temporal’s definition of workflow and durable execution?

Qian Li 00:23:12 Yeah, so that’s a good question and we got asked about that question a lot. So I think the workflow definition, DBOS and Temporal are essentially the same, but the core difference is how we perceive the implementation of this durable execution. So in Temporal we call the pattern external orchestration pattern with external orchestration, which means you have to start a Temporal server and then when you run your workflow instead of directly executing each step, your workflow function will need to, in queue a step into a Temporal worker and then Temporal worker, Temporal server will instead push the notification or will instead notify a worker node to process that, they call it activity, we call it a step. And then after the worker node finishes a step or activity, they will send it back to Temporal server and Temporal server will persist the result and then send it back to the main workflow function.

Qian Li 00:24:12 So basically for every invocation of your step, there are multiple network hubs. And then by contrast in DBOS we essentially embed job execution in your program. So when you call the workflow, it’s just a function call. And when a workflow calls a step, it’s again, it’s a function call but the function call is intercepted by the DBOS library. So everything will happen just in your program. So we believe the benefit of DBOS is that it’s really simple. All you need is your program and your Postgres database. You don’t have to deploy an extra orchestration server and you don’t have to deploy like distributed workers to process your steps.

Kanchan Shringi 00:24:54 I get the part about not having to deploy an orchestration server, but then how do you achieve orchestration?

Qian Li 00:25:02 So DBOS implements job execution, this as orchestration. So just to clarify that, we achieve job execution by intercepting the function calls. So you can think of it as when you call it decorative function instead of directly executing the function, we wrap around the workflow function for example to say we first persist the input and then call the function when execute the workflow function. The workflow function will call each step. And again each step is also wrapped by DBOS. So in the wrapper, DBOS will first check if the step has finished before, if not execute it, persists a result, if it has been executed before, directly return the result. So everything happens in the language layer. So that’s why we don’t need to send it over to a separate worker to a separate message queue to achieve this.

Kanchan Shringi 00:25:59 Is there any other areas where it’s important to compare and contrast with Temporal?

Qian Li 00:26:06 I think the two main area, one is simplicity. Like in order to add DBOS to your program is really simple. You install it as a library, add to your existing program by decorating your functions and you’re done. And then to implement something Temporal, you have to basically restructure a program into like you have to think in a distributed system way. Every time when you cut out a step, it is essentially an RPC call to another worker and then to execute it and then wait for the result to come back. So we think simplicity is the number one differentiator and then it’s also very simple to operate. So say like to run DBOS in production, all you need is a Postgres server. And then basically people already know how to operate Postgres servers. So we don’t add much of the operational overhead to it. By contrast, if you use any like external orchestration services, you have to host their service or you have to rely on their Cloud providers Cloud offerings. For example, every time you want to call a step, you have to communicate to their Cloud and say I want to execute a step and their Cloud will send a message to worker to execute it. So on so forth. So that will also give some like implication to performance, right? In DBOS every step is we’ll add basically add a database, right? Which is like a few milliseconds. Well if you do everything over the network, that will easily go a few hundred milliseconds.

Kanchan Shringi 00:27:39 Other than the simplicity of use, do you have any thoughts about when a developer should, when they need durable execution, they should consider DBOS versus Temporal?

Qian Li 00:27:49 Yeah, I mean Temporal is really great technology like it’s used by a lot of companies. I think one benefit of the Temporal model is that if you want to have a workflow that’s consisted of different steps are written, different languages, it could be easier to use the Temporal model. Say you have a workflow written in Python but maybe some steps are written GO, other in Java and Rust. If you have such heterogeneous workflow, it will currently be easier to do it in Temporal. While on the other hand, if your program is just in Python or TypeScript, it will definitely be easier to do in DBOS. In fact, a current trend is we call full stack applications. So when users write their front-end code in TypeScript, they actually also want to write their backend in TypeScript. So that’s why like DBOS in this case makes more sense because it’s very lightweight. You just add it as another TypeScript library and then use it in a program.

Kanchan Shringi 00:28:47 How much of this comparison also applies to some of the other technologies like AWS step functions?

Qian Li 00:28:55 Yeah, so the external orchestration part is the same. So we actually had some performance benchmarks against step functions and we found that for DBOS we are like each step is like a few millisecond, whereas in step functions every time you have to schedule a step it will push to a queue and then you have to de-queue and wait for results. So it will come back as like 200 milliseconds. So the performance gap is pretty large. And another difference is that step functions requires you to use a late JSON description language like to add to basically specify the DAC. And that could be another overhead. I would say well though like step functions also provide a graphical interface, we can drag and draw your workflows into a deck. This is really nice. And this can be used by, for example, developers or people without too much coding experience.

Qian Li 00:29:51 However, the problem is that it’s very easy to use when you have simple tasks, but when you have more complicated tasks, it’s hard to manage. Just share an anecdote from our conversations with users. Someone was switching from step functions to DBOS because they said eventually they gave up on code review because they have to review 3000 lines of JSON code and then in that case using DBOS is better because all your workflow logics essentially just your code. So you write workflows as code and then you can do code review, you can do your debugging normally as you would when you develop your functions.

Kanchan Shringi 00:30:29 A follow up question there, in some of the previous episodes on this topic, there was pros and cons of the graphical representation discussed. Certainly there was a sentiment expressed that developers are not very fond of the graphical representation, but more business users find it useful. What is your experience there? And if I did want to communicate that with the business user and I was as a developer using DBOS, what are my options?

Qian Li 00:30:59 That’s a great question. I think I have the same experience when talking to developers. They usually say they want code, they want to see code, but when we talk to more business focused people, they want to see like what’s going on. Like say, they want to have a visual or a graph visualization of what happened, like how many steps happened, how many workflows happened, and if something failed, which step failed. So in DBOS, we basically provide, we are actually actively developing a graphical visualization based on the information stored in the database. And the reason is that, so you can’t define workflows based on graph, but we provide observability into what happened, what have executed before. I think that is a good balance I’d say because developers usually use code to develop their workflows. Well, business focused people usually need to see what happened, like the execution of those workflows. I think that’s a good combination.

Kanchan Shringi 00:32:02 You had talked earlier about scheduling how one would achieve that. Can you talk about rollback with DBOS? Because rollback was another big plus of a workflow system, like being able to achieve that. How would I, as a developer achieve that if I was using DBOS?

Qian Li 00:32:20 Yeah, so if you use DBOS, you can just handle exceptions normally as you would when you develop your API or your service. So in the checkout example, so if I receive, if the workflow says cease the payment was failed, then you will need to call the undo inventory reservation function. They also need to cause some functions to cancel the order. So in DBOS you basically express them in code as error. Like if you see this error call those sequence of functions, if you see other types of error, do something else. So we do ask users to explicitly specify the rollback actions and the rollback actions are also steps. So the benefit of using DBOS is that we’ll guarantee everything the workflow, that all the steps will run to completion.

Kanchan Shringi 00:33:10 How do I as a developer take an existing body of code and the annotations that DBOS wants? What process would I go through?

Qian Li 00:33:20 Yeah, so we think it’s pretty easy. So you first install the library. If you use Python, it’s PIP install. If you use hypers script, it’s MPM install. And then after that you can just import DBOS. Then you say you have a function, you want to be a workflow, you just decorate it as @DBOS workflow. And then within that workflow function you will see what function calls in makes and then you decorate those functions as @DBOS.step.

Kanchan Shringi 00:33:50 I think my question was more like how do I figure out which functions I should now go and decorate?

Qian Li 00:33:56 Oh right, so I think you’ll essentially decorate any functions that will talk to external APIs or talk to the database, anything that will generate side effects outside of your program.

Kanchan Shringi 00:34:07 Thanks Qian. So I’d like to now go into a different section where we talk about maybe the more use cases that we see for durable execution, especially with AI agents. Before we go there, is there anything else you’d like to add?

Qian Li 00:34:23 Yeah, so actually I like to talk a bit more about workflow recovery and failure recovery when you use in production. So as we talked earlier, it is true that developers can write checkpointing code manually, but you want to use a job execution system or a library like DBOS or others because you want to have automatic recovery. So give you a concrete example. In production you usually you have to deploy your code in multiple, if you use Kubernetes, you’ll deploy to multiple containers or pod. And in a large production development, you could probably develop maybe a hundred or a thousand pods that each will serve. Well, you’ll load balance between those pods to serve requests. So the problem is that at any point in time or within any hour, some pods will definitely fail. So it’s a probability to fail is high when you have a large deployment and if you already use Kubernetes, you may think, yes, I can just restart those pods.

Qian Li 00:35:25 But the problem is that sure, when you restart a pod, you still need some application logic to decide how to resume the work you’ve done before, before you crash. And if you use DBOS, we provide a service called conductor. So this service basically connects to your running applications and then it will detect when some of the workers are failing. So if it detects some workers failed, it will try to redistribute the pending workflows running on that worker to other healthy workers. And this way we automatically recover from failures and we guarantee that all workflows will eventually run to completion. So I think this is one reason that you really want to delegate this kind of recovery to some library or services like DBOS.

Kanchan Shringi 00:36:15 How does that coexist with the recovery mechanisms in Kubernetes?

Qian Li 00:36:21 Yeah, so in Kubernetes, basically when you restart, it will restart your application. But it doesn’t have any application semantics. It doesn’t know what workforce failed before. Essentially DBOS as a library will have some background threats to listen for tasks, say you have to recover this and that. So DBOS will dispatch certain recovery commands to your workers. And this works perfectly with other mechanisms in like say Kubernetes or other deployments.

Kanchan Shringi 00:36:50 So what you mean is that I develop my application with the DBOS Transact library and then when I deploy it to production I can leverage DBOS conductor whether or not I’m deploying to Kubernetes.

Qian Li 00:37:04 Yes, exactly.

Kanchan Shringi 00:37:06 Do you use DBOS Conductor automatically in the DBOS Cloud?

Qian Li 00:37:10 Yes, we have the same capability in DBOS Cloud. So if you deploy to DBOS Cloud, you have all the automatic recovery plus autoscaling.

Kanchan Shringi 00:37:18 Alright, let’s spend some time on AI agents. That was one big use case on your website. Why is the problem surfacing now with AI agents and requiring durable executions specifically for that?

Qian Li 00:37:34 So I think this is a new and emergent use case for DBOS. So actually besides AI agents, we also have another AI use case, which is AI data pipeline. If we have time we can dive into that as well. And for AI agents, basically the program will be driven by AI or LLMs. So say instead of developers coding what functions to call the function calling or tool calling will be decided based on the LLM responses. So I actually, build a very interesting refund agent, using line graph and DBOS. So basically what an agent does is that when they receive a refund request, it will talk to LM to decide, which tool to use, maybe based on the customer record, based on the order it’ll call a refund workflow.

Kanchan Shringi 00:38:25 So very simplistically, am I right in that the agent is essentially an LLM with tools at the most basic level? Yeah. Okay, sorry continue Qian.

Qian Li 00:38:36 So then like the agent can say basically, oh, if the purchase was above a threshold, I will need to send an email to an admin to verify whether you want to approve this refund or not. And then the refund workflow will have to wait for human input and then decide what to do next. So I would say with AI agents, the most challenging part is first dynamic execution of workflows. You can no longer have a very static workflow like workflow branches or function calling will be very dynamic based on the LLM output. And second, it’s really unreliable in many ways. Like from my point of view, we treat AI as an unreliable service in a stack. So your AI call may fail and then every time you call it may give you a different result. And then third thing is that human in the loop is really tricky. Like how do we make AI work together with human, it’s a kind of new and challenging topic these days.

Kanchan Shringi 00:39:38 Can you talk more about the agent you developed and how you use DBOS there?

Qian Li 00:39:43 Yeah, so for the agent I developed, basically I decorate my tool function as a DBOS workflow so that it will guarantee basically if I refunded before, I will only refund it once. So this guarantees that if the user asks, I want to refund it again based on the recorded information, we’ll say you’ve already been refunded before, so we’ll skip the refund. And they will guarantee that once we kick off, the refund workflow will always finish. So say if something happened to the program, if it crashes in the middle, we’ll just resume the refund process from where it left off. So let’s say if already process the payment or kind of return the money back to users, we’ll also restore the inventory and stuff like that.

Kanchan Shringi 00:40:26 Basic question though, so if the agent is the LLM with tool and the LLM is the one calling the tool, which in many cases the tool is an API call, how does now DBOS get involved in that part of the workflow?

Qian Li 00:40:42 Yeah, exactly. So the cool thing is that as I introduced before, DBOS is a library and annotations that you can wrap around your function with. So when LLM costs the function, it’ll cost the wrapped function and that wrap function is a durable workflow. So that’s why it’s super easy to add DBOS with any AI frameworks. Like instead of passing your bare function, you pass the annotated function into your AI tools.

Kanchan Shringi 00:41:11 That’s interesting. You going to talk about data pipelines as well? Is there an example there?

Qian Li 00:41:17 Yeah, so Data Pipeline is one huge use case in DBOS where basically the typical pipeline sometimes goes the first it’s a workflow of first scrape websites or scrape other data sources and store the PDFs or images into some S3 bucket. And then the second step will in parallel, will use LLMs to analyze those images or PDFs. And then the third step will be to persist the return value from LLMs into multiple data stores, either Vector database or Postgres or you know, multiple data sources. And then maybe another step is to do some decision based on or analytics based on the results from previous steps. Like a concrete use case is for example, stock marketing monitoring. So some of our use cases would be to scrape the website of stock markets and then put into LLM to do some analysis and then do some business decision based on like, do I want to invest in the stock? Do I want to sell the stocks as a final step.

Kanchan Shringi 00:42:26 Why durable execution?

Qian Li 00:42:28 Yeah, durable execution is essential because you don’t want to lose data and once you pros, like usually there are two reasons. One, the data volume is huge, right? So easily you can process 1000 documents and if something fell in the middle or not just 1000, maybe 10,000 documents. If something fail in the middle, you don’t want to restart from the beginning. Like you don’t want to say, oh now something happened to my pipeline. Either the AI failed or I hit some AI rate limit. I don’t want to say I have to reprocess all 10,000 documents again, I want to resume from where it left off. So that’s the first thing. The second thing is you want everything to complete. Say if I process 10,000 documents, I want all of them to complete, otherwise my business decision, for example, which stock to trade may be based on incomplete data that can cause like financial loss and other sequence consequences.

Kanchan Shringi 00:43:28 So in this specific example, if I did decorate the loading step with DBOS and the error was on let’s say a hundred thousand document in my set of 500,000, what is the behavior?

Qian Li 00:43:44 So basically in this case you want to use DBOS queues to paralyze those tasks. Like you don’t want to process those, all those documents in one giant step. You want to make sure that you want to basically say each step is a queued task spot by my workflow. Like from documents one to 10,000 I will in queue a task and then it will become a fan out pattern and then we will wait until all tasks finish. So, we’ll fan in by waiting those results. So if any of the tasks failed, then when we recover the workflow, we’ll just say, okay, now I have to check all those queued tasks and if some tasks fail then we’ll restart execution of those tasks. But for those tasks that have already finished, we’ll just get the result from the database and then return it. So that’s how you can correctly recover from those failures.

Kanchan Shringi 00:44:41 So you had earlier mentioned about using conductor for DBOS in production. Besides that, are there other recommendations about being able to run DBOS at scale?

Qian Li 00:44:53 Yeah, so the question is mostly about like the best practices for using DBOS. Yeah. So I think to run DBAs at scale, you first really want to make sure that your workflows are deterministic. That’s the core to guarantee your workflows are recoverable because if you have non-determinism in your workflow, when we recover it, we may go through another execution path that will, we don’t really know how to recover. And then the second thing is use queues wisely. So for example, when you deal with LLMs, they typically have rate limiting and they also have concurrency limits. So how many outstanding requests you can have to those LLMs. So in this case use DBOS queues. So with queues you can add rate limiter, say I want at most five API calls within like 30 seconds and I want at most 10 outstanding requests at a time. So with those primitives you can build your apps with parallel tasks but also within the limit of those LMS.

Kanchan Shringi 00:46:01 Is there anything else you’d like to add for folks that would like to get started with DBOS Transact?

Qian Li 00:46:08 Yeah, so for DBOS, Transact, downloaded, installed it, run you locally on your laptop. The cool thing is that because it’s a library, we can also leverage the debugger. So we actually develop a time travel debugger that you can install from VS code marketplace where you will be able to replay traces happened in, anything happened in the past. Say I want to investigate why the workflow executed that way, like executed yesterday, then I will just pick the workflow and then tell our debugger extension to say I want to re-execute that. And when we re execute instead of writing to the database, we’ll just pull the information from the database, say this is the workflow, this was the input and those were the output of each step. So we can instead of actually, so say if your workflow contains sending email or calling out to Stripe will not actually call those external APIs will instead return the recorded information and then you’ll be able to step through your workflow as if it happened in the past. So that’s very cool.

Kanchan Shringi 00:47:10 Thanks so much for today, Cian. I think a lot of the goal of the conversation, at least what I was trying to achieve was distinguish between workflow as a graphical representation aid for business users versus grouping together the series of steps or services that need to be durable. And how does that play into the now where a lot of the code generation is happening with AI as well and where exactly does DBOS fit into it? You’ve explained that you can definitely use a simplicity of DBOS and the performance of DBOS in grouping together steps that you need as a workflow. And I presume one would be able to use AI to then introspect that code and generate the graphical representation as required for business users. But then the question would be why would you not use AI to even write all the framework that comes with DBOS? Could you just maybe spend a few minutes on your thinking about how AI fits into all this?

Qian Li 00:48:23 Yeah, so it’s a really interesting question and I’ve been thinking a lot about it recently, especially with AI, we’ve seen a lot of AI generated code. It’s true that someone may say, okay, AI may be able to generate those checkpoint code, but AI won’t be able to generate automatic workflow recovery. It won’t be able to generate the control plane that can automatically recover on failure. So we think DBOS and other workflow engines still played a central role here in the AI era, right? And then with AI generated code, I think we really want to make sure they are reliable. And by reliable I mean like yes, those code may be buggy. So you want to have a way to say, if I caught a bug, I want to be able to investigate a bug. I want to be able to say restart my workflow from a specific step and then fix the bug.

Qian Li 00:49:13 So those are the capabilities where those durable execution engines really shine because with those AI generated a reliable code, like you want a way to correctly store that information for debugging. And because I think the one advantage of DBOS is that because we store those, like what step has executed and at one at a time stamp of it in a relational database. So we really have this structured data for AI to analyze, it’s easy for human to analyze and it will also be easy for AI to analyze what’s going wrong. So I think it’s a really exciting new capability that we are exploring. And I think simplicity and structured data logging are important. So simplicity is that we actually try to add a prompt for DBOS and we give the prompt that will incorporate our latest version of the code will basically give AI instructions on how to add DBOS to your code.

Qian Li 00:50:11 So with that we can just tell AI, make this code durable and then AI will correctly do that in one shot. And that’s really interesting because if we put durable execution as a library, we’ll be able to make AI generated code durable very easily. And then when you execute that code, we automatically checkpoint data in a database. So it’ll be very easy for a human or for AI to investigate what’s going on. And then because everything in the database, it also allows users or AI to automatically fix those code because to fix those code, you just need several SQL statements to modify the table, to change the result, to change the output of the recorded outputs table and then to continue from where it left off. So like we are really exploring the synergy between your code, your database, and the way to modify your code, to generate new traces, to modify the database, to observe databases. So like it’s a really exciting area.

Kanchan Shringi 00:51:15 How can listeners contact you if they have any feedback or questions?

Qian Li 00:50:17 Yeah, so to contact me, can visit my website, QianLi.dev. You can follow me on LinkedIn, follow me on Blue Sky or Twitter. But also don’t forget to visit DBOS.dev. This is our main website and we’re looking forward to your feedback.

Kanchan Shringi 00:51:31 Thank you so much Qian, for coming on.

Qian Li 00:50:33 Thanks, Kanchan, for inviting me.

[End of Audio]

Join the discussion

More from this show