Paul Frazee, CTO of Bluesky, speaks with SE Radio’s Jeremy Jung about the Authenticated Transfer Protocol (ATProto) used by the Bluesky decentralized social network. They discuss why ATProto was created, as well as how it differs from the ActivityPub open standard, the scaling limitations of peer-to-peer solutions, cryptographic decentralized identifiers, and creating a protocol based on experience with distributed systems. They also examine the role of personal data servers, relays, and app views, the benefits of using domain names, allowing users to create algorithmic feeds and moderation tools, and the challenges of content moderation.
Brought to you by IEEE Computer Society and IEEE Software magazine.
Show Notes
- Bluesky
- Paul Frazee on Bluesky
- ATProtocol
- ATProto for distributed systems engineers
- Bluesky and the AT Protocol: Usable Decentralized Social Media
- Decentralized Identifiers (DIDs)
- ActivityPub
- Webfinger
Transcript
Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.
Jeremy Jung 00:00:18 This is Jeremy Jung for Software Engineering Radio and today I am talking to Paul Frazee. He’s the current CTO of Bluesky and he previously worked on other decentralized applications like Beaker and Secure Scuttlebutt. Paul, welcome to Software Engineering Radio.
Paul Frazee 00:00:34 Oh, thanks for having me.
Jeremy Jung 00:00:35 For people who aren’t familiar with Bluesky, what is it?
Paul Frazee 00:00:38 So Bluesky is an open social network, simplest way to put it, designed in particular for high scale. That’s kind of one of the big requirements that we had when we were moving into it. And it is really geared towards making sure that the operation of the social network is open amongst multiple different organizations. So we’re one of the operators, but other folks can come in, spin up the software, all the open-source software and essentially have a full node with a full copy of the network active users and have their users join into our network. And they all work functionally as one shared application.
Jeremy Jung 00:01:17 So it sounds like it’s similar to say Twitter, but instead of there just being one Twitter, there could be any number and there is part of the underlying protocol that allows them to all connect to one another and act as one system.
Paul Frazee 00:01:34 That’s exactly right. And there’s a metaphor we use a lot, which is compared to the web and search engines, which actually kind of matches really well. Like when you use Bing or Google, you’re searching the same web. So on the AT Protocol and Bluesky, you use Bluesky, you use some alternative client or application, all the same. We call it the atmosphere, all one shared network.
Jeremy Jung 00:01:53 And more than just the client. Because I think sometimes when people think of a client, they’ll think of, I use a web browser, I could use Chrome or Firefox, but ultimately, I’m connecting to the same thing. But it’s not just people running alternate clients, right? It’s also running their own social networks in a way.
Paul Frazee 00:02:11 Their own full backend to it. That’s right. Yeah. The anchoring point on that being the fire hose of data that runs the entire thing is open as well. And so you start up your own application, you spin up a service that’s just pipes into that fire hose and taps into all the activity.
Jeremy Jung 00:02:28 And so talking about this underlying protocol, maybe we could start where this all began so people get some context for where this all came from.
Paul Frazee 00:02:39 All right, so let’s wind the clock back here in my brain, we started out 2022, right at the beginning of the year we were formed as an essentially a consulting company outside of Twitter with a contract with Twitter. And our goal was to build a protocol that could have run Twitter, much like the way that we just described, which set us up with a couple of pretty specific requirements. For one, we had to make sure that it could scale. So that ended up being a really important first requirement. And we wanted to make sure that there was strong kind of guarantees that the network doesn’t ever get captured by any one operator. So the idea was that Twitter would become the first adopter of the technology, other applications, other services would begin to take advantage of it and users would be able to smoothly migrate their accounts in between one or the other at any time.
Paul Frazee 00:03:31 And it’s really anchored in a particular goal of just deconstructing monopolies, getting rid of those moats that make it so that there’s a kind of a lack of competition between these things. And making sure that if there was some kind of reason that you decided you’re just not happy with what direction this service has been going, you move over to another one, you’re still in touch with all the folks that you were in touch with before. You don’t lose your data, you don’t lose your follows. Those were the kind of initial requirements that we set out with. A team by and large came from the decentralized web movement, which is actually a pretty large community that’s been around since, I want to say around 2012 is when we first kind of started to form. It got really made more specifically into a community somewhere around 2015 or 16 I want to say.
Paul Frazee 00:04:16 When the internet archives started to host conferences for us. And so that gave us kind of a meeting point where it all started to meet up. There’s kind of three schools of thought within that movement. There was the blockchain community, the federation community, and the peer-to-peer community. And so blockchain you don’t need to explain that one. You got Federation, which was largely ActivityPub Mastodon, and then peer-to-peer was IPFS, that Protocol Secure Scuttlebutt. Those kinds of bit Torrent style of technologies really, they were all kind of inspired by that. So these three different kinds of sub-communities we’re all working independently on different ways to attack how to make these open applications. How do you get something that’s a high scale web application without one corporation being the only operator. When this team came together in 2022, we’ve largely sourced from the peer-to-peer group of the decentralized community.
Paul Frazee 00:05:12 Personally, I’ve been working in the space and on those kinds of technologies for about 10 years at that stage. And the other folks that were in there, five to 10 each respectively. So we all had had a fair amount of time working on that. And we had really kind of hit some of the limitations of doing things entirely using client devices. We were running into challenges about reliability of connections punching holes to the individual devices, very hard, synchronizing keys between the devices, very hard, maintaining strong availability of the data because people’s devices are going off and on, things like that. Even when you’re using the kind of BitTorrent style of shared distribution, that becomes a challenge. But probably the worst challenge was quite simply scale. You need to be able to create aggregations of a lot of behavior even when you’re trying to model your application as largely pure wise interactions like thinking like messaging.
Paul Frazee 00:06:06 You might need an aggregation of like accounts that even exist. How do you do notifications reliably? Things like that. Really challenging. And what I was starting to say to myself by the end of that kind of peer-to-peer was that it just can’t be rocket science to do a comment section. Like at some point you just ask yourself like how hard are we willing to work to make these ideas work? But there were some pretty good pieces of tech that did come out of the peer-to-peer world. A lot of it had to do with what I might call a cryptographic structure. Things like Merkle trees and advances within Merkle trees, ways to take data sets and reduce them down to hashes so that you can then create nice signatures and have signed data sets at rest at larger scales.
Paul Frazee 00:06:48 So our basic thought was, well all right, we got some pretty good tech out of this, but let’s drop that requirement that it all run off of devices and it’s taken some servers in there and instead almost kind of think of the entire network as a peer-to-peer mesh of servers that’s going to solve your scale problem. Because you can throw big databases at it, it’s going to solve your availability problems, it’s going to solve your device sync problems. But you get a lot of the same properties of being able to move data sets between services. Much like you could move them between devices in the peer-to-peer network without losing their identifiers because you’re doing this in direction of cryptographic identifiers to the current host. That’s what peer-to-peer always is always doing. You’re taking like a public key or hash and then you’re asking the network, hey who has this?
Paul Frazee 00:07:31 Well if you just move that into the server, you get the same thing, that dynamic resolution of who’s your active host. So you’re getting that portability that we wanted real bad. And then you’re also getting that kind of in meshing of the different services where each of them is producing these data sets that they can sync from each other. So take pure to pure and apply it to the server stack. And that was our kind of initial thought of like, hey you know what? This might work. This might solve the problems that we have. And a lot of the design fell out from that basic mentality.
Jeremy Jung 00:07:57 When you talk about these cryptographic identifiers, is the idea that anybody could have data about a person, like a message or a comment and that could be hosted different places, but you would still know which person that originally came from. Is that the goal there?
Paul Frazee 00:08:15 That’s exactly it. Yeah, you want to create identification that supersedes servers, right? So when you think about like if I’m using Twitter and I want to know what your posts are, I go to twitter.com/jeremy, right? I’m asking Twitter and your ID is consequently always bound to Twitter. You’re always kind of a second-class identifier. We wanted to boost up the user identifier to be kind of a thing freestanding on its own. I wanted just know what Jeremy’s posts are and then once you get into the technical system, it’ll be designed to figure out okay, who knows that, who can answer that for you? We use cryptographic identifiers internally. So like all the data sets use these kinds of long URLs to identify things. But in the live system we ended up using, like in the application the user facing part, we use domain names for people, which kind of I think gives the picture of how this all operates. It really moves the user accounts up into a free-standing first-class identifier within the system and then consequently whatever application you’re using, it’s really about whatever data is getting put into your account. And then that just exchanges between any application that anybody else is using.
Jeremy Jung 00:09:22 So in this case, it sounds like the identifier is some long string that not sure if it’s necessarily human readable or not. You’re shaking your head no. But if you have that string, itís for a specific person and since it’s not really human readable, what you do is you put a layer on top of it, which is in this case is a domain that somebody can use to look up and find the identifier.
Paul Frazee 00:09:52 Yeah, yeah. So we just use DNS, you put a TXT record in there, map into that long string or you can do a do well known file on a web server if that’s more convenient for you. And then the ID that’s behind that, the non-human readable one, those are called DIDs, which is actually a W3Cís spec. Those then map to a kind of a certificate what you call a DID document that kind of confirms the binding by declaring what that domain name should be. So you get this bi-directional binding and then that certificate also includes signing keys and active servers. So you pull down that certificate and that’s how the discovery of the active server happens is through the DID system.
Jeremy Jung 00:10:29 So when you refer to an active server, what is that server and what is that server storing?
Paul Frazee 00:10:34 It’s kind of like a web server, but instead of hosting HTML, it’s hosting the simplest way to put, it’s like a bunch of JSON records. Every user has their own kind of documents store of JSON documents; it’s bucketed into collections. Whenever you’re looking up somebody on the network, you’re going to get access to that repository of data, jump into a collection. This collection is their post collection, get the R key and then you’re pulling out JSON at the end of it, which is just a structured piece of stuff saying here’s the created app, here’s the text, here’s the type, things like that. It’s really kind of one way you could look at the whole system is it’s a giant, giant database network.
Jeremy Jung 00:11:09 So if someone’s going to look up someone’s identifier, let’s say they have the user’s domain, they have to go to some source, right? To find the user’s data you’ve mentioned I think before the idea that this is decentralized and by default, I would picture some kind of centralized resource where I send somebody a domain and then they give me back the identifier and the links to the servers. So how does that work in practice where it actually can be decentralized?
Paul Frazee 00:11:38 Yeah, I mean the servers are pretty dumb, right? They’re just hosting those data repositories for you. So what you’re doing is, I mentioned that your DID that non-human readable identifier and that has that certificate attached to it that lists servers and signing keys and things like that. So you’re just going to look up inside that DID document what that server is, your data repository host. And then you contact that guy and say, all right, I’m told you’re hosting in this thing. Here’s the person I’m looking for hand over the data. So it’s really pretty straightforward. The way that gets decentralized is by then to the fact that I could swap out that active server that’s in my certificate and probably want to rotate the signing keys because I’ve just changed the, I don’t want to keep using the same signing keys as I was using previously because I just changed the authority.
Paul Frazee 00:12:24 So that’s the migration changed. The hosting server change out the signing keys. Somebody that’s looking for me now, they’re going to load up my document, my DID document. They’re going to say, okay, oh new server, new keys, pull down the data. Looks good, right? Matches up with the DID doc. So that’s how you get that level of portability. But when those changes happen, the DID doesn’t change, right? The DID document changes. So there’s the level of indirection there and that’s pretty important because if you don’t have a persistent identifier whenever you’re trying to change out servers, all those backlinks are going to break. That’s the kind of stuff that stops you from being able to do clean migrations on things like web-based services. The only real option is to go out and ask everybody to update their data. And when you’re talking about like interactions on the social network, like people replying to each other, there’s no chance, right? Every time somebody moves, you’re going to go back and modify all those records. You don’t even control all the records from the top down because they’re hosted all over the web. So it’s just, you can’t do it. So generally we call this account portability that your kind of like phone number portability that you can change your host. But so that part’s portable but the ID stays the same. And that’s keeping that ID the same is the real key to making sure that this can happen without breaking the whole system.
Jeremy Jung 00:13:32 And so it sounds like there’s the decentralized ID then there’s the decentralized ID document that’s associated with that. That points you to where the actual location of your data, your posts, your pictures and whatnot. But then you also mentioned that they could change servers, I guess. So let’s say somebody changes where their data is stored, that would change the servers I guess in their document. But then how do all of these systems know, okay, I need to change all these references to your old server, to these new servers?
Paul Frazee 00:14:06 Yeah, well the good news is that you only have to, you got the public data set of all the user’s activity and then you have like internal caches of where the current server is. You just got to update those internal caches when you’re trying to contact their server. So it’s actually a pretty minimal thing to just like update like, oh they moved just start talking to update my table, my Redis that’s holding onto that kind of temporary information, put it on TTL, that sort of thing. And honestly in practice, a fair amount of the system for scalability reasons doesn’t necessarily work by servers directly contacting each other. It’s actually a little bit more like how I told you before, I’m going to use this metaphor a lot, the search engines with the web, right? What we do is we actually end up crawling the repositories that are out in the world and funneling them into event streams like a Kafka.
Paul Frazee 00:14:56 And that allows the entire system to act like a data processing pipeline where you’re just tapping into these event streams and then pushing those logs into databases that produce like these large-scale aggregations. So a lot of the application behavior ends up working off of these event logs. If I reply to somebody for instance, I don’t necessarily, it’s not my server has to like talk to your server and say, hey, I’m replying to you. What I do is I just publish a reply in my repository that gets shot out into the event logs and then these aggregators pick up that the reply got created and just update their database with it. So it’s not that our hosting servers are constantly having to send messages with each other, you actually use these aggregators to pull together the picture of what’s happening on the network.
Jeremy Jung 00:15:40 Okay. So like you were saying, it’s an event stream model where everybody publishes the events, the things that they’re doing, whether that’s making a new post, making a reply, that’s all being posted to this event stream. And then everybody who provides, I’m not sure if instances the right term, but in implementation of the atmosphere protocol. They are listening for all those changes and they don’t necessarily have to know that you moved servers because they’re just listening for the events and you still have the same identifier.
Paul Frazee 00:16:23 Generally speaking, yeah. Because like if you’re listening to one of these event streams, what you end up looking for is just the signature on it and making sure that the signature matches up because you’re not actually having to talk to their live server. You’re just listening to this relay that’s doing this aggregation for you. But I think actually to kind of give a little more clarity to what you’re talking about, it might be a good idea to kind of refocus how we’re talking about the system here. I mentioned before that our goal was to make a high scale system, right? We need to handle a lot of data. And if you’re thinking about this in the way that Mastodon does it, you’re the ActivityPub model that’s actually going to give you the wrong intuition, because we chose a dramatically different system. What we did instead was we picked up essentially the same practices you’re going to use for a data center, a high scale application data center and said alright, how do you tend to build these sorts of things?
Paul Frazee 00:17:06 Well what you’re going to do is you’re going to have multiple different services running different purposes. It gets pretty close to a microservices approach. You’re going to have a set of databases and then you’re going to, generally speaking for high scale, you’re going to have an event, some kind of a Kafka, some kind of an event log that you are tossing changes about the state of these databases into. And then you have a bunch of secondary systems that are tapping into the event log and processing that into the large-scale databases like your search index, your nice Postgres of user profiles. And that makes sure that you can you, you get each of these different systems to perform really well at their particular task. And then you can detach them in their design. For instance, your primary storage can be just a key value store that scales horizontally.
Paul Frazee 00:17:55 And then on the event log you’re using a Kafka that’s designed to handle particular semantics of making sure that the messages don’t get dropped, that they come through as particular throughput. And then you’re using, for us, we’re using like Scylla DB for the big scale indexes that scales horizontally really well. So it’s just different kind of profiles for different pieces. If you read Martin Kleppmann’s book —Data Intensive Applications, I think it’s called — a lot of it gets captured there. He talks a lot about this kind of thing, and it’s sometimes called a Kappa Architecture is one way. This is described event sourcing is a similar term for it as well, stream processing. That’s pretty standard practices for how you would build a traditional high scale service. So if you take these, this kind of microservice architecture and essentially say, okay, now imagine that each of the services that are a part of your data center could be hosted by anybody, not just within our data center but outside of our data center as well and should be able to all work together.
Paul Frazee 00:18:50 That’s basically how the ATProto is designed. So we were talking about the data repository hosts, those are just the primary data stores. They hold onto the user keys and they hold onto those JSON records. And then we have another service category we call Relay that just crawls those data repositories and sucks it in that fire hose of data. We were talking about that event log and then we have what we call app views that sit there and till the log and produce indexes off of it, they’re listening to those events and then like making threads. Like okay, that guy posted, that guy replied, that guy replied, that’s a thread. They assemble it into that form. So when you’re running an application, you’re talking to the app view to read the network and you’re talking to the hosts to write to the network and each of these different pieces sync up together in this open mesh. So we really took a traditional sort of data center model and just turned it inside out where each piece is a part of the protocol and communicate it with each other and therefore anybody can join into that mesh.
Jeremy Jung 00:19:47 And to just make sure I am tracking the data repository is the data about the user. So it has your decentralized identifier, it has your replies, your posts, all that sort of thing. And then you have a relay, which its responsibility is to somehow find all of those data repositories and collect them as they happen so that it can publish them to some kind of event stream. And then you have the app view, which it’s receiving messages from the relay as they happen and then it can have its own store and index that for search, it can collect them in a way so that it can present them onto a UI, that sort of thing. That’s the user facing part I suppose.
Paul Frazee 00:20:35 Yeah, that’s exactly it. And again it’s, it’s actually quite similar to how the web works. If you combine together the Relay and app view, the web works where you got all these different websites, they’re hosting their stuff and then the search engine is going around aggregating all that data and turning it into a search experience. Totally the same model. It’s just being applied to more varieties of data like structured data like posts and replies, follows, likes, all that kind of stuff. And then instead of producing a search application at the end — I mean it does that too, but it also produces timelines and threads and people’s profiles and stuff like that. So it’s actually a pretty SPOG standard way of doing like that’s one of the models that we’ve seen work for large scale decentralized systems. And so we’re just transposing it onto something that kind of is more focused towards social applications.
Jeremy Jung 00:21:22 So I think I’m tracking that the data repository itself, since it has your decentralized identifier and because the data is cryptographically signed, it’s from a specific user. I think the part that I am still not quite sure about is the relays. I understand if you run all the data repositories, where they are so how to collect the data from them. But if someone’s running another system outside of your organization, how do they find your data repositories? Or do they have to connect to your relay? What’s the intention for that?
Paul Frazee 00:21:58 That logic runs, again, really similar to how search engines find out about websites. So there is actually a way for one of these data hosts to contact Relay and say, hey, I exist. Go ahead and get my stuff and it’ll be up to the relay to decide like if they want it or not. Right now, generally we’re just like, yeah, we want it. But as you can imagine, as the thing matures and gets to higher scale, there might be some trust kind of things to worry about. That’s kind of the naive operation that currently exists. But over time as the network gets bigger and bigger it’ll probably involve some more traditional kind of spiraling behaviors because, as more relays come into the system, each of these hosts, they’re not going to know who to talk to.
Paul Frazee 00:22:37 You’re trying to start a new relay. What they’re going to do is they’re going to discover all of the different users that exist in the system by looking at what data they have to start with. Probably involve a little bit of a manual feeding in at first whenever I’m starting up a relay, like okay, there’s Bluesky’s relay, let me just pull what they know. And then I go from there. And then anytime you discover a new user you don’t have, you’re like, oh, I look them up, pull them into the relay too. Right? So there’s a pretty straightforward discovery process that you’ll just have to bake into a relay to make sure you’re calling as much the network as possible.
Jeremy Jung 00:23:09 And so I don’t think we’ve defined the term federation, but maybe you could explain what that is and if that is what this is?
Paul Frazee 00:23:19 We are so unsure.
Jeremy Jung 00:23:22 Okay.
Paul Frazee 00:23:23 Yeah. This is agenda is up pretty bad because I think everybody can, everybody pretty strongly agrees that ActivityPub is federation, right? And ActivityPub kind of models itself pretty similarly. The email in a way like the metaphors they use is that there’s inboxes and outboxes and every ActivityPub server, they’re standing up the full vertical stack, they set up the primary hosting, the views of the data that’s happening there, the interface for the application, all of it pretty traditional, like closed service. But then they’re kind of using the perimeter, they’re making that permeable by sending, exchanging, essentially mailing records to each other, right? That’s their kind of logic of how that works. And that’s pretty much in line with I think what most people think of with federation. Whereas what we’re doing isn’t like that we’ve cut, instead of having a bunch of vertical stacks communicating horizontally with each other, we kind of sliced in the other direction.
Paul Frazee 00:24:16 We sliced horizontally into this microservices mesh and have like a total mix and match of different microservices between different operators. Is that federation? I don’t know. Right? We tried to invent a term, didn’t really work at the moment we just kind of don’t worry about it that much, see what happens, see what the world sort of has to say to us about it. But I compare it to the web a lot because it shares a lot of properties with how the web works, the web and kind of search engine dynamic. And beyond that, yeah, I don’t know.
Jeremy Jung 00:24:48 Because I’m trying to picture the distinction here. So with, I think people probably are thinking of something like say a Mastodon instance. When you’re talking about everything being included, the webpage where you view the Postgres database that’s keeping the messages and that same instance, it’s responsible for basically everything. And I believe what you’re saying is that the difference with the authenticated transfer protocol, is that? Okay.
Paul Frazee 00:25:25 Yeah, thank you for your help.
Jeremy Jung 00:25:25 And the difference there is that at the protocol level, you’ve split it up into the data itself, which can be validated completely separately from other parts of the system. Like you could have the JSON files on your hard drive and somebody else can have that same JSON file and they would know who the user is and that these are real things that user posted. That’s like a separate part. And then the relay component that looks for all these different repositories that has people’s data, that can also be its own independent thing where its job is just to output events. And that can exist just by itself. It doesn’t need the application part that the user facing part, it can just be this event stream on itself. And that’s the part where it sounds like you can make decisions on who to collect data from. I guess you have to agree that somebody can connect to you and get the users from your data repositories. And likewise other people that run relays, they also have to agree to let you pull the users from theirs.
Paul Frazee 00:26:37 Yeah, that’s right. Yeah.
Jeremy Jung 00:26:39 And so I think the Mastodon example makes sense, but I wonder if the underlying ActivityPub protocol forces you to use it in that way in like a whole full application that talks to another full application or is it more like that’s just how people tend to use it and it’s not necessarily a characteristic of the protocol?
Paul Frazee 00:26:58 Yeah, that’s a good question actually. So, generally what I would say is pretty core to the protocol is the expectations about how the services interact with each other. So the mailbox metaphor that’s used in the ActivityPub. So in that design, if I reply to you, I’ll update my local database with what I did and then I’ll send a message over to your server saying, hey by the way, add this reply. I did this. And that’s how they find out about things. That’s how they find out about activity outside of their network. That’s the part that as long as you’re doing ActivityPub, I suspect you’re going to see reliably happening. I can say for sure that’s a pretty tight requirement. That’s ActivityPub. If you wanted to split it up the way we’re talking about, you could, I donít know if you necessarily would want to because I don’t know, that’s actually, I think I’d have to dig into their stack a little bit more to see how meaningful that would be.
Paul Frazee 00:27:51 I do know that there’s some talk of introducing a similar kind of an aggregation method into the ActivityPub world, which I believe they’re also calling a relay and to make things even more complicated. And Noster has a concept of a relay. So these are three different protocols using this term. I think they’re you could do essentially what a search engine does on any of these things. You could go crawling around for the data, pull them into a fire hose and then tap into that aggregation to produce bigger views of the network. So that principle can certainly apply anywhere. AT Protocol I think it’s a little bit, we focus on that from the get go. We focus really hard on making sure that the data is signed at risk. That’s why it’s called the Authenticated Transfer Protocol. And that’s a nice advantage to have when you’re running a relay like this because it means that you don’t have to trust the relay.
Paul Frazee 00:28:38 Like generally speaking, when I look at results from Google, I’m trusting pretty well that they’re accurately reflecting what’s on the website, which is fine, that’s not actually a huge risk or anything, but whenever you’re trying to build entire applications and you’re using somebody else’s relay, you could really run into things where they say like, oh you know what, Paul tweeted the other day, I hate dogs. They’re like, no I didn’t. That’s a lie, right? And you just sneak in little lies like that over a while it becomes a problem. So having the signatures on the data is important. If you’re going to try to get people to cooperate, you got to manage the trust model. I know that ActivityPub does have mechanisms for sign records. I don’t know how deep they go if they could fully replace that utility.
Paul Frazee 00:29:16 And then Macedon ActivityPub, they also use a different identifier system that they’re actually taking a look at DIDs right now. I don’t know what’s going to happen there. We’re totally on board to give any kind of insight that we got working on them, but at the moment they use, I think it’s Webfinger based identifiers they look like emails. So you got host names in there and those identifiers are being used in the data records. So you don’t get that continuous identifier. They actually do have to do that, hey, I moved update your records sort of thing. And that causes it to, I mean it works like decently well, but not as well as it could. They got us to the point where it moves your profile over and you update all the folks that were following you so they can update their follow records, but your posts, they’re not coming right, because that’s too far into that mesh of interlinking records. There’s just no chance. So that’s kind of the upper limit on that. It’s a different set of choices and trade-offs. You’re always kind of asking like, is that how important is the migration? Does that work out? Anyway now I’m just kind of digging into differences between the two here.
Jeremy Jung 00:30:22 I’m trying to understand the difference there. So you were saying that with ActivityPub, all of the instances need to be notified that you’ve changed your identifier, but then all of the messages that they had already received, they don’t point to the new identifier somehow.
Paul Frazee 00:30:40 Yeah, you run into basically just the practicalities of actual engineering with that is what happens, right? Because if you imagine you got a multimillion user social network, they got all their posts, maybe the user has like let’s say a thousand posts and 10,000 likes and that activity can range back three years, let’s say. They changed their identifier and now you need to change the identifier of all those records. If you’re in a traditional system that’s already a tall order, you’re going back and rewriting a ton of indexes, you’re writing a lot of primary, anytime somebody replied to you, they have these links to your posts, they’re now, you’ve got to update the identifiers on all of those things. You could end up with a pretty significant explosion of rewrites that would have to occur. Now that’s tough if you’re in a centralized model. If you’re in a decentralized one, it’s pretty much impossible because you’re now, when you notify all the other servers like this changed, how successful are all of them at actually updating those pointers?
Paul Frazee 00:31:40 It’s a good chance that things are going to fall out of correctness. That’s just a reality of it. So if you’ve got a mutable identifier, you’re in trouble for migrations. So the DID is meant to keep it permanent and it ends up being the kind of the anchoring point. If you lose control of your DID, well that’s it, your account’s done, but we took some pretty traditional approaches to that where siding keys get managed by your hosting server instead of like trying to, this may seem like really obvious, but if you’re from the decentralization community, we spend a lot of time, like with blockchain. It’s like, hey, how have the users hold onto their keys? And the tooling on that is getting better for what it’s worth, we’re starting to see a lot better key peer management in like Apple’s ecosystem and Google’s ecosystem, but it’s still in the range of like, nah, people lose their keys.
Paul Frazee 00:32:24 So having the servers manage those is important. And then we have ways of exporting paper keys so that you could kind of adversarial migrate if you wanted to. That was in the early spec it was, we wanted to make sure that this portability idea works. You can always migrate your accounts so you can export a paper key that can override. And that was how we figured that out. Okay, yeah, we don’t have to have everything getting signed by keys that are on the user’s devices. We just need these master backup keys that can say, you know what? I’m done with that host. No matter what they say, I’m overriding what they think. And that’s how we squared that one.
Jeremy Jung 00:32:58 So it seems like one of the big differences with account migration is that with ActivityPub, when you move to another instance, you have to actually change your identifier. And with the AT Protocol, you’re actually not allowed to ever change that identifier. And maybe what you’re changing is just you have some kind of a lookup, like you were saying, you could use a domain name to look that up, get a reference to your decentralized identifier, but your decentralized identifier, it can never change.
Paul Frazee 00:33:32 It can’t change. Yeah. And it shouldn’t need to, what I mean. It’s really a total disaster kind of situation if that happens. So that it’s designed to make sure that doesn’t happen in the applications. We use these domain name handles to identify folks and you can change those anytime you want because that’s really just a user facing thing. Then in practice what you see pretty often is that you may, if you change hosts, we give sub domains to folks because like not everybody has their own domain. A lot of people do actually, to our surprise, people actually kind of enjoy doing that. But a lot of folks are just using paul.bsky.social as their handle. And so if you migrated off of that, you probably lose that. So your handle’s going to change but you’re not losing the followers and stuff. Because the internal system isn’t using paul.bsky.social, it’s using that DID and that DID stays the same.
Jeremy Jung 00:34:18 Yeah, I thought that was interesting about using the domain names because when you, like you have a lot of users, everybody’s got their own subdomain, you could have however many millions of users. Does that become an issue at some point?
Paul Frazee 00:34:33 It’s a funny thing. I mean like the number of users, like that’s not really a problem because you run into the same kind of namespace crowding problem that any service is going to have, right? Like if you just take the subdomain part of it, like the name Paul, like actors, there’s only, you only get to have one paul.bsky.social. In terms of the number of users, that part’s fine I guess. As fine as ever. You, where gets more interesting of course is like really kind of around the usability questions. For one, it’s not exactly the prettiest to always have that Bsky.social in there. If we had some kind of solution to that, we would use it. But like the reality is that we’ve committed to the domain name approach and some folks, they kind of like, ah, that’s a little bit ugly.
Paul Frazee 00:35:11 And we’re like, yeah, it is, that’s life. The plus side though is that you can actually use like TLD the domain name. It’s like I’m phrase z.com and that starts to get more fun. It can actually act as a pretty good trust signal in certain some areas, for instance, well-known domain names like nytimes.com strong authentication right there. We don’t even need a blue check for it. Similarly, the .gov domain name space is tightly regulated. So you actually get a really strong signal out of that. Senator Wyden is one of our users, so he’s, it’s widen.gov and same thing, strong identity signal right there. So that’s actually a really nice upside. So that’s like positives, negatives. Then that trust signal only works so far. If somebody were to make p phrase.net, then that could be a bit confusing.
Paul Frazee 00:36:00 People may not be paying attention to the.com versus.net. So it’s not, I don’t want to give the impression that we’ve solved blue checks. It’s a complicated and multifaceted situation, but it’s got some juice. It’s also kind of nice too because a lot of folks that are doing social, they’ve got other stuff that they’re trying to promote, like I’m pretty sure that NY Times would love it if you went to their website. And so tying it to their online presence so directly like that is a really nice kind of feature of it and tells I think a good story about what we’re trying to do with an open internet where, everybody has their space on the internet where they can do whatever they want on that. And then these social profiles, it’s that presence showing up in a shared space.
Paul Frazee 00:36:42 It’s all kind of part of the same thing that feels like a nice kind of thing to be chasing. It also kind of speaks well to the naming worked out for us. We, chose AT Protocol as a name, we background our way into that one because it was a simple sort of thing, but it actually ended up really reflecting the biggest part of it, which is that it’s about putting people’s identities at the front. Kind of promoting everybody from a second-class identity that’s underneath Twitter or Facebook or something like that up at two. Nope, you’re freestanding. You exist as a person independently, which is what a lot of it’s about.
Jeremy Jung 00:37:15 Yeah, I think just in general, not necessarily just for Bluesky, if people had more of an interest in getting their own domain, that would be pretty cool if people could tie more of that to something you basically own, right? I mean I guess you’re kind of leasing it from ICAN or whatever, but yeah. Rather than everybody having @ Gmail, Outlook or whatever, they could actually have something unique that they control more or less.
Paul Frazee 00:37:44 Yeah. And we actually have a little experimental service for registering domain names that we haven’t integrated into the app yet because we just kind of wanted to test it out and kind of see what that appetite is for folks to register domain names way higher than you’d think. We did that early on. It’s funny when you’re coming from decentralization is like an activist space, right? Like it’s group of people trying to change how this tech works and sometimes you’re trying to parse between what might come off as a fascination of technologists compared to what people actually care about. And it varies, the domain name thing to a surprising degree folks really got into that. We saw people picking that up almost straight away, kind of more so than, than certainly we ever predicted. And I think that’s, that’s just because I guess it that speaks to something that people really get about the internet at this point, which is great.
Paul Frazee 00:38:35 We did a couple of other things that are similar and we saw varied levels of adoption on them. We had similar kinds of user facing opening up of the system with algorithms and with moderation. And those have both been pretty interesting in and of themselves. So algorithms, what we did was we set that up so that anybody can create a new feed algorithm. And this is kind of one of the big things that you run into whenever you use the app. So if you wanted to create a new kind of for you feed, you can set up a service somewhere that’s going to tap into that fire hose, right? And then all it needs to do is serve a JSON endpoint. That’s just a list of URLs but like here’s what should be in that feed and then the Bluesky app will pick that up and send that hydrate in the content of the posts and show that to folks.
Paul Frazee 00:39:21 And so there’s, I want to say this is a bit of a misleading number and I’ll explain why, but I think there’s about 35,000 of these feeds that have been created. Now the reason it’s little misleading is that I mean not significantly, but it’s not everybody went and sat down in their IDE and wrote these things. Essentially one of our users created, actually multiple of our users made little platforms for building these feeds, which is awesome. That’s the kind of thing you want to see because we haven’t gotten around to it. Our app still doesn’t give you a way to make these things, but they did. And so lots of, they’re, it is cool, like one person made a kind of a combinatorial logic thing that’s like visual almost like scratch it’s like, so if it has this hashtag and includes these users, not those users and you’re kind of arranging these blocks and that constructs the feed and then publish it on your profile and then folks can use it?
Paul Frazee 00:40:03 So that has been I would say fairly successful except we had one group of hackers do put in a real effort to make a replacement for you feed like magic algorithmic feed kind of thing. And then they kind of kept up going for a while and then ended up giving up on it. Most of what we see are actually kind of weird niche use cases for feeds. You get straightforward ones like content-oriented ones like a cat feed, politics feed, things like that. It’s great. Some of those are using like ML detection. So like the cat feed is ML detection, so sometimes you get like a beaver in there, but most of the time it’s a cat. And then we got some ones that are kind of a funny, like change in the dynamic of freshness or selection criteria, things that you wouldn’t normally see.
Paul Frazee 00:40:45 But because they can do whatever they want they try it out. So like the quiet posters ended up being a pretty successful one. And that one just shows people you’re following that don’t post that often when they do just those folks. I use that one all the time because yeah, they get lost in the noise. So it’s like a way to keep up with them. The moderation one, that one’s a real interesting situation. What we did there, essentially we wanted to make sure that the moderation system was capable of operating across different apps so that they can share their work, so to speak. And so we created what we call labeling. And labeling is a metadata layer that exists over the network doesn’t actually live in the normal data repositories. It uses a completely different synchronization because a lot of these labels are getting produced.
Paul Frazee 00:41:28 It’s just one of those things where the engineering characteristics of the labels is just too different from the rest of the system. So we, we created a separate synchronization for this and it’s really kind of straightforward. It’s here’s a URL and here’s a string saying something like NSFW or gore or, whatever. And then those get merged onto the records brought down by the client and then the client based on the user’s preferences will put like warning screens up, hide it, stuff like that. So yeah, these label streams can then anybody that’s running a moderation service can, are publishing these things and so anybody can subscribe to them and you get that kind of collaborative thing we’re always trying to do with this. And we had some users set up moderation services. So then as an end user you find it, it looks like a profile in the APP and you subscribe to it and you configure it and off you go.
Paul Frazee 00:42:12 That one has had probably the least amount of adoption throughout all of them. It’s a moderation. It’s a sticky topic. As you can imagine, challenging for folks. These moderation services, they do receive reports like whenever I’m reporting a post, I choose from all my moderation services who I want to report this to. What has ended up happening more than being used to actually filter out like subjective stuff is more kind of like either algorithmic systems or what you might call informational. So the algorithmic ones are like one of the more popular ones is a thing that’s looking for posts from other social networks. Like this screenshot of a Reddit post or a Twitter post or a Facebook post because, which you’re kind of like why? But the thing is some folks just get really tired of seeing screenshots from the other networks.
Paul Frazee 00:43:01 Because often it’s like, look what this person said. Can you believe it? And it’s like, ah, okay, bad enough. So somebody made one of our users, Andra made a moderate service that just runs an ML, detects it, labels it, and then folks that are tired of it, they subscribe to it and they’re just hide it. And so it’s like a smart filter kind of thing that they’re doing. Hypothetically you could do that for things like spiders, like you’ve got an arachnophobia, things like that. So that’s like a pretty straightforward kind of automated way of doing it, which takes a lot of the spice out of running moderation. So that users have been like, yeah, yeah, okay, we can do that. Those are user facing ways that we try to surface the decentralized principle, right?
Paul Frazee 00:43:40 And may take advantage of how this whole architecture can have this kind of a plugability into it. But then really at the end of the day, kind of the important core part of it is those pieces we were talking about before, the hosting the relay and the applications themselves, having those be swappable in completely. So we tend to think of those as kind of ranges of infrastructure into application and then into particular client-side stuff. So a lot of folks right now, for instance, they’re making their own clients to the application and those clients are able to do customizations, add features, things like that as you might expect. But most of them are not running their own backend, they’re just using our backend. But at any point it’s right there for you. You can go ahead and clone that software and start running the backend. If you wanted to run your own relay, you could go ahead and go all the way to that point. If you want to do your own hosting, you can go ahead and do that. It’s all there. It’s just kind of how much effort your project really wants to take. That’s the kind of systemically important part. That’s the part that makes sure that overall mission of de-monopolizing social media online. That’s where that really gets enforced.
Jeremy Jung 00:44:44 And so someone has their own data repository with their own users and their own relay. They can request that your relay collect the information from their own data repositories. That’s how these connections get made.
Paul Frazee 00:45:01 Yeah, and we have a fair number of those already. Fair number of, we call those the self- hosters, right? And we got, I want to say 75 self-hosters going right now, which is, love to see that be more. But it’s really the folks that if you’re running a service, you probably would end up doing that. But the folks that are just doing it for themselves, it’s the nerdiest of the nerds over there doing that. Because it doesn’t end up showing itself in the application at all, right? It’s totally abstracted away. So that one’s really about like measure your paranoia kind of thing or if you’re just proud of the self-hosting or curious, that’s kind of where that sits at the moment.
Jeremy Jung 00:45:40 We haven’t really touched on the fact that there’s this underlying protocol and everything we’ve been discussing has been centered around the Bluesky Social Network where you run your own instance of the relay and the data repositories with the purpose of talking to Bluesky. But the protocol itself is also intended to be used for other uses, right?
Paul Frazee 00:46:01 Yeah, it’s generic. The data types are set up in a way that anybody can build new data types in the system. There’s a couple that have already begun front page, which is kind of a hacker news clone. There’s Smoke Signals, which is a events app. There’s Blue Cast, which is like a Twitter spaces clubhouse kind of thing. Those are the folks that are kind of willing to trudge into the bleeding edge and deal with some of the rough edges there pretty, I think obvious reasons a lot of our work gets focused in on making sure that the Bluesky app and that use case is working correctly. But we are starting to round the corner on getting to a full kind of how to make alternative applications state. If you go to the ATProto.com, there’s a kind of an introductory tutorial where that actually shows that whole stack and how it’s done.
Paul Frazee 00:46:47 So it’s getting pretty close. There’s a couple of still things that we want to finish up. Where we’re working for instance, right now on OAuth for sign in, which will include with it that are permissioning, which is very important. What kind of data does the app get access to? Make sure that it’s getting enforced. But it’ll be nice because then when you’re making one of these apps, you get that kind of one SSO, you press a button to sign in with whatever and that’ll be nice. Yeah. So, and like the way that works roughly is you set up a server and they’ll SSO in and then along with that session that’s being given to you, you get pointed to like, okay, here’s their account. Here’s their DID, right? You use that to look up their home server and then your app kind of, it’s almost like you’re being handed a couple of databases to work with in a way because their data repository is their primary storage that you want to write to.
Paul Frazee 00:47:37 So what happens is they’re going to, you’re going to write your normal server side application. They’re going to ask like, hey, set my status to blank or update my profile this way. They’ll send that request to your server and then you’re just going to send that request over to their home server and say like, okay, update this record for me and it’ll say, great. Done. And then now it’s committed to that kind of primary storage. From there you’ve got your view of the network that you’ve accumulated, that you’ve aggregated together. You could go ahead and optimistically update it and say like, okay, I know that I just changed their profile so that their display name should change locally. But at the same time, what you’re going to do is you’re going to listen to that fire hose and eventually that, hopefully pretty soon that write you just made will show up on that fire hose and then you ingest it in that way too.
Paul Frazee 00:48:20 And it’s how you continue to be a part of the larger network. It’s not just your users that you’re observing, you’re seeing all the, all the activity by listening to the fire hose. So it’s a little slight, slight modification from how you normally make a web app, but not as much as you might think for a whole protocol. Like this really comes down to like right into that home server and then listening to that fire hose. That’s kind of the thing that comes along with it. And once you get used to that, it’s pretty straightforward. And in fact it’s kind of nice because you’re not having to do that. You’re getting handed essentially part of your stack as a part of the network. So it’s kind of an interesting way to do this. We’re shaping that up now and, I don’t know, maybe six months it, it’ll feel pretty good by then. We should get all the rough edges kind of calmed down.
Jeremy Jung 00:49:05 So in a way you can almost think of it as having an eventually consistent data store on the network, I suppose, where you don’t have to architect that part. You can make a traditional web application with a relational database and the source of truth can actually be wherever that data repository is stored on the network.
Paul Frazee 00:49:26 Yeah, that’s exactly, it is an eventually consistent system. That’s exactly right. The source of truth is their native repo and that relational database that you might be using. I think the best way to think about it is like secondary indexes or computed indexes, right? They reflect the source of truth, and you can structure that whatever makes it easy for you, which is lucky because that’s where most of your data work is happening. Like updating a row in a database, not that big a deal that and that you just send that off to the personal server. Yeah. All the heavy lifting is happening in your aggregation database and so you could use whatever you want there and that, that is fortunate. Keeps it from being too much of a pain.
Jeremy Jung 00:50:04 And built into the protocol itself. Is there because you mentioned signing in with OAuth for example. Then would you be able to restrict an app that signs in as you to only be able to write to certain things in the data store, that kind of thing?
Paul Frazee 00:50:20 Yeah. Yeah. All the data’s typed. We have a kind of a reverse DMS identifier for the data. So Bluesky posts are identified with a type of f.Bsky.feed.post. We did reverse DNS to make sure you don’t confuse it with a web address. It’s jarring at first, you’re like what is this Java? But you get used pretty fast. So you get these, everything’s gets well-defined. You have these well-defined schemas and as a consequence it started off as just like, well let’s make sure that people making apps have a way to predict what everybody’s doing. That’s why we introduced that schema and typing system. In previous projects I had worked on like secure skull, but we went entirely by convention and that was a nightmare because you just never knew what other folks were doing. There was no coordination system.
Paul Frazee 00:51:06 So we had this kind of a hacker ethos like, ah, now you’ll figure it out. But now actually you, you started to run into situations where you would make a schema change in your application that seemed benign but turns out you just broke two other apps because you just didn’t even know what they were doing. So the schema system is to really just help developers coordinate with each other. At least that was the initial impulse. We get to the permissions story and we realize, oh man, we are lucky we did this because now we have a way to describe in these schemas what the different kind of record types are. And so whenever you do your OAuth sign in, it’ll be able to say, write to your Blue Scout posts. Read your smoke signals, events, things like that. That permissioning definitely a core piece of what the oof is introducing.
Jeremy Jung 00:51:45 That’s pretty cool because it’s taking care of a certain part of your design, right? In terms of how am I going to store my data? How am I going to manage permissions? At least that part is taken care of for you.
Paul Frazee 00:51:59 Yeah, this is getting kind of grandiose. I don’t tend to pose in these terms, but it is almost like we’re trying to have an OS layer at a protocol level where, permissioning in the, and the storage, it’s like having your own like network wide database or network wide file system. These are the kind of facilities you expect out of a platform like OS And so the hope would be that this ends up being actually quite a convenient introduction into the internet protocols, the internet stack so to speak, so that it’s getting that usage outside of just the initial social app like we’re doing here. If it doesn’t end up working out that way, if it ends up you know, good for the Twitter style use case, the other one’s not so much and that’s fine too. That’s our initial goal but we wanted to make sure to build it in a way that like yeah, there’s availability to it keeps it, make sure that you’re getting kind of the most utility you can out of it.
Jeremy Jung 00:52:49 Yeah, I can see some of the parallels to some of the decentralized stuff that I suppose people are still working on, but more on the peer-to-peer side where the idea was that I can have a network host this data but, and in this case it’s a network of maybe larger providers where they could host a bunch of people’s data versus just straight peer-to-peer where everybody has to have a piece of it. And it seems like your angle there was really the scalability part.
Paul Frazee 00:53:19 It was the scalability part and there’s great work happening in peer-to-peer. There’s a lot of advances on it that are still happening. I think really the limiter that you run into is running queries against aggregations of data because you can get the network BitTorrent sort approved that you can do distributed open horizontal scaling of hosting. That basic idea of hey, everybody’s got a piece and you sync it from all these different places. We know you can do things like that. What nobody’s been able to really get into a good place is running queries across large data sets. In a model like that, there’s been some research in what’s called federated queries, which is where you’re sending a query to multiple different nodes and asking them to fulfill as much of it as they can and then collating the results back.
Paul Frazee 00:54:08 But it didn’t work that well. That’s still kind of an open question and until that is in a place where it can like reliably work and at very large scales you’re just going to need a big database somewhere that does give the properties that you need, you need these big indexes. And once we were pretty sure of that requirement, then from there you start asking, what about the else about the system? Could we make easier if we just apply some more traditional techniques, merge that in with the peer-to-peer ideas. And so key hosting, that’s an obvious one availability, let’s just have a server it’s no big deal but you’re trying to, you’re trying to make as much of them dumb as possible so that they have that kind of easy replaceability.
Jeremy Jung 00:54:53 Earlier you were talking a little bit about the moderation tools that people could build themselves. There was some process where people could label posts and then build their own software to determine what a feed should show for a person. But I think before that layer for the platform itself, there’s a base level of moderation that has to happen. And I wonder if you could speak to, as the app has grown, how that’s handled.
Paul Frazee 00:55:25 Yeah, you got to take some, requirements in moderation pretty seriously to start and with decentralization it sometimes that gets a little bit dropped. You need to have systems that can deal with questions about CSAM so you got those big questions you got to answer and then you got stuff that’s more in the line of like, alright, what makes a good platform? What kind of guarantees that we’re trying to give there? So just not legal concerns but, good product experience concerns. That’s something we’re in the realm of like spam and abusive behavior and things like that. And then you get into even more fine grain of like what a personís subjective preference is and how can they kind of make their thing better. And so you get a kind of a tele scooping level of concerns from the really big, the legal sort of concerns.
Paul Frazee 00:56:15 And then the really small subjective preference kind of concerns. That actually that telescoping maps really closely to the design of the system as well. Where the further you get up in the kind of the legal concern territory, you’re now in core infrastructure and then you go from infrastructure, which is the relay down into the application, which is kind of a platform and then down into the client and that’s where we’re having those labelers apply. And each of them, as you kind of move closer to infrastructure, the importance of the decision gets bigger too. So you’re trying to do just legal concerns with the relay right? Stuff that you objectively can, everybody’s in agreement like Yeah, yeah, yeah. No bigs don’t include that. And the reason is that at the relay level you’re anybody that’s using your relay, they depend on the decisions you’re making that sort of selection you’re doing, any filtering you’re doing, they don’t get a choice after that.
Paul Frazee 00:57:07 So you want to try to keep that focus really on legal concerns and doing that well so that applications that are downstream of it can make their choices. The applications themselves somebody can run a parallel, I guess you could call it like a parallel platform. So we got Bluesky doing the microblogging use case. Other people can make an application doing the microblogging use case. There’s choice that users can easily switch easily enough switch between, it’s still a big choice. We’re operating that in many ways. Like any other app nowadays might do it. You’ve got policies for what’s acceptable on the network. You’re still trying to keep that to be as objective as possible, make it fair, things like that. You want folks to trust your TNS team. But from the kind of systemic decentralization question, you get to be a little bit more opinionated down all the way into the client with that labeling system where you can for this is individuals turning on and off preferences.
Paul Frazee 00:58:04 You can be as opinionated as you want on that letter. And that’s how we have basically approached this. And in a lot of ways it really just comes down to, in the day to day, the volume of moderation tasks is huge. You don’t actually have high like stakes, moderation decisions most of the time. Most of them are pretty straightforward. Shouldn’t have done that. That’s got to go. You get a couple every once in a while that are a little spicier or a policy that’s a little spicier and it probably feels pretty common to end users. But that just speaks to how the volume of reports and problems that come through. And we don’t want to make it so that the system is seized up trying to decentralize itself. It needs to be able to operate day to day. What you want to make is back pressure checks on that power so that if an application or a platform does really start to go down the wrong direction on moderation, then people can have this credible exit. This way of saying what that’s a problem. We’re moving from here. And somebody else can come in with different policies, better fit people’s expectations about what should be done at these levels. So it’s not about taking away authority, it’s about checking authority kind of a checks and balances mentality.
Jeremy Jung 00:59:17 And high level because youíre saying how there’s such a high volume of things that if you know what it is, you want to remove it, but there’s just so much of it. So is there, do you have automated tools to label these things? Do you have a team of moderators? Do they have to understand all the different languages that are coming through your network? That kind of thing?
Paul Frazee 00:59:39 Yes, yes and yes. Yeah, you use every tool at your disposal to stay on top of it. Because you’re trying to move as fast as you can folks. Problems showing up, the slower you are to respond to it, the more irritating it is to folks. Likewise, if you make a missed call, if somebody misunderstands what’s happening, which believe me is sometimes just figuring out what the heck is going on is hard. People’s beefs definitely surface up to the moderation or you know misunderstanding or wrong application. Moderators make mistakes, so you’re trying to maintain a pretty quick turnaround on stuff that’s tough. And especially when to move fast on some really upsettling content that can make its way through. Again, illegal stuff, for instance, but war videos, stuff like that, it’s real problem.
Paul Frazee 01:00:28 So yeah, you got to be using some automated systems as well. Clamping down on bot rings and spam, you can imagine that’s gotten a lot harder thanks to LLMs. Just doing like a text analysis by kind of dumb statistics of what they’re talking about. That doesn’t even work anymore. Because the LLMs are capable of producing consistently varied responses while still achieving the same goal of like plugging an online betting site of some kind? So we do use kind of dumb heuristic systems for when it works, but boy, that doesn’t work. That won’t work for much longer. And we’ve already got cases where it’s, oh boy so that’s the moderation’s in a dynamic place to say the least right now with, with LLMs coming in. It was tough before and now it’s real interesting. So you use everything you can and you keep playing the cat and mouse game. That’s that.
Jeremy Jung 01:01:24 Are the tools that you use where they all built in house or are there?
Paul Frazee 01:01:29 Not all of them. We do use a couple of services that are out there for one with the CSAM stuff, there’s a company called Thorn that is actually set up to handle that. Obviously it’s a very sensitive kind of thing. They maintain these hash fingerprint databases that we check in every post. It’s just on your database. We also use service called Hive to help us automatically detect pornography and gore. We’re still tuning that one, but it hits significantly. It’s correct more than it’s incorrect. And we try to encourage users to self-label their stuff. Because nobody, it might good actors don’t want to be posting NSFW to people that don’t want to get it. We do see people do it, but we still need to have that extra check. And so we’re running that kind of detection on it and things like that. And then we got a couple of in-house things, something called Auto Mode which you could find the source for that’s just sitting there and it’s a Go code base that’s sitting there and running all kinds of heuristics, trying to detect kinds of patterns of behavior that we’ve identified as being a problem. Often that has to do with spam and then thereís people.
Jeremy Jung 01:02:36 Do you have a sense of after you’ve gone through all these tools, how many reports do people actually have to flag a day or judge each day, I guess?
Paul Frazee 01:02:47 I don’t have the numbers in front of me, but our head of TNS is trying to keep up with regular reports on that. So I think actually if you looked up his account, he may have said something about it recently. I just don’t have the numbers on me right now.
Jeremy Jung 01:03:00 Yeah, because I think with any application that has user generated content, it just seems like such a huge challenge to deal with.
Paul Frazee 01:03:13 It really is, it shows up in a lot of places. What’s funny is what we’re talking about is the obvious way that it shows up. And even that it’s, it’s not really apparent to the average person just how much that really is. It’s a pretty big task, but it also makes its way into product development. And is that one of those hidden costs that shows up all over the place. It was funny there’s that meme about, oh, I can build Twitter in a weekend, kind of thing.
Jeremy Jung 01:03:42 Right.
Paul Frazee 01:03:43 There’s a lot of reasons why building a microblogging app had a lot of time to reflect on how many weekends it was taken to build?
Jeremy Jung 01:03:52 Yeah.
Paul Frazee 01:03:53 And why it took longer than a weekend for us. Certainly, targeting a lot of different platforms, we’ll do it to you. The React native makes that a lot easier these days. And I give them a plug. They’ve been great. The moderation, the safety that part of the design that you got to work into all new features, that is a real hidden cost of product development that you got to take seriously from the get-go. And you got to look at everything you do and say, what, how’s this going to get messed with? What are people going to do with this? And our track record has been pretty good. There are definitely moments where we get something out and go, ah didn’t think of that. That’s just how it goes. But that’s shows up everywhere.
Jeremy Jung 01:04:32 For sure. Yeah. I didn’t even think about the fact where you had mentioned the LLMs where now people just have the power to generate posts that are indistinguishable from people. So I don’t know where we’re going here, butÖ
Paul Frazee 01:04:46 Somewhere new.
Jeremy Jung 01:04:48 Any closing thoughts or other things you want to mention?
Paul Frazee 01:04:54 Yeah, this has been an interesting project. I’m happy with how it’s gone so far. For sure the first question that we had when we started out was just whether or not it would scale, because historically folks that work in the decentralization space, that’s what jams us up pretty bad and obvious ways for peer to peer, but even blockchains actually got jammed up real bad on scaling. And that’s kind of why the NFT bubble popped as hard as it did, when it did everybody knows the transaction fees got really high. It’s because the system got too crowded. So making these big scale systems that have open operation is a real challenge. And that’s the part that so far we’re up to now 11 million registered accounts and we sit at somewhere around 1.5 to 2 million DAU and we haven’t identified anything that makes us think, oh, that’s a scaling limit for this design.
Paul Frazee 01:05:42 We’re pretty sure that it can make its way all the way to the kind of scale that people expect out of this. So I’m really happy to see just kind of from the engineering side of things, like getting that validated has felt pretty good about this project. So I’m pretty curious to see kind of how this all unfolds moving forward. I’m feeling, feeling very happy with the outcome so far and just actually genuinely interested on a personal level to see how these dynamics play out over time. Really looking forward to seeing somebody else start to operate full nodes like we are. I think that’ll be a great milestone that we’re trying to head for. And hopefully in a year or two, I think we kind of need to make a little social network in a box, make it a little bit easier for somebody to just press a button and get the whole thing going. That’s a milestone in the future I’m looking forward to. Other than that, a lot of my day to day just comes down to, a lot of it’s just running the social app and learning all the lessons that come along with that, which has been a trip in and of itself.
Jeremy Jung 01:06:37 Like you were mentioning, people think, oh, build Twitter in a weekend. Right? But it’s taken a few more weekends than that.
Paul Frazee 01:06:42 Taking a few more, your problems only just start the, all the source code is available on GitHub, so if you want to see how we do it, it’s right there. And if you’re an application developer, I think in particular, the social app repo is pretty interesting to look at. That’s a React native repository. You can kind of see the full thing, lots of good FORCABLE code in there if you’re building an app. So definitely check that out. All the protocol code is up there. We got specs up on the app protocol website, so app proto.com, you can check all that out. Some good intro material, things like that. You want to learn a little bit more about how all this works. Find out on the app proto.com. It’s all up there. It’s all ready to look at.
Jeremy Jung 01:07:22 And if people want to follow you or see what you’re up to?
Paul Frazee 01:07:26 Youíre going to jump on Bluesky and on pfrazee.com.
Jeremy Jung 01:07:29 Paul, thank you so much for taking the time.
Paul Frazee 01:07:32 Absolutely. Thank you.
Jeremy Jung 01:07:34 This has been Jeremy Jung for Software Engineering Radio. Thanks for listening.
[End of Audio]