Journey with AT Protocol: Part 1

I read about the AT Protocol (Authenticated Transfer Protocol), made popular by Bluesky, a couple years ago. But more recently, I decided to do a side project so I could deep dive into it. About five years ago I did a side project with Activity Pub, which I thought would potentially take over how social media platforms are built, but that didn't quite pan out. Part of it was certainly that Mastodon just didn't have the same brand clout as Bluesky, the latter being founded by Twitter's ex-CEO Jake Dorsey. This means that anyone which builds upon the AT Protocol has access to the content of over 30 million registered Bluesky users (as of the time I'm writing this).
Now, I don't think the AT Protocol (abbr atproto) is objectively better than Activity Pub as far as open social media standards go. The goals for atproto are simply different from Activity Pub. The major difference being that Mastodon's Activity Pub specified how you pass data between "mailboxes" within the fediverse (federated universe), but ultimately each server (i.e. where you signed up for an account) held a majority of the power in regards to your experience with Mastodon.
For example, there was no concept of a Relay or Firehose, and querying a server for a list of recent content isn't in the spec (to my knowledge), meaning it's incredibly difficult to build a feed that includes content from accounts that you don't follow. That means your experience with Mastodon as a new user would be limited strictly to (a) the content that exists on the server you signed up on and (b) people you knew were on Mastodon and which you followed immediately after signup. That's a hard sell, honestly. You basically have to choose your community before you create an account.
That's enough preamble though. I'm going to start a series that captures my own personal journey into writing an application using atproto. There's a lot to cover so let's get this started with part 1.
- Part 1: Setup/PDS (you're here!)
- Part 2: Lexicons/Event Streams
- Part 3: Multiformats Madness
- Part 4: Public vs Private
The Setup
First, let's get the underlying setup out of the way.
Because atproto is formed on the concept of a custom data storage solution, called a PDS (Personal Data Server), I decided to use the open source PDS solution provided by Bluesky. There are currently two options: a TypeScript PDS, which is the official and feature-complete version, and a Go PDS, which is experimental and only contains the most fundamental features.
But there's more than just a difference in programming language between them: TypeScript seems to use sqlite to store the underlying blocks in the MST (I'll get to this in a bit) as well as metadata about the collections and records contained with the PDS. The Go PDS, on the other hand, uses sqlite for metadata but has configurable options for storing block data, with the default being ScyllaDB. This is actually a huge boon for the Go PDS despite it being considered experimental and not having full feature parity with the TypeScript PDS.
For my purposes, however, I didn't necessarily need the hyper-scalability afforded by ScyllaDB as a sqlite alternative. I preferred something more feature-complete for my first adventure with atproto so I chose to use the TypeScript PDS.
For the application itself, I used the Elixir programming language, as I just find it fun to program in. Additionally, "metaprogramming" in Elixir is quite simple, which would allow me to quickly build a code generator that could take in Lexicons and produce Elixir source code.
PDS (Personal Data Server)
The PDS is, unfortunately, not well-documented even though the structure is well-defined in the atproto specification. It's effectively up to the person implementing the PDS to decide how to store the underlying data.
Looking at the repository spec, it's not immediately clear, but what I've figured out is that you need at least two separate data storage mechanisms to implement a repo.
The first is the MST, or Merkle Search Tree, which is just an efficient means of storing a set of cryptographic hashes, used by systems like Bitcoin, IPFS, etc. Anything where you might need to quickly verify the contents of a large data structure. Each node is effectively a "commit" of data to the tree, but the data itself doesn't live there. Leaves are effectively just pointers to content. In this case we specifically use CIDs for pointers. Branches (i.e. non-leaf nodes) use CIDs as pointers as well, but those pointers use the hash value of its sub-tree to determine the CID pointer value.
The MST doesn't store the underlying content (which are either records or blobs), so something else needs to do that. Because each piece of content is stored using a CID, it is simplest to use a key-value store with CIDs as keys and the content as values. And those values must be, according to atproto, DAG-CBOR encoded (the CIDs are based off this encoded value and not the raw content). So official atproto PDS implementations use a "block store" to store encoded DAG-CBOR binary data by their CID. The process to add records goes something like this:
- Request a record to be added.
- Generate a "block", which is a combination of a DAG-CBOR encoded record, or a blob, and its CID.
- Add the CID to the MST.
- Update the MST (all hashes/CIDs above the leaf node need to be re-computed).
- Add the block to storage.
This makes ScyllaDB a great choice for block storage, because it's effectively a key-value store specifically built for high performance and distributed deployments. However, anyone making their own PDS could choose whatever block store they want, so long as the MST is built according to spec and records are stored as DAG-CBOR encoded binaries.
Now this part isn't strictly necessary, although it does make certain operations far simpler: metadata storage. Both official implementations store metadata for each PDS. For example, if you wanted to list the most recent 50 app.bsky.feed.post
records for a user, it's not as efficient to go searching the MST to find those 50 records. It's far more efficient to have a record
table in sqlite that stores the collection name (e.g. app.bsky.feed.post
), the record's last-updated timestamp, the record key, and the record's CID. You just query it, get back the CIDs, and then fetch those from the block store; done.
I won't get into this now, but it's also recommended for custom applications to use their own data store for dehydrated records. For example, if you wanted to build you own content feed, it's more efficient to store only the necessary bits of a record rather than duplication the whole thing. This is especially true if you need information beyond the scope of an individual record, such as user id or a creation timestamp. Then, when those records are queried, they would contain the repo and CID, and each returned record would be hydrated before being sent to the end client.
Registering Users
Alright, now we've got a PDS running and a bare bones application. Now, each user has their own repo within the PDS, so you basically create a separate metadata store and MST for each user, although typically the block storage isn't partitioned by user. This is because content is stored via a content identifier, which should be universally unique, meaning partitioning the actual content data by user id could create unfavorable hot spots within a distributed system. Since the CIDs are universally unique, they just all get stored together.
Now, oddly enough, the official TypeScript PDS requires an invite code to register a user. In theory, requirements like an invite code are completely optional and up to the different PDS implementations. I do not particularly like that the official PDS from Bluesky forces this requirement rather than making it configurable, but the projects maintained by Bluesky also tend to blur the lines between atproto as an underlying standard and Bluesky as a consumer application, meaning sometimes choices are made in the atproto implementations because of what was wanted/needed by the Bluesky platform.
Thankfully, it's not too painful to work around this. The registration flow just requires us to first request an invite code from the PDS and then immediately use the returned code to register an account. A "session" is returned with the account's distributed id (or did), which is both the account's unique id and the id of the account's repo. So anywhere you need to identify a repo, you use the did.
Authentication sessions are based on OAuth 2.0, even though it isn't strictly OAuth - being an OAuth flow would mean authenticating a user through a third-party system, whereas the PDS stores a username (a.k.a. handle) and password, so you just do standard password authentication to create a session. Then you get an access token and a refresh token as your session response.
Now, something that I don't necessarily like is how user authentication is implemented within the PDS. Each repo has a public-private key pair that is used to sign content, but there's no required link between the key pair and your account credentials. And because they require a password to be stored within the PDS (at least, the official PDS implementations require it), you are effectively required to add more complexity to your own authentication solution if you want stronger account security.
And when I say stronger account security, I'm talking about magic links, passkeys, 2FA, etc. None of that is supported by the PDS. If you wanted to support oauth login (e.g. Google/Apple/etc) or passkeys (e.g. Yubikey) or anything else, you are required to maintain an entirely separate user authentication solution. But you still need to know the password for the PDS. So if you supported only oauth login and avoid maintaining things like passwords, too bad, the PDS needs a password to create and use an account.
So that's what I ended up doing: maintaining my own user authentication solution and storing a PDS-only password per user in my own database. When a user registers, they get a strong, random password just for creating PDS sessions. And just to add a little more security for the sake of paranoia, I also have a secret "salt" that my application knows which is used to generate a cryptographic hash value on-demand to get the final password string used during PDS session create. That helps prevent bad actors from having direct access to everyone's accounts in the event of a database leak, as they would also need the "salt" to make it work.
This gives me the option to use any number of possible authentication methods in my own application while still having a way for users to create new sessions against the PDS. I could do passkeys, oauth, magic links, sms, I can add 2FA, OTP... I can do whatever I want now because user authentication is decoupled from the PDS.
And honestly I don't think this is in direct conflict with the intention behind the PDS requirements. Within the scope of the PDS you don't have users, you have accounts. And since they aren't, on their own, considered end users, I think this is one of the intended PDS integrations. Applications handle user authentication and the PDS handles account authentication. Still, I wish they had chosen a more flexible solution to avoid needing a password stored on the PDS at all.
Next Steps...
At this point, I've got a PDS running, I can manually execute commands via my terminal (e.g. http <my-local-pds>/xrpc/_health
), and I can even create accounts. The next step is to generate Elixir code from the base set of lexicons. Because lexicons are part of the specification, they have a strict way they are organized and defined. This is great as it will allow for straight-forward implementation of a code generator.