Journey with AT Protocol: Part 2
I am on a journey to understand the AT Protocol better by implementing it from scratch in the Elixir programming language. I did not implement a custom PDS, however. I already discussed my project setup and the PDS in Part 1. This part will now focus on lexicons and generating usable code from them.
- Part 1: Setup/PDS
- Part 2: Lexicons/Event Streams (you're here!)
- Part 3: Multiformats Madness
- Part 4: Public vs Private
Most of the magic covered in this part is handled by the code generator, which turns Lexicon JSON files into Elixir source code. You can find the source code for it on my GitHub.
What is a Lexicon
The concept of a Lexicon is huge in atproto. Whereas other specifications such as HTTP 1.1 and Activity Pub rely on a set of documents that mostly use prose and pseudo-code to define the specification, atproto uses Lexicons: JSON files that adheres to a strict schema.
In short, the schema of atproto defines types, of which there are four primary types. Each Lexicon is uniquely identified by an NSID (namespace + id) and contains one or more definitions using those schema-defined types. Aside from what are commonly referred to as "primitive types" (e.g. numbers, strings, etc) there are also the following "complex types":
record
query
procedure
subscription
object
The first four in the above list are primary types, and the schema states that each Lexicon (i.e. each individual JSON file) can have one primary type at most in its set of definitions. It's also possible to have no primary types, which is seen with defs.json
files, which contain only object
types.
But let's break down these complex types because they are the ones that will require special conversion into source code.
- A
record
is data that gets persisted to block storage/addressed in the MST. You can think of it as being functionally equivalent to a row of data in a SQL database. - Both
query
andprocedure
refer to HTTP requests, which are called XRPC within the scope of the atproto. Queries areGET
requests and procedures arePOST
requests. - A
subscription
defines an event stream (which we'll discuss briefly at the end of this post) and the structure of the data it emits. - An
object
is effectively just a type definition: it gives an NSID to a particular "shape" of data, e.g. the response to aquery
request.
And that's about it. Each type has a schema it must follow but everything about lexicons is rather straight-forward. Which makes them a pleasure to work with.
Code Generation
So looking at the schema, we can presume to do the following at a high level:
- Every
record
models data in a database. - Every
query
andprocedure
is just a wrapper around an HTTP request. - Every
subscription
defines an event stream name and how to decode the data it emits. - Every
object
is a type definition, which we can go one step further inside Elixir and create astruct
.
Thankfully, some wonderful libraries exist to help us create simple and effective Elixir modules to represent Lexicon definitions.
For record
types we'll be using Ecto to create "schemas". An Ecto Schema can cast an arbitrary map of data into a schema struct, it allows for advanced validation rules, and integrates with Ecto Changesets. Changesets are a wonderful way to incrementally build up new records or modify existing records, apply validation rules, and get back sensible errors; many other libraries also have specialized support for Ecto Changesets, like Phoenix Forms which understands how to render Changesets into HTML forms.
For query
and procedure
types we'll being using Req which gives us dead-simple HTTP request functionality. And Req is built on a top of a lower-level HTTP library called Mint, which itself has a library specifically for managing WebSocket streams called Mint.WebSocket. So we'll use Req for queries and procedures, and we'll be using Mint.WebSocket for subscriptions.
And while we could just create type specs for object
types, we can gain additional functionality by creating structs for them. This will allow us to have some "validation" when creating an object type, as well as give us access to the dot-syntax method of accessing data (e.g. struct.field
).
Finally, I chose to use EEx as a template engine to avoid building up source file strings incrementally. Instead, this allows me to have a separate file I can maintain per output type. And because EEx templates get executed within the context of the running Elixir application, the template itself can call functions present in the application. This leads to "helpers" which encapsulate some of the more messy presentation layer logic.
So the lexgen
application will do the following:
- Read in all Lexicon files.
- Parse each Lexicon file into a Lexicon struct and store it in-memory.
- For each Lexicon, parse the appropriate template file(s) and pass in the Lexicon struct for rendering.
- Save the rendered string to a file.
I won't delve into the code generator source too much - you can look at it on my GitHub if you'd like - but I'll just briefly show a few things off as an example of how it all ended up working.
First, here's what the template looks like for an Ecto Schema:
defmodule <%= Lexgen.Schema.deref_main(lexicon, lexicon.defs.schema.key) %> do
use Ecto.Schema
import Ecto.Changeset
@doc """
<%= lexicon.defs.schema.description %>
"""
@primary_key {:id, <%= lexicon.defs.schema.pktype %>, autogenerate: false}
schema "<%= lexicon.nsid %>" do
<%= Lexgen.Schema.fields(lexicon.defs.schema) %>
# DO NOT CHANGE! This field is required for all records and must be set to the NSID of the lexicon.
# Ensure that you do not change this field via manual manipulation or changeset operations.
field :"$type", :string, default: "<%= lexicon.nsid %>"
end
def new(params \\ %{}), do: changeset(%__MODULE__{}, params)
def changeset(struct, params \\ %{}) do
struct
<%= Lexgen.Schema.operations(lexicon.defs.schema) %>
end
end
Simple, right? And you can see the "helpers" I'm using, such as Lexgen.Schema.fields/1
, which keeps the template file itself clean and readable.
And here's an example of converting the app.bsky.feed.post
Lexicon into an Elixir file:
defmodule App.Bsky.Feed.Post do
use Ecto.Schema
import Ecto.Changeset
@doc """
Record containing a Bluesky post.
"""
@primary_key {:id, :id, autogenerate: false}
schema "app.bsky.feed.post" do
field :createdAt, :utc_datetime
field :embed, :map
field :entities, {:array, :map}
field :facets, {:array, :map}
field :labels, :map
field :langs, {:array, :string}
field :reply, :map
field :tags, {:array, :string}
field :text, :string
field :"$type", :string, default: "app.bsky.feed.post"
end
def new(params \\ %{}), do: changeset(%__MODULE__{}, params)
def changeset(%__MODULE__{} = struct, params \\ %{}) do
struct
|> cast(params, [:createdAt, :embed, :entities, :facets, :labels, :langs, :reply, :tags, :text])
|> validate_required([:createdAt, :text])
|> validate_length(:langs, max: 3)
|> validate_length(:tags, max: 8)
end
end
Not bad, eh? We're making good use of Ecto validations, we've got a bit of documentation, we've mapped atproto types into Elixir and Ecto types... all-in-all, quite good.
I also want to point out that there are over 200 official lexicons published by Bluesky right now, if you include both com.atproto.*
and app.bsky.*
. This makes sense given that the Lexicon spec only allows a maximum of one primary type per JSON file. Even so, it took me less than 2 seconds to generate a full suite of source files from the entire set of lexicons. Not bad at all.
Event Streams
Although event streams are extremely powerful tools, atproto kind of falls flat here in my opinion. Though I do have an idea for why they chose the path they did.
If you search for every single subscription
definition in the more than 200 files present in the Bluesky/atproto lexicon set, you'll find... two. Just two subscriptions are defined in the entire set of Lexicons. They are com.atproto.label.subscribeLabels
and com.atproto.sync.subscribeRepos
.
The former (subscribeLabels
) isn't used much at the moment, and the latter (subscribeRepos
) is where all the magic happens. It's responsible for letting you know about any and all changes that happen within the PDS, and I do mean any and all. Every single thing that changes the state of a repo within the PDS emits an event over the subscribeRepos
subscription.
On one hand, I get their decision. They wanted a firehose for events. They wanted individuals to have a single endpoint they could receive all event data from so that those subscribers could process the data however they want. No need to subscribe to multiple endpoints, no need to worry about new future endpoints; very plug-and-play.
On the other hand, that subscription is heavy. You get every little change for a repo whether you want it or not. And it's all DAG-CBOR encoded binary data, which means you need to be capable of properly decoding DAG-CBOR binaries to get at the underlying data. Not only that, but you get the actual content as a CAR file, which is another non-trivial specialized format you will also need to be capable of properly decoding. That is, assuming you want to read the data you were given and not send an HTTP request with every relevant event captured just to fetch the underlying content in a more accessible format.
Thankfully, someone at Bluesky realized how impractical it was for most people to interface with this subscription, and in response they gave us Jetstream. Jetstream binds to subscribeRepos
on your behalf, does all of the DAG-CBOR and CAR decoding on your behalf, then formats it to JSON, and emits the event in a more lightweight and universally-recognizable format. The tradeoff is that you can't do validation of the incoming data because you don't have the DAG-CBOR encoded content block, which is what the multihash inside the CID is validated against.
You might be thinking, "Well that's fine. I don't need to validate the data coming from Bluesky." Sure, no one does. But Bluesky is intended to be only one part of the atmosphere (i.e. what it calls the collection of all connected atproto applications). In reality, without doing the validation yourself, you would need to be very particular about which data hosts you would use a Jetstream interface for. After all, since anyone can stand up an atproto application, and therefore anyone can stand up a Jetstream service to replace/supplement the subscribeRepos
endpoint, anyone can create a fraudulent Jetstream service and you would probably never know.
TL;DR: I wish they had more dedicated subscriptions with more sensible event data, but I get why they went the way they did. And we do have Jetstream as an option, which is especially useful when you're trying to get data from Bluesky.
Putting It Together...
Alright, so at this point I have a working PDS, code being generated from lexicons, and a very simple website that will let me create a user, create Bluesky posts, and view a feed of Bluesky posts from people I follow. User creation includes what I mentioned in the previous part, which means I can use Google Sign-In to create a user and still access the PDS just fine with my custom user schema.
The one really weird part I've encountered so far is that the PDS requires an invite code to create a user. The actual Lexicon for creating an account does not specifically flag invite codes as required, but the PDS implementation by Bluesky requires it. Thankfully, it's simple enough to get around that: just create an invite code and immediately use that during account registration. It adds a bit of overhead to the user creation process but it's not a noticeable.
And because I want to make this more than a superficial dive into atproto, I've also started on debugging the subscribeRepos
subscription. I can subscribe, listen for events, and even decode the DAG-CBOR data that gets sent over. What I can't do yet is decode the CAR data that gets sent along with any repo commits. At this point though, it's looking like I'll need to take a detour and learn about IPLD and multiformats!