Journey with AT Protocol: Part 3

Journey with AT Protocol: Part 3
Multiformats - a self-describing value format

I am on a journey to understand the AT Protocol better by implementing it from scratch in the Elixir programming language. I have a running PDS, I've generated working code from the official set of Lexicons, and I can stream in repo events over a websocket connection. But that event data is a bunch of DAG-CBOR binary data, CID references, and CAR file exports! This means it is a good time to talk about multiformats, a set of protocols for self-describing values.

Is This Necessary?

Before I begin talking about multiformats, I want to answer the question, "is this digression a necessary endeavor?" And the answer is quite simple: no.

The PDS is what primarily has the burden of needing to use multiformats. It is what will generate CIDs, manage CAR blocks, and so on. Jetstream, which I talked about in the previous post, also needs to understand multiformats and specifically the DAG-CBOR and CAR codecs so that it can decode event stream data properly.

But me? At most I would need to be capable of decoding CBOR data so that I could parse out CIDs from events from the subscribeRepos subscription. But that would be it. If I didn't want to deal with CAR block data, I could simply take the CIDs for relevant events and use them to fetch the underlying content in JSON format. It would require additional interactions and increase overall latency before I had usable data, but it's a reasonable solution.

However, keep in mind that this journey is about understanding, not just the quickest path to implementation. I had read the original research paper on Merkle-Search Trees to understand the underlying data structures present in the spec, even though you never directly interact with MSTs if you aren't building a PDS. And so when I came across references to concepts like CIDs and multihashes, I just had to know more about them.

What Is Multiformats?

At the beginning of this post I mentioned that multiformats is a set of protocols for self-describing values. But what is a self-describing value? It's any value that contains identifying information about about itself. This can be a confusing definition if you're not familiar with the concept, so let's dive into an example to better understand what a self-describing value is and why they matter.

Imagine that I gave you a base-encoded string and told you to decode it correctly on your first try. Could you do that? On your first try? If I said, "decode the base-encoded string 74657374", would you be able to definitively give me the correct answer in one attempt without any other information? Keep in mind, the content that is base-encoded could be arbitrary binary data (e.g. DAG-CBOR).

The answer is "no", in case the obvious trap was not so obvious. All of the characters in the above string are present in multiple base encodings (e.g. 16, 32, 64, etc). Additionally, the byte length would be acceptable in multiple base encodings. There's simply no information about this string that gives you definitive information about which base encoding was used to create it.

But what if the string did contain information about itself? What if the first byte of the string actually represented a specific base encoding? Well, then the string would be describing what it is, which in turn gives us instruction on how to decode it. And that's what muliformats aims to standardize: how to prefix binary data with a well-defined mapping of codecs and other metadata so that anyone can inspect the data itself to understand what it is/how it was created.

If you're wondering, "but how often is that an issue?", well it's a lot more prominent in some software systems than it is others. Blockchain and distributed computing will run into these issues far more frequently, especially those where a specification might not enforce a particular solution, or where two or more standards want to interact with each other but use different codecs to encode/decode content.

Multiformats All The Way Down

One of the most interesting aspects of multiformats is that it is specifically designed to be composable. That is to say, you typically see multiple components of multiformats used together. But rather than hypotheticals, let's take something actually used in atproto that demonstrates this: the CID (Content IDentifier).

Although atproto has a "blessed" CID spec of its own, it admits that CIDs are selected specifically to allow evolution over the life of the protocol. So even if you think you can "cheat" with CID decoding in the context of atproto, you can't because there's an inherent risk that the protocol will evolve and the allowed values within a CID will change. For that reason, it's good to understand the CID as it is described in the context of multiformats.

Here's a break down of a CID into its various components to illustrate how multiformats are intended to be used. Since CID v1 is used by atproto, I'll look only at that version for now.

cid = <multibase codec><multibase string>
multibase string => decoded => <cid version><content type><content multihash>
content multihash = <multihash codec><multihash string>
multihash string => decoded => <hash algo><hash size><hash string>

The above shows roughly how you would decode a CID, but I'll explain it in terms of encoding (i.e. reverse the order of operations in the list above).

A CID starts with a multihash of the content it is addressing. Since CIDs could be part of a binary stream, we include the hash size to tell CID decoders how many subsequent bytes are for the hash string. We also need the hash algorithm used (e.g. sha256) to be fully self-describing.

Once we have the multihash, we prefix that with the content type. In atproto you only have two supported types currently: raw and dag-cbor. But in multiformats there are many content types you can specify. The reason we need this information is because it tells us what format the content was in when the hash was generated. So even if a record is stored as a JSON string in a PDS, if the content type is dag-cbor then it means the hash in the CID was based off the dag-cbor encoded content value.

After that, we prefix the CID version identifier (v1 in this case) to describe the overall structure of the CID. In essence, the v1 identifier tells us that for the remainder of the binary data there's a content type codec followed by a multihash. And you could stop there and have a valid CID if you wanted, that's the entirety of the spec for it.

But if you stop at the CID version identifier, then you are only guaranteed to have an array of bytes. It might not output well to client devices like a terminal window or web page and would almost certainly not embed properly in things like URLs. For this reason, CIDs can be multibase encoded to become a "human-friendly" string. This would allow a CID to be passed in something like a URL with ease and safety.

And the CID is just one example. Other use cases exist where you mix-and-match multiformats components to get arrive at a final encoded value (usually because you've multibase encoded other multiformats outputs).

Now why are CIDs so important? Well, for starters, you have a fairly good algorithm for preventing (or deliberating causing!) key collision, because the key (the CID) is based on the content it references. If the content changes then the key naturally changes as well.

On the other hand, we get this cool effect where the same content will generate the same CID. Let's say you upload an image; that image will be stored as raw bytes with its own CID. Now let's assume someone else wants to upload the same image. It already exists! This means we can avoid uploading the same file twice because the CID for this specific piece of content is already present in block storage.

Additionally, you can take a hash of hashes to build a treelike structure that allows for quick and simple verification of all nodes within the tree regardless of which position you're looking at. This is the underlying concept behind a Merkle-Search Tree (MST).

As discussed previously, MST nodes use a hash value to reference other nodes. In atproto, they specifically use CIDs. This means the references in the MST are dependent on multiformats. So, again, multiformats is very important for the PDS, and less so for everyone else.

Multiformats in Elixir

As I have mentioned a few times, you don't need multiformats. But I wanted to have the option to (a) decode DAG-CBOR data and (b) decode CIDs. So I created my own implementation of multiformats in Elixir. All of the other implementations I found tended to be very old and some parts did not work in newer versions of Elixir.

This work was also done before I was aware of Jetstream, when I thought it would be necessary to at least decode DAG-CBOR data. I was also considering validating the content which means I would need to decode the CIDs to extract the multihash information.

The first thing I did was build the codec map. My project actually has a Mix task which, when provided the codec map or similarly-formatted CSV, will produce the Multiformats.Multicodec module which is effectively just a way to hard-code a mapping of byte prefixes to codec metadata. Since mulitformats might need to deal with decoding binary data streams, I used this code generation approach to ensure that the decode function was fast.

For example, sha2-256 is a codec tagged with "multihash" and given the prefix 0x12 (hex format). This means that if you have a multihash string prefixed with 0x12, you know the algorithm that produced that hash was sha2-256. And everything in multiformats has a codec tagged appropriately. This is used for multibase, multihash, multi-everything.

Thankfully, again, this is well-formatted data and we can just use a code generator. This one is particularly simple with one exception: getting the binary string representation of the hex-encoded byte prefix so we can match on the head of a binary string in the function clause rather than in a separate clause in the function body. This is what allows for us to have a high-performance match function. But by using EEx templates, it's actually trivial to output the byte prefix, and we just calculate what that byte prefix value should be when we're rendering the template.

Oh, and this will require one more thing which I won't speak too much on, but we technically need variable-length integers (LEB-128). In short, it's a way to compress arbitrarily large integers into a smaller number of bits for more efficient storage and transmission. It does not make a lot of sense to use it for the codec prefixes because they are never large enough to warrant the extra effort, but multiformats is all about reuse and standardization.

Alright, back on track: the hex-encoded byte prefix is the non-encoded prefix, so we need to parse the hex string to an integer, varint-encode it, and then convert that to a binary representation. This is the code I use to do that:

code_hex
|> Integer.parse(16)
|> elem(0)
|> Varint.LEB128.encode()
|> inspect()

That gives us the binary string matchers (e.g. << <%= prefix %>, _rest >>), so now let's look at what the metadata that gets returned:

%{
    code: 0x12,
    prefix: <<18>>,
    name: "sha2-256",
    description: "",
    status: "permanent",
    tag: "multihash"
  }

Elixir can also natively handle writing integers as hex-style numbers, so the compiler will turn the hex code 0x12 into the integer 18 for us. And while 18 is also the value within prefix, that isn't guaranteed to be the case. Binary strings hold integer values of up to 255 for each byte, so any prefix whose code is greater than 255 will require two bytes in prefix. Also don't forget that the prefix is varint encoded.

Now something that really impressed me is the forethought of the codec map. To my knowledge, every codec in the codec map has a sequentially unique prefix. I made up that phrase, but what I intend to convey is that each byte in the binary string is entirely unique for any given sub-sequence starting from the left-most value. That's a lot of jargon, let's look at an example.

Take json: it has a prefix of <<128, 4>>. There are NO codecs which have a prefix of <<128>>. So if you try to find a codec based on a binary string of <<128>>, you would get nothing back. Which basically tells you to fetch another byte of data from the stream and try again.

But this also means that if you did get <<128, 4>>, there are NO other codecs which start with <<128, 4>>; that sub-sequence is always going to be unique to json alone. Nothing in the codec map contains <<128, 4, 10>> or <<128, 4, 86>>. If you get <<128, 4>> when parsing the codec, you're done, no more work necessary.

And to build this extremely long codec map while also designing for simplicity in codec parsing, that's just... incredible. I was really impressed. And super happy! This rule allowed me to greatly simplify my match logic.

Multibase

I want to touch on multibase specifically for a moment because it's handled a little differently from the rest of the multiformats.

While some of the multiformats components may output a human-readable value, the intent is to have machine-ready binary data. Multibase is the only multiformat (currently) where machine-processing is not the intended use case. Multibase is specifically about self-describing a human-friendly string representation of arbitrary binary data. This means that you typically aren't dealing with streams of binary data when you have a multibase value.

Why does that matter? A few reasons:

  1. There are no entries in the codec map for multibase types (e.g. base32). The multibase spec defines which bases are valid within the context of multiformats.
  2. No varint encoding is used on the prefix.
  3. No length information is stored per the current version of multibase.

Quite the Digression

One day I might actually build my own Relay or need to ingest event data directly from the subscribeRepos subscription. But for now, I'm not doing that. I'll use Jetstream to simplify my ability to read in relevant events from Bluesky. But it was still impressive and enlightening to learn about multiformats, CIDs, and Merkle-Search Trees.

Next time, I'm going to talk about one "problem" I typically keep butting my head against: public vs private content.