A Discordant Journey: Elixir + Discord
A few months ago I decided to embark on some personal programming using Elixir. I loved Erlang when I first played around with it many years ago and have enjoyed Elixir since the initial release (long before it hit 1.0). Around this same time I was setting up to host a Dungeons and Dragons campaign for my friends and wanted some way to record our sessions for personal enjoyment and reminiscing. We chose Discord for our voice chat service and Roll20 for our map viewer, dice roller, etc. I was having an awful time trying to get my computer set up to record both my incoming audio from Discord and my outgoing audio from the microphone, and so I looked to see if Discord had any voice recording bots, and they did. I had stumbled upon Craig.
Craig was a wonderfully simple-to-use bot that would record your audio, but one of the biggest features to me was the fact that each speaker was recorded as a separate audio track. I have a friend that types loudly when we’re playing and you can hear it frequently in the background, but if he is recorded as a separate track then it becomes simple to silence the portions where he’s just typing without messing up anyone else’s audio. It also allows me to increase and decrease the audio of specific individuals in an attempt to equalize the average speaking volume. But there was one thing I didn’t like about Craig: it was written in node.js.
I won’t go into a rant against node.js, but I personally don’t like it. Craig works wonderfully though, so don’t think I’m against it because node “doesn’t work” - it does, obviously it does. But when I opened up the code to read it… well, a large amount of readability comes from the developers themselves, taking the time to give obvious meaning to variable and function names as well as ensuring that logic is simple to follow. But my main problem with all node code, especially Craig, is that developers are essentially forced to create “callback hell”. It’s just callback within callback within callback or chain this future to that future to this other future to, wait this function isn’t marked properly so you need to work around how to turn this non-asynchronous operation into an asynchronous one that uses futures so that all the code actually works. It’s awful. I ended up ranting, sorry, but it was a short one. Back to the point…
I decided to see if it was possible to rewrite Craig (which is open-source software) in Elixir. I had hoped it would be “better” in a sense because Discord itself uses Erlang/Elixir for its backend services. Unfortunately, that wasn’t quite the case. Without going too deep into any one part of my journey, here are some of the major problems I encountered.
ETF (Erlang Term Format)
One of the things I was most happy to initially see was that you could tell Discord to use ETF (Erlang Term Format) for the over-the-wire protocol. The other protocol available is the more prolific JSON format, but I appreciated the binary format of ETF and, since Elixir is Erlang, I could natively encode and decode data structures into and out of ETF. I was super excited when my initial tests were working… and then anything more complicated than the initial handshake started experiencing partial failures. That is, some of the data couldn’t be accessed.
Originally I had assumed this inaccessibility to data in the payloads was because the data was simply missing or I had used the wrong key to access it. Nope, it was more insidious than that. After inspecting the raw Erlang map structure that was being used for the payload I noticed that a small number of keys were atoms instead of strings. For those not familiar with the concept of atoms, they are like Ruby’s symbols: they look like strings but are passed by reference only and therefore make comparison and lookup much faster than would normally be done on strings, but without having direct access to string functionality. Because Elixir doesn’t impose that all keys have the same type in a map, it is completely valid to mix key types, though generally avoided for this exact reason.
The engineers at Discord probably don’t view this as a problem, assuming that almost everyone uses JSON anyway, probably without any real support for those using ETF. In JSON conversion an atom becomes a string, and strings are still strings, so all the keys do use the same type for that format due to the lack of types in JavaScript. It’s just disappointing that the teams at Discord don’t have better standards or review processes that ensure services adhere to such standards.
Gateway to Madness
Discord has two kinds of connectivity: REST and WebSocket. REST is for anything where you don’t need to receive real-time updates, WebSocket for anything you do. To receive real-time notification of people joining or leaving a voice channel, we need to be connected to a WebSocket, which Discord refers to as the Gateway. There are some operations that you can invoke over your Gateway connection, but not many. Usually if you want to invoke an action, you are required to use the REST interface.
It is odd, to me, that I could receive a chat message event, parse it, and then not be able to send a response over the same Gateway I received the event from initially. Instead I needed to toss the response to a REST client to invoke the “create message” function. I would have liked to see such actions possible from within the Gateway; it would have allowed me to remove my dependencies on HTTP and JSON which are both required by the REST interface.
Oh, except I still wouldn’t have been able to get rid of JSON. There’s actually another Gateway just for voice channels. When a voice channel is created, it is assigned a unique WebSocket endpoint, and you must generate a new WebSocket connection using that dynamically-retrieved endpoint. And this Gateway only accepts the JSON format. At least, that’s what the documentation states; I have not tried ignoring the documentation to see if ETF would work.
I almost forgot: after connecting to the second Gateway just for voice channels, there’s a need to open a UDP connection to another dynamically-retrieved endpoint that is used solely for the raw voice data.
The Sound of Success
This was my first time needing to work with libsodium and libopus. Sodium was used to encrypt the voice data, which was awesome to see that all voice data sent over Discord was encrypted. Opus was used to encode the audio data. It took a lot of trial-and-error, mostly to get the packet decrypted and parsed correctly, but I believe that Elixir was definitely the correct choice here. In my opinion, the pattern-matching power of Erlang/Elixir really makes it simple to know exactly what data you want and creates code that is more readable to others. Let me show you an example of my packet processing function, which is invoked whenever the UDP socket receives some data.
def process_packet(<<0x90, 0x78, _: :binary>>=packet, state) do # do stuff end def process_packet(_, state), do: {:noreply, state }
It’s easy to see that the first function body is only invoked when the incoming packet starts with 0x90
and 0x78
, which is the header that denotes audio data (as opposed to extraneous metadata that is not useful to my bot’s purpose). You can also see the catch-all after. It’s just a far more readable version of and if/else or switch/case statement. I can also pull out data from the packet easily.
<<_ :: binary-size(2), s :: binary-size(2), ts :: binary-size(4), ssrc :: binary-size(4)>> = header
After successfully decrypting the audio packet there is a 12-byte header that contains some useful information. In this case, s
is the sequence number, ts
is the timestamp, and ssrc
is a unique identifier for the source of the audio. In the case of the voice channel, the audio source tends to be the people in the channel. But look at how simple it was to extract byte-data of various sizes. For reference, binary-size
refers to how many bytes you want, so binary-size(2)
translates to 2 bytes.
Sadly, all of this success was the effort of much trial-and-error, because there is practically no documentation on this part of Discord. I’m sure that having applications and bots connect to the voice channels aren’t a major focus for the Discord team, but the documentation is sorely lacking. They posted a Wikipedia entry to UDP Hole Punching without any additional information on how their particular service implements this flow (it requires the SSRC given in an earlier connection step padded with zeroes, for example). Thankfully it was somewhat fun to look around the sodium and opus documentation, inspecting the binary packet data, and figuring out how everything fit together.
The Tricky Part
In all honesty, there are a lot of tricky parts to creating an application that interacts meaningfully with Discord. But, thankfully, the trickiest part of all was Craig’s custom logic for taking decrypted audio data and storing that in a format that the “cooking” scripts could process later. And that’s a good thing, because at this point I’ve got practically all of the Discord-related code done. Aside from a few bumps due to a lack of standards and a lack of documentation, it was still straight-forward to create an application that could interact with Discord and its services.
The one part that I haven’t yet fixed, but will in the near-future, is leaving a voice channel “the right way.” I put that in quotes, because even if I request the bot to leave the voice channel it is currently in, it will still have an open WebSocket connection that will literally throw an error once everyone has left the voice channel. Why? Because that dynamically-generated endpoint for a voice channel is generated every time the voice channel goes from zero members to one. To put that another way, the voice channel is destroyed every time the member count drops to zero. So I was getting an error that “oh hey, the voice channel crashed, super weird but you should try reconnecting.” Only, it didn’t crash, it literally was closed because that’s how the flow works. Yet, for some reason, it still will occasionally throw the “try reconnecting” error.
Fast Iteration
One of the things I loved about doing this project in Elixir was the fast iteration periods. The compiling took very little time but also caught a majority of issues, leading to more frequent successes than crashing errors. Which in turn allowed me to test bad code less frequently which lowered the overall development period. I also enjoyed that the increased readability from Elixir’s language features allowed me to more quickly grasp what a piece of code was doing so that I could fix it/improve it/safely remove it as necessary.
There are still some tweaks to be made, but those all exist within the scope of how the audio data is stored and processed. I’ve got all the Discord integrations completed. It took some stumbling through a lack of standards and poor documentation around some of the key components of voice gateways, but overall it was a positive experience. And one of the main goals was simply to get back into Elixir, for which this project really helped me tackle a breadth of Elixir features and programming concepts, so it was a good educational project.