Making the ALSA Sequencer Talk Plain Text
I was staring at a hex dump of MIDI events at 3 AM on my Ubuntu 24.04 machine, wondering why my Python script was dropping note-off messages. Trying to get a local AI agent to compose and play MIDI directly through the Linux kernel is an absolute nightmare. Raw bytes, timing issues, obscure C structs. I kept getting the dreaded snd_seq_open failed: -2 error every time the agent tried to initialize the port.
ALSA is old. It works, and it runs practically every piece of audio hardware connected to a Linux box, but the API is actively hostile to modern scripting. If you want an agent to generate music or control a physical hardware synth, it can’t output raw snd_seq_event_t structs. Agents spit out text. They are text prediction engines.
Usually, we write massive wrappers to bridge this gap. You feed the agent a JSON schema, hope it formats the JSON correctly, parse it, map it to a library like mido, and pray the ALSA backend doesn’t crash. I wasted three days building a pipeline exactly like this. It was brittle. The agent kept hallucinating nested JSON objects that broke the parser.
Then I ripped it all out and tried something stupidly simple: piping ALSA sequencer events directly to plain text, and vice versa.
Text as the Universal Interface
The Unix philosophy is supposed to be about text streams. We somehow forgot this when it came to audio. But if you write a tiny utility that translates ALSA sequencer events into a standard, human-readable text stream, everything suddenly clicks into place.

Imagine your sequencer output looks like this instead of raw hex:
NOTE_ON ch=0 note=60 vel=100
WAIT ms=120
NOTE_OFF ch=0 note=60 vel=0
NOTE_ON ch=0 note=64 vel=90
Agents understand this perfectly. You don’t need a complex system prompt or a massive JSON schema. You literally just tell the model, “Write a text stream that plays a C major chord progression,” and pipe that standard output directly into your text-to-ALSA utility.
I built a quick prototype in C to handle the ALSA side, keeping the Python side strictly for the agent logic. The agent generates lines of text. The C utility reads stdin, parses the text strings, and fires the ALSA events to my Korg Minilogue.
The Latency Question
My immediate worry was performance. Text parsing is slow. Audio requires tight timing. I assumed converting strings to integers and managing string buffers would introduce horrible jitter.
I was wrong. I ran a stress test using Python 3.12.2 to blast 5,000 dense MIDI events—a chaotic, unmusical arpeggio—through the text pipe into the ALSA sequencer. I measured the parsing overhead on my M2 Mac (running an Ubuntu VM). The translation delay was barely 1.4ms per event block
