Your Music Has No Genre: Reading It Off the Waveform
Serendeep Rudraraju


I wanted my Spotify playlists as local files, sorted into a tree my DJ software can
read: genre on the outside, tempo band on the inside, hard techno/140-147/. A gig
set or a workout mix as something I own, not something I stream. The tempo was a solved
problem. The genre turned out not to exist. Not in the file, not in any database I
could reach. So I stopped looking it up and started computing it from the audio.
The pieces almost exist. spotdl pulls
a playlist down as audio. librosa computes tempo. Nobody joins
download → analyze → sort into one pass, but that's plumbing. The part that wasn't
plumbing, and the part I spent the week on, was that the music I actually listen to has
its genre written down nowhere. Fixing that ended with an audio transformer running on
my laptop, listening to each track and telling me it's hard techno.
TL;DR
Spotify cut the BPM API for new apps in late 2024, so tempo has to be computed
locally; librosa handles that. The genre is the hard part: underground electronic
ships with no genre tag, and every metadata source I tested came back empty, wrong,
or too coarse to be useful. The genre is in the audio, though, so I run a small ONNX
model (MAEST, trained on 400 Discogs styles) that reads the sub-genre straight off
the waveform, locally, with no account and no upload. The bug worth writing down: the
model labelled every techno track as "Cumbia" or "Chanson" until I reproduced its
mel-spectrogram front end to the byte. ML preprocessing fails silently; the model
never tells you it's being fed the wrong thing.
The tempo problem is solved; I'm not going to pretend otherwise
Spotify used to expose audio-features: tempo, key, the lot. They closed it to new
apps in November 2024 and shipped no replacement. So you compute tempo yourself, and
librosa.beat.beat_track is the standard. It has one well-documented failure mode:
octave errors. It will report a 140 BPM track as 70, because half-tempo is a valid
beat grid too. The fix is a plausible-tempo window and a post-processing step that
snaps the obvious half/double mistakes back into range. Genre-informed prior, global
window, done. This is known territory and I'll leave it there.
The genre is not in any database
Here is the part nobody warns you about: for underground electronic, genre metadata barely exists.
spotdl writes whatever Spotify has into the file tags. For a T78 or a Timmo or a
Clara Cuvé track, the genre field comes back empty. Spotify knows the artist, the
album, the release date. It does not know the genre of the individual track. So I
reached for MusicBrainz, the open music database, the
obvious next stop. I pulled eight of my own tracks and looked each up by hand. Four
returned empty genre and tag lists. Four had no match at all. MusicBrainz is a good
project, but it was built by people cataloguing the music they care about, and nobody
has sat down to tag last month's hard-groove white label.
I was convinced for about a day that the answer was a better lookup. There is a whole chain of them, so I tested each against my real tracks rather than against the docs:
| Source | Credential | What it actually returned |
|---|---|---|
| Embedded tag (spotdl/Spotify) | none | empty for every track |
| MusicBrainz | none | 4/8 empty, 4/8 no match |
| Last.fm tags | free key | matched the wrong artist on a Unicode lookalike; otherwise generic ("edm") |
| Deezer | none | matched 17/18, but everything is "Electro" or "Dance", with no sub-genre |
| Discogs styles | token | genuinely good ("Peak Time / Driving"), but misses the newest releases |
Every row has the same flaw. Each depends on someone, somewhere, having already written the genre down. Deezer matches almost everything and tells me nothing, because its taxonomy can't separate hard techno from tech house. Discogs is the closest to right and still misses exactly the recent underground releases I care about. For new music, the authoritative label doesn't exist yet, so no lookup can return it.
That is the reframe that mattered: I didn't have a lookup problem. I had a label that only existed in the audio.
So compute the label from the signal
A DJ can hear four bars and name the sub-genre. The information is in the waveform; it simply isn't in any text field. Which turns the problem from "find a better database" into "classify the audio", and there is a model for that.
MAEST is an audio transformer trained on 400
Discogs styles. You feed it audio, it returns probabilities over a taxonomy that
natively separates Hard Techno, Tech House, Minimal Techno, Trance. It runs on
onnxruntime on a CPU. No account, no API, no cloud
round-trip: the model file sits in a cache directory and the audio never leaves the
machine. For a tool whose whole premise is "this is local and it's yours," that
mattered more than a point or two of accuracy.
The cost is honesty about size. I went for the lighter 18 MB variant first, and it ships its classifier head as a TensorFlow graph with no ONNX export, which would have pulled TensorFlow back into a project working hard to avoid it. The larger self-contained model is 330 MB and end-to-end ONNX. One download, then it runs offline forever. I took the disk over the dependency.
The bug: confidently wrong, and silent about it
The first run was garbage, delivered with total confidence. T78's "Bombacid", a peak-time techno track, came back as Grime. Timmo's "Salty" came back as Cumbia. Clara Cuvé got Chanson. The model wasn't broken. It was being handed nonsense and answering the question it was actually asked.
MAEST doesn't take audio. It takes a mel-spectrogram, a specific 2D representation, and that spectrogram has to be computed with the exact recipe the model trained on. I had written a reasonable-looking front end that was wrong in two ways. I used the magnitude spectrum where the model expects power, and I skipped the per-dataset normalization entirely. To the model, my input wasn't quiet techno or loud techno; it was off the manifold the network had ever seen, so it landed wherever.
The fix was to stop approximating and reproduce the published feature extractor exactly: power spectrogram, slaney mel filters, log compression, then subtract the training-set mean and divide by twice the standard deviation. Two constants, 2.0676 and 1.2683, copied out of the model's own feature-extractor config. Same audio, same model, corrected input:
| Track | Before | After |
|---|---|---|
| Luciid - Era Of Us | Drum n Bass | hard techno (0.78) |
| OMAKS - On Da Beat | Hardcore | hard techno (0.62) |
| Timmo - Salty | Cumbia | techno (0.60) |
| Lorenzo Raganzini - Born Slippy | Hardcore | hard techno (0.43) |
Fourteen of eighteen landed on a correct hard-electronic sub-genre. The remaining four sit at low confidence, which is where you want your errors: below a threshold, where they fall through to the next option instead of asserting nonsense.
The transferable lesson is the silence. A misconfigured preprocessor doesn't raise. The model doesn't report that its input is out of distribution. It returns a clean, well-formed, completely wrong answer, and if you don't have ground truth to check against, like eighteen tracks you happen to know the genre of, you ship it. I keep relearning this: with models, the failure that costs you is rarely a crash. It's a confident answer to a question you didn't realize you were asking.
Never strand a file
A sorting tool has one unforgivable bug: losing a track. So genre resolution is a chain that always lands somewhere. Embedded tag first if it exists, then the audio model, then a coarse Deezer lookup by name, the last only if you opt in, because it is the single step that leaves your machine. If everything comes up empty, the track is filed under the artist's name rather than dropped in an "unsorted" bucket. An artist folder is a real location; "unsorted" is where files go to be forgotten.
The same instinct closed a quieter gap. Songs that failed to download used to vanish: spotdl couldn't find a match, the file never appeared, and the interface simply showed fewer tracks than the playlist had, with no error. cratemind now fetches the full expected tracklist up front, diffs it against what actually landed, and shows the missing ones as failed rows. A silent gap reads as success; a red row reads as the truth.
What it is

Paste a playlist link. cratemind downloads each track, computes a true-tempo BPM and
the Camelot key for harmonic mixing, reads the sub-genre off the audio, and files
everything into a {genre}/{tempo}/ tree you can template. It resumes cheaply, exports
a small crate.json you can hand to someone else, and runs entirely on your machine.
The audio, the analysis, all of it local.
It began as "download my playlists" and turned into a small argument I didn't expect to be making: for the music worth digging for, the metadata was never going to save you. The label isn't in the database. It's in the signal, and you have to compute it.
cratemind is on GitHub.
Enjoyed this post? Consider supporting the blog.
Buy me a coffee