Why We Killed Librosa

The Cost of One Import

librosa is an excellent research library. It implements dozens of audio feature extraction routines with clean, readable interfaces. TuneLab used exactly one of them: librosa.cqt(), the Constant-Q Transform — a frequency analysis where bin spacing matches musical semitones. Perfect for key detection.

The problem was everything that came with it.

pip install librosa installs numba (a JIT compiler for Python), which installs llvmlite (Python bindings for LLVM), which installs the LLVM toolchain itself. It also installs scipy, which links against LAPACK and BLAS — Fortran linear algebra libraries originally written for supercomputers in the 1970s. The total dependency footprint: approximately 280 MB of compiled code, none of which the CQT actually needs at runtime.

For a research notebook, this is irrelevant. For a production API that handles thousands of requests per hour in containerized workers, it creates two concrete problems:

Cold start. numba's JIT compiler runs on the first call to any numba-decorated function. librosa's filterbank initialization triggers this. On a fresh container, the first key detection request took 14.65 seconds — mostly waiting for LLVM to compile Python bytecode into native machine code. Subsequent calls ran in milliseconds, but cold start was a deployment tax.
Portability. The entire numba/LLVM/LAPACK stack is native code. It cannot compile to WebAssembly. As long as key detection depended on librosa, the model was locked to server-side Python — it could never run client-side in a browser.

What CQT Actually Does

The Constant-Q Transform is conceptually simple. Unlike the standard FFT (which spaces frequency bins uniformly), CQT spaces them logarithmically — each bin is a fixed ratio apart, matching the 12-tone equal temperament system used in Western music. One octave = double the frequency = 12 equally-spaced bins. This makes CQT output directly interpretable as pitch-class energy.

The actual computation is a windowed inner product between the audio signal and a bank of complex sinusoid kernels, one per target frequency. Each kernel is a Hann-windowed cosine + sine pair at that frequency, with a window length proportional to the wavelength (longer windows for lower frequencies — the "constant Q" part).

That's it. For each frequency bin: window a chunk of audio, multiply by the kernel, sum. In matrix form: kernel_conj @ frames.T. One matrix multiply. No JIT compilation, no LAPACK, no LLVM.

Building the Standalone CQT

The standalone implementation pre-computes the kernel bank once and caches it. At inference time, the audio is framed with reflection padding, and the cached kernels are applied via a single matrix multiplication.

Pseudocode

# Pre-compute kernels (runs once, cached)
for each frequency bin:
    L = sr / freq                # window length (constant-Q)
    window = hann(L)             # Hann window
    kernel = window * exp(2πj * freq * t)
    normalize: L1 norm + sqrt(L) ortho scaling

# Inference (runs per request)
frames = pad_reflect(audio, max_kernel_len)
cqt = abs(kernel_conj @ frames.T)

The kernel cache is keyed by (sample_rate, n_bins, bins_per_octave, fmin). For a given configuration, kernels are computed once on startup and reused for all subsequent requests.

The ghost factor. librosa applies a sqrt(window_length) scaling factor per frequency bin when scale=True (the default). This is a frequency-dependent normalization that adjusts magnitude based on window length. Without it, CQT magnitudes are approximately 28× too small. The downstream key detection model trained on librosa CQT output expects this scaling — drop it, and confidence values collapse silently. No error. No warning. Just worse predictions.

This took three days to find. The magnitudes "looked right" visually. The model still produced key labels. Accuracy just degraded by 15%. Only a systematic A/B test against librosa's output caught the discrepancy.

The final standalone CQT computes in 10 ms versus librosa's 35 ms — and with zero transitive dependencies beyond numpy.

The Cascade Effect

Removing librosa from key detection was the first domino. Once the pattern was established, every remaining librosa callsite was replaceable:

librosa function	Replacement
`librosa.cqt()`	Standalone CQT (matrix multiply, cached kernels)
`librosa.feature.melspectrogram()`	FFT + pre-computed mel filterbank (saved as `.npz`)
`librosa.feature.spectral_centroid()`	FFT + weighted mean formula (3 lines)
`librosa.resample()`	Polyphase FIR resampler (30× faster)
`librosa.feature.chroma_cqt()`	Standalone CQT + octave folding

Seven callsites across five modules. After the last one was replaced, the entire import librosa line was deleted. Container images shrank by 280 MB. Cold start dropped from 14.65 s to 2.0 s. And the dependency audit turned green — no more native code compilation required at build time.

ONNX Everywhere

With librosa gone, the entire key detection pipeline — audio preprocessing, CQT feature extraction, and neural inference — consisted of pure numpy operations plus an ONNX model. This meant it could run anywhere ONNX Runtime exists:

Server (bare metal): ONNX Runtime Python, 70 ms per request including CQT.
Browser (client-side): ONNX Runtime WebAssembly, running entirely on the user's device. No upload required. The same model weights, the same inference logic, the same results.

This is TuneLab's deployment model for all audio analysis: one ONNX graph, two runtimes, identical outputs. The preprocessing that feeds the model (CQT, mel spectrograms, spectral features) must also be portable — which is why librosa had to go. You can't ship LLVM to a browser.

Same model, same results. TuneLab's client-side tools and server API produce identical key labels for the same audio. This isn't an approximation — the ONNX graph is bit-for-bit identical. The preprocessing is mathematically equivalent. The only difference is where the compute runs.

Before and After

Metric	With librosa	Standalone	Change
CQT computation	35 ms	10 ms	3.5×
Key detection (end-to-end)	3.94 s	70 ms	56×
Cold start (container)	14.65 s	2.0 s	7.3×
Runtime dependencies	~280 MB	0 MB	—
Browser deployment	impossible	WASM + ORT	—

The 56× improvement in key detection is not entirely from the CQT change — it includes replacing a slower classification algorithm with a convolutional neural network trained on pitch-class spectrograms, which was only possible once the feature extraction pipeline was free of non-portable dependencies.

The Lesson

librosa is a well-maintained, well-documented library that has powered thousands of research projects. The decision to remove it was not a judgment on its quality — it was a recognition that research tools and production tools have different constraint profiles.

Research libraries optimise for readability and flexibility. They expose every parameter, support every edge case, and compose with the broader scientific Python ecosystem. This is exactly what researchers need.

Production audio systems optimise for latency, portability, and minimal dependency surface. The actual math is usually simpler than the library wrapping it. A CQT is a matrix multiply. A mel spectrogram is an FFT followed by a dot product with a filter bank. Spectral centroid is a weighted mean. When you need exactly one function from a 280 MB library, the right answer is to implement that function.

Most audio ML projects inherit research-grade dependency trees that they don't need. If you're deploying audio models to production — especially to environments where native compilation is impossible, like browsers or serverless containers with tight size limits — audit your imports. The scaffolding that makes research convenient is the same scaffolding that makes production deployments slow and fragile.