Auralis — 6D Audio Embedding Visualizer
Turns any sound into a navigable 6-dimensional universe. Audio is mapped onto seven embedding tracks — interpretable spectral features, PCA/t-SNE/UMAP projections, the Tonnetz harmonic space, YAMNet event semantics, and CLAP audio-text meaning — and rendered as a luminous 3D trajectory you can fly through, across a curated library of 102 sounds.
Business Context
Audio is one of the hardest data types to explore intuitively. Spectrograms are dense and unreadable to non-specialists, and embedding spaces from modern audio models are high-dimensional and abstract. There is rarely a way to actually see how a machine-learning model "hears" a sound, or to compare what different representations capture about the same audio.
Strategic Value
Auralis makes audio embeddings tangible. By projecting seven different representations of the same sound into one navigable 6D space, it turns abstract feature vectors into trajectories you can fly through and compare directly — interpretable spectral Features emphasize raw acoustic structure, PCA/t-SNE/UMAP are corpus-wide projections of MFCC frames (linear vs two manifold methods), Tonnetz reveals harmonic/tonal relationships, YAMNet adds event-level semantics (1024-D AudioSet), and CLAP links sound to natural-language meaning (512-D contrastive audio-text) so semantically related sounds cluster even when their spectra differ. A curated library of 102 sounds (space, nature, music, human-made) and ten render modes make it both an analytical lens on representation learning and an expressive instrument. Built as a FastAPI + React/Three.js monorepo, deployed live; the heavy CLAP runtime is precomputed offline so the deployed app stays light.
The Challenge
Conventional audio visualizers — spectrum analyzers, waveforms — show the signal but not its structure or meaning. Two sounds that share meaning but differ acoustically look unrelated; there is no single view that places a sound by both how it sounds and what it is.
Our Approach
Each sound is analyzed into seven 6D embedding tracks (Features, PCA, t-SNE, UMAP, Tonnetz, YAMNet, CLAP), all min-max normalized so any feature can drive any axis — spatial XYZ plus color and size, with time as the implicit sixth axis. A React/Three.js frontend renders the trajectory in real time with ten render modes, synchronized to Web Audio playback. The offline data pipeline (librosa + scikit-learn/UMAP + TensorFlow/YAMNet + CLAP) extracts features and writes per-clip JSON; the FastAPI backend serves them verbatim.
Key Performance Indicators
| KPI | Baseline | Result | Impact |
|---|---|---|---|
| Embedding Tracks | Single spectrogram view | 7 tracks (spectral → YAMNet → CLAP) | Compare what each representation hears |
| Audio Exploration | Flat waveform / spectrogram | Navigable 6D trajectories, 10 render modes | Sound as a space, not a signal |
Architecture
auralis embedding space
Sound as a Navigable Space
Auralis turns any sound into a luminous trail you can fly through. Upload audio and the backend analyzes it into a six-dimensional feature space — spatial position (X, Y, Z) plus color and motion — then the frontend renders it as a 3D trajectory where every point is a moment in time, positioned by its acoustic and semantic properties.
Seven Ways to Hear the Same Sound
Auralis computes seven 6D embedding tracks per sound, each a different lens on the same audio (all min-max normalized to [0,1] so any feature can drive any axis):
| Track | What it captures | Source |
|---|---|---|
| Features | Six interpretable spectral scalars (brightness, bandwidth, rolloff, …) | direct 6D |
| PCA | Linear projection of MFCC frames | corpus-wide → 6D |
| t-SNE | Nonlinear manifold of MFCC frames | corpus-wide → 6D |
| UMAP | Nonlinear manifold of MFCC frames | corpus-wide → 6D |
| Tonnetz | Harmonic space — fifths, minor/major thirds (Harte 2006) | natural 6D |
| YAMNet | Deep AudioSet event embeddings (Hershey 2017) | 1024-D → 6D PCA |
| CLAP | Contrastive language-audio embeddings (Wu 2023) | 512-D → 6D PCA |
Features emphasizes raw acoustic structure; PCA/t-SNE/UMAP are three projections of the same MFCC frames (one linear, two manifold methods) so you can see how each geometry reshapes the corpus; Tonnetz reveals tonal relationships; YAMNet brings event-level semantics; and CLAP links sound to natural-language meaning — so two sounds that mean similar things cluster together even when their raw spectra differ. (CLAP is precomputed offline; its heavy runtime is not bundled in the production deploy.)
Ten Render Modes
The same trajectory can be drawn ten ways — Trail, Comet, Constellation, Ribbon, Tube, Particles, Light Painting, Galaxy, Nebula, and Aurora — each interpreting the path differently for distinct analytical and aesthetic effects.
Architecture
A monorepo: a FastAPI backend (librosa for spectral/MFCC/chroma/mel/Tonnetz, TensorFlow/YAMNet for event embeddings, CLAP via transformers for semantic embeddings, with per-track PCA models persisted for consistent projection) and a React + TypeScript + Vite + Three.js frontend (react-three-fiber for rendering, Web Audio API for playback synchronization, Zustand for state). The CLAP runtime is a heavy optional dependency, so production serves precomputed embeddings rather than bundling the torch/transformers stack. Live at auralis.fasl-work.com.
Technology Stack
Visual assets for this project are not publicly available.