Auralis — 6D Audio Embedding Visualizer

Business Context

Audio is one of the hardest data types to explore intuitively. Spectrograms are dense and unreadable to non-specialists, and embedding spaces from modern audio models are high-dimensional and abstract. There is rarely a way to actually see how a machine-learning model "hears" a sound, or to compare what different representations capture about the same audio.

Strategic Value

Auralis makes audio embeddings tangible. By projecting seven different representations of the same sound into one navigable 6D space, it turns abstract feature vectors into trajectories you can fly through and compare directly — interpretable spectral Features emphasize raw acoustic structure, PCA/t-SNE/UMAP are corpus-wide projections of MFCC frames (linear vs two manifold methods), Tonnetz reveals harmonic/tonal relationships, YAMNet adds event-level semantics (1024-D AudioSet), and CLAP links sound to natural-language meaning (512-D contrastive audio-text) so semantically related sounds cluster even when their spectra differ. A curated library of 102 sounds (space, nature, music, human-made) and ten render modes make it both an analytical lens on representation learning and an expressive instrument. Built as a FastAPI + React/Three.js monorepo, deployed live; the heavy CLAP runtime is precomputed offline so the deployed app stays light.

KPI	Baseline	Result	Impact
Embedding Tracks	Single spectrogram view	7 tracks (spectral → YAMNet → CLAP)	Compare what each representation hears
Audio Exploration	Flat waveform / spectrogram	Navigable 6D trajectories, 10 render modes	Sound as a space, not a signal

KPI

Baseline

Result

Impact

Embedding Tracks

Single spectrogram view

7 tracks (spectral → YAMNet → CLAP)

Compare what each representation hears

Audio Exploration

Flat waveform / spectrogram

Navigable 6D trajectories, 10 render modes

Sound as a space, not a signal

Sound as a Navigable Space

Auralis turns any sound into a luminous trail you can fly through. Upload audio and the backend analyzes it into a six-dimensional feature space — spatial position (X, Y, Z) plus color and motion — then the frontend renders it as a 3D trajectory where every point is a moment in time, positioned by its acoustic and semantic properties.

Seven Ways to Hear the Same Sound

Auralis computes seven 6D embedding tracks per sound, each a different lens on the same audio (all min-max normalized to [0,1] so any feature can drive any axis):

Track	What it captures	Source
Features	Six interpretable spectral scalars (brightness, bandwidth, rolloff, …)	direct 6D
PCA	Linear projection of MFCC frames	corpus-wide → 6D
t-SNE	Nonlinear manifold of MFCC frames	corpus-wide → 6D
UMAP	Nonlinear manifold of MFCC frames	corpus-wide → 6D
Tonnetz	Harmonic space — fifths, minor/major thirds (Harte 2006)	natural 6D
YAMNet	Deep AudioSet event embeddings (Hershey 2017)	1024-D → 6D PCA
CLAP	Contrastive language-audio embeddings (Wu 2023)	512-D → 6D PCA

Features emphasizes raw acoustic structure; PCA/t-SNE/UMAP are three projections of the same MFCC frames (one linear, two manifold methods) so you can see how each geometry reshapes the corpus; Tonnetz reveals tonal relationships; YAMNet brings event-level semantics; and CLAP links sound to natural-language meaning — so two sounds that mean similar things cluster together even when their raw spectra differ. (CLAP is precomputed offline; its heavy runtime is not bundled in the production deploy.)

Ten Render Modes

The same trajectory can be drawn ten ways — Trail, Comet, Constellation, Ribbon, Tube, Particles, Light Painting, Galaxy, Nebula, and Aurora — each interpreting the path differently for distinct analytical and aesthetic effects.

Architecture

A monorepo: a FastAPI backend (librosa for spectral/MFCC/chroma/mel/Tonnetz, TensorFlow/YAMNet for event embeddings, CLAP via transformers for semantic embeddings, with per-track PCA models persisted for consistent projection) and a React + TypeScript + Vite + Three.js frontend (react-three-fiber for rendering, Web Audio API for playback synchronization, Zustand for state). The CLAP runtime is a heavy optional dependency, so production serves precomputed embeddings rather than bundling the torch/transformers stack. Live at auralis.fasl-work.com.

Auralis — 6D Audio Embedding Visualizer

Business Context

Strategic Value

The Challenge

Our Approach

Key Performance Indicators

Architecture

Sound as a Navigable Space

Seven Ways to Hear the Same Sound

Ten Render Modes

Architecture

Technology Stack