ES
← Back to Portfolio
Audio & Signal April 2026

Auralis — 6D Audio Embedding Visualizer

Turns any sound into a navigable 6-dimensional universe. Audio is mapped onto seven embedding tracks — interpretable spectral features, PCA/t-SNE/UMAP projections, the Tonnetz harmonic space, YAMNet event semantics, and CLAP audio-text meaning — and rendered as a luminous 3D trajectory you can fly through, across a curated library of 102 sounds.

Embedding Tracks
7 (Features, PCA, t-SNE, UMAP, Tonnetz, YAMNet, CLAP)
Render Modes
10
Curated Library
102 sounds
Stack
FastAPI + librosa/YAMNet/CLAP · React/Three.js
Auralis — 6D Audio Embedding Visualizer — Architecture
#audio #embeddings #visualization #threejs #clap #yamnet #librosa #fastapi

Business Context

Audio is one of the hardest data types to explore intuitively. Spectrograms are dense and unreadable to non-specialists, and embedding spaces from modern audio models are high-dimensional and abstract. There is rarely a way to actually see how a machine-learning model "hears" a sound, or to compare what different representations capture about the same audio.

Strategic Value

Auralis makes audio embeddings tangible. By projecting seven different representations of the same sound into one navigable 6D space, it turns abstract feature vectors into trajectories you can fly through and compare directly — interpretable spectral Features emphasize raw acoustic structure, PCA/t-SNE/UMAP are corpus-wide projections of MFCC frames (linear vs two manifold methods), Tonnetz reveals harmonic/tonal relationships, YAMNet adds event-level semantics (1024-D AudioSet), and CLAP links sound to natural-language meaning (512-D contrastive audio-text) so semantically related sounds cluster even when their spectra differ. A curated library of 102 sounds (space, nature, music, human-made) and ten render modes make it both an analytical lens on representation learning and an expressive instrument. Built as a FastAPI + React/Three.js monorepo, deployed live; the heavy CLAP runtime is precomputed offline so the deployed app stays light.

The Challenge

Conventional audio visualizers — spectrum analyzers, waveforms — show the signal but not its structure or meaning. Two sounds that share meaning but differ acoustically look unrelated; there is no single view that places a sound by both how it sounds and what it is.

Our Approach

Each sound is analyzed into seven 6D embedding tracks (Features, PCA, t-SNE, UMAP, Tonnetz, YAMNet, CLAP), all min-max normalized so any feature can drive any axis — spatial XYZ plus color and size, with time as the implicit sixth axis. A React/Three.js frontend renders the trajectory in real time with ten render modes, synchronized to Web Audio playback. The offline data pipeline (librosa + scikit-learn/UMAP + TensorFlow/YAMNet + CLAP) extracts features and writes per-clip JSON; the FastAPI backend serves them verbatim.

Key Performance Indicators

KPIBaselineResultImpact
Embedding TracksSingle spectrogram view7 tracks (spectral → YAMNet → CLAP)Compare what each representation hears
Audio ExplorationFlat waveform / spectrogramNavigable 6D trajectories, 10 render modesSound as a space, not a signal

Architecture

auralis embedding space

auralis embedding space

Sound as a Navigable Space

Auralis turns any sound into a luminous trail you can fly through. Upload audio and the backend analyzes it into a six-dimensional feature space — spatial position (X, Y, Z) plus color and motion — then the frontend renders it as a 3D trajectory where every point is a moment in time, positioned by its acoustic and semantic properties.

Seven Ways to Hear the Same Sound

Auralis computes seven 6D embedding tracks per sound, each a different lens on the same audio (all min-max normalized to [0,1] so any feature can drive any axis):

TrackWhat it capturesSource
FeaturesSix interpretable spectral scalars (brightness, bandwidth, rolloff, …)direct 6D
PCALinear projection of MFCC framescorpus-wide → 6D
t-SNENonlinear manifold of MFCC framescorpus-wide → 6D
UMAPNonlinear manifold of MFCC framescorpus-wide → 6D
TonnetzHarmonic space — fifths, minor/major thirds (Harte 2006)natural 6D
YAMNetDeep AudioSet event embeddings (Hershey 2017)1024-D → 6D PCA
CLAPContrastive language-audio embeddings (Wu 2023)512-D → 6D PCA

Features emphasizes raw acoustic structure; PCA/t-SNE/UMAP are three projections of the same MFCC frames (one linear, two manifold methods) so you can see how each geometry reshapes the corpus; Tonnetz reveals tonal relationships; YAMNet brings event-level semantics; and CLAP links sound to natural-language meaning — so two sounds that mean similar things cluster together even when their raw spectra differ. (CLAP is precomputed offline; its heavy runtime is not bundled in the production deploy.)

Ten Render Modes

The same trajectory can be drawn ten ways — Trail, Comet, Constellation, Ribbon, Tube, Particles, Light Painting, Galaxy, Nebula, and Aurora — each interpreting the path differently for distinct analytical and aesthetic effects.

Architecture

A monorepo: a FastAPI backend (librosa for spectral/MFCC/chroma/mel/Tonnetz, TensorFlow/YAMNet for event embeddings, CLAP via transformers for semantic embeddings, with per-track PCA models persisted for consistent projection) and a React + TypeScript + Vite + Three.js frontend (react-three-fiber for rendering, Web Audio API for playback synchronization, Zustand for state). The CLAP runtime is a heavy optional dependency, so production serves precomputed embeddings rather than bundling the torch/transformers stack. Live at auralis.fasl-work.com.

Technology Stack

Python FastAPI librosa TensorFlow YAMNet CLAP TypeScript React Three.js Web Audio API PCA

Visual assets for this project are not publicly available.