Hyperspectral Mineral Analysis by Topic Modelling (2022)
The original 2022 conference work that reframed hyperspectral mineral characterisation as a topic-modelling problem: spectra as documents, LDA topics as a routing layer for per-topic regression. Cut copper-recovery error an order of magnitude on private drill-core data.
Business Context
In geometallurgy, knowing how an ore will behave in processing requires lab work that is too slow and expensive to run on everything. Hyperspectral imaging promises a cheap proxy, but only if the spectral evidence can be mapped reliably to the lab targets. This 2022 work was the first to show that the document/topic metaphor from text mining does exactly that on real mineral samples.
Strategic Value
To our knowledge the first formalisation of HSI mineral-sample characterisation as a probabilistic topic-modelling problem. The contribution over the prior hierarchical scheme (Egaña et al., Minerals 2020) was to make the clustering stage probabilistic and interpretable — each sample is a soft mixture of latent mineral topics inferred by LDA, rather than a hard cluster assignment. On the DB1 drill-core holdout, topic-routed regression with LDA Version 1 cut copper-recovery MAE from 4.568 (naive baseline) to 0.422 — roughly a 10x error reduction — and Version 3 was comparable. This is the seed idea that later scaled into the full LDA-HSI research platform; here it stays modest and period-accurate: three recipes, one backbone (LDA), small private datasets.
The Challenge
Geometallurgical targets (copper/molybdenum recovery, acid consumption, Bond work index) need slow, costly lab assays on every drill-core composite. Hyperspectral imaging is fast and cheap, but a raw spectrum is hundreds of correlated bands with no obvious link to a recovery number — and a naive per-spectrum regressor is effectively useless.
Our Approach
Reframe each mineral sample as a document. Three "wordification" recipes were compared — Version 1 (word = wavelength band, count = summed quantised intensity), V2 (word = quantised intensity level), V3 (joint per-spectrum band intensities). A gensim LDA infers a per-sample topic distribution; then a separate regressor is trained per topic, and a new sample is estimated by weighting the per-topic regressors by its inferred topic mixture — a probabilistic extension of an earlier hierarchical clustered-regression scheme.
Key Performance Indicators
| KPI | Baseline | Result | Impact |
|---|---|---|---|
| Copper-recovery error (DB1) | Naive per-spectrum MAE 4.568 | LDA Version 1 MAE 0.422 | ~10x error reduction |
| Method | Hard clustering + regression | Probabilistic LDA topic routing | Soft, interpretable membership |
Architecture
hsi mineral classification
The Founding Idea (2022)
Presented as “Geometallurgical Estimation of Mineral Samples from Hyperspectral Images and Statistical Topic Modelling” at the 18th International Conference on Mineral Processing and Geometallurgy (Procemin Geomet 2022, Gecamin), from postdoctoral research at ALGES / AMTC, Universidad de Chile. The idea: treat a hyperspectral mineral sample as a document, its quantised spectral patterns as vocabulary, and let an LDA topic model infer a small set of latent mineral “topics” — then use that topic mixture to route a per-topic regression onto the lab targets.
Spectra as Documents
Three “wordification” recipes were compared (Table 2 of the paper): Version 1 — each wavelength band is a word, the document counts summed quantised intensities per band (reduced and interpretable); Version 2 — words are quantised intensity levels; Version 3 — joint per-spectrum band intensities. Reflectance was quantised to Q levels; topic count chosen by coherence score; engine gensim LDA with pyLDAvis for inspection.
The Result
On a 20% holdout of the DB1 drill-core set, topic-routed hierarchical regression with LDA Version 1 cut copper-recovery MAE from 4.568 (naive per-spectrum baseline) to 0.422 — an order-of-magnitude reduction — with Version 3 comparable (0.432) and Version 2 weaker (0.714). Molybdenum recovery improved similarly (18.6 → 2.2). On the smaller DB2 set (7 topics) estimation error dropped ~10–15% versus baselines. Version 1 — band-frequency — was the strong, interpretable recipe and survives as the canonical baseline in the modern platform.
Scope (Period-Accurate)
This entry stays faithful to the 2022 paper: three recipes, one backbone (LDA), a few small private mineral datasets (drill-core DB1/DB2 plus the early HIDSAG geological subsets) — no public benchmark scenes, no neural backbones, no design-space sweep. That breadth came later. The idea seeded here — spectra as documents, topics as structure — scaled into the LDA-HSI platform: 19 recipes, four backbones, six public scenes, and a live web app.
Technology Stack
Visual assets for this project are not publicly available.