EMSLIBS 2025
Published:
Attendance to the 2025 international conference EMSLIBS (Euro-Mediterranean Symposium on Laser-induced Breakdown Spectroscopy). Participation in the EMSLIBS 2025 Data Challenge. Even though the method presented here isn’t novel nor exotic, the primary goal was to obtain better results. Before settling on this approach, other classification techniques (such as RF, XGBoost…) were also tested.
Data Challenge
A data challenge was proposed by the organizing committee of the conference. This data challenge consisted in a blind classification of LIBS spectra in 6 different classes. Participants were not given any information on the acquisition instruments or parameters or on the classes. Entire spectra were given, without manual preprocessing, although they showed baseline removal, probably applied by the acquisition instruments.
The dataset comprised of :
- spectra_cal (size: 582 rows × 23,431 columns): the 582 spectra to be used for training. The spectral range consists of 23,431 wavelengths.
- classes_cal (size: 582 rows × 1 column): the class corresponding to each spectrum in spectra_cal. There are six classes in total, coded from 1 to 6.
- spectra_test_perm (size: 510 rows × 23,431 columns): the 510 spectra to be used for testing.
- wave (size: 1 row × 23,431 columns): the wavelength values of the considered spectral range.
Data Preparation and Preprocessing
The dataset was partitioned into a training set and a stratified validation set (10%) to preserve class balance. To enhance robustness and mitigate overfitting, the Mixup data augmentation technique was applied to the training samples.
To further eliminate residual instrumental noise, the following preprocessing steps were applied sequentially:
- Savitzky-Golay smoothing: applied to minimize high-frequency noise while preserving peak intensity.
- Standard Normal Variate (SNV): applied to normalize the spectra and correct for baseline shifts.
Architecture
I started by developing the straightforward approach, utilizing the entire spectral range without manual wavelength selection.
- Dimensionality reduction: I reduced the dimensionality of the 23 431 points per spectra using Principal Component Analysis (PCA).
- Supervised learning: These PCA scores served as input features for a three-layer Multi-Layer Perceptron (MLP). The network architecture was kept simple.
The model was evaluated using stratified 10-fold cross-validation. It demonstrated stable classification throughout, achieving an average accuracy of 94.15% with a standard deviation of 2.07%. Once stability was confirmed, the final model was retrained on all available training data.
Alternative Approach
I also thought that going into the frequency domain could yield better results. I transformed each spectra into 3 different images: one ST-FFT and two CWT, to get a 224x224x3 image. This image was then used to finetune a ResNet-50 model to predict the 6 classes.
While I knew this was overkill, I wanted to see if the methods, while drastically different, would yield the same predictions to potentially use an ensemble. The two different techniques agreed on >80% of the predictions. However, I intuited that the test dataset was balanced (around the same amount of samples in each class). The PCA+MLP method gave a much more balanced estimation than the ResNet, so I decided to submit the first technique.
Results
I achieved 2nd place, with 87.65% of accuracy. After the results were given, we were informed that we were actually classifying bones from 6 different individuals.
