Simulated Raman spectra representing two different datasets. Simulated spectra were generated using R.

Multivariate statistical methods are commonly used by researchers who study fossils with spectroscopy. Moreover, these methods are often performed with the free software language ‘R’. However, it can sometimes be a real challenge to assemble R code for a new stats method you are not overly familiar with. How do you know if your R code is working properly? Actually, the answer is pretty simple – just analyse a dataset that has a known outcome. For example, if the method is supposed to distinguish spectra into groups, then work with a dataset that contains different groups of spectra. Simple, right?

Well, it’s often the case that a good dataset is not easily available for simply testing code. My solution? Simulated data. The following lines of R code simulate two groups of Raman spectra that vary slightly within each group, and vary substantially between each group. This code will let you test a whole range of multivariate methods.

Code for generating two ‘random’ sets of Raman spectra (feel free to use, adapt and modify – attribution would be nice but is not necessary).

##Code Starts Here##

wavenumber=200:3500

nspectra=20

set.seed(2)

baseline_1=runif(length(wavenumber),min=0,max=0.1)

group_1_peaks=sample(wavenumber[10:(length(wavenumber)-10)],10)

dataset_1=matrix(0,ncol=nspectra,nrow=length(wavenumber))

for(i in 1:ncol(dataset_1)){

for(j in 1:length(group_1_peaks)){

width=round(runif(1,min=1,max=10),digits=0)

peak_position=dnorm((group_1_peaks[j]-width):(group_1_peaks[j]+width),mean=group_1_peaks[j],sd=width)

peak_intensity=peak_position*round(runif(1,min=1,max=100),digits=0)

baseline_1[(group_1_peaks[j]-width):(group_1_peaks[j]+width)]=peak_intensity

}

dataset_1[,i]=baseline_1

}

set.seed(20)

baseline_2=runif(length(wavenumber),min=0,max=0.1)

group_2_peaks=sample(wavenumber[10:(length(wavenumber)-10)],10)

peak_intensity_means_2=runif(length(group_2_peaks),min=10,max=100)

dataset_2=matrix(0,ncol=nspectra,nrow=length(wavenumber))

for(i in 1:ncol(dataset_2)){

for(j in 1:length(group_2_peaks)){

width=round(runif(1,min=1,max=10),digits=0)

peak_position=dnorm((group_2_peaks[j]-width):(group_2_peaks[j]+width),mean=group_2_peaks[j],sd=width)

peak_intensity=peak_position*(peak_intensity_means_2[j]*runif(1,min=0.5,max=2))

baseline_2[(group_2_peaks[j]-width):(group_2_peaks[j]+width)]=peak_intensity

}

dataset_2[,i]=baseline_2

}

##Code Ends Here##

Stacked spectra, with all simulated spectra within each dataset overlying one another.

The following principal component analyses is a good example of how these simulated spectra might be used. The following code calls on two libraries (signal, baseline), so be aware that you will need to install the signal and baseline packages to run it. This code first runs through a set of preprocessing steps before the principal component analysis. The citation for this code is: Thomas D B, Chinsamy A, 2011. Chemometric analysis of EDXRF measurements from fossil bone. *X-ray Spectrometry* 40: 441-445

##Code Starts Here##

library (signal)

library(baseline)

combined.data=cbind(dataset_1,dataset_2)

combined.baseline=matrix(0,ncol=length(combined.data[,1]),nrow=length(combined.data[1,]))

combined.bc=c()

combined.data=t(combined.data)

for (n in 1:(length(combined.data[,1]))) {

combined.bc=baseline(combined.data[n,, drop=FALSE],lambda=1,hwi=20, it=30, int=800, method=’fillPeaks’)

combined.baseline[n,]=combined.bc@corrected[1:length(combined.bc@corrected)]

}

combined.data=combined.baseline

combined.data=t(combined.data)

combined.min=matrix(rep(apply(combined.data,2,min),length(combined.data[,1])),ncol=length(combined.data[1,]),byrow=T)

combined.max=matrix(rep(apply(combined.data,2,max),length(combined.data[,1])),ncol=length(combined.data[1,]),byrow=T)

combined.minmax=(combined.data-combined.min)/(combined.max-combined.min)

combined.mc=matrix(rep(colMeans(t(combined.minmax)),length(combined.minmax[1,])),ncol=length(combined.minmax[1,]))

combined.minmax.mc=combined.minmax-combined.mc

combined.minmax.mc=t(combined.minmax.mc)

combined.pc=prcomp(combined.minmax.mc)

combined.loadings=combined.pc$rotation

combined.scores=combined.pc$x

eigenval=combined.pc$sdev^2

explained=round(eigenval/sum(eigenval) * 100, digits = 1)

PCx=1

PCy=2

combined.scores.PCx=combined.scores[,PCx]

combined.scores.PCy=combined.scores[,PCy]

combined.loadings.PCx=combined.loadings[,PCx]

combined.loadings.PCy=combined.loadings[,PCy]

layout(matrix(c(1,1,2,1,1,3),nrow=3,ncol=2))

plot(combined.scores.PCx,combined.scores.PCy,font=5, ann=FALSE,col=c(rep(“red”,(nspectra)),rep(“blue”,(nspectra))))

title(xlab=paste(“Principal component “, PCx, ” (“,explained[PCx],”%)”,sep=””))

title(ylab=paste(“Principal component “, PCy, ” (“,explained[PCy],”%)”,sep=””))

plot(wavenumber,combined.loadings.PCx,type=”l”,xlab=”wavenumber (1/cm)”,ylab=paste(“Principal component”, PCx))

plot(wavenumber,combined.loadings.PCy,type=”l”,xlab=”wavenumber (1/cm)”,ylab=paste(“Principal component”, PCy))

##Code Ends Here##

The results of a PCA applied to a simulated dataset of Raman spectra.

Cool beans. We can clearly recover two groupings of PC scores along PC1. Here the two groups (red, blue) correspond to the two simulated Raman spectral datasets.