Using Machine Learning for Classiﬁcation of Cancer Cells from Raman

Spectroscopy

Lerina Aversano

1 a

, Mario Luca Bernardi

1 b

, Vincenzo Calgano

, Marta Cimitile

2 c

Concetta Esposito

, Martina Iammarino

1 d

, Marco Pisco

, Sara Spaziani

and Chiara Verdone

University of Sannio, Department of Engineering, Benevento, Italy

Unitelma Sapienza University, Rome, Italy

University of Sannio, Optoelectronic Division - Engineering Department, Benevento, Italy

Keywords:

Machine Learning, Classiﬁcation, Raman Spectroscopy Analisys, Health Informatics.

Abstract:

Since cancer represents one of the leading causes of death worldwide, the development of approaches ca-

pable of discerning healthy from diseased cells would be of fundamental importance to support diagnostic

and screening techniques. Raman spectroscopy is the most effective molecular analysis technique currently

available and provides information on the molecular composition, bonds, chemical environment, phase, and

crystalline structure of the samples under examination. This work exploits a combination of Raman spec-

troscopy and machine learning models to discriminate patients’ liver cells between tumor and non-tumor. The

research uses real patient data, provided by the Center for Nanophotonics and Optoelectronics for Human

Health (CNOS), which analyzed the cells of a patient with liver cancer. Speciﬁcally, the dataset has been built

through a long data collection process, which ﬁrst involved the analysis of the cells with Raman spectroscopy

and then the training of two classiﬁers, Decision Tree and Random Forest. The results show good performance

for the trained classiﬁers, especially those relating to the Random Forest, which reaches an accuracy of 90%.

1 INTRODUCTION

Health institutions’ reports clearly highlight that one

of the leading causes of death in the world is cancer

(Torre et al., 2016; Miller et al., 2016). Just restricting

to 2020, we can count about 19.3 million new cases of

cancer in the world and about 10 million deaths due

to the disease.

Tumors represent a very complex disease and can

manifest in distinct phases, (i) the initial or localized

disease, in which there is only a single tumor in a

single site; (ii) the phase of relapse, whether or not

surgery, in which the disease recurs, but always and

only in the site where it appeared for the ﬁrst time

and (iii) the disseminated form, in which the malig-

nant cells have exited the organ of origin to colonize

other organs even at a distance (metastasis). The main

problem for its identiﬁcation and possible treatment

https://orcid.org/0000-0003-2436-6835

https://orcid.org/0000-0002-3223-7032

https://orcid.org/0000-0003-2403-8313

https://orcid.org/0000-0001-8025-733X

concerns the fact that each type of tumor requires a

different approach and, often, also different treatment

times.

Research in this regard is always in constant evo-

lution and currently, medicine has several methodolo-

gies at its disposal to ﬁght cancer. Active surveillance

is reserved for very slow-growing cancers, surgery,

radiotherapy that uses X-rays to destroy cancer cells,

chemotherapy uses cytotoxic drugs, which are toxic

to cells, as they block the division of rapidly repli-

cating cells, without distinguishing between healthy

cells and diseased cells, hormone therapy, biologi-

cal or molecular target drugs are substances capa-

ble of ”recognizing” the tumor cell and promoting

its destruction by the immune system, and ﬁnally im-

munotherapy consists of drugs capable of stimulating

the immune system against cancer cells. In general,

the earlier a diagnosis is, the more timely and effec-

tive the treatment can be. The study (Neal et al., 2015)

showed that there is a correlation between diagnosis

made in a short time and a favorable outcome for the

patient. Therefore, developing approaches capable of

distinguishing healthy cells from compromised ones

Aversano, L., Bernardi, M., Calgano, V., Cimitile, M., Esposito, C., Iammarino, M., Pisco, M., Spaziani, S. and Verdone, C.

Using Machine Learning for Classiﬁcation of Cancer Cells from Raman Spectroscopy.

DOI: 10.5220/0011142600003277

In Proceedings of the 3rd International Conference on Deep Learning Theory and Applications (DeLTA 2022), pages 15-24

ISBN: 978-989-758-584-5; ISSN: 2184-9277

that can support diagnostic and screening techniques

would be fundamental. During the ﬁrst phase of diag-

nosis, the most important step is between cancer cells

and normal cells. Currently, the methods in use for

tumor detection include several approaches including

MRI, tomography, endoscopy, different biochemical

pathways in combination with mass or optical spec-

troscopy (Henschke et al., 1999; Sun et al., 2008; Zhu

et al., 2008). Unfortunately, however, these require

very sensitive times and equipments for the early di-

agnosis of the tumor.

In the late 1970s, Raman spectroscopy with an op-

tical microscope was introduced and has been used

for microanalysis in many ﬁelds (Mulvaney and Keat-

ing, 2000). Micro-Raman spectroscopy has become

an important tool in biology, particularly for single-

cell studies (Smith et al., 2016). Raman spectroscopy

exploits the interaction of light through a process of

diffusion with matter (called scattering) to obtain in-

formation on the characteristics of a material and

its molecular structure but, infrared spectroscopy is

based on the absorption of light, Raman spectroscopy

provides information on intra- and intermolecular in-

teractions. Therefore, analyzing Raman spectra, it

should be possible to identify differences in molec-

ular compositions and structures between normal and

cancer cells and tissues.

At the same time, the technique of machine learn-

ing is spreading more and more in different ﬁelds of

medicine and bioengineering (Aversano et al., 2021b;

Ardimento et al., 2021; Aversano et al., 2020). Ma-

chine learning algorithms are fast and effective in

learning from the data that is provided in input, gener-

ating calculation models capable of automatically and

rapidly producing evaluation results.

Therefore, the combination of Raman spec-

troscopy and machine learning could lead to a reduc-

tion in the time required for the diagnosis of patient

cells, in order to subject the patient to specialist ex-

aminations as soon as possible in case of need (Zhang

et al., 2021; Zhang et al., 2022).

This research study is placed in this context and

has as its objective the classiﬁcation of cells into

the tumor and non-tumor cells analyzed with Raman

spectroscopy.

This document is structured as follows. The fol-

lowing section provides some basic information. In

Section 2 there is a brief discussion of the related

work. The proposed approach is described in Section

4, while the results of the experiment are discussed in

Section 5. Finally, in Section 6 and 7, respectively,

the threats to validity and conclusions are reported.

2 RELATED WORKS

In recent years, numerous studies have focused on

the classiﬁcation and prediction of human diseases of

different types. In particular, machine learning tech-

niques have been largely used for the early diagnosis

of many diseases, such as diabetes(G. and K., 2019),

heart disease(Karayılan and Kılıc¸, 2017), Parkinson’s

disease (Aversano et al., 2020) and thyroid diseases

(Aversano et al., 2021a). More recently they have

also been used for Covid-19 diagnosis (Rasheed et al.,

2021). These works are intended to reduce the time

and costs required for the diagnosis and treatment of

the patient.

Similarly, the proposed study aims to develop a

predictive model of liver cancer in a patient, start-

ing from the Raman spectroscopy analysis of patient

cells.

Few other studies have investigated approaches

based on the combination of Raman spectroscopy and

machine learning with this objective.

The study (Germond et al., 2018) concerns the

application of techniques based on machine learn-

ing for the classiﬁcation of cell types. The authors

present different approaches to exploit the Raman hy-

perspectral images: they extract information of the

cell from the calculation of the average spectrum (i)

and from the wave numbers used to map the distri-

bution of molecular compounds (ii) combining the

two previous methods (iii). The adopted classiﬁ-

cation method is the PCA-DA approach consisting

of a principal components analysis (PCA) step fol-

lowed by discriminant analysis (DA). In addition, au-

thors investigated projection on latent structure (PLS-

DA), predictive model K-means and Support Vector

Machine (SVM). With spectrum-based classiﬁcation,

the PCA-DA model showed an accuracy of 83.3%,

while image-based classiﬁcation scored an accuracy

of 96.3%. The combination of the two approaches

archives a 100% accuracy in cell line discrimination

in their experimentation.

Like the previous work, also in (Pavillon et al.,

2018) the authors used the statistical approach of PCA

for the analysis of small cellular changes in response

to stimuli, acquiring different parameters with unla-

beled microscopy and achieving an accuracy of the

model equal to 85

In (Schie et al., 2016) the authors used Ra-

man spectroscopy for the diversiﬁcation of eukaryotic

from prokaryotic cells. Since the former is smaller

than the latter, a single Raman spectrum is often

enough to generate a dataset sufﬁcient for the train-

ing phase. For the latter, more than one spectrum

is necessary. Since probing entire cells with Raman

DeLTA 2022 - 3rd International Conference on Deep Learning Theory and Applications

spectroscopy using high resolution takes a long time,

the authors propose a method that acquires integrated

Raman spectra that can cover a large portion of the

cell. The approach exploits support vector machines

for classiﬁcation by comparing single spectra with in-

tegrated Raman images and spectra of cells. Their

results show that the sensitivity of the model can be

as high as 90%.

The study (Hsu et al., 2020) deals with stem cells,

which can self-renew and differentiate into multiple

cell types, allowing the evaluation of pharmaceuti-

cal effects and allowing the treatment of various neu-

ral diseases. The authors propose a platform exploit-

ing Raman-labeled spectroscopy to classify cells into

the different classes of neural cells (from induced cell

stem). For this reason, the authors used several clas-

siﬁcation models (i.e., Support Vector Machine, Ran-

dom Forest, K-Nearest Neighbor, and the Stochastic

Gradient (SGB) enhancement model) that achieve an

average accuracy of 97.5%.

In (Ren, 2020), the authors used normal breast

cells and prostate cancer cells and an NGK machine

learning approach to obtain a prediction value be-

tween 87% and 89% without considering the outliers.

The study (Lussier et al., 2019) reports an ap-

proach based on the combination of Raman spec-

troscopy and Deep Learning for the analysis at the

same time of at least eight in vitro metabolites close

to different cell lines. Analyzing these components

is fundamental because it allows research on living

cells for responses to inﬂammation and wound heal-

ing. The authors propose a supervised ANN neu-

ral network that assigns multiple spectra to the same

metabolite. The network consists of two convolu-

tional layers and two pooling layers and uses the soft-

max function at the output. The results show good

model performance, which achieves an accuracy of

86.8%.

Our study aims to classify tumor cells from non-

tumor cells through Machine Learning methods for

the ﬁrst time. Moreover, the proposed model was

tested on true patients’ records as provided by the

Center for Nanophotonics and Optoelectronics for

Human Health (CNOS).

3 RAMAN SPECTROSCOPY

Raman spectroscopy is a widely used spectroscopic

method. It is an analysis technique providing infor-

mation on the chemical structure and molecular inter-

actions of a target sample (e.g., a tissue segment or

even a single cell). It works by pointing a laser beam

at a sample. Like is shown in Figure 1 ,a scattered

Figure 1: Scattering processes occurring when light inter-

acts with a molecule.

light excites molecules in the sample and the scatter-

ing effect is observed. Then, the scattered light is col-

lected by an optical system including a microscope

objective, and decomposed by the spectrograph.

A small amount of the scattered light shifts in en-

ergy from the laser frequency because of interactions

between the incident electromagnetic waves and the

vibrational energy levels of the molecules in the sam-

ple. Plotting the intensity of the shifted light against

the frequency produces a Raman spectrum of the sam-

ple.

The measured scattered light showed a broader

spectrum with additional wavelengths. A second ﬁl-

ter (emission ﬁlter) behind the probe allowed block-

ing the incident wavelength. The observed resid-

ual scattered light could now be clearly distinguished

from the incident light. Raman spectra are usually

plotted according to the laser frequency, meaning the

Rayleigh band falls between 0 cm and 1 cm. On this

scale, the band positions sit at the frequencies corre-

sponding to the energy levels of varying functional

group vibrations.

Therefore, in summary, the light exits the system

hit by the laser in three components. A part of the

radiation spreads elastically in all directions without

loss of energy, i.e. at the same frequency as the in-

cident radiation (elastic scattering or Rayleigh). A

large portion passes through the sample and a very

small portion is diffused inelastically. This diffusion

can be of three types: anelastic ceding diffusion (Ra-

man Stokes diffusion); Acquisition of anelastic diffu-

sion (Raman anti-Stokes diffusion); and Energy in the

interaction with the molecule.

As for the excitation source, an intense monochro-

matic beam is used. Monochromatic because the fre-

quency shifts of the radiation diffused by the incident

radiation are very small and therefore the source must

be monochromatic to facilitate observation. Intense

because the intensity of the diffused radiation is very

low and therefore the incident radiation must have a

much greater intensity.

Using Machine Learning for Classiﬁcation of Cancer Cells from Raman Spectroscopy

4 PROPOSED APPROACH

In this section, we report the proposed approach,

which aims to classify a cell as a tumor or not a tumor.

First, we describe the process used for data collection,

then the machine learning algorithms used and the pa-

rameters we have set for their operation, and ﬁnally,

the metrics used for the validation of the model.

4.1 Data Collection

The data for the research was provided by the Cen-

ter for Nanophotonics and Optoelectronics for Hu-

man Health (CNOS), which have analyzed the cells

of a patient with liver cancer (hepatocarcinoma) using

RAMAN spectroscopy. In this regard, two tissue sam-

ples from the liver have been taken from the patient,

one in the area affected by the tumor, and another lo-

cated in a part distant from the tumor but belonging to

the same liver tissue as the patient.

The cells have been supplied to the laboratory by

the National Cancer Institute IRCCS G. Pascale Foun-

dation

. These are real cells that have not undergone

any preliminary process and that compared to cell

lines, which correspond to a cell taken from the pa-

tient and replicated thousands of times, have not been

cultured or immortalized. Therefore, the lack of repli-

cation ensures that there is no loss of information re-

lated to their fundamental properties.

For the analysis using Raman spectroscopy,

LabRAM HR Evolution has been used, a system that,

thanks to the Raman effect, can obtain high spatial

and spectral resolution spectra using ultra-fast confo-

cal images. This instrument offers a wide range of

wavelengths, from 200 to 2200 nm, and can reach fre-

quencies of the order of 10 cm

−1

using an ultra-low

frequency module. For the processing and setup of

the measurements, the instrument has been supplied

with the LabSpec software.

During the measurement, the ﬁrst phase consists

of the self-calibration of the instrument to set the right

parameters to obtain the Raman effect. The instru-

ment also has a Rayleigh ﬁlter which takes care of

ﬁltering the non-informative vibrations, thus show-

ing only the Raman Stokes vibrations. In the second

phase, instead, the cell suitable for measurement is

searched for using a resolution of 10x. In the ﬁgure

2 it’s possible to see the candidate cells for possible

measurement.

The selection of the suitable cell for measurement

is made based on the experience of the operator, who

https://www.alleanzacontroilcancro.it/en/istituto/istituto-

nazionale-tumori-fondazione-pascale/

Figure 2: Candidate cells for measurement with 10x resolu-

tion.

Figure 3: Visualization of a cell with resolution 100x and

laser positioning.

chooses the one that has the best visual characteris-

tics, excluding cells with irregular shapes or cells that

are involved in the cell division process. Once the cell

is selected, 100x resolution is used for a more detailed

view of the cell At this point, the operator moves the

laser towards a point within its nucleus, making 5

measurements for each cell at different points within

the nucleus of the same cell. The ﬁgure 3 shows the

cell with 100x resolution, where the green dot repre-

sents the laser that is manually guided by the operator.

The measurements carried out concern the Finger

Print Region which captures the Raman spectrum in

the region ranging from 600 cm

−1

up to 1800 cm

−1

There is also another region for acquisitions, the High

Wawe Length Region which instead ranges from 1800

−1

up to 3100

−1

. In other cases, it is possible to

make measurements on the union of the two regions,

therefore from 600 cm

−1

up to 3100 cm

−1

Once the data was collected, these have been sub-

jected to preprocessing operations: vector normaliza-

tion, and windowing.

Having available a vector space that has an inter-

nal product and a norm, the vector normalization has

been performed because it allows obtaining a unitary

norm to the vector space In addition to normaliza-

tion, the removal of outliers, background, and sub-

DeLTA 2022 - 3rd International Conference on Deep Learning Theory and Applications

Window Size

Overlap

n-1

n+1

Amplitude

900 1000 1100

Window

Frequencies

Range

1100

1200 1300

Raman shift/cm

-1

600 700 900

Figure 4: Dataset is built using a sliding windows approach.

strate was also performed.

The windowing technique, exempliﬁed in Figure

4, has been used to divide the sample into windows

of varying widths on which the ensemble is trained.

In particular, the window size ranges from 2 samples

to 90 samples. Data is divided into multiple win-

dows, characterized by different sizes and overlaps.

In this regard we have used four levels of overlap:

(i) no overlap; (ii) overlap of 0.5, meaning that win-

dows overlap by half of their size; (iii) overlap of 0.75,

meaning that windows overlap by

of their size and

(iv) leave-one-out overlap, meaning that the overlap is

almost total advancing the window by just one sam-

ple.

The ﬁnal dataset that we have used consists of 364

frequencies, and each of them corresponds to a sam-

ple of amplitude in the various records of the dataset.

Speciﬁcally, within the dataset used, each record has:

• a progressive number representing the cell num-

ber and the measurement number, separated by a

dot. These two numbers, taken together, identify

one of the ﬁve points measured in the cell nucleus

(for instance, “1.1” identiﬁes the ﬁrst measure-

ment of the ﬁrst cell);

• the measurement samples that represent the am-

plitudes corresponding to the various frequencies.

• the class label to predict, which identiﬁes the type

of cell, and can take two values: Tumor if it is a

measurement made on a tumor cell, Non Tumor if

the cell is healthy.

4.2 Data Augmentation

To mitigate the problem of the reduced size of the

dataset, a data augmentation step has been per-

formed. It is a set of techniques that extend the

available dataset without actually collecting new el-

ements: data augmentation applies random controlled

changes to the already existing data, making modi-

ﬁed copies. Augmented data can be either slightly

modiﬁed copies of already existing data or synthetic

data created starting from the initial dataset (van Dyk

and Meng, 2001). Therefore, with this technique, we

have increased the set of spectral data. In summary,

for spectral data, data augmentation is mainly carried

out with changes in the slope of the spectrum, random

multiplication of the amplitudes, addition of random

offsets, or, as achieved in this study, with the shift of

the wavelengths of the spectra. Another advantage of

this technique is that it reduces the phenomenon of

overﬁtting. Therefore, in this work an augmentation

was carried out based on the random generation of

new spectra shifted in wavelength, with the help of

the aug xshift() function.

More speciﬁcally, to use this function we have set

the following parameters:

• the spectrum, randomly selected from the set of

spectra available;

• the shift range speciﬁes the range of possible ran-

dom shift values for the spectrum. In this case, 6

indicates the left limit at -6 and the right limit at

+6, corresponding to the interval [-6, + 6];

• the quantity, the number of new spectra generated

starting from the original spectrum. In this case,

we have opted for a value of 1 corresponding to

the default value of the function, so only a new

spectrum will be generated;

• the classes to predict, 1 in the case of tumor cell

spectra, 0 for non-tumor cells.

Therefore, an equal number of new spectra were gen-

erated for the two prediction classes, where each spec-

trum was randomly selected from the available set.

With this technique, we conducted an additional anal-

ysis, using a dataset made up of 632 samples as the

input of the classiﬁers, doubling the initial dataset.

4.3 Classiﬁcation Methods and Setup

For the classiﬁcation of cells into the tumor and non-

tumor cells, different classiﬁers have been used, vary-

ing the algorithm, the window size, the window over-

lap, and the adopted features. The goal is to train a

model that can support the diagnosis of liver cancer

Using Machine Learning for Classiﬁcation of Cancer Cells from Raman Spectroscopy

Table 1: Validation Metrics.

Metrics Formula Description

Accuracy A =

T P+T N

T P+F P+T N+FN

(1)

The accuracy of the model on the test set, it deﬁnes true

positives (TP) and true negatives (TN) as correctly clas-

siﬁed instances while false positives (FP) and false nega-

tives (FN) as classiﬁed instances incorrectly

Precision P =

T P

T P+F P

(2)

The ability of a classiﬁer not to label as positive an in-

stance that is in fact negative, it represents the ratio be-

tween correct forecasts (TP) and the total forecasts (TP +

FP).

Recall R =

T P

T P+F N

(3)

The sensitivity of the model, it represents the ratio be-

tween the correct predictions for a given class on the total

of cases in which the class is veriﬁed

F1 Score F1 = 2 ∗

Precision∗Recall

Precision+Recall

(4)

The harmonic and weighted average of the Precision and

Recall. A classiﬁer gets a high F1-Score only for high

Precision and Recall values.

effectively. More speciﬁcally, we have used learning

algorithms based on decision trees.

The Decision Tree Classiﬁer (DTC) (Rokach and

Maimon, 2014) is an algorithm for which classiﬁca-

tion functions are learned in the form of a tree. Within

the tree, the variables are represented by the nodes,

the possible value for that property is represented by

an arc at a child node and the expected value for a

given class based on the values of the other properties

is represented by a leaf. For each iteration, each at-

tribute is evaluated and the information gain is calcu-

lated, the attribute that gets the most information gain

is set as the new tree node. Once at the leaf node, the

algorithm assigns the class to which they belong to

the remaining instances. If there are multiple classes,

a probability distribution is assigned.

The Random Forest Classiﬁer (RFC) (Breiman,

2001) is a classiﬁer obtained by aggregating multiple

decision trees, which are trained with a random subset

of the training dataset. Each tree produces a predic-

tion of the class. Once all the predictions have been

produced, the result will be the one that appears most

frequently.

To train the classiﬁers we have used these param-

eters:

• DecisionTreeClassiﬁer(criterion = ’entropy’,

max dept = 5000) where the criterion parameter

sets the way the quality of the split function is

evaluated. We have used entropy which sets

the criterion based on informational gain. The

max dept parameter indicates the maximum

depth of the decision tree. In our case, a depth of

5000 was set.

• RandomForestClassiﬁer(max dept =1000

n estimators =1000), where the max dept

parameter has the same meaning as in the pre-

vious case and has been set to a depth of 1000,

while n estimators represents the number of trees

belonging to the forest, in our case 1000.

The dataset construction and the classiﬁers have

been built in Python, using the open-source scikit-

learn library for machine learning algorithms and the

Seglearn library to perform windows generation with

different sizes and overlaps.

4.4 Validation

To validate the performance of the different classi-

ﬁers, we have used the metrics reported in Table 1.

More speciﬁcally, the ﬁrst column contains the name

of the metrics, the second the formula to calculate it,

and the third a brief description.

5 DISCUSSION OF RESULTS

In this section, we report the results obtained. Speciﬁ-

cally, we have conducted, for each classiﬁcation algo-

rithm, ﬁve different experiments, with a different set

of features provided in input to the classiﬁers:

• original dataset, as provided to us by the CNOS

center;

• addition of the base frequency (i.e., the lowest fre-

quency of each window) to the windows samples;

• addition of the frequency range (i.e., the lowest

and the highest frequencies of each window) to

the windows samples;

• addition of all the frequencies to the windows

samples (hence doubling the features);

• original dataset, as provided to us by the CNOS

center, extended with data augmentation (hence

doubling the samples).

In the second, third, and fourth experiments, we have

added more information to allow the classiﬁer to learn

relationships among the peaks in the spectrum to the

frequencies where they happen.

For each experiment conducted, the results are

shown in Tables 2, 3, 4, 5, 6. Each table shows the

DeLTA 2022 - 3rd International Conference on Deep Learning Theory and Applications

results obtained with both classiﬁers used, the Deci-

sion Tree and the Random Forest. Respectively, the

ﬁrst six columns refer to the ﬁrst classiﬁer, and those

following to the second. For each experiment, we re-

port the four cases in which we found the best results.

More speciﬁcally, in the ﬁrst two columns, col-

ored in yellow, we report the parameters we have set

for the windows size and overlap, and in the following

columns the metrics we used for model validation. In

particular, accuracy, precision, recall, and F measure.

Therefore, Table 2 reports the results obtained

by training and evaluating the model on the starting

dataset, containing the progressive number represent-

ing the cell number and the measurement number, the

amplitudes corresponding to the various wavenum-

bers, and the type of cell (tumor or non-tumor). As

you can see in the table, with the Decision Tree the

best performances have been obtained by setting the

window size to 85 and an overlap equal to 50%. In

this case, the model has an accuracy of 70%, a preci-

sion of 75%, recall of 77%, and F measure of 76%.

With the Random Forest, the results are better, the

F-measure oscillates between 85% and 86% for all

four of the best cases reported. In particular, with this

classiﬁer, the best results were obtained by setting the

window size to 60 and considering a null overlap. In

this case, the validation metrics are respectively 82%,

83%, 88%, and 85%. These results attest that the

classiﬁer, speciﬁcally Random Forest, albeit with a

modestly sized dataset, can classify the instances with

some success.

In Table 3 we show the results obtained for the

second experiment. In particular, in addition to the

information previously described, we have added the

lowest frequency of each window as an additional fea-

ture. With these input data, the Decision Tree clas-

siﬁer has got the best results with window size 80

and null overlap, it reaches an F-score equal to 74%.

Therefore, compared to the previous experiment, in

this case, the additional information did not help, but

led to a lowering, albeit minimal, of the model valida-

tion metrics.

This did not happen in the case of Random Forest,

where the F-score reached 86% in the best case. In

particular, the best case is the one with a window size

of 60 and null overlap.

In the third experiment, the frequency range (i.e.,

lowest and highest frequencies of the windows) have

been added as additional features. Table 4 shows that

in the case of the Decision Tree there is still no in-

crease in the F-score, this being almost 75% in the

best case, the one with a window size of 50 and zero

overlap. In the original dataset, however, the high-

est F-score with this model was 76%. With the Ran-

dom Forest instead there continues to be a slight in-

crease in the best case (i.e., window size 75 and zero

overlap). Speciﬁcally, it shows a slight improvement

in the F-score equal to 86% and a strong increase in

the Recall that reaches 91%. Therefore, these results

show that with the addition of this information Ran-

dom Forest is able to learn relationships among peaks

and the frequencies better than other classiﬁers, man-

aging to further minimize the classiﬁcation errors of

the input instances.

Table 5 instead shows the results related to the

fourth experiment where all the frequencies corre-

sponding to each window have been added as addi-

tional features (doubling the features). As in the two

previous cases, for the Decision Tree, there are no

substantial variations in the scores obtained for the

validation metrics, which remain stable. On the con-

trary, with Random Forest we continue to have a small

increase in the F-score. In fact, in the best case, with a

window size equal to 75 and without overlap, we ob-

tain an increase of 0.3% compared to the experiment

conducted including the frequency intervals. These

improvements, albeit small, say that the classiﬁer is

improving its performance during the training phase.

Finally, the last experiment has been conducted

using the data augmentation technique, the results of

which are shown in Table 6. The results show a de-

cisive improvement in the case of the Decision Tree,

which in the best case with a window size of 70 and

without overlap obtains a substantial increase in ac-

curacy, going from 70% on the initial dataset to 82%

in this case. The other metrics, on the other hand, re-

main stable. This is also true for the Random Forest,

which in the best case has a window size of 75 and an

overlap of 50%. With these parameters we obtain an

accuracy of almost 90%, precision almost 81%, recall

84%, and F-score 82%.

Therefore, it is possible to note that among the

classiﬁers chosen, the one that best ﬁts the data is

the Random Forest, which in all ﬁve experiments con-

ducted obtained the best results with an overlap equal

to 0 and which is the best performance ever. were ob-

tained in the fourth experiment, where the classiﬁer

received the starting dataset as input with the addition

of all the frequencies of the window under examina-

tion.

6 THREATS TO VALIDITY

The proposed study suffers from three types of threats

to validity: internal, external, and constructive.

Threat to internal validity could be classiﬁcation

errors due to incorrect data labeling. This risk is

Using Machine Learning for Classiﬁcation of Cancer Cells from Raman Spectroscopy

Table 2: Results on the original Dataset.

Decision Tree Random Forest

Windows

Size

Overlap Accuracy Precision Recall F1-Score

Windows

Size

Overlap Accuracy Precision Recall F1-Score

85 0 68,95% 72,84% 77,29% 75,00% 90 0.5 81,17% 82,71% 87,41% 84,99%

90 0.5 70,03% 76,14% 74,07% 75,09% 75 0 81,32% 80,86% 90,39% 85,36%

80 0 69,47% 73,84% 76,42% 75,11% 55 0 81,90% 82,48% 88,95% 85,59%

85 0.5 70,33% 75,00% 77,04% 76,00% 60 0 82,25% 83,11% 88,66% 85,79%

Table 3: Results with the base Frequency.

Decision Tree Random Forest

Windows

Size

Overlap Accuracy Precision Recall F1-Score

Windows

Size

Overlap Accuracy Precision Recall F1-Score

80 0.75 68,42% 71,89% 74,94% 73,38% 75 0.75 80,89% 78,17% 92,41% 84,69%

65 0.75 68,87% 72,20% 75,31% 73,72% 75 0 81,32% 80,86% 90,39% 85,36%

75 0.75 69,91% 72,83% 75,60% 74,19% 55 0 82,43% 82,97% 89,24% 85,99%

80 0 68,95% 74,03% 74,67% 74,35% 60 0 82,60% 83,75% 88,37% 86,00%

Table 4: Results with the interval Frequency.

Decision Tree Random Forest

Windows

Size

Overlap Accuracy Precision Recall F1-Score

Windows

Size

Overlap Accuracy Precision Recall F1-Score

75 0.75 69,42% 72,79% 74,30% 73,54% 75 0.75 80,58% 77,91% 92,19% 84,45%

75 0 68,95% 74,67% 73,36% 74,01% 60 0 82,25% 83,47% 88,08% 85,71%

80 0.75 69,48% 72,95% 75,42% 74,17% 55 0 82,43% 82,80% 89,53% 86,03%

50 0 70,33% 77,37% 72,59% 74,90% 75 0 82,37% 81,64% 91,27% 86,19%

Table 5: Results with all Frequencies.

Decision Tree Random Forest

Windows

Size

Overlap Accuracy Precision Recall F1-Score

Windows

Size

Overlap Accuracy Precision Recall F1-Score

50 0 69,28% 76,52% 71,60% 73,98% 80 1 80,77% 78,56% 91,59% 84,57%

60 0 68,89% 74,78% 73,26% 74,01% 55 0 82,07% 82,70% 88,95% 85,71%

75 0.75 70,22% 73,46% 75,05% 74,25% 60 0 82,25% 83,47% 88,08% 85,71%

80 0 68,95% 74,03% 74,67% 74,35% 75 0 82,63% 81,71% 91,70% 86,42%

Table 6: Results with Data Augmentation.

Decision Tree Random Forest

Windows

Size

Overlap Accuracy Precision Recall F1-Score

Windows

Size

Overlap Accuracy Precision Recall F1-Score

75 0 83,38% 69,19% 70,19% 69,69% 85 0.75 89,97% 83,36% 81,18% 82,25%

85 0.5 82,56% 68,59% 71,65% 70,09% 85 1 89,76% 80,65% 84,13% 82,35%

75 0.5 82,25% 66,60% 74,19% 70,19% 90 1 89,82% 80,51% 84,43% 82,42%

70 0 82,39% 69,18% 72,14% 70,63% 75 0.5 89,98% 80,98% 84,19% 82,55%

strongly mitigated because the data set used was

provided by a specialized center (i.e., the Center

for Nanophotonics and Optoelectronics for Human

Health (CNOS)) which analyzed the cells of a pa-

tient under treatment at a known institute of national

prestige (National Cancer Institute IRCCS G. Pascale

Foundation).

On the other hand, the generalization of the re-

sults is about the threat to external validity. A limi-

tation of the study is represented by the classiﬁcation

carried out on the cells of a single patient, for which

the dataset does not contain a very high number of in-

stances. To avoid this threat, we have used the data

augmentation technique.

Finally, threats to construct validity could be rep-

resented by inaccuracies or omissions made during

the construction phase of the dataset. To mitigate this

problem, the cells have been analyzed using Raman

spectroscopy.

7 CONCLUSIONS AND FUTURE

WORK

This paper addressed an important issue because

nowadays oncological diseases represent the leading

cause of death in the world. Developing a system

that can help the oncologist in the evaluation of tumor

markers, can lead soon to tools that greatly speed up

the time in diagnosing the pathology. The study car-

DeLTA 2022 - 3rd International Conference on Deep Learning Theory and Applications

ried out on real cells opens the door to ”diagnosis in

real-time”, through detection using the Raman spec-

trum, a non-invasive and non-destructive technique

for the patient.

The proposed approach, based on the combination

of Raman spectroscopy and the use of machine learn-

ing models, allows obtaining data on the patient’s

cells to be identiﬁed as “malignant” or not in a matter

of minutes. This methodology does not aim to replace

the work of the doctor who remains at the center of

the diagnosis and treatment process but is a tool made

available to him.

The main contribution of this work consists in the

use of a dataset containing real information about a

patient under treatment at the National CancerInsti-

tute IRCCS G. Pascale Foundation, whose cells have

been analyzed by the Center for Nanophotonics and

Optoelectronics for Human Health (CNOS).

Therefore, the proposed approach has been tested

on an overall dataset containing 364 wavenumbers

where each corresponds to a sample of amplitude

across the various records of the dataset. The results

show good performance of the Random Forest Clas-

siﬁer which in the case of data augmentation reached

an accuracy of 89.98%.

The limitation of the study concerns the fact that

the classiﬁcation was carried out on cells relating to

a single patient, because some were collected in the

tumor area, and others in adjacent but healthy areas.

So, in the future it could be interesting to inves-

tigate in three different directions: classiﬁcation of

cells from different patients but with the same pathol-

ogy to assess whether the pathology has similar traits

in different patients (i); classiﬁcation of cells from

different patients but with different tumor patholo-

gies to evaluate if there is an indicator, that is a

set of biological components, common for all onco-

logical pathologies (ii); classiﬁcation of cells from

healthy patients and patients suffering from oncologi-

cal diseases to understand if some tumor traits are also

present in healthy patients, avoiding the disease with

effective prevention therapy (iii).

REFERENCES

Ardimento, P., Aversano, L., Bernardi, M. L., and Cimitile,

M. (2021). Deep neural networks ensemble for lung

nodule detection on chest ct scans. In 2021 Interna-

tional Joint Conference on Neural Networks (IJCNN),

pages 1–8.

Aversano, L., Bernardi, M. L., Cimitile, M., Iammarino,

M., Macchia, P. E., Nettore, I. C., and Verdone,

C. (2021a). Thyroid disease treatment prediction

with machine learning approaches. Procedia Com-

puter Science, 192:1031–1040. Knowledge-Based

and Intelligent Information and Engineering Sys-

tems: Proceedings of the 25th International Confer-

ence KES2021.

Aversano, L., Bernardi, M. L., Cimitile, M., and Pecori,

R. (2020). Early detection of parkinson disease us-

ing deep neural networks on gait dynamics. In 2020

International Joint Conference on Neural Networks

(IJCNN), pages 1–8.

Aversano, L., Bernardi, M. L., Cimitile, M., and Pecori,

R. (2020). Early detection of parkinson disease us-

ing deep neural networks on gait dynamics. In 2020

International Joint Conference on Neural Networks

(IJCNN), pages 1–8.

Aversano, L., Bernardi, M. L., Cimitile, M., and Pecori,

R. (2021b). Deep neural networks ensemble to de-

tect covid-19 from ct scans. Pattern Recognition,

120:108135.

Breiman, L. (2001). Random forests. Machine Learning,

45(1):5–32.

G., S. S. and K., M. (2019). Diagnosis of diabetes diseases

using optimized fuzzy rule set by grey wolf optimiza-

tion. Pattern Recognition Letters, 125:432 – 438.

Germond, A., Ichimura, T., da Chiu, L., Fujita, K., Watan-

abe, T. M., and Fujita, H. (2018). Cell type discrimina-

tion based on image features of molecular component

distribution. Scientiﬁc Reports, 8.

Henschke, C. I., McCauley, D. I., Yankelevitz, D. F.,

Naidich, D. P., McGuinness, G., Miettinen, O. S.,

Libby, D. M., Pasmantier, M. W., Koizumi, J., Altorki,

N. K., and Smith, J. P. (1999). Early lung cancer ac-

tion project: overall design and ﬁndings from baseline

screening. The Lancet, 354(9173):99–105.

Hsu, C.-C., Xu, J., Brinkhof, B., Wang, H., Cui, Z., Huang,

W. E., and Ye, H. (2020). A single-cell raman-based

platform to identify developmental stages of human

pluripotent stem cell-derived neurons. Proceedings

of the National Academy of Sciences, 117(31):18412–

18423.

Karayılan, T. and Kılıc¸, . (2017). Prediction of heart disease

using neural network. In 2017 International Confer-

ence on Computer Science and Engineering (UBMK),

pages 719–723.

Lussier, F., Missirlis, D., Spatz, J. P., and Mas-

son, J.-F. (2019). Machine-learning-driven surface-

enhanced raman scattering optophysiology reveals

multiplexed metabolite gradients near cells. ACS

Nano, 13(2):1403–1411. PMID: 30724079.

Miller, K. D., Siegel, R. L., Lin, C. C., Mariotto, A. B.,

Kramer, J. L., and Rowland, J. H. (2016). Cancer

treatment and survivorship statistics, 2016. CA Can-

cer J Clin, 66:271–289.

Mulvaney, S. P. and Keating, C. D. (2000). Raman spec-

troscopy. Analytical Chemistry, 72(12):145–158.

Neal, R. D., Tharmanathan, P., France, B., Din, N. U.,

Cotton, S. J., Fallon-Ferguson, J., Hamilton, W. T.,

Hendry, A., Hendry, M., Lewis, R., Macleod, U.,

Mitchell, E. D., Pickett, M., Rai, T. K., Shaw, K., Stu-

art, N. S., Tørring, M. L., Wilkinson, C., Williams,

B., Williams, N., and Emery, J. D. (2015). Is increased

Using Machine Learning for Classiﬁcation of Cancer Cells from Raman Spectroscopy

time to diagnosis and treatment in symptomatic cancer

associated with poorer outcomes? systematic review.

British Journal of Cancer, 112:S92 – S107.

Pavillon, N., Hobro, A. J., Akira, S., and Smith, N. I.

(2018). Noninvasive detection of macrophage activa-

tion with single-cell resolution through machine learn-

ing. Proceedings of the National Academy of Sciences,

115(12):E2676–E2685.

Rasheed, J., Hameed, A. A., Djeddi, C., Jamil, A., and Al-

Turjman, F. (2021). A machine learning-based frame-

work for diagnosis of covid-19 from chest x-ray im-

ages. Interdisciplinary Sciences: Computational Life

Sciences, 13(1):103–117.

Ren, X., N. W. G. P. e. a. (2020). Scalable nanolaminated

sers multiwell cell culture assay. Microsyst Nanoeng,

6(47).

Rokach, L. and Maimon, O. (2014). Data Mining with De-

cision Trees. WORLD SCIENTIFIC, 2nd edition.

Schie, I. W., Kiselev, R., Krafft, C., and Popp, J. (2016).

Rapid acquisition of mean raman spectra of eukary-

otic cells for a robust single cell classiﬁcation. The

Analyst, 141 23:6387–6395.

Smith, R., Wright, K. L., and Ashton, L. (2016). Raman

spectroscopy: an evolving technique for live cell stud-

ies. Analyst, 141:3590–3600.

Sun, C., Lee, J. S., and Zhang, M. (2008). Magnetic

nanoparticles in mr imaging and drug delivery. Ad-

vanced Drug Delivery Reviews, 60(11):1252–1265.

Inorganic Nanoparticles in Drug Delivery.

Torre, L. A., Siegel, R. L., and Jemal, A. (2016). Lung

Cancer Statistics, pages 1–19. Springer International

Publishing, Cham.

van Dyk, D. A. and Meng, X.-L. (2001). The art of data

augmentation. Journal of Computational and Graph-

ical Statistics, 10(1):1–50.

Zhang, H., Chen, C., Gao, R., Yan, Z., Zhu, Z., Yang, B.,

Chen, C., Lv, X., Li, H., and Huang, Z. (2021). Rapid

identiﬁcation of cervical adenocarcinoma and cervical

squamous cell carcinoma tissue based on raman spec-

troscopy combined with multiple machine learning al-

gorithms. Photodiagnosis and Photodynamic Ther-

apy, 33:102104.

Zhang, L., Li, C., Peng, D., Yi, X., He, S., Liu, F., Zheng,

X., Huang, W. E., Zhao, L., and Huang, X. (2022).

Raman spectroscopy and machine learning for the

classiﬁcation of breast cancers. Spectrochimica Acta

Part A: Molecular and Biomolecular Spectroscopy,

264:120300.

Zhu, Q.-L., Jiang, Y.-X., Liu, J.-B., Liu, H., Sun, Q.,

Dai, Q., and Chen, X. (2008). Real-time ultra-

sound elastography: Its potential role in assessment of

breast lesions. Ultrasound in Medicine and Biology,

34(8):1232–1238.

DeLTA 2022 - 3rd International Conference on Deep Learning Theory and Applications