Predicting the Malignant Breast Cancer using Tumor Tissue Features
Wenrui Zhao
College of Art and Science, the Ohio State University, Columbus, OH, 43210, U.S.A.
Keywords: Breast Cancer, Breast Cancer Datase, Feature Selection, FNA, Cancer Diagnosis.
Abstract: Breast cancer is one of the most common cancers in women and is the second leading cause of death after
lung cancer. In clinical diagnosis, fine needle aspiration cytology is often used in tumor diagnosis, considering
safety, accuracy, and ease of operation. Pathologists can judge whether the patient's tumor tissue is malignant
by observing the cell population. The accuracy of fine-needle biopsy largely depends on the doctors who
participate in sampling and analysis. Therefore, it is crucial to study which characteristics of cells can become
a solid basis for discrimination. This article constructs univariate and multivariate logistic regression models
to analyze the predictive value of 9 features of the cell to breast cancer. By evaluating the ROC curve, the
article shows that the constructed model accurately predicts malignant tumor tissue. The 9 characteristics of
FNA quantitative detection of tumor tissue are of great value in predicting malignant breast cancer.
1 INTRODUCTION
Breast cancer is one of the most common cancers in
women and is the second leading cause of death after
lung cancer (Nguyen,1970) (Mangasarian, 1990). In
2020, over 2.3 million women were diagnosed with
breast cancer worldwide, and 685 thousand died. Due
to population growth, aging, and the increasing
prevalence of known cancer risk factors (such as
smoking and unhealthy eating), WHO believes that if
the global incidence rate remains the same as in 2020,
there will be around 28.4 million new cancer cases
worldwide in 2040. Women in every country face the
risks of developing breast cancer at any age after
puberty, but the incidence rate will increase with age
growth (Piro,2021). Existing diagnostic techniques,
including nuclear magnetic resonance imaging,
ultrasound, CT (computer tomography) or PET
(positron emission tomography), are very effective in
tumor detection (World Health Organization, 2021)
However, when doctors find suspicious tumor tissue,
they still hope to obtain tissue samples for analysis.
Biopsy isan essential technique for the diagnosis of
cancer in the clinic. Because fine needle biopsy does
not need any preparation in advance, nor does it need
special dietary norms, fine-needle aspiration (FNA)
has become the preliminary diagnostic basis for
judging whether breast tissue is cancerous. A large
number of data show that although FNA has many
advantages, a few cases may be misdiagnosed.
Therefore, it is vital to study which characteristics of
cells can become a solid basis for discrimination.
From 1989 to 1991, Dr. Wolberg, Dr. Mangasarian
and two graduate students constructed a classifier
using the pattern separation multi-surface method
(MSM) for these nine features and successfully
diagnosed 97% of new cases (Nguyen,1970)
(Wolberg,1989). These led to the Wisconsin breast
cancer dataset. This article constructs univariate and
multivariate logistic regression models to analyze the
predictive value of 9 features of the cell to breast
cancer. This article used biometric methods for
exploratory data analysis to focus more narrowly on
checking the fitting degree of the model (Chatfield,
2021) By studying the different importance of the 9
features of cells, the article helps people establish a
more standard method to judge whether tumor tissue
is malignant.
2 ANALYSES
FNA uses a tiny needle tube of about 20-27G (similar
to or smaller than the needle tube for regular blood
testing. Generally, the larger the number of G, the
smaller the needle tube) (CancerQuest,2021). Due to
the small amount of tissue and its cellular
components collected, pathologists will pay more
attention to the observation of cell populations. The
study used the Wisconsin Breast Cancer Dataset
204
Zhao, W.
Predicting the Malignant Breast Cancer using Tumor Tissue Features.
DOI: 10.5220/0011196000003443
In Proceedings of the 4th International Conference on Biomedical Engineering and Bioinformatics (ICBEB 2022), pages 204-211
ISBN: 978-989-758-595-1
Copyright
c
2022 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
(WBCD) for women in the UCI machine learning
dataset (Nguyen, 1970) (Wolberg, 1989) (Taylor,
2021) (Tukey, 1992) The dataset contains 699
records. It includes nine features in this dataset.
Depending on the values of features, benign and
malignant masses can be distinguished. This dataset
has 16 missing values, this article discarded these
missing values in our experiment, and we considered
the remaining 683 records. 444 records belong to the
benign category from the cleaned data set, and the
remaining 239 records belong to the malignant
category. An important step in the breast cancer
diagnosis model is feature extraction. The optimum
feature set should have adequate and discriminating
features while mainly reducing the redundancy of
components space to avoid low sampling density.
This dataset provides nine crucial features of the cell
population, which are clump thickness, uniformity of
cell size, uniformity of cell shape, marginal adhesion,
single epithelial cell size, bare nuclei, bland
chromatin, normal nucleoli, and mitoses.
Exploratory data analysis begins with the
establishment of logistic regression models and
random forest. By observing the performance of the
model in each category through the visual confusion
matrix, the accuracy and recall of the model
corresponding to each class can be calculated.
Because judging the diagnosis result is a binary
classification problem, the ROC curve can analyze
the coordinates well. An essential feature of the ROC
curve is its area. An area of 0.5 is random
classification, and the recognition ability is 0. The
closer the area is to 1, the stronger the recognition
ability is, and an area equal to 1 is complete
recognition.
3 RESULTS
3.1
Exploratory Data Analysis
The diagnosis of malignant tumors belongs to binary
classification problems. In other words, for a patient,
her tumor result is either malignant or not. Therefore,
according to the data type, the results can be divided
into benign and malignant.
Exploratory data analysis (EDA) is an approach
for data analysis that employs a table to maximize
insight into the dataset and extract important
variables. Through EDA, figure 1 was derived. In the
table, the lower the p-value, the greater the statistical
significance of the observed difference. The result
shows that the nine characteristics of the cells we
observed are statistically significantly correlated with
the diagnostic results.
In addition, in 241 patients with malignant
tumors, the values of 9 characteristics of cells were
significantly higher than those in patients with benign
tumors, and the p-value of the statistical test was less
than 0.05 (< 0.0001), ndicating that these nine
variables are adverse predictors of malignant tumors.
Figure 1: Baseline characteristics of the study population
Correlation analysis is necessary before logistic
regression analysis. This article further discussed the
correlation between these nine variables and derived
Figure 2. It is found that these variables are positively
correlated, which is consistent with the information
prompted in Figure 1. Among them, judging by the
Predicting the Malignant Breast Cancer using Tumor Tissue Features
205
color depth of the chart, the uniformity of cell size
and cell shape has the strongest correlation, which is
closest to 1. Through correlation analysis, it is
suggested that all 9 features have good consistency in
describing the morphology of tumor cells.
Figure 2: Correlation between 9 Features.
3.2 Building Logistic Regression
Models
Because the number of cases given by the database is
not large, only nine factors appear in about 600
diagnoses. Therefore, if all elements are analyzed in
the equation, the results may be problematic
(Mangasarian,1990). In this case, univariate analysis
can help to screen out significant variables and then
put these variables into the equation for multivariate
analysis.
In the univariate logistic regression model, it can
be observed that every single factor is correlated with
tumor categories. For example, the results in clump
thickness (OR:2.55, 95%CI: [2.25,2.97]) indicates
that when the clump thickness of cells increases by
one unit, the probability that the tissue is malignant
increases by 1.55 times. All else follows: uniformity
of cell size (OR:4.43, 95%CI: [3.54,/70]), uniformity
of cell shape (OR:4.08, 95%CI: [3.32,5.17]),
marginal adhesion (OR:2.62, 95%CI: [2.26,3.11]),
single epithelial cell size (OR:3.89, 95%CI:
[3.17,4.89]), bare nuclei (OR:2.18, 95%CI:
[1.96,2.48]), bland chromatin (OR:3.75, 95%CI:
[3.06,4.74]), normal nucleoli (OR:2.36, 95%CI:
[2.06,2.76]), and mitoses (OR:3.84, 95%CI:
[2.78,5.62]).
ICBEB 2022 - The International Conference on Biomedical Engineering and Bioinformatics
206
Figure 3: Uni-variate Logistic Regression Model.
In multivariate analysis, only statistically
significant independent variables in the univariate
analysis may lead to some influencing factors not
having the opportunity to enter the multivariate
model. However, for the nine features studied in this
article, the P-value is far less than 0.05, so it is
feasible to incorporate all these factors into the
multivariable logistic regression model.
In this multivariable logistic regression model, for
example, when the other eight variables remain
constant, if the clump thickness of cells increases by
one unit, the probability of malignant tissue increases
by 0.71 times. All else follows: uniformity of cell size
(OR:100, 95%CI: [0.71,1.48]), uniformity of cell
shape (OR:1.41, 95%CI: [0.93, 2.11]), marginal
adhesion (OR:1.27, 95%CI: [1.02,1.60]), single
epithelial cell size (OR:1.07, 95%CI: [0.80,1.43]),
bare nuclei (OR:1.45, 95%CI: [1.24, 1.74]), bland
chromatin (OR:1.54, 95%CI: [1.15, 2.11]), normal
nucleoli (OR:1.15, 95%CI: [0.94, 1.41]), and mitoses
(OR:1.73, 95%CI: [1.02, 2.94]).
Figure 4: Multi-variate Logistic Regression Model.
3.3 Analysis from Confusion Matrix
In the confusion matrix, the amount of data is 698, in
which the row represents the prediction category of
the data and the column represents the real category.
There are 675 cases whose predicted category is
consistent with the real category, accounting for
96.7% of all cases. Therefore, the matrix indicates
that the prediction can correctly classify 96.7% of the
samples.
The model correctly predicts 443 data in the
benign category, and 232 data in the malignant
category. There are 14 data which the prediction is
benign, but the real category is malignant, and 9 data
which the prediction is malignant, but the real
category is benign. This shows that although FNC has
many advantages, there are still a few cases that may
be misdiagnosed. In this sample, the false negative is
greater than the false positive.
Variable OR 95%CI
fit1:Clump Thickness 2.55 [2.25,2.97]
fit2:Uniformity of cell size 4.43 [3.54,5.70]
fit3:Uniformity of Cell Shape 4.08 [3.32,5.17]
fit4:Marginal Adhesion 2.62 [2.26,3.11]
fit5:Single Epithelial Cell Size 3.89 [3.17,4.89]
fit6:Bare Nuclei 2.18 [1.96,2.48]
fit7:Bland Chromatin 3.75 [3.06,4.74]
fit8:Normal Nucleoli 2.36 [2.06,2.76]
fit9:Mitoses 3.84 [2.78,5.62]
Univariable Logistic regression analysis predicting Breast Cancer
Variable OR 95%CI
fit1:Clump Thickness 1.71 [1.35,2.26]
fit2:Uniformity of cell size 1.00 [0.71,1.48]
fit3:Uniformity of Cell Shape 1.41 [0.93,2.11]
fit4:Marginal Adhesion 1.27 [1.02,1.60]
fit5:Single Epithelial Cell Size 1.07 [0.80,1.43]
fit6:Bare Nuclei 1.45 [1.24,1.74]
fit7:Bland Chromatin 1.54 [1.15,2.11]
fit8:Normal Nucleoli 1.15 [0.94,1.41]
fit9:Mitoses 1.73 [1.02,2.94]
Multivariable Logistic regression analysis predicting Breast Cancer
Predicting the Malignant Breast Cancer using Tumor Tissue Features
207
Figure 5: Confusion Matrix for Multi-variable Regress Model.
3.4 Analysis from Evaluation Matrix
The accuracy rate represents the proportion of
correctly predicted samples in all samples, which is
0.9670487. The accuracy rate of the benign class
represents the proportion of samples whose real
category is benign among the samples predicted as
benign, which is 0.9541667; The recall rate of the
benign class represents the proportion of samples
successfully predicted by the model in the real benign
samples, which is 0. 7502075. The F-measure is the
harmonic mean of precision and recall. In most
situations, there would be a trade-off between
precision and recall. Since the F-Measure is
0.952183, which is pretty close to 1, the regression
analysis will give out both high recall and high
precision.
Figure 6: Evaluation Matrix.
3.5 Confusion Matrices’ Comparison
between the Multivariable Logistic
Regression Model and Random
Forest
The matrix for random forest correctly predicts 443
data in the benign category and 234 data in the
malignant category. There are 14 data which the
prediction is benign, but the real category is
malignant, and 7 data which the prediction is
malignant, but the real category is benign. There are
677 cases whose predicted category is consistent with
the real category, accounting for 96.9% of all cases.
Benign Malignant
Benign
443 14
Malignant
9 232
Actual Class
Predicted Class
Accuracy 0.9670487
Precision 0.9541667
Recall 0.9502075
F_Measure 0.952183
ICBEB 2022 - The International Conference on Biomedical Engineering and Bioinformatics
208
Since the confusion matrix for the random forest
is not significantly different from the multivariable
logistic regression model in Figure 5, considering the
interpretability of the multivariable regression model,
it would be better to use the multivariable logistic
regression model and evaluate its ROC curve.
Figure 7: Confusion Matrix for Random Forest.
3.6 Analysis of ROC Curve
The closer the ROC curve is to the upper left corner
(0,1) model on the image, the better, the larger the
area (AUC value) surrounded by the horizontal axis
and straight-line FPR = 1 under the ROC curve, the
better.
The area under the ROC curve, that is, the AUC,
is 0.995, which shows that the prediction effect of this
model for malignant tumor tissue is very accurate.
When setting the cut-off of the model output to 0.195,
the model can achieve its best outcome. Meanwhile,
the sensitivity of the model is 0.992, and its
specificity is 0.967.
Figure 8: ROC Curve.
3.7 Analysis of Random Forest
The Mean Decrease Accuracy plot expresses how
much accuracy the model losses by excluding each
variable. Bare Nuclei and clump thickness are
influential factors from the first diagram, which
means that these two variables are more important for
successful classification than the others. Whereas
under different standards, uniformity of cell size and
uniformity of cell shape become contributing factors
to the homogeneity of the nodes and leaves in the
resulting random forest.
Predicting the Malignant Breast Cancer using Tumor Tissue Features
209
Figure 9: Random Forest.
4 CONCLUSIONS
FNA is simple, fast and safe. Now it has become one
of the critical diagnostic methods of clinical diseases.
However, due to the small amount of tissue and its
cellular components, the tissue morphology and
interstitial structure in the specimen are mostly or
entirely lost, which cannot reflect the overall picture
of the type of lesion and interfere with the observation
of cell characteristics. Therefore, quantitative
detection of the 9 features of tumor tissues using FNA
technology is of great value in predicting malignant
breast cancer. Among them, clump thickness,
mitoses, and bland chromatin are of high predictive
value.
A larger sample size collected from over the
world or some certain nations instead of a state would
be always expected. It could improve the accuracy
and generality of the study.
A lack of personal information (i.e. age, family
history of breast cancer) may impede the progress
of generalizing the conclusion. This information
would be helpful when applying our study to a border
sample, or in practical conditions.
REFERENCES
Biopsy. CancerQuest. (n.d.). Retrieved November 6, 2021,
from https://www.cancerquest.org/zh-
hans/geihuanzhe/jianceyuzhenduan/huotizuzhijiancha
zuzhi.
Chatfield, C. (n.d.). Problem solving: A statistician's guide,
second edition. in Search Works catalog. Retrieved
October 7, 2021.
Nguyen, Q. H., Do, T. T. T., Wang, Y., Heng, S. S., Chen,
K., Ang, W. H. M., Philip, C. E., Singh, M., Pham, H.
N., Nguyen, B. P., & Chua, M. C. H. (1970, January 1).
Breast cancer prediction using feature selection and
ensemble voting: Semantic scholar. undefined.
O. Mangasarian W. Wolberg, O. Mangasarian
Multisurface method of pattern separation for medical
diagnosis applied to breast cytology. Proceedings of the
National Academy of Sciences of the United States of
America.
Piro, M., Bona, R. D., Abbate, A., Biasucci, L. M., & Crea,
F. (2010, March 1). Sex-related differences in
myocardial remodeling: Journal of the American
College of Cardiology. Retrieved September 30, 2021,
from
https://www.jacc.org/doi/abs/10.1016/j.jacc.2009.09.0
65.
Robust linear programming discrimination of two linearly
inseparable sets. Taylor & Francis. (n.d.). Retrieved
October 1, 2021, from
https://www.tandfonline.com/doi/abs/10.1080/105567
89208805504.
ICBEB 2022 - The International Conference on Biomedical Engineering and Bioinformatics
210
Tukey J.W. (1992) The Future of Data Analysis. In: Kotz
S., Johnson N.L. (eds) Breakthroughs in Statistics.
Springer Series in Statistics (Perspectives in Statistics).
Springer, New York, NY.
Wolberg, W. H., Mangasarian, O. L., & Setiono, R. (1989,
January 1). Pattern recognition via Linear
Programming: Theory and application to medical
diagnosis. MINDS@UW Home. Retrieved October 1,
2021.
World Health Organization. (n.d.). Breast cancer. World
Health Organization. Retrieved November 6, 2021,
from https://www.who.int/news-room/fact-
sheets/detail/breast-cancer.
Predicting the Malignant Breast Cancer using Tumor Tissue Features
211