MLP-Based Lung Cancer Prediction and Feature Importance

Evaluation

Zonglin Jiang

Computer Science, Arizona State University, Tempe, U.S.A.

Keywords: Machine Learning, Lung Cancer Prediction, MLP.

Abstract: Lung cancer remains one of the deadliest cancers globally, with high mortality rates due to the challenges of

early detection. Traditional diagnostic methods, such as CT scans and biopsies, have limitations, including

the risk of human error and patient discomfort. With the advent of machine learning (ML) technologies, early

detection has improved significantly. This paper investigates the importance of features in lung cancer

prediction using a Random Forest model and a Multilayer Perceptron (MLP) model. The dataset used consists

of 309 clinical samples and 15 features, with binary classification into cancerous and non-cancerous cases.

After data preprocessing, the models were trained and evaluated to assess the contribution of different features.

Age, Allergy, and Swallowing Difficulty were found to be the most important features in both models. The

study highlights the impact of dataset imbalance on feature importance and model performance. Future work

will focus on addressing this imbalance to improve prediction accuracy and reliability in clinical applications.

1 INTRODUCTION

Cancer is well known for how painful and

challenging it is to cure. Lung cancer is the cancer that

has the highest fatality rate. According to the World

Health Organization (WHO), in 2020, lung cancer

caused 1.8 million deaths. At the same time, it is not

only hard to cure but also hard to find (World Health

Organization: WHO & World Health Organization:

WHO, 2023). The early stage of lung cancer is mostly

asymptomatic; by the time there are symptoms,

likely, the cancer has already progressed to later

stages with limited treatment options. Therefore,

screening high-risk individuals and achieving early

detection is important for more treatment options and

higher survival rates.

However, traditional methods of diagnosing lung

cancer, like chest X-rays, Computed Tomography

(CT) scans, or biopsies have several limitations. CT

scans rely on the professionalism of radiology

doctors; this introduces human mistakes and

subjective decision possibilities to the test. However,

those methods can recognize visible abnormal

situations. But when the cancer is too small, it is hard

for doctors to tell if it is cancer or inflammation and

whether it is cancer or something else before tissue

https://orcid.org/ 0009-0009-0296-7614

diagnosis with only a few exceptions (Connolly et al.,

2003). At the same time, biopsy will increase the

discomfort of the patient and the risk of infection.

Therefore, a method to eliminate human mistakes and

increase accuracy is required.

In recent years, the rise of artificial intelligence

(AI) and machine learning (ML) has provided a new

opportunity to improve the diagnostic method

(Kononenko, 2001; Erickson, 2017; Giger, 2018).

Those technologies use huge datasets to

automatically analyses medical images and clinical

data and recognize possible patterns and

abnormalities that may represent cancer. ML models,

like Neural Networks, support vector machines, and

random forests, are used for the recognition of lung

cancer and have made huge progress in improving

accuracy and comprehensibility (Pacurari et al.,

2023). Take convolutional neural networks (CNNs)

as an example; they have achieved significant

improvements in early detection and diagnosis

accuracy (Javed et al., 2024). It is possible that those

models can detect early-stage lung cancer with

patterns that are hard for humans to understand

instantly; this provides a way for early intervention.

However, one of the main challenges in

developing efficient ML lung cancer detection is

320

Jiang, Z.

MLP-Based Lung Cancer Prediction and Feature Importance Evaluation.

DOI: 10.5220/0013329900004558

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Modern Logistics and Supply Chain Management (MLSCM 2024), pages 320-323

ISBN: 978-989-758-738-2

choosing relative features from complicated and

high-dimension datasets. Selecting features is vital

for improving the performance of the model because

unrelated or redundant features will introduce noise,

which will cause overfitting and lower the accuracy

of the model. Also, different types of data, for

example, imaging data, clinical records, and genome

information lead to challenges in integration and

analysis. A successful model not only needs to

recognize cancer accurately but also needs to spread

among different patient groups and clinical

environments. This research discussed the

importance of different lung cancer dataset features

for Multilayer Perceptron (MLP). This paper has

designed a Random Forest model and an MLP model

for lung prediction based on clinical datasets. The

evaluation standard is based on the importance of

features based on their contribution to the accuracy of

the diagnosis.

2 METHODS

2.1 Dataset Preparation

The data used in this study to evaluate the feature

importance is Lung Cancer (Aswad, 2022). There are

309 specimens and 15 features, The classification

task of those data is binary classification, data has

been split into 2 categories which are those who have

lung cancer and those who do not have lung cancer.

This paper pre-processed the data by changing the 1

and 2 of the features into 0 and 1 which is easier to

understand for others, then using one hot encode to

code the Gender feature from M and F to 2 features,

which are Gender_F and Gender_M this way 1 and 0

can be used to represent female and male, since this

dataset don’t have any missing values or missing

features. Therefore, this paper didn’t have any

preprocess measure for those situations. After those

processes, using the train_test_split method to split

the dataset into train 80% train set and 20% testing

set.

2.2 Random Forest

The first model used in this paper to assess the

importance of features is a random forest. Random

forest is a frequently used machine learning

algorithm, mainly for classification and regression

tasks. As an ensemble learning method, random forest

builds multiple decision trees during the training

process and obtains the final prediction result by

voting or averaging the results of these decision trees.

The main reason for choosing random forest as the

feature importance evaluation model is that it can

measure the feature importance by calculating the

contribution of each feature to the reduction of

impurity. Each tree in a random forest chooses the

feature that minimizes impurity when splitting nodes,

which allows the model to automatically evaluate

which features are most important in the prediction

task. In terms of the selection of model

hyperparameters, most of the hyperparameters in this

paper are determined by using random search cross-

validation. RandomizedSearchCV is verified by

randomly selecting several combinations in the

hyperparameter space to find the optimal

hyperparameter configuration. For n_estimators,

however, this article takes a loop approach by adding

10 values at a time from values between 1 and 200

and loop-training the model to determine the optimal

number of trees. Through this method, the paper finds

the best hyperparameter configuration suitable for the

data set while ensuring the performance of the model,

so that the importance of features can be more

accurately evaluated.

2.3 MLP Model

The other model this study used is the MLP model

(Taud, 2018; Pinkus, 1999), MLP is a Neural

Network model that is widely used in classification

and regression tasks. The main idea is to process the

input data layer by layer through multiple hidden

layers of neurons, and finally output the predicted

result. The structure of an MLP consists of an input

layer, one or more hidden layers, and an output layer,

with neurons in each layer undergoing nonlinear

transformations via activation functions to capture

complex patterns in the data. In this paper, the

proposed MLP model architecture consists of two

hidden layers, each containing 64 and 32 neurons

respectively. To avoid the model training time being

too long, the maximum number of iterations is set at

1500. The activation function uses the Rectified

Linear Unit (ReLU), which can effectively mitigate

the gradient disappearance problem in deep networks.

In terms of optimizer, this paper explores different

learning rates, regularization parameters, and

optimization algorithms when using random search

cross-validation to optimize hyperparameters, and

finally determines the optimal configuration of

hyperparameters. Using RandomizedSearchCV, this

paper finds the best model suitable for the dataset and

task in a search space containing 100 different

parameter combinations. This method ensures the

stability of the model in generalization performance

MLP-Based Lung Cancer Prediction and Feature Importance Evaluation

321

Figure 1: Accuracy and ROC of MLP model after removing the corresponding features (Photo/Picture

credit:

Original).

Figure 2: Features Importance in Random Forest (Photo/Picture credit: Original).

and improves the model's prediction accuracy.

Finally, the MLP model constructed in this paper

performs well on the test set, which proves its

effectiveness in complex classification tasks. The

classification report, accuracy, and Receiver

Operating Characteristic (ROC) curve of the model

further verified the reliability and prediction ability.

This study will then put the feature into the MLP

model based on the feature importance in the Random

Forest model, then this study calculates the feature

importance to the MLP model by comparing the

contribution to the accuracy.

0,2

0,4

0,6

0,8

1,2

Accuracy ROC

0,05

0,1

0,15

0,2

0,25

MLSCM 2024 - International Conference on Modern Logistics and Supply Chain Management

322

3 RESULTS AND DISCUSSION

This study evaluated the importance of features in a

lung cancer dataset using Random Forest and

Multilayer Perceptron (MLP) models. The dataset

comprises 309 samples and 15 features. These

features were selected to identify better factors related

to lung cancer and improve the model's predictive

accuracy. The results are provided in Figure 1 and

Figure 2.

3.1 Feature Importance Evaluation

Using the Random Forest model, this study calculated

the contribution of each feature to the model’s

performance. The results of feature importance are

shown in Figure 1. Age (AGE) was identified as the

most significant feature with an importance score of

0.1955, highlighting its crucial role in lung cancer

prediction. The next most important features were

Allergy and Swallowing Difficulty, with importance

scores of 0.1136 and 0.0893, respectively, indicating

a strong correlation with lung cancer.

In the MLP model, this study assessed the

contribution of features to the model’s accuracy by

using the feature importance from the Random Forest

model. Detailed data on feature importance in the

MLP model are presented in Figure 2. This study

observed that removing the Age feature decreased

model accuracy to 0.94 (ROC of 0.90) while

removing the Allergy feature decreased accuracy to

0.96 (ROC of 0.93).

In summary, both Random Forest and MLP

models highlight the importance of features such as

Age, Allergy, and Swallowing Difficulty in lung

cancer prediction. The identification and weighting of

these features are crucial for enhancing early

detection accuracy. These findings help optimize

model performance in practical applications and

provide a foundation for further research and feature

selection strategies.

The results of this study indicate that both the

Random Forest and MLP models’ most important

features are Age, Allergy, and Swallowing Difficulty

in lung cancer prediction. However, it is important to

consider the impact of dataset imbalance on the

evaluation of feature importance.

Additionally, features with lower importance

scores, such as Fatigue and Wheezing, might still

have clinical significance in specific patient groups or

disease stages. Therefore, it is important for future

research to address dataset imbalance through

techniques like over sampling, and under

sampling. These methods can provide a more

balanced view of feature importance and improve the

model's overall performance

4 CONCLUSIONS

This study demonstrates the features and importance

of lung cancer prediction in the MLP model using a

random forest model. Features such as Age, Allergy,

and Swallowing Difficulty are important for the

diagnostic process. However, the significant

imbalance in this dataset, with only 39 non-cancerous

samples, may impact the accuracy of feature

importance evaluations. Future research should focus

on using dataset imbalance with advanced sampling

techniques and validating findings with larger, more

balanced datasets. Overcoming these challenges will

enhance the accuracy of predictive models, improve

early detection strategies for lung cancer, and benefit

patient outcomes.

REFERENCES

Connolly, J. L., et al. 2003. Role of the surgical pathologist

in the diagnosis and management of the cancer patient.

Holland-Frei Cancer Medicine - NCBI Bookshelf.

Erickson, B. J., Korfiatis, P., Akkus, Z., & Kline, T. L. 2017.

Machine learning for medical imaging. Radiographics,

37(2), 505-515.

Giger, M. L. 2018. Machine learning in medical

imaging. Journal of the American College of

Radiology, 15(3), 512-520.

Javed, R., Abbas, T., Khan, A. H., Daud, A., Bukhari, A.,

& Alharbey, R. 2024. Deep learning for lungs cancer

detection: A review. Artificial Intelligence Review,

57(8).

Kononenko, I. 2001. Machine learning for medical

diagnosis: history, state of the art and perspective.

Artificial Intelligence in medicine, 23(1), 89-109.

Lung cancer. 2022. Kaggle. https://www.kaggle.com/data

sets/nancyalaswad90/lung-cancer/data

Pacurari, A. C., et al. 2023. Diagnostic Accuracy of

Machine Learning AI architectures in detection and

Classification of lung Cancer: A Systematic review.

Diagnostics, 13(13), 2145.

Pinkus, A. 1999. Approximation theory of the MLP model

in neural networks. Acta numerica, 8, 143-195.

Taud, H., & Mas, J. F. 2018. Multilayer perceptron

(MLP). Geomatic approaches for modeling land change

scenarios, 451-455.

World Health Organization: WHO & World Health

Organization: WHO. 2023. Lung cancer. Retrieved

September 4, 2024, from https://www.who.int/news-

room/fact-sheets/detail/lung-cancer

MLP-Based Lung Cancer Prediction and Feature Importance Evaluation

323