Real-time Covid-19 Detection based on Symptoms using Machine

Learning Models

Youssef Mellah, Zakaria Kaddari, Mohammed Ghaouth Belkasmi, Noureddine Rahmoun

and Toumi Bouchentouf

Mohammed First University, LARSA/SmartICT Laboratory, ENSAO, Oujda, Morocco

Keywords: COVID-19, SARS-Cov-2, PCR, X-Ray, Machine-Learning, World Health Organization.

Abstract: The rapid spread of coronavirus has led to the 2019 global coronavirus disease pandemic (COVID-19). In this

sense, people suspected of having this virus must know quickly if they are infected so that they can receive

appropriate treatment, isolate themselves and inform people with whom they have been in close contact. This

situation has attracted worldwide attention, developing exact methods for the identification and isolation of

patients infected with SARS-CoV-2. However, those methods (as PCR test, chest X-ray and CT scan images...

etc.) take time to get the exact result of the patient. In this paper, we propose different classifier methods using

Machine Learning for real-time COVID-19 detection based only on symptoms. These methods can achieve

an accuracy of up to 98.0%, evaluated on a DataSet based on significant COVID-19 symptoms according to

the World Health Organization (WHO).

INTRODUCTION

Coronavirus is a specific type of virus, which

depending on the inherited property, grows and

replicates. The severity of this virus is due to its

nature of spreading very quickly. Common signs of

disease are fever, cough, respiratory symptoms,

shortness of breath, and difficulty breathing. The

disease, in more severe cases, can cause kidney

failure, severe acute respiratory syndrome,

pneumonia, and even death.

Standard recommendations for limiting the spread

of this virus include covering the mouth and nose

when coughing and sneezing, washing hands

regularly, and taking a bib.

In this sense, and to limit the spread of the

pandemic, research is popping out to give guidance in

this domain. Authors in Reference (Shan, 2020) used

the deep learning-based segmentation method to

identify regions of interest in CT Images of lungs for

quantifying COVID-19.

Authors (Gozes, 2020) used 2D and 3D deep

learning techniques and obtained 99.6% accuracy in

the context of COVID-19. The work (Wang, 2020)

used a bidirectional Recurrent Neural Network with

and attentional mechanisms in the context of COVID-

19. Authors in the paper (Xu, 2020) come out with

86.7% accuracy on benchmark datasets using deep

learning in the context of coronavirus. In addition,

other researches use X-Ray images to detect COVID-

19. The literature is growing and many efforts are

underway in the scientific community to limit the

adverse effects of this virus.

From our side, in this paper, we propose to

incorporate Machine Learning models for real-time

COVID-19 detection based only on symptoms. We

applied these models to a DataSet built base on World

Health Organization (WHO) 1 reports. This dataset

was published on Kaggle2, containing 5,435 possible

combinations, each with seven significant symptoms

of COVID-19 according to the same organization. By

doing that, we achieve 98.0% as our best accuracy

reached by Artificial Neural Network.

SETTING UP MACHINE

LEARNING CLASSIFIERS

Classification is the way to predict the given data

class. Each method attempt to recognize a pattern that

best matches the relationship between data.

The objective of training is to have a predictive

model that predicts the correct class, especially on

data not seen during the training phase. The

206

Mellah, Y., Kaddari, Z., Belkasmi, M., Rahmoun, N. and Toumi, B.

Real-time Covid-19 Detection based on Symptoms using Machine Learning Models.

DOI: 10.5220/0010731100003101

In Proceedings of the 2nd International Conference on Big Data, Modelling and Machine Learning (BML 2021), pages 206-210

ISBN: 978-989-758-559-3

classification is a method of grouping a pattern from

the data in input. There are diverse algorithms for

solving a classification problem such as K-

neighbours classifier, decision tree classifier, and

gradient boosting classifier, without forgetting neural

networks, which have shown great success.

2.1

Random Forest Classifier

The Random forest (Culter, 2012) is a supervised

learning algorithm. The idea is to build a “forest” like

a set of decision trees, generally trained with

combining several models with the “bagging”

method, which helps in increasing the general

precision.

This algorithm can be used for both regression and

classification problems, which make it useful and

powerful.

2.2

Logistic Regression Classifier

Logistic regression (Wright, 1995) is a predictive

technique. It aims to build a model making it possible

to predict / explain the values taken by a qualitative

target variable (most often binary, we then speak of

binary logistic regression; if it has more than 2

modalities, we speak of polychromous logistic

regression) from a set of quantitative or qualitative

explanatory variables (coding is necessary in this

case).

2.3

Decision Tree Classifier

This decision support or data mining tool allows you

to represent a set of choices in the graphic form of a

tree. It is one of the most popular supervised learning

methods for data classification problems. Concretely,

a decision tree models a hierarchy of tests to predict a

result.

The possible decisions are located at the ends of

the branches (the "leaves" of the tree) and are reached

based on decisions made at each stage.

2.4

Naïve Bayes Classifier

The Naive Bayesian classification (Leung, 2007)

method is a supervised machine learning algorithm

that classifies a set of observations according to rules

determined by the algorithm itself. There is a theory

called Bayes, where the name of the algorithm comes

from.

This classification tool must first be trained on a

training dataset, which shows the expected class

according to the inputs during the learning phase, the

algorithm develops its classification rules on this data

set, in order to apply them secondly to the

classification of a prediction data set.

2.5

Support Vector Machine Classifier

The main idea behind the Support Vector Classifier

(Hearst, 1998) is to find a decision boundary with

maximum width that can classify both classes.

Maximum margin classifiers are extremely sensitive

to outliers in training data, which makes them quite

lame. Choosing a threshold that allows classification

errors is an example of the Bias-Variance tradeoff

that affects all machine learning algorithms.

2.6

K-Nearest Neighbours Classifier

In artificial intelligence, more precisely in machine

learning, the k nearest neighbours (Peterson, 2009)

method is a supervised learning method. In

abbreviated form k-NN or KNN, from English k-

nearest neighbors.

In this context, we have a training database made

up of N "input-output" pairs. To estimate the output

associated with a new input x, the k nearest neighbors

method consists of taking into account (identically)

the k training samples whose according to a defined

distance, input is closest to the new input x.

For example, in a classification problem, we will

retain the most represented class among the k outputs

associated with the k inputs closest to the new input

2.7

Gradient Boosting Classifier

Gradient Boosting (Friedman, 2002) classifiers are a

category of machine learning algorithms that

combine multiple learning models to create a stronger

one. Decision trees are generally used when

increasing gradients. Gradient enhancement models

are popular due to their efficiency in classifying

complex data sets and have recently been used to win

many Kaggle Data Science competitions.

2.8

Artificial Neural Network Classifier

First, the neural network is a concept. It's not

physical. The concept of Artificial Neural Networks

(ANN) (Wang, 2003) was inspired by biological

neurons. In a biological neural network, several

neurons work together, receive input signals, process

information, and trigger an output signal. The

artificial intelligence neural network is based on the

same model as the biological neural network.

Real-time Covid-19 Detection based on Symptoms using Machine Learning Models

207

Neural networks are trained with a multitude of

input data coupled with their respective output data.

They then calculate the output data, they compare it

to the known actual output data and constantly update

themselves to improve the results (if necessary).

There are several types of ANN, such as recurrent

neural network used in natural language processing,

convolutional neural network used in image

processing, and hybrid ones. Thy have been very

successful and outperform most other algorithms,

especially with new advancements like the attention

mechanism and the architecture of Transformers.

IMPLEMENTATION

3.1

Environment

To make all Machine Leaning Classifiers, we use

sklearn, Numpy, and Pandas. All packages are

managed through pip. We used Jupyter Notebook as

the code editor.

3.2

Steps for Setting up the Various

Machine Learning Classifiers

3.2.1 Dataset Preprocessing

The DataSet is already cleaned and there are no

missing values. Using StandardScaler, we scale

various feature value ranges in standard format.

3.2.2 Dataset Splitting

We split the DataSet into:



Training Set: 80% from the whole DataSet



Test Set: 20% from the whole DataSet

3.2.3 Make and Train the Various Models

We have set up all the models mentioned below, then

we trained and evaluated them in order to compare

them and see which classifier gives the best results.

The training is made using a model.fit() method.

3.2.4 Evaluate the Various Models

We evaluate our models on the Test Set to see the

score and the generalizability of them, by comparing

predicted and expected outputs. Using sklearn, we get

the predictions as follow:

Begin

Prediction=

model.predict(inputs_features)

END

Also, the accuracy, precision, and other scores are

calculated between the predicted and the expected

output of all the various MC classifier models by

employing classification reports from

sklearn.metrics. The evaluation is made by the

following snippet code:

Begin

print(classification_report(expected,

predicted);

END

RESULTS AND ANALYSIS

4.1

Evaluation and Results

The different statistical metric used for evaluating our

models are described as follow (see Table 1 for

obtained results):



Accuracy



Recall



Precision



F1 score



AUC

Table 1: Results of the evaluation of the various Models.

Models Accuracy F1 Precision Recall AUC

Artificial Neural Network 0.980 0.987 0.989 0.986 0.971

K-Neighbors Classifier 0.977 0.986 0.989 0.982 0.969

Gradient Boosting Classifier 0.977 0.986 0.984 0.987 0.961

Support Vector Machines 0.977 0.986 0.984 0.987 0.961

Random Forrest Classifier 0.974 0.984 0.985 0.982 0.961

Logistic Regression 0.962 0.977 0.968 0.986 0.925

Decision Tree Classifier 0.948 0.967 0.972 0.963 0.925

Naïve Bayes 0.759 0.824 1.000 0.701 0.850

BML 2021 - INTERNATIONAL CONFERENCE ON BIG DATA, MODELLING AND MACHINE LEARNING (BML’21)

208

Figure 1: Features Importance of the various Machine Learning Classifiers

Also, we are interested in features impotence,

which plays a vital role in a predictive modelling

project, Figure 1 shows the importance of the features

given by the various classifiers.

4.2

Analysis and Discussion

Artificial Neural Network (ANN) gave the highest

score of 0.980 in terms of precision metrics,

compared to all other classifiers.

The same thing for F1 and AUC, ANN provides

the highest score compared to other ML classifiers. In

terms of Precision, the Naïve Bayes Classifier reaches

the top one achieving a score of one.

While Gradient Boosting Classifier and the

Support

Vector Machines are given top scores on Recall

with a minimal difference (0.001) compared to ANN.

As shown in Figure 1, the majority of ML

classifiers do not take into account all features, except

Random Forest and Neural Network ones so that we

can consider this last as the more powerful one

compared to all models, in terms of accuracy as well

as features importance.

5 CONCLUSIONS

The exponential rate of spread of the COVID-19

pandemic requires the implementation of a solid and

effective strategy for the detection of patients affected

by the virus, especially in terms of detection time. In

this sense, we have implemented machine learning

models for real-time detection for more efficiency

and time savings. We have found that the model

Real-time Covid-19 Detection based on Symptoms using Machine Learning Models

209

based on artificial neural networks gives more

precision than other classical classifiers, reaching

98.0% on a DataSet based on symptoms. In addition,

it takes into consideration all the features (symptoms)

which make it the most optimal.

REFERENCES

Shan, F., Gao, Y., Wang, J., Shi, W., Shi, N., Han, M., ... &

Shi, Y. (2020). Lung infection quantification of

COVID-19 in CT images with deep learning. arXiv

preprint arXiv:2003.04655.

Gozes, O., Frid-Adar, M., Greenspan, H., Browning, P. D.,

Zhang, H., Ji, W., ... & Siegel, E. (2020). Rapid ai

development cycle for the coronavirus (covid-19)

pandemic: Initial results for automated detection &

patient monitoring using deep learning ct image

analysis. arXiv preprint arXiv:2003.05037.

Wang, Y., Hu, M., Li, Q., Zhang, X. P., Zhai, G., & Yao,

N. (2020). Abnormal respiratory patterns classifier may

contribute to large-scale screening of people infected

with COVID-19 in an accurate and unobtrusive

manner. arXiv preprint arXiv:2002.05534

Xu, X., Jiang, X., Ma, C., Du, P., Li, X., Lv, S., ... & Li, L.

(2020). A deep learning system to screen novel

coronavirus disease 2019 pneumonia. Engineering,

6(10), 1122-1129

Cutler, A., Cutler, D. R., & Stevens, J. R. (2012). Random

forests. In Ensemble machine learning (pp. 157-175).

Springer, Boston, MA.

Wright, R. E. (1995). Logistic regression.

Leung, K. M. (2007). Naive bayesian

classifier. Polytechnic University Department of

Computer Science/Finance and Risk

Engineering, 2007, 123-156.

Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., &

Scholkopf, B. (1998). Support vector machines. IEEE

Intelligent Systems and their applications, 13(4), 18-28.

Peterson, L. E. (2009). K-nearest

neighbor. Scholarpedia, 4(2), 1883.

Friedman, J. H. (2002). Stochastic gradient

boosting. Computational statistics & data

analysis, 38(4), 367-378.

Wang, S. C. (2003). Artificial neural network.

In Interdisciplinary computing in java

programming (pp. 81-100). Springer, Boston, MA.

BML 2021 - INTERNATIONAL CONFERENCE ON BIG DATA, MODELLING AND MACHINE LEARNING (BML’21)

210