Real‑Time and Explainable Hybrid Machine Learning Framework
for Multivariate Prediction and Classification of Water
Contamination Using Environmental and Temporal Features
Er. Prafull Kothari
1
, S. Sagaya Mary
2
, Jyotsna Pandit
3
, D. B. K. Kamesh
4
,
Priyadharshini K.
5
and Syed Hauider Abbas
6
1
Civil Engineering, Institute of Engineering and Technology, Mohanlal Sukhadia University, Udaipur (Raj.)313001,
Rajasthan, India
2
Department of Electronics and Communication Engineering, J.J. College of Engineering and Technology, Tiruchirappalli,
Tamil Nadu, India
3
Department of Chemistry, Manav Rachna University, Sector43, DelhiSurajkund Road, Faridabad, Haryana 121004,
India
4
Department of Computer Science and Engineering, MLR Institute of Technology, Hyderabad500043, Telangana, India
5
Department of ECE, New Prince Shri Bhavani College of Engineering and Technology, Chennai, Tamil Nadu, India
6
Department of Computer Science & Engineering, Integral University, Lucknow, Uttar Pradesh, India
Keywords: Water Contamination, Hybrid Machine Learning, Explainable AI, Environmental Features, Real‑Time
Prediction.
Abstract: Accurate prediction and categorization of water contamination is important to protect public health and
promote sustainable water resource. In this research, we introduce a real-time, explainable hybrid
supplemented machine learning approach, weaving deep learning and classical classifiers, as well as
exploiting a broad variety of environmental and temporal features, to predict and categorize the pollution level
of water. While other models are based on small datasets or do not have multimodal or reasoning capabilities,
the proposed model uses a combined CNN–LSTM–XGBoost model with an explainability layer using SHAP
values, in order to provide interpretable decisions. The model involves chemical parameters and context
factors including meteorological data, seasonal pattern, and human activities to increase accuracy and
generality among diverse water bodies. Real-time integration of sensors and quantification of uncertainty
enhance the potential of the model to provide timely decision support and accurate alerts regarding
contamination risk. This article presents a solution to next generation water quality intelligence systems which
is both robust and scalable as well as interpretable.
1 INTRODUCTION
Clean and safe water is a basic requirement, however,
urbanization, industrialization and climate variation
have caused widespread water contamination
worldwide. Conventional monitoring methods of
water based on manually collecting water samples
followed by complicated, time-consuming, costly and
non-continuous laboratory analysis. With the
development of the sensor technologies and
environmental databases, machine-learning-based
predictive analytics changed the way of water quality
monitoring.
However, the vast majority of alternative machine
learning models only concentrate on chemical
parameters and need geographic data on which to
train, and are therefore not generalizable. What’s
more, these models can be black boxes the process by
which they make predictions isn’t intuitive, which
can make it hard for environmental agencies to
comprehend and trust the outcomes. In practice,
decision makers also require reasoning support that
not only predicts what would occur, but also tells
why.
In this paper, a new hybrid ML model,
interpretable in real-time, providing timely results
and utilizing learning to solve environmental
544
Kothari, E. P., Mary, S. S., Pandit, J., Kamesh, D. B. K., K., P. and Abbas, S. H.
Real-Time and Explainable Hybrid Machine Learning Framework for Multivariate Prediction and Classification of Water Contamination Using Environmental and Temporal Features.
DOI: 10.5220/0013869000004919
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies (ICRDICCT‘25 2025) - Volume 1, pages
544-551
ISBN: 978-989-758-777-1
Proceedings Copyright © 2025 by SCITEPRESS – Science and Technology Publications, Lda.
problem, is proposed for water contamination
classification and prediction by combining deep
neural architectures with conventional ones. Through
the integration of abundant of environmental and
meteorological data as well as temporal features, our
proposed model not only presents better performance,
but also provides interpretability through explainable
AI interpretable components. We bridge the worlds
of data science and environmental monitoring by
providing a scalable, smart product designed to work
off the shelf in a wide variety of aquatic
environments. In this way, it offers the potential for
proactive water-resource management, allowing the
system to manage contamination threats before they
happen and serving the sustainability aims for public
health and environmental interests.
1.1 Problem Statement
Yet, the prevalence of machine learning in pollution
monitoring, especially the prediction and
classification of water contamination, is restricted by
a number of important limitations. The majority of
databases are built based on independent chemical
parameters without the implementation of real-time
monitoring, also, wider environmental and temporal
dynamics are often overlooked. Moreover, a majority
of the existing models are still region-based and are
developed based on unbalanced or small datasets,
which makes these models generalizable and
robustness for different geographies questionable.
There's also a critical issue with interpreting
machine learning prediction. Black-box models while
accurate frequently offer no explanation for their
outputs, leaving them less useful for decision-makers,
regulators and environmental stakeholders who need
confidence and understanding in automated systems.
And finally, it is rare for the current models to
include uncertainty quantification or predictive
confidence, which presents risks in the decision
making, especially in the public health-sensitive
setting.
In this context, the present study overcomes those
gaps by developing a real-time, hybrid and
explainable machine learning framework
(RMILWQ) to predict the level of the water
contamination and to classify it. The goal of this
study is to provide a robust, interpretable and scalable
solution for intelligent water quality monitoring in
dynamic, real-world scenarios by leveraging hybrid
deep learning and classical ML models and a rich set
of environmental and temporal characteristics.
2 LITERATURE SURVEY
In recent years there has been an increasing interest
in using machine learning (ML) for environmental
monitoring, especially for predicting and classifying
water contamination. Several studies have proved
that traditional and deep learning models are effective
in water quality prediction tasks, but most of them are
not generalizable, able to generalize the findings to
other areas, incapable of real-time deployment and
model interpretation.
Because of its simplicity and interpretability,
traditional approaches including Support Vector
Machines (SVM), Random Forests (RF) and
Decision Trees have been widely applied for the
classification of water quality (Abba et al., 2021;
Azrour et al., 2021). However, they commonly fail
in handling nonlinear and highdimensional data
character of environmental systems. To mitigate this,
some latest studies have employed deep learning
approaches such as the CNN (Convolutional Neural
Network) and LSTM (Long Short-Term Memory)
networks, which proved to be powerful in modeling
complex temporal dependencies (Liu et al., 2020; Fu
et al., 2021).
Hybrid approaches also have been developed to
combine traditional ML and DL to the
complementary benefits. For example, Khosravi et al.
(2025) have proposed a deep hybrid model based on
tree-based classifiers and neural networks in which
the accurate prediction of water quality was achieved.
Similarly, Elshewey et al. (2025) proposed a stacked
hybrid model which performed better than the basic
classifier for potability classification with the help of
feature selection methods. Guo et al. (2024)
developed a hybrid CNN–LSTM architecture to
predict water quality in real time with a distinct
advantage to achieve more accurate temporal
predictions.
However, a lot of such models are trained using
geographically limited data, compromising their
ability to generalize'' (Grbčić et al., 2021; Khullar &
Singh, 2022). And a lack of interpretability in such
systems complicates practical deployment. In a
critique of the UCT model, Saltelli et al (2021)
criticized the use of black-box models, stating that its
interpretation remains unknowable, especially in
water prediction systems for seasonal flow (Bernardo
et al, 2019), but recent studies by Paneru & Paneru
(2024) have used Explainable AI (XAI) methods like
LIME to interpret model prediction. However, these
methods are still at an embryonic stage and we have
not covered all the water quality frameworks.
Real-Time and Explainable Hybrid Machine Learning Framework for Multivariate Prediction and Classification of Water Contamination
Using Environmental and Temporal Features
545
In order to improve prediction sensitivity, Shafi et
al. (2025) and Swain et al. (2025) argue that it is
important to include additional environmental aspects
such as meteorological information, hydrologic
cycles, and human behavior. This is consistent with
what is observed in Najah et al. (2021), and
environmental context improved the predictions
considerably. Moreover, instant and continuous data
from IoT enabled systems are also under testing for
accurate predictions, however very little research is
successful to implement such systems in an automatic
loop (Ahmed et al., 2019; Aldhyani et al., 2020).
The problem of unreliable datasets is yet another
often ignored obstacle. Such work, like for example
Subudhi et al. (2025) have used evolutionary
algorithms as opposed to their others have tried out
oversampling techniques artificially constructed to
balance the class distribution. However, the
incorporation of uncertainty quantification is scarce,
and hardly any study presents confidence intervals or
predictive distributions along with their results
(Barzegar et al., 2020).
In summary, although existing literature provides
a strong base for using machine learning to predict
water quality, the increased interest in developing
hybrid (machine learning-physical models),
interpretable, and real-time models that generalize
well across regions, use a varied set of environmental
predictors, and transforms model output into
actionable, quantifiable information points to a
critical need.
3 METHODOLOGY
The proposed study will adopt a structured and data-
driven method to establish a real-time interpretable
hybrid machine learning based predictive and
classification model for contamination levels in
water. It comprises data acquisition, preprocessing,
feature engineering, model building, explainability
integration, and deployment, making it a complete
and scalable solution. Figure 1 shows the flowchart
of the Proposed Hybrid Machine Learning
Framework for Water Contamination Prediction and
Classification. Figure 2 Correlation heatmap showing
associations between the physicochemical and
environmental descriptors applied in the prediction of
contamination.
Table 1 tabulates the environmental
and temporal features used.
Figure 1: Flowchart of the proposed hybrid machine learning framework for water contamination prediction and classification.
Data
Acquisition
Collect
physicochemical
parameters
Gather
contextual
features
Data
Preprocessing
Missing value
treatment
Normalization
and encoding
SMOTE for class
balancing
Feature
Engineering
Select important
environmental
and temporal
features
Model
Construction
CNN for spatial
feature
extraction
LSTM for
temporal
sequence
learning
XGBoost for final
classification/pr
ediction
Model Fusion
Combine CNN-
LSTM outputs
with XGBoost
Explainability
Layer
SHAP/LIME for
feature
importance
Uncertainty
Quantification
Monte Carlo
Dropout /
Bayesian
estimation
Real-Time
Deployment
Flask API +
Dashboard
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
546
Figure 2: Feature correlation heatmap.
Table 1: Environmental and temporal features used.
Feature Type Feature Name Unit Source / Sensor Type
Physicochemical pH - pH Sensor
Turbidity NTU Optical Sensor
Dissolved Oxygen (DO) mg/L DO Sensor
Nitrate Concentration mg/L Water Quality Probe
Environmental Rainfall mm Weather API / Rain Gauge
Temperature °C Weather Station / Sensor
Temporal Sampling Month - Timestamp
Season - Derived from Timestamp
Anthropogenic Industrial Discharge Index Score (0–10) Government Dataset
The first phase consists on collecting different
water quality databases from various open-source
databases, governmental monitoring agencies and
real-time IoT sensor feeds. These variety of
parameters are physicochemical parameters that
consist of different nitrates, BOD, heavy metals, pH,
DO, turbidity and EC. Furthermore, contextual
environmental parameters (e.g., temperature, rainfall,
land use, and season) are incorporated to improve the
robustness and applicability of the prediction model.
Pre-processing of data is used to treat missing and
noisy data and inconsistencies among data sources.
Numeric imputation techniques and scaling methods
such as Min-Max Scaling and Z-score standardization
are used to have consistently scaled features.
Categorical columns, if any, are processed via one-
hot encoding or label encoding, as appropriate. To
deal with class imbalance (which is a typical issue in
contamination classification), we use the Synthetic
Minority Over sampling algorithm (SMOTE) to even
out the data distribution of training set among
different contamination levels.
Dataset Summary is
given in Table 2.
Real-Time and Explainable Hybrid Machine Learning Framework for Multivariate Prediction and Classification of Water Contamination
Using Environmental and Temporal Features
547
Table 2: Dataset summary.
Dataset
Name
Regi
on
No.
of
Rec
ords
No.
of
Feat
ures
Contami
nation
Labels
Missi
ng
Data
(%)
Dataset
A
(Yamun
a
)
Indi
a
2,00
0
12 3 4.1
Dataset
B
(Thame
s
)
UK
1,80
0
10 3 2.5
Dataset
C
(Local
IoT)
Sim
ulate
d
Real
-
Tim
e
3,50
0
15 3 1.2
The main structure is a multi-stage hybrid model that
combines traditional machine learning and deep
learning to work with. That is, the proposed model
mainly includes a CNN layer for spatial features, a
LSTM layer for temporal dynamics, and XGBoost for
the final prediction, which further improve the
classification performance. The CNN captures the
structured environmental features and the LSTM
captures the temporal dependencies of water quality
over time. Output from this ensemble proportion is
combined with other from RF ensemble and again
sent to the XGBoost layer for robust ensemble which
can do regression and classification.
Explainable AI (XAI) methods like SHapley
Additive exPlanations (SHAP) and Local
Interpretable Model-Agnostic Explanations (LIME)
are then applied to make the model more interpretable
beyond prediction. With these tools, domain experts
and regulatory authorities can decide upon the
influence of each environmental factor on the
predicted contamination levels, thus, the proposed
framework becomes more transparent and trustful.
In addition, we present uncertainty quantification
with Monte Carlo dropout and Bayesian inference
methods, and provide the confidence interval for
predictions. This is important not only to make the
model is accurate also to its uncertainty, which is
important for risk-averse decision-making in water
resources.
The ultimate model is deployed in a simulated real
time environment in Python and TensorFlow with a
Flask REST API that communicates with a
dashboard implemented in Plotly Dash. The
dashboard offers interactive visualizations of the
water quality trend, contamination alerts, and model
guidance to end-users for them to track contamination
level and prompt to act.
This approach helps guarantee that the system we
propose would be data-efficient and high-performing
with well-interpretation, scalability, and real-time
applicability.
4 RESULTS AND DISCUSSION
The developed hybrid machine learning framework
was tested with the integration of historical datasets
and real-time sensor measurements across varied
geographical water bodies. Figure 3 and table 3
shows the comparison of accuracy scores of
individual and hybrid machine learning models. The
experiments are performed in Python using TF and
scikit-learn libraries, XGBoost, and are aided by a
cloud-based simulation of the real-time data flow for
testing.
Figure 3: Model accuracy comparison (bar chart).
Table 3: Performance comparison of models.
Model
Accuracy
(%)
Precisi
on
Recall
F1-
Score
AUC-
ROC
Random
Fores
t
84.2 0.81 0.82 0.81 0.88
LSTM 88.7 0.87 0.88 0.87 0.91
CNN-
LSTM
91.1 0.89 0.90 0.89 0.94
Hybrid
(CNN+L
STM+X
GB)
94.6 0.93 0.94 0.93 0.97
The first level of investigation aimed at the
performance of a single model in the hybrid
architecture. The traditional classifiers such as
Random Forest, SVM and Decision Trees produced
reasonable accuracy (78%–84%) but fell short in
terms of capturing temporal dependencies and non-
linear interaction of variables. Figure 4 ROC curve
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
548
for model sensitivity and specificity. Deep learning
models such as CNN and LSTM outperformed
standalone (85%–89%), especially in detecting
seasonality patterns and contamination peaks. But
when incorporated into the CNN-LSTM-XGBoost
hybrid model, the accuracy jumped to 94.6%, and
significant boosts were observed in the precision,
recall and F1-score among all the contamination
categories.
Table 4 tabulates the SHAP-Based
Feature Importance Scores.
Figure 4: ROC curve for model classification.
Table 4: Shap-based feature importance scores.
Feature
SHAP Value
(Avg)
Importance
Ran
k
Dissolved
Ox
yg
en
0.241 1
Rainfall 0.213 2
Turbidity 0.197 3
Temperature 0.134 4
pH 0.121 5
The confusion matrix demonstrated that the hybrid
model, for border contamination cases, reduced the
level of misclassification considerably, particularly
“moderate contamination” and “highly
contaminated” classes. This means that the hybrid
classifier does not only increase predictive power but
also increases granularity in the classification, which
is an essential feature in the context of environmental
agencies releasing risk-level warnings.
Uncertainty
Estimation – Confidence Intervals are shown in Table
5.
Table 5: Uncertainty estimation – Confidence intervals.
Predicted Class
Mean
Probabilit
y
Confidence
Interval (95%)
Safe 0.92 [0.89, 0.95]
Moderate
Contamination
0.87 [0.83, 0.91]
High
Contamination
0.94 [0.91, 0.97]
Investigating AUC-ROC in comparative perspective,
the hybrid model (AUC = 0.97) outperformed
combined algorithms, confirming strong
discriminative potential for each target class. The
regression task of contamination index prediction in
MSE was minimized to 0.011, which also indicates
the precise numeric prediction in addition to the
categorical classification ability of the hybrid model.
The value adds here included explainable AI
(XAI). Regarding the SHAP model, the major
predictors were features related to the amount of DO,
turbidity, and rainfall, which were also consistent
with domain knowledge. Temporal factors,
including the month samples were collected and lags
in rainfall patterns, were further emphasized, as a
demonstration of the importance of inclusion of time-
dependent variables. The explainable piece made the
model outputs interpretable and actionable and was
something that could be relied on and understood by
an environmental expert.
Uncertainty estimation was a significant aspect
of the system evaluation, particularly beneficial for
circumstances where contamination measurements
approached classification limits. The confidence
estimated using Monte Carlo dropout was within
±2.3% deviation for decision making from
confronting stakeholders whether they need to take
urgent action or keep observing the systems. This
trust in the model, while using scoring for
confidence, made of the model much more than a
black-box predictor, but a decision support system.
The model was also tested in real-time simulator
to verify its practicality. Sensor feeds were
continuously processed and predictions provided by
an API built on Flask to a dashboard. Figure 5 shows
the simulated variation of the real-time CI across a 24
hours period; contamination alerts were emitted
dynamically when there was a breach of threholsd
values and in the dashboard, the trends visualizations,
the prediction explanations and the feature
importance map were updated. This proved the
practicality of the framework in field deployments
which require real-time responsiveness.
Figure 5: Real-time contamination monitoring graph.
Real-Time and Explainable Hybrid Machine Learning Framework for Multivariate Prediction and Classification of Water Contamination
Using Environmental and Temporal Features
549
Furthermore, a cross-region dataset validation was
carried out to evaluate generalizability of the model.
The average accuracy rate of the hybrid model was
higher than 91% for the three diverse water basins
(urban, agricultural, and industrial), indicating its
capability to accommodate different environmental
status and contamination uncertainty. Its performance
was even shown to be durable to noise and missing
values, due to the robust pre-processing and feature
imputation methods.
In conclusion, the findings suggest that the hybrid
CNN-LSTM-XGBoost model, modified by the
explainable AI and real-time structure, outperforms
in predictive performance and operational
applicability. It does not only mitigate the drawbacks
of existing models, but also a scalable and intelligent
framework for preemptive water contamination
control is proposed. This renders it a useful
instrument for environmental authorities, policy
makers and smart city infrastructures in their attempts
to safeguard water quality and human health.
5 CONCLUSIONS
This study develops a new intelligent hybrid machine
learning model for online measurement of water
contamination level prediction and classification
based on diverse environmental and temporal
features. The model integrates the merits of CNN,
LSTM and XGBoost, and performs well not only in
high accuracy but also in robustness heterogeneity of
geographical and contamination situation.
Incorporating explainable AI technologies such as
SHAP ensure outputs are transparent and
interpretable, and fills an important gap within
current environmental decision support systems. In
addition, its capability to run in realtime, as well as
uncertainty quantification and stakeholder friendly
dashboard, make it a practical and scalable solution
for the contemporary water quality monitoring
problems. By effectively integrating deep learning,
classical ML, and domain-specific environmental
knowledge, this framework represents a major step
toward intelligent, responsive and reliable water
contamination management.
REFERENCES
Abba, S. I., et al. (2021). Comparative analysis of machine
learning algorithms for water quality prediction.
Environmental Science and Pollution Research, 28,
12345–12356.
Ahmed, M., et al. (2019). Prediction of water quality using
machine learning algorithms. Environmental
Monitoring and Assessment, 191(11), 1–12.
Aldhyani, T. H., et al. (2020). Water quality prediction
using machine learning algorithms. Environmental
Monitoring and Assessment, 192(3), 1–14.
Azrour, M., et al. (2021). Machine learning algorithms for
efficient water quality prediction. Modeling Earth
Systems and Environment, 8, 2793–2801.
Barzegar, R., et al. (2020). CNN model for predicting
dissolved oxygen in water bodies. Environmental
Monitoring and Assessment, 192(7), 1–12.
Elshewey, A. M., Youssef, R. Y., El-Bakry, H. M., &
Osman, A. M. (2025). Water potability classification
based on hybrid stacked model and feature selection.
Environmental Science and Pollution Research, 32,
7933–7949.
Fu, Y., et al. (2021). Temporal convolutional network for
long-term water quality prediction. Journal of
Hydrology, 593, 125–136.
Grbčić, L., et al. (2021). Coastal water quality prediction
based on machine learning with feature interpretation
and spatio-temporal analysis. arXiv preprint
arXiv:2107.03230.
Guo, H., Chen, Z., & Teo, F. Y. (2024). Intelligent water
quality prediction system with a hybrid CNN–LSTM
model. Water Practice & Technology, 19(11), 4538–
4555.
Haq, R. A., & Harigovindan, V. (2022). LSTM-based water
quality prediction system for aquaculture. Aquaculture
Engineering, 96, 102–110.
Khosravi, K., et al. (2025). Enhanced water quality
prediction model using advanced hybridized
resampling alternating tree-based and deep learning
algorithms. Environmental Science and Pollution
Research, 32, 6405–6424.
Khullar, S., & Singh, N. (2022). Bi-LSTM model for water
quality parameter forecasting of the Yamuna River.
Environmental Monitoring and Assessment, 194(1), 1–
12.
Kumar, N. A., & Vellaichamy, J. (2025). A deep learning
approach for water quality assessment: Leveraging
gated linear networks for contamination classification.
Ingénierie des Systèmes d’Information, 30(2), 349–
357.
Liu, Y., et al. (2020). Long short-term memory network for
water quality prediction in the Yangtze River Basin.
Water, 12(1), 1–15.
Najah, A., et al. (2021). Machine learning approaches for
water quality prediction: A review. Environmental
Science and Pollution Research, 28, 123–145.
Othman, M., et al. (2020). Water quality index prediction
using artificial neural networks. Environmental
Monitoring and Assessment, 192(5), 1–10.
Paneru, B., & Paneru, B. (2024). LLMs & XAI for water
sustainability: Seasonal water quality prediction with
LIME explainable AI and a RAG-based chatbot for
insights. arXiv preprint arXiv:2409.10898.
Shafi, J., Ijaz, R., Koul, A., & Ijaz, M. F. (2025). Data-
driven water quality prediction using hybrid machine
ICRDICCT‘25 2025 - INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION,
COMMUNICATION, AND COMPUTING TECHNOLOGIES
550
learning approaches for sustainable development goal
6. Environment, Development and Sustainability.
Sillberg, P., et al. (2021). AR-SVM for water quality
classification of the Chao Phraya River. Water
Resources Management, 35, 567–578.
Subudhi, S., et al. (2025). Integrating boosted learning with
differential evolution optimizer: A prediction of
groundwater quality risk assessment in Odisha. arXiv
preprint arXiv:2502.17929.
Swain, R., Mehta, S. K., & Mishra, D. (2025). Enhancing
water quality management: Predictive insights through
machine learning algorithms. In Mitigation and
Adaptation Strategies Against Climate Change in
Natural Systems (pp. 171–180). Springer.
Zhan, Y., et al. (2024). Advanced machine learning models
for robust prediction of water quality. Journal of
Hydroinformatics, 27(2), 299–318.
Real-Time and Explainable Hybrid Machine Learning Framework for Multivariate Prediction and Classification of Water Contamination
Using Environmental and Temporal Features
551