Abstract
In the fast-changing world with increased water demand, water pollution, environmental problems, and related data, information on water quality and suitability for any purpose should be prompt and reliable. Traditional approaches often fail in the attempt to predict water quality classes and new ones are needed to handle a large amount or missing data to predict water quality in real time. One of such approaches is machine-learning (ML) based prediction. This paper presents the results of the application of the Naïve Bayes, a widely used ML method, in creating the prediction model. The proposed model is based on nine water quality parameters: temperature, pH value, electrical conductivity, oxygen saturation, biological oxygen demand, suspended solids, nitrogen oxides, orthophosphates, and ammonium. It is created in Netica software and tested and verified using data covering the period 2013–2019 from five locations in Vojvodina Province, Serbia. Forty-eight samples were used to train the model. Once trained, the Naïve Bayes model correctly predicted the class of water sample in 64 out of 68 cases, including cases with missing data. This recommends it as a trustful tool in the transition from traditional to digital water management.
HIGHLIGHTS
The study tests efficiency of water quality prediction by the Naïve Bayes method.
Nine water quality parameters are analyzed, defined by SWQI methodology.
Water quality is assessed by the Naïve Bayes model at five locations in Serbia and 68 samples (cases) of data.
Prediction model is trained on 48 cases.
The model predicted the water quality class accurately in 64 out of 68 cases.
Graphical Abstract
INTRODUCTION
Water demands are growing due to population growth, urbanization, agricultural and industrial development. For the same reasons, water professionals are facing growing challenges related to water pollution and the effects of climate change on the amount of available water (de Marsily 2007; Hofstra et al. 2019; Jain et al. 2021). Providing sufficient quantities for all requirements is not the only challenge. In order to be considered as a resource, its quality, which encompasses physical, chemical, and biological properties, is also essential. Different uses require different quality, but each use entails quality control and assessment. To know the suitability of water for different purposes, many water quality parameters should be assessed and evaluated. Determining water quality includes sampling, analysis, and interpretation of results, which requires costs, time, and human labor. New trends in research and practice show attempts to reduce costs by matching available and real-time data to predict water quality. It is a difficult task due to uncertainties that characterize natural systems. There are numerous influential factors, some of which can be changed easily and quickly, but it is challenging also because of the consequences that may occur due to incorrect prediction (Srđević & Srdjević 2020). However, to achieve effective water resource management, the necessity for modeling and predicting water quality is growing with the fast-changing world and increased demands.
In recent years, there have been many studies on the topic of machine-learning-based water quality prediction and attempts to find the best way of accurate prediction (e.g. Emamgholizadeh et al. 2014; Haghiabi et al. 2018; Hmoud Al-Adhaileh & Waselallah Alsaade 2021). As Bilali & Taleb (2020) stated in their research, the implementation of artificial intelligence techniques for water quality has shown highly accurate prediction results. However, they also emphasize that site-specific conditions play an important role in the results but also on required input data, thus generalization of the models and their accuracy is not recommended.
One of the models which are in use for prediction in many different domains is the Bayesian Network (BN). Bayesian Networks are direct acyclic graphs that are representing the cause and effects of systems’ components through a conditional probability distribution (Jensen 1996). The use of BN in the modeling of environmental systems is expanding because of the possibility of integration of different components, handling missing data and uncertainty, and so on (Kragt 2009; Chen & Pollino 2012). BNs are also widely used in the field of water management (Drury et al. 2017; Phan et al. 2019; Guzmán-Fierro et al. 2020; Govender et al. 2021).
Traditional machine-learning algorithms assume that data are precise, but for uncertain data the Naïve Bayes (NB) classifier is favorable to use (Murphy 2006; Ren et al. 2009). Feki-Sahnoun et al. (2018) concluded that NB models are very popular and effective in many complex problems despite its simplicity, which is a great advantage for its use. However, its application in water quality prediction has not yet been sufficiently researched. This is especially the case for the study area in this research, not just for the implementation of Bayesian Networks or Naïve Bayes, but for artificial intelligence techniques in general. Therefore, research on their application in this area occupies the special attention of researchers.
In this paper, the NB model was created and used for water quality class’ prediction of 68 water samples taken at the five measuring points in Vojvodina Province, Serbia. The basis of Vojvodina's water resources are the Danube, Tisa, and Sava rivers, smaller watercourses, the basic canal network with a total length of 930 km, and the detailed canal network of about 20.000 km (Savić et al. 2013). This hydrographical network provides the possibility of supplying the population for various needs, including for drinking, irrigation, industry, recreation, and so on. However, these valuable resources are affected by large amounts of wastewater from various sources, which significantly reduces its usability (Belic et al. 2005). The rich multipurpose canal network has much greater potential than the used one, and one of the main problems is water quality. Untreated or insufficiently/inadequately treated wastewater discharging into them often makes it unusable for further use. This requires strict control and continuous quality monitoring. In addition to real-time quality assessment, this implies the necessity for quality prediction in the future and estimation of impacts of various pollutants. To answer these requirements and test the usability and accuracy of the NB model, the data relating to nine parameters in the period 2013–2019 were pre-processed and used as input for model training and testing accuracy of predicted classification of water samples.
METHODS
Water quality assessment
SWQI parameters and their maximum value
Water quality parameters . | The maximum value of SWQI (qi ×wi) . |
---|---|
Oxygen saturation (%) | 18 |
BOD5 (mg/l) | 15 |
Ammonium (mg/l) | 12 |
pH value | 9 |
Total nitrogen oxides (mg/l) | 8 |
Orthophosphates (mg/l) | 8 |
Suspended solids (mg/l) | 7 |
Temperature (°C) | 5 |
Electrical conductivity (μS/cm) | 6 |
E. Coli (per 100 ml) | 12 |
∑ qi × wi = SWQI | 100 |
Water quality parameters . | The maximum value of SWQI (qi ×wi) . |
---|---|
Oxygen saturation (%) | 18 |
BOD5 (mg/l) | 15 |
Ammonium (mg/l) | 12 |
pH value | 9 |
Total nitrogen oxides (mg/l) | 8 |
Orthophosphates (mg/l) | 8 |
Suspended solids (mg/l) | 7 |
Temperature (°C) | 5 |
Electrical conductivity (μS/cm) | 6 |
E. Coli (per 100 ml) | 12 |
∑ qi × wi = SWQI | 100 |
Water quality indexes of the parameters according to their values
Water quality (qi x wi) . | Oxygen saturation (%) . | BOD5 (mg/l) . | Ammonium (mg/l) . | pH value . | ||||
---|---|---|---|---|---|---|---|---|
18 | 93–109 | |||||||
17 | 88–921 | 110–119 | ||||||
16 | 85–87 | 120–129 | ||||||
15 | 81–84 | 130–134 | 0 | 0.9 | ||||
14 | 78–80 | 135–139 | 1.0 | 1.9 | ||||
13 | 75–77 | 140–144 | 2.0 | 2.4 | ||||
12 | 72–74 | 145–154 | 2.5 | 2.9 | 0 | 0.09 | ||
11 | 69–71 | 155–164 | 3.0 | 3.4 | 0.10 | 0.14 | ||
10 | 66–68 | 165–179 | 3.5 | 3.9 | 0.15 | 0.19 | ||
9 | 63–65 | 180+ | 4.0 | 4.4 | 0.20 | 0.24 | 6.5–7.9 | |
8 | 59–62 | 4.5 | 4.9 | 0.25 | 0.29 | 6.0–6.4 | 8.0–8.4 | |
7 | 55–58 | 5.0 | 5.4 | 0.30 | 0.39 | 5.8–5.9 | 8.5–8.7 | |
6 | 50–54 | 5.5 | 6.1 | 0.40 | 0.49 | 5.6–5.7 | 8.8–8.9 | |
5 | 45–49 | 6.2 | 6.9 | 0.50 | 0.59 | 5.4–5.5 | 9.0–9.1 | |
4 | 40–44 | 7.0 | 7.9 | 0.60 | 0.99 | 5.2–5.3 | 9.2–9.4 | |
3 | 35–39 | 8.0 | 8.9 | 1.00 | 1.99 | 5.0–5.1 | 9.5–9.9 | |
2 | 25–34 | 9.0 | 9.9 | 2.00 | 3.99 | 4.5–4.9 | 10.0–10.4 | |
1 | 10–24 | 10.0 | 14.9 | 4.00 | 9.99 | 3.5–4.4 | 10.5–11.4 | |
0 | 0–9 | 15.0+ | 10.00+ | 0–3.4 | ||||
. | Total nitrogen oxides (mg/l) . | Orthophosphates (mg/l) . | Suspended solids (mg/l) . | Temperature (°C) . | Electrical conductivity (μS/cm) . | |||
10 | ||||||||
9 | ||||||||
8 | 0 | 0.49 | 0 | 0.029 | ||||
7 | 0.5 | 1.49 | 0.030 | 0.059 | 0–9 | |||
6 | 1.5 | 2.49 | 1.060 | 0.099 | 10–14 | 0–49 | 50–188 | |
5 | 2.5 | 3.49 | 0.100 | 0.129 | 15–19 | 0.–17.4 | 189 | 190–239 |
4 | 3.5 | 4.49 | 0.130 | 0.179 | 20–29 | 17.5–19.4 | 240 | 298 |
3 | 4.5 | 5.49 | 0.180 | 0.219 | 30–44 | 19.5–21.4 | 290 | 379 |
2 | 5.5 | 6.99 | 0.220 | 0.279 | 45–64 | 21.5–22.9 | 380 | 539 |
1 | 7.0 | 9.99 | 0.280 | 0.369 | 65–119 | 23.0–24.9 | 540 | 839 |
0 | 10.00+ | 0.370+ | 120+ | 25+ | 810+ |
Water quality (qi x wi) . | Oxygen saturation (%) . | BOD5 (mg/l) . | Ammonium (mg/l) . | pH value . | ||||
---|---|---|---|---|---|---|---|---|
18 | 93–109 | |||||||
17 | 88–921 | 110–119 | ||||||
16 | 85–87 | 120–129 | ||||||
15 | 81–84 | 130–134 | 0 | 0.9 | ||||
14 | 78–80 | 135–139 | 1.0 | 1.9 | ||||
13 | 75–77 | 140–144 | 2.0 | 2.4 | ||||
12 | 72–74 | 145–154 | 2.5 | 2.9 | 0 | 0.09 | ||
11 | 69–71 | 155–164 | 3.0 | 3.4 | 0.10 | 0.14 | ||
10 | 66–68 | 165–179 | 3.5 | 3.9 | 0.15 | 0.19 | ||
9 | 63–65 | 180+ | 4.0 | 4.4 | 0.20 | 0.24 | 6.5–7.9 | |
8 | 59–62 | 4.5 | 4.9 | 0.25 | 0.29 | 6.0–6.4 | 8.0–8.4 | |
7 | 55–58 | 5.0 | 5.4 | 0.30 | 0.39 | 5.8–5.9 | 8.5–8.7 | |
6 | 50–54 | 5.5 | 6.1 | 0.40 | 0.49 | 5.6–5.7 | 8.8–8.9 | |
5 | 45–49 | 6.2 | 6.9 | 0.50 | 0.59 | 5.4–5.5 | 9.0–9.1 | |
4 | 40–44 | 7.0 | 7.9 | 0.60 | 0.99 | 5.2–5.3 | 9.2–9.4 | |
3 | 35–39 | 8.0 | 8.9 | 1.00 | 1.99 | 5.0–5.1 | 9.5–9.9 | |
2 | 25–34 | 9.0 | 9.9 | 2.00 | 3.99 | 4.5–4.9 | 10.0–10.4 | |
1 | 10–24 | 10.0 | 14.9 | 4.00 | 9.99 | 3.5–4.4 | 10.5–11.4 | |
0 | 0–9 | 15.0+ | 10.00+ | 0–3.4 | ||||
. | Total nitrogen oxides (mg/l) . | Orthophosphates (mg/l) . | Suspended solids (mg/l) . | Temperature (°C) . | Electrical conductivity (μS/cm) . | |||
10 | ||||||||
9 | ||||||||
8 | 0 | 0.49 | 0 | 0.029 | ||||
7 | 0.5 | 1.49 | 0.030 | 0.059 | 0–9 | |||
6 | 1.5 | 2.49 | 1.060 | 0.099 | 10–14 | 0–49 | 50–188 | |
5 | 2.5 | 3.49 | 0.100 | 0.129 | 15–19 | 0.–17.4 | 189 | 190–239 |
4 | 3.5 | 4.49 | 0.130 | 0.179 | 20–29 | 17.5–19.4 | 240 | 298 |
3 | 4.5 | 5.49 | 0.180 | 0.219 | 30–44 | 19.5–21.4 | 290 | 379 |
2 | 5.5 | 6.99 | 0.220 | 0.279 | 45–64 | 21.5–22.9 | 380 | 539 |
1 | 7.0 | 9.99 | 0.280 | 0.369 | 65–119 | 23.0–24.9 | 540 | 839 |
0 | 10.00+ | 0.370+ | 120+ | 25+ | 810+ |
SWQI water quality indicators
Numerical indicator (index) . | Descriptive indicator . |
---|---|
100–90 | Excellent |
84–89 | Very good |
72–83 | Good |
39–71 | Bad |
0–38 | Very bad |
Numerical indicator (index) . | Descriptive indicator . |
---|---|
100–90 | Excellent |
84–89 | Very good |
72–83 | Good |
39–71 | Bad |
0–38 | Very bad |
The first step in this research was to assess the water quality of samples taken on the Danube, Tisa, and Begej watercourses, and on two sections of the DTD canal network (Table 4) by the SWQI. The study area and location of measuring points are shown in Figure 1. Given measuring stations are selected due to different natural conditions of their location. Very different volumes of water of the courses (from a big river to a small canal), different sources of pollution (agriculture, industry, urban areas), and so on, makes the quality of these water sections very different. Data records from the period 2013–2019 used in the study were restricted also to April and August, because of the observed differences in the water quality in these months. These differences are necessary to train the model with an appropriate data set covering all possible existing water quality classes and the combination of the parameters on the site.
Information on the study area and sampling location
Watercourse . | Measuring station . | River Basin . | Coordinates (Gauss-Kruger Projection) . | |
---|---|---|---|---|
Danube | Novi Sad | Danube | 5009538 | 7409075 |
Тisza | Titel | Danube | 5006900 | 7446600 |
Begej | Hetin | Danube | 5056488 | 7484738 |
DTD canal Bački Petrovac-Karavukovo | Bač | Danube | 5028554 | 7362001 |
DTD canal Banatska Palanka – Novi Bečej | Melenci | Danube | 5044463 | 7448738 |
Watercourse . | Measuring station . | River Basin . | Coordinates (Gauss-Kruger Projection) . | |
---|---|---|---|---|
Danube | Novi Sad | Danube | 5009538 | 7409075 |
Тisza | Titel | Danube | 5006900 | 7446600 |
Begej | Hetin | Danube | 5056488 | 7484738 |
DTD canal Bački Petrovac-Karavukovo | Bač | Danube | 5028554 | 7362001 |
DTD canal Banatska Palanka – Novi Bečej | Melenci | Danube | 5044463 | 7448738 |
Naïve Bayes classifier
Bayesian networks in general are composed of a set of nodes, which are the variables of a system. Between the nodes are links that represent a relationship among them. The direction of the link is from cause to effect, and if the link is from node A to node B, then node A is called the ‘parent node’ of the ‘child’ node B. Nodes are defined by certain states, which characterize them and can be discrete or continuous. The background of every node is a conditional probability table consisting of every possible combination of the parent nodes’ states. The likelihood of a variable to be in a certain state is shown in the table and the more information is in, the lower the degree of uncertainty will be. The dependencies are updated every time some new information is obtained and a change in one node will cause a change in another one linked to it (Olalla et al. 2005; Tang et al. 2019).
The structure of Naïve Bayes network is very simple: it is constructed of one parent node (classification node – C) of all the other nodes – child nodes (variables – X1, X2…,Xn), and the other links are not allowed (Figure 2).
The main idea is to train the model from j sets of cases (training data). As an example, let U (X1, X2, …, Xn, C) be the set of cases, where variables X1(j), X2(j), …, Xn(j) represent the attributes of the system and C is the class. C is the root of the network and the attributes have just one parent, e.g. i
.
The probabilities of all combinations of attributes X1 (j), X2 (j),…, Xn (j) for the j-th case and the corresponding outcome C for that case define the predictive rules for determining the outcome class. Knowing these rules, the classification is done by calculating the probability of a certain class C (j+1) based on the new, given data set X1 (j+1), X2 (j+1),…, Xn (j+1) and predicting the class with the highest posterior probability.
According to Cheng & Greiner (2013), there are two precedences among many other classifiers: it is easy to construct and the process is very efficient, both because of the assumption of nodes’ independence. For detailed definitions and background of the classifier, the reader is referred to, for example, Rish 2001 and Zhang 2004.
Development of the NB model for water quality prediction
In this study, the Bayesian network is constructed of 10 nodes: one parent node (class node) named ‘Quality’ and nine child nodes that represent nine water quality parameters (Figure 3). The parameters are defined within the SWQI methodology (Table 1). The monitoring data related to the parameters were taken from the database of the Serbian Environmental Protection Agency for the period 2013–2019 (SEPA 2020) in each April and August, 68 samples in total. The samples were taken from five measurement points: Novi Sad, Titel, Hetin, Melenci, and Bač.
After creating the network, the next step is to feed the model with the training data, from which the model will learn what are the causal relationships and reflecting probabilities, and identify prediction rules. This way, the network will be able to predict the water class of new samples.
To train the network, parameters for 48 samples from the period 2015–2019 were selected, classified into five classes and inserted into a file with training data (Figure 4). Then the calculation of SWQI for each water sample was performed in the SWQI calculator at the official website of the Agency (www.sepa.gov.rs) and results were inserted as a last column in the file. Note that each row in the file represents one of 48 learning cases (classes of nine parameters and corresponding SWQI for one water sample). After creating the file, it is imported into the BN model in (Netica, 2020).
After the training file is imported, Netica calculates probabilities at each node of the network learned from 48 cases (Figure 5). Now, the model is ready to predict the water class of new samples with known or estimated parameter classes, but also with the missing data.
Among the other options, it is worth mentioning that Netica allows entering a likelihood of all node states, to calibrate it, or if the state is unknown, to mark it. Thus, when some parameter is missing, the user should mark the unknown state, and based on previous knowledge (training inputs, probabilities) the network will be updated. Also, it is possible to enter an action (specified state) to see what the effects of changing it are. The results can be displayed in different forms (including graphs and meters), exported, saved and used later, nodes can be deleted without changing the overall relations, etc. All features with many application examples can be found on the website of Norsys Software Corporation (https://www.norsys.com/index.html).
RESULTS AND DISCUSSION
In Table 5, calculated indexes and corresponding classes, and predicted classes by the classifier are presented. The number of samples from 1 to 14 represent the quality of the Danube water, 15–28 of the Tisza River, 29–42 Begej, 43–54 DTD canal (Melenci), and finally from 43 to 68 DTD canal (Bač).
Serbian water quality index (SWQI), corresponding SWQI class, and the NB estimation of quality class
![]() |
![]() |
Graphical presentation of the differences between assessed and predicted classes is shown in Figure 6. The ordinal number of the water sample is marked on the abscissa, and the number of the water quality class is on the ordinate (from ‘Excellent’ – class 1, to worst observed ‘Bad’ – class 4).
The differences between assessed and predicted water quality classes.
The classifier correctly predicted 64 out of 68 cases. Thirteen of these 64 cases were selected in a different class but actually it is not wrong because it is the threshold value of the previous or the next class or very close to it. This is the case for 10 samples. Further, three samples are allocated to the real one and contiguous class by almost 50 percent of chances for both. In this case, index values are also almost on the border of the classes (samples No. 20, 31, and 46). Each class assessed as ‘Excellent’ is misclassified as the class below (‘Very good’). The reason lies in a small number of learning cases for ‘excellent’ quality class. However, just one out of four ‘Excellent’ quality samples is actually misclassified, the rest are very close to the threshold of the predicted one.
The quality class prediction may be done correctly even if some parameters are unknown. An example of sample No. 1, for instance, may be as follows: if four out of nine water quality parameters are known (in this case temperature, pH, ammonium, and orthophosphates) and the rest are unknown the probability of the ‘Very good’ class is still high (77%). This is presented in Figure 7, where yellow-marked nodes represent unknown parameters, while grey ones are known (determined). In this case, in addition to the overall water quality class, corresponding quality classes of unknown parameters are also predicted with a certain probability. Compared to real data, parameters EC and oxygen saturation are predicted correctly – the most probably quality class is in line with the real one. However, the model did not predict accurately the classes of the three remaining parameters. The actual BOD5 class is 3rd one, while the model classified it in the first class by almost 44% but to the real one by around 19%. Similar is for parameter Total nitrogen oxides (NO). The third class is the calculated one, but the model estimated the second one as the most probable class (56%), while the probability for the real one is 22.4%. In the case of Suspended solids, the real class is 2nd one, which the model predicted by 14%, and as the most probable one gives class 4 (almost 48%) (Figure 7).
CONCLUSIONS
Water quality prediction is an important issue for efficient water management. Very often some parameters are missing or are expensive and/or difficult to measure, or sampling requires a lot of time. Some of them can be changed easily and quickly, and the consequences that may occur due to incorrect predictions can be enormous. The NB classifier is conducive for this purpose because the final decision can be made based on a probability distribution, i.e. on known uncertainty. Further, in the NB model, the parameters are conditionally independent, thus it is easy to manipulate with the data (add, delete, change) within the network. For network construction and manipulation, the software Netica has proven to be very suitable. The simplicity of the method and relatively low level of data preprocessing are also great advantages of its use and what sets this method apart from other machine-learning methods that provide satisfactory results.
The model in this work predicted accurately 64 out of 68 cases and gives correct results of overall quality class prediction when some data are missing which is very important. NB classifier can thus be recommended as a trustful tool in the transition from traditional to digital water management.
Further research in this course may include the development of Bayes theory-based software/app that would be linked with a network of sensors on watercourses and provide a quick estimation of the suitability of water quality for certain purposes.
ACKNOWLEDGEMENTS
This work was supported by the Ministry of Education, Science and Technological Development of Serbia (Grant No. 451-03-9/2021-14/200117).
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.