Abstract
The objective of this research was to arrive at a better assessment of the quality of surface water in the Constantine region. The focus was on the comparison of three classical indices WQINSF (National Sanitation Foundation Water Quality Index), WQICCME (Canadian Council of Ministers of the Environment Water Quality Index) and WQIAP (weighted arithmetical Water Quality Index), the development of a new index and the prediction by ANN (artificial neural network) of WQI indices. The principal components analysis (PCA) allows the selection of 10 parameters to be used in the calculation of the classical WQI, and eight principal components to be used as input for the new proposed index (regularized WQI). However the ANN is applied for the search for prediction models of classical WQI and developed WQI. The results show that the WQIAP index assesses water quality better, and that the regularized WQI further promotes the assessment of water quality. WQIR shows that, after the pollution peak, the water quality does not return to its initial state. The modeling approach by ANN offers an effective alternative to predict the WQI, it subsequently appears that the ANN predicts the new index WQIRregularized (R2 = 0.999) better than the classic model WQIAP (R2 = 0.99).
HIGHLIGHTS
The first principal components with an eigenvalue greater than 1 are used as input in the calculation of the newly developed index (WQI regularized).
The regularized WQI index improves the assessment of the water quality of the Constantine catchment area compared to the classical indices WQIWeighted Arithmetic, WQINSF and WQICCME.
ANN prediction of classical and regularized WQI is evaluated using six fitness criteria: R, RMSE, MEA, NE, IOS and R%
The classical ANN model is mainly influenced by temperature, OS, NO3 and BOD5.
The regularized ANN model is influenced by Component 2 and Component 6, the component 2 is closely related to the parameters NO3, Tu, pH and temperature, while the component 6 is positively correlated with NO2 and Os.
Graphical Abstract
INTRODUCTION
Surface water quality assessment is important in hydro-environmental management. Many organizations and agencies have adopted water quality indices as a tool for water quality assessment and management. The water quality index (WQI) is a single-valued numerical expression that assesses the quality of a given body of water at a specific location and time. In addition, they are indicators that provide appropriate classification values describing the quality status of surface waters and allowing categorization of pollutant load and designation of classes (Khuan et al. 2002; Sujana Prajithkumar & Mane 2014; Sutadian et al. 2016).
Water Quality Indices (WQIs) reduce a large number of parameters into a simpler expression to allow easy and efficient interpretation of monitoring data (Sujana Prajithkumar & Mane 2014).
The quality parameters commonly measured to assess water quality are divided into physical parameters [Temperature – Electrical Conductivity – Taste – Total Suspended Solids (TSS) – Turbidity – Odor – Color – Total Dissolved Solids (TDS)], chemical parameters (pH – Biochemical Oxygen Demand (BOD) – Chemical Oxygen Demand (COD) – Dissolved Oxygen (DO) – Total Hardness – Phosphates – Pesticides – Nitrates Surfactants – Heavy Metals) and biological parameters [bacteria (fecal coliform, Escherichia coli, Cryptosporidium, Giardia lamblia), Virus – Fungi – Protozoa – Parasitic worms].
Most studies analyzing water quality indices using biological parameters are focused on marine systems (Liu et al. 2015; Frena et al. 2019) rather than rivers (Crabill et al. 1999; Lee et al. 2016; Seo et al. 2019). Often these biological parameters are treated as water quality factors (Lin & Ganesh 2013; Seo et al. 2019; Kothari et al. 2021).
Surface water quality assessment is important in hydro-environmental management. Many organizations and agencies have adopted WQIs as a tool for water quality assessment and management. The WQI is a single-valued numerical expression that assesses the quality of a given body of water at a specific location and time based on several water quality parameters.
WQIs reduce a large number of parameters into a simpler expression to allow the easy and efficient interpretation of monitoring data (Sujana Prajithkumar & Mane 2014). In addition, they are indicators that provide appropriate classification values, describing the quality status of surface waters allowing categorization of pollutant load and designation of classes (Khuan et al. 2002; Sujana Prajithkumar & Mane 2014; Sutadian et al. 2016).
The concept of WQI was first introduced in Germany in 1848, where the presence or absence of certain elements in water was used as an indicator of water quality (Abbasi & Abbasi 2012). Horton (Horton 1965) developed the very first modern WQI in 1965, since the birth of the WQI concept, Yidana et al. (2010) combined GIS with a multivariate statistical method to calculate a WQI; Achary (2017) used the WQI for the assessment of groundwater quality for drinking in the Bhubaneswar region of Odisha state, India.
Mukaite et al. (2019) improved a WQI (IWQI) that focused on both the desirable limit (DL) and the maximum allowable limit of a water parameter. In contrast, the new indicator WQImin suggested by Nong et al. (2020) consists of five crucial parameters, namely total phosphorus, fecal coliform, mercury, water temperature, and dissolved oxygen. The model was constructed using stepwise multiple linear regression analysis and showed excellent performance in assessing water quality.
The WPI made by Hossain & Patra (2020) is based on the standard permissible limits of grounds water parameter recommended by the BIS and WHO. It aims at evaluating the degree of pollution in groundwater for drinking purposes using water quality parameters.
Islam et al. (2020), proposed a modified integrated WQI (MIWQI) considering principal component analysis (PCA), and compared to entropy theory or the entropy WQI (EWQI). These indices are mainly based on some factors such as the selection of parameters, assignment of weights and relative weights, conversion to a specific range scale, i.e. calculation of sub-indices, and aggregation of sub-indices (Brown et al. 1970; Abbasi & Abbasi 2012; Sutadian et al. 2016; Hossain & Patra 2020). The aforementioned work indicates that, apart from the number and choice of parameters, the sample size and, in particular, the weighting of the parameters is of great importance in the assessment of water quality. The latter, weighting, can be fixed, variable and sometimes subjective depending on the opinions of experts (Abbasi & Abbasi 2012; Rezaei et al. 2017; Tripathi & Singal 2019).
To better understand water quality and to determine the parameters that directly influence the estimation of the indices we used surface water quality modeling. The latter is very difficult to model due to limited water quality data and the high cost of water quality monitoring, which poses a serious problem. Therefore, artificial intelligence offers the optimal solution for solving several types of environmental problems, as the computations are very fast and require far fewer parameters and input conditions than deterministic models (Hameed et al. 2016). Its application to simulate water quality parameters is cost effective, fast and reliable.
The common methods for water quality prediction include artificial neural network (ANN) techniques are intensively used for model comparison and optimization (Sujana Prajithkumar & Mane 2014), and for data compression, and prediction (Hameed et al. 2016; Sahaya Vasanthi & Adish Kumar 2019; Singh et al. 2021).
Various ANN applications are used to predict water quality and to determine the index using independent variables, to this end Sujana Prajithkumar & Mane (2014) used modular neural networks and radial basis function networks to create two models for predicting the WQI of the Panava River in India. Nourani et al. (2013) used a neural network to calculate WQI and found that it performed better than other conventional methods.
Hameed et al. (2016) used the ANN application to predict tropical water quality parameters in the Langat and Klang river basins in Malaysia. Two different models were applied to examine and imitate the relationship of WQI with water quality variables, namely the back propagation neural network (BPNN) and the radial basis function neural network (RBFNN).
A prediction of the WQI of Parakai Lake, Tamil Nadu, and India was conducted by Sahaya Vasanthi & Adish Kumar (2019), and shows that the ANN model performs better, and is more accurate than the multiple regression model (MLR). In 2018 Isiyaka et al. (2018) used an ANN and a multivariate statistical technique for the reduction of the number of parameters and the number of water quality monitoring stations in the Kinta River, Malaysia. (Sahoo et al. 2015) proposed the ANFIS for water quality prediction in the Brahmani river. Kouadri et al. (2021) used the ANN application to predict the WQI of groundwater in the El Merk region (South-East Algeria), using mineralization, TH, NO3 and NO2 as inputs.
Singh et al. (2021) used neural-based soft computing techniques, an ANN and generalized regression neural network (GRNN), and a hybrid soft computing technique, an ANFIS with four membership functions to predict WQIs in the Khorramabad, Biranshahr, and Alashtar subwatersheds in Iran.
In practice, there is no globally accepted methodology to improve a WQI. Indeed, the lack or wrong choice of parameters, the lack of knowledge and the non-adjustment of the importance of the parameters weighting as well as the small size of the databases are the main causes of the weakness of the models.
This work proposes an effective alternative approach to develop a new index in the context of the search for the best assessment and prediction of water quality in the Constantinois coastal watershed. The latter consists of the use of multivariate statistics (PCA) in combination with the ANN regression to improve the evaluation of classical indices by the data. The newly developed WQIR index takes advantage of the benefits of principal components in summarizing all the information into a reduced number of variables and was able to reveal the eclipsed pollution points.
In the first step, the application of PCA is performed to select the variables to be involved in the calculation of the classical WQI, and the components to be used as input to the developed model.
In the second step, the focus is on finding the best assessment of surface water quality in the basin by comparing three traditional WQIs: the Weighted Arithmetic Index (WQIAP), the National Sanitation Foundation Index (WQINSF), and the Canadian Council of Ministers of the Environment Water Quality Index (WQICCME).
In the third step, we sought to improve the evaluation of the retained WQI by proposing a new WQI index: the WQIR (regularized WQI, based on principal components). Particular attention was given to the weighting of the parameters.
In the fourth step, the neural network model prediction of the WQIAP indices and the new WQIR index is detailed.
MONITORING AREA AND DESCRIPTION
Study area
western Constantinian coasts with an area of 2,424 km2;
central Constantinian with an area of 5,582 km2 ;
eastern Constantinian coasts with an area of 3,203 km2.
Map of the geographical situation of the Coastel Constantine watershed.
Data description
Seven sampling stations (Table 1) were chosen to observe water quality in the coastal watershed in Constantine. The study was conducted monthly for a period of 9 years (January 2010–December 2018) by the agents of the National Water Resources Agency; only one sampling per month was taken into account for the temporal follow-up of the water quality, Eastern region (Algeria). The study area suffers from data gaps, a lack of some pollution parameters and small sample sizes.
Location of the measuring stations in the Watershed ‘Coastel Constantinois’
Code . | Sub-basin . | state . | X (m) Lambert . | Y (m) Lambert . |
---|---|---|---|---|
Bge.Mexa St.031609 | OuedKebir East | El Taref (Bougous) | 1,007,932 | 398,848 |
Bge.ZitEmba St.031102 | Oued Kebir Hammem | Skikda (BekkoucheLakhdar) | 909,629 | 383,821 |
Bge. Zerdezas St.030902 | Oued Safsaf | Skikda (Zerdezas) | 875,820 | 373,112 |
Bge. Guenitra St.030701 | Oued Guebli | Skikda (OumToub) | 851,771 | 385,930 |
Bge. BniZid St.030711 | Oude Guebli | Skikda (BniZid) | 836,630 | 406,182 |
Bge. Cheffia St.031501 | Cotiers Bounamoussa | El Tarf (Cheffia) | 977,367 | 380,540 |
Bge.ElAgreme St.030303 | Cotiers Jijel | Jijel (Kaous) | 779,250 | 385,450 |
Code . | Sub-basin . | state . | X (m) Lambert . | Y (m) Lambert . |
---|---|---|---|---|
Bge.Mexa St.031609 | OuedKebir East | El Taref (Bougous) | 1,007,932 | 398,848 |
Bge.ZitEmba St.031102 | Oued Kebir Hammem | Skikda (BekkoucheLakhdar) | 909,629 | 383,821 |
Bge. Zerdezas St.030902 | Oued Safsaf | Skikda (Zerdezas) | 875,820 | 373,112 |
Bge. Guenitra St.030701 | Oued Guebli | Skikda (OumToub) | 851,771 | 385,930 |
Bge. BniZid St.030711 | Oude Guebli | Skikda (BniZid) | 836,630 | 406,182 |
Bge. Cheffia St.031501 | Cotiers Bounamoussa | El Tarf (Cheffia) | 977,367 | 380,540 |
Bge.ElAgreme St.030303 | Cotiers Jijel | Jijel (Kaous) | 779,250 | 385,450 |
Source: National River Basin Agency.
Twenty-three physical and chemical parameters were collected: turbidity (Tu), suspended solids (TSS), temperature (T), electrical conductivity (Cond), pH, phosphates (PO4), calcium (Ca), magnesium (Mg), sulfates (SO4), chlorides (Cl), bicarbonates (HCO3), sodium (Na), potassium (K), nitrates (NO3) nitrite (NO2), biochemical oxygen demand (BOD5), chemical oxygen demand (COD), dissolved oxygen (DO), saturating oxygen (OS), Organic Matter (OM), ammonium (NH4), total alkalinity (TA) and dry residue (Rs).
The data set is presented by five values: extreme values (minimum and maximum values), median, quartiles, percentiles, and sometimes-remote values (extremes coded *) as shown in Figure 2. The top border of the box represents the 75th percentile and the bottom border represents the 25th percentile. The vertical length of the box represents the interquartile range while the centerline shows the median.
The results of the whisker box analysis show that for the parameters OM, Turbidity, NH4, PO4, NO2, K and TSS, the body of the box is small (the rectangles are squashed) and the whiskers are short, this indicates that the values are more uniform and less scattered; indeed they are closer to the median. The distribution is more elongated towards the maximum values for BOD5 and towards the minimum values for COD. Concerning the parameters NO3, SO4, pH, DO, Cond, TA, HCO3, Na and Temperature, the position of the median is in the center of the box; equal to the percentile (Q2 = 50 percentile), these parameters are homogeneous and the box shape is symmetrical. The box with whiskers indicates the presence of distant values (extremes) which denotes the great variability of the data.
METHODS OF ANALYSIS
Principal component analysis
There is no general rule for the selection of parameters for WQI models, experts propose several approaches, including the Delphi method which allows organizing the consultation of a group of experts in order to obtain a final and convergent opinion of the group (Saha 2014). The other commonly used approach is the use of statistical methods, including Pearson's correlation coefficient and principal component/factor analysis (PCA/FCA) (Abbasi & Abbasi 2012; Tripathi & Singal 2019). PCA accomplishes the task without losing much information (Diallo et al. 2014; Reggam et al. 2015; Bouslah 2017; Tripathi & Singal 2019; Islam et al. 2020). The PCA method is more accurate than the Delphi method and it is reasonable and scientific to apply it in calculating WQI indices (Liu et al. 2021). In this study, PCA will be applied to reduce the dimensionality of the dataset to avoid the bias of parameter selection, to identify the main possible sources of pollution, and to select the parameters to be used to develop the prediction model of artificial neural networks. All analyses were performed using R software (Rx64 4.1.2).
Artificial neural network
Artificial neural networks have been used to overcome many problems related to engineering mainly, in the environmental field of water quality, because it avoids ambiguity and the effect of eclipsing variables (Behboudian et al. 2014; Hameed et al. 2016; Isiyaka et al. 2018; Garcia et al. 2019). The ANN is featured by its capacity to model complex and non-linear processes without any form of prior knowledge of the relationship between input and output variables (Hameed et al. 2016).
The signals are transmitted from the input layer (independent variables) to the hidden layer via a weighted connection system for processing and finally to the output layer (dependent variable), the network will be extensively trained to optimize the number of hidden nodes,
The data were divided into training (80%) and validation (20%) sets. The training dataset is used to adjust the weight to estimate and learn the parameter model, while the validation subset is used to evaluate the performance of the trained network (Thurston et al. 2011; Isiyaka et al. 2018). In this study, the networks will be trained and validated based on the formula proposed by Caudill (1988) (Benbouras et al. 2021).
Therefore, to verify the validation of the model and to estimate the generalization capacity of the learning model, we will use the K-flod cross-validation approach. The latter is an advanced approach, which revealed more accuracy and robustness when assessing the ability of the optimal model to overcome over-fitting and under–fitting problems in data learning. The approach relies on dividing the database into K equal splits. Hence, for each split, folds are utilized for the training phase and the last one for validation. This procedure is reiterated successively until the use of all splits for the validation step (Benbouras et al. 2021).
We used the connection weight approach for the importance analysis of the input variables on the prediction of the WQI. The computational procedure is to multiply the value of the connection weight of the hidden-output neurons, for each hidden neuron, by the values of the connection weights of the input-hidden layer. By doing this for each input neuron, we identified its contribution to the output (Sekiou 2014).
Calculation of WQI
The WQI is an index that expresses the overall quality of the water at a certain location and time (Saha 2014), thus determining whether the water in question is of good quality. Water Quality Indicators are used as a strategy to address several environmental issues by many water quality monitoring agencies and managers (Noori et al. 2019; Tripathi & Singal 2019; Kouadri et al. 2021).
In this work, we tested three classical indices (WQINSF, WQICCME and WQIAP) for the best evaluation of the surface water quality of the Constantinois basin. The three indices are presented in Table 2.
Use, aggregation form and interpretation of WQI
Name . | Use . | Aggregation form . | Interpretation . | Reference . |
---|---|---|---|---|
WQINFS | General surface water quality evaluation | ![]() ![]() ![]() ![]() ![]() ![]() Weighted geometrical average | 0: very bad 100: excellent | Brown et al. (1970), Noori et al. (2019) |
WQICCME | General surface water quality evaluation | ![]() ![]() ![]() ![]() ![]() The measure for amplitude, F3 is calculated as follows: Excursion is the number of times by which an individual concentration is greater than (or less than, when the objective is a minimum) the objective. When the test value does not exceed the objective: ![]() For cases in which the test value exceeds the objective: ![]() The collective amount by which individual tests are out of compliance is calculated by summing the excursions of individual tests from their objectives and dividing by the total number of tests (both those meeting objectives and those not meeting objectives). This variable, referred to as the normalized sum of excursions (nse) is calculated as: ![]() F3 is then calculated by an asymptotic function that scales the normalized sum of the excursions from objectives (nse) to yield a range between 0 and 100. ![]() | 0: bad 100: excellent | CCME (2001), Haile & Gabbiye (2021) |
weighted arithmetical WQIAP | General surface water quality evaluation | ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() | 0: excellent 100: bad | Abbasi & Abbasi (2012) |
Name . | Use . | Aggregation form . | Interpretation . | Reference . |
---|---|---|---|---|
WQINFS | General surface water quality evaluation | ![]() ![]() ![]() ![]() ![]() ![]() Weighted geometrical average | 0: very bad 100: excellent | Brown et al. (1970), Noori et al. (2019) |
WQICCME | General surface water quality evaluation | ![]() ![]() ![]() ![]() ![]() The measure for amplitude, F3 is calculated as follows: Excursion is the number of times by which an individual concentration is greater than (or less than, when the objective is a minimum) the objective. When the test value does not exceed the objective: ![]() For cases in which the test value exceeds the objective: ![]() The collective amount by which individual tests are out of compliance is calculated by summing the excursions of individual tests from their objectives and dividing by the total number of tests (both those meeting objectives and those not meeting objectives). This variable, referred to as the normalized sum of excursions (nse) is calculated as: ![]() F3 is then calculated by an asymptotic function that scales the normalized sum of the excursions from objectives (nse) to yield a range between 0 and 100. ![]() | 0: bad 100: excellent | CCME (2001), Haile & Gabbiye (2021) |
weighted arithmetical WQIAP | General surface water quality evaluation | ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() | 0: excellent 100: bad | Abbasi & Abbasi (2012) |
Development of a new index the regularized WQI, a principal component-based index (WQIR)
To further, investigate the estimation of water quality by indices, we propose a new index whose entries will be the scores of the principal components with eigenvalues greater than 1. This variant of the index is used to improve the assessment of water quality, particularly where data are lacking or scarce. This new index is estimated based on the final aggregation of the Arithmetic Index with a weight redevelopment. The four steps necessary to calculate the new WQIR index.
- 1.
Selection of the input parameters for the WQIR calculation

PC1 = c11X1 + c12X2 +c13X3 +· · · + c1nXp (axis Y1)
PC2 = c21X1 + c22X2 +c23X3 +· · · + c2pXp (axis Y2)
. . .
PCp = cp1X1 + cp2X2 +cp3X3 +· · · + cnpXp (axis Yp)




- 2.
Calculation of the parameter rating scale
According to this new approach, the water quality is mainly based on the observed concentration () and the standard allowable concentration (
) of the new parameters; and can be adjusted to the total number of applied variables (
) as desired by the user.
The third step for WQIR development is the establishment of weights.
In general, weights are assigned to parameters based on their relative importance and influence on the final index value (Sutadian et al. 2016). We propose to use the contribution of the component in the total variance as weights in order to avoid any subjectivity of choice and the arbitrary selection of weights.
- 4.
Calculation of the final index
The WQI was divided into five water quality classes based on the weighted arithmetic WQI method.
RESULTS AND DISCUSSION
Determination of WQI entries
The results of the PCA analysis presented in this study relate to the first stage of WQI development, i.e., parameter selection. The parameters are divided into physical and chemical parameters and biological parameters (coliforms, heterotrophic bacteria), the latter are important indicators of water quality, related to human health. Researchers have used these parameters in the determination of indices and shown their great usefulness, but the major studies deal with coliforms as water quality factors. (Lin & Ganesh 2013; Seo et al. 2019; Kothari et al. 2021).
The number of parameters used varies from one model to another; it is just five to six parameters for the Malaysian index models (Hameed et al. 2016) and up to 47 parameters for the BCWQI, which considers a large number of variables. In general the most used parameters are the following 10 the temperature, turbidity, pH, TDS, fecal coliforms (FC), dissolved oxygen (DO), biochemical oxygen demand (BOD5), chemical oxygen demand (COD), nitrite (NO2) and nitric nitrogen (NH3).
The statistical treatment of the principal component analysis was carried out for the seven stations and 23 variables: COD, BOD5, MO, DO, NH4, NO2, NO3, PO4, SO4, Cl, Tu, TSS, Rs, Cond, TA, Mg, Na, Ca, T, pH, HCO3, K, OS.
Following the application of PCA, only eight principal components with eigenvalues greater than 1 were extracted according to Keiser's rule, which explains more than 70% of the total variance, as shown in Table 3 (Abbasi & Abbasi 2012).
Results of determination of eigenvalues and explained variances
. | PC1 . | PC2 . | PC3 . | PC4 . | PC5 . | PC6 . | PC7 . | PC8 . | PC9 . |
---|---|---|---|---|---|---|---|---|---|
Eigenvalue | 6.06 | 2.186 | 1.877 | 1.452 | 1.220 | 1.197 | 1.116 | 1.033 | 0.942 |
Variability (%) | 26.344 | 9.506 | 8.162 | 6.312 | 5.304 | 5.207 | 4.855 | 4.494 | 4.094 |
Cumutative | 26.344 | 35.851 | 44.013 | 50.325 | 55.629 | 60.837 | 65.692 | 70.186 | 74.280 |
. | PC1 . | PC2 . | PC3 . | PC4 . | PC5 . | PC6 . | PC7 . | PC8 . | PC9 . |
---|---|---|---|---|---|---|---|---|---|
Eigenvalue | 6.06 | 2.186 | 1.877 | 1.452 | 1.220 | 1.197 | 1.116 | 1.033 | 0.942 |
Variability (%) | 26.344 | 9.506 | 8.162 | 6.312 | 5.304 | 5.207 | 4.855 | 4.494 | 4.094 |
Cumutative | 26.344 | 35.851 | 44.013 | 50.325 | 55.629 | 60.837 | 65.692 | 70.186 | 74.280 |
Matrix of principal components
. | PC1 . | PC2 . | PC3 . | PC4 . | PC5 . | PC6 . | PC7 . | PC8 . |
---|---|---|---|---|---|---|---|---|
BOD5 | 0.093 | 0.070 | −0.465 | 0.606 | −0.229 | 0.069 | 0.221 | 0.223 |
COD | 0.135 | 0.402 | 0.417 | 0.473 | 0.109 | −0.163 | −0.215 | 0.102 |
MO | 0.033 | 0.022 | 0.199 | 0.140 | 0.334 | −0.031 | 0.077 | 0.513 |
NH4 | 0.077 | 0.301 | 0.203 | 0.103 | −0.110 | 0.320 | 0.016 | −0.546 |
NO2 | 0.333 | 0.376 | −0.082 | 0.150 | 0.030 | 0.520 | −0.218 | 0.141 |
NO3 | 0.101 | 0.577 | −0.043 | −0.151 | −0.239 | 0.219 | −0.279 | 0.154 |
OS | −0.232 | −0.191 | 0.758 | −0.105 | 0.049 | 0.414 | 0.138 | 0.110 |
PO4 | −0.001 | 0.142 | −0.170 | 0.360 | 0.264 | 0.317 | 0.374 | −0.416 |
SO4 | 0.839 | 0.111 | −0.005 | −0.031 | 0.194 | 0.109 | −0.101 | 0.132 |
Cl | 0.725 | −109 | 0.274 | 0.211 | −067 | −235 | 0.246 | −148 |
Tu | −327 | 0.493 | 0.000 | 0.367 | 0.226 | −0.098 | 0.012 | 0.006 |
TSS | 0.167 | 0.117 | −0.452 | −0.273 | 0.376 | 0.017 | 0.281 | 0.070 |
RS | 0.954 | −0.010 | 0.063 | −0.024 | 0.096 | −0.059 | 0.067 | 0.001 |
pH | −0.025 | −0.526 | −0.162 | 0.120 | 0.050 | 0.394 | 0.359 | 0.260 |
DO | −0.276 | 0.459 | 0.525 | −0.287 | −0.184 | 0.089 | 0.479 | 0.167 |
Cond | 0.927 | 0.008 | 0.152 | 0.000 | 0.081 | −0.122 | 0.046 | −0.016 |
Ca | 0.701 | 0.205 | −0.035 | 0.005 | 0.245 | 0.100 | −0.121 | 0.072 |
Mg | 0.778 | −0.192 | 0.022 | −0.030 | −0.162 | 0.006 | −0.009 | 0.040 |
Na | 0.835 | −0.109 | 0.252 | 0.063 | 0.029 | −0.193 | 0.161 | −0.104 |
TA | 0.742 | 0.074 | −0.126 | −0.237 | −0.158 | 0.165 | −0.012 | −0.024 |
HCO3 | 0.522 | −0.018 | −0.227 | −0.259 | −0.180 | 0.209 | −0.033 | −0.007 |
T | −0.026 | −0.716 | 0.236 | 0.227 | 0.215 | 0.271 | −0.414 | −0.082 |
K | 0.217 | −0.165 | 0.041 | 0.361 | −0.653 | 0.015 | 0.034 | 0.144 |
. | PC1 . | PC2 . | PC3 . | PC4 . | PC5 . | PC6 . | PC7 . | PC8 . |
---|---|---|---|---|---|---|---|---|
BOD5 | 0.093 | 0.070 | −0.465 | 0.606 | −0.229 | 0.069 | 0.221 | 0.223 |
COD | 0.135 | 0.402 | 0.417 | 0.473 | 0.109 | −0.163 | −0.215 | 0.102 |
MO | 0.033 | 0.022 | 0.199 | 0.140 | 0.334 | −0.031 | 0.077 | 0.513 |
NH4 | 0.077 | 0.301 | 0.203 | 0.103 | −0.110 | 0.320 | 0.016 | −0.546 |
NO2 | 0.333 | 0.376 | −0.082 | 0.150 | 0.030 | 0.520 | −0.218 | 0.141 |
NO3 | 0.101 | 0.577 | −0.043 | −0.151 | −0.239 | 0.219 | −0.279 | 0.154 |
OS | −0.232 | −0.191 | 0.758 | −0.105 | 0.049 | 0.414 | 0.138 | 0.110 |
PO4 | −0.001 | 0.142 | −0.170 | 0.360 | 0.264 | 0.317 | 0.374 | −0.416 |
SO4 | 0.839 | 0.111 | −0.005 | −0.031 | 0.194 | 0.109 | −0.101 | 0.132 |
Cl | 0.725 | −109 | 0.274 | 0.211 | −067 | −235 | 0.246 | −148 |
Tu | −327 | 0.493 | 0.000 | 0.367 | 0.226 | −0.098 | 0.012 | 0.006 |
TSS | 0.167 | 0.117 | −0.452 | −0.273 | 0.376 | 0.017 | 0.281 | 0.070 |
RS | 0.954 | −0.010 | 0.063 | −0.024 | 0.096 | −0.059 | 0.067 | 0.001 |
pH | −0.025 | −0.526 | −0.162 | 0.120 | 0.050 | 0.394 | 0.359 | 0.260 |
DO | −0.276 | 0.459 | 0.525 | −0.287 | −0.184 | 0.089 | 0.479 | 0.167 |
Cond | 0.927 | 0.008 | 0.152 | 0.000 | 0.081 | −0.122 | 0.046 | −0.016 |
Ca | 0.701 | 0.205 | −0.035 | 0.005 | 0.245 | 0.100 | −0.121 | 0.072 |
Mg | 0.778 | −0.192 | 0.022 | −0.030 | −0.162 | 0.006 | −0.009 | 0.040 |
Na | 0.835 | −0.109 | 0.252 | 0.063 | 0.029 | −0.193 | 0.161 | −0.104 |
TA | 0.742 | 0.074 | −0.126 | −0.237 | −0.158 | 0.165 | −0.012 | −0.024 |
HCO3 | 0.522 | −0.018 | −0.227 | −0.259 | −0.180 | 0.209 | −0.033 | −0.007 |
T | −0.026 | −0.716 | 0.236 | 0.227 | 0.215 | 0.271 | −0.414 | −0.082 |
K | 0.217 | −0.165 | 0.041 | 0.361 | −0.653 | 0.015 | 0.034 | 0.144 |
(a) Projection of the variables on the principal plane 1–2. (b) Correlation matrix between the variables and the principal components.
(a) Projection of the variables on the principal plane 1–2. (b) Correlation matrix between the variables and the principal components.
Figure 4(a) shows the homogeneous distribution of the variables in the projection area 1–2 (Dim1–Dim2) and the absence of the size effect, this indicates the good presentation of all the pollution parameters.
The analysis of the factorial design F1 (Dim1) and F2 (Dim2) shows that 36% of the information is expressed. The F1 (Dim1) design displays 26.3% of the variance, and characterizes the mineralization of the waters. It is determined by the Rs, Cond, TA, SO4, Cl, Mg, Na and Ça, which are strongly correlated with each other and positively to the F1, since they define eigenvectors of the same direction. They present the following correlations: Cond&Rs (0.919), Na&Rs105 (0.852), Na&Cond (0.817), Cond&SO4 (0.799), Cond&Cl (0.726), and Mg&Rs1 (0.702)…etc. The correlations between these variables are stronger when the variables are positioned at the ends of the axis defined by Principal Component 1.
The factorial plane F2 (Dim2) represents only 9.5% of the information and is considered as an axis characterizing organic and agricultural pollution, it is determined by NO3, NO2, NH4, COD, DO, Tu, PO4, pH and Temperature (Figure 4(a)(b)).
The factorial plane F2 (Dim2) represents only 9.5% of the information and is considered as an axis characterizing organic and agricultural pollution. These variables are probably better explained by other principal components, other than CP1 and CP2.
(a) Correlation matrix of the 10 variables. (b) Combination between the correlogram and the significance test.
(a) Correlation matrix of the 10 variables. (b) Combination between the correlogram and the significance test.
Several significant correlations could be identified (Figure 6(b)), the correlation between OS with BOD5 and TSS, pH with temperature, NO3, COD and BOD5, between NO2 with NO3 and conductivity, and between TSS with OS and temperature. This shows the important and significant role that these elements play in determining the salt load of these waters.
Assessment of water quality by calculating WQIAP, WQINSF, WQICCME and WQIR
Calculation of the weighted WQIAP, WQINFS and WQICCME performed for the seven stations shows notable similarities and dissimilarities found in the assessment of water quality, as shown in Figure 7.
Comparison between the three Indices used in the assessment of surface water quality in the Constantine Watershed from 2010 to 2018.
Comparison between the three Indices used in the assessment of surface water quality in the Constantine Watershed from 2010 to 2018.
While the weighted arithmetic index WQIAP has revealed the presence of pollution peaks in the seasons of summer 2013 for the station Mixa, the autumn of 2011 for the station Zit El-Emba; the autumn of 2012 for both stations Zerdezas and Guenitra. Moreover, the same index shows the presence of pollution in the following seasons: autumn 2016, winter and spring 2017 for the Zerdezas station; summer 2018 for the Guenitra station, these pollution peaks were not clear following the use of the two indices WQICCME and WQINFSQ.
The high WQIAP values recorded at all stations probably originate from the large volume of river flow in the wet season and the volume of storm flooding in the summer season in accordance with the comments of Noori et al. (2019) and also to the bypassing of wastewater to the natural environment instead of it being delivered to the treatment plant for treatment.
The WQIAP was able to show, contrary to the other indices, that the global quality of the waters in all the stations studied was, apart from some good values, considerably degraded ‘bad to very bad’ with the presence of a significant number of values unsuitable for consumption, whereas the two indices WQICCME and WQINFSQ show that these waters were respectively of fair and good quality. This result is reported by House (1989) and (Gao et al. 2020) who note that the weighted arithmetic index WQIAP provides the best results for indexing general water quality (Fernandez et al. 2005).
The differences in the evaluation of the three indices based on the same calculation parameters are due a priori to the final aggregation formula, which may give a different evaluation, to the specificity of these indices to the geographical region where they were generated, as well as their ranges of classification of water quality (Fernandez et al. 2005).
To further investigate water quality, we developed a new index based on principal components by replacing the classical index entries with the first eight principal components with eigenvalues greater than 1.
Variation of the WQIAP and the WQIR of the surface waters of the Constantinois Coastal Watershed from 2010 to 2018.
Variation of the WQIAP and the WQIR of the surface waters of the Constantinois Coastal Watershed from 2010 to 2018.
Also, the evolution of WQIR shows contrary to WQIAP that after the pollution peak (summer 2014, winter 2010 and autumn 2011) the water quality of Mixa, Zit El Emba and El Egreme stations successively did not return to its initial state.
In fact, the difference in evaluation between the Classic WQIAP and WQIR index probably comes down to the fact that the WQIAP uses 10 parameters (determined by PCA), whereas the second one uses the summary of the 23 parameters resulting from the use of the first eight principal components, which allowed the detection of the eclipsed pollution points.
Furthermore, the variations among WQIs were also authenticated by ANOVA results. The outcome of ANOVA indicated that WQIAP exhibited a significant difference when compared with WQINSF and WQICCME (P = 0.015 < 0.05) and had very highly significant variations with the WQIR (P < 0.001). The means are significantly different according to the Tukey test (this test has the ability to separate the means into groups), recorded the highest variance in WQIR), the mean in WQICCME and WQINFS and the lowest in WQIAP (
.
Prediction of the classical weighted arithmetic index WQIAP and the regularized index WQIR
Two ANN prediction models were proposed, the first prediction model was built based on the 10 variables: DO, BOD5, COD, NH3, NO2, PO4, TSS, Cond, T and pH as input parameters and WQIAP as output parameter; while for the second model, the first eight principal components are considered as input and the WQIR (regularized weighted arithmetic index) as output.
Tested ANN models of the first case
Mode . | Architecture . | Performance . | ||||||
---|---|---|---|---|---|---|---|---|
Inputs . | Hidden neurons . | Outputs . | MAE . | RMSE . | IOS . | R . | NE . | |
M1 | 10 | 5 | 1 | 11.6197 | 5.1003 | 0.0977 | 0.9842 | 0.9687 |
M2 | 10 | 6 | 1 | 11.3513 | 4.0292 | 0.0772 | 0.9901 | 0.9805 |
M3 | 10 | 7 | 1 | 9.7247 | 2.9393 | 0.0563 | 0.9948 | 0.9896 |
M4 | 10 | 8 | 1 | 11.5119 | 2.8565 | 0.0547 | 0.9951 | 0.9902 |
M5 | 10 | 9 | 1 | 11.4436 | 4.0187 | 0.0769 | 0.9908 | 0.9806 |
M6 | 10 | 10 | 1 | 10.0134 | 3.6638 | 0.0702 | 0.9921 | 0.9838 |
M7 | 10 | 11 | 1 | 10.1393 | 4.4953 | 0.0862 | 0.9877 | 0.9756 |
M8 | 10 | 12 | 1 | 11.378 | 4.4286 | 0.0848 | 0.9882 | 0.9771 |
Mode . | Architecture . | Performance . | ||||||
---|---|---|---|---|---|---|---|---|
Inputs . | Hidden neurons . | Outputs . | MAE . | RMSE . | IOS . | R . | NE . | |
M1 | 10 | 5 | 1 | 11.6197 | 5.1003 | 0.0977 | 0.9842 | 0.9687 |
M2 | 10 | 6 | 1 | 11.3513 | 4.0292 | 0.0772 | 0.9901 | 0.9805 |
M3 | 10 | 7 | 1 | 9.7247 | 2.9393 | 0.0563 | 0.9948 | 0.9896 |
M4 | 10 | 8 | 1 | 11.5119 | 2.8565 | 0.0547 | 0.9951 | 0.9902 |
M5 | 10 | 9 | 1 | 11.4436 | 4.0187 | 0.0769 | 0.9908 | 0.9806 |
M6 | 10 | 10 | 1 | 10.0134 | 3.6638 | 0.0702 | 0.9921 | 0.9838 |
M7 | 10 | 11 | 1 | 10.1393 | 4.4953 | 0.0862 | 0.9877 | 0.9756 |
M8 | 10 | 12 | 1 | 11.378 | 4.4286 | 0.0848 | 0.9882 | 0.9771 |
This result is also reported by Singh et al. (2021) who note that the model (10.8.1) produces good agreement of the predicted values with R2 and RMSE equal to 0.9810 and 0.1324. We note that there is a simulation between the results. However the model is marked by Hameed et al. (2016) who mention that the tropical water index using the RBFNN model showed the best performance evaluation criteria of R2, RMSE, and NE (0.9872, 0.0157, and 0.9871, respectively) with six input layer neurons (number of water quality parameters), eight hidden layer neurons, and one output layer neuron (WQI as target) (6.8.1). In contrast Isiyaka et al. (2018) with a multilayer ANN shows that a model of 14 physicochemical parameters (as input parameter) and 10 nodes in the hidden layer with a target output (14.10.1.1) gives the best performance criteria (highest R2 = 0.998 with lowest RMSE = 0.432).
The results indicate that, apart from the number and choice of parameters, the nature of the learning algorithm and the type and architecture (structure) of the model (number of hidden layers) are of great importance to achieve the best prediction model.
Performance of the selected ANN model
Mode . | Architecture . | Performance . | ||||||
---|---|---|---|---|---|---|---|---|
Inputs . | Hidden neurons . | Outputs . | MAE . | RMSE . | IOS . | R . | NE . | |
M1 | 8 | 2 | 1 | 0.2034 | 0.70852 | 0.00361 | 0.99994 | 0.9987 |
M2 | 8 | 3 | 1 | 0.2693 | 1.09485 | 0.00558 | 0.99986 | 0.9969 |
M3 | 8 | 4 | 1 | 0.2026 | 0.35369 | 0.00180 | 0.99998 | 0.9996 |
M4 | 8 | 5 | 1 | 0.3175 | 1.76558 | 0.00901 | 0.99965 | 0.9922 |
Mode . | Architecture . | Performance . | ||||||
---|---|---|---|---|---|---|---|---|
Inputs . | Hidden neurons . | Outputs . | MAE . | RMSE . | IOS . | R . | NE . | |
M1 | 8 | 2 | 1 | 0.2034 | 0.70852 | 0.00361 | 0.99994 | 0.9987 |
M2 | 8 | 3 | 1 | 0.2693 | 1.09485 | 0.00558 | 0.99986 | 0.9969 |
M3 | 8 | 4 | 1 | 0.2026 | 0.35369 | 0.00180 | 0.99998 | 0.9996 |
M4 | 8 | 5 | 1 | 0.3175 | 1.76558 | 0.00901 | 0.99965 | 0.9922 |
For the second model, it was found that the model (8.4.1) eight input layer neurons (number of principal components with eigenvalues greater than 1 multiplied with raw data), four hidden layer neurons and one output layer neuron (WQIR as a target) (Figure 11), yielded the best model for predicting the WQI with R2 = 0.9998 and RMSE = 0.155 (Table 6).
The selected model appears less complex and more efficient (a single hidden layer, high R2 and low RMSE) than the two models proposed by Isiyaka et al. (2018) whose architecture is complex, the models contain two hidden layers (Table 7).
Performance of the best model in the different phases
. | Architecture . | R2 . | RMSE . |
---|---|---|---|
Selected Model | 8.4.1 | 0.9998 | 0.155 |
Model 1 | 14.8.1.1 | 0,999 | 0,159 |
Model 2 | 6.4.1.1 | 0,950 | 2,351 |
. | Architecture . | R2 . | RMSE . |
---|---|---|---|
Selected Model | 8.4.1 | 0.9998 | 0.155 |
Model 1 | 14.8.1.1 | 0,999 | 0,159 |
Model 2 | 6.4.1.1 | 0,950 | 2,351 |
This indicates that the model with reduced data set dimensionality has the best combination of inputs and outputs capable of predicting WQI with high accuracy. The alternative approach that we proposed to develop a new index (WQIR) gave commendable results compared to the classical WQI because of the advantages of principal components in summarizing all the information in a reduced number of variables.
Table 8 summarizes the training, validation, and overall performance of the two ANN prediction models. The results show the high performance of the ANN models in predicting the WQIAP and WQIR indices with a marked improvement in favor of the WQIR prediction model, the latter is parsimonious and allows low RMSE and high R.
Performance of the best model in the different phases
. | . | Training . | Validation . | All . |
---|---|---|---|---|
M1 | RMSE | 2.7045 | 3.3929 | 2.8565 |
R | 0.9953 | 0.99501 | 0.99512 | |
M2 | RMSE | 0.3805 | 0.2156 | 0.3536 |
R | 0.99998 | 0.99999 | 0.99998 |
. | . | Training . | Validation . | All . |
---|---|---|---|---|
M1 | RMSE | 2.7045 | 3.3929 | 2.8565 |
R | 0.9953 | 0.99501 | 0.99512 | |
M2 | RMSE | 0.3805 | 0.2156 | 0.3536 |
R | 0.99998 | 0.99999 | 0.99998 |
Correlation between predicted WQI and calculated WQI. (a) ANN model using 10 variables. (b) RNA model using eight principal components.
Correlation between predicted WQI and calculated WQI. (a) ANN model using 10 variables. (b) RNA model using eight principal components.
Relative error distribution. (a) ANN model using 10 variables. (b) ANN model using eight principal components.
Relative error distribution. (a) ANN model using 10 variables. (b) ANN model using eight principal components.
Relative importance of input variables on the ANN prediction model of WQIAP.
Relative importance of input variables on the ANN model for predicting WQIR.
CONCLUSION
This study is part of the research of the best evaluation and prediction of the water quality of the Constantine coastal watershed, It consists of the use of principal component analysis (PCA) in combination with ANN regression to improve the data evaluation of classical indices, to develop a new WQI and to establish predictive ANN models.
So, data collection and analyses are discussed, along with the various water quality parameters and indices for the assessment of the environmental aspects of surface resources.
Principal component analysis simplified the initial WQI models constructed from 23 variables down to only 10 variables (Cond, DO, BOD5, COD, NO3, NO2, PO4, TSS, pH, and T) and extracted eight principal components of over 70% explained variance to be used in the development of a new WQIr index and a regularized ANN regression model.
Comparing the three indices CCME-WQI, NFS-WQI, and AP-WQI the latter adapts to the geographical region and better assesses the water quality of the coastal Constantinois watershed and reveals more peaks and seasons of pollution which is mainly due to the large volume of river flow in the wet season, the volume of stormy floods in the summer season, and to the bypassing of wastewater to the natural environment instead of to the treatment plant for treatment.
Generally, the use of the ANN is a technique to predict surface water quality for proper management of the river (threshold) so that adequate measures can be taken to keep pollution within permissible limits.
For the best assessment we have developed the WQIR whose inputs are the first principal component of eigenvalue greater than 1. This index is based on specific principal components to achieve data collection in a precise number of variables, it reveals water quality during and after pollution peaks.
This model is a new technique based on a flexible mathematical structure that is capable of identifying complex non-linear relation ships between input and output data when compared with other classical modeling techniques.
Hence, it should be that the ANN model can describe the behavior of water quality parameters more accurately than the linear regression models, and we observed that the WQIR is better than the classical WQIAP.
So, the WQIR is a suitable approach that really meets the demand of researchers since it can allow easy and efficient interpretation of monitoring data and so far ensure a sustainable and friendly green environment.
In fact, the use of multivariate statistics in combination with artificial intelligence techniques allowed us to better assess water quality; to reveal eclipsed pollution points, to develop a new index and to establish parsimonious prediction models.
In fact, the use of different statistics in combination with artificial intelligence techniques helps in the evaluation of water quality to reveal the obscured pollution point, to improve a more effective index and build other prediction models.
Our findings show a remarkable study as a first research work on the evolution of a new WQI that best suits the data-poor study of the region.
ACKNOWLEDGEMENTS
The authors are thankful to the ANRH for providing the necessary data for the study reported in the paper free of charge; and to the anonymous reviewers for their careful reading and precious comments.
COMPETING INTERESTS
The authors are not affiliated with or involved with any organization or entity with any financial interest or nonfinancial interest in the subject matter or materials discussed in this paper.
FUNDING
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.