Soft computing technique-based prediction of water quality index

Water quality plays a crucial role in management of water resources. Water quality indexes (WQIs) are frequently used methods to assess water quality for drinking purposes. A WQI can be predicted using chemical analysis which might not, however, be viable for a longer period in all country-scale rivers. Thus, in this investigation, two neural-based soft computing techniques – an artificial neural network (ANN) and a generalized regression neural network (GRNN) – and one hybrid soft computing techniques – an adaptive neuro-fuzzy interference system (ANFIS) with four membership functions – were used to predict WQIs in Khorramabad, Biranshahr and Alashtar sub-watersheds in Iran. Ten distinct physiochemical parameters were used as input variables and WQI as output. Simultaneously, a correlation plot and pairs were used to ascertain the relation of input and output variables. The soft computing techniques were compared using six fitness criteria: Nash-Sutcliffe efficiency (NSE), mean absolute error (MAE), Legates-McCabe Index (LMI), root mean square error (RMSE), mean absolute percentage error (MAPE), and coefficient of correlation (CC). Results indicated that ANN better predicted WQI than did GRNN and ANFIS. Among the different membership functions of ANFIS, ANFIS_trimf was far better than were the others. Thus, it was concluded that ANN was a viable tool for the prediction of a WQI.


INTRODUCTION
Industries, agriculture, and people pollute water resources through a variety of activities (Katyal 2011). In some watersheds, pollution has exceeded the permissible limit. There is global concern as water quality has degraded almost everywhere (Adriaenssens et al. 2004;Azad et al. 2018). Quality water is fundamental for sustainable living so the abatement of pollution and protection of water resources are necessary, which requires an assessment of water quality (Witek & Jarosiewicz 2009;Reza & Singh 2010;Tiri et al. 2018). Water quality can be determined by chemical, physical and biological analyses (Abbasi & Abbasi 2012;Ewaid & Abed 2017;Medeiros et al. 2017;Tiri et al. 2018). A Water Quality Index (WQI) is one of the commonly used methods for the assessment of water quality (Medeiros et al. 2017;Tiri et al. 2018). WQI group parameters can be utilized for grading water quality and hence used in classification of the health of water systems, such as rivers (Hasan et al. 2015;Ewaid & Abed 2017).
WQIs were proposed by Brown et al. (1970) and Horton (1965) and various methods have since been developed for calculating them (Debels et al. 2005;Tsegaye et al. 2006;Saeedi et al. 2010). Using a WQI, Kannel et al. (2007) analyzed seasonal and spatial changes of the Bagmati River. Debels et al. (2005) calculated a WQI using 9 physio-chemical parameters in the Chill'an River.  combined GIS with a multivariate statistical method to calculate a WQI.
Soft computing techniques have been devised for addressing non-stationary and non-linearity of quality of water. These techniques are attractive, because they directly and quickly model water quality (Gaya et al. 2020;Karim & Kamsani 2020;Yasin & Karim 2020; Hmoud Al-Adhaileh & Waselallah Alsaade 2021), and have a great ability to reduce errors and time of computation (Bhagat et al. 2019). M5P model tree, adaptive neuro-fuzzy interference system (ANFIS), support vector regression (SVM), random forest (RF), Gaussian process (GP), and artificial neural network (ANN), are among the most frequently used soft computing techniques (Barzegar et al. 2016). Chen & Zheng (2008) used soft computing techniques in the prediction of water quality and observed that ANN is the best model which gives the most appropriate result. Singh et al. (2009) implemented an ANN model in the modelling of water quality to compute biological oxygen and oxygen demand. Practical swarm optimization with ANN was applied to predict water quality of sewage effluent by Zheng et al. (2010). Gao et al. (2015) applied a special type of neuronal network: a back propagation neural network combined with practical swarm optimization in prediction of water quality. Nourani et al. (2013) employed ANN for computing a WQI and found that it outperformed other conventional methods. Emamgholizadeh et al. (2014) used ANN and ANFIS to estimate a WQI in the Karoom watershed and found that ANN predicted better than ANFIS. Different combinations of soft computing techniques have also been employed for the estimation of WQIs (Yaseen et al. 2018).
The reliability of soft computing techniques for WQIs has been amply demonstrated (Bui et al. 2020;Gaya et al. 2020;Najafzadeh & Lottfi-Dashbalagh 2020;Tung & Yaseen 2020;Riahi-Madvar et al. 2021). Since analysis of the water quality of all rivers might not be possible at frequent intervals on a countrywide scale for a substantial period of time (De LR Wagener et al. 2019), modeling of water quality with ease and fewer parameters provides motivation for the use of soft computing techniques. Although, a lot of literature is available in which water quality was predicted by the use of soft computing techniques, no one has predicted the water quality of three sub watersheds with a combination of ANFIS, ANN and GRNN (generalized regression neural network), which indicates its significance and novelty. The analysis of water quality in a laboratory is a very costly and time consuming process which requires collection of samples, transportation and testing. In this regard, the study presents a real-time system to evaluate an alternative approach based on soft computing techniques for predicting water quality. The objectives of this study are: i to develop a neuro-fuzzy based model, an ANFIS, for the prediction of a WQI for Khorramabad, Biranshahr and Alashtar sub-watersheds in Iran; ii to validate the output of ANN, GRNN and ANFIS; and iii to compare the performances of the soft computing techniques using model fitness criteria.
The paper is organized as follows. In the second section, a description of the soft computing techniques and study area is given. Details of the methodology, data description, and model fitness criteria are also given. Results and discussion summarizing the performances of soft computing techniques are presented in the third section, followed by conclusions and references.

Soft computing techniques
Artificial neural network (ANN) ANN, a concept taken from the human mind, is a widely used prediction technique (Sepahvand et al. 2019;Sihag et al. 2020;Singh 2020). ANN has a brain-like architecture and neuron system. It contains a single input layer, single target layer and one or multiple hidden layers. Every layer has a certain quantity of nodes, and the weighted relation between these layers depicts the node relationship. The input layer, which has the same number of nodes as the number of input parameters, delivers data to the network but does not assist in processing. The last processing unit is the target layer. Whenever an input layer receives input information that moves through the linkages among the nodes, the values are multiplied by the associated weights and added together to get the final target (Z d ) to the unit.
where A cd ¼ weight of interconnection from unit c to d, B c ¼ input value at the input layer, and Z d ¼ target obtained by the activation function to produce a target for unit d. Haykin (1999) has given a complete discussion of ANN. The main advantage of this method is that it learns automatically and produces an output which is not limited to the input provided. Also, its working is not affected by loss of data as it stores the input in its own networks instead of a database.
Generalized regression neural network (GRNN) Specht (1991) was the first to introduce GRNN, which uses a normalized radial basis function (RBF) network with a single hidden component based at each training example. The kernel function, also known as the RBF, is a probability density function that includes neural networks, Gaussian processes, and support vector machines. The target values are the hidden-tooutput weights, so the output is a weighted average of the target values of training bags near the specified input bags. The widths of the RBF components are the only weights which need to be investigated. There are only four levels in a GRNN structure. The input values are in the first level, the pattern elements are in the second level, the targets from this level are crisscrossed to the summation elements in the third level, and the output elements are in the final level. The first level is completely linked to the second, pattern level, where each element shows a training pattern and its output is a measure of the distance of input from the stored pattern. The optimal value of the user-defined parameter known as spread (s) is determined experimentally. For more information about GRNN, readers are referred to Specht (1991) and Wasserman (1993). The advantage of GRNN is that it can handle the noises in the input easily and use single-pass learning so no back proportion is needed.
Adaptive neuro-fuzzy interference system (ANFIS) The configuration of ANFIS is displayed in Figure 1. There are five layers to this system ) and details of these five layers are as follows: 1. The membership degree is measured in the first layer. A membership degree is produced by every other node. The membership functions are being used in fuzzy sets.
where x and y are the outputs, A i and B i are the linguistic labels, and m Ai and m BiÀ2 are the degrees of membership functions for A i and B i , correspondingly.  2. Based on the calculated membership degree, the performance of the second layer (fire strings) is derived. To produce the firing power, the previous layer's membership functions are multiplied together: where O 2i is the output of this layer called the fire string. 3. At this stage, the firing capabilities are standardized. The steady nodes evaluate the contribution of the firing capabilities: 4. In this row, the ratio of the i th rule to the final result is evaluated using subsequent parameters: where p i , q i , and r i are the consequent parameters. 5. To obtain the total output, this layer uses a concise summary of input signals: It has been shown that a Gaussian membership function can lead to accurate outputs (Azimi et al. 2019;Ebtehaj et al. 2019;Gholami et al. 2019). Thus, the current study uses this membership function: where c i and σi are the parameters for the membership function. The ANFIS model has the advantage of having both numerical and linguistic knowledge.

Area of the research work
The research project was done using flow and water quality data measured in a watershed consisting of three sub-watersheds: Khorramabad, Biranshahr, and Alashtar sub-watersheds, from Lorestan province, Iran, located between 48°03 0 10″E and 48°59 0 07″E, and between 33°11 0 47″N and 34°03 0 27″N with an area of 3,562.1 km 2 . The elevation of the catchment area varies from 1,158 to 3,646 m above sea level. The observations were recorded from September 2014 to August 2017. Most of the rainfall occurs from November to May in a calendar year. The average rainfall is 442 mm, 484 mm, and 556 mm for Khorramabad, Biranshahr, and Alashtar sub-watersheds, respectively. Figure 2 shows the details of the study area and the black dots represent the sampling sites for the three watersheds.

Methodology and data descriptions
Three types of parametersbiological, physical, and chemicalwere used to analyze the water quality by which the WQI was determined (Dogan et al. 2009). These parameters were total dissolved solids (TDS), sodium (Na), sulfate (SO 4 ), electrical conductivity (EC), calcium (Ca), the potential of hydrogen (pH), bicarbonate ions (HCO), chlorides (Cl), magnesium (Mg), and potassium (K). With the use of these parameters, the WQI was calculated as follows: 1. A weight (z i ) was assigned to each parameter on the scale of one to five according to its significance for drinking suitability and human health. The zi values are given by  Table 1 and its formula is:  3. A scale for the quality rating (s i ) was calculated for each of the parameters as: 4. The sub-index level number (SIL i ) was calculated as: 5. The water quality index (WQI) was determined as: where Con i is the concentration of parameters in mg/l, Std i is the standard value of each parameter as per WHO, and n is the number of parameters.
A total of 124 observations were used which were observed from September 2014 to August 2017. The dataset consists of 10 input variables, pH, Na, Mg, SO 4 , K, TDS, K, Cl, HCO and EC, and 1 output variable, WQI. The pairs of all variables are represented in Figure 3 which shows the interrelation of all variables with each other and also gives information about the outliers which were not in large quantities. 70 percent of the entire dataset was used in a training stage of the soft computing techniques and 30 percent was used for testing the techniques. The characteristics of water quality parameters for the subwatersheds are tabulated in Table 2. The characteristics of the three watersheds are similar except for some values which were higher in the Khorammabad watershed. Figure 4 shows the flow chart as well as the architecture of soft computing used in the investigation which suggests the implementation of ANN, GRNN and ANFIS in the study area.

Correlation plot
This investigation used the 10 input variables to estimate the output variable in the three watersheds. The correlation coefficients were obtained to evaluate the correlation among the output and input variables. Figure 5 shows the correlation plot of

Model fitness criteria (MFC)
The performances of ANFIS, GRNN, ANN, and ANN-FFA were compared by model fitness criteria (MFC). These MFC were defined as: • Nash-Sutcliffe Efficiency: Nash & Sutcliffe (1970) introduced Nash-Sutcliffe efficiency (NSE) which was used to evaluate the working of soft computing models. The range of NSE lies between À∞ to 1. If NSE is equal to 1, it shows a perfect result of the model. Efficiency equal to 0 means the model is as accurate as the mean of the experimental value and a negative value indicates a better prediction than the model (Wilcox et al. 1990;Legates & McCabe 1999). NSE can be computed as: • Mean Absolute Error: The mean absolute error (MAE) is the mean of the absolute difference between observed and predicted values. The range of MAE is 0 to 1. The formula for MAE is: • Legates-McCabe Index: The Legates-McCabe Index (LMI) was introduced by Legates & McCabe (1999). The values of the LMI lie between 0 and 1. A lesser value indicates a bad result and vice versa. The LMI can be calculated as: • Root Mean Square Error: The root mean square error (RMSE) is one of the most used error methods to assess model fitness.
As the name indicates, it is the square root of the mean square error. A zero value indicates the best prediction and 1 indicates the worst case. The equation of RMSE is: • Mean Absolute Percentage Error: The mean absolute percentage error (MAPE) is also known as mean absolute percentage deviation. It can be formulated as: • Coefficient of Correlation: The coefficient of correlation (CC) shows the interrelation between observed and measured values. The range of CC is from -1 to þ1. It can be calculated as: where D represents the experimentally obtained values, E denotes the values obtained from the soft computing models, D denotes the mean of the experimentally observed values, and F is the number of the dataset.

RESULTS AND DISCUSSION
Three soft computing techniques, namely, ANN, GRNN, and ANFIS with four membership functions (ANFIS_trimf, ANFIS_trapmf, ANFIS_gbellmf, and ANFIS_gaussianmf), were used in this study. Analysis of soft computing techniques is a trial and error process. ANFIS and GRNN were executed by MatLab, while ANN used Weka (3.9). The soft computing techniques employ regulator parameters, and the model accuracy can increase or decrease by changing these parameters. Therefore, to improve accuracy, the optimal parameter values were determined by trial and error, and modeling was done thereafter. The optimal values of hyperparameters were as follows: 1. For ANN: Momentum ¼ 0.2, learning rate ¼ 0.1, hidden layer ¼ 01, neuron per hidden layer ¼ 08, iteration ¼ 1,500. 2. For GRNN: Spread ¼ 0.2.
Based on the optimal values of the hyperparameters, the soft computing techniques yielded MFC results for the prediction of WQI, and details of the MFC are summarized in Table 3. The performance of WQI with soft computing techniques is plotted in Figure 6. For the training dataset, all the soft computing techniques worked well which is clearly shown in Figure 6 as only a few are outliers from the best agreement zone. Also, it followed the same trends as the actual WQI, except for some outliers. But, in the training dataset, ANN gave the best results, all of the ANN plot was in the best agreement zone and followed the same trend as the actual line. The plots of ANFIS_trapmf in the graphs and the line diagram did not follow the trends of actual WQI. Thus, Figure 6 indicates that ANN worked superbly in the prediction of the WQI.
The Taylor diagram, developed by Taylor in 2001(Taylor 2001, is one of the most used and modern methods which shows how closely a pattern or set of patterns is summarized. It quantifies a set of patterns in terms of correlation, standard variation, and root mean square error. It is useful for solving the multi-aspects problem related to complex models. In this investigation, Taylor diagrams were constructed in 'R' language using the 'plotrix' package and Figure 7 shows the plot of a Taylor diagram with different soft computing techniques for the prediction of WQI. It also gives the same results as provided by the other plots. In training, all the soft computing techniques worked well in the prediction of WQI but, in testing, ANN was the most accurate in prediction. Also, ANFIS_trampmf was the most inaccurate with a correlation of 0.42 (Figure 7). Thus, the Taylor diagram (Figure 7) also concluded that ANN was the most reliable soft computing technique for the prediction of WQI.