Modeling of contaminant concentration using the classi ﬁ cation-based model integrated with data preprocessing algorithms

Water quality is one of the most important factors contributing to a healthy life; meanwhile, total dissolved solids (TDS) and electrical conductivity (EC) are the most important parameters in water quality, and many water developing plans have been implemented for the recognition of these factors. The accurate prediction of water quality parameters (WQPs) is an essential requisite for water quality management, human health, public consumption, and domestic uses. Using three novel data preprocessing algorithms (DPAs), including empirical mode decomposition (EMD), ensemble EMD (EEMD), and variational mode decomposition (VMD) to estimate two important WQPs, TDS and EC, differentiates this study from the existing literature. The acceptability and reliability of the proposed models (e.g., model tree (MT), EMD-MT, EEMD-MT, and VMD-MT) were evaluated using ﬁ ve performance metrics and visual plots. A comparison of the performances of standalone and hybrid models indicated that DPAs can enhance the performance of standalone MT model for both TDS and EC estimations. For instance, the VMD-MT model (root-mean-square error (RMSE) ¼ 24.41 mg/l, ratio of RMSE to SD (RSD) ¼ 0.231, and Nash – Sutcliffe ef ﬁ ciency ( E ns ) ¼ 0.94 (Garmrood) and RMSE ¼ 31.85 mg/l, RSD ¼ 0.133, and E ns ¼ 0.98 (Varand)) outperformed other hybrid models and original MT models for TDS estimations. Regarding the EC estimation results, as for R 2 , VMD could enhance the accuracy of prediction for the MT model for Garmrood and Varand stations by 10.2 and 7.6%, respectively. used to address the nonstationary of the dataset. (cid:129) To validate proposed models, a classi ﬁ cation-based MT was used as the benchmark model. (cid:129) The VMD-MT proves to be an effective tool to provide strong technical support for WQPs.


INTRODUCTION
Conductivity or electrical conductivity (EC) and total dissolved solids (TDS) are some of the water quality parameters (WQPs; dependent variables) that should be controlled in maintaining human health and welfare.
Conductivity measurements are used routinely in many industrial and environmental applications as a fast, inexpensive, and reliable way of measuring the ionic content in a solution (Olson & Hawkins ; Asadi et al. ).
In many cases, conductivity is linked directly to the TDS.
TDS is a measure of the solid fraction of a sample which is able to pass through a filter. The amount of dissolved solids gives a general indication of the suitability of the water as a drinking source and for certain agricultural and industrial uses (Sheykhzadeh 2019). In today's industrial world, most of the global natural water sources, including those in Iran, contain impurities such as the TDS and EC.
Numerous factors that include cations such as sodium ion (Na þ ), potassium ion (K þ ), calcium ion (Ca 2þ ), and magnesium ion (Mg 2þ ) and anions such as chloride ion (Cl À ) and bicarbonate ion (HCO 3 À ) with sulfate ion (SO 4 2À ) affect the concentration of these parameters in natural water systems (Asadollahfardi et al. ).
In the last decades, artificial intelligence (  (1) Presenting an accurate and stable formula for each TDS and EC using a classification-based AI model at both above-mentioned stations.
(2) Introducing a more reliable DPA coupled with MT in order to estimate monthly scaled EC and TDS at the Tajan basin. In this regard, the most effective DPAs are adopted for managing the volatility and randomness of WQP series data. After decomposition and reconstruction, the volatility and randomness of the original wind speed data were effectively reduced, and the forecasting performance was validly enhanced.
(3) Conducting a comprehensive and scientific evaluation to test the performance of the proposed system. So, all standalone and hybrid models, MT, EMD-MT, EEMD-MT, and VMD-MT, were compared with several performance metrics and graphical plots.

MATERIALS AND METHODS
Case study: Tajan  The statistical analysis for training and testing periods for physiochemical data was conducted, and the values obtained for minimum, maximum, mean, skewness coefficient (C s ), and standard deviation (SD) are given in Table 1.

Model tree
MT was initially implemented by Quinlan () to attain a relationship between input-output parameters and was reintroduced and improved by Wang & Witten (), who presented physically meaningful insights of phenomena.
The M5 model can solve highly complex problems by dividing the data space into several subproblems (subspaces) and building piecewise linear functions for each subdomain at its terminal nodes. Therefore, the process of this algorithm generally completes in two different steps. In the first step, datasets are divided into subsets to construct an initial tree through a recursive splitting process. To determine the best attribute for splitting the dataset at each node, the SD is utilized as a splitting criterion. Trees are constructed using the SD reduction schema, which maximizes the expected error reduction for each node as follows (Quinlan ): where I refers to the set of instances that reach the leaf (node), I i represents a subset of input data to the parent node, and SD is the standard deviation.  In the second phase, after pruning an overgrown tree, linear regression models are created and presented for each of the subtrees of samples in the terminating nodes.
The pruning process is performed to prune back the overgrown trees, which is a pivotal step to avoid the over- , if X 1 < 2 and X 2 > 2.5, then the third model where a 0 , a 1 , and a 2 represent regression coefficients, X 1 and X 2 are inputs, and Y is the output. Readers can refer Step 1 Step 2: A new series h(t) is calculated by subtracting the mean m(t) from the original series E(t): Step 3: The EMD stopping criteria for shifting determines whether shifting should stop. If the stopping condition is met, h(t) is the IMF, and the next step is executed. If the stopping condition is not met, then h(t) is used as the original series; steps 1 and 2 are repeated until the stopping condition is met; and the first IMF, IMF1 c 1 (t), is calculated.
Step 4: The residual series r 1 (t) is obtained by subtracting the IMF c 1 (t) from the original series E(t): Step 5: The residual series r 1 (t) is used as the new original series, and steps 1-4 are repeated. All the IMFs, c 1 (t), c 2 (t), .…, c n (t), are decomposed until c n (t) is a monotonic or single-extreme-point residual.

Ensemble EMD
The EEMD method, proposed by Wu & Huang (), is an empirical procedure used to represent a nonlinear and nonstationary signal from original data. This data preprocessing method is an improvement of the EMD that reduces mode Step 1: White noise n i (t) with a mean of 0 and the SD constant is added to the original signal E(t) multiple times. The SD of the white noise is set to 0.1-0.4 times the SD of the original signal (0.2 in this study): where X i (t) represents the signal after the ith addition of Gaussian white noise.
Step 2: Each X i (t) undergoes the EMD procedure.
The IMF component obtained is denoted by c ij (t), and the residual term is denoted by r i (t). Among them, c ij (t) represents the jth IMF from the decomposition of the signal after the ith addition of Gaussian white noise.
Step 3: Steps l and 2 are repeated N times. Based on the principle that the statistical mean of an uncorrelated random series is 0, the IMFs are subjected to an overall averaging operation to eliminate the impact of adding Gaussian white noise to the actual IMF multiple times. Finally, the IMF obtained from EEMD is as follows: where C j (t) represents the jth IMF of the original signal obtained by EEMD. As the value of N increases, the sum of IMFs for the corresponding white noise approaches 0. At this point, the result of EEMD is as follows: where r(t) is the final residual, which represents the average trend of the signal. Any signal E(t) can be decomposed into multiple IMFs and one residual via EEMD. IMF c j (t) ( j ¼ 1, 2, …) represents the signal's components from high frequency to low frequency. Each frequency contains distinct components and varies with the signal E(t).

Variational mode decomposition
The VMD, as a newly developed algorithm for adaptive and quasi-orthogonal signal decomposition, was applied (Dragomiretskiy & Zosso ). The VMD algorithm has an ability in decomposing a signal x(t) to K discrete number of subsignals or modes u k , where every component is selected compact around their respective center frequency w k . The VMD is employed as a constrained optimization issue, which is expressed as (Dragomiretskiy & Zosso ) follows: Equation (4) can be resolved with the alternative direction method of multipliers.

Development of the DPA-based models
Hydro-climatic series often include many IMFs with differ- The preprocessing-based model (i.e., EMD-/EEMD-/ VMD-MT) mainly includes the following steps. In the first step, the original data is divided into two parts, including the training part and the testing part. Secondly, EMD/EEMD/VMD procedures are utilized to decompose the original input and output observed time-series data E(t) into several IMF components C j (t) (i ¼ 1, 2, 3, …, n) and one residual component r n (t). In the next step, for each extracted IMF component and the residual component (e.g., IMF1), the MT model is established as a WQP-predicting tool to simulate the decomposed IMF and residual components and to calculate each component using the same subseries (IMF1) of input variables. Finally, the predicted values of all extracted IMF and residual components using the MT models are aggregated to generate the TDS and EC, and then the prediction error is evaluated using the predicted dataset.
As the proposed algorithms are not model-adaptive in real applications, the best and optimal structure of hybrid models is chosen based on the problem. In other words, different problems need distinct optimal structures of WQP predictive models, and determining the structure of the network is an intellectual challenge for all researchers.
As no solution was presented by the researchers to achieve the optimum values of the model's parameters, this study employed the trial-and-error procedure to obtain the optimum user-defined parameters. Table 2 indicates design parameters of the MT, EEMD, and VMD algorithms for WQPs' prediction. It is noticeable that if 'Noise SD ¼ 0' and 'Number of realization ¼ 1', then the EMD decomposition is obtained.

Model's performance analysis
The acceptance and reliability of the proposed models need to be evaluated in order to assess the models' performance.
The evaluation criteria used in this study were the coefficient of determination (R 2 ), RMSE, mean absolute error (MAE), Nash-Sutcliffe efficiency (E ns ), and ratio of RMSE to SD (RSD), which can be formulated as follows: where N is the number of observations, O i and P i are the observed and predicted values of WQP, O and P are the mean of the observed and predicted data, and STDEV obs is the SD of the observed value of WQP.

APPLICATION RESULTS AND DISCUSSION
In this section, the performance of the proposed hybrid and standalone models in terms of accuracy and error will be investigated by several evaluation metrics and visual plots.

Model's performance in terms of TDS estimation
In this study, the capability of DPAs, EMD, EEMD, and VMD, integrated with the MT model, is investigated on monthly TDS estimation, and their performances are

Model's performance in terms of EC estimation
In terms of EC estimation, evaluation metrics R 2 , RMSE, MAE, E ns , and RSD were employed for EC estimation for   Table 4).
A similar trend is also observed for the testing stage. The VMD-MT model generally estimated monthly EC better than the MT, EMD-MT, and EEMD-MT; for instance, the decreases in RSD for the VMD-MT at Garmrood and      To sum up, the proposed VMD-MT method proves to be an effective tool to provide strong technical support for the monthly WQPs' estimation in the Tajan basin.
The authors, according to the results and as the generalizability of the proposed hybrid models, recommend using decomposition algorithms for other WQPs' assessment with the same scale of input/output parameters as well as watershed physical characteristics in order to assess the generalization of the hybrid proposed models. Furthermore, nonlinear and/or dynamic AI programming based on simulation models could be used to find the contributors to the WQPs in the river system; however, this type of assessment typically imposes a prohibitive computational burden, especially for the prediction of large and complex river systems. Generally speaking, due to the correlations and interactions between WQPs, it is interesting to investigate whether a domain-specific mechanism governing observed patterns exists to prove the predictability of these variables.
The identification of such forecasting models is particularly useful for ecologists and environmentalists, since they will be able to predict water pollution levels and take necessary precautionary measures in advance.
On the other hand, hydrological processes can be regarded as a nonlinear process in the nature. At this point, it can be said that some variables such as streamflow and precipitation at different places have time-dependent parameters. Therefore, it is required to describe, analyze, and interpret these nonlinear processes. So, it is crucial to overcome the complexity, nonlinearity, and random distribution of the hydrological datasets before using them to develop AI models. In this study, three DPAs were recruited to overcome these drawbacks. Therefore, a hybrid model that needs no more information about parameters and removes the noise trends of the WQPs datasets can be fruitful.

SUMMARY AND CONCLUSION
Over the past few decades, the approaches for improving hydrological estimating accuracy have received significant attention from scholars and engineers in the water resources field. To enrich this theory, the application of three DPAs, EMD, EEMD, and VMD, was considered to overcome time-series data: nonstationarity, nonlinearity, and complex- For Varand station, this metric was enhanced by 10 and 3% for TDS and EC parameters, respectively. Among these DPAs, VMD was a promising alternative for improving results of estimation for both TDS and EC on a monthly scale. However, further studies might be required to improve the proposed models by introducing more input variables that have not been investigated due to the lack of such information. In addition, presenting an advanced optimizer to be incorporated with the datadriven models might lead to more improvement in predicting WQPs.

DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.