Abstract
Water quality prediction is the basic work of water resource management and pollution control, and it is crucial to accurately predict the trend of pollutant concentration in water bodies over time. Water quality data prediction has an important significance, as it provides data support for the effective estimation of water quality, and is also an indirect way to protect water resources and the environment. At present there are a variety of water quality prediction methods, but these methods still have some shortcomings. In this paper, the main water quality pollution indicators such as the dissolved oxygen (DO), ammonia nitrogen (NH3-N) and total phosphorus (P) data were the object of study to build a water quality prediction model. The water quality prediction index contains numerous nonlinear correlation characteristics that results in low training efficiency on a large-scale data. Therefore, a combined water quality prediction model based on integrated ensemble empirical mode decomposition (EEMD) and cascade support vector machine (Cascade SVM) is proposed. First, the EEMD method is used to highlight the real characteristics of the original water quality data series. Then, the parallel training and prediction process are realized by the Spark, a distributed computing engine, to parallelize the traditional Cascade SVM. The experimental results show that the proposed combined model shows a strong superiority in many aspects of performance such as training efficiency and prediction accuracy.
HIGHLIGHTS
Proposes a combined water quality prediction model based on EEMD and Cascade SVM.
Improves the accuracy of the prediction results.
A combined water quality prediction model proposed in this paper has a higher accuracy.
A combined water quality prediction model proposed in this paper has less prediction time.
The proposed combined model shows a strong superiority in training efficiency and prediction accuracy.
Graphical Abstract
INTRODUCTION
Due to the unreasonable living habits and production methods of human beings, the problem of water pollution has become increasingly serious worldwide, exceeding the maximum pollution load that the natural environment can bear. Water quality degradation is considered one of the most serious environmental problems worldwide, as it can destroy the ecological balance of water bodies and endanger regional environmental security. Therefore, how to accurately predict water quality is of great importance for both social and economic development. In recent years, with the high frequency of water pollution events, water quality prediction has gradually become a popular issue of concern for environmental management departments in many countries (regions) (Qaderi & Engineering 2017; Ding et al. 2019; Rashid et al. 2021; Singha et al. 2021).
Water quality forecasting is the basic work of water resources and environmental management, the use of scientific and reasonable water quality forecasting methods, in order to accurately reflect the current water quality and pollution, to clarify the development of water quality changes in the law, and to identify the main pollution problems. The purpose of water quality prediction is to prevent the further deterioration of water quality. With the help of monitoring the data obtained and the relevant information collected, people in the current water quality depend on the future development trend of water quality to make scientific predictions and inferences, in order to propose corresponding improvement methods. At this stage, the main methods of water quality prediction can be divided into two types (Zhang et al. 2016; Avila et al. 2017; Khadr & Elshemy 2017): (1) prediction methods based on classical statistical analysis and (2) prediction methods based on artificial intelligence modeling. Shi & Zou (2014) used probability distributions of independent residuals to generate synthetic water quality data and used autoregressive integrated moving average (ARIMA) models to predict future water quality data for complex waters. Park & Koo (2015) proposed an ARIMA model-based water quality prediction model that can predict common monitoring indicators such as dissolved oxygen (DO) and ammonia nitrogen (NH3-N). Katimon et al. (2018) used the ARIMA model to model the water quality data of the Johor River to achieve accurate prediction of hydrological variables. Zhai et al. (2021) used statistical analysis methods to achieve the prediction of hazardous chemical accidents in drinking water sources in the Three Gorges reservoir area.
Since most statistical-based water quality prediction methods are normally distributed, they cannot be applied to other river waters. With continuous research, artificial intelligence methods have been rapidly developed in the field of water quality prediction. Setshedi et al. (2021) used artificial neural networks (ANNs) to predict the red tide phenomenon in a water quality dataset. Hrnjica and Bonacci (2019) use feedforward and recurrent neural networks for the lake water level prediction. In addition to ANNs, many researchers have attempted to use various machine learning techniques for multiple water quality metrics prediction (Shihab & Al-Tayyar 2019; Wang et al. 2019; Yi et al. 2019; Deng et al. 2021; Searcy & Boehm 2021). For example, Cao et al. (2020) proposed a genetic algorithm-optimized support vector machine (SVM) to predict future water quality conditions. However, optimal support vector machines are less efficient to train on large-scale water quality data.
Therefore, in order to solve the above problems, a combined water quality prediction model based on ensemble empirical mode decomposition (EEMD) and cascade support vector machine (Cascade SVM) is proposed in this paper. First, DO, NH3-N, and total phosphorus (P) from water quality monitoring data are selected as predictors (Haleem et al. 2019; Hu et al. 2019; Harun et al. 2020), and EEMD is used to decompose water quality time series data to obtain relatively realistic components. Second, Cascade SVM, as a parallel SVM, can improve its own training efficiency on a large-scale data through global problem decomposition, filtering and feedback. Therefore, in this paper, we try to use Spark-based parallelized Cascade SVM for each component obtained from the decomposition to make predictions. Finally, the corresponding outputs from the above process are combined to obtain the water quality prediction results of the combined model.
The rest of the paper is organized as follows: In Section 2, the EEMD method for water quality data series is studied in detail, while Section 3 provides the detailed data of the proposed combined water quality prediction model. Section 4 provides the results and discussion. Finally, the paper is concluded in Section 5.
EEMD METHOD FOR WATER QUALITY DATA SERIES
Principle of empirical mode decomposition method
Empirical mode decomposition (EMD) is usually used to deal with non-smooth nonlinear signal sequences (Chen et al. 2018; Du et al. 2018; Lu et al. 2018). Generally speaking, most of the intrinsic decomposition methods are only suitable for data with certain fixed characteristics. For example, wavelet transform decomposition methods require the decomposed data to be smooth and linear. On the contrary, Fourier transform decomposition is mainly used to deal with smooth cyclic data cases. EMD takes part of the global signal as the basis to resolve the signal tendency or fluctuation pattern, and also generates several intrinsic mode functions (IMFs). Theoretically speaking, EMD can be used for different types of signal analysis.
Process of EEMD
The homogeneous distribution of the Gaussian white noise spectrum is utilized to drive the signals of different time scales to actively disperse to a suitable reference standard. In the initial signal to add the Gaussian white noise, on the one hand, can provide a uniform distribution of the signal resolution standard; on the other hand, it can smooth the interference of the pulse, and thus is superior to highlight the real characteristics of the initial water quality sequence. The origin of EEMD is a repeated addition of the Gaussian white noise to the multiple EMD, and the detailed steps of the process are as follows:
Resolving the signal after adding the Gaussian white noise sequence into a set of IMFs by EMD.
Adding a different Gaussian white noise sequence each time, and then repeating the above steps several times.
- Calculate the mean value of the decomposed IMF by using the principle of uncorrelated random sequences, so as to suppress the influence of the Gaussian white noise on the real IMF. The final IMF that can be obtained by EEMD parsing is:where N is the number of EMD integrations and is the ith IMF obtained from the mth EMD.
PROPOSED COMBINED WATER QUALITY PREDICTION MODEL
Basic framework of the model
In many water quality prediction literatures, single model prediction is widely used, such as SVM, gray model prediction method and ANN, but the applications of combined model prediction are relatively few. This paper proposes a new combined model, and through empirical research shows that the combined model in water quality prediction has a high prediction accuracy. In this paper, three indexes, DO, NH3-N and P, in water quality data are selected as prediction indexes, and a combined water quality prediction model is constructed.
The EEMD–Spark–Cascade SVM combined model is built by the following steps:
Determine the optimal input structure of the model.
Establish the training sample of the model according to the determined optimal input structure, and train and test the model by the determined training sample until the error is minimized.
According to the constructed combined model predict the water quality data.
Spark-based Cascade SVM
First, the entire training set on HDFS (Huang et al. 2017) is divided randomly and equally, which is called Initial Random Partition (IRP). In this step, the entire training set is randomly partitioned into m (, ) subsets, each corresponding to a Split block. An Receiver Register Disable (RDD) is assigned to each subset and the partition is set so that each subset corresponds to a partition.
Then, the training process is performed on the partitioned RDDs, i.e., a SVM training process is executed for each partition of the dataset. Spark starts multiple Executor processes (Zhang et al. 2021) on each Worker node of the cluster to complete the training process for each subset.
After training all the subsets in the previous layer in parallel, the support vectors (SVs) of each subset are merged by RDD merging operation and persisted to HDFS as input for the next layer.
EXPERIMENT AND RESULT ANALYSIS
Experimental environment and data sources
Spark's distributed file system HDFS was installed in the experimental environment and Spark clustering configuration was performed. All services were deployed on three Linux virtual machines in VMWare. The operating system is CentOS-6.8. The virtual machine environment configuration is shown in Table 1.
No. . | Hosts . | Operating system . | CPU core . | Memory (GB) . | Hard disk (GB) . |
---|---|---|---|---|---|
1 | Node1 | CentOS-6.8 | 2 | 2 | 20 |
2 | Node2 | CentOS-6.8 | 4 | 4 | 100 |
3 | Node3 | CentOS-6.8 | 4 | 4 | 100 |
No. . | Hosts . | Operating system . | CPU core . | Memory (GB) . | Hard disk (GB) . |
---|---|---|---|---|---|
1 | Node1 | CentOS-6.8 | 2 | 2 | 20 |
2 | Node2 | CentOS-6.8 | 4 | 4 | 100 |
3 | Node3 | CentOS-6.8 | 4 | 4 | 100 |
Spark services run on the Java virtual machine. The software versions used in the experiments are shown in Table 2. After all the configurations are completed, the configuration of HDFS and Spark cluster is completed by distributing the packages to all nodes.
No. . | Software . | Versions . |
---|---|---|
1 | JDK | jdk 1.8.0_152 |
2 | Spark | spark-2.0.0-bin-hadoop2.6 |
No. . | Software . | Versions . |
---|---|---|
1 | JDK | jdk 1.8.0_152 |
2 | Spark | spark-2.0.0-bin-hadoop2.6 |
The experimental data were obtained from the Chuanyang River water quality site in Shanghai Taihu Lake Basin on the China Environmental Monitoring website (http://www.cnemc.cn/). The monthly data of DO, NH3-N and P in the water body were selected as the numerical experimental data for this paper. The data period is from December 1991 to December 2021, with a total of 458 samples. The commonly used statistical discriminant 3σ criterion is adopted to effectively discriminate and eliminate outliers. An empty string labeled ‘unknown’ was used to fill gaps. In this paper, the sample data are divided into the following two parts: training samples and testing samples. The training sample is the water quality data from 1991 to 2019, and the test sample is the water quality data from 2020 to 2021.
Evaluation indicators
Comparison of water quality prediction results
Five models, LIBSVM, Cascade SVM, EMD–Cascade SVM, EEMD–Cascade SVM and EEMD–Spark–Cascade SVM, were used to predict DO, NH3-N and P, and the results are shown in Table 3.
Indicator/mg·l−1 . | Models . | MAPE/% . | MSE . | MAE . | Time/s . |
---|---|---|---|---|---|
DO | LIBSVM | 6.03 | 0.5404 | 0.5698 | 44.3 |
Cascade SVM | 5.00 | 0.2714 | 0.379 | 75.5 | |
EMD–Cascade SVM | 4.01 | 0.4257 | 0.5517 | 87.1 | |
EEMD–Cascade SVM | 3.73 | 0.1813 | 0.3397 | 99.6 | |
EEMD–Spark–Cascade–Cascade SVM | 3.72 | 0.1812 | 0.3381 | 53.7 | |
NH3-N | LIBSVM | 12.40 | 5.9841 | 4.8362 | 47.1 |
Cascade SVM | 8.41 | 4.607 | 3.0632 | 76.4 | |
EMD–Cascade SVM | 7.30 | 3.9223 | 2.2413 | 85.7 | |
EEMD–Cascade SVM | 6.93 | 3.2133 | 1.8269 | 103.6 | |
EEMD–Spark–Cascade SVM | 6.95 | 3.2198 | 1.8185 | 51.2 | |
P | LIBSVM | 8.91 | 0.3566 | 0.3761 | 43.6 |
Cascade SVM | 6.57 | 0.2567 | 0.2978 | 73.6 | |
EMD–Cascade SVM | 5.52 | 0.1545 | 0.1843 | 87.2 | |
EEMD–Cascade SVM | 4.99 | 0.1411 | 0.1635 | 96.1 | |
EEMD–Spark–Cascade SVM | 4.93 | 0.1402 | 0.1631 | 47.1 |
Indicator/mg·l−1 . | Models . | MAPE/% . | MSE . | MAE . | Time/s . |
---|---|---|---|---|---|
DO | LIBSVM | 6.03 | 0.5404 | 0.5698 | 44.3 |
Cascade SVM | 5.00 | 0.2714 | 0.379 | 75.5 | |
EMD–Cascade SVM | 4.01 | 0.4257 | 0.5517 | 87.1 | |
EEMD–Cascade SVM | 3.73 | 0.1813 | 0.3397 | 99.6 | |
EEMD–Spark–Cascade–Cascade SVM | 3.72 | 0.1812 | 0.3381 | 53.7 | |
NH3-N | LIBSVM | 12.40 | 5.9841 | 4.8362 | 47.1 |
Cascade SVM | 8.41 | 4.607 | 3.0632 | 76.4 | |
EMD–Cascade SVM | 7.30 | 3.9223 | 2.2413 | 85.7 | |
EEMD–Cascade SVM | 6.93 | 3.2133 | 1.8269 | 103.6 | |
EEMD–Spark–Cascade SVM | 6.95 | 3.2198 | 1.8185 | 51.2 | |
P | LIBSVM | 8.91 | 0.3566 | 0.3761 | 43.6 |
Cascade SVM | 6.57 | 0.2567 | 0.2978 | 73.6 | |
EMD–Cascade SVM | 5.52 | 0.1545 | 0.1843 | 87.2 | |
EEMD–Cascade SVM | 4.99 | 0.1411 | 0.1635 | 96.1 | |
EEMD–Spark–Cascade SVM | 4.93 | 0.1402 | 0.1631 | 47.1 |
From Table 3 and Figure 5, it can be seen that the proposed combined prediction model EEMD–Spark–Cascade SVM obtains the highest prediction accuracy (by comparing MSE, MAPE and MAE indicators). This is because EEMD can completely uncover the intrinsic connection between water quality data, making the original data series smoother after the noise processing. EEMD solves the problem of modal confounding in the process of EMD method, and can more clearly show the fluctuation trend of the original series, thus effectively improving the performance of the model in terms of prediction accuracy. In addition, compared with the EEMD–Cascade SVM model, the combined prediction model EEMD–Spark–Cascade SVM has significantly reduced the training and prediction time, whereas the prediction accuracy remains unchanged. It can be seen that the parallel prediction time is greatly reduced compared to the standalone prediction time by a factor of about one. Moreover, with the increase of cluster size and parallelism, the time of parallel prediction achieved with the Spark platform can be reduced even more. Therefore, we can conclude that the EEMD–Spark–Cascade SVM water quality prediction model is statistically superior to the other benchmark models considered. The proposed EEMD does not use the probability distribution of independent residuals to analyze water quality data, but instead uses a combination of residuals as shown in Equation (1).
It can be seen from Figure 7 that the prediction accuracy of all four models, Cascade SVM, EMD–Cascade SVM, EEMD–Cascade SVM, and EEMD–Spark–Cascade SVM, decreases on the test set as the number of initial divisions continues to increase. This is because as the number of initial divisions increases, each subset generated by the random initial division becomes more different from the original data distribution, and some of the global SVs may be filtered out after the first layer of training. However, from the decreasing trend of the prediction accuracy of the four models, the decreasing speed of EEMD–Spark–Cascade SVM is significantly slower and less fluctuating compared with the other three models. In other words, the EEMD–Spark–Cascade SVM declines slowly with the increase of the number of divisions, and the decline rate is smoother and more stable.
CONCLUSIONS
A combined water quality prediction model is proposed in this paper based on EEMD and Cascade SVM. Considering the characteristics of water quality data such as nonlinearity and its instability, EEMD is introduced in the water quality prediction of the data for the processing of time and frequency, so as to reduce the instability of time series data, and more effectively improve the accuracy of the prediction results. Considering the low efficiency of water quality prediction on large-scale data, the parallelized Cascade SVM based on the Spark is used for each component obtained from the decomposition for prediction. The monthly data of DO, NH3-N and P in water bodies were selected for experimental analysis, and the results showed that the combined water quality prediction model proposed in this paper has a higher accuracy and less prediction time compared with other prediction models. The shortcoming of this study is that the prediction accuracy of the model will not decrease with the increase in the number of initial divisions, and subsequently will use the quadratic distribution of the subset after the initial division to try to solve this problem.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.