To effectively prevent river water pollution, water quality monitoring is necessary. However, existing methods for water quality assessment are limited in terms of the characterization of water quality conditions, and few researchers have been able to focus on feature extraction methods relative to water pollution identification, or to obtain accurate water pollution source information. Thus, this study proposed a feature extraction method based on the entropy-minimal description length principle and gradient boosting decision tree (GBDT) algorithm for identifying the type of surface water pollution in consideration of the distribution characteristics and intrinsic association of conventional water quality indicators. To improve the robustness to noise, we constructed the coarse-grained discretization features of each water quality index based on information entropy. The nonlinear correlation between water quality indexes and pollution classes was excavated by the GBDT algorithm, which was utilized to acquire tree transformed features. Water samples collected by a southern city Environmental Monitoring Center were used to test the performance of the proposed algorithm. Experimental results demonstrate that features extracted by the proposed method are more effective than the water quality indicators without feature engineering and features extracted by the principal component analysis algorithm.

  • Different water pollutions have unique attributes for risk characterization.

  • Based on our study of the characteristics of water quality data, we proposed an innovative feature extraction method based on the entropy-minimal description length principle and gradient boosting decision tree algorithm.

  • We focus on the research into the feature extraction method in water pollution identification.

Water resources are critical for living organisms and human society. However, economic and population growth has highlighted the negative effect of surface water pollution on people's lives. Recent studies (Sun et al. 2017; Ahmed & Ismail 2018) have shown that untreated sewage and dumping of waste are the primary reasons for water pollution and several diseases. To solve this issue, many countries have been gradually establishing water monitoring stations and improving water monitoring hardware facilities (Wang & Yang 2016). Meanwhile, various techniques and algorithms have been used to monitor and evaluate surface water quality to help related departments using water quality information undertake decisions about water resource management.

For analyzing surface water quality, the pollution index method (Liu et al. 2011; Bin et al. 2014), fuzzy comprehensive evaluation method (Liu & Zou 2012), anomaly detection method (Jeffrey et al. 2009), spectroscopy method (Lv et al. 2016) are used. Many other methods have been applied to assess surface water quality (Wang et al. 2016, 2017, 2019). The pollution index method and fuzzy comprehensive evaluation method help researchers quantify the level of water pollution. However, these approaches to water quality assessment are limited to their ability to characterize the conditions of surface water quality. They cannot present valuable information, such as the cause of surface water pollution. The spectrometry method can identify the category of specific pollutants. However, the inconvenience of maintenance cannot be extended to large-scale real-time online monitoring, and to monitor water quality effectively, conventional water indicators are more widely applied.

Recently, water pollution identification with conventional water indicators is being explored based on the data-driven algorithm, which is essentially a classification task. The classification algorithm is mainly composed of two parts: the feature extraction and the classification model. Due to the uncertainty and distribution characteristics of conventional indicators of water quality monitoring, the original data should not be used directly as the input feature of classification models. Liu et al. (2015a, 2015b) used Mahalanobis and cosine distances to measure the similarity between characteristic pollutant vectors. They demonstrated that the type of contaminant could be determined as the class with the minimum distance. Muhammad et al. (2015) compared the performance of five classification models and analyzed the critical features among 53 water quality indicators. However, water quality monitoring data is affected by many external factors, such as the accuracy of the instrument and incorrect manual operation, which leads to misjudgment. Previous studies have reported feature extraction studies to overcome the limitation, such as principal component analysis (PCA) (Olsen et al. 2012), artificial neural network (ANN) algorithm (Wechmongkhonkon et al. 2012), and cluster analysis (Azhar et al. 2015). The PCA algorithm can extract latent variables from noisy hydrological data, and the ANN algorithm can extract useful features by instructing the hidden layers to change their way of twisting and transplantation via gradient descent algorithm. However, the multi-indicator joint response regularity is not remarkable. Many algorithms have not offered interpretable results for policymakers, managers, and other nontechnical people (Singh & Kaur 2017).

Different forms of water pollution have unique attributes when it comes to risk characterization. However, only a few studies specialized in surface water pollution have considered the distribution characteristics and intrinsic association of water quality indicators. Thus, this is the first study to undertake an analysis of the relationship between the conventional water quality indicators, explore the classification methods for the identification of significant surface water pollution with conventional water quality indicators, and develop a new feature extraction method based on the entropy-minimal description length principle (Entropy-MDLP) and gradient boosting decision tree (GBDT) algorithm. The coarse-grained discretization features of each water quality indicator based on information entropy were constructed to improve the robustness to noise. The nonlinear correlations between water quality indicators and water pollution were excavated by the GBDT algorithm, which was utilized to acquire combined features. The comparison experiments with different feature extraction methods were conducted. The results reveal the proposed method has a better identification performance than the water quality indicators without feature engineering and features extracted by the PCA algorithm.

This study proposes a feature extraction method of water quality data to help classification algorithms automatically identify whether the surface water is polluted and distinguish the source of surface water pollution. As shown in Figure 1, based on the conventional water quality indicators, four significant types of surface water pollution data for this study were collected: industrial sewage, domestic sewage, muddy water, and salty tidewater; and six conventional water quality indicators, namely, chemical oxygen demand (COD), NH3N, dissolved oxygen (DO), pH, turbidity, and electrical conductivity (EC), were measured for identifying water pollution. Based on different water quality monitoring conditions, we can also choose more different water quality indicators.

Figure 1

Pollution source identification based on conventional water quality indicators.

Figure 1

Pollution source identification based on conventional water quality indicators.

Close modal

Feature extraction is a type of data transformation that includes feature construction, discretization, and subset selection (Guyon & Elisseeff 2006). In this study, the proposed feature extraction method focuses on feature discretization and construction. The Entropy-MDLP and the GBDT algorithms are proposed to obtain discrete features and combined features. The distinct and combined features are integrated into the classification model to identify the source of surface water pollution.

Feature discretization based on the Entropy-MDLP algorithm

There are certain drawbacks associated with the direct use of noisy hydrological data as input for the classification model in identifying the source of surface water pollution. The commonly used feature extraction algorithms, such as PCA or independent component analysis, can be used for noise reduction but ignore the class information. The entropy theory (Babovic & Keijzer 2000; Sun et al. 2010), which uses the entropy value as a measure of variability, has been applied in hydrology for years. The Entropy-MDLP algorithm (Dougherty et al. 1995; Lustgarten et al. 2008) discretizes each continuous water quality indicator into multiple intervals to obtain discrete features, which can improve the robustness of classification models to outliers and noise data. The algorithm can also help the classification model discover nonlinear relations between water quality indicators.

The Entropy-MDLP algorithm (Fayyad & Irani 1993) is a supervised algorithm that uses the information entropy minimization heuristic to select cut points recursively. This algorithm constructs an objective function of stopping criteria for the recursive discretization strategy based on the MDLP, and the steps are summarized as follows.

  • Step 1: Calculate the class entropy of Set :
    (1)
    where is the ith class of water quality, k is the number of different classes, and stands for the proportion of examples in S that belong to class .
  • Step 2: For a given feature A, let boundary T partition be the set S of examples into subsets and and calculate the class information entropy of the partition induced by :
    (2)

The boundary , which minimizes the entropy function over all possible limits, is selected as a cut point.

  • Step 3: Recursively splitting the set S until the following MDLP criteria is satisfied:
    (3)
    (4)
    where N is the number of instances in set S, is the number of unique class in the set , and is calculated by the following formula:
    (5)
  • Step 4: Discretize all water quality indicators and convert a continuous value into an absolute value under the rule found by Entropy-MDLP discretization for all instances.

Feature construction based on the GBDT algorithm

The discrete features extracted by the Entropy-MDLP algorithm are coarse-grained. Thus, misjudgment quickly occurs when an instance is close to the neighborhood of the classifying boundaries. The GBDT (He et al. 2014) algorithm aims to construct new features at a fine granularity by building decision trees to obtain a set of decision rules with original features as input. This algorithm works as a supervised feature encoding method that converts a real-valued vector into a binary-valued vector. The tree transformed feature is constructed by a traversal from the root node to a leaf node. The algorithm creates a new tree to model the residual of previous trees in each learning iteration (Friedman 2001). As a result, these constructed features can be easily distinguished. The specific steps of the algorithm are summarized as follows.

  • Step 1: Initialize the loss function:
    (6)
    where c is constant, y is the label of instances, which stands for the category of water pollution, and
    (7)
  • Step 2: Assume the number of the tree is M, for 1: M:

    • Step 2.1: Calculate the residual of previous trees as follows:
      (8)
    • Step 2.2: Create a new tree using the CART (classification and regression tree) algorithm to model the residual.

    • Step 2.3: Update the as follows:
      (9)

  • Step 3: Obtain the boosted tree model .

  • Step 4: Convert a real-valued vector into a binary-valued vector for each instance. Each tree is treated as a categorical feature that takes the index of an instance a leaf ends up falling as value. The features generated by each individual tree are encoded by a one-hot encoding algorithm (Alkharusi 2012). Example is as shown in Figure 2.

Figure 2

Example of feature construction based on the GBDT algorithm.

Figure 2

Example of feature construction based on the GBDT algorithm.

Close modal

Figure 2 shows that, if an instance ends up in leaf 1 in the first subtree, leaf 3 in the second subtree, and leaf 1 in the last subtree, then the overall input to the classifier will be binary vector [1, 0, 0, 0, 0, 1, 0, …, 1, 0], where each entry corresponds to the leaves of each subtree.

Evaluation of feature extraction algorithm

Classification algorithm

To evaluate the effectiveness of the proposed algorithm comprehensively, two distinct commonly used algorithms are selected as classifiers. The first is support vector machine (SVM) (Hearst et al. 1998), which is a pattern classification algorithm based on the idea of finding a hyperplane with the most significant possible margin among different classes of data; the second is random forest (Liaw & Wiener 2002), which is an ensemble method that creates an entire forest of random uncorrelated decision trees to arrive at the possible answer. To compare the generalization performance of the classifier with features extracted by different algorithms, 80% of water samples are randomly chosen as a training set, and the remaining water samples are used as a test set.

Confusion matrix

Identifying the source of surface water pollution is primarily a multi-class classification problem. Thus, a confusion matrix is used to visualize the performance of the classification algorithm. Each row of the matrix represents the instances in an actual class, whereas each column represents the instances in a predicted class. All correct predictions are located in the diagonal of the matrix. The classification result can be easily visually inspected because the prediction errors are represented by values outside the diagonal.

Assessment criteria

The averaging precision rate, recall rate, and f1 score of each class of pollution source are used as assessment criteria. Their formulas are presented as follows:
(10)
(11)
(12)
where TP represents the correctly identified sources of surface water pollution, FP represents incorrectly identified sources, and FN represents the source incorrectly identified as other sources of surface water pollution.

Based on the proposed feature extraction method, this study designed the experiment as shown in Figure 3.

  • i.

    Data acquisition and preprocessing eliminate outliers and standardize the data.

  • ii.

    Dataset partitioning, the dataset is randomly divided into the training set and test set, wherein the training set accounts for 80%, which is used for model training, and the test set accounts for 20%, which was used for model evaluation.

  • iii.

    Feature extraction uses the method proposed in this paper to extract the features of the original water quality data and obtain the discretization features and combination features.

  • iv.

    Model training and tuning, taking the random forest algorithm, which combines multiple trees with the idea of integrated learning, and the SVM algorithm, which has excellent classification performance for a small amount of data as classifiers, and the extracted features are used as input to the model. The parameters are adjusted according to the performance of the model on the test set.

Figure 3

The framework of the classification algorithm for identifying the source of water pollution.

Figure 3

The framework of the classification algorithm for identifying the source of water pollution.

Close modal

Data acquisition and preprocessing

The proposed pollution source identification method is tested on the real data obtained by Hangzhou Environmental Monitoring Center. Water samples are collected from water quality monitoring stations, domestic sewage outlets, industrial sewage outlets, and sewage treatment plants; some water samples are obtained by artificial sampling. The dataset has 2,843 instances. The specific numbers of water quality samples belonging to different classes are shown in Table 1.

Table 1

Distribution of water samples

Pollution sourceSample sizeData source
Normal river water 1,768 15 water quality monitoring stations 
Domestic sewage 290 8 domestic sewage outlets 
Industrial sewage 238 5 industrial sewage outlets 
Salty tide 300 artificial sampling 
Muddy water 247 artificial sampling 
Pollution sourceSample sizeData source
Normal river water 1,768 15 water quality monitoring stations 
Domestic sewage 290 8 domestic sewage outlets 
Industrial sewage 238 5 industrial sewage outlets 
Salty tide 300 artificial sampling 
Muddy water 247 artificial sampling 

Six conventional water quality indicators, namely, COD, NH3N, DO, pH, turbidity, and EC, are measured for each instance of these water samples. These water samples are all measured by the instruments supported by the Laboratory of Smart Environmental Sensing and Control at Zhejiang University. The information of devices is shown in Table 2.

Table 2

Information of measuring instruments

IndicatorInstrumentMeasurement techniqueRangeResolution
COD spectra::lyserTM Spectrometry 0–4,000 mg/L 0.01 mg/L 
NH3GR-3411 Spectrophotometric 0–10 mg/L 0.01 mg/L 
DO SDF-02 Fluorescence 0–20 mg/L 0.01 mg/L 
pH SPC-02 Potential analysis 0–14 0.01 
Turbidity CTR-01 90° scattering 0–1,000 NTU 0.01 NTU 
EC SCE-01 Potential analysis Industrial sewage 1 μS/cm 
IndicatorInstrumentMeasurement techniqueRangeResolution
COD spectra::lyserTM Spectrometry 0–4,000 mg/L 0.01 mg/L 
NH3GR-3411 Spectrophotometric 0–10 mg/L 0.01 mg/L 
DO SDF-02 Fluorescence 0–20 mg/L 0.01 mg/L 
pH SPC-02 Potential analysis 0–14 0.01 
Turbidity CTR-01 90° scattering 0–1,000 NTU 0.01 NTU 
EC SCE-01 Potential analysis Industrial sewage 1 μS/cm 

We measured the water samples with the above instruments at different times and used them as training and testing sets for our model, and the details of the dataset are summarized in Table 3.

Table 3

Distribution of water dataset

DatasetNormal river waterDomestic sewageIndustrial sewageSalty tideMuddy water
Training set 1,100 800 548 932 608 
Testing set 275 200 137 233 152 
DatasetNormal river waterDomestic sewageIndustrial sewageSalty tideMuddy water
Training set 1,100 800 548 932 608 
Testing set 275 200 137 233 152 

For a precise observation of the distribution of water quality indicators under different types of water pollution, the distribution of indicators is plotted using the box plot. An outlier is also defined as a data point located outside 1.5 times the interquartile range above the upper quartile and below the lower quantile (Dawson 2011).

As shown in Figure 4, evident outliers are present in the original water quality data. These outliers caused by the instruments or incorrect experimental operations are removed because they may negatively affect the performance of the algorithm. Necessary works for preprocessing water quality data, namely, normalization of data and deletion of missing values, are also performed.

Figure 4

Box plot of water quality indicators under different types of pollution source.

Figure 4

Box plot of water quality indicators under different types of pollution source.

Close modal

Feature extraction for water quality indicators

Feature discretization based on the Entropy-MDLP algorithm

The EC indicator is used to show the specific calculation process of the Entropy-MDLP algorithm. The probability density distribution of EC under different classes of surface water pollution sources is shown in Figure 5.

Figure 5

Probability density distribution of EC under different pollution sources.

Figure 5

Probability density distribution of EC under different pollution sources.

Close modal

As shown in Figure 5, a considerable difference is observed in the probability density distribution of EC under different classes of water pollution. Moreover, other types of surface water pollution are difficult to distinguish after data normalization because the EC values of salty tidewater pollution are far beyond different types. This phenomenon often causes monitoring failures. Feature discretization based on the Entropy-MDLP algorithm is used to obtain discrete features for water quality indicators for solving the issue as mentioned above. In this way, the robustness of classification models to the distribution of water quality indicators can be improved.

The EC values are sorted, and the best cut point is recursively found. The results of discretization are shown in Figure 6. After discretization, the EC values are discretized into seven intervals, which are distinguishable for identifying water pollution. The results of discretization for other water quality indicators are as shown in Table 3.

Figure 6

The discretization result for EC.

Figure 6

The discretization result for EC.

Close modal

The continuous water quality indicators are converted into discrete features with the cut points found by the Entropy-MDLP algorithm, which can improve the robustness of classification. After that, the one-hot encoding technique is used to encode the discrete features into a binary-valued vector for introducing a nonlinear relationship between water quality indicators.

Feature construction based on the GBDT algorithm

Feature construction based on the GBDT algorithm aims to integrate different water quality indicators automatically. To obtain the newly constructed features, a GBDT model is trained with water quality data first. The GBDT model used in this study is XGBoost (Chen & Guestrin 2016), an implementation of the GBDT algorithm. The maximum depth of each boosted tree is 5, and the number of boosted trees is 25. The first generated boosted tree is shown in Figure 7.

Figure 7

First boosted tree of the GBDT model.

Figure 7

First boosted tree of the GBDT model.

Close modal

The decision tree shown in Figure 7 is the first boosted tree generated by the GBDT model, which is trained on the training set. A traversal from the root node to a leaf node represents a rule constructed by specific water quality indicators. For example, leaf 1 in the first subtree indicates that the feature is built by turbidity, pH, and EC. The feature construction based on the boosted decision tree can be understood as a supervised feature encoding method that converts a real-valued vector into a binary-valued vector. Unlike the discrete features that treat each water quality indicator separately, tree transformed features take into account the internal connection and nonlinear relationship between water quality indicators.

The leaves in the first boosted tree are visualized by a heat map, as shown in Figure 8, for analyzing the principle of the features constructed by the GBDT algorithm. The vertical and horizontal axes in the graph stand for leaves and sources of surface water pollution. The different shades of color represent the probability of samples from various sources falling into the leaf nodes.

Figure 8

Distribution of water quality samples in the first boosted tree.

Figure 8

Distribution of water quality samples in the first boosted tree.

Close modal

Figure 8 shows that most of the normal water quality samples fall into leaf 2, which is distinguished from other classes of water quality samples. The leaves in other boosted trees can separate the other categories of water quality by different combinations of water quality indicators. Therefore, the features extracted by the GBDT algorithm can considerably affect the identification of water pollution sources.

We obtain 51 dimensions of discrete features and 161 dimensions of tree transformed features using Entropy-MDLP and GBDT algorithms. We take the discrete features, tree transformed features, and the combination of them as input to the classifiers. To comprehensively evaluate the effectiveness of the proposed feature extraction method, we choose the SVM and random forest algorithms as classifiers, and the precision rate, recall rate, and f1 score are used as assessment criteria.

As shown in Table 4, the combination of discrete and tree transformed features outperforms the discrete or tree transformed features alone. The Entropy-MDLP algorithm extracts features based on the distribution characteristics of water quality indicators, whereas the GBDT algorithm is based on the intrinsic association of different indicators. Given that they work in different ways, their combination has better performance than using them separately.

Table 4

Cut points for the discretization of water quality indicators

IndicatorsCut points
COD (mg/L) [6.12, 6.45, 7.85, 8.74, 10.17, 11.33, 14.94] 
NH3N (mg/L) [0.32, 1.11, 1.57, 1.79, 2.22, 2.42, 3.42, 4.21] 
DO (mg/L) [0.99, 2.12, 2.96, 3.53, 4.6, 5.98, 7.22, 7.71] 
pH [5.62, 6.56, 6.88, 7.1, 7.43, 7.8, 8.78] 
Turbidity (NTU) [19.03, 30.24, 37.24, 42.41, 46.23, 61.93, 86.6, 109.82, 155.87] 
EC (μS/cm) [313, 377, 529, 711, 1002, 1397] 
IndicatorsCut points
COD (mg/L) [6.12, 6.45, 7.85, 8.74, 10.17, 11.33, 14.94] 
NH3N (mg/L) [0.32, 1.11, 1.57, 1.79, 2.22, 2.42, 3.42, 4.21] 
DO (mg/L) [0.99, 2.12, 2.96, 3.53, 4.6, 5.98, 7.22, 7.71] 
pH [5.62, 6.56, 6.88, 7.1, 7.43, 7.8, 8.78] 
Turbidity (NTU) [19.03, 30.24, 37.24, 42.41, 46.23, 61.93, 86.6, 109.82, 155.87] 
EC (μS/cm) [313, 377, 529, 711, 1002, 1397] 

The proposed feature extraction algorithm is also compared with original water quality indicators without feature extraction and the features extracted by the PCA algorithm. The results shown in Table 5 indicate that the proposed algorithm outperforms other feature extraction algorithms. As shown in Table 6, the classifiers perform worst when we use the original water quality indicators as features. To explore the patterns and distribution of water quality indicators under different types of water pollution, as shown in Figure 9, the pair plot is used to visualize the measured original water quality indicators.

Table 5

Classification results based on the proposed algorithm

ClassifierSVM
Random forest
FeaturesMDLPGBDTMDLP–GBDTMDLPGBDTMDLP–GBDT
Precision 0.791 0.787 0.865 0.819 0.808 0.913 
Recall 0.770 0.769 0.854 0.809 0.791 0.904 
f1 score 0.771 0.771 0.849 0.805 0.787 0.903 
ClassifierSVM
Random forest
FeaturesMDLPGBDTMDLP–GBDTMDLPGBDTMDLP–GBDT
Precision 0.791 0.787 0.865 0.819 0.808 0.913 
Recall 0.770 0.769 0.854 0.809 0.791 0.904 
f1 score 0.771 0.771 0.849 0.805 0.787 0.903 
Table 6

Comparison of different feature extraction algorithms

ClassifierSVM
Random forest
FeaturesOriginalPCAMDLP–GBDTOriginalPCAMDLP–GBDT
Precision 0.694 0.806 0.865 0.787 0.813 0.913 
Recall 0.663 0.788 0.854 0.755 0.802 0.904 
f1 score 0.695 0.785 0.849 0.751 0.796 0.903 
ClassifierSVM
Random forest
FeaturesOriginalPCAMDLP–GBDTOriginalPCAMDLP–GBDT
Precision 0.694 0.806 0.865 0.787 0.813 0.913 
Recall 0.663 0.788 0.854 0.755 0.802 0.904 
f1 score 0.695 0.785 0.849 0.751 0.796 0.903 
Figure 9

Pair plot for water quality indicators under different types of pollution source.

Figure 9

Pair plot for water quality indicators under different types of pollution source.

Close modal
Figure 10

Performance of different feature extractors with the SVM classifier.

Figure 10

Performance of different feature extractors with the SVM classifier.

Close modal

The pair plot builds on two basic figures: the histogram and the scatter plot. The histogram on the diagonal visualizes the distribution of a single water quality indicator, while the scatter plots on the upper and lower triangles show the relationship between two indicators. As shown in Figure 9, it is difficult to identify the type of water pollution by a simple thresholding method using a single indicator because of the nonlinear relationship between water quality indicators and water pollution.

To explore the intrinsic mechanism of water quality features extracted by the proposed algorithm that can effectively identify the source of surface water pollution, 80% of instances in our dataset are used for training, and the rest is for validation. The classification results are displayed by a confusion matrix.

Using the SVM as a classifier with the original water quality features, the discrete features extracted by the Entropy-MDLP algorithm, and the combined features extracted by the proposed algorithm as input, the confusion matrix is obtained as follows in Figure 10.

The results show that the discrete features perform better than the original water quality data without feature engineering. The precision rate of the normal river water quality and the recall rate of other water quality classes are considerably improved. The reason is that the SVM classifier is limited to outliers and the distribution of the water quality data. Concerning the original water quality indicator, the long-tail or similar distribution under different kinds of water quality will considerably influence the classification result of SVM. The discrete features can reduce the effect of long-tail distribution by introducing the nonlinearity of the original water quality indicator. However, if the SVM classifier only takes discrete features as input, then the classifier will easily misjudge domestic sewage or salty water into industrial wastewater. This situation will be effectively improved when the discrete and tree transformed features are combined as input. In this way, the intrinsic association of water quality indicators is considered.

The results with the random forest as classifier are shown in Figure 11. With the discrete features as an input of the classifier, the classification results are not as good as the original water quality features. The reason is that the random forest classifier is sensitive only to the distribution of water quality data but not to their specific values. Although the discretization improves the robustness of the model to some extent, it transforms the fine-grained features into coarse-grained features, which can easily lead to misjudgment. Therefore, the discretization does not considerably improve the performance of the random forest classifier. However, the tree transformed features obtained by the GBDT algorithm can be considered as more fine-grained features, so that the combination of coarse-grained and fine-grained features significantly improves the performance.

Figure 11

Performance of different feature extractors with the random forest classifier.

Figure 11

Performance of different feature extractors with the random forest classifier.

Close modal

The PCA feature extraction algorithm is also compared with the proposed algorithm. The confusion matrix when the SVM and random forest take the features extracted by the PCA algorithm as input is obtained as follows. Moreover, the marco-average receiver operating characteristic (ROC) curves are also used to compare the performance of different feature extractors, which compute the average area under curves of all possible pairwise combinations of classes and insensitive to class imbalance (Hand & Till 2001).

As shown in Figures 12 and 13, the features extracted by the PCA algorithm have a better ability to identify the real source of surface water pollution than original water quality indicators. However, the misjudgment rate of the PCA algorithm is still higher than the proposed algorithm, especially for classifying other kinds of water quality as normal water quality. The water quality features extracted by the PCA algorithm aim to eliminate the correlation among different water quality indicators but ignore the relationship between water quality indicators and water pollution. As a result, the features extracted by PCA are as independent as possible but are affected by the noise and nonlinearity of original water quality indicators. However, our proposed algorithm can better discover the nonlinear relationship between water quality indicators and achieve better performance than PCA.

Figure 12

Performance of water quality features extracted by PCA.

Figure 12

Performance of water quality features extracted by PCA.

Close modal
Figure 13

Comparision of ROC curves for different feature extraction algorithms.

Figure 13

Comparision of ROC curves for different feature extraction algorithms.

Close modal

This study focuses on a feature extraction algorithm for identifying the classes of surface water pollution. We extract the discrete features in consideration of the distribution characteristics and intrinsic association of water quality indicators based on the Entropy-MDLP algorithm and the tree transformed features based on the GBDT algorithm from the original water quality data. We validate the effectiveness of the extracted features by comparing them with original water quality data and features extracted by the PCA algorithm by using different classification algorithms such as SVM and random forest. The results indicate that the proposed feature extraction algorithm can effectively improve the performance of the water quality classification. The Entropy-MDLP and GBDT feature extraction methods can extract features based on the distribution characteristics and intrinsic correlation of different indicators, and our proposed algorithm can also have certain interpretability by analyzing the abnormal discrete features or tracking decision nodes of tree transformed features, so that we can provide auxiliary decision-making advice for environmental remediation.

The proposed algorithm has certain limitations as well. Given that feature discretization based on the Entropy-MDLP algorithm and feature construction based on the GBDT algorithm must be based on supervised learning, we need sufficient data to ensure adequate generalization capability. In the future, we will mitigate the problem of over-fitting caused by small datasets using techniques such as feature selection or model regularization.

This work was funded by the Key Technology Research and Development Program of Zhejiang Province (No. 2015C03G2010034), the National Natural Science Foundation of China (Nos 61573313 and U1509208), and the National Key R&D Program of China (No. 2017YFC1403801).

Data cannot be made publicly available; readers should contact the corresponding author for details.

Ahmed
S.
Ismail
S.
2018
Water pollution and its sources, effects & management: a case study of Delhi
.
International Journal of Current Advanced Research
07
(
2
),
10436
10442
.
Azhar
S. C.
Aris
A. Z.
Yusoff
M. K.
Ramli
M. F.
Juahir
H.
2015
Classification of river water quality using multivariate analysis
.
Procedia Environmental Sciences
30
,
79
84
.
Babovic
V.
Keijzer
M.
2000
Forecasting of river discharges in the presence of chaos and noise
. In:
Flood issues in contemporary water management
(J. Marsalek, W. E. Watt, E. Zeman & F. Sieker, eds).
Springer
,
Dordrecht
,
405
419
.
Bin
X. U.
Lin
C.
Mao
X.
2014
Analysis of applicability of Nemerow pollution index to evaluation of water quality of Taihu Lake
.
Water Resources Protection
30
(
2
),
38
40
.
Chen
T.
Guestrin
C.
2016
XGBoost: a scalable tree boosting system
. In:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
.
ACM
,
New York, NY, USA
, pp.
785
794
.
Dawson
R.
2011
How significant is a boxplot outlier?
Journal of Statistics Education
19
(
2
),
1
13
.
Dougherty
J.
Kohavi
R.
Sahami
M.
1995
Supervised and unsupervised discretization of continuous features
. In:
Proceedings of the Twelfth International Conference on International Conference on Machine Learning
,
San Francisco, USA
, pp.
194
202
.
Fayyad
U.
Irani
K.
1993
Multi-interval discretization of continuous-valued attributes for classification learning
. In:
International Joint Conference on Artificial Intelligence
,
Chambery, France
, pp.
1022
1027
.
Friedman
J. H.
2001
Greedy function approximation: a gradient boosting machine
.
Annals of Statistics
29
(
5
),
1189
1232
.
Guyon
I.
Elisseeff
A.
2006
An Introduction to Feature Extraction
.
Springer
,
Berlin
.
He
X.
Pan
J.
Jin
O.
Xu
T.
Liu
B.
Xu
T.
Candela
J. Q.
2014
Practical lessons from predicting clicks on ads at facebook
. In:
Proceedings of the Eighth International Workshop on Data Mining for Online Advertising
.
ACM
,
New York, USA
, pp.
1
9
.
Hearst
M. A.
Dumais
S. T.
Osuna
E.
Platt
J.
Scholkopf
B.
1998
Support vector machines
.
IEEE Intelligent Systems and Their Applications
13
(
4
),
18
28
.
Liaw
A.
Wiener
M.
2002
Classification and regression by random forest
.
R News
2
(
3
),
18
22
.
Liu
S.
Lou
S.
Kuang
C.
Huang
W.
Chen
W.
Zhang
J.
Zhong
G.
2011
Water quality assessment by pollution-index method in the coastal waters of Hebei Province in western Bohai Sea, China
.
Marine Pollution Bulletin
62
(
10
),
2220
2229
.
Liu
D.
Zou
Z.
2012
Water quality evaluation based on improved fuzzy matter-element method
.
Journal of Environmental Sciences
24
(
7
),
1210
1216
.
Liu
S.
Che
H.
Smith
K.
Chang
T.
2015a
A real time method of contaminant classification using conventional water quality sensors
.
Journal of Environmental Management
154
,
13
21
.
Liu
S.
Che
H.
Smith
K.
Chang
T.
2015b
Contaminant classification using cosine distances based on multiple conventional sensors
.
Environmental Science: Processes & Impacts
17
(
2
),
343
350
.
Lustgarten
J. L.
Gopalakrishnan
V.
Grover
H.
Visweswaran
S.
2008
Improving classification performance with discretization on biomedical datasets
. In:
AMIA Annual Symposium Proceedings
.
American Medical Informatics Association
, pp.
445
449
.
Lv
Q.
Xu
S.-q.
Gu
J.-q.
Wang
S.-f.
Wu
J.
Cheng
C.
Tang
J.-k.
2016
Pollution source identification of water body based on aqueous fingerprint-case study
.
Spectroscopy and Spectral Analysis
36
(
8
),
2590
2595
.
Muhammad
S. Y.
Makhtar
M.
Rozaimee
A.
Aziz
A.
Jamal
A.
2015
Classification model for water quality using machine learning techniques
.
International Journal of Software Engineering & Its Applications
9
(
6
),
45
52
.
Singh
P.
Kaur
P. D.
2017
Review on data mining techniques for prediction of water quality
.
International Journal of Advanced Research in Computer Science
8
(
5
),
396
401
.
Sun
Y.
Babovic
V.
Chan
E. S.
2010
Multi-step-ahead model error prediction using time-delay neural networks combined with chaos theory
.
Journal of Hydrology
395
(
1–2
),
109
116
.
Wang
X.
Babovic
V.
Li
X.
2017
Application of spatial-temporal error correction in updating hydrodynamic model
.
Journal of Hydro-Environment Research
16
,
45
57
.
Wechmongkhonkon
S.
Poomtong
N.
Areerachakul
S.
2012
Application of artificial neural network to classification surface water quality
.
World Academy of Science, Engineering and Technology
6
(
9
),
574
578
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).