Abstract
To effectively prevent river water pollution, water quality monitoring is necessary. However, existing methods for water quality assessment are limited in terms of the characterization of water quality conditions, and few researchers have been able to focus on feature extraction methods relative to water pollution identification, or to obtain accurate water pollution source information. Thus, this study proposed a feature extraction method based on the entropy-minimal description length principle and gradient boosting decision tree (GBDT) algorithm for identifying the type of surface water pollution in consideration of the distribution characteristics and intrinsic association of conventional water quality indicators. To improve the robustness to noise, we constructed the coarse-grained discretization features of each water quality index based on information entropy. The nonlinear correlation between water quality indexes and pollution classes was excavated by the GBDT algorithm, which was utilized to acquire tree transformed features. Water samples collected by a southern city Environmental Monitoring Center were used to test the performance of the proposed algorithm. Experimental results demonstrate that features extracted by the proposed method are more effective than the water quality indicators without feature engineering and features extracted by the principal component analysis algorithm.
HIGHLIGHTS
Different water pollutions have unique attributes for risk characterization.
Based on our study of the characteristics of water quality data, we proposed an innovative feature extraction method based on the entropy-minimal description length principle and gradient boosting decision tree algorithm.
We focus on the research into the feature extraction method in water pollution identification.
INTRODUCTION
Water resources are critical for living organisms and human society. However, economic and population growth has highlighted the negative effect of surface water pollution on people's lives. Recent studies (Sun et al. 2017; Ahmed & Ismail 2018) have shown that untreated sewage and dumping of waste are the primary reasons for water pollution and several diseases. To solve this issue, many countries have been gradually establishing water monitoring stations and improving water monitoring hardware facilities (Wang & Yang 2016). Meanwhile, various techniques and algorithms have been used to monitor and evaluate surface water quality to help related departments using water quality information undertake decisions about water resource management.
For analyzing surface water quality, the pollution index method (Liu et al. 2011; Bin et al. 2014), fuzzy comprehensive evaluation method (Liu & Zou 2012), anomaly detection method (Jeffrey et al. 2009), spectroscopy method (Lv et al. 2016) are used. Many other methods have been applied to assess surface water quality (Wang et al. 2016, 2017, 2019). The pollution index method and fuzzy comprehensive evaluation method help researchers quantify the level of water pollution. However, these approaches to water quality assessment are limited to their ability to characterize the conditions of surface water quality. They cannot present valuable information, such as the cause of surface water pollution. The spectrometry method can identify the category of specific pollutants. However, the inconvenience of maintenance cannot be extended to large-scale real-time online monitoring, and to monitor water quality effectively, conventional water indicators are more widely applied.
Recently, water pollution identification with conventional water indicators is being explored based on the data-driven algorithm, which is essentially a classification task. The classification algorithm is mainly composed of two parts: the feature extraction and the classification model. Due to the uncertainty and distribution characteristics of conventional indicators of water quality monitoring, the original data should not be used directly as the input feature of classification models. Liu et al. (2015a, 2015b) used Mahalanobis and cosine distances to measure the similarity between characteristic pollutant vectors. They demonstrated that the type of contaminant could be determined as the class with the minimum distance. Muhammad et al. (2015) compared the performance of five classification models and analyzed the critical features among 53 water quality indicators. However, water quality monitoring data is affected by many external factors, such as the accuracy of the instrument and incorrect manual operation, which leads to misjudgment. Previous studies have reported feature extraction studies to overcome the limitation, such as principal component analysis (PCA) (Olsen et al. 2012), artificial neural network (ANN) algorithm (Wechmongkhonkon et al. 2012), and cluster analysis (Azhar et al. 2015). The PCA algorithm can extract latent variables from noisy hydrological data, and the ANN algorithm can extract useful features by instructing the hidden layers to change their way of twisting and transplantation via gradient descent algorithm. However, the multi-indicator joint response regularity is not remarkable. Many algorithms have not offered interpretable results for policymakers, managers, and other nontechnical people (Singh & Kaur 2017).
Different forms of water pollution have unique attributes when it comes to risk characterization. However, only a few studies specialized in surface water pollution have considered the distribution characteristics and intrinsic association of water quality indicators. Thus, this is the first study to undertake an analysis of the relationship between the conventional water quality indicators, explore the classification methods for the identification of significant surface water pollution with conventional water quality indicators, and develop a new feature extraction method based on the entropy-minimal description length principle (Entropy-MDLP) and gradient boosting decision tree (GBDT) algorithm. The coarse-grained discretization features of each water quality indicator based on information entropy were constructed to improve the robustness to noise. The nonlinear correlations between water quality indicators and water pollution were excavated by the GBDT algorithm, which was utilized to acquire combined features. The comparison experiments with different feature extraction methods were conducted. The results reveal the proposed method has a better identification performance than the water quality indicators without feature engineering and features extracted by the PCA algorithm.
METHODOLOGY
This study proposes a feature extraction method of water quality data to help classification algorithms automatically identify whether the surface water is polluted and distinguish the source of surface water pollution. As shown in Figure 1, based on the conventional water quality indicators, four significant types of surface water pollution data for this study were collected: industrial sewage, domestic sewage, muddy water, and salty tidewater; and six conventional water quality indicators, namely, chemical oxygen demand (COD), NH3N, dissolved oxygen (DO), pH, turbidity, and electrical conductivity (EC), were measured for identifying water pollution. Based on different water quality monitoring conditions, we can also choose more different water quality indicators.
Feature extraction is a type of data transformation that includes feature construction, discretization, and subset selection (Guyon & Elisseeff 2006). In this study, the proposed feature extraction method focuses on feature discretization and construction. The Entropy-MDLP and the GBDT algorithms are proposed to obtain discrete features and combined features. The distinct and combined features are integrated into the classification model to identify the source of surface water pollution.
Feature discretization based on the Entropy-MDLP algorithm
There are certain drawbacks associated with the direct use of noisy hydrological data as input for the classification model in identifying the source of surface water pollution. The commonly used feature extraction algorithms, such as PCA or independent component analysis, can be used for noise reduction but ignore the class information. The entropy theory (Babovic & Keijzer 2000; Sun et al. 2010), which uses the entropy value as a measure of variability, has been applied in hydrology for years. The Entropy-MDLP algorithm (Dougherty et al. 1995; Lustgarten et al. 2008) discretizes each continuous water quality indicator into multiple intervals to obtain discrete features, which can improve the robustness of classification models to outliers and noise data. The algorithm can also help the classification model discover nonlinear relations between water quality indicators.
The Entropy-MDLP algorithm (Fayyad & Irani 1993) is a supervised algorithm that uses the information entropy minimization heuristic to select cut points recursively. This algorithm constructs an objective function of stopping criteria for the recursive discretization strategy based on the MDLP, and the steps are summarized as follows.
The boundary , which minimizes the entropy function over all possible limits, is selected as a cut point.
Step 4: Discretize all water quality indicators and convert a continuous value into an absolute value under the rule found by Entropy-MDLP discretization for all instances.
Feature construction based on the GBDT algorithm
The discrete features extracted by the Entropy-MDLP algorithm are coarse-grained. Thus, misjudgment quickly occurs when an instance is close to the neighborhood of the classifying boundaries. The GBDT (He et al. 2014) algorithm aims to construct new features at a fine granularity by building decision trees to obtain a set of decision rules with original features as input. This algorithm works as a supervised feature encoding method that converts a real-valued vector into a binary-valued vector. The tree transformed feature is constructed by a traversal from the root node to a leaf node. The algorithm creates a new tree to model the residual of previous trees in each learning iteration (Friedman 2001). As a result, these constructed features can be easily distinguished. The specific steps of the algorithm are summarized as follows.
Step 2: Assume the number of the tree is M, for 1: M:
Step 2.2: Create a new tree using the CART (classification and regression tree) algorithm to model the residual.
Step 3: Obtain the boosted tree model .
Step 4: Convert a real-valued vector into a binary-valued vector for each instance. Each tree is treated as a categorical feature that takes the index of an instance a leaf ends up falling as value. The features generated by each individual tree are encoded by a one-hot encoding algorithm (Alkharusi 2012). Example is as shown in Figure 2.
Figure 2 shows that, if an instance ends up in leaf 1 in the first subtree, leaf 3 in the second subtree, and leaf 1 in the last subtree, then the overall input to the classifier will be binary vector [1, 0, 0, 0, 0, 1, 0, …, 1, 0], where each entry corresponds to the leaves of each subtree.
Evaluation of feature extraction algorithm
Classification algorithm
To evaluate the effectiveness of the proposed algorithm comprehensively, two distinct commonly used algorithms are selected as classifiers. The first is support vector machine (SVM) (Hearst et al. 1998), which is a pattern classification algorithm based on the idea of finding a hyperplane with the most significant possible margin among different classes of data; the second is random forest (Liaw & Wiener 2002), which is an ensemble method that creates an entire forest of random uncorrelated decision trees to arrive at the possible answer. To compare the generalization performance of the classifier with features extracted by different algorithms, 80% of water samples are randomly chosen as a training set, and the remaining water samples are used as a test set.
Confusion matrix
Identifying the source of surface water pollution is primarily a multi-class classification problem. Thus, a confusion matrix is used to visualize the performance of the classification algorithm. Each row of the matrix represents the instances in an actual class, whereas each column represents the instances in a predicted class. All correct predictions are located in the diagonal of the matrix. The classification result can be easily visually inspected because the prediction errors are represented by values outside the diagonal.
Assessment criteria
EXPERIMENT
Based on the proposed feature extraction method, this study designed the experiment as shown in Figure 3.
- i.
Data acquisition and preprocessing eliminate outliers and standardize the data.
- ii.
Dataset partitioning, the dataset is randomly divided into the training set and test set, wherein the training set accounts for 80%, which is used for model training, and the test set accounts for 20%, which was used for model evaluation.
- iii.
Feature extraction uses the method proposed in this paper to extract the features of the original water quality data and obtain the discretization features and combination features.
- iv.
Model training and tuning, taking the random forest algorithm, which combines multiple trees with the idea of integrated learning, and the SVM algorithm, which has excellent classification performance for a small amount of data as classifiers, and the extracted features are used as input to the model. The parameters are adjusted according to the performance of the model on the test set.
Data acquisition and preprocessing
The proposed pollution source identification method is tested on the real data obtained by Hangzhou Environmental Monitoring Center. Water samples are collected from water quality monitoring stations, domestic sewage outlets, industrial sewage outlets, and sewage treatment plants; some water samples are obtained by artificial sampling. The dataset has 2,843 instances. The specific numbers of water quality samples belonging to different classes are shown in Table 1.
Pollution source . | Sample size . | Data source . |
---|---|---|
Normal river water | 1,768 | 15 water quality monitoring stations |
Domestic sewage | 290 | 8 domestic sewage outlets |
Industrial sewage | 238 | 5 industrial sewage outlets |
Salty tide | 300 | artificial sampling |
Muddy water | 247 | artificial sampling |
Pollution source . | Sample size . | Data source . |
---|---|---|
Normal river water | 1,768 | 15 water quality monitoring stations |
Domestic sewage | 290 | 8 domestic sewage outlets |
Industrial sewage | 238 | 5 industrial sewage outlets |
Salty tide | 300 | artificial sampling |
Muddy water | 247 | artificial sampling |
Six conventional water quality indicators, namely, COD, NH3N, DO, pH, turbidity, and EC, are measured for each instance of these water samples. These water samples are all measured by the instruments supported by the Laboratory of Smart Environmental Sensing and Control at Zhejiang University. The information of devices is shown in Table 2.
Indicator . | Instrument . | Measurement technique . | Range . | Resolution . |
---|---|---|---|---|
COD | spectra::lyserTM | Spectrometry | 0–4,000 mg/L | 0.01 mg/L |
NH3N | GR-3411 | Spectrophotometric | 0–10 mg/L | 0.01 mg/L |
DO | SDF-02 | Fluorescence | 0–20 mg/L | 0.01 mg/L |
pH | SPC-02 | Potential analysis | 0–14 | 0.01 |
Turbidity | CTR-01 | 90° scattering | 0–1,000 NTU | 0.01 NTU |
EC | SCE-01 | Potential analysis | Industrial sewage | 1 μS/cm |
Indicator . | Instrument . | Measurement technique . | Range . | Resolution . |
---|---|---|---|---|
COD | spectra::lyserTM | Spectrometry | 0–4,000 mg/L | 0.01 mg/L |
NH3N | GR-3411 | Spectrophotometric | 0–10 mg/L | 0.01 mg/L |
DO | SDF-02 | Fluorescence | 0–20 mg/L | 0.01 mg/L |
pH | SPC-02 | Potential analysis | 0–14 | 0.01 |
Turbidity | CTR-01 | 90° scattering | 0–1,000 NTU | 0.01 NTU |
EC | SCE-01 | Potential analysis | Industrial sewage | 1 μS/cm |
We measured the water samples with the above instruments at different times and used them as training and testing sets for our model, and the details of the dataset are summarized in Table 3.
Dataset . | Normal river water . | Domestic sewage . | Industrial sewage . | Salty tide . | Muddy water . |
---|---|---|---|---|---|
Training set | 1,100 | 800 | 548 | 932 | 608 |
Testing set | 275 | 200 | 137 | 233 | 152 |
Dataset . | Normal river water . | Domestic sewage . | Industrial sewage . | Salty tide . | Muddy water . |
---|---|---|---|---|---|
Training set | 1,100 | 800 | 548 | 932 | 608 |
Testing set | 275 | 200 | 137 | 233 | 152 |
For a precise observation of the distribution of water quality indicators under different types of water pollution, the distribution of indicators is plotted using the box plot. An outlier is also defined as a data point located outside 1.5 times the interquartile range above the upper quartile and below the lower quantile (Dawson 2011).
As shown in Figure 4, evident outliers are present in the original water quality data. These outliers caused by the instruments or incorrect experimental operations are removed because they may negatively affect the performance of the algorithm. Necessary works for preprocessing water quality data, namely, normalization of data and deletion of missing values, are also performed.
Feature extraction for water quality indicators
Feature discretization based on the Entropy-MDLP algorithm
The EC indicator is used to show the specific calculation process of the Entropy-MDLP algorithm. The probability density distribution of EC under different classes of surface water pollution sources is shown in Figure 5.
As shown in Figure 5, a considerable difference is observed in the probability density distribution of EC under different classes of water pollution. Moreover, other types of surface water pollution are difficult to distinguish after data normalization because the EC values of salty tidewater pollution are far beyond different types. This phenomenon often causes monitoring failures. Feature discretization based on the Entropy-MDLP algorithm is used to obtain discrete features for water quality indicators for solving the issue as mentioned above. In this way, the robustness of classification models to the distribution of water quality indicators can be improved.
The EC values are sorted, and the best cut point is recursively found. The results of discretization are shown in Figure 6. After discretization, the EC values are discretized into seven intervals, which are distinguishable for identifying water pollution. The results of discretization for other water quality indicators are as shown in Table 3.
The continuous water quality indicators are converted into discrete features with the cut points found by the Entropy-MDLP algorithm, which can improve the robustness of classification. After that, the one-hot encoding technique is used to encode the discrete features into a binary-valued vector for introducing a nonlinear relationship between water quality indicators.
Feature construction based on the GBDT algorithm
Feature construction based on the GBDT algorithm aims to integrate different water quality indicators automatically. To obtain the newly constructed features, a GBDT model is trained with water quality data first. The GBDT model used in this study is XGBoost (Chen & Guestrin 2016), an implementation of the GBDT algorithm. The maximum depth of each boosted tree is 5, and the number of boosted trees is 25. The first generated boosted tree is shown in Figure 7.
The decision tree shown in Figure 7 is the first boosted tree generated by the GBDT model, which is trained on the training set. A traversal from the root node to a leaf node represents a rule constructed by specific water quality indicators. For example, leaf 1 in the first subtree indicates that the feature is built by turbidity, pH, and EC. The feature construction based on the boosted decision tree can be understood as a supervised feature encoding method that converts a real-valued vector into a binary-valued vector. Unlike the discrete features that treat each water quality indicator separately, tree transformed features take into account the internal connection and nonlinear relationship between water quality indicators.
The leaves in the first boosted tree are visualized by a heat map, as shown in Figure 8, for analyzing the principle of the features constructed by the GBDT algorithm. The vertical and horizontal axes in the graph stand for leaves and sources of surface water pollution. The different shades of color represent the probability of samples from various sources falling into the leaf nodes.
Figure 8 shows that most of the normal water quality samples fall into leaf 2, which is distinguished from other classes of water quality samples. The leaves in other boosted trees can separate the other categories of water quality by different combinations of water quality indicators. Therefore, the features extracted by the GBDT algorithm can considerably affect the identification of water pollution sources.
RESULTS AND DISCUSSION
We obtain 51 dimensions of discrete features and 161 dimensions of tree transformed features using Entropy-MDLP and GBDT algorithms. We take the discrete features, tree transformed features, and the combination of them as input to the classifiers. To comprehensively evaluate the effectiveness of the proposed feature extraction method, we choose the SVM and random forest algorithms as classifiers, and the precision rate, recall rate, and f1 score are used as assessment criteria.
As shown in Table 4, the combination of discrete and tree transformed features outperforms the discrete or tree transformed features alone. The Entropy-MDLP algorithm extracts features based on the distribution characteristics of water quality indicators, whereas the GBDT algorithm is based on the intrinsic association of different indicators. Given that they work in different ways, their combination has better performance than using them separately.
Indicators . | Cut points . |
---|---|
COD (mg/L) | [6.12, 6.45, 7.85, 8.74, 10.17, 11.33, 14.94] |
NH3N (mg/L) | [0.32, 1.11, 1.57, 1.79, 2.22, 2.42, 3.42, 4.21] |
DO (mg/L) | [0.99, 2.12, 2.96, 3.53, 4.6, 5.98, 7.22, 7.71] |
pH | [5.62, 6.56, 6.88, 7.1, 7.43, 7.8, 8.78] |
Turbidity (NTU) | [19.03, 30.24, 37.24, 42.41, 46.23, 61.93, 86.6, 109.82, 155.87] |
EC (μS/cm) | [313, 377, 529, 711, 1002, 1397] |
Indicators . | Cut points . |
---|---|
COD (mg/L) | [6.12, 6.45, 7.85, 8.74, 10.17, 11.33, 14.94] |
NH3N (mg/L) | [0.32, 1.11, 1.57, 1.79, 2.22, 2.42, 3.42, 4.21] |
DO (mg/L) | [0.99, 2.12, 2.96, 3.53, 4.6, 5.98, 7.22, 7.71] |
pH | [5.62, 6.56, 6.88, 7.1, 7.43, 7.8, 8.78] |
Turbidity (NTU) | [19.03, 30.24, 37.24, 42.41, 46.23, 61.93, 86.6, 109.82, 155.87] |
EC (μS/cm) | [313, 377, 529, 711, 1002, 1397] |
The proposed feature extraction algorithm is also compared with original water quality indicators without feature extraction and the features extracted by the PCA algorithm. The results shown in Table 5 indicate that the proposed algorithm outperforms other feature extraction algorithms. As shown in Table 6, the classifiers perform worst when we use the original water quality indicators as features. To explore the patterns and distribution of water quality indicators under different types of water pollution, as shown in Figure 9, the pair plot is used to visualize the measured original water quality indicators.
Classifier . | SVM . | Random forest . | ||||
---|---|---|---|---|---|---|
Features . | MDLP . | GBDT . | MDLP–GBDT . | MDLP . | GBDT . | MDLP–GBDT . |
Precision | 0.791 | 0.787 | 0.865 | 0.819 | 0.808 | 0.913 |
Recall | 0.770 | 0.769 | 0.854 | 0.809 | 0.791 | 0.904 |
f1 score | 0.771 | 0.771 | 0.849 | 0.805 | 0.787 | 0.903 |
Classifier . | SVM . | Random forest . | ||||
---|---|---|---|---|---|---|
Features . | MDLP . | GBDT . | MDLP–GBDT . | MDLP . | GBDT . | MDLP–GBDT . |
Precision | 0.791 | 0.787 | 0.865 | 0.819 | 0.808 | 0.913 |
Recall | 0.770 | 0.769 | 0.854 | 0.809 | 0.791 | 0.904 |
f1 score | 0.771 | 0.771 | 0.849 | 0.805 | 0.787 | 0.903 |
Classifier . | SVM . | Random forest . | ||||
---|---|---|---|---|---|---|
Features . | Original . | PCA . | MDLP–GBDT . | Original . | PCA . | MDLP–GBDT . |
Precision | 0.694 | 0.806 | 0.865 | 0.787 | 0.813 | 0.913 |
Recall | 0.663 | 0.788 | 0.854 | 0.755 | 0.802 | 0.904 |
f1 score | 0.695 | 0.785 | 0.849 | 0.751 | 0.796 | 0.903 |
Classifier . | SVM . | Random forest . | ||||
---|---|---|---|---|---|---|
Features . | Original . | PCA . | MDLP–GBDT . | Original . | PCA . | MDLP–GBDT . |
Precision | 0.694 | 0.806 | 0.865 | 0.787 | 0.813 | 0.913 |
Recall | 0.663 | 0.788 | 0.854 | 0.755 | 0.802 | 0.904 |
f1 score | 0.695 | 0.785 | 0.849 | 0.751 | 0.796 | 0.903 |
The pair plot builds on two basic figures: the histogram and the scatter plot. The histogram on the diagonal visualizes the distribution of a single water quality indicator, while the scatter plots on the upper and lower triangles show the relationship between two indicators. As shown in Figure 9, it is difficult to identify the type of water pollution by a simple thresholding method using a single indicator because of the nonlinear relationship between water quality indicators and water pollution.
To explore the intrinsic mechanism of water quality features extracted by the proposed algorithm that can effectively identify the source of surface water pollution, 80% of instances in our dataset are used for training, and the rest is for validation. The classification results are displayed by a confusion matrix.
Using the SVM as a classifier with the original water quality features, the discrete features extracted by the Entropy-MDLP algorithm, and the combined features extracted by the proposed algorithm as input, the confusion matrix is obtained as follows in Figure 10.
The results show that the discrete features perform better than the original water quality data without feature engineering. The precision rate of the normal river water quality and the recall rate of other water quality classes are considerably improved. The reason is that the SVM classifier is limited to outliers and the distribution of the water quality data. Concerning the original water quality indicator, the long-tail or similar distribution under different kinds of water quality will considerably influence the classification result of SVM. The discrete features can reduce the effect of long-tail distribution by introducing the nonlinearity of the original water quality indicator. However, if the SVM classifier only takes discrete features as input, then the classifier will easily misjudge domestic sewage or salty water into industrial wastewater. This situation will be effectively improved when the discrete and tree transformed features are combined as input. In this way, the intrinsic association of water quality indicators is considered.
The results with the random forest as classifier are shown in Figure 11. With the discrete features as an input of the classifier, the classification results are not as good as the original water quality features. The reason is that the random forest classifier is sensitive only to the distribution of water quality data but not to their specific values. Although the discretization improves the robustness of the model to some extent, it transforms the fine-grained features into coarse-grained features, which can easily lead to misjudgment. Therefore, the discretization does not considerably improve the performance of the random forest classifier. However, the tree transformed features obtained by the GBDT algorithm can be considered as more fine-grained features, so that the combination of coarse-grained and fine-grained features significantly improves the performance.
The PCA feature extraction algorithm is also compared with the proposed algorithm. The confusion matrix when the SVM and random forest take the features extracted by the PCA algorithm as input is obtained as follows. Moreover, the marco-average receiver operating characteristic (ROC) curves are also used to compare the performance of different feature extractors, which compute the average area under curves of all possible pairwise combinations of classes and insensitive to class imbalance (Hand & Till 2001).
As shown in Figures 12 and 13, the features extracted by the PCA algorithm have a better ability to identify the real source of surface water pollution than original water quality indicators. However, the misjudgment rate of the PCA algorithm is still higher than the proposed algorithm, especially for classifying other kinds of water quality as normal water quality. The water quality features extracted by the PCA algorithm aim to eliminate the correlation among different water quality indicators but ignore the relationship between water quality indicators and water pollution. As a result, the features extracted by PCA are as independent as possible but are affected by the noise and nonlinearity of original water quality indicators. However, our proposed algorithm can better discover the nonlinear relationship between water quality indicators and achieve better performance than PCA.
CONCLUSION
This study focuses on a feature extraction algorithm for identifying the classes of surface water pollution. We extract the discrete features in consideration of the distribution characteristics and intrinsic association of water quality indicators based on the Entropy-MDLP algorithm and the tree transformed features based on the GBDT algorithm from the original water quality data. We validate the effectiveness of the extracted features by comparing them with original water quality data and features extracted by the PCA algorithm by using different classification algorithms such as SVM and random forest. The results indicate that the proposed feature extraction algorithm can effectively improve the performance of the water quality classification. The Entropy-MDLP and GBDT feature extraction methods can extract features based on the distribution characteristics and intrinsic correlation of different indicators, and our proposed algorithm can also have certain interpretability by analyzing the abnormal discrete features or tracking decision nodes of tree transformed features, so that we can provide auxiliary decision-making advice for environmental remediation.
The proposed algorithm has certain limitations as well. Given that feature discretization based on the Entropy-MDLP algorithm and feature construction based on the GBDT algorithm must be based on supervised learning, we need sufficient data to ensure adequate generalization capability. In the future, we will mitigate the problem of over-fitting caused by small datasets using techniques such as feature selection or model regularization.
ACKNOWLEDGEMENTS
This work was funded by the Key Technology Research and Development Program of Zhejiang Province (No. 2015C03G2010034), the National Natural Science Foundation of China (Nos 61573313 and U1509208), and the National Key R&D Program of China (No. 2017YFC1403801).
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.