Machine learning method for quick identification of water quality index (WQI) based on Sentinel-2 MSI data: Ebinur lake case study

Surface water quality is an important factor affecting the ecological environment and human living environment. The monitoring of surface water quality by remote sensing monitoring technology can provide important research significance for water resources protection and water quality evaluation. Finding the optimal spectral index sensitive to water quality for remote sensing monitoring of water quality is extremely important for surface water quality analysis and treatment in the Ebinur Lake Basin in arid areas. This study used Sentinel-2MSI data at 10 m resolution to quickly monitor the water quality of the watershed. Through laboratory experiments and measurement data from the Ebinur Lake Basin, 22 water quality parameters (WQPs) were obtained. Through Z-score and redundancy analysis, 9 WQPs with significant contributions were extracted. Based on the remote sensing spectral band, four water indexes (NDWI, NWI, EWI, AWEI-nsh) and 2D modeling spectral index (DI, RI, NDI), the correlation analysis between WQPs and two kinds of spectral band indexes is carried out, and it is concluded that the overall correlation between WQP and 2D spectral modeling is more relevant. This paper calculates the evaluation and models the 2D spectrum of the Water Quality Index (WQI). The WQI is predicted and modeled through four machine learning algorithms (RF, SVM, PLSR, PLSR-SVM).The results show that the inversion effect of the two-dimensional spectral modeling index on water quality parameters (WQPs) is superior to that of the water index, and the correlation coefficient of the DI (R12-R1) SWIR-2 and BLUE band interpolation index reaches 0.787. On this basis, three kinds of two-dimensional spectral modeling indexes are used to inversely synthesize the WQI, and the correlation coefficient of the ratio index of the RI (R11/R8) SWIR-1 and near-infrared (NIR) bands is preferably 0.69. In the WQI prediction, the partial least squares regression support vector machine (PLSR-SVM) model in machine learning algorithms has good modeling and prediction effects (Rc1⁄4 0.873, Rv1⁄4 0.87), which can provide a good basis. The research results provide references for remote monitoring of surface water in arid areas, and provide a basis for water quality prediction and safety evaluation.


GRAPHICAL ABSTRACT INTRODUCTION
With the continuous development of water resources, water quality has become a popular issue of global concern (Pahlevan et al. ). The area of human activity continues to expand worldwide, and the impact on water quality is gradually becoming more severe (Mananze et al. ).
The deterioration of water quality due to pollution is increasing due to urban cluster pollution emissions, industrial pollution emissions, and agricultural fertilizer pollution emissions (Shao et al. ; Lintern et al. ; Wang et al. ). Water pollution has had a considerable impact on human production and life and increased the health risks of ecosystems (Chen et al. ). Therefore, water quality and safety assessments are extremely important. Water quality has become the most important indicator in aquatic ecosystems. The quality of water directly affects the health of aquatic ecosystems and fully reflects the spatiotemporal heterogeneity and integrity of river water environment ecosystems (Staponites et al. ). Therefore, it is necessary to monitor the pollution of water quality in real time and record changes in water quality.
In the current situation, remote sensing can be used to accurately and quickly monitor river water quality, and there is an opportunity to regularly monitor aquatic systems using different sensor satellite products (Tyler et al. ).
Remote sensing technology has become an important method in water resource surveys and monitoring, the remote monitoring of water environments, regional ecological remote sensing monitoring, urban environmental remote sensing monitoring and wetland protection. Reducing the interference of non-water factors and enhancing the expression of water body factors are key tasks in remote sensing to identify water body information (Li et al. ).
Landsat image recognition technology for water bodies has been well established around the world. With the continuous deepening of machine learning methods, more and more water quality researches use this method to predict water quality (Gao et  And PLSR is widely used in the establishment of multivariate models, which can handle the collinearity between various variables and further reduce and weaken related data sets (Sidike et al. ; Wang et al. ). Random forest (RF) has a good application basis for water quality index prediction (Meyers et al. ), support vector machine (SVM) can achieve a better effect on nonlinear relationships when used alone (Nawar et al. ; Lucà et al. ). In PLSR-SVM, not only high-quality results can be obtained, but also spectral bands can be combined to improve the inversion effect (Zhang et al. ).
Additionally, this technology applies a cluster analysis method to establish a stable water index, extract surface water information (Wang et al. b) and establish an automatic classification set (Fisher et al. ). The remote sensing of water quality mainly includes water color remote sensing and water quality remote sensing. Water color remote sensing mainly uses remote sensing to invert the optical characteristics and spectral characteristics of water bodies, such as the suspended matter and chlorophyll a levels. Conversely, if there are no significant optical and spectral properties for a given water body, the remote sensing inversion of other WQPs that have close relationships with direct WQPs can be achieved by water quality remote sensing; such parameters include total nitrogen and total phosphorus (Ma & Dai ). In research study, based on the change in chromaticity and the inversion of water body radiation and WQPs through spectroscopy, chromaticity is used to predict water quality. The spectral complexity of inland lakes has also gradually been considered (Bukata et al. ).
In 1991, Anatoly Gitelson () studied the remote sensing of inland water quality through aerospace remote sensing and studied the potential associated with using chlorophyll a (Lintern et al. ), dissolved organic matter (DOM), and suspended solids (SM) concentrations.
BABAN and Serwan (Baban ) used Landsat satellites to perform a regression analysis of TM data and surface WQPs in 1993 and to simulate and predict the water quality of inland lakes. Since 1990, Chinese scholars have studied the water quality of inland rivers through remote sensing (Yu et al. ). The research on chlorophyll and suspended solids has been continuously improved, and the inversion accuracy also been improved (Ma & Dai ). In the inversion of the water quality of inland rivers, by selecting different spectral band combinations and combining the water quality spectral characteristic parameters, the accuracy of establishing estimation spectral and water quality estimation models has been improved.  (Terrado et al. ). The WQI appears to lag behind the spatial and temporal changes in water quality samples when evaluating individual water samples, and problems related to the continuous large-scale comprehensive evaluation of water quality changes must be solved with the development of remote sensing technology (Wang et al. ). In this paper, WQPs are extracted to establish the corresponding WQI; then, the effects of the most prominent parameters of river basin water on river water quality are investigated.
The river water ecosystem in the Ebinur Lake Basin, Xinjiang, was studied. The Ebinur Lake Basin is located in an arid zone and adjacent to the largest arid zone in the temperate region of the Northern Hemisphere. The basin has typical regional characteristics, large climate changes and considerable evapotranspiration, and the artificial largescale extraction of water for agricultural production has led to regional ecological and hydrological changes. Intensive non-point source pollution caused by the development of water and soil resources has affected the water quality of the basin to a certain extent. To this end, it is necessary to quickly identify and manage water quality through remote sensing images and provide technical support for environmental protection and safety.
In this article, the water quality in arid regions is studied through an analysis of the correlation between the water index and the WQI based on a spectral combination method, and a correlation analysis between the band combination index and the WQI is performed to identify the optimal WQI. The spectral band combination that is most suitable for water quality in arid areas is also determined. Furthermore, the effects of different parameters on the water quality in the entire basin is studied to provide important technical support for the future macrocontrol of water bodies.

Study area
The study was performed in the arid region of the hinterland of the Eurasian continent, the Ebinur Lake Watershed of The plain is approximately 220 km from east to west and 60-120 km from north to south. The total area of the basin is 2,080 km 2 . Ebinur Lake is located in the low-lying area of the river tail basin (Yu & Jiang ), and the source of recharge for the lake is the surrounding surface water and groundwater. The water of Ebinur Lake is mainly composed of water from several inflowing rivers, such as the Jing River, Boertala River and Kuitun River. The complexity of the geographical environment has affected the ecological environment of the Ebinur Lake Basin to varying degrees.
The water quality of river water directly affects the growth of vegetation in the basin, the degree of salinization of the soil and the living conditions for human beings (Wang et al. ).

Sentinel-2 image acquisition and preprocessing
A multispectral instrument was launched onboard the Sentinel-2A satellite on June 23, 2015, from the European Space Agency. Sentinel-2A has 13 bands and three different spatial resolutions. The MSI ranges from VIS to NIR and SWIR, with spatial resolutions of 10, 20, and 60 m, respectively (Table 1). Notably, the MSI is the only terrestrial remote sensing satellite in the world with a ground resolution of up to 20 m.
Four specialized bands (B5, B6, B7, and B8a) were designed to obtain the spectral characteristics of the vegetation in the near-infrared 'red edge' region (690-800 nm), and these bands are close to B4 at the red wavelength (Peterson et al. ).
We downloaded the Sentinel 2 image from the ESA Sentinel Scientific Data Hub (https://scihub.copernicus.eu/); this data is a radiometrically calibrated L1C product. We considered the data of field survey to choose the date of the image (October 2017, cloudless). To achieve the L2A product, we conducted atmospheric correction using the sen2cor toolbox in SNAP software (Main-Knorn et al.

).
Meanwhile, geometric correction is carried out to ensure the verification accuracy to ±0.5 pixels. Using ENVI5.5 software, the 20-m and 60-m resolution bands were resampled to 10 m. To verify the water spectrum, the spectrum of the typical sample was extracted as Figure 2.
In general, the water spectrum is obvious reflection in the B3 band, and the absorption continues to increase after the B4 band. The absorption effect is best in the B8a band, which is consistent with the changes in the water in the spectral range. By comparison, these spectra were accurate in this study.

Water sampling collection and analysis
River water quality data selection is mainly based on China's surface water environmental quality standard GB3838-2002 (GB- ); in this process, water samples were collected from the basin, and outdoor and indoor tests were

Water spectral indices
There are many established water indices that can be applied to Landsat TM/ETM þ data. This study selected four classic water indices for analysis. Based on the Landsat WQI, the improved WQI was developed to reflect the inversion of water quality in each band of Sentinel-2. The   found that the two types of indices were suitable for water extraction in the Ebinur Lake Basin and that the extraction accuracy was sufficient (Table 3).

Water quality index (WQI)
As there are a variety of chemical, physical and biological where Wi is the weight value, wi is the weight of each parameter, and n is the number of WQPs. The weight value of each water quality parameter is based on the surface water quality standards of the WHO (Alghamdi et al.

)
. The result is multiplied by 100%. The specific formula pH Measured with a pH-40A portable pH acidity meter is as follows: where q i is the ratio of the measured WQP concentration to  (Table 4).
These contributions can be calculated by the following formula: AWEI Ànsh ¼ 4 × (ρB 3 À ρB 11 ) À (2:5 × ρB 8 þ 2:75 × ρB 12 )  variables of water quality parameters, and these parameters are not all influential in terms of overall water quality. It is a relatively better way to invert the overall water quality by selecting water quality parameters with greater weight in the PCA analysis. This is a good, comprehensive water quality evaluation method (Sȩner et al. ).

Constructions of 2D spectral indices
Information enhancement processing is a common data processing method for remote sensing data. Complex terrain information is difficult to analyze using single-band data.
Therefore, multispectral remote sensing data are selected for two-dimensional spectrum analysis and calculations where NDI (Equation (1)) is the normalized remote sensing index, DI (Equation (2)) is the difference remote sensing index, RI (Equation (3)) is the ratio remote sensing index, and i,j is an arbitrary band involved in any two-band data extraction process among the 1-7, 8a, 8, 9, 11, and 12 bands.

Research calibration models
In recent years, there have been more and more researches on the use of machine learning to model data, and modeling methods that can provide choices are also endless, but the three modeling methods (SVM, RF, PLSR) selected in this article are relatively mature machine learning methods, so this article chooses these three methods to model and predict the data, which can effectively improve the accuracy of data prediction.

Support vector machines (SVM)
An SVM is a supervised classifier based on small samples in machine learning that can quickly and accurately fit and predict samples; this method was proposed by Boser (Boser With these two characteristics, various data sets can be quickly extracted without repeated accumulation, which is better for data classification.

Partial least squares regression (PLSR)
PLSR is a new multivariate statistical analysis method The parameters with more than four outliers include the TSS, turbidity, sulfur, Cu, Zn, volatile phenol, and cobalt concentration, and the largest outliers appear for these indicators. The water quality data are standardized, and it is clear that the median value of the data appears below the average in these cases.

RDA analysis and correlation
In correspondence of multiple parameters. After the redundancy analysis of the data, the nine parameters with the highest contributions were selected for linear correlation analysis. It can be seen from the correlation analysis in Table 5 that COD and TDS have the highest correlation with BOD 5 and TN, and the correlation coefficient is 0.891 at the 0.01 significance level. The COD in water will reflect the size of the electrolytes in the water, and BOD 5 has a certain restriction on the TN content in the water.
Additionally, the TN content also limits changes in the COD in the water. The TDS is also influenced by the COD, BOD 5 , DO, TN and salt content in the water. As shown above, each parameter in the water will be affected by other factors, and the water quality will change.

Correlation between the water quality indices and water quality parameters (WQPs)
Sentinel-2 MSI imagery was used to retrieve the typical water index values in July 2016. Based on SPSS software, the correlation between the four classical water quality indices extracted from remote sensing images, and the nine WQPs were analyzed. The results are shown in Table 6.
According to the correlation analysis, the water quality parameters with high correlations were the COD, BOD 5 , DO, TN, TSS, salt content, TDS, pH and NWI, suggesting that the NWI can effectively reflect most WQPs. The correlation between the NDWI and turbidity was as high as 0.128. Compared with other water indices, the band was not sensitive to water turbidity, and the turbidity correlation was generally low. The correlations between the NDWI and TDS and COD and TN were above 0.5. The EWI exhibited correlations with TN, the salt content and TDS of more than 0.5.
The correlation between the AWEI-nsh and COD was 0.558.
In general, the correlation between the NWI and various water quality indicators in the classic water index was good, indicating that the WQI is sensitive to bands 1, 8, 11, and 12. Remote sensing image extraction is a sensitive reflection of the water indices.  Correlation between the 2D spectrum modeling and water quality parameters (WQPs) By using 2D spectrum modeling, the multiband pixel values of the corresponding points on each image point are extracted, and three empirical indices, normalized index, differential index and ratio index, are established. Correlation analysis was performed on nine water quality parameters and 13 bands of Sentinel-2 sampled at each point. The analysis results are shown in Figure 5.
The color bar on the right side of the figure shows the correlation between the band combination and WQPs, where dark red is the maximum value of positive correlation and dark blue is the maximum value of negative correlation.
As can be seen from the figure, the correlation coefficients of the two WQPs and band modeling effects, COD and TDS, are better than 0.7.
From the correlation among the water index and the spectral band index and the measured WQPs, the correlation coefficient between the water index and the WQP is below 0.7, and the highest is 0.679. In contrast, the spectral band is established. The index correlation is better. Although the index can only reflect the combination algorithm of two bands, it can also find the correlation of the optimal band Overall, the 2D modeled band index can effectively invert the WQI, and the accuracy is higher than that of the water index. Thus, the 2D modeled band index is the optimal spectral index for water quality monitoring.

WQI and modeled spectral index
To estimate the overall water quality, a WQI index was introduced. WQI is a comprehensive evaluation of water quality indicators. It can analyze the overall effect of water quality through selected indicators, so that it can analyze the overall water quality situation more intuitively. We employed nine types of water quality selected from RDA in the WQI calculation, obtained WQI of 38 sampling points, and compared and analyzed WQI in the water index and modeled spectral index, respectively. As can be seen in Table 7, among the correlations between WQI and the four water indexes, the correlation between WQI and     Water indices based on spectral bands can be used to quickly identify surface water and highlight certain features (Xu ). This article studies the extraction of the DN values corresponding to the Sentinel-2 data and sampling points and establishes a WQI and a two-dimensional spectral model based on this approach. First, the correlations between the water index and individual WQPs are studied.
As shown in Table 6, the correlation between the NDWI, NWI, EWI and TDS is better than 0.5. The effect of AWEI-nsh on COD inversion reaches 0.558. Notably, the water spectral index can be used to invert the corresponding WQPs in the band calculation. Based on this method, we have established a two-dimensional spectral model of the Sentinel-2 MSI and three model indices; namely, the NDI, RI and DI. By comparing the correlation between the water index and the 2D spectrum model with that for a single WQI, the effect of the two-dimensional spectrum model is better than that of the water index. The twodimensional spectrum model can filter the spectrum information and select the optimal band combination for analysis (Hong et al. ); such methods used in water quality monitoring will allow us to find more sensitive bands and band combinations for a single type of water quality. Through the correlation analysis of WQPs, water quality indices and spectral bands, the water quality of sampling points in the basin can be analyzed in depth.
In the Ebinur Basin, located in an arid area, many scho-  In future research, it is necessary to consider the distribution of water samples and the selection of measured data in the entire basin.

CONCLUSIONS
This study explores the relationship between the water quality in the Ebinur Lake watershed and the multispectral bands of Sentinel-2 MSI data. The relevant WQPs are related to the water index and analyzed by spectral bands.
The following results are obtained.
The Z-score and RDA are used to reduce 22 WQPs to nine while dividing them into different groups. COD, BOD 5 , DO, TN, TSS, turbidity, the salt content, TDS and the pH are the nine selected WQPs with high contribution values. TDS, COD, and TN are the most influential WQPs.
The WQI is established through the selected nine WQPs, and modeling and prediction are performed through machine learning and linear correlation models. The PLSR-SVM model with a linear correlation and machine learning is the best model for modeling, with R 2 v ¼ 0.87 and RPD ¼ 2.755; the predictions with this approach are very accurate, and this approach can provide an effective method for water prediction.