## Abstract

This study utilized the ECO Lab model calculation samples of Tai Lake, in combination with robust analysis and the GCV test, to promote a faster intelligent application of machine learning and evaluate the MARS machine learning method. The results revealed that this technique can be better trained with small-scale samples, as indicated by the R^{2} values of the water quality test results, which were all >0.995. In combination with the Sobol sensitivity analysis method, the contribution degree of the parameterized external conditions as well as the relationship with the water quality were examined, which indicated that TP and TN are primarily related to the external input water quality and flow, while Chl-a is related to inflow (36.42%), TP (26.65%), wind speed (25.89%), temperature (8.38%), thus demonstrating that the governance of Chl-a is more difficult. In general, the accuracy and interpretability of MARS machine learning are more in line with the actual situation, and the use of the Sobol method can save computer calculation time. The results of this research can provide a certain scientific basis for future intelligent management of lake environments.

## HIGHLIGHTS

Introduce a MARS – machine learning method coupled with a Sobol sensitive analysis approach.

Coupled methods can solve the same problems with less time.

The declared goal of this research is to provide a certain scientific basis for future intelligent management of lake environments.

## INTRODUCTION

After the Tai Lake cyanobacteria crisis of 2007, a more serious cyanobacteria crisis broke out 10 years later, in 2017, indicating that there are still some shortcomings in the original management of the watershed (Zhang *et al.* 2020). However, since management of the watershed is very complex, involving many administrative conflicts and natural areas, determining how to rely on either measured data or model results in order to achieve intelligent management is a difficult problem that needs to be solved (Daneshfaraz *et al.* 2020; Xu *et al.* 2020). The basis of intelligent management is the need to analyze and explain the connection between measured data or model calculations and actual phenomena. Therefore, research on the sensitivity of external conditions and water quality results is very important.

As computer performance has improved, the accuracy of the original water quality model has been greatly enhanced, and the current research on parameter sensitivity has also achieved great breakthroughs (Liang *et al.* 2020; Liu & Ding 2020). Koo *et al.* (2020), for example, used the Sobol method to analyze the SWAT model parameters and nitrate flux sensitivity, eliminating some unimportant parameters and thereby saving time for later calculations (Koo *et al.* 2020). Through the combined use of the Morris and Sobol methods, Garcia *et al.* (2016) developed a new system for selecting important parameters while reducing the amount of calculation (García-Nieto *et al.* 2016). Jiang (Jiang *et al.* 2018) used the global sensitivity analysis (GSA) method to study and analyze the water quality parameters of different areas in Tai Lake, finding that the sensitivity of the parameters affected by the water quality in different regions exhibits certain differences, and the relationship between the growth of algae with temperature and water quality is mutually transformable. When the water quality concentration is sufficient, algae growth has a closer relationship with temperature; otherwise, it is still mainly affected by nutrient salt concentration.

The Sobol method is a global sensitivity analysis approach that can quantify parameter sensitivity and correlation, although it is not widely used in the early stages of analysis due to the huge amount of calculation (Wang 2018; Jordan *et al.* 2020). At present, given the development and application of machine learning methods, the goal of this study was to employ the multivariate adaptive regression splines (Demneh 2019) (MARS)-Sobol method to reduce the workload and improve future management efficiency. The MARS method, which was formally proposed by the American statistician Jerome Friedman in 1991 (Kuter *et al.* 2018), can handle large amounts of data and high dimensionality, and also features the advantages of fast calculation and accurate modeling. Deo (Deo *et al.* 2017) conducted an in-depth study of the relationship between rainfall runoff and regional drought with the help of the MARS method and established a refined model with the dual factors of geography and seasonality via the measurement data of long-term drought and the determination of related parameters, thus providing a successful intelligent application case for later regional drought research. Garcia *et al.* (García-Nieto *et al.* 2016) successfully constructed the MARS-ABS prediction model by combining the MARS method and an artificial bee colony algorithm. Since the estimated coefficients of total phosphorus (TP) and chlorophyll-a (Chl-a) in the lake body were both >0.8 and had good physical, chemical, and biological interpretability, this was a relatively successful and innovative research project.

Based on the measured data of Tai Lake for the past 10 years, this study parameterized the external conditions of the lake such as temperature, wind speed, water entering and exiting, and water quality. With the help of the Latin hypercube sampling (LHS) method, 250 random combinations were selected within the range of values, and the ECO Lab water quality model was then utilized. The TP, TN, and Chl-a were simulated, and the results obtained were trained and tested using the MARS method. A total of 6,000 groups were then sampled according to the Sobol sequence, combined with the Sobol method, to study the sensitivity between external conditions and water quality, and finally passed to the mini-batch k-means clustering algorithm for analysis of the overall conclusion when the lake algae reached the standard level. This manuscript is mainly composed of four sections (Figure 1): (1) Research area; (2) Research methods; (3) Results and discussion; and (4) Conclusions and prospects.

## STUDY AREA AND METHODS

### Study area

Tai Lake (30°05′–32°08′N, 119°08–122°55′E) is located in the lower reaches of the Yangtze River in China (Yao *et al.* 2020). It has a total area of approximately 2,338 km^{2} and is a typical large-scale shallow-water freshwater lake, with an average depth of about 1.9 m. In the past 10 years, the air temperature has ranged from −4.5 °C to 33 °C, the wind speed at a height of 10 m has varied from 0.5 to 8.1 m/s, the average annual rainfall was ∼1,222 mm, and the annual average evaporation was 1,051 mm. There are approximately 219 rivers that flow into and out of the surrounding area. The main lake areas can be divided into seven regions: the lake center, Zhushan Bay, Meiliang Bay, Gong Bay, Northwest Lake, Southwest Lake, and East Lake (Tang *et al.* 2020), and there is a corresponding hydrological and water quality synchronization monitoring site (Figure 2).

Based on the relevant measured data for the past 10 years from the Jiangsu Monitoring Center, Taihu Basin Administration of Ministry of Water Resources (http://www.tba.gov.cn/) and the China Meteorological Data Network (http://data.cma.cn/), this study parameterized the amount of water entering and leaving the lake, the water quality, temperature, and wind speed, and linked the LHS method to a total of 250 sampling combinations for external input conditions within the range of 50–150% of the average value. The ECO Lab water quality model simulated and calculated the TP, TN, and Chl-a of the seven main monitoring stations. The actual measured external conditions are illustrated in Figure 2. The obtained simulation results were used as the basic data for later research on alternative models.

### Methods

#### ECO Lab model established

The ECO Lab model is based on a three-dimensional unsteady hydrodynamic model (Waldman *et al.* 2017). It is formed from a hydrodynamic-ecological model coupled with ecological modules, including algae, oxygen cycle, nitrogen cycle, phosphorus cycle, and carbon cycle. The model employed in this study featured a Cartesian coordinate grid of 5,881 rectangular cells, each of which had a length of 300–500 m. To better simulate the lake bottom terrain, *σ* coordinates were used in the vertical direction, which was divided into three layers on average. Based on hydrostatic continuity and to avoid the pressure gradient error caused by the *σ* coordinates, the slope of the lake bottom should be <0.33. The model calculation time step was 3,600 s, and the simulation time was 365 d.

#### Establishing the alternative model based on the MARS method

Multivariate adaptive regression splines (MARS) is a prediction method proposed by the American statistician Jerome Friedman in 1991. This method uses the tensor product of the spline function as the basis function and is divided into three steps: forward process, backward pruning process, and model selection (Metya *et al.* 2017). Furthermore, the generalized cross-validation (GCV) criterion is adopted, and the fitting path is adjusted according to the dynamic characteristics of the fitting object and the interaction between variables, which can fully fit functions of different dimensions.

In Equation (3), *n* is the number of predictors; the *k*^{th}-dependent variable is excluded in order to construct the model; is the predicted value of ; and is the amount of function complexity correction.

This study used MATLAB programming tools, and based on the dimension n × d of the input variable , as well as the selection of independent variables and nodes, the basis function consisted of several pairs of piecewise polynomials given by Equations (1) and (2). Hence, screening out a suitable basis function was the key to the establishment of the MARS model. The modeling process was divided into three steps.

Step 1: Start with a basic model with only one constant term, and then use the direct truncation process to split the sample function. Taking into consideration the interaction of variables, continue to increase the number of BFs, and improve the accuracy of the model until the residual sum of squares reaches the minimum value or the number of BFs reaches the maximum value, resulting in an overfitting model.

Step 2: Delete the basis functions with small contributions using the backward pruning process, and continuously modify the coefficients of the remaining items. If the accuracy of the model can be guaranteed, delete the redundant BFs; otherwise, keep them.

Step 3: Finally, compare the series of models obtained from the backward pruning process and select the optimal model via the GCV criterion in Equation (3). When the accuracy of the model increases, the GCV value decreases.

#### Sensitivity analysis of external conditions based on the Sobol method

The Sobol method (Jordan *et al.* 2020) was first proposed by Ilya M. Sobol in 1993. It is a widely used quantitative global sensitivity analysis method. As opposed to the qualitative sensitivity analysis method, the Sobol method can directly provide the sensitivity of the model parameters through the calculation of the sensitivity quantitative index. The specific calculation process is as follows (Figure 3).

#### Cluster analysis based on the mini-batch k-means method

The mini-batch k-means algorithm (Jia *et al.* 2020) uses a method called mini-batch (batch processing) to calculate the distance between data points. The advantage of the mini-batch technique is that it is not necessary to use all the data samples in the calculation process. Instead, a portion of each sample from different types of samples is extracted to represent their respective types for calculation. Since the calculation sample size is small, the running time will be reduced accordingly, although this type of sampling will inevitably bring about a decrease in accuracy. This technique is employed when the dataset is huge.

In fact, this approach is not only applied to k-means clustering but also widely used in machine learning and deep learning algorithms such as gradient descent and deep networks.

In this study, in order to more clearly determine the characteristics of the influence of different external conditions on algae growth, the clustering method was used to analyze and count the 200 groups of data in the later period. The flow of this algorithm is similar to k-means and consists of the following steps.

Step 1: First, extract part of the dataset, and use the k-means algorithm to construct a model with k clustering points.

Step 2: Continue to extract part of the sample data of the dataset in the training dataset, add it to the model, and assign it to the nearest cluster center point.

Step 3: Update the center point value of the cluster.

Step 4: Iterate the second and third steps of the loop until the center point is stable or the number of iterations is reached, then stop the calculation operation.

## RESULTS AND DISCUSSION

### ECO Lab model evaluation

^{2}), Nash-Sutcliffe efficiency (NSE), and comprehensive prognostic index (CPI) evaluation system to analyze the error and correlation between the measured value

*M*and the simulated value

*S*(Qiu

*et al.*2020). The specific equations are as follows:where

*N*is the total number of simulations;

*i*is the simulation number;

*S*is the value of the

_{i}*i*simulation;

^{th}*M*is the value of the

_{i}*i*measured value; is the simulated average value; and is the measured average value.

^{th}*i*takes 1–4; that is, traverses

*R*, RMSE, MRE, and NSE; according to the relationship between the four parameter size trends and model accuracy, for RMSE and MRE, the coefficient

^{2}*α*is −1; for NSE and

_{i}*R*, the coefficient

^{2}*α*is 1; and CPI

_{i}*is the comprehensive prognostic index of model*

_{i}*i*. The larger the CPI, the higher the model prediction accuracy.

The results revealed that the water quality simulation of Tai Lake by the ECO Lab model was highly credible, with a comprehensive error <20% (Figure 4) and a CPI > 1.5 (Table 1), indicating that it could better invert the actual water quality in 2017. At the same time, it can be seen from the simulation results that the change trends of Chl-a and TP were relatively close, and the relationship with TN was very weak. DO was almost inversely proportional to temperature. These results are not only consistent with the trends of the actual measurement results but also consistent with the research conclusions of Wang *et al.* (2017). This proves once again that the ECO Lab water quality model can provide basic support for the subsequent mechanism research.

. | RMSE . | MRE . | R^{2}
. | NSE . | CPI . |
---|---|---|---|---|---|

TP | 0.008 | 10.71% | 83.99% | 0.93 | 1.65 |

TN | 0.13 | 9.98% | 87.08% | 0.95 | 1.59 |

Chl-a | 0.002 | 16.98% | 82.64% | 0.91 | 1.56 |

DO | 0.35 | 4.01% | 93.01% | 0.97 | 1.52 |

. | RMSE . | MRE . | R^{2}
. | NSE . | CPI . |
---|---|---|---|---|---|

TP | 0.008 | 10.71% | 83.99% | 0.93 | 1.65 |

TN | 0.13 | 9.98% | 87.08% | 0.95 | 1.59 |

Chl-a | 0.002 | 16.98% | 82.64% | 0.91 | 1.56 |

DO | 0.35 | 4.01% | 93.01% | 0.97 | 1.52 |

### Robustness test and MARS substitution model establishment

*F-*test (Sanderson & Windmeijer 2016) to measure the homogeneity of variance. The assumptions were

*H*: No significant difference exists in the simulated data; and

_{0}*H*: A significant difference exists in the simulated data, . The specific equations are as follows:where is the sum of the squared total deviations,

_{1}*s*is the number of evaluations, is the number of levels, is the

*i*

^{th}evaluation value at the

*j*

^{th}level, is the average of all evaluation values, and

*n*is the degrees of freedom; the rejection field is .

The *N* = 25, 50, 100, 150, and 200 groups of parameter samples were simulated and calculated, after which the Morris index corresponding to each parameter and the *F*-test value between independent samples were obtained (Figure 5). The convergence of the Morris index of the parameter group was closely related to the number of parameter samples; however, when the number of samples was >150 groups, the basic conditions of *R ^{2}*,

*p*value >0.90, and the GCV criterion could all be satisfied at the same time, indicating that calculating the Morris index required a higher computational workload. A sensitivity analysis can be performed on multi-dimensional parameters for a small sample size, to ensure its high accuracy (Wang

*et al.*2020). It shows that this method can obtain high accuracy with a small number of samples, can save a lot of calculation time in the intermediate process on the basis of the later stage, and also can provide scientific support for the final research goal (Sima

*et al.*2018). In order to ensure the reliability of the study, 200 groups of parameter samples will be utilized for the next simulation study.

Further optimization of the BF and number of nodes revealed that when the TN and TP basis functions were set to 20 and the number of nodes was set to three, the accuracy reached a steady state. In addition, when the basis function of Chl-a was set to 50 and the number of nodes was 3, the accuracy reached a steady state. The calculation accuracies of TP, TN, and Chl-a were all >0.995 (Figure 6). While ensuring the accuracy of the replacement model, the impact on the computer load was minimal, and the calculation time for 1,000 operations was only five minutes. Overall, the computing performance was greatly improved. At the same time, in terms of the basis function, we initially found that the BFs of TP and TN were relatively simple, and the correlation between external conditions was weak, while the BF of Chl-a was more complex, and the correlation between external conditions was also strong, indicating that the MARS model was in the learning stage. Different judgments were made based on the number of parameters and the magnitude of the interaction, which could simultaneously reflect the linear and nonlinear relationships and increase the credibility of the interpretability of the actual situation (Conoscenti *et al.* 2015). At the same time, the study of Huang (Huang *et al.* 2019) also shows that MARS method also has certain advantages in nonlinear prediction, which can avoid common problems such as overfitting, and the large savings in time also provides more possibilities for the later intelligent management of big data. Subsequently, this research will combine the analysis with the Sobol method in order to further analyze the sensitivity of the external conditions, thereby determining whether the interpretability of the alternative model can meet the actual research needs.

### Sensitivity analysis of external conditions and thoughts on future governance

Tai Lake is currently in a stage of severe eutrophication, and the outbreak of algae blooms has seriously affected the safety of the drinking water, according to the ‘Technical Specification for Water Bloom Remote Sensing and Ground Monitoring and Evaluation’ issued by the Ministry of Ecology and Environment of China (http://www.mee.gov.cn/ywgz/fgbz/bz/bzwb/other/qt/202002/t20200213_762889.shtml). For specific requirements, this study set the Chl-a concentration of <10 mg/L as the assessment standard for algae water quality. Based on the cluster analysis of 200 groups of results calculated by the ECO Lab model, it was found that in 66 groups of compliance scenarios, algae growth was mainly affected by three external conditions – wind speed, flow rate, and TP concentration, as shown in Figure 7. This conclusion is similar to that of Jalil *et al.* (2018), indicating that the algae in Tai Lake is mainly affected by hydrodynamics and phosphorus flux. However, since wind speed and temperature are natural factors that are not currently controlled by humans, the future compliance of algae requires dual control of the hydrodynamics and nutrients of Tai Lake by closely combining flow and phosphorus flux. The overall stable compliance scenario indicates that the total phosphorus entering the lake has already exceeded the acceptable threshold. The amount of TP needs to be reduced by 30–50% in order to prevent large-scale outbreaks of algae, which is basically consistent with the research conclusions (Xu *et al.* 2016; Liu *et al.* 2020).

Studies have shown that TP and TN are affected in far simpler ways than algae. TP and TN are directly affected by external input pollution flux, the influences of both reaching more than 90% (Figure 8). This is similar to conclusions of the experimental research on Tai Lake performed by many scientists (Deng *et al.* 2018; Wang *et al.* 2019a, 2019b), indicating that under the premise of maintaining a dynamic balance of the internal sources of nutrients, TP and TN are still mainly affected by external flux input. This influence is not closely related to meteorological conditions such as temperature and wind, although the Chl-a concentration, which reflects the growth of algae, is greatly affected by external comprehensive factors. This is because algae are a type of living organism and are therefore more significantly affected by the water environment of eutrophic lakes. In particular, under conditions of suitable temperature and nutrients, they are primarily affected by wind speed and flow velocity, the influence of which can reach more than 60%. In addition, Chl-a is closely related to TP (Wu *et al.* 2019) but has almost no relationship with TN, indicating that TN in Taihu Lake is not a factor that limits the growth of Chl-a. Therefore, more attention should be paid to the dynamic changes of TP in the future. Moreover, in the Sobol calculation results, we found parameters with large first-order sensitivity, and the total-order sensitivity was also large, indicating that this parameter has a strong correlation with other parameters (Jaxa-Rozen & Kwakkel 2018).

According to the above research results, the MARS-Sobol method used in this study has good interpretability when calculating multiple parameters and multiple dimensions (Zhang *et al.* 2015). In addition, although it takes 45 min to calculate a Sobol sample alone, when combined with the MARS method, it only takes about five minutes to calculate 1,000 groups of samples, thus greatly reducing the calculation load of the workstation without changing the actual situation. Furthermore, the MARS method requires fewer samples than the back propagation artificial neural network (BP-ANN) method (Sun *et al.* 2019). In the future, it is strongly recommended that more in-depth research be conducted on more complex models or measured data.

## CONCLUSIONS AND PROSPECTS

- (1)
The ECO Lab model was found to fully reflect the actual situation in the water quality simulation of Tai Lake, and the CPI was >1.5. The learning accuracy of the MARS method was related to the number of training samples, basis functions, number of nodes, and other factors. In the later stage, appropriate parameter sizes should be selected according to the specific situation in order to adjust the learning accuracy of the MARS method. In general, the MARS method requires a small number of training samples, and the learning accuracy of 150 groups can reach more than 0.990. At the same time, this research proves that it is suitable for the learning of high-latitude parameter groups.

- (2)
The MARS-Sobol method has the dual analysis functions of sensitivity and correlation. The results revealed that the factors with strong sensitivity are strongly correlated; different external conditions will have a significant impact on the water quality of Tai Lake. When the Chl-a is the output, the sensitivity ranking is inflow (36.42%) >TP (26.65%) > wind speed (25.89%) > temperature (8.38%). When TP is the output indicator, the sensitivity ranking is TP (81.17%) > inflow (16.64%); when TN is used as the output index, the sensitivity is closely related to TN concentration (92.52%). In general, the impact of a single external condition on Chl-a is less than that on TP and TN, and the growth of algae depends largely on the magnitude of the hydrodynamic force. Therefore, the government should devote much more time to treating the algae problem.

- (3)
The clustering results of 200 groups of data revealed that the growth of algae is mainly affected by three external conditions: wind speed, flow rate, and TP concentration. In other words, algae are more significantly affected by hydrodynamic forces and phosphorus flux in Tai Lake. Since the temperature and water level cannot be controlled properly by humans, this study suggests that the water flow and phosphorus flux should be the dual controls of the hydrodynamic and nutrient levels in the future.

## ACKNOWLEDGEMENTS

The authors thank the Chinese National Science Foundation (Grant No. 51879070). This work was supported by the Fundamental Research Funds for the Central Universities, the World-class Universities (Disciplines), and the Characteristic Development Guidance Funds for the Central Universities. This research was also funded by the Major Science and Technology Program for Water Pollution Control and Treatment of China (Grant No. 2018ZX07208003). We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this manuscript.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.