Sensitivity analysis of external conditions based on the MARS-Sobol method: case study of Tai Lake, China

This study utilized the ECO Lab model calculation samples of Tai Lake, in combination with robust analysis and the GCV test, to promote a faster intelligent application of machine learning and evaluate the MARS machine learning method. The results revealed that this technique can be better trained with small-scale samples, as indicated by the R 2 values of the water quality test results, which were all > 0.995. In combination with the Sobol sensitivity analysis method, the contribution degree of the parameterized external conditions as well as the relationship with the water quality were examined, which indicated that TP and TN are primarily related to the external input water quality and ﬂ ow, while Chl-a is related to in ﬂ ow (36.42%), TP (26.65%), wind speed (25.89%), temperature (8.38%), thus demonstrating that the governance of Chl-a is more dif ﬁ cult. In general, the accuracy and interpretability of MARS machine learning are more in line with the actual situation, and the use of the Sobol method can save computer calculation time. The results of this research can provide a certain scienti ﬁ c basis for future intelligent management of lake environments.


INTRODUCTION
After the Tai Lake cyanobacteria crisis of 2007, a more serious cyanobacteria crisis broke out 10 years later, in 2017, indicating that there are still some shortcomings in the original management of the watershed (Zhang et al. ). However, since management of the watershed is very complex, involving many administrative conflicts and natural areas, determining how to rely on either measured data or model results in order to achieve intelligent management is a difficult problem that needs to be solved (Daneshfaraz et al. ; Xu et al. ). The basis of intelligent management is the need to analyze and explain the connection between measured data or model calculations and actual phenomena. Therefore, research on the sensitivity of external conditions and water quality results is very important.
As computer performance has improved, the accuracy of the original water quality model has been greatly enhanced, and the current research on parameter sensitivity has also achieved great breakthroughs (Liang et  method to study and analyze the water quality parameters of different areas in Tai Lake, finding that the sensitivity of the parameters affected by the water quality in different regions exhibits certain differences, and the relationship between the growth of algae with temperature and water quality is mutually transformable. When the water quality concentration is sufficient, algae growth has a closer relationship with temperature; otherwise, it is still mainly affected by nutrient salt concentration. The Sobol method is a global sensitivity analysis approach that can quantify parameter sensitivity and correlation, although it is not widely used in the early stages of analysis due to the huge amount of calculation (Wang ; Jordan et al. ). At present, given the development and application of machine learning methods, the goal of this study was to employ the multivariate adaptive regression splines (Demneh ) (MARS)-Sobol method to reduce the workload and improve future management efficiency. The MARS method, which was formally proposed by the American statistician Jerome Friedman in 1991 (Kuter et al. ), can handle large amounts of data and high dimensionality, and also features the advantages of fast calculation and accurate modeling. Deo (Deo et al. ) conducted an indepth study of the relationship between rainfall runoff and regional drought with the help of the MARS method and established a refined model with the dual factors of geography and seasonality via the measurement data of long-term drought and the determination of related parameters, thus providing a successful intelligent application case for later regional drought research. Garcia et al. Based on the measured data of Tai Lake for the past 10 years, this study parameterized the external conditions of the lake such as temperature, wind speed, water entering and exiting, and water quality. With the help of the Latin hypercube sampling (LHS) method, 250 random combinations were selected within the range of values, and the ECO Lab water quality model was then utilized. The TP, TN, and Chl-a were simulated, and the results obtained were trained and tested using the MARS method. A total of 6,000 groups were then sampled according to the Sobol sequence, combined with the Sobol method, to study the sensitivity between external conditions and water quality, and finally passed to the mini-batch k-means clustering algorithm for analysis of the overall conclusion when the

Study area
Tai Lake (30 05 0 -32 08 0 N, 119 08-122 55 0 E) is located in the lower reaches of the Yangtze River in China (Yao et al. ). It has a total area of approximately 2,338 km 2 and is a typical large-scale shallow-water freshwater lake, with an average depth of about 1.9 m. In the past 10 years, the air temperature has ranged from À4.5 C to 33 C, the wind  The MARS model is based on an input variable X with a dimension of n × d and a dependent continuous variable Y with a dimension of n × 1; there is no need to establish preliminary assumptions about X and Y in advance. The model uses piecewise polynomials to construct a smoothly connected basis function (BF) and divides X into different intervals. The piecewise point p is called a node, and its spline curve is a bilateral truncated power function, as expressed by Equation (1): If N basis functions are considered, the MARS model can be expressed aŝ where c 0 is a constant; c i is the coefficient of the i th basis function BF i ; q (q > 0) is the power of the spline function, which determines the smoothness of the spline curve; and x is the data in the input variableX: In Equation (3), n is the number of predictors; the k thdependent variable y k is excluded in order to construct the model;ŷ k is the predicted value of y k ; and C(M) is the amount of function complexity correction.
This study used MATLAB programming tools, and based on the dimension n × d of the input variableX, as well as the selection of independent variables and nodes, the basis function consisted of several pairs of piecewise polynomials given by Equations (1) and (2). Hence, screening out a suitable basis function was the key to the establishment of the MARS model. The modeling process was divided into three steps.
Step 1: Start with a basic model with only one constant term, and then use the direct truncation process to split the sample function. Taking into consideration the interaction of variables, continue to increase the number of BFs, and improve the accuracy of the model until the residual sum of squares reaches the minimum value or the number of BFs reaches the maximum value, resulting in an overfitting model.
Step 2: Delete the basis functions with small contributions using the backward pruning process, and continuously modify the coefficients of the remaining items. If the accuracy of the model can be guaranteed, delete the redundant BFs; otherwise, keep them.
Step 3: Finally, compare the series of models obtained from the backward pruning process and select the optimal model via the GCV criterion in Equation (3). When the accuracy of the model increases, the GCV value decreases. Since the calculation sample size is small, the running time will be reduced accordingly, although this type of sampling will inevitably bring about a decrease in accuracy. This technique is employed when the dataset is huge.
In fact, this approach is not only applied to k-means clustering but also widely used in machine learning and deep learning algorithms such as gradient descent and deep networks.
In this study, in order to more clearly determine the characteristics of the influence of different external conditions on algae growth, the clustering method was used to analyze and count the 200 groups of data in the later period. The flow of this algorithm is similar to k-means and consists of the following steps.
Step 1: First, extract part of the dataset, and use the k-means algorithm to construct a model with k clustering points.
Step 2: Continue to extract part of the sample data of the dataset in the training dataset, add it to the model, and assign it to the nearest cluster center point.
Step 3: Update the center point value of the cluster.
Step 4  . The specific equations are as follows: where N is the total number of simulations; i is the simulation number; S i is the value of the i th simulation; M i is the value of the i th measured value; M is the simulated average value; and S is the measured average value. i takes 1-4; that is, traverses R 2 , RMSE, MRE, and NSE; according to the relationship between the four parameter size trends and model accuracy, for RMSE and MRE, the coefficient α i is À1; for NSE and R 2 , the coefficient α i is 1; and CPI i is the comprehensive prognostic index of model i. The larger the CPI, the higher the model prediction accuracy.
The results revealed that the water quality simulation of Tai Lake by the ECO Lab model was highly credible, with a comprehensive error <20% ( Figure 4) and a CPI > 1.5 (Table 1) can provide basic support for the subsequent mechanism research.

Robustness test and MARS substitution model establishment
For a more scientifically sound analysis of the significant differences in the simulation results, this study used an F-test (Sanderson & Windmeijer ) to measure the homogeneity of variance. The assumptions were H 0 : No significant difference exists in the simulated data; and H 1 : A significant difference exists in the simulated data, α ¼ 0:05. The specific equations are as follows: where S T is the sum of the squared total deviations, s is the number of evaluations, n j is the number of levels, X ij is the i th evaluation value at the j th level, X is the average of all evaluation values, and n is the degrees of freedom; the rejection field is F ! F α (n À 1, n À 1).
The   ) but has almost no relationship with TN, indicating that TN in Taihu Lake is not a factor that limits the growth of Chl-a. Therefore, more attention should be paid to the dynamic changes of TP in the future. Moreover, in the Sobol calculation results, we found parameters with large first-order sensitivity, and the total-order sensitivity

CONCLUSIONS AND PROSPECTS
(1) The ECO Lab model was found to fully reflect the actual situation in the water quality simulation of Tai Lake, and (2) The MARS-Sobol method has the dual analysis functions of sensitivity and correlation. The results revealed that the factors with strong sensitivity are strongly correlated; different external conditions will have a significant impact on the water quality of Tai Lake. When the Chla is the output, the sensitivity ranking is inflow (36.42%) >TP (26.65%) > wind speed (25.89%) > temperature (8.38%). When TP is the output indicator, the sensitivity ranking is TP (81.17%) > inflow (16.64%); when TN is used as the output index, the sensitivity is closely related to TN concentration (92.52%). In general, the impact of a single external condition on Chl-a is less than that on TP and TN, and the growth of algae depends largely on the magnitude of the hydrodynamic force.
Therefore, the government should devote much more time to treating the algae problem.
(3) The clustering results of 200 groups of data revealed that the growth of algae is mainly affected by three external conditions: wind speed, flow rate, and TP concentration.
In other words, algae are more significantly affected by hydrodynamic forces and phosphorus flux in Tai Lake.
Since the temperature and water level cannot be controlled properly by humans, this study suggests that the water flow and phosphorus flux should be the dual controls of the hydrodynamic and nutrient levels in the future.