Abstract
In this study, a large amount of lake monitoring data was collected and genetic programming, a machine learning technique based on natural selection, was used to search for a robust relationship between phosphorus concentration, wind speed and water temperature. No forms were specified before searching but a new prediction formula was obtained. The formula can provide acceptable simulation accuracy and a theoretical reference for water environment management in shallow lakes or reservoirs.
HIGHLIGHT
A genetic programming-based model for predicting TP concentration.
Graphical Abstract
INTRODUCTION
With the increasing shortage of water resources and aggravation of water pollution, the effective prevention of lake water pollution is a topic of increasing concern (Bunting et al. 2007; Higgins et al. 2018). Water quality in shallow lakes is especially vulnerable to the external environment (Dapeng et al. 2011; Wang et al. 2015). Phosphorus is a major factor in lake eutrophication, and the total phosphorus (TP) threshold of cyanobacteria has been shown to be 0.010 mg/L (Steinberg & Hartmann 1988). When TP reaches or exceeds 0.100 mg/L, the cyanobacteria content will increase sharply and reduce water quality (Downing et al. 2001). If TP variation can be maintained in shallow lakes, water quality will be improved.
There are many reasons affecting the concentration of total phosphorus in lakes, including the release of endogenous phosphorus from the sediment and the input of exogenous phosphorus. TP release from sediment is affected by various factors including wind speed, temperature, light, pH, dissolved oxygen (DO) concentration, biological activity, etc (Xie et al. 2003; Jin et al. 2006; Jiang et al. 2008). The hydrodynamic effect of wind is often considered one of the main factors affecting shallow lake water quality (Bormans et al. 2016; Havens et al. 2016). Wind can disturb the sediment in shallow lakes, releasing the nutrients absorbed on the suspended sediment (Qin et al. 2006; James et al. 2008). As water temperature is a key parameter in most biological systems, temperature change also plays an important role in lake ecosystems (Liu et al. 2011; Wan et al. 2018). In general, phosphorus is a limiting nutrient in most lake ecosystems for most of the year (e.g., Dillon & Rigler 1974; Guildford & Hecky 2000). At appropriate water temperatures, microorganisms can absorb TP more readily and thus affect its concentration in the water (Ye et al. 2011; Wu et al. 2014).
In order to study TP migration in lakes, phosphorus release can be investigated via monitoring data and laboratory experiments. TP release experiments from sediment under variable hydrodynamic conditions in the laboratory enabled establishment of a TP release model (Bai et al. 2019). Imboden (1974, 1985) proposed a two-box lake TP model to estimate the response of Lake Baldegger to different restoration strategies. Jørgensen et al. (1986) estimated the universality of the dynamic phosphorus model and described the sediment-water phosphorus exchange in detail. Wang et al. (2003a, 2003b) developed a model to predict TP release fluxes at the sediment-water interface in Chesapeake Bay. This takes dissolved, exchangeable and organic particulate TP into account, but ignores TP release caused by sediment resuspension and subsequent migration in the water column. Hu et al. (2011) established an empirical relationship between the sediment resuspension rate and flow velocity using flume experiments. On this basis, Huang et al. (2016) created a more complex sediment transport model to assess the importance of internal phosphorus cycling on water phosphorus concentrations. The LEEDS-model is a dynamic, theoretical phosphorus model, which includes several temperature-dependent processes (Malmaeus & Håkanson 2004) and can be used as a physical lake model (Malmaeus et al. 2006).
There are many numerical models of TP transport change in lakes, but they need large sets of accurate, original lake data. Some famous lakes – e.g., Taihu Lake – have been studied extensively, but there are many smaller, shallow lakes and reservoirs that have not been studied well, for which it is very difficult to make water quality predictions. The models also need to carry out complex calculations, which takes a lot of time and affects use of the application in the real-time management of lakes. Establishing a simple model based on the relationships between hydrodynamics, temperature and TP concentration was, thus, the basis for better data application in managing shallow lakes and reservoirs. Data from phreatic lakes around the world were investigated in this study, and the variation in TP concentration with the hydrodynamics caused by wind speed and water temperature observed, to establish the relationship between wind speed, water temperature and TP concentration, using an artificial intelligence optimization algorithm. The aim is to be able to predict pollutant concentrations in the lake using data that are easy to obtain, so that the model range can be extended easily.
METHOD
Data pre-processing
The data were derived from field monitoring. To make the model validation data as wide as possible, some 210 sets of field data were used, from different lakes in China, the United States, Europe and Japan (Table 1).
Source . | Study area . | Wind-speed range (m/s) . | Water temperature range (°C) . | TP concentration range (mg/L) . |
---|---|---|---|---|
Chao et al. (2017) | Taihu, China | 2.9 ∼ 7.4 | 30.5 ∼ 31.7 | 0.085 ∼ 0.242 |
Zhu et al. (2005) | Taihu, China | 0.3 ∼ 12.3 | 0 ∼ 5.6 | 0.076 ∼ 0.299 |
Shinohara et al. (2016) | Lake Kasumigaura, Japan | 1.7 ∼ 8.7 | 19.9 ∼ 30.4 | 0.039 ∼ 0.129 |
Wu et al. (2019) | Taihu, China | 2 ∼ 6.7 | 27.3 ∼ 31.2 | 0.103 ∼ 0.241 |
Tammeorg et al. (2013) | Lake Peipsi, Russia | 0.5 ∼ 2.3 | 10 ∼ 30 | 0.043 ∼ 0.093 |
Tammeorg et al. (2013) | Lake Lämmijärv, Russia | 0.5 ∼ 2.3 | 10 ∼ 30 | 0.095 ∼ 0.14 |
Maceina & Soballe (1990) | Okeechobee, Florida | 0.98 ∼ 1.57 | 18 ∼ 29.7 | 0.057 ∼ 0.109 |
Bai et al. (2020) | Yangshapao Reservoir, China | 1.2 ∼ 4.6 | 1.6 ∼ 29.8 | 0.011 ∼ 0.645 |
Wang et al. (2016) | Lake Donghu, China | 1.4–1.7 | 26.8–29.7 | 0.043–0.06 |
Coppens et al. (2016) | Lake Eymir, Turkey | 3.26–4.32 | 4–24.5 | 0.52–0.75 |
Ramm & Scheps (1997) | Lake Blankensee, Germany | 1.5–8.5 | 0.5–18 | 0.042–0.407 |
Selig et al. (2002) | Lake Bützow, Germany | 1.3–3.1 | 4.5–18 | 0.019–0.034 |
Søndergaard et al. (2003) | Lake Vest Stadil Fjord, Denmark | 0.3–6.2 | 12.2–14.9 | 0.052–0.315 |
Spears et al. (2007) | Loch Leven, UK | 1.6–6.4 | 4.5–15.5 | 0.028–0.116 |
Kamiya et al. (2011) | Lake Shinji, Japan | 1.9–6.7 | 3.6–24.3 | 0.029–0.095 |
Kamiya et al. (2011) | Lake Nakaumi, Japan | 2.9–6.7 | 8.5–26.3 | 0.03–0.083 |
Phillips et al. (1994) | Norfolk Broads, UK | 1.1–7.2 | 4.5–16.5 | 0.016–0.209 |
Source . | Study area . | Wind-speed range (m/s) . | Water temperature range (°C) . | TP concentration range (mg/L) . |
---|---|---|---|---|
Chao et al. (2017) | Taihu, China | 2.9 ∼ 7.4 | 30.5 ∼ 31.7 | 0.085 ∼ 0.242 |
Zhu et al. (2005) | Taihu, China | 0.3 ∼ 12.3 | 0 ∼ 5.6 | 0.076 ∼ 0.299 |
Shinohara et al. (2016) | Lake Kasumigaura, Japan | 1.7 ∼ 8.7 | 19.9 ∼ 30.4 | 0.039 ∼ 0.129 |
Wu et al. (2019) | Taihu, China | 2 ∼ 6.7 | 27.3 ∼ 31.2 | 0.103 ∼ 0.241 |
Tammeorg et al. (2013) | Lake Peipsi, Russia | 0.5 ∼ 2.3 | 10 ∼ 30 | 0.043 ∼ 0.093 |
Tammeorg et al. (2013) | Lake Lämmijärv, Russia | 0.5 ∼ 2.3 | 10 ∼ 30 | 0.095 ∼ 0.14 |
Maceina & Soballe (1990) | Okeechobee, Florida | 0.98 ∼ 1.57 | 18 ∼ 29.7 | 0.057 ∼ 0.109 |
Bai et al. (2020) | Yangshapao Reservoir, China | 1.2 ∼ 4.6 | 1.6 ∼ 29.8 | 0.011 ∼ 0.645 |
Wang et al. (2016) | Lake Donghu, China | 1.4–1.7 | 26.8–29.7 | 0.043–0.06 |
Coppens et al. (2016) | Lake Eymir, Turkey | 3.26–4.32 | 4–24.5 | 0.52–0.75 |
Ramm & Scheps (1997) | Lake Blankensee, Germany | 1.5–8.5 | 0.5–18 | 0.042–0.407 |
Selig et al. (2002) | Lake Bützow, Germany | 1.3–3.1 | 4.5–18 | 0.019–0.034 |
Søndergaard et al. (2003) | Lake Vest Stadil Fjord, Denmark | 0.3–6.2 | 12.2–14.9 | 0.052–0.315 |
Spears et al. (2007) | Loch Leven, UK | 1.6–6.4 | 4.5–15.5 | 0.028–0.116 |
Kamiya et al. (2011) | Lake Shinji, Japan | 1.9–6.7 | 3.6–24.3 | 0.029–0.095 |
Kamiya et al. (2011) | Lake Nakaumi, Japan | 2.9–6.7 | 8.5–26.3 | 0.03–0.083 |
Phillips et al. (1994) | Norfolk Broads, UK | 1.1–7.2 | 4.5–16.5 | 0.016–0.209 |
Establishing the formula
The main steps in Eureqa are:
1: Enter the training and validation data.
2: The stochastic equation generator generates preliminary equations that combine operating factors – e.g., constants and variables – with operations (addition, subtraction, multiplication, division, etc).
3: The and are used to compare the predicted value, generated by the equations, with the measured value from the test group. Bad solutions are discarded. The rest are hybridized using the probability function given by the program, and a new sub-expression is generated according to the inherent mutation probability function.
4: Stop the program when a reasonable solution appears (it will not stop automatically). The program will give a series of solutions with different accuracies and complexities.
Like other GP-based software, Eureqa requires data classification into three groups: 40% for training, 30% for validation and 30% for testing. The criteria used to select the final solution from the formula provided by Eureqa in this study were: (1) its accuracy does not increase significantly with increasing complexity, and (2) the complexity should not be too great.
RESULTS
The simulation results are shown in Table 2 and Figure 1. Clearly the accuracy of prediction increases with the equation's complexity. However, when the complexity exceeds 14, the fitting accuracy changes less while the model's complexity increases more. The more complex equation is not suited to practical application, so that with a complexity of 14 was selected. Its MSE and MAE are 0.0324 and 0.130, respectively. The value of for the test data is 0.7261, and 0.7159 for all data (Figures 2 and 3). The points are distributed evenly on either side of the line.
Complexity . | Solution . | MSE . | MAE . |
---|---|---|---|
1 | w=x | 0.0405 | 0.149 |
3 | w=0.967×x | 0.0391 | 0.148 |
5 | w=0.102+0.798×x | 0.0338 | 0.140 |
7 | w=0.775×x +0.165×y | 0.0322 | 0.136 |
9 | w=0.792×x +0.195×y−0.039 | 0.0327 | 0.135 |
12 | w=0.785×x +0.262×sin(y)−0.0617 | 0.0325 | 0.135 |
14 | w=0.356+0.456×x +0.145×cos(1.811×y) | 0.0324 | 0.130 |
20 | w=1.239×x +0.25×y−0.113−x2×cos(1.183×x) | 0.0318 | 0.128 |
22 | w=1.376×x +0.251×y−0.137−x2×cos(1.058×x2) | 0.0311 | 0.127 |
28 | w=1.737×x +1.803×x3+sin(y)−0.19−0.559×y−2.633×x2 | 0.0307 | 0.126 |
Complexity . | Solution . | MSE . | MAE . |
---|---|---|---|
1 | w=x | 0.0405 | 0.149 |
3 | w=0.967×x | 0.0391 | 0.148 |
5 | w=0.102+0.798×x | 0.0338 | 0.140 |
7 | w=0.775×x +0.165×y | 0.0322 | 0.136 |
9 | w=0.792×x +0.195×y−0.039 | 0.0327 | 0.135 |
12 | w=0.785×x +0.262×sin(y)−0.0617 | 0.0325 | 0.135 |
14 | w=0.356+0.456×x +0.145×cos(1.811×y) | 0.0324 | 0.130 |
20 | w=1.239×x +0.25×y−0.113−x2×cos(1.183×x) | 0.0318 | 0.128 |
22 | w=1.376×x +0.251×y−0.137−x2×cos(1.058×x2) | 0.0311 | 0.127 |
28 | w=1.737×x +1.803×x3+sin(y)−0.19−0.559×y−2.633×x2 | 0.0307 | 0.126 |
Note: w represents , x, and y.
DISCUSSION
The validation data used came from worldwide sources, so the data span is large. The TP concentration ranges from 0.011 to 0.747 mg/L. The wide range of the data selected lays a foundation for a universal model, although, compared with some complex numerical models, the fitting accuracy is relatively low (Huang et al. 2016; Bai et al. 2019). However, complex models often need more detailed data – e.g., topography, pH, biological communities, etc – which are not available for many lakes, making the models' application more troublesome. The model developed in this study is based on wind speed and water temperature, which can be monitored easily on most lakes, so its universality is higher.
If wind speed and water temperature, the parameters required by the simple equation, are difficult to obtain, it would be better to detect the TP concentration directly. On the other hand, the two factors are important in TP concentration variation in shallow lakes, because the wind can cause a large amount of sediment resuspension, resulting in large increases in TP concentration (Bormans et al. 2016; Havens et al. 2016), and temperature will affect flow velocity. Bai et al. (2019) found that flow velocity increases the saturated concentration of TP along with its upper limit.
Lake depth is an important factor that affects phosphorus concentration, depending in both shallow and deep lakes on whether sufficient sediment can be suspended to influence the concentration (Qin et al. 2020). If a lake is too deep, however, the wind cannot disturb the sediment surface, so the basis for the equation is not present.
Temperature is very important for microorganisms and organisms in lakes. High temperatures often increase their activity, and accelerate their absorption or release of TP, affecting the TP concentration indirectly, and vice versa (Ye et al.; Wu et al. 2014; Zhang et al. 2014). It is difficult to investigate the many kinds of organisms and microorganisms in lakes in detail, but correlations can be established between temperature, biological and microbial activity in lakes (Liu et al. 2011; Wu et al. 2014). Thus, the formula reflects not only the physical effects (the disturbance of sediment by wind) but also the chemical effects (the change in biological activity with water temperature), which is advantageous. The study results could be improved by increasing the amount of data used in verification.
CONCLUSIONS
In this study, the relationship between TP concentration and water temperature and wind speed were explored by collecting previous research data. A GP algorithm was used to develop a simple and easy TP concentration predictor for shallow lakes or reservoirs. This provides theoretical reference for lake and/or reservoir management.
ACKNOWLEDGEMENTS
Funding is acknowledged for key technology development and application demonstration of comprehensive management and resource utilization of cyanobacteria in Taihu Lake Basin (Key R & D funds of Zhejiang Province: 2021C03196); Zhejiang Basic Public Welfare Research Project (LGF19E090001). The manuscript was proofread by Qianqian Mao and Zhicheng Qiu in Zhejiang University of Water Resources and Electric Power.
ETHICAL APPROVAL
Not applicable
CONSENT TO PARTICIPATE
Not applicable
CONSENT TO PUBLISH
Not applicable
AUTHORS CONTRIBUTIONS
Yu Bai analyzed and interpreted the data and was a major contributor in writing the manuscript. Jianquan Yang and Guojin Sun carried out the calculations for the model. All authors read and approved the final manuscript.
FUNDING
Funding is acknowledged for key technology development and application demonstration of comprehensive management and resource utilization of cyanobacteria in Taihu Lake Basin (Key R & D funds of Zhejiang Province: 2021C03196); Zhejiang Basic Public Welfare Research Project (LGF19E090001). The manuscript was proofread by Qianqian Mao and Zhicheng Qiu in Zhejiang University of Water Resources and Electric Power.
COMPETING INTERESTS
The authors declare that they have no competing interests
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.