The use of spectral theory and chaos theory on river water quality modeling is reported in a very limited way. This study proposes a wavelet-maximum Lyapunov exponent (WMLE) hybrid model for river water quality dynamics, combining spectral theory and chaos theory. The methodology involves the following major steps: (1) use of wavelet transformation to filter the noisy signal in the water quality time series; (2) reconstruction of phase space to embed the water quality time series and determine the trajectory of the underlying dynamics; and (3) identification of the presence/absence of chaos and prediction using the largest Lyapunov exponent value. Case studies on the Huaihe River in China and the Potomac River in the United States, as representatives of low-frequency and high-frequency forecast, show average relative errors on weekly dissolved oxygen (DO), chemical oxygen demand (COD), and ammonia nitrogen (NH3-N) data are 2.35%, 4.53%, and 18.85%, and on 15-minute based DO data are 1.185%. It also indicates that the hybrid model performs better to some extent when compared to the purely Lyapunov exponent model, ARMA model, and ANN model. This study is a proof that the combination of spectral theory and chaos theory is promising to describe and predict fluctuation of particular water quality indicators in rivers.
River water quality dynamics of DO, COD and NH3-N fluctuation present chaos behavior.
A hybrid Wavelet-Lyapunov exponent model for water quality forecast is newly proposed.
WMLE model performs well for weekly DO and COD forecast but relative poor for NH3-N.
Forecast for 15-minute based DO time series achieve high accuracy in Potomac river.
WMLE model slightly overweight ANN, ARMA and pure Lyapunov model in the study cases.
Forecasting the river water quality fluctuation is one of the key elements of watershed water quality management. It provides an early warning for chemical release incidents, risk assessment for water usage, guiding the operation of water supply and drainage system. The existing modeling or forecast tools are normally grouped as mechanistic models and data-driven models. Due to the ability to represent physical processes, the mechanistic models are commonly used in pollution control planning of water environment or situational analysis of long-term transition (Gordillo et al. 2019). However, those models generally require data of variables, which are hard to collect in reality.
Data-driven models, due to their ability to identify patterns in the system dynamics often based on a single time series alone, have clear advantages, especially in short-term prediction (Babovic 2005; Solomatine & Ostfeld 2008; Tongal & Booij 2018). Although the types of data used in the data-driven models are much less when compared to those in the mechanistic models, the data-driven models often require long-term observations of water quality (Sun et al. 2010). With growing infrastructure and analysis methods for environmental monitoring and data accumulation around the world in recent years, the arrival of the era of data-intensive innovation provides new opportunities for the development and use of more sophisticated data-driven models. In this regard, the rapid development of many machine learning techniques certainly provides new avenues for modeling and prediction of water quality dynamics (Meszaros & El Serafy 2018; Yajima & Derot 2018).
River systems are open, complex, and nonlinear systems. The dynamics of water quality (and quantity) in rivers are often governed by complex and nonlinear interactions among the variables influencing river systems (Lintern et al. 2018). Such dynamics are also often subjected to sensitive dependence on initial conditions, which limits our ability to predict the dynamics in the long term but, nevertheless, allows reliable short-term predictions. Therefore, in the context of complex and nonlinear interactions with sensitive dependence on initial conditions, the dynamics of water quality (and quantity) in rivers can be studied using the concepts of nonlinear dynamic and chaos theories; see the monograph by Sivakumar (2017) for a comprehensive account of chaos theory applications to hydrological systems.
Chaos theory was discovered in 1963 by the famous American meteorologist Edward Lorenz through numerical experiments of the atmosphere (Lorenz 1963). The term ‘chaos’ was coined, and has been widely used to refer to situations where complex and ‘random-looking’ behaviors arise from simple deterministic systems with sensitive dependence on initial conditions. The development of many methods for identification of chaos in time series since the 1980s has led to their applications in many different scientific and engineering fields, such as meteorology, hydrology, geology, biology, medical science, electronics, transportation, and astronomy. In the field of hydrology, there have been numerous applications of the concepts of chaos theory to study many different processes, data, and problems associated with river systems. These studies include analysis of rainfall, river flow, and sediment transport processes (Yu et al. 2004; Sivakumar 2005; Huang et al. 2017), addressing system characterization, prediction, missing data estimation, and data disaggregation, among others.
The outcomes of such studies are certainly encouraging, especially for more reliable short-term predictions of river system dynamics when compared to the predictions achieved using stochastic and other data-based approaches. A particular advantage of the concepts of chaos theory is that they can offer useful clues regarding the predictability of system dynamics. For instance, the Lyapunov exponents provide important information on the predictability horizon for a system (Huang et al. 2017), as the Lyapunov exponents are essentially the average exponential rates of divergence (expansion) or convergence (contraction) of nearby orbits in the phase space which is essentially a graph or a coordinate diagram, whose coordinates represent the variables necessary to completely describe the state of the system at any moment. Recently, some studies have reported forecasting high-dimensional chaotic systems by long short-term memory networks (Lu et al. 2018; Vlachas Pantelis et al. 2018).
While the chaos theory has found widespread applications in river system studies, the applications have largely been to study water quantity, such as river flow discharge and water level (Tongal 2013, 2020; Tongal & Berndtsson 2017). To our knowledge, chaos applications to study river water quality dynamics have, thus far, been very limited. One possible reason for this situation may be the general belief that chaos identification and prediction methods require infinite (or at least very long) and noise-free time series, while real water quality time series are often very short and always contaminated with noise. However, studies have shown that the data size issue for chaos identification and prediction methods is not as serious as it is generally believed to be and that the methods can provide reliable results even when the data size is small; see, for example, Sivakumar et al. (2002), Sivakumar (2005), Siek & Solomatine (2010), and Zhang (2013) for some details on the issue of data size in the application of chaos identification and prediction methods for river system-related time series.
In recent years, there has been increasing attention on developing hybrid models to increase the predictability in data-driven modeling (Huang et al. 2017; Li et al. 2017). A good understanding of the specific advantages and disadvantages of the existing models allows selection of two (or more) models that are suitable for the problem/data at hand for more reliable modeling and predictions. Wavelet technology has the feature of good localization performance of time frequency, and it has strong adaptability to represent the local feature of signal in the time-frequency domain (Sang 2013; Alizadeh et al. 2018). There have been many studies that have successfully combined wavelets with other machine learning methods, such as artificial neural networks (ANNs), support vector machine (SVM) (Yu et al. 2004; Lin et al. 2013), and extreme learning machine (Roushangar et al. 2018). For example, Shi et al. (2017) proposed a coupled wavelet-ANN model and applied it for early warning of water quality conditions. Barzegar et al. (2018) combined wavelets and extreme learning machine to predict electrical conductivity. Ren et al. (2013) proposed a combination of adaptive neural-fuzzy inference system and wavelet analysis to predict monthly runoff and reported very good prediction. With the usefulness and advantages of the concepts of chaos theory, there seems great potential to couple chaos methods and wavelets to enhance our analysis, modeling, and prediction of time series. To our knowledge, no study has attempted to study the effectiveness of the hybrid model for water quality time series. The present study makes an effort to further advance research in this direction.
In the present study, a hybrid prediction model is developed by coupling chaos theory and wavelets for water quality prediction. In particular, a coupled wavelet-Lyapunov exponent model is proposed. The wavelet is used to effectively extract feature information of the observed time series through decomposition and reconstruction. It can be applied in short time series and greatly simplifies the calculation complexity of prediction model of the maximal Lyapunov exponent (MLE); meanwhile, it increases flexibility and improves the forecast precision of the MLE model when applying on larger fluctuations of time-sequence data. Case studies are carried out based on long-term water quality monitoring on Huaihe River, China and high-frequency monitoring on Potomac River, United States (US). Based on the data availability, chemical oxygen demand (COD), dissolved oxygen (DO), and ammonia nitrogen (NH3-N), nitrite-nitrogen (NO2-N) were the main water quality indicators under investigation.
As well, this work discussed the following hypotheses or questions: (1) Is there presence/absence of chaos on the river water quality variation? (Does the chaos exist in the river water quality variation?) (2) Can the chaos theory-based modeling approach be qualified for predicting different water quality parameters? (3) Does the hybrid method combining with spectral theory improve the performance of purely chaos-based predication? (4) How is the performance of the hybrid method compared with traditional data-driven models, i.e., ARMA, ANN, etc. The following sections include illustration of methodology, descriptions of study area and monitoring data, results and discussion on chaos identification, model performance and comparison.
Decomposition and reconstruction of wavelet
Wavelet transformation has good localized performance of time frequency and it is one kind of conversion tool of time frequency that is widely used (Sang 2013). It has high frequency resolution and low time resolution in the low-frequency part and low frequency resolution and high time resolution in the high-frequency part. Therefore, it has strong adaptability to represent the localized feature of signal in the time-frequency domain. Wavelet function in wavelet analysis can be diverse, and different wavelet functions are often needed in different situations in applications. The db-wavelet has reliable time-frequency resolution, and it can match the original signal to the greatest degree. The db5 wavelet has been found to provide optimal effect upon repeated calculations and analysis in many applications (Sang 2013; Alizadeh et al. 2018; Roushangar et al. 2018), and so this wavelet function is selected here. The specific steps are as follows (Kim et al. 1999):
Identification of chaos
There exist many methods for identification and prediction of chaos in a time series. These include correlation dimension method, Lyapunov exponent method, false nearest neighbor algorithm, and nonlinear local approximation prediction method, among others; see, for example, Kantz & Schreiber (2004) and Sivakumar (2017) for details. Almost all of the chaos identification and prediction methods use the concept of phase space for embedding (reconstruction) of the time series (e.g., Packard et al. 1980; Takens 1981). In the present study, the correlation dimension method and the Lyapunov exponent method are used for identification of chaos in the water quality time series and, therefore, they are briefly described here. Since it is common to both methods, it is presented first.
Determination of delay time τ
Determination of optimal embedded dimension m and correlation dimension D
The correlation dimension of a time series is a reliable indicator of the number of dominant variables governing the underlying system dynamics. For estimation of the correlation dimension method, the Grassberger-Procaccia (GP) algorithm (Li et al. 2013) is widely used. The main steps of the GP algorithm are as follows:
A small (say, m0 = 2) is first offered to reconstruct the phase space with time series , as shown in Equation (1).
The dimension d and cumulative distribution function of the attractor should satisfy for certain proper scope of r. The estimated value of correlation exponent corresponding to can be obtained by fitting.
Then, an embedded dimension is added, and steps (2) and (3) are repeated until the corresponding estimated value of dimension is convergent to a stable value. The value d obtained in this way is the correlation dimension of the attractor, and the embedding dimension that is just above the correlation dimension value is considered to be equal to the number of variables dominantly governing the underlying system dynamics.
Lyapunov exponents are the average exponential rates of divergence (expansion) or convergence (contraction) of nearby orbits in the phase space. In general, the presence of a positive Lyapunov exponent in a time series indicates the presence of chaos, i.e., at least the MLE λ1 is above zero. There are many different methods to calculate the Lyapunov exponent of a time series (Zhang 2013). However, the algorithm by Wolf et al. (1985) is widely used, and so is used in the present study (Huang et al. 2017). The main calculation process is as follows:
Prediction model of MLE
Wavelet-Lyapunov exponent model
As long as the presence of chaos in a time series is identified, the hybrid WLME model can be used for decomposition of the time series and subsequent predictions. It is assembled in three steps. First, wavelet decomposition is conducted on the original time series, to obtain the low-frequency signal and high-frequency signal. Second, the low-frequency signal is predicted in the application of the MLE model and the high-frequency signals are all set as zero. Third, wavelet reconstruction is conducted in combination with low-frequency prediction result and the reset high-frequency signal to obtain the final prediction result. A flow chart of the WLE model is shown in Figure 1.
STUDY AREA AND MONITORING DATA
For watershed water quality management, routine monitoring is typically conducted on a monthly basis or weekly basis. In many cases, online high-frequency monitoring (e.g., 5 to 60 min interval) is available for physical (e.g., temperature, turbidity) and chemical (e.g., conductivity, dissolved oxygen) indicators due to the development of in-situ water quality sensors (Rode et al. 2016). This is of particular importance in monitoring drinking water resources. Considering these, in this study, we select water quality data observed at two different time resolutions for investigation, to represent low-resolution monitoring and high-resolution monitoring: (1) weekly water quality data observed at the Huaihe River in China and (2) 15-minute water quality data observed at the Potomac River in the US. In both cases, three water quality parameters are analyzed: COD, DO, and NH3-N. One-step ahead and multi-step ahead forecasts are made. The two study regions and the data considered are described next.
Weekly monitoring at Huaihe River, China
The Huaihe River Basin is located in eastern China between the Yellow River Basin and the Yangtze River Basin (Figure 2). The Huaihe River flows through five provinces, including Hubei, Henan, Anhui, Shandong, and Jiangsu. The area of the entire basin is about 270,000 km3, and the basin population is about 165 million. The average annual rainfall in the basin is about 900 mm, over 70% of which is concentrated in the rainy season between June and October. In the Huaihe River Basin, there are 27 automatic water quality monitoring stations, and water quality data have been monitored at weekly intervals, through the national automatic monitoring network.
In the present study, water quality data observed at the Aishanxi Bridge monitoring station are considered for analysis. This station is located in Picang, Pizhou in Jiangsu province. The specific location of this station is westward drift of the floodway in Picang in the junction of Shandong and Jiangsu provinces. Weekly water quality data from January 2009 to June 2014 are considered for analysis, for a total of 281 data (weeks). The water quality variables considered for study are COD, DO, and NH3-N. The time series of these three water quality variables over the 281-week period are shown in Figure 3.
Sub-hourly monitoring in Potomac River, US
The Potomac River is the fourth largest river in the US and is located along the coast of the Atlantic Ocean (Figure 4). The river originates in the Appalachian Mountains of West Virginia and runs into Chesapeake Bay. Approximately five million people live in this basin. The average flow in this river is 306 m³/s (USEPA 2017).
The United States Geological Survey (USGS) (https://waterdata.usgs.gov/nwis/sw) has installed more than 850,000 monitoring stations to collect surface-water data, including water levels, streamflow (discharge), surface-water quality, rainfall, etc. They are mainly detected by sensors at stations with a fixed interval of 15 to 60 minutes and transmitted to the USGS every hour. The surface water quality parameters include water temperature (WT), specific conductivity (SC), DO, pH, turbidity (TURB), and nitrate + nitrite nitrogen (NOx). Among the variables, only DO is common for both the river basins considered in this study. Long-term COD and NH3-N for the Potomac River are not available.
For the present study, water quality data observed from Smith Creek close to New Market in Virginia (USGS Station #01632900) are considered for analysis. The data are available at a very high resolution of 15 minutes as shown in Figure 5.
RESULTS AND DISCUSSION
The COD dataset in the Huaihe River is first used for model verification as a baseline scenario. The WMLE model is then compared with pure Lyapunov exponent model (LE model) and two classical data-driven prediction models, ARMA (Shi et al. 2017) and ANN prediction model (Du et al. 2017). WMLE model performance of COD dynamics is compared on the application on DO and NH3-N dynamics and, finally, the performance on high-resolution water quality dynamics is compared.
COD monitoring records are divided into two groups: calibration period or ‘warm-up’ period (total 250 data from the first week in 2009 to the 41st week in 2013) and verification period (total 30 data from the 42nd week in 2013 to the 19th week in 2014) (Grassberge & Procaccia 1983; Zhang et al. 2016). Warm-up means the chaos feature of the low-frequency signal of decomposed surface water quality time series is identified at this period. Tests on DO and NH3-N dynamics followed the same routine.
Chaos identification of low-frequency signal of surface water quality
Wavelet decomposition is first conducted on COD monitoring data from the first 250 weeks of the original time series with db5 wavelet function and the decomposition layer is two. Thus, the low-frequency signal presented the main information in monitoring data of water quality and the high-frequency signal is composed of main noise. Wavelet decomposition waveform of the original time series can be seen in Figure 6, and S is the original signal. a2 is low-frequency signal and d2 and d1 are high-frequency signals.
To determine the delay time , non-partial/multiple autocorrelation coefficient is calculated according to the method hereinbefore and the result can be seen in Supplementary Material, Table S1. Time of value reduced to (1 − 1/e) of the initial value for the first time is selected as time delay , namely, optimal delay time of reconstructed phase space. It can be found that = 3 from Figure 7 and Supplementary Material, Table S1.
Phase space of the low-frequency signal is then reconstructed. The projection shown in Figure 8(b) corresponds to a delay time value τ = 3. The reconstruction yields relatively well-defined dynamics. Figure 8(c) shows correlation integral, C(r), and the radius, r, for embedding dimensions, m, from 2 to 13. Fitting should be conducted on the parts of all the curves mostly close to the straight line with least squares method and the slope of the fitted straight line is, namely, correlation exponent D. It can be found from Supplementary Material, Table S2 and Figure 8(d) that correlation coefficient D tends to be stable when embedded dimension m = 11, which is much larger than the case for runoff modeling (Sivakumar 2017). Thus, it is determined that embedded dimension m = 11 and corresponding correlation coefficient D = 3.8415.
Correlation exponent has a platform as the embedding dimension goes higher in Figure 8(d), which is a proof of chaos absence. Further, it can be obtained that MLE λ1 = 1.1710 (low-frequency signal) and 1.2413 (original data) through calculation according to Equation (4) and delay time = 3 and embedded dimension m = 11 determined in step 1 and step 2. MLE is positive, also denoting that the low-frequency signal of time series of river water quality has chaotic property and can be predicted with chaos theory.
Performance of WMLE model and comparison with classical data-driven approaches
COD dataset was first used for investigation. The one-step prediction results of the WMLE model on the upcoming 30 weeks have maximum relative error (MRE) 14.61% and average relative error (ARE) 4.53% (Figure 9).
Prediction can be conducted on the same data and scenario with pure Lyapunov exponent model (LE model) without wavelet filtering (Equation (4)). It can be obtained that = 3, m = 15 and corresponding correlation exponent D = 4.5000 and MLE = 1.2413 according to the mentioned method hereinbefore. LE model results can be seen in Figure 9. MRE is 17.17% and ARE is 6.72% in this case.
ARMA was proposed by American statisticians Jenkins and Box in the 1970s and it has become the benchmark model for time series prediction. The basic principle is: time series can be regarded as a random process and it can be described or simulated with a mathematical model (Thornton & Chambers 2017). ARMA model was constructed here and results of COD at Aishanxi Bridge monitoring station are reported in Figure 9 as well. MRE is 19.82% and ARE is 7.11%.
ANN model is another benchmark gray-box model on the scope of machine learning. A back-propagation ANN was used for comparison (Tongal 2013). The log-sigmoid transfer function logsig was used in the hidden layer, and the pure linear function purelin in the output layer. After training test and optimization (see Supplementary information), a structure of 10 × 8 × 1 was selected. During the training, maximum step of 1,000, minimum error with 0.0001, and learning rate 0.05 were set. Results show MRE is 13.80% and the ARE is 7.25% in the case of Huaihe River. Supplementary Material, Figures S1 and S2 with corresponding interpretation give the information on the construction of the ANN model. The detailed steps of the ARMA model are shown in Figures S3 and S4 and the accompanying text.
Table 1 lists the MRE and ARE of the above four methods for comparison. The hybrid WMLE model is superior to almost all other models in this case. The wavelet process largely improved the performance of the purely Lyapunov model which denotes the merit of the hybrid model. As a benchmark model, ARMA interestingly performs well in this case (see Supplementary information for more details). The ANN model performs worst in ARE, however, with a minimum MRE. This is probably due to the noise and input neurons. ANN is normally used as a ‘half-mechanism’ model, and relative impact factors of COD, like temperature, DO, etc., are necessary to be considered in an input layer besides historical COD data (Wang et al. 2013). As a nonlinear local approximation approach, it was found that although the pure Lyapunov model is significantly superior to the ANN model for river flow forecast (Sivakumar et al. 2002), the ANN model performs slightly better on COD forecast in this study. Performance on long-term forecast, and multi-step, was not investigated here since the time scale in this case is coarse.
|Models .||Maximum relative error (%) .||Average relative error (%) .|
|Models .||Maximum relative error (%) .||Average relative error (%) .|
The combination of chaos system with machine learning is also promising. One pathway is chaos evidence improving machine learning models (Sun et al. 2010; Huang et al. 2017). The other is machine learning models reconstructing chaos and improve the forecasting performance (Pathak et al. 2018). No studies were reported on riverine water quality dynamics.
Performance of WMLE model on different water quality parameters
The same process of prediction, verification, and analysis is also conducted on DO and NH3-N time series as COD. It was found that DO has delay time τ = 5 and embedding dimension m = 4 while NH3-N has t = 3 and m > 20 (misconvergence). Results shown MRE of DO prediction is 6.11% and ARE is 2.35%; MRE of NH3-N prediction is 55.20% and ARE is 18.85%.
NH3-N prediction is most challenged here compared with DO and COD. This is reasonable, since sources and dynamics of the in-stream NH3-N process are more complex than those of COD and DO. The many big jumps in the original datasets of NH3-N also offer some preliminary observations on this (see Figure 3). Such jumps probably result from storm water or pollutant discharge and they can be identified as anomalies, in practice. For example, NH3-N abruptly jumped to 3.5 mg/Lat the 156th week, six times higher than the average value 0.57 mg/L of the whole series. Such abnormalities surely exceed the predication capacity of models, hence the increase of both MRE and ARE to a certain degree.
Performance of WMLE model on high-resolution water quality dynamic
High-resolution observation obviously leads to a different pattern of water quality dynamics from that of weekly observation. In the Potomac River case, DO and NO2−-N were investigated due to data availability. The previous 5 days, from January 1 to January 5, is set as the warm-up period which contains 480 data on a 15 minute basis. Ninety-six data on the 6th day are used as model verification. Delay time τ is selected according to non-partial multiple autocorrelation coefficient method, as shown in the Methodology section. As presented in Table S1 and Figure 10(a), τ = 18 is determined, i.e., 4.5 hours. Embedded dimension m is determined by ln C(r)-ln r relationship (Figure 10(a)) and m = 6 is selected. Further, Lyapunov exponent λ1 = 0.0042 is obtained which indicates the presence of chaos of the high-resolution DO system. NO2−-N dataset did not present a chaotic behavior after the same analysis.
Figure 10(d) shows the forecast results of the WMLE model of high-resolution DO time series. The MRE is 4.76% and ARE is 1.18%. It is a good enough result. A smooth and narrow diurnal fluctuation of DO accounts for this. Compared with the weekly DO forecast, short-term forecast succeeded better since the Lyapunov method is a local approximation approach.
Interestingly, the results also reveal that the embedded dimension m for weekly time series (Huaihe River case) is obviously larger than the high-resolution monitoring time series (Potomac River) while the characteristic of delay time τ is reversed. For delay time, it is reasonable for high-resolution observation to have a larger time lag. Embedded dimension m is a feature that denotes complexity. The larger m the more complex water quality dynamics. Here, the weekly monitoring data need a higher dimension to present attractors – that is, only in higher dimensional space can the chaos emerge.
The following conclusions can be drawn in this study: (1) Chaos phenomenon surely existed in river water quality dynamic systems, but not all the water quality indexes presented chaos behavior, possibly due to the data properties. Some stations in Huaihe River Basin and NO2− in the Potomac River did not present chaotic dynamics. Therefore, chaos behavior recognition of water quality dynamics should be first conducted before utilizing a chaos theory-based modeling approach. (2) A hybrid model wavelet–maximal Lyapunov exponent method for river water quality prediction was successfully established. The WMLE model is superior to traditional data-driven models such as ARMA and ANN, and outweighs the pure MLE model in this investigation. It also has good performance on weekly forecasting of DO and COD time series with average relative error 2.35% and 4.53%, respectively. NH3-N forecasting is challenged with the average relative error of 18.85%. (3) High-frequency DO time series can be more accurately forecasted than weekly ones with a relative average error of 1.18%. This indicates the developed hybrid model WMLE combined with sensors and abnormal detection algorithm can provide an eligible alternative for river water quality early warning. (4) Data-driven models are experiencing a renaissance. Further studies will be meaningful on the chaos characteristics of water quality dynamics based on large-scale datasets, and their fundamental mechanism of generic chaos dynamics can be discussed since chaos bridges the deterministic and stochastic.
This work was supported by Key-Area Research and Development Program of Guangdong Province (Grant No. 2019B110205005), National Natural Science Foundation of China (Grant No. 51979136), and Open Project of State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology (Grant No. ESK201901), Science, Technology and Innovation Commission of Shenzhen (JSGG20180508151852303). We are grateful to Dr Bin Shi from Research Center for Eco-Environmental Sciences, CAS for assistance in coding and data collection.
DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository or repositories (https://github.com/nantekoto/WQprediction).