Abstract
Machine learning (ML) has been increasingly adopted due to its ability to model complex and non-linearities between river water temperature (RWT) and its predictors (e.g., Air Temperature, AT). Most of these ML approaches have been applied using average AT without any detailed sensitivity analysis of other forms of AT (e.g., maximum and minimum). The present study demonstrates how new ML approaches, such as ridge regression (RR), K-nearest neighbors (KNN) regressor, random forest (RF) regressor, and support vector regression (SVR), can be coupled with Sobol’ global sensitivity analysis (GSA) to predict accurate RWT estimates with the most appropriate form of AT. Furthermore, the proposed ML approaches have been combined with the Ensemble Kalman Filter (EnKF), a data assimilation (DA) technique to improve the predicted values based on the measured data. The proposed modelling framework's effectiveness is demonstrated with a tropical river system of India, Tunga-Bhadra River, as a case study. The SVR has been noted as the most robust ML model to predict RWT at a monthly time scale compared with daily and seasonal. The study demonstrates how ML methods can be coupled with a global sensitivity algorithm and DA techniques to generate accurate RWT predictions in river water quality modelling.
HIGHLIGHTS
Machine learning models coupled with global sensitivity analysis to predict RWT.
Ridge regression, KNN, random forest, SVR, along with Sobol’ sensitivity analysis were explored.
Maximum AT as the most sensitive variable in RWT prediction.
The SVR as the most robust ML model to predict RWT at monthly time scale.
Application on a tropical river system of India.
INTRODUCTION
The river water temperature (RWT) directly affects the river's physical, biological, and chemical characteristics and determines the fitness and life of all aquatic organisms. The RWT is of particular significance as (i) the discharge of excess heat from industries and municipal effluents can affect the aquatic ecosystem, (ii) temperature influences both biological and chemical reactions, and (iii) temperature fluctuations affect the density of water and hence the transport of water (Thomann & Mueller 1987). For many environmental, hydrology, and ecology applications, accurate prediction and assessment of RWT have become the key problem (Zhu et al. 2019b, 2019c). In this context, process-based RWT models have been evolved based on heat advection-dispersion transport equations (Stefan & Sinokrot 1993) and net heat transfer processes at the surface based on thermal equilibrium concepts (Mohseni et al. 1999; Rehana & Mujumdar 2012). Although such process-based models give exact results, a large amount of detailed and computationally intensive data is required. Due to the simplicity of implementation, regression models have been improved using the relationship between air and water temperatures (e.g., Stefan & Preud'homme 1993; Pilgrim et al. 1998; Erickson Troy & Stefan Heinz 2000; Neumann David et al. 2003; Rehana & Mujumdar 2011). The usual illustrations are linear regression models (Morrill et al. 2005; Krider et al. 2013), non-linear regression models (Mohseni et al. 1998; van Vliet et al. 2012), stochastic regression models (Ahmadi-Nedushan et al. 2007; Rabi et al. 2015), and hybrid statistical-physical based models (Gallice et al. 2015; Toffolon & Piccolroaz 2015; Piccolroaz et al. 2016) have been developed successfully for data relating to different time scales in the past years. Artificial neural networks (ANNs) have proven to be a promising mathematical tool for predicting the non-linear relationships and their applications in RWT predictions (Chenard & Caissie 2008; Sahoo et al. 2009; DeWeber & Wagner 2014; Hadzima-Nyarko et al. 2014; Piotrowski et al. 2015; Rabi et al. 2015; Temizyurek & Dadaser-Celik 2018; Zhu et al. 2018, 2019d, 2019e). In recent years, Zhu et al. (2018, 2019a, 2019b) and Graf et al. (2019) developed the wavelet neural networks (WT-ANN), decision tree (DT), feedforward neural network (FFNN), Gaussian process regression (GPR), and extreme learning machine (ELM) based models to estimate RWT, and these models are very effective to a linear model and a non-linear model. However, support vector regression (SVR), which is based on structural risk minimization to avoid overfitting (Vapnik et al. 1996), has been adopted over ANN for RWT predictions due to the uniqueness and globalization of the solution (Rasouli et al. 2012; Wang et al. 2013; Huang et al. 2017; Heddam & Kisi 2018; Komasi et al. 2018; Rehana 2019). Random forest (RF) models have been used extensively in hydrology (Balk & Elder 2000; Tehrany et al. 2013; Li et al. 2020), and few researchers have applied for RWT modelling (Lu & Ma 2020). The K-nearest neighbors (KNN) approach has been used in many hydrology applications (Souza & Lall 2003; Beersma & Buishand 2004; Leander et al. 2005) and can be a proper choice for RWT predictions (Muluye 2012; Antunes et al. 2018; Gavahi et al. 2019).
In this context, the robustness of any such data-driven ML algorithms depends on the feature vector (predictors) under consideration in the prediction of RWT. Few studies have tried to model RWT by considering multiple factors, such as river flow discharge (Webb et al. 2003; Laanaya et al. 2017), solar radiation (Sahoo et al. 2009), riparian shade (Johnson et al. 2014), landform attributes, and forested land cover (DeWeber & Wagner 2014). However, the inclusion of air temperature (AT) as the sole variable in predicting RWT has gained much popularity in the research community due to the ready availability of temperature variables (e.g., Caissie 2006; Rehana & Mujumdar 2011). To this end, many studies have used average AT as the promising variable in RWT estimation using data-driven algorithms and hybrid algorithms due to the direct and linear relationships between average air and water temperatures (Piccolroaz et al. 2016; Rehana & Dhanya 2018; Zhu et al. 2018, 2019c; Graf et al. 2019; Rehana 2019). However, at maximum ATs, which are prevailing under seasonal temperature variations, the atmosphere's moisture-holding capacity increases, and the rate of evaporative cooling also increases, and therefore, the RWT no longer increases linearly with average AT (Mohseni et al. 1998; Bogan et al. 2003). Therefore, a thorough sensitivity analysis must be performed to identify the most influencing AT variable (average, maximum, and minimum) to predict the RWT before applying any data-driven algorithm. Given that several studies focused on average AT as the only variable to predict RWT using various ML algorithms, selecting an appropriate AT variable (average, maximum, and minimum) has not been intensively studied in the literature. To the author's best knowledge, none of the studies applied sensitivity analysis to select the best suitable and effective AT variable among maximum, minimum, and average and tested various ML models in the prediction of RWT. The present study assessed the ML model's capability with a global sensitivity analysis (GSA) to better predict RWT. The present study proposed a GSA algorithm variance based on the Sobol’ method (Sobol 1990; Sobol′ 2001) to predict more influencing AT variables in the prediction of RWT. Although the Sobol’ method has been used in many fields of science and engineering, it has been very limited in hydrology applications (Tang et al. 2006; Cloke et al. 2008; Pappenberger et al. 2008; van Werkhoven et al. 2009; Cibin et al. 2010; Yang 2011). The present study made efforts to use the Sobol’ method to select highly sensitive features in RWT prediction.
One of the major limitations of ML algorithms includes the difficulty of incorporating existing physical knowledge (Boukabara et al. 2020). The most appropriate way forward is to combine the best of the two approaches: theory-driven and understanding-rich processes with data-driven discovery processes (Babovic 2005). Recent progress in ML inspires the idea of learning data assimilation (DA) models directly from the real observations – these are uncertain, sparsely sampled, and only indirectly sensitive to the processes of interest (Geer 2020). DA is a methodology that uses observational data and combines it with (or assimilates it into) numerical models (Babovic et al. 2005). The DA method can be categorized into four groups (WMO 1992; Babovic 2005): (i) updating input parameters, (ii) updating model parameters, (iii) updating state variables, and (iv) updating output variables. The fourth type updates output directly, and the possibility of forecasting these errors and superimposing them to the simulation model forecasts usually gives a good performance (Babovic et al. 2005).
DA has been used to enhance simulation accuracy in many engineering applications. One of the most efficient and sequential DA methods is the Kalman filter (KF) developed by Kalman (1960), and its applications in hydrology are also very impressive (Liu et al. 2010; Li et al. 2013; Wang & Babovic 2016; Wang et al. 2016, 2017; Mehrparvar & Asghari 2018). In RWT forecasting, only a few studies addressed the use of DA (Morrison & Foreman 2005; Yearsley 2009; Pike et al. 2013; Ouellet-Proulx et al. 2017). Besides, to the author's knowledge, a limited systematic DA method combined with ML has ever been applied in the context of RWT forecasting. Hence, this study presents an attempt to use an Ensemble Kalman Filter (EnKF) DA method to update and balance the ML model estimates by available observed historical data in RWT forecasting. This paper proposed an integrated modelling framework with ML and DA approach to improving the predicted values based on the measurement data. The proposed algorithm has been demonstrated with a river gauging station daily temperature data of the Shimoga station along the Tunga River, a tributary of the Tunga-Bhadra River, a major tributary of the Krishna River, India. In summary, the objectives of the present study are to (i) identify the most influencing AT variable by the GSA algorithm; (ii) apply various ML models (ridge regression (RR), KNN regressor, RF regressor, and SVR) with the best selected AT for RWT prediction; (iii) apply the EnKF with each ML model; and (iv) compare the performance of four advanced ML algorithms by coupling the GSA and EnKF algorithms when applied on a tropical river system of India.
STUDY AREA AND DATA
The river location considered for the modelling of RWT is Shimoga along the Tunga River, which confluences with the Bhadra River to form the Tunga-Bhadra River, a major tributary of the Krishna River basin, India (Figure 1). A storage dam is situated about 15 km upstream from Shimoga at Gajanur across the river Tunga. The monthly mean discharge at the Shimoga station is about 166.95 m3/s. The observed minimum, maximum, and average air (water) temperature mean were noted as 19.66, 29.74, and 24.78 °C (27.54 °C) and standard deviation as 3.48, 3.47, and 2.77 °C (2.66 °C), respectively. A significant decrease of discharge has been noted about 3.1% at Shimoga along the Tunga River compared from 1971–1991 to 1992–2006 (Rehana & Mujumdar 2011). The Tunga River location receives the waste load from the Shimoga city municipal effluent. The daily average RWT data and average, maximum and minimum AT data from 1 January 1989 to 1 January 2004 recorded at the Shimoga station were obtained from Central Water Commission (CWC), Bangalore, Karnataka, India, and Advanced Centre for Integrated Water Resources Management (ACIWRM), Karnataka, India. The frequency of water quality data collection, i.e., water temperature, is ten times a day. The measurement of water temperature data is mean daily of ten samples (Central Water Commission 2018). To create a complete time-series dataset, the na.interp() function within the R's forecast package was used to interpolate data between missing time-series values (Hyndman et al. 2018). For seasonal data, na.interp uses STL (Seasonal and Trend decomposition using Loess) for this interpolation.
METHODOLOGY
The overview of the proposed modelling framework is shown in Figures 2 and 3. The first step is to apply sensitivity analysis to select the most appropriate form of AT variable to predict the RWT. Various ML approaches such as RR, KNN, RF, and SVR were applied to the study location to predict RWT at a daily time scale. Figure 2 shows the architectural flow diagram proposed for the prediction of RWT using sensitivity and ML. Figure 3 shows the ML model and the EnKF DA method's architectural flow diagram to improve the ML model's efficiency in each simulation step.
Sensitivity analysis
Sensitivity analysis (SA), which is often used as a powerful technique to measure the strength of relationships between model inputs and outputs, is an important assessment of any modelling, including environmental modelling (Nossent et al. 2011). SA is crucial in hydrologic and water quality models due to various aspects involved in modelling processes, such as spatiotemporal scales and complexity, requiring an assessment of parameters influence on the model's prediction (Yuan et al. 2015). In recent years, various SA environmental models are available in the literature (Saltelli et al. 2010; Yang 2011), based on variance decomposition. The variance-based Sobol’ method is an SA method that is very common in many fields (Sobol 1990). In general, SA methods aim to measure the amount of variance that each parameter adds to the unconditional variance of the model output, these amounts are expressed as (Sobol’) sensitivity indices (SIs).
Sobol’ SA method
To compute the variances to obtain the sensitivity measures, Sobol’ proposed a shortcut in the calculations, based on the assumption of mutually orthogonal summands in the decomposition. The shortcut is attained by transforming the double-loop integral of Equation (4) into an integral of the product of and . Because environmental models are mostly complex and non-linear, it is almost impossible to calculate the variances using analytical integrals. The SIs can be calculated by performing Monte-Carlo simulations.
The evaluation of the SA
Due to its advantageous properties and the drawbacks of the qualitative results of the one-factor-at-a-time (OAT) (Yang 2011) sensitive analysis approach, in this study, an attempt has been made to identify the most sensitive parameters using the Sobol’ method. To analyse sensible parameters, the maximum, minimum, and average AT parameters are selected for the Sobol’ sensitivity analysis of the model. One thousand independent samples of the parameter sets are generated from the Sobol sequence using the SALib module (Herman & Usher 2017) to assess the second-order sensitivity indices and total sensitivity effects. For the second-order effect, the Saltelli (Saltelli et al. 2008) method of the cross-sampling scheme creates a total of N * (2D + 2) parameter sets, where D is the number of input parameters and N is the number of independent samples of the parameter sets. Since no prior knowledge is available on the parameters, the SA's input parameter values were sampled from a uniform distribution (Sobol 1990). The different parameter ranges were scaled between 0 and 1 with normalization. Mean from 10% changes of AT parameters as the input values to compare the shift in mean response and changes in the entire range of simulated river temperatures. For assessment and comparison purposes, sensitivity indices can be ranked into the four classes found in Table 1 as defined by Lenhart et al. (2002). Normalized SIs for RWT model inputs parameters are listed in Table 4.
Index . | Sensitivity . |
---|---|
Small to negligible | |
Medium | |
High | |
Very high |
Index . | Sensitivity . |
---|---|
Small to negligible | |
Medium | |
High | |
Very high |
Performance rating . | RSR . | NSE . |
---|---|---|
Very good | ||
Good | ||
Satisfactory | ||
Unsatisfactory |
Performance rating . | RSR . | NSE . |
---|---|---|
Very good | ||
Good | ||
Satisfactory | ||
Unsatisfactory |
Ridge regression
KNN regressor
In this study, the KNN model is developed on a daily scale to predict the RWT with minimum and maximum AT as predictor variables. The tuning parameter choices were five neighbors to fit the model.
Support vector regression
Based on previous studies (Dibike et al. 2001; Rehana 2019), Radial Basis Function (RBF) was chosen as the kernel function to measure the performance of the model for the RWT. A detailed introduction to the SVR method may be found in Dibike et al. (2001).
RF regressor
Ensemble Kalman filter
EnKF model development
In this study, EnKF as a DA technique is implemented to improve the efficiency of ML models in each simulation step. The proposed approach is presented to enhance the performance of the integration of the ML model and EnKF. For developing the ML model to predict or simulate RWT, EnKF is implemented to update and optimize ML model predictions. Figure 3 shows the ML and DA architectural flow diagram.
In Figure 3, Yp is the result of ML model prediction, YF is the data blended by updating the ML model prediction results with the RWT observations Ym using the EnKF technique. The steps of this model as follows:
The ML model is trained with the observed data at t− 1 to form the model.
The subsequent observations are used to predict the RWT at t.
This step updates the predicted data Yp with the available RWT measurements Ym using the EnKF technique, and then the updated data YF are used as inputs to update the ML model if the error is less than the previous simulation step. The process then returns to step (1) for the next prediction until there are no new data.
MODEL EVALUATION
RESULTS AND DISCUSSION
The data used in this paper consist of daily water temperature and corresponding daily minimum, maximum, and mean AT for the period from 1 January 1989 to 1 January 2004. The observed minimum, maximum, and average air (water) temperature mean were noted as 19.66, 29.74, and 24.78 (27.54 °C) and standard deviation as 3.48, 3.47, and 2.77 °C (2.66 °C), respectively. To study the statistical dependency between various air and water temperature variables, Spearman's correlation coefficients have been estimated from 1 January 1989 to 1 January 2004. Spearman's correlation coefficients between RWT and maximum, minimum, and average ATs were calculated. It is observed that RWT is highly significant with the maximum, minimum, and average ATs (p-value < 0.001) (Table 3). Based on the statistical dependency measures, the maximum AT was positively correlated with daily RWT for the case study.
Season . | RWT – maximum AT . | RWT – minimum AT . | RWT – average AT . |
---|---|---|---|
Monsoon (Jun–Sep) | 0.90 | 0.18 | 0.71 |
Post-monsoon (Oct–Nov) | 0.77 | 0.26 | 0.59 |
Winter (Dec–Feb) | 0.84 | 0.20 | 0.62 |
Summer (Mar–May) | 0.77 | 0.55 | 0.76 |
Annual | 0.84 | 0.31 | 0.70 |
Season . | RWT – maximum AT . | RWT – minimum AT . | RWT – average AT . |
---|---|---|---|
Monsoon (Jun–Sep) | 0.90 | 0.18 | 0.71 |
Post-monsoon (Oct–Nov) | 0.77 | 0.26 | 0.59 |
Winter (Dec–Feb) | 0.84 | 0.20 | 0.62 |
Summer (Mar–May) | 0.77 | 0.55 | 0.76 |
Annual | 0.84 | 0.31 | 0.70 |
Furthermore, based on the SA (Table 4), it is observed that the maximum AT is highly sensitive, with a sensitivity index of 0.95 in the prediction of RWT compared with the minimum and average ATs. The SA also supports the use of maximum AT as the most important independent variable to be considered in the prediction of RWT. To show the variability of maximum AT with RWT, the daily data from 1 January 1989 to 1 January 2004 have been compared, as shown in Figure 4. Most of the earlier studies considered the average AT as the independent variable in RWT prediction. For example, Rehana & Mujumdar (2011) evaluated the average AT to predict the RWT for the Tunga-Bhadra River at the Shimoga station with the coefficient of determination (R2) value as 0.53 with discharge as another independent variable. As the present study's main objective is to select an appropriate AT among average, maximum, and minimum to model RWT, the study has not used river discharge in the RWT prediction.
Furthermore, the improved performance in the prediction of RWT with consideration of maximum AT and the average AT was compared with the linear regression model. The resulting R2 value in RWT prediction was obtained as 0.58 and 0.83 with the average and maximum ATs, respectively. Such improved performance of the RWT prediction model was convincing with an earlier study by Rehana & Mujumdar (2011), which used average AT as the predictor variable in RWT modelling.
To understand the variability of air and water temperature changes for long-term periods, the study estimated the linear trends of both variables (Figure 7(a) and 7(b)). As can be observed, the long-term maximum AT and the RWT are varied during the period from 1989 to 2004 (Figure 5). The monthly seasonal dynamics of RWT and maximum AT based on 15 years averages at the Shimoga station (1989–2004) are presented in Figure 6. It is shown that RWT and maximum AT give a strong seasonal pattern with larger values in summer and lower values in winter. As shown in Figure 7, the long-term AT and the water temperature increased during the period 1989–2004 at the Shimoga station. AT has been increased about 0.077 °C year–1, while RWT increased about 0.062 °C year–1. Such increasing trends of RWT have been investigated in many parts of the world. For example, the observed RWT has shown a growing trend of about 0.029–0.046 °C year–1 over China (Chen et al. 2016), over the USA of about 0.009–0.077 °C year–1 (Isaak et al. 2012; van Vliet et al. 2013; Rice & Jastram 2015) and Europe as 0.006–0.18 °C year–1 (Albek & Albek 2009; Orr et al. 2015; Hardenbicker et al. 2017). AT increased by 1.0 °C over the 15-year interval from the plot, while the water temperature increased by 0.8 °C. Such increasing air and water temperature trends agreed with the case study's earlier research findings (Rehana & Mujumdar 2011). Furthermore, there is strong evidence of climate change's impact on the river water quality due to the increase of RWTs and decrease of stream flows for the river of interest (e.g., Rehana & Mujumdar 2012; Rehana & Dhanya 2018).
ML model performance
The next step in the prediction of RWT is to use appropriate ML, which can work accurately in terms of calibration and validation with a comparison of acceptable performance measures, as shown in Figure 2. To utilize the data better, assessing the effectiveness of the model and avoid overfitting, the cross-validation (CV) technique was applied. When dealing with time-series data, traditional CV (like k-fold) cannot be used since the adjacent data points are often highly dependent, so standard CV will fail. To overcome these issues, the time-series splits CV technique was used in the present study (Pedregosa et al. 2011; Scavuzzo et al. 2018). This CV was performed chronologically, started with a small subset of data for training purposes, estimated the last data points, and then checked the accuracy for the calculated data points. The same estimated data points are then included as part of the next training dataset, and subsequent data points were estimated. This CV procedure provides an almost unbiased estimate of the true error (Varma & Simon 2006). The error on each split is averaged in order to compute a robust estimate of model error, as shown in Figure 2. While fitting a model on a dataset, all the possible combinations of parameter values are evaluated using the GridSearchCV python library module (Pedregosa et al. 2011), and the best combination is taken to make the model performant.
The results of the ML approaches (Ridge, KNN, RF, and SVR) for the prediction of RWT were evaluated using several goodness-of-fit statistics (MSE, MAE, RMSE, RSR, NSE, and R2), and graphical tools (seasonal plots and box plots). The experiment results showed a good trade-off between training and validation performance, confirming the stable generalization capacity of ML approaches. The developed models were able to predict RWT using AT as input successfully. Figure 8 shows the box plot for observed and predicted RWT using Ridge, KNN, RF, and SVR models, and it is observed that the minimum RWT is 21 °C and max RWT is 31 °C for the observed data while the lower and quartile range between 24 and 28 °C with median RWT of 26 °C. According to Figure 8, all the four models performed almost comparable predictions with a difference of 1 °C based on the median, and there is a clear resemblance between the observed RWT and the predicted value, in addition the lower and the upper quartile ranges predicted using these models were marginally varied compared with the observed data.
The performance of the Ridge, KNN, RF, and SVR models for daily data at the Shimoga station is provided in Table 5 and Figure 9. Results showed that the seasonal variations of predicted RWT are almost synchronous and comparable with the observed values (Figure 9), but the Ridge model performed poorly with overestimated values in high water temperature period and performance statistics (R2, MSE, RMSE RSR, NSE, and MAE) can be found in Table 5. From Table 5, the SVR (R2 = 0.84, KGE = 0.86, MSE = 0.99, RMSE = 0.99, RSR = 0.40, NSE = 0.84, and MAE = 0.77) model has performed slightly better than KNN (R2 = 0.82, KGE = 0.87, MSE = 1.11, RMSE = 1.05, RSR = 0.42, NSE = 0.82, and MAE = 0.84), RF (R2 = 0.83, KGE = 0.87, MSE = 1.05, RMSE = 1.03, RSR = 0.41, NSE = 0.83, and MAE = 0.81), and Ridge (R2 = 0.76, KGE = 0.87, MSE = 1.44, RMSE = 1.01, RSR = 0.31, NSE = 0.76, and MAE = 0.90) for daily time scale. The accuracy for the ML approaches showed excellent performance in terms of NSE (NSE >0.75) and RSR (RSR <0.50) (Moriasi et al. 2007; Table 2) with lower values of MSE and RMSE. The relationship between daily RWT and maximum AT at the Shimoga station has a relatively strong correlated value for all four models (R2 values). The RMSE values for the Shimoga station range from 0.99 to 1.05 for all the four ML models (Table 5) for daily data, which are reasonable compared with Jackson et al. (2018) (1.57) and Sohrabi et al. (2017) (1.25), and far better than that of Temizyurek & Dadaser-Celik (2018) (2.10–2.64). Based on RSR and NSE performance ratings (Moriasi et al. 2007; Table 2), the best performing model was noted as the SVR (NSE = 0.84; KGE = 0.86; R2 = 0.84; RSR <0.50) for RWT prediction based on the performance measures (Table 5) for daily time scale. The superiority of SVR in the prediction of RWT as revealed in the present study was found to agree with the study of Rehana (2019) for the same case study. However, it can be noted that the study by Rehana (2019) used the average AT as the independent variable without testing for the most influencing AT variables in the prediction of RWT, as demonstrated in the present study. Furthermore, it can also be noted that the model performance has improved using the SVR with maximum AT (NSE: 0.84 and RMSE: 0.99) as an independent variable compared with the average AT (NSE: 0.61 and RMSE: 1.69) (Rehana 2019) for the same case study at daily time scale.
Input parameter . | Sensitivity indices . |
---|---|
Minimum air temperature | 0.05 |
Maximum air temperature | 0.95 |
Average air temperature | 0.00 |
Input parameter . | Sensitivity indices . |
---|---|
Minimum air temperature | 0.05 |
Maximum air temperature | 0.95 |
Average air temperature | 0.00 |
Data . | Model . | R2 . | KGE . | MSE . | RMSE . | RSR . | NSE . | MAE . |
---|---|---|---|---|---|---|---|---|
Daily | Ridge | 0.76 | 0.87 | 1.44 | 1.01 | 0.31 | 0.76 | 0.90 |
KNN | 0.82 | 0.87 | 1.11 | 1.05 | 0.42 | 0.82 | 0.84 | |
RF | 0.83 | 0.87 | 1.05 | 1.03 | 0.41 | 0.83 | 0.81 | |
SVR | 0.84 | 0.86 | 0.99 | 0.99 | 0.40 | 0.84 | 0.77 | |
Monthly | Ridge | 0.79 | 0.87 | 1.02 | 1.00 | 0.35 | 0.79 | 0.74 |
KNN | 0.85 | 0.85 | 0.87 | 0.93 | 0.38 | 0.84 | 0.74 | |
RF | 0.87 | 0.94 | 0.71 | 0.84 | 0.39 | 0.87 | 0.67 | |
SVR | 0.88 | 0.88 | 0.61 | 0.78 | 0.39 | 0.88 | 0.57 | |
Season (Jan–Apr) | Ridge | 0.64 | 0.72 | 1.93 | 1.38 | 0.30 | 0.64 | 1.06 |
KNN | 0.76 | 0.90 | 1.42 | 1.19 | 0.35 | 0.76 | 0.97 | |
RF | 0.80 | 0.89 | 1.15 | 1.07 | 0.36 | 0.80 | 0.86 | |
SVR | 0.82 | 0.92 | 1.00 | 1.00 | 0.36 | 0.82 | 0.80 | |
Season (May–Aug) | Ridge | 0.84 | 0.88 | 1.42 | 1.19 | 0.27 | 0.84 | 0.88 |
KNN | 0.86 | 0.89 | 1.30 | 1.14 | 0.28 | 0.85 | 0.86 | |
RF | 0.87 | 0.86 | 1.17 | 1.08 | 0.28 | 0.87 | 0.82 | |
SVR | 0.87 | 0.95 | 1.18 | 1.08 | 0.28 | 0.86 | 0.76 | |
Season (Sep–Dec) | Ridge | 0.52 | 0.86 | 0.71 | 0.84 | 0.56 | 0.52 | 0.68 |
KNN | 0.50 | 0.70 | 0.77 | 0.88 | 0.53 | 0.49 | 0.69 | |
RF | 0.53 | 0.72 | 0.73 | 0.85 | 0.53 | 0.52 | 0.68 | |
SVR | 0.61 | 0.74 | 0.61 | 0.78 | 0.58 | 0.60 | 0.60 |
Data . | Model . | R2 . | KGE . | MSE . | RMSE . | RSR . | NSE . | MAE . |
---|---|---|---|---|---|---|---|---|
Daily | Ridge | 0.76 | 0.87 | 1.44 | 1.01 | 0.31 | 0.76 | 0.90 |
KNN | 0.82 | 0.87 | 1.11 | 1.05 | 0.42 | 0.82 | 0.84 | |
RF | 0.83 | 0.87 | 1.05 | 1.03 | 0.41 | 0.83 | 0.81 | |
SVR | 0.84 | 0.86 | 0.99 | 0.99 | 0.40 | 0.84 | 0.77 | |
Monthly | Ridge | 0.79 | 0.87 | 1.02 | 1.00 | 0.35 | 0.79 | 0.74 |
KNN | 0.85 | 0.85 | 0.87 | 0.93 | 0.38 | 0.84 | 0.74 | |
RF | 0.87 | 0.94 | 0.71 | 0.84 | 0.39 | 0.87 | 0.67 | |
SVR | 0.88 | 0.88 | 0.61 | 0.78 | 0.39 | 0.88 | 0.57 | |
Season (Jan–Apr) | Ridge | 0.64 | 0.72 | 1.93 | 1.38 | 0.30 | 0.64 | 1.06 |
KNN | 0.76 | 0.90 | 1.42 | 1.19 | 0.35 | 0.76 | 0.97 | |
RF | 0.80 | 0.89 | 1.15 | 1.07 | 0.36 | 0.80 | 0.86 | |
SVR | 0.82 | 0.92 | 1.00 | 1.00 | 0.36 | 0.82 | 0.80 | |
Season (May–Aug) | Ridge | 0.84 | 0.88 | 1.42 | 1.19 | 0.27 | 0.84 | 0.88 |
KNN | 0.86 | 0.89 | 1.30 | 1.14 | 0.28 | 0.85 | 0.86 | |
RF | 0.87 | 0.86 | 1.17 | 1.08 | 0.28 | 0.87 | 0.82 | |
SVR | 0.87 | 0.95 | 1.18 | 1.08 | 0.28 | 0.86 | 0.76 | |
Season (Sep–Dec) | Ridge | 0.52 | 0.86 | 0.71 | 0.84 | 0.56 | 0.52 | 0.68 |
KNN | 0.50 | 0.70 | 0.77 | 0.88 | 0.53 | 0.49 | 0.69 | |
RF | 0.53 | 0.72 | 0.73 | 0.85 | 0.53 | 0.52 | 0.68 | |
SVR | 0.61 | 0.74 | 0.61 | 0.78 | 0.58 | 0.60 | 0.60 |
A summary of the Ridge, KNN, RF, and SVR model performances for monthly data is illustrated in Table 5 and Figure 10. ML results showed that the seasonal variations of predicted RWT are almost synchronous and comparable with the observed values (Figure 10), but the Ridge model performed poorly with overestimated values in high water temperature period and performance statistics are given in Table 5. Compared with the four ML models, the SVR (R2 = 0.88, KGE = 0.88, MSE = 0.61, RMSE = 0.78, RSR = 0.39, NSE = 0.88, and MAE = 0.57) model performed slightly better than KNN (R2 = 0.85, KGE = 0.85, MSE = 0.87, RMSE = 0.93, RSR = 0.38, NSE = 0.84, and MAE = 0.74), RF (R2 = 0.87, KGE = 0.94, MSE = 0.71, RMSE = 0.84, RSR = 0.39, NSE = 0.87, and MAE = 0.67), and Ridge (R2 = 0.79, KGE = 0.87, MSE = 1.02, RMSE = 1.00, RSR = 0.35, NSE = 0.79, and MAE = 0.74) for monthly time scale. It can be noticed that performance coefficients of monthly time scale were improved in terms of higher R2, NSE, and lower RMSE and MAE values when compared with daily time scale (Table 5). The ML model accuracy has been increased with monthly data for RWT predictions compared with daily data, with SVR (RSR = 0.39; NSE = 0.88), RF (RSR = 0.39; NSE = 0.87), KNN (RSR = 0.38; NSE = 0.84), and Ridge (RSR = 0.35; NSE = 0.79) showed very good performance based on RSR and NSE performance ratings (Moriasi et al. 2007; Table 2).
The performance of the Ridge, KNN, RF, and SVR models for seasonal data (Jan–Apr, May–Aug, and Sep–Dec) (Laizé et al. 2017; Zhu et al. 2019c) is shown in Figure 11. Results showed that the seasonal variations of predicted RWT are almost in agreement with the observed values (Figure 11), but the Ridge model performed poorly with overestimated values in high water temperature periods and performance statistics are given in Table 5. From Table 5, the SVR model performed slightly better than KNN, RF, and Ridge in all three seasons (Jan–Apr, May–Aug, and Sep–Dec). It can be noticed that NSE and RSR values were poor for the season (Sep–Dec) when compared with the other two seasons, daily time scale and monthly time scale values. Table 5 shows that the four models constructed in this paper may learn the RWT variation rules from the historical data and reproduce the seasonal dynamics of RWT. This case study demonstrates that integrating the scientific knowledge into ML tools promises to improve many important environmental variables predictions.
ML-EnKF model performance
In the next step in the prediction of RWT, the EnKF DA technique is implemented to improve the efficiency of ML models in each simulation step. Table 6 shows the results of the ML-EnKF model at different simulation steps with the assimilated data. Table 6 shows that the blended data show the improved results from simulation-1 (1 January 2001 to 1 January 2002) to simulation-2 (1 January 2002 to 1 January 2003). These results demonstrate that the blended data are best. It can be concluded that the ML-EnKF model can do a better job with assimilated data in RWT prediction. It dramatically enhances the direct ML models. If the simulation steps continue, the ML-EnKF model is improved and the simulation results are significantly improved, according to Table 6. As the first section states, the ML-EnKF model is designed to improve the ML model performance by a combination of both ML models and a DA approach to enhance the predicted values based on the measurement data.
Data . | Model . | R2 . | KGE . | MSE . | RMSE . | RSR . | NSE . | MAE . |
---|---|---|---|---|---|---|---|---|
Simulation-1 (1 Jan 2001 to 1 Jan 2002) | Ridge | 0.829 | 0.807 | 0.829 | 0.910 | 0.413 | 0.829 | 0.759 |
KNN | 0.855 | 0.925 | 0.699 | 0.836 | 0.379 | 0.855 | 0.667 | |
RF | 0.860 | 0.934 | 0.676 | 0.822 | 0.373 | 0.860 | 0.656 | |
SVR | 0.886 | 0.915 | 0.555 | 0.745 | 0.338 | 0.885 | 0.593 | |
Simulation-2 (1 Jan 2002 to 1 Jan 2003) | Ridge | 0.867 | 0.843 | 0.841 | 0.917 | 0.363 | 0.867 | 0.710 |
KNN | 0.855 | 0.883 | 0.921 | 0.959 | 0.379 | 0.856 | 0.764 | |
RF | 0.865 | 0.880 | 0.898 | 0.947 | 0.375 | 0.859 | 0.741 | |
SVR | 0.911 | 0.921 | 0.564 | 0.741 | 0.303 | 0.908 | 0.573 |
Data . | Model . | R2 . | KGE . | MSE . | RMSE . | RSR . | NSE . | MAE . |
---|---|---|---|---|---|---|---|---|
Simulation-1 (1 Jan 2001 to 1 Jan 2002) | Ridge | 0.829 | 0.807 | 0.829 | 0.910 | 0.413 | 0.829 | 0.759 |
KNN | 0.855 | 0.925 | 0.699 | 0.836 | 0.379 | 0.855 | 0.667 | |
RF | 0.860 | 0.934 | 0.676 | 0.822 | 0.373 | 0.860 | 0.656 | |
SVR | 0.886 | 0.915 | 0.555 | 0.745 | 0.338 | 0.885 | 0.593 | |
Simulation-2 (1 Jan 2002 to 1 Jan 2003) | Ridge | 0.867 | 0.843 | 0.841 | 0.917 | 0.363 | 0.867 | 0.710 |
KNN | 0.855 | 0.883 | 0.921 | 0.959 | 0.379 | 0.856 | 0.764 | |
RF | 0.865 | 0.880 | 0.898 | 0.947 | 0.375 | 0.859 | 0.741 | |
SVR | 0.911 | 0.921 | 0.564 | 0.741 | 0.303 | 0.908 | 0.573 |
CONCLUSIONS
ML techniques represent a potentially disruptive force for many scientific disciplines. The purpose of this study was to assess the performance of a suite of ML models for RWT prediction for the Tunga-Bhadra River, India, with the aid of the minimum and maximum AT at daily, monthly, and seasonal time scales. In this study, an attempt has been made to identify the most sensitive AT variable (average, maximum, and minimum) using the Sobol’ sensitivity analysis method, which can serve as an input variable in the prediction of RWT. The results indicated that the maximum AT was the most important variable in the prediction of RWT for the river location of interest. In general, it can be concluded that the Sobol’ sensitivity analysis can be successfully applied for input variable fixing and prioritization of any RWT model. Therefore, the Sobol’ sensitivity analysis method can be considered as a robust and powerful sensitivity analysis method for RWT prediction modelling.
Furthermore, each model's configurable variable is optimized, and the performances of various ML models are analysed to test the applicability of the data-driven models in the RWT being investigated. The study revealed that ML model performance coefficients are improved in monthly data compared with the daily time scale. The seasonal time scale RWT prediction models also performed poorly compared with daily and monthly time scale data. Overall, the monthly time scale RWT prediction ML models have performed better than daily and seasonal for interest study location. The SVR has been noted as the most robust ML model to predict RWT. Furthermore, the EnKF DA algorithm with ML approaches improves the predicted values based on the measurement data. The ML-EnKF model update of the prediction data with the observed data using the DA method shows a better result. Generally, the assimilation method is just considered to bring model predictions close to the observations rather than improve the model structure. Here, as the updated data are used to train the ML model for the next prediction, it does enhance the model and makes the model more practical in hydrologic applications. If the simulation steps continue, the ML-EnKF model is improved and the simulation results are significantly improved.
This case study demonstrated how a data-driven modelling framework could be scaled up and used for the prediction of RWT. The DA methods can also combine with ML models to improve the predicted values based on the measurement data. Overall, the data-driven modelling framework presented in the study indicated that all ML models were proven to be effective in RWT prediction. This case study demonstrates that integrating scientific knowledge into ML tools for improving predictions of many important environmental variables and the applicability of data-driven models in the field of the water sector. Simultaneously, ML models architecture and the law of parameter setting demonstrated in the present study can be valuable for the river water quality management problems.
Despite the robustness of the modelling frameworks as presented in the study, it has some caveats. One of the major limitations of the study is consideration of the data for the period from 1989 to 2004, which is the only long period of data available along the river stretch with minimal missing and erroneous data. The proposed modelling framework of RWT prediction can always be implemented with newly updated data as demonstrated in the present study, which can be extended to other stations based on data availability. RWT prediction models should consider the spatial dependency of air and water temperature variables when the modelling framework is proposed to be implemented with multiple stations of a river stretch. Furthermore, the study demonstrated the modelling framework to consider the most sensitive variables in predicting RWT using various AT variables, such as average, maximum, and minimum. However, there are several variables, which have a direct impact on RWT, such as streamflow (Isaak et al. 2010; Toffolon & Piccolroaz 2015; Sohrabi et al. 2017) and river geometry (Gu & Li 2002), which need to be considered in the sensitivity analysis and consequently in the ML algorithms. Further research into the robust and hybrid approaches to RWT modelling is required, as an accurate simulation of RWT plays an important role in water resources management.
ACKNOWLEDGEMENT
We thank the Advanced Centre for Integrated Water Resources Management (ACIWRM), Bengaluru, Karnataka, India, for providing the water quality data. The research work is supported by the Science and Engineering Research Board (SERB), Department of Science and Technology, Government of India, through Core Research Grant Project No. CRG/2020/002028 to Dr. S. Rehana.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.