Abstract
This paper aims to evaluate two machine learning (ML) algorithms, namely, convolutional neural network (CNN) and long short-term memories (LSTM) deep learning algorithms, to predict the hydrological regime of the 3S River Basin under various climate change scenarios. Climate models CMCC-CMS, HadGEM-AO2, and MIROC5 were used to predict future climate and streamflow for three future periods: near-future (2020–2050), mid-future (2050–2080), and far-future (2080–2100) under two Representative Concentration Pathways (RCPs) 4.5 and 8.5. The future projection shows an increase in mean annual temperature from 0.08 to 4.3 °C by CMCC-CMS, from 0.13 to 4.4 °C by HadGEM-AO2, and −0.07 to 4.2 °C MIROC5 models. Similarly, the annual precipitation is projected to fluctuate from 13.3 to 62.5% by CMCC-CMS, from −12.4 to 26.1% by HadGEM-AO2, and from 6.9 to 49% by the MIROC5 model. The 3S River Basin expects an increasing trend in streamflow in the Srepok and Sesan Rivers, while the Sekong is projected to have reduced streamflow. ML models predicted the increasing flood risk in the Sekong and Sesan catchments with the increase of the Q5 index in the future but a decrease in the Srepok.
HIGHLIGHTS
Machine learning (ML) models were used to predict the hydrological changes in the 3S River Basin.
The 3S River Basin is expected to be warmer and have fluctuation in rainfall patterns in the future.
The Srepok and Sesan Rivers are expected to have an increasing trend of streamflow while the Sekong is projected to have reduced streamflow.
ML models predicted the increasing flood risk in the Sekong and Sesan catchments.
INTRODUCTION
The significance of precise streamflow modeling cannot be overstated in hydrological modeling, as it holds crucial implications for water resource planning, including dam construction, water resource allocation, catchment area management, and flood control. However, the inherent complexity of the hydrological system, marked by dynamic variables, inconsistent data quality, and a web of sensitive parameters in non-linear relationships, poses a substantial challenge (Akhtar et al. 2009; Bourdin et al. 2012; Cui & Singh 2015; Zaghloul et al. 2022). To the best knowledge, there is not yet a single, universally superior model that performs optimally under all conditions and catchment characteristics. The current research emphasizes the development of robust and flexible models that can yield improved performance based on historical data (Mohammadi 2021).
Despite challenges, the research community continues to explore solutions. Machine learning (ML) approaches have shown promise for simulating complex hydrological processes (Adnan et al. 2021). ML algorithms have been widely used in health care (Tran et al. 2021; Widiputra 2021), 3D modeling analysis (Pepe et al. 2021; Khayyal et al. 2022), streamflow prediction (Ghimire et al. 2021; Lin et al. 2021; Singh et al. 2023), predicting urban flood susceptibility (Ekwueme 2022), weather forecasting (Chen et al. 2022), and climate change projection (Obahoundje et al. 2022; Prodhan et al. 2022; Nguyen et al. 2023). In streamflow modeling, ML algorithms avoid complex physical processes and simulate rainfall-runoff with fewer variables. These data-driven models mimic hydrological processes by assimilating measured inputs (Chang & Chang 2006; Mosavi et al. 2018; Liu et al. 2020). Convolutional neural networks (CNNs) and long short-term memories (LSTMs) are preferred for streamflow prediction due to their ability to handle time series data and express non-linear relationships (Chang & Chang 2006; Bourdin et al. 2012; Nguyen et al. 2015; Abadi et al. 2016; Assem et al. 2017; Nguyen 2022). These models formulate non-linearity of streamflow based solely on historical data. The user-friendly concept makes them popular and quick to develop (Mosavi et al. 2018). ML and AI have been pitched in learning the non-linearity of streamflow to predict future streamflow (Nguyen et al. 2015; Assem et al. 2017; Duan et al. 2020; Liu et al. 2020; Ghimire et al. 2021). Studies show that artificial intelligence is learning the non-linearity of streamflow based on historical time series data and proving the advantage of CNN and LSTM architecture, among others, in streamflow prediction. Besides, Jabbari & Bae (2018) and He et al. (2021) demonstrated the capability of ML techniques in capturing the complex relationship between different elements of hydrometeorological systems. Duan et al. (2020) used CNN to simulate precipitation, temperature, solar radiation, and streamflow.
Although ML and AI are recognized as advanced technologies in streamflow forecasting, most research has focused on short-term predictions, such as 6 h (Hu et al. 2018), 1 day (Duan et al. 2020; Le et al. 2021), 5 days (Nguyen et al. 2015), up to 9 months (Ghimire et al. 2021), and 2 years (Liu et al. 2020). Long-term prediction applications remain relatively unexplored, presenting a considerable research gap (Assem et al. 2017). In addition, as alluded to previously, streamflow modeling significantly varies across different basins due to their unique characteristics. These distinguishing features include topography, regional climate, and variations in land use and land cover. Consequently, it becomes imperative to develop and apply customized streamflow models tailored to the specific conditions of each basin.
Locally, the 3S River Basin holds a prominent position in hydrological research due to its significant contribution to the Lower Mekong River flow and the reservoir development initiatives it supports (Ngo et al. 2018). Numerous studies have been conducted in this area, focusing on assessing and projecting hydrological changes within the region. For instance, Trang et al. (2017) utilized the Soil and Water Assessment Tool (SWAT) to investigate the climate change impacts on hydrology in the 3S River Basin across five different general circulation models (GCMs). Similarly, Pradhan et al. (2022) employed SWAT and CROPWAT software to evaluate the influence of human activities and climate change on water resources in the Srepok sub-basin. Ty et al. (2012) implemented the hydrological model (HEC-HMS) to project future hydrological variables under the ECHAM4 GCM.
Despite the extensive research conducted in the area, there is a noticeable lack of studies applying ML to streamflow modeling in this basin. ML technologies have been successfully applied in various hydrological contexts, highlighting a significant research gap. Thus, creating novel ML prediction models to capture hydrological variations under different scenarios is an essential advancement. Such models can provide valuable insights to decision-makers, enhancing their capacity to make more informed, future-proof choices for water resources management (Evers & Pathirana 2018; Liu et al. 2020).
This study aims to (i) create climate change scenarios for the 3S River Basin based on GCMs; (ii) develop ML models (CNN and LSTM) to predict daily streamflow using future rainfall and temperature; and (iii) analyze the future streamflow in the 3S (Sekong, Sesan, and Srepok) River Basin under climate change scenarios using ML models (CNN and LSTM). Ensemble of three GCMs is used for climate change projection under three future periods: near-future (2020–2050), mid-future (2050–2080), and far-future (2080–2100) compared to the baseline period (1980–2005).
STUDY AREA
The climate of the 3S River Basin is dictated by its topography and seasonal monsoon with a strong orographic effect due to Annamite Mountains present at the eastern boundary. Owing to the elevation difference, ranging from 43 to 2,409 m above mean sea level (masl), the average annual precipitation within the basin varies from 2,800 mm in the upper Sekong to around 2,500 mm in the lowland of the Srepok River Basin. Similarly, the average daily temperature in the basin ranges from 7.8 to 32.8 °C, with Sekong being slightly cooler than Sesan and Srepok. Streamflow in the 3S River Basin is highly seasonal unless controlled by reservoir operation upstream. The observed discharge at three selected hydrological stations: Attapeu, Kon Tum, and Bandon, indicate the typical flow regime of Sekong, Sesan, and Srepok, respectively. The average annual flow at these stations is 425.6, 95.8, and 277.9 m3/s, Sekong being much larger, followed by Srepok and Sesan. The peak flow within all three tributaries occurs in August, September, and October.
DATA ACQUISITION
This study used 21 years (1985–2005) of observed daily mean temperature and precipitation data for climate change projection. We also used 21 years (1985–2005) of observed daily discharge data from three selected stations: Attapeu, Kon Tum, and Bandon, as shown in Figure 1. All the observed hydrometeorological data were acquired from the Mekong River Commission (MRC). To improve the quality of training and testing (validation) of acquired data, all the data were cleaned by removing or filling in any missing values, errors, and duplicates. The mean of nearby stations was used to fill in the missing data if the missing data was higher than 10% of the total sample (Yang et al. 2017). For climate change projection, three GCMs, namely: Centro Euro-Mediterraneo sui Cambiamenti Climatici Climate Model (CMCC-CMS), Model for Interdisciplinary Research on Climate (MIROC5), and Hadley Centre Global Environmental Model (HadGEM-AO2) (Table 1) from Coupled Model Intercomparison Project Phase 5 (CMIP5) were selected due to their ability to simulate the most realistic onset timing as suggested by various kinds of literature (Hasson et al. 2016; Ruan et al. 2018; Ruan et al. 2019). For each GCM, two RCPs, namely RCP4.5 and RCP8.5, were chosen corresponding to medium (4.5 W/m2) and high (4.5 W/m2) radiative forcing scenarios.
S.No. . | Data . | Spatio-temporal resolution . | Duration . | Source/Developer . |
---|---|---|---|---|
Hydrometeorological data | ||||
1 | Temperature | Point/Daily | 1980–2005 | MRC |
2 | Rainfall | Point/Daily | 1981–2005 | |
3 | Discharge | Point/Daily | 1985–2005 | |
GCMs' data (Historical and Future: RCP4.5 and 8.5) | ||||
1 | CMCC-CMS | 0.5°/Daily | 1980–2100 | CMCC-CM |
2 | MIROC5 | 1.41°/Daily | 1980–2100 | CCSR, NIES, JAMSTEC-FRCGC |
3 | HadGEM2-AO | 1.9° × 1.25°/Daily | 1980–2100 | Hadley Center, UKMO |
S.No. . | Data . | Spatio-temporal resolution . | Duration . | Source/Developer . |
---|---|---|---|---|
Hydrometeorological data | ||||
1 | Temperature | Point/Daily | 1980–2005 | MRC |
2 | Rainfall | Point/Daily | 1981–2005 | |
3 | Discharge | Point/Daily | 1985–2005 | |
GCMs' data (Historical and Future: RCP4.5 and 8.5) | ||||
1 | CMCC-CMS | 0.5°/Daily | 1980–2100 | CMCC-CM |
2 | MIROC5 | 1.41°/Daily | 1980–2100 | CCSR, NIES, JAMSTEC-FRCGC |
3 | HadGEM2-AO | 1.9° × 1.25°/Daily | 1980–2100 | Hadley Center, UKMO |
Note: MRC, Mekong River Commission, Lao-PDR; CMCC-CM, Centro Euro-Mediterraneo sui Cambiamenti Climatici Climate Model, Italy; CCSR, Center for Climate System Research, University of Tokyo; NIES, National Institute for Environmental Studies, Japan; JAMSTEC-FRCGC, Japan Agency for Marine-Earth Science and Technology Frontier Research Center for Global Change, Japan.
Beyond temperature and rainfall, factors such as human activities and land use and land cover (LULC) play significant roles in climate change processes. Human activities, including urbanization, deforestation, and varied agricultural practices, can profoundly influence the greenhouse gas balance and the albedo effect, directly impacting global and regional climate patterns (Pradhan et al. 2022). Moreover, LULC changes can affect local microclimates and contribute to or mitigate broader climate change, depending on the specifics of those changes (Ghaderpour et al. 2023). However, due to the complexities and difficulties associated with data collection and interpretation, these critical elements are frequently not adequately accounted for in current climate models. Future research endeavors should include these variables for more accurate and comprehensive climate change assessments.
METHODOLOGY
Climate change projection
ML models
Convolutional neural network (CNN)
Long short-term memories (LSTM) networks
In the LSTM model, the input gate It combines Xt and Ht−1 and passes them through the sigmoid function, as shown in Equation (8). Then, an activation function such as the hyperbolic tangent (tanh) function (Equation (9)) is used to create a candidate vector added to the memory, replacing the older memory state Ct−1 as shown in Equation (10). Then, the sigmoid layer decided which part of the memory state would be output by Equation (11). Eventually, a memory unit will calculate the final output using Equation (12), a filtered version based on our cell state. In summary, the weight matrices W(f; i; c;o) and biases vectors b(f; i; c;o) from Equations (7)–(12) are updated iteratively in the LSTM network through Backpropagation Through Time (BPTT) algorithm. It will actively choose useful information to store and reject the uninformative information; LSTM provides a better solution for eliminating the gradient explosion and vanishing problem RNN faces.
Model development
This study used the TensorFlow library, a deep learning library in Python, to develop both CNN and LSTM models. These models' performances depend on the number of layers, nodes in each layer, input size, normalization method, optimizer, batch size, learning rate, and epochs chosen during the model development. These factors are defined here based on the experiment and available computational resources. At first, all the observed dataset was divided into two sets: training (1985–2000) and testing (2000–2006) at a ratio of 80–20%. Temperature and precipitation were fed into the model to simulate the streamflow.
Training of any neural network involves two critical processes: forward and backward propagation. The forward propagation receives the input data, processes the information, and generates output. During this phase, weights, biases, and filters are randomly initialized and treated as parameters by the convolution neural network algorithm. Whereas, during the backward propagation, the model calculates error and updates the parameters based on the overall prediction accuracy utilizing the gradient descent technique. Finally, the predicted discharge was compared with the observed one to measure the goodness of fit (loss). A good model is expected to have a minimal loss during training and testing. After the training process, a set of weights and biases are obtained for all the layers, which are later used for further testing and simulation of future flow regimes.
CNN model architecture in this study consists of two 2D convolutional layers; one flattened layer and three fully connected layers (dense layer). In 2D convolutional layers, we use eight filters with a kernel size of 10 × 2 in the first layer and four filters with a kernel size 5 × 2 in the second layer. Next, the flattened layer transforms the entire 2D of the second layer into a single-column matrix which is then fed to the dense layers for processing. We use three dense layers with varying numbers of nodes (neurons) of 60, 30, and 1, respectively. In each layer, batch normalization was used to reduce data redundancy and eliminate undesirable characteristics. Another vital component of building a CNN model was choosing an activation function that determines how the network is initialized and adjusted with weight and biases during training. Here, we decided leaky rectified linear unit (Leaky ReLU) (Maas et al. 2013; Goodfellow et al. 2016) as an activation function over a standard rectified linear unit (ReLU) function due to its better gradient propagation and ability to eliminate the vanishing gradient problem of standard rectified linear unit (ReLU) function. Leaky ReLU is a faster learning activation function and offers better performance and generalization in deep learning than sigmoid and tanh functions (LeCun et al. 2015; Goodfellow et al. 2016; Sharma 2017). To ensure the model can converge without causing overfitting, we choose a learning rate of 10−5 and momentum = 0.9. Also, we use mean absolute error (MAE) to measure the loss per each epoch.
Other components affecting the performance of the CNN model include optimizer, batch size, and epochs. The optimizer's choice reduces the losses, leading to more accurate outcomes. Popular selection of optimizers includes gradient descent (GD), stochastic gradient descent (SGD), Nesterov Accelerated Gradient (NAG), Adaptive Gradient (AdaGrad), and Adam. Among those optimizers, Adam helps the model converge faster and generalize better to the test data (Kingma & Ba 2014; Dogo et al. 2018; Llugsi et al. 2021). Batch size is another hyperparameter to tune in modern deep learning systems. A larger batch size allows computational speed up from the parallel processing through graphical processing units (GPUs); however, it leads to poor generalization.
On the other hand, a smaller batch size considerably slows down training speed. Based on our available computational resources, the batch size was 28 for both CNN and LSTM models. The epochs are the number of times the model runs through whole data; more significant epochs usually lead to more accurate results and slow down the training process. Observing the model losses, we set the number of epochs to 500 in both CNN and LSTM architecture.
The LSTM model was developed using two LSTM layers, which capture the essential features of input data. The input size (number of cells) of the first LSTM layer depends on the length of the input window (the number of training days) and the size of the second LSTM layer, which is set to 30. In each LSTM layer, we keep using batch normalization and Leaky ReLU as activation functions. After passing through the LSTM layer, the input is processed via two fully connected layers containing 10 and 1 node (neuron), respectively. All the hyperparameters, including optimizer, learning rate, batch size, and the number of epochs, were kept similar to the CNN model.
Model performance evaluation
This study used the observation of four different window lengths (number of days): 30, 60, 180, and 365 (1 year) to predict the next day's streamflow using CNN and LSTM models. The performance of these proposed models was assessed through five statistical indicators and then evaluated based on the models' ability to predict the observed streamflow at three different stations: Attapeu (Sekong River Basin), Kon Tum (Sesan River Basin), and Bandon (Srepok River Basin). The statistical indicators used are Pearson correlation coefficient (R) (Legates & McCabe 1999), Nash–Sutcliffe efficiency (NSE) (Nash & Sutcliffe 1970), root mean square error (RMSE), (Zhu et al. 2020), MAE (Legates & McCabe 1999), and percent bias (PBIAS) (Gupta et al. 1999).
R is used to analyze how differences in the second variable can explain differences in one variable. It is a measure of linear correlation between two sets of data. It varies from −1 to 1, where −1 denotes the negative correlation, 0 means no correlation, and 1 expresses an unrealistically perfect correlation (Equation (13)). NSE (Equation (14)) is another commonly used indicator to evaluate the model's predictive power. It is also used to describe the model's accuracy quantitatively. Its value ranges from −∞ to 1, with 1 being the perfect model, whereas an efficiency less than 0 means the observed mean is a better predictor than the model. Similarly, RMSE measures the spread of residuals around the line of best fit. It is the standard deviation of the residuals (prediction error Equation (15)).
RESULTS AND DISCUSSION
Future projected climate
GCMs indicate the basin remains in the temporal distribution of temperature except at 140704 (Pleiku, Sesan catchment) station, the hot period is around March–June, and it starts cooling down in July (Supplementary Figure S3). Even though the hot period in 140703 stations still spreads from March to June, the hottest month is delayed from March to April. The results indicate that the basin must prepare for scorching events since the hot-season temperature is significantly increased at most studied locations.
Besides the alteration in total rainfall, GCMs also predicted the adjustment in the temporal distribution of precipitation in this basin. The rainy season typically occurs between May and October, similar to the historical record. Sekong catchment will have the most rainfall in July instead of August, according to four of the six GCMs. The Sesan catchment has the least change in annual precipitation amount and temporal pattern compared to others. However, it is projected to increase in rainfall during the wet period by all studied GCMs. Three investigated stations in the Srepok watershed showed a significant increase in monthly precipitation during the rainy season, suggesting a higher risk of flooding due to precipitation seasonality. In this sub-basin, the rainy season typically occurs between May and October. August and September usually see the highest volume of river discharge, but this is projected to change in the coming months of September and October.
Performances of ML models in hydrologic simulation
Location . | R . | NSE . | MAE . | RMSE . | MAE . | PBIAS . |
---|---|---|---|---|---|---|
Attapeu-Sekong | 0.85 | 0.7 | 176.7 | 367.5 | 176.7 | 17.9 |
Bandon-Srepok | 0.84 | 0.7 | 64.3 | 125.9 | 64.3 | 3.2 |
Kon Tum-Sesan | 0.8 | 0.58 | 29.6 | 61.6 | 29.6 | −5.95 |
Location . | R . | NSE . | MAE . | RMSE . | MAE . | PBIAS . |
---|---|---|---|---|---|---|
Attapeu-Sekong | 0.85 | 0.7 | 176.7 | 367.5 | 176.7 | 17.9 |
Bandon-Srepok | 0.84 | 0.7 | 64.3 | 125.9 | 64.3 | 3.2 |
Kon Tum-Sesan | 0.8 | 0.58 | 29.6 | 61.6 | 29.6 | −5.95 |
. | RCP4.5 . | RCP8.5 . | |||||
---|---|---|---|---|---|---|---|
. | . | CMCC-CMS . | HadGEM-AO2 . | MIROC5 . | CMCC-CMS . | HadGEM-AO2 . | MIROC5 . |
Wet period (June–November) | Kon Tum | 6.03 | −3.57 | 34.41 | 5.86 | 23.68 | 34.33 |
Bandon | −10.33 | −5.82 | −13.78 | −11.78 | − 18.67 | −7.95 | |
Attapeu | −24.18 | 52.45 | 117.86 | −24.75 | 35.64 | 114.69 | |
Dry Period (December–May) | Kon Tum | 17.81 | 22.56 | 36.98 | 14.38 | 16.87 | 33.12 |
Bandon | 53.98 | 50.35 | 146.22 | 30.10 | 48.65 | 174.49 | |
Attapeu | 42.32 | 82.60 | 149.01 | 49.69 | 67.91 | 145.27 |
. | RCP4.5 . | RCP8.5 . | |||||
---|---|---|---|---|---|---|---|
. | . | CMCC-CMS . | HadGEM-AO2 . | MIROC5 . | CMCC-CMS . | HadGEM-AO2 . | MIROC5 . |
Wet period (June–November) | Kon Tum | 6.03 | −3.57 | 34.41 | 5.86 | 23.68 | 34.33 |
Bandon | −10.33 | −5.82 | −13.78 | −11.78 | − 18.67 | −7.95 | |
Attapeu | −24.18 | 52.45 | 117.86 | −24.75 | 35.64 | 114.69 | |
Dry Period (December–May) | Kon Tum | 17.81 | 22.56 | 36.98 | 14.38 | 16.87 | 33.12 |
Bandon | 53.98 | 50.35 | 146.22 | 30.10 | 48.65 | 174.49 | |
Attapeu | 42.32 | 82.60 | 149.01 | 49.69 | 67.91 | 145.27 |
On the other hand, most GCMs project that the Sekong River will have a downtrend in high flow with a decrease of around 1–26% except for a minor increase with the MIROC5 model at an increased rate of 5%. GCMs also predict the period of 2050–2080 will have the largest shifting at this location, followed by the 2020–2050 period, and the far-future (2080–2100) will have the least change compared to other periods.
CONCLUSIONS
This study demonstrates that selected ML algorithms, CNN and LSTM, are effective enough to predict streamflow, utilizing only daily mean temperature and precipitation data. Among ML models, LSTM performs better with excellent correlation and NSE score, given that the river has a natural flow condition. This study finds that precipitation, temperature, and river discharge in the 3S River Basin will significantly change in the future. The temperature in the basin will increase by 0.13–4.2 °C varying by location and scenarios. The northwest region will have the most significant temperature change, followed by the south region, and the least is the central region. GCMs also predict that the precipitation in the 3S River Basin will increase by up to 40.6%. The southeast region will have the most significant shift in rainfall. The streamflow in the basin will significantly increase in the mid- and far-future (2050–2100). ML models also suggest increasing flood risk in the Sekong and Sesan catchment with the increase of the Q5 index in the future, while it will have a downtrend of high flow at the Srepok with a decrease of around 1–26%. HadGEM-AO2 and MIROC5 models forecast that the Q5 flow at Attapeu station will add up from 11 to 58%, while CMCC-CMS predicts this flow will be reduced by about a third compared to the 1985–2005 period. In addition, climate change affects the amount of river discharge. It alters the temporal streamflow pattern in the study area by delaying the wet season in most research areas.
ACKNOWLEDGEMENTS
The authors would like to express sincere gratitude to Silver Anniversary Scholarships for the financial support of this research. The authors would like to thank the USAID-funded PEER project ‘Connecting climate change, hydrology, and fisheries for energy and food security in Lower Mekong Basin' (Thailand and Cambodia – Project 6-436) carried out in co-operation with the Asian Institute of Technology, Stockholm Environmental Institute, Inland Fisheries Research and Development Institute, and Arizona State University for their support in data acquisition.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.