Abstract
Extreme weather conditions like floods and droughts call for careful planning and management of water resources in order to prevent fatalities and other negative effects. Modern soft computing and machine learning approaches have provided a solution for simulating these hydrological phenomena despite the complexity and non-linear character of these phenomena, which depend on diverse parameters. Distributed or semi-distributed models for large-size watershed areas with geographical irregularity and heterogeneity necessitate a substantial amount of high-quality spatial data. This research uses 40 years (1981–2020) of daily rainfall-runoff data to illustrate the application of two data-driven models, random forest regression (RFR) and feed-forward neural network (FFNN), for semi-arid, large-size watershed areas. To understand the effect of input data, different input–output combinations were considered to simulate eight rainfall-runoff models. Results show that both RFR and FFNN models have successfully performed but RFR model performance is best with correlation coefficient values of 0.9928 (M6) and 0.9926 (M1).
HIGHLIGHTS
Spatially well-distributed, low rain-gauge density data can provide good results.
RFR model performance for peak-flow estimation is better than FFNN.
A data-driven model proves a better option for the large watershed with low rain-gauge density for flood warning systems.
INTRODUCTION
Among all extreme weather events, floods and heavy rains claimed not only the highest casualties worldwide but also the loss of property, infrastructure, and agriculture. Heavier precipitation, more frequent hurricanes, and recently experienced flash flooding at Death Valley National Park are some of the key ways of climate change that has increased the flood risks. Record-breaking heavy rainfall events are projected to increase along with temperatures through the 21st century. As the frequency of extreme weather events is increasing at an alarming rate, early warning around 3–6 h prior to the event using technology, advanced models, observations, and relevant computing are needed. Not only heavy rains in a short time but even a reasonable amount of rainfall can cause severe damage because of manmade environments such as increased urbanisation, alterations in natural drainage systems, and changing land use. On the other hand, water scarcity due to droughts and overuse is also a big challenge in front of the societies and countries. Hence, it necessitates sustainable water management that will help to determine future water requirements for livelihood and irrigation purposes. Hence, maximum research has been carried out and is still required on rainfall-runoff analysis. A variety of models have been used to develop the relationship, including empirical, conceptual, deterministic, stochastic, physical, and data-driven models. The use of soft computing and machine learning techniques such as neural networks, genetic algorithms, and fuzzy logic that are able to model complex or unknown relationships is becoming very useful and feasible (Chandwani et al. 2015).
ANN remained the most popular tool and has been used for a wide range of hydrological processes such as rainfall-runoff modelling, flood frequency analysis, stream flow predictions, sedimentation, reservoir inflow prediction, water quality modelling, and ground water modelling, but the majority of work is in rainfall-runoff modelling (Maier et al. 2010). The rainfall-runoff relationship is a physical phenomenon that plays an important role in water resource management planning and flood forecasting. The rainfall-runoff model is a mathematical equation describing the intricate relationship between variables and parameters. The parameters that need to be considered for rainfall-runoff modelling are the topographical and hydrological features, land use and land cover, size of the watershed (small, medium, or large), density of rain-gauge (RG) stations, quality, and length of the data series (Sinha et al. 2015).
An artificial neural network (ANN) model is a data-driven machine learning modelling approach that gives good results not only for frequent disastrous flood-prone areas (Kisi et al. 2013; Ruslan et al. 2014; Chaipimonplin & Vangpaisal 2015; Setiono & Hadiani 2015; Kumar & Yadav 2021) but also for arid and semi-arid regions (Riad et al. 2004; Solaimani 2009; Ghumman et al. 2011; Aichouri et al. 2015; Parmar et al. 2016; Hussain et al. 2017). Based on time steps considered for modelling, a rainfall-runoff relation has been developed for hourly, six-hourly, daily, weekly, monthly, and annually recorded data, which also depends on the size of the basin as precipitation water that needs to travel the distance up to the basin outlet will vary accordingly. Based on the size of the watershed, most of the rainfall-runoff simulation is carried out on a medium-sized watershed with an area ranging from 250 to 2,500 km2 (Kalteh 2008; Zadeh et al. 2010; Rezaeianzadeh et al. 2013; Ruslan et al. 2014). Comparatively, less work is carried out on mini, micro, and milli watersheds. For example, Jain & Indurthy (no date) demonstrated a comparative analysis of event-based rainfall-runoff modelling in a watershed of less than 1 km2 and discovered that the ANN is the most suitable technique. A study done by Kovář et al. (2015) on the watershed of an area of 26.58 km2 has been noticed that it is hard to estimate runoff for extremely small areas due to the possibility of its severity and torrential water flow. As the watershed's spatial variance rises with size, larger watersheds have heterogeneous characteristics. With an increasing watershed area, more water can be stored. The increase in area will result in a rise in both the quantity and variety of characteristics. For instance, the nature of the ground profile, which includes topography, soil qualities, and land use, will vary as the size of the watershed increases. Yet data-driven models such as ANN have given good results without consideration of the physical, chemical, or biological characteristics of the basin. Rajurkar et al. (2002) demonstrated that coupling of the ANN with a multiple-input, single-output model predicts the daily runoff values with high accuracy for a large-size catchment area of 17,157 km2 where the spatial variation of rainfall is accounted for by subdividing the catchment and treating the average rainfall of each sub-catchment as a parallel and separate lumped input to the model. Patel & Joshi (2017) simulated an ANN model using the feed-forward back propagation algorithm for establishing monthly and annual rainfall-runoff correlations for the Dharoi Watershed of an area of 21,674 km2 and the results indicated that the ANN model had a good ability to capture the relationship between input and output. The aim of this paper is to develop a rainfall-runoff model for a semi-arid, large-size Manjira watershed area 9,960 km2 spread along the Deccan plateau of India by using random forest regression (RFR) and feed-forward neural network (FFNN) with different input and output parameters and compare the accuracy. The paper is organised as follows: The next section gives detailed information about the study area and data description, followed by a section explaining theory and methodology. Chapter results and discussion are given next, while the last chapter gives a concluding remark.
STUDY AREA AND DATA DESCRIPTION
IMD RGs
Sr. No. . | Name . | Latitude . | Longitude . |
---|---|---|---|
1 | Patoda | 18° 43′ 12″ | 75° 42′ 0″ |
2 | Chowasala | 18° 48′ 0″ | 75° 28′ 48″ |
3 | Kallam | 18° 34′ 12″ | 76° 1′ 12″ |
4 | Ausa | 18° 15′ 0″ | 76° 30′ 0″ |
5 | Latur | 18° 24′ 0″ | 76° 48′ 0″ |
6 | Halsoor | 18° 1′ 12″ | 77° 1′ 12″ |
Sr. No. . | Name . | Latitude . | Longitude . |
---|---|---|---|
1 | Patoda | 18° 43′ 12″ | 75° 42′ 0″ |
2 | Chowasala | 18° 48′ 0″ | 75° 28′ 48″ |
3 | Kallam | 18° 34′ 12″ | 76° 1′ 12″ |
4 | Ausa | 18° 15′ 0″ | 76° 30′ 0″ |
5 | Latur | 18° 24′ 0″ | 76° 48′ 0″ |
6 | Halsoor | 18° 1′ 12″ | 77° 1′ 12″ |
NHP, Nashik RGs
Sr. No. . | Name . | Latitude . | Longitude . |
---|---|---|---|
1 | Alni | 18° 17′ 30″ | 76° 0′ 55″ |
2 | Digholamba | 18° 40′ 44″ | 76° 17′41″ |
3 | Jadhala | 18° 38′ 0″ | 73° 42′ 31″ |
4 | Jagalpur | 18° 37′ 45″ | 77° 04′ 11″ |
5 | Karajkheda | 18° 03′ 03″ | 76° 16'33″ |
6 | Limbaganesh | 18° 48′ 00″ | 75° 40′ 0″ |
7 | Limbla | 18° 55′ 21″ | 76° 17′ 31″ |
8 | Matola | 18° 04′ 57″ | 76° 24′ 25″ |
9 | Nalegaon | 18° 25′ 11″ | 76° 48'47″ |
10 | Nitur | 18° 14′ 25″ | 76° 46′ 38″ |
11 | Padoli | 18° 11′ 54″ | 76° 16′ 43″ |
12 | Tadola | 18° 22′ 33″ | 76° 03′ 0″ |
13 | Ujani | 18° 42′ 42″ | 76° 41′ 25″ |
14 | Yellamghat | 18° 47′ 17″ | 75° 49′ 35″ |
15 | Yeoti | 18° 55′ 55″ | 76° 14'31″ |
Sr. No. . | Name . | Latitude . | Longitude . |
---|---|---|---|
1 | Alni | 18° 17′ 30″ | 76° 0′ 55″ |
2 | Digholamba | 18° 40′ 44″ | 76° 17′41″ |
3 | Jadhala | 18° 38′ 0″ | 73° 42′ 31″ |
4 | Jagalpur | 18° 37′ 45″ | 77° 04′ 11″ |
5 | Karajkheda | 18° 03′ 03″ | 76° 16'33″ |
6 | Limbaganesh | 18° 48′ 00″ | 75° 40′ 0″ |
7 | Limbla | 18° 55′ 21″ | 76° 17′ 31″ |
8 | Matola | 18° 04′ 57″ | 76° 24′ 25″ |
9 | Nalegaon | 18° 25′ 11″ | 76° 48'47″ |
10 | Nitur | 18° 14′ 25″ | 76° 46′ 38″ |
11 | Padoli | 18° 11′ 54″ | 76° 16′ 43″ |
12 | Tadola | 18° 22′ 33″ | 76° 03′ 0″ |
13 | Ujani | 18° 42′ 42″ | 76° 41′ 25″ |
14 | Yellamghat | 18° 47′ 17″ | 75° 49′ 35″ |
15 | Yeoti | 18° 55′ 55″ | 76° 14'31″ |
THEORY AND METHODOLOGY
Data processing is an important step that is carried out by first identifying the percentage of missing data for each RG station, and RGs with missing data of more than 30% are eliminated. For selected RGs, missing data were estimated with data from other well-associated gauges at a minimum distance by the normal ratio method, as the normal ratio method provides the most accurate estimations of missing data and is observed to be the most reliable method (Armanuos et al. 2020). To understand the relationship of each RG's rainfall value with runoff in terms of water level and discharge, correlation
The models with different structures are presented in Table 3.
Input–output combination of the models
Model . | Input . | Abbreviation for input . | Output . | Full-year data/seasonal data . |
---|---|---|---|---|
M1 | 6 RGs rainfall | P(1–6) | Discharge (Q) | Full year |
M2 | 6 RGs rainfall | P(1–6) | Discharge (Q) | Seasonal |
M3 | 6 RGs rainfall | P(1–6) | Level (L) | Full year |
M4 | 6 RGs rainfall | P(1–6) | Level (L) | Seasonal |
M5 | 6 RGs, weighted rainfall (Theisen polygon method) | Weighted P(1–6) | Level (L) | Full year |
M6 | 18 RGs | P(1–18) | Discharge (Q) | Full year |
M7 | 18 RGs | P(1–18) | Level (L) | Full year |
M8 | 18 RGs, weighted rainfall (Theisen polygon method) | Weighted P(1–18) | Level (L) | Full year |
Model . | Input . | Abbreviation for input . | Output . | Full-year data/seasonal data . |
---|---|---|---|---|
M1 | 6 RGs rainfall | P(1–6) | Discharge (Q) | Full year |
M2 | 6 RGs rainfall | P(1–6) | Discharge (Q) | Seasonal |
M3 | 6 RGs rainfall | P(1–6) | Level (L) | Full year |
M4 | 6 RGs rainfall | P(1–6) | Level (L) | Seasonal |
M5 | 6 RGs, weighted rainfall (Theisen polygon method) | Weighted P(1–6) | Level (L) | Full year |
M6 | 18 RGs | P(1–18) | Discharge (Q) | Full year |
M7 | 18 RGs | P(1–18) | Level (L) | Full year |
M8 | 18 RGs, weighted rainfall (Theisen polygon method) | Weighted P(1–18) | Level (L) | Full year |
Random forest regressor
A popular supervised machine learning algorithm for classification and regression issues is random forest (Figure 8). Using various samples, it constructs decision trees and uses their average for classification and the majority vote for regression. The steps involved in RFR are given below.
Step 1: In a random forest, n random records are taken from the dataset having k number of records.
Step 2: Individual decision trees are created for each sample.
Step 3: Each decision tree will generate an output.
Step 4: For classification and regression, the final result is based on the majority vote or average, respectively.
Artificial neural network
An ANN, also called a neural network, is a computing system inspired by the biological neural networks that constitute the human brain. Such systems ‘learn’ to perform tasks by considering examples, generally without being programmed with any task-specific rules. The ANN is an idea of knowledge in the field of artificial intelligence designed by adopting the human nervous system. The process of training the ANN has many types and uses, including perceptron and back propagation. The ANN models constructed in this study were of the FFNN type. In the feed-forward model, the information is only processed in one direction. While the data may pass through multiple hidden nodes, it always moves in one direction and never backwards. The detailed theoretical information about ANN can be found here in ASCE Task Committee on Application of Artificial Neural Networks in Hydrology (2000).
RESULT AND DISCUSSION

Model results
Performance evaluator . | Model . | M1 . | M2 . | M3 . | M4 . | M5 . | M6 . | M7 . | M8 . |
---|---|---|---|---|---|---|---|---|---|
MAE | RFR | 3.9816 | 6.9224 | 0.1024 | 0.1644 | 0.1024 | 3.9812 | 0.1429 | 0.0967 |
FFNN | 0.0313 | 0.0514 | 0.1533 | 0.1197 | 0.1546 | 0.0383 | 0.1480 | 0.1514 | |
RMSE | RFR | 20.6617 | 27.5881 | 0.1703 | 0.2749 | 0.1703 | 20.3023 | 0.2383 | 0.1617 |
FFNN | 0.1179 | 0.1414 | 0.2223 | 0.1794 | 0.2158 | 0.1296 | 0.2193 | 0.2212 | |
R2 | RFR | 0.9852 | 0.9820 | 0.9764 | 0.9599 | 0.9764 | 0.9857 | 0.9556 | 0.9807 |
FFNN | 0.9327 | 0.9018 | 0.9206 | 0.9593 | 0.9277 | 0.9365 | 0.9288 | 0.9235 | |
R | RFR | 0.9926 | 0.9910 | 0.9881 | 0.9797 | 0.9903 | 0.9928 | 0.9775 | 0.9779 |
FFNN | 0.9657 | 0.9496 | 0.9595 | 0.9794 | 0.9632 | 0.9677 | 0.9638 | 0.9610 |
Performance evaluator . | Model . | M1 . | M2 . | M3 . | M4 . | M5 . | M6 . | M7 . | M8 . |
---|---|---|---|---|---|---|---|---|---|
MAE | RFR | 3.9816 | 6.9224 | 0.1024 | 0.1644 | 0.1024 | 3.9812 | 0.1429 | 0.0967 |
FFNN | 0.0313 | 0.0514 | 0.1533 | 0.1197 | 0.1546 | 0.0383 | 0.1480 | 0.1514 | |
RMSE | RFR | 20.6617 | 27.5881 | 0.1703 | 0.2749 | 0.1703 | 20.3023 | 0.2383 | 0.1617 |
FFNN | 0.1179 | 0.1414 | 0.2223 | 0.1794 | 0.2158 | 0.1296 | 0.2193 | 0.2212 | |
R2 | RFR | 0.9852 | 0.9820 | 0.9764 | 0.9599 | 0.9764 | 0.9857 | 0.9556 | 0.9807 |
FFNN | 0.9327 | 0.9018 | 0.9206 | 0.9593 | 0.9277 | 0.9365 | 0.9288 | 0.9235 | |
R | RFR | 0.9926 | 0.9910 | 0.9881 | 0.9797 | 0.9903 | 0.9928 | 0.9775 | 0.9779 |
FFNN | 0.9657 | 0.9496 | 0.9595 | 0.9794 | 0.9632 | 0.9677 | 0.9638 | 0.9610 |
Bold values of correlation coefficient 0.9926 of model M1 and 0.9928 of model M6 are the highest results of RFR that signifies that RFR model performs better for discharge as output. While 0.9794 of model M4 and 0.9677 of model M6 are the highest results of FFNN models for seasonal level and full year discharge as output respectively. In short, these bold values are the top two highest results in RFR and FFNN models.
Time series plot for training state (from 20 August to 30 October 1998).
Time series plot for the testing state (from 14 September to 14 October 2016).
As mentioned in Table 3, models M1–M5 were calibrated with 6 RG data as input, while models M6–M8 were calibrated with 18 RG data. Seventy percent of the dataset is used for training, 15% for validation, and 15% for testing. According to the World Meteorological Organisation (1976), for arid regions, one RG station for 1,500–10,000 km2 is recommended, while the Bureau of Indian Standard (BIS) suggests one RG station for up to 500 km2 (Subramanya 2008). Results have shown that RFR and FFNN models with spatially well-distributed six RGs (1,660 km2 per RG) data as input gave the same or comparatively better results. As the Manjira watershed area belongs to a semi-arid climatic zone, to identify the effect of a dry period during which runoff is 0, model calibration is done by taking full-year days (M1 and M3) as well as seasonal day data (1 June–31 October) as input (M2 and M4). It is observed that seasonal data input FFNN model (M4 – 0.9794) performance is better while whole year data input RFR model (M1 – 0.9926) performance is best. Models M1, M2, and M6 are calibrated with discharge as the output parameter. RFR models with discharge as an output performed better than FFNN. The correlation coefficient of FFNN varies from 0.9496 to 0.9794, whereas the correlation coefficient of RFR varies from 0.9775 to 0.9928. The performance of the developed RFR model is equally good as that of the FFNN model. Model M6 is the best-performing RFR model where 18 RG data points are input and discharge is output, which is quite visible in the scatter plot (Figure 8 (M6)). Also, the time series plot for RFR-M6 shows high accuracy in peak-flow discharge. Furthermore, by observing time series plots M1, M2, and M6 for both 1998 and 2016 flood events, RFR estimates peak discharge values more precisely than FFNN. As during summer and most of the winter period, streamflow is dry, models M3, M4, M5, M7, and M8 with level as output are less effective to estimate 0 of the gauge reading, i.e., 542.723 m. In the case of FFNN, the M4 model with six RGs measuring rainfall in rainy season days as input and level as output is the best-performing model, giving a 0.9794 correlation coefficient. In the case of RFR, the M6 model with 18 RG data as input and discharge as output is the best-performing model, giving a 0.9928 correlation coefficient.
Model M5 and M8 results considering weighted rainfall input do not show any drastic improvement in model results as expected as watershed areas have a gentle slope of 0–2.93 m/km, which characterises less spatial variation in rainfall. The time series plot generated for peak flow from 14 September to 14 October 2016 clearly shows that level estimation is good as compared to discharge estimation for both RFR and FFNN.
CONCLUSION
In order to understand the impact of various input–output data combinations on the model and identify the best-fit model, this study uses RFR and FFNN models to simulate the rainfall-runoff process for large-size semi-arid watershed areas. It is observed that 6 RGs rainfall data as input performs equally well as compared to 18 RGs rainfall data as input. It means that with due consideration of the characteristics of the watershed area, even low-density, spatially well-distributed RG data can provide good results. So, the model can provide a solution for runoff estimation and subsequently flood forecasting in areas where data availability is an issue because of the low density of RGs. From the time series plot for training state (20 August–30 October 1998) as well as testing state (14 September–14 October 2016), it is clearly visible that the RFR model with level as output estimates peak flow more precisely than FFNN. Thus, it is again proven that artificial neural networks are capable of simulating complex rainfall-runoff relationships but need a large volume of data and are less accurate in peak-flow estimation. On the other hand, it is observed that both RFR and FFNN are less effective to estimate the zero value of gauge against the zero gauge value observed during dry season. The FFNN model performs well with a 0.9794 (M4) value of the correlation coefficient for P(1–18) as input and Level as output. But the RFR model performed better with 0.9928 (M6) and 0.9926 (M1) results for P(1–6) as input-discharge as output and P(1–18) as input-discharge as output, respectively. From the model performance graph (Figure 13), it clearly shows that for the M4 model, where rainfall data from six RG seasonal days are input and level is output, the FFNN result is most accurate while RFR is least accurate. The major effect of climate change is a change in the rainfall pattern, with heavy rainfall in a short period of time causing flash floods in arid and semi-arid regions. This necessitates the accurate estimation and forecasting of runoff that will help provide flood warnings. The simulation findings show that the data-driven models deliver useful information without an in-depth understanding of watershed characteristics and are useful for the management and planning of water resources.
ACKNOWLEDGEMENTS
The authors wish to sincerely thank Meteorological Department, Shivajinagar, Pune; Central Water Commission, Krishna Godavari Basin Organization, Hyderabad and Hydrology Project (SW), Jal Vidnyan Bhawan, Dindori Road, Nashik for providing the necessary data for the research work. The authors also acknowledge ADVIT AI Labs, Pune for technical support.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.