Abstract

The study of the infiltration process is considered essential and necessary for all hydrology studies. Therefore, accurate predictions of infiltration characteristics are required to understand the behavior of the subsurface flow of water through the soil surface. The aim of the current study is to simulate and improve the prediction accuracy of the infiltration rate and cumulative infiltration of soil using regression tree methods. Experimental data recorded with a double ring infiltrometer for 17 different sites are used in this study. Three regression tree methods: random tree, random forest (RF) and M5 tree, are employed to model the infiltration characteristics using basic soil characteristics. The performance of the modelling approaches is compared in predicting the infiltration rate as well as cumulative infiltration, and the obtained results suggest that the performance of the RF model is better than the other applied models with coefficient of determination (R2) = 0.97 and 0.97, root mean square error (RMSE) = 8.10 and 6.96 and mean absolute error (MAE) = 5.74 and 4.44 for infiltration rate and cumulative infiltration respectively. The RF model is used to represent the infiltration characteristics of the study area. Moreover, parametric sensitivity is adopted to study the significance of each input parameter in estimating the infiltration process. The results suggest that time (t) is the most influencing parameter in predicting the infiltration process using this data set.

HIGHLIGHTS

  • Infiltration rate and cumulative infiltration characteristics in semi-arid region were modelled using regression tree.

  • Random forest outperforms random tree as well as M5 tree to predict the infiltration characteristics.

  • Time is observed as the most important parameter in affecting the prediction performance of both the infiltration rate and the cumulative infiltration of soil.

INTRODUCTION

Infiltration process influences the hydrological cycle, stream flow, groundwater recharge, irrigation, soil erosion and drainage system (Singh et al. 2018). Infiltration can be defined as irrigation or precipitation water moving into the soil (Sihag et al. 2018a, 2018b). The actual rate at which water moves into the soil is determined as infiltration rate (Haghighi et al. 2011). Infiltration process plays a significant role in the selection of crop type and agricultural practices. Prediction of infiltration is also necessary for the efficient design of irrigation systems (Hillel 1998). The sustainability of the groundwater system also depends on this process (Chen & Young 2006).

The quantity and quality of recharged water is the most important parameter for water resources management especially in regions where water scarcity occurs. It divides water into the two most important hydrological segments, groundwater flow and surface flow. It is the main factor in watershed modelling for the prediction of surface runoff. Accurate prediction of infiltration rate is compulsory for reliable prediction of surface runoff (Diamond & Shanley 2003). At catchment level, infiltration characteristics are one of the main factors in determining the flooding condition (Bhave & Sreeja 2013). The significance of the infiltration process has required soil and water researchers to generate several models (e.g. Green & Ampt 1911; Kostiakov 1932; Horton 1941; Philip 1957; Holtan 1961; Soil Conservation Service 1972; Smith & Parlange 1978; Swartzendruber 1987; Singh & Yu 1990; Sihag et al. 2017a). However, only a few developed models have been effectively used with field data. Parameter calculation is the main criterion for selection of any model over other models. These infiltration models can be divided into three groups: physical models, semi-empirical models and empirical models.

The main problem for modelling the infiltration process is the variation of soil type and texture (Machiwal et al. 2006). Variation in the infiltration process in any watershed causes difficulty in management of water for agricultural practices. The process of infiltration is influenced by several aspects; including soil type, texture, soil hydraulic properties, soil moisture profile, and precipitation and climate properties. The spatial and temporal prediction of the infiltration process in the real environment cannot be at present deduced by direct or indirect methods. Infiltration models are based on the law of mass conservation and Darcy's law (physical models), a few models rely on field and laboratory measurements (empirical models) and some models are based on simple hypotheses about the infiltration rate and cumulative infiltration relation quantities (semi-empirical models). These infiltration models are developed for vertically homogeneous soils with constant initial moisture content in the soil, constant density and over horizontal planes. A very few researchers have made attempts to use soft computing in the prediction of soil infiltration (Sy 2006; Sihag et al. 2018a, 2018b).

During the last few years, several researchers have used soft computing techniques such as adaptive neuro fuzzy inference system (ANFIS), artificial neural network (ANN), support vector machine (SVM), M5 tree, Gaussian process (GP), random forest (RF) and gene expression programming (GEP) in the prediction of hydraulic properties and infiltration process (Schaap & Leij 1998; Erzin et al. 2009; Anari et al. 2011; Das et al. 2011; Arshad et al. 2013; Al-Sulaiman 2015; Elbisy 2015; Esmaeelnejad et al. 2015; Al-Sulaiman & Aboukarima 2016; Rahmati 2017; Singh et al. 2017; Tiwari et al. 2017; Sihag et al. 2017b, 2017c; Sihag 2018) and found it to work equally well or better than empirical relations. A very few studies have been carried out on the spatial variability of soil hydraulic properties in arid regions. However, to the best of our information, no study has been carried out for describing the spatial variability of the soil water infiltration process using soft computing techniques in semi-arid regions of India. Thus, the main objective of this research is to develop infiltration models using regression-tree-based computing approaches and to determine the spatial variability at the same scale in some locations of Haryana (India). Both the cumulative infiltration as well as infiltration rate are predicted using random tree, random forest and M5 tree-based models. In the last few decades conventional infiltration models have been widely used for the calculation or estimation of infiltration rate and cumulative infiltration such as Kostiakov, Philip, Novel, Horton, Holton etc. These models are point- and location-specific. The aim of this investigation is to develop a general model for a complete study area using tree-based models.

STUDY AREA AND DATA COLLECTION

Four districts of Haryana State (Hisar, Kurukshetra, Kaithal, and Karnal) are selected for the infiltration measurements (Figure 1). The study area comprises a total of 17 sites represented by the numbers on the district maps (Figure 1).

Figure 1

Location map of the study area showing the measuring points.

Figure 1

Location map of the study area showing the measuring points.

The infiltration observations have been recorded in the field with the help of a double ring infiltrometer. The double ring infiltrometer has two concentric rings having diameter 30 and 60 cm for the inner and outer ring, respectively, with 30 cm depth. Both rings are driven about 10 cm deep into the soil by using a hammer without undue disturbance to the soil surface. Both rings are filled with water to equal depth and the initial reading of the water level is noted. The rate of water level drop against time in the inner ring of the infiltrometer is noted at regular intervals until a constant rate is attained, which is called the steady-state infiltration rate. The soil samples for calculating the soil properties are collected from their respective locations selected for the measurement of infiltration rate. The sample is withdrawn along the side of the infiltrometer and represents the soil properties of that location. The properties of the soil samples are measured in the laboratory. The properties of soils respective to each location are listed in Table 1.

Table 1

Basic properties of soil

Site no.Sand (%)Clay (%)Silt (%)Dry density (gm/cm3)Moisture content (%)
47.50 25.20 27.30 1.64 12.74 
50.71 23.19 26.10 1.66 19.85 
39.84 52.34 7.82 1.58 19.74 
42.85 24.00 33.15 1.58 18.39 
48.74 29.38 21.88 1.67 8.77 
59.58 30.72 9.70 1.60 14.21 
26.63 41.82 31.55 1.54 18.55 
46.70 16.56 36.74 1.65 5.28 
24.81 39.62 35.57 1.24 19.06 
10 79.73 6.45 13.83 1.60 8.47 
11 84.14 7.83 8.03 1.75 7.47 
12 66.63 19.62 13.75 1.70 12.71 
13 44.27 23.12 32.61 1.65 11.56 
14 45.67 39.16 15.17 1.62 8.33 
15 26.12 66.14 7.74 1.61 18.60 
16 19.73 62.41 17.86 1.61 15.37 
17 32.71 54.21 13.08 1.75 7.64 
Site no.Sand (%)Clay (%)Silt (%)Dry density (gm/cm3)Moisture content (%)
47.50 25.20 27.30 1.64 12.74 
50.71 23.19 26.10 1.66 19.85 
39.84 52.34 7.82 1.58 19.74 
42.85 24.00 33.15 1.58 18.39 
48.74 29.38 21.88 1.67 8.77 
59.58 30.72 9.70 1.60 14.21 
26.63 41.82 31.55 1.54 18.55 
46.70 16.56 36.74 1.65 5.28 
24.81 39.62 35.57 1.24 19.06 
10 79.73 6.45 13.83 1.60 8.47 
11 84.14 7.83 8.03 1.75 7.47 
12 66.63 19.62 13.75 1.70 12.71 
13 44.27 23.12 32.61 1.65 11.56 
14 45.67 39.16 15.17 1.62 8.33 
15 26.12 66.14 7.74 1.61 18.60 
16 19.73 62.41 17.86 1.61 15.37 
17 32.71 54.21 13.08 1.75 7.64 

DATA SET

The data set used for the modelling in this study is a collection of infiltration observations yielded from the field experiments from 17 sites. In this study, infiltration rate (I) and cumulative infiltration () are considered as the output variables whereas time (), dry density (), moisture content (M), and percentages of sand (), silt (), and clay () are the input variables. The basic soil properties respective to each site are measured from the laboratory experiments conducted on the field soil samples. The total data set is split in two groups: one used for model development (training data) and the other used for model validation (testing data). Training data involves 70% data chosen randomly from the complete data set, and the remaining 30% data forms the testing data. The features of the data set used for modelling the infiltration characteristics of the soil are presented in Table 2.

Table 2

Features of data set used in this study

ParameterTraining data
Testing data
Min.Max.MeanStandard deviationMin.Max.MeanStandard deviation
(minutes) 5.00 180.00 52.08 53.21 5.00 180.00 46.50 47.10 
(%) 19.73 84.14 47.08 17.81 19.73 84.14 44.53 17.66 
(%) 6.45 66.14 32.45 17.29 6.45 66.14 33.98 17.45 
(%) 7.74 36.74 20.47 10.28 7.74 36.74 21.49 10.20 
(gm/cm31.24 1.75 1.62 0.11 1.24 1.75 1.61 0.11 
(%) 5.28 19.85 13.20 4.93 5.28 19.85 13.47 5.00 
(mm/hr) 0.25 252.00 57.49 59.32 0.90 204.00 47.29 45.92 
(mm) 1.50 383.00 46.99 64.62 3.00 244.00 38.54 41.61 
ParameterTraining data
Testing data
Min.Max.MeanStandard deviationMin.Max.MeanStandard deviation
(minutes) 5.00 180.00 52.08 53.21 5.00 180.00 46.50 47.10 
(%) 19.73 84.14 47.08 17.81 19.73 84.14 44.53 17.66 
(%) 6.45 66.14 32.45 17.29 6.45 66.14 33.98 17.45 
(%) 7.74 36.74 20.47 10.28 7.74 36.74 21.49 10.20 
(gm/cm31.24 1.75 1.62 0.11 1.24 1.75 1.61 0.11 
(%) 5.28 19.85 13.20 4.93 5.28 19.85 13.47 5.00 
(mm/hr) 0.25 252.00 57.49 59.32 0.90 204.00 47.29 45.92 
(mm) 1.50 383.00 46.99 64.62 3.00 244.00 38.54 41.61 

MODELLING METHODS

Random tree (RT)

Random trees are decision/regression-tree-based models used for classification/regression problems. Random model trees are essentially the combination of two existing algorithms: single model trees are combined with random forest ideas. Every leaf in model trees holds a linear model which is optimized for the local subspace described by that leaf (Pfahringer 2010). The collection of tree predictors in random trees is called a forest. With random features at each node, a random tree is a tree drawn randomly from a set of possible trees. The distribution of trees is uniform and each tree in the set of trees has an equal chance of being sampled. Random trees can be generated efficiently and the ensemble of random trees generally leads to accurate models (Zhao & Zhang 2008).

Random forest (RF)

The random forest proposed by Breiman (2001) is a classification and regression method. The performance of single decision trees is shown to be improved considerably by adding randomness to the training data using a random forest. The algorithm combines the concept of bagging (Breiman 1996) and the random selection feature (Amit & Geman 1997). In bagging, the training data is sampled with replacement for each tree. Therefore, a new training set (bootstrap samples) is drawn with replacement from the original training set. During the selection of the bootstrap samples some of the training data may be left out of the sample and some may be repeated in the sample (Adusumilli et al. 2013). To generate a tree, instead of computing the best split among attributes for each node, only a random subset of all attributes is chosen at every node, and the best split for that subset is computed.

The procedure for the random forest regression model is as follows:

  • 1.

    Draw bootstrap samples from the data set.

  • 2.

    Generate a regression tree with the following modifications: at each node, choose the best split from the randomly sampled -tree predictors.

  • 3.

    The models are tuned for the user-defined parameters: number of trees () grown and number of variables used at each node () to construct a tree.

M5 tree (M5)

Quinlan (1992) proposed this high-accuracy regression algorithm for dealing with continuous-class problems with high dimensionality. M5 generates a tree-based model which has multivariate linear regression models at the leaves. The first step is to build a tree, which is similar to inducing a regular decision tree. The difference is that the splitting criterion is based on maximizing the expected reduction in error instead of maximizing the information gain at each interior node. The standard deviation reduction (SDR) can be calculated by:
formula
(1)
where represents number of instances that reach the node; represents the number of instances that have the ith outcome of the potential set; and represents the standard deviation of the target values of the instances in the data set.

The second step is to construct a multivariate linear regression function using the instances at each non-terminal node and then prune the built tree back from the leaves. Instead of using all attributes, this multivariate linear regression function only uses the attributes that are referenced by tests or linear functions somewhere in the sub-tree at this node.

The last step is to improve the prediction accuracy by applying some smoothing techniques. The smoothed predicted values of M5 are backed up from the leaf node to the root node (Li & Li 2011).

Performance assessment parameters

The performance of various modelling techniques has been carried out using various statistical performance assessment parameters, which are coefficient of determination (R2), root mean square error (RMSE) and mean absolute error (MAE) to check the statistical difference between the actual and the predicted values of the implemented soft computing approaches.

Coefficient of determination (R2): The coefficient of determination is used to measure of success of numeric prediction. The coefficient of determination (R2) is computed as follows (Adnan et al. 2019; Malik et al. 2019):
formula
(2)
where is the observed values; is the predicted values; and n is the number of observations. The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of observed ( and predicted values (. The range of R2 is 0 to 1.0. A correlation of 1.0 indicates that correlation is perfect among observed and predicted ( values, while a correlation of 0.0 indicates that there is no linear relationship among the observed ( and predicted ( values.

Mean absolute error (MAE)

The mean absolute the error is used to measure the success of numeric estimation. The mean absolute error (MAE) is computed as follows (Sammen et al. 2017; Pham et al. 2021):
formula
(3)

MAE can range from 0 to ∞. An efficiency of 0 (MAE = 0) corresponds to a perfect match of modelled A to the observed data.

Root mean square error (RMSE)

Mean-squared error (MSE) is the most commonly statistical evaluator for performance evaluation, and root mean square error is the square root of the mean-squared error after giving it the same dimensions as the predicted values themselves. This method exaggerates the prediction error minus the difference between prediction value and actual value. The root mean square error (RMSE) can be computed from the following equation (Sammen et al. 2020):
formula
(4)
RMSE can range from 0 to ∞. An efficiency of 0 (RMSE = 0) corresponds to a perfect ideal model which predicts exact values as observed values.

EVALUATION OF MODELS

The performance measures adopted for model assessment are coefficient of determination (R2), root mean square error (RMSE) and mean absolute error (MAE). The value of R2 determines the closeness of the regression line to the observed data. R2 = 1 indicates perfect fitting of the regression line to the observed data. RMSE and MAE are the error values that measure the accuracy of the models in approximating the actual observations. Lower values of RMSE and MAE indicate better prediction of the actual data. Two output values: infiltration rate and cumulative infiltration, are predicted relative to the input variables (Table 2) using the regression-tree-based models. In this study, WEKA 3.9 software is used for all applied regression-tree-based model development and validation. The models are developed using the training data set after setting the values of tuning parameters (user-defined) respective to each regression tree model. The performance of the constructed model is tested on the unseen testing data set using the statistical performance measures (R2, RMSE and MAE). Several models are constructed and tested by assigning the values of the user-defined parameters to the modelling techniques and the best one is chosen based on optimal fitting of the model on the testing data set. The procedure is repeated for all the regression tree models using both outputs and the tuning parameters selected for performance analysis are shown in Table 3.

Table 3

Tuning parameters of regression models

Regression modelTuning parameters
Infiltration rate predictionCumulative infiltration prediction
Random model tree Number of randomly chosen attributes = 4 Number of randomly chosen attributes = 3 
Random forest  = 200, = 1  = 78, = 2 
M5 tree Number of instances allowed at a leaf node = 5 Number of instances allowed at a leaf node = 5 
Regression modelTuning parameters
Infiltration rate predictionCumulative infiltration prediction
Random model tree Number of randomly chosen attributes = 4 Number of randomly chosen attributes = 3 
Random forest  = 200, = 1  = 78, = 2 
M5 tree Number of instances allowed at a leaf node = 5 Number of instances allowed at a leaf node = 5 

RESULTS AND DISCUSSION

Modelling of infiltration rate

The models are formed and checked for accuracy after setting the tuning parameters to near-optimal values (Table 3) in order to predict the values of soil infiltration rate. The simulation results of training and testing data are presented in Figures 2 and 3, respectively, which show the scattering of the data points for the prediction of three regression-tree-based models viz. random tree (RT), random forest (RF) and M5 tree. To investigate the superiority of the modelling techniques, the statistical performance measures acquired by the models with the observed infiltration rate data are elucidated in Table 4. As indicated from Figure 2, the scattering of training data predicted by the M5 tree with both pruned as well as unpruned model is higher relative to random tree and random forest. The higher values of infiltration rate predicted by the M5 tree model scatter relatively more than the lower values and degrade the modelling performance. Statistical measures from the M5 tree confirm the poor performance of model development with the training data as the RMSE and MAE values are substantially higher. The performance of the model is improved with the testing data set, but the testing performance is inferior to the other regression tree models. The trained model based on the random tree is best (R2 = 0.99) as the data prediction is close to the line of perfect agreement in the training stage but the model is not capable of giving accurate predictions with the testing data as compared with the random forest model. The performance of the model developed with RT drops at the validation stage having RMSE and MAE values of 10.82 and 5.98, respectively (Table 4). Closer fitting of predicted points to the perfect agreement line using the RF model in the testing stage reveals its suitability in the generalization of the infiltration rate data as the error values (RMSE = 8.10 and MAE = 5.74) are minimum (Table 4). Figure 4 shows the residual error values obtained from the experimental observations and predictions of infiltration rate with regression tree models. The residuals are minimum with the RT model during the training stage, but the deviation of the model is higher from the zero error line during the testing stage relative to the RF model. M5 tree-based models have larger residuals during model training but the quality of the models improve during testing with smaller error values, but still, the models are ranked last after the RT model. Therefore, modelling results based on this study indicate superior performance of the RF model in simulating the actual infiltration rate of soil followed by the RT and M5 tree models, respectively.

Table 4

Performance of regression models in simulating the infiltration rate

Regression modelTraining data
Testing data
R2RMSE (mm/hr)MAE (mm/hr)R2RMSE (mm/hr)MAE (mm/hr)
Random model tree 1.00 0.62 0.27 0.95 10.82 5.98 
Random forest 0.98 9.93 4.73 0.97 8.10 5.74 
M5 tree pruned 0.81 26.81 18.02 0.85 19.08 15.28 
M5 tree unpruned 0.81 26.81 17.96 0.85 18.49 14.10 
Regression modelTraining data
Testing data
R2RMSE (mm/hr)MAE (mm/hr)R2RMSE (mm/hr)MAE (mm/hr)
Random model tree 1.00 0.62 0.27 0.95 10.82 5.98 
Random forest 0.98 9.93 4.73 0.97 8.10 5.74 
M5 tree pruned 0.81 26.81 18.02 0.85 19.08 15.28 
M5 tree unpruned 0.81 26.81 17.96 0.85 18.49 14.10 
Table 5

Performance of regression models in simulating the cumulative infiltration

Regression modelTraining data
Testing data
R2RMSE (mm)MAE (mm)R2RMSE (mm)MAE (mm)
Random model tree 0.99 0.92 0.54 0.94 10.76 7.87 
Random forest 0.99 10.33 4.52 0.97 6.96 4.44 
M5 tree pruned 0.81 28.92 16.09 0.81 18.43 11.21 
M5 tree unpruned 0.82 28.68 15.91 0.81 18.11 10.97 
Regression modelTraining data
Testing data
R2RMSE (mm)MAE (mm)R2RMSE (mm)MAE (mm)
Random model tree 0.99 0.92 0.54 0.94 10.76 7.87 
Random forest 0.99 10.33 4.52 0.97 6.96 4.44 
M5 tree pruned 0.81 28.92 16.09 0.81 18.43 11.21 
M5 tree unpruned 0.82 28.68 15.91 0.81 18.11 10.97 
Figure 2

Scatter plot of infiltration rate prediction by the regression models using training data.

Figure 2

Scatter plot of infiltration rate prediction by the regression models using training data.

Figure 3

Scatter plot of infiltration rate prediction by the regression models using testing data.

Figure 3

Scatter plot of infiltration rate prediction by the regression models using testing data.

Figure 4

Residual error values from observed and predicted infiltration rate for training and testing data sets.

Figure 4

Residual error values from observed and predicted infiltration rate for training and testing data sets.

Modelling of cumulative infiltration

In this section, the cumulative infiltration of soil is simulated using random tree (RT), random forest (RF) and M5 tree (M5) models with the basic soil properties. The scattering plots of the regression-tree based models for the training and testing data sets are depicted in Figures 5 and 6, respectively. Table 5 provides the information about the statistical measures obtained from the observed training and testing cumulative infiltration data with their corresponding predicted values by the models. The scattering of the training data predicted by the M5 pruned and unpruned models is higher, particularly for the larger values, as compared with the RT and RF models (Figure 5). The predicted data of the RT and RF model fits closely to the perfect agreement line having lower values of RMSE and MAE than the M5 tree in the model development stage. The RT model perfectly predicts the training data with a maximum value of R2 = 0.99 and minimum values of RMSE = 0.92 and MAE = 0.54 as most of the data points reside on the perfect agreement line, but the prediction capability of the model degrades during validation with the testing data. Analysing the testing data scatter plot (Figure 6), the prediction tendency of the RF model is superior as compared with the other regression models as the predicted points are comparatively closer to the perfect agreement line than the other models. The lowest values of RMSE = 6.96 and MAE = 4.44 obtained from the RF model at the testing stage affirm the stronger capability of the model in the generalization of the cumulative infiltration data (Table 5). So, statistical performance measures acquired from the testing data set suggest better performance by the RF model as compared with the RT and M5 tree models in approximating the cumulative infiltration of soil. Figure 7 shows the residual errors observed with the current data set using different regression models. The residual error values peak in the training stage, mostly predicted by the M5 tree model, while the RT model has smaller residual values. The residuals are minimum with the RF model relative to the RT model in the testing stage, which assures its higher predictability.

Figure 5

Scatter plot of cumulative infiltration prediction by the regression models using training data.

Figure 5

Scatter plot of cumulative infiltration prediction by the regression models using training data.

Figure 6

Scatter plot of cumulative infiltration prediction by the regression models using testing data.

Figure 6

Scatter plot of cumulative infiltration prediction by the regression models using testing data.

Figure 7

Residual error values from observed and predicted cumulative infiltration for training and testing data sets.

Figure 7

Residual error values from observed and predicted cumulative infiltration for training and testing data sets.

SIMULATION OF INFILTRATION CHARACTERISTICS OF SOIL BY RANDOM FOREST MODEL

Based on the modelling results discussed above, it is clear that the RF model has higher generalization performance in simulating the infiltration rate as well as cumulative infiltration of soil as compared with the other applied models. In this section, infiltration characteristics experimentally observed from their respective sites are compared with the predicted values of the RF model. In order to accomplish this task, the RF model developed with the training data set is tested on the complete data set of the locations. Both outputs, infiltration rate and cumulative infiltration, observed from the field are presented with the RF model-based predictions for their respective sites (Figure 8). The model trained for both outputs using the tuning parameters (Table 3) is implemented to their respective data sets for prediction. For comparative analysis, the results are plotted for each site (specified by the location number). The infiltration rate/cumulative infiltration values are plotted against time to show the variation of the model in predicting the observed data. The results reveal that the prediction of points by the RF model agrees reasonably well with the observed infiltration values, particularly in predicting the infiltration rate, and the model has a decent tendency to follow the actual observations.

Figure 8

Comparison of observed and RF model predicted infiltration characteristics of soil respective to each site.

Figure 8

Comparison of observed and RF model predicted infiltration characteristics of soil respective to each site.

PARAMETRIC SENSITIVITY

The sensitivity of each input variable in predicting the output variables (infiltration rate and cumulative infiltration of soil) is tested. The random forest (RF) model developed with the training data set is used to study the parametric importance in prediction. Models are developed using various combinations of input parameters by removing each parameter one by one and the sensitivity is judged by means of R2 and RMSE. The fluctuation in statistical measures with various input combinations accounts for the change in performance of each model in the prediction of both outputs. Analysing the combination from Table 6, it is revealed that the removal of time () from the input parameters causes higher fluctuation in both R2 as well as RMSE. The combination drastically increases the RMSE value, which makes time () the most important variable among the other input variables in affecting the prediction of infiltration rate as well as cumulative infiltration. The absence of other input parameters in the RF model yields a small change in R2 and RMSE values. Removal of the parameters sand, clay and silt, one by one, marginally decrease the RMSE and hence slightly improve the efficiency of the model in predicting the infiltration rate, while on the other hand, neglecting these parameters has a negative effect in cumulative infiltration prediction with slight increment in RMSE values (Table 6).

Table 6

Sensitivity analysis of parameters using random forest (RF) model

CombinationParameter removedInfiltration rate prediction
Cumulative infiltration prediction
R2RMSER2RMSE
 – 0.98 9.93 0.99 10.34 
 t 0.61 37.08 0.39 50.40 
 Sa 0.98 9.53 0.99 10.46 
  0.98 9.41 0.99 11.49 
  0.98 9.42 0.99 10.65 
  0.98 9.45 0.99 10.55 
  0.98 10.49 0.99 10.65 
CombinationParameter removedInfiltration rate prediction
Cumulative infiltration prediction
R2RMSER2RMSE
 – 0.98 9.93 0.99 10.34 
 t 0.61 37.08 0.39 50.40 
 Sa 0.98 9.53 0.99 10.46 
  0.98 9.41 0.99 11.49 
  0.98 9.42 0.99 10.65 
  0.98 9.45 0.99 10.55 
  0.98 10.49 0.99 10.65 

Infiltration is characterised as the flow of water into the subsurface from above ground. The topic of infiltration has received a great deal of attention because of its importance to topics as widely ranging as irrigation, contaminant transport, groundwater recharge, and ecosystem viability. In the last few decades conventional infiltration models have been widely used for the calculation or estimation of infiltration rate and cumulative infiltration, such as Kostiakov, Philip, Novel, Horton, Holton etc. Machiwal et al. (2006) observed that the infiltration process was well described by the Philip model in a wasteland of Kharagpur, India. However, soil management practices influencing the final infiltration rate are a major factor in deciding the applicability of these models. Thus, the variability of soil infiltration characteristics and goodness of fit of the infiltration models for different soils should be given due consideration in infiltration modelling studies for predicting the constant infiltration rate. The applicability of these models for estimating the infiltration rate with different soil management has been examined by researchers. Gifford (1976) observed that amongst the Horton, Kostiakov and Philip models, the Horton model was the best model to fit the infiltration data in mostly semi-arid regions from Australia, but under specific conditions only. Mishra et al. (2003) compared the performance of 14 different conventional infiltration models using 243 sets of infiltration data collected from field and laboratory tests conducted in India and the USA on soils ranging from coarse sand to fine clay. These models are point- or location-specific, and until now no general model has been developed. The results of this study suggest that the RF-based model is suitable for predicting the infiltration characteristics of the study area. The RF model is a general model for the study area and gives closer values to the observed values. The RF model is used to represent the infiltration characteristics of the study area and it saves time and effort in comparison with experimentation and conventional models.

CONCLUSIONS

Prediction of infiltration characteristics is essential for several hydrologic studies associated with surface and subsurface flow. In this study, the effectiveness of regression-tree-based models in simulating the infiltration characteristics of soil is evaluated. In estimating the infiltration rate and cumulative infiltration of soil, random forest outperforms random tree as well as M5 tree with R2 = 0.97 and 0.97, RMSE = 8.10 and 6.96 and MAE = 5.74 and 4.44, respectively. The training and testing results of the random forest encourage the utility of this method relative to the other tested methods and show a good potential in representing the infiltration characteristics of the study area. Based on the sensitivity results, time is observed as the most important parameter in affecting the prediction performance of both the infiltration rate and the cumulative infiltration of soil.

Finally, it is worth noting that the proposed approach uses only the six fundamental parameters of t, , M, Sa, Si and C for prediction of the infiltration rate and cumulative infiltration of soil. This is of importance for situations where there is no possibility of measuring other parameters like porosity and lower depth soil type. Considering the ability of the RF model in handling missing values in any data sets as well as its capacity in bypassing noisy data, the implemented soft computing methodology showed the potential to be accurately used in approximation of the infiltration process. The RF model's ability needs to be compared with hybrid techniques to try to develop a more accurate model. More experimental data is also required for better and general results.

DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

REFERENCES

REFERENCES
Adnan
R. M.
,
Malik
A.
,
Kumar
A.
,
Parmar
K. S.
&
Kisi
O.
2019
Pan evaporation modeling by three different neuro-fuzzy intelligent systems using climatic inputs
.
Arabian Journal of Geosciences
12
(
19
),
606
.
Adusumilli
S.
,
Bhatt
D.
,
Wang
H.
,
Bhattacharya
P.
&
Devabhaktuni
V.
2013
A low-cost INS/GPS integration methodology based on random forest regression
.
Expert Systems with Applications
40
(
11
),
4653
4659
.
Al-Sulaiman
M. A.
2015
Applying of an adaptive neuro fuzzy inference system for prediction of unsaturated soil hydraulic conductivity
.
Biosciences Biotechnology Research Asia
12
(
3
),
2261
2272
.
Al-Sulaiman
M. A.
&
Aboukarima
A. M.
2016
Distribution of natural radionuclides in the surface soil in some areas of agriculture and grazing located in west of Riyadh, Saudi Arabia
.
Journal of Applied Life Sciences International
7
(
2
),
1
12
.
Amit
Y.
&
Geman
D.
1997
Shape quantization and recognition with randomized trees
.
Neural Computation
9
(
7
),
1545
1588
.
Anari
P. L.
,
Darani
H. S.
&
Nafarzadegan
A. R.
2011
Application of ANN and ANFIS models for estimating total infiltration rate in an arid rangeland ecosystem
.
Research Journal of Environmental Sciences
5
(
3
),
236
247
.
Arshad
R. R.
,
Sayyad
G.
,
Mosaddeghi
M.
&
Gharabaghi
B.
2013
Predicting saturated hydraulic conductivity by artificial intelligence and regression models
.
ISRN Soil Science
2013
,
308159
.
Bhave
S.
&
Sreeja
P.
2013
Influence of initial soil condition on infiltration characteristics determined using a disk infiltrometer
.
ISH Journal of Hydraulic Engineering
19
(
3
),
291
296
.
Breiman
L.
1996
Bagging predictors
.
Machine Learning
24
(
2
),
123
140
.
Breiman
L.
2001
Random forests
.
Machine Learning
45
(
1
),
5
32
.
Chen
L.
&
Young
M. H.
2006
Green–Ampt infiltration model for sloping surfaces
.
Water Resources Research
42
(
7
),
W07420
.
Das
T.
,
Dettinger
M. D.
,
Cayan
D. R.
&
Hidalgo
H. G.
2011
Potential increase in floods in California's Sierra Nevada under future climate projections
.
Climatic Change
109
(
1
),
71
94
.
Diamond
J.
&
Shanley
T.
2003
Infiltration rate assessment of some major soils
.
Irish Geography
36
(
1
),
32
46
.
Erzin
Y.
,
Gumaste
S. D.
,
Gupta
A. K.
&
Singh
D. N.
2009
Artificial neural network (ANN) models for determining hydraulic conductivity of compacted fine-grained soils
.
Canadian Geotechnical Journal
46
(
8
),
955
968
.
Esmaeelnejad
L.
,
Ramezanpour
H.
,
Seyedmohammadi
J.
&
Shabanpour
M.
2015
Selection of a suitable model for the prediction of soil water content in north of Iran
.
Spanish Journal of Agricultural Research
13
(
1
),
e12-002
.
Green
W. H.
&
Ampt
G. A.
1911
Studies on soil physics
.
The Journal of Agricultural Science
4
(
1
),
1
24
.
Haghighi
F.
,
Kheirkhah
M.
&
Saghafian
B.
2011
Evaluation of soil hydraulic parameters in soils and land use change. In:
Earth and Environmental Sciences
.
(I. A. Dar & M. A. Dar, eds)
,
InTech, Rijeka
,
Croatia
, pp.
457
464
.
Hillel
D.
1998
Environmental Soil Physics: Fundamentals, Applications, and Environmental Considerations
.
Elsevier
,
London, UK
.
Holtan
H. N.
1961
A Concept for Infiltration Estimates in Watershed Engineering
.
US Department of Agriculture
,
Washington, DC, USA
.
Horton
R. E.
1941
An approach toward a physical interpretation of infiltration-capacity
.
Soil Science Society of America Journal
5
(
C
),
399
417
.
Kostiakov
A. N.
1932
On the dynamics of the coefficient of water percolation in soils and the necessity of studying it from the dynamic point of view for the purposes of amelioration
.
Trans. Sixth Comm. Int. Soc. Soil Sci.
1
,
7
21
.
Li
C.
&
Li
H.
2011
Learning random model trees for regression
.
International Journal of Computers and Applications
33
(
3
),
258
265
.
Machiwal
D.
,
Jha
M. K.
&
Mal
B. C.
2006
Modelling infiltration and quantifying spatial soil variability in a wasteland of Kharagpur, India
.
Biosystems Engineering
95
(
4
),
569
582
.
Malik
A.
,
Kumar
A.
&
Singh
R. P.
2019
Application of heuristic approaches for prediction of hydrological drought using multi-scalar streamflow drought index
.
Water Resources Management
33
(
11
),
3985
4006
.
Mishra
S. K.
,
Tyagi
J. V.
&
Singh
V. P.
2003
Comparison of infiltration models
.
Hydrological Processes
17
(
13
),
2629
2652
.
Pfahringer
B.
2010
Random Model Trees: An Effective and Scalable Regression Method
,
Working Paper 03/2010. Department of Computer Science, University of Waikato, Hamilton, New Zealand
.
Pham
Q. B.
,
Mohammadpour
R.
,
Linh
N. T. T.
,
Mohajane
M.
,
Pourjasem
A.
,
Sammen
S. S.
,
Anh
D. T.
&
Nam
V. T.
2021
Application of soft computing to predict water quality in wetland
.
Environmental Science and Pollution Research
28
,
185
200
.
https://doi.org/10.1007/s11356-020-10344-8
.
Quinlan
J. R.
1992
Learning with continuous classes
. In:
AI '92: Proceedings of the 5th Australian Joint Conference on Artificial Intelligence
(
A. Adams & L. Sterling, eds)
,
World Scientific
,
Singapore
, pp.
343
348
.
Sammen
S. S.
,
Mohamed
T. A.
,
Ghazali
A. H.
,
El-Shafie
A. H.
&
Sidek
L. M.
2017
Generalized regression neural network for prediction of peak outflow from dam breach
.
Water Resources Management
31
,
549
562
.
https://doi.org/10.1007/s11269-016-1547-8
.
Sammen
S. S.
,
Ghorbani
M. A.
,
Malik
A.
,
Tikhamarine
Y.
,
AmirRahmani
M.
,
Al-Ansari
N.
&
Chau
K.-W.
2020
Enhanced artificial neural network with harris hawks optimization for predicting scour depth downstream of ski-jump spillway
.
Applied Sciences
10
,
5160
.
https://doi.org/10.3390/app10155160
.
Schaap
M. G.
&
Leij
F. J.
1998
Using neural networks to predict soil water retention and soil hydraulic conductivity
.
Soil and Tillage Research
47
(
1–2
),
37
42
.
Sihag
P.
,
Tiwari
N. K.
&
Ranjan
S.
2017a
Estimation and inter-comparison of infiltration models
.
Water Science
31
(
1
),
34
43
.
Sihag
P.
,
Tiwari
N. K.
&
Ranjan
S.
2017b
Modelling of infiltration of sandy soil using gaussian process regression
.
Modeling Earth Systems and Environment
3
(
3
),
1091
1100
.
Sihag
P.
,
Tiwari
N. K.
&
Ranjan
S.
2017c
Prediction of unsaturated hydraulic conductivity using adaptive neuro-fuzzy inference system (ANFIS)
.
ISH Journal of Hydraulic Engineering
25
(
2
),
132
142
.
doi:10.1080/09715010.2017.1381861
.
Sihag
P.
2018
Prediction of unsaturated hydraulic conductivity using fuzzy logic and artificial neural network
.
Modeling Earth Systems and Environment
4
(
1
),
189
198
.
Sihag
P.
,
Tiwari
N. K.
&
Ranjan
S.
2018a
Support vector regression-based modeling of cumulative infiltration of sandy soil
.
ISH Journal of Hydraulic Engineering
26
(
1
),
44
50
.
doi:10.1080/09715010.2018.1439776
.
Sihag
P.
,
Singh
B.
,
Sepah Vand
A.
&
Mehdipour
V.
2018b
Modeling the infiltration process with soft computing techniques
.
ISH Journal of Hydraulic Engineering
26
(
2
),
138
152
.
doi:10.1080/09715010.2018.1464408
.
Singh
V. P.
&
Yu
F. X.
1990
Derivation of infiltration equation using systems approach
.
Journal of Irrigation and Drainage Engineering
116
(
6
),
837
858
.
Singh
B.
,
Sihag
P.
&
Singh
K.
2017
Modelling of impact of water quality on infiltration rate of soil by random forest regression
.
Modeling Earth Systems and Environment
3
(
3
),
999
1004
.
Singh
B.
,
Sihag
P.
&
Singh
K.
2018
Comparison of infiltration models in NIT Kurukshetra campus
.
Applied Water Science
8
(
2
),
63
.
Smith
R. E.
&
Parlange
J.-Y.
1978
A parameter-efficient hydrologic infiltration model
.
Water Resources Research
14
(
3
),
533
538
.
Soil Conservation Service
1972
SCS National Engineering Handbook, Section 4: Hydrology
.
US Department of Agriculture, Washington, DC, USA
.
Swartzendruber
D.
1987
A quasi-solution of Richards’ Equation for the downward infiltration of water into soil
.
Water Resources Research
23
(
5
),
809
817
.
Tiwari
N. K.
,
Sihag
P.
&
Ranjan
S.
2017
Modeling of infiltration of soil using adaptive neuro-fuzzy inference system (ANFIS)
.
Journal of Engineering & Technology Education
11
(
1
),
13
21
.
Zhao
Y.
&
Zhang
Y.
2008
Comparison of decision tree methods for finding active objects
.
Advances in Space Research
41
(
12
),
1955
1959
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).