## Abstract

The study of the infiltration process is considered essential and necessary for all hydrology studies. Therefore, accurate predictions of infiltration characteristics are required to understand the behavior of the subsurface flow of water through the soil surface. The aim of the current study is to simulate and improve the prediction accuracy of the infiltration rate and cumulative infiltration of soil using regression tree methods. Experimental data recorded with a double ring infiltrometer for 17 different sites are used in this study. Three regression tree methods: random tree, random forest (RF) and M5 tree, are employed to model the infiltration characteristics using basic soil characteristics. The performance of the modelling approaches is compared in predicting the infiltration rate as well as cumulative infiltration, and the obtained results suggest that the performance of the RF model is better than the other applied models with coefficient of determination (*R*^{2}) = 0.97 and 0.97, root mean square error (RMSE) = 8.10 and 6.96 and mean absolute error (MAE) = 5.74 and 4.44 for infiltration rate and cumulative infiltration respectively. The RF model is used to represent the infiltration characteristics of the study area. Moreover, parametric sensitivity is adopted to study the significance of each input parameter in estimating the infiltration process. The results suggest that time (*t*) is the most influencing parameter in predicting the infiltration process using this data set.

## HIGHLIGHTS

Infiltration rate and cumulative infiltration characteristics in semi-arid region were modelled using regression tree.

Random forest outperforms random tree as well as M5 tree to predict the infiltration characteristics.

Time is observed as the most important parameter in affecting the prediction performance of both the infiltration rate and the cumulative infiltration of soil.

## INTRODUCTION

Infiltration process influences the hydrological cycle, stream flow, groundwater recharge, irrigation, soil erosion and drainage system (Singh *et al.* 2018). Infiltration can be defined as irrigation or precipitation water moving into the soil (Sihag *et al.* 2018a, 2018b). The actual rate at which water moves into the soil is determined as infiltration rate (Haghighi *et al.* 2011). Infiltration process plays a significant role in the selection of crop type and agricultural practices. Prediction of infiltration is also necessary for the efficient design of irrigation systems (Hillel 1998). The sustainability of the groundwater system also depends on this process (Chen & Young 2006).

The quantity and quality of recharged water is the most important parameter for water resources management especially in regions where water scarcity occurs. It divides water into the two most important hydrological segments, groundwater flow and surface flow. It is the main factor in watershed modelling for the prediction of surface runoff. Accurate prediction of infiltration rate is compulsory for reliable prediction of surface runoff (Diamond & Shanley 2003). At catchment level, infiltration characteristics are one of the main factors in determining the flooding condition (Bhave & Sreeja 2013). The significance of the infiltration process has required soil and water researchers to generate several models (e.g. Green & Ampt 1911; Kostiakov 1932; Horton 1941; Philip 1957; Holtan 1961; Soil Conservation Service 1972; Smith & Parlange 1978; Swartzendruber 1987; Singh & Yu 1990; Sihag *et al.* 2017a). However, only a few developed models have been effectively used with field data. Parameter calculation is the main criterion for selection of any model over other models. These infiltration models can be divided into three groups: physical models, semi-empirical models and empirical models.

The main problem for modelling the infiltration process is the variation of soil type and texture (Machiwal *et al.* 2006). Variation in the infiltration process in any watershed causes difficulty in management of water for agricultural practices. The process of infiltration is influenced by several aspects; including soil type, texture, soil hydraulic properties, soil moisture profile, and precipitation and climate properties. The spatial and temporal prediction of the infiltration process in the real environment cannot be at present deduced by direct or indirect methods. Infiltration models are based on the law of mass conservation and Darcy's law (physical models), a few models rely on field and laboratory measurements (empirical models) and some models are based on simple hypotheses about the infiltration rate and cumulative infiltration relation quantities (semi-empirical models). These infiltration models are developed for vertically homogeneous soils with constant initial moisture content in the soil, constant density and over horizontal planes. A very few researchers have made attempts to use soft computing in the prediction of soil infiltration (Sy 2006; Sihag *et al.* 2018a, 2018b).

During the last few years, several researchers have used soft computing techniques such as adaptive neuro fuzzy inference system (ANFIS), artificial neural network (ANN), support vector machine (SVM), M5 tree, Gaussian process (GP), random forest (RF) and gene expression programming (GEP) in the prediction of hydraulic properties and infiltration process (Schaap & Leij 1998; Erzin *et al.* 2009; Anari *et al.* 2011; Das *et al.* 2011; Arshad *et al.* 2013; Al-Sulaiman 2015; Elbisy 2015; Esmaeelnejad *et al.* 2015; Al-Sulaiman & Aboukarima 2016; Rahmati 2017; Singh *et al.* 2017; Tiwari *et al.* 2017; Sihag *et al.* 2017b, 2017c; Sihag 2018) and found it to work equally well or better than empirical relations. A very few studies have been carried out on the spatial variability of soil hydraulic properties in arid regions. However, to the best of our information, no study has been carried out for describing the spatial variability of the soil water infiltration process using soft computing techniques in semi-arid regions of India. Thus, the main objective of this research is to develop infiltration models using regression-tree-based computing approaches and to determine the spatial variability at the same scale in some locations of Haryana (India). Both the cumulative infiltration as well as infiltration rate are predicted using random tree, random forest and M5 tree-based models. In the last few decades conventional infiltration models have been widely used for the calculation or estimation of infiltration rate and cumulative infiltration such as Kostiakov, Philip, Novel, Horton, Holton etc. These models are point- and location-specific. The aim of this investigation is to develop a general model for a complete study area using tree-based models.

## STUDY AREA AND DATA COLLECTION

Four districts of Haryana State (Hisar, Kurukshetra, Kaithal, and Karnal) are selected for the infiltration measurements (Figure 1). The study area comprises a total of 17 sites represented by the numbers on the district maps (Figure 1).

The infiltration observations have been recorded in the field with the help of a double ring infiltrometer. The double ring infiltrometer has two concentric rings having diameter 30 and 60 cm for the inner and outer ring, respectively, with 30 cm depth. Both rings are driven about 10 cm deep into the soil by using a hammer without undue disturbance to the soil surface. Both rings are filled with water to equal depth and the initial reading of the water level is noted. The rate of water level drop against time in the inner ring of the infiltrometer is noted at regular intervals until a constant rate is attained, which is called the steady-state infiltration rate. The soil samples for calculating the soil properties are collected from their respective locations selected for the measurement of infiltration rate. The sample is withdrawn along the side of the infiltrometer and represents the soil properties of that location. The properties of the soil samples are measured in the laboratory. The properties of soils respective to each location are listed in Table 1.

Site no. . | Sand (%) . | Clay (%) . | Silt (%) . | Dry density (gm/cm^{3})
. | Moisture content (%) . |
---|---|---|---|---|---|

1 | 47.50 | 25.20 | 27.30 | 1.64 | 12.74 |

2 | 50.71 | 23.19 | 26.10 | 1.66 | 19.85 |

3 | 39.84 | 52.34 | 7.82 | 1.58 | 19.74 |

4 | 42.85 | 24.00 | 33.15 | 1.58 | 18.39 |

5 | 48.74 | 29.38 | 21.88 | 1.67 | 8.77 |

6 | 59.58 | 30.72 | 9.70 | 1.60 | 14.21 |

7 | 26.63 | 41.82 | 31.55 | 1.54 | 18.55 |

8 | 46.70 | 16.56 | 36.74 | 1.65 | 5.28 |

9 | 24.81 | 39.62 | 35.57 | 1.24 | 19.06 |

10 | 79.73 | 6.45 | 13.83 | 1.60 | 8.47 |

11 | 84.14 | 7.83 | 8.03 | 1.75 | 7.47 |

12 | 66.63 | 19.62 | 13.75 | 1.70 | 12.71 |

13 | 44.27 | 23.12 | 32.61 | 1.65 | 11.56 |

14 | 45.67 | 39.16 | 15.17 | 1.62 | 8.33 |

15 | 26.12 | 66.14 | 7.74 | 1.61 | 18.60 |

16 | 19.73 | 62.41 | 17.86 | 1.61 | 15.37 |

17 | 32.71 | 54.21 | 13.08 | 1.75 | 7.64 |

Site no. . | Sand (%) . | Clay (%) . | Silt (%) . | Dry density (gm/cm^{3})
. | Moisture content (%) . |
---|---|---|---|---|---|

1 | 47.50 | 25.20 | 27.30 | 1.64 | 12.74 |

2 | 50.71 | 23.19 | 26.10 | 1.66 | 19.85 |

3 | 39.84 | 52.34 | 7.82 | 1.58 | 19.74 |

4 | 42.85 | 24.00 | 33.15 | 1.58 | 18.39 |

5 | 48.74 | 29.38 | 21.88 | 1.67 | 8.77 |

6 | 59.58 | 30.72 | 9.70 | 1.60 | 14.21 |

7 | 26.63 | 41.82 | 31.55 | 1.54 | 18.55 |

8 | 46.70 | 16.56 | 36.74 | 1.65 | 5.28 |

9 | 24.81 | 39.62 | 35.57 | 1.24 | 19.06 |

10 | 79.73 | 6.45 | 13.83 | 1.60 | 8.47 |

11 | 84.14 | 7.83 | 8.03 | 1.75 | 7.47 |

12 | 66.63 | 19.62 | 13.75 | 1.70 | 12.71 |

13 | 44.27 | 23.12 | 32.61 | 1.65 | 11.56 |

14 | 45.67 | 39.16 | 15.17 | 1.62 | 8.33 |

15 | 26.12 | 66.14 | 7.74 | 1.61 | 18.60 |

16 | 19.73 | 62.41 | 17.86 | 1.61 | 15.37 |

17 | 32.71 | 54.21 | 13.08 | 1.75 | 7.64 |

## DATA SET

The data set used for the modelling in this study is a collection of infiltration observations yielded from the field experiments from 17 sites. In this study, infiltration rate (*I*) and cumulative infiltration () are considered as the output variables whereas time (), dry density (), moisture content (*M*), and percentages of sand (), silt (), and clay () are the input variables. The basic soil properties respective to each site are measured from the laboratory experiments conducted on the field soil samples. The total data set is split in two groups: one used for model development (training data) and the other used for model validation (testing data). Training data involves 70% data chosen randomly from the complete data set, and the remaining 30% data forms the testing data. The features of the data set used for modelling the infiltration characteristics of the soil are presented in Table 2.

Parameter . | Training data . | Testing data . | ||||||
---|---|---|---|---|---|---|---|---|

Min. . | Max. . | Mean . | Standard deviation . | Min. . | Max. . | Mean . | Standard deviation . | |

(minutes) | 5.00 | 180.00 | 52.08 | 53.21 | 5.00 | 180.00 | 46.50 | 47.10 |

(%) | 19.73 | 84.14 | 47.08 | 17.81 | 19.73 | 84.14 | 44.53 | 17.66 |

(%) | 6.45 | 66.14 | 32.45 | 17.29 | 6.45 | 66.14 | 33.98 | 17.45 |

(%) | 7.74 | 36.74 | 20.47 | 10.28 | 7.74 | 36.74 | 21.49 | 10.20 |

(gm/cm^{3}) | 1.24 | 1.75 | 1.62 | 0.11 | 1.24 | 1.75 | 1.61 | 0.11 |

(%) | 5.28 | 19.85 | 13.20 | 4.93 | 5.28 | 19.85 | 13.47 | 5.00 |

(mm/hr) | 0.25 | 252.00 | 57.49 | 59.32 | 0.90 | 204.00 | 47.29 | 45.92 |

(mm) | 1.50 | 383.00 | 46.99 | 64.62 | 3.00 | 244.00 | 38.54 | 41.61 |

Parameter . | Training data . | Testing data . | ||||||
---|---|---|---|---|---|---|---|---|

Min. . | Max. . | Mean . | Standard deviation . | Min. . | Max. . | Mean . | Standard deviation . | |

(minutes) | 5.00 | 180.00 | 52.08 | 53.21 | 5.00 | 180.00 | 46.50 | 47.10 |

(%) | 19.73 | 84.14 | 47.08 | 17.81 | 19.73 | 84.14 | 44.53 | 17.66 |

(%) | 6.45 | 66.14 | 32.45 | 17.29 | 6.45 | 66.14 | 33.98 | 17.45 |

(%) | 7.74 | 36.74 | 20.47 | 10.28 | 7.74 | 36.74 | 21.49 | 10.20 |

(gm/cm^{3}) | 1.24 | 1.75 | 1.62 | 0.11 | 1.24 | 1.75 | 1.61 | 0.11 |

(%) | 5.28 | 19.85 | 13.20 | 4.93 | 5.28 | 19.85 | 13.47 | 5.00 |

(mm/hr) | 0.25 | 252.00 | 57.49 | 59.32 | 0.90 | 204.00 | 47.29 | 45.92 |

(mm) | 1.50 | 383.00 | 46.99 | 64.62 | 3.00 | 244.00 | 38.54 | 41.61 |

## MODELLING METHODS

### Random tree (RT)

Random trees are decision/regression-tree-based models used for classification/regression problems. Random model trees are essentially the combination of two existing algorithms: single model trees are combined with random forest ideas. Every leaf in model trees holds a linear model which is optimized for the local subspace described by that leaf (Pfahringer 2010). The collection of tree predictors in random trees is called a forest. With random features at each node, a random tree is a tree drawn randomly from a set of possible trees. The distribution of trees is uniform and each tree in the set of trees has an equal chance of being sampled. Random trees can be generated efficiently and the ensemble of random trees generally leads to accurate models (Zhao & Zhang 2008).

### Random forest (RF)

The random forest proposed by Breiman (2001) is a classification and regression method. The performance of single decision trees is shown to be improved considerably by adding randomness to the training data using a random forest. The algorithm combines the concept of bagging (Breiman 1996) and the random selection feature (Amit & Geman 1997). In bagging, the training data is sampled with replacement for each tree. Therefore, a new training set (bootstrap samples) is drawn with replacement from the original training set. During the selection of the bootstrap samples some of the training data may be left out of the sample and some may be repeated in the sample (Adusumilli *et al.* 2013). To generate a tree, instead of computing the best split among attributes for each node, only a random subset of all attributes is chosen at every node, and the best split for that subset is computed.

The procedure for the random forest regression model is as follows:

- 1.
Draw bootstrap samples from the data set.

- 2.
Generate a regression tree with the following modifications: at each node, choose the best split from the randomly sampled -tree predictors.

- 3.
The models are tuned for the user-defined parameters: number of trees () grown and number of variables used at each node () to construct a tree.

### M5 tree (M5)

*i*

^{th}outcome of the potential set; and represents the standard deviation of the target values of the instances in the data set.

The second step is to construct a multivariate linear regression function using the instances at each non-terminal node and then prune the built tree back from the leaves. Instead of using all attributes, this multivariate linear regression function only uses the attributes that are referenced by tests or linear functions somewhere in the sub-tree at this node.

The last step is to improve the prediction accuracy by applying some smoothing techniques. The smoothed predicted values of M5 are backed up from the leaf node to the root node (Li & Li 2011).

### Performance assessment parameters

The performance of various modelling techniques has been carried out using various statistical performance assessment parameters, which are coefficient of determination (*R*^{2}), root mean square error (RMSE) and mean absolute error (MAE) to check the statistical difference between the actual and the predicted values of the implemented soft computing approaches.

*R*

^{2}): The coefficient of determination is used to measure of success of numeric prediction. The coefficient of determination (

*R*

^{2}) is computed as follows (Adnan

*et al.*2019; Malik

*et al.*2019):where is the observed values; is the predicted values; and

*n*is the number of observations. The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of observed ( and predicted values (. The range of

*R*

^{2}is 0 to 1.0. A correlation of 1.0 indicates that correlation is perfect among observed and predicted ( values, while a correlation of 0.0 indicates that there is no linear relationship among the observed ( and predicted ( values.

#### Mean absolute error (MAE)

*et al.*2017; Pham

*et al.*2021):

MAE can range from 0 to ∞. An efficiency of 0 (MAE = 0) corresponds to a perfect match of modelled *A* to the observed data.

#### Root mean square error (RMSE)

*et al.*2020):RMSE can range from 0 to ∞. An efficiency of 0 (RMSE = 0) corresponds to a perfect ideal model which predicts exact values as observed values.

## EVALUATION OF MODELS

The performance measures adopted for model assessment are coefficient of determination (*R*^{2}), root mean square error (RMSE) and mean absolute error (MAE). The value of *R*^{2} determines the closeness of the regression line to the observed data. *R*^{2} = 1 indicates perfect fitting of the regression line to the observed data. RMSE and MAE are the error values that measure the accuracy of the models in approximating the actual observations. Lower values of RMSE and MAE indicate better prediction of the actual data. Two output values: infiltration rate and cumulative infiltration, are predicted relative to the input variables (Table 2) using the regression-tree-based models. In this study, WEKA 3.9 software is used for all applied regression-tree-based model development and validation. The models are developed using the training data set after setting the values of tuning parameters (user-defined) respective to each regression tree model. The performance of the constructed model is tested on the unseen testing data set using the statistical performance measures (*R*^{2}, RMSE and MAE). Several models are constructed and tested by assigning the values of the user-defined parameters to the modelling techniques and the best one is chosen based on optimal fitting of the model on the testing data set. The procedure is repeated for all the regression tree models using both outputs and the tuning parameters selected for performance analysis are shown in Table 3.

Regression model . | Tuning parameters . | |
---|---|---|

Infiltration rate prediction . | Cumulative infiltration prediction . | |

Random model tree | Number of randomly chosen attributes = 4 | Number of randomly chosen attributes = 3 |

Random forest | = 200, = 1 | = 78, = 2 |

M5 tree | Number of instances allowed at a leaf node = 5 | Number of instances allowed at a leaf node = 5 |

Regression model . | Tuning parameters . | |
---|---|---|

Infiltration rate prediction . | Cumulative infiltration prediction . | |

Random model tree | Number of randomly chosen attributes = 4 | Number of randomly chosen attributes = 3 |

Random forest | = 200, = 1 | = 78, = 2 |

M5 tree | Number of instances allowed at a leaf node = 5 | Number of instances allowed at a leaf node = 5 |

## RESULTS AND DISCUSSION

### Modelling of infiltration rate

The models are formed and checked for accuracy after setting the tuning parameters to near-optimal values (Table 3) in order to predict the values of soil infiltration rate. The simulation results of training and testing data are presented in Figures 2 and 3, respectively, which show the scattering of the data points for the prediction of three regression-tree-based models viz. random tree (RT), random forest (RF) and M5 tree. To investigate the superiority of the modelling techniques, the statistical performance measures acquired by the models with the observed infiltration rate data are elucidated in Table 4. As indicated from Figure 2, the scattering of training data predicted by the M5 tree with both pruned as well as unpruned model is higher relative to random tree and random forest. The higher values of infiltration rate predicted by the M5 tree model scatter relatively more than the lower values and degrade the modelling performance. Statistical measures from the M5 tree confirm the poor performance of model development with the training data as the RMSE and MAE values are substantially higher. The performance of the model is improved with the testing data set, but the testing performance is inferior to the other regression tree models. The trained model based on the random tree is best (*R*^{2} = 0.99) as the data prediction is close to the line of perfect agreement in the training stage but the model is not capable of giving accurate predictions with the testing data as compared with the random forest model. The performance of the model developed with RT drops at the validation stage having RMSE and MAE values of 10.82 and 5.98, respectively (Table 4). Closer fitting of predicted points to the perfect agreement line using the RF model in the testing stage reveals its suitability in the generalization of the infiltration rate data as the error values (RMSE = 8.10 and MAE = 5.74) are minimum (Table 4). Figure 4 shows the residual error values obtained from the experimental observations and predictions of infiltration rate with regression tree models. The residuals are minimum with the RT model during the training stage, but the deviation of the model is higher from the zero error line during the testing stage relative to the RF model. M5 tree-based models have larger residuals during model training but the quality of the models improve during testing with smaller error values, but still, the models are ranked last after the RT model. Therefore, modelling results based on this study indicate superior performance of the RF model in simulating the actual infiltration rate of soil followed by the RT and M5 tree models, respectively.

Regression model . | Training data . | Testing data . | ||||
---|---|---|---|---|---|---|

R^{2}
. | RMSE (mm/hr) . | MAE (mm/hr) . | R^{2}
. | RMSE (mm/hr) . | MAE (mm/hr) . | |

Random model tree | 1.00 | 0.62 | 0.27 | 0.95 | 10.82 | 5.98 |

Random forest | 0.98 | 9.93 | 4.73 | 0.97 | 8.10 | 5.74 |

M5 tree pruned | 0.81 | 26.81 | 18.02 | 0.85 | 19.08 | 15.28 |

M5 tree unpruned | 0.81 | 26.81 | 17.96 | 0.85 | 18.49 | 14.10 |

Regression model . | Training data . | Testing data . | ||||
---|---|---|---|---|---|---|

R^{2}
. | RMSE (mm/hr) . | MAE (mm/hr) . | R^{2}
. | RMSE (mm/hr) . | MAE (mm/hr) . | |

Random model tree | 1.00 | 0.62 | 0.27 | 0.95 | 10.82 | 5.98 |

Random forest | 0.98 | 9.93 | 4.73 | 0.97 | 8.10 | 5.74 |

M5 tree pruned | 0.81 | 26.81 | 18.02 | 0.85 | 19.08 | 15.28 |

M5 tree unpruned | 0.81 | 26.81 | 17.96 | 0.85 | 18.49 | 14.10 |

Regression model . | Training data . | Testing data . | ||||
---|---|---|---|---|---|---|

R^{2}
. | RMSE (mm) . | MAE (mm) . | R^{2}
. | RMSE (mm) . | MAE (mm) . | |

Random model tree | 0.99 | 0.92 | 0.54 | 0.94 | 10.76 | 7.87 |

Random forest | 0.99 | 10.33 | 4.52 | 0.97 | 6.96 | 4.44 |

M5 tree pruned | 0.81 | 28.92 | 16.09 | 0.81 | 18.43 | 11.21 |

M5 tree unpruned | 0.82 | 28.68 | 15.91 | 0.81 | 18.11 | 10.97 |

Regression model . | Training data . | Testing data . | ||||
---|---|---|---|---|---|---|

R^{2}
. | RMSE (mm) . | MAE (mm) . | R^{2}
. | RMSE (mm) . | MAE (mm) . | |

Random model tree | 0.99 | 0.92 | 0.54 | 0.94 | 10.76 | 7.87 |

Random forest | 0.99 | 10.33 | 4.52 | 0.97 | 6.96 | 4.44 |

M5 tree pruned | 0.81 | 28.92 | 16.09 | 0.81 | 18.43 | 11.21 |

M5 tree unpruned | 0.82 | 28.68 | 15.91 | 0.81 | 18.11 | 10.97 |

### Modelling of cumulative infiltration

In this section, the cumulative infiltration of soil is simulated using random tree (RT), random forest (RF) and M5 tree (M5) models with the basic soil properties. The scattering plots of the regression-tree based models for the training and testing data sets are depicted in Figures 5 and 6, respectively. Table 5 provides the information about the statistical measures obtained from the observed training and testing cumulative infiltration data with their corresponding predicted values by the models. The scattering of the training data predicted by the M5 pruned and unpruned models is higher, particularly for the larger values, as compared with the RT and RF models (Figure 5). The predicted data of the RT and RF model fits closely to the perfect agreement line having lower values of RMSE and MAE than the M5 tree in the model development stage. The RT model perfectly predicts the training data with a maximum value of *R*^{2} = 0.99 and minimum values of RMSE = 0.92 and MAE = 0.54 as most of the data points reside on the perfect agreement line, but the prediction capability of the model degrades during validation with the testing data. Analysing the testing data scatter plot (Figure 6), the prediction tendency of the RF model is superior as compared with the other regression models as the predicted points are comparatively closer to the perfect agreement line than the other models. The lowest values of RMSE = 6.96 and MAE = 4.44 obtained from the RF model at the testing stage affirm the stronger capability of the model in the generalization of the cumulative infiltration data (Table 5). So, statistical performance measures acquired from the testing data set suggest better performance by the RF model as compared with the RT and M5 tree models in approximating the cumulative infiltration of soil. Figure 7 shows the residual errors observed with the current data set using different regression models. The residual error values peak in the training stage, mostly predicted by the M5 tree model, while the RT model has smaller residual values. The residuals are minimum with the RF model relative to the RT model in the testing stage, which assures its higher predictability.

## SIMULATION OF INFILTRATION CHARACTERISTICS OF SOIL BY RANDOM FOREST MODEL

Based on the modelling results discussed above, it is clear that the RF model has higher generalization performance in simulating the infiltration rate as well as cumulative infiltration of soil as compared with the other applied models. In this section, infiltration characteristics experimentally observed from their respective sites are compared with the predicted values of the RF model. In order to accomplish this task, the RF model developed with the training data set is tested on the complete data set of the locations. Both outputs, infiltration rate and cumulative infiltration, observed from the field are presented with the RF model-based predictions for their respective sites (Figure 8). The model trained for both outputs using the tuning parameters (Table 3) is implemented to their respective data sets for prediction. For comparative analysis, the results are plotted for each site (specified by the location number). The infiltration rate/cumulative infiltration values are plotted against time to show the variation of the model in predicting the observed data. The results reveal that the prediction of points by the RF model agrees reasonably well with the observed infiltration values, particularly in predicting the infiltration rate, and the model has a decent tendency to follow the actual observations.

## PARAMETRIC SENSITIVITY

The sensitivity of each input variable in predicting the output variables (infiltration rate and cumulative infiltration of soil) is tested. The random forest (RF) model developed with the training data set is used to study the parametric importance in prediction. Models are developed using various combinations of input parameters by removing each parameter one by one and the sensitivity is judged by means of *R*^{2} and RMSE. The fluctuation in statistical measures with various input combinations accounts for the change in performance of each model in the prediction of both outputs. Analysing the combination from Table 6, it is revealed that the removal of time () from the input parameters causes higher fluctuation in both *R*^{2} as well as RMSE. The combination drastically increases the RMSE value, which makes time () the most important variable among the other input variables in affecting the prediction of infiltration rate as well as cumulative infiltration. The absence of other input parameters in the RF model yields a small change in *R*^{2} and RMSE values. Removal of the parameters sand, clay and silt, one by one, marginally decrease the RMSE and hence slightly improve the efficiency of the model in predicting the infiltration rate, while on the other hand, neglecting these parameters has a negative effect in cumulative infiltration prediction with slight increment in RMSE values (Table 6).

Combination . | Parameter removed . | Infiltration rate prediction . | Cumulative infiltration prediction . | ||
---|---|---|---|---|---|

R^{2}
. | RMSE . | R^{2}
. | RMSE . | ||

– | 0.98 | 9.93 | 0.99 | 10.34 | |

t | 0.61 | 37.08 | 0.39 | 50.40 | |

Sa | 0.98 | 9.53 | 0.99 | 10.46 | |

0.98 | 9.41 | 0.99 | 11.49 | ||

0.98 | 9.42 | 0.99 | 10.65 | ||

0.98 | 9.45 | 0.99 | 10.55 | ||

0.98 | 10.49 | 0.99 | 10.65 |

Combination . | Parameter removed . | Infiltration rate prediction . | Cumulative infiltration prediction . | ||
---|---|---|---|---|---|

R^{2}
. | RMSE . | R^{2}
. | RMSE . | ||

– | 0.98 | 9.93 | 0.99 | 10.34 | |

t | 0.61 | 37.08 | 0.39 | 50.40 | |

Sa | 0.98 | 9.53 | 0.99 | 10.46 | |

0.98 | 9.41 | 0.99 | 11.49 | ||

0.98 | 9.42 | 0.99 | 10.65 | ||

0.98 | 9.45 | 0.99 | 10.55 | ||

0.98 | 10.49 | 0.99 | 10.65 |

Infiltration is characterised as the flow of water into the subsurface from above ground. The topic of infiltration has received a great deal of attention because of its importance to topics as widely ranging as irrigation, contaminant transport, groundwater recharge, and ecosystem viability. In the last few decades conventional infiltration models have been widely used for the calculation or estimation of infiltration rate and cumulative infiltration, such as Kostiakov, Philip, Novel, Horton, Holton etc. Machiwal *et al.* (2006) observed that the infiltration process was well described by the Philip model in a wasteland of Kharagpur, India. However, soil management practices influencing the final infiltration rate are a major factor in deciding the applicability of these models. Thus, the variability of soil infiltration characteristics and goodness of fit of the infiltration models for different soils should be given due consideration in infiltration modelling studies for predicting the constant infiltration rate. The applicability of these models for estimating the infiltration rate with different soil management has been examined by researchers. Gifford (1976) observed that amongst the Horton, Kostiakov and Philip models, the Horton model was the best model to fit the infiltration data in mostly semi-arid regions from Australia, but under specific conditions only. Mishra *et al.* (2003) compared the performance of 14 different conventional infiltration models using 243 sets of infiltration data collected from field and laboratory tests conducted in India and the USA on soils ranging from coarse sand to fine clay. These models are point- or location-specific, and until now no general model has been developed. The results of this study suggest that the RF-based model is suitable for predicting the infiltration characteristics of the study area. The RF model is a general model for the study area and gives closer values to the observed values. The RF model is used to represent the infiltration characteristics of the study area and it saves time and effort in comparison with experimentation and conventional models.

## CONCLUSIONS

Prediction of infiltration characteristics is essential for several hydrologic studies associated with surface and subsurface flow. In this study, the effectiveness of regression-tree-based models in simulating the infiltration characteristics of soil is evaluated. In estimating the infiltration rate and cumulative infiltration of soil, random forest outperforms random tree as well as M5 tree with *R*^{2} = 0.97 and 0.97, RMSE = 8.10 and 6.96 and MAE = 5.74 and 4.44, respectively. The training and testing results of the random forest encourage the utility of this method relative to the other tested methods and show a good potential in representing the infiltration characteristics of the study area. Based on the sensitivity results, time is observed as the most important parameter in affecting the prediction performance of both the infiltration rate and the cumulative infiltration of soil.

Finally, it is worth noting that the proposed approach uses only the six fundamental parameters of *t*, , *M*, *Sa*, *Si* and *C* for prediction of the infiltration rate and cumulative infiltration of soil. This is of importance for situations where there is no possibility of measuring other parameters like porosity and lower depth soil type. Considering the ability of the RF model in handling missing values in any data sets as well as its capacity in bypassing noisy data, the implemented soft computing methodology showed the potential to be accurately used in approximation of the infiltration process. The RF model's ability needs to be compared with hybrid techniques to try to develop a more accurate model. More experimental data is also required for better and general results.

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.