An expert system for predicting the infiltration characteristics

Infiltration plays a fundamental role in streamflow, groundwater recharge, subsurface flow, and surface and subsurface water quality and quantity. This study includes a comparative analysis of the two machine learning techniques; M5P model tree (M5P) and Gene Expression Programming (GEP) in predictions of the infiltration characteristics. The models were trained and tested using the 7 combination (CMB1 – CMB7) of input parameters; moisture content (m), bulk density of soil (D), percentage of the silt (SI), sand (SA) & Clay (C), and time (t), with output parameters; cumulative infiltration (CI) and infiltration rate (IR). Results suggested that GEP has an edge over M5P to predict the IR and CI with R, RMSE & MAE values 0.9343, 15.9667 mm/hr & 8.7676 mm/hr, and 0.9586, 9.2522 mm and 7.7865 mm for IR and CI, respectively with CMB1. Although the M5P model also gave good results with R, RMSE & MAE values 0.9192, 14.1821 mm/hr, & 19.2497 mm/hr, and 0.8987, 11.2144 mm & 18.4328 mm for IR and CI, respectively, but lower than GEP. Furthermore, single-factor ANOVA and uncertainty analysis were used to show the significance of the predicted results and to find the most efficient soft computing techniques respectively.

• Infiltration characteristics is predicted using two soft computing techniques; M5P and GEP. • Linear relationship model is generated from M5P and GEP. • Single-factor ANOVA and uncertainty analysis is done for the best selected models.  (Patle et al. 2018). A good water management system required an efficient control of the infiltration characteristics of the soil . The infiltration characteristics consist of two terms; infiltration rate and cumulative infiltration. The cumulative infiltration refers to the total amount of water that infiltrates into the soil and the infiltration rate is the rate by which it infiltrates into the soil (Haghighi et al. 2010). Good knowledge of the infiltration characteristics would help in a wide range of problems such as artificial and natural groundwater recharge, flooding, pollution of underground water, the optimum amount of water for irrigation, and runoff water (Dahan et al. 2007). It is also the most dominant factor in the accurate prediction of the flooding conditions in any catchment (Bhave & Sreeja 2013). An irrigation scheme should be planned by considering the lateral flow of water, for the efficient and effective utilization of irrigation water (Chowdary et al. 2006). Infiltration characteristics also help us in measuring the irrigation efficiency, growth in the crop yields, and minimizing the soil erosion (Adeniji et al. 2013). There are many factors such as soil properties and texture, water content, humidity, rainfall intensity, and field density that affect the infiltration characteristics (Igbadun et al. 2007). Soil properties and texture affect the water holding capacity of the soil (Gupta & Gupta 2008). Sand contains a comparatively larger pore size than clay and thus has a high infiltration rate and very low water holding capacity (Micheal 1978;Smith 2006). Infiltration characteristics also play a significant role in the prediction of runoff in designing hydraulic structures as well as water resources planning and management (Heinz et al. 2007;Souchère et al. 2010).

ABBREVIATIONS
Many researchers used different types of infiltration models such as the Philip model (Philip 1957), Green and Ampt (Green & Ampt 1911), Holton model (Holtan 1961), Singh-Yu model (Singh & Yu 1990), Kostiakov model (Kostiakov 1932), Huggins-Monke model (Huggins & Monke 1966), Novel model (Sihag et al. 2017), Horton model (Horton 1938 and revised modified Kostiakov model (Parhi et al. 2007) with a different type of the soil. Mishra et al. (2003) compared 14 infiltration models on 243 datasets with different types of soil and found Singh-Yu gave the precise result as well Huggins-Monke, Holton, and Horton models also gave the good result but comparatively lower than the Singh-Yu model. Mirzaee et al. (2014) introduced a new model that is modified Kostiakov which works well with silty clay, clay loam, and loam soil in Iran. Sihag et al. (2017) also worked in the infiltration field and introduced a new model (Novel Model) in the soil of India (Kurukshetra). Zakwan (2018) used various infiltration models on 16 datasets (worldwide) in which the Holton model is the best suitable model and the Novel model by Sihag et al. (2017) works well in 5 cases.
In the 21 st century, machine learning techniques such as GEP, ANN, M5P, and RF, became one of the most dominant and used techniques in the field of water resources, structure, hydrology, and environmental engineering Parsaie & Haghiabi 2017;Angelaki et al. 2018;Haghiabi et al. 2018;Sihag et al. 2018aSihag et al. , 2018bSihag et al. , 2020Singh et al. 2018bSingh et al. , 2019aSingh et al. , 2019bVand et al. 2018;Mohanty et al. 2019;Kumar et al. 2020;Malik et al. 2020;Pandey et al. 2020;Pandhiani et al. 2020;Mohammed et al. 2021;Rehamnia et al. 2021). CFD is also a method which can be used in the water resources problems (Al-Obaidi 2019; Al-Obaidi & Mohammed 2019; Al-Obaidi 2021), but machine learning techniques inside traditional fluid simulations can improve both accuracy and speed (Kochkov et al. 2021), Karbasi & Azamathulla (2016) used the GEP technique in the prediction of characteristics of a hydraulic jump over a rough bed. The performance of the GEP is slightly better than other modeling techniques (ANN and SVM). Shabanlou et al. (2018) employed GEP to determined scour dimensions around submerged vanes and found that it estimated the scour dimensions precisely. Gholami et al. (2018) applied GEP to determine the stable threshold channel slope. Khozani et al. (2018) compared the GEP and ANN in which GEP gave the superior result for predicting the shear stress distribution in circular open channels.
As compared to GEP, the M5P model tree is not much popular in the machine learning techniques but successfully used in civil engineering problems (Bhattacharya & Solomatine 2005;Pal & Deswal 2009;Singh et al. 2010;Goyal & Ojha 2011). Ayoubloo et al. (2011) compared two machine learning techniques; the M5P model tree and ANN, to calculate local scour downstream of spillways and found M5P predicts the result efficiently. Singh et al. (2017) also employed the M5P model tree, ANN, and RF to predict the infiltration rate of the soil. Goyal & Ojha (2011) found M5P model tree gave accurate results than ANN in the topic of the local scour downstream of a ski jump bucket. The main advantage of the GEP and M5P model tree is that they gave linear models to predict the output at any instance. The present investigation is originated by setting the following goals (i) to develop M5P and GEP models for prediction of infiltration characteristics; infiltration rate as well as cumulative infiltration (ii) to compare the result of M5P and GEP models to find the most accurate values of infiltration characteristics (iii) to generate the linear relationship equation for calculating the infiltration characteristics by M5P and GEP. By achieving these objectives, two machine learning based model was developed and their results were compared by using performance evaluation parameters and novel linear relationship equation were generated to calculate the values of IR and CI at any instance.

M5P
M5P, initially introduced by Quinlan (1992), is used to develop a decision tree by engaging the linear regression function method at nodes to construct a model that develops a relationship among input and output variables. The splitting approach is applied at each node instead to gain the maximum information to minimize the variation in the intra subset class value down to each branch. The splitting process will be converged when there are diminutive variations among the class values of the instances or left only a few instances or when the tree is pruned back. The developed tree indicates a very good structure and estimation precision due to showing more probable linearity at the leaf node (Singh et al. 2017). The equation to calculate the standard deviation reduction is as follows: The main advantage of the M5P model is that it can deal with a continuous and categorical variable and is also able to handle the missing values of variables. Figure 1 provided the concept of the M5P in which M i are the models and n i are the split nodes of the tree. (2002) is a search technique that involves computer programs. It is a developed method with the base of GA and has been widely implemented in the current studies. The computer programs of GEP are all encoded in linear chromosomes, which are then articulated or translated into ETs. A concise flowchart of GEP is shown in Figure 2. The first step of this program to solve any problem is to produce the initial population, which happens with arbitrary births of chromosomes and in the latter, the chromosomes convert to ETs that are examined by performance criteria to represent the solubility of produced ETs. If the outcomes convince the performance criteria, population generating stops, and if the results are not satisfactory, the system regenerates with some improvement to make a new generation with improved value and this process occurs until the best results are achieved. These results are generated based on the primary parameters and details of the primary parameters are summarized in Table 1. The advantage of GEP is to compare the chromosomes of a symbolic and linear string of fixed length. For further explanation about GEP, readers are referred to Ferreira (2006) and Ebtehaj et al. (2015).

Study area and methodology
The infiltration characteristics data are experimentally determined in four districts i.e. Kurukshetra, Hisar, Kaithal, and Karnal of the Haryana state, India. These four districts are situated in the north part of India which is very far from the sea which results in a very low temperature (1°C to 7°C) in the winter and high temperature (40°C to 45°C) in the summer. The geographical representation of all the districts along with the location of the experimentation point is shown in Figure 3. There is a total of 17 locations on which experimentation was done ( Figure 3). All the locations are located  29.14°to 29.96°north latitude and 75.72°to 76.99°east longitude. The rainfall occurs in this part of India due to two factors; monsoon (June to Aug.) and western disturbance (Dec. to Feb.) but approximately 95% of the total rainfall occurs due in the monsoon period. The soil present is loam, sandy loam, clay loam, and sandy in Kurukshetra, Kaithal, Karnal, and Hisar respectively.
The infiltration characteristics consist of the two properties; IR and CI of water. The IR and CI were determined by using DRI (ASTM 2009). The DRI is the standard instrument for measuring the infiltration characteristics of the soil. The DRI is consists of two cylinders; the outer (diameter 600 mm) and the inner cylinder (diameter 300 mm), connected with iron strips as displayed in Figure 4. The instrument was driven 100 mm into the soil out of 300 mm which is the total depth of the instrument and it was done with the fallen weight type hammer strike uniformly without disturbing the top layer of the soil. Both the rings were filled with equal depth of water and note down the initial depth of water in inner ring because the water from  Uncorrected Proof the inner ring went downwards directly not laterally. The total amount of water that infiltrates through the process is called CI and the rate of water or amount of water infiltrate per time is the IR. Simultaneously, the soil samples were also collected to find out the properties of the soil such as percentages of SI, SA & C, D, and M. The percentage of SI, SA & C were calculated by the hydrometer test (ASTM 2007) and D and M were calculated by the Proctor test (Connely et al. 2008) and oven-dry method (Rowe 2018) respectively. The main reason for calculating these parameters is to know about the soil of the study area. The detailed description of these soil properties is tabulated in Table 2 for the locations of four districts.

RESULTS AND DISCUSSIONS
The structure of results for this investigation are as follows: firstly, analysis of the dataset which includes the descriptive characteristics of the dataset, making of various combination models, and details of the performance evaluation parameters followed by the prediction of the IR and CI using M5P and GEP and performance comparison among each model. Finally, single-factor ANOVA and uncertainty analysis are done for the best fit models for both of the techniques.  Uncorrected Proof

Analysis of dataset
To check the effectiveness of the M5P and GEP model, M in percentage, D in kg/m 3 , SI, SA & C in percentage and T in minutes were used as input parameters and on the other hand, outputs variables are CI and IR. The total dataset was divided into two sections; the training section (2/3 of the total dataset) and the testing section (1/3 of the total dataset). The study was carried out in two sets; the first set was the analysis of the IR and the second set was the analysis of the CI. The IR and CI dataset consists of 185 observations which tend to result that 125 observations in the training section and 60 observations in the testing section. The characteristics details of the training and testing section data are summarized in Table 3. A total of 7 models (CMB1-CMB7) were created by using the input parameters for IR as well as CI. CMB 1 contained all the parameters while CMB 2 to CMB 7 was created by removing one input parameter in each combination. These combinations were created to examine the effects of each of the parameters in the prediction of infiltration characteristics. The details of the input model's combination are given in Table 4.
The performance evaluation parameters were used to compare the predicted values of the M5P and GEP. MAE (Malik et al. 2021a;Singh et al. 2021aSingh et al. , 2021b Figure 5 using the M5P and GEP techniques. Secondly, a single factor ANOVA was used in the comparison of the statistically predicted values. This method is the hypothesis method to find out the significant or insignificant differences among two or more modeling techniques. The single factor ANOVA technique gives three values; F-value, p-value, and F-critical. If the P-value is more than 0.05 and the F value is less than F critical then the result of that particular approach is insignificant and vice versa (Singh et al. 2017).

IR
To get the best prediction of the IR, M5P and GEP were used. The details of the performance evaluation parameters are listed in Figure 4. GEP provided good results in the prediction of the IR. An output from Figure 5 (Figure 6) results for GEP and M5P which show that the best model for these techniques is CMB1 and CMB2, respectively. The performance of the best accurate model combinations was plotted in  Similarly, the value of R for this case is 0.8510. The major benefit of the M5P and GEP is that they provided a linear relationship between the input and output variables. The details of the linear relation for both of the models are summarized in Table 5. In M5P, if C ,¼ 21.37, linear relation 1 (LM num 1) was followed, if C. 21.37 and T ,¼ 17.5, linear relation 2 (LM num 2) was followed and so on. A total of five linear models were given by the M5P model with model combination CMB 2. For GEP, only one model; M num1, was created which was listed in Table 5. CMB 1 and CMB 2 were the best-fitted combination model in the prediction of IR for GEP and M5P, respectively. The values of R, RMSE, and NAE were 0.9343, 15.9667 mm/hr & 8.7676 mm/hr, and 0.9192, 14.1821 mm/hr, and 19.2497 mm/hr for GEP and M5P, respectively ( Figure 5). The output from Figures 4-6 suggests that the GEP technique is more accurate than the M5P technique. The plot of the GEP technique is symmetrical than M5P with highest the values of R (0.9159) which is much higher than M5P (R ¼ 0.8510) (Figure 7). Thus, the GEP technique is the most precise technique to predict the IR with CMB 1.

CI
In this section, the prediction of the CI with 7 models' combinations was done using GEP and M5P. The values of the performance evaluation parameters i.e. R, RMSE, and MAE were also provided in Figure 8.   Uncorrected Proof models with GEP and M5P by the Taylor diagram and performance graph (Figure 9 and 10). The values of R for these combinations were 0.9586 & 0.8987 which was the highest among all model combinations and the values of RMSE and NSE were 9.2522 mm & 11.2144 mm and 7.7856 mm & 18.4328 mm which was lowest from CMB 5 with GEP and CMB 1 with M5P. But CMB 7 was the worst model combination with both the modeling techniques. The performance of the GEP and M5P with the best model (CMB 1 and CMB 6 respectively) is provided by Figures 9 and 10. Similarly, the performance combined two outputs which were explained in the IR section. The linear relation model for the CI is listed in Table 6. It is clear from Table 6 that a total of 9 linear models was developed for the CMB 6 using M5P. Similarly, CMB 1 and CMB 6 were the most accurate combination model to predict the CI with GEP and M5P respectively. Figure 8 suggests the value of R 2 which was higher in the case of the GEP (0.8949) than M5P (0.8349) and all the plots of the GEP were in symmetry. It also suggests that the GEP has a good result of R, RMSE, and MAE (0.9586, 9.2522 mm, and 7.7865 mm) for GEP than M5P (0.8987, 11.2144 mm, and 18.4328 mm). Thus, in CI also, the GEP technique is the most precise technique with CMB 1.  ((SA-M)).*((M À 3.887726).*D))) þ ((((SA þ 9.966279)-(M À 9.966279))./T).* 9.280182) þ (SI-((log(C)-sin(SA)).*(SA.^(1.0./3.0)))) þ ((((0.257995-M).* (À2.68805)) þ M).*cos(log(C))) þ ((À3.535065)./cos((((SI þ D).*SI).*(À3.535065 À 3.556549)))).

Uncorrected Proof
A comparison is done with the previous studies to check the potential best machine learning techniques; GEP. The selected previous studies are Vand et al. 2018, Sihag et al. 2017, Singh et al. 2019aSepahvand et al. 2021. Figure 11 shows the result of the comparison which suggests that result of current study is best as compare to the previous studies. The order of the results is as follow: Singh et al. 2017Singh et al. , 2019aSihag et al. 2018b, Vand et al. 2018Sepahvand et al. 2021 Current study. Thus, the result of the current study is outperformed than the previous studies.
Additionally, both the machine learning techniques (M5P and GEP) are capable to provide the explicit equation (provided in Tables 5 and 6) which can be useful in calculating the infiltration characteristics. These equations may be used in such regions where the measuring of infiltration characteristics is very difficult. Furthermore, accurate estimation of soil infiltration rate and thereby runoff rate will help to develop proper soil management strategies and conservation measures to minimize the risk of erosion and land degradation.

Single factor ANOVA
The single factor ANOVA for both of the techniques was depicted in Table 8. For the insignificant result, the P-value should be more than 0.05 and F-value should be less than F critical. Table 7 suggests that in both of the techniques, the P-value is more than 0.05 and F-value is less than F critical for both IR and CI. Thus, the results of both of the predictors are insignificant for IR and CI.  The accurate technique; GEP, underwent UA after Single Factor ANOVA. In this investigation, UA gives a nice performance description of GEP models. Several researchers used UA for data testing (Azimi et al. , 2019. Uncertainty is the difference of actual and estimated values er i ¼ j i À k i . The mean difference and standard deviation (SD er ) of predicted values are calculated with er ¼ P m i¼1 er i and SD er ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi P m i¼1 (er i À er) 2 =n À 1 q . The positive and negative signs in the prediction error specify the model underestimation and overestimation of estimated values. Parameters er and SD er are used to define a certainty band around the prediction error values with the Wilson score method without continuity correction. Table 8 gave the details about the UA. In IR, GEP CMB1 gave the minimum values of the indices used for the UA (er ¼ þ0.0006, SD er ¼ 0.0079 and 95% prediction error interval ¼+0.0480) and also for CI, GEP CMB1 gave the minimum values. Hence, CMB1 gave the best results for predicting the infiltration characteristics of soil which were also interpreted by the performance assessment parameters (Figures 4 and 7), Taylors diagram ( Figures 5 and 8), and performance graph (Figures 6 and 9).

CONCLUSIONS
Infiltration characteristics measuring is a time-consuming and complicated task for water resource and agriculture researchers. Thus, a trustworthy soft computing technique can be a perfect replacement. For this objective, M5P and GEP were used to model the infiltration characteristics with seven model combinations. The following conclusions are drawn from this investigation: • GEP is the most efficient technique to predict the infiltration characteristics (both IR and CI) of the soil of four districts of Haryana, India.
• CMB 1 with input combination M, D, SI, C, SA, and T, gave the best values of the performance evaluation parameters (R,RMSE,and MAE ¼ 0.9343,15.9667,and 8.7676) than the other model combination to predict the IR.
• CMB 1 with input combinations M, D, SI, C, SA, and T, was the best model in the prediction of CI with R, RMSE, and MAE values equal to 0.9586, 9.2522, and 7.7865.
• Linear relationships among the input and output variables were given to find the values of infiltration characteristics at any instance by the GEP and M5P soft computing techniques.
• A comparison with past studies also revealed that the GEP model of this study is superior.
• Single-factor ANOVA suggested that both the techniques gave insignificant results in the prediction of the IR and CI with P-value more than 0.05 and F-value less than F critical.
• UA also gave concludable the results that GEP is the best soft computing technique to predict the infiltration characteristics.