Abstract
Permeability coefficient of soil (k) is one of the most important parameters in groundwater studies. This study, two robust explicit data-driven methods, Including classification and regression trees (CART) and the group method of data handling (GMDH) were developed using the characteristics of soil, i.e., clay content (CC), water content (ω), liquid limit (LL), plastic limit (PL), specific density (γ), void ratio (e) to generate predictive equations for prediction of k. When compared to CART; mean absolute error (MAE) = 0.0051, root mean square error (RMSE) = 0.0088, scatter index (SI) = 64.00%, correlation coefficient (R) = 0.7841, index of agreement (IA) = 0.8830; the GMDH equation produced the lowest error values; MAE = 0.0044, RMSE = 0.0072, SI = 52.17%, R = 0.8493, Ia = 0.9184; in testing stage. Although, GMDH had better performance, however, CART and GMDH could be considered effective approaches for the prediction of k.
HIGHLIGHTS
Two predictive models were developed to estimate the permeability coefficient of soil.
GMDH and CART algorithms were evaluated in this study.
GMDH provided more accurate results when compared to CART for the prediction of permeability coefficient of soil.
The water content was the most effective parameter for determining the permeability coefficient of soil.
Field data were used in this research.
INTRODUCTION
One of the important and challenging tasks of hydro-geologists in water resources management is the high-resolution characterization of k in aquifers. Infiltration is defined as a physical phenomenon in which water penetrates into the soil from water surface sources such as precipitation. Different types of soils have different infiltration characteristics (Sajjadi et al. 2016; Torabi et al. 2020). Hydraulic conductivity of the saturated soil is one of the most important hydraulic parameters for simulating water movement across the soil profile, which is highly variable in space.
In complicated engineering problems, data-driven methods, such as artificial neural networks (ANNs); adaptive neuro-fuzzy inference system (ANFIS); wavelet–hybrid (W-hybrid) data-driven methods; evolutionary polynomial regression (EPR); Support Vector Machines (SVMs); classification and regression trees (CART); multivariate adaptive regression splines (MARS); gene expression programming (GEP) and group method of data handling (GMDH), which are frequently used and have been applied in many civil engineering fields, especially in water-related problems (Samadi et al. 2015, 2020a, 2020b; Mojaradi et al. 2018; Torabi et al. 2022).
Generally, the data-driven models are classified into white-box and black-box models. White-box data-driven models give mathematical equations that enable the relationship between the variables and dependent variables to be understood and interpreted directly (Samadi et al. 2020a, 2020b). By contrast, black-box models predict output parameters using numerical values rather than a straightforward equation between the input and output variables (Samadi et al. 2021a, 2021b). Thus, the primary advantage of white-box models over black-box models is that these models can generate useful predictive expressions for estimating the output parameter. Therefore, they have the potential to address the limitations of black-box approaches. The present study used two white-box models, including GMDH and CART, for prediction of k. The primary reason for developing explicit data-driven methods is its ability to provide equations and extract knowledge hidden from the data in the form of mathematical expressions.
CART and GMDH have been proved to be dependable data-driven approaches for making good estimates and resolving complicated issues in various civil engineering areas. These are data-driven methods for establishing an equation between multiple independent variables and a single target variable. CART and also regression trees are successful applications for solving water-engineering problems. Samadi & Jabbari (2012) and Kamranzad & Samadi (2013) indicated CART provides mathematical expressions that could easily be used to predict scour depth and the time series of wave height. Samadi et al. (2014) used the ANN method to determine the scour depth below the ski-jump bucket and free over fall spillways. Samadi et al. (2021a, 2021b) indicated that CART is more convenient for modeling dynamic pressure distribution in hydraulic structures. Sihag et al. (2021a, 2021b) simulated soil infiltration rate using regression tree methods.
Recently, the GMDH predictive approach has been implemented to model a range of water-related problems. For example, Qaderi et al. (2018) assessed the GMDH for predicting bed form dimensions. Khedri et al. (2020) used GMDH to estimate groundwater levels. Sihag et al. (2021a, 2021b) predicted the aeration efficiency of the flume using GMDH. Nasrabadi et al. (2021) used the GMDH method to predict submerged hydraulic jump characteristics such as jump length, relative energy loss and relative submergence depth. Zeinali et al. (2021) showed that the GMDH model has best performance in estimating the hydro-socio-technology-knowledge indicators compared to those obtained via radial basis function and regression trees. Yonesi et al. (2022) used GMDH to estimate discharge in compound open channels.
Despite the above-mentioned studies, present research explores the predictive performance of CART and GMDH methods for estimating k. In this research, using sufficient data, mathematical modeling of k was conducted and a comparison on performance of both used soft computing methods was performed.
MATERIALS AND METHODS
This section provides information about soil data samples and data set preparation for model developments. In addition, gives a brief review of CART and GMDH algorithms.
Data collection
Preparation of data set for development of the proposed methods
The data set splits randomly into two groups of training and testing data. Training data is used for the generation of CART and GMDH and the test data were used for the assessment of the proposed methods. Therefore, 70% of the data were randomly selected for training and the rest was used to evaluate the proposed models. Moreover, the main statistical parameters of training and testing were similar for the development of data-driven models. The training data were used for the learning process, and test data were used to evaluate the performance of the different models. The main statistical parameters to train and test field and laboratory data are shown in Table 1.
Variable . | Data range . | Average . | ||
---|---|---|---|---|
Train . | Test . | Train . | Test . | |
(%) | 5.70–64 | 6.10–63.4 | 25.04 | 25.49 |
(%) | 15.09–99.90 | 16.97–95.58 | 34.84 | 32.78 |
18.90–88.93 | 19.81–82.25 | 37.65 | 36.37 | |
12.20–54.80 | 12.50–53.00 | 22.26 | 22.10 | |
2.58–2.74 | 2.59–2.73 | 2.67 | 2.68 | |
0.46–2.63 | 0.52–2.51 | 0.98 | 0.93 | |
0.003–0.071 | 0.003–0.061 | 0.015 | 0.014 |
Variable . | Data range . | Average . | ||
---|---|---|---|---|
Train . | Test . | Train . | Test . | |
(%) | 5.70–64 | 6.10–63.4 | 25.04 | 25.49 |
(%) | 15.09–99.90 | 16.97–95.58 | 34.84 | 32.78 |
18.90–88.93 | 19.81–82.25 | 37.65 | 36.37 | |
12.20–54.80 | 12.50–53.00 | 22.26 | 22.10 | |
2.58–2.74 | 2.59–2.73 | 2.67 | 2.68 | |
0.46–2.63 | 0.52–2.51 | 0.98 | 0.93 | |
0.003–0.071 | 0.003–0.061 | 0.015 | 0.014 |
As seen, the statistical parameters have more or less the same values. Six variables including clay content (CC), water content (ω), liquid limit (LL), plastic limit (PL), specific density (γ), void ratio (e) are used as inputs to the CART and GMDH for predicting k.
CART and modeling k
Breiman et al. (1984) created CART as a predictive method. The CART strategy explores the hidden relationship between a set of independent variables and a dependent variable via recursive partitioning. It is a nonparametric technique for constructing binary trees from discrete and continuous data characteristics. Typically, a regression tree created by CART demonstrates a significant relationship exists between outputs and major independent variables in a highly complex dataset without using parametric approaches. CART is based on binary splitting and is used for regression and classification problems. The decision tree may handle classification and regression problems (Endalie et al. 2021). During the modeling training phase, the data series are segmented into homogenous groups to predict or control an objective variable, ending in the tree regression structure of the model. The process of generating a tree is divided into three stages: splitting the nodes, dividing the nodes, and assigning the nodes to the appropriate end classes (Samadianfard et al. 2022). Using the CART algorithm, the graphical tree was obtained in Figure 1.
Two rules were obtained for predicting k. The only variable ω was selected as an important variable for predicting k. Therefore, one can quickly approximate the value of only k by knowing ω.
GMDH and modeling k
The expressions obtained via the GMDH algorithm are as follows:
Evaluation criteria
RESULTS AND DISCUSSIONS
The statistical results of the evaluation criteria of the employed models in this research are presented in Table 2.
Model . | RMSE . | SI (%) . | R . | MAE . | Ia . |
---|---|---|---|---|---|
CART (Train) | 0.0069 | 44.58 | 0.9084 | 0.0042 | 0.9465 |
CART (Test) | 0.0088 | 64.00 | 0.7841 | 0.0051 | 0.8830 |
GMDH (Train) | 0.0080 | 51.68 | 0.8721 | 0.0049 | 0.9314 |
GMDH (Test) | 0.0072 | 52.17 | 0.8493 | 0.0044 | 0.9184 |
Model . | RMSE . | SI (%) . | R . | MAE . | Ia . |
---|---|---|---|---|---|
CART (Train) | 0.0069 | 44.58 | 0.9084 | 0.0042 | 0.9465 |
CART (Test) | 0.0088 | 64.00 | 0.7841 | 0.0051 | 0.8830 |
GMDH (Train) | 0.0080 | 51.68 | 0.8721 | 0.0049 | 0.9314 |
GMDH (Test) | 0.0072 | 52.17 | 0.8493 | 0.0044 | 0.9184 |
As can be seen from the Table 2, in the testing stage the R = 0.7841 and IA = 0.8830 in CART and for GMDH has R = 0.8493 and Ia = 0.9184. Compared to CART, GMDH increases R and IA accuracy by 8.32% and 4.01%, respectively. In addition, error values including RMSE, SI, and MAE decrease about 18.18%, 18.48% and 13.73%. Therefore, statistical indices indicated that GMDH is outperformed CART for prediction GMDH.
As mentioned earlier, six independent variables were used for estimation of k by CART and GMDH approaches. The equations provided by GMDH were quadratic polynomial nonlinear expressions. This algorithm considers the interaction between involved parameters in the estimation of k and finally, three independent variables including , and were considered.
On the other hand, the CART algorithm selected only one independent variable, i.e., , to estimate k from among six input variables. In fact, the CART algorithm was able to estimate the value of k by selecting only one independent variable () and generated two decision rules. The rules provided by CART do not need any computational cost, so it is less computational time and effort for the prediction of k. However, due to the simplicity of CART, it is less accurate than the GMDH algorithm. In conclusion, the GMDH algorithm was able to extract the nonlinear behavior of independent variables because of the nature of the ability of GMDH to model the linear behavior of phenomena. Therefore, the nonlinear relationships which provided by GMDH were more skillful than the simple rules provided by CART for estimating k. The obtained results of the present study were compared with two earlier types of research conducted by Pham et al. (2021a, 2021b) which used the M5 Model Tree (M5MT) and Gaussian Process (GP) for the estimation of k. Their findings showed the M5MT with RMSE = 0.0081 and MAE = 0.0045 was more accurate than GP with RMSE = 0.0093 and MAE = 0.0054. The result of CART with RMSE = 0.0088 and MAE = 0.0051 as a decision tree algorithm is very similar to M5MT as another decision tree algorithm. Both decision tree algorithms are binary decision trees and provide simple decision rules for prediction k. The later study by Pham et al. (2021b) employed three soft computing models: random forest, ANN, and SVM for predicting k. They found that RF with R = 0.851, RMSE = 0.0084, and MAE = 0.0049 is more accurate than ANN and SVM. In addition, they examined the importance of the input parameters by using the Relief F approach. They found that and e were the most important parameters for estimating k.
It is very interesting that CART selected ω as the most important variable for prediction of k. Moreover, GMDH used and e as the most influential variables for prediction of k. Therefore, the findings of this study was supported by the previous studies. CART and GMDH, without any sensitivity analysis, discovered the most independent variables for the prediction of k.
Compared to the previous researches, one of the main achievements of this study was that it provided explicit mathematical equations for prediction of k. These equations directly calculated the value of k. CART algorithm with only knowing the value of yielded the value of k. Moreover, the GMDH with related to the values of , and computed k. In fact, for estimation of k with accurate results, it is possible only to use three variables by GMDH. The graphical performance of CART and GMDH for the training and testing set are illustrated in Figures 2–5.
SUMMARY AND CONCLUSIONS
This study investigated application of CART and GMDH as white box data-driven models for predicting k against characteristics of soil including CC, ω, LL, PL, γ and e. It was found that developed polynomials equations by the GMDH model were able to more accurate estimating k. The polynomial equations of GMDH modeling may be a flexible viable option for the prediction of k. CART provided an easier relationship compared to the nonlinear method of GMDH which determine k concerning only ω. The presented methods are capable to overcome the limitations of black-box models, such as the ANN for prediction of k. Both developed CART and GMDH models demonstrated that the ω and e have the greatest impact on modeling and estimation of k as well as consistent with previous studies. In addition, statistical indices and scatter diagrams indicated that GMDH has the highest values of R = 0.8493 and IA = 0.8830, and minimum measured errors of MAE = 0.0044, RMSE = 0.0072, and SI = 52.17%, which were more accurate than CART in the prediction of k.
ACKNOWLEDGEMENTS
The authors would like to thank Dr Mehrshad Samadi for his valuable technical support in this research.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICTS OF INTEREST
The authors declare there is no conflict.