Permeability coefficient of soil (k) is one of the most important parameters in groundwater studies. This study, two robust explicit data-driven methods, Including classification and regression trees (CART) and the group method of data handling (GMDH) were developed using the characteristics of soil, i.e., clay content (CC), water content (ω), liquid limit (LL), plastic limit (PL), specific density (γ), void ratio (e) to generate predictive equations for prediction of k. When compared to CART; mean absolute error (MAE) = 0.0051, root mean square error (RMSE) = 0.0088, scatter index (SI) = 64.00%, correlation coefficient (R) = 0.7841, index of agreement (IA) = 0.8830; the GMDH equation produced the lowest error values; MAE = 0.0044, RMSE = 0.0072, SI = 52.17%, R = 0.8493, Ia = 0.9184; in testing stage. Although, GMDH had better performance, however, CART and GMDH could be considered effective approaches for the prediction of k.

  • Two predictive models were developed to estimate the permeability coefficient of soil.

  • GMDH and CART algorithms were evaluated in this study.

  • GMDH provided more accurate results when compared to CART for the prediction of permeability coefficient of soil.

  • The water content was the most effective parameter for determining the permeability coefficient of soil.

  • Field data were used in this research.

One of the important and challenging tasks of hydro-geologists in water resources management is the high-resolution characterization of k in aquifers. Infiltration is defined as a physical phenomenon in which water penetrates into the soil from water surface sources such as precipitation. Different types of soils have different infiltration characteristics (Sajjadi et al. 2016; Torabi et al. 2020). Hydraulic conductivity of the saturated soil is one of the most important hydraulic parameters for simulating water movement across the soil profile, which is highly variable in space.

In complicated engineering problems, data-driven methods, such as artificial neural networks (ANNs); adaptive neuro-fuzzy inference system (ANFIS); wavelet–hybrid (W-hybrid) data-driven methods; evolutionary polynomial regression (EPR); Support Vector Machines (SVMs); classification and regression trees (CART); multivariate adaptive regression splines (MARS); gene expression programming (GEP) and group method of data handling (GMDH), which are frequently used and have been applied in many civil engineering fields, especially in water-related problems (Samadi et al. 2015, 2020a, 2020b; Mojaradi et al. 2018; Torabi et al. 2022).

Generally, the data-driven models are classified into white-box and black-box models. White-box data-driven models give mathematical equations that enable the relationship between the variables and dependent variables to be understood and interpreted directly (Samadi et al. 2020a, 2020b). By contrast, black-box models predict output parameters using numerical values rather than a straightforward equation between the input and output variables (Samadi et al. 2021a, 2021b). Thus, the primary advantage of white-box models over black-box models is that these models can generate useful predictive expressions for estimating the output parameter. Therefore, they have the potential to address the limitations of black-box approaches. The present study used two white-box models, including GMDH and CART, for prediction of k. The primary reason for developing explicit data-driven methods is its ability to provide equations and extract knowledge hidden from the data in the form of mathematical expressions.

CART and GMDH have been proved to be dependable data-driven approaches for making good estimates and resolving complicated issues in various civil engineering areas. These are data-driven methods for establishing an equation between multiple independent variables and a single target variable. CART and also regression trees are successful applications for solving water-engineering problems. Samadi & Jabbari (2012) and Kamranzad & Samadi (2013) indicated CART provides mathematical expressions that could easily be used to predict scour depth and the time series of wave height. Samadi et al. (2014) used the ANN method to determine the scour depth below the ski-jump bucket and free over fall spillways. Samadi et al. (2021a, 2021b) indicated that CART is more convenient for modeling dynamic pressure distribution in hydraulic structures. Sihag et al. (2021a, 2021b) simulated soil infiltration rate using regression tree methods.

Recently, the GMDH predictive approach has been implemented to model a range of water-related problems. For example, Qaderi et al. (2018) assessed the GMDH for predicting bed form dimensions. Khedri et al. (2020) used GMDH to estimate groundwater levels. Sihag et al. (2021a, 2021b) predicted the aeration efficiency of the flume using GMDH. Nasrabadi et al. (2021) used the GMDH method to predict submerged hydraulic jump characteristics such as jump length, relative energy loss and relative submergence depth. Zeinali et al. (2021) showed that the GMDH model has best performance in estimating the hydro-socio-technology-knowledge indicators compared to those obtained via radial basis function and regression trees. Yonesi et al. (2022) used GMDH to estimate discharge in compound open channels.

Despite the above-mentioned studies, present research explores the predictive performance of CART and GMDH methods for estimating k. In this research, using sufficient data, mathematical modeling of k was conducted and a comparison on performance of both used soft computing methods was performed.

This section provides information about soil data samples and data set preparation for model developments. In addition, gives a brief review of CART and GMDH algorithms.

Data collection

The soil data samples contain 84 data taken during the detailed design of Pham et al. (2021a) and the relationship between k and soil characteristics is expressed as follows:
(1)

Preparation of data set for development of the proposed methods

The data set splits randomly into two groups of training and testing data. Training data is used for the generation of CART and GMDH and the test data were used for the assessment of the proposed methods. Therefore, 70% of the data were randomly selected for training and the rest was used to evaluate the proposed models. Moreover, the main statistical parameters of training and testing were similar for the development of data-driven models. The training data were used for the learning process, and test data were used to evaluate the performance of the different models. The main statistical parameters to train and test field and laboratory data are shown in Table 1.

Table 1

Summary basic statistic characteristics of train and test data

VariableData range
Average
TrainTestTrainTest
(%) 5.70–64 6.10–63.4 25.04 25.49 
(%) 15.09–99.90 16.97–95.58 34.84 32.78 
 18.90–88.93 19.81–82.25 37.65 36.37 
 12.20–54.80 12.50–53.00 22.26 22.10 
 2.58–2.74 2.59–2.73 2.67 2.68 
 0.46–2.63 0.52–2.51 0.98 0.93 
 0.003–0.071 0.003–0.061 0.015 0.014 
VariableData range
Average
TrainTestTrainTest
(%) 5.70–64 6.10–63.4 25.04 25.49 
(%) 15.09–99.90 16.97–95.58 34.84 32.78 
 18.90–88.93 19.81–82.25 37.65 36.37 
 12.20–54.80 12.50–53.00 22.26 22.10 
 2.58–2.74 2.59–2.73 2.67 2.68 
 0.46–2.63 0.52–2.51 0.98 0.93 
 0.003–0.071 0.003–0.061 0.015 0.014 

As seen, the statistical parameters have more or less the same values. Six variables including clay content (CC), water content (ω), liquid limit (LL), plastic limit (PL), specific density (γ), void ratio (e) are used as inputs to the CART and GMDH for predicting k.

CART and modeling k

Breiman et al. (1984) created CART as a predictive method. The CART strategy explores the hidden relationship between a set of independent variables and a dependent variable via recursive partitioning. It is a nonparametric technique for constructing binary trees from discrete and continuous data characteristics. Typically, a regression tree created by CART demonstrates a significant relationship exists between outputs and major independent variables in a highly complex dataset without using parametric approaches. CART is based on binary splitting and is used for regression and classification problems. The decision tree may handle classification and regression problems (Endalie et al. 2021). During the modeling training phase, the data series are segmented into homogenous groups to predict or control an objective variable, ending in the tree regression structure of the model. The process of generating a tree is divided into three stages: splitting the nodes, dividing the nodes, and assigning the nodes to the appropriate end classes (Samadianfard et al. 2022). Using the CART algorithm, the graphical tree was obtained in Figure 1.

The decision rules concerning Figure 1:
(2)
Figure 1

The regression tree for prediction of .

Figure 1

The regression tree for prediction of .

Close modal

Two rules were obtained for predicting k. The only variable ω was selected as an important variable for predicting k. Therefore, one can quickly approximate the value of only k by knowing ω.

GMDH and modeling k

Ivakhnenko (1971) developed a strategy that could be adapted to complicated systems to generate a self-organizing model capable of predicting and identifying system problems. The GMDH neural network employs a class of polynomials to define relationships between input and output parameters. It is formed in feedforward and multi-layered networks. It consists of a series of neurons drawn from a number of input pairs utilizing second-order polynomials to create a final network structure. Kolmogorov-Gabor polynomials can simulate nonlinear systems with a single output and numerous inputs as below:
where y is the output of the nonlinear system, x is the vector of input variables, and a is the vector of polynomial coefficients. It is possible to have one or more neurons per layer in this network; however, each has two inputs and only a single output. These elements serve as the model's constituents, and their structure is considered to be a second-order polynomial via the use of Equation (4).
(4)
Coefficients of Equation (4) are found through the application of the least-square approach. To obtain the target value as Equation (5), the least-square errors should indeed be minimized.
(5)
The partial derivative of Equation (5) is used to determine the minimum errors. By substituting from Equation (4) into this partial derivative, a matrix equation can be presented as follows:
(6)
where and . In addition, Matrix A expressed as:
(7)
The technique of singular value decomposition is utilized as a solution to the aforementioned matrix equation, through which the unknown a is calculated using Equation (8), in which is the transposition of the matrix. As a solution to the preceding matrix equation, the method of singular value decomposition is employed, and the unknown matrix (A) is computed using Equation (8).
(8)

The expressions obtained via the GMDH algorithm are as follows:

The first layer:
(9)
The output layer:
(10)

Evaluation criteria

Performance of the GMDH and CART was evaluated using root mean square error (RMSE), scatter index (SI), correlation coefficient (R), mean absolute error (MAE) and index of agreement (IA) defined as follows:
(11)
(12)
(13)
(14)
(15)
where and represented observed and predicted values, and represent the average of observed and predicted values of and , respectively; and N equals the number of the dataset.

The statistical results of the evaluation criteria of the employed models in this research are presented in Table 2.

Table 2

The statistical results of the present study

ModelRMSESI (%)RMAEIa
CART (Train) 0.0069 44.58 0.9084 0.0042 0.9465 
CART (Test) 0.0088 64.00 0.7841 0.0051 0.8830 
GMDH (Train) 0.0080 51.68 0.8721 0.0049 0.9314 
GMDH (Test) 0.0072 52.17 0.8493 0.0044 0.9184 
ModelRMSESI (%)RMAEIa
CART (Train) 0.0069 44.58 0.9084 0.0042 0.9465 
CART (Test) 0.0088 64.00 0.7841 0.0051 0.8830 
GMDH (Train) 0.0080 51.68 0.8721 0.0049 0.9314 
GMDH (Test) 0.0072 52.17 0.8493 0.0044 0.9184 

As can be seen from the Table 2, in the testing stage the R = 0.7841 and IA = 0.8830 in CART and for GMDH has R = 0.8493 and Ia = 0.9184. Compared to CART, GMDH increases R and IA accuracy by 8.32% and 4.01%, respectively. In addition, error values including RMSE, SI, and MAE decrease about 18.18%, 18.48% and 13.73%. Therefore, statistical indices indicated that GMDH is outperformed CART for prediction GMDH.

As mentioned earlier, six independent variables were used for estimation of k by CART and GMDH approaches. The equations provided by GMDH were quadratic polynomial nonlinear expressions. This algorithm considers the interaction between involved parameters in the estimation of k and finally, three independent variables including , and were considered.

On the other hand, the CART algorithm selected only one independent variable, i.e., , to estimate k from among six input variables. In fact, the CART algorithm was able to estimate the value of k by selecting only one independent variable () and generated two decision rules. The rules provided by CART do not need any computational cost, so it is less computational time and effort for the prediction of k. However, due to the simplicity of CART, it is less accurate than the GMDH algorithm. In conclusion, the GMDH algorithm was able to extract the nonlinear behavior of independent variables because of the nature of the ability of GMDH to model the linear behavior of phenomena. Therefore, the nonlinear relationships which provided by GMDH were more skillful than the simple rules provided by CART for estimating k. The obtained results of the present study were compared with two earlier types of research conducted by Pham et al. (2021a, 2021b) which used the M5 Model Tree (M5MT) and Gaussian Process (GP) for the estimation of k. Their findings showed the M5MT with RMSE = 0.0081 and MAE = 0.0045 was more accurate than GP with RMSE = 0.0093 and MAE = 0.0054. The result of CART with RMSE = 0.0088 and MAE = 0.0051 as a decision tree algorithm is very similar to M5MT as another decision tree algorithm. Both decision tree algorithms are binary decision trees and provide simple decision rules for prediction k. The later study by Pham et al. (2021b) employed three soft computing models: random forest, ANN, and SVM for predicting k. They found that RF with R = 0.851, RMSE = 0.0084, and MAE = 0.0049 is more accurate than ANN and SVM. In addition, they examined the importance of the input parameters by using the Relief F approach. They found that and e were the most important parameters for estimating k.

It is very interesting that CART selected ω as the most important variable for prediction of k. Moreover, GMDH used and e as the most influential variables for prediction of k. Therefore, the findings of this study was supported by the previous studies. CART and GMDH, without any sensitivity analysis, discovered the most independent variables for the prediction of k.

Compared to the previous researches, one of the main achievements of this study was that it provided explicit mathematical equations for prediction of k. These equations directly calculated the value of k. CART algorithm with only knowing the value of yielded the value of k. Moreover, the GMDH with related to the values of , and computed k. In fact, for estimation of k with accurate results, it is possible only to use three variables by GMDH. The graphical performance of CART and GMDH for the training and testing set are illustrated in Figures 25.

Figure 2

(a) Scatter plot of k using CART in the training stage. (b) Actual results versus CART results of k in the training stage.

Figure 2

(a) Scatter plot of k using CART in the training stage. (b) Actual results versus CART results of k in the training stage.

Close modal
Figure 3

(a) Scatter plot of k using CART in the testing stage. (b) Actual results versus CART results of k in the testing stage.

Figure 3

(a) Scatter plot of k using CART in the testing stage. (b) Actual results versus CART results of k in the testing stage.

Close modal
Figure 4

(a) Scatter plot of k using GMDH in the training stage. (b) Actual results versus GMDH results of k in the training stage.

Figure 4

(a) Scatter plot of k using GMDH in the training stage. (b) Actual results versus GMDH results of k in the training stage.

Close modal
Figure 5

(a) Scatter plot of k using GMDH in the testing stage. (b) Actual results versus CART results of k in the testing stage.

Figure 5

(a) Scatter plot of k using GMDH in the testing stage. (b) Actual results versus CART results of k in the testing stage.

Close modal

This study investigated application of CART and GMDH as white box data-driven models for predicting k against characteristics of soil including CC, ω, LL, PL, γ and e. It was found that developed polynomials equations by the GMDH model were able to more accurate estimating k. The polynomial equations of GMDH modeling may be a flexible viable option for the prediction of k. CART provided an easier relationship compared to the nonlinear method of GMDH which determine k concerning only ω. The presented methods are capable to overcome the limitations of black-box models, such as the ANN for prediction of k. Both developed CART and GMDH models demonstrated that the ω and e have the greatest impact on modeling and estimation of k as well as consistent with previous studies. In addition, statistical indices and scatter diagrams indicated that GMDH has the highest values of R = 0.8493 and IA = 0.8830, and minimum measured errors of MAE = 0.0044, RMSE = 0.0072, and SI = 52.17%, which were more accurate than CART in the prediction of k.

The authors would like to thank Dr Mehrshad Samadi for his valuable technical support in this research.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Breiman
L.
,
Friedman
J.
,
Olshen
R.
&
Stone
C.
1984
Classification and regression trees
.
Wadsworth International Group
37
(
15
),
237
251
.
Endalie
D.
,
Haile
G.
&
Taye
W.
2021
Deep learning model for daily rainfall prediction: case study of Jimma, Ethiopia
.
Water Supply
22 (3), pp. 3448-3461. https://doi.org/10.2166/ws.2021.391.
Ivakhnenko
A.
1971
Polynomial theory of complex systems
.
IEEE Transactions on Systems, Man, and Cybernetics
SMC-1
(
4
),
364
378
.
Kamranzad
B.
&
Samadi
M.
2013
Assessment of soft computing models to estimate wave heights in Anzali port
.
Journal of Marine Engineering
9
(
17
),
27
36
.
Mojaradi
B.
,
Alizadeh
S. F.
&
Samadi
M.
2018
Estimation of Water Quality Index in Talar River Using Gene Expression Programming and Artificial Neural Networks
.
Nasrabadi
M.
,
Mehri
Y.
,
Ghassemi
A.
&
Omid
M. H.
2021
Predicting submerged hydraulic jump characteristics using machine learning methods
.
Water Supply
21
(
8
),
4180
4194
.
Pham
B. T.
,
Ly
H. B.
,
Al-Ansari
N.
&
Ho
L. S.
2021a
A comparison of Gaussian process and M5P for prediction of soil permeability coefficient
.
Scientific Programming
2021a: 1–13. https://doi.org/10.1155/2021/3625289.
Pham
B. T.
,
Nguyen
M. D.
,
Al-Ansari
N.
,
Tran
Q. A.
,
Ho
L. S.
,
Le
H. V.
&
Prakash
I.
2021b
A comparative study of soft computing models for prediction of permeability coefficient of soil
.
Mathematical Problems in Engineering
2021b: 1–11. https://doi.org/10.1155/2021/7631493.
Qaderi
K.
,
Maddahi
M. R.
,
Rahimpour
M.
&
Masoumi Shahr-babak
M.
2018
Investigating the capability of two hybrid intelligence methods to predict bedform dimensions of alluvial channels
.
Water Science and Technology: Water Supply
18
(
5
),
1706
1718
.
Sajjadi
S. A. H.
,
Mirzaei
M.
,
Nasab
A. F.
,
Ghezelje
A.
,
Tadayonfar
G.
&
Sarkardeh
H.
2016
Effect of soil physical properties on infiltration rate
.
Geomechanics & Engineering
10
(
6
),
727
736
.
Samadi
M.
&
Jabbari
E.
2012
Assessment of regression trees and multivariate adaptive regression splines for prediction of scour depth below the ski-jump bucket spillway
.
Journal of Hydraulics
7
(
3
),
73
79
.
Samadi
M.
,
Jabbari
E.
,
Azamathulla
H. M.
&
Mojallal
M.
2015
Estimation of scour depth below free overfall spillways using multivariate adaptive regression splines and artificial neural networks
.
Engineering Applications of Computational Fluid Mechanics
9
(
1
),
291
300
.
Samadi
M.
,
Afshar
M. H.
,
Jabbari
E.
&
Sarkardeh
H.
2020a
Application of multivariate adaptive regression splines and classification and regression trees to estimate wave-induced scour depth around pile groups
.
Iranian Journal of Science and Technology, Transactions of Civil Engineering
44
(
1
),
447
459
.
Samadi
M.
,
Sarkardeh
H.
&
Jabbari
E.
2020b
Explicit data-driven models for prediction of pressure fluctuations occur during turbulent flows on sloping channels
.
Stochastic Environmental Research and Risk Assessment
34
(
5
),
691
707
.
Samadi
M.
,
Afshar
M. H.
,
Jabbari
E.
&
Sarkardeh
H.
2021a
Prediction of current-induced scour depth around pile groups using MARS, CART, and ANN approaches
.
Marine Georesources & Geotechnology
39
(
5
),
577
588
.
Samadianfard
S.
,
Mikaeili
F.
&
Prasad
R.
2022
Evaluation of classification and decision trees in predicting daily precipitation occurrences
.
Water Supply
22 (4), pp. 3879–3895. https://doi.org/10.2166/ws.2022.017.
Torabi
M.
,
Sarkardeh
H.
&
Mirhosseini
S. M.
2020
Effect of water temperature on hydraulic conductivity of soil with and without coarse aggregates
. In
19th Iranian Hydraulic Conference
,
Mashhad, Iran
.
Torabi
M.
,
Sarkardeh
H.
&
Mirhosseini
S. M.
2022
Prediction of soil permeability coefficient using GEP approach
.
Numerical Methods in Civil Engineering
(in press).
Yonesi
H. A.
,
Parsaie
A.
,
Arshia
A.
&
Shamsi
Z.
2022
Discharge modeling in compound channels with non-prismatic floodplains using GMDH and MARS models
.
Water Supply
22 (4), pp. 4400–4421. https://doi.org/10.2166/ws.2022.058.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).