Understanding the interactions among the natural components of hydrological systems is essential for managing water resources, addressing environmental concerns, and mitigating the impacts of natural events such as floods and droughts. In this study, long-term (1993–2019) hydrologic connectivity among precipitation, humidity, evaporation, surface runoff, minimum/maximum temperature, and their interactions with groundwater level (GWL) in the southeastern Caspian Sea region was assessed using mutual information theory and the state-of-the-art jittered genetic programming approach. While the former was used to find dominant components, their effective time delay, and the power of potential nonlinear interactions, the latter was utilized to extract an explicit relation between the GWL and the most influential components. The data were gathered from several piezometers, meteorological stations, and a stream gauge available in the study area. The results showed an overall positive trend in the GWL with an increasing rate since 2007 that reflects the influence of artificial recharge infrastructures built in the study area. Statistical connectivity analyses demonstrated that historical precipitation and streamflow series have the least impact on the temporal variation of the average GWL.

  • The Jittered Genetic Programming (JGP) model was introduced for hydrologic connectivity analysis.

  • A new MATLAB script was developed for lagged mutual information calculation.

  • The study highlighted the importance of hydrological connectivity analysis for dynamic groundwater pattern recognition.

Hydrosystems, also known as hydrological systems, encompass various natural components that interact to regulate the movement, distribution, and quality of water. The key natural components of hydrosystems include (but are not limited to) surface and groundwater bodies, precipitation, soil moisture, vegetation, temperature, etc. The groundwater level (GWL) is an essential aspect of hydrological systems and plays a vital role in water supply, agriculture, and ecosystems. It refers to the depth at which water is found beneath the Earth's surface in underground aquifers or water-bearing formations. The GWL can vary depending on several factors, including precipitation, evaporation, surface water interactions, geological conditions, and human activities such as groundwater pumping and recharging. When precipitation exceeds evaporation and water infiltrates into the ground, the GWL tends to rise. Conversely, it can decline during periods of low rainfall or excessive pumping.

Soft computing techniques (SCTs) have been increasingly used in the field of civil engineering, especially hydrology, to detect underlying patterns and accurately predict environmental variables (Creaco et al. 2022; Ghasemi et al. 2023; Yin et al. 2023). Regarding the changing trends of GWL, several studies have been conducted on regional and catchment scales (Roushangar et al. 2020; Nourani et al. 2023). Among different SCTs, artificial neural networks (ANNs) and support vector regression (SVR) have been frequently used as an efficient way to solve problems in the groundwater-related field, such as GWL prediction (Daliakopoulos et al. 2005; Sreekanth et al. 2009; Nourani et al. 2011; Gholami & Sahour 2022; Roshani & Hamidi 2022; Dąbrowska et al. 2023). For example, Nourani et al. (2016) used the Self-Organizing Map (SOM)-based clustering method and the hybrid wavelet-Adaptive Neuro-Fuzzy Inference System (ANFIS) approach for GWL prediction and showed the strength of ANNs to provide estimation accuracy. Adamowski & Chan (2011) proposed a new method based on the wavelet and ANN predictions for GWL prediction. The exact prediction results for both St-Remi and Mercier sites in the Chateauguay basin indicate that the Wavelet Analysis based ANN (WA-ANN) method is a suitable method for predicting GWLs. Hosseini et al. introduced a program of evolved ANNs using a simple genetic algorithm to simulate monthly GWLs in the coastal aquifer in the Shabestar Plain, Iran. This research shows that employing a genetic algorithm to initialize neural connection weights in complex space avoids the risk of becoming stuck in local minima, which can prevent high costs and time-wasting in water resource management projects for drilling more piezometers. Gholami (2022) combined modular ANNs with adaptive fuzzy inputs to increase the accuracy of the annual GWL modeling process on the southern coasts of the Caspian Sea (Iran). More recently, Roshani & Hamidi (2022) used ANNs to predict GWL fluctuations in the Sari-Neka aquifer, the Caspian Sea basin. The authors showed that the groundwater depth of the Sari-Neka basin is expected to decrease with a gentle slope during the period 2021–2040.

Genetic programming (GP) is rather a new class of SCTs that is applied to a wide range of problem domains, including symbolic regression, classification, control systems, and optimization. It allows for the automatic generation of programs without the need for explicit programming by humans (Danandeh Mehr et al. 2023). GP has been used successfully in various fields, such as data mining, robotics, bioinformatics, and finance, to tackle complex problems and discover novel solutions. Various versions of GP such as monotonic GP, linear GP, multigene GP, and gene expression GP have been satisfactorily used in hydrological studies (Uyumaz et al. 2014; Rahmani-Rezaeieh et al. 2020; Samadi et al. 2020, 2021; Mehr & Gandomi 2021). A recent review study on the application of SCTs for GWL prediction conducted by Tao et al. (2022) indicated that GP was used only in 5.0% of the artificial intelligence models applied for GWL modeling, so that there is room for more research on the application of GP for GWL prediction. Some of the recent implications of GP variants for GWL estimation include (not limited to) the study of Fallah-Mehdipour et al. (2013), Kasiviswanathan et al. (2016), Aryafar et al. (2019), and Abdolahzadeh & Schmalz (2022). The existing studies mostly trained a standalone GP engine or hybridized it to find the best short-term (less than a season) GWL estimating model. Differing from the earlier studies, we applied GP to identify how different hydrometeorological variables interconnect with GWL fluctuations from a temporal perspective. Thus, standalone monotonic GP was used in this study without any attempt to enhance its performance via common hybridization techniques.

In recent years, there has been a burgeoning interest in the interconnectedness of ecosystems and how various components respond to both natural variations and human-induced activities. This approach has gained traction as a valuable tool for comprehending the fundamental principles that inform policy-making and ecosystem management (Rinderer et al. 2018; Zhang et al. 2023; Zhou & Yang 2023). For instance, Rinderer et al. (2018) explored hydrologic connectivity among groundwater and streamflow time series from a Swiss catchment and demonstrated that employing ground-based high-frequency sensors as well as emerging remote sensing techniques can provide new opportunities for a better understanding of connectivity across ecosystems. Another noteworthy study by Zhou et al. (2022) introduced a novel algorithm for mining prevalent negative co-location patterns in spatial data. Meanwhile, Jia & Zhou (2023) proposed a response surface methodology to estimate the global karst carbon sink, demonstrating that functional connectivity analysis could significantly reduce numerical calculation time, making it a viable tool for global weathering models. These investigations collectively highlight the versatility of the concept of connectivity in both ecology and hydrology. In ecology, it is defined as the extent to which the landscape facilitates or hinders movement among resource patches. In hydrology, on the other hand, it pertains to the water-mediated transfer of matter, energy, and/or organisms within or between elements of the hydrologic cycle (Rinderer et al. 2018). The integration of these perspectives underscores the interdisciplinary nature of connectivity studies, offering valuable insights for informed decision-making in environmental science and management.

Monitoring GWL is important for hydrosystem management and assessing the overall health of aquifers. This information helps determine the availability of water for various uses, including drinking water, irrigation, and industrial purposes. Additionally, GWL estimation helps to prevent issues such as land subsidence, saltwater intrusion, and ecological disruptions caused by significant changes in water table depths. Therefore, the purpose of this study is to use emerging statistical methods and soft computing tools (i) to assess the hydrologic connectivity between various hydrosystem components and then (ii) to identify the underlying pattern of GWL based on the hydrologic connectivity and evolutionary GP technique. The study focuses on the southeastern Caspian Sea region that covers three coastal catchments and an unconfined aquifer. This article is expected to contribute to the development of a more effective and sustainable water supply for the domestic and agricultural water consumption in the study area, that is, facing water scarcity and negative climate change impacts these years.

As previously mentioned, this study was conducted in a rural region laid between the Alborz mountains and the southeastern coast of the Caspian Sea. The study area encompasses three stream catchments, namely Jafakende, Nowkandeh, and Vatana, several villages, and a port town known as Bandar Gaz (see Figure 1). The underground water body of the area is an unconfined aquifer, called Jafakande, which is the main source of industrial, domestic, and agricultural water use in the study area. This area has an approximate population of 24,000 with 250-l daily water use per capita.
Figure 1

The Jafakende watershed and monitoring stations in the study area.

Figure 1

The Jafakende watershed and monitoring stations in the study area.

Close modal
To accomplish statistical analysis, observed monthly GWL data together with total monthly precipitation (P), mean monthly humidity (H), total monthly evaporation (E), minimum temperature (Tmin), maximum temperature (Tmax), and total runoff volume (Q) during the period of 1993–2019 were used. The monthly data were collected from two rain gauge stations (Jafakadeh and Vatana), one evaporation measuring station (Kordkoy), one hydrometry station (Jafakende), and six piezometers located in the study area. The data were acquired from the Iran Water Resources Management company that guarantees observation quality before sharing them with academia. Table 1 presents the geographical features of the measuring stations. The time series of the observed data together with their main statistical features are presented in Figure 2 and Table 2, respectively.
Table 1

Geographical features of hydrometeorological stations and piezometers

Station nameStation typeHeight (m)Universal Transverse Mercator (UTM)Coordinates
LongitudeLatitude
Jafakende Weather 30 228941 4069303 
Vatana Weather 100 229122 4067740 
Jafakende Hydrometric 30 226153 4069676 
Livan Piezometer 1 20 757138 4070313 
Nokande Piezometer 2 20 759738 4070278 
Banafsheh Tapeh Piezometer 3 38.58 224602 4068109 
Gaz Piezometer 4 21.02 228430 4069821 
Dashteh Kalateh Piezometer 5 19.57 227145 4070728 
Bandar Gaz Piezometer 6 −21.63 227473 4075185 
Station nameStation typeHeight (m)Universal Transverse Mercator (UTM)Coordinates
LongitudeLatitude
Jafakende Weather 30 228941 4069303 
Vatana Weather 100 229122 4067740 
Jafakende Hydrometric 30 226153 4069676 
Livan Piezometer 1 20 757138 4070313 
Nokande Piezometer 2 20 759738 4070278 
Banafsheh Tapeh Piezometer 3 38.58 224602 4068109 
Gaz Piezometer 4 21.02 228430 4069821 
Dashteh Kalateh Piezometer 5 19.57 227145 4070728 
Bandar Gaz Piezometer 6 −21.63 227473 4075185 
Table 2

Main statistical features of the observed data during the period 1993–2019

VariableMeanMedianMinimumMaximumStd.Dev.Skewness
P (mm) 59.233 51.85 231.3 41.097 0.69382 0.81621 
H (%) 80.973 81.1 61.33 94.4 7.8966 0.09752 −0.80209 
E (mm) 81.656 75.05 9.9 193.8 46.191 0.56567 −0.99235 
Tmin (°C) 12.28 12.1 −2.9 25.13 7.3604 0.59936 −1.3663 
Tmax (°C) 22.789 22.65 7.7 37.7 7.6748 0.33678 −1.3318 
R (MCM) 4.4406 2.285 36.85 6.039 1.3599 6.5207 
GWL1 3.8388 3.72 0.8 9.15 1.7259 0.4496 −0.47116 
GWL2 3.0732 3.025 5.02 0.63703 0.20729 −0.27055 
GWL3 11.806 11.83 8.84 14.2 1.1676 0.098894 −0.50337 
GWL4 12.075 12.31 8.4 15.62 1.8095 0.14986 −1.0243 
GWL5 3.6446 3.28 1.41 6.52 1.061 0.29112 −0.65206 
GWL6 1.9919 1.93 0.87 3.91 0.51672 0.25941 0.66812 
VariableMeanMedianMinimumMaximumStd.Dev.Skewness
P (mm) 59.233 51.85 231.3 41.097 0.69382 0.81621 
H (%) 80.973 81.1 61.33 94.4 7.8966 0.09752 −0.80209 
E (mm) 81.656 75.05 9.9 193.8 46.191 0.56567 −0.99235 
Tmin (°C) 12.28 12.1 −2.9 25.13 7.3604 0.59936 −1.3663 
Tmax (°C) 22.789 22.65 7.7 37.7 7.6748 0.33678 −1.3318 
R (MCM) 4.4406 2.285 36.85 6.039 1.3599 6.5207 
GWL1 3.8388 3.72 0.8 9.15 1.7259 0.4496 −0.47116 
GWL2 3.0732 3.025 5.02 0.63703 0.20729 −0.27055 
GWL3 11.806 11.83 8.84 14.2 1.1676 0.098894 −0.50337 
GWL4 12.075 12.31 8.4 15.62 1.8095 0.14986 −1.0243 
GWL5 3.6446 3.28 1.41 6.52 1.061 0.29112 −0.65206 
GWL6 1.9919 1.93 0.87 3.91 0.51672 0.25941 0.66812 
Figure 2

Timeseries plots of the observed monthly data during 1993–2019.

Figure 2

Timeseries plots of the observed monthly data during 1993–2019.

Close modal
According to Table 2, the Jafakende River carries approximately 4.4 MCM of water each month to the Caspian Sea. Thus, the local authorities designed an artificial recharge system (Figure 3) on the river where approximately 1.6 MCM of floodwater is diverted into the floodwater distribution system every year to recharge groundwater (Jafariroodsari et al. 2021). Despite significant fluctuations in GWL, Figure 2 illustrates overall increasing trends in all piezometers. With respect to their mean level, P1, P2, and P5 can be grouped in the same class. P3 and P4 show the maximum level, and P6 shows the minimum level, indicating that GWL increases by distance from the Caspian Sea.
Figure 3

The artificial recharge system constructed (2005–2007) to transfer the water of the Jafakande River into three reservoirs for the purpose of surface runoff storage and feeding the aquifer.

Figure 3

The artificial recharge system constructed (2005–2007) to transfer the water of the Jafakande River into three reservoirs for the purpose of surface runoff storage and feeding the aquifer.

Close modal

To assess hydrologic connectivity between the hydrometeorological variables (i.e., H, E, Tmax, Tmin, R) and the GWL in the study area, first, a representative GWL time series (hereafter GWLave) was obtained for the whole area by calculating the arithmetic mean of GWL at all piezometers. Then, a set of linear and nonlinear connectivity measures including the pairwise correlation coefficient (PCC) and mutual information (MI) was calculated. In hydrological time series, the value of a variable at a particular time can be influenced by previous values, so-called lags, of the same variable or other indicator variables. Therefore, we also looked for leading indicators when conducting conductivity analysis. In other words, in addition to evaluating the linear and nonlinear correlations between two concurrent time series, we investigate the correlation between the GWL and the independent variables with a few time lags (here 36 months). Thus, lagged PCC and lagged MI (LMI) were also calculated to find the dominant lags. Finally, a monotonic jittered genetic programming (JGP) method was employed to link the GWL in the study area with the most dominant independent variables and their lags. To cope with the existing trends and dimensional inconsistency of both dependent (i.e., GWLave) and independent variables, they were standardized first, and then made stationary via a differencing method before the modeling stage. Due to the rather short period of historical observations, jittered-based augmented time series were used for training and testing the GP model.

Pairwise correlation coefficient

Pearson's product-moment correlation, also known as PCC, is perhaps the most utilized linear connectivity measure in hydro-environmental studies. PCC (Equation (1)) quantifies the strength and direction of the linear association between two signals X = {x1, x2, … , xn} and Y = {y1, y2, … , yn}. PCC varies in the range from −1.0 to 1.0 indicating negative (reverse) or positive linear correlation between the signals.
(1)
where and are the arithmetic means of the signals X and Y, respectively.

A value of 0.0 implies no linear correlation between the signals. A value of 1 indicates a perfect positive linear relationship, where an increase in one variable is associated with a proportional increase in the other variable. A value of −1 indicates a perfect negative linear relationship, where an increase in one variable is associated with a proportional decrease in the other variable.

Mutual information and lagged mutual information

MI measures the statistical dependence or information shared between two random variables. It quantifies how much information is obtained about one variable by knowing the value of another variable. In other words, it measures the amount of information that two variables share. MI is often used in the fields where a nonlinear relationship between variables is aimed to be detected (Batina et al. 2011).

MI is defined based on the concept of entropy. Entropy measures the uncertainty or randomness of a random variable. By comparing the joint distribution of two variables with their individual distributions, MI captures the reduction in uncertainty about one variable when the other variable is known. Higher values of MI indicate stronger dependence or correlation between the variables, while lower values indicate a weaker relationship or independence. Mathematically, MI can be calculated by estimating the probability distribution function (PDF) of the signals X and Y.
(2)
where is the joint PDF of X and Y signals and and are their marginal PDFs, respectively.
LMI is a measure used to analyze the time series data and assess the relationship between variables at different time lags. It is useful in several applications, such as time series prediction, identifying causal relationships, and detecting patterns or dependencies within time-dependent data (Granger & Lin 1994). In this study, the LMI considers the temporal delay between the target variable Y and lagged versions of the indicators. X(tk) is calculated X and itself Y(tk) for various lag values (k). The developed MATLAB script to compute LMI is presented in Supplementary Materials.
(3)

Genetic programming and jittered genetic programming

GP (Koza 1994) is a subfield of evolutionary computing that applies the principles of natural selection and genetics to evolve computer programs. It is an SCT that uses a population of candidate programs (solution trees) to solve complex problems by iteratively evolving and refining the programs over generations. The GP is known as the grey-box ML model. It is more explainable compared to the black box SCTs such as ANNs or SVR. In GP, an initial population of randomly generated programs is created. Each program represents a potential solution to the problem at hand. The programs are evaluated and assigned a fitness score based on their performance in solving the problem. The fitter programs, which exhibit better solution quality, have a higher chance of being selected for reproduction. The core genetic operators in GP are mutation and crossover (Figure 4). Mutation involves randomly altering parts of a program tree, introducing small variations in the program structure or behavior. For instance, the subtree ‘b. sin a’ was replaced by ‘a’ in Parent 1, Figure 4. Crossover, on the other hand, combines program subtrees from two-parent programs to create offspring programs with a blend of their characteristics. After the genetic operators are applied, a new generation of programs is created. The process of evaluation, selection, and reproduction is repeated iteratively for multiple generations, with the hope that the population gradually improves its fitness and converges towards an optimal or near-optimal solution.
Figure 4

An exemplary Monotonic Genetic Programming (MGP) structure illustrating mutation and crossover operations.

Figure 4

An exemplary Monotonic Genetic Programming (MGP) structure illustrating mutation and crossover operations.

Close modal

To cope with the comment overfitting problem of GP models, particularly when they encounter a small number of samples, we introduced the JGP model in which the jittering technique was used to create additional samples for training and testing GP. This technique can be valuable for improving the robustness and generalization ability of all machine learning models trained on time series data. It involves adding small amounts of random noise to the data to create variations while maintaining the overall structure of the time series. This technique can be valuable for improving the robustness and generalization ability of machine learning models trained on time series data. To this end, first, the level of jitter was specified as one-third of the standard deviation of each indicator. Then, random noise values that follow a Gaussian distribution were created. Finally, the random noise value was added to each data point in the original time series. The result is a new time series that was added to the end of existing ones and used for training and testing GP.

Performance evaluation

In addition to graphical comparisons, three statistical measures were used to evaluate the performance of the evolved models: the root mean square error (RMSE), Nash–Sutcliffe efficiency (NSE) coefficient, and relative bias (RE). Their relationships are given below:
(4)
(5)
(6)
where and represent the estimated and observed target values (here GWL), respectively. RMSE shows the difference between the predicted value and the calculated target values, which have a value between (0, +∞), and values closer to zero indicate a higher accuracy of the forecasting model. The NSE is a dimensionless and normal parameter sensitive to limit values and takes values between (−∞, 1]. RE provides a quantitative assessment of the average difference between predicted and observed values, expressed as a percentage. A positive RE indicates that the model tends to overestimate the observed values, while a negative RE indicates an underestimation. RE evaluates the size of the bias due to under coverage with respect to the true unknown parameter to estimate.

Data preprocessing

To account for the non-stationarity and inconsistent dimensionality, the historical GWL and indicator variables were standardized first and then an augmented Dickey-Fuller test was performed on the standardized time series to assess the presence of a unit root in the time series, which constitutes the null hypothesis. The test statistic indicated that the null hypothesis of the test could not be disproven, albeit the sharp difference in their standard deviation was alleviated through the initial standardization of the data. Thus, the first-order differencing technique was posed on both standardized dependent and independent series to achieve stationary series. Figure 5 illustrates the stationary state of the observed hydrometeorological data and the GWLave series attained for the study area.
Figure 5

Stationary states of the observed hydrometeorological time series and GWLave during 1993–2019.

Figure 5

Stationary states of the observed hydrometeorological time series and GWLave during 1993–2019.

Close modal

Linear correlation

PCC and lagged PCC values between the original indicators and GWLave time series are tabulated in Table 3 and depicted in Figure 6. The color gradient in the table indicates the strength and direction of functional connectivity, allowing us to understand how changes in one variable correspond to changes in GWLave. The red and blue colors stand for positive and negative connectivity, respectively. The darker the cell, the stronger the connectivity. In addition, we classified the strength of connectivity into three groups as below:
  • Insignificant connectivity when the value of associated PCC is within the range −0.2 to 0.2.

  • Positive connectivity when the value of the PCC exceeds 0.2.

  • Negative connectivity when the value of the PCC falls below −0.2.

Figure 6

PCCs between GWLave and independent variables at different lags.

Figure 6

PCCs between GWLave and independent variables at different lags.

Close modal
Table 3

PCC vectors between hydrometeorological variables and GWLave time series

 
 

Color gradient from blue (−1.0) to red (+1.0) indicates where the cell value falls within that range.

The graphical analysis of leading indicators (Figure 6) illustrates the temporal offset between the signals at a time delay 12 that reflects the common annual periodicity existing in hydrological time series. It is observed from Table 3 and Figure 6 that almost all PCC values of P remain in the insignificant connectivity range and therefore, their influence on average groundwater depth in the study area could be considered negligible. Despite being insignificant, an increasing trend in the PCC values of P is seen, indicating that P may influence GWLave at longer lags. The associated lagged PCC values had predominantly positive values after a 30-month delay, signaling that the mean GWL at the site has a direct relation with P after two and half years. All the PCC values of H and Q fell in the blue cells and hence gave an impression of widespread adverse connectivity across the groundwater monitoring wells and the catchment humidity and runoff. Despite the increasing trend, almost all PCC values of H and Q are in the range of negative connectivity. Therefore, it is concluded that increasing H and Q in the basin may decrease GWLave. The associated PCC values of E, Tmin, and Tmax show almost the same periodicity pattern including both positive and negative connectivities in the range of −0.43 to +0.43. The peaks and troughs appear with 3–4- and 9–10-month lags, respectively.

Nonlinear dependency: MI and LMI results

The PCC only captures linear dependency and may not capture nonlinear associations between variables. Thus, MI and LMI were calculated to capture the dependence between GWLave, its preceding values and hydrometeorological indicators. To this end, we first calculated the marginal PDFs of all time series and their joint PDFs with GWLave using the histogram method. Figure 7 illustrates an example of marginal and joint probabilities depicted for GWLave (t) and GWLave (t + 36). Then, MI and LMI values between the indicators and GWLave time series were calculated using Equations (2) and (3). The obtained results are tabulated in Table 4 and depicted in Figure 8. Like Table 3, the color gradient in the table indicates the strength of functional connectivity. However, the MI values are always positive as they are based on the concept of entropy, and entropy is a non-negative quantity. Thus, cells' color varies between white (0.0) and red. The darker the cell, the stronger the connectivity. To interpret the MI results, we classified the strength of connectivity into two groups as below.
  • Insignificant connectivity when the value of associated MI/LMI is less than 0.25.

  • Significant connectivity when the value of the MI/LMI exceeds or equals 0.25.

Figure 7

Example of (a) marginal probability of GWLave (t) and (b) joint probability histogram between and GWLave (t) and GWLave (t + 36).

Figure 7

Example of (a) marginal probability of GWLave (t) and (b) joint probability histogram between and GWLave (t) and GWLave (t + 36).

Close modal
Figure 8

MI between GWLave and independent variables at different lags.

Figure 8

MI between GWLave and independent variables at different lags.

Close modal
Table 4

MI (Lag = 0) and LMI vectors between hydrometeorological variables and GWLave time series

 
 

The color gradient from white (0.0) to red (2.24) indicates where the cell value falls within that range.

According to the figure, The MI values are insignificant for P and Q in all lags. This reveals the insignificant influence of concurrent and past rainfall and runoff on the current behavior of GWLave. On the other hand, H, E, Tmin and Tmax are highly correlated to GWLave after 18 lags (Figure 8(b)), and the maximum association is seen between GWLave and its lagged values back to the 27-month lag (Figure 8(a)). This long significant correlation might be a consequence of our decision to limit our threshold to 0.25. The LMI series (Figure 8(a)) shows that antecedent GWLave back to 3 months together with its past year value (i.e., GWLave (t − 12)) can provide valuable information for modeling and understanding the groundwater dynamic system in the Jafakende aquifer. The magnitudes of LMI of Tmin were greater than Tmax in the first 6 months lag and contrasting results were obtained after the 18-month lag, thus suggesting that the interaction between groundwater dynamics and temperature should be assessed at different seasons separately.

JGP modeling results

To develop a nonlinear relation between GWLave, its past values, and hydrometeorological indicators, nonlinear dependency results (Table 4) were used. To identify a month-ahead underlying pattern, the most associated lag of individual hydrometeorological variables provided that its LMI is greater than 0.25 together with the most effective antecedent GWLave valued were selected, as expressed in Equation (7).
(7)
The observed original monthly time series has 357 samples (Figure 2). The stationary state of the time series has one less sample due to differencing with order one. Besides, the time series must be split into train and test sets. Thus, the number of samples is not large enough to train and test any machine learning model. Thus, to increase the number of samples and have more data points within the same time interval, the jittering technique was used and the total number of samples at each time series was augmented to 891 samples (three times of stationary data after resizing owing to 27 lags). Then, to formulate Equation (7), they were divided into two subseries of training (the first 70%) and testing (the last 30%) samples. The GPdotNet tool (Hrnjica & Danandeh Mehr 2018) was utilized to find the most appropriate GWLave estimation model. There are some other free GP alternatives such as FlexGP and GeneticEngine with the codes available on GitHub. Table 5 summarizes the GP parameters used to run GPdotNet. It is worth remembering that a set of random floating numbers between −1.0 and 1.0 were also added to the input set to increase the model flexibility for fitting on the data. Figure 9 (Equation (8)) shows the best JGP tree (model) attained after 500 generations together with progress in fitness values. The performance measures of the best-evolved model are also tabulated in Table 5. It is clear from the figure that the best estimation model evolved almost after 50 generations. It does not possess Tmaxt−26 and GWLavet−1. Therefore, the evolved model represents 2 months ahead prediction model.
(8)
Table 5

GP setup parameters

ParameterValue
Functional set +, −, ×, ÷, sin, cos, exp,sqrt 
Population size 500 
Initialization method Half and half 
Crossover rate 0.7 
Mutation rate 0.2 
Reproduction rate 0.1 
Maximum operation level 
ParameterValue
Functional set +, −, ×, ÷, sin, cos, exp,sqrt 
Population size 500 
Initialization method Half and half 
Crossover rate 0.7 
Mutation rate 0.2 
Reproduction rate 0.1 
Maximum operation level 
Figure 9

The best GP model and evolution fitness during the training period.

Figure 9

The best GP model and evolution fitness during the training period.

Close modal
Figure 10 represents time series plots of the historical and predicted GWLave during the augmented training and testing dataset. The associated performance measures are tabulated in Table 6.
Table 6

Performance indices calculated for the 2 years ahead estimation GP model

Training set
Testing set
RMSE (m)NSERE (%)RMSE (m)NSERE (%)
0.217 0.458 1.2 0.230 0.454 0.8 
Training set
Testing set
RMSE (m)NSERE (%)RMSE (m)NSERE (%)
0.217 0.458 1.2 0.230 0.454 0.8 
Figure 10

Observed and 2-month ahead estimations of the average GWL in the study area during (a) the training and (b) testing datasets.

Figure 10

Observed and 2-month ahead estimations of the average GWL in the study area during (a) the training and (b) testing datasets.

Close modal

The figure shows that the evolved model can well detect the fluctuating dynamics of the GWL. However, the extreme peaks and troughs are not well predicted. This is mainly due to the intentional implementation of the ad hoc GP technique to discover the simplest nonlinear relation (Figure 9(b)) between GWLave and the indicators. The use of advanced optimization techniques for fine-tuning of GP may increase the estimation accuracy of extreme values as seen in Sepahvand et al. (2019). Compared to the literature, our study showed some agreements and conflicting results with those of similar studies. For example, like the works of Kasiviswanathan et al. (2016), our study showed that GP effectively captures the non-linearities in GWL fluctuations, demonstrating a reasonable performance using observed hydrometeorological and antecedent GWL data, all without necessitating explicit knowledge of the underlying physics of the system. In contrast to Kasiviswanathan et al. (2016), we found E as a dominant parameter in GWL. Such conflicting results could be due to the higher perspectives of lag times (up to 36 months) that were considered in our study. Thus, the authors believe that our findings could better represent GWL dynamics from a long-term perspective. In addition, our findings showed that precipitation is not a dominant factor in GWL fluctuations in our study area.

Our study revealed apprehensions regarding the feasibility of developing a forecasting model for GWL estimation. Like previous research, our study was constrained to datasets obtained from in situ hydrometeorological measurements. Subsequent investigations could incorporate additional factors, including demographic and geological data such as topographic wetness index, pumping rate, groundwater recharge, lithology, land use, population growth, and more.

Hydrologic connectivity analysis was conducted across the Jafakende River catchment to identify potential correlation and causality between hydrometeorological parameters and the GWL at the Jafakende aquifer. First, linear and nonlinear methods were utilized to determine the dominant features, and then, their temporal variation was linked to the target GWL using an innovative GP technique. Our study generally showed that the relationship between GWL and hydrometeorological factors is complex and challenging to be modeled. The detailed conclusions are itemized below.

  • Both linear and nonlinear statistics suggested the absence of significant connectivity between average groundwater depth and P and Q at the catchment.

  • There is a strong correlation between H, E, Tmin, Tmax, and GWLave after 18 lags.

  • Incorporating antecedent GWLave up to 3 months, along with its value from the previous year (GWLave (t−12)), can offer valuable insights for modeling and comprehending the dynamics of the groundwater system in the Jafakende aquifer.

  • Jittering is useful for preventing the overfitting of GP. However, it's essential to strike a balance in the level of jitter introduced, as excessive noise can hinder model performance.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Adamowski
J.
&
Chan
H. F.
2011
A wavelet neural network conjunction model for groundwater level forecasting
.
Journal of Hydrology
407
(
1–4
),
28
40
.
Batina
L.
,
Gierlichs
B.
,
Prouff
E.
,
Rivain
M.
,
Standaert
F. X.
&
Veyrat-Charvillon
N.
2011
Mutual information analysis: A comprehensive study
.
Journal of Cryptology
24
(
2
),
269
291
.
Creaco
E.
,
Zheng
F.
&
Pezzinga
G.
2022
Minimum transport-driven algorithm for water distribution network partitioning
.
Aqua Water Infrastructure, Ecosystems and Society
71
(
1
),
120
138
.
Dąbrowska
D.
,
Rykała
W.
&
Nourani
V.
2023
The impact of weather conditions on the quality of groundwater in the area of a municipal waste landfill
.
Environmental & Socio-Economic Studies
11
(
3
),
14
21
.
Daliakopoulos
I. N.
,
Coulibaly
P.
&
Tsanis
I. K.
2005
Groundwater level forecasting using artificial neural networks
.
Journal of Hydrology
309
(
1–4
),
229
240
.
Danandeh Mehr
A.
,
Marttila
H.
,
Haghighi
A. T.
,
Croghan
D.
&
Fathollahzadeh Attar
N.
2023
GTAR: A new ensemble evolutionary autoregressive approach to model dissolved organic carbon
.
Aqua Water Infrastructure, Ecosystems and Society
72
(
3
),
381
394
.
Fallah-Mehdipour
E.
,
Haddad
O. B.
&
Mariño
M. A.
2013
Prediction and simulation of monthly groundwater levels by genetic programming
.
Journal of Hydro-Environment Research
7
(
4
),
253
260
.
Ghasemi
M.
,
Samadi
M.
,
Soleimanian
E.
&
Chau
K. W.
2023
A comparative study of black-box and white-box data-driven methods to predict landfill leachate permeability
.
Environmental Monitoring and Assessment
195
(
7
),
862
.
Gholami
V.
&
Sahour
H.
2022
Prediction of groundwater drawdown using artificial neural networks
.
Environmental Science and Pollution Research
29
(
22
),
33544
33557
.
Granger
C.
&
Lin
J. L.
1994
Using the mutual information coefficient to identify lags in nonlinear models
.
Journal of Time Series Analysis
15
(
4
),
371
384
.
Hrnjica
B.
&
Danandeh Mehr
A.
2018
Optimized Genetic Programming Applications: Emerging Research and Opportunities
.
IGI Global
,
Hershey, PA
.
Jafariroodsari
S.
,
Nourani
V.
&
Gokçekuş
H.
2021
Investigating sea-level change on the coastal aquifer, case study: Jafakendeh aquifer
.
Modeling Earth Systems and Environment
7
,
2643
2651
.
Jia
B.
&
Zhou
G.
2023
Estimation of global karst carbon sink from 1950 to 2050s using response surface
methodology
.
Geo-spatial Information Science
1
18
.
doi:10.1080/10095020.2023.2165974
.
Kasiviswanathan
K. S.
,
Saravanan
S.
,
Balamurugan
M.
&
Saravanan
K.
2016
Genetic programming based monthly groundwater level forecast models with uncertainty quantification
.
Modeling Earth Systems and Environment
2
,
1
11
.
Nourani
V.
,
Alami
M. T.
&
Vousoughi
F. D.
2016
Hybrid of SOM-clustering method and wavelet-ANFIS approach to model and infill missing groundwater level data
.
Journal of Hydrologic Engineering
21
(
9
),
05016018
.
Rahmani-Rezaeieh
A.
,
Mohammadi
M.
&
Danandeh Mehr
A.
2020
Ensemble gene expression programming: A new approach for evolution of parsimonious streamflow forecasting model
.
Theoretical and Applied Climatology
139
(
1–2
),
549
564
.
Roushangar
K.
,
Nourani
V.
&
Dolatshahi
M.
2020
Investigation and trend identification of groundwater level variations using discrete wavelet transform and non-parametric tests (case study: Azarshahr plain)
.
Iran-Water Resources Research
16
(
1
),
102
115
.
Samadi
M.
,
Sarkardeh
H.
&
Jabbari
E.
2020
Explicit data-driven models for prediction of pressure fluctuations occur during turbulent flows on sloping channels
.
Stochastic Environmental Research and Risk Assessment
34
,
691
707
.
Sreekanth
P. D.
,
Geethanjali
N.
,
Sreedevi
P. D.
,
Ahmed
S.
,
Kumar
N. R.
&
Jayanthi
P. K.
2009
Forecasting groundwater level using artificial neural networks
.
Current Science
96
(
7
),
933
939
.
Tao
H.
,
Hameed
M. M.
,
Marhoon
H. A.
,
Zounemat-Kermani
M.
,
Heddam
S.
,
Kim
S.
,
Sulaiman
S. O.
,
Tan
M. L.
,
Sa'adi
Z.
,
Mehr
A. D.
,
Allawi
M. F.
,
Abba
S. I.
,
Zain
J. M.
,
Falah
M. W.
,
Jamei
M.
,
Bokde
N. D.
,
Bayatvarkeshi
M.
,
Al-Mukhtar
M.
,
Bhagat
S. K.
,
Tiyasha
T.
,
Khedher
K. M.
,
Al-Ansari
N.
,
Shahid
S.
&
Yaseen
Z. M.
2022
Groundwater level prediction using machine learning models: A comprehensive review
.
Neurocomputing
489
,
271
308
.
Uyumaz
A.
,
Danandeh Mehr
A.
,
Kahya
E.
&
Erdem
H.
2014
Rectangular side weirs discharge coefficient estimation in circular channels using linear genetic programming approach
.
Journal of Hydroinformatics
16
(
6
),
1318
1330
.
Yin
L.
,
Wang
L.
,
Li
T.
,
Lu
S.
,
Tian
J.
,
Yin
Z.
,
Li
X.
&
Zheng
W.
2023
U-Net-LSTM: Time series-enhanced lake boundary prediction model
.
Land
12
(
10
),
1859
.
Zhou
G.
&
Yang
Z.
2023
Analysis for 3-D morphology structural changes for underwater topographical in Culebrita Island
.
International Journal of Remote Sensing
44
(
7
),
2458
2479
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).

Supplementary data