## Abstract

Understanding the interactions among the natural components of hydrological systems is essential for managing water resources, addressing environmental concerns, and mitigating the impacts of natural events such as floods and droughts. In this study, long-term (1993–2019) hydrologic connectivity among precipitation, humidity, evaporation, surface runoff, minimum/maximum temperature, and their interactions with groundwater level (GWL) in the southeastern Caspian Sea region was assessed using mutual information theory and the state-of-the-art jittered genetic programming approach. While the former was used to find dominant components, their effective time delay, and the power of potential nonlinear interactions, the latter was utilized to extract an explicit relation between the GWL and the most influential components. The data were gathered from several piezometers, meteorological stations, and a stream gauge available in the study area. The results showed an overall positive trend in the GWL with an increasing rate since 2007 that reflects the influence of artificial recharge infrastructures built in the study area. Statistical connectivity analyses demonstrated that historical precipitation and streamflow series have the least impact on the temporal variation of the average GWL.

## HIGHLIGHTS

The Jittered Genetic Programming (JGP) model was introduced for hydrologic connectivity analysis.

A new MATLAB script was developed for lagged mutual information calculation.

The study highlighted the importance of hydrological connectivity analysis for dynamic groundwater pattern recognition.

## INTRODUCTION

Hydrosystems, also known as hydrological systems, encompass various natural components that interact to regulate the movement, distribution, and quality of water. The key natural components of hydrosystems include (but are not limited to) surface and groundwater bodies, precipitation, soil moisture, vegetation, temperature, etc. The groundwater level (GWL) is an essential aspect of hydrological systems and plays a vital role in water supply, agriculture, and ecosystems. It refers to the depth at which water is found beneath the Earth's surface in underground aquifers or water-bearing formations. The GWL can vary depending on several factors, including precipitation, evaporation, surface water interactions, geological conditions, and human activities such as groundwater pumping and recharging. When precipitation exceeds evaporation and water infiltrates into the ground, the GWL tends to rise. Conversely, it can decline during periods of low rainfall or excessive pumping.

Soft computing techniques (SCTs) have been increasingly used in the field of civil engineering, especially hydrology, to detect underlying patterns and accurately predict environmental variables (Creaco *et al.* 2022; Ghasemi *et al.* 2023; Yin *et al.* 2023). Regarding the changing trends of GWL, several studies have been conducted on regional and catchment scales (Roushangar *et al.* 2020; Nourani *et al.* 2023). Among different SCTs, artificial neural networks (ANNs) and support vector regression (SVR) have been frequently used as an efficient way to solve problems in the groundwater-related field, such as GWL prediction (Daliakopoulos *et al.* 2005; Sreekanth *et al.* 2009; Nourani *et al.* 2011; Gholami & Sahour 2022; Roshani & Hamidi 2022; Dąbrowska *et al.* 2023). For example, Nourani *et al.* (2016) used the Self-Organizing Map (SOM)-based clustering method and the hybrid wavelet-Adaptive Neuro-Fuzzy Inference System (ANFIS) approach for GWL prediction and showed the strength of ANNs to provide estimation accuracy. Adamowski & Chan (2011) proposed a new method based on the wavelet and ANN predictions for GWL prediction. The exact prediction results for both St-Remi and Mercier sites in the Chateauguay basin indicate that the Wavelet Analysis based ANN (WA-ANN) method is a suitable method for predicting GWLs. Hosseini *et al.* introduced a program of evolved ANNs using a simple genetic algorithm to simulate monthly GWLs in the coastal aquifer in the Shabestar Plain, Iran. This research shows that employing a genetic algorithm to initialize neural connection weights in complex space avoids the risk of becoming stuck in local minima, which can prevent high costs and time-wasting in water resource management projects for drilling more piezometers. Gholami (2022) combined modular ANNs with adaptive fuzzy inputs to increase the accuracy of the annual GWL modeling process on the southern coasts of the Caspian Sea (Iran). More recently, Roshani & Hamidi (2022) used ANNs to predict GWL fluctuations in the Sari-Neka aquifer, the Caspian Sea basin. The authors showed that the groundwater depth of the Sari-Neka basin is expected to decrease with a gentle slope during the period 2021–2040.

Genetic programming (GP) is rather a new class of SCTs that is applied to a wide range of problem domains, including symbolic regression, classification, control systems, and optimization. It allows for the automatic generation of programs without the need for explicit programming by humans (Danandeh Mehr *et al.* 2023). GP has been used successfully in various fields, such as data mining, robotics, bioinformatics, and finance, to tackle complex problems and discover novel solutions. Various versions of GP such as monotonic GP, linear GP, multigene GP, and gene expression GP have been satisfactorily used in hydrological studies (Uyumaz *et al.* 2014; Rahmani-Rezaeieh *et al.* 2020; Samadi *et al.* 2020, 2021; Mehr & Gandomi 2021). A recent review study on the application of SCTs for GWL prediction conducted by Tao *et al.* (2022) indicated that GP was used only in 5.0% of the artificial intelligence models applied for GWL modeling, so that there is room for more research on the application of GP for GWL prediction. Some of the recent implications of GP variants for GWL estimation include (not limited to) the study of Fallah-Mehdipour *et al.* (2013), Kasiviswanathan *et al.* (2016), Aryafar *et al.* (2019), and Abdolahzadeh & Schmalz (2022). The existing studies mostly trained a standalone GP engine or hybridized it to find the best short-term (less than a season) GWL estimating model. Differing from the earlier studies, we applied GP to identify how different hydrometeorological variables interconnect with GWL fluctuations from a temporal perspective. Thus, standalone monotonic GP was used in this study without any attempt to enhance its performance via common hybridization techniques.

In recent years, there has been a burgeoning interest in the interconnectedness of ecosystems and how various components respond to both natural variations and human-induced activities. This approach has gained traction as a valuable tool for comprehending the fundamental principles that inform policy-making and ecosystem management (Rinderer *et al.* 2018; Zhang *et al.* 2023; Zhou & Yang 2023). For instance, Rinderer *et al.* (2018) explored hydrologic connectivity among groundwater and streamflow time series from a Swiss catchment and demonstrated that employing ground-based high-frequency sensors as well as emerging remote sensing techniques can provide new opportunities for a better understanding of connectivity across ecosystems. Another noteworthy study by Zhou *et al.* (2022) introduced a novel algorithm for mining prevalent negative co-location patterns in spatial data. Meanwhile, Jia & Zhou (2023) proposed a response surface methodology to estimate the global karst carbon sink, demonstrating that functional connectivity analysis could significantly reduce numerical calculation time, making it a viable tool for global weathering models. These investigations collectively highlight the versatility of the concept of connectivity in both ecology and hydrology. In ecology, it is defined as the extent to which the landscape facilitates or hinders movement among resource patches. In hydrology, on the other hand, it pertains to the water-mediated transfer of matter, energy, and/or organisms within or between elements of the hydrologic cycle (Rinderer *et al.* 2018). The integration of these perspectives underscores the interdisciplinary nature of connectivity studies, offering valuable insights for informed decision-making in environmental science and management.

Monitoring GWL is important for hydrosystem management and assessing the overall health of aquifers. This information helps determine the availability of water for various uses, including drinking water, irrigation, and industrial purposes. Additionally, GWL estimation helps to prevent issues such as land subsidence, saltwater intrusion, and ecological disruptions caused by significant changes in water table depths. Therefore, the purpose of this study is to use emerging statistical methods and soft computing tools (i) to assess the hydrologic connectivity between various hydrosystem components and then (ii) to identify the underlying pattern of GWL based on the hydrologic connectivity and evolutionary GP technique. The study focuses on the southeastern Caspian Sea region that covers three coastal catchments and an unconfined aquifer. This article is expected to contribute to the development of a more effective and sustainable water supply for the domestic and agricultural water consumption in the study area, that is, facing water scarcity and negative climate change impacts these years.

## STUDY AREA AND DATA

*P*), mean monthly humidity (

*H*), total monthly evaporation (

*E*), minimum temperature (

*T*

_{min}), maximum temperature (

*T*

_{max}), and total runoff volume (

*Q*) during the period of 1993–2019 were used. The monthly data were collected from two rain gauge stations (Jafakadeh and Vatana), one evaporation measuring station (Kordkoy), one hydrometry station (Jafakende), and six piezometers located in the study area. The data were acquired from the Iran Water Resources Management company that guarantees observation quality before sharing them with academia. Table 1 presents the geographical features of the measuring stations. The time series of the observed data together with their main statistical features are presented in Figure 2 and Table 2, respectively.

Station name . | Station type . | Height (m) . | Universal Transverse Mercator (UTM)Coordinates . | |
---|---|---|---|---|

Longitude . | Latitude . | |||

Jafakende | Weather | 30 | 228941 | 4069303 |

Vatana | Weather | 100 | 229122 | 4067740 |

Jafakende | Hydrometric | 30 | 226153 | 4069676 |

Livan | Piezometer 1 | 20 | 757138 | 4070313 |

Nokande | Piezometer 2 | 20 | 759738 | 4070278 |

Banafsheh Tapeh | Piezometer 3 | 38.58 | 224602 | 4068109 |

Gaz | Piezometer 4 | 21.02 | 228430 | 4069821 |

Dashteh Kalateh | Piezometer 5 | 19.57 | 227145 | 4070728 |

Bandar Gaz | Piezometer 6 | −21.63 | 227473 | 4075185 |

Station name . | Station type . | Height (m) . | Universal Transverse Mercator (UTM)Coordinates . | |
---|---|---|---|---|

Longitude . | Latitude . | |||

Jafakende | Weather | 30 | 228941 | 4069303 |

Vatana | Weather | 100 | 229122 | 4067740 |

Jafakende | Hydrometric | 30 | 226153 | 4069676 |

Livan | Piezometer 1 | 20 | 757138 | 4070313 |

Nokande | Piezometer 2 | 20 | 759738 | 4070278 |

Banafsheh Tapeh | Piezometer 3 | 38.58 | 224602 | 4068109 |

Gaz | Piezometer 4 | 21.02 | 228430 | 4069821 |

Dashteh Kalateh | Piezometer 5 | 19.57 | 227145 | 4070728 |

Bandar Gaz | Piezometer 6 | −21.63 | 227473 | 4075185 |

Variable . | Mean . | Median . | Minimum . | Maximum . | Std. . | Dev. . | Skewness . |
---|---|---|---|---|---|---|---|

P (mm) | 59.233 | 51.85 | 0 | 231.3 | 41.097 | 0.69382 | 0.81621 |

H (%) | 80.973 | 81.1 | 61.33 | 94.4 | 7.8966 | 0.09752 | −0.80209 |

E (mm) | 81.656 | 75.05 | 9.9 | 193.8 | 46.191 | 0.56567 | −0.99235 |

T_{min} (°C) | 12.28 | 12.1 | −2.9 | 25.13 | 7.3604 | 0.59936 | −1.3663 |

T_{max} (°C) | 22.789 | 22.65 | 7.7 | 37.7 | 7.6748 | 0.33678 | −1.3318 |

R (MCM) | 4.4406 | 2.285 | 0 | 36.85 | 6.039 | 1.3599 | 6.5207 |

GWL1 | 3.8388 | 3.72 | 0.8 | 9.15 | 1.7259 | 0.4496 | −0.47116 |

GWL2 | 3.0732 | 3.025 | 1 | 5.02 | 0.63703 | 0.20729 | −0.27055 |

GWL3 | 11.806 | 11.83 | 8.84 | 14.2 | 1.1676 | 0.098894 | −0.50337 |

GWL4 | 12.075 | 12.31 | 8.4 | 15.62 | 1.8095 | 0.14986 | −1.0243 |

GWL5 | 3.6446 | 3.28 | 1.41 | 6.52 | 1.061 | 0.29112 | −0.65206 |

GWL6 | 1.9919 | 1.93 | 0.87 | 3.91 | 0.51672 | 0.25941 | 0.66812 |

Variable . | Mean . | Median . | Minimum . | Maximum . | Std. . | Dev. . | Skewness . |
---|---|---|---|---|---|---|---|

P (mm) | 59.233 | 51.85 | 0 | 231.3 | 41.097 | 0.69382 | 0.81621 |

H (%) | 80.973 | 81.1 | 61.33 | 94.4 | 7.8966 | 0.09752 | −0.80209 |

E (mm) | 81.656 | 75.05 | 9.9 | 193.8 | 46.191 | 0.56567 | −0.99235 |

T_{min} (°C) | 12.28 | 12.1 | −2.9 | 25.13 | 7.3604 | 0.59936 | −1.3663 |

T_{max} (°C) | 22.789 | 22.65 | 7.7 | 37.7 | 7.6748 | 0.33678 | −1.3318 |

R (MCM) | 4.4406 | 2.285 | 0 | 36.85 | 6.039 | 1.3599 | 6.5207 |

GWL1 | 3.8388 | 3.72 | 0.8 | 9.15 | 1.7259 | 0.4496 | −0.47116 |

GWL2 | 3.0732 | 3.025 | 1 | 5.02 | 0.63703 | 0.20729 | −0.27055 |

GWL3 | 11.806 | 11.83 | 8.84 | 14.2 | 1.1676 | 0.098894 | −0.50337 |

GWL4 | 12.075 | 12.31 | 8.4 | 15.62 | 1.8095 | 0.14986 | −1.0243 |

GWL5 | 3.6446 | 3.28 | 1.41 | 6.52 | 1.061 | 0.29112 | −0.65206 |

GWL6 | 1.9919 | 1.93 | 0.87 | 3.91 | 0.51672 | 0.25941 | 0.66812 |

*et al*. 2021). Despite significant fluctuations in GWL, Figure 2 illustrates overall increasing trends in all piezometers. With respect to their mean level, P1, P2, and P5 can be grouped in the same class. P3 and P4 show the maximum level, and P6 shows the minimum level, indicating that GWL increases by distance from the Caspian Sea.

## METHODOLOGY

To assess hydrologic connectivity between the hydrometeorological variables (i.e., *H*, *E*, *T*_{max}, *T*_{min}, *R*) and the GWL in the study area, first, a representative GWL time series (hereafter GWL_{ave}) was obtained for the whole area by calculating the arithmetic mean of GWL at all piezometers. Then, a set of linear and nonlinear connectivity measures including the pairwise correlation coefficient (PCC) and mutual information (MI) was calculated. In hydrological time series, the value of a variable at a particular time can be influenced by previous values, so-called lags, of the same variable or other indicator variables. Therefore, we also looked for leading indicators when conducting conductivity analysis. In other words, in addition to evaluating the linear and nonlinear correlations between two concurrent time series, we investigate the correlation between the GWL and the independent variables with a few time lags (here 36 months). Thus, lagged PCC and lagged MI (LMI) were also calculated to find the dominant lags. Finally, a monotonic jittered genetic programming (JGP) method was employed to link the GWL in the study area with the most dominant independent variables and their lags. To cope with the existing trends and dimensional inconsistency of both dependent (i.e., GWL_{ave}) and independent variables, they were standardized first, and then made stationary via a differencing method before the modeling stage. Due to the rather short period of historical observations, jittered-based augmented time series were used for training and testing the GP model.

### Pairwise correlation coefficient

*X*= {

*x*

_{1},

*x*

_{2}, … ,

*x*

_{n}} and

*Y*= {

*y*

_{1},

*y*

_{2}, … ,

*y*

_{n}}. PCC varies in the range from −1.0 to 1.0 indicating negative (reverse) or positive linear correlation between the signals.where and are the arithmetic means of the signals X and Y, respectively.

A value of 0.0 implies no linear correlation between the signals. A value of 1 indicates a perfect positive linear relationship, where an increase in one variable is associated with a proportional increase in the other variable. A value of −1 indicates a perfect negative linear relationship, where an increase in one variable is associated with a proportional decrease in the other variable.

### Mutual information and lagged mutual information

MI measures the statistical dependence or information shared between two random variables. It quantifies how much information is obtained about one variable by knowing the value of another variable. In other words, it measures the amount of information that two variables share. MI is often used in the fields where a nonlinear relationship between variables is aimed to be detected (Batina *et al.* 2011).

*X*and

*Y*.where is the joint PDF of

*X*and

*Y*signals and and are their marginal PDFs, respectively.

*Y*and lagged versions of the indicators.

*X*(

*t*

*−*

*k*) is calculated

*X*and itself

*Y*(

*t*

*−*

*k*) for various lag values (

*k*). The developed MATLAB script to compute LMI is presented in Supplementary Materials.

### Genetic programming and jittered genetic programming

To cope with the comment overfitting problem of GP models, particularly when they encounter a small number of samples, we introduced the JGP model in which the jittering technique was used to create additional samples for training and testing GP. This technique can be valuable for improving the robustness and generalization ability of all machine learning models trained on time series data. It involves adding small amounts of random noise to the data to create variations while maintaining the overall structure of the time series. This technique can be valuable for improving the robustness and generalization ability of machine learning models trained on time series data. To this end, first, the level of jitter was specified as one-third of the standard deviation of each indicator. Then, random noise values that follow a Gaussian distribution were created. Finally, the random noise value was added to each data point in the original time series. The result is a new time series that was added to the end of existing ones and used for training and testing GP.

### Performance evaluation

## RESULTS AND DISCUSSION

### Data preprocessing

_{ave}series attained for the study area.

### Linear correlation

_{ave}time series are tabulated in Table 3 and depicted in Figure 6. The color gradient in the table indicates the strength and direction of functional connectivity, allowing us to understand how changes in one variable correspond to changes in GWL

_{ave}. The red and blue colors stand for positive and negative connectivity, respectively. The darker the cell, the stronger the connectivity. In addition, we classified the strength of connectivity into three groups as below:

Insignificant connectivity when the value of associated PCC is within the range −0.2 to 0.2.

Positive connectivity when the value of the PCC exceeds 0.2.

Negative connectivity when the value of the PCC falls below −0.2.

Color gradient from blue (−1.0) to red (+1.0) indicates where the cell value falls within that range.

The graphical analysis of leading indicators (Figure 6) illustrates the temporal offset between the signals at a time delay 12 that reflects the common annual periodicity existing in hydrological time series. It is observed from Table 3 and Figure 6 that almost all PCC values of *P* remain in the insignificant connectivity range and therefore, their influence on average groundwater depth in the study area could be considered negligible. Despite being insignificant, an increasing trend in the PCC values of *P* is seen, indicating that *P* may influence GWL_{ave} at longer lags. The associated lagged PCC values had predominantly positive values after a 30-month delay, signaling that the mean GWL at the site has a direct relation with *P* after two and half years. All the PCC values of *H* and *Q* fell in the blue cells and hence gave an impression of widespread adverse connectivity across the groundwater monitoring wells and the catchment humidity and runoff. Despite the increasing trend, almost all PCC values of *H* and *Q* are in the range of negative connectivity. Therefore, it is concluded that increasing *H* and *Q* in the basin may decrease GWL_{ave}. The associated PCC values of *E*, *T*_{min}, and *T*_{max} show almost the same periodicity pattern including both positive and negative connectivities in the range of −0.43 to +0.43. The peaks and troughs appear with 3–4- and 9–10-month lags, respectively.

### Nonlinear dependency: MI and LMI results

_{ave}, its preceding values and hydrometeorological indicators. To this end, we first calculated the marginal PDFs of all time series and their joint PDFs with GWL

_{ave}using the histogram method. Figure 7 illustrates an example of marginal and joint probabilities depicted for GWL

_{ave}(

*t*) and GWL

_{ave}(

*t*+ 36). Then, MI and LMI values between the indicators and GWL

_{ave}time series were calculated using Equations (2) and (3). The obtained results are tabulated in Table 4 and depicted in Figure 8. Like Table 3, the color gradient in the table indicates the strength of functional connectivity. However, the MI values are always positive as they are based on the concept of entropy, and entropy is a non-negative quantity. Thus, cells' color varies between white (0.0) and red. The darker the cell, the stronger the connectivity. To interpret the MI results, we classified the strength of connectivity into two groups as below.

Insignificant connectivity when the value of associated MI/LMI is less than 0.25.

Significant connectivity when the value of the MI/LMI exceeds or equals 0.25.

The color gradient from white (0.0) to red (2.24) indicates where the cell value falls within that range.

According to the figure, The MI values are insignificant for *P* and *Q* in all lags. This reveals the insignificant influence of concurrent and past rainfall and runoff on the current behavior of GWL_{ave}. On the other hand, *H*, *E*, *T*_{min} and *T*_{max} are highly correlated to GWL_{ave} after 18 lags (Figure 8(b)), and the maximum association is seen between GWL_{ave} and its lagged values back to the 27-month lag (Figure 8(a)). This long significant correlation might be a consequence of our decision to limit our threshold to 0.25. The LMI series (Figure 8(a)) shows that antecedent GWL_{ave} back to 3 months together with its past year value (i.e., GWL_{ave} (*t* − 12)) can provide valuable information for modeling and understanding the groundwater dynamic system in the Jafakende aquifer. The magnitudes of LMI of *T*_{min} were greater than *T*_{max} in the first 6 months lag and contrasting results were obtained after the 18-month lag, thus suggesting that the interaction between groundwater dynamics and temperature should be assessed at different seasons separately.

### JGP modeling results

_{ave}, its past values, and hydrometeorological indicators, nonlinear dependency results (Table 4) were used. To identify a month-ahead underlying pattern, the most associated lag of individual hydrometeorological variables provided that its LMI is greater than 0.25 together with the most effective antecedent GWL

_{ave}valued were selected, as expressed in Equation (7).

_{ave}estimation model. There are some other free GP alternatives such as FlexGP and GeneticEngine with the codes available on GitHub. Table 5 summarizes the GP parameters used to run GPdotNet. It is worth remembering that a set of random floating numbers between −1.0 and 1.0 were also added to the input set to increase the model flexibility for fitting on the data. Figure 9 (Equation (8)) shows the best JGP tree (model) attained after 500 generations together with progress in fitness values. The performance measures of the best-evolved model are also tabulated in Table 5. It is clear from the figure that the best estimation model evolved almost after 50 generations. It does not possess

*T*

_{maxt−26}and GWL

_{avet−1.}Therefore, the evolved model represents 2 months ahead prediction model.

Parameter . | Value . |
---|---|

Functional set | +, −, ×, ÷, sin, cos, exp,sqrt |

Population size | 500 |

Initialization method | Half and half |

Crossover rate | 0.7 |

Mutation rate | 0.2 |

Reproduction rate | 0.1 |

Maximum operation level | 6 |

Parameter . | Value . |
---|---|

Functional set | +, −, ×, ÷, sin, cos, exp,sqrt |

Population size | 500 |

Initialization method | Half and half |

Crossover rate | 0.7 |

Mutation rate | 0.2 |

Reproduction rate | 0.1 |

Maximum operation level | 6 |

_{ave}during the augmented training and testing dataset. The associated performance measures are tabulated in Table 6.

Training set . | Testing set . | ||||
---|---|---|---|---|---|

RMSE (m) . | NSE . | RE (%) . | RMSE (m) . | NSE . | RE (%) . |

0.217 | 0.458 | 1.2 | 0.230 | 0.454 | 0.8 |

Training set . | Testing set . | ||||
---|---|---|---|---|---|

RMSE (m) . | NSE . | RE (%) . | RMSE (m) . | NSE . | RE (%) . |

0.217 | 0.458 | 1.2 | 0.230 | 0.454 | 0.8 |

The figure shows that the evolved model can well detect the fluctuating dynamics of the GWL. However, the extreme peaks and troughs are not well predicted. This is mainly due to the intentional implementation of the ad hoc GP technique to discover the simplest nonlinear relation (Figure 9(b)) between GWL_{ave} and the indicators. The use of advanced optimization techniques for fine-tuning of GP may increase the estimation accuracy of extreme values as seen in Sepahvand *et al.* (2019). Compared to the literature, our study showed some agreements and conflicting results with those of similar studies. For example, like the works of Kasiviswanathan *et al.* (2016), our study showed that GP effectively captures the non-linearities in GWL fluctuations, demonstrating a reasonable performance using observed hydrometeorological and antecedent GWL data, all without necessitating explicit knowledge of the underlying physics of the system. In contrast to Kasiviswanathan *et al.* (2016), we found *E* as a dominant parameter in GWL. Such conflicting results could be due to the higher perspectives of lag times (up to 36 months) that were considered in our study. Thus, the authors believe that our findings could better represent GWL dynamics from a long-term perspective. In addition, our findings showed that precipitation is not a dominant factor in GWL fluctuations in our study area.

Our study revealed apprehensions regarding the feasibility of developing a forecasting model for GWL estimation. Like previous research, our study was constrained to datasets obtained from *in situ* hydrometeorological measurements. Subsequent investigations could incorporate additional factors, including demographic and geological data such as topographic wetness index, pumping rate, groundwater recharge, lithology, land use, population growth, and more.

## CONCLUSION

Hydrologic connectivity analysis was conducted across the Jafakende River catchment to identify potential correlation and causality between hydrometeorological parameters and the GWL at the Jafakende aquifer. First, linear and nonlinear methods were utilized to determine the dominant features, and then, their temporal variation was linked to the target GWL using an innovative GP technique. Our study generally showed that the relationship between GWL and hydrometeorological factors is complex and challenging to be modeled. The detailed conclusions are itemized below.

Both linear and nonlinear statistics suggested the absence of significant connectivity between average groundwater depth and

*P*and*Q*at the catchment.There is a strong correlation between

*H*,*E*,*T*_{min},*T*_{max}, and GWL_{ave}after 18 lags.Incorporating antecedent GWL

_{ave}up to 3 months, along with its value from the previous year (GWL_{ave (t−12)}), can offer valuable insights for modeling and comprehending the dynamics of the groundwater system in the Jafakende aquifer.Jittering is useful for preventing the overfitting of GP. However, it's essential to strike a balance in the level of jitter introduced, as excessive noise can hinder model performance.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.