ABSTRACT
Reliable simulation of groundwater fluctuation is imperative for implementing sustainable groundwater management so that not only local people will be covered, but also environmental threats are going to be restrained in the future. This research aims to scrutinize the accuracy of numerical model (NM) and machine learning (ML) methods for groundwater level prediction (GWLP) for Yazd-Ardakan Plain from 2012 to 2019 in monthly steps. Additionally, principal component analysis (PCA) is applied to investigate the impact of each ML input feature on GWLP. The study area's aquifer data were analyzed and prepared to develop the conceptual model of MODFLOW and train ML algorithms for GWLP. Considering observation wells (OBWs), operation wells (OPWs), and their latitude and longitude as input features in convolutional neural networks (CNN), support vector machine (SVM), and decision tree (DT) algorithms, GWLP was performed. The results demonstrate that although MODFLOW considers the unique features of the aquifer, the most accurate GWLP was achieved by SVM, with root mean square errors (RMSE), correlation coefficient (), and area under the receiver operating characteristics (ROC) curve (AUC) values of 0.12, 0.90, and 0.94, respectively. Furthermore, PCA presented that the observed groundwater level (GWL) was the most effective feature with 71%.
HIGHLIGHTS
In this study, MODFLOW, a numerical model, is compared with machine learning methods for GWLP.
PCA was employed in this study to assess the effectiveness of each input feature used in the algorithms and each one-year time period.
This study also uses the important features of latitude and longitude, which significantly affect GWLP results when machine learning is applied.
The results demonstrate the better performance of the MODFLOW model in terms of generalization ability (GeA) and the superiority of machine learning methods in terms of accuracy for GWLP.
INTRODUCTION
Groundwater resources supply the majority proportion of the water in arid and semi-arid regions. In these areas, only 5–10% of the precipitation contributes to the recharging of the aquifer (Bouwer 2002). Due to climate change, the reduction in precipitation in these areas has made the natural recharging of the aquifer potentially insufficient (Sherif et al. 2022). Iran is a country with an arid and semi-arid climate, where the imbalance between natural recharge and groundwater extraction has caused a sharp drop in the groundwater level (GWL) (Gholami 2022). As a result of the indiscriminate use of these resources, groundwater level prediction (GWLP) is performed to manage and utilize them for drinking, industrial, and agricultural purposes (Rahman et al. 2020).
MODFLOW as a numerical model (NM) and a graphic simulator based on boundary physics (Tigabu et al. 2020), along with its flexibility in integrating with qualitative models (Lenin Sundar et al. 2022) and hydrological models (Yifru et al. 2022), has been an efficient tool over the last three decades for evaluating and solving many challenging problems of water resources management (Beyene et al. 2023). By comparing the effects of artificial recharge of aquifers using various techniques, such as infiltration basins (Chenini & Ben Mammou 2010) and injection wells (Bana Bafroei & Alimohammadi 2023), MODFLOW simulation can be utilized to determine which technique is most effective in advancing groundwater resources management objectives. MODFLOW simulator–optimizer models are employed to address groundwater challenges, such as reducing pumping costs from aquifers (Ayvaz & Elçi 2013), identifying pollution sources (Chakraborty & Prakash 2023), locating OPWs locations (Yin et al. 2020), addressing environmental concerns, and protecting the aquifer (Maghsoudi et al. 2023). MODFLOW simulator–optimizer models' results have proven to be highly beneficial for optimal aquifer management.
Hydrological problems across different scales have been predicted and solved in recent decades using artificial intelligence and machine learning (ML) methods. With numerous hidden layers, the convolutional neural network (CNN) is an advanced version of the artificial neural network and is classified as one of the deep learning algorithms that can be adapted for time series forecasting. This adaptability allows CNN to effectively stimulate patterns in GWL (Afzaal et al. 2020). CNN has been used less frequently than other techniques in GWLP (Afrifa et al. 2022). According to a study by Afzaal et al. (2020), CNN outperforms other algorithms both in terms of speed and precision. When utilizing large datasets for GWLP, CNN demonstrates superior performance (Wunsch et al. 2021), making them a valuable tool in scenarios where extensive historical data is available.
ML models for GWLP enhance accuracy (Amaranto et al. 2019), and one of the most frequently employed is support vector machine (SVM). Amaranto et al. (2018) applied five ML models to provide GWLP models for an aquifer located in Nebraska, USA, and SVM demonstrated accurate predictive abilities like other ML models, particularly in non-linear scenarios. Khan Mohammad & Coulibaly (2006) for Lake Erie in North America and Yoon et al. (2016) in South Korea evaluated GWL using several algorithms and compared their performances. In both studies, SVM demonstrated superior accuracy in GWLP. Additionally, Choi et al. (2020) predicted the water level of a vast wetland in South Korea using various techniques including SVM and decision tree (DT), and found that incorporating multiple features was both essential and highly effective. However, the DT is used less frequently in GWLP studies, and researchers are encouraged to investigate it further (Afrifa et al. 2022).
A few studies have compared NM and ML techniques for GWLP in order to discover which is the most effective. These comparisons used groundwater unit hydrograph (Mohammed et al. 2023), precipitation (Amiri et al. 2023), OPWs (Almuhaylan et al. 2020), GWL (Malekzadeh et al. 2019), recharging and evaporation rates (Moghaddam et al. 2019), temperature, and surface flow (Mohanty et al. 2013). Furthermore, in all the mentioned studies, ML was found to be a superior method for GWLP. However, Ebrahimi et al. (2023) noticed that although both models are robust, MODFLOW is the better model. Chen et al. (2020) also conducted another comparison between MODFLOW and ML using RMSE, , and generalization ability (GeA) indicators, and the results highlighted MODFLOW's strength in GeA and ML's superiority in prediction accuracy. In another comparison, Almuhaylan et al. (2020) emphasized the need to compare and utilize deep learning or CNN techniques. Interestingly though, PCA has not been applied in any of the comparisons mentioned in the literature. PCA is commonly applied in hydrology (Cai et al. 2021), frequently used to decrease data dimensionality and improve the accuracy and efficiency of GWLP models by extracting the key characteristics from datasets when paired with ML models for GWLP by extracting core features that affect prediction accuracy in hydrologic datasets (Seifi et al. 2020).
Very little GWLP using NM and ML has been conducted in the study area. The area, which experiences extreme water stress and scarcity, is one of Iran's highest-density industrial and agricultural regions. The purpose of this research is to compare the GWLP methods by employing MODFLOW, SVM, CNN, and DT algorithms. Furthermore, PCA is utilized to determine the importance of each input feature used in algorithms and each 1-year time period. This study also utilizes the impressive features of latitude and longitude, which have a substantial effect on GWLP results when using ML.
METHODOLOGY
Study area



An illustration of the aquifer's input and output boundaries, the model grid, the OBWs, and the OPWs.
An illustration of the aquifer's input and output boundaries, the model grid, the OBWs, and the OPWs.
Numerical modeling








MODFLOW has the ability to simulate confined and unconfined aquifers, transient and steady flows, rivers, drains, wells, springs, evaporation, and infiltration by rainfall and irrigation. In order to identify and comprehend the mathematical model of the initial conditions, parameters, and physics of the surrounding environment and to simplify it, the first step in groundwater modeling is the creation of a conceptual model. The hydrologist's knowledge of hydrology, groundwater flow dynamics, and hydrogeological conditions is essential for developing a conceptual model. Aquifer data, including field data, OPWs, the initial distribution of hydrological parameters, the DEM of the aquifer's bedrock and surface, and recharging, were utilized and implemented in GMS to develop a conceptual model. The solitary outflow boundary is found in the aquifer's northern region, and the majority of its borders are inflow boundaries (Ministry of Energy 2012). The model's grid was created with dimensions of 1,000 × 1,000 m. As shown in Figure 2, the model contains 1,725 active cells and 2,271 inactive cells. Using the parameter estimation (PEST) program, the model was calibrated due to GWL calculation, by taking into account the aquifer's hydraulic conductivities and calculating the difference between the recharging and discharge factors. Through the OBWs, the estimated and actual GWL of the aquifer are compared. Notably, the parameters introduced to PEST were calibrated within a suitable range that matched the aquifer's attributes. The model incorporates an interval representing the permitted range of fluctuation between the computed and observed GWLs; if that difference surpasses the established limit, a new calibration occurs. The interval that has been introduced to the model is 1 m. Following calibration stages, values of calibrated parameters were employed to validate the model in order to gain more assurance that the model will be able to simulate GWL by applying calculated parameters in the calibration steps and different time periods, groundwater withdrawal, and groundwater recharge of calibration stage in acceptable RMSE.
ML models
For the ML simulation, the input features, including OBWs, OPWs, and the latitude and longitude of these parameters, were used for GWLP with CNN, SVM, and DT algorithms. Python 3.2 and K-fold cross-validation (with 5 folds) were applied during the simulation. To ensure reliable evaluation and minimize bias, the dataset was split into two subsets: 80% for training and the remaining 20% were utilized for testing the model. Data preprocessing, including normalization of all input variables, was performed to transform the data into a format that is more easily and effectively processed. Additionally, PCA was applied to reduce the dimensionality of the data and improve model efficiency and accuracy.
Convolutional neural network
Lecun et al. (1998) introduced CNN, a deep learning algorithm that is similar to the connections among neurons in the human brain and nearly mimics human vision. Unlike traditional ML algorithms, which require manual preprocessing for feature extraction, CNN automatically performs this task, reducing the need for such preprocessing. However, like other deep learning algorithms, CNN needs an enormous volume of data as well as computational power. Upon receiving input data, CNN assigns weights to different elements, distinguishing them based on their importance and regional characteristics (Xu et al. 2020).
The convolution layer, pooling, and fully connected are the three components of CNN. The CNN multi-layer structure is part of the convolution layer. The convolution layer can be expressed mathematically as the product of two matrices that have been multiplied correspondingly. In CNNs, each of the filtered pictures goes through a pooling step after the convolution layer. Smaller and more superfluous parts of information vanish during this process, which accelerates and improves the accuracy of learning the data processing algorithm by passing simpler and more useful information to the following layers. Every neuron in the fully connected layer is linked to every other neuron in the pooling layer, but not to any other neurons within the layer (Xu et al. 2020). This layer boosts the prediction accuracy by interpreting and analyzing the data.
CNNs are effective at capturing spatial and temporal patterns in GWL (Wunsch et al. 2021). The model learns hierarchical features from the input data, including OBWs, OPWs, and geographic coordinates, without the need for manual feature extraction. By treating latitude and longitude as spatial variables, CNNs can capture the spatial dependencies between OBWs and GWL. This makes CNN particularly suitable for modeling complex regional and temporal variations in GWL based on the input features.
Support vector machine
SVM, introduced by Vapnik (1995), is a powerful and precise ML algorithm used for both regression and classification (Goodarzi et al. 2024a). SVM is widely utilized because of its accurate detection abilities in a variety of fields. The primary function of SVM is to define a hyperplane in a multidimensional space that separates data points with maximum margin, fitting the regression to minimize errors in the predicted output (Li et al. 2016). The hyperplane's dimensions are determined by the number of features in the data (Goodarzi et al. 2024b). As the number of features increases, visualizing the hyperplane becomes more complex. SVM uses various kernel functions to search for optimal hyperplanes in high-dimensional spaces. Its ability to handle regression tasks with fewer features, combined with its complex structure and advanced techniques, makes it particularly suitable for estimating GWL.




Decision tree
The DT is one of the best ML algorithms for regression and classification based on predetermined answers (Quinlan 1986). It operates based on a set of rules that guide the decision-making process. The root of the tree represents the initial decision point, and the branches represent possible outcomes from this decision. The process continues until it reaches the leaf nodes, which represent the final predictions. The complexity of a DT is directly related to its interpretability, and it can be controlled through pruning, which reduces the number of nodes, leaves, and attributes involved in the tree.
For GWLP, DT effectively models non-linear relationships by capturing the spatial and temporal dependencies between GWL and features such as OBWs, OPWs, and geographic coordinates. DT's ability to handle both categorical and continuous data allows it to model diverse groundwater conditions effectively.
Principal component analysis
High-dimensional feature spaces complicate models, increase computation times, and make interpretation more difficult. PCA, proposed by Pearson (1901), is a dimensionality reduction technique that addresses these issues. PCA has the ability to generate a plan to minimize the number of features while maintaining the highest level of information by determining the correlation and patterns in the features as well as the maximum variance among them (Zhang et al. 2021). PCA enables the model to employ features that have the greatest impact on its result rather than analyzing every feature.
The input features of the algorithms are the pumping values of aquifer OPWs, OBWs, and their longitude and latitude. In this study, PCA has been used to assess the impact and effect of each ML algorithm input feature over an interval of 1 year. The model's accuracy will be preserved, and computation times will also get quicker.
Model's performance criteria




A GeA value of 1 represents that the model performs the simulation optimally.
ROC is one of the most powerful statistical and graphical techniques for calculating how accurate predictions and data-driven models are. The ROC curve displays false positive values on the Y-axis and true positive values on the X-axis for different thresholds (Chen et al. 2022). The sensitivity of the model increases as the false positive value rises. The AUC indicates the model's reliability and ability for prediction. AUC values between 0.9 and 1 are outstanding, 0.8 and 0.9 are very good, 0.7 and 0.8 are good, 0.6 and 0.7 are satisfactory, and 0.6 and 0.5 are acceptable since the more accurate model has a greater AUC. ROC-AUC provides valuable insights in regression-based GWLP by evaluating the models' ability to discriminate between GWL thresholds. By utilizing ROC-AUC, we analyze the discriminatory ability and overall reliability of the models. This approach has been applied to a variety of groundwater-related problems, including risk assessment of nitrate groundwater contamination (Liang et al. 2024), groundwater potentiality modeling (Mallick et al. 2022), and groundwater spring potential (Seifu et al. 2023).
RESULT

RMSE values for the GWLP models during the training/calibration and testing/verification periods.
RMSE values for the GWLP models during the training/calibration and testing/verification periods.
R² values for the GWLP models during the training/calibration and testing/verification periods.
R² values for the GWLP models during the training/calibration and testing/verification periods.
Turning to the results of the ML models, monthly GWLP for the years 2012–2019 was performed. For the test period, the RMSE values for the CNN, SVM, and DT algorithms were calculated to be 0.26, 0.12, and 0.17, respectively. For the training period, these values were determined to be 0.25, 0.12, and 0.16 (Figure 4). The values for the CNN, SVM, and DT algorithms were 0.64, 0.90, and 0.80 during the test period, and 0.66, 0.91, and 0.81 for the training period (Figure 5).


RMSE maps comparing observed and predicted GWL for the test/validation period for (a) MODFLOW, (b) CNN, (c) SVM, and (d) DT.
RMSE maps comparing observed and predicted GWL for the test/validation period for (a) MODFLOW, (b) CNN, (c) SVM, and (d) DT.
maps comparing observed and predicted GWL for the train/calibration period for (a) MODFLOW, (b) CNN, (c) SVM, and (d) DT.
maps comparing observed and predicted GWL for the train/calibration period for (a) MODFLOW, (b) CNN, (c) SVM, and (d) DT.

Comparison of time series observed GWL (m) and obtained GWL (m) from ML models and MODFLOW in verification or testing period for OBW numbers 7, 94, and 89.
Comparison of time series observed GWL (m) and obtained GWL (m) from ML models and MODFLOW in verification or testing period for OBW numbers 7, 94, and 89.
Correlation of the simulated GWL (m) and observed GWL (m) during the verification/testing period for OBW numbers 7, 94, and 89 for (a) MODFLOW, (b) CNN, (c) SVM, and (d) DT.
Correlation of the simulated GWL (m) and observed GWL (m) during the verification/testing period for OBW numbers 7, 94, and 89 for (a) MODFLOW, (b) CNN, (c) SVM, and (d) DT.
ROC curve and AUC values of the models in the training steps for (a) CNN, (b) SVM, and (c) DT.
ROC curve and AUC values of the models in the training steps for (a) CNN, (b) SVM, and (c) DT.
The RMSE values indicate that the GWLP performed perfectly despite the uncertainty regarding the input parameters of the ML models and MODFLOW. One reason for the accuracy of ML predictions is the inclusion of the latitude and longitude as input features. Table 1 presents the MODFLOW, CNN, SVM, and DT performance in GWLP using the criteria from 36 OBWs. The SVM algorithm provided highly accurate predictions. The RMSE, , and AUC values for the SVM algorithm were 0.12, 0.91, and 0.94, respectively. For 11 OBWs out of 36, the computed RMSE value was less than 0.099, and for 18 OBWs, it was less than 0.99. Moreover, for 24 OBWs, the
value between the observed GWL using the SVM model was greater than 0.9; for 11 OBWs, it was greater than 0.87; and for one OBW, it was 0.81. The DT algorithm, the next most accurate model after SVM in this study, has been utilized in only a small percentage of GWLP studies. The RMSE,
, and AUC values for the DT algorithm were 0.16, 0.81, and 0.85, respectively. Additionally, the CNN algorithm achieved good results, with an RMSE of 0.25,
of 0.66, and an AUC of 0.73, demonstrating its effectiveness in predicting GWL within the study area.
Comparison of performance criteria for numerical and ML models in the validation/testing stage
Criteria . | MODFLOW . | CNN . | SVM . | DT . |
---|---|---|---|---|
RMSE | 0.84 | 0.26 | 0.12 | 0.17 |
![]() | 0.68 | 0.64 | 0.90 | 0.80 |
GeA | 1.02 | 1.06 | 1.04 | 1.04 |
Criteria . | MODFLOW . | CNN . | SVM . | DT . |
---|---|---|---|---|
RMSE | 0.84 | 0.26 | 0.12 | 0.17 |
![]() | 0.68 | 0.64 | 0.90 | 0.80 |
GeA | 1.02 | 1.06 | 1.04 | 1.04 |
The effect of each parameter and its percentage impact on the accuracy of the GWLP by the ML models.
The effect of each parameter and its percentage impact on the accuracy of the GWLP by the ML models.
DISCUSSION
Numerous studies have emphasized the importance of using NM and ML algorithms for GWLP. CNN, a deep learning algorithm; DT, a tree-based algorithm; SVM, an ML algorithm; and MODFLOW, an NM, were employed in this study for GWLP. 80% of the data from the study area were applied for the training, and 20% were used for testing. The results demonstrate the superiority of ML techniques over the NM by performing statistical indicators for comparison methods. Among ML methods, SVM exhibited the best performance in GWLP according to the model's RMSE and , aligning with Chen et al. (2020). Additionally, in this study, AUC values and the ROC curve were also utilized to further assess and compare ML methods. The highest AUC score shows that SVM performs remarkably well in distinguishing between different GWL classes, proving its capability to handle non-linear relationships effectively. Furthermore, the DT capability to handle non-linear relationships was effective, though its value was slightly less than SVM. Nevertheless, the relatively lower AUC score indicates that CNN struggles with the task compared to SVM and DT. In contrast to CNN's powerful abilities for image and spatial data, their performance might be limited by the structure of the groundwater data as they are numerical, or time series in nature, which may not present the kind of spatial or hierarchical patterns that CNNs are designed to exploit. However, in this study, like the results of a study by Wunsch et al. (2021), CNN achieved notable accuracy for GWLP. Especially, when it trained on rich input features such as latitude and longitude, which our PCA analysis identified as effective features because CNNs significantly benefit from extensive datasets.
MODFLOW outperformed ML methods in the GeA index, similar to the findings of Chen et al. (2020). The fact is that MODFLOW considers the unique properties of the aquifer, making it more robust when the groundwater basin experiences changes in recharge due to drought, climate change, or stress from pumping at levels different from the study period, leading to superior performance in scenarios requiring comprehensive flow simulations. In contrast, ML algorithms do not perform well under such altered conditions. However, one of the challenges of MODFLOW is the extensive data preparation required; it can take months to complete a modeling project. Additionally, understanding groundwater flow and hydrogeological sciences is necessary for successful MODFLOW modeling. Unlike ML techniques, MODFLOW simulates groundwater flow by balancing the flow budget, offering a more comprehensive approach.
Turning to ML, users can simulate groundwater with fewer data requirements, and preparing the necessary GWLP data using ML is easier than collecting MODFLOW data (Mohanty et al. 2013). On the one hand, ML approaches, particularly when trained with robust datasets, offer flexibility, reduced computational demands, and faster model preparation. On the other hand, ML models can be extremely helpful in cases of aquifers with limited data. Notably, PCA revealed that latitude and longitude contributed significantly, with a combined effectiveness of 22% in improving model predictions, along with the GWL having the greatest effectiveness at 71%. The findings from PCA also imply that perfect feature selection can fill in the gaps between various modeling techniques.
Consequently, by applying management scenarios over time, both NM and ML models are able to predict GWL changes, providing decision-makers with tools to analyze groundwater flow under conditions of drought, climate change, and water stress due to aquifer over-exploitation.
CONCLUSION
The study area is located in the Yazd-Arkan Plain, one of Iran's most industrialized regions, which has recently experienced a dramatic decline in GWL. Consequently, the implementation of management programs is required to prevent further damage and use groundwater simulators to predict potential impacts. This study used ML models and MODFLOW as an NM to perform a GWLP, utilizing aquifer features and data from 36 OBWs. SVM provides the most accurate modeling for GWLP, with AUC of 0.94, RMSE of 0.12, and of 0.91. When compared to other models used in the study, MODFLOW outperformed the others in the GeA index. However, all methods produced good GWL modeling results. MODFLOW achieved an RMSE value of 0.84, indicating its effectiveness in simulation. The most efficient features in ML prediction are identified by employing the PCA technique. Four features of OBWs and OPWs, along with their latitude and longitude, were introduced as ML input. According to PCA, these four input features contributed to ML accuracy by 71, 7, 13, and 9%, respectively. PCA also determined that the fifth time period was the most effective for CNN and SVM models, while the sixth time period was the most effective for DT. One of the most notable distinctions between the two approaches is the amount of time required for GWLP. However, physically based models like MODFLOW are not hindered by non-stationary conditions because they incorporate physical modules. Finally, the authors recommend modeling the entire Yazd-Ardakan Plain using a hybrid algorithm for groundwater forecasting, especially for CNN. Future studies could assess the autocorrelation of groundwater variations in the study area to identify patterns or dependencies within the GWL time series. Additionally, the use of the detrended fluctuation analysis (DFA) method could help remove trends and focus on the intrinsic fluctuations of GWL data from the OBWs.
ACKNOWLEDGEMENTS
The authors express their gratitude to the editor and reviewing team for their helpful comments, which helped elevate the quality of this work. They also thank Yazd Regional Water for supplying the data.
FUNDING
The authors assert that they have no known financial or personal relationships that could have potentially influenced the work reported in this paper.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.