The precise prediction of groundwater levels is a challenging task due to the complex relationships between hydrological parameters and the lack of in situ climate data. The present research proposed an integrated machine learning model for groundwater level prediction based on long short-term memory (LSTM) along with principal component analysis (PCA) and discrete wavelength transform (DWT), i.e. PCA–DWT–LSTM model. The proposed model was developed using 23 years (2000–2022) of seasonal groundwater level data and climatic variables for nine wells in the district of Kangra in Himachal Pradesh, India. The proposed model attains higher ranges of R2 (0.8253–0.8828) and lower ranges of root mean square error (RMSE) (0.1011–2.0025) than the alternative model (PCA–LSTM), having R2 and RMSE values in the range of 0.7019–0.8005 and 0.2662–2.9565, respectively. Moreover, when compared to the hybrid models, the accuracy of the DWT-based models is much higher. The developed model (PCA–DWT–LSTM) improves the accuracy and interpretability of groundwater level prediction and has the potential to estimate the accurate groundwater level, particularly in the regions where obtaining the hydrogeological data is difficult.

  • A hybrid principal component analysis (PCA)–discrete wavelength transform (DWT)–long short-term memory (LSTM) model for enhanced prediction accuracy is proposed.

  • Integration of hydrological and remote sensing data with machine learning is adopted.

  • Comprehensive performance evaluation using R² and root mean square error (RMSE) is presented.

  • Benefits for groundwater management and policy-making are presented.

The depletion of groundwater is a worldwide issue that needs immediate action by the water management bodies, the general public, and other stakeholders to guarantee the sustainability of groundwater resources (Konikow & Kendy 2005). The sustainable development of groundwater resources requires the application of efficient management techniques and a thorough understanding of the mechanisms underlying groundwater depletion. In India, the loss of groundwater can reduce agricultural intensity by 20–68% in areas where groundwater is limited (Bhattarai et al. 2021). The long-term balance between groundwater abstraction and recharge is mostly accountable for this effect (Wendt et al. 2020).

In order to anticipate possible groundwater situations and comprehend the dynamics of groundwater systems, advanced modelling techniques can be used to monitor aquifer properties, groundwater levels, and water quality. Moreover, groundwater level changes over the long term are helpful in guiding future research and decision making (Hadidi et al. 2019).

For environmental scientists, water resource planners, and policymakers around the world, managing and predicting groundwater levels has become a critical undertaking in recent decades. Saqr et al. (2024) performed a study on the development of solar energy-based groundwater exploitation suitability maps, which integrates the region-specific characteristics and the three pillars of sustainability to assess the groundwater level. The conventional hydrological models such as MODFLOW and HYDRUS have been employed to provide informative assessments of groundwater dynamics (Twarakavi et al. 2008; Zhu et al. 2012; Sundar et al. 2022). The hybrid simulation–optimisation (S–O) framework of the MODFLOW-UnStructured Grids (USG) model has been effectively utilised for groundwater prediction (Saqr et al. 2022). These models are the physical-based models and when solving the groundwater problem numerically, a set of aquifers’ physical characteristics is utilised implicitly. However, these models are not able to simulate complex geological features (Kumar 2019). Numerical models are hindered by long runtimes, whereas the machine learning models are an efficient alternate. By adopting the machine learning approach, the constraints present in numerical modelling and remote sensing approaches are successfully overcome (Tao et al. 2022). Machine learning models keep the degree of detailed precision in groundwater level predictions and reduce the time needed for model development and calibration (Di Salvo 2022).

In recent years, groundwater modelling and forecasting have gone through significant modifications due to the introduction of machine learning (ML). Machine learning techniques can determine the weak relationship of hydrological variables via simulation through a training network (Banerjee et al. 2011). Artificial intelligence models are an increasingly prevalent choice for predicting groundwater levels than the complicated numerical approaches (Derbela & Nouiri 2020). Several studies have used artificial intelligence in groundwater resource simulation (Lallahem et al. 2005; Tsanis et al. 2008; Suryanarayana et al. 2014; Hussein et al. 2020; Khedri et al. 2020; Samani et al. 2023; Singh et al. 2024). Abd-Elmaboud et al. (2024) utilised the Artificial neural network (ANN) technique for mapping the groundwater potential zones. However, standard artificial intelligence models lack the ability to learn time-series data since they are not capable of maintaining the past information, resulting in limited prediction potential for long-term time-series data (Wiese & Omlin 2009). Convolutional neural network (CNN) and long short-term memory (LSTM) neural networks are important for dealing with long and short-term dependencies (Wunsch et al. 2021), while LSTM is more efficient than the regular neural networks for long-term dependent data (Lechner & Hasani 2020).

For noisy and sparse data, various standalone model techniques presented in the literature are inefficient for precise prediction of groundwater level (Nguyen et al. 2019). Therefore, for limited data, combining several individual techniques, i.e. hybrid models, can be utilised to improve the predictive efficacy of machine learning algorithms (Maxhuni et al. 2016). A multivariate statistical method called Principal Component Analysis (PCA) is used to reduce the dimensionality of the data by converting it into orthogonal components, which are linear combinations of the original variables. PCA is typically used to minimise a dataset's dimensionality while preserving as much of the original data as feasible (Singh et al. 2011). Literature review indicates that PCA is also widely used to analyse the parameters affecting the quality of the groundwater (Usman et al. 2014). Discrete wavelet transform (DWT) is a data processing technology extensively used to improve the prediction of models. DWT is used to separate the dynamic and the multi-scaler features in the groundwater level (Nourani et al. 2015). Moreover, combining the DWT with the machine learning model enhances the model's prediction metrics (Wei et al. 2023).

Driven by the various effective applications of hybrid machine learning models, a novel hybrid PCA–DWT–LSTM model is proposed in the present study for groundwater prediction. The PCA is used to identify the most important variables influencing groundwater level and to generate new variables that represent the highest variance of input dataset. To remove outliers from the groundwater time-series and to obtain the trend and residual series, the DWT technique has been used. The seasonal groundwater level is then predicted using LSTM. The study uses the groundwater level data of 23 years, i.e. for years 2000–2022, for the nine wells in the study area. The data of 19 years were used for the training, whereas the remaining data were utilised for validation purposes. The objectives of the study are as follows: (1) to identify the dominant variables affecting groundwater level and to reduce the dimensionality of the dataset using PCA; (2) to obtain the trend series, i.e. the approximation component from groundwater level series using the DWT; and (3) to develop a model for predicting groundwater level using the LSTM technique and evaluate its efficacy using statistical indicators.

Study area

The Kangra district lies in the western region of Himachal Pradesh, India. The coordinates of the study area are 75°47′55″ to 77°45′ E longitude and 31°21′ to 32°59′ N latitude. The study uses groundwater level data of nine wells that are located at Bandh, Bod, Panjpir, Paprola, Dehra Gopipur, Raja ka Talab, Bharoli, Hardogi, and Jawalaji in the Kangra district of Himachal Pradesh, India. The groundwater level data are recorded by the Central Ground Water Board (CGWB) on seasonal basis. The study area experiences significant climatic variations, with annual temperatures ranging from 0 to 40 °C and annual precipitation varying between 1,200 and 3,000 mm due to its unique topography. In total, 70% of the yearly rainfall that falls in the region occurs during the monsoon season.

Data description

Observed groundwater data level

The study uses 23 years of groundwater level data from the year 2000 to 2022. The selection of the time frame (2000–2022) for data collection was based on capturing both long-term trends and seasonal variations in groundwater levels, providing a robust basis for modelling. This period was chosen due to the availability of high-resolution consistent data with minimal gaps in it. Additionally, the data were attained from a government agency, i.e. the CGWB, which further ensures its accuracy and credibility. In this study, the dataset of over two decades has been utilised for the model development, which offers a significant comprehensive view that enhances the representativeness and reliability of the collected data. A total of nine wells, designated as X1, X2, X3, X4, X5, X6, X7, X8, and X9, were chosen for the study from Kangra district in Himachal Pradesh. Figure 1 represents the study area and location of the groundwater monitoring wells. The water levels are monitored during the monsoon in August, the early post-monsoon period in November, the pre-monsoon period in May, and the late post-monsoon period in January every year. During pre-monsoon and post-monsoon periods, the depth to water level ranges from 1.56 to 15.44 m and 0.48 to 12.30 m below ground level (bgl), respectively.
Figure 1

Overview of the study area and location of the groundwater monitoring wells.

Figure 1

Overview of the study area and location of the groundwater monitoring wells.

Close modal

Satellite data

The National Oceanic and Atmospheric Administration (NOAA) and the National Aeronautics and Space Administration (NASA), both operated by the United States of America (USA), worked together to create CHIRPS, which stands for Climate Hazards Group InfraRed Precipitation with Station Data (USA). The dataset, with a spatial resolution of 0.05° × 0.05°, integrates satellite observations and rain gauge measurements (Funk et al. 2015). The CHIRPS dataset, which spans from 1981 until the present, is acquired from https://www.chc.ucsb.edu/data/chirps. Several studies using gauge-based observations have validated the CHIRPS dataset's reliability and accuracy in capturing precipitation patterns across India (Prakash 2019). CHIRPS rainfall data were successfully used by Moudgil et al. (2024) to analyse the spatiotemporal variation of terrestrial water storage over various Indian river basins.

As previously mentioned, the observed CGWB data are available seasonally, thus, the CHIRPS average monthly precipitation data for August, November, May, and January months have been used in the study. The FLDAS (USA) stands for Famine Early Warning Systems Network (FEWS NET) Land Data Assimilation System. The hydrological variables considered in the present study, i.e. evapotranspiration, surface soil moisture, and root zone soil moisture, are taken from the FLDAS dataset. The datasets corresponding to August, November, May, and January months of every year in the study duration are taken from the Noah 3.6.1 (Land Surface model, USA) simulations with a spatial resolution of 0.1° × 0.1°.

TerraClimate is a high-resolution worldwide gridded dataset of monthly climate and climatic water balance variables that was established in 1958. Hydrological variables considered in the present study, i.e. monthly maximum and minimum temperatures and runoff, are taken from the TerraClimate dataset with a high spatial resolution of 0.04° × 0.04°. Table 1 presents the summary of the different datasets used in the present study to predict the groundwater levels.

Table 1

Summary of different datasets used in this study to predict the groundwater level

Sr. no.VariablesSourceResolutionTime span
Precipitation CHIRPS 0.05° × 0.05° 2000–2022 
Evapotranspiration FLDAS 0.1° × 0.1° 2000–2022 
Surface soil moisture FLDAS 0.1° × 0.1° 2000–2022 
Root zone soil moisture FLDAS 0.1° × 0.1° 2000–2022 
Groundwater level CGWB In situ data 2000–2022 
Minimum temperature Terra climate 0.04° × 0.04° 2000–2022 
Maximum temperature Terra climate 0.04° × 0.04° 2000–2022 
Runoff Terra climate 0.04° × 0.04° 2000–2022 
Sr. no.VariablesSourceResolutionTime span
Precipitation CHIRPS 0.05° × 0.05° 2000–2022 
Evapotranspiration FLDAS 0.1° × 0.1° 2000–2022 
Surface soil moisture FLDAS 0.1° × 0.1° 2000–2022 
Root zone soil moisture FLDAS 0.1° × 0.1° 2000–2022 
Groundwater level CGWB In situ data 2000–2022 
Minimum temperature Terra climate 0.04° × 0.04° 2000–2022 
Maximum temperature Terra climate 0.04° × 0.04° 2000–2022 
Runoff Terra climate 0.04° × 0.04° 2000–2022 

Identification and processing of outliers

Errors in data management (such as field recording and database entry) and observation bore failures (such as collapsed or flooded bores) result in inaccurate measurements of groundwater level. Such mistakes and anomalies ought to be found and considered before being included in the study (Peterson et al. 2018). The ‘3σ’ criterion has been applied widely by several authors to detect the outliers present in the groundwater level data (Tran et al. 2016; Azimi et al. 2018). The ‘3σ’ rule states that 99.73% of all values for a normally distributed parameter should lie within the range of (μ − 3σ, μ + 3σ), where μ is the parameter's mean and σ is its standard deviation.

Data points outside the range (μ − 3σ, μ + 3σ) are regarded as outliers and can be treated as potentially unusual measurements. The 3 σ criterion is used for the identification of outliers and the value of outliers was revised using the weighted average method. The preprocessing steps ensured an effective dataset, reduced overfitting risks, and improved predictive accuracy. Also, the data preprocessing maintains the important time-series data intact while lessening the impact of extreme values as updated during the identification and processing of the outliers.
(1)

Using Equation (1), the smoothened outliers are determined, which are represented by Et. The parameter refers to weighted values applied to input values, which vary with time. The parameter is a historical value at time t near the outliers, and k refers to a positive integer, respectively.

Principal component analysis

PCA is an unbiased method used to extract ‘relevant’ information from high-dimensional datasets by only considering principal components (PCs) that represent sufficiently large portions of the total dataset in terms of variance (Daffertshofer et al. 2004). The PCs are the linear functions of the original variables, and the sum of variances for the original and derived variables are equal. The PCs are listed in descending order of their respective eigenvalues, which represent the variance explained by each PC. The first principal component (PC1) has the highest eigenvalue and explains the most variance, followed by the second principal component (PC2). The details of PC1 and PC2 are displayed in Equations (2) and (3), the remaining PCs can be obtained in the same way.
(2)
(3)
where {Y1, Y2, …, Yj} represent the initial variables. The parameters a1j and a2j are the coefficients that define the linear combination of the original variables for the first and second PCs, respectively, and j represents the original variable number. Climate variables that have a major influence on groundwater level estimation were chosen to capture the intricate geographical and temporal patterns. Initially, the dimensionality of the dataset was reduced using PCA, which preserved the most effective variables. Then the processed data were combined with effective variables to predict the groundwater level.

Wavelet transform

Removing or minimising input noise in hydrologic modelling helps reduce noise impact in simulation, leading to a highly efficient model with the ability to generate findings that are extremely precise. The Fourier transform (FT) and the wavelet transform (WT) are two of the many transforms used to increase model efficiency (Jeihouni et al. 2019). There are two types of wavelets: DWT and continuous wavelet transform (CWT). DWT is a more effective and useful technique for studying time-series data than CWT, as CWT requires large amounts of processing time and data volume (Wu et al. 2021).

The groundwater time-series data are decomposed into approximation and detail components. The high-frequency, quickly varying component of the signal is represented by the detail component, while the low-frequency, slowly variable component is represented by the approximation component. The approximation component provides the most important information about the underlying trends and patterns in the data because it catches the low-frequency changes that are frequently of major relevance in groundwater level analysis. By contrast, the detail component includes high-frequency noise and fluctuations in the data, which may be less useful in predicting the overall groundwater level trends. By retaining the approximation component and rejecting the detail component, the DWT provides a data filtration strategy that enables more efficient analysis of the groundwater level time-series, as only the most essential information is maintained (Deepmala & Piscoran 2016).

The process of denoising involves identifying a suitable mother wavelet and a number of decomposition levels. In fact, it decomposes the signal into a list of functions (Cohen & Kovacevic 1996).
(4)
In Equation (4), a mother wavelet ψ(x) that is extended by j (determine the level of decomposition) and translated by k (determine the position at which the wavelet analyses the function) gives rise to ψj, k(x). The formula is used to get a signal's discrete wavelet function, f(x), as given in set of Equation (5).
(5)
where the wavelet coefficient of a signal is denoted by cj,k, is the analysed signal, and is the complex conjugate of the wavelet function. The scaling function (x) is used to formulate the mother wavelet as given in the set of Equation (6):
(6)
where h0(n) refers to the low-pass filter coefficients, refers to the high-pass filter coefficients that help identify the sudden changes in groundwater levels, ϕ(2xn) is the argument, (2xn) scales and translates the function, and h0(1–n) uses a mirrored version of the low-pass coefficients that helps to identify the long-term trends in groundwater levels. It is possible to find distinct sets of coefficients h0(n) that correspond to the wavelet bases with different properties (Rajaee 2011).

LSTM

The LSTM model was proposed by Hochreiter (1997) to address the vanishing gradient problem. During the long-term information storage via recurrent backpropagation, the problem of vanishing gradient arises in classic recurrent neural network (RNN) models. Traditional RNN model residuals decrease exponentially during lengthy training times, causing the network weights to update slowly and preventing the RNNs from retaining their long-term memory. A unique artificial RNN architecture is used by the LSTM model. A typical LSTM unit comprises of an LSTM memory cell that maintains data for arbitrarily long periods, and three gates, i.e. input, output, and a forget gate – which control the information flow in and out of the cell. The ‘cell state,’ or Ct, which is defined in the following equations, is the fundamental component of the LSTM model. It allows information to flow and represents the temporal variation in the memory storage space. The LSTM unit consists of three self-parameterised control gates, which govern the cell state and information flow. LSTM formulae from Graves et al. (2013) are expressed as follows:
(7)
(8)
(9)
(10)
(11)
(12)
where gt is the current input information of the cell t; ht−1 is the final output of the previous time step; , , , and wo are the weight matrix for forget gate, cell state, input gate, and output gate; are the bias term for the forget gate, input gate, cell state, and output gate; cell state σ (Sigmoid) is activation function that varies between 0 and 1, indicating how much of each component should be kept or forgotten; is the forget gate outcome; is the previous cell state; is the input gate outcome; generates the candidate cell state that contains the new information to be potentially added to cell state; ot is the result of output cell; Ct is the current cell state; tanh is the hyperbolic tangent function; and ht is the final output of the new cell.

Model evaluation

In this study, the accuracy of the model based on the comparison of the measured and predicted values of seasonal groundwater level was assessed using the coefficient of determination (R2) and the root mean square error (RMSE). An R2 value close to 1 is preferred for optimal model prediction (Chandel et al. 2023). It is bounded between (−∞,1). A value of 0 for RMSE indicates that the model prediction is ideal. The RMSE value varies from 0 to ∞. Equations (13) and (14) are used to determine the R2 and RMSE values, where is the observed groundwater level value; is the predicted groundwater level value; is the mean value of observed groundwater level, all in meters; and N is the number of observations.
(13)
(14)

Proposed hybrid PCA–DWT–LSTM model

It is evident from the literature review that the neural networks are capable of learning the intricate properties of the time-series data. To improve the model's predictive capabilities, this study develops a PCA–DWT–LSTM hybrid model. There are three components of the model: (1) lowering the dimensionality of the influencing variables, (2) finding outliers in the series and smoothening them using the 3σ and weighted average methods, while the DWT method is used to remove noise component and to get the trend from the time series, and (3) employing the hybrid model for predicting groundwater level. The details of the PCA–DWT–LSTM model provided in Figure 2 represent the structure of the proposed PCA–DWT–LSTM model. At first, the PCA is utilised to reduce multicollinearities among input features: runoff, maximum and minimum temperatures, precipitation, evapotranspiration, and soil moisture content at the surface and root zone. PCA is used to get the PCs of these input variables represented as PC1, PC2, and PC3 in Figure 3, resulting in reducing the dimensionality of the data. Subsequently, the outliers are identified with the 3σ criteria and weighted average method is used to obtain an outlier-free series. The approximation and detail components are separated using DWT from the groundwater level data. The approximation series has a consistent variance and reflects the essential characteristics of the original data. Following that, the approximation component is subtracted from the outlier processed series to yield the residual series. Thereafter two LSTM models are developed to predict the groundwater level.
Figure 2

Flow diagram depicting the methodology of the groundwater prediction model.

Figure 2

Flow diagram depicting the methodology of the groundwater prediction model.

Close modal
Figure 3

Presentation of the groundwater level data: (a) The observed groundwater level with the identification of outliers and (b) the processed groundwater level.

Figure 3

Presentation of the groundwater level data: (a) The observed groundwater level with the identification of outliers and (b) the processed groundwater level.

Close modal

The first LSTM model is designed to give outputs from the main features, i.e. the approximation component. Therefore, the input feature for the first model is the approximation component, which is the denoised series, and the three PCs obtained from PCA. The second LSTM model was created to enhance the capabilities in order to better understand the peak variations in groundwater level and to improve prediction outcomes. The input for the

second model is the residual series and the PCs (PC1, PC2, and PC3). The final groundwater level prediction results are obtained by combining the outputs from both models.

DWT analysis

The 3σ rule is applied for the identification of the outliers in groundwater level. This rule states that normally distributed parameters are bracketed in the range (μ − 3σ, μ + 3σ), where μ and σ indicate the parameter's mean and standard deviation. Any data points that fall outside of this range are considered outliers and analysed as potentially unusual measurements. Then, the weighted average approach is employed to smooth the outlier. Figure 3 presents the preprocessing of data for a well. To capture the important characteristics of time series, the Daubechies wavelets with different orders (dbn) are utilised for the decomposition of the series on the outlier processed series. The RMSE and signal-to-noise ratio (SNR) are used to evaluate the performance of the DWT analysis at various levels of decomposition. A high SNR for the groundwater level time series indicates that the groundwater level is clear and easy to detect or interpret, while the noise is relatively low. According to the evaluation criteria the lowest RMSE and highest SNR will reflect the best decomposition effect. From Figure 4 it is concluded that when the wavelet is db1 and decomposition level is 1, then the best decomposition effect is achieved for well X9. All nine wells are processed using similar criteria. The DWT analysis's findings show the following: The optimal decomposition effects for wells X1, X6, X7, X8, and X9 are obtained by employing the db1 wavelet at a decomposition level of 1. The db4 wavelet at a decomposition level of 1 yields the best results for wells X2 and X4. The most efficient decomposition can be obtained for wells X3 and X5 using the db2 wavelet with a decomposition level of 1. Over the course of the study, it was found that these configurations produced the most accurate and reliable decomposition findings.
Figure 4

Performance metrics of DWT analysis for the well X9.

Figure 4

Performance metrics of DWT analysis for the well X9.

Close modal

Determination of the PCs of variables using PCA

In this study, the meteorological factors, such as maximum and minimum temperature, precipitation, evaporation, surface soil moisture, root zone soil moisture, and runoff, were utilised in the PCA to reduce the dimensionality of the input dataset. The high-dimensional dataset was reduced using PCA to a compact dataset that explains the majority of the original dataset's content. The results of PCA on the seven independent variables for well X9 are shown in Figure 5. The cumulative variance contribution ratio (CVCR) method is applied to select the PCs. The CVCR of the first three PCs is 96.00%, which is greater than 95% and close to 100%. This indicates that the first three components can explain the maximum variance present in the dataset and contain almost the same amount of information as the input dataset.
Figure 5

PCA results for well X9, which show that the first three PCs explain the 96.00% variance of data.

Figure 5

PCA results for well X9, which show that the first three PCs explain the 96.00% variance of data.

Close modal

Table 2 shows the results of PCA for all nine wells with the CVCR greater than 95%. The high directionality dataset is reduced and presented by the PCs, i.e. PC1, PC2, and PC3. The linear combination defines the loadings of each major component as presented in Table 2. The loadings are generated following the normalisation of the eigenvectors associated with the eigenvalue. The loadings also indicate the contribution of each original feature to the PC and these features are the initial variables in the dataset. A high absolute value of loading means that the feature contributes significantly to that PC. In Table 2, the features runoff (R), evapotranspiration (EVAPO), and minimum temperature (MINI_TEMP) are showing higher loadings and contributing significantly to PC1, PC2, and PC3.

Table 2

PCA results for all nine wells with the cumulative explained variance

WellCVCR of three PCsPrincipal
ComponentsFeatureLoading
X1 97.88% PC1 RUNOFF 0.460041 
PC2 EVAPO −0.687797 
PC3 MINI_TEMP 0.833887 
X2 97.81% PC1 RUNOFF 0.446283 
PC2 EVAPO −0.698117 
PC3 MINI_TEMP 0.847849 
X3 97.88% PC1 RUNOFF 0.452446 
PC2 EVAPO −0.700963 
PC3 MINI_TEMP 0.872350 
X4 97.84% PC1 RUNOFF 0.454899 
PC2 EVAPO −0.699614 
PC3 MINI_TEMP 0.834079 
X5 98.00% PC1 RUNOFF 0.448317 
PC2 EVAPO −0.701701 
PC3 MINI_TEMP 0.856566 
X6 97.90% PC1 RUNOFF 0.457426 
PC2 EVAPO −0.695376 
PC3 MINI_TEMP 0.843056 
X7 98.06% PC1 RUNOFF. 0.441410 
PC2 EVAPO −0.661033 
PC3 MINI_TEMP 0.801298 
X8 97.93% PC1 RUNOFF 0.464323 
PC2 EVAPO −0.682250 
PC3 MINI_TEMP 0.819262 
X9 96.00% PC1 RUNOFF 0.448317 
PC2 EVAPO −0.701701 
PC3 MINI_TEMP 0.856566 
WellCVCR of three PCsPrincipal
ComponentsFeatureLoading
X1 97.88% PC1 RUNOFF 0.460041 
PC2 EVAPO −0.687797 
PC3 MINI_TEMP 0.833887 
X2 97.81% PC1 RUNOFF 0.446283 
PC2 EVAPO −0.698117 
PC3 MINI_TEMP 0.847849 
X3 97.88% PC1 RUNOFF 0.452446 
PC2 EVAPO −0.700963 
PC3 MINI_TEMP 0.872350 
X4 97.84% PC1 RUNOFF 0.454899 
PC2 EVAPO −0.699614 
PC3 MINI_TEMP 0.834079 
X5 98.00% PC1 RUNOFF 0.448317 
PC2 EVAPO −0.701701 
PC3 MINI_TEMP 0.856566 
X6 97.90% PC1 RUNOFF 0.457426 
PC2 EVAPO −0.695376 
PC3 MINI_TEMP 0.843056 
X7 98.06% PC1 RUNOFF. 0.441410 
PC2 EVAPO −0.661033 
PC3 MINI_TEMP 0.801298 
X8 97.93% PC1 RUNOFF 0.464323 
PC2 EVAPO −0.682250 
PC3 MINI_TEMP 0.819262 
X9 96.00% PC1 RUNOFF 0.448317 
PC2 EVAPO −0.701701 
PC3 MINI_TEMP 0.856566 

Results of the PCA–DWT–LSTM model

As previously mentioned, two LSTM models were trained, one for the denoised sequence and the other for the residual series, using the PCA findings as input. In the training process, the 19 years of data were used to train the proposed model. The model is trained individually for each well and after training each model is validated using the dataset of the years 2019–22. Trial-and-error method is applied to analyse the structure of the LSTM model for various hyperparameters, such as the number of hidden layers, neurons in the hidden layers, and the time steps of the input variables.

The regularisation that takes place during the iterative model training (such as gradient descent) is early stopping. The number of iterations that can be done before the model starts overfitting is determined by early stopping rules. A patience parameter of 35 was used to enable early stopping and prevent overfitting. A large amount of variance and noise occur within each batch if the batch size is too small, as the small sample size is unlikely to accurately represent the entire dataset. Conversely, if the batch size is large, then the data becomes overfit as it will not suit the training process' memory. In an attempt to find satisfactory conditions for the complete set of monitoring wells, batch sizes of 1, 2, 5, and 10 were attempted. The minimum average and standard deviation of the RMSE were obtained with a batch size of 10. Dropout was also employed as a regularisation strategy to prevent the model from overfitting.

As proposed by Muhammad et al. (2021), the dropout effect with a probability 0.2 gives best results. In order to prevent overfitting, a batch size of 10 and a dropout value of 0.2 were selected. Moreover, to establish the ideal number of hidden layers for modelling, five hidden layer scenarios were evaluated, ranging from 1 to 5. A large range of RMSE values for the first and fifth hidden layers were observed, but the average RMSE for the second, third, and fourth hidden layers were low with a closer range. Since the average RMSE of the three hidden layers was the lowest, the condition was chosen as the model's best design. Moreover, an excessive number of neurons cause overfitting, whereas an insufficient number of neurons hinders the network's ability to learn. Consequently, 40 neurons were selected for each of the LSTM models with three hidden layers. To assess the developed hybrid model's performance against the PCA–LSTM model, the similar hyperparameters were used to train the PCA–LSTM models for each well. The wells that showed the highest prediction accuracy were X1, X2, and X9, with more stable groundwater dynamics. The area under these wells has consistent recharge pattern, particularly during the monsoon season. Also, the lower RMSE values for these wells, i.e. 0.1011–0.1447 indicate the best performance of the hybrid model. For X5, X7, and X8 wells, the hybrid model demonstrates moderate prediction performance as the RMSE values for these wells vary from 0.1935 to 0.2794. However, the X3, X4, and X6 wells situated in hydrogeological systems with frequent fluctuations in monsoon and recharge pattern show poor performance because the RMSE values for these wells vary from 0.3315 to 2.0025. The well locations in the present study are in an urban ecosystem, which are subjected to variations in groundwater levels due to the unequal recharge rates and extraction. The results for each of the nine wells are presented in Table 3. R2 and RMSE for the PCA–DWT–LSTM model ranged from 0.82534 to 0.8828 and from 0.1011 to 2.0025, respectively. PCA–LSTM model's R2 values varied between 0.7019 and 0.8005, indicating that the PCA–DWT–LSTM model had a significantly higher R2. However, the RMSE of the PCA–DWT–LSTM model was significantly lower than that of the PCA–LSTM model that ranged from 2.9565 to 0.2662 m. After evaluating a number of models for groundwater prediction, Sun et al. (2022) concluded that the LSTM model performed better, with RMSE values ranging from 0.60 to 1.98 for wells in different zones. Whereas the developed PCA–DWT–LSTM model for groundwater level prediction in the present study yielded lower RMSE values, i.e. ranging from 0.10 to 2.00, which indicate improved precision and reliability for the prediction of groundwater level. The findings suggest that the PCA–DWT–LSTM model is better for forecasting seasonal groundwater level depth. The R² range of 0.8253–0.8648 for the developed hybrid model indicates the satisfactory prediction performance. The model also captures a significant amount of the variance in groundwater levels. Lower RMSE values indicate precise predictions, particularly in areas with stable groundwater patterns. The RMSE values range from 0.1011 to 2.0025, which indicate low to moderate prediction errors. These outcomes indicate that the model is dependable for practical purposes in the field of groundwater management. The model provides precise predictions that facilitate accurate planning of water resources especially in areas where groundwater quality varies.

Table 3

The performance statistics for the PCA–DWT–LSTM and PCA–LSTM models in groundwater table depth prediction for the 4 years, 2019–2022

PCA–LSTM
PCA–WT–LSTM
WellRMSER2RMSER2
X1 0.3538 0.7074 0.1011 0.8253 
X2 0.2956 0.7628 0.1447 0.8828 
X3 0.5247 0.7444 0.3315 0.8643 
X4 0.6124 0.7019 0.4246 0.8650 
X5 0.2951 0.7303 0.1935 0.8459 
X6 2.9565 0.8005 2.0025 0.8598 
X7 0.3118 0.7163 0.2794 0.8635 
X8 0.5992 0.7215 0.2073 0.8271 
X9 0.2662 0.7358 0.1312 0.8733 
PCA–LSTM
PCA–WT–LSTM
WellRMSER2RMSER2
X1 0.3538 0.7074 0.1011 0.8253 
X2 0.2956 0.7628 0.1447 0.8828 
X3 0.5247 0.7444 0.3315 0.8643 
X4 0.6124 0.7019 0.4246 0.8650 
X5 0.2951 0.7303 0.1935 0.8459 
X6 2.9565 0.8005 2.0025 0.8598 
X7 0.3118 0.7163 0.2794 0.8635 
X8 0.5992 0.7215 0.2073 0.8271 
X9 0.2662 0.7358 0.1312 0.8733 

Figure 6 depicts the scatterplots and the relations between observed and predicted groundwater levels using the developed PCA–DWT–LSTM models for the nine wells. The dotted line shows the linear fitting for the scatters. Figure 7 shows that the PCA–DWT–LSTM predictions closely resemble the observed groundwater level. The PCA–DWT–LSTM model generates more focused points during validation compared to the PCA–LSTM, indicating its effectiveness.
Figure 6

The scatter plots of the observed and predicted groundwater level values using PCA–DWT–LSTM model for wells X1, X2, X3, X4, X5, X6, X7, X8, and X9.

Figure 6

The scatter plots of the observed and predicted groundwater level values using PCA–DWT–LSTM model for wells X1, X2, X3, X4, X5, X6, X7, X8, and X9.

Close modal
Figure 7

Comparison between the observed and predicted groundwater level values using PCA–LSTM and PCA–DWT–LSTM models for nine wells.

Figure 7

Comparison between the observed and predicted groundwater level values using PCA–LSTM and PCA–DWT–LSTM models for nine wells.

Close modal

In order to further visualise the agreement between the observed and predicted values, Figure 7 represents the results of the model during the validation phase for the nine wells. The PCA–DWT–LSTM model outperforms the PCA–LSTM model in terms of the prediction accuracy as indicated by statistical findings and data agreement. The PCA–LSTM model predictions lag actual values, resulting in less reliable outcomes at notable peaks compared to the PCA–DWT–LSTM model. The developed hybrid model handles complicated temporal patterns and filter noise using the DWT, which results in achieving more accuracy for predicting groundwater levels using the PCA–DWT–LSTM model. Also, the hybrid model is time-intensive, and it performs well for eliminating noise and smoothing outliers, whereas the PCA–LSTM model is simple and faster but less accurate for capturing complex groundwater dynamics that are more susceptible to noise. The PCA–DWT–LSTM model successfully predicts seasonal groundwater level and achieves satisfactory prediction accuracy. Moreover, the hybrid model can be modified to account for predicted changes in temperature, evapotranspiration, and precipitation over time by integrating general circulation model (GCM) data. The model's capacity to predict output corresponding to long-term climatic shifts can be improved by incorporating the climate projections, which enable the model to simulate future groundwater levels under various climate scenarios. Also, the developed model can be applied to regions with diverse hydrogeological characteristics by adapting input variables to include relevant local factors. Adjusting wavelet transforms for local fluctuations and incorporating region-specific data enhances the reliability of the developed model in predicting groundwater level. Additionally, recalibrating the LSTM component to reflect local recharge and discharge cycles ensures precise predictions, making the model adaptable to various environments.

Sensitivity analysis

The sensitivity analysis of the PCA–DWT–LSTM model is performed for the well that shows the highest predictive accuracy, i.e. X2 well. The hybrid model employs three inputs, i.e. PC1, PC2, and PC3. The sensitivity analysis of the PCA–DWT–LSTM model indicates different predictive accuracy corresponding to different combinations of input variables. From Table 4, it is clear that the most sensitive variable in the developed model is PC3, which contains minimum temperature among other variables. This is because when the PC3 variable is excluded during the model development and only PC1 and PC2 are used, the R² value drastically drops to 0.3686, while a higher RMSE value of 1.6727 is observed. This analysis clearly indicates that the model's performance is highly sensitive to the PC3 variable. By contrast, when PC1 and PC2 are removed during model development, there is no significant decrease in the model's performance, suggesting that these variables are less sensitive.

Table 4

Sensitivity analysis of hybrid model predictive accuracy based on principal components

VariableR2RMSE
PC1, PC2 0.3686 1.6727 
PC2, PC3 0.4354 0.7978 
PC1, PC3 0.6392 0.5485 
PC1, PC2, PC3 0.8828 0.1447 
VariableR2RMSE
PC1, PC2 0.3686 1.6727 
PC2, PC3 0.4354 0.7978 
PC1, PC3 0.6392 0.5485 
PC1, PC2, PC3 0.8828 0.1447 

The study aims to develop a hybrid PCA–DWT–LSTM model that combines the PCA and DWT techniques with the LSTM for seasonal groundwater level prediction. The model is validated using the seasonal groundwater level dataset for nine wells in the Kangra district of Himachal Pradesh, India. The study showed that the use of PCA and DWT reduces the dimensionality and provides high-quality input variables with large variance and gives the trend series for the groundwater level. Furthermore, integrating two LSTM models improves the prediction accuracy at peak points while lowering average error. The sensitivity analysis postulates that the most sensitive input variable to the developed model is PC3, which contains minimum temperature as the main component. The outcomes of the study indicated that the proposed PCA–DWT–LSTM model outperforms the PCA–LSTM model in terms of R2 and RMSE, which range from 0.8253 to 0.8648 and 0.1011 to 2.0025 for the former. The study establishes the efficacy of the PCA–DWT–LSTM approach for predicting seasonal groundwater levels. The study recommends the inclusion of distinct climate projections and modification for real-time monitoring for enhancing the prediction accuracy and improving versatility of the PCA–DWT–LSTM model.

Data cannot be made publicly available; readers should contact the corresponding author for details.

The authors declare there is no conflict.

Abd-Elmaboud
M. E.
,
Saqr
A. M.
,
El-Rawy
M.
,
Al-Arifi
N.
&
Ezzeldin
R.
(
2024
)
Evaluation of groundwater potential using ANN-based mountain gazelle optimization: a framework to achieve SDGs in East El Oweinat, Egypt
,
Journal of Hydrology: Regional Studies
,
52
,
101703
.
Azimi
S.
,
Azhdary Moghaddam
M.
&
Hashemi Monfared
S. A.
(
2018
)
Anomaly detection and reliability analysis of groundwater by crude Monte Carlo and importance sampling approaches
,
Water Resources Management
,
32
,
4447
4467
.
Banerjee
P.
,
Singh
V.
,
Chatttopadhyay
K.
,
Chandra
P.
&
Singh
B.
(
2011
)
Artificial neural network model as a potential alternative for groundwater salinity forecasting
,
Journal of Hydrology
,
398
(
3
),
212
220
.
Bhattarai
N.
,
Pollack
A.
,
Lobell
D. B.
,
Fishman
R.
,
Singh
B.
,
Dar
A.
&
Jain
M.
(
2021
)
The impact of groundwater depletion on agricultural production in India
,
Environmental Research Letters
,
16
(
8
),
085003
.
Chandel
A.
,
Shankar
V.
&
Kumar
N.
(
2023
)
Neural computing techniques to estimate the hydraulic conductivity of porous media
,
Water Supply
,
23
(
6
),
2586
2603
.
Cohen
A.
&
Kovacevic
J.
(
1996
)
Wavelets: the mathematical background
,
Proceedings of the IEEE
,
84
,
514
522
.
Daffertshofer
A.
,
Lamoth
C. J.
,
Meijer
O. G.
&
Beek
P. J.
(
2004
)
PCA in studying coordination and variability: a tutorial
,
Clinical Biomechanics
,
19
(
4
),
415
428
.
Derbela
M.
&
Nouiri
I.
(
2020
)
Intelligent approach to predict future groundwater level based on artificial neural networks (ANN)
,
Euro-Mediterranean Journal for Environmental Integration
,
5
,
1
11
.
Funk
C.
,
Peterson
P.
,
Landsfeld
M.
,
Pedreros
D.
,
Verdin
J.
,
Shukla
S.
,
Husak
G.
,
Rowland
J.
,
Harrison
L.
,
Hoell
A.
&
Michaelsen
J.
(
2015
).
The climate hazards infrared precipitation with stations - a new environmental record for monitoring extremes
.
Scientific Data
,
2
(
1
),
1
21
.
Graves
A.
,
Mohamed
A. R.
&
Hinton
G.
(
2013, May
).
Speech recognition with deep recurrent neural networks
. In
2013 IEEE international conference on acoustics, speech and signal processing
(pp.
6645
-
6649
). Ieee.
Hadidi
A.
,
Holzbecher
E.
&
Zirulia
A.
(
2019
). '
Trends in groundwater observation data and implications
’,
13th Gulf Water Conference–Water in the GCC: Challenges and Innovative Solutions
, pp.
12
14
.
Hochreiter
S.
(
1997
)
Long Short-Term Memory
.
Neural Computation MIT-Press. Cambridge, MA, USA
.
Hussein
E. A.
,
Thron
C.
,
Ghaziasgar
M.
,
Bagula
A.
&
Vaccari
M.
(
2020
)
Groundwater prediction using machine-learning tools
,
Algorithms
,
13
(
11
),
300
.
Konikow
L. F.
&
Kendy
E.
(
2005
)
Groundwater depletion: a global problem
,
Hydrogeology Journal
,
13
,
317
320
.
Kumar
C. P.
(
2019
)
An overview of commonly used groundwater modelling software
,
International Journal of AdvancedResearch in Science, Engineering and Technology
,
6
,
7854
7865
.
Lallahem
S.
,
Mania
J.
,
Hani
A.
&
Najjar
Y.
(
2005
)
On the use of neural networks to evaluate groundwater levels in fractured media
,
Journal of Hydrology
,
307
(
1–4
),
92
111
.
Lechner
M.
&
Hasani
R.
(
2020
)
Learning long-term dependencies in irregularly-sampled time series. arXiv preprint arXiv:2006.04418
.
Maxhuni
A.
,
Hernandez-Leal
P.
,
Sucar
L. E.
,
Osmani
V.
,
Morales
E. F.
&
Mayora
O.
(
2016
)
Stress modelling and prediction in presence of scarce data
,
Journal of Biomedical Informatics
,
63
,
344
356
.
Moudgil
P. S.
,
Rao
G. S.
&
Heki
K.
(
2024
)
Bridging the temporal gaps in GRACE/GRACE–FO terrestrial water storage anomalies over the major Indian river basins using deep learning
,
Natural Resources Research
,
1
20
.
Muhammad
P. F.
,
Kusumaningrum
R.
&
Wibowo
A.
(
2021
)
Sentiment analysis using word2vec and long short-term memory (LSTM) for Indonesian hotel reviews
,
Procedia Computer Science
,
179
,
728
735
.
Nguyen
D.
,
Ouala
S.
,
Drumetz
L.
&
Fablet
R.
(
2019
)
Em-like learning chaotic dynamics from noisy and partial observations. arXiv preprint arXiv:1903.10335
.
Nourani
V.
,
Alami
M. T.
&
Vousoughi
F. D.
(
2015
)
Wavelet-entropy data pre-processing approach for ANN-based groundwater level modeling
,
Journal of Hydrology
,
524
,
255
269
.
Peterson
T. J.
,
Western
A. W.
&
Cheng
X.
(
2018
)
The good, the bad and the outliers: automated detection of errors and outliers from groundwater hydrographs
,
Hydrogeology Journal
,
26
(
2
),
371
380
.
Samani
S.
,
Vadiati
M.
,
Nejatijahromi
Z.
,
Etebari
B.
&
Kisi
O.
(
2023
)
Groundwater level response identification by hybrid wavelet–machine learning conjunction models using meteorological data
,
Environmental Science and Pollution Research
,
30
(
9
),
22863
22884
.
Saqr
A. M.
,
Nasr
M.
,
Fujii
M.
,
Yoshimura
C.
&
Ibrahim
M. G.
(
2022
). '
Optimal solution for increasing groundwater pumping by integrating MODFLOW-USG and particle swarm optimization algorithm: a case study of Wadi El-Natrun, Egypt
',
International Conference on Environment Science and Engineering
, pp.
59
73
.
Singh
V.
,
Agrawal
H. M.
,
Joshi
G. C.
,
Sudershan
M.
&
Sinha
A. K.
(
2011
)
Elemental profile of agricultural soil by the EDXRF technique and use of the Principal Component Analysis (PCA) method to interpret the complex data
,
Applied Radiation and Isotopes
,
69
(
7
),
969
974
.
Singh
A.
,
Patel
S.
,
Bhadani
V.
,
Kumar
V.
&
Gaurav
K.
(
2024
)
AutoML-GWL: automated machine learning model for the prediction of groundwater level
,
Engineering Applications of Artificial Intelligence
,
127
,
107405
.
Sundar
M. L.
,
Ragunath
S.
,
Hemalatha
J.
,
Vivek
S.
,
Mohanraj
M.
,
Sampathkumar
V.
,
Mohammed Siraj Ansari
A.
,
Parthiban
V.
&
Manoj
S.
(
2022
)
Simulation of groundwater quality for Noyyal River Basin of Coimbatore City, Tamilnadu, using MODFLOW
.
Chemosphere
,
306
,
135649
.
Suryanarayana
C.
,
Sudheer
C.
,
Mahammood
V.
&
Panigrahi
B. K.
(
2014
)
An integrated wavelet-support vector machine for groundwater level prediction in Visakhapatnam, India
,
Neurocomputing
,
145
,
324
335
.
Tao
H.
,
Hameed
M. M.
,
Marhoon
H. A.
,
Zounemat-Kermani
M.
,
Heddam
S.
,
Kim
S.
,
Sulaiman
S. O.
,
Tan
M. L.
,
Sa'adi
Z.
,
Danandeh Mehr
A.
,
Allawi
M. F.
,
Abba
S. I.
,
Zain
J. M.
,
Falah
M. W.
,
Jamei
M.
,
Bokde
N. D.
,
Bayatvarkeshi
M.
,
Al-Mukhtar
M.
,
Bhagat
S. K.
,
Tiyasha
T.
&
Yaseen
Z. M.
(
2022
)
Groundwater level prediction using machine learning models: A comprehensive review
.
Neurocomputing
,
489
,
271
308
.
Tsanis
I. K.
,
Coulibaly
P.
&
Daliakopoulos
I. N.
(
2008
)
Improving groundwater level forecasting with a feedforward neural network and linearly regressed projected precipitation
,
Journal of Hydroinformatics
,
10
(
4
),
317
330
.
Usman
U. N.
,
Toriman
M. E.
,
Juahir
H.
,
Abdullahi
M. G.
,
Rabiu
A. A.
&
Isiyaka
H.
(
2014
)
Assessment of groundwater quality using multivariate statistical techniques in Terengganu
,
Science and Technology
,
4
(
3
),
42
49
.
Wendt
D. E.
,
Van Loon
A. F.
,
Bloomfield
J. P.
&
Hannah
D. M.
(
2020
)
Asymmetric impact of groundwater use on groundwater droughts
,
Hydrology and Earth System Sciences
,
24
(
10
),
4853
4868
.
Wiese
B.
&
Omlin
C.
(
2009
)
Credit card transactions, fraud detection, and machine learning: Modelling time with LSTM recurrent neural networks
. In:
(Monica, B., Marco, M., Franco, S. & Lakhmi, C. J., eds.)
Innovations in Neural Information Paradigms and Applications
,
Berlin, Heidelberg
:
Springer Berlin Heidelberg
, pp.
231
268
.
Wu
C.
,
Zhang
X.
,
Wang
W.
,
Lu
C.
,
Zhang
Y.
,
Qin
W.
,
Shu
L.
,
Zhang
Y.
,
Qin
W.
,
Tick
G. R.
,
Liu
B.
&
Shu
L.
(
2021
).
Groundwater level modeling framework by combining the wavelet transform with a long short-term memory data-driven model
.
Science of The Total Environment
,
783
,
146948
.
Zhu
C.
,
Luan
Q.
&
Hao
Z.
(
2012
)
Groundwater level simulation with combined grey neural networks and modflow models
,
Advanced Science Letters
,
7
(
1
),
245
248
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY-NC 4.0), which permits copying, adaptation and redistribution for non-commercial purposes, provided the original work is properly cited (http://creativecommons.org/licenses/by-nc/4.0/).