ABSTRACT
The statistical downscaling of global circulation models presents a significant challenge in selecting appropriate input variables from a vast pool of predictors. To address the issue, we developed ensemble approach based on the Combining Multiple Clusters via Similarity Graph (COMUSA), which integrates k-means and self-organizing maps (SOMs) methods with the mutual information (MI)-random sampling approach. This innovative feature extraction technique demonstrated a 21% improvement in the classification efficacy of large-scale climatic variables. When comparing feature extraction methods, the combination of MI-random sampling and ensemble clustering yielded more accurate results than SOM clustering alone. The most efficient artificial neural networks (ANNs)-based downscaling model was employed to project near- and mid-future precipitation and temperature (2025–2035, 2035–2045), revealing varied outcomes under different scenarios (SSP3-7.0 and SSP5-8.5). Under SSP3-7.0 and SSP5-8.5, annual mean precipitation values are projected to decrease by 2–3 and by 4–5%. Also, projected annual mean temperature values indicate an increase in 21-27 and 29-35% under SSP3-7.0 and SSP5-8.5 scenarios. Integrating COMUSA ensemble clustering with MI-random sampling enhances the estimation accuracy of the ANN downscaling model, contributing to accurate projections of future precipitation and temperature values.
HIGHLIGHTS
Integrating k-means and self-organizing map (SOM) as feature extraction methods to enhance the precision of the artificial neural networks downscaling model.
Combining Multiple Clusters via Similarity Graph (COMUSA) ensemble clustering approach to integrate k-means and SOM feature selection approaches.
The MI-random sampling method for selecting the dominant predictors from each cluster as the representative of the clusters.
Comparison of COMUSA ensemble clustering and SOM approaches.
NOMENCLATURE
- AI
artificial intelligence
- ANN
artificial neural network
- ARIL
average relative length
- BP
back propagation
- CC
correlation coefficient
- COMUSA
combining multiple clusters via similarity graph
- DL
distance of lower bands
- DU
distance of upper bands
- FFNN
feed-forward neural network
- GCMs
global circulation models
- GT
gamma test
- IPCC
Intergovernmental Panel on Climate Change
- KGE
Kling–Gupta efficiency
- LM
Levenberg–Marquardt
- MI
mutual information
- NSE
Nash–Sutcliffe efficiency
- PCA
principal component analysis
- RCMs
regional climate models
- RMSE
root mean square error
- SC
Silhouette coefficient
- SOM
self-organizing maps
- SSP
shared socioeconomic pathway
- TS
tangent sigmoid
INTRODUCTION
The emission of greenhouse gases is a significant factor contributing to climate change. According to the Intergovernmental Panel on Climate Change (IPCC), human activities are the primary drivers behind the increasing levels of greenhouse gases, which in turn lead to global warming and significant disruptions in the water cycle (IPCC 2018). Consequently, accurate forecasting of regional variations in precipitation and temperature is essential for developing effective strategies to adapt to and mitigate the adverse effects of climate change (Mirdashtvan et al. 2019; Zhang et al. 2022). Global circulation models (GCMs) are widely regarded as the most reliable frameworks for projecting future climate change scenarios (Rahimi et al. 2021). However, their coarse spatial resolution often fails to capture the fine-scale processes necessary for accurate regional climate predictions (Wilby et al. 2002). Therefore, GCM outputs cannot be directly used for simulating and projecting the impacts of climate change on land surface variables.
To address this limitation, downscaling methods have been developed to enhance the resolution of projected climatic variables, including precipitation, air temperature, and humidity (Mora et al. 2014; Elkiran et al. 2021; Mirdashtvan et al. 2021; Chen et al. 2023). Downscaling techniques are generally classified into two broad categories: dynamical and statistical methods (Mirdashtvan et al. 2018). High-resolution regional climate models use boundary conditions from GCMs to derive finer climate variables at the local scale. In contrast, statistical downscaling techniques establish empirical relationships between local climate variables (predictands) and large-scale GCM outputs (predictors) (Mirdashtvan & Malekian 2020).
Statistical downscaling methods are increasingly favored for their simplicity and lower computational costs (Tavakol-Davani et al. 2013). Among various statistical techniques, the use of artificial intelligence (AI) methods, such as artificial neural networks (ANNs), is increasingly successful. This is attributed to their effectiveness in capturing the nonlinear dynamics of hydro-climatic variables across different spatiotemporal scales (Nourani et al. 2018; Haji Hosseini et al. 2020; Wang et al. 2020; Rabezanahary Tanteliniaina et al. 2021; Gumus et al. 2023).
The use of ANN-based downscaling models has demonstrated various advantages and disadvantages, highlighting both their effectiveness and limitations in processing GCM outputs (Snell et al. 2000; Hosseini Baghanam et al. 2019). The differing results across studies utilizing ANN-based downscaling can often be attributed to the quality and quantity of GCM data used as input. The inclusion of irrelevant or insignificant data can create significant challenges during the training of AI-based models (Wang et al. 2024). Thus, selecting the most relevant features as potential input variables is crucial for enhancing the efficiency of ANN-based downscaling models (Ahmadi et al. 2015; Asghari & Nasseri 2015; Ang et al. 2023; Ghimire et al. 2023).
Given the importance of robust feature selection methods in data mining-based downscaling approaches, several commonly used techniques have proven effective in identifying the most relevant input variables for statistical downscaling. These techniques include the correlation coefficient (CC) (Mehta et al. 2023; Nourani et al. 2023), principal component analysis (Haji Hosseini et al. 2020), mutual information (MI) (Nasseri et al. 2013), decision trees (Nourani et al. 2018), and the gamma test (Ahmadi et al. 2015). In recent decades, several studies have utilized various clustering-based feature selection methods coupled with different pre-processing techniques (e.g., MI and CC) to identify the dominant inputs of AI-based models in hydro-environmental contexts (Bowden et al. 2002; Chang et al. 2016; Feng et al. 2023) and the statistical downscaling of GCMs (Sehgal et al. 2018; Hosseini Baghanam et al. 2019).
The process of selecting clustering methods is influenced by their underlying assumptions, and there is no consensus on which technique performs best. To address the issue, researchers have proposed a range of solutions, including the use of ensemble clustering approaches. One such method, Combining Multiple Clusters via Similarity Graph (COMUSA), developed by Mimaroglu & Erdil (2011), has been widely used in various studies, coupled with filter-based feature selection via MI to identify dominant inputs for AI-based hydrological models (Nourani et al. 2022; Sharghi et al. 2022).
Given the dynamic nature of hydro-climatic data, the hybrid MI-random sampling technique (MI-random sampling) offers a preferred alternative to MI for representative selection within each cluster. This technique increases the number of dominant data points and fills the gap between dominant and secondary data, potentially better representing the entire time series during clustering.
To the authors' best knowledge, prior research has primarily focused on the effectiveness of ensemble clustering methods as feature extraction approaches for AI-based hydrological models. However, their application as feature selection methods for identifying and grouping similar parameters in the statistical downscaling of GCMs is relatively limited. Furthermore, the use of the MI-random sampling technique to represent the entire time series during clustering is underexplored, particularly when coupled with clustering-based feature extraction approaches. The innovation of current study lies in addressing the following research objectives:
Integrating k-means and self-organizing map techniques with the COMUSA ensemble clustering approach as feature extraction methods for an AI-based downscaling model.
Coupling MI-random sampling with COMUSA ensemble clustering in the statistical downscaling of predictands.
MATERIALS AND METHODS
Case study and data explanation
In this study, the monthly observed precipitation and temperature data of the Ardabil synoptic station were collected from the Iran Meteorological Organization for the period 1981–2014. Different GCMs from the IPCC's 6th Assessment Report (CMIP6) were considered in this study (see Table 1). Among 25 GCMs considered, three GCMs (i.e., ACCESS-CM2, FGOALS-g3, and CanESM5-CanOE) were selected based on CC metric. The CC was calculated by assessing the linear relationships between the precipitation and temperature data from each GCM and the observed historical data, all at a monthly temporal resolution. This analysis evaluated the performance of each GCM in simulating monthly precipitation and temperature patterns. The GCMs with the highest CCs, indicating the best agreement with observed historical data, were selected for the modeling procedure. The monthly historical GCM dataset (1981–2014) and their projections under the different shared socioeconomic pathway (SSP) scenarios (i.e., SSP1–2.6, SSP3–7.0, and SSP5–8.5) were obtained from the Copernicus Climate Change Service (https://cds.climate.copernicus.eu/).
NO. . | Predictors . | Description . |
---|---|---|
1 | pr | Precipitation |
2 | taa | Air temperature |
3 | hura | Relative humidity |
4 | husa | Specific humidity |
5 | uaa | Eastward wind |
6 | zga | Geopotential height |
7 | vaa | Northward wind |
8 | uas | Eastward near-surface wind |
9 | evspsbl | Evaporation including sublimation and transpiration |
10 | tas | Near-surface air temperature |
11 | huss | Near-surface specific humidity |
12 | vas | Northward near-surface wind |
13 | psl | Sea level pressure |
14 | tauv | Surface downward northward wind stress |
15 | rsds | Surface downwelling shortwave radiation |
16 | ts | Surface temperature |
17 | hfls | Surface upward latent heat flux |
18 | rlus | Surface upwelling longwave radiation |
19 | rsdt | TOA incident shortwave radiation |
20 | rsut | TOA outgoing shortwave radiation |
21 | tasmax | Daily maximum near-surface air temperature |
22 | tasmin | Daily minimum near-surface air temperature |
23 | hurs | Near-surface relative humidity |
24 | sfcWind | Near-surface wind speed |
26 | ps | Surface air pressure |
27 | tauu | Surface downward eastward wind stress |
28 | rlds | Surface downwelling longwave radiation |
29 | snw | Surface snow amount |
30 | hfss | Surface upward sensible heat flux |
31 | rsus | Surface upwelling shortwave radiation |
32 | rlut | TOA outgoing longwave radiation |
33 | clt | Total cloud cover percentage |
NO. . | Predictors . | Description . |
---|---|---|
1 | pr | Precipitation |
2 | taa | Air temperature |
3 | hura | Relative humidity |
4 | husa | Specific humidity |
5 | uaa | Eastward wind |
6 | zga | Geopotential height |
7 | vaa | Northward wind |
8 | uas | Eastward near-surface wind |
9 | evspsbl | Evaporation including sublimation and transpiration |
10 | tas | Near-surface air temperature |
11 | huss | Near-surface specific humidity |
12 | vas | Northward near-surface wind |
13 | psl | Sea level pressure |
14 | tauv | Surface downward northward wind stress |
15 | rsds | Surface downwelling shortwave radiation |
16 | ts | Surface temperature |
17 | hfls | Surface upward latent heat flux |
18 | rlus | Surface upwelling longwave radiation |
19 | rsdt | TOA incident shortwave radiation |
20 | rsut | TOA outgoing shortwave radiation |
21 | tasmax | Daily maximum near-surface air temperature |
22 | tasmin | Daily minimum near-surface air temperature |
23 | hurs | Near-surface relative humidity |
24 | sfcWind | Near-surface wind speed |
26 | ps | Surface air pressure |
27 | tauu | Surface downward eastward wind stress |
28 | rlds | Surface downwelling longwave radiation |
29 | snw | Surface snow amount |
30 | hfss | Surface upward sensible heat flux |
31 | rsus | Surface upwelling shortwave radiation |
32 | rlut | TOA outgoing longwave radiation |
33 | clt | Total cloud cover percentage |
aPredictors at 100, 500, 1000, 2000, 3000, 5000, 7000, 10,000, 15,000, 20,000, 25,000, 30,000, 40,000, 50,000, 60,000, 70,000, 85,000 Pressure heights.
Some previous studies have highlighted the advantages of using data from multiple grid points surrounding the study area (Tavakol-Davani et al. 2013; Beecham et al. 2014). Consequently, predictors were derived from four grid points (see Figure 1). The selected predictors used in the statistical downscaling process for both temperature and precipitation are listed in Table 2.
Model . | Precipitation . | Temperature . | Model . | Precipitation . | Temperature . |
---|---|---|---|---|---|
CC . | CC . | CC . | CC . | ||
CanESM5-CanOE | 0.69 | 0.95 | MIROC-ES2L | 0.43 | 0.90 |
INM-CM4-8 | 0.06 | 0.71 | E3SM-1-1-ECA | 0.42 | 0.90 |
FGOALS-f3-L | 0.30 | 0.82 | BCC-ESM1 | 0.27 | 0.81 |
FGOALS-g3 | 0.51 | 0.94 | CESM2-WACCM | 0.40 | 0.89 |
AWI-CM-1-1-MR | 0.29 | 0.81 | GFDL-ESM4 | 0.30 | 0.88 |
AWI-ESM-1-1-LR | 0.34 | 0.90 | MPI-ESM1-2-HR | 0.25 | 0.84 |
HadGEM3-GC31-LL | 0.30 | 0.87 | MPI-ESM1-2-LR | 0.41 | 0.91 |
ACCESS-CM2 | 0.54 | 0.92 | KACE-1-0-G | 0.30 | 0.85 |
ACCESS-ESM1–5 | 0.29 | 0.85 | CMCC-CM2-HR4 | 0.36 | 0.82 |
EC-Earth3-CC | 0.36 | 0.90 | CMCC-CM2-SR5 | 0.29 | 0.81 |
EC-Earth3-Veg-LR | 0.38 | 0.88 | CMCC-ESM2 | 0.40 | 0.90 |
EC-Earth3-AerChem | 0.40 | 0.89 | CESM2-WACCM-FV2 | 0.44 | 0.92 |
IPSL-CM6A-LR | 0.31 | 0.87 |
Model . | Precipitation . | Temperature . | Model . | Precipitation . | Temperature . |
---|---|---|---|---|---|
CC . | CC . | CC . | CC . | ||
CanESM5-CanOE | 0.69 | 0.95 | MIROC-ES2L | 0.43 | 0.90 |
INM-CM4-8 | 0.06 | 0.71 | E3SM-1-1-ECA | 0.42 | 0.90 |
FGOALS-f3-L | 0.30 | 0.82 | BCC-ESM1 | 0.27 | 0.81 |
FGOALS-g3 | 0.51 | 0.94 | CESM2-WACCM | 0.40 | 0.89 |
AWI-CM-1-1-MR | 0.29 | 0.81 | GFDL-ESM4 | 0.30 | 0.88 |
AWI-ESM-1-1-LR | 0.34 | 0.90 | MPI-ESM1-2-HR | 0.25 | 0.84 |
HadGEM3-GC31-LL | 0.30 | 0.87 | MPI-ESM1-2-LR | 0.41 | 0.91 |
ACCESS-CM2 | 0.54 | 0.92 | KACE-1-0-G | 0.30 | 0.85 |
ACCESS-ESM1–5 | 0.29 | 0.85 | CMCC-CM2-HR4 | 0.36 | 0.82 |
EC-Earth3-CC | 0.36 | 0.90 | CMCC-CM2-SR5 | 0.29 | 0.81 |
EC-Earth3-Veg-LR | 0.38 | 0.88 | CMCC-ESM2 | 0.40 | 0.90 |
EC-Earth3-AerChem | 0.40 | 0.89 | CESM2-WACCM-FV2 | 0.44 | 0.92 |
IPSL-CM6A-LR | 0.31 | 0.87 |
k-means and SOM clustering
In this study, both k-means and SOM clustering techniques were employed for their respective advantages in handling different aspects of the data. k-means was chosen for its straightforward application and rapid convergence, with its simple linear structure making it easy to implement and understand. In contrast, SOM was utilized for its ability to transform complex, nonlinear statistical associations among high-dimensional attributes into simple geometric relationships on a low-dimensional map, while preserving the dataset's structure. By leveraging k-means for its linear capabilities with SOM for its nonlinear approach, we employed two complementary methods to effectively identify patterns in the data.
Ensemble of k-means and SOM clustering
The ensemble clustering technique aims to integrate multiple base clustering algorithms to create a robust and accurate clustering approach that produces reliable results. Several ensemble clustering methods have been proposed, including factor graphs (Huang et al. 2016), weighted co-association matrices (Berikov & Pestunov 2017), and density-based similarity matrices (Beauchemin 2015). However, there is currently no consensus on the best technique among these approaches. In this study, the similarity graph (SG) approach was employed to integrate k-means and SOM clustering methods for the input dataset (D). This approach is regarded as one of the most accurate and straightforward methods for ensemble clustering. It combines the outputs of k-means and SOM, resulting in a unified clustering solution. The SG approach, as described by Mimaroglu & Erdil (2011), is outlined as follows:
1. In the first step, each individual clustering approach is defined as where Ci is a cluster of ; and is the best cluster sets of each individual clustering approach.
where votesi.j denotes the number of times objects i and j are allocated to the same clusters.
To demonstrate the SM, the SG can be utilized as an undirected and weighted graph, where SG = (D, E) and each edge (di. dj) has a sign related to the SMij in the SM.
where df (di) denotes the degrees of freedom and sw(di) is the sum of the weights of the edges connected to di. It should be noted that the member with the maximum attachment index is chosen as the pivot (initial member).
4. First, all nearby neighbors are considered to expand the pivot item within each singular cluster. A neighbor is incorporated into the cluster of a pivot if it exhibits the highest similarity to that pivot. Once a neighbor is added, it becomes a new pivot and evaluates its own nearby neighbors for further expansion. The cluster expansion in COMUSA stops when the current pivots can no longer incorporate additional objects. If there are any remaining unassigned objects in the input dataset, COMUSA initiates the creation of a new cluster by selecting a new pivot. This clustering process continues until all objects in the dataset are assigned to a cluster. Once all objects have been assigned, COMUSA terminates its operation. For more details on COMUSA, please refer to the work by Mimaroglu & Erdil (2011).
Artificial neural networks
In the present study, the statistical downscaling of GCM data was performed using a three-layer feed-forward neural network (FFNN). Previous research has shown that FFNN equipped with the back propagation (BP) algorithm are commonly employed to establish regression-based relationships between hydro-climatologic predictors and predictands (Maier & Dandy 2000). To achieve the highest efficiency of the three-layer FFNN-BP model, the Levenberg–Marquardt scheme was utilized for training the ANNs due to its faster convergence rate (Haykin 1994).
In this study, the tangent sigmoid activation function was selected as the nonlinear kernel for the ANNs. The training process of the network was terminated when the error rate on the test data increased, indicating the completion of the training phase. It is important to note that a crucial aspect of ANN modeling is the design of appropriate architectures, including determining the number of hidden neurons and the number of iterations (epochs). The optimal network structures were obtained through a trial-and-error process. For a more comprehensive explanation of the mathematical principles underlying ANNs, readers are advised to see the work by Haykin (1994).
In this study, the ANN model was utilized for statistical downscaling due to its widespread popularity, ease of application, and proven accuracy.
Mutual Information
MI is a statistical dependency metric derived from Shannon's entropy. As a commonly used feature extraction method, MI detects nonlinear relationships between predictors and predictands while reducing computational costs. Shannon information content is mathematically formulated using data probability distributions, applicable in both discrete and continuous forms depending on the data and problem context (Macedo et al. 2022). In this study, the MI approach is employed to compute nonlinear relationships between the predictors of each cluster and the predictands (i.e., precipitation and temperature). For a deeper understanding of the fundamental mathematics of MI, readers are referred to Shannon (1948).
Evaluation metrics
Cluster evaluation metrics
Model evaluation metrics
Uncertainty evaluation metrics
MODELING PROCEDURE
First step: Feature selection and screening of dominant inputs
Considering that various GCMs have distinct resolutions and utilize different modeling specifications, a multi-GCM ensemble (i.e., ACCESS-CM2, FGOALS-g3, and CanESM5-CanOE) was utilized. Appropriate GCMs were selected based on the highest CC values between the GCM predictors and observed precipitation and temperature datasets. This ensemble approach aims to reduce uncertainties and encompass both the advantages and limitations of multiple GCMs.
Given that prevailing climatic conditions in each region significantly influence the climate of adjacent areas, the predictors are evaluated across the four grid points surrounding the research area. The four closest grid points to the Ardabil synoptic station are illustrated by , , and (i = 1–4) for FGOALS-g3, CanESM5-CanOE, and ACCESS-CM2, respectively (see Figure 1).
In general, predictors do not uniformly influence to the predictands. While some may exhibit a strong correlation, others may show little relevance. Moreover, using a large set of predictors can diminish the accuracy of the ANN downscaling model. Feature selection in this context involves identifying a subset of relevant variables (features) from a larger dataset using clustering approach coupled with pre-processing techniques that contribute most effectively to the prediction of a target outcome, such as precipitation and temperature. Therefore, it is essential to group similar predictors and select dominant ones from each cluster. To achieve this, the COMUSA ensemble clustering algorithm, which leverages the strengths of both k-means and SOM, was employed to identify optimal cluster structures. The similarity metrics (linear, nonlinear, and multi-linear) between a predictor and predictand within a cluster may not always yield the highest correlation but can still exceed the maximum similarity values found in other clusters. Consequently, a hybrid MI and random sampling technique was used to select dominant predictors from the clusters. In this process, MI was applied alongside a random sampling to choose a representative from each cluster.
Second step: AI-based statistical downscaling model
The second step involves the developed ANN-based downscaling methods. These models are trained using the dominant predictors identified in the first step. It is worth noting that standardizing the GCM's outputs is highly recommended (Wilby & Dawson 2004).
Third step: Future precipitation and temperature projection
In the final step, the calibrated downscaling model was employed to project future precipitation and temperature for the Ardabil synoptic station. Projections were conducted under two SSP scenarios: SSP3–7.0 and SSP5–8.5, for the periods 2025–2035 and 2035–2045. These scenarios are considered to yield more realistic and appropriate outcomes; thus, they are recommended for inclusion in CMIP6 to assess the impacts of climate change (Hausfather & Peters 2020; Nourani et al. 2023).
RESULTS AND DISCUSSION
The purpose of this study was to evaluate the effectiveness of coupled ensemble clustering methods with MI-random sampling as a robust pre-processing technique in ANN-based downscaling to project precipitation and temperature at the target station. Since the proposed methodology consists of three phases, the results are accordingly presented in three sections as follows.
Results of feature selection
Considering the correlation between historical precipitation and temperature from 25 GCMs and the predictands over the period 1981–2014, suitable GCMs were selected. Consequently, FGOALS-g3, ACCESS-CM2, and CanESM5-CanOEP5 were employed in the modeling procedure. The results of the grid points with the highest CCs with the predictands are listed in Table 2. After identifying the main GCMs, the proposed feature extraction approach was applied to select the most dominant predictors from the high-dimensional input matrix across multiple GCMs. Each GCM contains 119 predictors at 17 pressure heights, resulting in a total of 119 × 3 × 4 = 1,428 predictors to evaluate in the downscaling process.
In the first step, all predictors from the three GCMs were clustered using two distinct methods including k-means and SOM. By evaluating cluster numbers from 2 to 10, the optimal number of clusters was determined based on the SC metric (Table 3). Mean SC values greater than 0.5 indicate well-structured clusters, while values below 0.5 suggest poorly structured clusters. According to Table 3, k-means clustering with seven clusters was selected as the best structure for more precise differentiation. The mean SC value for the SOM clustering technique reached a maximum of 0.571 with six clusters. A comparison of SC values between the two methods shows that SOM outperformed k-means across all cluster numbers. This superiority can be attributed to the nonlinear nature of SOM and its ability to manage complex interactions within datasets. After clustering the input variables using distinct methods, the COMUSA ensemble clustering technique was employed as a pre-processing approach to integrate and enhance the best outcomes of the k-means and SOM. The results of the evaluation metrics for the ensemble clustering approach are presented in Table 4. A comparison of the SC values between ensemble clustering (Table 4) and the individual k-means and SOM methods (Table 3) demonstrates that ensemble clustering can improve the efficacy of these methods by up to 21%, effectively recognizing patterns and features of climatic variables.
Methods . | Number of clusters . | ||||||||
---|---|---|---|---|---|---|---|---|---|
2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . | |
k-means | 0.51 | 0.41 | 0.41 | 0.34 | 0.40 | 0.51 | 0.38 | 0.30 | 0.29 |
SOM | 0.53 | 0.53 | 0.52 | 0.47 | 0.57 | 0.52 | 0.44 | 0.41 | 0.40 |
Methods . | Number of clusters . | ||||||||
---|---|---|---|---|---|---|---|---|---|
2 . | 3 . | 4 . | 5 . | 6 . | 7 . | 8 . | 9 . | 10 . | |
k-means | 0.51 | 0.41 | 0.41 | 0.34 | 0.40 | 0.51 | 0.38 | 0.30 | 0.29 |
SOM | 0.53 | 0.53 | 0.52 | 0.47 | 0.57 | 0.52 | 0.44 | 0.41 | 0.40 |
Ensemble clustering . | SC . | . | Number of clusters . | ||||||
---|---|---|---|---|---|---|---|---|---|
1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | |||
0.634 | Number of predictors in each cluster | 15 | 229 | 67 | 302 | 454 | 329 | 33 |
Ensemble clustering . | SC . | . | Number of clusters . | ||||||
---|---|---|---|---|---|---|---|---|---|
1 . | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . | |||
0.634 | Number of predictors in each cluster | 15 | 229 | 67 | 302 | 454 | 329 | 33 |
As the final phase of pre-processing, dominant predictors from each cluster were selected using a hybrid approach that combines MI and random sampling. These selected predictors are considered the best representatives of the clusters. Given the superior performance of ensemble clustering compared to individual clustering methods, the MI-random sampling feature selection method was applied 50 times to determine the dominant predictors for ensemble clustering. Additionally, to compare the results of ensemble clustering with those of distinct methods in the downscaling process, the dominant predictors from SOM (due to their superior SC values) were also selected using the MI-random sampling feature extraction method. The multi-GCM dominant predictors selected by the MI-random sampling technique from the clusters of both ensemble clustering and SOM methods are presented in the Supplementary Materials, Tables S1 and S2, respectively.
Results of AI-based statistical downscaling model
In the second step, an ANN-based downscaling model was employed as the statistical downscaling method. The main predictors selected in the previous step through MI-random sampling from the two clustering approaches (ensemble clustering and SOM) were standardized over the baseline period from 1981 to 2014. To calibrate and validate the developed model, the predictors and predictands dataset was divided into calibration (75% from 1981 to 2006) and validation (25% from 2006 to 2014) sets. This data division has been widely used in AI-based hydro-climatological modeling studies (Chau 2007; Komasi & Sharghi 2016; Nourani et al. 2018). To expedite the training process, both input and output data were normalized before training.
A three-layer FFNN using a BP algorithm was employed to downscale the precipitation and temperature for the Ardabil synoptic station. To identify the optimal number of hidden layer neurons, a sequential search was performed over 1,000 epochs. The evaluation metrics indicated that the best training epoch and the optimal number of hidden neurons were found within the ranges of 70–240 and 3–8, respectively. Subsequently, the four statistics (NSE, RMSE, CC, and KGE) of the downscaled precipitation and temperature (using predictors obtained through MI and MI-random sampling from ensemble clustering) are reported in Table 5. The results suggest that ensemble clustering combined with MI-random sampling outperforms the combination of ensemble clustering with MI. This superiority may stem from MI-random sampling's ability to explore different sections of the feature space, thereby introducing diversity into the feature selection process.
Approach . | Precipitation . | Temperature . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | |||||||||||||
NSE . | RMSE . | CC . | KGE . | NSE . | RMSE . | CC . | KGE . | NSE . | RMSE . | CC . | KGE . | NSE . | RMSE . | CC . | KGE . | |
MI-random sampling | 0.73 | 0.07a | 0.90 | 0.72 | 0.59 | 0.07a | 0.78 | 0.62 | 0.98 | 0.02a | 0.99 | 0.98 | 0.97 | 0.03a | 0.98 | 0.94 |
MI | 0.68 | 0.16a | 0.79 | 0.67 | 0.50 | 0.14a | 0.70 | 0. 58 | 0.95 | 0.03a | 0.98 | 0.93 | 0.94 | 0.04a | 0.97 | 0.91 |
Approach . | Precipitation . | Temperature . | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Train . | Test . | Train . | Test . | |||||||||||||
NSE . | RMSE . | CC . | KGE . | NSE . | RMSE . | CC . | KGE . | NSE . | RMSE . | CC . | KGE . | NSE . | RMSE . | CC . | KGE . | |
MI-random sampling | 0.73 | 0.07a | 0.90 | 0.72 | 0.59 | 0.07a | 0.78 | 0.62 | 0.98 | 0.02a | 0.99 | 0.98 | 0.97 | 0.03a | 0.98 | 0.94 |
MI | 0.68 | 0.16a | 0.79 | 0.67 | 0.50 | 0.14a | 0.70 | 0. 58 | 0.95 | 0.03a | 0.98 | 0.93 | 0.94 | 0.04a | 0.97 | 0.91 |
aNormalized RMSE results.
The highest SC value for ensemble clustering compared to SOM (as shown in Tables 3 and 4) further confirms the superiority of ensemble clustering as a feature extraction method for ANN downscaling models. The robustness of ensemble clustering may be attributed to its ability to leverage the strengths of various clustering algorithms, thus addressing the limitations of individual methods. It is important to note that, due to SOM's superiority over k-means in terms of SC value, the evaluation metrics of the downscaling model based on inputs from the ensemble clustering feature extraction approach were compared with those based on inputs from SOM. In summary, the results indicate that the combination of ensemble clustering, which capitalizes on the strengths of various clustering algorithms, and MI-random sampling, which selects representative features from different parts of the feature space, significantly enhances the accuracy of ANN-based downscaling models for precipitation and temperature projections.
In addition, Table 6 shows the results of four uncertainty evaluation metrics including Pcl, ARIL, DU, and DL for downscaled precipitation and temperature using both the ensemble clustering and SOM methods. According to Table 6, the ARIL value for downscaled precipitation is 1.60 when using the ensemble clustering feature extraction approach, whereas the ARIL value for the same precipitation data is 1.50 when using the SOM method. However, the ARIL value of ensemble clustering (2.05) is lower than the results of SOM (2.15) feature selection approach for downscaled temperature, which means that the observed data are in a narrow uncertainty area. To better assess the uncertainty results, DL and DU as new uncertainty metrics are considered in addition to the Pcl, and ARIL. In both downscaled precipitation and temperature, DU and DL of the ensemble clustering feature extraction method represent more appropriate uncertainty outcomes than the SOM approach (based on Table 6). The appropriate uncertainty results of the ensemble clustering feature extraction method for the downscaled precipitation and temperature variables can be attributed to the effective clustering of the input dataset.
Variable . | Ensemble clustering . | SOM . | ||||||
---|---|---|---|---|---|---|---|---|
Pcl . | ARIL . | DU . | DL . | Pcl . | ARIL . | DU . | DL . | |
Temperature | 100 | 2.05 | 0.73 | 1.74 | 100 | 2.15 | 0.82 | 2.86 |
Precipitation | 100 | 1.60 | 11.34 | −2.98 | 100 | 1.50 | 11.79 | −3.17 |
Variable . | Ensemble clustering . | SOM . | ||||||
---|---|---|---|---|---|---|---|---|
Pcl . | ARIL . | DU . | DL . | Pcl . | ARIL . | DU . | DL . | |
Temperature | 100 | 2.05 | 0.73 | 1.74 | 100 | 2.15 | 0.82 | 2.86 |
Precipitation | 100 | 1.60 | 11.34 | −2.98 | 100 | 1.50 | 11.79 | −3.17 |
Results of the projected precipitation and temperature
Conversely, projected winter and summer precipitation shows increasing trends in the mid-term under both scenarios, indicating a shift in precipitation patterns compared to historical data. However, the increase in winter and summer precipitation is less pronounced than the decrease in spring and fall precipitation. In the near future, anticipated precipitation is expected to exceed that of the mid-future in all seasons except summer under the SSP3–7.0 scenario. Under SSP5–8.5, a decrease in precipitation is projected for spring, fall, and winter in the near future, while summer shows an increasing trend, as illustrated in Figure 7(a). Furthermore, the analysis of projected mean temperatures under both scenarios reveals a substantial rise in mean monthly and seasonal temperatures for both the near and mid-future periods, as shown in Figure 7(b). Mean temperature values under the SSP3–7.0 and SSP5–8.5 scenarios exhibit increasing trends of 21–27 and 29–35%, respectively, compared to the baseline period.
To assess the potential variability of projected precipitation and temperature values, uncertainty bands were calculated, representing two standard deviations around the yearly mean of the models. In Figure 8(a), the uncertainty band for projected precipitation ranges from 219 to 345 mm under SSP3–7.0 and from 215 to 325 mm under SSP5–8.5. In Figure 8(b), the uncertainty bands for projected temperature vary from 7 to 16 °C for SSP3–7.0 and from 7 to 11 °C for SSP5–8.5. A comparison of the uncertainty band widths reveals that the band is wider for SSP3–7.0 in both projected precipitation and temperature, indicating greater uncertainty associated with projections under this scenario compared to SSP5–8.5.
CONCLUSIONS
In this study, we assessed the impact of climate change on precipitation and temperature at the Ardabil synoptic station for the near (2025–2035) and mid-future (2035–2045) periods using an ANN-based downscaling model. Three different GCMs from CMIP6 were incorporated into the downscaling models under the SSP3–7.0 and SSP5–8.5 climate change scenarios, employing a multi-GCM input technique. To identify the most informative predictors across multiple grid points, we applied the innovative coupled COMUSA ensemble clustering method along with MI-random sampling for feature selection.
The SC indicated that ensemble clustering yielded more accurate results, with SOM clustering outperforming other methods like k-means. To select a representative predictor from each cluster, we utilized MI and MI-random sampling techniques. Comparing MI and MI-random sampling methods with ensemble clustering revealed that the combination of ensemble clustering and MI-random sampling outperformed MI alone by 18 and 3% in terms of NSE during the testing of precipitation and temperature downscaling models, respectively. Furthermore, when comparing ensemble clustering with MI-random sampling to SOM clustering with MI-random sampling, the former demonstrated superior performance by 16 and 2% in mean NSE for precipitation and temperature downscaling, respectively. Overall, the projected trends indicate a decline in precipitation and an increase in temperature in the coming years. Specifically, precipitation is expected to decrease by 2–3 and 4–5% under the SSP3–7.0 and SSP5–8.5 scenarios, while temperatures are projected to rise by 21–27 and 29–35% under these scenarios, respectively.
The findings provide strong evidence for the effectiveness of statistical downscaling, particularly highlighting the advantages of the COMUSA ensemble clustering combined with MI-random sampling for selecting large-scale climatic predictors. The integration of multiple clustering methods, such as Ward, SOM, and k-means, within the COMUSA approach presents a promising avenue for future research. We also suggest incorporating a decision tree input screening approach with multi-linear entities alongside MI-random sampling to select dominant predictors from the generated clusters. Additionally, committee of other AI and statistical learning methods (such as support vector machine), for statistical downscaling could enhance accuracy comparisons with the ANN model.
AUTHOR CONTRIBUTION
Z.R. contributed to conceptualization, methodology, data curation, formal analysis; investigated the work; wrote the original draft; and also reviewed and edited the manuscript. M.N. contributed to conceptualization, methodology, formal analysis; validated and supervised the work and also wrote, reviewed, and edited the manuscript. M.T. contributed to conceptualization; wrote, reviewed, and edited the manuscript; and also supervised the work. F.M. supervised the work; contributed to data curation; and also wrote and edited the manuscript.
DECLARATION OF COMPETING INTEREST
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.