## Abstract

Important issues in statistical downscaling of general circulation models (GCMs) is to select dominant large-scale climate data (predictors). This study developed a predictor screening framework, which integrates wavelet-entropy (WE) and self-organizing map (SOM) to downscale station rainfall. WEs were computed as the representatives of predictors and fed into the SOM to cluster the predictors. SOM-based clustering of predictors according to WEs could lead to physically meaningful selection of the dominant predictors. Then, artificial neural network (ANN) as the statistical downscaling method was developed. To assess the advantages of different GCMs, multi-GCM ensemble approach was used by Can-ESM2, BNU-ESM, and INM-CM4 GCMs. Moreover, NCEP reanalysis data were used to calibrate downscaling model as well for comparison purposes. The calibration, validation, and projection of the proposed model were performed during January 1951 to December 1991, January 1992 to December 2005 and January 2017 to December 2100, respectively. The proposed data screening model could reduce the dimensionality of data and select appropriate predictors for generalizing future rainfall. Results showed better performance of ANN than multiple linear regression (MLR) model. The projection results yielded 29% and 21% decrease of rainfall at the study area for 2017–2050 under RCPs 4.5 and 8.5, respectively.

## INTRODUCTION

Rainfall as the major component of the hydrologic cycle deposits most of the fresh water on the earth and can lead to disasters such as floods, erosion, sedimentation, and surface and groundwater contamination. The role of precipitation in water resource management cannot be disregarded, therefore, an extensive range of studies on the issue is being carried out, of which, the basic requirement of such studies is rainfall data. Access to high-resolution rainfall data can lead to more reliable studies; however, availability of high-resolution rainfall data is limited due to large spatiotemporal variation of rainfall as well as restrictions of cost, technical capability, and scarcity of hydrological stations. In this way, general circulation models (GCMs) can be considered as reliable hydro-climatologic data sources. GCMs use physical-based equations on various processes of the atmosphere and ocean for simulating the response of global climate system against increasing greenhouse gas concentrations (IPCC 2013). The skill of GCMs in reproducing hydrologic parameters such as precipitation is of critical importance. Although GCMs are tools that provide reliable atmospheric data, coarse spatial resolution of GCMs may lead to their poor applicability as the input to local-scale hydrologic models. Downscaling is an approach to obtain local-scale weather data from large-scale GCMs. The two methods of downscaling are dynamical and statistical downscaling. Dynamical downscaling is a method to derive smaller-scale climatic information over a bounded area via a high-resolution regional model driven by boundary conditions from GCMs. Statistical downscaling encompasses statistically relating large-scale climate features (predictors), to local climate (predictand) data (Wilby & Wigley 1997); in this way, physical characteristics of the study area can be involved in downscaling by extracting the statistical pattern between predictors and predictand, which indeed demonstrates a black box modeling.

In spite of high-resolution outputs of dynamical downscaling, intensive computational procedures due to the numerical solution of motion and thermodynamic equations, use of physical principles to reproduce local climates, the need for large volumes of data, and a high level of expertise to implement and interpret results, limit the application of dynamical downscaling (Trzaska & Schnarr 2014; Danandeh Mehr & Kahya 2016). Instead, in statistical downscaling techniques, the purpose is to find statistical relations between GCM and local weather data without needing any physical knowledge of the region. Therefore due to the convenience in implementing and interpreting results, statistical downscaling techniques have been used widely in several studies (e.g., Sailor & Li 1999; Olsson *et al.* 2001, 2004; Harpham & Wilby 2005; Chen & Adams 2006; Sousa *et al.* 2007; Beecham *et al.* 2014). The main concept of statistical downscaling is to make a relationship between predictors and predictand by means of a statistical method such as: (i) linear regression models, e.g., statistical downscaling model (SDSM) (Wilby *et al.* 2002); (ii) nonlinear regression models, e.g., artificial neural network (ANN) (Zorita & Von Storch 1999), support vector machine (SVM) (Tripathi *et al.* 2006), relevance vector machines (RVM) (Ghosh & Mujumdar 2008) gene expression programming (GEP) (Sachindra & Perera 2016); (iii) weather generators, e.g., Long Ashton Research Station-Weather Generator (LARS-WG) (Racsko *et al.* 1991).

Trzaska & Schnarr (2014) prepared a thorough review of downscaling methods for climate change projections, which can be referred to in order to study more about downscaling methods.

Among nonlinear regression techniques, ANN has high potential to extract complex patterns and relations between predictors and predictand. The capability of ANN in simulating the nonlinear, and time-varying characteristics of atmospheric variables at different scales, leads to several successful applications of ANNs in downscaling issues within the literature (e.g., Wilby & Wigley 1997; Dibike & Coulibaly 2006; Chadwick *et al.* 2011; Okkan & Fistikoglu 2014; Okkan & Kirdemir 2016).

Focus on the consequences of studies with ANN-based downscaling methods shows contradictory results, with some studies stating the superiority, while others denote the drawbacks and inefficiency of ANN in downscaling GCM data (e.g., Dibike & Coulibaly 2006; Khan *et al.* 2006; Tisseuil *et al.* 2010; Abdellatif *et al.* 2013). These inconsistent results can depend on the quality and quantity of the applied data. Huge data sets available by GCMs are one of the issues that has become a major challenge in ANN-based modeling. Redundant information and the involved noise in data are the other challenging issues concerning ANN, because the noises might be magnified while using nonlinear models such as ANN. In this case, application of input screening as a pre-processing scheme can largely enhance the efficiency of the ANN-based downscaling model.

Input data screening is an important step in any data-driven modeling and which usually improves the modeling performance. Since different researchers have already proved that the application of input data screening approaches can enhance the efficiency of data-based models such as ANN (Nourani *et al.* 2017), it is expected that application of a robust input data screening technique can increase the efficiency of ANN in downscaling problems as well. In statistical downscaling problems, the main concern is encountering huge data sets relevant to several large-scale climate variables with long historical time series (i.e., temporal variation) at diverse grid points (i.e., spatial variation). Generally, such large input data sets may lead to decreasing ANN performance, thus implementing input data screening on GCM data by reduction of the input data size, or in other words, dominant input selection among GCM variables might be useful.

In downscaling subjects, apart from different climate variables at multiple grid points over long time intervals, selection of effective GCMs among several GCMs is the other factor to manifold the data dimensionality. Although various research centers around the globe prepare GCM-based climate data, generally they do not offer the same values for a specific variable in a particular region; the reason is due to different parameterization schemes, variation in boundary layers and different resolutions, relying on the output from a single model, and finally, abrupt spatial variability in climate (Khan & Pilz 2017). Thus, proper selection of the GCM is of prime importance in downscaling approach (Lee & Kim 2017). In this way, the question is which variables and GCMs should be included in a SDSM.

In spite of developing new GCM parameter selecting techniques (Khan & Pilz 2017) and extensive research projects which aim to standardize and coordinate climate models (i.e., the Coupled Modelling Inter comparison Project (CMIP5: http://cmip-pcmdi.llnl.gov/cmip5/) and the Paleoclimate Modelling Inter comparison Project (PMIP3: https://pmip3.lsce.ipsl.fr/)) (Varela *et al.* 2015), general approaches such as correlation analysis (Devak & Dhanya 2014), principal component analysis (PCA) (Ahmadi & Han 2013; Chuang *et al.* 2016), fuzzy and Gamma test (Ahmadi *et al.* 2015) have been employed for input selection of statistical downscaling models. Hence, a robust method of dominant input selection among various GCMs and several climate variables is needed, which the current study tries to address.

One of the effective methods to decrease the dimensionality of input space is clustering and selecting a representative member from each cluster (Bowden *et al.* 2005). Although several studies have already used different clustering based methods in dominant input selection of ANN models in hydrological applications (e.g., May *et al.* 2008; Markus *et al.* 2010; Nourani & Parhizkar 2013; Li *et al.* 2015; Nourani *et al.* 2017), the implementation of such a clustering based input screening is indeed scarce in statistical downscaling of GCMs. The advantage of the proposed input screening model is to select a comprehensive group of variables in generalizing future rainfall while the whole data domain contributes in selecting the dominant predictor, since an appropriate clustering method categorizes data space into homogenous clusters with similar characteristics and extracts feature from the whole domain of observed data even with moderate or low relevancy criteria.

In the current study, the self-organizing map (SOM) as a robust clustering method which is fed by wavelet-entropies (WEs), classifies the features of large-scale climate variables to select dominant inputs. SOMs are often described as a type of neural network; they are more easily understood as a type of constrained k-means. Various studies indicated the superiority of SOM to classical method such as k-means (Openshaw & Openshaw 1997; Bação *et al.* 2005; Hsu & Li 2010; Nourani & Parhizkar 2013). Some examples of SOM advantages to k-means are: the k-means method is sensitive to initialization which SOM is not; the k-means gradient orientation forces a premature convergence which, depending on the initialization, may frequently yield local optimum solutions while SOM is less prone to local optima than k-means and can deal effectively with problems having multiple optima; the performance of the K-mean algorithm largely depends on not only the number but also the placement of the initial codebooks. This has been a limiting factor in the application of the k-mean algorithm, while the SOM mapping tries to preserve topological relations, i.e., patterns that are close in the input space will be mapped to units that are close in the output space, and vice versa. To allow an easy visualization, SOM offers the opportunity for an early exploration of the search space, and as the process continues it gradually narrows the search. By the end of the search process (providing the neighborhood radius decreases to zero) the SOM is exactly the same as k-means, which allows for a minimization of the distances between the observations and the cluster centers.

Since climatologic data sets involve non-stationary time series, entropy as a measure of information content (Shannon 1948) could be preferred to statistical moments (i.e., mean, variance, skewness, etc.) in order to represent whole time series in the clustering procedure. On the other hand, underlying multi-resolution seasonality of the process is drawn out by wavelet transform via extracting various features of time series at different time scales (Johnson *et al.* 2011; Nourani *et al.* 2014; Rashid *et al.* 2015).

The proposed novel input data screening method is incorporated into ANN-based statistical downscaling of GCMs in order to reduce redundant information of input data set for projection of monthly rainfall of Tabriz station located in Iran. Furthermore, to enjoy effective characteristics of each GCM in future rainfall simulation, an ensemble of three GCMs (i.e., Can-ESM2, BNU-ESM, and INM-CM4 GCMs) is used in the proposed downscaling. The ensemble of GCMs has been recommended by some other studies in order to cover the uncertainties involved in each GCM (Li *et al.* 2012; Miao *et al.* 2014; Rashid *et al.* 2015). Since various GCMs are developed according to different structures, the climate downscaling and projection using various GCMs can lead to diverse results (Sachindra *et al.* 2014). Application of a multi-GCM approach can lead to efficient results since there is the opportunity to select dominant variables among various GCMs with different structural inherence than being forced to select from one specific GCM. In this way, ANN downscaling was performed by using ensemble of GCMs and reanalysis data (i.e., NCEP).

This article is organized in four sections as follows: the section below describes the study area as well as the applied data, including the station and GCM data; furthermore, the applied methodology is included at the last part of this section. This is followed by a results and discussion section, and the final section provides concluding remarks.

## MATERIAL AND METHODS

### Study area and data set

Tabriz City (latitude: , longitude: ) is the capital city of East Azerbaijan province in the north-west of Iran and is located in the valley of a seasonal river (Figure 1). The city lies on the Tabriz plain which has a moderate slope and at 60 km west ends on the east bank of Urmia Lake. The elevation of the city varies between 1,350 and 1,600 meters above sea level. The city's weather is semi-arid with regular seasons. The annual precipitation of Tabriz is around 280 millimeters, mostly falling in wet seasons (i.e., October to April). During the winter, precipitation is snow and during spring and fall it rains. Overall, the city's weather is mild and fine in spring, dry and semi-hot in summer, humid and rainy during fall, and cold with snowfall in winter. The average annual temperature is 12.6 °C. Cool winds blow from east to west, mostly in summer.

In order to project large-scale GCM data to local-scale station data, Tabriz hydrologic observational station was considered. The monthly rainfall data of the station during January 1951 to December 2016 prepared by the Meteorological Organization of East Azerbaijan were used in the current study. In order to develop the proposed downscaling model, large-scale historical climate data were extracted during the period January 1951–December 2005 from Can-ESM2, BNU-ESM, and INM-CM4 GCMs developed respectively in research centers of Canada, China, and Russia with mean monthly atmospheric variables (Table 1). Since several studies described the beneficial effects in applying several grid points around the study location (Frost *et al.* 2011; Guo *et al.* 2012; Beecham *et al.* 2014), predictors on four grid points around the study station were adopted in this study as well (Figure 1). The four closest grid points from each GCM around the Tabriz station were considered as the potential grid points based on the grid size of each GCM. But according to the proposed methodology, effective climate variables and consequently effective grid points were selected as the dominant grid points. As Figure 1 shows, there are grid points A1, A2, A3, A4 relevant to BNU and CAN GCMs which coincide spatially and B1, B2, B3, B4 relevant to INM GCM. In order to validate GCM-based downscaling, monthly reanalysis data sets of the National Center for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) (termed NCEP reanalysis afterwards in this study) were collected from the National Oceanic and Atmospheric Administration/Earth System Research Laboratory (NOAA/ESRL). The resolution of NCEP data is 2.5° × 2.5°, which, in order to match GCMs, the GCM outputs were linearly interpolated over the NCEP grid points.

Centre | Centre acronym | Model | RCP (W/m^{2}) | Grid size (approximately) | Predictor number | Applied climate variables in GCMs | Pressure levels (Pa) |
---|---|---|---|---|---|---|---|

Beijing Normal University, China | BNU | BNU-ESM | RCP4.5; RCP8.5 | 380 | ^{a}ua: eastward wind; ^{a}va: northward wind; ^{a}zg: geo-potential height; ^{a}hur: relative humidity; ^{a}hus: specific humidity; tas: air temperature; uas: eastward near-surface wind; vas: northward near-surface wind; psl: air pressure at sea level; hfls: surface upward latent heat flux; prc: convective precipitation flux; pr: precipitation flux; hurs: near-surface relative humidity; huss: near-surface specific humidity; evspsbl: water evaporation flux | ^{b}100; ^{b}200; ^{b}300; ^{b}500; ^{b}700; 1,000; 2,000; 3,000; 5,000; 7,000; 10,000; 15,000; 20,000; 25,000; 30,000; 40,000; 50,000; 60,000; 70,000; 85,000; 92,500; 100,000 | |

Canadian Centre for Climate Modelling and Analysis, Canada | CCCma | Can-ESM2 | RCP4.5; RCP8.5 | 480 | |||

Russian Academy of Sciences, Institute of Numerical Mathematics, Russia | INM | INM-CM4 | RCP4.5 RCP8.5 | 380 |

Centre | Centre acronym | Model | RCP (W/m^{2}) | Grid size (approximately) | Predictor number | Applied climate variables in GCMs | Pressure levels (Pa) |
---|---|---|---|---|---|---|---|

Beijing Normal University, China | BNU | BNU-ESM | RCP4.5; RCP8.5 | 380 | ^{a}ua: eastward wind; ^{a}va: northward wind; ^{a}zg: geo-potential height; ^{a}hur: relative humidity; ^{a}hus: specific humidity; tas: air temperature; uas: eastward near-surface wind; vas: northward near-surface wind; psl: air pressure at sea level; hfls: surface upward latent heat flux; prc: convective precipitation flux; pr: precipitation flux; hurs: near-surface relative humidity; huss: near-surface specific humidity; evspsbl: water evaporation flux | ^{b}100; ^{b}200; ^{b}300; ^{b}500; ^{b}700; 1,000; 2,000; 3,000; 5,000; 7,000; 10,000; 15,000; 20,000; 25,000; 30,000; 40,000; 50,000; 60,000; 70,000; 85,000; 92,500; 100,000 | |

Canadian Centre for Climate Modelling and Analysis, Canada | CCCma | Can-ESM2 | RCP4.5; RCP8.5 | 480 | |||

Russian Academy of Sciences, Institute of Numerical Mathematics, Russia | INM | INM-CM4 | RCP4.5 RCP8.5 | 380 |

^{a}Variables which vary at pressure levels.

^{b}Pressure levels of Can-ESM2 model in addition to others.

It is noted that the Fifth Assessment Report (AR5) of the United Nations Intergovernmental Panel on Climate Change (IPCC) under the representative concentration pathways (RCPs), i.e., RCP4.5 and RCP8.5 scenarios for future simulation, was considered in this study. The applied data sets in the current study were collected from the IPCC data distribution center (http://cera-www.dkrz.de/) for the period 1951–2100.

Table 1 shows the list of available variables for each considered GCM. Ninety-five large-scale atmospheric variables at each grid point of BNU-ESM and INM-CM4 models and 120 variables at each grid point of the Can-ESM2 model were considered, so in total 1,240 variables in three GCMs are the potential input candidates of the downscaling model. In the case of trial and error procedure, cases should be examined which is an exhausting process and not a feasible technique to select the dominant inputs. Although linear correlation coefficient may be used as a criterion to select dominant inputs, nonlinear input selection method is essential for a nonlinear process as judged by Nourani *et al.* (2015). In this regard, nonlinear input screening method was addressed in this study to find the significant inputs of ANN-based downscaling model.

### Proposed methodology

In order to downscale GCM data, an ANN-based statistical method was applied, and to improve the efficiency of the ANN model, pre-processing on variables of GCMs over the four nearest grid points around the studied station was performed. The most effective GCMs in this study were selected according to the results of Cai *et al.* (2009). Selection of appropriate predictors was the next step to statistically downscale rainfall. Since involving a full set of potential variables simultaneously in a downscaling model can negatively impact the outcomes due to redundant information, pervasive assessments on selecting particular predictors becomes necessary due to the lack of general guidelines. Important predictor selection among three GCMs was implemented in this study using a newly proposed method involving WE and SOM tools. In this way, the ensemble of three different GCMs was performed in the screening step before downscaling. The ensemble idea in the current study was ensemble in input not outputs, therefore all the considered predictors in Table 1 from four grid points of three GCMs (i.e., in total 12 sources of predictors) were integrated into the screening methodology in order to select the important variables which impact on Tabriz rainfall generation. Since the selected variables belonged to different GCMs at diverse grid points, it was named multi-GCM ensemble procedure.

The proposed methodology considers three stages to achieve statistical downscaling of predictand. According to the presented schematic diagram in Figure 2, the first step is dominant input selection procedure, the second step is ANN-based downscaling model, and finally, the third step is projection of future rainfall at Tabriz hydrological station according to a multi-input ensemble ANN model under RCPs 4.5 and 8.5.

#### First step

Climatic variables due to non-stationary fluctuations inherent in climate phenomena encompass temporal features as well as seasonal attributes, and wavelet transform can extract such features. Therefore, discrete wavelet transform (DWT) was applied to decompose time series of large-scale climate variables into multi-resolution sub-series. In this way, large- and small-scale hidden features of time series were broken up into approximation and detailed sub-series to represent general trend and different levels of periodicity involved in the time series. Dominant feature extraction among such huge data sets including numerous sub-series would lead to poor outcomes due to decrease of accuracy caused by the imposed noises. Therefore, it is suggested to compute and use entropy of each sub-series as the representatives. Entropy shows information content or the degree of order/disorder of WT decomposed time series and is designated as WE here.

In this way, each sub-series is substituted by only one number (i.e., WE) so that the dimensionality of data domain is reduced greatly. In more detail, monthly time series with many components first is decomposed to sub-series via wavelet transform at level *l* and then WEs are computed to represent whole time series with only numbers, each representing a specific feature of original time series. In order to use WEs in the clustering tool, vectors of components including WEs relevant to approximation and decomposed time series were established as the feature of each predictor variable. Generally, all of the features are not informative equally, some may include noise, some others may have correlation, and some have no remarkable relevancy to the output variable. Therefore, screening features are necessary to select the impressive features. SOM as an ANN-based clustering method was used here to screen the inputs. Basically, SOM is an unsupervised ANN tool that can capture linear and nonlinear statistical relations in complex high dimensional data sets and illustrate them in a comprehensible geometric relation map. SOM-based clustering was implemented over computed WEs of three GCM variables. Application of WEs instead of whole decomposed sub-series in the clustering procedure can lead to optimizing the SOM input layer and improve clustering performance by reducing the dimensionality of the data set.

Due to the complexity of climate variables, poor linear correlation between predictors and predictand is probable in spite of strong nonlinear relation. In this regard, the mutual information (MI) concept was used as a measure to detect nonlinear relations between predictors at clusters and predictand. Finally, the variables which passed the screening procedure were considered to be used in the rainfall downscaling model via ANN.

#### Second step

In the second step, the ANN-based downscaling model was developed. In this way a multi-GCM ensemble model was trained according to screened variables of three GCMs at four surrounding grid points of the study area to calculate Tabriz station rainfall values. NCEP-derived reanalysis data were also used to calibrate the ANN downscaling model. Since application of GCM-derived data in downscaling has shown systematic biases between the downscaled results and observations, statistical amendment of GCM data is often essential (Wilby *et al.* 2004). One of the common procedures to remove the systematic biases in mean and variance is standardization (Tripathi *et al.* 2006; Chen *et al.* 2010; Acharya *et al.* 2013). According to Sachindra *et al.* (2014), standardization of predictors (i.e., subtracting the mean from data and dividing by the standard deviation) scales down the predictor data to a single uniform scale and removes the units of the variables.

#### Third step

Finally, stage three was performed under RCPs 4.5 and 8.5 to simulate future monthly rainfall of Tabriz hydrological station for near and distant future during the periods 2017–2050 and 2051–2100. It is noted that according to various plausible viewpoints about future anthropogenic actions due to different rates of growth in population, economic, energy, and socioeconomic development and their impact on greenhouse gas emissions, each RCP demonstrates distinct radiative forcing pathways by 2100. In this way, RCPs 4.5 and 8.5 are related to intermediate and high emission scenarios, respectively.

The sections below briefly explain the required tools for the proposed methodology.

### Wavelet-entropy

**seasonal**pattern encompassed in climate time series are captured with WT by decomposing the time series into sub-series with different periods. The wavelet makes localization of a signal in time and scale domain by comparing the relationship between wavelet function and signal. Here, in this study, discrete form of WT is used and screened as Equation (1) (Mallat 1998): where * is the complex conjugate and

*g*(

*t*) named wavelet function or mother wavelet. Different types of mother wavelets are

*ciof2*, Daubechies (db) family, sym3 Meyer, etc. (Mallat 1998);

*m*and

*n*denote the wavelet dilation and translation, respectively;

*a*

_{0}is a specified fined dilation step greater than 1; and

*b*

_{0}is the location parameter which must be greater than zero. The most common and simplest choice for parameters are

*a*

_{0}= 2 and

*b*

_{0}= 1. This power-of-two logarithmic scaling of the dilation and translation is known as the dyadic grid arrangement. The dyadic wavelet can be written in more compact notation as (Mallat 1998): For a discrete time series,

*x*, the dyadic WT becomes (Mallat 1998): where

_{i}*T*corresponds to wavelet coefficient of the discrete wavelet at scale

_{m,n}*a*=

*2*and location

^{m}*b*=

*2*. Equation (3) considers a finite time series,

^{m}n*x*,

_{i}*i*= 0, 1, 2, …,

*N*= 1, and

*N*

*=*2

*. This gives the ranges of*

^{q}*m*and

*n*as, respectively,

*0*

*<*

*n*

*<*2

*−1 and 1*

^{q−m}*<*

*m*

*<*

*q*.

*T̅(t)*is called approximation sub-signal at level

*q*and

*W*(

_{m}*t*) are detail sub-signals at levels

*m*= 1, 2,

*…, q*.

The wavelet coefficients, *W _{m}*(

*t*) (

*m*= 1, 2, …,

*q*), provide the detail signals, which can capture small features of interpretational value in the data; the residual term,

*T̅*(

*t*), represents the background information of data which is named approximation signal.

*T*, which takes values

_{m,n}*T*

_{m,n}

^{1},

*T*

_{m,n}

^{2}, …,

*T*with probabilities

_{m,n}^{N}*P*(

*T*

_{m,n}

^{1}),

*P*(

*T*

_{m,n}

^{2}), …,

*P*(

*T*

_{m,n}

^{N}), respectively, is defined as (Shannon 1948): where is entropy of

*T*(also referred to as entropy function) and

_{m,n}*N*is the number of intervals or bins to form the histogram and thereinafter

_{s}*et al.*2008): Here the examples of two variables

*X*and

*Y*in this study are predictors (

*X*) and predictand (

*Y*), where is the joint probability of

*x*and

_{i}*y*with number of bins

_{i}*N*and

_{s}*M*, respectively. In order to find the relations of two variable MI concepts, an entropy-based criterion can be used as well. In this way, the mutual dependence between the two variables is calculated by quantifying the information content of random variables relative to each other (Nourani

_{s}*et al.*2015). While correlation coefficient detects the linear relation between two variables, MI has the capability to detect the nonlinear dependency of variables (Cover & Thomas 1991). MI is computed by the following equation for a predictor

*x*and the predictand

_{i}*y*in this study (Yang

_{i}*et al.*2000): For numerical calculation of MI and

*E*using Equations (6) and (8), PDF for all of the variables should be specified. Histogram method is the most common approach of calculating PDF (Yang

*et al.*2000).

### Self-organizing map (SOM)

SOMs are unsupervised ANNs which reduce dimensionality of data by generating a low-dimensional, discretized representation of the input space of the training samples, called a map. Neighborhood function is used to keep the topological attributes of the input space. Therefore, SOMs are effective in visualizing low-dimensional illustration of high-dimensional data.

*w*) to the

*n*-dimensional input vector

*x*Euclidean distance concept is used (Kohonen 1997): The weight with the closest match to the input data is the winner node named best matching unit (

*BMU*). In order to further decrease the distance between the weights and

*BMU*learning continues by changing the weights at each training iteration

*t*(Kohonen 1997): where corresponds to the learning rate ranging in [0 1], denotes the neighborhood function. The most commonly used neighborhood function is the Gaussian function (Kohonen 1997): where

*l*and

*m*correspond to

*BMU*and its neighboring output nodes' position, and is the width of the topological neighborhood at iteration

*t*.

Finally, SOM clusters homogeneous data with a similar pattern in a cluster, and reduces the dimensionality of data.

#### Artificial neural network (ANN)

ANN learns according to input and output data, while information flows through the network. The underlying pattern in the data affects the structure of ANN and leads to generation of the appropriate network according to data. ANN comprises neurons in different layers which interconnect through the network. The simplest ANN includes three layers: input, hidden, and output layers. The input layer includes input neurons that send information to the hidden layer of neurons, and finally processed information goes to the third layer of output neurons. The greater the number of layers, the more complex is the network.

*i*,

*j*, and

*k*show the input, hidden, and output layer neurons, respectively.

*w*is a weight in the hidden layer, which connects the

_{ji}*i*th neuron in the input layer to the

*j*th neuron in the hidden layer;

*w*is the bias for the

_{jo}*j*th hidden neuron,

*f*is the activation function of the hidden neuron;

_{h}*w*is a weight in the output layer connecting the

_{kj}*j*th neuron in the hidden layer to the

*k*th neuron in the output layer;

*w*is the bias for the

_{ko}*k*th output neuron,

*f*is the activation function for the output neuron,

_{o}*x*is

_{i}*i*th input variable for the input layer and

_{k}is computed output.

*N*and

_{N}*M*are the neuron numbers of input and hidden layers, respectively. For more information on the ANN, readers are referred to study relevant books in this field such as Haykin (1994).

_{N}### Evaluation criteria

#### Clustering evaluation criterion

In order to measure validity of SOM-based clustering, silhouette coefficient (SC) is applied. SC refers to a technique of verification of consistency within clusters of data. It is obvious that increasing the number of clusters leads to more homogeneous clusters and, generally, the consistency of clusters increases by growing the number of clusters, but such increment may not be wise in high dimensional data such as in the current study where the dominant feature from each cluster will be used by ANN. Since numerous inputs to ANN can decrease the efficiency of the model, a logical number of inputs obtained from clusters should be considered, and to do so, the mean value of SC for each number of cluster was drawn up. In the following, SC calculation is explained.

In order to assess the overall clustering configuration, the average S(i) over all data of all clusters is calculated, which shows how appropriately the data have been clustered.

*n*is the total number of arrays in the data space.

### Simulation evaluation criteria

*N*, , , and are number of observation data, observed data, calculated values, and mean of observed data, respectively. It is necessary to mention that Legates & McCabe (1999) proved the adequacy of DC and RMSE in hydro-climatology prediction processes as evaluation criteria of models.

## RESULTS AND DISCUSSION

Among several GCMs, the three potential GCMs (i.e., Can-ESM2, BNU-ESM, and INM-CM4) were assigned to select dominant predictors. In this regard, important climate variables of all the GCMs were identified according to the WT-EN-SOM methodology, which was proposed as the first step in the current study.

### First step-input screening

Through the first step, DWT was applied to decompose time series to the approximation and detail sub-series. The proper selection of the mother wavelet and decomposition level is the important issue while using DWT. Based on Nourani *et al.* (2014), the structure of the Daubechies mother wavelet with four vanishing moments (db4) is similar to the time series of hydro-climatologic processes; thus, it can accurately capture the hidden features of the time series through wavelet analysis. Hence, decomposition procedure was established with the db4 mother wavelet. In regard to selection of the decomposing level, owing to the fact that the aim of the study is to downscale rainfall for the future and the need to have knowledge about different rainfall frequencies for the distant horizon, the level 7 was selected as the appropriate decomposition level to capture a large range of frequencies from approximately monthly to seasonal up to five-yearly frequencies. Therefore, each predictor was decomposed into one approximation and seven detail sub-series, then, WE of each sub-series was calculated. In this way, a vector of eight members was replaced by the eight long sub-series of a predictor. The procedure was conducted on all predictors of each GCM.

Afterwards, SOM clustering approach was used to cluster the similar features of the predictors in several groups. In order to reduce dimensionality of input space, dominant features from each cluster were selected according to MI.

For each of the three GCMs (i.e., Can-ESM2, BNU-ESM, and INM-CM4), 15 atmospheric predictors at four grid points around the hydrological station were considered to screen dominant variables, as shown in Table 1. The advantage of imposing four grid points around the downscaling station is the ability to take into account the physical characteristics of locations, which affect behavior of climate predictors due to spatial variation on an area covering nearly 103,850 square kilometers (area is calculated based on grid size).

After handling the non-stationary property of predictors by wavelet, it is time to select the dominant features among large volumes of data (e.g., for Can-ESM2 GCM, it is 3,840 sub-series in the length of 50 years of monthly data; the number came from 120 large-scale climate variables considering 22 pressure levels multiplied by eight decomposed sub-series at four grid points, as seen in Table 1). In this regard, entropy of an approximation and seven detail sub-series were calculated and a matrix with eight rows and N columns (N depends on the number of predictors at each GCM, (e.g., N = 120 in Can-ESM2 GCM)) was constructed for each GCM. Since the wavelet representation of the predictors differs according to the frequency state of the process, WE values of them also vary. In fact, high values of WEs refer to high stochastic fluctuations obtained from contributions of all frequency bands. In contrast, low values for WEs are expected in the case of ordered sub-series with deterministic and smooth signals.

Henceforth, SOM classified approximately 1,200 feature vectors of predictors obtained from the three GCMs in several clusters. The dominant features of clusters were selected based on maximum MI computed between the predictors in a cluster and the predictand, thus, a predictor with the most nonlinear relevancy to observed rainfall data was designated as the agent for that cluster.

In order to determine the optimum cluster number, the SC measure was calculated and plotted against the number of clusters, which varies between 4 and 100 (Figure 3) due to two-dimensional SOM structures, i.e., cluster numbers. According to Figure 3, the SC values increase very smoothly by growing the number of clusters after a considerable drop around ten clusters. Although increase of cluster number raises the SC value, too many clusters should be avoided in order to maintain the efficacy of ANN (numerous inputs can lead to poor performance of ANN in the verification step). Therefore, the optimum cluster number for approximately 1,200 climate features considering ANN efficiency and SC acceptable value can be around 9 to 25. The growth of SC around 9 to 25 clusters is smooth in comparison to the gradient of SC after and before the specified range, as depicted by vertical arrows in Figure 3.

After grouping approximately 1,200 climate features of the three GCMs in nine clusters, the dominant agents of clusters, which can operate as the representative of a cluster, were determined by MI. Figure 4(a) illustrates the number of members in each of the nine clusters at SOM hit plot and Figure 4(b) shows the SOM neighbor weight distances plot. The hit plot shows the neuron locations in the topology, and indicates how many of the training data are associated with each of the neurons (cluster centers). The topology is a 3-by-3 grid, so there are nine neurons. The maximum number of hits associated with any neuron is the maximum hit number (here 273), thus, there are 273 input vectors in the relevant cluster. There are eight elements in each input vector, so the input space is eight-dimensional. The weight vectors (cluster centers) fall within this space. The weight distance matrix (also called the U-matrix) is a visualization tool for the SOM, in which, the small hexagons represent the neurons and the lines connect neighboring neurons. The regions containing the lines indicate the distances between neurons, in which darker and lighter regions represent larger and smaller distances, repectively.

According to the SOM hits plot in Figure 4(a), among the nine groups of clustered data, the first cluster (i.e., at left bottom corner) with just two components is considerable. These two members refer to east- and north-ward winds (85,000 pressure level) at the grid point B4 relevant to the INM-CM4 model with a majority of zero values. The sensitivity analysis proved that these components have no effect on the rainfall of Tabriz; therefore, they were eliminated and not considered in the modeling procedure. The SOM neighbor weight distances plot (Figure 4(b)) shows the discrepancy of these two variables by the darkest shade. It is worth noting that east- and northward winds at 92,500 and 100,000 pressure level of the INM-CM4 model at all the four grid points were totally zero, which was not considered in the clustering process. Dominant features of the clusters, selected in the first step of the proposed methodology, are tabulated in Table 2.

Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | Cluster 7 | Cluster 8 | Cluster 9 | |
---|---|---|---|---|---|---|---|---|---|

Variable | va, ua 85000 | hfls | ua 100000 | hus7000 | hfls | zg3000 | hus7000 | hfls | hfls |

Model | INMCM4 | BNU-ESM | BNU-ESM | INMCM4 | Can-ESM2 | INMCM4 | Can-ESM2 | BNU-ESM | Can-ESM2 |

Grid No. | B4 | A1 | A2 | B4 | A3 | B1 | A2 | A4 | A4 |

Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 | Cluster 6 | Cluster 7 | Cluster 8 | Cluster 9 | |
---|---|---|---|---|---|---|---|---|---|

Variable | va, ua 85000 | hfls | ua 100000 | hus7000 | hfls | zg3000 | hus7000 | hfls | hfls |

Model | INMCM4 | BNU-ESM | BNU-ESM | INMCM4 | Can-ESM2 | INMCM4 | Can-ESM2 | BNU-ESM | Can-ESM2 |

Grid No. | B4 | A1 | A2 | B4 | A3 | B1 | A2 | A4 | A4 |

Although various predictors were selected from the three GCMs through the clusters, representative predictors were related to humidity, wind velocity, geopotential height, and heat flux of the atmosphere; accordingly, some previous studies more or less pointed to such atmospheric predictors in rainfall downscaling (Wilby *et al.* 2004; Frost *et al.* 2011; Rashid *et al.* 2015).

In the absence of clustering and selection of important variables based on MI, only variables associated with high values in terms of the MI criterion were selected, which also do not cover the entire variable domain, while other variables with moderate or low relation values from the input domain may be associated in generalizing future rainfall and its fluctuation. However, application of clustering makes it happen. The schematic diagram in Figure 5 clarifies the advantage of SOM-MI with regard to the simple MI feature extraction method. In the first approach of Figure 5, which illustrates the simple MI method, ranked values of MI were arranged in a descending order. The issue in this case is how to separate dominant inputs from the MI ranked inputs. In other words, how many of the maximum ranked MIs should be selected? The second approach of Figure 5, which shows application of the clustering method like SOM before MI-based ranking, solves the problem of the simple MI ranked method. In this way, by clustering potential inputs into specified groups, the number of dominant inputs can be defined as the number of clusters, and not necessarily inputs with the largest MI values are selected; instead, different variables with various MIs are selected. Dominant input selection according to the largest MI values leads to selection of only one kind of input; however, in the case of using the clustering (SOM), the input variables are clustered into groups of similar inputs with specific patterns, various ranges of MI. Accordingly, various kinds of inputs can be selected, which causes inputs with different patterns to be imposed on the AI models and, subsequently, elevates the accuracy of prediction of unseen data in the verification step. Therefore, by application of a clustering approach in this study, dominant variables from whole domain of variables, even with low or moderate scores, were selected according to the MI measure.

Figure 6 displays dominant predictors obtained in the first step from different grid points. It is noted that the area between the grid points of Can-ESM2 and BNU-ESM GCMs, which overlapped, is larger than the area made by the INM-CM4 grid points around the study station. Important predictors on each grid can denote the morphologic impact of that region on atmospheric variables, which consequently affects the Tabriz station rainfall. Concerning the issue, humidity-relevant variables became important on grids, which are influenced by close water bodies. In this way, the specific humidity variable hus7000 showed a dominant effect on Tabriz rainfall at the nearest grids to the study area (i.e., the grids A2 and B4) around the study location. The Caspian Sea and Urmia Lake, as the sources of moisture, lead to dominancy of the humidity variable (i.e., hus7000) on the grid A2 and B4, respectively.

Eastward wind (i.e., ua100000) at the nearest grid (i.e., the grid A2) was the other dominant variable in downscaling rainfall; the wind variable was in line with the prevailing wind of the region.

Hfls was the most important climate variable, which affected Tabriz rainfall from distant grid points and became significant at the grids A1, A3, and A4. The latent heat flux emerges as winds carry moisture away from the surface; thus, it can be involved in regional rainfall as an important factor. In this way, faraway water bodies can also influence Tabriz rainfall through the hfls climate variable. The source of such moisture can be Urmia Lake adjacent to grid A1, Sevan Lake close to grid A3, and the Caspian Sea near grid A4 (see Figure 6).

Near surface geopotential height (i.e., zg 3000) was selected as the dominant climate variable at the southern grid point B1 of the Tabriz station. The way air layers are situated and the amount of atmospheric pressure at one point varies with temperature variations. The hot air has low pressure, whereas the cold air is high in pressure. Thus, alteration of the seasons leads to temperature changes, which in turn, causes change in atmospheric pressure. The change in the atmospheric pressure is the source of wind generation at the earth's surface. On the other hand, winds are the principal cause of cloud displacement, which can be effective in producing rainfall. In this way, the input screen methodology of the current study with the ability to detect nonlinear relationships extracted the effect of the geopotential height on precipitation production.

From Figure 6, it is noticed that the variables of the eastern grid points of INM-CM4 model (i.e., B2 and B3) did not contribute in SOM-MI input screening.

### Second step ANN-based downscaling model

Eight element sets of climatic variables were selected as inputs of the ANN model in the first step. The ANN inputs included two types of data: (i) ensemble of GCMs and similar (ii) NCEP reanalysis data. The large-scale climate variables that passed through the SOM screening procedure were standardized over the period 1951–2005. The first 75% of the dominant predictors and observed rainfall data was used for the training of downscaling model and the remaining 25% used for the validating purpose. Hence, the calibration period of the ANN model was from 1951 to 1991, and the verification period was from 1992 to 2006.

The three-layer feed-forward neural network with back propagation algorithm and the Levenberg–Marquardt scheme, with tangent sigmoid (Tansig) as the activation function was used to downscale Tabriz station rainfall. Different hidden neurons (which were selected via trial and error method) through 1,000 epochs were examined. Evaluating criteria of ANN model demonstrated that the maximum efficiency occurred at 480 epochs with four hidden neurons. Results of multi-GCM ensemble downscaling model is tabulated in Table 3.

Model | GCM-based inputs selection | DC train | DC verify | N^{a}-RMSE train | N^{a}-RMSE verify | CC train | CC verify |
---|---|---|---|---|---|---|---|

ANN | CC | 0.23 | 0.18 | 0.91 | 0.99 | 0.30 | 0.28 |

ANN | WE-SOM-MI | 0.50 | 0.41 | 0.74 | 0.69 | 0.70 | 0.67 |

MLR | WE-SOM-MI | 0.30 | 0.40 | 0.86 | 0.70 | 0.55 | 0.59 |

Model | GCM-based inputs selection | DC train | DC verify | N^{a}-RMSE train | N^{a}-RMSE verify | CC train | CC verify |
---|---|---|---|---|---|---|---|

ANN | CC | 0.23 | 0.18 | 0.91 | 0.99 | 0.30 | 0.28 |

ANN | WE-SOM-MI | 0.50 | 0.41 | 0.74 | 0.69 | 0.70 | 0.67 |

MLR | WE-SOM-MI | 0.30 | 0.40 | 0.86 | 0.70 | 0.55 | 0.59 |

^{a}N denotes normalized RMSE values.

In order to determine the efficiency of the proposed screening methodology, standard correlation analysis also was used to select dominant predictors (the acceptable range set to 0.3 and up). The obtained predictors from CC-based method were fed into the ANN and results also tabulated in Table 3. The results showed that the proposed screening method was 46% more successful than CC-based downscaling due to recognizing the nonlinear relations and filtering redundant information, and thus could detect a more robust pattern between predictor and predictand.

To investigate the relationship between the grid point distance from the Tabriz station and the impact of variables at the grid points over the predictand forecasting, ANN-based sensitivity analysis was performed on the variables of each grid point and the results are shown in Table 4.

Grid points | Point A1 | Point A2 | Point A3 | Point A4 | Point B1 | Point B4 |
---|---|---|---|---|---|---|

DC verify of ANN downscaling | 0.19 | 0.30 | 0.28 | 0.27 | 0.08 | 0.22 |

Grid points | Point A1 | Point A2 | Point A3 | Point A4 | Point B1 | Point B4 |
---|---|---|---|---|---|---|

DC verify of ANN downscaling | 0.19 | 0.30 | 0.28 | 0.27 | 0.08 | 0.22 |

According to Table 4, none of the two grid points (i.e., A1 and B4) with the least distance from the study area could catch the maximum effect on production of the predictand; however, the middle distance grid point (i.e., A2) had the maximum effect on calibration of the predictand. Moreover, the two distant grid points (i.e., A3 and A4) with little difference with point A2 were more effective for downscaling the Tabriz station rainfall.

Therefore, the significance of large-scale climate variables in the rainfall production of the region was not related to the distance of the grid points from the studied area, so that the distant points could be more effective. This sensitivity analysis surely confirmed that the simultaneous effect of climate variables from different locations could lead to a more robust downscaling model.

Since some of the commonly used statistical downscaling models such as SDSM use the multiple linear regression (MLR) method to establish a statistical relationship between predictors and the predictand, in order to assess the performance of the ANN model, the MLR model was also developed in this study (Table 3) and compared with the results of the proposed ANN-based methodology. Results of Table 3 demonstrate that the ANN-based downscaling model outperforms the MLR model in terms of evaluating criteria, because MLR was not capable of distinguishing nonlinear relations among dominant predictors and the predictand. It should be mentioned that the MLR model was also built based on the dominant climate variables, which were used in the ANN downscaling model.

In general, the ANN model showed better performance in reproducing the rainfall statistics of the Tabriz station than the MLR model in reproducing a reasonable trend and variation of the observed data. Figure 7 depicts the mean monthly rainfall for calibration and validation steps according to the observed data of the Tabriz station as well as the ANN (with two types of inputs: multi-GCM and NCEP) and MLR downscaling models. Moreover, it was observed that the ANN-based downscaling model calibrated by NCEP data performed better than ANN model calibrated by multi-GCM historical data.

Moreover, Figure 7 demonstrates that both the ANN downscaling models showed high efficiency in forecasting peak points of rainfall so that, as an example, the April mean rainfall was forecasted as accurately as the observed data, while the MLR model could not perform as efficiently as the ANN model.

Furthermore, in order to test the ANN downscaling accuracy, the BI criterion was calculated for the multi-GCM ensemble model at the verification step. Figure 8 shows that, during the dry months of summer (i.e., June, July, August, and September), the proposed multi-GCM ANN model overestimates Tabriz rainfall, while in other months, the model performs more reliably with acceptable BI values falling inside 10% deviation from the observed values (i.e., the two horizontal dashed lines specify BI in the range of , which is acceptable BI range).

### Third step-rainfall projection for future

Since the projection of faraway future emissions and other human factors that impact atmosphere is troublesome, researchers utilize a variety of situations utilizing different suppositions about future economic, social, technological, and environmental conditions, named projection scenarios (here RCPs). RCPs indicate four greenhouse gas concentration (not emissions) paths adopted by the IPCC-AR5. Among the four RCPs, including RCP2.6, RCP4.5, RCP6, and RCP8.5, RCPs 4.5 and 8.5 were used in the current study.

Future rainfall projection under RCPs 4.5 and 8.5 was conducted for near and distant future over the 2017–2050 and 2051–2100 periods after calibrating the ANN downscaling model with NCEP data. The boxplots in Figure 9 compare the distribution of rainfall data based on the minimum, first quartile, median, third quartile, and maximum for observation, calibration and projection values in near and distant future under the two RCPs.

According to Figure 9(a), the mean values of the ANN calibrated rainfall at baseline (i.e., 1951–2005) were almost equal to the observed rainfall values; however, the data dispersion was reduced in calibrated models. The comparison of the annual observed and simulated rainfall values demonstrated that over both RCPs and periods (i.e., near and distant future), the mean rainfall values decrease. The highest decrement in annual rainfall will be shown in RCP4.5 at the end of the century. The decrease in rainfall under RCP4.5 and RCP8.5 revealed decrease of 0.29%–36% and 21%–30% for near and distant future compared to baseline, respectively. Figure 9(b) demonstrates the seasonal projections of observed and simulated rainfall during baseline and future (2017–2100), in which mean rainfall during all seasons under both RCPs will show decrease, while during autumn and winter decreasing is more notable than decline during spring and summer. Especially during spring, the rainfall under both RCPs tends to be almost constant comparing observation at baseline. Projections show that although mean rainfall will decrease, the extreme events will be more probable. This outcome is also in line with the IPCC 2013 report which stated that ‘the amount of rain falling in heavy precipitation events is likely to increase in most regions, while storm tracks are projected to shift poleward.’

Figure 10 depicts annual variations of rainfall for the observed and simulated values under both RCPs 4.5 and 8.5. Figure 10 indicates the decreasing trend line and the changes in annual rainfall encompass considerable year-to-year variations.

## CONCLUDING REMARKS

The Tabriz hydrological station rainfall variation is the important parameter in developing hydrologic impact studies of the city. In this regard, the future rainfall of the city was assessed by an ANN-based SDSM. The dominant climate predictors were determined via the novel WE-SOM predictor screening methodology among three various GCMs as the multi-GCM ensemble model and reanalysis NCEP data sets. According to proper calibration results of the NCEP-based ANN downscaling, the later model was used in the simulation step as well.

The advantages of using the proposed predictor screening methodology was that the wavelet transform handled non-stationary effects of climate variables by depicting various periods involved in the time series and WE diminished the dimensionality of data by replacing the decomposed sub-series, while preserving the information content underlaid in the data. Moreover, SOM clustered the features of the predictors in homogenous clusters, so one agent from each cluster could represent that cluster instead of applying several resembling data. The agents were selected by MI with the capability of detecting nonlinear relation to find the most correlated predictors to the predictand in each cluster. Although the proposed screening methodology with different mathematical tools seems to be complicated (which may be considered as a limitation of modeling), the selection of optimum variables (i.e., 8 variables) from 1,200 potential variables resulted in the establishment of a robust and suitable downscaling model.

Finally, the selected dominant predictors were applied in ANN to downscale rainfall of the location. ANN captured a nonlinear statistical relationship between the large-scale circulation variables and rainfall of the Tabriz station. The comparison of the linear (i.e., MLR) and nonlinear (i.e., ANN) regression models indicated that ANN can be more reliable than the MLR model in downscaling rainfall at the study station due to 40% better performance in the model training step.

In contrast to research, in which the variables were selected at important grid points according to inverse distance weighting (IDW) based on the distance from the study station (Zhou & Zhang 2014), the present study showed that sometimes the distance cannot be a suitable criterion for choosing an effective variable to simulate predictand. Therefore, in this study, the contribution of the variables in distant points was even more than in close points in downscaling the Tabriz station rainfall.

In the last step, future rainfall of the Tabriz station under RCPs 4.5 and 8.5 was projected. Results of the simulations indicated that, according to intermediate and high emission scenarios, the station rainfall will decrease in average (i.e., 0.29%–36%) and high states (i.e., 21%–30%), respectively.

Overall, the results of this study provide promising evidence for statistical downscaling and, more specifically, for the proposed WE-SOM input screening method, to select climate variables. In order to build on the current study, it is recommended in the future that the proposed WE–SOM screening method is used to downscale other climatologic parameters of the station (e.g., temperature). Moreover, the proposed downscaling method can be compared with statistical and dynamical downscaling models (other than the MLR, which was used in this study). In this way, due to the different abilities of artificial intelligence and machine learning methods, it may be suggested to use other versions and kinds of AI models (e.g., SVM, ANFIS, GEP) and clustering approaches.