## Abstract

One of the weaknesses of water resources management is the neglect of the nonstructural aspects that involve the most important relationships between water resources and socioeconomic parameters. Particularly, socioeconomic evaluation for different regions is crucial before implementing water resources management policies. To address this issue, 14 countries in the world that have continuous increasing trends of using renewable water per capita (RWPC) during 1998–2017 were used for the estimation of eight socioeconomic parameters associated with four key indicators (i.e., economy, demographics, technology communication, and health sanitation) by using four different data-driven methods, including artificial neural networks, support vector machines (SVMs), gene expression programming (GEP), and wavelet-gene expression programming (WGEP). The performances of the models were evaluated by using correlation coefficient (R), root-mean-square error (RMSE), and mean absolute error (MAE). It was found that the WGEP model had the best performance in estimating all parameters. The mathematical expressions for these socioeconomic parameters were explored and their potential to be expanded in different spatial and temporal dimensions was assessed. The derived equations provide a quantitative means for the future estimation of the socioeconomic parameters in the studied countries.

## HIGHLIGHTS

The relationships between water resources and socioeconomic parameters were evaluated.

The mathematical equations of the hydro-socioeconomic parameters were explored.

Different data-driven methods were compared in the estimation of hydro-socioeconomic parameters and the best ones were determined.

## INTRODUCTION

One of the main goals of all engineering disciplines is to create more prosperity for communities. Therefore, any decision that is away from the needs and interests of society will lose its value. All communities are dependent on water; water is needed for agricultural production, energy generation, health service, and industrial manufacture. The sustainability of the communities depends directly or indirectly on the quantity, quality, reliability, and affordability of water. The water resources and socioeconomic systems are well interconnected. On the one hand, decisions on water resources can create social challenges, and on the other hand, social behaviors can change the status of water systems.

A wide range of human sciences is needed to solve water issues, including economics (Langer 2020), behavioral and perceptual studies, decision-making, social values, community psychology, and politics. In many places, basic human water needs cannot be met. On the other hand, plenty of water is available in some places for human needs and industrial use. Both cases pose challenges for water resources management. Over many decades, efforts have been made to benefit society through effective water resources management.

In the last decade, data-driven soft computing methods such as artificial neural network (ANN), adaptive neuro-fuzzy inference system (ANFIS), gene expression programming (GEP), multivariate adaptive regression splines (MARS), M5 tree model, support vector machines (SVMs), random forest (RF), multi-linear regression (MLR), and hybrid wavelet methods have been successfully employed to address both water quality and quantity issues. Shabani *et al.* (2016) forecasted water demand of the City of Kelowna (CKD), Canada, using intelligent soft computing models and found that the GEP models were more sensitive to data classification, genetic operators, and optimum lag time than other intelligent soft computing models. Based on a review of 43 papers about the applications of the ANN method, Maier & Dandy (2000) concluded that ANNs have been increasingly used for the prediction of water resources. Mohammadrezapour *et al.* (2019) estimated monthly potential evapotranspiration in an arid region by using the SVM, ANFIS, and GEP models in Sistan and Baluchestan Province, Iran and indicated that the SVM, GEP, and ANFIS models, respectively, took the first, second, and third places in the estimation of monthly potential evapotranspiration. Roboredo *et al.* (2016) used an aggregate index of social-environmental sustainability to evaluate the social-environmental quality for a watershed in the southern Amazon. Soil, water, vegetation, socioeconomic, and social organization qualities were considered as indicators in their study. Pande & Sivapalan (2017) examined the human impacts on water resources and found that technology, economy, and trade were closely relevant to water sustainability. Li *et al.* (2019) demonstrated how socioeconomic development affected water quality in Tai Lake by analyzing population, per capita gross domestic production, and sewage discharge and their relationships with water quality.

Using various data-driven methods, Najafzadeh *et al.* (2018) estimated scour depth under clear water conditions in rectangular channels. Kisi *et al.* (2019) modeled the separation (transition) zone using the GEP, MARS, M5T, and DENFIS techniques. Surono *et al.* (2022) forecasted the air quality by using genetic algorithm-fuzzy k-medoids clustering (GA-FKM) and fuzzy k-medoids clustering particle swarm optimization (FKM-PSO). In addition, ZamanZad-Ghavidel *et al.* (2021) applied GEP models to 14 countries to determine the appropriate hydro-socioeconomic index (HSEI) for the evaluation of the sustainability of water resource systems. To improve the estimation of socioeconomic parameters for those 14 countries, different data-driven methods are used in this study. The main goal of this study is to determine the best data-driven methods to estimate the socioeconomic parameters for the future since few studies have been conducted to address this issue (Dong *et al.* 2019; Zhang *et al.* 2020). Artificial intelligence methods have been widely used in the field of water resources (Bozorg-Haddad *et al.* 2017) and some efforts have been made to use these methods to address the hydro-socioeconomic issues. Nowadays, it is particularly imperative to understand and determine the complex relationships between water resources and socioeconomic factors/parameters for future water resources management.

## METHODOLOGY

### Selection of key hydro-socioeconomic indicators and parameters

Interdisciplinary approaches are generally needed for managing water systems. The uncertainties in the status of future water resources and the response of a community to them make management more difficult. Given the interactions between the physical water system and the socioeconomic dynamics, effective water resources management is a complex process. For example, some of the questions that need an interdisciplinary answer are (Sivapalan *et al.* 2012) as follows:

How do social systems relate to water resources systems?

How do water resources decisions affect socioeconomic parameters?

*et al.*2012). Figure 1 shows the relationship of hydro-socioeconomic indicators and parameters, including gross domestic product per capita (

*GDPC*)

*,*income index (

*II*), exports and imports (

*EI*), human development index (

*HDI*), population density (

*PD*), internet users (

*IU*), mortality rate (

*MR*), and population served with piped water (

*RPPW*). Figure 2 shows the main stages of the current study. The data of this study were extracted from the Knoema database (https://knoema.com). Figure 3 shows the geographic locations of these selected countries and Table 1 shows the basic information about these countries, including the average RWPC, GDP, and PD. In this study, 14 countries with continuous increasing trends of renewable water per capita (RWPC) during the 20-year period (1998–2017) were selected as an input dataset for all data-driven methods. Ten countries (including Albania, Belarus, Bosnia, Bulgaria, Croatia, Estonia, Georgia, Hungary, Latvia, and Lithuania) were used for training, while the remaining four countries (including Poland, Romania, Serbia, and Ukraine) were used for testing. The socioeconomic parameters were estimated in each year of the study period as the model outputs (Figure 4).

Countries . | Average RWPC (cubic meters) . | Average GDP (US dollars) . | Average PD (people per sq. km) . |
---|---|---|---|

Albania | 10.16 | 3,044.45 | 108.56 |

Belarus | 6.00 | 4,330.60 | 47.60 |

Bosnia | 10.14 | 3,587.80 | 72.29 |

Bulgaria | 2.82 | 5,153.35 | 69.54 |

Croatia | 24.16 | 10,801.65 | 78.09 |

Estonia | 9.50 | 12,696.00 | 31.78 |

Georgia | 15.63 | 2,434.45 | 71.16 |

Hungary | 10.37 | 10,879.15 | 111.43 |

Latvia | 16.15 | 10,033.15 | 34.91 |

Lithuania | 7.69 | 10,240.20 | 51.12 |

Poland | 1.59 | 9,811.60 | 124.61 |

Romania | 10.16 | 6,387.40 | 90.90 |

Serbia | 22.13 | 4,359.55 | 83.84 |

Ukraine | 3.75 | 2,262.50 | 80.85 |

Countries . | Average RWPC (cubic meters) . | Average GDP (US dollars) . | Average PD (people per sq. km) . |
---|---|---|---|

Albania | 10.16 | 3,044.45 | 108.56 |

Belarus | 6.00 | 4,330.60 | 47.60 |

Bosnia | 10.14 | 3,587.80 | 72.29 |

Bulgaria | 2.82 | 5,153.35 | 69.54 |

Croatia | 24.16 | 10,801.65 | 78.09 |

Estonia | 9.50 | 12,696.00 | 31.78 |

Georgia | 15.63 | 2,434.45 | 71.16 |

Hungary | 10.37 | 10,879.15 | 111.43 |

Latvia | 16.15 | 10,033.15 | 34.91 |

Lithuania | 7.69 | 10,240.20 | 51.12 |

Poland | 1.59 | 9,811.60 | 124.61 |

Romania | 10.16 | 6,387.40 | 90.90 |

Serbia | 22.13 | 4,359.55 | 83.84 |

Ukraine | 3.75 | 2,262.50 | 80.85 |

### Introduction to the selected socioeconomic parameters

In this study, *RWPC* is considered as an indicator of water resources status (i.e., hydro), while the socioeconomic parameters include *GDPC, II, EI, HDI, PD, IU, MR*, and *RPPW* (Figure 1). The distribution of the population in each country varies according to its natural parameters and characteristics. Therefore, the particular access to water resources plays an important role in *PD*. With the awareness of the water resources of each area, the facilities needed for residents can be estimated. The *HDI* is an indicator for the social evaluation of a society, which consists of life expectancy, education index, and II. This index is dependent on several main factors, including water resources (Sinha & Sengupta 2019). *MR* is an index for measuring the number of deaths. One of the causes of the disease is the lack of adequate water resources or their pollution, which have a great impact on the health of the people who live in such areas, Keshavarz *et al.* (2013) highlighted the great impact of water scarcity on the health of people who lived in two villages of Shiraz Province, Iran. The *GDP* is the total value of all finished goods and services produced in a country over a specific period, indicating the overall economic condition of the country. Another economic index used in this study is the *II*, which is obtained by dividing the gross national income (*GNI*) by the population of the country. Both economic indicators depend on the amount of water resources in the area. *EI* is another economic parameter used in this study, which is also dependent on the water resources. The number of people with *IU* is important in this regard because, as an educational tool, the internet can have a significant impact on people's awareness of water issues (Aerts *et al.* 2018). Table 2 lists the abbreviations and units of the selected socioeconomic parameters.

Parameters . | Abbreviations . | Units . |
---|---|---|

Renewable water per capita | RWPC | Cubic meters |

GDP per capita | GDP | US dollars |

Income index | II | Score |

Exports and imports | EI | US dollars |

Human development index | HDI | Score |

Population density | PD | People per sq. km |

Internet users | IU | Percent |

Mortality rate | MR | Deaths per 1,000 live births |

Population served with piped water | RPPW | Percent |

Parameters . | Abbreviations . | Units . |
---|---|---|

Renewable water per capita | RWPC | Cubic meters |

GDP per capita | GDP | US dollars |

Income index | II | Score |

Exports and imports | EI | US dollars |

Human development index | HDI | Score |

Population density | PD | People per sq. km |

Internet users | IU | Percent |

Mortality rate | MR | Deaths per 1,000 live births |

Population served with piped water | RPPW | Percent |

### Different data-driven methods

#### Artificial neural network

Artificial neural networks are computational systems that are inspired by biological neural networks. ANN approaches include three main layers (i.e., input, output, and hidden layers). The Levenberg–Marquardt (LM) algorithm is one of the faster and more reliable back propagation (BP) algorithms. The detailed theory of ANNs can be found in Haykin (1998).

The characteristics of the ANN models can be summarized as follows:

Applied algorithm: The LM algorithm with three layers was applied for training of the ANN estimation models.

Functions of activation: The logsig, tansig, and pureline functions were applied for the necessary need nodes.

Determination of the neuron number: The trial-and-error method is the best way to determine the optimal number of neurons in the third layer of the ANN models (Barzegar *et al.* 2016). The ANN program code was written using the MATLAB in the current study.

#### Support vector machines

SVMs are powerful data-driven methods introduced by Vapnik (1995). The major advantages of SVMs over ANNs include their improved generalization ability, unique and globally optimal architectures, and the ability to be rapidly trained.

*et al.*2002). The choice of the kernel function type for SVM depends on the amount of training data and the feature vector dimensions. A kernel function should be chosen so that it is capable of learning from the inputs of the problem. Four types of kernel functions, including linear kernel, polynomial kernel, hyperbolic tangent kernel, and radial basis function (RBF) kernel, can be used for SVM models (Deka 2014). In the current study, the RBF kernel was selected. It can be expressed as follows:where is the kernel function; and and are the training and testing datasets, respectively. In this study, the SVM code was written by using MATLAB.

#### Gene expression programming

The GEP model is based on the Darwin's theory of natural selection. The fundamental steps of this model include (1) selecting the terminal dataset; (2) selecting the function set; (3) selecting the indicators of model evaluation; (4) determining the control components; and (5) determining the requirements/criteria to stop the program run. The GEP model has many advantages. One of the most important advantages of this approach is to generate the express tree and formalization, which can be very useful in the engineering fields (Ferreira 2006).

The characteristics of the GEP model developed in this study can be summarized as follows:

The functions set (*F*): Different mathematical functions are applied to compare and evaluate the estimation models:

The terminal set (*T*): The terminal set includes RWPC. Other characteristic parameters used in the GEP model include number of chromosomes = 30, head length *h* = 7, and genes per chromosome = 3 (function set defined in Genexprotools). Additional values were selected to link the sub-trees. In this study, the Genexprotools 4.0 was utilized to estimate the socioeconomic parameters.

#### Wavelet analysis

*a*and

*b*are the scaling and translation functions with integer

*m*;

*t*is an integer that refers to a point of the input signal;

*n*is the discrete time index;

*x*(

*t*) is a given signal; and

*f*(

*t*) is the mother wavelet. Selections of the mother wavelet type and a suitable number of decomposition levels based on the nature of the signal are the most important step in the wavelet analysis. In the present study, a one-dimensional Daubechies wavelet based on the similarity in shape between the mother wavelet and the data series at a suitable level is used to decompose the data into subseries.

*L*, is given by (Barzegar

*et al.*2016):where

*N*is the number of data points. Wavelet decomposition and approximation are performed by using the db4 at level 2. For the three parts (i.e., A2, D2, and D1), A2 represents the low-frequency part of the signal, while D2 and D1 represent the high-frequency parts. Figure 4 shows the flowchart of the modeling and analysis in the current study.

*X*is the normalized value;

_{N}*X*is the real value;

_{i}*X*

_{min}is the minimum value; and

*X*

_{max}is the maximum value. Normalizing the training inputs generally improves the quality of the training.

The augmented Dickey–Fuller (ADF) test (Dickey & Fuller 1979) is a unit root test for stationarity of a time series. The null hypothesis is defined as the presence of a unit root (i.e., non-stationary). In general, a *p*-value less than 5% implies that the null hypothesis can be rejected, and the time series is stationary. In this study, the ADF test is first performed on the time series of the existing parameters using the EVIEWS software. EVIEWS supports various types of information criteria. In this study, the Schwarz information criterion (SIC) is used for the ADF test. Moreover, to consider the 20-year data for each country in the macroeconomic series, a 20-year interval is defined for each country (i.e., 1–20 for the first country, 21–40 for the second country, and so on). Therefore, the lag length is 20. For instance, the first year of the second country must be defined for the software analysis as a new initiation of data for better recognition of pattern. Since the main purpose of this study is to predict socioeconomic parameters, the stationarity of data is very important. Thus, stationary time series have been used, as verified by the results of ADF test. According to the *p*-values shown in Table 3, all the time series of data used in this study have a *p*-value less than 0.005.

Parameter . | p-value
. |
---|---|

RWPC | <0.001 |

GDP | 0.0018 |

II | 0.0215 |

EI | 0.0179 |

HDI | 0.0038 |

PD | <0.001 |

IU | 0.0228 |

MR | 0.0058 |

RPPW | 0.0186 |

Parameter . | p-value
. |
---|---|

RWPC | <0.001 |

GDP | 0.0018 |

II | 0.0215 |

EI | 0.0179 |

HDI | 0.0038 |

PD | <0.001 |

IU | 0.0228 |

MR | 0.0058 |

RPPW | 0.0186 |

### Assessment of model performance

*R*), root mean square error (

*RMSE*), and mean absolute error (

*MAE*), which are respectively given by:where and are the average values of the observed and estimated socioeconomic parameter values; SE

_{i}_{o}and SE

_{i}_{e}are the observed and estimated socioeconomic parameter values; and

*N*is the total number of datasets. The correlation coefficient (

*R*) measures the strength and direction of the linear relationship between variables; the

*RMSE*shows the goodness of fit relevant to the high values; and the

*MAE*measures the balanced distribution of goodness of fit at moderate values. In general, the model performances are optimum if

*R*and

*RMSE*are closer to 1 and 0, respectively.

## RESULTS AND DISCUSSION

### ANN, SVM, and GEP models

The three data-driven approaches (i.e., ANN, SVM, and GEP) were applied to estimate eight different socioeconomic parameters with consideration of economy, demographics, technology communication, and health sanitation. All the selected output parameters showed significant correlations with renewable water consumption per capita. Thus, with increasing RWPC, PD decreased because there was no need to focus on the population in a particular area to use water resources and the population was spread in different parts of the country. MR had an indirect relationship with RWPC, having access to adequate water resources and strengthening the agricultural sector, which could help reduce the majority of diseases. As a result, increasing the quantity of available water resources could improve people's health and reduce the number of deaths due to water-related diseases. Other selected parameters showed a direct relationship with RWPC. The ANN with the LM and one hidden layer was applied and the number of neurons of the hidden nodes, ranging from 1 to 10, was determined by applying the trial-and-error method. The numbers of neurons in the hidden layer of the models were 3, 2, 4, 4, 3, 3, 4, and 2 for *GDPC*, *II*, *EI*, *HDI*, *PD*, *IU*, *MR*, and *RPPW,* respectively. The activation functions of the hidden nodes of the ANN models were obtained by tangent sigmoid for all parameters. The activation functions of the output nodes were obtained by tangent sigmoid for *HDI* and *IU*, and linear functions for *GDPC*, *II*, *EI*, *PD*, *MR*, and *RPPW* parameters.

*RRSE*) was selected as an appropriate fitness function with a pressure tree. The results of ANN, SVM, and GEP for the test period are shown in Table 4. The GEP models achieved the best values of

*RMSE*,

*R*, and

*MAE*for all socioeconomic parameters. SVM models showed good performances for the socioeconomic parameters, while ANN models had the worst performances in predicting all parameters. Figures 5–7 show the observed and estimated socioeconomic parameters during the testing period for the ANN, SVM, and GEP methods, respectively. As shown in Table 4, the values of

*R*and

*RMSE*for the three methods have the following relationships:

*R*

_{GEP}>

*R*

_{SVM}>

*R*

_{ANN}and

*RMSE*

_{ANN}>

*RMSE*

_{SVM}>

*RMSE*

_{GEP}, indicating that the GEP model is the best for estimating the aforementioned socioeconomic parameters, in addition to generating the express tree and formalization.

Aspects . | Parameters . | WGEP . | ANN . | SVM . | GEP . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

R . | RMSE . | MAE . | R . | RMSE . | MAE . | R . | RMSE . | MAE . | R . | RMSE . | MAE . | ||

Economy | GDPC | 0.857 | 0.188 | 0.133 | 0.706 | 0.259 | 0.195 | 0.739 | 0.249 | 0.192 | 0.763 | 0.24 | 0.177 |

II | 0.872 | 0.16 | 0.119 | 0.785 | 0.205 | 0.161 | 0.793 | 0.201 | 0.158 | 0.803 | 0.195 | 0.147 | |

EI | 0.876 | 0.167 | 0.108 | 0.677 | 0.259 | 0.191 | 0.709 | 0.243 | 0.175 | 0.745 | 0.227 | 0.156 | |

Demographics | HDI | 0.918 | 0.135 | 0.097 | 0.885 | 0.16 | 0.117 | 0.889 | 0.157 | 0.117 | 0.895 | 0.153 | 0.114 |

PD | 0.999 | 0.011 | 0.008 | 0.999 | 0.016 | 0.012 | 0.999 | 0.015 | 0.011 | 0.999 | 0.014 | 0.011 | |

Technology communication | IU | 0.934 | 0.172 | 0.134 | 0.831 | 0.238 | 0.187 | 0.867 | 0.235 | 0.187 | 0.877 | 0.228 | 0.174 |

Health sanitation | MR | 0.931 | 0.175 | 0.136 | 0.856 | 0.218 | 0.169 | 0.877 | 0.215 | 0.158 | 0.89 | 0.212 | 0.158 |

RPPW | 0.936 | 0.158 | 0.132 | 0.888 | 0.194 | 0.161 | 0.893 | 0.192 | 0.16 | 0.901 | 0.189 | 0.152 |

Aspects . | Parameters . | WGEP . | ANN . | SVM . | GEP . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

R . | RMSE . | MAE . | R . | RMSE . | MAE . | R . | RMSE . | MAE . | R . | RMSE . | MAE . | ||

Economy | GDPC | 0.857 | 0.188 | 0.133 | 0.706 | 0.259 | 0.195 | 0.739 | 0.249 | 0.192 | 0.763 | 0.24 | 0.177 |

II | 0.872 | 0.16 | 0.119 | 0.785 | 0.205 | 0.161 | 0.793 | 0.201 | 0.158 | 0.803 | 0.195 | 0.147 | |

EI | 0.876 | 0.167 | 0.108 | 0.677 | 0.259 | 0.191 | 0.709 | 0.243 | 0.175 | 0.745 | 0.227 | 0.156 | |

Demographics | HDI | 0.918 | 0.135 | 0.097 | 0.885 | 0.16 | 0.117 | 0.889 | 0.157 | 0.117 | 0.895 | 0.153 | 0.114 |

PD | 0.999 | 0.011 | 0.008 | 0.999 | 0.016 | 0.012 | 0.999 | 0.015 | 0.011 | 0.999 | 0.014 | 0.011 | |

Technology communication | IU | 0.934 | 0.172 | 0.134 | 0.831 | 0.238 | 0.187 | 0.867 | 0.235 | 0.187 | 0.877 | 0.228 | 0.174 |

Health sanitation | MR | 0.931 | 0.175 | 0.136 | 0.856 | 0.218 | 0.169 | 0.877 | 0.215 | 0.158 | 0.89 | 0.212 | 0.158 |

RPPW | 0.936 | 0.158 | 0.132 | 0.888 | 0.194 | 0.161 | 0.893 | 0.192 | 0.16 | 0.901 | 0.189 | 0.152 |

The bold values represent the best values for each criteria among different methods.

### Wavelet analysis

The one-dimensional Daubechies-4 (db4) wavelet was used to decompose the data into subseries. The Daubechies-4 wavelet has been applied in many studies (e.g., Barzegar *et al.* 2016). In the current study, the number of data is 280. So, the level of wavelet decomposition is 2. The discrete db4 wavelet decomposed *GDPC*, *II*, *EI*, *HDI*, *PD*, *IU*, *MR*, *RPPW*, and *RWPC* parameters at level 2. The values of the A2, D2, and D1 analyses are shown in Table 5. For example, the values of the *L*-frequency A2 at level 2 for the economy-related parameters *GDPC*, *II*, and *EI* signals vary from −0.033 to +1.036, from −0.026 to +1.080, and from −0.041 to +1.058, respectively. The values of the *H*-frequency parts D (2 and 1), which contain the signal details, range from −0.206 to +0.152 and from −0.367 to +0.375 for *GDPC*.

Parameters | GDPC | II | EI | ||||||

Wavelet analyses | A2 | D2 | D1 | A2 | D2 | D1 | A2 | D2 | D1 |

Min | −0.033 | −0.206 | −0.367 | −0.026 | −0.168 | −0.368 | −0.041 | −0.346 | −0.455 |

Max | 1.036 | 0.152 | 0.375 | 1.080 | 0.175 | 0.383 | 1.058 | 0.343 | 0.516 |

Parameters | HDI | PD | IU | ||||||

Wavelet analyses | A2 | D2 | D1 | A2 | D2 | D1 | A2 | D2 | D1 |

Min | −0.012 | −0.154 | −0.363 | −0.110 | −0.157 | −0.432 | −0.059 | −0.163 | −0.349 |

Max | 1.082 | 0.181 | 0.383 | 1.024 | 0.152 | 0.366 | 1.086 | 0.186 | 0.353 |

Parameters | MR | RPPW | RWPC | ||||||

Wavelet analyses | A2 | D2 | D1 | A2 | D2 | D1 | A2 | D2 | D1 |

Min | −0.077 | −0.145 | −0.368 | −0.032 | −0.172 | −0.341 | −0.031 | −0.161 | −0.367 |

Max | 1.071 | 0.189 | 0.349 | 1.119 | 0.225 | 0.352 | 1.112 | 0.152 | 0.432 |

Parameters | GDPC | II | EI | ||||||

Wavelet analyses | A2 | D2 | D1 | A2 | D2 | D1 | A2 | D2 | D1 |

Min | −0.033 | −0.206 | −0.367 | −0.026 | −0.168 | −0.368 | −0.041 | −0.346 | −0.455 |

Max | 1.036 | 0.152 | 0.375 | 1.080 | 0.175 | 0.383 | 1.058 | 0.343 | 0.516 |

Parameters | HDI | PD | IU | ||||||

Wavelet analyses | A2 | D2 | D1 | A2 | D2 | D1 | A2 | D2 | D1 |

Min | −0.012 | −0.154 | −0.363 | −0.110 | −0.157 | −0.432 | −0.059 | −0.163 | −0.349 |

Max | 1.082 | 0.181 | 0.383 | 1.024 | 0.152 | 0.366 | 1.086 | 0.186 | 0.353 |

Parameters | MR | RPPW | RWPC | ||||||

Wavelet analyses | A2 | D2 | D1 | A2 | D2 | D1 | A2 | D2 | D1 |

Min | −0.077 | −0.145 | −0.368 | −0.032 | −0.172 | −0.341 | −0.031 | −0.161 | −0.367 |

Max | 1.071 | 0.189 | 0.349 | 1.119 | 0.225 | 0.352 | 1.112 | 0.152 | 0.432 |

*WGEP* models

*DWT*wavelet tool. To build the WGEP model, each decomposed subseries of the selected parameters was estimated separately with the GEP model and the WGEP estimation model was directly generated with the summation of the estimated subseries. The results of data-driven models with the db4 mother wavelet and WGEP for the testing period are shown in Table 4. The

*RMSE*values of the WGEP model with the db4 mother wavelet are 0.188, 0.160, 0.167, 0.135, 0.011, 0.172, 0.175, and 0.158 for

*GDPC*,

*II*,

*EI*,

*HDI*,

*PD*,

*IU*,

*MR*,

*RPPW*, and

*RWPC*parameters, respectively. Figure 8 shows the results of the observed and estimated values of A2 decomposed series by applying the WGEP for the eight socioeconomic parameters in the selected countries. Figure 9 shows the observed and estimated socioeconomic parameters during the testing period from the WGEP method. The

*R*values of the WGEP model are 0.734, 0.761, 0.767, 0.842, 0.999, 0.872, 0.864, and 0.876 for

*GDPC*,

*II*,

*EI*,

*HDI*,

*PD*,

*IU*,

*MR*, and

*RPPW,*respectively. Figure 10 shows the comparisons of the

*R*,

*RMSE,*and

*MAE*values of the ANN, SVM, GEP, and WGEP methods, indicating that the WGEP model had the best performance.

*Q*

_{1}(0.25%), the median (0.50%), the third quartile –

*Q*

_{3}(0.75%), and the maximum. As shown in Figure 11, the

*PD*and

*EI*parameters have the minimum median and the

*IU*parameter has the maximum median. The

*EI*parameter has lower values of the first to third quartiles. The estimated

*HDI*has the maximum value in the third quartile (0.75%). Most

*MR*and

*GDPC*values vary between the first quartile (0.25%) and the third quartile (0.75%). Table 6 shows the ranks of eight parameters, in terms of

*R, RMSE*, and

*MAE*for the models used in this study, indicating that

*PD*is the best calculated parameter in all four models.

Ranking . | ANN . | SVM . | GEP . | WGEP . |
---|---|---|---|---|

1 | PD | PD | PD | PD |

2 | HDI | HDI | HDI | HDI |

3 | RPPW | RPPW | RPPW | RPPW |

4 | II | II | II | II |

5 | MR | MR | MR | EI |

6 | IU | IU | EI | IU |

7 | EI, GDPC | EI | IU | MR |

8 | ___ | GDPC | GDPC | GDPC |

Ranking . | ANN . | SVM . | GEP . | WGEP . |
---|---|---|---|---|

1 | PD | PD | PD | PD |

2 | HDI | HDI | HDI | HDI |

3 | RPPW | RPPW | RPPW | RPPW |

4 | II | II | II | II |

5 | MR | MR | MR | EI |

6 | IU | IU | EI | IU |

7 | EI, GDPC | EI | IU | MR |

8 | ___ | GDPC | GDPC | GDPC |

Table 7 lists all the mathematical equations used in the models to estimate the socioeconomic parameters. The performances of the models for all selected socioeconomic parameters for the studied countries follow the following order: WGEP > GEP > SVM > ANN (Table 6). The WGEP model outperformed its simple form (i.e., GEP) for all eight parameters. The WGEP model improved the performance by 22, 18, 26, 12, 22, 25, 18, and 16% compared with the GEP model for *GDPC*, *II*, *EI*, *HDI*, *PD*, *IU*, *MR*, and *RPPW,* respectively.

Parameters . | Equations . | |
---|---|---|

GDPC | A2 | |

D2 | ||

D1 | ||

II | A2 | |

D2 | ||

D1 | ||

EI | A2 | |

D2 | ||

D1 | ||

HDI | A2 | |

D2 | ||

D1 | ||

PD | A2 | |

D2 | ||

D1 | ||

IU | A2 | |

D2 | ||

D1 | ||

MR | A2 | |

D2 | ||

D1 | ||

RPPW | A2 | |

D2 | ||

D1 | ||

For all parameters | Final equation Equation of A2 + Equation of D2 + Equation of D1 |

Parameters . | Equations . | |
---|---|---|

GDPC | A2 | |

D2 | ||

D1 | ||

II | A2 | |

D2 | ||

D1 | ||

EI | A2 | |

D2 | ||

D1 | ||

HDI | A2 | |

D2 | ||

D1 | ||

PD | A2 | |

D2 | ||

D1 | ||

IU | A2 | |

D2 | ||

D1 | ||

MR | A2 | |

D2 | ||

D1 | ||

RPPW | A2 | |

D2 | ||

D1 | ||

For all parameters | Final equation Equation of A2 + Equation of D2 + Equation of D1 |

*H _{ts}*,

*GDPC*,

*II*,

*UR*,

*EI*,

*HDI*,

*PD*,

*IU*,

*RPPW*, and

*MR*denote renewable water per capita (Hydro), GDP per capita, income index, unemployment rate, exports and imports, human development index, population density, internet users, proportion of rural population served with piped water, and mortality rate (under five years old), respectively.

According to Table 7, various operators have been used to increase the accuracy of the models, and these relations have been applied to quantify the dependance of socioeconomic sciences and water resources. In the GEP models, it is also possible to select simple mathematical equations to reduce the number of operators. But it should be noted that there is a possibility of reducing the accuracy of the proposed models (Bagatur & Onen 2018).

In addition, the results from this study highlighted the importance of examining the relationships between the status of water resource and socioeconomic parameters. This study indicated that water resources parameters had significant impacts on socioeconomic parameters (Sivapalan *et al.* 2012). WGEP had the best performance among all the data-driven models used for predicting the socioeconomic parameters in this study. In fact, the socioeconomic conditions of a country can be a good indicator that reflects the status of its water resources and also have a mutual relationship, which is very important for making decisions in the integrated management of water resources.

## CONCLUSIONS

Water resources are important in terms of production and social, economic, and environmental values for a country. Socioeconomic considerations are needed to cope with the decreasing trend of water resources in many countries in recent decades and the increasing demand for water resources. The new contributions of this study include the following: (1) To the best of our knowledge, this is the first effort to jointly apply various data-driven methods, including artificial neural networks, SVMs, GEP, and WGEP for analyses of linked hydrologic and socioeconomic systems. (2) Different socioeconomic parameters, including *GDPC*, *II*, *EI*, *HDI*, *PD*, *IU*, *MR*, and population served with piped water (*RPPW*) were estimated by using *RWPC* as a representative parameter of water resources. (3) The potential to expand the mathematical relationships in different spatial and temporal dimensions was assessed. In this study, the relationship between water resources and socioeconomic parameters was modeled by data-driven methods and their performances were compared and assessed. This study indicated that the hybrid data-driven models based on the wavelet theory improved the performances of GEP models. It was demonstrated that the WGEP models had the best performance and the ANN models showed the poorest performance**.** Thus, it is possible to assess the socioeconomic status of a region/country by developing such models before implementing major water projects. The methods developed in this study can significantly improve the related water resources planning and management and also provide useful information for socioeconomic development. The main limitation of this study is the unavailability of data on all socioeconomic parameters that are likely to be strongly correlated with water resources. In the future research, other data mining models can be used to characterize the relationship between water resources and socioeconomic parameters. Other socioeconomic parameters that account for different environmental and/or political dimensions, such as the health index, happiness index, and employment rate can also be studied.

## ACKNOWLEDGEMENTS

The authors thank Iran's National Science Foundation (INSF) for its financial support for this research.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.