Prediction and analysis of water resources demand in Taiyuan City based on principal component analysis and BP neural network

Water is a fundamental natural and strategic economic resource that plays a vital role in promoting economic and social development. With the accelerated urbanization and industrialization in China, the potential demand for water resources will be enormous. Therefore, accurate prediction of water resources demand is important for the formulation of industrial and agricultural policies, development of economic plans, and many other aspects. In this study, we develop a model based on principal component analysis (PCA) and back propagation (BP) neural network to predict water resources demand in Taiyuan, Shanxi Province, a city with severe water shortage in China. The prediction accuracy is then compared with PCA-ANN, ARIMA, NARX, Grey – Markov, serial regression, and LSTM models, and the results showed that the PCA-BP model outperformed other models in many evaluation factors. The proposed PCA-BP model reduces the dimensionality of high-dimensional variables by PCA and transformed them into uncorrelated composite data, which can make them easier to compute. More importantly, BP and weight threshold adjustment in model training further improve the prediction accuracy of the model. The model analysis will provide an important reference for water demand assessment and optimal water allocation in other regions.

government, and a critical link in its goal for the construction of an ecological civilization, has been a clean, safe, and sufficient water supply, which has, in turn, made the accurate prediction of domestic water supply become the primary prerequisite for the formulation of optimal provision and allocation of the resource.
Shanxi Province is a critical energy and heavy chemical base in China, and one of the provinces with severe water shortages. As the capital of Shanxi Province, the current situation of water resources in Taiyuan is also an emergency. According to the Shanxi Water Resources Bulletin, Taiyuan's water resources per capita is 177 m 3 , less than 9% of the national per capita amount, and is in a state of extreme water shortage as determined by the United Nations. Due to the serious shortage of water resources, Taiyuan City has to rely on more than 0.7 billion m 3 of groundwater every year to support economic and social development and water for residents. Over the years, serious overdrafting of groundwater has led to a significant drop in the groundwater level in the area, the aquifer has been continuously drained, a large number of shallow wells have been scrapped, and the depth of water extraction wells has been shifting downward, coupled with coal mining seepage not being effectively curbed, which means groundwater resources in Taiyuan are in great danger. Surface water pollution further exacerbates the shortage of water resources in Taiyuan City, making the contradiction between the supply and demand of water resources more prominent. Currently, with economic development and population growth, Taiyuan City has a growing demand for water resources, posing a serious threat to achieving sustainable development. Therefore, it has great practical significance to study and forecast the water resources demand in Taiyuan City, which can provide useful theoretical references for the rational allocation of water resources and the coordinated development of regional economy in Taiyuan City.
The following content of this paper is divided into four parts: the current research status and shortcomings of water resources prediction are introduced in Section 2. The two models of principal component analysis (PCA) and back propagation (BP) neural network are discussed in Section 3. In Section 4, the relevant data of Taiyuan City, with a large population and relative lack of water resources, are taken as an example. The demand of water resources is analyzed and predicted by using the PCA-BP model. Moreover, the prediction results are evaluated and compared with other models. The comparison results with other models show that the PCA-BP model outperforms other models in all indicators. The last section concludes a summary of this study and gives relevant suggestions.

CURRENT STATUS OF RESEARCH ON WATER RESOURCES PREDICTION
How to organize and utilize the water resources effectively and rationally, in reality, is essential to solving the sustainable development of urban water resources. Precise prediction of water demand is the primary task of the optimal allocation of water resources. Consequently, analyzing the model and method of forecasting water demand accurately is highly necessary. Meanwhile, water resources demand is affected by many uncertain factors such as climate, population, and economic level, which makes it a great challenge to precisely predict water consumption. At present, according to the periodicity of prediction variables and the size of the prediction range, the existing methods and models usually have different application scenarios (Donkor et al. 2014). Nowadays, the water demand forecast is divided into three terms: the water demand forecast of more than 2 years is defined as a long-term forecast, while the water demand forecast of 3 months to 2 years belongs to a medium-term forecast, and the water demand forecast of fewer than 3 months is classified as a short-term forecast by Billings & Jones (2008).
Short-term water resources demand can be predicted by the time-series model. Based on the historical data, Wong et al. (2010) developed a time-series model of daily urban water consumption based on rainfall and temperature. They applied it to the prediction of water consumption in Hong Kong, China. Faced with the nonlinear problem of data information change, Adamowski et al. (2012) used the artificial neural network (ANN) to predict short-term urban water demand. The results show that the ANN is superior to the linear regression technology in urban water demand prediction. Besides ANNs, support vector machine (SVM) is another machine learning technology for forecasting short-term water demand. Herrera et al. (2010) established a support vector regression model to predict the future urban water demand of southeast Spain. The model's predictions yield more accurate results compared with ANNs and other machine learning methods. Additionally, the support vector regression model, developed by Braun et al. (2014) for the 24-h water demand prediction of residential areas in Berlin performs, is better than the seasonal autoregressive model. Since then, support vector regression has been widely used in urban water demand forecasting, and the effectiveness of the method has been further verified (Wang et al. 2015;Shabani et al. 2017;Antunes et al. 2018;Bata et al. 2020;Deng et al. 2020a). For the medium-term forecast, an ANN method was proposed to forecast Bangkok's 6-month water demand (Babel & Shinde 2011). Ziervogel et al. (2010) used the information of seasonal climate change to predict and plan water resources in South Africa. Lvu (2014) established the precipitation forecast model of Zhengzhou City by using the time-series analysis method, and the 3-month precipitation forecast results were given. Polebitski & Palmer (2010) developed a water demand regression model, which could accurately predict a single household's water resources demand in a bimonthly time step. Ordinarily, nonlinear model, statistical analysis model, and grey forecasting model can be utilized to make long-term forecasting models (Hernandez et al. 2014). The nonlinear model mainly includes ANN (Shouxiang et al. 2016;Ren et al. 2020), SVM (Zhu & Wei 2013), time-series analysis model (Angelopoulos et al. 2019), and Markov chain model (Tsiliyannis 2018). The Grey model is a prediction model based on the grey theory, which is mainly applied in the uncertain background of fewer data and information. Through data processing and analysis, it can achieve the establishment of a prediction model, predict the development trend, and make a reasonable evaluation (Ding 2018;Deng et al. 2020e). Wang et al. (2021b) used the optimized Grey-Markov model to forecast the domestic water consumption in the Shaanxi Province of China. The results presented that the accuracy of the model was significantly improved compared with the general grey model and unbiased grey model.
Nowadays, some new water resources prediction models have been proposed to predict the demand for water resources more accurately. Tian et al. (2016) used the simulation method and the newly developed retrospective weather forecast of numerical weather forecast to improve the short-term forecasting ability of urban water demand. The model results demonstrate that the simulation method based on numerical prediction has a good application prospect in improving the prediction accuracy. Sanchez et al. (2020) established the indexes of social economy, environment, and landscape pattern and used a geographically weighted regression model to predict the urban water demand. The model takes the water demand of North Carolina and South Carolina as the empirical objects to evaluate the influence of population density and climate warming on future water demand. The research also reveals that the prediction results are impacted directly by the value of parameters in the water demand prediction model. Therefore, the reasonable calculation of model parameters is the key to the accuracy of water demand prediction (Rehman et al. 2017). According to the historical water demand data, Oliveira et al. (2017) used the harmony search algorithm to optimize the parameters of the autoregressive integrated moving average (ARIMA) model, which improved the short-term water demand forecasting efficiency. Mohammad & Pezhman (2019) used the extended ARIMA model and the nonlinear autoregressive exogenous (NARX) method to predict Teheran water consumption successfully. In recent years, stochastic optimization algorithm based on biological evolutionary mechanisms has become the mainstream tool for solving some complex problems that the initial values are difficult to choose and the objective functions are difficult to meet the accuracy requirements (Deng et al. 2021a(Deng et al. , 2021bLi et al. 2021;Wang et al. 2021a). The algorithm, as a population intelligence model, achieves search and optimization of the solution space by information exchange and sharing with the ideas of simulating animal foraging behavior and population optimization for survival (Deng et al. 2020b(Deng et al. , 2020c(Deng et al. , 2020d. It has the advantages of simple design, fast convergence speed, and few control parameters. For example, the particle swarm optimization (PSO) algorithm is a representative meta-heuristic swarm intelligence algorithm that integrates the social and historical cognition of particles into behavior, which opens up a new way to solve the optimization problem based on the survival principle of animal evolution (Lalwani et al. 2019;Deng et al. 2020b;Wu et al. 2021). Since then, many scholars have proposed swarm intelligence algorithms with different mechanisms, such as moth flame optimization algorithm (Mirjalili 2015a), ant lion optimizer (Mirjalili 2015b), grey wolf optimization algorithm (Teng et al. 2019), Harris hawk optimization (Heidari et al. 2019), SALP colony algorithm (Braik 2021). Swarm intelligence optimization algorithm has been used to predict water demand in recent years because of the excellent performance of the intelligent algorithms in resolving optimization problems. Bai et al. (2014) proposed an urban water demand estimation method based on multi-scale measurement. With this method, the adaptive chaotic PSO algorithm was used to search the optimal weighting factor of the relevance vector regression model. Wang et al. (2018) proposed a hybrid model based on linear and exponential models, using the firefly algorithm to solve the weight operators. Guo et al. (2020) proposed an improved whale optimization algorithm based on social learning and wavelet mutation strategy, which used the latest CEC2017 benchmark function to verify the superiority of the algorithm. Du et al. (2021) used PCA to reduce the dimensionality of factors affecting urban water demand, followed by discrete wavelet transform to eliminate the noisy part of water demand data. Then, LSTM was used to predict water demand with satisfactory results.
Nevertheless, previous research models often focus on the model design itself, and moreover, they do not fully consider the functional relationship between water resources demand and various influencing factors. In addition, there are many variables restricting water resource requirements, and the influence degrees are different. Therefore, it is necessary to quantify and compare the influence degrees of various factors and select important vital indicators as the input information of the prediction model. Through the in-depth study of historical data information, it is more feasible and practical to seek regular changes and forecast future water resources demand.

BASIC PRINCIPLE AND METHOD STEPS OF THE MODEL
This section is divided into three parts. The basic principle of PCA is given in the first subsection. Next, the principle and steps of BP neural network are shown in the second subsection. Finally, the methods to test the accuracy of the model are listed in the third subsection.

Basic principle of PCA
PCA was first proposed by K. Pearson in 1901, then improved by Hotelling in 1933, and extended to random vector. The mathematical model of PCA can be expressed as follows: (1) The specific operation steps of PCA are as follows: (1) Standardize the original data to prevent the dimensional difference between different indicators from affecting the results.
(2) Establish the correlation coefficient matrix between variables R.
(3) Calculate the eigenvalues l j and eigenvectors of the correlation coefficient matrix R.
(4) Get the principal component factors and calculate the comprehensive score.
Calculate the information contribution rate and cumulative contribution rate of eigenvalue l j (j ¼ 1, 2, Á Á Á , m), respectively. Among them, b j ¼ l j = P m k¼1 l k is the information contribution rate of the main component, and a p ¼ l k is the cumulative contribution rate of each principal component. When a p reaches 0.85, it shows that the influencing factors have been able to explain the original variables. Therefore, the first p variables are selected as the principal component factors.

Principle and steps of BP neural network
ANN has always been a hot academic frontier research and learning field in the international academic community. It has been widely used in various fields, such as power load forecasting (Xie et al. 2020;Wang et al. 2021c), prediction of river runoff (Ghose et al. 2018), and prediction of return rate of capital market (Galeshchuk 2016). The common structures of the neural network are RBF neural network (Huang & Yang 2020), particle swarm optimization neural network (Li et al. 2017), BP neural network (Ma et al. 2017), and genetic neural network (Ding et al. 2014). Figure 1 shows the mathematical model of a single neuron. AQUA -Water Infrastructure, Ecosystems and Society Vol 00 No 0, 5

Uncorrected Proof
A typical neural network generally consists of three to four layers, which are input layer, hidden layer for data processing, and output layer for result output (Figure 2). The relationship between input and output can be expressed by the following formula: where u is the threshold value, P n j¼1 w ij x j is the net activation amount and the sum of each neuron input multiplied by its weight. In the neural network, the activation function is the mapping function between the net activation quantity and the output, which the formula is The structure of ANN is shown in the literatures (Adesanya et al. 2021;Cicek & Ozturk 2021). The neural networks can be divided into two states: learning state and working state. The learning state is to adjust the weight of neural network to make the output close to the real value, while the working state is to use the established network to classify and predict without changing the weight of neural network. The learning mode of neural network is tutored learning. Meanwhile, the weight of the network is adjusted by the difference between the actual output and the expected output of the network to make Uncorrected Proof the model fit as accurately as possible. The BP neural algorithm used in this study is mainly composed of two parts: the forward propagation of signal and the BP of error. The basic idea of the algorithm is to use the gradient search technique to minimize the mean square error between the actual output value and the expected output value, and according to the error propagation layer by layer, the error estimation of each layer can be obtained. Then, the weight of each layer is modified until it reaches the acceptable range.
Suppose: there are n neurons in the input layer, p neurons in the hidden layer, and q neurons in the output layer, and parameter and function representation of BP model are shown in Table 1.
The steps of BP neural algorithm are as follows: Step 1. Calculate the input and output of each layer: Step 2. Use the expected output and the actual output of the network to calculate the partial derivatives of the error function to the neurons in the output layer: Step 3. Using the connection weight from hidden layer to output layer, the output layer d 0 (k), and the output of the hidden layer to calculate the partial derivative of the error function to each neuron in the hidden layer d h (k): Step 4. Using the error function of each neuron in the output layer and the output of each neuron in the hidden layer to correct the connection weight w oh (k): The parameter m is the set learning rate. Error function e ¼ 1 2 AQUA -Water Infrastructure, Ecosystems and Society Vol 00 No 0, 7 Uncorrected Proof Step 5. Calculate the global error: Step 6. Judge whether the network error meets the set accuracy requirements. When the error reaches the preset accuracy or the number of learning times is greater than the set maximum number, the algorithm ends. Otherwise, the next learning sample and the corresponding expected output must be selected to return to the next round of learning.

Test of model accuracy
The methods to test the accuracy of the model generally include mean absolute error test, mean relative error test, residual test, and posterior variance test. In this study, the average absolute error and the average relative error are used to test the prediction results of the research model. The absolute error calculation formula is as follows: The corresponding mean absolute error is as follows: The formula for the relative error d k is as follows: The formula for calculating the average relative error d is as follows: In the above formulas, x k and L k are the predicted value and the true value of the kth period, respectively, and n is the number of periods of the test period.

ANALYSIS OF WATER RESOURCES PREDICTION MODEL IN TAIYUAN CITY
This section is divided into six parts. The first subsection introduces the physical geography and social economy of Taiyuan City, which enables to have a further understanding of the research object's situation. The second part is an introduction to the dataset. The third part is the analysis of the results of the PCA model. Next, according to the selected influencing factors, the details of the BP neural network model are provided in the fourth subsection. In the following subsection, the prediction effect of the proposed model is compared with other latest models. Finally, the future water consumption of Taiyuan City is reasonably predicted in the last subsection. Similarly, the data processing flow in the study is as follows: firstly, we analyze the influencing factors of domestic water demand in Taiyuan and find nine closely related influencing factors. Next, the influencing factor dataset was selected from the Taiyuan Water Resources Bulletin as well as the Statistical Bulletin on the National Economic and Social Development of Taiyuan City. Through PCA, the nine influencing factors are reduced to obtain the data of the first three main components. Then, the first three main components are input into the BP neural network model for corresponding training and verification so as to obtain the final prediction results. The relevant flow chart of the paper is shown in Figure 2.

Physical geography
Taiyuan City is located in the central part of Shanxi Province in China, with an average altitude of about 800 m. The terrain is high in the north and low in the south. It is adjacent to Taihang Mountain in the west, Luliang Mountain in the east, Houlianyunzhong mountain, and Xizhou mountain in the east. The climate type is the north temperate continental climate. The annual precipitation distribution is very uneven, and the temperature varies greatly during the day. The annual precipitation is mainly concentrated in summer, and the winter is long, cold, and dry. While in spring, the temperature soars rapidly, and there are more gale days. At this time, the rain belt has not yet moved to northern China, evaporation is high, and there will be a spring drought with an annual average precipitation of 468.4 mm.

Social and economic situation
Taiyuan is a city with a long history, connecting Jinzhong City in the south and Yangquan City in the east. As the secondlargest tributary of the Yellow River, the Fenhe River flows through Taiyuan City from north to south. Taiyuan has always been the commercial center of our country. 'Shanxi Merchants' are famous all over the world, and it is also a heavy industry base and resource base, containing iron, copper, lead, manganese, and other metals. In 2017, the city's total population reached 4.3797 million, and the urbanization rate reached 84.7%. In 2018, Taiyuan's gross domestic product (GDP) was 388.448 billion yuan, with a growth rate of 9.2%. By the end of 2018, Taiyuan City has jurisdiction over six municipal districts, three counties, and a total of 54 streets.

Introduction to the dataset
The dataset is selected from the Taiyuan National Economic and Social Development Statistical Bulletin and Taiyuan Water Resources Bulletin, from which nine indicators (precipitation resident population, GDP, total water resources, total industrial output value, total agricultural output value, average temperature, annual average relative humidity, and annual sunshine hours) from 2012 to 2020, as the influencing factors of water demand in Taiyuan City, have been selected. The dataset is shown in Table 2.

Analysis of the results of the PCA model
In this study, we set the following variables to replace the indicator values. x 1 is the annual rainfall (m), x 2 is the permanent population (million people), x 3 is the gross regional product (100 billion yuan), x 4 is the total water resources (100 billion m 3 ), x 5 is the total industrial output value (10 4 yuan), x 6 is the total agricultural output value (10 4 yuan), x 7 is the annual average Uncorrected Proof temperature (centigrade), x 8 is the annual average relative humidity (%), x 9 is the annual sunshine hours (h). Taking these nine indicators as the influencing factors, SPSS 26 is used to standardize the data, and then PCA is carried out to calculate the eigenvector, eigenvalue, and cumulative variance contribution rate. Tables 3 and 4 show the results of the PCA of water demand affecting Taiyuan City. From Table 3, it can be seen that the eigenvalue of the first principal component is 4.799, the eigenvalue of the second principal component is 1.626, and the eigenvalue of the third principal component is 1.255. The formula of variance percentage is the quotient of the variance of each element and the sum of the variances of all elements. According to the principle that the first k elements with eigenvalues greater than 1 or cumulative contribution percentages greater than 85% are selected as principal components, so the first three components are selected as principal ones.
From Table 4, it can be seen that the first principal component has the highest correlation with GDP, gross industrial output value, gross agricultural output value, higher correlation with the number of population and sunshine hours per year, and a smaller and negative correlation with precipitation. The second principal component has the highest correlation with precipitation and is positively correlated. The third principal component has a higher correlation with the average temperature per year.

Construction of the BP neural network model
After selecting the first principal component second principal component, third principal component as the main factors affecting the water demand of Taiyuan City, this work takes the historical data as the input sample of BP neural network model and takes the annual water consumption of Taiyuan from 2009 to 2017 as the output sample. In applying traditional machine learning methods, the division of training and test sets is generally 7:3. In our study, there are 12 years in total, so the training set should generally contain eight to nine samples. To make the model training relatively adequate and the test sets comparable, we set the data from 2009-2017 as the training set and the data from 2018-2020 as the test set. The test set is used to observe whether the error between the output results of the neural network and the actual results reaches the expected results. Finally, the model is applied to the prediction of water resources demand in Taiyuan City from 2021 to 2030 to observe the changes of water resources demand in the future. Set the number of iterations to 10,000 and the target error as 10 À7 , then Figure 3 shows the error image during the training period. It can be seen from Figure 3 that the final minimum error of BP neural network is 3.6545 Â 10 À13 , which meets the expected requirements, and then we test the validation set. The results are shown in Table 5.

Compared with other prediction algorithms
To prove that the BP neural network model has better accuracy in the prediction data, this study compares with the results of the Grey prediction model , the optimized Grey-Markov model , the time-series prediction model (Sena & Nagwani 2016), ARIMA model (Jamil 2020), and the NARX method (Mohammad & Pezhman 2019) and takes the water consumption data of Taiyuan from 2009 to 2017 as the training sample and the water consumption data of Taiyuan from 2018 to 2020 as the prediction sample. Table 6 shows the results of different models for the forecast of water demand in Taiyuan City for the 3 years 2018, 2019, and 2020 and the error comparison.
As it can be seen from Table 6, the average absolute error of the PCA-BP model is 0.07, and the average relative error is 0.92%. Meanwhile, compared with the optimized Grey-Markov Chain model, Gray prediction model, serial regression model, ARIMA model, NARX model, LSTM model, NAR model, and ANN model, the average absolute error has decreased by 0. 6277, 0.5785, 0.2913, 0.3869, 0.0179, 0.12, 0.554, and 0.163, and the average relative errors decreased by 7. 71, 7.11, 3.56, 4.76, 0.16, 1.458, 6.71, and 1.99%, respectively. From the comparison results, it can be seen that the  Uncorrected Proof prediction results of the PCA-BP model are better than those of each of the latest models, which proves the superiority of the PCA-BP model, and the model can be used as an important method for future water demand prediction in Taiyuan City.

Forecast of future water consumption in Taiyuan City
On the basis of the test results of the PCA-BP model, the future water consumption in Taiyuan City can be reasonably predicted. According to the growth rate of GDP of Taiyuan City over the years, this study reasonably assumes that the growth rate of GDP of Taiyuan City over the years is 7% higher than that of the previous year, and the number of residents population (million) increases by 0.93% every year. The annual precipitation is the moving average of precipitation of Taiyuan City every 10 years, and the results are shown in Table 7. According to the prediction results of the PCA-BP neural network model, the domestic water consumption of Taiyuan in 2025 and 2030 will be 962.85 and 1,053.59 billion m 3 , respectively, which will increase by 18.3 and 29.4% compared with 814 million m 3 in 2020. In other words, it indicates that there will be a large gap in Taiyuan's water consumption in recent years.
Based on the above projections for water demand in Taiyuan City from 2021-2030, we conclude the following recommendations: 1. Strengthen the legal system and regulate water strictly in accordance with the laws. 2. Use groundwater correctly and effectively, and reasonably allocate water sources to different industries. After analyzing the proportion of groundwater use in various industries, redistribute the sources of water supply to different industries. 3. Strengthen the monitoring and protection of the groundwater geological environment to form and improve the groundwater monitoring system, improve the level of dynamic monitoring in the use of groundwater in Taiyuan City which can reduce or prevent environmental problems, and need to gradually improve the long-term monitoring process of groundwater quality to prevent water pollution.

CONCLUSION
Water demand forecasting is a typical nonlinear problem with various influencing factors. The weight to be attributed to these variables is difficult to determine because they have complex relationships. This study proposes a water demand forecasting model based on principal component analysis and BP neural network (PCA-BP) and takes Taiyuan, a city in Shanxi Province, China, which has severe water shortage, as the study site. By using PCA, the main factors affecting the water demand in Taiyuan City are found to be annual precipitation, total local output, and local permanent population. After determining the main factors, the BP neural network model is selected to predict the water demand for the next few years and compares with optimized Grey-Markov chain model, Grey prediction model, serial regression model, ARIMA model, NARX model, LSTM model, NAR model, and ANN model, and the results show that the PCA-BP model has higher prediction accuracy and better performance.
The water resources prediction model based on the BP neural network and PCA also has some shortcomings. First, due to the limited original calculation data collected, it cannot make a complete and accurate evaluation of the calculation effect of the model, which is necessary for a mature model. Second, although the computational efficiency of the model meets the accuracy requirements to a certain extent, there is still room for further improvement. Nevertheless, the model proposed in this study also has a certain promotion and reference significance for the prediction of water resources demand. We believe that with the further development of the research, the model will become more promising and specific, which will better assist the research topic of water resources demand to forecast. The future research work mainly focuses on the parameter adjustment and optimization of neural network so as to improve the accuracy of the model to a greater extent.