Research on stage–discharge relationship model based on information entropy

In order to improve the estimation accuracy of stage–discharge relationship model, the back propagation neural network optimized through the genetic algorithm (GA-BP) based on information entropy was proposed. Firstly, the information entropy and hierarchical clustering were used to quickly cluster the hydrological sample data and get the optimal number of clusters. Secondly, the k-nearest neighbor algorithm was used to divide the new stage data into the most appropriate clustering categories. Finally, the river daily discharge was estimated. Some measured data collected from a hydrological station were used to test the model, and the simulation results showed that the method proposed by this paper can get higher estimation accuracy than the classical analytical model, BP neural network algorithm and GA-BP neural network algorithm, which provided a new effective method for parameter estimation of the stage–discharge relationship model.


INTRODUCTION
The stage-discharge relationship model is a curve describing the relationship between the water level of the basic section at the hydrological measuring station and the flow through the section. It plays an important role in compiling hydrological and water resources data, hydrological prediction, engineering design and construction (Maghrebi et al., 2016). The current measuring equipment for river flow is not only of great error but also of great cost (Nezamkhiavy & Nezamkhiavy, 2014). Therefore, the most effective and common method is to establish a stage-discharge curve and obtain flow data by converting the average water level, which highlights the importance of the stage-discharge curve (Roushangar & Alizadeh, 2019).
Researchers at home and abroad have put forward a large number of study protocols on the research method of the stage-discharge relationships. Jain & Chalisgaonkar (2000) took the lead in applying a three-layer feed-forward neural network (ANN) to the modeling of river flow. The results show that compared with the traditional curve fitting method, the ANN has obvious advantages and can estimate the discharge well. Lohani et al. (2006) applied Takagi-Sugeno (TS) to the stage-discharge curve and compared it with the traditional least square method and artificial neural network (ANN). The results show that TS is superior to the traditional modeling method based on ANN. Wolfs & Willems (2014) used ANN and M5 model tree to train the hydrological data and compared it with the traditional stage-discharge curve. The results show that the ANN can accurately simulate the calibration data, but when a small amount of data is used for training, there will be over-fitting situation. The architecture of M5 model tree is easier to interpret than ANN and can obtain higher fitting accuracy. Birbal et al. (2021) proposed a Gene Expression Programming (GEP) as an extension of Genetic Programming (GP) and applied it to stage-discharge curves. This method was compared with the traditional stage-discharge relationship curves (SRC) and regression methods, and the results show that the performance of GEP model is significantly better than that of GP model and traditional model. Kashani et al. (2015) compared the performance of ANN, adaptive neuro-fuzzy inference system (ANFIS), gene expression program design and traditional conventional methods (water level-flow relationship curve and regression method) using the water level and flow data in the Kizlemak River, Turkey. The results show that the related coefficient (r), root mean square error (RMSE) and mean absolute error (MAE) of the machine learning method have high accuracy. In addition, the ANFIS model has the best performance of all methods. Alizadeh et al. (2021) used the hybrid preprocessing method of empirical mode decomposition (EEMD), wavelet transform (WT), mutual information (MI) and support vector machine to predict the river flow. The results show that the proposed WT-EEMD-MI method can effectively improve the prediction accuracy of river flow.
Due to the influence of natural conditions and human activities, the process of water change becomes more and more complex. Obviously, the classical analytical model of stage-discharge cannot describe the dynamic relationship between stage and discharge well (Petersen-Øverleir, 2006). In recent years, the rapid development of automatic control technology, computer technology and image display technology has provided new methods and means for the establishment of stage-discharge model and the study of hydrological parameter prediction. Hence, in order to improve the accuracy of stage-discharge curve, a back propagation neural network optimized through the genetic algorithm (GA-BP) based on information entropy is proposed in this paper. This method is a non-parametric learning method based on machine learning. It does not have to form a clear hypothesis to define the complete objective function on the entire sample space, and it can form a different local approximation of the objective function for each query sample. This method first uses information entropy and hierarchical clustering (HAC) to set up a non-analytical relationship model between stage and discharge samples, then quickly clusters hydrological data samples and obtains the optimal number of clusters; then uses K-nearest neighbor (KNN) method to classify the new water level data into the most appropriate cluster category; finally, the daily flow of the river is estimated by using the newly established relationship model. To verify the effectiveness of the proposed method, the classical analytical model of stage-discharge, the BP neural network algorithm and the GA-BP neural network algorithm are compared and tested with the proposed method.

STAGE-DISCHARGE RELATIONSHIP MODEL Classical analytical model
The stage-discharge relationship refers to the empirical relationship between the discharge Q of a basic section and the corresponding water level h. Different researchers have different opinions on the expression of stage-discharge relationship curve, among which there are two common forms: polynomial model and power-law model. The expression of the polynomial model is as follows: where Q is the flow through the basic section (m 3 =s); a 0 , a 1 , Á Á Á , a m are the undetermined coefficients, h i is the water level (m). The power-law relationship is as follows: where Q is the flow through the section (m 3 =s); k is the constant coefficient to be estimated; h is the average water level (m); b is the index. a i , k, b in the above formulas can generally be obtained by the linear regression, the least squares and other methods. In this paper, the least square (LS) is used to calculate the analytical model.

The least square
LS is a tradition mathematical optimization design method (Ruiz et al., 1996). LS is used to obtain the unknown data conveniently, and the sum of the square of the error between the estimated data and the measured data is minimized. LS is widely used in mathematical statistics, mathematical optimization and prediction estimation. Based on the theory of error, LS has high reliability and solves the problem of how to obtain credible reliability from a set of measured data. In this paper, the LS is used to calculate the polynomial form of stage-discharge curve mentioned above, and the estimated discharge will be obtained through the stage. The main steps are as follows: Step 1. Determine the number n of polynomial fitness.
Step 2. Determine the coefficients of polynomials according to the principle of LS and minimize the sum of squares of errors when the function is m polynomials.
The structures of the above classical analytical model are simple and easy to solve. But, the value of them is greatly affected by many complex factors, such as flood fluctuations, variable backwater and sedimentation. Therefore, the traditional stage-discharge relationship model established by the mathematical method cannot take these complex factors into consideration. So, the above empirical relationships cannot accurately reflect the true stage-discharge relationship in the river (Ajmera & Goyal, 2012).
Based on the above analysis, a GA-BP network algorithm based on information entropy is proposed in this paper to establish a non-parametric relationship between stage and discharge, which aims to more accurately and truly reflect the hydrological situation in the river, and provide reliable data for water conservancy planning and design, as well as hydrological prediction. To verify the effectiveness of the proposed algorithm, BP network algorithm and GA-BP network algorithm are compared with the novel method. The details are as follows.

BP NEURAL NETWORK ALGORITHM
Back propagation neural network was proposed by Rumelhart (1986), which is a multi-layer feed-forward neural network trained according to the error back propagation algorithm. At present, BP neural network has made great contributions to computer science, information science and mathematical statistics and has been the most widely used neural network so far. The basic principle is showed by Figure 1: In forward propagation, the sample is passed from the input layer and processed by the hidden layer to the output layer. If the actual output of the output layer is inconsistent with the expected output, it enters the back propagation. Error back propagation is to transfer the output error layer by layer through the hidden layer to the input layer in some form and allocates the error to all the units of each layer, so as to obtain the error signal of each layer unit. This kind of signal adjusts the weight of each layer in forward and back propagation repeatedly until the error of network output is reduced to an acceptable degree or the number of learning times is set, among which the process of weight adjustment is the process of network learning (Tang et al., 2020;Wang et al., 2020).
Studies have shown that a three-layer BP neural network model can simulate any complex nonlinear problem, that is, one input layer, one hidden layer and one output layer (Peter, 2019). In this paper, the following parameters are set: input layer node is 1; hidden layer node is 5; output layer node is 1; Select S-type transfer function (logsig) as transfer function; Levenberg-Marquardt calculation method is used in the training function; the number of training cycles is set to 5,000 times, the number of iterations is 500 times, the learning rate is 0.1, and the allowable error is 0.001.

GA-BP NEURAL NETWORK ALGORITHM
BP (back propagation) neural network can learn independently, has strong nonlinear mapping ability and rigorous derivation process, but it has the disadvantages of slow convergence speed and weak generalization ability. To solve this problem, researchers propose to use GA to optimize the BP neural network (Chen et al., 2019). GA can effectively solve the problems of BP neural networks by selecting, crossing and mutating operations to finally obtain the optimal threshold and initial weight of the network (Jan, 2019). The concrete steps are as follows: 1. Establish BP model. The setting of BP neural network is the same as that of BP model mentioned above, that is, the input layer node is 1, the hidden layer node is 5 and the output layer node is 1. 2. Initialize the cluster. According to the number of nodes of the BP neural network in the previous step to encode the chromosome, this paper uses a real number code scheme, where the individual coding length is 1 Â 5 þ 5 Â 1 þ 1 þ 1 ¼ 12. 3. Determine the fitness function. The reciprocal of the sum of squared errors between the actual output value and the expected output value is taken as the fitness function value F of each individual: where n is the number of samples; x i is the actual output value of the i-th node; y i is the expected output value of the i-th node. 4. Random traversal sample method is used for selection. 5. Cross operation. The real number crossing method is used, and the crossover operation of the k-th chromosome m k and the l-th chromosome m l at the j position are as follows: where m is a random number and m [ [0, 1]. 6. Mutation operation. The j-th gene of the i-th individual is selected for mutation, the specific operation are as follows: where f(g) ¼ k(1 À g=G max ) 2 , and g is the current iterations, and k is a random number. G max is the maximum number of evolution, and r is a random number and r [ [0, 1]; m min and m max are the lower bound and the upper bound of m ij . 7. Replace the original chromosome with a new chromosome and calculate the fitness. If the condition is satisfied or the number of iterations is reached, Step (8) will be performed; otherwise, go to Step (3) to continually optimize. 8. Assign optimized weights and thresholds to the BP neural network and train the data until the requirements set by the network are reached.

GA-BP BASED ON INFORMATION ENTROPY
In 1948, the founder Claude Elwood Shannon of information theory introduced the concept of entropy to information theory (Hasan & Rai, 2020). Information entropy is often used as a quantitative indicator of the information content of the system, which can be further used as the goal or parameter selection of system equation optimization Criterion (Capozziello & Luongo, 2017). Shannon proved mathematically that the uncertainty function of random variables satisfying monotonicity, nonnegativity and accumulation has a unique form: where x i means various random events; P(x i ) represents the probability of random event x i ; H(x) is entropy. If the data are divided into a set of k clusters named C ¼ fC 1 ; C 2 ; Á Á Á ; C k g, the information entropy can be expressed as: where P(x ij ) is the probability of event i in set C j . It is worth explaining that with the increase of the number of clusters, the amount of data in each class decreases, the probability that each datum belongs to one class increases, and the information entropy of the whole class becomes larger. In the process of increasing the number of classes, class division is carried out in the order of disorder → order → disorder. The initial disorder is because the clustering is too general to see the characteristics of the data set, and the final disorder is that the clustering is too fine and lacks the overall understanding. Therefore, the transition value of the total information entropy of the data set can be used to determine the optimal number of clusters (Mahata & Sing, 2020). We define the state of k class as the k state of the dataset, l k as the information entropy of the k state, l k À l kÀ1 as the information entropy jump value. The information entropy transition value is the difference between the jump from the kÀ1 to the k state entropy value and the jump from the k to the k þ 1 state entropy value, that is jðl kþ1 À l k Þ À ðl k À l kÀ1 Þj. When the transition value of information entropy reaches the minimum value, it shows that the entropy jump from the k-th state to the (k þ 1)-th state is the smallest among all the jumps compared with the k-th to (k À 1)-th state, namely the minimum increase in uncertainty across the data set (Su et al., 2010). Now there is no need to increase the number of clusters, which eventually identified as k clusters.
Clustering belongs to unsupervised learning, that is, the marking information of training samples is unknown, and the goal is to expose the attributes, structure and information of training samples and to provide the basis for further data mining. The essence of clustering process in machine learning is an optimization process, that is, the objective function of the system reaches a minimum value through a fast operation. Clustering mainly includes density-based method, model-based method, partition-based method and hierarchy-based method, in which HAC is to create a hierarchical nested clustering tree by calculating the similarity between different categories of data points (Bonetto & Latzko, 2021). It calculates the distance between each category of data points and all data points to determine their similarity. The smaller the distance, the higher the similarity, and combines the nearest two data points or categories to generate the cluster tree.
A GA-BP algorithm based on information entropy is proposed in this paper in order to reduce the estimation error of stage-discharge curve and improve the estimation accuracy of discharge. The implementation process is as follows.
Firstly, the number of clusters is determined. In order to obtain the optimal clustering scheme, the HAC is used to divide the hydrological data and the information entropy is introduced to judge the optimal number of clusters. The details are as follows: Step 1, determine the initial clustering number range [C min , C max ], generally take C min ¼ 2, C max ¼ ffiffiffi n p , where n is the number of samples.
Step 2, select k cluster centers from each n training object (initial value is set to 2); Step 3, each sample is classified into one class (initializing data), the distance between each two classes is calculated, that is, the similarity between each two samples is calculated; Step 4, according to the principle of minimum variance, select the category that meets the distance requirement, and complete the inter-class merger; Step 5, recalculate the distance between the newly generated classes and the old classes (similarity) in Step 2; Step 6, repeat Step 2 and Step 3 until all objects finally merge to form k clusters.
Step 7, calculate information entropy and the transition value of information entropy. When k , C max , returned to Step 1; otherwise determine the optimal number of clusters according to the entropy transition value of the data set.
After that, this paper uses the optimal clustering number to cluster and uses the KNN algorithm to divide the new data (for testing) into classes divided by HAC rules (Sharma & Shamkuwar, 2019). KNN belongs to supervised learning in machine learning. This method determines the category of samples to be divided according to Water Policy Vol 23 No 4, 1080 the category of one or more nearest samples, that is, when a new water level data is given, KNN searches one or more samples closest to the new water level data in the water level samples that have been clustered. The Euclidean distance is used between the sample x i and x j : where , a 1 (x), a 2 (x), Á Á Á , a n (x) . is the proper vector for the new sample x. x i is the stage and discharge on day i. Finally, when each new sample is classified, the river flow is estimated using the GA-BP algorithm in the previous section.

METHOD TEST Experimental data
In this paper, the algorithm is tested with the measured data from 2007 to 2010 in some basin of Minjiang River. Minjiang River originates from Langjialing, Songpan County, the south foot of Minshan Mountain, and it is the largest tributary of Yangtze River Runoff input. From north to south, it flows through the west of Sichuan Basin, converges with Jinsha River in Yibin City and converges into Yangtze River. The drainage basin above Dujiangyan is called the upper reaches of Minjiang River, which is located between Sichuan Basin and Qinghai Tibet Plateau, and most of them are alpine canyon areas. It is located between 30°45 0 -33°10 0 N and 102°35 0 -130°57 0 E. The total length of the main stream of Minjiang River is 337 km, and the drainage area is about 22,612 km 2 . The geographical location of the study area is shown in Figure 2.
The paper collected annual and monthly data on runoff, precipitation and temperature in the upper reaches of the Minjiang River from 2007 to 2010. The distributions of meteorological and hydrological stations are shown in Figure 3. The data from 2007 to 2009 are used for training the relationships of stage and discharge, and the data collected in 2010 are used to test the trained model. Part of the measured data is shown in Table 1.

Parameter estimation for relationship model of stage-discharge
When the GA-BP algorithm based on information entropy is used to test, the clustering center of the method is judged and determined by the information entropy. The results of clustering are shown in Figure 4. Figure 4(a) is the distribution of the measured data, and Figure 4(b) is the distribution of the information entropy transition value.
It can be seen from Figure 4 that when the transition state is 3, the transition value of information entropy from (3 classes → 4 classes) to (4 classes → 5 classes) reaches the lowest, and then it starts to increase again, which shows that the number of 4 clusters can make the overall information amount the largest, which is the optimal number of clusters. So, the number of clusters is 4 for the next operation in this paper. The clustering results are shown in Figure 5.
In this paper, absolute mean error and average absolute percentage error are used as the criteria to test the performance of each method. Absolute mean error: Water Policy Vol 23 No 4, 1081 Mean absolute percentage error: where x i is the estimated river flow on the i-th day, y i is the actual river flow on the i-th day and n is the number of test samples. The results of MAE and MAPE are shown in Figure 6.
The analysis of Figure 6 shows that the MAE and MAPE of the proposed method are lower than those of the other three methods. The proposed method can effectively reduce the error of parameter estimation and improve the precision of model estimation.
Based on the above calculations, the results are counted as in Table 2. By comparing Table 2, it can be seen that MAE and MAPE of the method proposed by this paper (GA-BP model based on information entropy) are superior to classical analytical model, BP model and GA-BP model. The MAE and MAPE of the method proposed by this paper are 25.15 and 16.38% less than those of the classical analytical model, 24.21 and 18.05% less than those of the BP, 24.53 and 15.43% less than those of the GA-BP.  Figure 7 shows the estimation results of River daily discharge by several methods. Figure 7 shows the estimated results of several methods for daily river flow. We can see from Figure 7 that the classical analytical model, the BP model and the GA-BP model basically agree with the estimation results of the river flow, but the BP model has a large error in estimating the large flow rate, so it cannot estimate the large flow rate better. The GA-BP model based on information entropy proposed in this paper has higher estimation accuracy and can better estimate the flow. By comparing Figure 7, we can see that the proposed method can effectively improve the phenomenon of large deviation between the measured flow rate and the estimated daily flow rate under normal climatic conditions.
Comparing the four methods mentioned above, the method proposed in this paper can obtain a smaller estimation error than the other three methods. Combined with the daily flow distribution diagram shown in Figure 4(a), compared with Figure 6, it can be seen that in the dry season from January to mid-April, the daily   flow of the river is relatively small, the estimated daily flow is basically the same as the actual observed value, and the accuracy of the flow estimation by the four schemes is not much different; from the middle of April to September, the daily flow of river flow in this period is relatively large, and there will be many unpredictable factors that affect the flow of the day, such as flood, rainfall, weather changes, which cause the measured daily flow to have many different values and large deviations under the same water level. Obviously, the traditional methods of estimating river flow will cause a great many errors, and during this time, the scheme proposed in this paper can significantly improve the accuracy of river daily flow estimation; from October to December, with the passing of the rainy season, the river flow gradually decreases, and the uncontrollable influencing factors decrease. The accuracy of the classical analytical model, BP, and GA-BP algorithm to estimate the daily flow of the river increases. The GA-BP algorithm can further improve the estimation accuracy of river daily flow. Through the analysis of Figures 6 and 7, it can be seen that this method can get more accurate estimates than the analytic model, BP model and GA-BP model, and effectively improve the phenomenon that there is a large deviation between the estimated value and the measured value of River daily flow under normal climate conditions. This is because the river hydrological data are used as training samples for clustering, and then KNN method is used to cluster the flow related data and the new river data samples are classified into appropriate classes, which can avoid the interference of other irrelevant information, thus improving the efficiency and accuracy of River daily flow data estimation. In the case of no extreme climate, the method proposed in this paper has a strong practical significance to capture the change of river hydrology more accurately.

CONCLUSION
Most of the classical SRCs are based on empirical regression, which cannot be well applied to the study of flow characteristics of complex rivers. In this paper, the GA-BP model based on information entropy is proposed to estimate the parameters of the river water level and flow curve in the Minjiang River Basin, and the estimation results are compared with those of the classical analytical model, BP model and GA-BP model. The method is verified by the measured data from the hydrological station in the Minjiang River Basin, and the simulation results show that: 1. Compared with the classical analytical model, the model based on the neural network can estimate the runoff flow better and has a higher estimation accuracy. Obviously, the model based on neural network can better capture the change characteristics of dynamic flow. 2. The GA-BP model based on information entropy proposed in this paper can control the average absolute error of flow estimation below 80, which is reduced by 25.15, 24.21 and 24.53% compared with the classical analytical model, BP model and GA-BP model. It is obvious that this method can obtain higher estimation accuracy than the classical analytical model, BP model and GA-BP model.