## Abstract

River water level prediction (WLP) plays an important role in flood control, navigation, and water supply. In this study, a WaveNet-based convolutional neural network (WCNN) with a lightweight structure and good parallelism was developed to improve the prediction accuracy and time effectiveness of WLP. It was applied to predict 1/2/3 days ahead the water levels at the Waizhou gauging station of the Ganjiang River (GR) in China, and it was compared with two recurrent neural networks (long short-term memory (LSTM) and gated recurrent unit (GRU)). The results showed that the WCNN model achieved the best prediction performance with the fewest training parameters and time. Compared with the LSTM and GRU models in the 1-day ahead prediction, the training parameters were reduced from 73,851 and 55,851 to 32,937, respectively. The root mean square error (RMSE) was reduced from 0.071 and 0.076 to 0.057, respectively. The mean absolute error (MAE) was reduced from 0.052 and 0.059 to 0.038, respectively. The Nash–Sutcliffe efficiency (NSE) and coefficient of determination (*R*^{2}) both increased to 0.998. This result indicated that the improved model was more efficient for WLP.

## HIGHLIGHTS

A WaveNet-based convolutional neural network was proposed for water level prediction.

WCNN with a lightweight structure and good parallelism achieved better prediction performance.

WCNN obtained higher accuracy results with the fewest parameters and training time than RNN.

Influence of different inputs and hyperparameters of models on prediction results was revealed.

## INTRODUCTION

The water level is an important hydrological feature of rivers, and accurate water level prediction (WLP) is crucial to flood control, shipping and water supply planning and management (Deng *et al.* 2021). However, variations in the water level are highly nonlinear due to various factors, such as rainfall, runoff, topography, water conservancy projects, and human activities, which increase the difficulty of accurate prediction (Lai *et al.* 2013; Zhang *et al.* 2016).

The WLP models mainly include hydrologic models based on physical processes and machine learning (ML) models based on a data-driven approach (abbreviations: physical models and ML models). Physical models can obtain the meticulous simulation of water level processes by establishing basic equations to express the interaction mechanism between variables (Lai *et al.* 2013). However, the establishment of physical models requires a wealth of professional knowledge, a large number of basic data and physical parameters, the modelling process is complex, difficult and time-consuming. ML models do not need to understand the mechanism of physical systems, and they predict the water level by directly detecting correlations between variables and realizing the mapping of inputs to outputs. The data that ML models require are easy to obtain, and the modelling process is relatively simple (Yin *et al.* 2021).

Various ML models were used in hydrological prediction tasks. For example, in Ahmed *et al.* (2022), six different ML algorithms (i.e., linear regression (LR), interaction regression (IR), robust regression (RR), stepwise regression (SR), support vector regression (SVR), boosted trees ensemble regression (BOOSTER), bagged trees ensemble regression (BAGER), XGBoost, tree regression (TR), and Gaussian process regression (GPR)) were developed to predict the river's water level, on a daily basis based on collected data from 1990 to 2019 which were used to train and test the proposed models. They found that the GPR model was capable of predicting the water level of the river with high precision and less uncertainty. In Latif & Ahmed (2020), (2023) and Latif (2023), the LSTM model was used to predict the daily streamflow of the Kowmung River at Cedar Ford in Australia, daily reservoir inflow of the Dokan Dam in Iraq and daily pan evaporation at Sydney airport in Australia, respectively. The results showed the LSTM model outperformed other conventional ML models (e.g., random forest (RF), tree boost (TB), multilayer perceptron neural network (MLP-NN), and boosted regression tree (BRT)). Huang *et al.* (2021) built three ML models, namely, an artificial neural network (ANN), a nonlinear autoregressive model with exogenous input (NARX) and GRU to simulate the daily Poyang Lake level from 2003 to 2016. They found that ML models with historical memory (i.e., the GRU model) were more suitable for simulating the Poyang Lake level under the influence of the Three Gorges Dam. In Ho *et al.* (2022), the LSTM model was proposed to predict short-term water levels in tidal sluice gates from 6 to 48 h ahead in the Bac Hung Hai irrigation system in Vietnam. The findings of this study highlighted the performance of LSTM models in providing high-accuracy short-period water level forecasts for areas near estuaries. Kima *et al.* (2022) used ML models such as gradient boosting (GB), support vector model (SVM), and LSTM to predict the flood water level of the Heungcheon bridge station which was located in the downstream of Bokha bridge. Rainfall, water level, and discharge data of Bokha bridge station from 2005 to 2020 were collected, especially the collected rainfall data were classified into 53 rainfall events using Interevent Time Definition (IETD) analysis of rainfall data. The LSTM model showed the best predictive power and was selected as the optimal model for real-time floodwater level forecasting in this study. Zhang *et al.* (2018) built four different neural networks to predict the water level of the combined sewer overflow structure in Norway, through a comparison of other different neural networks (e.g., multilayer perceptron (MLP) and wavelet neural network (WNN)), the LSTM and GRU presented superior capabilities for multistep-ahead time series prediction. In Cai *et al.* (2021), a GRU model was built for groundwater level simulation in 78 catchments in the central-eastern continental United States, the results showed that the GRU model performed better in regions where hydrogeological properties could promote more effective responses of groundwater to external changes. The above shows that the LSTM and GRU models in recurrent neural networks (RNNs) can usually obtain better prediction performance. However, they still have some shortcomings that need to be improved. For example, their internal recurrent connections required processing inputs according to time sequence, resulting in models that could not be trained in parallelism and increased training time (Fan *et al.* 2021). Furthermore, they consumed additional memory to hold long-term information. For practical problems with very large datasets, the LSTM and GRU models are relatively low in terms of cost-efficiency.

A convolutional neural network (CNN) is a lightweight structure that has unique advantages in capturing spatial dependencies of input data (Collado-Villaverde *et al.* 2021). Nevertheless, its performance on time series regression tasks is poor (Yan *et al.* 2020). A new convolutional structure named WaveNet, proposed by DeepMind in 2016, has performed well in sequential analysis problems (van den Oord *et al.* 2018). It combines the advantages of dilated causal convolutions, residual, and skip connections. It not only receives nonfixed-length inputs but also learns complex long-term dependencies of sequences.

Compared with RNN models, WaveNet has a smaller number of parameters, large receptive fields, and good parallelism. It has good application prospects in solving sequence analysis problems with large datasets (Borovykh *et al.* 2019; Rizvi *et al.* 2021). However, WaveNet was initially proposed for audio generation tasks, and it cannot directly address the time series regression problem. Many researchers have improved it and achieved better prediction performance than LSTM or GRU in the prediction tasks on the power load (Wang *et al.* 2021), traffic flow (Zhang *et al.* 2021a, 2021b), and air quality (Benhaddi & Ouarzazi 2021) datasets. However, to the best of our knowledge, WaveNet has not been improved for WLP tasks.

This study aims to develop a highly efficient WLP model of the convolutional structure. A more lightweight CNN, named WaveNet-based convolutional neural network (WCNN), was built by improving WaveNet on the basis of WLP characteristics. The GR in China was selected as the study area, the 1/2/3 days ahead water levels were predicted for the Waizhou gauging station (abbreviation: Waizhou station), and the long-term sequence data of gauging stations could be used for verification. To make full use of the ability of the WCNN to extract spatial information, the water level and discharge sequences of three upstream gauging stations were used as auxiliary input features. Two RNN models (LSTM and GRU) were established, and the performance of the WCNN model was evaluated by comparing their results on the test set.

## STUDY AREA AND DATA DESCRIPTION

^{2}(Wen

*et al.*2019). The GR crosses through Jiangxi Province from south to north, and it is bifurcated near Nanchang City, which is the provincial capital. Then it flows into Poyang Lake. The Waizhou station, the catchment area which accounts for 97% of the GR basin, is a more important gauging station in the lower reach of the GR, and the station is located near Nanchang City. The WLP of this station is significant to the safety of flood control, shipping and water supply of the Nanchang reach, thus this station is selected as the prediction station.

*et al.*2019). In this study, the sequence data of the Waizhou station and upstream gauging stations from 1965 to 2001 (37 years in total) with relatively stable hydrological rules were used to set up the model. Considering the water transmission time and predicted time, Ji'an gauging station with a distance of 230 km was selected as the farthest station in the upstream of the Waizhou station. The locations of the gauging stations are shown in Figure 1. Specifically, the sequences of daily average water level and discharge data belonged to Waizhou, Zhangshu, Xiajiang and Ji'an stations (hereafter referred to as

*Z*,

_{wz}*Z*,

_{zs}*Z*,

_{xj}*Z*,

_{ja}*Q*,

_{wz}*Q*,

_{zs}*Q*, and

_{xj}*Q*, respectively). These data were supplied by the Hydrological Bureau of Jiangxi Province. The statistical information from 1965 to 2001 is shown in Figure 3, which shows that the distributions of the water level and discharge at all stations are uneven, and there are many extreme values on the right side of the mean value.

_{ja}## METHODS

### Procedure of WLP of models

### Feature selection

*et al.*(2011), which could express the nonlinear correlations between two variables and was widely used in feature selection in hydrological datasets (Lu

*et al.*2021; Zhang

*et al.*2021b). The calculation formula of MIC is shown in Formula (1), and the results between each variable (

*Z*,

_{wz}*Z*,

_{zs}*Z*,

_{xj}*Z*,

_{ja}*Q*,

_{wz}*Q*,

_{zs}*Q*,

_{xj}*Q*, which are regarded as

_{ja}*H*) and predicted variables (1/2/3 days ahead water levels at the Waizhou station expressed as

*Z*,

_{wz}^{1}*Z*,

_{wz}^{2}*Z*, respectively, which are regarded as

_{wz}^{3}*L*) from 1965 to 1999 are shown in Table 1. Variables with MIC values larger than 0.6 were selected in the original datasets to constitute input features. In the three forecast periods, all candidate variables met the requirements. Thus, the input features were

*Z*,

_{wz}*Z*,

_{zs}*Z*,

_{xj}*Z*,

_{ja}*Q*,

_{wz}*Q*,

_{zs}*Q*, and

_{xj}*Q*. In addition, Min-Max normalization was used to scale features to [0, 1].where

_{ja}*H*= {

*h*} and

_{1},h_{2},…,h_{q}*L*= {

*l*} are independent variables and dependent variables, respectively, and

_{1},l_{2},…,l_{q}*q*is the length of

*H*and

*L*;

*D*is a set of ordered pairs, and

*D*= {(

*h*,

_{i}*l*),

_{i}*i*=

*1*,

*2*,…,

*q*};

*G*is the cells of grid;

*D|G*is the probability distribution caused by data

*D*on

*G*;

*I(D|G)*denotes mutual information of

*D*on

*G*;

*B(q)*

*=*

*q*; and

^{0.6}*|D|*is the size of

*D*.

. | Water levels of gauging stations . | Dsicharges of gauging stations . | ||||||
---|---|---|---|---|---|---|---|---|

Waizhou station . | Zhangshu station . | Xiajiang station . | Ji'an station . | Waizhou station . | Zhangshu station . | Xiajiang station . | Ji'an station . | |

1-day ahead water level Z _{wz} | 0.89 | 0.70 | 0.64 | 0.61 | 0.70 | 0.71 | 0.65 | 0.62 |

2-day ahead water level Z _{wz} | 0.80 | 0.65 | 0.63 | 0.63 | 0.64 | 0.66 | 0.62 | 0.62 |

3-day ahead water level Z _{wz} | 0.73 | 0.61 | 0.60 | 0.60 | 0.60 | 0.60 | 0.60 | 0.61 |

. | Water levels of gauging stations . | Dsicharges of gauging stations . | ||||||
---|---|---|---|---|---|---|---|---|

Waizhou station . | Zhangshu station . | Xiajiang station . | Ji'an station . | Waizhou station . | Zhangshu station . | Xiajiang station . | Ji'an station . | |

1-day ahead water level Z _{wz} | 0.89 | 0.70 | 0.64 | 0.61 | 0.70 | 0.71 | 0.65 | 0.62 |

2-day ahead water level Z _{wz} | 0.80 | 0.65 | 0.63 | 0.63 | 0.64 | 0.66 | 0.62 | 0.62 |

3-day ahead water level Z _{wz} | 0.73 | 0.61 | 0.60 | 0.60 | 0.60 | 0.60 | 0.60 | 0.61 |

### The WCNN

WaveNet is a deep network model proposed by DeepMind for generating raw audio waveforms that have large and flexible receptive fields and good parallelism, and can capture long-term dependencies of sequences (Rethage *et al.* 2018; van den Oord *et al.* 2018). WaveNet can be extrapolated to datasets that are outside of audio, and many achievements have been made in research on time series prediction (Luo *et al.* 2021; Rueda *et al.* 2021; Nie *et al.* 2022).

*X*) and other auxiliary input features (

*A*) before the

*t*th timestep were used to predict

*x*(

_{t}*t*th timestep predictor), and the mapping formula is shown in Equation (2). Here,

*A*is treated as the local condition

*LC*, which is received using the ReLU function.where

*X*= {

*x*,

_{1}*x*…,

_{2},*x*}, and

_{T−1}*T*is the input length

*; t*=

*range*(

*1,T*);

*A*= {

*a*,

^{1}*a*…,

^{2},*a*},

^{p}*a*denotes the

^{i}*i*th auxiliary input features,

*i*= range(

*1, p*), and

*p*is the number of auxiliary input features; and

*a*denotes the value of auxiliary input condition

_{t}^{i}*a*at the

^{i}*t*th timestep.

- (1)
The input layer received the sequences of predicted variables

*X*and auxiliary input*A*, and the input shape was [timesteps, features]. As mentioned in Section 3.2,*X*was*Z*and_{wz,}*A*included*Z*,_{zs}*Z*,_{xj}*Z*,_{ja}*Q*,_{wz}*Q*,_{zs}*Q*, and_{xj}*Q*in this study. To implement the residual operation, inputs were first transformed into the output shape of the residual blocks through a causal convolutional layer._{ja} - (2)
The main components of the residual block were the convolutional layer and ReLU function. Through these two structures, the order of data was ensured and the spatiotemporal nonlinear mapping of data was learned. Moreover, a temporal-excitation (TE) block based on a squeeze-and-excitation (SE) block was proposed to learn the long-term dependencies of data (Hu

*et al.*2019). The TE block obtained the global temporal information by explicitly modelling the relationships between the timesteps of the convolution channel*U*. The structure is shown in Figure 5. First, the transpose function*F*(·) is used to swap the coordinate systems of the temporal features and channel features of the channel_{tr}*U*. Then, the excitation operation*F*(·) is used to capture the temporal dependence of the channel_{ex}*U*, generating a set of modulation weights for each channel. Specifically, a fully connected (FC) layer with a dimensionality-reduction ratio*r*(*r*=*2*) and ReLU function are used to parameterize the nonlinearity between the time steps, with an FC layer that restores the dimension and a sigmoid function that scales the weights, as shown in Figure 5. Finally,*F*(·) is used to restore the coordinate system, and a multiplication operation_{tr}*F*(·,·) is used to integrate the results into the backbone network._{mul}

The number and size of the convolution kernels for all residual blocks in the WCNN were the same, which ensured that all residual blocks outputted a uniform shape. As shown in Figure 5, in the first residual block, the predicted variables *X* and condition *A* are based on a convolution operation and ReLU function to obtain channel *U* containing temporal and spatial features. The calculation formula is shown in Equation (3). Subsequently, the TE block is utilized to learn the global temporal information of the channel *U* to recalibrate the convolutional channel features, and the calculation formula is shown in Equation (4). The output of the TE block is fed into the backbone network by multiplying it with *U*, and then, it is added to the inputs to obtain the final output of the residual block *z _{1}*, as shown in Equations (5) and (6). The other residual blocks receive

*z*and output

_{k−1}*z*, and the calculation formulas are as follows in Equations (7)–(10).

_{k}*k*> 1):where

*k*=

*range*(

*1,K*) denotes the

*k*th residual blocks, and

*K*is the number of residual blocks;

*W*and

_{f,k}*b*represent the weight and bias of the convolutional filters of

_{f,k}*X*in the

*k*th layer, respectively;

*j*

*=*

*range(1,p)*is the conditional index;

*V*represent the weight of the convolutional filters of

_{g}^{i}*a*in the first layer, and

^{i}*b*represent the bias of the convolutional filters in the first layer;

_{g}*δ*is the ReLU function;

*u*is the output of

_{k}*δ*in the

*k*th residual block. Additionally,

*W*,

_{k,1}*b*and

_{k,1}*W*,

_{k,2}*b*denote the weights and biases of the first and second FC layers in the TE block of the

_{k,2}*k*th residual block;

*σ*is the

*sigmoid*function;

*u*is the transpose of

_{k}^{’}*u*; s

_{k}*is the output of the TE block in the*

_{k}*k*th residual block;

*e*and

_{k}*z*indicate the intermediate and final outputs of the

_{k}*k*th residual block, respectively; ʘ represents the multiplication of corresponding elements; and the remaining parameters are the same as in Formula (2).

- (3)The output layer is a 1 × 1 convolutional layer with linear activation. The output
*z*of the last residual block first performs the calculation of the ReLU function and then enters the output layer and outputs_{k}*O**=**(o*,_{t−n+1},o_{t−n+2},…,o_{t})*n*is the input length, and*o*denotes the_{t}*t*th timestep prediction, which is the required predicted value. The calculation formula is shown in Equation (11).where*z*is the output of the last residual block;_{K}*W*and_{o}*b*represent the weight and bias of the output layer, respectively; and_{o}*O*refers to the result of the output layer.

The WCNN inherits the advantages of WaveNet. It has a lightweight structure, good learning ability and parallelism, and can solve network degradation (vanishing/explosion gradient) problems. It differs from the original WaveNet in the following three points:

- (1)
Replace the gated activation function in the residual structure with the ReLU function. It can learn nonlinear correlations of sequences effectively, making the WCNN more applicable to 1-D sequential regression problems. This replacement has been proven to be effective in nonstationary and noisy time series forecasting tasks (Borovykh

*et al.*2019). In addition, the application of the ReLU function can also reduce the complexity of the model and training time. - (2)
Replace the dilated convolutional structure with a TE block. A TE block based on an SE block is proposed to learn the long-term temporal dependencies of sequences. It obtains global temporal information by capturing the information between time steps and feeds back this information to the convolutional channel. Thus, the network can select key temporal features for mapping.

- (3)
For application to 1-D sequential regression studies, the softmax distribution of the output layer is replaced with a linear function.

### Baseline model

LSTM and GRU are widely used ML models, and have already been applied to various WLP tasks (Zhang *et al.* 2018; Ren *et al.* 2020; Noor *et al.* 2022). Thus, LSTM and GRU were chosen as the baseline models in this study. LSTM solves the short-term memory and vanishing gradient problems by setting three gate (e.g., forget, input, and output gate) structures and two hidden states (e.g., short-term and long-term state). The specific structure and calculation formulas of LSTM are shown in Gers *et al.* (1999). GRU is a simplified version of LSTM that combines the two state variables in the LSTM unit into one. It controls the forget gate and input gate by a single gate controller (Géron 2019).

### Model parameter settings and evaluation metrics

The hyperparameters have a great influence on the prediction performance of the model. The basic network architecture of WCNN is different from that of LSTM and GRU models, thus, its hyperparameters are also different. The network hyperparameters of the WCNN model include the residual block number, kernel size and filter number of the convolutional layer. The network hyperparameters of LSTM and GRU models are the recurrent layer number and neurons number in each layer. The hyperparameters of the three models in the training stage are the same, which are optimizer, epochs, batch size and learning rate. For the above hyperparameters, the grid search method was used to determine the optimum value. This method selected one value from the value range of each hyperparameter of the model and combined the parameters to build the model, and then the training data were used to validate it. The optimal hyperparameter combination was determined by comparing the loss between the predicted value and the measured value on the validation set under different parameter combinations. The range of hyperparameters and the optimal parameter configuration of the three models are shown in Table 2. After optimization by the grid search method, the optimal parameter configuration of the three models is as follows: the WCNN model consists of four residual blocks. The filter number and kernel size of each residual block are 40 and 5, respectively. The LSTM and GRU models consist of two recurrent layers, the number of neurons in the first and second layers are 100 and 50, respectively. The dropout rate is 0.2. For the hyperparameters of models in the training stage, the optimal parameters of the three models are the same. Adam is selected to be the optimizer, the batch size is set to 100 and the timestep is 10. The initial learning rate is set to 0.001, which is reduced by 50% while the validation loss does not decrease in 20 epochs for scheduling.

Parameters . | Optimization range . | WCNN . | LSTM . | GRU . |
---|---|---|---|---|

Layers | [1, 2, 3] | – | 2 | 2 |

Neurons | [50, 100, 150] | – | [100, 50] | [100, 50] |

Dropout | [0.1, 0.2, 0.3, 0.4, 0.5] | – | 0.2 | 0.2 |

Residual blocks | [1, 2, 3, 4, 5, 6] | 4 | – | – |

Kernel size | [1, 2, 3, 4, 5, 6, 7, 8] | 5 | – | – |

Filter number | [20, 30, 40, 50, 60] | 40 | – | – |

Timesteps or Input Length | [6, 10, 17, 30, 45] | 10 | 10 | 10 |

Optimizer | [SGD,RMSprop,Adam,Nadam] | Adam | Adam | Adam |

Epochs | [50,100,200,300,400,500] | 500 | 500 | 500 |

Batch size | [32,64,100,128,256] | 100 | 100 | 100 |

Initial learning rate | [0.01,0.005,0.001,0.0005,0.0001] | 0.001 | 0.001 | 0.001 |

Parameters . | Optimization range . | WCNN . | LSTM . | GRU . |
---|---|---|---|---|

Layers | [1, 2, 3] | – | 2 | 2 |

Neurons | [50, 100, 150] | – | [100, 50] | [100, 50] |

Dropout | [0.1, 0.2, 0.3, 0.4, 0.5] | – | 0.2 | 0.2 |

Residual blocks | [1, 2, 3, 4, 5, 6] | 4 | – | – |

Kernel size | [1, 2, 3, 4, 5, 6, 7, 8] | 5 | – | – |

Filter number | [20, 30, 40, 50, 60] | 40 | – | – |

Timesteps or Input Length | [6, 10, 17, 30, 45] | 10 | 10 | 10 |

Optimizer | [SGD,RMSprop,Adam,Nadam] | Adam | Adam | Adam |

Epochs | [50,100,200,300,400,500] | 500 | 500 | 500 |

Batch size | [32,64,100,128,256] | 100 | 100 | 100 |

Initial learning rate | [0.01,0.005,0.001,0.0005,0.0001] | 0.001 | 0.001 | 0.001 |

The number of training parameters is an important measure of model training efficiency. The fewer the training parameters, the shorter the training time of the model. The number of training parameters of WCNN, LSTM and GRU under the same input conditions identified in Section 3.2 was calculated, which were 32,937, 73,851, and 55,851, respectively. The WCNN model had the fewest parameters, and the LSTM model had the most parameters.

_{LSTM}, Loss

_{GRU}, and Loss

_{WCNN}represent the loss functions of the LSTM, GRU, and WCNN models, respectively;

*m*is the batch size of the training set;

*y*and

^{^}_{i}*y*represent the predicted and measured values of the

_{i}*i*th input, respectively;

*y*and

^{^}_{i,−1}*y*represent the last predicted and measured values of the

_{i,−1}*i*th input.

*R*

^{2}) were used to evaluate the performance of different models (Huang

*et al.*2021; Latif & Ahmed 2023). The closer the values of MAE and RMSE are to 0, and NSE and

*R*

^{2}are to 1, the better the performance of the models. The formulas for calculating MAE, RMSE, NSE and

*R*

^{2}are as follows:where

*N*is the test set size;

*and*are the mean value of the measured and predicted values, respectively; and the remaining parameters are the same as in Equation (12).

## RESULTS

### Performance comparison of different models

The WCNN, LSTM and GRU models were built to simulate *Z _{wz}* for 1/2/3 days ahead, and the results are shown in Table 3. Compared with the LSTM and GRU models for 1-day ahead prediction of total water levels, the MAE and RMSE of the WCNN models are improved by 26.30 and 34.98% and by 19.97 and 24.93%, respectively. The NSE and

*R*

^{2}are both improved to 0.998 on the complete test set.

. | . | Model & Forecast period . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

. | . | WCNN . | LSTM . | GRU . | ||||||

Range of water level . | Metrics . | t + 1 . | t + 2 . | t + 3 . | t + 1 . | t + 2 . | t + 3 . | t + 1 . | t + 2 . | t + 3 . |

All water levels | MAE | 0.038 | 0.115 | 0.213 | 0.052 | 0.127 | 0.222 | 0.059 | 0.131 | 0.238 |

RMSE | 0.057 | 0.171 | 0.333 | 0.071 | 0.179 | 0.337 | 0.076 | 0.180 | 0.348 | |

NSE | 0.998 | 0.979 | 0.920 | 0.996 | 0.977 | 0.918 | 0.996 | 0.977 | 0.912 | |

R^{2} | 0.998 | 0.980 | 0.921 | 0.997 | 0.979 | 0.920 | 0.997 | 0.979 | 0.918 | |

Water level < 17 m | MAE | 0.021 | 0.047 | 0.077 | 0.036 | 0.077 | 0.087 | 0.071 | 0.092 | 0.104 |

RMSE | 0.031 | 0.067 | 0.102 | 0.044 | 0.089 | 0.112 | 0.080 | 0.106 | 0.130 | |

NSE | 0.982 | 0.916 | 0.801 | 0.964 | 0.848 | 0.758 | 0.878 | 0.786 | 0.676 | |

R^{2} | 0.982 | 0.922 | 0.831 | 0.977 | 0.914 | 0.819 | 0.973 | 0.904 | 0.805 | |

17 m ≤ Water level ≤ 19 m | MAE | 0.041 | 0.117 | 0.204 | 0.052 | 0.120 | 0.215 | 0.048 | 0.121 | 0.230 |

RMSE | 0.058 | 0.153 | 0.278 | 0.069 | 0.159 | 0.285 | 0.064 | 0.159 | 0.297 | |

NSE | 0.990 | 0.930 | 0.768 | 0.986 | 0.924 | 0.757 | 0.988 | 0.924 | 0.736 | |

R^{2} | 0.990 | 0.941 | 0.805 | 0.987 | 0.933 | 0.805 | 0.989 | 0.936 | 0.807 | |

Water level > 19 m | MAE | 0.059 | 0.219 | 0.466 | 0.078 | 0.230 | 0.469 | 0.071 | 0.229 | 0.484 |

RMSE | 0.082 | 0.297 | 0.619 | 0.106 | 0.306 | 0.621 | 0.101 | 0.301 | 0.633 | |

NSE | 0.985 | 0.809 | 0.170 | 0.976 | 0.797 | 0.165 | 0.978 | 0.804 | 0.131 | |

R^{2} | 0.986 | 0.832 | 0.451 | 0.979 | 0.842 | 0.442 | 0.978 | 0.836 | 0.427 |

. | . | Model & Forecast period . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

. | . | WCNN . | LSTM . | GRU . | ||||||

Range of water level . | Metrics . | t + 1 . | t + 2 . | t + 3 . | t + 1 . | t + 2 . | t + 3 . | t + 1 . | t + 2 . | t + 3 . |

All water levels | MAE | 0.038 | 0.115 | 0.213 | 0.052 | 0.127 | 0.222 | 0.059 | 0.131 | 0.238 |

RMSE | 0.057 | 0.171 | 0.333 | 0.071 | 0.179 | 0.337 | 0.076 | 0.180 | 0.348 | |

NSE | 0.998 | 0.979 | 0.920 | 0.996 | 0.977 | 0.918 | 0.996 | 0.977 | 0.912 | |

R^{2} | 0.998 | 0.980 | 0.921 | 0.997 | 0.979 | 0.920 | 0.997 | 0.979 | 0.918 | |

Water level < 17 m | MAE | 0.021 | 0.047 | 0.077 | 0.036 | 0.077 | 0.087 | 0.071 | 0.092 | 0.104 |

RMSE | 0.031 | 0.067 | 0.102 | 0.044 | 0.089 | 0.112 | 0.080 | 0.106 | 0.130 | |

NSE | 0.982 | 0.916 | 0.801 | 0.964 | 0.848 | 0.758 | 0.878 | 0.786 | 0.676 | |

R^{2} | 0.982 | 0.922 | 0.831 | 0.977 | 0.914 | 0.819 | 0.973 | 0.904 | 0.805 | |

17 m ≤ Water level ≤ 19 m | MAE | 0.041 | 0.117 | 0.204 | 0.052 | 0.120 | 0.215 | 0.048 | 0.121 | 0.230 |

RMSE | 0.058 | 0.153 | 0.278 | 0.069 | 0.159 | 0.285 | 0.064 | 0.159 | 0.297 | |

NSE | 0.990 | 0.930 | 0.768 | 0.986 | 0.924 | 0.757 | 0.988 | 0.924 | 0.736 | |

R^{2} | 0.990 | 0.941 | 0.805 | 0.987 | 0.933 | 0.805 | 0.989 | 0.936 | 0.807 | |

Water level > 19 m | MAE | 0.059 | 0.219 | 0.466 | 0.078 | 0.230 | 0.469 | 0.071 | 0.229 | 0.484 |

RMSE | 0.082 | 0.297 | 0.619 | 0.106 | 0.306 | 0.621 | 0.101 | 0.301 | 0.633 | |

NSE | 0.985 | 0.809 | 0.170 | 0.976 | 0.797 | 0.165 | 0.978 | 0.804 | 0.131 | |

R^{2} | 0.986 | 0.832 | 0.451 | 0.979 | 0.842 | 0.442 | 0.978 | 0.836 | 0.427 |

### Influence of different inputs on results

*Z*,

_{wz}*Q*]; (2) Input B: [

_{wz}*Z*,

_{wz}*Q*,

_{wz}*Z*,

_{zs}*Q*]; (3) Input C: [

_{zs}*Z*,

_{wz}*Q*,

_{wz}*Z*,

_{xj}*Q*]; (4) Input D: [

_{xj}*Z*,

_{wz}*Q*,

_{wz}*Z*,

_{ja}*Q*]; (5) Input E: [

_{ja}*Z*,

_{wz}*Q*,

_{wz}*Z*,

_{zs}*Q*,

_{zs}*Z*,

_{xj}*Q*]; (6) Input F (recommended model input): [

_{xj}*Z*,

_{wz}*Q*,

_{wz}*Z*,

_{zs}*Q*,

_{zs}*Z*,

_{xj}*Q*,

_{xj}*Z*,

_{ja}*Q*]. The parameters of the models were the same in the experiments. Evaluation metrics of the 1/2/3 days ahead predicted water levels on the test set (2000–2001) of the WCNN model under different inputs are plotted in Figure 9. On the basis of the input data of the Waizhou station (input A), the input data of an upstream station is added separately (e.g., inputs B, C, D), and the prediction accuracy of the models is all significantly improved. Especially, when adding the data of Zhangshu, Xiajiang and Ji'an gauging stations to the Waizhou station separately, which are 92, 170, and 230 km away from the Waizhou station, to predict 1/2/3 days of water levels of the Waizhou station, the accuracy of the results are improved the most. The result shows that when the forecast period increases, adding the data of gauging stations farther away from the Waizhou station to the inputs, the prediction accuracy is better. The result also shows the influence of the water propagation characteristics from upstream to downstream stations on the WLP. On the basis of input data of two gauging stations (e.g., inputs B, C, D), adding the data of one or two gauging stations to the inputs respectively, and the prediction accuracy of the models are further improved (input B compares with inputs E, F; input C compares with inputs E, F; input D compares with input F). But with the increase of input features, the increase rate of the prediction accuracy gradually slows down.

_{ja}### Influence of hyperparameters

## DISCUSSION

The proposed WCNN model was compared with LSTM and GRU models in the WLP of l/2/3 days ahead of the Waizhou station in GR, China. The result shows that the best prediction with the fewest training parameters was achieved in the WCNN model, which was consistent with the findings of variant models of WaveNet proposed by researchers (Benhaddi & Ouarzazi 2021; Zhang *et al.* 2021a) in air quality prediction and other sequence modelling tasks, respectively. The WCNN model can be generalized to other rivers with different climate conditions. Because the inputs of the model (i.e., water level and discharge data of the gauging stations) reflect the characteristics of the weather conditions, the data of the predicted station and its upstream stations are easy to obtain.

On the other hand, according to the propagating physical characteristics of rivers, the water level and discharge sequences of gauging stations were selected to be the model inputs. The reasonable input features were selected by calculating the MIC of variables, referring to the studies of Lu *et al.* (2021) and Zhang *et al.* (2021b). The influence of input features on the prediction results of models was quantified. The results showed that adding variables of high correlation with predicted variables to model inputs could improve prediction performance. But with the increase of input features, the increase rate of prediction accuracy gradually slowed down, and the training time of models also increased. Therefore, when selecting input features, it was necessary to comprehensively consider the relationship between the time cost caused by increasing input features and the degree of performance improvement. When setting the model parameters, the grid search method was used to optimize the parameters and determine the optimal values. The influence of the main hyperparameters of the WCNN model, such as input length, layer number, kernel size, and filter number, on the prediction results was analyzed and discussed in section 4.3. At present, there are various methods for hyperparameter optimization, such as Random Search, Bayesian Optimization, and Genetic Algorithm. These methods will be selected and adopted for our future research.

*R*

^{2}are 0.998, 0.997, and 0.996, respectively. However, compared with the WLP results of 2001–2002, the accuracy was reduced due to the influence of riverbed topographic changes. In further research, the impact of riverbed topographic changes, tributary inlet, rainfall and water conservancy projects should be considered to improve the prediction accuracy of models (Zhang

*et al.*2018; Deng

*et al.*2021). Furthermore, the multistep ahead WLP could not be significantly improved by the WCNN model, while Borovykh

*et al.*(2019), Benhaddi & Ouarzazi (2021) only studied the variant models of WaveNet to predict 1-day variables. Therefore, follow-up work should be carried out to extend the effective forecast periods of the model. Considering the poor effect of the sequence-to-value structure, a sequence-to-sequence structure using an encoder–decoder network (Rueda

*et al.*2021) should be adopted.

Range of water level . | Metrics . | Model & Forecast period . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

WCNN . | LSTM . | GRU . | ||||||||

t + 1 . | t + 2 . | t + 3 . | t + 1 . | t + 2 . | t + 3 . | t + 1 . | t + 2 . | t + 3 . | ||

All water levels | MAE | 0.063 | 0.180 | 0.370 | 0.088 | 0.181 | 0.316 | 0.097 | 0.212 | 0.408 |

RMSE | 0.098 | 0.267 | 0.503 | 0.123 | 0.281 | 0.470 | 0.130 | 0.284 | 0.517 | |

NSE | 0.997 | 0.981 | 0.934 | 0.996 | 0.979 | 0.943 | 0.996 | 0.979 | 0.930 | |

R^{2} | 0.998 | 0.982 | 0.936 | 0.997 | 0.980 | 0.945 | 0.996 | 0.981 | 0.939 | |

Water level < 14 m | MAE | 0.047 | 0.127 | 0.245 | 0.099 | 0.107 | 0.183 | 0.079 | 0.208 | 0.452 |

RMSE | 0.069 | 0.164 | 0.298 | 0.113 | 0.147 | 0.248 | 0.097 | 0.235 | 0.492 | |

NSE | 0.970 | 0.837 | 0.454 | 0.921 | 0.869 | 0.623 | 0.941 | 0.665 | 0.486 | |

R^{2} | 0.975 | 0.894 | 0.670 | 0.972 | 0.903 | 0.744 | 0.957 | 0.897 | 0.696 | |

14 ≤ Water level ≤ 16 m | MAE | 0.060 | 0.172 | 0.390 | 0.070 | 0.179 | 0.298 | 0.107 | 0.173 | 0.312 |

RMSE | 0.097 | 0.258 | 0.516 | 0.111 | 0.267 | 0.411 | 0.146 | 0.252 | 0.418 | |

NSE | 0.971 | 0.792 | 0.171 | 0.962 | 0.778 | 0.473 | 0.933 | 0.802 | 0.454 | |

R^{2} | 0.971 | 0.809 | 0.500 | 0.962 | 0.815 | 0.601 | 0.934 | 0.824 | 0.560 | |

Water level > 16 m | MAE | 0.082 | 0.245 | 0.479 | 0.096 | 0.261 | 0.479 | 0.103 | 0.261 | 0.474 |

RMSE | 0.122 | 0.353 | 0.641 | 0.145 | 0.386 | 0.670 | 0.141 | 0.357 | 0.633 | |

NSE | 0.994 | 0.946 | 0.823 | 0.991 | 0.936 | 0.806 | 0.991 | 0.945 | 0.827 | |

R^{2} | 0.994 | 0.949 | 0.853 | 0.992 | 0.942 | 0.836 | 0.993 | 0.948 | 0.854 |

Range of water level . | Metrics . | Model & Forecast period . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|

WCNN . | LSTM . | GRU . | ||||||||

t + 1 . | t + 2 . | t + 3 . | t + 1 . | t + 2 . | t + 3 . | t + 1 . | t + 2 . | t + 3 . | ||

All water levels | MAE | 0.063 | 0.180 | 0.370 | 0.088 | 0.181 | 0.316 | 0.097 | 0.212 | 0.408 |

RMSE | 0.098 | 0.267 | 0.503 | 0.123 | 0.281 | 0.470 | 0.130 | 0.284 | 0.517 | |

NSE | 0.997 | 0.981 | 0.934 | 0.996 | 0.979 | 0.943 | 0.996 | 0.979 | 0.930 | |

R^{2} | 0.998 | 0.982 | 0.936 | 0.997 | 0.980 | 0.945 | 0.996 | 0.981 | 0.939 | |

Water level < 14 m | MAE | 0.047 | 0.127 | 0.245 | 0.099 | 0.107 | 0.183 | 0.079 | 0.208 | 0.452 |

RMSE | 0.069 | 0.164 | 0.298 | 0.113 | 0.147 | 0.248 | 0.097 | 0.235 | 0.492 | |

NSE | 0.970 | 0.837 | 0.454 | 0.921 | 0.869 | 0.623 | 0.941 | 0.665 | 0.486 | |

R^{2} | 0.975 | 0.894 | 0.670 | 0.972 | 0.903 | 0.744 | 0.957 | 0.897 | 0.696 | |

14 ≤ Water level ≤ 16 m | MAE | 0.060 | 0.172 | 0.390 | 0.070 | 0.179 | 0.298 | 0.107 | 0.173 | 0.312 |

RMSE | 0.097 | 0.258 | 0.516 | 0.111 | 0.267 | 0.411 | 0.146 | 0.252 | 0.418 | |

NSE | 0.971 | 0.792 | 0.171 | 0.962 | 0.778 | 0.473 | 0.933 | 0.802 | 0.454 | |

R^{2} | 0.971 | 0.809 | 0.500 | 0.962 | 0.815 | 0.601 | 0.934 | 0.824 | 0.560 | |

Water level > 16 m | MAE | 0.082 | 0.245 | 0.479 | 0.096 | 0.261 | 0.479 | 0.103 | 0.261 | 0.474 |

RMSE | 0.122 | 0.353 | 0.641 | 0.145 | 0.386 | 0.670 | 0.141 | 0.357 | 0.633 | |

NSE | 0.994 | 0.946 | 0.823 | 0.991 | 0.936 | 0.806 | 0.991 | 0.945 | 0.827 | |

R^{2} | 0.994 | 0.949 | 0.853 | 0.992 | 0.942 | 0.836 | 0.993 | 0.948 | 0.854 |

## CONCLUSIONS

To improve the effectiveness of WLP models under the conditions of large datasets, a new CNN model named WCNN was proposed by adapting the WaveNet architecture in the audio generation domain. It was applied to predict the water level at the Waizhou station of the GR in China. The main conclusions are as follows:

- (1)
For l/2/3 days ahead WLP of the Waizhou station, the WCNN, LSTM, and GRU models all had good predicted accuracy. Moreover, the WCNN showed better prediction performance than the baseline models of LSTM and GRU. Compared with the other two models, the model structure of WCNN was lighter, the training parameters were minimal, and the model had good parallelism, which could significantly reduce the training time of the model. It is a more efficient method for WLP.

- (2)
It is recommended to select the water level and discharge of upstream stations that are highly correlated with the prediction station as input features to improve the prediction accuracy. However, when the input features increase to a certain amount, the increase rate of the prediction accuracy gradually slows down, and the training time of models also increases. The time and data costs associated with the addition of input features should be considered in combination with the incremental performance improvement.

## ACKNOWLEDGEMENTS

This work was supported by the National Key Research and Development Programme of China (2021YFD1700802).

## DATA AVAILABILITY STATEMENT

Data cannot be made publicly available; readers should contact the corresponding author for details.

## CONFLICT OF INTEREST

The authors declare there is no conflict.