Abstract
Real-time detection of water level outliers is critical for real-time regulation of gates or pump stations in open-channel water transfer projects. However, this remains a challenging task because of the lack of definition of water level outliers and the imbalance of flow monitoring data. In this study, we define the water level outliers and then propose a highly accurate outlier index for real-time detection of water level outliers based on the water level-flow relationship, and the thresholds for water level outliers are determined based on the order of magnitude of flow and water level differences. A case study is performed with the South-to-North Water Diversion Project of China. A random noise is added to 15 randomly selected non-adjacent monitoring datasets to verify the accuracy of the index, and the noise is increased from 4 to 9 cm at a step of 1 cm. The results show that a total of 159 outliers are detected out of 180 outliers with an accuracy rate of 88.3%.
HIGHLIGHTS
An outlier index is proposed based on the water level-flow relationship.
The definition of water level outliers in open-channel water transfer projects is proposed.
A case study shows that the proposed method can effectively identify outliers in real time.
INTRODUCTION
Cross-basin water transfer projects are intended to alleviate water shortage and optimize water allocation, and such projects are complex systems characterized by long-distance water transmission with the help of various hydraulic structures such as gates and pump stations. In order to solve the problem of slope damage caused by water pressure difference, the water levels in front of pump stations or gates should be kept as stable as possible (Clemmens et al. 2001). Therefore, there is a need for accurate real-time monitoring of water levels in front of pump stations or gates. As the presence of outliers can considerably reduce the quality of the datasets, effective methods are needed to detect these outliers in real time (Boiten 2008; Herschy 2014). Many hydrodynamic models and statistical methods are proposed to offer a potential solution but are not particularly useful in a real-time setting, especially for long-distance open-channel water transfer projects.
The one-dimensional hydrodynamic model based on Saint-Venant equations (Yi et al. 2017; Zhu et al. 2021) can detect outliers in water level datasets by simulating water movement. The Saint-Venant equations consist of a continuity equation and a motion equation, where the continuity equation requires the balance between the inflow and outflow of the same channel. However, this is difficult to achieve for long-distance open-channel water transfer projects whose channels are often tens of kilometers long because of abnormal operation of equipment, monitoring errors, rainfall, leakage, and even flow inversion (when the water level is stable, the downstream flow monitoring data points are larger than the upstream ones for a long time), which makes the equation set unsolvable. For this reason, the model is not applicable to real-time detection of outliers in water level datasets.
Statistically, the distribution-based 3-sigma method (Le et al. 2013; Hwang et al. 2016, 2019; Chen & Liao 2020) and the quantile-based box-plot method (Hubert & Vandervieren 2008; Zhao & Yang 2019) proposed by Tukey (1977) have been widely used to detect outliers in water level datasets. Nevertheless, these two methods do not apply to open-channel water transfer projects because of the masking effect (Rousseeuw & Hubert 2011) and the swamping effect (Rousseeuw & Hubert 2011). The masking effect means that the fitted model cannot detect the deviating observations, and the swamping effect means that some data points are incorrectly identified as outliers due to the presence of another good subset. It is apparent that the former effect may lead to false-negative misdiagnosis (missing of outliers), while the latter effect may lead to false-positive misdiagnosis (the outliers identified are actually not outliers). Such misdiagnoses are attributed to the neglect of the hydraulic correlation between water level changes and flow changes in statistical methods. Because of changes in the water level of open-channel water transfer projects (i.e., continuous short- or long-term changes in adjacent water levels in response to changes in operational requirement or scheduling objective), it is difficult to determine the length and method parameters (e.g., sigma estimation) for detection of outliers. Hence, statistical methods are not applicable to real-time detection of outliers in water level datasets of water transfer projects.
The inconsistency of the definition of water level outliers and the imbalance of flow in water level datasets make real-time detection of water level outliers extremely difficult. Outliers can be defined in different ways (Hawkins 1980; Barnett & Lewis 1994; Chandola et al. 2009; Rousseeuw & Hubert 2011; Liu et al. 2012), and no consensus is reached for open-channel water transfer projects due to the complexity of water level changes. Very often, datasets are reviewed manually at regular intervals, which is labor-intensive and inefficient and easily causes false or incomplete identification. It is clear that changes in the water levels of a given channel are affected not only by changes in inflow but also by changes in outflow. However, the monitoring of the inflow and outflow of the channel is subject to a variety of uncertainties, resulting in long-term water imbalance in the datasets. Therefore, the key for real-time detection of water level outliers is to process the flow datasets under the influence of uncertainties in an easier and quicker way. Based on the water level-flow relationship, we propose for the first time a novel high-accuracy outlier index for the real-time detection of outliers in water level datasets, which contributes to improving the quality of the water level-flow datasets and provides high-precision data for hydrodynamic simulation, emergency warning, and data mining. Our major contributions are threefold:
- 1.
An outlier index is proposed based on the water level-flow relationship and the thresholds for water level outliers are determined based on the order of magnitude of flow and water level differences.
- 2.
The definition of water level outliers for open-channel water transfer projects is proposed.
- 3.
A case study is presented to demonstrate the effectiveness of the outlier index in the real-time identification of outliers for open-channel water transfer projects.
The paper is organized as follows. Section 2 describes the outlier index and its definition and derivation process; Section 3 presents the study area and the method for outlier detection; Section 4 describes and analyzes the results; Section 5 presents the conclusions.
OUTLIER INDEX
Definition of water level outliers
How the water level outliers are defined is closely related to the accuracy of the detection method. As water levels are generally measured using a water level ruler with a deviation of about 3–5 cm due to the influence of various factors, the measured water levels with a deviation of more than 3 cm from the reading of the ruler are defined as an outlier in this study. However, the manual reading of the water ruler is difficult to obtain, and the simulated value of the water level from the one-dimensional hydrodynamic model (see Section 3.2) is chosen to replace the reading of the water ruler in this paper.
Classification based on the order of magnitude
The water level ruler is generally accurate to the centimeter level. To make the analysis more practical, the raw datasets of water level and flow are processed before the first-order differential calculation, and the values of water level and flow are rounded to two decimals (0.01 m) and one decimal (0.1 m3/s), respectively. After FOD calculation, and are classified by the order of magnitude, where Z is the water level data point, m; and Q is the flow data point, m3/s. The classification results are shown in Table 1.
Category . | A . | B . | C . | D . | E . |
---|---|---|---|---|---|
[0.1, 1) | [1, 10) | [10, 100) | [100, 1,000) | ||
[0.01, 0.1) | [0.1, 1) | [1, 10) | [10, 100) | [100, 1,000) | |
Orders of magnitude | − 2 | − 1 | 0 | 1 | 2 |
Number | 0.01 | 0.1 | 1 | 10 | 100 |
Scientific notation | 10−2 | 10−1 | 100 | 101 | 102 |
Category . | A . | B . | C . | D . | E . |
---|---|---|---|---|---|
[0.1, 1) | [1, 10) | [10, 100) | [100, 1,000) | ||
[0.01, 0.1) | [0.1, 1) | [1, 10) | [10, 100) | [100, 1,000) | |
Orders of magnitude | − 2 | − 1 | 0 | 1 | 2 |
Number | 0.01 | 0.1 | 1 | 10 | 100 |
Scientific notation | 10−2 | 10−1 | 100 | 101 | 102 |
Derivation of the outlier index
The range of is determined depending on the water transfer project. For open-channel water transfer projects, the flow difference is usually several orders of magnitude higher than that of the water level difference. In general, belongs to class A or B. If belongs to class A, then belongs to class B or C. In this case, the corresponding range of is 10–90 or 100–900, respectively; similarly, if belongs to class B, then belongs to class C or D, and the corresponding range of is 10–99 or 100–999, respectively. When is outside the above range, there will be an outlier in , but which one, or , is the outlier needs to be further determined.
The low-frequency (e.g., hourly) flow data need to be homogenized. For low-frequency flow datasets, the flow data points at adjacent moments may fluctuate substantially due to various causes such as equipment errors and gate or pump regulation. Under stable water level conditions, there would be a small change (<0.05 cm) in the water level at adjacent moments, and then the fluctuation range of the low-frequency flow data at adjacent moments has a large impact on the value. Therefore, the data at adjacent moments can be averaged to reduce the fluctuation and improve the identification of outliers.
STUDY AREA AND METHODS
Study area
This project has been put into operation since 12 December 2014, and now a large amount of water level datasets is available under different climate and operating conditions. Due to security and privacy concerns, we do not have access to minute-level monitoring data. In this study, 15,570 rows of data are collected for each gate for the period 2017–2021 at 2-h time intervals.
No. . | Date and time . | Water level before the gate (m) . | Gate-hole 1 . | Gate-hole 2 . | Flow (m3/s) . | Outlet 1 (m3/s) . | Outlet 2 (m3/s) . |
---|---|---|---|---|---|---|---|
4 | 2018-01-01 00:00:00 | 144.775 | 1,050 | 1,080 | 128.0543 | 0 | 0 |
4 | 2018-01-01 02:00:00 | 144.7781 | 1,050 | 1,080 | 125.9459 | 0 | 0 |
4 | ……… | … | … | … | … | … | … |
4 | 2018-07-01 02:00:00 | 144.7558 | 1,720 | 1,720 | 182.1611 | 0 | 0 |
4 | ……… | … | … | … | … | … | … |
4 | 2018-12-25 14:00:00 | 144.6307 | 1,280 | 1,280 | 143.2842 | 0 | 0 |
4 | 2018-12-25 16:00:00 | 144.6273 | 1,280 | 1,280 | 142.3666 | 0 | 0 |
No. . | Date and time . | Water level before the gate (m) . | Gate-hole 1 . | Gate-hole 2 . | Flow (m3/s) . | Outlet 1 (m3/s) . | Outlet 2 (m3/s) . |
---|---|---|---|---|---|---|---|
4 | 2018-01-01 00:00:00 | 144.775 | 1,050 | 1,080 | 128.0543 | 0 | 0 |
4 | 2018-01-01 02:00:00 | 144.7781 | 1,050 | 1,080 | 125.9459 | 0 | 0 |
4 | ……… | … | … | … | … | … | … |
4 | 2018-07-01 02:00:00 | 144.7558 | 1,720 | 1,720 | 182.1611 | 0 | 0 |
4 | ……… | … | … | … | … | … | … |
4 | 2018-12-25 14:00:00 | 144.6307 | 1,280 | 1,280 | 143.2842 | 0 | 0 |
4 | 2018-12-25 16:00:00 | 144.6273 | 1,280 | 1,280 | 142.3666 | 0 | 0 |
1Null values in the raw datasets are filled with zero.
No.: gates are sequentially numbered from upstream to downstream. Date and time: the date and time of each data record with a time interval of 2 h. Water level before the gate: the values are obtained automatically from the monitoring equipment and are the average of all the gate holes. There are 2 gate holes in gate 4. Outliers in the water level datasets before the gate would be detected. Gate-hole 1, Gate-hole 2: the opening sizes of each gate hole are obtained automatically from the monitoring equipment. Flow: the sum of the flow through each gate hole. The monitoring equipment is usually deployed in front of the gate. Outlet 1, Outlet 2: Outlets are located in channel 3 and the total value indicates the outflow of channel 3.
Figure 3(a) shows that the overall variation of the water level (144.4–144.9 m) before the gate in 2018 is within 0.5 m. It is also noted that the fluctuation of the dataset becomes smoother after FOD calculation (b). In most instances, the changes in water level are mostly less than 0.05 m and symmetrical about 0, and the differences between inflow and outflow of channel 3 are between 0 and 10. This indicates that evaporation and leakage loss during water transfer should not be ignored and that the outlier index formula is reasonable. The 2-h frequency flow data need to be homogenized. In this study, the average of the flow at two adjacent moments is taken as the flow at the latter moment.
Method
In this section, the simulated results of the 1D hydrodynamic model are used as water ruler data to find the true values of the water level, and a random noise greater than 3 cm is added to some true values. Finally, the outlier index is used for the detection of outliers.
In the hydrodynamic model, the water level results of gate 49 for 7 consecutive days (total 85 data points) are used as the upstream boundary condition, and the flow results of gate 50 for the same time series are used as the downstream boundary condition. The model output is the water level data in front of gate 50. The Preissmann four-point implicit difference scheme is used to discretize Equations (7) and (8). If the deviation between simulated and gauged data at the corresponding moment is within 3 cm, the 85 monitored data points are considered to be true values.
A random noise greater than 3 cm is added to 15 randomly selected non-adjacent data to construct the water level dataset with outliers. At each time, the noise of the same size is added to 15 data and the size is gradually increased from 4 to 9 cm at a step of 1 cm. After that, the outlier index is calculated.
RESULTS AND DISCUSSION
The 15 randomly selected non-adjacent water levels and other data are shown in Table 3. The addition of a noise will change the associated with the water level point by the same amount and in the opposite direction. The partial outlier indexes after adding noise are shown in Table 4.
No. . | Serial No. . | (m3/s) . | (m) . | (m2/s) . | (m3/s) . | (m) . | (m2/s) . |
---|---|---|---|---|---|---|---|
1 | 7 | − 1.6 | − 0.01 | 160 | 0.3 | 0.01 | 30 |
2 | 12 | 0.1 | − 0.01 | − 10 | 0.2 | 0 | / |
3 | 18 | − 0.3 | 0 | / | 3.1 | 0 | / |
4 | 23 | − 0.3 | 0 | / | 0.8 | 0.01 | 80 |
5 | 37 | 0.5 | 0 | / | − 3.1 | − 0.01 | 310 |
6 | 44 | − 0.2 | 0 | / | − 0.1 | 0.01 | − 10 |
7 | 48 | 0.1 | 0 | / | 0.5 | 0 | / |
8 | 51 | − 0.1 | 0.01 | − 10 | − 0.3 | 0 | / |
9 | 57 | 0.4 | − 0.01 | − 40 | − 0.2 | 0 | / |
10 | 61 | 0 | 0.01 | 0 | 0.3 | 0 | / |
11 | 65 | − 1 | 0 | / | 0 | 0 | / |
12 | 70 | − 0.3 | 0.01 | − 30 | − 0.3 | 0 | / |
13 | 73 | 0.5 | − 0.01 | − 50 | − 0.6 | 0 | / |
14 | 77 | 3.6 | 0 | / | − 0.1 | 0 | / |
15 | 81 | 1.2 | 0 | / | − 0.1 | 0 | / |
No. . | Serial No. . | (m3/s) . | (m) . | (m2/s) . | (m3/s) . | (m) . | (m2/s) . |
---|---|---|---|---|---|---|---|
1 | 7 | − 1.6 | − 0.01 | 160 | 0.3 | 0.01 | 30 |
2 | 12 | 0.1 | − 0.01 | − 10 | 0.2 | 0 | / |
3 | 18 | − 0.3 | 0 | / | 3.1 | 0 | / |
4 | 23 | − 0.3 | 0 | / | 0.8 | 0.01 | 80 |
5 | 37 | 0.5 | 0 | / | − 3.1 | − 0.01 | 310 |
6 | 44 | − 0.2 | 0 | / | − 0.1 | 0.01 | − 10 |
7 | 48 | 0.1 | 0 | / | 0.5 | 0 | / |
8 | 51 | − 0.1 | 0.01 | − 10 | − 0.3 | 0 | / |
9 | 57 | 0.4 | − 0.01 | − 40 | − 0.2 | 0 | / |
10 | 61 | 0 | 0.01 | 0 | 0.3 | 0 | / |
11 | 65 | − 1 | 0 | / | 0 | 0 | / |
12 | 70 | − 0.3 | 0.01 | − 30 | − 0.3 | 0 | / |
13 | 73 | 0.5 | − 0.01 | − 50 | − 0.6 | 0 | / |
14 | 77 | 3.6 | 0 | / | − 0.1 | 0 | / |
15 | 81 | 1.2 | 0 | / | − 0.1 | 0 | / |
No. . | Serial No. . | = 0.04 m . | = −0.04 m . | = 0.06 m . | = −0.07 m . | = 0.09 m . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
(m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | ||
1 | 7 | − 53 | − 10 | 32 | 6 | − 32 | − 6 | 20 | 4 | − 20 | − 4 |
2 | 12 | 3 | − 5 | − 2 | 5 | 2 | − 3 | − 1 | 3 | 1 | − 2 |
3 | 18 | − 8 | − 78 | 8 | 78 | − 5 | − 52 | 4 | 44 | − 3 | − 34 |
4 | 23 | − 8 | − 27 | 8 | 16 | − 5 | − 16 | 4 | 10 | − 3 | − 10 |
5 | 37 | 13 | 62 | − 13 | − 103 | 8 | 44 | − 7 | − 52 | 6 | 31 |
6 | 44 | − 5 | 3 | 5 | − 2 | − 3 | 2 | 3 | − 1 | − 2 | 1 |
7 | 48 | 3 | − 13 | − 3 | 13 | 2 | − 8 | − 1 | 7 | 1 | − 6 |
8 | 51 | − 2 | 8 | 3 | − 8 | − 1 | 5 | 2 | − 4 | − 1 | 3 |
9 | 57 | 13 | 5 | − 8 | − 5 | 8 | 3 | − 5 | − 3 | 5 | 2 |
10 | 61 | 0 | − 8 | 0 | 8 | 0 | − 5 | 0 | 4 | 0 | − 3 |
11 | 65 | − 25 | 0 | 25 | 0 | − 17 | 0 | 14 | 0 | − 11 | 0 |
12 | 70 | − 6 | 8 | 10 | − 8 | − 4 | 5 | 5 | − 4 | − 3 | 3 |
13 | 73 | 17 | 15 | − 10 | − 15 | 10 | 10 | − 6 | − 9 | 6 | 7 |
14 | 77 | 90 | 3 | − 90 | − 3 | 60 | 2 | − 51 | − 1 | 40 | 1 |
15 | 81 | 30 | 3 | − 30 | − 3 | 20 | 2 | − 17 | − 1 | 13 | 1 |
No. . | Serial No. . | = 0.04 m . | = −0.04 m . | = 0.06 m . | = −0.07 m . | = 0.09 m . | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
(m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | (m2/s) . | ||
1 | 7 | − 53 | − 10 | 32 | 6 | − 32 | − 6 | 20 | 4 | − 20 | − 4 |
2 | 12 | 3 | − 5 | − 2 | 5 | 2 | − 3 | − 1 | 3 | 1 | − 2 |
3 | 18 | − 8 | − 78 | 8 | 78 | − 5 | − 52 | 4 | 44 | − 3 | − 34 |
4 | 23 | − 8 | − 27 | 8 | 16 | − 5 | − 16 | 4 | 10 | − 3 | − 10 |
5 | 37 | 13 | 62 | − 13 | − 103 | 8 | 44 | − 7 | − 52 | 6 | 31 |
6 | 44 | − 5 | 3 | 5 | − 2 | − 3 | 2 | 3 | − 1 | − 2 | 1 |
7 | 48 | 3 | − 13 | − 3 | 13 | 2 | − 8 | − 1 | 7 | 1 | − 6 |
8 | 51 | − 2 | 8 | 3 | − 8 | − 1 | 5 | 2 | − 4 | − 1 | 3 |
9 | 57 | 13 | 5 | − 8 | − 5 | 8 | 3 | − 5 | − 3 | 5 | 2 |
10 | 61 | 0 | − 8 | 0 | 8 | 0 | − 5 | 0 | 4 | 0 | − 3 |
11 | 65 | − 25 | 0 | 25 | 0 | − 17 | 0 | 14 | 0 | − 11 | 0 |
12 | 70 | − 6 | 8 | 10 | − 8 | − 4 | 5 | 5 | − 4 | − 3 | 3 |
13 | 73 | 17 | 15 | − 10 | − 15 | 10 | 10 | − 6 | − 9 | 6 | 7 |
14 | 77 | 90 | 3 | − 90 | − 3 | 60 | 2 | − 51 | − 1 | 40 | 1 |
15 | 81 | 30 | 3 | − 30 | − 3 | 20 | 2 | − 17 | − 1 | 13 | 1 |
Figure 4 shows that the gauged water level is very stable over the 7 days, and more than half of the data have an FOD value of 0. However, the simulated water level is much less stable, but all the deviations from the corresponding gauged data are within 0.03 m. It can be considered that all the gauged datasets are the true values of the water level. The deviation of the water level at adjacent moments is 0 or 0.01, which means that is very sensitive to the change of . Thus, it is necessary to homogenize Q. Figure 5 shows that when the water level is in a steady state, the fluctuation of the raw flow difference is relatively large, while the mean-shifting treatment can still retain the overall variation trend while reducing the fluctuation of the flow difference. Most of the flow difference data lie between −1 and 2, and only three data points (19th, 36th, and 77th) lie between 6 and 8, which are averaged into six points that lie between 3 and 5.
In Table 3, there are 13 numbers with values less than or equal to 0.3, which means that is more likely to be lower than 10 if the value with the noise is greater than 0.03. Therefore, small amplitude outliers can be easily detected in most data. However, the data like the 5th and 11th data (both values are relatively large, or one value is 0 and the other value is relatively large) are not sensitive to small amplitude outliers. The manual diagnosis is needed to determine whether it is a flow error or a water level error when the outlier index is greater than 90 or 990, or or is 0.
Based on Table 1 and the discussion at the end of Section 2.3, the outliers resulting from the addition of different random noises can be identified according to the outlier index or , as shown in Table 4. As the noise is increased from = 0.04 m (−0.04 m) to = 0.09 m (−0.09 m) at a step of 1 cm, a total of 180 outliers are involved. Of these 180 outliers, 159 are detected, 11 (12), 12 (13), 13 (14), 14 (14), 14 (14), and 14 (14), respectively. Since the 2-h dataset is not sufficient enough to characterize the regulation process of gates, and similar to the abnormal 11th data point in Figure 4, the manual diagnosis could not be performed and it is temporarily considered as a missed diagnosis. According to Equation (6) and Table 4, when is unchanged, the larger the deviation of the abnormal data, the closer the outlier index is to 0, and thus the random noise in Table 4 is representative. The total detection rate of outliers reaches 88.3%, and it increases with the increase of noise.
However, it should be noted that the outlier index may not identify outliers in any one of the following four situations:
- (a)
, and ;
- (b)
, and ;
- (c)
, and ;
- (d)
, and .
CONCLUSIONS
This study may contribute to the detection of water level outliers for open-channel water transfer projects. To our knowledge, this is the first study on the real-time detection of water level outliers for open-channel water transfer projects. The main findings are: (1) an outlier index is proposed based on water balance; (2) the water level outlier is defined for water transfer projects; and (3) a case study with the Middle Route of the South-to-North Water Diversion Project shows that of the 180 outliers with the variation greater than or equal to 0.04 m, 159 outliers can be detected with an accuracy rate of 88.3%.
The limitations of this study warrant further investigation. First, some well-established indicators are excluded due to the lack of high-frequency data. Second, an outlier can be detected in , but which one, or , is the outlier needs to be further studied. Finally, the outlier index still has units since water wave motion is not considered, and a dimensionless outlier index should be derived on the basis of momentum conservation in the future.
AUTHOR CONTRIBUTIONS
L.Z., X.L., and H.W. conceptualized the whole article; L.Z. and Z.Z. developed the methodology; Y.Q., L.Z., Z.H., and Z.Z. validated the article; L.Z. and Y.Q. rendered support in data curation; L.Z. wrote the original draft; L.Z. wrote the review and edited the article. All authors have read and agreed to the published version of the manuscript.
FUNDING
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.