Abstract
A parallel algorithm for 1D free-surface flow simulations in irrigation canals is shown. The model is based on the Hartree method applied to Saint-Venant equations. Due to the close-to-steady flow nature in irrigation canals, external and internal boundary conditions are linearized to preserve the parallel character. Gate trajectories, off-take withdrawals, and external boundary conditions are modeled as piece-wise functions of time, so there are discontinuities. To achieve a fully parallelized algorithm, an explicit version of the Hartree method is chosen, and external and internal boundary conditions are linearized around operation point. This approach is used to build a computer simulator, written in C-CUDA language. Two tests by ASCE Committee on Canal Automation Algorithms have been used to evaluate accuracy and performance of the algorithm. The Maricopa Stanfield benchmark has been used to prove its accuracy, and the Corning Canal benchmark to evaluate performance in terms of processing time. Surprisingly, solving a 12 hr-long prediction horizon with a cell size of about Δx= 10 m is less than 1 s on a Nvidia K40 card. Results were compared with a serial and a multi-CPU version of the same algorithm. The implementation that showed the best performance on different platforms is the one that uses GPU.
HIGHLIGHTS
A high-performance parallel algorithm for solving 1D-Saint Venant Equations in irrigation canals has been implemented. It is highly efficient enabling many simulations in a reasonable period of time. In this way, Genetic Algorithms can be used in the context of inverse problems in free-surface flow phenomena.
The GPU parallel algorithm used in the paper is very efficient making it possible to obtain steady flow solutions by using transient equations as an alternative to the step method.
The philosophy inherent in the methodology could also be applied to other phenomena, such as the flow in pipes. The resolution time of the transients is so short that the steady state could be found from the final state of a transient in a very large networks in real time.
INTRODUCTION
The development of numerical models that represents real systems usually involves a calibration phase, which implies the determination of unknown parameters of the phenomenon governing equations from the observed data. One of the most used methodologies are genetic algorithms which can involve thousands of simulations (Whitley 1994). Therefore, having efficient simulation algorithms is very important. Herein, the phenomenon studied is the free-surface flow in open canals which is described by means of the 1D-Saint-Venant's equations (1D-SVE). In the context of irrigation canal control – more specifically when model predictive control strategy is adopted – simulations are needed to have good predictions on responses to control actions (Martín-Sanchez & Rodellar 1996). Efficient simulators could also be useful in this field.
The use of graphical processing units (GPUs) for scientific calculation has increased in the last decade since scientists and engineers have been looking for alternatives to central processing units (CPUs) for high-performance computing (Anders & Kandrot 2010). Nevertheless, most GPU applications have been developed in the field of 2D and 3D computational fluid dynamics, and the number of 1D applications in civil engineering on hydraulics/hydrology is significantly lower. In this paper, we implement a parallel algorithm for solving 1D-SVE in irrigation canals that uses GPUs. Several works in the literature report that explicit methods have a greater potential than implicit to take advantage of the parallelization in the GPUs (Aissa et al. 2017; Bondarenco et al. 2017; Bessone et al. 2018). Thus, a numerical explicit scheme has been chosen for the algorithm.
According to Wylie (1969), all numerical methods, whether they are explicit, implicit, or characteristics, give similar results when compared to the observed data, depending more on the accuracy of the starting data than on the particular methodology. In this paper, a Hartree method with linear space interpolations has been implemented, that is, a first order version of the method of characteristics (MOC). The MOC was chosen since it is well well known for having many merits regarding the aspects of theoretical and physical interpretation for 1D-SVE modeling (Yang & Chiu 1993). Details of the Hartree method can be found in Katapodes (2016) and Litrico & Fromion (2009). Gómez (1988) considers the MOC one of the best methods to solve unsteady flow problems in prismatic canals. Soler et al. (2015) used the Hartree method in the context of inverse problems to reproduce the flow in a laboratory canal, and were able to accurately reproduce observed water depth hydrographs in a transient.
The application of the numerical techniques requires to discretize the domain x/t (space/time) and find the solution at certain points only. The canal is divided into cross sections and two variables are considered at each section: the water depth yi and the flow velocity vi. As the spatial discretization is fixed, the solution is always obtained at the same points of the canal. The solution for each instant depends on the solution in the previous instant. The specified-time-interval scheme has been used commonly in applying the MOC to unsteady open-canal flow problems. However, with the use of this scheme, the numerical error for the simulation results can always be induced due to the interpolation used to approximate the characteristics trajectory. In order to overcome the lack of accuracy, Yang & Chiu (1993) proposed to use MOC with higher-order splines for interpolation, and for demonstration example, cell sizes from Δx= 200 m to Δx= 800 m were used. Alternatively, it is possible to use the four-point implicit scheme, also known as the box scheme finite difference or Preissmann's method. This scheme is known to be a conservative method. Although the mesh size in the box scheme can be larger than in the MOC, the two methods give good results, both in terms of surface profiles in a time instant and as in terms of flow hydrograph at a section (Ostad-Ali-Askari et al. 2018). Another strategy for increasing accuracy is fine discretization, which has been adopted herein.
The fine linear discretization strategy has been adopted since the increasing number of computational nodes is compensated by its simplicity of computing. A fine discretization forces the reduction of the time step size according to the Courant–Friedrichs–Lewy (CFL) condition, so that the efficiency in terms of computation time is reduced. However, the concept of efficiency is not as simple as it first seems, since explicit schemes can make the most of parallel architecture. When considering explicit schemes, the solution of each node depends exclusively on the solution in the surrounding nodes. Thus, each node can be updated independently of the rest, therefore, it is not necessary to look for values in the memory far from the point which is being updated. Under these conditions, each point can be resolved in a different process unit at the same time, either CPU or GPU. To adopt an explicit scheme is amenable to multi-threading.
The box scheme finite difference along with external and internal boundary conditions results in a system of linear equations which must be solved for each time step. For a single canal, the coefficient matrix has a band width of five and can be solved by one of many banded matrix solvers. Direct solvers for banded linear systems cannot be performed in parallel without communication. This can be accomplished through a shared memory bus and synchronicity, things that may cause delays. Moreover, for network canal problems, sparse terms destroy the banded structure and the consequent loss in the locality of data deteriorates the performance in many-core platforms. On the other hand, the memory use of implicit solvers strongly limits the size of problem that can be solved.
Currently, the lines of research on new numerical schemes for 1D-SVE are aimed at improving the techniques based on finite volume. Specifically, new ways of solving the Riemann solver are investigated, such as in Hodges & Liu (2019). These techniques are very suitable for the complex flows that occur in rivers, such as in the presence of hydraulic jumps and very irregular geometries. These techniques do not contribute much in the case of the flow in irrigation canals, where it is not usual to find hydraulic jumps or other types of discontinuity. It is well known that the finite volume technique is based on the principle of conservation of momentum, and therefore, it is adequate to be applied in local phenomena with discontinuities where energy losses are unknown, like it occurs just downstream a gate (Cozzolino et al. 2015). In other words, finite volume techniques give non-continuous solutions. Conversely, when hydraulic variables are continuous, as happens the most in irrigation canals, the energy principle seems more appropriate to be applied, since the resulting solutions are always continuous.
According to these points, a totally explicit scheme based on the Hartree method with linear interpolations has been chosen. The method is included in the MOC category, and its presentation is the objective of the following section. In the past, the method was rejected due to its lack of accuracy, but it was a time when computers had low power so that they had to use very sparse space discretizations (Δx= 300 m in Gómez (1988) for example). Using CPU-GPU acceleration makes it possible to use strong discretizations (Δx= 1 m, for example) in large-scale canals (a 28 km-long canal, for example), so that, lack of accuracy is overcome.
The remaining sections of this paper are as follows. In the next section, the parallel algorithm is presented, some arguments that justify each one of its steps are given, and the reasons why the Hartree method has been chosen for the resolution of the 1D-SVE. Moreover, it describes in detail the used numerical scheme. It includes the discretization scheme used for the 1D-SVE and the treatment of boundary conditions, both the external and the internal. In the section titled ‘Illustrative examples’, a pair of hydraulically complicated scenarios are presented for testing the algorithm. These scenarios are based on benchmarks that can be found in the canal control literature. The examples show the advantages of parallel algorithms as opposed to traditional serial algorithms. The algorithm performance in terms of processing time to complete the scenario simulation is compared between different platforms.
PARALLEL ALGORITHM
Although the 1D-SVE can be applied in several research fields, such as river analysis systems or drainage system modeling, this paper is focused on the flow in irrigation canals. The flow in irrigation canals, which are often lined and prismatic-shaped, shows characteristics that differentiate it from river drainage system flow, and that influence the choice of the numerical scheme. Usually, water management in irrigation canals implies the use of electro-hydraulic devices that control the flow (such as the in-line gates) that remain in the same position for long periods of time. When a change of position occurs relatively quickly, it can be considered as an instantaneous movement, especially if the movement time is compared with the time the gates remain stopped. All that results in a close-to-steady state sub-critical regime where things occur very slowly. This behavior is common in irrigation canal control systems, and it is amenable to be described mathematically by means of piece-wise functions of time as the gate trajectories shown in Figure 1. As can be seen below, this definition is interesting for numerical parallelization purposes.
Sketch of a piece-wise function of time. The piece Ui(n) represents the control variable inside of the nth-Regulation Period. That is the period of time between instants Tn−1 and Tn. A position of the i-device at instant tk is denoted in minor letters by . Notice that there is a jump discontinuity at the end of each Regulation Period.
Sketch of a piece-wise function of time. The piece Ui(n) represents the control variable inside of the nth-Regulation Period. That is the period of time between instants Tn−1 and Tn. A position of the i-device at instant tk is denoted in minor letters by . Notice that there is a jump discontinuity at the end of each Regulation Period.
In the context of irrigation canal control, the flow control devices stay in the same position without moving most of the time. Only occasionally, the control algorithm changes the device positions according to a desired target (the period between movements is called sampling period, but is referred to as ΔT-long Regulation Period in this paper). As a result, unsteady flow varies weakly and slowly around the operation point at a close-to-steady canal state. In that way, it is possible to consider the following assumptions:
Flow control trajectories can be defined as ΔT-long piece-wise functions of time. Figure 1 plots a generic piece-wise function.
An explicit scheme with fine discretizations for the Saint-Venant equations can be used in the numerical model.
External and internal boundary conditions can be linearized in a small region around the operating point.
The number of CFL stability condition verification, and the corresponding time step size computations, can be reduced a great deal because it is not necessary to do this at every time step.
Taking into account these points, it is possible to apply efficiently the three previous basic strategies for parallelization by solving flow equations for a Regulation Period entirely in the GPUs with no global memory accesses, as can be seen below.
At the end of each Regulation Period, external and internal boundary conditions change instantaneously. For example, actuators change opening gates. Therefore, between two consecutive Regulation Periods, there is a discontinuity called, herein, Jump Step. Non-linearized original equations at discontinuities have to be solved using iterative techniques, in which the number of floating point operations varies very much depending on the flow conditions. This is the reason why it has been decided to compute Jump Step in the CPU. Broadly speaking, the problem of simulation in open canals is divided into two parts, depending on where they are solved: on the GPU or on the CPU. An algorithm that meets all of these prescriptions is written as follows:
Load initial data and initial conditions to the CPU:
Compute the first time step size:
CPU external loop
Initialize Regulation Period Counter: n= 1
Compute Jump Step:
Compute the time step size
Copy from CPU-memory to GPU-memory:
GPU inner loop
- 1.
k = 0; t = (n − 1) · ΔT; tend = n ΔT; dstOut=true
- 2.
if (dstOut=‘true’):
Otherwise:
- 3.
k=k+ 1; t=t+ ΔtG ; dstOut=! dstOut;
- 4.
If t<tend go to GPU inner loop step 2
- 5.
if (dstOut=‘false’):
- 6.
Copy from GPU-memory to CPU-memory:
- 7.
n=n+ 1
- 8.
If n≤λ go to CPU external loop step 2.


The algorithm has two loops: the external one implemented in the CPU, and the inner in the GPU. The NVIDIA compiler (nvcc) separates source code into CPU and GPU components: GPU functions are processed by nvcc, and the CPU functions are processed by standard compiler, like the gcc.
The following comments about the algorithm can be made:
Theoretically, the CFL condition has to be satisfied always, that is, at all time steps. Due to the close-to-steady flow nature, it is possible to update the time step size once (CPU external loop step 3), just before getting into the GPU inner loop. Since the largest changes in flow conditions are produced when control devices are re-positioned, updating the time step size has been located just after Jump step computation. Its computation is avoided in the inner loop. Vectors are accessed at global memory (herein, referred as CPU-memory).
Instances of communication and synchronization between different sub-tasks are usually some of the biggest obstacles to obtaining good performance in parallel programming. Updating the time step size according to the CFL condition is considered as one example of these because it cannot be fully parallelized. Searching a minimum value of a vector is a task non-parallelizable by definition since the candidate to be a minimum has to be compared with all components one by one. This is another reason why the CFL has also been placed into the CPU.
CUDA threads may access data from multiple memory spaces during their execution. Although each thread can access the same global memory, they have a private local memory (herein, referred to as GPU-memory), which is considered to be accessed in a way faster than in the global. Threads launched by the kernel RegulationPeriodModel to compute
, have access to read
from the private local memory. The kernel
has been programmed in a way that it never needs to have access to global memory, and it only uses local private memory. That is shown in the GPU-inner loop steps 4 and 5.
Optimizing memory usage starts with minimizing data transfers with low bandwidth between the CPU and the GPU since large time intervals are spent on data movement rather than arithmetic. The CPU-GPU data transfer, and vice versa, has been limited once every external loop iteration.
Applying the Hartree method to the 1D-SVE results in an explicit system of two linear equations for each computational node. This feature makes this method very suitable for parallelization. Moreover, boundary and interior conditions can be linearized to preserve this characteristic. Moreover, since within a Regulation Period control variables keep constant, governing equations are simplified because their time-derivatives are canceled. Lumping together whole equations, a fully parallelizable set is obtained. The subroutine for solving the set is denoted by RegulationPeriodModel (GPU inner loop step 2). Details of how this set is obtained is one objective of this section. Vectors are accessed at local private memory.
Instantaneous changes in hydraulic variables are represented as piece-wise functions of time. Since the equations are nonlinear, they cannot be linearized because their variables are not continuous, and then their derivatives do not exist. Therefore, the original equations have to be solved by using iterative techniques. The set solving subroutine is called JumpStepModel function. This kind of method is inadequate for parallelization since the number of iterations required by each node depends on flow conditions and every time step. This is the reason why JumpStepModel has been isolated in the CPU, at external loop step 2.
In the second step of GPU inner loop, vectors
and
are alternatively exchanged to avoid having to update every iteration, and only vector pointers are passed through the function as parameters. This fact speeds up the inner loop.
In the rest of the section, numerical formulation details about the Hartree method and boundary conditions are given. It shows how these three functions are compiled: JumpStepModel(Δt,Y,V), RegulationPeriodModel(Δt,Y,V), and CFL.
Governing equations for 1D free-surface flow
The system (1) is applicable under flow conditions given in irrigation canals, where:
the curvature of the free-surface is small (vertical accelerations are neglected, and vertical pressure distribution is hydrostatic),
the canal slope is supposed sufficiently small such that its sin is practically null,
the changes in flow conditions are not fast enough to generate wave fronts.
The mathematical process of transformation of system (1) to the equivalent (4) can be found in many references, like Ames (1977) or Ducheteau & Zachmann (1988), so will not be discussed further here. The system (4) describes the conditions of flow in a canal in the same way as (1), and it adds no new hypothesis in the transformation. However, it is limited in the way it is applied. The variable x, which was initially independent, is now dependent on time t, as it is understood in (4); then, (4(a)) will be true only along the curves, denoted by x+, that satisfy the equation (4(b)) and, in the same way (4(c)) will be true along the curves, denoted by x−, that satisfy the equation (4(d)). In this way, the four equations of the system must be solved simultaneously.
Linear space interpolations
The Hartree method belongs to the MOC family, and it uses linear interpolations in the space direction (Figure 2).
The solution at instant k+ 1 for node i is explicit and it depends only on the values at the three nodes i − 1, i, and i+ 1, at instant k. Since these values are stored in a vector, they are closely stored in the memory, and the processor can quickly access them. In contrast, that does not occur when using implicit methods, where the solution at instant k+ 1 for node i depends on whole vectors at k − 1 and k. Another interesting property of (10) is that the system can be solved in a fixed number of float operations for all i. Both are key factors for a good parallelization because processing time remains constant for all nodes i at time instant k, and they are simultaneously solved by the corresponding codes without synchronization requirements. It can launch one thread for one node without waiting time for synchronization.
The free-surface flow in open canals is governed by the 1D-SVE, but other equations are needed to describe the flow throughout structures like gates, siphons, joints, jumps, weirs, off-takes, etc. Appropriate boundary conditions need to be specified at off-takes, check gates, head and tail end of the canal system, so that, additional relationships are also needed for describing upstream and downstream boundary conditions, and internal conditions for checkpoints located throughout the canal.
Upstream external boundary condition (gate at canal header)
Flow in irrigation canals is usually sub-critical, and then, the second equation in (10) has to be satisfied along the x− (Figure 3). According to this, and taking into account uniqueness theorems for hyperbolic systems (see e.g., Crandall 1956), the second equation in (10) together with (11) has to be satisfied at the same time.
The set (12) is nonlinear in , so it has to be solved by using iterative techniques, like the Newton–Raphson method. As mentioned above, this kind of technique is inappropriate for parallelization. In order to avoid using it, a linearization of the equations around
, is carried out. Since the opening gate u(t) and HUP(t) are defined as piece-wise functions of time, there are time instants where functions are not continuous, that is at Jump Step. This is a problem with linearization because their derivatives do not exist. Therefore, it is mandatory to use iterative techniques for solving (12) to solve Jump Step. This is the first reason why Jump Steps are solved at CPU external loop step 2.
Downstream external boundary condition (canal tail)
Internal boundary condition (checkpoints)
Checkpoints are cross structures for canal flow control which contain gates in-line, off-take withdrawals, lateral weirs, pumping stations, and so on. Their function is to interfere with stream to regulate flow. Canal models are often sectioned between pools by checkpoints. Flow through checkpoints are usually regulated by gates. Herein, checkpoints are considered to have only a sluicegate and a pumped off-take withdrawal for simplicity, as seen in Figure 5.
Consider a sluice gate in a checkpoint located among two canal pools as Figure 6 shows. On the left, there is the upstream cell defined by points xn and xn−1, and the downstream one defined by x1 and x2. The objective is to compute a solution at from the solution at
, that is, at points RU and RD from interpolated variables at P and Q.


In the context of the parallelization, all nodes are solved at the same time through the corresponding thread, and they have to be finished before starting the next time step, that means synchronization among threads. When solving (25), a 4 × 4 matrix is inverted for obtaining a double solution, so that, the checkpoint node requires the largest computational time required by any node. Then, it is highly recommendable to optimize the source code for this kind of node. One way for optimizing is to invert the matrix by Cramer's rule, since four elements of 16 are zero, so it is possible to optimize the source code by eliminating many operations. Note that this node has the largest number of floating point operations among all types of nodes defined in this paper. Optimizing reduces computing time significantly.
Courant–Friedericks–Levy stability condition
Computing Δtk with (27) is not totally parallelizable because it is necessary to compare all wave-speed values with each other. If a thread for each i to compute wave-speed is launched, it is not able to compare values among them until all threads are finished because threads are working independently from others. However, it could divide the domain into groups of cells, launch a thread per group to obtain the minimum value in the group and, when all threads are finished, compare values among groups. However, this is not a parallelized process, and threads need to use shared memory.
The computation of the CFL is the greatest obstacle to getting good parallel program performance, especially when it is calculated always for all time instant k. However, when flow is close-to-steady, it is possible to update the time step size once before getting into a Regulation Period. It has been decided to compute the time step size between two consecutive Regulation Periods, just after Jump Step at the third step of the algorithm. In this way, computation in the GPU is avoided. During Regulation Period time-size remains constant.
Theoretically, it is possible to take NC= 1 in (27) when the time step size is always computed; but if it is only computed at the Jump Step, it can help to take NC= 0.95.
Parallel-in-space updating functions
By using one of these models, it is possible to update hydraulic unknowns for all sections from the variables at instant k. Thanks to this strategy, programming both CPU and GPU solvers is trivial. The parallel part of the algorithm consists of a single kernel that is processed in the GPU, in which each thread calculates the couple
, so it does not present any difficulty. With respect to the CPU version implemented in FORTRAN, it only requires the ‘omp parallel do’ directive before the loop that runs through all the nodes in the inner loop by calling the update function of the instant variables k + 1; this function is similar to the one implemented in the GPU kernel.
ILLUSTRATIVE EXAMPLES
The ASCE Task Committee on Canal Control Algorithms (Clemmens et al. 1998) developed a series of test cases to test the general suitability of canal-control scheme logic. The cases consisted of defining several scenarios happening over two canals, and the objective was to find the way to canal control. The test cases covered a reasonable range of the proprieties and operating characteristics of irrigation canals. Among them, there is a control part where flow-rate changes are scheduled, and then, the control problem has feedforward character. In this part, the objective of the tests is to bring the canal from an initial steady-state to a final one.
GoRoSo is a feedforward control algorithm for irrigation canals based on sequential quadratic programming (Soler et al. 2013a), with which it is possible to calculate gate trajectories that smoothly carry the canal from an initial steady-state to a final steady-state by keeping water depths constant at overall end points of each pool, that is at checkpoints. Constraints on gate movements are required for stabilizing the control algorithm, and to avoid gate stroking movements which are amenable to generate possible changes in flow regime. The algorithm makes it possible to calculate gate trajectories by constraining gate movements into a user-defined amount. The application of the GoRoSo method to the tests can be found in Soler et al. (2013b). The two canals defined in these test cases together with internal and boundary conditions obtained from the feedforward control algorithm define two scenarios for validation of the parallel algorithm.
The first example shows how by using a fine discretization it is possible to overcome the numerical diffusion that is inherited from the Hartree scheme, and the second example is provided in order to show the high performance in terms of computing speed. The second example is also solved by means of the HEC-RAS platform in order to compare simulation results and diffusive errors (https://www.hec.usace.army.mil/software/hec-ras/). The HEC-RAS is a computer program to model the water flows through open canal systems and the calculation of surface water profiles (Barkau 1992). For unsteady flow, HEC-RAS solves the full, dynamic, 1-D SVE using the above-mentioned box four-point implicit scheme, first used by Preissmann (1961). Since flow conditions in the scenario are close-to-steady state, and subcritical, HEC-RAS does not show the numerical instability problems associated with regime changes.
Description of the test cases
Scenarios are implemented on two different canals, each with eight pools separated by checkpoints with sluicegates (Figure 7).
The first canal is based on the 9.5 km-long lateral canal WM within the Maricopa Stanfield Irrigation and Drainage district in Central Arizona. It is a steep canal characterized by high Froude number and little storage. More details are given in Table 1.
Maricopa Stanfield Canal characteristics
Pool . | S0 (m/m) . | Pool length (m) . | n . | b (m) . | m (m/m) . | Canal depth (m) . | d (m) . | k0, y0 (m) . |
---|---|---|---|---|---|---|---|---|
I | 0.0020 | 100 | 0.014 | 1.0 | 1.5 | 1.1 | 1.0 | – |
II | 0.0020 | 1,200 | 0.014 | 1.0 | 1.5 | 1.1 | 1.0 | – |
III | 0.0020 | 400 | 0.014 | 1.0 | 1.5 | 1.0 | 1.0 | – |
IV | 0.0020 | 800 | 0.014 | 0.8 | 1.5 | 1.1 | 1.0 | – |
V | 0.0020 | 2,000 | 0.014 | 0.8 | 1.5 | 1.1 | 1.0 | – |
VI | 0.0020 | 1,700 | 0.014 | 0.8 | 1.5 | 1.0 | 1.0 | – |
VII | 0.0020 | 1,600 | 0.014 | 0.6 | 1.5 | 1.0 | 1.0 | – |
VIII | 0.0020 | 1,700 | 0.014 | 0.6 | 1.5 | 1.0 | 1.0 | 0.9486833/0.4 |
Pool . | S0 (m/m) . | Pool length (m) . | n . | b (m) . | m (m/m) . | Canal depth (m) . | d (m) . | k0, y0 (m) . |
---|---|---|---|---|---|---|---|---|
I | 0.0020 | 100 | 0.014 | 1.0 | 1.5 | 1.1 | 1.0 | – |
II | 0.0020 | 1,200 | 0.014 | 1.0 | 1.5 | 1.1 | 1.0 | – |
III | 0.0020 | 400 | 0.014 | 1.0 | 1.5 | 1.0 | 1.0 | – |
IV | 0.0020 | 800 | 0.014 | 0.8 | 1.5 | 1.1 | 1.0 | – |
V | 0.0020 | 2,000 | 0.014 | 0.8 | 1.5 | 1.1 | 1.0 | – |
VI | 0.0020 | 1,700 | 0.014 | 0.8 | 1.5 | 1.0 | 1.0 | – |
VII | 0.0020 | 1,600 | 0.014 | 0.6 | 1.5 | 1.0 | 1.0 | – |
VIII | 0.0020 | 1,700 | 0.014 | 0.6 | 1.5 | 1.0 | 1.0 | 0.9486833/0.4 |
The second canal is based on the 28 km-long upstream portion of Corning Canal in California, and is essentially flat and has significant storage. More details are given in Table 2.
Corning canal characteristics
Pool . | S0 (m/m) . | Pool length (m) . | n . | b (m) . | m (m/m) . | Canal depth (m) . | d (m) . | k0/y0 (m) . |
---|---|---|---|---|---|---|---|---|
I | 0.0001 | 7,000 | 0.020 | 7.0 | 1.5 | 2.5 | 0.2 | – |
II | 0.0001 | 3,000 | 0.020 | 7.0 | 1.5 | 2.5 | 0.2 | – |
III | 0.0001 | 3,000 | 0.020 | 7.0 | 1.5 | 2.5 | 0.2 | – |
IV | 0.0001 | 4,000 | 0.020 | 6.0 | 1.5 | 2.3 | 0.2 | – |
V | 0.0001 | 4,000 | 0.020 | 6.0 | 1.5 | 2.3 | 0.2 | – |
VI | 0.0001 | 3,000 | 0.020 | 5.0 | 1.5 | 1.9 | 0.2 | – |
VII | 0.0001 | 2,000 | 0.020 | 5.0 | 1.5 | 1.9 | 0.2 | – |
VIII | 0.0001 | 2,000 | 0.020 | 5.0 | 1.5 | 1.9 | 0.2 | 1.08465229/0.85 |
Pool . | S0 (m/m) . | Pool length (m) . | n . | b (m) . | m (m/m) . | Canal depth (m) . | d (m) . | k0/y0 (m) . |
---|---|---|---|---|---|---|---|---|
I | 0.0001 | 7,000 | 0.020 | 7.0 | 1.5 | 2.5 | 0.2 | – |
II | 0.0001 | 3,000 | 0.020 | 7.0 | 1.5 | 2.5 | 0.2 | – |
III | 0.0001 | 3,000 | 0.020 | 7.0 | 1.5 | 2.5 | 0.2 | – |
IV | 0.0001 | 4,000 | 0.020 | 6.0 | 1.5 | 2.3 | 0.2 | – |
V | 0.0001 | 4,000 | 0.020 | 6.0 | 1.5 | 2.3 | 0.2 | – |
VI | 0.0001 | 3,000 | 0.020 | 5.0 | 1.5 | 1.9 | 0.2 | – |
VII | 0.0001 | 2,000 | 0.020 | 5.0 | 1.5 | 1.9 | 0.2 | – |
VIII | 0.0001 | 2,000 | 0.020 | 5.0 | 1.5 | 1.9 | 0.2 | 1.08465229/0.85 |
Even though original tests present gravity off-takes, withdrawals are modeled as scheduled pumping stations, and take place even when water depth is lower than the orifice center (y0). This is one of the more difficult situations to control because withdrawal ratios are scheduled and are independent from water depth in the canal. Only the last canal pool is discharged through orifices. Sluicegate discharge coefficients are cd= 0.61. Prediction horizon length is 12 hr in both scenarios. The control objective consists of bringing the canal from an initial steady-state to a final one by following the respective set scheduled gate trajectories.
In the Maricopa Stanfield Canal test, prediction horizon is divided into λ= 144 regulation periods about ΔT= 300 [s] (5 min), and in the Corning Canal test into λ= 48 pieces about ΔT= 900 [s] (15 min). Pumping discharge ratios remain constant during the first 2 hours of the prediction horizon until, at instant 2 h, whole ratios change to the final state ratio until the horizon ends. Then, they remain constant again until the prediction horizon ends. The upstream boundary condition keeps constant along the prediction horizon at a water depth: HUP= 1.1 [m] for the Maricopa Stanfield case and HUP = 3.0 [m] for the Corning Canal case.
Figure 9 shows the gate trajectories that bring both canals from the initial state to the final one described in Figure 8. Feasible gate opening changes are constrained into 3 cm for the case of Maricopa Stanfield and 10 cm for the Corning Canal case.
Initial and final steady-states for the Maricopa Stanfield and Corning Canal test cases.
Initial and final steady-states for the Maricopa Stanfield and Corning Canal test cases.
Piece-wise gate trajectories along the prediction horizon. They are constituted by λ= 144 pieces of ΔT= 300 s for the Maricopa Stanfield example and by λ= 48 pieces of 900 s for the Corning Canal case.
Piece-wise gate trajectories along the prediction horizon. They are constituted by λ= 144 pieces of ΔT= 300 s for the Maricopa Stanfield example and by λ= 48 pieces of 900 s for the Corning Canal case.
Simulation results
When gate trajectories shown in Figure 9 are applied to the models described by the characteristics given in Tables 1 and 2, the evolution of the water depths at checkpoints is shown in Figure 10.
Water depth hydrographs at checkpoints 2 to 9 obtained by the parallel algorithm along the 12 hr-long prediction horizon.
Water depth hydrographs at checkpoints 2 to 9 obtained by the parallel algorithm along the 12 hr-long prediction horizon.
As seen in Figure 10, water depths show small oscillations around values of the target, which are given in Table 3. These oscillations are sufficiently small to consider being good enough for the regulation. This occurs except at CP 9, where the water level rises up about 10 cm from the target for approximately 15 min, but without overflowing.
Definition of the initial and final steady-states of the prediction horizons
Pool . | Maricopa initial and final water depths at downstream (m) . | Maricopa initial flow (m3/s) . | Maricopa final flow (m3/s) . | Corning Canal initial and final water depths at downstream (m) . | Corning Canal initial steady-state discharge (m3/s) . | Corning Canal final steady-state discharge (m3/s) . |
---|---|---|---|---|---|---|
I | 0.9 | 2.0 | 2.0 | 2.1 | 2.7 | 13.7 |
II | 0.9 | 1.8 | 1.8 | 2.1 | 2.5 | 12.0 |
III | 0.8 | 1.6 | 1.4 | 2.1 | 2.2 | 10.2 |
IV | 0.9 | 1.4 | 1.4 | 1.9 | 2.0 | 7.5 |
V | 0.9 | 1.2 | 1.4 | 1.9 | 1.7 | 7.2 |
VI | 0.8 | 1.0 | 1.4 | 1.7 | 1.5 | 7.0 |
VII | 0.8 | 0.8 | 1.1 | 1.7 | 1.2 | 6.2 |
VIII | 0.8 | 0.6 | 0.9 | 1.7 | 1.0 | 5.0 |
Pool . | Maricopa initial and final water depths at downstream (m) . | Maricopa initial flow (m3/s) . | Maricopa final flow (m3/s) . | Corning Canal initial and final water depths at downstream (m) . | Corning Canal initial steady-state discharge (m3/s) . | Corning Canal final steady-state discharge (m3/s) . |
---|---|---|---|---|---|---|
I | 0.9 | 2.0 | 2.0 | 2.1 | 2.7 | 13.7 |
II | 0.9 | 1.8 | 1.8 | 2.1 | 2.5 | 12.0 |
III | 0.8 | 1.6 | 1.4 | 2.1 | 2.2 | 10.2 |
IV | 0.9 | 1.4 | 1.4 | 1.9 | 2.0 | 7.5 |
V | 0.9 | 1.2 | 1.4 | 1.9 | 1.7 | 7.2 |
VI | 0.8 | 1.0 | 1.4 | 1.7 | 1.5 | 7.0 |
VII | 0.8 | 0.8 | 1.1 | 1.7 | 1.2 | 6.2 |
VIII | 0.8 | 0.6 | 0.9 | 1.7 | 1.0 | 5.0 |
Maricopa Stanfield diffusive error
The Maricopa Stanfield case has been chosen to analyze the accuracy of the parallel algorithm since the water depth profiles are highly curved, leading to large diffusive errors. The left-hand panel of Figure 8 shows the highly curved water depth profiles for the Maricopa Stanfield scenario. In this situation, using straight lines in interpolation generates significant errors, which can be reduced by using fine discretization. In order to show this, four different levels of discretization have been implemented, that is, from Δx= 10.0 to Δx= 0.5. As Table 4 shows, diffusive error is reduced when discretization is fine enough. Therefore, the accuracy of the Hartree scheme is good enough when a fine discretization is used.
Maricopa Stanfield scenario results
Cell size Δx (m) . | Nodes . | ε . | Processing time (s) . |
---|---|---|---|
10 | 958 | 0.04213 | 2.4 |
5 | 1,908 | 0.02317 | 5.7 |
1 | 9,508 | 0.00497 | 63.0 |
0.5 | 19,008 | 0.00251 | 180.0 |
Cell size Δx (m) . | Nodes . | ε . | Processing time (s) . |
---|---|---|---|
10 | 958 | 0.04213 | 2.4 |
5 | 1,908 | 0.02317 | 5.7 |
1 | 9,508 | 0.00497 | 63.0 |
0.5 | 19,008 | 0.00251 | 180.0 |
A computer with a CPU (Intel® Core™ 2 Duo CPU E7500 @ 2.93 GHz × 2) and a GPU (NVIDIA GeForce GT 710/PCIe/SSE2 710 CUDA cores running at 954 MHz) has been used to simulate these models. As the last column in Table 4 shows, processing times are low enough to use this numerical scheme in the real-time canal control context.
Corning canal scenario diffusive error
When the HEC-RAS model discretized by a space cell size about Δx= 10 m runs constrained by a time step size lower than 30 s, similar results to the Hartree method are obtained (Figure 11).
Discharge hydrograph by HEC-RAS model (left) and parallel algorithm (right).
From Figure 11, the following comments can be made:
Simulation results can be considered as equivalent.
The gate trajectories carry the canal to the final steady state.
The canal only needs a transient of about 3 hours to achieve the final steady-state.
The differences between both hydrographs are due to the fact that the gate formulation used in the HEC-RAS model is different from the one used here (11).
Curves are drawn using 721 points, that is a point every minute.
As Table 5 shows, the parallel algorithm has an equivalent diffusive error as the HEC-RAS model.
Numerical diffusive error index from parallel algorithm and HEC-RAS model
Parallel algorithm | ɛ = 0.0007987 |
HEC-RAS model | ɛ = 0.0007374 |
Parallel algorithm | ɛ = 0.0007987 |
HEC-RAS model | ɛ = 0.0007374 |
Corning canal scenario processing time evaluation
To test the parallel algorithm performance, five Corning Canal models with different level of spatial discretization were implemented, and the processing time required by each model was compared.
Two parallel versions were implemented, one in CPU and one in CPU-GPU, the first being a version implemented in FORTRAN and the second one using C-CUDA. The characteristics of the hardware where the tests were run are:
Equipment 1: A computer with a CPU (Intel® Core™ 2 Duo CPU E7500 @ 2.93 GHz × 2) and a GPU (NVIDIA GeForce GT 710/PCIe/SSE2 710 CUDA cores running at 954 MHz). It can be considered as a representation of the cheapest equipment that can found in the market.
Equipment 2: A computer Dell PowerEdge R720 (2 processor Intel Xeon (R) CPU E5-2620v2 @ 2.1 GHz × 6) and a GPU (NVIDIA K40 GPU 2880 CUDA cores running at 745 MHz, Kepler Architecture, 12GB DDR5). This can be considered as a moderate server of a data center.
Equipment 3: A computer Dell PowerEdge R740 (2 Intel Xeon Gold 6138 @ 2 GHz × 20 processor) and a GPU (NVIDIA Tesla V100 GPU 5120 CUDA cores running at 1,370 MHz, 32GB DDR5). This corresponds to a last generation server from a data center.
Performance test 1
In this test, the processing time corresponding to three levels of spatial discretization are compared (Δx= 10 m, Δx= 5 m, and Δx= 1 m). For this test, three different solvers were implemented. The first is a serial version on CPU using the FORTRAN language, the second is a parallel CPU implementation using OpenMP with a shared memory strategy, and the third is a parallel version in C-CUDA which runs on CPU-GPU as described in the ‘Parallel algorithm’ section. It is worth mentioning that the gfortran-7 version is used for the CPU code, so the code is translated to C by the compiler before compilation. On the other hand, the parallel CPU version is compiled with the ‘-fopenmp’ directive, that allows activation of the parallel execution. The three solvers were run on Equipment 1, 2, and 3, using 2, 12, and 40 CPU-cores for the parallel CPU version, respectively, while the CPU-GPU version uses the graphics card of each equipment. Table 6 summarizes the execution times obtained in each case.
Processing time of the performance test 1
Equipment . | Cell size Δx . | 10 m . | 5 m . | 1 m . | # Cores . |
---|---|---|---|---|---|
# Nodes . | 2,808 . | 5,608 . | 28,008 . | ||
1 | Serial version | 25.80 s | 103.8 s | 2,672.4 s | 1 (CPU) |
Parallel CPU | 15.01 s | 55.80 s | 1,382.4 s | 2 (CPU) | |
Parallel GPU | 3.44 s | 10.57 s | 232.7 s | 710 (GPU) | |
2 | Serial version | 42.22 s | 169.6 s | 4,193.5 s | 1 (CPU) |
Parallel CPU | 5.07 s | 19.38 s | 460.9 s | 12 (CPU) | |
Parallel GPU | 0.837 s | 1.705 s | 12.34 s | 2,880 (GPU) | |
3 | Serial version | 25.17 s | 100.2 s | 2,501.2 s | 1 (CPU) |
Parallel CPU | 2.122 s | 6.976 s | 142.0 s | 40 (CPU) | |
Parallel GPU | 0.288 s | 5.578 s | 3.011 s | 5,120 (GPU) |
Equipment . | Cell size Δx . | 10 m . | 5 m . | 1 m . | # Cores . |
---|---|---|---|---|---|
# Nodes . | 2,808 . | 5,608 . | 28,008 . | ||
1 | Serial version | 25.80 s | 103.8 s | 2,672.4 s | 1 (CPU) |
Parallel CPU | 15.01 s | 55.80 s | 1,382.4 s | 2 (CPU) | |
Parallel GPU | 3.44 s | 10.57 s | 232.7 s | 710 (GPU) | |
2 | Serial version | 42.22 s | 169.6 s | 4,193.5 s | 1 (CPU) |
Parallel CPU | 5.07 s | 19.38 s | 460.9 s | 12 (CPU) | |
Parallel GPU | 0.837 s | 1.705 s | 12.34 s | 2,880 (GPU) | |
3 | Serial version | 25.17 s | 100.2 s | 2,501.2 s | 1 (CPU) |
Parallel CPU | 2.122 s | 6.976 s | 142.0 s | 40 (CPU) | |
Parallel GPU | 0.288 s | 5.578 s | 3.011 s | 5,120 (GPU) |
From Table 6, the following comments can be made:
When the HEC-RAS model runs on Equipment 1 constrained with a user-defined time step size about 30 s, it takes about 17 s to complete the simulation. When the step size is 10 s, it takes about 28 s. It is important to note that only a HEC-RAS serial version is available.
The processing times for the serial version of the code are similar in the three sets of computer equipment. This is due to the fact that the clock frequencies of the CPUs have not evolved in recent years (2006, 2013, 2017, respectively, for each CPU), moreover, Equipment 2 has a lower clock frequency consuming more time. Regarding the parallel CPU version, it is shown that the code scales considerably with the number of parallel processes, obtaining gains of 1.9, 9.1, and 17.6 compared to the serial versions in the case of Δx= 1 m, which is the most expensive computationally.
For the GPU versions, a substantial improvement can be seen in computing time regarding the parallel solver in CPU. The speed-up obtained with the different GPU versions is compared with the faster CPU parallel version (Equipment 3). The speed-up for the GT710 card is 0.6 for all discretization levels, due to the saturation of the relatively few GPU cores that the card has. Speed-ups from 2.5 to 11.5 were obtained on the K40 card and from 7.3 to 47.2 on the V100 card, showing that the code scales very well when refining the mesh and/or increasing the number of GPU cores available.
As the finest case (Δx= 1 m) shows, it is possible to solve relatively large problems on GPU in a reasonable time. Moreover, when using the parallel GPU version, the time required to solve the problem depends only on the number of available threads. Since this method implies trivial parallelism, it would be easy to use multi-GPU allowing larger problems to be solved over long periods of time. Implementation of the Hartree method in fine meshes reduces interpolation errors and, according to the times obtained in GPU, it is suitable to use on optimization problems.
Finally, it is worth mentioning that the cost of a small card like the GT710 (2017) is around 50 USD, while the Xeon Gold 6138 (2017) processors are around 5,224 USD, considering that the processing time for both configurations is of the same order (speed-up of 0.6 in favor of the CPU). If the costs of the K40 and V100 cards (approximately USD 750 and USD 10,000, respectively) are considered, a speed-up of 11.5 and 47.2 in the processing time may not be as convenient from a cost/speed-up point of view. However, we emphasize that a hydraulic moderator can obtain acceptable results in terms of time and precision at a very low cost when using low-profile graphics cards.
Performance test 2
The purpose of the second test is to check the relationship between the processing time required by a time step and the cell size of the model, while keeping the number of time steps constant. As required by the CFL condition, the time step size Δt depends on the cell size Δx, and if the number of time steps is required to be changed, only the processing time can be modified. However, since the Hartree scheme is linear, one can easily estimate the processing time in an approximate way. For example, by doubling the cell size the processing time must be divided by 2 to obtain the same number of time steps.
Five numerical models were implemented, each one with different discretization levels (Δx= 1 m, Δx= 2.5 m, Δx= 5 m, Δx= 7.5 m, and Δx= 10 m). A processing time has been determined for each level. The tests were performed on Equipment 1 using the GPU parallel solver. Results can be found in Table 7.
Results of the performance test 2
Δx (m) . | PH (hr) . | TS (step) . | PT (s) . | η = PT/TS (ms/step) . |
---|---|---|---|---|
1.0 | 1.2 | 28,498 | 24.216 | 0.850 |
2.5 | 3 | 27,390 | 9.828 | 0.359 |
5.0 | 6 | 27,375 | 5.263 | 0.192 |
7.5 | 9 | 27,366 | 3.829 | 0.140 |
10.0 | 12 | 27,358 | 3.439 | 0.126 |
1.0 | 1.75 | 39,825 | 33.915 | 0.852 |
2.5 | 4 | 36,514 | 13.116 | 0.359 |
5.0 | 8 | 36,485 | 7.188 | 0.197 |
7.5 | 12 | 36,470 | 5.146 | 0.141 |
10.0 | 16 | 36,461 | 4.294 | 0.118 |
1.0 | 2.5 | 57,145 | 48.548 | 0.850 |
2.5 | 6 | 54,740 | 19.739 | 0.361 |
5.0 | 12 | 54,689 | 10,567 | 0.193 |
7.5 | 18 | 54,662 | 7,634 | 0.140 |
10.0 | 24 | 54,637 | 6,433 | 0.118 |
Δx (m) . | PH (hr) . | TS (step) . | PT (s) . | η = PT/TS (ms/step) . |
---|---|---|---|---|
1.0 | 1.2 | 28,498 | 24.216 | 0.850 |
2.5 | 3 | 27,390 | 9.828 | 0.359 |
5.0 | 6 | 27,375 | 5.263 | 0.192 |
7.5 | 9 | 27,366 | 3.829 | 0.140 |
10.0 | 12 | 27,358 | 3.439 | 0.126 |
1.0 | 1.75 | 39,825 | 33.915 | 0.852 |
2.5 | 4 | 36,514 | 13.116 | 0.359 |
5.0 | 8 | 36,485 | 7.188 | 0.197 |
7.5 | 12 | 36,470 | 5.146 | 0.141 |
10.0 | 16 | 36,461 | 4.294 | 0.118 |
1.0 | 2.5 | 57,145 | 48.548 | 0.850 |
2.5 | 6 | 54,740 | 19.739 | 0.361 |
5.0 | 12 | 54,689 | 10,567 | 0.193 |
7.5 | 18 | 54,662 | 7,634 | 0.140 |
10.0 | 24 | 54,637 | 6,433 | 0.118 |
Δx [m] = cell size; PH [hr] = prediction horizon; TS [st] = total number of time steps; PT [ms] = processing time; and time-per-step ratio η = PT/TS in milliseconds per step.
Viewing the last column, it is possible to conclude that each cell size has its own approximate ratio η, independently from Prediction Horizon, so that, η depends uniquely on cell size Δx. It means that the processing time has to be proportional to Prediction Horizon, as expected. Figure 12 shows the curve η vs Δx which is obtained from Table 7.
From Figure 12, it can be concluded that the curve η vs Δx is clearly nonlinear. The ratio η by the models Δx= 5 m, Δx= 7.5 m, and Δx= 10 m shows small variations among each other, but models Δx= 1 m, Δx= 2.5 m, and Δx= 5 m show larger variations. That could indicate a limit point below which, the ratio of time-per-step η increases strongly. That suggests the existence of a trend change limit point. Obviously, this point is associated with a specific hardware, in this case, Equipment 1, and decays considerably in Δx= 2.5 m. In other words, discretization below Δx= 5 m mean that this execution configuration does not have enough ‘parallel capacity’ and, therefore, the model is inappropriate for such configuration.
SUMMARY AND CONCLUSIONS
Due to its lack of precision, using the Hartree method was dropped in open canal flow simulations. However, it should be taken into account that giving up this method occurred in a context with less computing power, where the thick grid used was mandatory and the interpolation error very large. The Maricopa Stanfield benchmark test 1.2 by the ASCE Task Committee on Canal Automation Algorithms has been used for evaluating the accuracy of the parallel algorithm. The example has been chosen since water depth profiles are highly curved, and this fact is amenable to obtaining large diffusive errors. It has been found that diffusive error is reduced when discretization is fine enough. Therefore, the accuracy of the Hartree scheme can be considered good enough when a fine discretization is used. The use of the parallel codes makes it possible to implement models with a high level of discretization that can be solved in relatively small times.
Parallel codes in CPU and GPU were developed for the resolution of free-surface flow for open channels. Numerical parallelization was achieved by using the explicit Hartree method applied to the Saint-Venant equations, and by making linear the equations for external and internal boundary conditions. As the trajectories of the gates, boundary conditions, and off-take withdrawals can be modeled with piece-wise functions due to the flow behavior in irrigation canals, the prediction horizon has been discretized in Regulation Period and Jump Step. Taking these ideas into account, a solver has been implemented that solves the non-linear Jump Step part on the CPU and the linear Regulation Period part on the GPU.
The Corning Canal benchmark test 2.2 by the ASCE Task Committee on Canal Automation Algorithms was used for the performance evaluation. It was found that the parallel version in C-CUDA in GPU allows the problem to be solved in much less time than the FORTRAN version in CPU.
Finally, a study of the sensitivity of the ratio of computation speed to the level of discretization has been carried out. As a result, the existence of a trend change limit point is suggested. This point is associated with specific computer equipment used by the calculations. Discretization below this point gives lower speed ratios and their behavior seems to be linear and less sensitive to changes in discretization. In contrast, dense discretization is more sensitive to changes in the cell size.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.