Abstract
Wavelet-based grid resolution adaptation driven by the ‘multiresolution analysis’ (MRA) of the Haar wavelet (HW) allows to devise an adaptive first-order finite volume (FV1) model (HWFV1) that can readily preserve the modelling fidelity of its reference uniform-grid FV1 counterpart. However, the MRA entails an enormous computational effort as it involves ‘encoding’ (coarsening), ‘decoding’ (refining), analysing and traversing modelled data across a deep hierarchy of nested, uniform grids. GPU-parallelisation of the MRA is needed to handle its computational effort, but its algorithmic structure (1) hinders coalesced memory access on the GPU and (2) involves an inherently sequential tree traversal problem. This work redesigns the algorithmic structure of the MRA in order to parallelise it on the GPU, addressing (1) by applying Z-order space-filling curves and (2) by adopting a parallel tree traversal algorithm. This results in a GPU-parallelised HWFV1 model (GPU-HWFV1). GPU-HWFV1 is verified against its CPU predecessor (CPU-HWFV1) and its GPU-parallelised reference uniform-grid counterpart (GPU-FV1) over five shallow water flow test cases. GPU-HWFV1 preserves the modelling fidelity of GPU-FV1 while being up to 30 times faster. Compared to CPU-HWFV1, it is up to 200 times faster, suggesting that the GPU-parallelised MRA could be used to speed up other FV1 models.
HIGHLIGHTS
Wavelet-based grid adaptation is parallelised on the GPU via a Z-order space-filling curve and a parallel tree traversal algorithm.
An adaptive Haar wavelet first-order finite volume shallow water model running on the GPU is developed (GPU-HWFV1).
GPU-HWFV1 is 20–300 times faster than its single-core serial CPU version 4.
GPU-HWFV1 is 1.3–30 times faster than its GPU-parallelised reference uniform-grid counterpart.
INTRODUCTION
Parallelisation of finite volume shallow water models that numerically solve the full two-dimensional (2D) shallow water equations (SWE) on the graphical processing unit (GPU) is very common for accelerating and dealing with the computational effort of supporting real-scale flood modelling applications (Xia et al. 2019; Dazzi et al. 2020; Gordillo et al. 2020; Carlotto et al. 2021; Shaw et al. 2021; Buttinger-Kreuzhuber et al. 2022; Delmas & Soulaïmani 2022; Han et al. 2022; Caviedes-Voullième et al. 2023; Ferrari et al. 2023; Sanz-Ramos et al. 2023). GPU-parallelised finite volume models using uniform grids have become the de facto standard in industrial hydraulic modelling packages (InfoWorks ICM 2018; MIKE 21 GPU 2019; Flood Modeller 2D 2022), with one also using a non-uniform grid via static-in-time grid resolution adaptation (TUFLOW HPC 2018). However, the development of GPU-parallelised shallow water models with dynamic-in-time grid resolution adaptation has received relatively less attention.
Classical adaptive mesh refinement (AMR; Berger & Colella 1989) enforces grid adaptation by generating a non-uniform grid that uses finer cells only where higher resolution is needed, reducing the overall number of cells in the grid and thus the computational cost (Hu et al. 2018; Lakhlifi et al. 2018; Ghazizadeh et al. 2020; Gong et al. 2020; Wallwork et al. 2020; Holzbecher 2022). A desirable goal in shallow water modelling is then to combine AMR and GPU-parallelisation in an effort to further reduce computational costs. However, the GPU-parallelisation of AMR is challenging due to its algorithmic complexity (de la Asunción & Castro 2017; Qin et al. 2019): because of this, often, only the finite volume shallow water model is parallelised on the GPU, while the AMR operations for generating the non-uniform grid are retained on the central processing unit (CPU). Such a hybrid CPU-GPU approach, in turn, introduces computational overhead because data must be transferred between the CPU and the GPU. Other models eliminate this data transfer overhead by parallelising the AMR operations on the GPU as well; examples include applications in hydrodynamics (Sætra et al. 2014; Beckingsale et al. 2015; Wahib et al. 2016; Dunning et al. 2020), magnetohydrodynamics (Schive et al. 2018), combustion (Lung et al. 2016) and gas dynamics (Giuliani & Krivodonova 2019). Nonetheless, the combination of AMR and GPU-parallelisation has not been as popular in shallow water modelling compared to the de facto standard GPU-parallelised uniform-grid shallow water models, as its design cannot guarantee the preservation of the modelling fidelity delivered by the reference uniform-grid finite volume shallow water model without AMR (Zhou et al. 2013; Donat et al. 2014; Liang et al. 2015; Kévin et al. 2017; Ghazizadeh et al. 2020; Zhang et al. 2021).
In contrast, wavelet-based grid adaptation is fundamentally designed on the basis of preserving a similar level of modelling fidelity initially deliverable by the reference uniform-grid counterpart, as demonstrated for shallow water models (Caviedes-Voullième & Kesserwani 2015; Haleem et al. 2015). The design of the wavelet-based grid adaptation process requires specifying a single user-specified error threshold ε that rigorously controls the deviation allowed to occur from the reference uniform-grid counterpart (Gerhard et al. 2015). However, wavelet-based grid adaptation imposes a much higher computational effort than classical AMR as it uses the ‘multiresolution analysis’ (MRA) to generate the non-uniform grid, which involves ‘encoding’ (coarsening), ‘decoding’ (refining), analysing and traversing modelled data across a deep hierarchy of nested, uniform grids. Although finite volume shallow water models incorporating wavelet-based grid adaptation have been shown to be 20 times faster than their uniform-grid counterparts when run in serial on the CPU (Kesserwani & Sharifian, 2020), it is unknown whether such a speedup can be achieved in the GPU setting, i.e., whether wavelet-based grid adaptation can accelerate GPU-parallelised uniform-grid finite volume shallow water models by the same extent as the CPU versions. Owing to the high computational effort imposed by the MRA, the implementation of wavelet-based grid adaptation on the GPU, in particular the MRA process, seems to be a direction worth exploring.
To date, wavelet-based grid adaptation has been mainly been parallelised on CPUs (Domingues et al. 2019; Soni et al. 2019; Deiterding et al. 2020; Semakin & Rastigejev 2020; Julius & Marie 2021; Gillis & van Rees 2022; Zeidan et al. 2022). Its parallelisation on the GPU is unreported and remains necessary, as mentioned, to reduce the computational overhead of performing the MRA. In practice, this is difficult to achieve due to two obstacles. First, the modelled data involved in the MRA process are close together in the hierarchy of grids in physical space, but they are not guaranteed to be close together in GPU memory space unless the hierarchy is indexed in a deliberate way, hindering coalesced memory access which should be maximised to achieve fast GPU performance (Brodtkorb et al. 2013; NVIDIA 2023). Second, traversing the hierarchy of grids boils down to a tree traversal problem, which is fundamentally a sequential task usually completed with recursive algorithms such as depth-first traversal (DFT; Sedgewick & Wayne 2011).
This paper aims to overcome these two obstacles and implement wavelet-based grid adaptation on the GPU by combining two computational ingredients. This entails (1) introducing so-called space-filling curves (SFCs; Sagan 1994; Bader 2013) to index the hierarchy of grids and ensure coalesced memory access and (2) adopting a parallel tree traversal (PTT) algorithm (Karras 2012) to replace the recursive DFT algorithm. To be of practical relevance, the GPU-parallelised wavelet-based grid adaptation process must make an adaptive wavelet-based shallow water model competitive with its GPU-parallelised reference uniform-grid counterpart.
The rest of this paper is organised as follows. Section 2 presents a GPU-parallelised adaptive Haar wavelet (HW) first-order finite volume (FV1) shallow water model (Haleem et al. 2015; Kesserwani & Sharifian 2020), termed GPU-HWFV1. Section 3 compares GPU-HWFV1 against the GPU-parallelised uniform-grid FV1 solver of the open-source LISFLOOD-FP 8.0 flood modelling package (Shaw et al. 2021), referred to hereafter as GPU-FV1. For completeness, GPU-HWFV1 is also compared against its sequential CPU predecessor, hereafter called CPU-HWFV1, which has already been extensively validated for one-dimensional and two-dimensional shallow water test cases (Kesserwani et al. 2019; Kesserwani & Sharifian 2020). Section 4 presents conclusions on the potential benefits of using GPU-HWFV1 for shallow water modelling applications.
NEW GPU-HWFV1 MODEL
This section presents a GPU-parallelised adaptive HW first-order finite volume (FV1) shallow water model, GPU-HWFV1. GPU-HWFV1 is obtained by parallelising the sequential CPU version of the HWFV1 model presented in Kesserwani & Sharifian (2020), CPU-HWFV1. CPU-HWFV1 is difficult to parallelise because it incorporates a wavelet-based grid adaptation process whose algorithmic structure (1) hinders coalesced memory access on the GPU and (2) features an inherently sequential tree traversal problem. To clarify why this is so, Section 2.1 gives an overview of CPU-HWFV1 with a focus on presenting the grid adaptation process. Section 2.2 then presents the GPU-parallelisation of CPU-HWFV1, detailing in particular how two computational ingredients, namely, Z-order SFCs and a PTT algorithm, are used to redesign the algorithmic structure of the grid adaptation process and make it suitable for GPU-parallelisation.
Existing sequential CPU-HWFV1 model
Overview
CPU-HWFV1 is an adaptive shallow water model that runs shallow water simulations by solving the two-dimensional SWEs over a dynamically adaptive non-uniform grid. CPU-HWFV1 is made up of two mechanisms: a wavelet-based grid adaptation process for generating the non-uniform grid and an FV1 scheme for updating the modelled data on this non-uniform grid. Since CPU-HWFV1 is dynamically adaptive, it performs the grid adaptation process every timestep before performing the FV1 update. In contrast to classical AMR methods (e.g., Berger & Colella 1989) which start with a coarse reference grid and selectively refine it, wavelet-based grid adaptation starts with a reference uniform grid at the finest allowable resolution and selectively coarsens it. The wavelet-based grid adaptation process is driven by the ‘multiresolution analysis’ (MRA) of the HW (Haleem et al. 2015; Kesserwani & Sharifian 2020), which starts on a reference uniform grid made up of 2L × 2L cells, where L is a user-specified integer denoting the maximum refinement level that controls the resolution of the finest grid.
To compute Lc, the local coefficients Sc = {hc, (hu)c, (hv)c, zc} as well as the coefficients of the neighbour cells, denoted by Swest, Seast, Snorth, and Ssouth, are needed. The expression for Lc is established in previous studies (Liang 2010) and is therefore outside of the scope of this paper. In summary, Lc involves flux calculations using a Harten-Lax-van Leer Riemann solver, and also includes treatments to ensure the positivity of the water depth and the so-called well-balancedness of the FV1 scheme over wet/dry zones and fronts.
Wavelet-based grid adaptation process driven by MRA
Encoding
The process of computing , , and at the lower refinement levels is called ‘encoding’. Algorithm 1 shows pseudocode summarising the process of encoding. The pseudocode has an outer loop (lines 2–10) and an inner loop (lines 3–9). The outer loop iterates over each grid in the hierarchy, starting from the grid at the second-highest refinement level, n = L − 1, to the grid at the lowest refinement level, n = 0. The inner loop iterates through each cell in the grid while applying Equations (3a)–(3d) (lines 5 and 6) and flagging significant details (line 8).
1 algorithm ENCODING(L, ε) |
2 for each grid in hierarchy from L− 1 to 0 do |
3 for each cell in the grid do |
4 load children for Equations (3a)–(3d) |
5 compute parent coefficient using Equation (3a) |
6 compute details dα, dβ, dγ using Equations (3b)–(3d) |
7 compute normalised detail dnorm |
8 flag as significant if dnorm≥2n−Lε |
9 end for |
10 end for |
11 end algorithm |
1 algorithm ENCODING(L, ε) |
2 for each grid in hierarchy from L− 1 to 0 do |
3 for each cell in the grid do |
4 load children for Equations (3a)–(3d) |
5 compute parent coefficient using Equation (3a) |
6 compute details dα, dβ, dγ using Equations (3b)–(3d) |
7 compute normalised detail dnorm |
8 flag as significant if dnorm≥2n−Lε |
9 end for |
10 end for |
11 end algorithm |
Algorithm 1: Pseudocode for the process of computing , , and at the lower refinement levels, called ‘encoding’.
Decoding
The process of performing a DFT while applying Equations (4a)–(4d) and identifying leaf cells is called ‘decoding’. Algorithm 2 shows pseudocode describing the decoding process. The algorithm launches at the coarsest cell in the tree. The children and of this cell are computed using Equations (4a)–(4d) (line 6) and then the algorithm is relaunched using the children coefficients at one refinement level higher (lines 7–10). The algorithm is recursively launched unless a cell with a detail that is not significant is reached or a cell on the finest grid is reached (line 2), at which point the cell is identified as a leaf cell (line 3).
1 recursive algorithm DECODING(, n) |
2 if detail is not significant or reached finest grid then |
3 identify cell as leaf cell |
4 stop decoding |
5 else |
6 compute children with Equations (4a)–(4d) |
7 DECODING(, n + 1) |
8 DECODING(, n + 1) |
9 DECODING(, n + 1) |
10 DECODING(, n + 1) |
11 end if |
12 end recursive algorithm |
1 recursive algorithm DECODING(, n) |
2 if detail is not significant or reached finest grid then |
3 identify cell as leaf cell |
4 stop decoding |
5 else |
6 compute children with Equations (4a)–(4d) |
7 DECODING(, n + 1) |
8 DECODING(, n + 1) |
9 DECODING(, n + 1) |
10 DECODING(, n + 1) |
11 end if |
12 end recursive algorithm |
Neighbour finding to perform the FV1 update over the non-uniform grid
This completes the description of the steps involved in the CPU-HWFV1 model, and Algorithm 3 shows pseudocode summarising CPU-HWFV1. To run CPU-HWFV1, the user needs to specify the maximum refinement level (L), the error threshold (ε) and the simulation end time (tend) (line 1). CPU-HWFV1 runs in a loop until tend is reached (lines 3–10). A non-uniform grid is generated every iteration of the loop, i.e., at every timestep (lines 4–6). Note that after the first timestep, the details are zeroed first before re-encoding, and encoding is performed only along the tree of significant details. On this non-uniform grid, CPU-HWFV1 performs an FV1 update (line 7), after which the simulation time is incremented by the current timestep (line 8) while a new timestep is recomputed based on the Courant-Friedrich-Lewy (CFL) condition (line 9); for stability, a CFL number of 0.5 is used (Kesserwani & Sharifian 2020).
1 algorithm HWFV1(L, ε, tend) |
2 get s(L) from initial discretisation of SWE on finest grid |
3 while current time < tenddo |
4 ENCODING(L, ε) |
5 DECODING(, 0) // start decoding from coarsest cell |
6 find neighbours to compute Lc |
7 use Lc to perform FV1 update |
8 increment current time by timestep |
9 compute new timestep based on CFL condition |
10 end while |
11 end algorithm |
1 algorithm HWFV1(L, ε, tend) |
2 get s(L) from initial discretisation of SWE on finest grid |
3 while current time < tenddo |
4 ENCODING(L, ε) |
5 DECODING(, 0) // start decoding from coarsest cell |
6 find neighbours to compute Lc |
7 use Lc to perform FV1 update |
8 increment current time by timestep |
9 compute new timestep based on CFL condition |
10 end while |
11 end algorithm |
Algorithm 3: Pseudocode summarising the steps involved in the HWFV1 model.
GPU-parallelisation of CPU-HWFV1
Ensuring coalesced memory access in encoding and decoding via Z-order curves
To facilitate coalesced memory access, a different type of indexing is needed that ensures that coefficients that are nearby in the grid are also nearby in memory. SFCs allow this kind of indexing by definition because they map spatial data to a one-dimensional line such that data close together in space tend to be close together on the line (Sagan 1994; Bader 2013). There are a few different types of SFCs such as the Sierpinski curve, the Peano curve, the Hilbert curve and the Z-order curve (also known as Lebesgue or Morton curve). All of these SFCs have been previously used in the context of AMR (Brix et al. 2009; Burstedde et al. 2011; Weinzierl & Mehl 2011; Meister et al. 2016), and also once in the context of wavelet-based grid adaptation (Brix et al. 2009), but none of these works involved GPU-parallelisation. To the authors' knowledge, this is the first work to use a SFC to implement wavelet-based grid adaptation on the GPU. In particular, the Z-order SFC is chosen because its motif matches exactly with the parent–children square stencil shown in Figure 1(b).
Parallelising the decoding process (line 5 of Algorithm 3; Algorithm 2) is more difficult than parallelising the encoding process because decoding involves the inherently sequential DFT algorithm. To overcome this difficulty, decoding is broken down into two parts that are parallelised separately: the first part is the application of Equations (4a)–(4d) and the second part is the identification of leaf cells. Hereafter, decoding refers exclusively to the application of Equations (4a)–(4d), not the identification of leaf cells.
Decoding can be parallelised relatively easily because it can be performed using loops (similar to those in Algorithm 1) instead of using DFT. Algorithm 4 shows pseudocode describing how to perform decoding in parallel. The pseudocode involves an outer loop and a parallelised inner loop. The outer loop iterates over the grids in the hierarchy starting from the coarsest grid up to the second-finest grid (lines 2–9) while the inner loop iterates through each cell in the grid in parallel (lines 3–8). The inner loop checks if the detail is significant (line 4), loads the parent and the details , and (line 5) and computes the children , (line 6). This parallel inner loop, in particular the loading of the children coefficients in line 5, has coalesced memory access because of Z-order indexing; this can be seen by interpreting the arrows in Figure 6 in the reverse direction.
1 algorithm PARALLEL_DECODING(L) |
2 for each grid in hierarchy from 0 to L− 1 do |
3 for each cell in the grid do in parallel |
4 if detail is significant then |
5 load parent and details , , |
6 compute children with Equations (4a)–(4d) |
7 end if |
8 end for |
9 end for |
10 end algorithm |
1 algorithm PARALLEL_DECODING(L) |
2 for each grid in hierarchy from 0 to L− 1 do |
3 for each cell in the grid do in parallel |
4 if detail is significant then |
5 load parent and details , , |
6 compute children with Equations (4a)–(4d) |
7 end if |
8 end for |
9 end for |
10 end algorithm |
Algorithm 4: Pseudocode for the decoding process in parallel.
Parallel tree traversal algorithm
Unlike decoding, the identification of leaf cells boils down to a tree traversal problem. There have been many investigations into PTT algorithms on the GPU (Lohr 2009; Bédorf et al. 2012; Karras 2012; Goldfarb et al. 2013; Zola et al. 2014; Nam et al. 2016; Chitalu et al. 2018), hinting that the identification of leaf cells can be parallelised by adopting a PTT algorithm. In this work, a modified version of the PTT algorithm developed by Karras (2012) is adopted because it can be easily modified to work with Z-order indexing. The PTT algorithm traverses the tree of significant details as follows, explained by example considering the tree in Figure 2(a) without loss of generality.
Figure 7(a) shows the tree after enforcing Z-order indexing of the hierarchy of grids, and different traversal paths are highlighted in cyan, magenta, yellow and grey. The PTT algorithm starts by launching as many threads as there are cells on the finest grid (22 × 22 = 16). Each thread tries to reach a target cell on the finest grid by traversing progressively finer cells. A thread is denoted by tm, where m is the thread index (here m = 0, 1, …, 15). The target cell of tm is the cell on the finest grid with a Morton code with the same thread index m, e.g., t3 has thread index 3 and tries to reach the cell on the finest grid with Morton code 3.2 A thread stops if it either reaches this target cell or encounters a cell with a detail that is not significant (analogous to identifying a leaf cell). The thread then records the Z-order index of the cell at which it stops. Figure 7(b) shows the traversal paths of each thread during PTT in terms of the Z-order indices of the cells they traverse. These paths show that divergence is greatly minimised because adjacent threads perform similar traversals. For example, t0 to t3 are adjacent threads and they follow the same cyan path in Figure 7(a). Similarly, t4 to t7 follow the magenta path, t8 to t11 follow the yellow path and t12 to t15 follow the grey path. Figure 7(c) shows the Z-order indices recorded by each thread after PTT is complete. These Z-order indices correspond to the indices of leaf cells.
1 algorithm PARALLEL_TREE_TRAVERSAL |
2 for each cell in finest grid do in parallel |
3 start at coarsest cell |
4 while try to reach finer cell do |
5 if detail is not significant then |
6 record Z-order index of cell |
7 stop traversing |
8 else |
9 if reached finest cell then |
10 record Z-order index of cell |
11 stop traversing |
12 else |
13 try to reach finer cell |
14 end if |
15 end if |
16 end while |
17 end for |
19 end algorithm |
1 algorithm PARALLEL_TREE_TRAVERSAL |
2 for each cell in finest grid do in parallel |
3 start at coarsest cell |
4 while try to reach finer cell do |
5 if detail is not significant then |
6 record Z-order index of cell |
7 stop traversing |
8 else |
9 if reached finest cell then |
10 record Z-order index of cell |
11 stop traversing |
12 else |
13 try to reach finer cell |
14 end if |
15 end if |
16 end while |
17 end for |
19 end algorithm |
Algorithm 5: Pseudocode for PTT of the tree of significant details. The PTT features an iterative procedure (lines 4–16) instead of the recursive procedure in the DFT in Algorithm 2.
Neighbour finding and FV1 update
The next step is to parallelise the FV1 update on the leaf cells making up the non-uniform grid (line 7 of Algorithm 3) which is relatively simple. The parallel FV1 update launches one thread per leaf cell. In Figure 8(b), there are ten leaf cells, so ten threads t0 to t9 are launched. Each thread uses the arrays of leaf cell and neighbour Z-order indices to retrieve Sc, Swest, Seast, Snorth and Ssouth from within the hierarchy. For example, t0 would use the first column of indices (red box in Figure 8(b)), t1 would use the next column and so on. Since each thread t0 to t9 retrieves Sc, Swest, Seast, Snorth and Ssouth for each leaf cell, Lc can be computed to perform the FV1 update in parallel for all leaf cells. After the FV1 update, incrementing the current simulation time by the timestep (line 8 of Algorithm 3) is trivial. The new minimum timestep based on the CFL condition (line 9 of Algorithm 3) is computed by performing a so-called parallel minimum reduction using the CUB library (Merrill 2022). Hence, the steps involved in the CPU-HWFV1 model (lines 4–9 of Algorithm 3) are all parallelised and GPU-HWFV1 is obtained.
NUMERICAL RESULTS AND DISCUSSION
This section will compare the proposed GPU-HWFV1 model against two existing and validated shallow water models, namely, the reference uniform-grid FV1 counterpart parallelised on the GPU, GPU-FV1, available in the open-source LISFLOOD-FP 8.0 flood modelling package (Shaw et al. 2021), and the sequential CPU predecessor (Kesserwani & Sharifian 2020), CPU-HWFV1. The comparisons will be performed for five test cases (see Table 1) that explore a range of topographies and flow dynamics. The primary aim of the comparisons will be to quantify GPU-HWFV1's runtime performance relative to GPU-FV1 and CPU-HWFV1. A secondary aim is to verify that the proposed GPU-HWFV1 model is valid, which is done by checking that GPU-HWFV1 preserves a similar level of fidelity as GPU-FV1 at the finest resolution accessible to GPU-HWFV1.
Test name . | Test type . | Reason for use . | Previously used in . |
---|---|---|---|
Quiescent flow over irregular topographies with different steepness. Dam-break flow over realistic terrain with friction effects (Section 3.1) | Synthetic | Verifying fidelity | (Song et al. 2011; Huang et al. 2013; Kesserwani et al. 2018; Kesserwani & Sharifian 2020; Shirvani et al. 2021) |
Circular 2D dam-break flow (Section 3.2) | Synthetic | Assessing runtime performance | (Wang et al. 2011; Kesserwani & Sharifian 2020) |
Pseudo-2D dam-break flow (Section 3.2) | Synthetic | Assessing runtime performance | (Kesserwani et al. 2019; Kesserwani & Sharifian 2020) |
Dam-break wave interaction with an urban district (Section 3.3) | Experimental | Assessing runtime performance | (Jeong et al. 2012; Caviedes-Voullième et al. 2020; Kesserwani & Sharifian 2020) |
Tsunami wave propagation over a complex beach (Section 3.3) | Experimental | Assessing runtime performance | (Hou et al. 2015; Arpaia & Ricchiuto 2018; Caviedes-Voullième et al. 2020; Kesserwani & Sharifian 2020) |
Test name . | Test type . | Reason for use . | Previously used in . |
---|---|---|---|
Quiescent flow over irregular topographies with different steepness. Dam-break flow over realistic terrain with friction effects (Section 3.1) | Synthetic | Verifying fidelity | (Song et al. 2011; Huang et al. 2013; Kesserwani et al. 2018; Kesserwani & Sharifian 2020; Shirvani et al. 2021) |
Circular 2D dam-break flow (Section 3.2) | Synthetic | Assessing runtime performance | (Wang et al. 2011; Kesserwani & Sharifian 2020) |
Pseudo-2D dam-break flow (Section 3.2) | Synthetic | Assessing runtime performance | (Kesserwani et al. 2019; Kesserwani & Sharifian 2020) |
Dam-break wave interaction with an urban district (Section 3.3) | Experimental | Assessing runtime performance | (Jeong et al. 2012; Caviedes-Voullième et al. 2020; Kesserwani & Sharifian 2020) |
Tsunami wave propagation over a complex beach (Section 3.3) | Experimental | Assessing runtime performance | (Hou et al. 2015; Arpaia & Ricchiuto 2018; Caviedes-Voullième et al. 2020; Kesserwani & Sharifian 2020) |
The first test case (Section 3.1) will focus solely on verifying GPU-HWFV1's fidelity under idealised scenarios where a benchmark solution is available. The second and third test cases (Section 3.2) will then mainly focus on assessing GPU-HWFV1's runtime performance against CPU-HWFV1 and GPU-FV1 in more complex cases considering synthetic dam-break flows over flat terrain. As there is no terrain data to consider in these scenarios, the fidelity of GPU-HWFV1 will be assessed by verifying that its results match those from CPU-HWFV1 and GPU-FV1. This then allows to systematically analyse the runtime performance in relation to the sensitivity to triggering grid refinement (ε), the depth in grid resolution (L) and the flow type (vigorous or smooth). Therefore, simulations are run by pairing different values for ε and L, with ε = {10−4, 10−3, 10−2} to cover the recommended ranges for maintaining a fair balance between the predictive accuracy and runtime performance (Kesserwani et al. 2019; Kesserwani & Sharifian 2020) and L = {8, 9, 10, 11} as no gains in runtime performance was identified for L ≤ 7 and using L ≥ 12 exceeded the memory capacity of the GPU card used (RTX 2070). The effects of these parameters and the flow type on the wavelet-based grid adaptation of the HWFV1 models are summarised in Table 2.
Aspects . | Description . | Finest grid resolution . |
---|---|---|
L | Controls the finest accessible grid resolution | Deeper with higher L |
ε | Controls how far the finest grid resolution is accessed | More accessible with smaller ε |
Flow | Vigorous (with discontinuities) to smooth (up to flat) | Triggered often for vigorous flows |
Aspects . | Description . | Finest grid resolution . |
---|---|---|
L | Controls the finest accessible grid resolution | Deeper with higher L |
ε | Controls how far the finest grid resolution is accessed | More accessible with smaller ε |
Flow | Vigorous (with discontinuities) to smooth (up to flat) | Triggered often for vigorous flows |
The expectations on runtime performance established from the synthetic test cases will finally be explored in the fourth and fifth test cases (Section 3.3), which involve realistic topographies represented by digital elevation models (DEMs). Running the HWFV1 models with the presence of a DEM means that the maximum refinement L must be set to accommodate the DEM resolution, and no grid coarsening is allowed beyond what the MRA of the DEM suggests.
Verification of fidelity
Hump profile . | h + z (m) . | Wetting conditions . | Reference . |
---|---|---|---|
Smooth | 0.875 | Dry around the highest hump, critical (h = 0 m) over the two small humps | (Song et al. 2011; Huang et al. 2013; Shirvani et al. 2021) |
Steeper | 1.78 | Dry around the highest hump, critical (h = 0 m) at the peak of the medium hump, wet above the shortest hump (h > 0 m) | (Kesserwani et al. 2018) |
Rectangular | 1.95 | Dry around the highest hump, critical (h = 0 m) at the peak of the medium hump, wet above the shortest hump (h > 0 m) | (Kesserwani et al. 2018) |
Hump profile . | h + z (m) . | Wetting conditions . | Reference . |
---|---|---|---|
Smooth | 0.875 | Dry around the highest hump, critical (h = 0 m) over the two small humps | (Song et al. 2011; Huang et al. 2013; Shirvani et al. 2021) |
Steeper | 1.78 | Dry around the highest hump, critical (h = 0 m) at the peak of the medium hump, wet above the shortest hump (h > 0 m) | (Kesserwani et al. 2018) |
Rectangular | 1.95 | Dry around the highest hump, critical (h = 0 m) at the peak of the medium hump, wet above the shortest hump (h > 0 m) | (Kesserwani et al. 2018) |
GPU-HWFV1 simulations are run up to 100 s with a maximum refinement level L = 8 and an error threshold ε = 10−3 (requiring around 3,000 timesteps to complete) while measuring the discharges. For GPU-HWFV1 to be deemed well-balanced, the free-surface elevation should stay undisturbed during the simulation, thus the discharges should not deviate from zero, within machine precision. The bottom panels of Figure 9 show the time histories of the maximum discharge errors (i.e., the maximum deviation from zero) for the three hump profiles. The errors are seen to become increasingly higher with increased irregularity in the hump profile, but nonetheless remain bounded as also observed for the CPU model counterparts (Kesserwani & Sharifian 2020). This demonstrates that GPU-HWFV1 is well-balanced irrespective of the steepness of the bed slope and the presence of wet–dry zones and fronts in the domain area.
Assessing runtime performance for synthetic test cases
Circular 2D dam-break flow
GPU-HWFV1 is identified to be 5× to 46× faster than CPU-HWFV1. This speed-up is proportional to the increase in L and decrease in ε. This suggests that wavelet-based grid adaptation is much more efficient when parallelised on the GPU, in particular as the maximum resolution refinement level is deepened and the sensitivity to refine resolution is increased. Compared to the runtime performance of FV1-GPU, GPU-HWFV1 is not faster in this test (right panel of Figure 12) until L ≥ 9 for all the ε values, reaching a maximum of 3× for the largest ε= 10−2, and around 2× for the smaller ε= 10−3 and 10−4. This means that GPU-HWFV1, despite the overhead costs from the MRA process, can still compete with the speed of a fine uniform-grid GPU-FV1 simulation even for a vigorous flow that would cause overrefinement on the adaptive grid. Namely, GPU-HWFV1 is likely to be faster than GPU-FV1 the deeper the grid resolution (which would lead to an excessively fine uniform grid for GPU-FV1) and the lower the sensitivity for triggering grid refinement. Next, a transient analysis of the speed-ups is performed in a longer simulation that sees a gradual change in the flow from vigorous to very smooth.
Pseudo-2D dam-break flow
This 1D dam-break flow test case has conventionally been used to verify shallow water models for a short simulation run (2.5 s) involving transient shock and rarefaction wave propagation in two opposite directions. It was used recently for a much longer simulation time (40 s) to assess speed-up for CPU-based adaptive grid models to their uniform grid counterparts by considering a flow with gradual transition from vigorous to smooth (Kesserwani et al. 2019; Kesserwani & Sharifian 2020).
Looking at the speed-up over CPU-HWFV1 (Figure 14, top panels), GPU-HWFV1 exceeds the breakeven for all but the largest ε = 10−2 and the lowest maximum refinement level L = 8 up to 2.5 s (Figure 14, top left panel). This means that CPU-HWFV1 only remained as fast as GPU-HWFV1 when the flow included the shock and the rarefaction waves and for the setting with the least depth in resolution refinement and the least sensitivity to trigger grid refinement. However, even at ε = 10−2, up to 8× speed-up is noted after 3 s when the shock wave is not present anymore. With any other combinations of {ε, L}, there is a significant demonstration of speed-up: with reduced ε and increased L, GPU-HWFV1 becomes increasingly faster than CPU-HWFV1 up to reaching, for the highest L and smallest ε, an average speed-up of 68× throughout the simulation and a maximum speed-up of 88× at 2.5 s when flow discontinuities were still present. This confirms the benefit of parallelising the wavelet-based grid adaptation on the GPU as an alternative to the CPU version for general purpose modelling involving all types of flow.
In terms of speed-ups over GPU-FV1 (Figure 14, bottom panels), at ε = 10−2, GPU-HWFV1 demonstrates a maximum speed-up of 25× when L = 11, though it could only outrun GPU-FV1 for L ≥ 9, beyond which GPU-HWFV1 increasingly shows speed-up within increased smoothening in the flow. At ε = 10−3, GPU-HWFV1's maximum speed-up reduces to 12× and outruns GPU-FV1 for L ≥ 10, whereas for L ≤ 9, it begins to demonstrate speed-up only after 10 s when the flow starts smoothening. This suggests to expect less speed-up over GPU-FV1 with increased sensitivity for triggering grid refinement with GPU-HWFV1 and reduced depth of the finest resolution. The same can be noted with ε = 10−4, but here GPU-HWFV1 starts to be faster than FV1-GPU for L ≥ 9 and the overall maximum speed-up reduces to 8× (reached again after 10 s when the flow is smoothening). These analyses indicate that an adaptive-grid GPU-HWFV1 simulation is likely to be more efficient than a uniform-grid GPU-FV1 simulation for very fine resolution modelling of gradual to smooth flows, with L ≥ 9, and when the sensitivity to grid refinement is not maximal, with ε > 10−4.
Further investigations into runtime performance: realistic flow simulations
Dam-break wave interaction with an urban district
As the gate opens abruptly, a dam-break wave forms and flows swiftly to collide with the blocks. The blocks almost entirely impede the shock, creating a backwater zone upstream, while the unimpeded flow cascades through the gaps to form a hydraulic jump downstream as the simulation progresses (e.g., Figure 24 in Soares-Frazão & Zech (2008)).
Tsunami wave propagation over a complex beach
The test case considers a 1:400 scaled replica of the 1993 Okushiri tsunami (Matsuyama & Tanaka 2001). It has been used in other works for model verification and for runtime performance assessments of wavelet-based adaptive models versus their uniform counterparts for simulations on the CPU (Caviedes-Voullième et al. 2020; Kesserwani & Sharifian 2020). It is here used to assess the runtime performance of the adaptive GPU-HWFV1 model versus CPU-HWFV1 and GPU-FV1 models.
Compared to GPU-FV1, GPU-HWFV1 only demonstrates speed-up with ε = 10−3 (around 1.25×) and is slightly slower to run with ε = 10−4 where its speed-up falls below the breakeven line. This implies that it is worthwhile to perform wavelet-based grid adaptation in this test for ε = 10−3 but not for ε = 10−4, at which the grid is overrefined (due to the terrain and flow disturbances). Overall, GPU-HWFV1 is shown to be generally faster than CPU-HWFV1 but could not outrun GPU-FV1 for ε = 10−4 (which leads to an overrefined grid) and for L = 9 (to accommodate a relatively small domain). Nonetheless, for ε = 10−3, GPU-HWFV1 remains a viable choice over GPU-FV1.
CONCLUSION
This paper presented an adaptive first-order finite volume (FV1) shallow water model with a dynamic-in-time wavelet-based grid adaptation process driven by the MRA of the HW that is fully parallelised on the GPU, called GPU-HWFV1. The MRA involves a nested hierarchy of grids, where the coarsest grid in the hierarchy is made up of a single cell while the finest grid is made up of 2L × 2L cells, where L is a user-specified maximum refinement level. In the ‘encoding’ (coarsening) process, a user-specified error threshold ε is needed to flag significant ‘details’ to decide which cells to include in the non-uniform grid. The encoding process results in a tree-like structure of significant details that is traversed in the ‘decoding’ (refinement) process by applying a sequential DFT algorithm to identify the cells making up the non-uniform grid. Encoding and decoding have been parallelised on the GPU by adopting the indexing of a Z-order space-filling curve to ensure coalesced memory access. Meanwhile, the DFT algorithm has been replaced with a PTT algorithm to traverse the tree of significant details on the GPU with minimal warp divergence. The PTT algorithm also allows to easily identify the neighbour cells of each cell in the non-uniform grid for performing the FV1 update in parallel.
GPU-HWFV1 was first verified and then its runtime performance was assessed against a sequential predecessor running on the central processing unit (CPU-HWFV1) as well as the reference uniform-grid FV1 counterpart parallelised on the GPU (GPU-FV1) ran on the finest grid accessible to the HWFV1 models. The verification was performed using ε = 10−3 (recommended for flood modelling) and L = 8 for four synthetic test cases involving motionless, vigorous, gradual and smooth flows. A systematic runtime performance assessment was performed for two synthetic test cases involving dam-break flow, where a lower and higher bound for ε = {10−2, 10−3, 10−4} were also considered alongside an increase in the maximum refinement level L= {8, 9, 10, 11}. Verification and runtime performance assessments were finally performed for realistic test cases with DEMs for which the value of L was chosen to match the DEM resolution, while GPU-HWFV1 was run with ε = {10−3, 10−4}.
GPU-HWFV1's overall performance for all the test cases provided strong evidence that it delivers a similar level of fidelity as GPU-FV1 in replicating the realistic flows including the presence of uneven topographies, wet–dry fronts and friction effects. In terms of runtime performance over CPU-HWFV1, GPU-HWFV1 yielded significant speed-ups for all the test cases, ranging between 20× and 400×. Hence, this work offers compelling evidence for porting the GPU-parallelised wavelet-based grid adaptation process to FV1 models in other fields. From the systematic runtime performance assessment for the synthetic test cases, GPU-HWFV1 tends to demonstrate speed-up of around 1.1× to 30× over GPU-FV1 for L ≥ 9 and/or by avoiding the smallest ε= 10−4. For the realistic test cases, GPU-HWFV1 showed speed-up over GPU-FV1 for the test with L = 11 and for ε= 10−3 for the test with L = 9. Hence, GPU-HWFV1 can be favoured to gain runtime performance over GPU-FV1 for shallow water modelling over real DEMs, namely with an increased fineness in the DEM resolution and an increased domain size.
ACKNOWLEDGEMENTS
Alovya Ahmed Chowdhury and Georges Kesserwani were supported by the UK Engineering and Physical Sciences Research Council (EPSRC) grant EP/R007349/1. This work is part of the SEAMLESS-WAVE project (SoftwarE infrAstructure for Multi-purpose fLood modElling at variouS scaleS based on “WAVElets”).
DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository at https://dx.doi.org/10.5281/zenodo.8075133.
CONFLICT OF INTEREST
The authors declare there is no conflict.
In Equations (3a)–(4d), H0, H1, G0 and G1 are scalar coefficients obtained from the Haar wavelets whose derivation is available in previous works (Keinert 2003; Kesserwani & Sharifian, 2020), and is outside of the scope of this paper.