## Abstract

Wavelet-based grid resolution adaptation driven by the ‘multiresolution analysis’ (MRA) of the Haar wavelet (HW) allows to devise an adaptive first-order finite volume (FV1) model (HWFV1) that can readily preserve the modelling fidelity of its reference uniform-grid FV1 counterpart. However, the MRA entails an enormous computational effort as it involves ‘encoding’ (coarsening), ‘decoding’ (refining), analysing and traversing modelled data across a deep hierarchy of nested, uniform grids. GPU-parallelisation of the MRA is needed to handle its computational effort, but its algorithmic structure (1) hinders coalesced memory access on the GPU and (2) involves an inherently sequential tree traversal problem. This work redesigns the algorithmic structure of the MRA in order to parallelise it on the GPU, addressing (1) by applying *Z*-order space-filling curves and (2) by adopting a parallel tree traversal algorithm. This results in a GPU-parallelised HWFV1 model (GPU-HWFV1). GPU-HWFV1 is verified against its CPU predecessor (CPU-HWFV1) and its GPU-parallelised reference uniform-grid counterpart (GPU-FV1) over five shallow water flow test cases. GPU-HWFV1 preserves the modelling fidelity of GPU-FV1 while being up to 30 times faster. Compared to CPU-HWFV1, it is up to 200 times faster, suggesting that the GPU-parallelised MRA could be used to speed up other FV1 models.

## HIGHLIGHTS

Wavelet-based grid adaptation is parallelised on the GPU via a

*Z*-order space-filling curve and a parallel tree traversal algorithm.An adaptive Haar wavelet first-order finite volume shallow water model running on the GPU is developed (GPU-HWFV1).

GPU-HWFV1 is 20–300 times faster than its single-core serial CPU version 4.

GPU-HWFV1 is 1.3–30 times faster than its GPU-parallelised reference uniform-grid counterpart.

## INTRODUCTION

Parallelisation of finite volume shallow water models that numerically solve the full two-dimensional (2D) shallow water equations (SWE) on the graphical processing unit (GPU) is very common for accelerating and dealing with the computational effort of supporting real-scale flood modelling applications (Xia *et al.* 2019; Dazzi *et al.* 2020; Gordillo *et al.* 2020; Carlotto *et al.* 2021; Shaw *et al.* 2021; Buttinger-Kreuzhuber *et al.* 2022; Delmas & Soulaïmani 2022; Han *et al.* 2022; Caviedes-Voullième *et al.* 2023; Ferrari *et al.* 2023; Sanz-Ramos *et al.* 2023). GPU-parallelised finite volume models using uniform grids have become the *de facto* standard in industrial hydraulic modelling packages (InfoWorks ICM 2018; MIKE 21 GPU 2019; Flood Modeller 2D 2022), with one also using a non-uniform grid via static-in-time grid resolution adaptation (TUFLOW HPC 2018). However, the development of GPU-parallelised shallow water models with *dynamic*-in-time grid resolution adaptation has received relatively less attention.

Classical adaptive mesh refinement (AMR; Berger & Colella 1989) enforces grid adaptation by generating a non-uniform grid that uses finer cells only where higher resolution is needed, reducing the overall number of cells in the grid and thus the computational cost (Hu *et al.* 2018; Lakhlifi *et al.* 2018; Ghazizadeh *et al.* 2020; Gong *et al.* 2020; Wallwork *et al.* 2020; Holzbecher 2022). A desirable goal in shallow water modelling is then to combine AMR and GPU-parallelisation in an effort to further reduce computational costs. However, the GPU-parallelisation of AMR is challenging due to its algorithmic complexity (de la Asunción & Castro 2017; Qin *et al.* 2019): because of this, often, only the finite volume shallow water model is parallelised on the GPU, while the AMR operations for generating the non-uniform grid are retained on the central processing unit (CPU). Such a hybrid CPU-GPU approach, in turn, introduces computational overhead because data must be transferred between the CPU and the GPU. Other models eliminate this data transfer overhead by parallelising the AMR operations on the GPU as well; examples include applications in hydrodynamics (Sætra *et al.* 2014; Beckingsale *et al.* 2015; Wahib *et al.* 2016; Dunning *et al.* 2020), magnetohydrodynamics (Schive *et al.* 2018), combustion (Lung *et al.* 2016) and gas dynamics (Giuliani & Krivodonova 2019). Nonetheless, the combination of AMR and GPU-parallelisation has not been as popular in shallow water modelling compared to the *de facto* standard GPU-parallelised uniform-grid shallow water models, as its design cannot guarantee the preservation of the modelling fidelity delivered by the reference uniform-grid finite volume shallow water model without AMR (Zhou *et al.* 2013; Donat *et al.* 2014; Liang *et al.* 2015; Kévin *et al.* 2017; Ghazizadeh *et al.* 2020; Zhang *et al.* 2021).

In contrast, wavelet-based grid adaptation is fundamentally designed on the basis of preserving a similar level of modelling fidelity initially deliverable by the reference uniform-grid counterpart, as demonstrated for shallow water models (Caviedes-Voullième & Kesserwani 2015; Haleem *et al.* 2015). The design of the wavelet-based grid adaptation process requires specifying a single user-specified error threshold *ε* that rigorously controls the deviation allowed to occur from the reference uniform-grid counterpart (Gerhard *et al.* 2015). However, wavelet-based grid adaptation imposes a much higher computational effort than classical AMR as it uses the ‘multiresolution analysis’ (MRA) to generate the non-uniform grid, which involves ‘encoding’ (coarsening), ‘decoding’ (refining), analysing and traversing modelled data across a deep hierarchy of nested, uniform grids. Although finite volume shallow water models incorporating wavelet-based grid adaptation have been shown to be 20 times faster than their uniform-grid counterparts when run in serial on the CPU (Kesserwani & Sharifian, 2020), it is unknown whether such a speedup can be achieved in the GPU setting, i.e., whether wavelet-based grid adaptation can accelerate GPU-parallelised uniform-grid finite volume shallow water models by the same extent as the CPU versions. Owing to the high computational effort imposed by the MRA, the implementation of wavelet-based grid adaptation on the GPU, in particular the MRA process, seems to be a direction worth exploring.

To date, wavelet-based grid adaptation has been mainly been parallelised on CPUs (Domingues *et al.* 2019; Soni *et al.* 2019; Deiterding *et al.* 2020; Semakin & Rastigejev 2020; Julius & Marie 2021; Gillis & van Rees 2022; Zeidan *et al.* 2022). Its parallelisation on the GPU is unreported and remains necessary, as mentioned, to reduce the computational overhead of performing the MRA. In practice, this is difficult to achieve due to two obstacles. First, the modelled data involved in the MRA process are close together in the hierarchy of grids in physical space, but they are not guaranteed to be close together in GPU memory space unless the hierarchy is indexed in a deliberate way, hindering coalesced memory access which should be maximised to achieve fast GPU performance (Brodtkorb *et al.* 2013; NVIDIA 2023). Second, traversing the hierarchy of grids boils down to a tree traversal problem, which is fundamentally a sequential task usually completed with recursive algorithms such as depth-first traversal (DFT; Sedgewick & Wayne 2011).

This paper aims to overcome these two obstacles and implement wavelet-based grid adaptation on the GPU by combining two computational ingredients. This entails (1) introducing so-called space-filling curves (SFCs; Sagan 1994; Bader 2013) to index the hierarchy of grids and ensure coalesced memory access and (2) adopting a parallel tree traversal (PTT) algorithm (Karras 2012) to replace the recursive DFT algorithm. To be of practical relevance, the GPU-parallelised wavelet-based grid adaptation process must make an adaptive wavelet-based shallow water model competitive with its GPU-parallelised reference uniform-grid counterpart.

The rest of this paper is organised as follows. Section 2 presents a GPU-parallelised adaptive Haar wavelet (HW) first-order finite volume (FV1) shallow water model (Haleem *et al.* 2015; Kesserwani & Sharifian 2020), termed GPU-HWFV1. Section 3 compares GPU-HWFV1 against the GPU-parallelised uniform-grid FV1 solver of the open-source LISFLOOD-FP 8.0 flood modelling package (Shaw *et al.* 2021), referred to hereafter as GPU-FV1. For completeness, GPU-HWFV1 is also compared against its sequential CPU predecessor, hereafter called CPU-HWFV1, which has already been extensively validated for one-dimensional and two-dimensional shallow water test cases (Kesserwani *et al.* 2019; Kesserwani & Sharifian 2020). Section 4 presents conclusions on the potential benefits of using GPU-HWFV1 for shallow water modelling applications.

## NEW GPU-HWFV1 MODEL

This section presents a GPU-parallelised adaptive HW first-order finite volume (FV1) shallow water model, GPU-HWFV1. GPU-HWFV1 is obtained by parallelising the sequential CPU version of the HWFV1 model presented in Kesserwani & Sharifian (2020), CPU-HWFV1. CPU-HWFV1 is difficult to parallelise because it incorporates a wavelet-based grid adaptation process whose algorithmic structure (1) hinders coalesced memory access on the GPU and (2) features an inherently sequential tree traversal problem. To clarify why this is so, Section 2.1 gives an overview of CPU-HWFV1 with a focus on presenting the grid adaptation process. Section 2.2 then presents the GPU-parallelisation of CPU-HWFV1, detailing in particular how two computational ingredients, namely, *Z*-order SFCs and a PTT algorithm, are used to redesign the algorithmic structure of the grid adaptation process and make it suitable for GPU-parallelisation.

### Existing sequential CPU-HWFV1 model

#### Overview

CPU-HWFV1 is an adaptive shallow water model that runs shallow water simulations by solving the two-dimensional SWEs over a dynamically adaptive non-uniform grid. CPU-HWFV1 is made up of two mechanisms: a wavelet-based grid adaptation process for generating the non-uniform grid and an FV1 scheme for updating the modelled data on this non-uniform grid. Since CPU-HWFV1 is dynamically adaptive, it performs the grid adaptation process every timestep before performing the FV1 update. In contrast to classical AMR methods (e.g., Berger & Colella 1989) which start with a coarse reference grid and selectively refine it, wavelet-based grid adaptation starts with a reference uniform grid at the finest allowable resolution and selectively coarsens it. The wavelet-based grid adaptation process is driven by the ‘multiresolution analysis’ (MRA) of the HW (Haleem *et al.* 2015; Kesserwani & Sharifian 2020), which starts on a reference uniform grid made up of 2* ^{L}* × 2

*cells, where*

^{L}*L*is a user-specified integer denoting the maximum refinement level that controls the resolution of the finest grid.

*t*,

*x*and

*y*, respectively. The vector contains the flow variables, , are the components of the flux vector, and and are source-term vectors. The variable is the water depth (m), and are the

*x*- and

*y*-components of the velocity (m/s), respectively, and

*g*is the gravitational acceleration constant (m/s

^{2}). In , is the topographic elevation (m), and in , where is Manning's coefficient (s

^{−1}m

^{1/3}).

*h*,

*hu*,

*hv*and

*z*as piecewise-constant modelled data over each cell in the grid, for which scalar coefficients

*h*, (

_{c}*hu*)

*, (*

_{c}*hv*)

*and*

_{c}*z*are assigned per cell. Ordinarily, on a uniform grid, an FV1 update is performed at every timestep using a forward-Euler scheme applied to a spatial operator

_{c}**L**

*to advance the vector of flow coefficients in time as follows:*

_{c}To compute **L*** _{c}*, the local coefficients

*S*= {

_{c}*h*, (

_{c}*hu*)

*, (*

_{c}*hv*)

*,*

_{c}*z*} as well as the coefficients of the neighbour cells, denoted by

_{c}*S*

_{west},

*S*

_{east},

*S*

_{north}, and

*S*

_{south}, are needed. The expression for

**L**

*is established in previous studies (Liang 2010) and is therefore outside of the scope of this paper. In summary,*

_{c}**L**

*involves flux calculations using a Harten-Lax-van Leer Riemann solver, and also includes treatments to ensure the positivity of the water depth and the so-called well-balancedness of the FV1 scheme over wet/dry zones and fronts.*

_{c}#### Wavelet-based grid adaptation process driven by MRA

*× 2*

^{L}*grid at the maximum refinement level*

^{L}*L*. Figure 1(a) shows a hierarchy of grids with

*L*= 2. The refinement level of each grid in the hierarchy is denoted by

*n*: the finest grid in the hierarchy is at refinement level

*n*=

*L*= 2, the second-finest grid is at refinement level

*n*= 1 and the coarsest grid is at

*n*= 0. Let

*s*

^{(n)}denote the coefficients

*s*at refinement level

*n*, where

*s*denotes any of the quantities in the set

*S*for ease of presentation of the MRA.

_{c}##### Encoding

*s*

^{(L)}are available, corresponding to the coefficients obtained from initially discretising the SWE over the reference uniform grid. The MRA process continues by producing

*s*

^{(n)}at the lower refinement levels, i.e., for . The coefficient

*s*

^{(n)}of a cell at refinement level

*n*(called ‘parent’) is produced using the coefficients , , and of four cells at refinement level

*n*+ 1 (called ‘children’), in particular by applying Equation (3a).

^{1}Figure 1(b) shows the parent–children stencil dictating how the children , , and at level

*n*+ 1 are positioned relative to the parent at level

*n*. Equations (3b)–(3d) are also applied to produce so-called details, denoted by , and . These details are used to decide which cells to include in the non-uniform grid based on a normalised detail , where is the largest coefficient for all

*s*

^{(L)}. Cells whose is greater than 2

*are deemed to have significant details, where*

^{n−L}ε*ε*is a user-specified error threshold.

The process of computing , , and at the lower refinement levels is called ‘encoding’. Algorithm 1 shows pseudocode summarising the process of encoding. The pseudocode has an outer loop (lines 2–10) and an inner loop (lines 3–9). The outer loop iterates over each grid in the hierarchy, starting from the grid at the second-highest refinement level, *n* = *L* − 1, to the grid at the lowest refinement level, *n* = 0. The inner loop iterates through each cell in the grid while applying Equations (3a)–(3d) (lines 5 and 6) and flagging significant details (line 8).

1 algorithm ENCODING(L, ε) |

2 for each grid in hierarchy from L− 1 to 0 do |

3 for each cell in the grid do |

4 load children for Equations (3a)–(3d) |

5 compute parent coefficient using Equation (3a) |

6 compute details d, _{α}d, _{β}d using Equations (3b)–(3d) _{γ} |

7 compute normalised detail d_{norm} |

8 flag as significant if d_{norm}≥2 ^{n−L}ε |

9 end for |

10 end for |

11 end algorithm |

1 algorithm ENCODING(L, ε) |

2 for each grid in hierarchy from L− 1 to 0 do |

3 for each cell in the grid do |

4 load children for Equations (3a)–(3d) |

5 compute parent coefficient using Equation (3a) |

6 compute details d, _{α}d, _{β}d using Equations (3b)–(3d) _{γ} |

7 compute normalised detail d_{norm} |

8 flag as significant if d_{norm}≥2 ^{n−L}ε |

9 end for |

10 end for |

11 end algorithm |

**Algorithm 1:** Pseudocode for the process of computing , , and at the lower refinement levels, called ‘encoding’.

##### Decoding

The process of performing a DFT while applying Equations (4a)–(4d) and identifying leaf cells is called ‘decoding’. Algorithm 2 shows pseudocode describing the decoding process. The algorithm launches at the coarsest cell in the tree. The children and of this cell are computed using Equations (4a)–(4d) (line 6) and then the algorithm is relaunched using the children coefficients at one refinement level higher (lines 7–10). The algorithm is recursively launched unless a cell with a detail that is not significant is reached or a cell on the finest grid is reached (line 2), at which point the cell is identified as a leaf cell (line 3).

1 recursive algorithm DECODING(, n) |

2 if detail is not significant or reached finest grid then |

3 identify cell as leaf cell |

4 stop decoding |

5 else |

6 compute children with Equations (4a)–(4d) |

7 DECODING(, n + 1) |

8 DECODING(, n + 1) |

9 DECODING(, n + 1) |

10 DECODING(, n + 1) |

11 end if |

12 end recursive algorithm |

1 recursive algorithm DECODING(, n) |

2 if detail is not significant or reached finest grid then |

3 identify cell as leaf cell |

4 stop decoding |

5 else |

6 compute children with Equations (4a)–(4d) |

7 DECODING(, n + 1) |

8 DECODING(, n + 1) |

9 DECODING(, n + 1) |

10 DECODING(, n + 1) |

11 end if |

12 end recursive algorithm |

#### Neighbour finding to perform the FV1 update over the non-uniform grid

**L**

*for performing the FV1 update, CPU-HWFV1 needs to retrieve the sets of coefficients of the neighbours*

_{c}*S*

_{west},

*S*

_{east},

*S*

_{north}and

*S*

_{south}for each leaf cell. Retrieving

*S*

_{west},

*S*

_{east},

*S*

_{north}and

*S*

_{south}is trivial on a uniform grid because a cell can look left, right, up and down to find its neighbours, but this is not as straightforward on a non-uniform grid, since a cell can have multiple neighbours in a given direction. Figure 3 shows an example of a cell and its neighbours in a non-uniform grid. In this example, finding

*S*

_{west}(blue cell),

*S*

_{south}and

*S*

_{north}(grey cells) is straightforward. However, it is not clear what

*S*

_{east}should be because the eastern neighbours (red cells) are at a higher refinement level and there are multiple neighbours. CPU-HWFV1 avoids this confusion by taking

*S*

_{east}to be the set of coefficients of the eastern neighbour at the same refinement level (yellow cell in Figure 3), which is readily available since the encoding and decoding processes have produced

*s*

^{(n)}at all refinement levels. Being able to find the neighbours for retrieving

*S*

_{west},

*S*

_{east},

*S*

_{north}and

*S*

_{south}makes it trivial for CPU-HWFV1 to compute

**L**

*and perform the FV1 update per cell.*

_{c}This completes the description of the steps involved in the CPU-HWFV1 model, and Algorithm 3 shows pseudocode summarising CPU-HWFV1. To run CPU-HWFV1, the user needs to specify the maximum refinement level (*L*), the error threshold (*ε*) and the simulation end time (*t*_{end}) (line 1). CPU-HWFV1 runs in a loop until *t*_{end} is reached (lines 3–10). A non-uniform grid is generated every iteration of the loop, i.e., at every timestep (lines 4–6). Note that after the first timestep, the details are zeroed first before re-encoding, and encoding is performed only along the tree of significant details. On this non-uniform grid, CPU-HWFV1 performs an FV1 update (line 7), after which the simulation time is incremented by the current timestep (line 8) while a new timestep is recomputed based on the Courant-Friedrich-Lewy (CFL) condition (line 9); for stability, a CFL number of 0.5 is used (Kesserwani & Sharifian 2020).

1 algorithm HWFV1(L, ε, t_{end}) |

2 get s^{(L)} from initial discretisation of SWE on finest grid |

3 while current time < t_{end}do |

4 ENCODING(L, ε) |

5 DECODING(, 0) // start decoding from coarsest cell |

6 find neighbours to compute L_{c} |

7 use L to perform FV1 update _{c} |

8 increment current time by timestep |

9 compute new timestep based on CFL condition |

10 end while |

11 end algorithm |

1 algorithm HWFV1(L, ε, t_{end}) |

2 get s^{(L)} from initial discretisation of SWE on finest grid |

3 while current time < t_{end}do |

4 ENCODING(L, ε) |

5 DECODING(, 0) // start decoding from coarsest cell |

6 find neighbours to compute L_{c} |

7 use L to perform FV1 update _{c} |

8 increment current time by timestep |

9 compute new timestep based on CFL condition |

10 end while |

11 end algorithm |

**Algorithm 3:** Pseudocode summarising the steps involved in the HWFV1 model.

### GPU-parallelisation of CPU-HWFV1

*et al.*2013; NVIDIA 2023). Coalesced memory access occurs when adjacent threads within a warp access contiguous memory locations. Warp divergence occurs threads within a warp execute different instructions. A naive parallelisation of the steps in Algorithm 3 would not properly address the issues of coalesced memory access and/or warp divergence. Consider the parallelisation of the encoding process (line 4 of Algorithm 3; Algorithm 1), assuming the cells in the hierarchy of grids are naively indexed in row-major order. Figure 4 shows the hierarchy of grids in Figure 1(a) indexed in row-major order (left panel) and the corresponding locations of the children and in memory (right panel). Using row-major indexing leads to uncoalesced memory access when loading the children (line 3 of Algorithm 1) because there are gaps between the memory locations.

#### Ensuring coalesced memory access in encoding and decoding via *Z*-order curves

To facilitate coalesced memory access, a different type of indexing is needed that ensures that coefficients that are nearby in the grid are also nearby in memory. SFCs allow this kind of indexing by definition because they map spatial data to a one-dimensional line such that data close together in space tend to be close together on the line (Sagan 1994; Bader 2013). There are a few different types of SFCs such as the Sierpinski curve, the Peano curve, the Hilbert curve and the *Z*-order curve (also known as Lebesgue or Morton curve). All of these SFCs have been previously used in the context of AMR (Brix *et al.* 2009; Burstedde *et al.* 2011; Weinzierl & Mehl 2011; Meister *et al.* 2016), and also once in the context of wavelet-based grid adaptation (Brix *et al.* 2009), but none of these works involved GPU-parallelisation. To the authors' knowledge, this is the first work to use a SFC to implement wavelet-based grid adaptation on the GPU. In particular, the *Z*-order SFC is chosen because its motif matches exactly with the parent–children square stencil shown in Figure 1(b).

*Z*-order curve can be created for a square 2

*× 2*

^{n}*grid by following the so-called Morton codes of each cell in the grid in order. The Morton code of a cell is obtained by interleaving the bits of its*

^{n}*i*and

*j*positional indices in the grid. Figure 5 shows the creation of a

*Z*-order curve for a 2

^{2}× 2

^{2}grid: the left panel shows the

*i*(black) and

*j*(red) indices of each cell in binary form and how the bits are interleaved (alternating red and black) to yield Morton codes (in binary form). The right panel shows how these Morton codes (in decimal form) are followed in ascending order to create the

*Z*-order curve. The curve allows to enforce

*Z*-order indexing of the grid because each cell in the grid can be identified on the curve via its (unique) Morton code.

*Z*-order curves are created for each grid in the hierarchy while enforcing continuity in the indexing of the curves of subsequent grids, resulting in a unique

*Z*-order index for each cell in the hierarchy. Figure 6 shows

*Z*-order indexing of the cells in the hierarchy of grids in Figure 1(a), alongside the corresponding locations of the children and in memory. Enforcing this

*Z*-order indexing allows for coalesced memory access during encoding because the children are in contiguous memory locations, as seen in the right panel of Figure 6.

Parallelising the decoding process (line 5 of Algorithm 3; Algorithm 2) is more difficult than parallelising the encoding process because decoding involves the inherently sequential DFT algorithm. To overcome this difficulty, decoding is broken down into two parts that are parallelised separately: the first part is the application of Equations (4a)–(4d) and the second part is the identification of leaf cells. *Hereafter, decoding refers exclusively to the application of* Equations (4a)–(4d)*, not the identification of leaf cells*.

Decoding can be parallelised relatively easily because it can be performed using loops (similar to those in Algorithm 1) instead of using DFT. Algorithm 4 shows pseudocode describing how to perform decoding in parallel. The pseudocode involves an outer loop and a parallelised inner loop. The outer loop iterates over the grids in the hierarchy starting from the coarsest grid up to the second-finest grid (lines 2–9) while the inner loop iterates through each cell in the grid in parallel (lines 3–8). The inner loop checks if the detail is significant (line 4), loads the parent and the details , and (line 5) and computes the children , (line 6). This parallel inner loop, in particular the loading of the children coefficients in line 5, has coalesced memory access because of *Z*-order indexing; this can be seen by interpreting the arrows in Figure 6 in the reverse direction.

1 algorithm PARALLEL_DECODING(L) |

2 for each grid in hierarchy from 0 to L− 1 do |

3 for each cell in the grid do in parallel |

4 if detail is significant then |

5 load parent and details , , |

6 compute children with Equations (4a)–(4d) |

7 end if |

8 end for |

9 end for |

10 end algorithm |

1 algorithm PARALLEL_DECODING(L) |

2 for each grid in hierarchy from 0 to L− 1 do |

3 for each cell in the grid do in parallel |

4 if detail is significant then |

5 load parent and details , , |

6 compute children with Equations (4a)–(4d) |

7 end if |

8 end for |

9 end for |

10 end algorithm |

**Algorithm 4:** Pseudocode for the decoding process in parallel.

#### Parallel tree traversal algorithm

Unlike decoding, the identification of leaf cells boils down to a tree traversal problem. There have been many investigations into PTT algorithms on the GPU (Lohr 2009; Bédorf *et al.* 2012; Karras 2012; Goldfarb *et al.* 2013; Zola *et al.* 2014; Nam *et al.* 2016; Chitalu *et al.* 2018), hinting that the identification of leaf cells can be parallelised by adopting a PTT algorithm. In this work, a modified version of the PTT algorithm developed by Karras (2012) is adopted because it can be easily modified to work with *Z*-order indexing. The PTT algorithm traverses the tree of significant details as follows, explained by example considering the tree in Figure 2(a) without loss of generality.

Figure 7(a) shows the tree after enforcing *Z*-order indexing of the hierarchy of grids, and different traversal paths are highlighted in cyan, magenta, yellow and grey. The PTT algorithm starts by launching as many threads as there are cells on the finest grid (2^{2} × 2^{2} = 16). Each thread tries to reach a target cell on the finest grid by traversing progressively finer cells. A thread is denoted by *t _{m}*, where

*m*is the thread index (here

*m*= 0, 1, …, 15). The target cell of

*t*is the cell on the finest grid with a Morton code with the same thread index

_{m}*m*, e.g.,

*t*

_{3}has thread index 3 and tries to reach the cell on the finest grid with Morton code 3.

^{2}A thread stops if it either reaches this target cell or encounters a cell with a detail that is not significant (analogous to identifying a leaf cell). The thread then records the

*Z*-order index of the cell at which it stops. Figure 7(b) shows the traversal paths of each thread during PTT in terms of the

*Z*-order indices of the cells they traverse. These paths show that divergence is greatly minimised because adjacent threads perform similar traversals. For example,

*t*

_{0}to

*t*

_{3}are adjacent threads and they follow the same cyan path in Figure 7(a). Similarly,

*t*

_{4}to

*t*

_{7}follow the magenta path,

*t*

_{8}to

*t*

_{11}follow the yellow path and

*t*

_{12}to

*t*

_{15}follow the grey path. Figure 7(c) shows the

*Z*-order indices recorded by each thread after PTT is complete. These

*Z*-order indices correspond to the indices of leaf cells.

1 algorithm PARALLEL_TREE_TRAVERSAL |

2 for each cell in finest grid do in parallel |

3 start at coarsest cell |

4 while try to reach finer cell do |

5 if detail is not significant then |

6 record Z-order index of cell |

7 stop traversing |

8 else |

9 if reached finest cell then |

10 record Z-order index of cell |

11 stop traversing |

12 else |

13 try to reach finer cell |

14 end if |

15 end if |

16 end while |

17 end for |

19 end algorithm |

1 algorithm PARALLEL_TREE_TRAVERSAL |

2 for each cell in finest grid do in parallel |

3 start at coarsest cell |

4 while try to reach finer cell do |

5 if detail is not significant then |

6 record Z-order index of cell |

7 stop traversing |

8 else |

9 if reached finest cell then |

10 record Z-order index of cell |

11 stop traversing |

12 else |

13 try to reach finer cell |

14 end if |

15 end if |

16 end while |

17 end for |

19 end algorithm |

**Algorithm 5:** Pseudocode for PTT of the tree of significant details. The PTT features an iterative procedure (lines 4–16) instead of the recursive procedure in the DFT in Algorithm 2.

#### Neighbour finding and FV1 update

*t*

_{0}to

*t*

_{3}and

*t*

_{12}to

*t*

_{15}) finish at the same leaf cell. These duplicates are used to record the

*Z*-order indices of the neighbour cells (which is necessary for the FV1 update) in parallel (line 6 of Algorithm 3) by making each thread in the grid in Figure 7(c) look left, right, up and down. The

*Z*-order indices of the leaf cells and their neighbours are stored in memory. Figure 8(a) shows how the

*Z*-order indices of the leaf cells and their neighbours are stored in five contiguous arrays. Duplicate groups of indices are removed as indicated by the double black lines via so-called parallel stream compaction using the CUB library (Merrill 2022). Figure 8(b) shows the

*Z*-order indices of the leaf cells and their neighbours without any duplicates, stored in five arrays. The leaf cells make up a non-uniform grid.

The next step is to parallelise the FV1 update on the leaf cells making up the non-uniform grid (line 7 of Algorithm 3) which is relatively simple. The parallel FV1 update launches one thread per leaf cell. In Figure 8(b), there are ten leaf cells, so ten threads *t*_{0} to *t*_{9} are launched. Each thread uses the arrays of leaf cell and neighbour *Z*-order indices to retrieve *S _{c}*,

*S*

_{west},

*S*

_{east},

*S*

_{north}and

*S*

_{south}from within the hierarchy. For example,

*t*

_{0}would use the first column of indices (red box in Figure 8(b)),

*t*

_{1}would use the next column and so on. Since each thread

*t*

_{0}to

*t*

_{9}retrieves

*S*,

_{c}*S*

_{west},

*S*

_{east},

*S*

_{north}and

*S*

_{south}for each leaf cell,

**L**

*can be computed to perform the FV1 update in parallel for all leaf cells. After the FV1 update, incrementing the current simulation time by the timestep (line 8 of Algorithm 3) is trivial. The new minimum timestep based on the CFL condition (line 9 of Algorithm 3) is computed by performing a so-called parallel minimum reduction using the CUB library (Merrill 2022). Hence, the steps involved in the CPU-HWFV1 model (lines 4–9 of Algorithm 3) are all parallelised and GPU-HWFV1 is obtained.*

_{c}## NUMERICAL RESULTS AND DISCUSSION

This section will compare the proposed GPU-HWFV1 model against two existing and validated shallow water models, namely, the reference uniform-grid FV1 counterpart parallelised on the GPU, GPU-FV1, available in the open-source LISFLOOD-FP 8.0 flood modelling package (Shaw *et al.* 2021), and the sequential CPU predecessor (Kesserwani & Sharifian 2020), CPU-HWFV1. The comparisons will be performed for five test cases (see Table 1) that explore a range of topographies and flow dynamics. The primary aim of the comparisons will be to quantify GPU-HWFV1's runtime performance relative to GPU-FV1 and CPU-HWFV1. A secondary aim is to verify that the proposed GPU-HWFV1 model is valid, which is done by checking that GPU-HWFV1 preserves a similar level of fidelity as GPU-FV1 at the finest resolution accessible to GPU-HWFV1.

Test name . | Test type . | Reason for use . | Previously used in . |
---|---|---|---|

Quiescent flow over irregular topographies with different steepness. Dam-break flow over realistic terrain with friction effects (Section 3.1) | Synthetic | Verifying fidelity | (Song et al. 2011; Huang et al. 2013; Kesserwani et al. 2018; Kesserwani & Sharifian 2020; Shirvani et al. 2021) |

Circular 2D dam-break flow (Section 3.2) | Synthetic | Assessing runtime performance | (Wang et al. 2011; Kesserwani & Sharifian 2020) |

Pseudo-2D dam-break flow (Section 3.2) | Synthetic | Assessing runtime performance | (Kesserwani et al. 2019; Kesserwani & Sharifian 2020) |

Dam-break wave interaction with an urban district (Section 3.3) | Experimental | Assessing runtime performance | (Jeong et al. 2012; Caviedes-Voullième et al. 2020; Kesserwani & Sharifian 2020) |

Tsunami wave propagation over a complex beach (Section 3.3) | Experimental | Assessing runtime performance | (Hou et al. 2015; Arpaia & Ricchiuto 2018; Caviedes-Voullième et al. 2020; Kesserwani & Sharifian 2020) |

Test name . | Test type . | Reason for use . | Previously used in . |
---|---|---|---|

Quiescent flow over irregular topographies with different steepness. Dam-break flow over realistic terrain with friction effects (Section 3.1) | Synthetic | Verifying fidelity | (Song et al. 2011; Huang et al. 2013; Kesserwani et al. 2018; Kesserwani & Sharifian 2020; Shirvani et al. 2021) |

Circular 2D dam-break flow (Section 3.2) | Synthetic | Assessing runtime performance | (Wang et al. 2011; Kesserwani & Sharifian 2020) |

Pseudo-2D dam-break flow (Section 3.2) | Synthetic | Assessing runtime performance | (Kesserwani et al. 2019; Kesserwani & Sharifian 2020) |

Dam-break wave interaction with an urban district (Section 3.3) | Experimental | Assessing runtime performance | (Jeong et al. 2012; Caviedes-Voullième et al. 2020; Kesserwani & Sharifian 2020) |

Tsunami wave propagation over a complex beach (Section 3.3) | Experimental | Assessing runtime performance | (Hou et al. 2015; Arpaia & Ricchiuto 2018; Caviedes-Voullième et al. 2020; Kesserwani & Sharifian 2020) |

The first test case (Section 3.1) will focus solely on verifying GPU-HWFV1's fidelity under idealised scenarios where a benchmark solution is available. The second and third test cases (Section 3.2) will then mainly focus on assessing GPU-HWFV1's runtime performance against CPU-HWFV1 and GPU-FV1 in more complex cases considering synthetic dam-break flows over flat terrain. As there is no terrain data to consider in these scenarios, the fidelity of GPU-HWFV1 will be assessed by verifying that its results match those from CPU-HWFV1 and GPU-FV1. This then allows to systematically analyse the runtime performance in relation to the sensitivity to triggering grid refinement (*ε*), the depth in grid resolution (*L*) and the flow type (vigorous or smooth). Therefore, simulations are run by pairing different values for *ε* and *L*, with *ε* = {10^{−4}, 10^{−3}, 10^{−2}} to cover the recommended ranges for maintaining a fair balance between the predictive accuracy and runtime performance (Kesserwani *et al.* 2019; Kesserwani & Sharifian 2020) and *L* = {8, 9, 10, 11} as no gains in runtime performance was identified for *L* ≤ 7 and using *L* ≥ 12 exceeded the memory capacity of the GPU card used (RTX 2070). The effects of these parameters and the flow type on the wavelet-based grid adaptation of the HWFV1 models are summarised in Table 2.

Aspects . | Description . | Finest grid resolution . |
---|---|---|

L | Controls the finest accessible grid resolution | Deeper with higher L |

ε | Controls how far the finest grid resolution is accessed | More accessible with smaller ε |

Flow | Vigorous (with discontinuities) to smooth (up to flat) | Triggered often for vigorous flows |

Aspects . | Description . | Finest grid resolution . |
---|---|---|

L | Controls the finest accessible grid resolution | Deeper with higher L |

ε | Controls how far the finest grid resolution is accessed | More accessible with smaller ε |

Flow | Vigorous (with discontinuities) to smooth (up to flat) | Triggered often for vigorous flows |

The expectations on runtime performance established from the synthetic test cases will finally be explored in the fourth and fifth test cases (Section 3.3), which involve realistic topographies represented by digital elevation models (DEMs). Running the HWFV1 models with the presence of a DEM means that the maximum refinement *L* must be set to accommodate the DEM resolution, and no grid coarsening is allowed beyond what the MRA of the DEM suggests.

### Verification of fidelity

Hump profile . | h + z (m)
. | Wetting conditions . | Reference . |
---|---|---|---|

Smooth | 0.875 | Dry around the highest hump, critical (h = 0 m) over the two small humps | (Song et al. 2011; Huang et al. 2013; Shirvani et al. 2021) |

Steeper | 1.78 | Dry around the highest hump, critical (h = 0 m) at the peak of the medium hump, wet above the shortest hump (h > 0 m) | (Kesserwani et al. 2018) |

Rectangular | 1.95 | Dry around the highest hump, critical (h = 0 m) at the peak of the medium hump, wet above the shortest hump (h > 0 m) | (Kesserwani et al. 2018) |

Hump profile . | h + z (m)
. | Wetting conditions . | Reference . |
---|---|---|---|

Smooth | 0.875 | Dry around the highest hump, critical (h = 0 m) over the two small humps | (Song et al. 2011; Huang et al. 2013; Shirvani et al. 2021) |

Steeper | 1.78 | Dry around the highest hump, critical (h = 0 m) at the peak of the medium hump, wet above the shortest hump (h > 0 m) | (Kesserwani et al. 2018) |

Rectangular | 1.95 | Dry around the highest hump, critical (h = 0 m) at the peak of the medium hump, wet above the shortest hump (h > 0 m) | (Kesserwani et al. 2018) |

GPU-HWFV1 simulations are run up to 100 s with a maximum refinement level *L* = 8 and an error threshold *ε* = 10^{−3} (requiring around 3,000 timesteps to complete) while measuring the discharges. For GPU-HWFV1 to be deemed well-balanced, the free-surface elevation should stay undisturbed during the simulation, thus the discharges should not deviate from zero, within machine precision. The bottom panels of Figure 9 show the time histories of the maximum discharge errors (i.e., the maximum deviation from zero) for the three hump profiles. The errors are seen to become increasingly higher with increased irregularity in the hump profile, but nonetheless remain bounded as also observed for the CPU model counterparts (Kesserwani & Sharifian 2020). This demonstrates that GPU-HWFV1 is well-balanced irrespective of the steepness of the bed slope and the presence of wet–dry zones and fronts in the domain area.

*n*= 0.018 m

_{M}^{1/3}/s) for the smooth hump profile (top left panel, Figure 9). The initial dam-break flow conditions assume a water body of

*h*= 1.875 m held by an imaginary dam located at

*x*= 16 m with zero discharges. Using the same choice of

*ε*and

*L*, a GPU-HWFV1 simulation is run up to 12 s. A GPU-FV1 simulation on the finest uniform grid accessible to GPU-HWFV1 is also performed to allow for like-for-like comparisons of flood depth profiles at outputs times reported in previous studies (Song

*et al.*2011; Shirvani

*et al.*2021). Figure 10 includes the 2D contour maps of the flood depths predicted by GPU-HWFV1 (left panel) compared to those predicted by GPU-FV1 (right panel) at 0, 6 and 12 s. At 0 s (top panel), both models are seen to start from the same flood depth profile. At 6 s (middle panel), both models predict that the small humps are completely submerged and that the dam-break wave has reached the large hump, and the

*L*

^{1}error difference between depths predicted by GPU-HWFV1 and GPU-FV1 is 4.6 × 10

^{−4}. There are similar wave patterns surrounding the large hump by 12 s (bottom panel) and the

*L*

^{1}error is 9.2 × 10

^{−4}. In all the predictions, GPU-HWFV1 shows symmetrical flood extent profiles that are similar to those reproduced GPU-FV1 and other hydrodynamic profiles reported in previous works (Song

*et al.*2011; Shirvani

*et al.*2021). This indicates that the GPU-HWFV1 implementation preserves the fidelity of well-established models used for real-world applications.

### Assessing runtime performance for synthetic test cases

#### Circular 2D dam-break flow

^{2}domain area (Toro 2001). Initially, the water depth inside the cylindrical dam is 2.5 m, separating it from a water depth of 0.5 m elsewhere. The dam-break flow occurs over a frictionless and flat terrain, resulting in a shock moving radially outwards and a rarefaction wave moving radially inwards, which eventually collapses to form a secondary shock. It is first used to further verify GPU-HWFV1 using the same choice of

*ε*and

*L*as in Section 3.1 and by comparing its simulation outputs to those of CPU-HWFV1 and FV1-GPU. As in Kesserwani & Sharifian (2020), simulations are run up to

*t*= 3.5 s for GPU-HWFV1, CPU-HWFV1 and FV1-GPU. Figure 11 shows the water depth centrelines predicted by the three models. GPU-HWFV1 predicts water depths that are identical to those predicted by CPU-HWFV1, GPU-FV1 and the benchmark solution. The benchmark solution was produced using the FV1 numerical solution to 1D radial form of the 2D SWEs using 256 × 256 cells, following Toro 2001.

*ε*,

*L*}, and their runtimes were recorded for producing the speed-up ratios of GPU-HWFV1 relative to CPU-HWFV1 and GPU-FV1, respectively. Figure 12 contains the plots of the speed-up ratios with increasing maximum refinement level

*L*, relative to CPU-HWFV1 in the left panel and to GPU-FV1 in the right panel. The black lines indicate the average speed-up ratios obtained for the three error thresholds and the dash-dotted lines indicate the breakeven point above which GPU-HWFV1 demonstrates speed-up (used also in the subsequent figures).

GPU-HWFV1 is identified to be 5× to 46× faster than CPU-HWFV1. This speed-up is proportional to the increase in *L* and decrease in *ε*. This suggests that wavelet-based grid adaptation is much more efficient when parallelised on the GPU, in particular as the maximum resolution refinement level is deepened and the sensitivity to refine resolution is increased. Compared to the runtime performance of FV1-GPU, GPU-HWFV1 is not faster in this test (right panel of Figure 12) until *L* ≥ 9 for all the *ε* values, reaching a maximum of 3× for the largest *ε**=* 10^{−2}, and around 2× for the smaller *ε**=* 10^{−3} and 10^{−4}. This means that GPU-HWFV1, despite the overhead costs from the MRA process, can still compete with the speed of a fine uniform-grid GPU-FV1 simulation even for a vigorous flow that would cause overrefinement on the adaptive grid. Namely, GPU-HWFV1 is likely to be faster than GPU-FV1 the deeper the grid resolution (which would lead to an excessively fine uniform grid for GPU-FV1) and the lower the sensitivity for triggering grid refinement. Next, a transient analysis of the speed-ups is performed in a longer simulation that sees a gradual change in the flow from vigorous to very smooth.

#### Pseudo-2D dam-break flow

This 1D dam-break flow test case has conventionally been used to verify shallow water models for a short simulation run (2.5 s) involving transient shock and rarefaction wave propagation in two opposite directions. It was used recently for a much longer simulation time (40 s) to assess speed-up for CPU-based adaptive grid models to their uniform grid counterparts by considering a flow with gradual transition from vigorous to smooth (Kesserwani *et al.* 2019; Kesserwani & Sharifian 2020).

*x*= 10 m, initially separates an upstream water depth of 6 m from a downstream water depth of 2 m. After the dam removal, at

*t*= 0 s, the shock and rarefaction waves remain present in the domain area up to 2.5 s. After 3 s, the shock wave has left the domain area from the downstream and the flow dynamics are only driven by the presence of the rarefaction wave until 10 s, after which it exits from the upstream. Therefore, after 10 s, the flow dissipates gradually with increased smoothness until 40 s. The models are first verified by running simulations up to 2.5 s using the same choice of

*ε*and

*L*as in Section 3.1 (

*L*= 8 and

*ε*= 10

^{−3}) for the HWFV1-based models and the finest uniform grid for the GPU-FV1 model. Figure 13 shows the plots of the water depth centrelines predicted by the models all showing a good agreement with the exact solution (Delestre

*et al.*2013).

*ε*,

*L*} alongside GPU-FV1 simulations on the finest uniform grid. Time histories of the runtimes are recorded throughout the 40 s simulations during which the flow transitions from vigorous to very smooth. Time series of the speed-up ratios of GPU-HWFV1 over CPU-HWFV1 and GPU-FV1 for the different values of

*L*and

*ε*are plotted in Figure 14.

Looking at the speed-up over CPU-HWFV1 (Figure 14, top panels), GPU-HWFV1 exceeds the breakeven for all but the largest *ε* = 10^{−2} and the lowest maximum refinement level *L* = 8 up to 2.5 s (Figure 14, top left panel). This means that CPU-HWFV1 only remained as fast as GPU-HWFV1 when the flow included the shock and the rarefaction waves and for the setting with the least depth in resolution refinement and the least sensitivity to trigger grid refinement. However, even at *ε* = 10^{−2}, up to 8× speed-up is noted after 3 s when the shock wave is not present anymore. With any other combinations of {*ε*, *L*}, there is a significant demonstration of speed-up: with reduced *ε* and increased *L*, GPU-HWFV1 becomes increasingly faster than CPU-HWFV1 up to reaching, for the highest *L* and smallest *ε*, an average speed-up of 68× throughout the simulation and a maximum speed-up of 88× at 2.5 s when flow discontinuities were still present. This confirms the benefit of parallelising the wavelet-based grid adaptation on the GPU as an alternative to the CPU version for general purpose modelling involving all types of flow.

In terms of speed-ups over GPU-FV1 (Figure 14, bottom panels), at *ε* = 10^{−2}, GPU-HWFV1 demonstrates a maximum speed-up of 25× when *L* = 11, though it could only outrun GPU-FV1 for *L* ≥ 9, beyond which GPU-HWFV1 increasingly shows speed-up within increased smoothening in the flow. At *ε* = 10^{−3}, GPU-HWFV1's maximum speed-up reduces to 12× and outruns GPU-FV1 for *L* ≥ 10, whereas for *L* ≤ 9, it begins to demonstrate speed-up only after 10 s when the flow starts smoothening. This suggests to expect less speed-up over GPU-FV1 with increased sensitivity for triggering grid refinement with GPU-HWFV1 and reduced depth of the finest resolution. The same can be noted with *ε* = 10^{−4}, but here GPU-HWFV1 starts to be faster than FV1-GPU for *L* ≥ 9 and the overall maximum speed-up reduces to 8× (reached again after 10 s when the flow is smoothening). These analyses indicate that an adaptive-grid GPU-HWFV1 simulation is likely to be more efficient than a uniform-grid GPU-FV1 simulation for very fine resolution modelling of gradual to smooth flows, with *L* ≥ 9, and when the sensitivity to grid refinement is not maximal, with *ε* > 10^{−4}.

### Further investigations into runtime performance: realistic flow simulations

#### Dam-break wave interaction with an urban district

*et al.*2020) as it has a set of spatial experimental data for the water depth and the velocities (Soares-Frazão & Zech 2008). It involves a dam-break wave propagation in a 36 m × 3.6 m smooth channel (

*n*= 0.01) that includes a wall barrier with a gate initially separating an upstream water body of 0.4 m from a water depth of 0.011 m (Figure 15). Downstream of the gate, there are twenty-five 0.3 m × 0.3 m square blocks, with 0.1 m gaps. The ground height for the wall barrier and the square blocks is 2 m. Based on this height and the dimension reported in Soares-Frazão & Zech (2008), a DEM file was built at a resolution of 0.02 m × 0.02 m, made of 324,000 cells. The DEM includes the two rectangular blocks forming the wall barrier linked to the gate and the 25 square blocks. These discontinuous blocks are included in the grid and are accounted for as part of the well-balanced topography integration.

_{M}As the gate opens abruptly, a dam-break wave forms and flows swiftly to collide with the blocks. The blocks almost entirely impede the shock, creating a backwater zone upstream, while the unimpeded flow cascades through the gaps to form a hydraulic jump downstream as the simulation progresses (e.g., Figure 24 in Soares-Frazão & Zech (2008)).

*L*= 11 for two values of

*ε*= {10

^{−4}, 10

^{−3}}, and using GPU-FV1 on a uniform grid using the finest resolution accessible to GPU-HWFV1. Figure 16 shows the water depth (left panel) and velocity (right panel) profiles along

*y*= 0.2 m at 6 s predicted by the GPU-HWFV1 and GPU-FV1 as well as the experimental profiles. All the models predicted profiles are within the expected range of agreement with the experimental profiles (Caviedes-Voullième

*et al.*2020; Kesserwani & Sharifian 2020). Compared to the prediction made by GPU-FV1, those made by GPU-HWFV1 with

*ε*= 10

^{−4}are closer than with

*ε*= 10

^{−3}though the difference is not significant.

*ε*= 10

^{−3}and 10

^{−4}, respectively, throughout the 10 s simulation. Higher levels of speed-up are demonstrated with larger

*ε*, which is in line with the findings in Section 3.2. GPU-HWFV1 is also faster than GPU-FV1 in this test, on average ∼2.4× faster with both

*ε*= 10

^{−3}and 10

^{−4}. This can be expected for a run with

*L*= 11 accommodating the very fine resolution of the DEM. Up to 2 s, the run with GPU-HWFV1 at

*ε*= 10

^{−3}demonstrates higher levels of speed-up than at

*ε*= 10

^{−4,}which is in line with the observations made in Section 3.2. In contrast, after 2 s, GPU-HWFV1 at

*ε*= 10

^{−3}reduces the level of speed-up to become lower than with GPU-HWFV1 at

*ε*= 10

^{−4}. This could be due to GPU-HWFV1's higher sensitivity to grid refinement around the rectangular and square topographic blocks. Overall, GPU-HWFV1, besides being more performant than CPU-HWFV1, also remains faster than GPU-FV1 for this test. Supported by the analysis in Section 3.2, this can be expected given the maximised depth in the resolution level (

*L*= 11) needed to accommodate the domain size to the fine resolution of the DEM.

#### Tsunami wave propagation over a complex beach

The test case considers a 1:400 scaled replica of the 1993 Okushiri tsunami (Matsuyama & Tanaka 2001). It has been used in other works for model verification and for runtime performance assessments of wavelet-based adaptive models versus their uniform counterparts for simulations on the CPU (Caviedes-Voullième *et al.* 2020; Kesserwani & Sharifian 2020). It is here used to assess the runtime performance of the adaptive GPU-HWFV1 model versus CPU-HWFV1 and GPU-FV1 models.

*n*= 0.01 m

_{M}^{1/3}/s) that has a uniform resolution of 0.014 m × 0.014 m on a DEM made of 163,840 cells (i.e., around twice fewer cells than the previous test case). The domain area has closed boundaries except for the western boundary through which a tsunami-generated inflow (Kesserwani & Sharifian 2020) enters and eventually reaches the coastal area to the east, before which there is a gauge point (

*x*= 4.521 m,

*y*= 1.696 m) hit by the tsunami-generated flood wave. Experimental time histories of the free-surface water elevation are available at this point and will be used to verify the GPU-HWFV1 and GPU-FV1 models' ability to achieve a 22.5 s simulation. The left panel of Figure 18 displays a view of the domain area including the gauge point location, marked by a red dot, and the coastal area at the eastern end (yellow colour). Given the smaller size of the domain area, the depth in the resolution level for the DEM to the domain size requires using

*L*= 9 in this test to run the GPU-HWFV1 simulations with

*ε*= {10

^{−3}, 10

^{−4}}. The GPU-FV1 simulation was run on a uniform grid at the DEM resolution. The right panel of Figure 18 contains time histories of the free-surface water elevations predicted by the models, which are in a good agreement with the experimental time histories. It can be seen that all the models predict the expected gradual retraction in the free-surface elevation between 12 and 15 s, followed by a sharp increase that peaks at around 17 s. GPU-HWFV1 at

*ε*= 10

^{−4}leads to predictions that are visually indistinguishable from those predicted by GPU-FV1. With

*ε*= 10

^{−3}, the predictions remain comparable subject to small, localised discrepancies at times where there is a sharp flow transition such as at around 18 and 21 s.

*ε*= 10

^{−3}and 10

^{−4}, respectively. Compared to the previous test case, the terrain is complex all over the domain and the grid cannot be coarsened much, leading to an overrefined grid that is further refined by flow disturbances during HWFV1 simulations. In such a case, GPU-HWFV1 demonstrates remarkable speed-up over CPU-HWFV1, which increases as

*ε*is decreased from 10

^{−3}to

*ε*= 10

^{−4}in line with the observations in Section 3.2, but the speed-up doubles in this test.

Compared to GPU-FV1, GPU-HWFV1 only demonstrates speed-up with *ε* = 10^{−3} (around 1.25×) and is slightly slower to run with *ε* = 10^{−4} where its speed-up falls below the breakeven line. This implies that it is worthwhile to perform wavelet-based grid adaptation in this test for *ε* = 10^{−3} but not for *ε* = 10^{−4}, at which the grid is overrefined (due to the terrain and flow disturbances). Overall, GPU-HWFV1 is shown to be generally faster than CPU-HWFV1 but could not outrun GPU-FV1 for *ε* = 10^{−4} (which leads to an overrefined grid) and for *L* = 9 (to accommodate a relatively small domain). Nonetheless, for *ε* = 10^{−3}, GPU-HWFV1 remains a viable choice over GPU-FV1.

## CONCLUSION

This paper presented an adaptive first-order finite volume (FV1) shallow water model with a dynamic-in-time wavelet-based grid adaptation process driven by the MRA of the HW that is fully parallelised on the GPU, called GPU-HWFV1. The MRA involves a nested hierarchy of grids, where the coarsest grid in the hierarchy is made up of a single cell while the finest grid is made up of 2* ^{L}* × 2

*cells, where*

^{L}*L*is a user-specified maximum refinement level. In the ‘encoding’ (coarsening) process, a user-specified error threshold

*ε*is needed to flag significant ‘details’ to decide which cells to include in the non-uniform grid. The encoding process results in a tree-like structure of significant details that is traversed in the ‘decoding’ (refinement) process by applying a sequential DFT algorithm to identify the cells making up the non-uniform grid. Encoding and decoding have been parallelised on the GPU by adopting the indexing of a

*Z*-order space-filling curve to ensure coalesced memory access. Meanwhile, the DFT algorithm has been replaced with a PTT algorithm to traverse the tree of significant details on the GPU with minimal warp divergence. The PTT algorithm also allows to easily identify the neighbour cells of each cell in the non-uniform grid for performing the FV1 update in parallel.

GPU-HWFV1 was first verified and then its runtime performance was assessed against a sequential predecessor running on the central processing unit (CPU-HWFV1) as well as the reference uniform-grid FV1 counterpart parallelised on the GPU (GPU-FV1) ran on the finest grid accessible to the HWFV1 models. The verification was performed using *ε* = 10^{−3} (recommended for flood modelling) and *L* = 8 for four synthetic test cases involving motionless, vigorous, gradual and smooth flows. A systematic runtime performance assessment was performed for two synthetic test cases involving dam-break flow, where a lower and higher bound for *ε* = {10^{−2}, 10^{−3}, 10^{−4}} were also considered alongside an increase in the maximum refinement level *L**=* {8, 9, 10, 11}. Verification and runtime performance assessments were finally performed for realistic test cases with DEMs for which the value of *L* was chosen to match the DEM resolution, while GPU-HWFV1 was run with *ε* = {10^{−3}, 10^{−4}}.

GPU-HWFV1's overall performance for all the test cases provided strong evidence that it delivers a similar level of fidelity as GPU-FV1 in replicating the realistic flows including the presence of uneven topographies, wet–dry fronts and friction effects. In terms of runtime performance over CPU-HWFV1, GPU-HWFV1 yielded significant speed-ups for all the test cases, ranging between 20× and 400×. Hence, this work offers compelling evidence for porting the GPU-parallelised wavelet-based grid adaptation process to FV1 models in other fields. From the systematic runtime performance assessment for the synthetic test cases, GPU-HWFV1 tends to demonstrate speed-up of around 1.1× to 30× over GPU-FV1 for *L* ≥ 9 and/or by avoiding the smallest *ε**=* 10^{−4}. For the realistic test cases, GPU-HWFV1 showed speed-up over GPU-FV1 for the test with *L* = 11 and for *ε**=* 10^{−3} for the test with *L* = 9. Hence, GPU-HWFV1 can be favoured to gain runtime performance over GPU-FV1 for shallow water modelling over real DEMs, namely with an increased fineness in the DEM resolution and an increased domain size.

## ACKNOWLEDGEMENTS

Alovya Ahmed Chowdhury and Georges Kesserwani were supported by the UK Engineering and Physical Sciences Research Council (EPSRC) grant EP/R007349/1. This work is part of the SEAMLESS-WAVE project (SoftwarE infrAstructure for Multi-purpose fLood modElling at variouS scaleS based on “WAVElets”).

## DATA AVAILABILITY STATEMENT

All relevant data are available from an online repository at https://dx.doi.org/10.5281/zenodo.8075133.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

In Equations (3a)–(4d), *H*^{0}, *H*^{1}, *G*^{0} and *G*^{1} are scalar coefficients obtained from the Haar wavelets whose derivation is available in previous works (Keinert 2003; Kesserwani & Sharifian, 2020), and is outside of the scope of this paper.

## REFERENCES

**46**(5),