## Abstract

A spillway discharging a high-speed flow is susceptible to cavitation damages. As a countermeasure, an aerator is often used to artificially entrain air into the flow. Its air demand is of relevance to cavitation reduction and requires accurate estimations. The main contribution of this study is to establish an embedded multi-gene genetic programming (EMGGP) model for improved prediction of air demand. It is an MGGP-based framework coupled with the gene expression programming acting as a pre-processing technique for input determination and the Pareto front serving as a post-processing measure for solution optimization. Experimental data from a spillway aerator are used to develop and validate the proposed technique. Its performance is statistically evaluated by the coefficient of determination (CD), Nash–Sutcliffe coefficient (NSC), root-mean-square error (RMSE) and mean absolute error (MAE). Satisfactory predictions are yielded with CD = 0.95, NSC = 0.94, RMSE = 0.17 m^{3}/s and MAE = 0.12 m^{3}/s. Compared with the best empirical formula, the EMGGP approach enhances the fitness (CD and NSC) by 23% and reduces the errors (RMSE and MAE) by 48%. It also exhibits higher prediction accuracy and a simpler expressional form than the genetic programming solution. This study provides a procedure for the establishment of parameter relationships for similar hydraulic issues.

## HIGHLIGHTS

An embedded multi-gene genetic programming is developed.

An explicit and simple correlation is proposed for air demand estimation.

Existing empirical models are assessed and recalibrated.

The proposed model improves the prediction accuracy significantly.

## INTRODUCTION

A spillway is an indispensable dam component, controlling the release of floods and preventing overtopping. In a high dam, the resulting flow velocity can be as high as 40–50 m/s. Even at medium heads, it amounts to 20–30 m/s (Kells & Smith 1991). Such a high-velocity flow would result in severe surface damage due to cavitation and affect the normal functions. Cavitation is often a result of surface irregularities in combination with the high velocity. Flow separations occurring near the boundary lead to a sharp local reduction in flow pressure, which could drop below the vapour pressure of water, forming small vapour-filled cavities. Transported by the flow, they collapse at high frequencies due to pressure recovery, generating forces up to 1,500 MPa (Lesleighter 1988). Fluctuating pressures acting on the boundary give rise to fatigue failure. Subsequently, the surface materials are taken away. Engineering measures should be taken to counteract its occurrence.

Countermeasures for cavitation reduction include: (a) improving construction quality and even out surface irregularities, (b) adopting high-strength materials and (c) introducing air into the flow by an aerator. The former two are feasible but costly. The latter is an effective and low-cost solution. Studies show that a 1–2% increase in air concentration could lead to a 5–7% reduction in cavitation erosion (Peterka 1953; Russell & Sheehan 1974). A typical chute aerator consists of an offset, a deflector and a groove (Figure 1). The deflector ramps up the approach flow away from the bottom, forming a plugging jet and an air cavity beneath. Connected to air shafts, the groove provides space for air passage. The offset prolongs the trajectory and enlarges the cavity. In the cavity, air gets entrained through the lower nappe surface and the shear action between the jet and the water recirculation at the end of the cavity (Chanson 1989).

Being effective and easy to construct, an aerator is commonly used in spillways (Figure 2). The aerator flow is characterized by strong air–water interactions. Chanson (1989) performs an earlier study of its entrainment and detrainment. A later study points out that the detrainment process occurs in the impact region (Figure 1), where the bottom pressure attains its maximum (Chanson 1994). With a focus on the aeration via the lower nappe, Bai *et al.* (2016) suggest that only a small portion of air in the cavity is entrained into the exiting flow. By means of fibre-optical instrumentation, Kramer *et al.* (2006) examine streamwise air–water features, setting up empirical models for the prediction of the bottom and average air concentration. In a parametric study of a chute aerator flow, Pfister & Hager (2010) suggest that the factors dominating the air transport downstream are chute slope *α*, deflector angle *θ* and aerator Froude number (Figure 1). Bai *et al.* (2019) experimentally explore the development of air bubbles alongside the chute, dissipation rate and breakup frequency. Kökpınar & Göğüş (2002) analyse the characteristics of aerated jet flows, with equations derived to predict the jet length and cavity sub-pressure.

The amount of air entrained into the flow is a proxy that indicates the aeration efficiency. In laboratory tests, Chanson (1990) shows that the air demand is mainly dependent on aerator geometry, flow discharge and cavity sub-pressure. After performing model tests on a bottom-inlet aerator at varying and *θ* values, Aydin (2018) proposes empirical formulae for the estimation of air entrainment. Based on prototype data, Lian *et al.* (2017) evaluate the existing empirical approaches for air demand estimation and establish a new prediction method. A numerical study by Aydin & Ozturk (2009) shows that computational fluid dynamics (CFD) modelling provides a reasonable prediction of air entrainment in an aerator, and the agreement is good with the field observations and empirical equations. Wei *et al.* (2020) employ a numerical model to simulate the air demand for a tunnel aerator, exhibiting an average deviation from the experiments by 25%. For air demand of low-level outlet works, Tullis & Larchar (2011) discuss the influence of gate geometry, gate opening, water head and tailwater submergence.

Apart from CFD, experimental and field tests, the machine learning technique is an advanced approach for various engineering applications (Taormina & Chau 2015; Gholami *et al.* 2016; Fotovatikhah *et al.* 2018; Mosavi *et al.* 2018; Fu *et al.* 2020; Sammen *et al.* 2020). Using the adaptive network-based fuzzy inference system, Baylar *et al.* (2007) simulate the aeration of a stepped cascade aerator. The comparison of its results with the experimental ones shows an insignificant difference. Baylar *et al.* (2008) successfully apply the same method to the air entrainment prediction of a weir. To better estimate the air discharge into a gated tunnel, Najafi *et al.* (2012) propose a combination of the Bayesian scheme with three fuzzy-based models. Compared to the Levenberg–Marquardt neural network and multiple linear regression (MLR), Zounemat-Kermani & Scholz (2013) suggest that the fuzzy rule-based model yields better air demand prediction for dam outlets. Overall, these models give acceptable estimations of air demand. However, these methods belong to the so-called black box network, in which the established relationships are implicit and provide no physical interpretations. These features inhibit their applications, as no explicit expressions can be derived for the use in a similar context. To overcome this weakness, genetic programming (GP) and its variants are an option.

The GP-based models are an evolutionary machine learning technique that allows extracting functional expressions. This unique feature enhances engineering applications and provides better physical interpretations (Giustolisi *et al.* 2007). It is applied to open channel flows and sediment transport (Ebtehaj *et al.* 2015; Hadi & Tombul 2018; Khozani *et al.* 2020). However, previous studies suggest that a standard GP-based model might fail to generate accurate results and that pre- or post-processing procedures are necessary for optimization (Ghorbani *et al.* 2018; Mehr 2018; Mehr & Safari 2020). To develop a simple, accurate and explicit prediction model for aerator air demand, this study proposes a hybrid scheme known as embedded multi-gene genetic programming (EMGGP). It is constructed by integrating the conventional MGGP model with the gene expression programming (GEP) and the Pareto front (PF). The GEP acts as a pre-processing measure for input selection (functions and variables), aiming to provide the model with informative information and help better interpret the system behaviours. The PF serves as a post-processing method for solution optimization, generating a simplified model that is favourable for practical application. The main novelty of this study is to establish a framework for improved air demand prediction. The effectiveness of the proposed method is established by comparing its accuracy with traditional empirical methods and GP models

## MODELS AND PERFORMANCE EVALUATION

The proposed EMGGP framework is a GP-based model. As a result, the theoretical background of the GP method is first introduced, after which the EMGGP details are presented. The effectiveness of both models is evaluated using statistical criteria.

### GP model

The GP model is a supervised learning algorithm that mimics the process of biological evolution. It adopts the fundamental concept of *survival of the fittest* (Khozani *et al.* 2020). A GP solution represents a tree-structured computer programme with different linking branches, root, inner and terminal nodes. Through evolution, it establishes a non-linear expression, whose form is not given or known in advance. Figure 3 illustrates a three-level tree-structured GP solution, representing the function of *x*_{1}*x*_{2} + cos(*x*_{2}) ÷ *x*_{3}. This programme consists of one root node (green), three inner nodes (yellow), four terminal nodes (pink) and three variables (*x*_{1}, *x*_{2} and *x*_{3}). They are connected by linking branches (mathematical operators).

The GP model starts with the construction of chromosomes that are best fitted with the given target. Then, they act as the parents, which create offspring using genetic operators, e.g. reproduction, crossover and mutation. In the reproduction, one offspring is generated by solely copying its parent, and no additional information is added. Crossover is an operation in which tree materials between two parents are exchanged. It begins with the selection of a crossover point, by which the parent is divided into two parts. The sub-tree structure is then exchanged with another one from a second parent, resulting in two new offspring (Figure 4(a)). Mutation produces a new offspring by replacing the sub-tree structure with a random one controlled by pre-defined operation parameters (Figure 4(b)). This evolution process continues until the chromosome reaches the desired fitness.

### EMGGP model

#### Model description

The proposed embedded model is an MGGP-based evolutionary scheme that is coupled with the GEP and PF. It consists of three phases, namely, input selection of variables and function sets by the GEP, simulation by the MGGP and solution optimization by the PF (Figure 5). The GEP, an advanced variant of the GP family, creates multiple low-depth GP blocks (Figure 3), known as sub-expression trees (sub-ETs). The best solution is established by connecting the sub-ETs with algebraic or Boolean functions (AND, OR and NOT). It differs fundamentally from the GP in the construction of the genes. In the GEP, the genetic operators are imposed on the tree-shaped genes, and the computer programmes are considered as linear strings with fixed lengths consisting of multiple genes. In the GP, the genes are expressed as parse trees with different sizes and shapes (Ferreira 2006). Due to the GEP's unique feature of creating a population of potential solutions (most data-driven models only develop a single one at each iteration), it is employed as a selection tool for informative input parameters and functions (Khozani *et al.* 2020). To this end, a number of best solutions are generated by the GEP, after which a sensitivity analysis is performed to determine the optimal inputs by computing their frequency in the solutions (first phase). The effective ones are then entered into the MGGP.

*T*is the weighted sum of several GP trees

*G*plus a bias term

_{i}*b*.where

*c*is the gene coefficient, determined by the least-squares method. The combination of classic linear regression and non-linear genes enables the pseudo-linear approach to capture the behaviours of a complex physical system. However, this may also lead to a complex solution that contributes little to the model performance (e.g. horizontal bloat) and is impractical for applications.

_{i}In the third phase, a trade-off between model accuracy and simplicity is thus performed using the PF, a multi-objective optimization method. It is based on the concept of *dominance* with the PF at any instant consisting of *non-dominated* solutions (Smits & Kotanchek 2005). Given a population of prediction models, the PF is composed of the members that outperform others in the pre-defined criteria in terms of accuracy and complexity. As a result, one can focus on the identified members and easily make a decision according to the requirements.

#### Model set-up

The EMGGP model starts, using the GEP, with the generation of a population of best solutions. To develop an accurate and explicit model, a number of initial function members are selected (e.g. arithmetic operations and trigonometric functions). The maximum number of sub-ETs and tree depth is also defined to reduce the risks of overfitting. The GEP evolved potential solutions are then employed to determine the effects of the variables and functions. This step is achieved by computing their frequency, and the less informative ones are removed from the initial inputs. The number of candidate models is essential, as a large population may lead to complex secondary genes, while a small one may result in a loss of effective parameters and functions. In two relevant studies, Uyumaz *et al.* (2014) adopt the top 30 GP models and Khozani *et al.* (2020) use the top 20 and 40 GEP solutions.

In this study, a median value of 30 is selected for the number of best models. As inputs to evolve new models, the MGGP engine picks up the most informative features and functions from the top expressions. Since the goal is to develop an easy-to-use yet accurate prediction model, the number of genes is limited to four, each having a maximum depth of five, which are optimized based on the references (Eray *et al.* 2018; Mehr & Safari 2020). With the initial population size set at 300, the MGGP is run multiple times to reduce the risks of trapping into local minima and to achieve high generalization. After some trials, the crossover and mutation rates are adjusted to 0.84 and 0.14 to produce diverse gene materials. Table 1 summarizes the parameter set-ups in the MGGP engine.

Parameter . | Value . | Parameter . | Value . |
---|---|---|---|

Population size | 300 | Reproduction rate | 0.2 |

Max. generations | 150 | Crossover rate | 0.84 |

Tournament size | 25 | Mutation rate | 0.14 |

Elite fraction | 0.7 | Max. total nodes | Infinitive |

Max. genes | 4 | Objective function | RMSE |

Max. tree depth | 5 |

Parameter . | Value . | Parameter . | Value . |
---|---|---|---|

Population size | 300 | Reproduction rate | 0.2 |

Max. generations | 150 | Crossover rate | 0.84 |

Tournament size | 25 | Mutation rate | 0.14 |

Elite fraction | 0.7 | Max. total nodes | Infinitive |

Max. genes | 4 | Objective function | RMSE |

Max. tree depth | 5 |

### Evaluation metrics

*et al.*2007; Hadi & Tombul 2018; Mehr 2018; Khozani

*et al.*2020), defined as follows:where

*O*is the

_{i}*i*th observation,

*S*is the

_{i}*i*th simulation, is the mean of observations, is the mean of simulations and

*N*is the number of data.

## EXPERIMENTS AND ANALYSIS

Experimental results from published work are employed to verify the proposed method. The experimental set-up and data source are given in this section. Moreover, dimensional analysis is performed to select the input variables for data-driven models.

### Data source

To develop and validate the EMGGP approach, a reliable and representative database is a prerequisite. The data employed here are based on the flume experiments of a chute spillway aerator conducted by Chanson (1988). The physical aerator model is placed in a 360.0 cm long and 25.0 cm wide Perspex flume that slopes at *α* = 52.3° (Figure 1). It is connected to a water supply tank with an overflow weir, thus providing a constant head to the rig. The gross water head above the aerator is 900.0 cm. A thin-plane gate, installed 50.3 cm upstream of the groove, controls the approach flow to the aerator. The outflow at the flume's downstream end is un-regulated.

The offset between the flume floors up- and downstream is *t* = 3.0 cm. The groove length and height are *l* = 19.0 cm and *h* = 25.0 cm. Two aerator configurations, with and without a deflector, are studied. The deflector height and length are *s* = 3.0 cm and *L _{r}* = 30.0 cm. The air supply to the aerator is realised via a central vent in the bottom of the groove and two identical vents on its two sides. All the vents are circular. The diameters of the former and the latter are 14.0 and 8.0 cm, respectively.

The water flow rate *Q*_{w} is measured volumetrically using a calibrated 10 m^{3} pit. The airflow rate, denoted as *Q*_{a}, is, by a Pitot tube, determined via measurements of airflow velocity in the vents. The nappe under-pressure *p _{0}* is obtained by a micro-manometer of high sensitivity. The

*Q*

_{w}varies by adjusting the gate opening, resulting in an average velocity

*V*

_{0}= 3.0–14.0 m/s and F = 3.0–25.0 at the gate. From the experiments, significant results are available for the configuration with a deflector. As such a configuration is more relevant to engineering applications, its results are used to evaluate the performance of the proposed EMGGP model. The datasets with sealed air vents (

*Q*

_{a}= 0) are excluded in this study, leading to a total number of 545 datasets for model development. 70% of the data will be used for model calibration and 30% for validation. The statistical properties of experimental results are presented in Table 2.

. | Parameter . | Min. (m^{3}/s)
. | Max. (m^{3}/s)
. | Mean (m^{3}/s)
. | SD^{a} (m^{3}/s)
. |
---|---|---|---|---|---|

Training | Q_{w} | 0.028 | 0.107 | 0.069 | 0.018 |

Q_{a} | 0.001 | 0.175 | 0.059 | 0.049 | |

Testing | Q_{w} | 0.107 | 0.194 | 0.143 | 0.028 |

Q_{a} | 0.003 | 0.181 | 0.068 | 0.048 |

. | Parameter . | Min. (m^{3}/s)
. | Max. (m^{3}/s)
. | Mean (m^{3}/s)
. | SD^{a} (m^{3}/s)
. |
---|---|---|---|---|---|

Training | Q_{w} | 0.028 | 0.107 | 0.069 | 0.018 |

Q_{a} | 0.001 | 0.175 | 0.059 | 0.049 | |

Testing | Q_{w} | 0.107 | 0.194 | 0.143 | 0.028 |

Q_{a} | 0.003 | 0.181 | 0.068 | 0.048 |

^{a}Standard deviation.

### Theoretical analysis

*W*is the chute width,

*k*

_{s}is the chute surface roughness,

*V*is the approach flow velocity, Δ

*p*is the air cavity pressure drop (=

*P*

_{atm}–

*P*

_{a}),

*P*

_{atm}is the atmospheric pressure,

*P*

_{a}is the air cavity pressure,

*ρ*

_{w}is the water density,

*ρ*

_{a}is the air density,

*μ*is the dynamic viscosity of water,

*σ*is the surface tension of water,

*g*is the gravitational acceleration and

*F*

_{1}is the functional symbol. In light of the Π-theorem, select

*μ*,

*g*and

*d*as the dimensionally independent quantities. The remaining ones reduce to a dimensionless from the following equation:in which Π

*is the Π-theorem symbol. The equations represent a physical system independent of the choice of dimensionless quantities. To simplify the formulation, new non-dimensional groups are introduced by rearranging the original ones.where*

_{i}*β*is the dimensionless air discharge,

*φ*is the geometrical factor, is the Reynolds number, is the Weber number, is the Euler number,

*P*is the pressure gradient number and

*f*is the functional symbol. As , and exert negligible effects on

*Q*

_{a}(Chanson 1988), Equation (6) is rewritten as follows:where is the functional symbol. Consequently, these dimensionless parameters are considered as the dominant ones, which are used for model inputs.

## RESULTS

This part first examines the existing empirical approaches for air demand estimation. They are recalibrated and assessed using the experimental data. Then, the EMGGP model is established and validated. In the end, the performance of all methods (empirical and soft computing) is compared.

### Evaluation of existing correlations

The air demand of an aerator is essential for aerator design. For the prediction of *β*, several correlations are developed in previous studies and they are summarized in Table 3. Campbell & Guyton (1953) suggest one of the first prediction models for gated outlet works (M1). Later, the US Army Corps of Engineers (1964) and Wisner (1965) propose two similar empirical formulae, denoted as M2 and M3. Sharma (1976) indicates that *β* is a linear function of F (M4). Volkart & Rutschmann (1984) provide a general form with F and *P* (M5, *k _{i}* = coefficient). Based on the experiments of a chute aerator, Pfister & Hager (2010) put forward one of the latest relationships (M6).

Model . | Expression . | Model . | Expression . |
---|---|---|---|

M1 | β = 0.04( – 1)^{0.85} | M5 | β = k_{1}( – k_{2}) – k_{3}P^{k4} |

M2 | β = 0.03( – 1)^{1.06} | M6 | β = 0.0028^{2}(1+ tanθ) – 0.1 |

M3 | β = 0.24( – 1)^{1.4} | MLR | β = 0.12– 4.19P +0.15 |

M4 | β = 0.09F |

Model . | Expression . | Model . | Expression . |
---|---|---|---|

M1 | β = 0.04( – 1)^{0.85} | M5 | β = k_{1}( – k_{2}) – k_{3}P^{k4} |

M2 | β = 0.03( – 1)^{1.06} | M6 | β = 0.0028^{2}(1+ tanθ) – 0.1 |

M3 | β = 0.24( – 1)^{1.4} | MLR | β = 0.12– 4.19P +0.15 |

M4 | β = 0.09F |

The experimental data from Chanson (1988) are used to assess the models, and the results are shown in Table 4. It is noticed that all the models exhibit a low level of fitness (CD and NSC) and large errors (RMSE and MARE), failing obviously to generate acceptable predictions. To improve their performance, recalibration is needed. By minimizing the sum of the square errors, the empirical constants are recalculated using the experimental data. The performance criteria, listed in Table 4, indicate however that some of the recalibrated models still do not provide accurate results. Compared with others, the M5 model presents a relatively better estimation, which is presumably attributable to the consideration of two contributing factors.

Method . | Original . | Method . | Recalibrated . | ||||||
---|---|---|---|---|---|---|---|---|---|

CD . | NSC . | RMSE (m^{3}/s)
. | MAE (m^{3}/s)
. | CD . | NSC . | RMSE (m^{3}/s)
. | MAE (m^{3}/s)
. | ||

M1 | 0.272 | –0.395 | 0.829 | 0.565 | M1–M3 | 0.281 | 0.279 | 0.596 | 0.447 |

M2 | 0.257 | –0.226 | 0.777 | 0.532 | M4 | 0.261 | 0.253 | 0.607 | 0.439 |

M3 | 0.233 | –95.708 | 6.903 | 5.073 | M5 | 0.776 | 0.776 | 0.288 | 0.235 |

M4 | 0.261 | 0.182 | 0.635 | 0.472 | M6 | 0.163 | 0.163 | 0.642 | 0.491 |

M6 | 0.163 | –3.363 | 1.466 | 1.315 | MLR | 0.759 | 0.759 | 0.345 | 0.271 |

Method . | Original . | Method . | Recalibrated . | ||||||
---|---|---|---|---|---|---|---|---|---|

CD . | NSC . | RMSE (m^{3}/s)
. | MAE (m^{3}/s)
. | CD . | NSC . | RMSE (m^{3}/s)
. | MAE (m^{3}/s)
. | ||

M1 | 0.272 | –0.395 | 0.829 | 0.565 | M1–M3 | 0.281 | 0.279 | 0.596 | 0.447 |

M2 | 0.257 | –0.226 | 0.777 | 0.532 | M4 | 0.261 | 0.253 | 0.607 | 0.439 |

M3 | 0.233 | –95.708 | 6.903 | 5.073 | M5 | 0.776 | 0.776 | 0.288 | 0.235 |

M4 | 0.261 | 0.182 | 0.635 | 0.472 | M6 | 0.163 | 0.163 | 0.642 | 0.491 |

M6 | 0.163 | –3.363 | 1.466 | 1.315 | MLR | 0.759 | 0.759 | 0.345 | 0.271 |

Additionally, a new equation is proposed using the MLR method, included in Table 4. In comparison with the M5 model, it exhibits a comparative level of accuracy, with a CD value over 0.75. However, its large errors indicate that further improvements are necessary. Meanwhile, it indicates the non-linear features of the aerator system in a sense that an oversimplified model might be inadequate to yield satisfactory results.

### EMGGP application

*x*

_{no}is the normalized value of variable

*x*. The pre-processed data (70% for training and 30% for testing) are fed into the programme, and 1,500 prediction models are generated. At the training stage, the one with the highest CD value survives, which is shown in the following equation:

As the key feature, it is noted that, from the top 30 GEP runs, the solution takes the *P* and , which is consistent with the models of relatively high accuracy (M4 and MLR) and confirms their dominance in *β*. As for the mathematical operators, addition, subtraction, multiplication, division, square root and tangent show a high frequency and are transferred into the MGGP. Meanwhile, the equation is not straightforward and features high non-linearity, indicating the complexity of the aerator flow and the inadequacy of the existing correlations (Table 4). The performance metrics of this model are shown in Table 6. At both the training and testing stage, this method accurately estimates the air demand, demonstrating a high generalization ability. Both the CD and NSC values are over 0.94, and the RMSE and MAE values are below 0.16 m^{3}/s, which is satisfactory.

Despite its high performance, Equation (11) shows some complexity in form, which is unpractical in applications. To overcome this, the PF method is applied to optimize the solution through the trade-off between the model accuracy and complexity. It is intended to single out symbolic models with expressional simplicity but with insignificant loss of accuracy in comparison with the best one in the population. The complexity of the solution is defined as the total number of nodes in its constituent trees. For a single tree, it is the sum of the node count of itself and all possible full sub-trees (a leaf node is also considered a full sub-tree) (Searson 2015). The accuracy of the model is defined as the CD value at the training phase, as the model is, in practice, developed based on only the training data and the testing data are the ones to be predicted (unavailable).

Figure 6 plots, in terms of fitness and complexity, all the evolved population members (1,500) and the Pareto optimal models (16). Each red square represents a Pareto optimal solution that is not outperformed by any other ones in both criteria, and the green-filled one corresponds to the best model (Equation (11)). The best solution exhibits the highest performance. However, it is at the cost of increased mathematical complexity. It is noticed that some other PF members are simpler and exhibit a comparative level of efficiency with Equation (11). Accordingly, it is possible to replace the most accurate model with one having insignificantly lower performance and much less complexity.

Model No. . | ΔCD (%) . | ΔComplexity (%) . | Model No. . | ΔCD (%) . | ΔComplexity (%) . |
---|---|---|---|---|---|

1 (benchmark) | – | – | 9 | 0.08 | 35.5 |

2 | 0 | 4.3 | 10 | 0.08 | 46.2 |

3 | 0.03 | 7.9 | 11 | 7.98 | 60.2 |

4 | 0.05 | 18.3 | 12 | 8.81 | 71.0 |

5 | 0.05 | 19.4 | 13 | 9.57 | 76.3 |

6 | 0.06 | 24.7 | 14 | 64.91 | 83.9 |

7 | 0.07 | 30.1 | 15 | 65.14 | 89.2 |

8 | 0.08 | 33.3 | 16 | 90.16 | 98.9 |

Model No. . | ΔCD (%) . | ΔComplexity (%) . | Model No. . | ΔCD (%) . | ΔComplexity (%) . |
---|---|---|---|---|---|

1 (benchmark) | – | – | 9 | 0.08 | 35.5 |

2 | 0 | 4.3 | 10 | 0.08 | 46.2 |

3 | 0.03 | 7.9 | 11 | 7.98 | 60.2 |

4 | 0.05 | 18.3 | 12 | 8.81 | 71.0 |

5 | 0.05 | 19.4 | 13 | 9.57 | 76.3 |

6 | 0.06 | 24.7 | 14 | 64.91 | 83.9 |

7 | 0.07 | 30.1 | 15 | 65.14 | 89.2 |

8 | 0.08 | 33.3 | 16 | 90.16 | 98.9 |

Its statistical indexes suggest that the simplified expression leads to a negligible loss in accuracy. At both the training and testing phase, the solution presents a high fitness level (CD > 0.94 and NSC > 0.93) and small errors (RMSE < 0.17 m^{3}/s and MAE < 0.13 m^{3}/s).

*X*

_{1}= 0.05 – 0.17 and

*X*

_{2}= 1.36

*P*+ 0.15. Its statistical measures are presented in Table 6. With some more complexity in form, the equation seems to be slightly less effective than Equation (12).

Method . | Training . | Testing . | ||||||
---|---|---|---|---|---|---|---|---|

CD . | NSC . | RMSE (m^{3}/s)
. | MAE (m^{3}/s)
. | CD . | NSC . | RMSE (m^{3}/s)
. | MAE (m^{3}/s)
. | |

Equation (11) | 0.955 | 0.953 | 0.155 | 0.119 | 0.947 | 0.946 | 0.152 | 0.118 |

Equation (12) | 0.955 | 0.950 | 0.161 | 0.121 | 0.945 | 0.936 | 0.166 | 0.124 |

Equation (13) | 0.933 | 0.931 | 0.188 | 0.155 | 0.931 | 0.923 | 0.178 | 0.144 |

Method . | Training . | Testing . | ||||||
---|---|---|---|---|---|---|---|---|

CD . | NSC . | RMSE (m^{3}/s)
. | MAE (m^{3}/s)
. | CD . | NSC . | RMSE (m^{3}/s)
. | MAE (m^{3}/s)
. | |

Equation (11) | 0.955 | 0.953 | 0.155 | 0.119 | 0.947 | 0.946 | 0.152 | 0.118 |

Equation (12) | 0.955 | 0.950 | 0.161 | 0.121 | 0.945 | 0.936 | 0.166 | 0.124 |

Equation (13) | 0.933 | 0.931 | 0.188 | 0.155 | 0.931 | 0.923 | 0.178 | 0.144 |

### Comparisons of empirical and soft computing methods

This study assesses the capability of empirical and soft computing methods (GP and EMGGP) in air demand estimation. To compare the performance of these approaches, Figure 7 plots the Taylor diagram in terms of standard deviation, Pearson's correlation coefficient (PCC) and RMSE, denoted by the radial arcs (cyan) from the origin, the radial lines (pink) and the semi-circles around the star symbol (red), respectively. This figure simultaneously captures the efficiency metrics of the models. A perfect model, which is in full concurrence with the observations, is set apart by the reference value with *PCC* = 1 and a similar level of varieties in comparison with the observations (Heo *et al.* 2014). Note that the machine learning models outperform all the empirical ones (recalibrated) and the EMGGP approach is the most accurate solution. The empirical equations generally fail to produce accurate estimations. The two categories of methods give rise to a remarkable difference in efficiency.

As compared with the best empirical model M5, Figure 8 presents the statistical improvement rate (all datasets). A positive rate of fitness indicates higher prediction accuracy and vice versa. For the CD and NSC, the EMGGP and GP approaches demonstrate the highest improvements, which are 23 and 20%, respectively. Other methods are not superior to the M5. For the RMSE and MAE, the M1–M3, M4 and M6 present the largest increase in errors, and the EMGGP and GP significantly reduce the errors by 48 and 36%, respectively.

Consequently, it is stated that the soft computing approaches generally provide more satisfactory predictions than the conventional ones, and the proposed EMGGP gains obviously by comparison. Figure 9 compares the air discharges from the experiments by Chanson (1988) with the predicted ones. Both the GP and EMMGP results exhibit a close match with the experimental data, particularly at large *Q*_{a} values. At low values, although the EMMGP performs slightly better, both are somewhat inefficient (Figure 9(b)).

## CONCLUSION

In a large dam, the spillway is subjected to high-speed flows, introducing cavitation risks to the structure. By artificially entrained air into the flow, an aerator is an effective device for cavitation reduction. The amount of entrained air is a proxy that indicates the aerator performance. Many empirical relations are developed for its prediction. However, some only take into consideration a single factor and have an oversimplified form, thus leading to low accuracy.

This study proposes an EMGGP for aerator air demand estimation. The standard MGGP model is integrated with the GEP and PF method. The approach comprises three phases: (a) determination of the informative input variables and function sets, the GEP is used to produce a number of potential solutions, the frequency of the inputs is computed and the effective ones are fed into the MGGP; (b) simulation by the MGGP, with setting up low-depth expression trees that are linearly combined; and (c) solution optimization. The PF is applied to simplify the model through the trade-off between its accuracy and complexity. In the end, an accurate and straightforward solution is generated.

Based on the experimental data for a spillway aerator, the proposed EMGGP approach is developed. Its performance is statistically assessed in terms of CD, NSC, RMSE and MAE. The EMGGP technique exhibits high accuracy with CD = 0.95, NSC = 0.94, RMSE = 0.17 m^{3}/s and MAE = 0.12 m^{3}/s (testing phase). In comparison with the most accurate empirical correlation, the EMGGP demonstrates a significant improvement in fitness (CD and NSC) and a considerable reduction in errors (RMSE and MAE). It also outperforms the standard GP in both accuracy and expressional simplicity.

This study introduces a hybrid method for aerator air demand estimation. The model is developed based on experimental datasets. For any data-driven models, the quality of the data is essential for model performance. Consequently, a wider range of data sources (from both physical and field studies) could be employed for further optimization. This paper focuses on offset aerators, and future studies can be extended to aerator piers and sidewall-ramp aerators.

## ACKNOWLEDGEMENTS

This study is funded by the Swedish Hydropower Centre (SVC), as part of two research projects entitled ‘Two-phase flow modelling: evaluations and simulations for safer spillway discharge’ and ‘Quality and trust of numerical modelling of water-air flows for safe spillway discharge’. The projects are supervised by James Yang and Anders Ansell. SVC has been established by the Swedish Energy Agency, Energiforsk and Svenska Kraftnät together with the Luleå University of Technology, KTH Royal Institute of Technology, Chalmers University of Technology and Uppsala University. Participating companies and industry associations include AFRY, Andritz Hydro, Boliden, Fortum Generation, Holmen Energi, Jämtkraft, Karlstads Energi, LKAB, Mälarenergi, Norconsult, Rainpower, Skellefteå Kraft, Sollefteåforsens, Statkraft Sverige, Sweco Energuide, Sweco Infrastructure, Tekniska verken i Linköping, Uniper, Vattenfall R&D, Vattenfall Vattenkraft, Voith Hydro, WSP Sverige and Zinkgruvan.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.