## Abstract

Cyanobacterial blooms are a persistent concern to water management and treatment, with blooms potentially causing the release of toxins and degrading water quality. However, previous models have not considered the zero inflation of cyanobacteria count data. Typically, a relatively large proportion of measured count data are zeros or non-detects of cyanobacteria, representing either no cyanobacteria was present or the cell number was too low to be detected. Commonly used Poisson and negative binomial models for count data underestimate the probability of zero data, making these models less reliable. This study proposes a Bayesian approach to fit the cyanobacteria abundance data with mixture models that handle zero-inflated data. Predictor variables considered included weather and water quality measures that can easily be obtained day-to-day. The optimal model (zero-inflated negative binomial) was used to predict cyanobacteria alert levels on a separate test set. The ability to predict narrow alert levels was limited, however, 76% accuracy was achieved in predicting cyanobacteria counts above or below 1,000 cells/mL. Parameter estimates were highly variable and demonstrated that complex and uncertain factors influence cyanobacteria count predictions. The modelling approach can be applied to a wide range of environmental problems where zero-inflated data is common.

## HIGHLIGHTS

Bayesian mixture models were used to model zero-inflated cyanobacteria count data.

A Bayesian variable selection method was applied to select important variables.

A zero-inflated model achieved 76% accuracy in predicting binary alert levels.

Bayesian framework produced probabilistic categorization of alert levels.

The model is well suited for management of complex systems with high uncertainty.

## INTRODUCTION

Cyanobacteria are photosynthetic microorganisms that can result in degraded freshwater quality and threaten human health. Cyanobacterial blooms can significantly increase turbidity, result in dissolved oxygen depletion due to the biological degradation of cyanobacteria biomass, and produce unpleasant taste and odour compounds (Huisman *et al.* 2018). Furthermore, some species can release toxins such as microcystins, nodularins, cylindrospermopsin, anatoxins, and saxitoxins (Catherine *et al.* 2013). It has previously been observed that a significant positive relation exists between non-alcoholic liver disease and large-scale blooms associated with toxin release (harmful algal blooms or CyanoHABs) (Zhang *et al.* 2015). Furthermore, associations between drinking surface water from cyanobacteria contaminated water bodies and a higher incidence of colorectal cancer have been noted (Lee *et al.* 2017). CyanoHABs also pose severe problems for ecological systems. Even low concentrations of microcystin-LR (5 μg/L) and microcystins (50 μg/L) have been found to impact fish growth and survival rates. At high concentrations of microcystins (>10 mg/L), morphological effects on fish have been observed (Oberemm *et al.* 1999). In addition, the accumulation of microcystins and cyanotoxins through the food web is a threat to human health (Bownik 2016). Based on analysis of lake sediments over the past 200 years, data show that cyanobacteria have increased significantly, with the most rapid growth in blooms occurring from 1945 until the present (Taranu *et al.* 2015).

CyanoHABs are caused or promoted by a combination of environmental factors, with strong associations with several anthropogenic and natural processes. Agricultural activities can increase nitrogen and phosphorus input into the water system, promoting cyanobacteria growth (O'neil *et al.* 2012). Climate change impacts are also likely to increase the occurrence of blooms in the future (Chapra *et al.* 2017). Higher water temperatures stimulate the growth of cyanobacteria, since their optimal growth rate is often reached at temperatures above 25 °C (Thomas & Litchman 2016). Cyanobacteria are carbon-fixing bacteria that rely on a CO_{2} concentrating mechanism, and therefore rising concentrations of CO_{2} in the atmosphere and water bodies may also promote blooms (Verspagen *et al.* 2014). Elevated pH is also known to reduce the energy cost of the CO_{2} concentrating process, with higher efficiencies observed in acidic environments (Mangan *et al.* 2016).

Cyanobacteria bloom density is usually counted with a mechanical or electronic counter using an inverted microscope following sedimentation in a chamber or filtration (Chorus & Welker 2021). Cell counting is a labour-intensive, time-consuming, and expensive method that limits the extent and frequency of monitoring campaigns. As such, there is a need for methods that can enumerate or estimate cyanobacterial levels rapidly and preferably without the need for sampling. Several studies have developed models to fit count data and make predictions of day-to-day counts based on easy-to-measure parameters. Dzialowski *et al.* (2009) attempted to build a linear regression model for predicting the cyanobacteria abundance and toxins in five reservoirs in Kansas, USA. However, their results suggest that simple linear models could not accurately predict cyanobacteria counts (Dzialowski *et al.* 2009). Pyo *et al.* (2020) utilized a convolutional neural network applied to the output of a spatial fluid dynamics model of cyanobacteria abundance, which achieved good short-term prediction of microcystis. Zhao *et al.* (2019) put forward a species identification model and analysed the dominant species using canonical correspondence analysis (CCA). The model was used to identify major driving factors, including water temperature, pH, total phosphorus, ammonia nitrogen, chemical oxygen demand and dissolved oxygen, and predict the risk of algal blooms. Harris & Graham (2017) developed 12 linear and non-linear models to predict cyanobacteria abundance, microcystin and geosmin in a reservoir. Support vector machines, random forests, boosted trees, and cubist modelling approaches were observed to have the best performance. However, all models underestimated cyanobacteria abundance, and none of the models predicted peak bloom events or the highest counts.

A common challenge with modelling cyanobacteria abundance is the innate imbalance in monitoring datasets. A significant excess number of zero counts is typical and may have resulted from either failure to detect cyanobacteria or an actual absence of cyanobacteria. Although data balancing techniques, such as SMOTE (Synthetic Minority Oversampling Technique), have been proven to be effective in addressing imbalanced data, they also come with disadvantages, such as increase the risking of overfitting (He & Garcia 2009), and the generated data may not fully preserve the true distribution of the minority class (Japkowicz & Stephen 2002). Poisson and negative binomial distributions are commonly used for modelling count data, but they cannot account for the information contained in the excess proportion of zeros. Several mixture models have been proposed to consider better high numbers of zero counts: zero-inflated models and hurdle models. Zero-inflated models assume zeros are generated by a Bernoulli distribution with probability *P* and negative binomial (or Poisson) distribution with probability (Lambert 1992). In hurdle models, the zeros and non-zero values are generated separately by a Bernoulli distribution and negative binomial (or Poisson) distribution (Min & Agresti 2005). Both hurdle and zero-inflated models have been used in environmental and ecological fields of study. Wenger & Freeman (2008) showed an improved fit of zero-inflated models to duck species abundance and stream fish abundance. Cha *et al.* (2014) developed a Bayesian hurdle Poisson model for predicting cyanobacteria abundance in Lake Paldang, Korea. However, comparisons between hurdle models and zero-inflated models were not made, and Poisson models cannot accommodate a common challenge of greater variability in observed data than expected for a given model (overdispersion) (Cha *et al.* 2014).

This study presents a Bayesian approach to fit cyanobacteria data with a negative binomial model, zero-inflated negative binomial (ZINB) model, and hurdle negative binomial model to address challenges with inflated zero counts. It is hypothesized that through the novel use of zero-inflated models for this application, the elevated zero counts inherent in the majority of cyanobacteria abundance data can be accounted for and model fit will be improved. Additionally, a Bayesian framework was used to present abundance predictions as distributions rather than point estimates, allowing for a more direct interpretation of uncertainty. Through these two key aspects of the presented models, the aim is to improve the integratability of models in water management by accounting for expected data distributions and emphasizing the need for knowledge of uncertainty in predictions of environmental systems. The fit of each model is compared to select an optimal model. Predictions from the optimal model are then classified according to Australian Management Strategies for Cyanobacteria (Newcombe *et al.* 2010) to assess the capabilities of the presented approach to identify cyanobacteria levels used in water management. The application of the selected model integrated into a Bayesian framework and utilizing the predictive distribution of each prediction obtained from MCMC sampling to assign the prediction into predefined categories are novel and can achieve higher accuracy compared to the regression method. The established model was also used to assess the importance and impact of environmental variables on the probability of cyanobacteria blooms. We employed the state-of-the-art projection predictive variable selection for generalized linear models which has shown superior performance to competing variable selection methods (Catalina *et al.* 2020), and validate the selected model through the posterior predictive checks (PPCs), which are useful tools to inspect the discrepancies between real and predicted/simulated data. The proposed approach for predicting cyanobacterial blooms offers clear advancements over existing approaches: unlike other methods that rely on traditional variable selection approaches, our approach incorporates a Bayesian variable selection method to the framework, which allows for more effective identification of key site-specific predictors under Bayesian framework. The approach also account for the distinctive zero-inflation characteristics of such imbalanced datasets and significantly improves the predictive performance, which is especially important when large or extensive datasets are not accessible. Finally, the Bayesian framework enables us to incorporate prior beliefs and expert knowledge into the analysis and capture the uncertainty of the prediction, providing a more informed and robust understanding and decision making. The whole process has not been used to resolve water quality issues, and the developed framework is appropriate to resolve a wide range of problems of predicting the classification of imbalance data in environmental and ecological fields. The experimental results validate the effectiveness of our methods, highlighting their potential for practical applications and further research in this field.

## METHODS

### Study site and data source

Data used in this study was collected from a eutrophic lake, Cheney Reservoir (37°45′35″N, 97°50′06″W), the main water supply for Wichita, Kansas, USA (Christensen *et al.* 2006). The reservoir has experienced frequent cyanobacterial blooms, presence of microcystin, and taste-and-odor problems. In part, this could be due to the shallow depth (average depth = 6.1 m) and persistent winds that cause maximal turbulence and a resulting turbid environment. Among the 185 samples in the dataset, 34 samples indicate zero counts of cyanobacteria (18.4% of the data).

The dataset that included water quality variables was obtained from the United States Geological Survey (USGS) from 2002 to 2015 (US Geological Survey 2015). The station code is 07144790. Precipitation, solar radiation, and wind speed were obtained from NASA Power project. The original dataset included nine variables: temperature, pH, total phosphorous, total nitrogen, chlorophyll *a* (Chl *a*), all sky insolation incident on a horizontal surface (i.e. solar radiation), wind, turbidity, and precipitation (Table 1). Although Chl *a* and turbidity are not the cause of cyanobacterial blooms, they are parameters that are expected to be correlated with the presence of cyanobacteria, and therefore can be used as predictors. Wind, temperature, and precipitation were also considered to explore the relationship between weather conditions and cyanobacterial blooms. Prior to modelling, variables were further selected by projection predictive inference, a Bayesian approach for model selection and decision making.

Variable . | Abbreviation . | Units . |
---|---|---|

Total phosphorous | TP | mg/L as P |

pH | pH | NA |

Temperature | Temp | °C |

Chlorophyll a | Chl a | μg/L |

All sky insolation incident on a horizontal surface | Solar radiation | |

Wind | Wind | m/s |

Total nitrogen | TN | mg/L as N |

Precipitation | Precipitation | mm |

Turbidity | Turb | FNU |

Variable . | Abbreviation . | Units . |
---|---|---|

Total phosphorous | TP | mg/L as P |

pH | pH | NA |

Temperature | Temp | °C |

Chlorophyll a | Chl a | μg/L |

All sky insolation incident on a horizontal surface | Solar radiation | |

Wind | Wind | m/s |

Total nitrogen | TN | mg/L as N |

Precipitation | Precipitation | mm |

Turbidity | Turb | FNU |

### Mixture models for zero-inflated count data

#### Zero-inflated negative binomial model

*Y*, the ZINB model can be written as:where

*π*is the parameter denoting the probability of zeros in a binomial distribution. is the probability density function of the negative binomial distribution.

, are the selected variables. In a zero-inflated model, *X* and can be different sets of variables. Here, *X* and are the same, which are the selected predictor variables, *Y* is the cyanobacteria counts.

#### Hurdle negative binomial model

*et al.*1996). The first part decides the presence of a positive count, which is typically accomplished through logistic modelling. The second part, a truncated negative binomial model, models the number of observations (non-zero value). The hurdle NB model can be written as:

The parameter and link functions of the alternative negative binomial distribution are the same as the above ZINB model.

### Bayesian approach

In a Bayesian approach, parameter estimation workflow consists of three processes: first, a prior distribution of the parameters in sections 2.2.1 and 2.2.2. is determined based on available experience and knowledge. Second, the likelihood of observed data is calculated using the parameters . Finally, the likelihood and prior are combined to determine the posterior distribution , reflecting an updated representation of knowledge (van de Schoot *et al.* 2021).

Priors used typically fall into three categories: informative, weakly informative, and diffuse (Depaoli *et al.* 2020). For our generalized linear models, the prior distributions for the parameters were specified as weakly informative priors with a large spread. The specific information of prior parameters and prior sensitivity analysis can be found in Supplementary Table S1.

Once the posterior distribution of parameters is determined, sample observations can be drawn. However, the parameter distribution is high-dimensional and usually not a probability distribution we are familiar with, making exact inference intractable (Bishop 2006). Markov Chain Monte Carlo (MCMC) methods, such as the Metropolis-Hastings algorithm (Metropolis *et al.* 1953; Hastings 1970), are used to generate random samples from the target distribution.

Stan is a probabilistic programming language for Bayesian statistical inference written in C + +. It provides a No-U-Turn sampler (NUTS) to obtain simulations from the user-specified posterior distribution (Carpenter *et al.* 2017). In this study, we have used the R package *rstan*, which provides an interface to Stan using R. Through *rstan,* we implemented mixture models such as zero-inflated and hurdle models for discrete distributions.

Convergence of MCMC chains can be diagnosed with trace plots and Gelman–Rubin diagnostic (Brooks & Gelman 1998). Trace plots are helpful when identifying the burn-in process and the convergence of Markov chains. Gelman–Rubin statistic compares the total-within and between-chain variation to analyse the difference between multiple Markov chains. indicates good convergence. Practically, a 0.975 quantile for denotes convergence.

### Model development, selection, and validation

After splitting the data into training/test sets, the most representative variables were selected by projection predictive inference. The selected variables were then used to build a Bayesian negative binomial model, a Bayesian zero-inflated model and a Bayesian hurdle negative binomial model. Model comparison and selection were achieved by leaving-one-out cross-validation (LOO-CV) and model validation (PPCs) for the best model.

When using generalized linear models to solve regression problems (e.g. binary and multinomial logistic regression), a threshold is commonly chosen as the decision rule. For example, in binary logistic regression, it is a general practice to choose 0.5 as the threshold, but in practice, different thresholds can be manually selected for specific situations. If high discriminative accuracy is required for positive cases, a larger threshold can be chosen (Kuk *et al.* 2014). Traditional multinomial logistic regression is subject to large bias when dealing with imbalanced data and does not take the distribution of the data into account. Thus, in order to make the result more indicative, we approximated the probability distribution of the prediction points by the density distribution obtained by MCMC sampling, and assigned the predictions to alert levels according to the management strategies for cyanobacteria by Water Quality Research Australia (WQRA) (Table 2). The categorization process is analog to assigning the predicted class according to the posterior distribution and the probability threshold we set in advance. Finally, we applied the fitted model to our test set to generate predictions and classified the results.

Alert Level . | Definition . | Description . |
---|---|---|

Safe | Safe for drinking water | |

Low | Detected at low levels | |

Medium | Potential toxin to to guideline concentration | |

High | Potential toxin greater than guideline concentration | |

Very high | Potential toxin greater than guideline concentration |

Alert Level . | Definition . | Description . |
---|---|---|

Safe | Safe for drinking water | |

Low | Detected at low levels | |

Medium | Potential toxin to to guideline concentration | |

High | Potential toxin greater than guideline concentration | |

Very high | Potential toxin greater than guideline concentration |

#### Projection predictive inference

Projection predictive inference (Piironen *et al.* 2020) is a Bayesian variable selection method. Variable selection was carried out using the *projpred* package in R.

#### Leave-one-out cross-validation

*et al.*2013).

Measures including AIC, BIC, DIC, and WAIC methods utilize all data to determine fit and therefore can be biased in assessment. Therefore, a LOO-CV (*loo* package in R) approach was used in order to determine model fit based on out-of-sample data. This can be extremely computationally expensive, especially if the dataset is large. However, with less than 200 samples, computation time can be ignored. In LOO-CV, a single sample from the dataset is removed to test the model and the remaining samples are used to train the model. The process is repeated *n* times (where *n* is the size of the dataset) so that each sample is considered.

#### Posterior predictive checks

Posterior predictive check is a classical approach to compare the test statistics (arbitrary function of data) of the actual observed data and the data generated from the model with parameters sampled from the posterior predictive distribution (Berkhof *et al.* 2000).

*p*-value (Posterior

*p*-value) is a quantitative measurement of the goodness of fit. The

*p*-value-like measure represents the probability that the test statistic (such as mean, maximum, minimum, and zero proportion) in the replicated (or predicted new observations) dataset exceeds that in the original data (or new observations).

If the model provides a good fit, the Bayesian *p*-value should be around 0.5. A value close to 0 or 1 indicates that the model is a poor fit (Meng 1994).

For each simulation () of parameters from the posterior distribution, we have a -dimensional vector of *n* predicted outcomes of *y*. Thus, the result is an sized matrix of predicted outcomes from all simulations.

In doing PPCs, either the same predictors *X* or new observations of predictors could be used when building the model. In the latter case, the test statistics of predicted values and test statistics of the actual observed values are compared. The *bayesplot* package in R was used for plotting PPCs.

## RESULTS AND DISCUSSION

### Variable selection

*a*, temperature, turbidity, total phosphorus, solar radiation, wind, pH, total nitrogen, and precipitation (Figure 3).

Figure 3 shows that the first five variables were sufficient predictors as they result in a similar elpd and RMSE to the reference model (the final point includes all variables). The selected variables are consistent with prior knowledge of factors that can be used to determine cyanobacteria counts. Chl *a* is produced by cyanobacteria and, therefore, a strong indicator of abundance. Temperature promotes cyanobacterial growth (Thomas & Litchman 2016) and is expected to be a significant driver of blooms. Nutrients (nitrogen and phosphorous) stimulate the growth of cyanobacteria (O'neil *et al.* 2012). However, it is worth noting that only total phosphorous was identified as a variable of importance and total nitrogen had no impact on model fit. Previous studies indicate that the optimal mass-based ratio of total nitrogen to total phosphorus is 16:1 (Davidson *et al.* 2012). The variable selection indicates the reservoir was phosphorus-dominated, and the nitrogen concentration was either sufficient or very stable. However, the average mass-based ratio of total nitrogen to total phosphorus was 11:1 in the reservoir, suggesting that nitrogen should be limited. The unselected variables, wind and precipitation, have been previously reported to have effects on cyanobacterial blooms: strong winds can disrupt the water layers, leading to mixing and reducing the stability of the water column, which can change the turbidity and limit the availability of light and nutrients (Anderson *et al.* 2002). Precipitation, particularly rainfall, also has effects on cyanobacterial blooms: heavy rainfalls can result in nutrients runoff from land into water bodies (Paerl & Otten 2013), which can promote the growth of cyanobacteria and lead to blooms. It is therefore possible that the effects of wind and precipitation can be better represented by the selected variables of turbidity, radiation, and nutrients, and are not directly needed for modelling.

### Model selection

Weakly informative priors were used for parameters, and four Markov chains were run for each model for 1,000 iterations, discarding the first 500 iterations as a burn-in process. Supplementary Figures S1–S3 present traceplots for parameters in NB, ZINB, and hurdle NB models. The overlapping of different chains indicates convergence. Furthermore, parameters from all three models have < 1.003, further suggesting convergence of each chain (Brooks & Gelman 1998).

After confirming the convergence of all MCMC chains, LOO-CV was applied to assess the strength of each modelling approach. Assessment of model strength was based on both elpd and standard error (SE) (Table 3). The difference in relative to the model with the largest (i.e. the ZINB model) can be used to consider the magnitude of difference between models. The significance of observed differences in elpd was determined by calculating *z*-scores and corresponding *p*-values of paired comparisons (Lambert 2018). Results indicate zero-truncated models (zero-inflated and hurdle models) were better than a negative binomial model (*p* = 0.002); however, the performance of ZINB and hurdle NB were comparable (*p**=* 0.14).

Model . | elpd difference . | SE difference . |
---|---|---|

ZINB | 0.0 | 0.0 |

Hurdle NB | −1.2 | 2.7 |

NB | −25.3 | 8.9 |

Model . | elpd difference . | SE difference . |
---|---|---|

ZINB | 0.0 | 0.0 |

Hurdle NB | −1.2 | 2.7 |

NB | −25.3 | 8.9 |

Differences in elpd and standard error (SE) were calculated using the highest performing model (ZINB).

While the fit between ZINB and hurdle NB were comparable, it should be considered that the mechanism of zero generation is different between them. In a zero-inflated model, zero counts may come from two sources: (a) the cell number is too low to be by the enumeration method used and (b) the cell number was truly zero. Zero counts are assumed only to be caused by cell numbers below the detection limit in a hurdle model. Cyanobacteria are likely not present in the reservoir at some times, and therefore, the ZINB was chosen as the best model based on goodness of fit and the ability to consider true zero counts.

### Model checking

#### Posterior predictive checks

PPCs are used to evaluate if the model fit is reasonable and identify potential differences between observed data and the fitted model. PPCs were initially run using the training set of data. We chose zero proportion as a test statistics for *y* and , which represents the proportion of zero values in the real observed data and the replicated data (predicted data for the same dataset) and calculated the Bayesian *p*-value. Bayesian *p*-values in this context indicate the probability that replicated data are not more extreme than the observed distribution (Gelman 2005). A Bayesian *p*-value close to 0.5 indicates a good fit, values approaching 0 indicate lack of fit, and values close to 1 indicate overfitting (Korner-Nievergelt *et al.* 2015).

*p*-value is 0.2, indicating that the model tends to underestimate the zero proportions. It is possible that the zero-inflated generalized linear models still cannot account for all zeros in the data due to not capturing non-linear relationships between cyanobacteria and predictor variables.

A 5-fold PPC cross-validation was applied to evaluate the model using out-of-bag samples. PPCs were repeated for each validation set and compared the test statistics for and . The Bayesian *p*-value of the five validation sets were 0.43, 0.56, 0.32, 0.42, and 0.53 with an average of 0.45. One validation set is shown as an example in Figure 4 (Bayesian *p*-value = 0.32). The predicted and the actual new observations overlap (Figure 4, bottom left), although there is a slight underestimation of zero proportion (Figure 4, bottom right). The difference in estimated zero proportions and *p*-values is likely due to the varying proportion of zero counts in each of the five validation sets. The Bayesian *p*-values of both replicated data and new data close to zero suggest that the linear model may be inadequate for the cyanobacteria growth model. Considering non-linear models, such as the dynamic phytoplankton model proposed by Malve *et al.* (2007) would add complexity but may also increase model fit.

### Cyanobacteria alert level prediction

From Figure 5, it can be observed that even if the peaks of the predictive density do not fall precisely on the true observed value, the maximum predictive density may be approximately adjacent to the true value and the overall predictive density shifts. It is also of note that despite density being highest immediately adjacent to true predictions, there is non-zero probability of elevated cyanobacteria abundance. The Bayesian modelling approach allows for direct interpretation of this uncertainty and the uncertain nature of factors influencing cyanobacterial population dynamics to carry through to predictions.

Predictive density was used to categorize predictions according to WQRA alert levels. By taking probability density in bins rather than point estimates, we account for the high levels of uncertainty in both the impacts of influencing factors and how to interpret risk from cyanobacteria abundance. Not all species will release toxins (Lee *et al.* 2015) and environmental conditions such as temperature impacts toxin release (Walls *et al.* 2018). As such, management of surface waters often is in response to categorized levels of cell counts or other water quality parameters (Ibelings *et al.* 2014).

The predicted class was determined by the mode or most common predicted class based on probability density. The accuracies in each fold were 0.50, 0.32, 0.36, 0.32, and 0.45. The overall confusion matrix for multiclass prediction (all WQRA alert levels) is shown in Table 4a. The average accuracy was found to be 0.40, generally indicating poor performance. In particular, it was noted that the model performed poorly in predicting low or medium alert levels and predictions of safe levels dominated. The dominant safe level probabilities are evident from the figure inset in Table 4a.

(a) All WQRA levels
. | Predicted. | |||||
---|---|---|---|---|---|---|

Safe
. | Low
. | Medium
. | High
. | Very high
. | ||

Actual | Safe | 21 | 1 | 24 | 14 | 3 |

Low | 9 | 0 | 9 | 14 | 0 | |

Medium | 5 | 1 | 12 | 20 | 2 | |

High | 1 | 0 | 4 | 40 | 1 | |

Very high | 0 | 0 | 0 | 4 | 0 | |

(b) Binary decision
. | Predicted. | |||||

Safe (<1,000 cells/mL)
. | Potential toxin presence (>= 1,000 cells/mL)
. | |||||

Actual | Safe (<1,000 cells/mL) | 14 | 65 | |||

Potential toxin presence (>= 1,000 cells/mL) | 1 | 105 |

(a) All WQRA levels
. | Predicted. | |||||
---|---|---|---|---|---|---|

Safe
. | Low
. | Medium
. | High
. | Very high
. | ||

Actual | Safe | 21 | 1 | 24 | 14 | 3 |

Low | 9 | 0 | 9 | 14 | 0 | |

Medium | 5 | 1 | 12 | 20 | 2 | |

High | 1 | 0 | 4 | 40 | 1 | |

Very high | 0 | 0 | 0 | 4 | 0 | |

(b) Binary decision
. | Predicted. | |||||

Safe (<1,000 cells/mL)
. | Potential toxin presence (>= 1,000 cells/mL)
. | |||||

Actual | Safe (<1,000 cells/mL) | 14 | 65 | |||

Potential toxin presence (>= 1,000 cells/mL) | 1 | 105 |

Based on poor performance with narrow alert level bands, and generally better separation of ‘high’ vs. ‘safe’ levels, it was considered to reduce classification to a binary decision of potential toxin presence or not. We set the threshold to 1,000 cells/mL, corresponding to the middle of the low alert level in WQRA and associated with a level where toxin release may be possible. For this binary decision, the precision and recall were found to be 0.62 and 0.99, respectively. As such, on a more coarse level, the model performance improved and has potential for distinguishing conditions that could result in toxin presence (Table 4b). In particular, the binary decision approach did not under-predict alert (false negatives), and performance was high for correctly predicting counts greater than 1,000 cells/mL.

### Influence of weather and water quality factors on cyanobacteria counts

*a*was found to have the largest positive coefficient, indicating a strong positive relationship with cyanobacteria counts. This was expected since cyanobacteria will produce Chl

*a*, and this measure is often used as a surrogate for cell counts (Chaffin

*et al.*2018). The temperature coefficient is distributed above zero, implying a positive impact on the probability of a bloom. A positive relationship between temperature was anticipated based on a significant amount of literature highlighting increased growth with increasing temperature (Thomas & Litchman 2016; Rousso

*et al.*2020).

The coefficient of solar radiation is mainly distributed below zero, indicating a negative correlation with cyanobacteria levels. A negative relationship between radiation and cell counts could be explained by photobleaching of pigments in cyanobacteria, such as phycobiliproteins (Sinha *et al.* 2005) or by relative competitive advantages of cyanobacteria compared to other algal taxa under limited light conditions (LeBlanc Renaud *et al.* 2011). Long-term exposure to increasing light intensity and UV-B light in particular has resulted in decreased Chl *a* content and decreased cyanobacteria populations (Xue *et al.* 2005; Cirés *et al.* 2011). At high radiation levels (340 μE m^{−2} s^{−1}), the cyanobacteria growth rate was previously found to be 30% lower than at moderate radiation (60 μE m^{−2} s^{−1}) or low radiation levels (Cirés *et al.* 2011). However, it should be noted that radiation intensity and temperature are strongly correlated, and increasing solar radiation was expected to result in increased cyanobacteria levels due to a corresponding increase in temperature (Jöhnk *et al.* 2008).

The turbidity coefficient was distributed on both sides of zero, indicating the possibility of either positive or negative correlations with cyanobacteria abundance. Turbidity is a general measure and does not distinguish types of matter, including no distinction between cyanobacteria and non-algal matter that would contribute to turbidity. Previously, cyanobacteria abundance of Kansas reservoirs was reported to be negatively correlated to non-algal turbidity (Dzialowski *et al.* 2011). As the non-algal turbidity increases, light penetration is reduced, and less cyanobacteria biomass is expected. Alternatively, cyanobacteria presence would lead to a measured increase in turbidity (Klemer & Konopka 1989). As such, the role of turbidity cannot be easily identified, and the parameter distribution appears to represent the uncertain relationship between turbidity and cyanobacteria counts accurately.

The coefficient distribution for total phosphorus was primarily distributed below zero, implying a negative correlation with cyanobacteria abundance. This result contradicts the expectation of phosphorous levels being positively associated with cell counts, given the substantial evidence that nutrient reduction strategies reduce blooms (Hamilton *et al.* 2016). It should be considered that there were relatively elevated levels of phosphorous in the reservoir (mean value of 0.1 mg/L), and nutrients may generally not have been a limiting factor for growth in this system. The recommended limit of total phosphorus in lakes is 0.05 mg/L (Litke 1999), and 92% of the recorded phosphorous levels in this dataset would imply the reservoir being studied is eutrophic or hypertrophic (Carlson & Simpson 1996). Relatively flat biomass responses with increasing phosphorous above a limiting threshold have also been previously reported (Dolman *et al.* 2012).

## CONCLUSIONS

Bayesian mixture models were applied to model cyanobacteria abundance in a reservoir, with particular consideration for the tendency for cyanobacteria abundance to be highly imbalanced with a high proportion of zero values. Two models that can account for the high proportionality of zero measurements, including a ZINB and hurdle NB, were compared. An NB model was also applied to act as a baseline approach that does not account for excess zero counts.

Based on fit determined from LOO-CV, it was found that the ZINB and hurdle NB models performed significantly better than the NB model. The observed improvement of fit when using models that account for excessive zero counts supports the hypothesis that inflated zero counts are important to consider when modelling cyanobacteria abundance. Furthermore, a slight increase of fit was observed when using ZINB compared to the hurdle NB approach. ZINB models can account for zero measurements being present either from the cell number being below detection limits, or from the true absence of cyanobacteria. As such, the improvement of fit using ZINB illustrates that both mechanisms of zero generation should be considered when modelling cyanobacteria.

The ZINB model was then applied to predict cyanobacteria levels using a separated test set. Although the performance was poor when predicting narrow alert level bands, precision and recall were high (0.62 and 0.99, respectively) for binary prediction of elevated vs. low risk levels of cyanobacteria. The established model utilizes a limited number of easy-to-measure parameters including Chl *a*, total phosphorous, pH, temperature, and solar radiation to generate these predictions. Furthermore, the predictions produced from the Bayesian approach utilized in this paper are probabilistic. The uncertainty from the data and interactions in the system are carried through the modelling process to produce an estimated cell count with an associated level of uncertainty. The high uncertainty levels in parameter estimates demonstrate that cyanobacteria count prediction is difficult, and the impact of influencing factors is complex. As such, we believe that the presented modelling process is well suited to inform the management of complex systems with high uncertainty, such as early warning of cyanobacteria blooms in surface waters.

## ACKNOWLEDGEMENT

The research was funded through Canada's NSERC Discovery program (RGPIN-2019-05449).

## DATA AVAILABILITY STATEMENT

All relevant data are available from https://github.com/meowliono/ZINB_cyanobacteria.

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

*Management Strategies for Cyanobacteria (Blue-Green Algae): A Guide for Water Utilities*. Water Quality Research Australia (WQRA). Reserach Report 74, pp. 60–76

arXiv preprint arXiv:2010.06994