## Abstract

Robust scientific inference is crucial to ensure evidence-based decision making. Accordingly, the selection of appropriate statistical tools and experimental designs is integral to achieve accuracy from data analytical processes. Environmental monitoring of water quality has become increasingly common and widespread as a result of technological advances, leading to an abundance of datasets. We conducted a scoping review of the water quality literature and found that correlation and linear regression are by far the most used statistical tools. However, the accuracy of inferences drawn from ordinary least squares (OLS) techniques depends on a set of assumptions, most prominently: (a) *independence among observations*, (b) *normally distributed errors*, (c) *equal variances of errors*, and (d) *balanced designs*. Environmental data, however, are often faced with temporal and spatial dependencies, and unbalanced designs, thus making OLS techniques not suitable to provide valid statistical inferences. Generalized least squares (GLS), linear mixed-effect models (LMMs), and generalized linear mixed-effect models (GLMMs), as well as Bayesian data analyses, have been developed to better tackle these problems. Recent progress in the development of statistical software has made these approaches more accessible and user-friendly. We provide a high-level summary and practical guidance for those statistical techniques.

## HIGHLIGHTS

Correlation and linear regression are commonly used to assess water quality data.

Environmental data, however, are often characterized by temporal and spatial dependency structures in the data thus making ordinary least squares techniques inappropriate.

Generalized least squares, linear mixed, and generalized linear mixed-effect models, as well as Bayesian techniques, may be more suitable for such data.

### Graphical Abstract

## INTRODUCTION

Anthropogenic activity influences the biological, chemical, and physical components of the environment. Environmental monitoring systematically measures these components over time to determine if changes have occurred or are occurring (Yoccoz *et al.* 2001). Among the many measurement outcomes used in monitoring, such as atmospheric deposition (Horb *et al.* in press) or wildlife health (Roberts *et al.* in press), ambient water quality monitoring is intended to characterize broader chemical changes and enable the identification of potential risks to water resources, including its ability to support aquatic life or suitability for consumption, recreation, or other uses (Eckner 1998). While any identified changes can provide context for evaluation of more localized monitoring activities such as end-of-pipe compliance monitoring (Walker *et al.* 2002) or to evaluate the influence of diffuse exposure pathways (Arciszewski *et al.* in press), the incorporation of water quality into a monitoring program also has other advantages. Water samples can easily be obtained, and an increasing number of reliable analysis methods enable a wide range of chemical analytes to be quantified. This has led, in some cases, to the accumulation of large and often publicly available datasets (e.g., Open Government 2017). These datasets are typically evaluated via statistical analyses and comparison with relevant jurisdictional water quality guidelines including, but not limited to, local watershed or sub-watershed management framework targets and limits, provincial limits, and federal limits (e.g., Glozier *et al.* 2018).

While some purpose-driven data collection and analysis have clear techniques and criteria for assigning significance, such as measurements of *E. coli* in surface waters (Eckner 1998), water quality analyses can also include undirected analyses to detect unknown, changing, or unusual conditions (Wintle *et al.* 2010). Similarly, data may also be routinely collected in water quality programs without a specific statistical evaluation framework developed *a priori*. Additionally, water quality programs can continue on for decades, increasing the likelihood of changes in sampling methods, frequency, locations, and analytical precision, which may limit the use of statistical analyses (Helsel *et al.* 2020). As with any data, analyzing water quality data requires careful consideration of both what the investigators hope to learn from the data and the state of the dataset itself.

Despite the potential challenges, most water quality research primarily utilizes conventional statistical tools based on the method of ordinary least squares (OLS). Techniques using OLS are popular because of their statistical power, familiarity, their analytical solvability, and computational simplicity (McElreath 2020). OLS techniques, however, can also be quite restrictive because of statistical assumptions, which are often challenging to satisfy in environmental and field-based datasets. Those assumptions are (a) *independence among observations* (independence assumption), (b) *normally distributed errors* (normality assumption), (c) *equal variances of errors* (homoscedasticity assumption), and (d) *a balanced design* (identical sample size among factor levels) in the case of Analysis of Variance (ANOVA). In such cases when statistical assumptions cannot easily be satisfied, applying more appropriate methodologies, including mixed-effect models, can facilitate more effective decision-making by minimizing potential uncertainty in results of environmental data analyses.

In this work, we quantified the usage of specific statistical methods for the analysis of water quality data via a scoping review of literature. This review informs subsequent discussion around statistical assumptions and the selection of appropriate methods where those assumptions are violated. We further provide a high-level summary and practical guidance for those statistical techniques using a real-world water quality dataset. Analyses were carried out in the R Language for Statistical Computing (R Core Team 2021) and all scripts are provided in Supplementary Material S1.

## SCOPING REVIEW

In this study, we conducted a scoping review of literature to quantify the number of statistical techniques used in water quality (WQ) research and identify patterns over time. Contrary to a full systematic review, a scoping review is conducted to identify the available literature on a topic in a systematic way but does not attempt to appraise the *quality* of the identified studies and their methodology (Arksey & O'Malley 2005; Grant & Booth 2009; Peters *et al.* 2020).

The research question for the scoping review was ‘What statistical methods have been commonly used in analysis of water quality data?’, and conversely, ‘What suitable methods have not been used?’. Primary studies that focused on investigation of water quality or monitoring of water quality in ecosystems were identified by conducting a search using a pre-established list of statistical techniques and keywords. Eighteen statistical methods were identified as a basis for initial identification of studies including: Analysis of Variance (ANOVA), Bayesian Analysis, Cluster Analysis, Control Charts, Correlation, Correspondence Analysis, Factor Analysis, Kruskal–Wallis, Machine Learning, Mann-Kendall, Mann–Whitney, Generalized Linear Mixed-Effect Models (GLMMs), Linear Mixed-Effect Models (LMMs), Principal Component Analysis (PCA), Non-metric Multidimensional Scaling (NMDS), Regression, Simulation and Forecasting, and *t*-test. WQ studies that did not employ any of these methods were reviewed for additional or alternate methods, which were the basis for separate follow-up searches of the water quality literature. The *Web of Science Core Collection database* and *Google Scholar* were used to conduct this review covering a time frame from Jan 1, 1990 to Sep 27, 2020.

Frequencies of each statistical approach were plotted to determine the most common techniques for assessment of water quality in river ecosystems (Figure 1(a)). Many articles used more than one statistical approach (e.g., Pearson correlation in conjunction with linear regression analysis), and, in these cases, a single paper is counted in multiple categories. Of 24,819 identified references that assessed WQ in water ecosystems, 11,024 (44.4%) mentioned using at least one statistical approach. To address any potential bias, of 13,795 remaining articles, we randomly selected 580 (∼4%), for which full-texts were reviewed and new statistical methods were identified (Figure 1(b)). All references were reviewed by one reviewer and verified for inclusion by a second independent reviewer.

The four most used statistical approaches included various simulation and forecasting methods (*N*=5,374, 34.4%), correlation (*N*=3,701, 23.7%), linear regression (*N*=1,727, 11.1%), and PCA (*N*=1,258, 8.1%), accounted for 77.3% of the entire classified literature (Figure 1). Of the simulation techniques, the most commonly used method was the Soil and Water Assessment Tool (SWAT), a watershed scale-based model that has a continuous daily time-step (Neitsch *et al.* 2011; Shawul *et al.* 2013; Worku *et al.* 2017) accounting for 19.8% of all identified simulation-specific papers.

Multivariate ordination techniques including correspondence analysis, factor analysis and cluster analysis, were used in 10.1% (*N*=1,577) of publications (Figure 1). Our review recorded the first use of Bayesian methods to analyze WQ in 1993 (Varis *et al.* 1993) in a study considering different analytical techniques for a WQ forecasting system. In the 20 years that followed, only 83 papers mentioned using Bayesian analysis. In the 8 years that precede present-day, 197 papers were identified, indicating that the Bayesian approaches have become more frequently used, with published papers incorporating this family of methods steadily increasing (Figure 2).

Multilevel models are another group of statistical methods whose popularity has steadily been increasing in published WQ studies, particularly in recent years (Figure 2). This family of analyses includes popular methods such as linear mixed-effect (LMMs) and generalized linear mixed-effect models (GLMMs). Multilevel models were first introduced by Laird & Ware (1982) to better address incorrect analysis of datasets using common statistical procedures such as *t*-test or ANOVA when certain statistical assumptions are not met, such as statistical independence among data points. Hawkins *et al.* (2010) described this problem and noted that, while ‘replication is a critical component in any type of assessment, its use in ecological assessment and monitoring has differed in important ways from its use in classic controlled experiments’. In ecological field experiments, while replicate samples taken within a site are often regarded as independent measures, they are in fact pseudoreplicates (Hawkins *et al.* 2010). Consequently, analyzing such data using a *t*-test or similar approach inflates the likelihood of observing a statistically significant difference in means (increasing the probability of a Type I error). This phenomenon of pseudoreplication was defined by Hurlbert (1984) as ‘the use of inferential statistics to test for treatment effects with data from experiments where either treatments are not replicated or experimental units are not statistically independent’. By comparison, multilevel models avoid this problem as they do not require balanced data and they allow explicit modeling and analysis of between- and within-individual variation.

The first use of linear multilevel models (LMM) for water quality analysis in our paper sample was by Tate *et al.* (2003), who looked at the effect of cattle feces distributions on water quality. Similar to the use of Bayesian approaches, multilevel models began to notably increase in popularity from 2012 onward, with only 10 WQ studies using the method between 2003 and 2012.

## STATISTICAL METHODS AND APPROACHES

An understanding of how to select the most appropriate statistical methods is critical if practitioners are to make informed decisions. When inappropriate or suboptimal methods are applied to even the most robust datasets, consequences may include drawing false conclusions, including missing environmentally critical changes. To facilitate an improved understanding by practitioners, we briefly describe the difference between traditional or frequentist statistics and Bayesian statistics (Section 3.1). We then highlight key statistical assumptions within the traditional statistical framework and what other tools are available if these assumptions are not sufficiently met (Section 3.2). We further provide resources in the form of required functions and references to run those methods in the R Language for Statistical Computing (R Core Team 2021). Lastly, we use a publicly available water quality dataset to demonstrate the use case for more flexible statistical approaches such as multilevel models over traditional tools such as *t*-tests, ANOVA and simple linear regression techniques (Section 4).

### Frequentist vs. Bayesian statistics

The frequentist approach to inferential statistics assumes that observed data are repeatable random samples originating from a data generating process with unknown but fixed parameters, such as the mean and standard deviation for the normal distribution. The goal is to estimate these unknown and fixed parameters by taking samples from the populations of interest (e.g., control and treatment group), calculate their parameter estimates and decide whether they belong to the same population (no effect). This is known as the null hypothesis. The decision metric is a probability value known as the *p*-value which can be calculated from a test statistic, such as Student's *t*-test, which calculates a test score and the corresponding *p*-value from the *t*-distribution. If the resulting *p*-value falls below a certain threshold level (the -level), the test statistic is assumed to be too extreme (or unlikely) to have originated from the same population and the null hypothesis is rejected in favor of an alternative hypothesis. It is important to note that the *p*-value (a) is not the probability of the null hypothesis being true or the alternative hypothesis being false, (b) is not the probability of the observed effect being the result of random chance alone, and (c) cannot provide any insight into the magnitude of an observed effect. The belief that *p*-values do provide this information, however, continues to persist in scientific literature, which has led to a vast body of literature criticizing null hypothesis significance testing (NHST), the interpretation of *p*-values, and the 0.05 significance level (Wasserstein & Lazar 2016; Wasserstein *et al.* 2019).

In contrast, the goal of Bayesian statistics is to estimate a probability distribution of unknown parameters given the observed data and some prior knowledge about those parameters in question. Incorporating prior knowledge into the calculation of the distribution of parameters is the defining feature of Bayesian statistics, and a key argument for why this approach could produce more realistic evaluations of data (i.e., because our prior knowledge is seldom zero, as is assumed in frequentist analysis). In Bayesian statistics, parameters are considered random variables and can therefore be modeled probabilistically and displayed in a distribution function unlike in frequentist statistics where parameters are assumed to be fixed. Bayesian statistics can answer questions such as ‘What is the probability that dissolved metal concentrations fall below a certain threshold level in a river ecosystem?’ Written probabilistically, we are asking ‘What is the probability of our hypothesis given the data, or P(hypothesis|data)?’ This is not possible in frequentist statistics, where we can only make statements about the probability of the observed data given a hypothesis, i.e., *P(data|hypothesis)*. In most cases, attaching a probability to a hypothesis, such as ‘there is a 70% probability that metal concentrations are above threshold levels’, feels more intuitive than attaching a probability to the observed data, such as ‘there is a 5% probability or less of observing our data if we assume metal concentrations are within threshold levels’ (frequentist interpretation of probability). The fact that the Bayesian interpretation of probability feels more intuitive may also explain the often-documented misinterpretation of the frequentist *p*-values (Kruschke & Liddell 2018). Whether researchers should choose frequentist or Bayesian analysis should depend on the research context and questions and not simply on preference. For an in-depth discussion about Bayesian data analyses using R, see Kruschke (2015) and McElreath (2020). We also provide a conceptual explanation of the Bayesian analysis approach with a simple dataset in Supplementary Material S2.

### Implications of statistical assumptions on method selection

Linear regression is a well-known statistical technique routinely used to analyze data. Linear regression models a mean outcome or response variable as a linear function of, or conditioned on, one or more predictor variables using a straight line of best fit. While many methods are available to describe best-fit lines, the most common is the OLS method. The mathematical foundation of the OLS regression is the minimization of the sum of squared differences (or residuals) of the observed data to the regression line. Similar to other statistical techniques, OLS regression relies on assumptions. The method of least squares provides unbiased parameter estimates if the errors are independent (uncorrelated) and normally distributed around the regression line with a mean of zero and constant variance. While diagnostic methods to evaluate for statistical assumptions can be numerical, graphical methods are generally considered more versatile and informative (Faraway 2016) which is usually done by inspecting the residuals for any obvious patterns that may be present and may reflect uncaptured systematic bias (Zuur *et al.* 2010; Zuur & Ieno 2016).

While OLS models are constrained by assumptions that must be tested, they can be powerful tools when assumptions are met and can be easily calculated. For environmental scientists, however, field-collected data often present limitations on the use of simple linear regression techniques. Field data often do not meet assumptions because either measurements could not be taken, or measurements were in close proximity to each other and are not independent, or the observed response variable may not follow a normal distribution and may not even be linear at all, or any combination of the three. In such cases, using linear regression techniques could lead to biased parameter estimates and conclusions. Implications of violation of statistical assumptions of constant variance, normality, linearity, and independence are discussed in Sections 3.2.1 through 3.2.4, respectively and more appropriate techniques suitable for analyzing water quality data are provided.

#### Violation of the constant variance assumption

Whenever the assumption of constant variance (homoscedasticity) is violated, but the other assumptions hold, generalized least squares (GLS) techniques represent a more appropriate alternative. Constant variance is assumed across all values/levels of the predictor variable(s). Lack of homoscedasticity may result in biased standard errors and *p*-values. GLS allows errors to be correlated and/or have unequal variances as it allows us to model the within-group correlation and/or heteroscedasticity structure and therefore results in more accurate *p*-values. While it is possible to use variance stabilizing transformations to satisfy the constant variance assumptions in OLS models, it is generally not recommended since potentially important characteristics of the sampled data are removed (Bolker *et al.* 2009; Zuur *et al.* 2009; Feng *et al.* 2014). Instead, choosing the appropriate statistical tool would be more appropriate. In R, GLS models can be run using the *gls()* function of the nlme package (Pinheiro *et al.* 2020).

#### Violation of the assumption of normality

Quite often in environmental research the distributions of the collected sample data do not appear to be normally distributed at each value or at each level of the predictor variable. In fact, we often have only one observation at each value of a continuous predictor variable and cannot say anything about the normality assumption. Hence, in order to assess whether the data meets this assumption, we need to fit a model first and then inspect the distribution of the pooled residuals. Quite frequently those distributions look left or right skewed, or show other ‘non-normal’ attributes. Sometimes transforming the data (i.e., performing a mathematical operation on the data, such as taking the logarithm) can force data to comply with OLS model assumptions; however, by altering the original scale of the data, important information about the sample may be removed and any subsequent statistical tests are only valid on the transformed scale (Feng *et al.* 2014). It can also make the interpretation of results cumbersome and unintuitive (e.g., reporting significant differences in ‘log transformed metal concentration’). In some cases, data transformations may even be incalculable, such as when many zeros are part of the data (Bolker *et al.* 2009).

Generalized linear models (GLMs) offer an alternative as they accommodate response variables with non-normal error distributions (Nelder & Wedderburn 1972; McCullagh & Nelder 1989; Faraway 2016; Dobson & Barnett 2018). The functional differences between OLS models and GLMs are (a) *the response variable in GLMs can take on any distribution that is a member of the exponential family of probability distributions* and (b) *GLMs use a so-called link function, which describes how the mean of that response distribution and a linear combination of the predictors are related* (Faraway 2016).

Two very common GLMs are known as logistic regression, when observed data are binary, and Poisson regression, when observed data represent counts. Another very useful GLM, especially when data are continuous with no upper bound but truncated at zero, such as in concentrations, heights, and weights, is the gamma GLM. The gamma distribution is a very flexible continuous probability distribution that accommodates data with a wide variety of shapes. In R, GLMs can be run using the *glm()* function.

Lastly, another very useful technique is called beta regression. Beta regression can accommodate datasets containing proportion and percentages and as such provides support for data that naturally fall in between zero and one. In R, beta regression can be run using the *betareg()* function of the betareg package (Cribari-Neto & Zeileis 2010).

#### Violation of the assumption of linearity

It is common in nature for observations to not follow linear trends. In these cases, neither OLS models nor GLMs are appropriate analytical tools. In these cases, non-linear least squares (Baty *et al.* 2015) or generalized additive modeling represent suitable alternatives (Zuur & Ieno 2016; Stasinopoulos *et al.* 2017; Wood 2017). Generalized additive models (GAMs) are GLMs with a linear predictor involving one or more smoothing functions at each covariate. Smoothing functions allow relationships within GAMs to be sinuous or wiggly, and thus permit great flexibility for modeling non-linear relationships. The risk when using non-linear or generalized additive models, however, comes in the form of overfitting, i.e., when too many smoothing functions are added into the model. The general problem with overfitting is that the model becomes less and less useful in generalizing the results and predicting future observations. In R, GAMs can be run, for example, using the *mgcv* package (Wood 2017).

#### Violation of the assumption of independence

Very often in field based or environmental research the assumption of independence cannot be satisfied. In this case, multilevel models are the most appropriate choice as they allow us to account for temporally and spatially repeated measurements as well as for unbalanced datasets, i.e., when the target variable has more observations in one specific class than the others. Furthermore, they enable assessments of variation among plants, animals, ecosystems, and other groupings or clusters, and do not require data averaging before conducting the analysis. Multilevel regression models represent by far the most useful models in science. In fact, McElreath (2020) argues that ‘[…] multilevel regression deserves to be the default form of regression. Papers that do not use multilevel models should have to justify not using a multilevel approach’ (p. 15). Multilevel models such as LMMs and GLMMs require the inclusion of random effects and optionally fixed effects (Pinheiro & Bates 2000; Bolker *et al.* 2009). Fixed effects are those variables for which parameters such as the mean or the slope are estimated, while random effects are grouping terms that account for the dependency structure in the data and estimate the variation within those groups. In R, such models can be run for example using the *lme4* package (Bates *et al.* 2015) and the function *lmer()* and *glmer()* or the *glmmTMB* package (Brooks *et al.* 2017) with the *glmmTMB()* function. The *glmmTMB()* function also allows us to specify various covariance structures to deal with autocorrelation (https://cran.r-project.org/web/packages/glmmTMB/vignettes/covstruct.html).

#### Other challenges in the analyses of water quality data

While solutions to violations of statistical assumptions are available, other challenges can also affect the availability of techniques. A common problem in water quality analyses is the occurrence of censored observations and many censoring limits (Helsel 2006, 2012). Some approaches to overcome these limits include non-parametric techniques, but robust approaches which can address all possible constraints common in water quality datasets are not widely available, although they can be combined with other techniques to identify changes in environmental indicators (Arciszewski *et al.* 2018). Similar to the selection of techniques described above, the use of techniques for censored data is based on the attributes of the data and the desires of the investigators.

## PRACTICAL EXAMPLE ON SURFACE WATER QUALITY IN THE ATHABASCA OIL SANDS

To demonstrate the issues with respect to statistical assumptions when analyzing environmental data, we chose a multi-year, multi-station surface water quality monitoring project along the Athabasca River in northeastern Alberta. The goal of this monitoring project was to assess surface water quality (including spatial and temporal trends) in relation to potential impacts of oil-sands mining activities (Glozier *et al.* 2018). Multiple water samples were taken at six sampling stations along the Athabasca River (Figure 3) over an 8-year period. The experimental design could be described as a crossed-nested multilevel design with six sampling stations along the Athabasca River sampled over a time period from 2011 to 2018 (i.e., crossed) and sampling dates (month-day) nested in years. Our hypothetical research question was whether there are significant differences in vanadium (V) concentrations along the sampling stations and among years. In the Oil Sands Region, the mobilization of V is associated with both industrial development and natural exposures (Glozier *et al.* 2018; Gopalapillai *et al.* 2019) and is suitable and useful to illustrate the approaches advocated here.

The data are publicly available (https://data.ec.gc.ca/data/substances/monitor/surface-water-quality-oil-sands-region/mainstem-water-quality-oil-sands-region). From there we downloaded the following file: ‘MainstemWaterQuality-LTWQM-M2-M7-Metals45-EPA-2011-2018-v2’ and prepared it for data analysis in R (R Core Team 2021). All R scripts for data preparation, descriptive statistics as well as the statistical analyses can be found in Supplementary Material S1. All analyses were conducted in R version 4.1.2 (2021-11-01) – ‘Bird Hippie’ with packages updated on Jan 4, 2022.

The first step in our example is to visualize the raw data (Figure 4). Visualizing the raw data can help determine which statistical analyses might be most appropriate and whether there are any underlying problems in the data. Figure 4(a) shows the data points across years and grouped by sampling stations. To quickly check whether the data show any signs of seasonality we uniquely colored the points representing the quarters, i.e., Jan to Mar, Apr to Jun, Jul to Sep, and Oct to Dec. This information could be helpful at the data analysis stage, for example by identifying additional model terms that potentially explain variation due to seasonality and hence improve the overall fit of the model. In the context of water quality in rivers, such variables could be water displacement and water depth. These data suggest that a generic variable, such as season or quarter, can be used to account for effects of flow where no stream gauges are present.

Figure 4(b) visualizes the data using boxplots. Boxplots are an effective way to check the variation and shape of the data. At this point, some expected patterns emerge from the box plots, such as increases in concentrations between M2 and M7 in 2016, but it also becomes clear that this dataset has various problems such as missing data, skewed data, as well as unbalanced data (see also Table 1). Figure 4(c) shows yet another visualization of the raw data, which are known as density curves. Density curves can help identify skewness, multimodality and other issues with respect to the data distribution and hence can help choose the appropriate distribution for the final statistical model.

Year . | M2 . | M3 . | M4 . | M5 . | M6 . | M7 . | Totals (years) . |
---|---|---|---|---|---|---|---|

2011 | 0 | 0 | 1 | 1 | 0 | 0 | 2 |

2012 | 0 | 3 | 9 | 10 | 3 | 3 | 28 |

2013 | 0 | 4 | 9 | 9 | 4 | 6 | 32 |

2014 | 0 | 8 | 8 | 9 | 4 | 8 | 37 |

2015 | 5 | 9 | 2 | 4 | 2 | 9 | 31 |

2016 | 3 | 7 | 4 | 0 | 4 | 3 | 21 |

2017 | 8 | 10 | 2 | 0 | 2 | 9 | 31 |

2018 | 3 | 3 | 0 | 0 | 0 | 3 | 9 |

Totals (location) | 19 | 44 | 35 | 33 | 19 | 41 |

Year . | M2 . | M3 . | M4 . | M5 . | M6 . | M7 . | Totals (years) . |
---|---|---|---|---|---|---|---|

2011 | 0 | 0 | 1 | 1 | 0 | 0 | 2 |

2012 | 0 | 3 | 9 | 10 | 3 | 3 | 28 |

2013 | 0 | 4 | 9 | 9 | 4 | 6 | 32 |

2014 | 0 | 8 | 8 | 9 | 4 | 8 | 37 |

2015 | 5 | 9 | 2 | 4 | 2 | 9 | 31 |

2016 | 3 | 7 | 4 | 0 | 4 | 3 | 21 |

2017 | 8 | 10 | 2 | 0 | 2 | 9 | 31 |

2018 | 3 | 3 | 0 | 0 | 0 | 3 | 9 |

Totals (location) | 19 | 44 | 35 | 33 | 19 | 41 |

Note: Within each sampling date, multiple transect measurements were taken.

Lastly, since the data show nesting structures (e.g., sampling dates nested in years) as well as multiple measurements along a transect (potential violation of the assumption of independent sampling), it becomes clear that analyzing this dataset using an OLS approach, e.g., ANOVA, linear regression, etc., is not advisable. Instead, multilevel models should be employed (Zuur *et al.* 2009). For our example, we decided to use a GLMM since it can account (a) for the nesting structure, (b) for the dependency structure, and (c) for the non-normal distribution of the data. For this model, we used the gamma distribution since it is naturally truncated at zero, which makes sense since we have concentration data that cannot be negative. One frequently observed problem when using OLS techniques on data that cannot be negative, e.g., biomass, height, or concentrations, is that confidence intervals can extend into the negative. This is not possible when using the gamma distribution. Another benefit of the gamma distribution is its ability to model highly left as well as right skewed data.

*glmmTMB()*function from the

*glmmTMB*package (Brooks

*et al.*2017) to model a GLMM (Table 2). We assumed the data to be gamma distributed with the distribution parameters (Faraway 2016, p. 175). We modeled the expected value

*E*of concentration as using the log-link function. The variance () was modeled as . The fixed effects were location and year and the random effect was sampling date, representing a unique ID of 191 combinations of sampling dates and sampling stations. The statistical model looks as follows:where

*concentration*represents the vanadium concentration for

_{ijk}*Station*,

_{k}*Year*, and

_{j}*sampling_date*.

_{i}. | Concentration . | ||
---|---|---|---|

Predictors . | Estimates . | CI . | p
. |

(Intercept) | 0.14 | 0.07–0.29 | <0.001 |

station id [M3] | 1.06 | 0.82–1.37 | 0.65 |

station id [M4] | 1.14 | 0.85–1.53 | 0.375 |

station id [M5] | 1.05 | 0.78–1.42 | 0.744 |

station id [M6] | 1.13 | 0.83–1.55 | 0.434 |

station id [M7] | 1.02 | 0.78–1.32 | 0.899 |

yr [2012] | 1.59 | 0.82–3.08 | 0.168 |

yr [2013] | 2.16 | 1.12–4.18 | 0.022 |

yr [2014] | 1.88 | 0.97–3.62 | 0.061 |

yr [2015] | 1.8 | 0.92–3.51 | 0.086 |

yr [2016] | 2.77 | 1.40–5.47 | 0.003 |

yr [2017] | 2.07 | 1.05–4.06 | 0.035 |

yr [2018] | 1.37 | 0.66–2.82 | 0.399 |

Random Effects | |||

τ_{00}_{unique_id} | 0.21 | ||

N_{unique_id} | 191 | ||

Observations | 588 |

. | Concentration . | ||
---|---|---|---|

Predictors . | Estimates . | CI . | p
. |

(Intercept) | 0.14 | 0.07–0.29 | <0.001 |

station id [M3] | 1.06 | 0.82–1.37 | 0.65 |

station id [M4] | 1.14 | 0.85–1.53 | 0.375 |

station id [M5] | 1.05 | 0.78–1.42 | 0.744 |

station id [M6] | 1.13 | 0.83–1.55 | 0.434 |

station id [M7] | 1.02 | 0.78–1.32 | 0.899 |

yr [2012] | 1.59 | 0.82–3.08 | 0.168 |

yr [2013] | 2.16 | 1.12–4.18 | 0.022 |

yr [2014] | 1.88 | 0.97–3.62 | 0.061 |

yr [2015] | 1.8 | 0.92–3.51 | 0.086 |

yr [2016] | 2.77 | 1.40–5.47 | 0.003 |

yr [2017] | 2.07 | 1.05–4.06 | 0.035 |

yr [2018] | 1.37 | 0.66–2.82 | 0.399 |

Random Effects | |||

τ_{00}_{unique_id} | 0.21 | ||

N_{unique_id} | 191 | ||

Observations | 588 |

CI=95% confidence interval, *p*=*p*-value.

Significant p-values are indicated in bold (α = 0.05).

The model results were significant for *Year* but not for *Station* (Table 3). The next step is to check the fit and usefulness of the model given the underlying data. This assessment was done using the *DHARMa* package (Hartig 2021). The ‘DHARMa’ package uses a simulation-based approach to create readily interpretable scaled (quantile) residuals for fitted (generalized) linear mixed models. The default output of the *simulateResiduals()* function is a plot with two panels. The first (left) panel is a QQ plot to detect overall deviations from the expected distribution. If the chosen distribution provides a perfect fit, all data points would fall on the red line. The second (right) panel plots the residuals against the predicted values. If no residual patterns are detected, the model fits the data accurately only leaving random errors. If residual patterns are present (indicated by red wiggly lines), removing or adding model terms (fixed effects and/or random effects) can help capture unexplained systematic variation. Figure 5 shows the residual plot for our model. Both the QQ plot residuals, as well as the residuals vs. the predicted values, showed significant deviations from what would be expected if the model would adequately describe the data. This information needs to be kept in mind when judging the usefulness of the model as well as when interpreting significant differences and *p*-values.

. | Analysis of Deviance Table (Type II Wald Chi-Square tests) . | ||
---|---|---|---|

. | Response: concentration . | ||

. | Chisq . | df . | Pr(>Chisq) . |

station_id | 1.613 | 5 | 0.8997 |

yr | 28.098 | 7 | 0.0002 |

. | Analysis of Deviance Table (Type II Wald Chi-Square tests) . | ||
---|---|---|---|

. | Response: concentration . | ||

. | Chisq . | df . | Pr(>Chisq) . |

station_id | 1.613 | 5 | 0.8997 |

yr | 28.098 | 7 | 0.0002 |

Significant p-values are indicated in bold (α = 0.05).

The next step was to calculate mean values based on the fitted model. This can be done using estimated marginal means. Those mean values are also known as lsmeans (or least squares means) in other statistical software packages. In R, estimated marginal means were calculated using the *emmeans* package and the *emmeans()* function (Lenth 2020). The next step was to add letters to the mean values using the *cld()* function of the *multcomp* package to indicate significant differences for the factor *Year* (Figure 6).

Next, we also fitted the same model as a Bayesian model using the *brms* R package and the *brm()* function (Bürkner 2017, 2018; Table 4 and Figure 7). The *brms* package provides straightforward tools for fitting Bayesian models, following the identical model specification syntax as in *lme4* (Bates *et al.* 2015) or *glmmTMB* (Brooks *et al.* 2017). Since this an illustrative example only, we used flat (uninformative) priors for this analysis, hence both models resulted in almost identical outcomes. Table 5 shows the default flat priors used by the *brm()* function given the model specification. This information can be requested using the *get_priors()* function and modified in accordance with the investigators' prior beliefs. Furthermore, choosing minimally informative priors over flat priors is generally recommended when running a Bayesian analysis (Lemoine 2019). Model validation in a Bayesian context is done by inspecting the posterior distributions, MCMC (Markov chain Monte Carlo) chains as well as by conducting posterior predictive checks. Functions for conducting these model checks are available in the *brms* package, such as *plot()* and *pp_check()*.

Concentration . | ||
---|---|---|

Predictors . | Estimates . | CI (95%) . |

Intercept | 0.14 | 0.07–0.30 |

station_id: M3 | 1.05 | 0.81–1.41 |

station_id: M4 | 1.14 | 0.82–1.58 |

station_id: M5 | 1.05 | 0.76–1.43 |

station_id: M6 | 1.14 | 0.82–1.66 |

station_id: M7 | 1.01 | 0.76–1.31 |

yr: yr2012 | 1.59 | 0.81–3.14 |

yr: yr2013 | 2.17 | 1.12–4.31 |

yr: yr2014 | 1.89 | 0.96–3.71 |

yr: yr2015 | 1.81 | 0.94–3.62 |

yr: yr2016 | 2.69 | 1.38–5.73 |

yr: yr2017 | 2.06 | 1.04–4.16 |

yr: yr2018 | 1.38 | 0.65–3.03 |

Random Effects | ||

σ^{2} | 0.02 | |

τ_{00} | 0.01 | |

ICC | 0.76 | |

N_{unique_id} | 191 | |

Observations | 588 | |

Marginal R^{2}/Conditional R^{2} | 0.177/0.906 |

Concentration . | ||
---|---|---|

Predictors . | Estimates . | CI (95%) . |

Intercept | 0.14 | 0.07–0.30 |

station_id: M3 | 1.05 | 0.81–1.41 |

station_id: M4 | 1.14 | 0.82–1.58 |

station_id: M5 | 1.05 | 0.76–1.43 |

station_id: M6 | 1.14 | 0.82–1.66 |

station_id: M7 | 1.01 | 0.76–1.31 |

yr: yr2012 | 1.59 | 0.81–3.14 |

yr: yr2013 | 2.17 | 1.12–4.31 |

yr: yr2014 | 1.89 | 0.96–3.71 |

yr: yr2015 | 1.81 | 0.94–3.62 |

yr: yr2016 | 2.69 | 1.38–5.73 |

yr: yr2017 | 2.06 | 1.04–4.16 |

yr: yr2018 | 1.38 | 0.65–3.03 |

Random Effects | ||

σ^{2} | 0.02 | |

τ_{00} | 0.01 | |

ICC | 0.76 | |

N_{unique_id} | 191 | |

Observations | 588 | |

Marginal R^{2}/Conditional R^{2} | 0.177/0.906 |

CI=95% credible interval.

Prior . | class . | coef . | Group . | source . |
---|---|---|---|---|

(flat) | b | default | ||

(flat) | b | station_idM3 | (vectorized) | |

(flat) | b | station_idM4 | (vectorized) | |

(flat) | b | station_idM5 | (vectorized) | |

(flat) | b | station_idM6 | (vectorized) | |

(flat) | b | station_idM7 | (vectorized) | |

(flat) | b | yr2012 | (vectorized) | |

(flat) | b | yr2013 | (vectorized) | |

(flat) | b | yr2014 | (vectorized) | |

(flat) | b | yr2015 | (vectorized) | |

(flat) | b | yr2016 | (vectorized) | |

(flat) | b | yr2017 | (vectorized) | |

(flat) | b | yr2018 | (vectorized) | |

student_t(3, −1.3, 2.5) | Intercept | default | ||

student_t(3, 0, 2.5) | sd | default | ||

student_t(3, 0, 2.5) | sd | unique_id | (vectorized) | |

student_t(3, 0, 2.5) | sd | Intercept | unique_id | (vectorized) |

gamma(0.01, 0.01) | shape | default |

Prior . | class . | coef . | Group . | source . |
---|---|---|---|---|

(flat) | b | default | ||

(flat) | b | station_idM3 | (vectorized) | |

(flat) | b | station_idM4 | (vectorized) | |

(flat) | b | station_idM5 | (vectorized) | |

(flat) | b | station_idM6 | (vectorized) | |

(flat) | b | station_idM7 | (vectorized) | |

(flat) | b | yr2012 | (vectorized) | |

(flat) | b | yr2013 | (vectorized) | |

(flat) | b | yr2014 | (vectorized) | |

(flat) | b | yr2015 | (vectorized) | |

(flat) | b | yr2016 | (vectorized) | |

(flat) | b | yr2017 | (vectorized) | |

(flat) | b | yr2018 | (vectorized) | |

student_t(3, −1.3, 2.5) | Intercept | default | ||

student_t(3, 0, 2.5) | sd | default | ||

student_t(3, 0, 2.5) | sd | unique_id | (vectorized) | |

student_t(3, 0, 2.5) | sd | Intercept | unique_id | (vectorized) |

gamma(0.01, 0.01) | shape | default |

The *brms* package does not require to specify the priors. In this case, it uses default priors given the specified model. This output can be generated using the *get_priors()* function.

In both analyses, we found similar point estimates and 95% confidence/credible intervals for vanadium concentration for sampling station and year (Figures 6 and 7). The frequentist approach suggested that there are significant differences between sampling years but not sampling stations (Figure 6). The Bayesian analysis does not provide frequentist *p*-values and significant differences. Instead, it generates posterior probability distributions for all model terms, which contain all possible vanadium concentrations given the data and prior information. This allows the researcher to make probabilistic statements about hypotheses given the data, such as ‘what is the probability of observing a specific metal concentration for a given sampling station?’. This is in contrast to frequentist statistics where statements are made about the probability of observing the collected data given a null hypothesis (see Section 3.1). It should also be noted that the interpretation of the aforementioned confidence and credible intervals for frequentist and Bayesian inference is different. In a frequentist setting, a confidence interval represents one possible interval that either contains the true population parameter or not. A 95% confidence interval means that (on average) out of 100 identically repeated experiments, 95 calculated confidence intervals will contain the true population parameter (see simulation in Supplementary Material S3). On the other hand, the Bayesian 95% credible interval contains all possible vanadium concentrations at a given sampling station with 95% probability. This is possible since Bayesian analysis treats parameters as random variables and computes posterior probability distributions, which include all possible values of that parameter given the data. To learn more about how to report Bayesian data analyses in scientific publications, please refer to the Bayesian analysis reporting guidelines by Kruschke (2021).

Lastly, we would also note that there may be cases where datasets simply cannot be meaningfully analyzed even using the most sophisticated inferential statistical techniques due to various problems with the underlying data collection. In such instances, a thorough descriptive statistical approach, including raw data visualizations, may still be very useful in extracting important information.

## CONCLUSION

In this study, we conducted a scoping review of literature to quantify the number of statistical techniques used in water quality (WQ) research and identify patterns over time. We further built on the results and highlighted useful statistical techniques for water quality assessment with simple guidance for practitioners.

The most frequently used statistical approaches included simulation and forecasting methods, correlation, linear regression, and PCA. Simulation techniques do not provide direct information on water quality, but rather provide results and trends in the context of water processes and functioning of water systems. Similarly, PCA provides descriptive rather than inferential information and is most commonly used in exploratory data analysis as well as being a tool for dimensionality reduction. Correlation and regression may appear more appealing techniques to infer water quality, however, field data typically are not normally distributed or independent. As a result, the estimates and conclusions may be biased.

Consequently, multilevel models that do not require balanced data and that allow explicit modelling and analysis of between- and within-individual variation are more appropriate because they better address analyses of datasets when traditional statistical assumptions are not met, which is frequently observed in ecological research. Moreover, Bayesian inference may be suitable for decision making especially when there is prior information available on datasets that are being analyzed.

When inappropriate or suboptimal methods are applied to even the most robust datasets, consequences may include drawing false conclusions, including missing environmentally critical triggers or changes. In agreement with recommendations of McElreath (2020), we suggest that multilevel models in ecological datasets, including water quality analyses, should be the default statistical approach.

## ACKNOWLEDGEMENTS

The authors acknowledge the financial support of the Oil Sands Monitoring Program (OSM). While this work was funded by OSM, it does not necessarily reflect the position of the Program. We thank Eleanor Stern for abstract and full-text screening for the scoping review. We also want to thank Dr Florian Hartig for comments on an earlier version of this manuscript.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## REFERENCES

*Wiley Series in Statistics in Practice*(M. Scott & V. Barnett eds.)

*Statistics for Biology and Health*(M. Gail, K. Krickeberg, J. M. Samet, A. Tsiatis & W. Wong, eds.)