## Abstract

A non-linear surrogate model of nitrate concentration in the Kansas River (USA) is described. The model is an (almost) Piece-wise Linear response surface that provides a mean field approximation to the dynamics of the measured data for nitrate plus nitrite (target product) correlations to turbidity and chlorophyll-a concentrations (input variables). The method extends the United States Geological Survey’s linear procedures for surrogate data modeling allowing for better approximations for river systems exhibiting algal blooms due to nutrient-rich source waters. The model and visualization procedures illustrated in the Kansas River example should be generally applicable to many medium-size rivers in agricultural regions.

## HIGHLIGHT

Non-linear surrogate modeling of nutrients is illustrate in the Kansas River that uses commonly available sensors, and initial nitrate plus nitrate calibration data, to enable future estimations of nutrients using only the turbidity and chlorophyll-a sensors.

## INTRODUCTION

Surrogate data models have been successfully employed to infer nutrients such as nitrate and phosphorus from high-frequency (HF) time series measurements such as turbidity, discharge, and chlorophyll concentrations (Rasmussen *et al.* 2005; Williams 2021). For instance, in the Kansas River, nitrate concentrations, near the intake of the source waters for the city of Lawrence, KS, have been estimated using a log-transformed linear model with input variables of HF time series consisting of chlorophyll, turbidity, and seasonality (Foster & Graham 2016). Linear models provide useful correlations for predicting target water quality quantities, but it is also appreciated that many of the correlations between input and output variables are best described by non-linear functions (Yang & Moyer 2020). The expectation, therefore, is that non-linear modeling in surrogate data applications will more accurately track the target variables. Non-linear modeling, though, can introduce thorny issues including concerns about introducing model bias by assuming a (global) non-linear model structure that, may or may not, be inherent in the data. Additionally, topics such as over-fitting (typically addressed by ‘regularization’), and sensible model extrapolations need to be considered.

All three of these topics – bias, regularization, and extrapolation – are often addressed in a non-linear setting by using the data itself to best inform a solution. One route in this direction is Bayesian inference methods and black box models (e.g., neural nets) that make weak prior assumptions. An alternative approach is the use of a domain expert to produce a parametrized model of the data with strong priors. In operational applications, black box models can be less than ideal because of their lack of transparency to the underlying biogeochemical processes. Expert models are more compelling in that regard but introduce strong priors that might not accurately describe the relevant correlations. Nevertheless, when models are connected to the physical process this inspires more confidence in practical applications, such as monitoring water quality.

In this paper, we address the points mentioned above in the context of a specific water quality surrogate model application. We examine the HF time series from the Kansas River at the USGS De Soto, Kansas, gauging station which has a long-term record of more common water quality variables, such as turbidity and chlorophyll-a concentrations, and a less common HF monitoring record for nitrate. The site is one of many where the USGS engages in estimating water quality variables following a detailed and documented surrogate modeling procedure developed and validated by the USGS over two decades (Rasmussen *et al.* 2009). Here, we illustrate how to augmente the linear surrogate modeling procedure with the introduction of an (almost) Piece-wise Linear (aPL) model that captures the global non-linearity of the data.

This surrogate modeling example sits midway between the black box and expert modeling approaches. It best falls under the rubric of manifold learning (Van Der Maaten *et al.* 2009). When embedded in a multivariate space feature space the data reveals a shape, or geometric structure, that is connected to physical processes. A backbone for this geometrical arrangement in the De Soto example can be captured with a smooth surface in two dimensions (or more generally a manifold in multiple dimensions) and serves as a useful starting point for modeling both local and global relations. The global geometric arrangement of the data points is non-linear, and some large model variances, observed in a linear model, is due to projections of data clusters that are not one-to-one (the projection function is not injective) – and the variance caused by this noninjective maps are reduced by a projection onto a non-linear manifold. Additionally, capturing the global non-linear geometry of the data serves as a good starting point to decompose the problem into subdomains – identified by clusters – which can be modeled individually. Thus, in some instances, it is best to check for, and model, any global non-linear data set organization as a first step to arriving at a parsimonious model with minimal variance. In addition to its research content, this paper is also intended as a tutorial on how to approach surrogate data modeling of water quality indicators from a manifold learning perspective.

## METHODS

### De Soto, Kansas: study area and high-frequency USGS data

The USGS station at De Soto, Kansas (Station ID: 06892350; Lat: 38.982, Lon: -94.964) is approximately 32 km downstream from Lawrence, KS, and has a large set of *in situ* water quality sensors including turbidity, chlorophyll-a, and nitrate plus nitrides () dating back to 2014 and earlier. When we refer to nitrate, nitride, and chlorophyll in this paper, we are actually referencing the quantity measured, () and chlorophyll-a concentration. USGS measurements are recorded every 15 min. Data from 2014 to 31 March 2023 had discharges in the range of to , turbidity between 0 and 1,500 FNU, chlorophyll-a concentrations between 0 and 200 , and nitrate plus nitride concentrations between 0 and 7 .

### Comment on remote sensing application and variable selection

Given the temporally dense time series record, it is natural to consider the state dependent auto-regressive (i.e., ‘dynamic’) models for target variables as was recently done to predict nitrate concentrations (Di Nunno *et al.* 2022). However, our ultimate aim in this paper is to develop a model that can be used with satellite data. For that, we need a (non-state-dependent, i.e., ‘static’) response surface model. Satellite data have extensive spatial coverage but intermittent temporal sampling due to variable atmospheric conditions (e.g., clouds), and sparse sampling (typically with less than daily overpasses). Thus, in this study, we only considered time-independent models with a choice of input variables determined by what is readily available from water quality satellite remote sensing products. The application of the surrogate model developed in this paper is described in a companion paper using data from the satellite sensor Sentinel-2 (Tufillaro *et al.* 2024).

### Data preprocessing: subsetting and normalization

The *in situ* USGS data at De Soto, KS, have temporal data gaps that can last from hours to months. For this study, we aggregated the data into gap-less bundles. Starting with the full data record from 1 January 2014 till 31 March 2023, we first select a set of variables for the surrogate model, typically turbidity, chlorophyll concentration, and discharge as the input variables, and nitrate concentration as the target variable. If gaps in this data are less than 3 h (12 samples) then the gaps are filled in by linear interpolation. The resulting data record is then broken into gapless subsets. All subsets with a gapless record length of less than 2 days (192 samples) are deleted. The remaining records are averaged into 6-h intervals (24 samples), and this final collection of gapless data records is used for further analysis and modeling. The resulting data set contains 52 gapless data subsets with record lengths varying from 32 (2 days) to 2916 (182 days). The total data set length is 9,279 points, or (6.3 years).

### Data visualization

### Identification of data geometry with biogeochemical processes

*et al.*2022). As the spring, summer, and fall progress a sequence of blooms oscillates between low and high nitrate states swinging back and forth in the upper right half of the plane in Figure 3, but generally avoiding the moderated nitrate states (low chlorophyll and turbidity) characteristic of winter nitrate levels. This overall seasonal bloom pattern is sometimes described as hysteresis (Burns

*et al.*2019).

It is also observed that the ‘variance’ of a model prediction can be expected to change by considering different projections of a co-domain into the domain. For instance, in Figure 2, if we consider a model of nitrate using only turbidity as an input variable, then any model prediction will have a high variance for moderate values of turbidity (2–4 log(FNU)) since the projection of a point onto the horizontal axis (turbidity) cannot distinguish data points occurring during the winter nitrate state (yellow-orange cluster in Figure 2) and a spring bloom (blue-cyan cluster in Figure 2), so points close together in the domain can take on a wide range of nitrate values simply due to the geometric projection resulting from this choice of embedding. Informally, we say the model embedding has low injectivity to indicate that a partial source of variance is geometric, and not intrinsic data noise.

There are two ways to improve the low injectivity – increase the input dimension, and consider alternative (possibly non-linear) embeddings or projections that better capture the shape of the data. Generally increasing the dimension helps, but also leads to an (exponentially) less dense sampling of the model space – the so-called curse of dimensionality. The search for non-linear low-dimensional models with lower variance (informally, higher injectivity) is an aim of manifold learning.

The linear surrogate data model has a high variance in part because it is projecting the data onto a linear subspace. If we are able to create a surface (a two-dimensional non-linear manifold, or non-linear subspace) that goes through the mean of the data defined locally, than we can reduce the variance of the surrogate model in two ways: first, by passing more closely through the local mean, we can often remove ambiguities arising from the overlap of projections passing through multiple data clusters, and second, models built from projecting onto this non-linear surface can be closer to a normal distribution. Informally, we refer to this as a mean field model.

The non-linear shape of the data reflects two processes moderating the nitrate concentration. First, (in log space) the nitrate concentration varies approximately linearly with turbidity (and sediment concentration), but non-linearly with phytoplankton assimilation. That is, the sharp non-linear step in the shape of the data is associated with the increase (or decrease) of nitrate assimilation by phytoplankton (see Figure 4 with chlorophyll values between 3.5 and 4.5). Thus, linear surrogate models are adequate for winter nitrate variations but are less than ideal for the spring, summer, and fall, when phytoplankton assimilation kicks in.

### aPL model of data geometry

To include the shape of the data in the modeling process we consider the data from a manifold learning point of view. Roughly, the aim of manifold learning is to find a projection, and change of coordinates, which both reduces the dimensionality of the data model, and also, in a local mean sense, straightens out the data. Mathematically speaking, manifolds need to be ‘smooth’ objects, which means all their derivatives exists – a (two–dimensional) sphere is smooth, but a two-dimensional cone is not smooth and hence is not a manifold since, at its tip, a local derivative is not well defined. Here we describe the construction of a non-linear surface which is smooth (so technically a manifold) but numerically stiff. It is essentially a spline, but spline constructions are typically not smooth surfaces (i.e., there are discontinuities in some of their derivatives).

The constants are estimated using non-linear optimization, specifically the *fitnlm* function in MATLAB (Seber & Wild 2003). Non-linear optimization requires a choice of starting initial conditions, and a good way to pick those is to estimate an initial set of constants for the model with a curve separating the two clusters, and an approximate mean value for the response surface in each cluster.

### Regression and error metrics

Two metrics are used to evaluate the model quality. The *coefficient of determination*, , is an indicator of the how much of the variance in the dependent variable is explained by the (multiple) independent variables. This is provided for the linear surface fits in Figure 4 and the aPL surface fit in Figure 6. A value of indicates that the model is no better than using the mean value for estimation; a indicates perfect correlation. Since we are interested in how well the target variable of nitrate is predicted, we also present a *one-to-one* line () and its single variable linear *goodness-of-fit* metric (), the difference between the model prediction value (horizontal axis) and the measured value (vertical axis). The goodness-of-fit is a useful metric to consider when estimating daily or seasonal nitrate loads. Neither metric indicates causation, only a correlation between the independent and dependent variables.

## RESULTS AND DISCUSSION

The increase in model quality from a linear model to a non-linear (i.e., almost piece-wise linear model) is indicated by the metrics: The increase from 0.66 to 0.75, and the goodness of fit () increase from 0.55 to 0.68.

*out-of-sample*predictions. Figure 10 shows a time series of the results for a training set to early 2020, the model prediction for the training data is indicated in blue. The red data points indicate the model prediction for the out of sample data from early 2020 through to March 2023. The goodness of fit parameter is = 0.65 in each time series. That the values are identical is coincidence, but the fact that the modeled times series looks similar before and after 2020, and that the metrics are close, indicates that model performs well on unseen future nitrate states.

Also, Figure 10 shows that the model does not predict extreme events well. The model predictions on each cluster are limited by the linear assumption of the model on each domain, and the variability permitted by the parameter fit to this locally planar model. To handle extreme events we next develop a model specific to the high nitrate (domain ) cluster. A good approach to modeling data on each cluster is the use of additional variables such as discharge, or a seasonal variable, to further disentangle the observed nitrate values.

Figure 11 shows an example surface fit for predicting nitrate in the high nitrate cluster. The fit indicates that the maximum value of nitrate – on average – occurs late spring-early summer with a discharge in the vicinity of 9.5 (13,000 – that is, higher and lower values of discharge tend to result in lower nitrate values. Further, nitrate values on the high nitrate cluster are suppressed in the early spring at lower discharge rates. The perhaps nonobvious result, suggested by this particular estimation method, is that (log) of discharge values in the range of 9.5–10 (13,000–20,000 ), on average, are correlated with higher nitrate concentrations at this specific site.

## CONCLUSION

This paper presents a visually rich introduction to surrogate time series methods for estimating nitrate with an emphasis on connecting the geometry observed in the time series data to the underlying biogeochemical processes. The work demonstrates that, during rapid nitrate assimilation by phytoplankton, changes in nitrate concentrations are non-linear relative to variations in turbidity. We present a simple data driven method to model this non-linearity by describing an (almost) Piece-wise linear response surface that provides a mean field approximation to the dynamics of the measured data for nitrate plus nitrite correlations to turbidity and chlorophyll-a concentrations. We argue that one advantage to switching to a coordinate system lying on a mean field approximation surface is that some of the model variance observed in linear models is due to a projections that mix points from different data clusters, and that by considering a projection to a non-linear subspace, the model variance is reduced. The method extends the USGS linear procedures for surrogate data modeling allowing for better approximations for river systems exhibiting algal blooms due to nutrient rich source waters. There is nothing in the model that is unique to the Kansas river compared to similar rivers in the American agricultural midwest. Therefore, the model and visualization procedures illustrated in the Kansas river example should be generally applicable to many medium size rivers in agricultural regions.

## DATA AVAILABILITY STATEMENT

All relevant data are available from an online repositoryor repositories (please ensure the DOI/URL has been provided as a submssion item).

## CONFLICT OF INTEREST

The authors declare there is no conflict.

## REFERENCES

*Logistic and linear regression model documentation for statistical relations between continuous real-time and discrete water-quality constituents in the Kansas River, Kansas, July 2012 through June 2015*. U.S. Geological Survey Open-File Report 2016–1040, p. 27. doi:10.3133/ofr20161040

*Estimation of constituent concentrations, densities, loads, and yields in lower Kansas River, northeast Kansas, using regression models and continuous water-quality monitoring, January 2000 through December 2003*. U.S. Geological Survey Scientific Investigations Report 2005–5165

*Ecohydrology*, e2631. https://doi.org/10.1002/eco.2631

*Linear regression model documentation and updates for computing water-quality constituent concentrations or densities using continuous real-time water-quality data for the Kansas River, Kansas, July 2012 through September 2019*U.S. Geological Survey Open-File Report 2021–1018, p. 18. https://doi.org/10.3133/ofr20211018">https://doi.org/10.3133/ofr20211018 (see Appendix 18 for USGS linear surrogate models of Total Nitrogen at the De Soto, KS site)