## Abstract

The present understanding of how changes in climate conditions will impact the flux of natural organic matter (NOM) from the terrestrial to aquatic environments and thus aquatic dissolved organic carbon (DOC) concentrations is limited. In this study, three machine learning algorithms were used to predict variations in DOC concentrations in an Australian drinking water catchment as a function of climate, catchment and physical water quality data. Four independent variables including precipitation, temperature, leaf area index and turbidity (*n* = 5,540) were selected from a large dataset to develop and train each machine learning model. The accuracy of the multivariable linear regression, support vector regression (SVR) and Gaussian process regression algorithms with different kernel functions was determined using adjusted *R*-squared (adj. *R*^{2}), root-mean-squared error (RMSE) and mean absolute error (MAE). Model accuracy was very sensitive to the time interval used to average climate observations prior to pairing with DOC observations. The SVR model with a quadratic kernel function and a 12-day time interval between climate and water quality observations outperformed the other machine learning algorithms (adj. *R*^{2} = 0.71, RMSE = 1.9, MAE = 1.35). The area under the receiver operating characteristic curve method (AUC) confirmed that the SVR model could predict 92% of the elevated DOC observations; however, it was not possible to estimate DOC values at specific sampling sites in the catchment, probably due to the complex local geological and hydrological changes in the sites that directly surround and feed each sampling point. Further research is required to establish potential relationships between climatological data and NOM concentration in other water catchments – especially in the face of a changing climate.

## Highlights

The application of different machine learning algorithms to quantify NOM in an Australian catchment.

Investigate potential relationships between climatological factors and water quality parameters.

The estimation of DOC concentration of water from simple measurable water quality and climatological variables.

The detection of high DOC concentration events using the machine learning technique.

### Graphical Abstract

## INTRODUCTION

A warming climate is driving a suite of changes in global surface waters, from nutrient, temperature and sediment profiles in Chesapeake Bay, USA (Najjar *et al.* 2010) to the timing and intensity of monsoons in Southeast Asia (Loo *et al.* 2015). Higher temperatures impact water quality in lakes by reducing oxygen concentrations, increasing the release of phosphorus from sediments (Arnell *et al.* 2015) and increasing the concentration of dissolved organic matter (El-Jabi *et al.* 2014). Intense, but less frequent rainfall events will increase sediment loads while decreasing water security in drinking water catchments. These changes will have profound effects on the potable water supplies of current and future generations (Hansen *et al.* 2013).

The character and composition of natural organic matter (NOM) in freshwater ecosystems also appears to respond to variations in climatic conditions (Hejzlar *et al.* 2003; Delpla *et al.* 2009; Zhu *et al.* 2017; Gavin *et al.* 2018; Parr *et al.* 2019). This trend is also observed in Australian catchments with deleterious consequences for the performance of existing treatment infrastructure on more highly variable potable water supplies (Mohiuddin *et al.* 2014). Observed increases in dissolved organic carbon (DOC) concentrations have been linked to changes in climatic conditions (Tranvik & Jansson 2002; Freeman *et al.* 2004; Evans *et al.* 2006). Potential climatic drivers of the upward trends in DOC concentrations have included temperature (Evans *et al.* 2006), soil moisture (SM) (Hudson *et al.* 2003) and precipitation (Erlandsson *et al.* 2008). However, the present understanding of how changes in climatic conditions will continue to impact the flux of NOM from the terrestrial to aquatic environments and thus aquatic DOC concentrations is limited. Hence, establishing potential relationships between climatological data and NOM concentration in water catchment will be critical in planning for the delivery of potable water in a warmer climate.

The link between climate, catchment and water quality is dynamic, complex and with some exceptions not well understood. The expanded use of satellite-derived observations and increases in the type and frequency of water quality measurements has created more data but not necessarily better outcomes for the management of water quality. Consequently, there is an opportunity to apply advanced data-driven modelling techniques to catalogue and interrogate these complex datasets to identify the trends and interdependencies between the variables, and to perform dimensionality reduction of the variables (Lary *et al.* 2016; Ruescas *et al.* 2018). Supervised data-driven machine learning algorithms can potentially map DOC concentration in a catchment to climatological data without attempting to accurately model underlying processes. Machine learning methods have been widely used and achieved promising results in different environmental settings (Kim *et al.* 2014; Park *et al.* 2015; Granata *et al.* 2017; Ruescas *et al.* 2017), with the dataset size varying depending on the type of observation from *n* = 63 (Kim *et al.* 2014) to *n* = 2,255 (Park *et al.* 2015) for analytical measurements and *n* > 20,000 for remote sensing applications (Ruescas *et al.* 2017). The application of data-driven modelling techniques to estimate DOC concentrations has been demonstrated in other studies (Clair *et al.* 1999; Snauffer *et al.* 2018). For example, a neural network model was developed to estimate DOC concentration using hydrological, and climatic variables such as temperature, and total precipitation (Clair *et al.* 1999). In another application, the concentration of coloured dissolved organic matter was determined from remote sensing signals by comparing different machine learning regression approaches such as Gaussian process regression (GPR), support vector regression (SVR), random forest and kernel ridge regression (Ruescas *et al.* 2017, 2018).

The low computational cost and interpretability makes the multivariate linear regression (MLR) algorithm desirable as there are clear trade-offs between model complexity and interpretability. In many cases, particularly where limited data are available, more complex algorithms can achieve high levels of generalisation and prediction accuracy (Khalil *et al.* 2005). Among non-linear approaches, SVR is a nonparametric (i.e. not limited by a functional form) supervised learning algorithm that simultaneously can minimise prediction error and model complexity (Vapnik 2000). In contrast to linear and non-linear machine learning algorithms that are trained from exact values for every parameter in a function, GPR uses a (Gaussian) probability distribution over all possible values that fit the data. Exponential kernel function and squared exponential are widely used choices for covariance kernel functions (Jiang *et al.* 2019). GPR has been applied in regression problems in different environmental fields of studies including prediction of stream water temperature (Grbić *et al.* 2013), groundwater salinity (Lal & Datta 2018) or streamflow forecasting (Sun *et al.* 2014). However, the approach has not been used to explore correlations between DOC and catchment and climate variables.

The prospective application of an appropriate machine learning algorithm to a sufficiently large dataset may enable water utilities to predict what climate conditions are conducive to excursions in NOM in water catchments. The first challenge is to select the appropriate machine learning algorithm and develop a sufficiently large dataset to train and evaluate the performance of each machine learning algorithm. In most cases, the data required can be sourced from routinely collected water quality data, which are often fragmented by water utilities and catchment management authorities and temporally disconnected. An ever-growing library of climatological data is freely available from multiple weather and geographic satellite sources. However, as surface water is influenced by climate conditions through the transfer of momentum and matter, a time gap exists between climatological and water quality data. Moreover, as the catchment area increases, the lag time (here the optimal lagged day) increases (Rostami *et al.* 2018). Therefore, the second challenge is to explore the effects of different statistical pairing of observed DOC concentrations with averaged climatological variables from multiple time steps prior to the water quality observation. Thus, by combining and harmonising temporal differences between fragmented datasets, and training and validating an appropriate machine learning algorithm, it may be possible to identify potential relationships between the climatological and water quality data.

The objective of this paper was to evaluate three commonly used machine learning algorithms, namely MLR, SVR and GPR for the estimation of DOC concentration of water from simple measurable water quality and climatological variables using datasets from an Australian catchment. The secondary aim of this study was to establish new knowledge about the potential relationships between climatological factors and water quality parameters by using the exhaustive feature selection technique. This may enable water utilities to minimise the impact of changes in climate conditions on treatment plant performance and potable water quality.

## METHODOLOGY

### Study area

This study was conducted in the Nepean catchment in Australia (386 km^{2}) over a spatial domain of 33.81°S to 35.09°S and 149.97°E to 151.16°E with an elevation of 130–720 m above sea level (Figure 1), The dominant climate class is temperate, characterised by no dry season, a warm summer and 800–1,600 mm of annual rainfall. The major land cover are trees (82%) with the remaining areas used for rural residential and agricultural purposes, including pasture and cropping (Peel *et al.* 2007b; Sixsmith *et al.* 2015; Office of Environment and Heritage 2017).

### Dataset

The dataset (*n* = 5,540) was compiled from surface water quality observations and climatological and catchment variables derived from satellites, and a land surface model with a temporal range of 15 years and 2 months (1 November 2002 to 1 January 2018) (Table S1, Supporting Information). A pre-processing step was employed to remove inaccuracies, including reading errors, abnormal data and missing values.

#### Water quality data

Water quality data were collected at the Burke River (SP-1), one of three main tributaries to Lake Nepean and Lake Nepean (SP-2) in the Nepean catchment (Figure S1, Supporting Information). Parameters obtained from routine water quality monitoring included pH, water temperature (°C), turbidity (NTU), alkalinity (mg CaCO_{3}/L), true colour (UTC) and DOC concentration (mg/L).

#### Catchment and climate data

The catchment boundary (DEM) was obtained from the Shuttle Radar Topography Mission (SRTM) and the Hydrologically Enforced Digital Elevation Model (DEM-H) Version 1.0 (Read *et al.* 2011); land cover (LC) was obtained from the Dynamic Land Cover Dataset Version 2.1 (Sixsmith *et al.* 2015) in 2012–2013, consisting of 22 LC classes; and climate class (CZ) was obtained from the updated Köppen–Geiger climate classification (Peel *et al.* 2007a) (Table S1). Daily rainfall (mm/day) and air temperature (°C) were obtained from the Australian Water Availability Project (AWAP) dataset, gridded at 0.05° (Raupach *et al.* 2009). Mean temperatures were expressed as arithmetic averages of daily maximum and minimum temperatures.

Daily SM (%) and actual evapotranspiration (ET) (mm/day) were extracted from the Australian Water Resource Assessment Landscape (AWRA-L) dataset, gridded at 0.05° (Frost *et al.* 2016). Leaf area index (LAI) (m^{2}/m^{2}) defined as the one-sided green leaf area per unit ground surface area was calculated from 8-day composite Moderate Resolution Imaging Spectroradiometer (MODIS) measurements from Terra and Aqua satellites products (Didan *et al.* 2015). Raw LAI data were re-projected and resampled to 0.05° × 0.05° grids (identical to the gridded rainfall data) by using the MODIS re-projection tool (Dwyer & Schmidt 2006). Outliers were removed using the Savitzky–Golay filter (Savitzky & Golay 1964) in the TIMESAT software package (Jönsson & Eklundh 2004). The filtered 8-day LAI maps were linearly interpolated to a daily time scale for consistency with the other parameters in the dataset.

### Data processing

Water quality data were collected on different time steps, and the dates of sample collection were not consistent. When multiple samples were taken within a day, the corresponding measurements were averaged to produce a single representative daily value. The climatological data series was complete and void of missing values, so additional data processing was not necessary. The climatological and water quality data series were merged with each other such that only the dates with both climatological and water quality data available could be considered for further data processing.

### Establishing the machine learning algorithms

The workflow to establish, train and assess the machine learning algorithms is presented in Figure 2. To evaluate predictive models, the original dataset was randomly partitioned into 10 equal size non-overlapping subsets to enable the 10-fold cross-validation technique. Nine subsets were used for training and one set for validation. The cross-validation process was repeated 10 times such that each of the 10 subsets was used exactly once as the validation data. Simple regression analysis was used to identify the potential correlations between the observed DOC concentrations (Target) and other water quality, catchment and climate data (Predictors). After understanding the associations between the predictors and the target, the three algorithms were programmed, trained and validated using MATLAB 2018b (Mathworks, Inc.).

**Algorithm 1. MLR**

*k*th sample point, , using the MLR model:where is the intercept and a linear term for each predictor (

*x*). The MLR is the simplest algorithm and was used to establish the baseline response between target and predictors. The ordinary least square approximation by QR decomposition was used to estimate the algorithm parameters.

_{i}**Algorithm 2. SVR**

*K*is the kernel function which transforms the data to map into a high-dimensional space. The effectiveness of several kernel functions, including linear, quadratic and cubic, to estimate DOC concentration using climatological parameters were tested. MATLAB allows the setting of certain parameters along with the kernel function as the SVR hyper parameters. Default MATLAB values were used for building the model. Briefly, the epsilon parameter, which represents the half width of epsilon-insensitive band and set to the interquartile range of DOC divided by 13.49 (i.e. an estimator for 10% of the standard deviation). When training an SVR, no penalty is associated with the training points inside the epsilon-insensitive band. The box constraint parameter, which controls the penalty of misclassified training samples, hence limiting model complexity, was set to a multiple of 10 times the epsilon value. Prior to training the model, the predictors were standardised, and the kernel scale parameter was set automatically by MATLAB.

**Algorithm 3. GPR**

*et al.*2019). The squared exponential covariance function used in the GPR algorithm was expressed as Equation (3):where

*σ*is the noise standard deviation and

*l*is the characteristic length-scale which controls the effect of distance between

*x*and

*x*′. Intuitively, when

*x*and

*x*′ are very close, then

*k*(covariance function) leads to a higher value, which means

*f*(

*x*) is highly correlated with

*f*(

*x*′). On the other hand, for two distant points,

*k*(

*x*,

*x*′) is near to zero. This property of the squared exponential function is also highly desirable as for predicting the target variable at a new

*x*, negligible effects from distant observations are anticipated.

### Time-lag function

### Performance and accuracy assessment of the DOC models

Three summary statistics describing the accuracy of the models were used to assess the models' performances: adjusted *R*-squared (adj. *R*^{2}), the root-mean-squared error (RMSE) and mean absolute error (MAE). The adj. *R*^{2} indicates the proportion of the total sum of squares explained by each model adjusted for the number of coefficients. To quantify the level of statistical significance in the performance of selected machine learning algorithms, the paired *t*-test as a commonly used statistical hypothesis test for comparing machine learning algorithms was applied into the 10-fold cross-validation results. The paired *t*-test results that are often a test statistic and a *p*-value can be interpreted to confirm whether there is a real or statistically significant difference between the machine learning algorithms (Nadeau & Bengio 2003; Bouckaert & Frank 2004). Although predicting the DOC as a continuous variable is highly desirable, ultimately predicting the events with high DOC concentrations might be more useful in practice. To do so, thresholding the predicted DOC value to produce a binary outcome was required.

To visualise and quantify the trade-off between the false alarms and missed events, the receiver operating characteristic (ROC) curve was generated for the best performing model (the classification process in Figure 2). The ROC curve plots true positive rate (that is known as sensitivity) against false positive rate (that is also known as one minus specificity), obtained by varying the probability threshold on the predicted output (DOC concentration) (Fawcett 2006). The sensitivity measures the accuracy of the model in predicting high DOC concentration events, while the specificity measures the accuracy of the model in predicting instances with below-threshold DOC concentration. The probability threshold was used to find out how accurately the model could determine if there exists a high DOC event or not. Since increases by as much as 100% in DOC concentrations were observed during wet weather events in other researches (Hinton *et al.* 1997; Schoenheinz & Grischek 2011; Whitworth *et al.* 2012; Dhillon & Inamdar 2013), a high DOC event was defined in this study when the DOC concentration was two times higher than the averaged observed historical DOC concentration in the dataset. To provide an aggregate measure to quantify the model performance across all possible threshold values for , the area under the ROC curve (AUC) was calculated. The AUC value of 1.0 indicates a perfect model performance, while a value of 0.5 represents the chance-level (Kordestani *et al.* 2019).

## RESULTS AND DISCUSSION

### Removal of irrelevant data

Table 1 enumerates the basic statistics of the monitored water quality data at selected sampling points (SP-1 and SP-2 in Figure 1) including the mean and standard deviation of each monitored water quality data and their *R*^{2} with DOC concentration for each site. pH and alkalinity were dropped from the analysis, as they were variables exhibiting an insignificant relationship with DOC concentration. It should be noted that some studies suggested rising DOC concentrations may have been linked to increases in alkalinity or pH, associated with recovery from acidification (Evans *et al.* 2005; Winterdahl *et al.* 2014). However, there are some studies that challenged the role of pH (Couture *et al.* 2012; Pärn & Mander 2012). Although the temporal correlations were not strong between pH and DOC concentrations in this dataset, the possibility that as pH increases the DOC concentrations increase in this catchment is not dismissed. True colour was infrequently measured in our dataset (*n* = 1,758 compared with *n* = 10,761 for turbidity at SP-1), thus it was excluded as a parameter in DOC modelling. Temperature and turbidity were retained as water quality variables for the modelling of the DOC concentration.

Variable . | SP-1 . | SP-2 . | ||||
---|---|---|---|---|---|---|

Mean . | Standard deviation . | R^{2} with DOC
. | Mean . | Standard deviation . | R^{2} with DOC
. | |

pH | 6.53 | 0.50 | 0.00 | 7.31 | 0.38 | 0.00 |

Temperature (̊C) | 16.00 | 5.30 | 0.26 | 14.81 | 3.52 | 0.17 |

Turbidity (NTU) | 7.72 | 4.94 | 0.30 | 2.15 | 1.86 | 0.10 |

Alkalinity (mg CaCo_{3}/L) | 5.17 | 3.54 | 0.00 | 11.07 | 1.08 | 0.00 |

True colour (HU) | 23.12 | 10.40 | 0.22 | 7.14 | 1.86 | 0.25 |

DOC concentration (mg/L) | 4.30 | 1.81 | 1.00 | 3.7 | 0.99 | 1.00 |

Variable . | SP-1 . | SP-2 . | ||||
---|---|---|---|---|---|---|

Mean . | Standard deviation . | R^{2} with DOC
. | Mean . | Standard deviation . | R^{2} with DOC
. | |

pH | 6.53 | 0.50 | 0.00 | 7.31 | 0.38 | 0.00 |

Temperature (̊C) | 16.00 | 5.30 | 0.26 | 14.81 | 3.52 | 0.17 |

Turbidity (NTU) | 7.72 | 4.94 | 0.30 | 2.15 | 1.86 | 0.10 |

Alkalinity (mg CaCo_{3}/L) | 5.17 | 3.54 | 0.00 | 11.07 | 1.08 | 0.00 |

True colour (HU) | 23.12 | 10.40 | 0.22 | 7.14 | 1.86 | 0.25 |

DOC concentration (mg/L) | 4.30 | 1.81 | 1.00 | 3.7 | 0.99 | 1.00 |

### Selection of correct predictor variables

Climatological data including precipitation (Pre.), temperature (Temp.), SM, ET and LAI measured at SP-1 from November 2002 to January 2018 are shown in Figure S2. To select the correct climatological variables for DOC estimation, Pearson's correlation coefficient of each climatological variable and selected water quality parameters with DOC concentration in selected sampling sites were determined (Table 2 and Table S2). Strong correlations were observed between SM and precipitation data (*r* = 0.91 at SP-1) and between temperature and ET data (*r* = 0.80 at SP-1) (Table 2 and Table S2). To ensure parameters used to estimate DOC concentration were independent from each other, parameters with high correlation were excluded. Therefore, two pathways arise for potential machine learning model input variables, one considering precipitation and temperature and the other including SM and ET. The effects of including more input variables on estimated DOC concentration were also studied by increasing the number of model input variables from only two variables (Pre. and Temp. versus SM and ET) to three and four input variables by adding turbidity and LAI, respectively.

MP1 . | DOC . | Turb. . | Pre. . | Temp. . | SM . | ET . | LAI . |
---|---|---|---|---|---|---|---|

DOC | 1.00 | ||||||

Turb. | 0.29 | 1.00 | |||||

Pre. | 0.45 | 0.19 | 1.00 | ||||

Temp. | 0.26 | 0.06 | 0.07 | 1.00 | |||

SM | 0.41 | 0.18 | 0.91 | 0.07 | 1.00 | ||

ET | 0.29 | 0.02 | 0.34 | 0.80 | 0.38 | 1.00 | |

LAI | 0.12 | 0.00 | 0.11 | 0.15 | 0.05 | 0.18 | 1.00 |

MP1 . | DOC . | Turb. . | Pre. . | Temp. . | SM . | ET . | LAI . |
---|---|---|---|---|---|---|---|

DOC | 1.00 | ||||||

Turb. | 0.29 | 1.00 | |||||

Pre. | 0.45 | 0.19 | 1.00 | ||||

Temp. | 0.26 | 0.06 | 0.07 | 1.00 | |||

SM | 0.41 | 0.18 | 0.91 | 0.07 | 1.00 | ||

ET | 0.29 | 0.02 | 0.34 | 0.80 | 0.38 | 1.00 | |

LAI | 0.12 | 0.00 | 0.11 | 0.15 | 0.05 | 0.18 | 1.00 |

### Impact of considering time lags for the climatological data on estimated DOC concentrations

To investigate the effects of considering time lags for the climatological data on estimated DOC concentrations, time lags of up to 20 days prior to water sampling for the climatological data were investigated in this study. Adj. *R*^{2} and RMSE values of estimated DOC concentrations at SP-1 using MLR as well as SVR and GPR with different kernel functions are presented in Figure 3(a)–3(f) and Figure S3(a)–S3(f), respectively. Figure S4(a)–S4(f) represents how MAE values of estimated DOC concentration at SP-1 varied by increasing the time interval for consideration of climatological data prior to water sampling. The results in Figure 3(a)–3(f) showed that increasing the number of inputs gave better results, especially with the more sophisticated machine learning approaches. The model performance was improved by considering turbidity and LAI as input variables, and the first scenario (Pre. and Temp.) resulted better than the second one (SM and ET) (Figure 3(a)–3(f)). SVR with a linear kernel function (Figure 3(b)) resulted in slightly better performance than MLR in adj. *R*^{2} values when the time lag between climatological data and water quality data was increased. Changing the kernel function from linear to quadratic also improved the model estimation results (adj. *R*^{2} of the model was increased) (Figure 3(c)). However, applying SVR with a cubic kernel function was unsuccessful for this dataset for DOC concentration estimation (Figure 3(d)). Similarly, GPR with kernel functions either exponential (Figure 3(e)) or squared exponential (Figure 3(f)) exhibited poor performance with maximum adj. *R*^{2} values on average less than 0.5. The adj. *R*^{2} increased when the time lag between the climatological data and the day of water sampling was increased from 1 to 12 days, beyond which it continued to decrease (Figure 3(a)–3(d)).

Therefore, using SVR with a quadratic kernel function with Pre., Temp., LAI and Turb. as model input variables with a 12-day time lag between climatological and water quality data outperformed the other selected machine learning algorithms in this study in terms of statistical indices (adj. *R*^{2} = 0.71, RMSE = 1.9, MAE = 1.35). To estimate whether the differences between the performance of SVR with a quadratic kernel function and other selected machine learning algorithms are true and reliable or are just due to statistical chance, a paired *t*-test was conducted on 10-fold cross-validation results, while a 12-day time lag between climatological and water quality data was considered. The paired *t*-test results between SVR with a quadratic kernel function and other machine learning algorithms rejected the null hypothesis at the 5% significance level, and the averaged *p*-value was 1.37% which is smaller than the considered significance level (i.e. 5%). Hence, the results statistically provided convincing evidence that machine learning algorithms performed differently and any observed difference in the performance of SVR with a quadratic kernel function is likely due to a difference in the models. Therefore, the SVR showed a better capability than MLR and GPR to handle the nonlinearity and complexity of climatological characteristics of the catchment. However, it should be acknowledged that the performance of the GPR model was investigated with only two kernel functions, the exponential and the squared exponential. Exploring other kernels could be a potential future work to this study and might result in a higher performance metrics for the GPR.

The trend of RMSE (Figure S3(a)–S3(f)) was opposite to that of adj. *R*^{2}, as the objective function with RMSE was to minimise the estimation error. It should be noted that by using the entire dataset in the hyperparameter selection process, a bias is introduced, which likely results in an overly optimistic performance metrics. The hyperparameters were the best model types (i.e. SVR with quadratic kernel function), the best set of features (Pre., Temp., LAI and Turb.) and the most appropriate value time delay (i.e. 12 days). It should be noted that the 12-day time interval would not be a universal value for all the other catchments, as the time interval could be related to some physical characteristics of the catchment such as catchment area/slope or the LC.

### Machine learning model performance assessment

One potential avenue is to explore the performance of the best model for a completely blinded dataset, which was not used for the feature and model selection. Therefore, a 12-day time lag was chosen to consider the impacts of climatological data prior to water sampling, as it led to the highest adj. *R*^{2} value and lowest RMSE value in this study (Figure 3(c)). The measured and estimated DOC concentrations and their scatter plot using the SVR model with the quadratic kernel function at SP-1 are presented in Figure 4. As seen from the figure, the plotted data points revealed a good agreement between measured and estimated DOC concentrations, as they generally correlated close towards the 1:1 sloped line.

#### ROC analysis

To investigate if the developed SVR model with a quadratic kernel function and a 12-day time lag between climatological and water quality data could detect high DOC concentration events, the ROC analysis was carried out. The averaged measured DOC concentration at SP-1 from November 2002 to January 2018 was 3.4 mg/L (Figure 4). In this study, the DOC concentration threshold for such events was set to be two times higher than the averaged observed DOC concentrations over the past 15 years (6.8 mg/L at SP-1). Figure 5 shows the ROC curve for the high DOC concentration events using the SVR model with the quadratic kernel function with Pre., Temp., LAI and Turb. as model input variables and a 12-day time lag between climatological and water quality data at SP-1. Different cut-off values can be selected from the ROC curve. Here, as high sensitivity will guarantee that most of the high DOC concentration events will be detected, a lax operating point is desirable. Therefore, a knee point in the ROC curve, which ensures a sensitivity of 90%, was selected as the operating point. The corresponding sensitivity and specificity were 0.89 and 0.88, respectively (denoted by a red circle in Figure 5). The confidence intervals in each point were computed by generating 100 bootstrap replicas. It was found that the AUC value for the SVR model with the quadratic kernel function and a 12-day time lag was 0.92 which is above the random level (i.e. 0.5), indicating that the developed SVR model, which uses climatological variables as model inputs, is capable of being used as an alarm system, indicating a possible high DOC event.

Adj. *R*^{2}, RMSE and MAE values for the estimated DOC concentrations at SP-2 with the same machine learning algorithms are shown in Figure S5(a)–S5(f), Figure S6(a)–S6(f) and Figure S7(a)–S7(f), respectively. Unlike SP-1 that is a river-based sampling point, SP-2 is located at a lake whose water comes from three main tributaries (SP-1 is at one of these three main tributaries into the lake). The results show that none of the applied machine learning algorithms could estimate the DOC concentration at SP-2 (adj. *R*^{2} <0.3). As shown in Figure S5(a)–S5(f), Figure S6(a)–S6(f) and Figure S7(a)–S7(f), estimated DOC concentrations at SP-2 did not improve by including more input parameters to machine learning algorithms, or by increasing the time lag between climatological and water quality datasets. This could largely be due to the complex local geological and hydrological changes in the sites that directly surround and feed each sampling point or because the lake water combines water from its tributaries with potentially different physiochemical and climatological characteristics. One future approach to deal with lake-based sites could be to consider data from all input tributaries. Another possible reason, that would make it hard to understand how water quality parameters in a lake are affected by climatological parameters, could be the inherent variability in lake characteristics such as lake size, shape and depth compared to the rivers. It could also be due to the complex nature of lake–atmosphere interactions (O'Reilly *et al.* 2015; Winslow *et al.* 2015) or water clarity in the lake as it influences the depth range over which heat can be absorbed (Rose *et al.* 2016) or the strength of lake stratification (Winslow *et al.* 2015, 2017). Hence, the response of the lake to changing climatological parameters was complex, and considering only the climatological parameters could not estimate the DOC concentration at the lake sampling point in our study (SP-2).

## CONCLUSION

This research utilised machine learning and satellite data to develop and train a model to estimate DOC concentration in water in an Australian catchment. Given the ease of accessing high-resolution satellite data, it is feasible to use climatological data derived from the satellite datasets and link them with water quality indicators to meet the current and future challenges in large-area water quality monitoring. The analysis of the results showed that precipitation, temperature, LAI and turbidity yielded the optimal results using the SVR model with the quadratic kernel function. Considering the impact of time lag on climatological data prior to water sampling showed an impact on model accuracy, and a 12-day time lag between climatological and water quality data brought far better results for the datasets used in this study in terms of statistical indices (adj. *R*^{2} = 0.71, RMSE = 1.9, MAE = 1.35). To show the usefulness of the proposed method compared with other traditional model and kernel-based models, different machine learning models were constructed on the dataset. Experimental results show that the forecasting capability of the SVR model outperforms those of other kernel-based models, thereby generating more accurate results.

From the ROC analysis, the developed SVR model with the quadratic kernel function can be successfully used to indicate possible high DOC events (AUC = 92%). Although the results showed that the SVR model works well for the river-based sampling point, the response of the lake to changing climatological parameters was complex and none of the applied machine learning algorithms could estimate the DOC concentration at lake-based sampling point, probably due to higher water residence times in lakes than rivers or diverse variability in lake characteristics such as lake size, shape and depth. Further studies would be needed to test and clarify the effect of considering other potential lake characteristics as model input variables for the DOC estimation at lake-based sampling points and the potential for utilising the approach to river-based modelling used in this study as inputs to DOC estimation models for lake-based sites.

## ACKNOWLEDGEMENTS

This research was supported via the Australian Research Council's Linkage Projects funding scheme (project number LP160100620) which included support from Sydney Water and WaterNSW. In addition, the authors thank Water Research Australia for the PhD top-up scholarship provided to Anthony Agostino. The authors also acknowledge the support of the UNESCO Centre for Membrane Science and Technology.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## REFERENCES

_{2}atmospheric scenario