Quantifying natural organic matter concentration in water from climatological parameters using different machine learning algorithms

The present understanding of how changes in climate conditions will impact the ﬂ ux of natural organic matter (NOM) from the terrestrial to aquatic environments and thus aquatic dissolved organic carbon (DOC) concentrations is limited. In this study, three machine learning algorithms were used to predict variations in DOC concentrations in an Australian drinking water catchment as a function of climate, catchment and physical water quality data. Four independent variables including precipitation, temperature, leaf area index and turbidity ( n ¼ 5,540) were selected from a large dataset to develop and train each machine learning model. The accuracy of the multivariable linear regression, support vector regression (SVR) and Gaussian process regression algorithms with different kernel functions was determined using adjusted R -squared (adj. R 2 ), root-mean-squared error (RMSE) and mean absolute error (MAE). Model accuracy was very sensitive to the time interval used to average climate observations prior to pairing with DOC observations. The SVR model with a quadratic kernel function and a 12-day time interval between climate and water quality observations outperformed the other machine learning algorithms (adj. R 2 ¼ 0.71, RMSE ¼ 1.9, MAE ¼ 1.35). The area under the receiver operating characteristic curve method (AUC) con ﬁ rmed that the SVR model could predict 92% of the elevated DOC observations; however, it was not possible to estimate DOC values at speci ﬁ c sampling sites in the catchment, probably due to the complex local geological and hydrological changes in the sites that directly surround and feed each sampling point. Further research is required to establish potential relationships between climatological data and NOM concentration in other water catchments – especially in the face of a changing climate. The estimation of DOC concentration of water from simple measurable water quality and climatological variables. The detection of high DOC concentration events using the machine learning technique.


INTRODUCTION
A warming climate is driving a suite of changes in global surface waters, from nutrient, temperature and sediment profiles in Chesapeake bay, USA (Najjar et al. 2010) to the timing and intensity of monsoons in Southeast Asia (Loo et al. 2015). Higher temperatures impact water quality in lakes by reducing oxygen concentrations, increasing the release of phosphorus from sediments (Arnell et al. 2015) and increasing the concentration of dissolved organic matter (El-Jabi et al. 2014). Intense, but less frequent rainfall events will increase sediment loads while decreasing water security in drinking water catchments. These changes will have profound effects on the potable water supplies of current and future generations (Hansen et al. 2013).
The character and composition of natural organic matter (NOM) in freshwater ecosystems also appears to respond to variations in climatic conditions (Hejzlar et al. 2003;Delpla et al. 2009;Zhu et al. 2017;Gavin et al. 2018;Parr et al. 2019). This trend is also observed in Australian catchments with deleterious consequences for the performance of existing treatment infrastructure on more highly variable potable water supplies (Mohiuddin et al. 2014). Observed increases in dissolved organic carbon (DOC) concentrations have been linked to changes in climatic conditions (Tranvik & Jansson 2002;Freeman et al. 2004;Evans et al. 2006). Potential climatic drivers of the upward trends in DOC concentrations have included temperature (Evans et al. 2006), soil moisture (SM) (Hudson et al. 2003) and precipitation (Erlandsson et al. 2008). However, the present understanding of how changes in climatic conditions will continue to impact the flux of NOM from the terrestrial to aquatic environments and thus aquatic DOC concentrations is limited. Hence, establishing potential relationships between climatological data and NOM concentration in water catchment will be critical in planning for the delivery of potable water in a warmer climate.
The link between climate, catchment and water quality is dynamic, complex and with some exceptions not well understood. The expanded use of satellite-derived observations and increases in the type and frequency of water quality measurements has created more data but not necessarily better outcomes for the management of water quality. Consequently, there is an opportunity to apply advanced data-driven modelling techniques to catalogue and interrogate these complex datasets to identify the trends and interdependencies between the variables, and to perform dimensionality reduction of the variables (Lary et al. 2016;Ruescas et al. 2018). Supervised data-driven machine learning algorithms can potentially map DOC concentration in a catchment to climatological data without attempting to accurately model underlying processes. Machine learning methods have been widely used and achieved promising results in different environmental settings (Kim et al. 2014;Park et al. 2015;Granata et al. 2017;Ruescas et al. 2017), with the dataset size varying depending on the type of observation from n ¼ 63 (Kim et al. 2014) to n ¼ 2,255 (Park et al. 2015) for analytical measurements and n . 20,000 for remote sensing applications (Ruescas et al. 2017). The application of data-driven modelling techniques to estimate DOC concentrations has been demonstrated in other studies (Clair et al. 1999;Snauffer et al. 2018). For example, a neural network model was developed to estimate DOC concentration using hydrological, and climatic variables such as temperature, and total precipitation (Clair et al. 1999). In another application, the concentration of coloured dissolved organic matter was determined from remote sensing signals by comparing different machine learning regression approaches such as Gaussian process regression (GPR), support vector regression (SVR), random forest and kernel ridge regression (Ruescas et al. 2017. The low computational cost and interpretability makes the multivariate linear regression (MLR) algorithm desirable as there are clear trade-offs between model complexity and interpretability. In many cases, particularly where limited data are available, more complex algorithms can achieve high levels of generalisation and prediction accuracy (Khalil et al. 2005). Among non-linear approaches, SVR is a nonparametric (i.e. not limited by a functional form) supervised learning algorithm that simultaneously can minimise prediction error and model complexity (Vapnik 2000). In contrast to linear and non-linear machine learning algorithms that are trained from exact values for every parameter in a function, GPR uses a (Gaussian) probability distribution over all possible values that fit the data. Exponential kernel function and squared exponential are widely used choices for covariance kernel functions (Jiang et al. 2019). GPR has been applied in regression problems in different environmental fields of studies including prediction of stream water temperature (Grbićet al. 2013), groundwater salinity (Lal & Datta 2018) or streamflow forecasting (Sun et al. 2014). However, the approach has not been used to explore correlations between DOC and catchment and climate variables.
The prospective application of an appropriate machine learning algorithm to a sufficiently large dataset may enable water utilities to predict what climate conditions are conducive to excursions in NOM in water catchments. The first challenge is to select the appropriate machine learning algorithm and develop a sufficiently large dataset to train and evaluate the performance of each machine learning algorithm. In most cases, the data required can be sourced from routinely collected water quality data, which is often fragmented by water utilities and catchment management authorities and temporally disconnected. An ever-growing library of climatological data is freely available from multiple weather and geographic satellite sources. However, as surface water is influenced by climate conditions through the transfer of momentum and matter, a time gap exists between climatological and water quality data. Moreover, as the catchment area increases, the lag time (here the optimal lagged day) increases (Rostami et al. 2018). Therefore, the second challenge is to explore the effects of different statistical pairing of observed DOC concentrations with averaged climatological variables from multiple time steps prior to the water quality observation. Thus, by combining and harmonising temporal differences between fragmented datasets, and training and validating an appropriate machine learning algorithm, it may be possible to identify potential relationships between the climatological and water quality data.
The objective of this paper was to evaluate three commonly used machine learning algorithms, namely MLR, SVR and GPR for the estimation of DOC concentration of water from simple measurable water quality and climatological variables using datasets from an Australian catchment. The secondary aim of this study was to establish new knowledge about the potential relationships between climatological factors and water quality parameters by using the exhaustive feature selection technique. This may enable water utilities to minimise the impact of changes in climate conditions on treatment plant performance and potable water quality.

Study area
This study was conducted in the Nepean catchment in Australia (386 km 2 ) over a spatial domain of 33.81°S to 35.09°S and 149.97°E to 151.16°E with an elevation of 130-720 m above sea level ( Figure 1), The dominant climate class is temperate, characterised by no dry season, a warm summer and 800-1,600 mm of annual rainfall. The major land cover are trees (82%) with the remaining areas used for rural residential and agricultural purposes, including pasture and cropping (Peel et al. 2007b;Sixsmith et al. 2015; Office of Environment & Heritage 2017).

Dataset
The dataset (n ¼ 5,540) was compiled from surface water quality observations and climatological and catchment variables derived from satellites, and a land surface model with a temporal range of 15 years and 2 months (1 November 2002 to 1 January 2018) (Table S1). A pre-processing step was employed to remove inaccuracies, including reading errors, abnormal data and missing values.

Water quality data
Water quality data were collected at the Burke River (SP-1), one of main three tributaries to Lake Nepean and Lake Nepean (SP-2) in the Nepean catchment (Fig. S1). Parameters obtained from routine water quality monitoring included pH, water temperature (°C), turbidity (NTU), alkalinity (mg CaCO 3 /L), true colour (UTC) and DOC concentration (mg/L).

Catchment and climate data
The catchment boundary (DEM) was obtained from the Shuttle Radar Topography Mission (SRTM) and the Hydrologically Enforced Digital Elevation Model (DEM-H) Version 1.0 (Read et al. 2011); land cover (LC) was obtained from the Dynamic Land Cover Dataset Version 2.1 (Sixsmith et al. 2015) in 2012-2013, consisting of 22 LC classes; and climate class (CZ) was obtained from the updated Köppen-Geiger climate classification (Peel et al. 2007a) (Table S1). Daily rainfall (mm/ day) and air temperature (°C) were obtained from the Australian Water Availability Project (AWAP) dataset, gridded at 0.05° (Raupach et al. 2009). Mean temperatures were expressed as arithmetic averages of daily maximum and minimum temperatures.
Daily SM (%) and actual evapotranspiration (ET) (mm/day) were extracted from the Australian Water Resource Assessment Landscape (AWRA-L) dataset, gridded at 0.05° (Frost et al. 2016). Leaf area index (LAI) (m 2 /m 2 ) defined as the one-sided green leaf area per unit ground surface area was calculated from 8-day composite Moderate Resolution Imaging Spectroradiometer (MODIS) measurements from Terra and Aqua satellites products (Didan et al. 2015). Raw LAI data were re-projected and resampled to 0.05°Â0.05°grids (identical to the gridded rainfall data) by using the MODIS re-projection tool (Dwyer & Schmidt 2006). Outliers were removed using the Savitzky-Golay filter (Savitzky & Golay 1964) in the TIMESAT software package ( Jönsson & Eklundh 2004). The filtered 8-day LAI maps were linearly interpolated to a daily time scale for consistency with the other parameters in the dataset.

Data processing
Water quality data were collected on different time steps, and the dates of sample collection were not consistent. When multiple samples were taken within a day, the corresponding measurements were averaged to produce a single representative daily value. The climatological data series was complete and void of missing values, so additional data processing was not necessary. The climatological and water quality data series were merged with each other such that only the dates with available both climatological and water quality data be considered for further data processing.

Establishing the machine learning algorithms
The workflow to establish, train and assess the machine learning algorithms is presented in Figure 2. To evaluate predictive models, the original dataset was randomly partitioned into 10 equal size nonoverlapping subsets to enable the 10-fold cross-validation technique. Nine subsets were used for  training and one set for validation. The cross-validation process was repeated 10 times such that each of the 10 subsets was used exactly once as the validation data. Simple regression analysis was used to identify the potential correlations between the observed DOC concentrations (Target) and other water quality, catchment and climate data (Predictors). After understanding the associations between the predictors and the target, the three algorithms were programmed, trained and validated using MATLAB 2018b (Mathworks, Inc.). Algorithm 1. MLR The MLR model learns the linear function to map the climatological parameters to the DOC concentrations in the catchment. Equation (1) represents the relationship between the predicted DOC for the kth sample point, d DOC k , using the MLR model: where a 0 is the intercept and a linear term for each predictor (x i ). The MLR is the simplest algorithm and was used to establish the baseline response between target and predictors. The ordinary least square approximation by QR decomposition was used to estimate the algorithm parameters.

Algorithm 2. SVR
The SVR can be expressed as Equation (2): where K is the kernel function which transforms the data to map into a high-dimensional space. The effectiveness of several kernel functions, including linear, quadratic and cubic, to estimate DOC concentration using climatological parameters were tested. MATLAB allows the setting of certain parameters along with the kernel function as the SVR hyper parameters. Default MATLAB values were used for building the model. Briefly, the epsilon parameter, which represents the half width of epsilon-insensitive band and set to the interquartile range of DOC divided by 13.49 (i.e. an estimator for 10% of the standard deviation). When training an SVR, no penalty is associated with the training points inside the epsilon-insensitive band. The box constraint parameter, which controls the penalty of misclassified training samples, hence limiting model complexity, was set to a multiple of 10 times the epsilon value. Prior to training the model, the predictors were standardised, and the kernel scale parameter was set automatically by MATLAB. Algorithm 3. GPR The exponential kernel function and the squared exponential kernel function were used in this study, as they are widely used choices for covariance kernel functions ( Jiang et al. 2019). The squared exponential covariance function used in the GPR algorithm was expressed as Equation (3): where σ is the noise standard deviation and l is the characteristic length-scale which controls the effect of distance between x and x 0 . Intuitively, when x and x 0 are very close, then k (covariance function) leads to a higher value, which means f (x) is highly correlated with f (x 0 ). On the other hand, for two distant points, k(x,x 0 ) is near to zero. This property of the squared exponential function is also highly desirable as for predicting the target variable at a new x, negligible effects from distant observations are anticipated.

Time-lag function
As the system time lag was unknown, the performance of the machine learning algorithms as measured by different statistical criteria (highlighted in this section) was evaluated as a function of the lag (T D ). Equation (4), whereŷ and u indicate the predicted value and model parameters, respectively, summarises the dominant delay estimation process. Perf(ŷ, y) shows a performance metric describing the accuracy of the model. As implied by Equation (4), the largest spike (arg max) in the performance metric specified the dominant lag time.
arg max Performance and accuracy assessment of the DOC models Three summary statistics describing the accuracy of the models were used to assess the models' performances: adjusted R-squared (adj. R 2 ), the root-mean-squared error (RMSE) and mean absolute error (MAE). The adj. R 2 indicates the proportion of the total sum of squares explained by each model adjusted for the number of coefficients. To quantify the level of statistical significance in the performance of selected machine learning algorithms, paired t-test as a commonly used statistical hypothesis test for comparing machine learning algorithms was applied into the 10-fold cross-validation results. The paired t-test results that are often a test statistic and a p-value can be interpreted to confirm whether there is a real or statistically significant difference between the machine learning algorithms (Nadeau & Bengio 2003;Bouckaert & Frank 2004). Although predicting the DOC as a continuous variable is highly desirable, ultimately predicting the events with high DOC concentrations might be more useful in practice. To do so, thresholding the predicted DOC value to produce a binary outcome was required.
To visualise and quantify the trade-off between the false alarms and missed events, the receiver operating characteristic (ROC) curve was generated for the best performing model (the classification process in Figure 2). The ROC curve plots true positive rate (that is known as sensitivity) against false positive rate (that is also known as one minus specificity), obtained by varying the probability threshold on the predicted output (DOC concentration) (Fawcett 2006). The sensitivity measures the accuracy of the model in predicting high DOC concentration events, while the specificity measures the accuracy of the model in predicting instances with below-threshold DOC concentration. The probability threshold was used to find out how accurately the model could determine if there exists a high DOC event or not. Since increases by as minimum as 100% in DOC concentrations were observed during wet weather events in other researches (Hinton et al. 1997;Schoenheinz & Grischek 2011;Whitworth et al. 2012;Dhillon & Inamdar 2013), a high DOC event was defined in this study when the DOC concentration was two times higher than the averaged observed historical DOC concentration in the dataset. To provide an aggregate measure to quantify the model performance across all possible threshold values for d DOC, the area under the ROC curve (AUC) was calculated. The AUC value of 1.0 indicates a perfect model performance, while a value of 0.5 represents the chance-level (Kordestani et al. 2019). Table 1 enumerates the basic statistics of the monitored water quality data at selected sampling points (SP-1 and SP-2 in Figure 1) including the mean and standard deviation of each monitored water quality data and their R 2 with DOC concentration for each site. pH and alkalinity were dropped from the analysis, as they were variables exhibiting an insignificant relationship with DOC concentration. It should be noted that some studies suggested rising DOC concentrations may have been linked to increases in alkalinity or pH, associated with recovery from acidification (Evans et al. 2005;Winterdahl et al. 2014). However, there are some studies that challenged the role of pH (Couture et al. 2012;Pärn & Mander 2012). Although the temporal correlations were not strong between pH and DOC concentrations in this dataset, the possibility that as pH increases the DOC concentrations increase in this catchment is not dismissed. True colour was infrequently measured in our dataset (n ¼ 1,758 compared with n ¼ 10,761 for turbidity at SP-1), thus it was excluded as a parameter in DOC modelling. Temperature and turbidity were retained as water quality variables for the modelling of the DOC concentration.

Selection of correct predictor variables
Climatological data including precipitation (Pre.), temperature (Temp.), SM, ET and LAI measured at SP-1 from November 2002 to January 2018 are shown in Fig. S2. To select the correct climatological variables for DOC estimation, Pearson's correlation coefficient of each climatological variable and selected water quality parameters with DOC concentration in selected sampling sites were determined (Table 2 and Table S2). Strong correlations were observed between SM and precipitation data (r ¼ 0.91 at SP-1) and between temperature and ET data (r ¼ 0.80 at SP-1) ( Table 2 and  Table S2). To ensure parameters used to estimate DOC concentration were independent from each other, parameters with high correlation were excluded. Therefore, two pathways arise for potential machine learning model input variables, one considering precipitation and temperature and the other including SM and ET. The effects of including more input variables on estimated DOC concentration were also studied by increasing the number of model input variables from only two variables Uncorrected Proof (Pre. and Temp. versus SM and ET) to three and four input variables by adding turbidity and LAI, respectively.

Impact of considering time lags for the climatological data on estimated DOC concentrations
To investigate the effects of considering time lags for the climatological data on estimated DOC concentrations, time-lags of up to 20 days prior to water sampling for the climatological data were investigated in this study. Adj. R 2 and RMSE values of estimated DOC concentrations at SP-1 using MLR as well as SVR and GPR with different kernel functions are presented in Figure 3(a)-3(f) and Fig. S3 (a-f),  Uncorrected Proof respectively. Fig. S4 (a-f) represents how MAE values of estimated DOC concentration at SP-1 varied by increasing the time interval for consideration of climatological data prior to water sampling. The results in Figure 3(a)-3(f) showed that increasing the number of inputs gave better results, especially with the more sophisticated machine learning approaches. The model performance was improved by considering turbidity and LAI as input variables, and the first scenario (Pre. and Temp.) resulted better than the second one (SM and ET) (Figure 3(a)-3(f)). SVR with a linear kernel function (Figure 3(b)) resulted in slightly better performance than MLR in adj. R 2 values when the time lag between climatological data and water quality data was increased. Changing the kernel function from linear to quadratic also improved the model estimation results (adj. R 2 of the model was increased) (Figure 3(c)). However, applying SVR with a cubic kernel function was unsuccessful for this dataset for DOC concentration estimation (Figure 3(d)). Similarly, GPR with kernel functions either exponential (Figure 3(e)) or squared exponential (Figure 3(f)) exhibited poor performance with maximum adj. R 2 values on average less than 0.5. The adj. R 2 increased when the time lag between the climatological data and the day of water sampling was increased from 1 to 12 days, beyond which it continued to decrease (Figure 3(a)-3(d)). Therefore, using SVR with a quadratic kernel function with Pre., Temp., LAI and Turb. as model input variables with a 12-day time lag between climatological and water quality data outperformed the other selected machine learning algorithms in this study in terms of statistical indices (adj. R 2 ¼ 0.71, RMSE ¼ 1.9, MAE ¼ 1.35). To estimate whether the differences between the performance of SVR with a quadratic kernel function and other selected machine learning algorithms are true and reliable or they are just due to statistical chance, a paired t-test was conducted on 10-fold cross-validation results, while a 12-day time lag between climatological and water quality data was considered. The paired t-test results between SVR with a quadratic kernel function and other machine learning algorithms rejected the null hypothesis at the 5% significance level, and the averaged p-value was 1.37% which is smaller than the considered significance level (i.e. 5%). Hence, the results statistically provided convincing evidence that machine learning algorithms performed differently and any observed difference in the performance of SVR with a quadratic kernel function is likely due to a difference in the models. Therefore, the SVR showed a better capability than MLR and GPR to handle the nonlinearity and complexity of climatological characteristics of the catchment. However, it should be acknowledged that the performance of the GPR model was investigated with only two kernel functions, the exponential and the squared exponential. Exploring other kernels could be a potential future work to this study and might result in a higher performance metrics for the GPR.
The trend of RMSE ( Fig. S3 (a-f)) was opposite to that of adj. R 2 , as the objective function with RMSE was to minimise the estimation error. It should be noted that by using the entire dataset in the hyperparameter selection process, a bias is introduced, which likely results in an overly optimistic performance metrics. The hyperparameters were the best model types (i.e. SVR with quadratic kernel function), the best set of features (Pre., Temp., LAI and Turb.) and the most appropriate value time delay (i.e. 12 days). It should be noted that the 12-day time interval would not be a universal value for all the other catchments, as the time interval could be related to some physical characteristics of the catchment such as catchment area/slope or the LC.

Machine learning model performance assessment
One potential avenue is to explore the performance of the best model for a completely blinded dataset, which was not used for the feature and model selection. Therefore, a 12-day time lag was chosen to consider the impacts of climatological data prior to water sampling, as it led to the highest adj. R 2 value and lowest RMSE value in this study (Figure 3(c)). The measured and estimated DOC concentrations and their scatter plot using the SVR model with the quadratic kernel function at SP-1 are presented in Figure 4. As seen from the figure, the plotted data points revealed a good agreement between measured and estimated DOC concentrations, as they generally correlated close towards the 1:1 sloped line.

ROC analysis
To investigate if the developed SVR model with a quadratic kernel function and a 12-day time lag between climatological and water quality data could detect high DOC concentration events, the ROC analysis was carried out. The averaged measured DOC concentration at SP-1 from November 2002 to January 2018 was 3.4 mg/L (Figure 4). In this study, the DOC concentration threshold for such events was set to be two times higher than the averaged observed DOC concentrations over the past 15 years (6.8 mg/L at SP-1). Figure 5 shows the ROC curve for the high DOC concentration events using the SVR model with the quadratic kernel function with Pre., Temp., LAI and Turb. as model input variables and a 12-day time lag between climatological and water quality data at SP-1. Different cut-off values can be selected from the ROC curve. Here, as high sensitivity will guarantee that most of the high DOC concentration events will be detected, a lax operating point is desirable. Therefore, a knee point in the ROC curve, which ensures a sensitivity of 90%, was selected as the operating point. The corresponding sensitivity and specificity were 0.89 and 0.88, respectively (denoted by a red circle in Figure 5). The confidence intervals in each point were computed by generating 100 bootstrap replicas. It was found that the AUC value for the SVR model with the quadratic kernel function and a 12-day time lag was 0.92 which is above the random level (i.e. 0.5), indicating that the developed SVR model, which uses climatological variables as model inputs, is capable to be used as an alarm system, indicating a possible high DOC event.
Adj. R 2 , RMSE and MAE values for the estimated DOC concentrations at SP-2 with the same machine learning algorithms are shown in Fig. S5(a-f), Fig. S6(a-f) and Fig. S7(a-f), respectively.  Unlike SP-1 that is a river-based sampling point, SP-2 is located at a lake that its water comes from three main tributaries (SP-1 is at one of these three main tributaries into the lake). The results show that none of the applied machine learning algorithms could estimate the DOC concentration at SP-2 (adj. R 2 ,0.3). As shown in Fig. S5(a-f), Fig. S6(a-f) and Fig. S7(a-f), estimated DOC concentrations at SP-2 did not improve by including more input parameters to machine learning algorithms, or by increasing the time lag between climatological and water quality datasets. This could largely be due to the complex local geological and hydrological changes in the sites that directly surround and feed each sampling point or because the lake water combines water from its tributaries with potentially different physiochemical and climatological characteristics. One future approach to deal with lake-based sites could be to consider data from all input tributaries. Another possible reason, that would make it hard to understand how water quality parameters in a lake are affected by climatological parameters, could be the inherent variability in lake characteristics such as lake size, shape and depth compared to the rivers. It could also be due to the complex nature of lake-atmosphere interactions (O'Reilly et al. 2015;Winslow et al. 2015) or water clarity in the lake as it influences the depth range over which heat can be absorbed (Rose et al. 2016) or the strength of lake stratification (Winslow et al. 2015(Winslow et al. , 2017. Hence, the response of the lake to changing climatological parameters was complex, and considering only the climatological parameters could not estimate the DOC concentration at the lake sampling point in our study (SP-2).
Figure 5 | The ROC curve and the AUC corresponding to the high DOC concentration events using the SVR model with the quadratic kernel function with Pre., Temp., LAI and Turb. as model input variables and a 12-day time lag between climatological and water quality data at SP-1.

CONCLUSION
This research utilised machine learning and satellite data to develop and train a model to estimate DOC concentration in water in an Australian catchment. Given the ease of accessing high-resolution satellite data, it is feasible to use climatological data derived from the satellite datasets and link them with water quality indicators to meet the current and future challenges in large-area water quality monitoring. The analysis of the results showed that precipitation, temperature, LAI and turbidity yielded the most optimal results using the SVR model with the quadratic kernel function. Considering the impact of time lag on climatological data prior to water sampling showed an impact on model accuracy, and a 12-day time lag between climatological and water quality data resulted better for the datasets used in this study in terms of statistical indices (adj. R 2 ¼ 0.71, RMSE ¼ 1.9, MAE ¼ 1.35). To show the usefulness of the proposed method compared with other traditional model and kernel-based models, different machine learning models were constructed on the dataset. Experimental results show that the forecasting capability of the SVR model outperforms those of other kernelbased models, thereby generating more accurate results. From the ROC analysis, the developed SVR model with the quadratic kernel function can be successfully used to indicate possible high DOC events (AUC ¼ 92%). Although the results showed that the SVR model works well for the river-based sampling point, the response of the lake to changing climatological parameters was complex and none of the applied machine learning algorithms could estimate the DOC concentration at lake-based sampling point, probably due to higher water residence times in lakes than rivers or diverse variability in lake characteristics such as lake size, shape and depth. Further studies would be needed to test and clarify the effect of considering other potential lake characteristics as model input variables for the DOC estimation at lake-based sampling points and the potential for utilising the approach to river-based modelling used in this study as inputs to DOC estimation models for lake-based sites.