Abstract
The application of machine learning (ML) approaches to predict estuarine dissolved oxygen (DO) from a set of environmental covariates including nutrients remains unexplored due to nutrient data unavailability. Employing data from 12 southwest coastal Florida water quality stations, the applicability of four ML models – support vector machine (SVM), random forest (RF), decision tree, and Wang–Mendel – was examined in predicting DO under a limited nutrient data environment. Monthly water temperature, pH, salinity, total nitrogen (TN), and total phosphorus (TP) data were used for model development. The multiple linear regression model was trained as benchmarks to compare the ML model performances. The site-specific RF and SVM showed superior model efficiency (Nash–Sutcliffe Efficiency > 0.80) when all the predictor variables were used for model development. However, models trained without nutrients demonstrated reduced prediction accuracy. Modeling by synthesizing all site data under TN-limited, TP-limited, and TN- & TP-co-limited regimes illustrated a preferable performance of RF. Overall, the study rendered two crucial conclusions that could complement the existing approaches to estimate total daily loads for environmental management: (1) nutrients serve as a necessary predictor of estuarine DO dynamics and (2) RF performs better among the ML methods under a limited data environment.
HIGHLIGHTS
Machine learning application on predicting dissolved oxygen (DO) is performed.
DO prediction including nutrient as a driver under a limited data environment is applied.
Inclusion of nutrient clearly depicts the dynamics of estuary DO.
INTRODUCTION
Estuarine water quality is pivotal for maintaining a healthy aquatic environment as estuarine ecosystem productivity and metabolism is heavily mediated by variability in surrounding water quality (Worthington et al. 2020). Among different water quality indicators, dissolved oxygen (DO) is one of the crucial indicators that contributes to the aquatic ecosystem's health and affects the life cycle of marine species. DO represents a dynamic equilibrium between oxygen‒source (e.g., photosynthesis) and ‒sink (e.g., respiration, nitrification, and chemical oxidation) in an aquatic system. Nutrients originated from different land-use types are transported via municipal and industrial discharges that culminate in estuaries and bays, depleting DO and thereby impacting the coastal health (Clune et al. 2020). The DO dynamics are highly sensitive to climatic, biogeochemical, land use, and watershed processes that pose a substantial constrain in the data-driven predictive model development efforts. Furthermore, data scarcity and lack of continuous time-series data availability, particularly for those variables that surrogate upstream land uses (i.e., nutrients), limit the applicability of using sophisticated data-based modeling approaches such as machine learning (ML) methods to predict DO in coastal systems. While ML approaches (Chen et al. 2020) are useful to predict coastal DO (Valera et al. 2020) from a predictor variable set that has large time-series data for model training, nutrients are rarely incorporated as a predictor due to the lack of continuous time-series data availability. Therefore, it remains relatively unexplored whether ML approaches could include nutrients as predictors along with the climatic and biogeochemical covariates under a limited data environment.
Choosing the influencing water quality indicators to predict DO is the most important step to empirically explain the DO dynamics (Chen et al. 2020). The use of freshwater discharge, tidal levels, wind, and climatic factors as drivers to predict DO in coastal aquatic environments is well established (Fennel & Testa 2019). In general, the variability of DO is significantly modulated by water temperature (Tw) as higher Tw reduces the dissolution capacity of the aquatic system. Furthermore, biogeochemical factors such as specific conductivity (proxy of salinity) and pH also influence DO, although their linkages with DO are relatively less than the DO–Tw correlations. pH reduction (increase in H+ concentration) leads to the requirement of more O2 to produce water (H2O), resulting in an inverse relationship between DO and pH. Similarly, salinity acts as a sink for aquatic DO (Onabule et al. 2020). Along with these above-mentioned factors, nutrients (i.e., total nitrogen (TN) and total phosphorous (TP)) play an important role in explaining DO dynamics, as nutrients increase biomass and negatively impact DO (Hamid et al. 2020). High nutrient concentration leads to the algae overgrowth, resulting in the dense scum layer on the water surface. Eventually, this dense scum layer restricts the sunlight, which is the primary factor responsible for the modulation of DO. The above-mentioned mechanism highlights the relevance of additionally including nutrients as predictors in ML-based DO modeling and prediction for estuarine systems.
However, one of the bottlenecks of including nutrients in the predictor variable set for DO modeling is that continuous nutrient time-series data are not readily available when compared to the availability of climatic and biogeochemical data. The grab sampling-based measurement technique hinders the continuous measurements of TN and TP, although there are recent advancements in auto-sampling techniques (Valkova et al. 2021). The land use/land cover (LULC) classes indirectly reflect the impacts of nutrients originating from the contributing watersheds, and recent ML models incorporated LULC classes as predictors of aquatic DO as a surrogate of nutrients (Ahmed & Lin 2021); however, direct inclusion of nutrients in the predictor variable set for DO modeling using ML techniques is still limited.
Different ML models such as artificial neural network (Melesse et al. 2008), Wang–Mendel (WM; Shaghaghian 2010), support vector machine (SVM; Li et al. 2017), random forest (RF; Valera et al. 2020), and decision tree (DT; Chen et al. 2020) have been used to simulate and predict aquatic DO. However, none of these studies directly included nutrients as predictors for training and testing. Furthermore, as ML models demand extensive continuous predictor data to ensure model reliability (Chen et al. 2020), it is challenging to evaluate the effectiveness of ML models incorporating nutrients under discrete data environments.
Furthermore, one of the constraints involving data-driven modeling of DO is that models developed for a specific site cannot often be applied for predictions for other adjacent locations due to large spatial variability. Furthermore, ML models are known to underperform when they are applied for predictions of data that are outside the range of the dataset that were used to train the corresponding ML model (Kisi et al. 2012). Therefore, the uncertainty in the spatial transferability of ML models coupled with the unavailability of longer time-series datasets that include nutrient information hinders the spatially robust application of ML models in predicting DO when a comprehensive set of variables (i.e., climatic, biochemical, and nutrients) are used as drivers.
The goal of this study is to evaluate the effectiveness of ML models in predicting DO at estuaries and bays from limited discrete datasets that includes climatic, biogeochemical, and nutrients (proxy for land use) as predictors, and whether the DO predictions using ML models are spatially transferable within similar nutrient-limited conditions. Twelve water quality stations located along the southwest coast of Florida were used as testbeds for modeling assessments. Four representative ML models such as RF, SVM, DT, and WM were used for training and testing. The traditional multiple linear regression (MLR) model was used as a benchmark to compare the efficacy of the developed ML models in predicting DO. The spatial transferability and efficiency of ML models were further tested by developing ensemble DO models based on spatiotemporal calibration and independent validations. This study would provide critical information on incorporating nutrients to develop ML models for estimating total daily load in the coastal/estuarine environment under a limited data setting.
MATERIALS AND METHODS
Study area and dataset
Statistical summary of different water quality variables across the southwest coast of Florida
Water quality variables . | Mean . | Standard deviation . | Coefficient of variation (%) . | Minimum . | 25th percentile . | 50th percentile . | 75th percentile . | Maximum . |
---|---|---|---|---|---|---|---|---|
DO (mg/L) | 5.74 | 1.31 | 21.08 | 1.90 | 4.80 | 5.68 | 6.56 | 10.01 |
Tw (°C) | 25.93 | 4.25 | 16.39 | 14.70 | 22.50 | 26.80 | 29.70 | 33.90 |
pH | 7.82 | 0.33 | 4.22 | 5.20 | 7.70 | 7.84 | 8.00 | 8.80 |
Sal (PSU) | 29.39 | 7.97 | 27.11 | 1.29 | 25.80 | 31.80 | 35.10 | 40.27 |
TN (mg/L) | 0.42 | 0.20 | 47.62 | 0.03 | 0.28 | 0.39 | 0.52 | 1.91 |
TP (mg/L) | 0.03 | 0.02 | 66.67 | 0.0001 | 0.02 | 0.03 | 0.04 | 0.20 |
Water quality variables . | Mean . | Standard deviation . | Coefficient of variation (%) . | Minimum . | 25th percentile . | 50th percentile . | 75th percentile . | Maximum . |
---|---|---|---|---|---|---|---|---|
DO (mg/L) | 5.74 | 1.31 | 21.08 | 1.90 | 4.80 | 5.68 | 6.56 | 10.01 |
Tw (°C) | 25.93 | 4.25 | 16.39 | 14.70 | 22.50 | 26.80 | 29.70 | 33.90 |
pH | 7.82 | 0.33 | 4.22 | 5.20 | 7.70 | 7.84 | 8.00 | 8.80 |
Sal (PSU) | 29.39 | 7.97 | 27.11 | 1.29 | 25.80 | 31.80 | 35.10 | 40.27 |
TN (mg/L) | 0.42 | 0.20 | 47.62 | 0.03 | 0.28 | 0.39 | 0.52 | 1.91 |
TP (mg/L) | 0.03 | 0.02 | 66.67 | 0.0001 | 0.02 | 0.03 | 0.04 | 0.20 |
Note: DO, Tw, Sal, TN, and TP represent dissolved oxygen, stream water temperature, salinity, total nitrogen, and total phosphorus, respectively. The units of each water quality parameter were mentioned in the parenthesis.
Locations of the water quality monitoring stations across the southwest coast of Florida. The stations that are used to train and test the ML techniques were denoted by circle and triangle, respectively.
Locations of the water quality monitoring stations across the southwest coast of Florida. The stations that are used to train and test the ML techniques were denoted by circle and triangle, respectively.
Statistical analysis and modeling
Different combinations of predictor variables (Tw, pH, Sal, TN, and TP) used to train ML models to predict estuary DO
Combination name . | Number of predictor variables . | Trained function . |
---|---|---|
M1 | 5 | DO = f (Tw, pH, Sal, TN, TP) |
M2 | 4 | DO = f (Tw, pH, Sal, TN) |
M3 | 4 | DO = f (Tw, pH, Sal, TP) |
M4 | 3 | DO = f (Tw, pH, Sal) |
Combination name . | Number of predictor variables . | Trained function . |
---|---|---|
M1 | 5 | DO = f (Tw, pH, Sal, TN, TP) |
M2 | 4 | DO = f (Tw, pH, Sal, TN) |
M3 | 4 | DO = f (Tw, pH, Sal, TP) |
M4 | 3 | DO = f (Tw, pH, Sal) |
Note: DO, Tw, Sal, TN, and TP refer to estuary dissolved oxygen, water temperature, salinity, total nitrogen, and total phosphorus, respectively.
Schematic of the workflow and application of different ML models to predict estuary dissolved oxygen (DO). First 70% of the data from each station were considered to train the model and the rest 30% data were considered to test the corresponding trained site-specific model. The available time series of similar estuaries were considered to train the multisite synthesized model and tested the corresponding available time series of independent sites based on the TN:TP ratio. Tw, pH, Sal, TN, TP, RF, SVM, DT, WM, and MLR refer to the stream temperature, pH, salinity, TN, TP, RF, SVM, decision tree, WM, and MLR, respectively. The model performance was evaluated based on three indices – NSE, RSR, and MBE refer to the Nash–Sutcliffe Efficiency, the ratio of root-mean-square error to the standard deviation of observations, and mean bias error, respectively.
Schematic of the workflow and application of different ML models to predict estuary dissolved oxygen (DO). First 70% of the data from each station were considered to train the model and the rest 30% data were considered to test the corresponding trained site-specific model. The available time series of similar estuaries were considered to train the multisite synthesized model and tested the corresponding available time series of independent sites based on the TN:TP ratio. Tw, pH, Sal, TN, TP, RF, SVM, DT, WM, and MLR refer to the stream temperature, pH, salinity, TN, TP, RF, SVM, decision tree, WM, and MLR, respectively. The model performance was evaluated based on three indices – NSE, RSR, and MBE refer to the Nash–Sutcliffe Efficiency, the ratio of root-mean-square error to the standard deviation of observations, and mean bias error, respectively.
The model performance was evaluated based on the Nash–Sutcliffe efficiency (NSE), the ratio of root-mean-square error to the standard deviation of observations (RSR), and mean bias error (MBE) (Text S1 in the Supplementary Material for details). RStudio 1.3.1073 was used to carry out all statistical analysis and modeling.
Pearson correlation to identify the linkages of DO
The Pearson correlation matrix provides useful information to identify the influential predictor variables and their linkages with DO. The correlation analysis also indicates the presence of multicollinearity in the predictor variable set.
ML and MLR models
The brief description and methodology of the applied techniques are described below.
Random forest (RF)
The broad applicability of RF ranging from the water source identification to fill in the missing observations in biochemistry (Kim et al. 2020) leads to use this technique for the DO prediction. RF consists of multiple regression trees where each regression tree includes many random subsets of data and ensembles all trees to predict the response variables. The ‘randomForest’ function was used, assuming regression 10 trees to train the RF model. The freedom from the inclusion of correlated variables and using discrete predictions to identify nonlinear relationships makes the RF model unique from the other models (Vincenzi et al. 2011).
Support vector machine (SVM)
SVM is a well-known technique (Kisi & Parmar 2016) used in water quality prediction. In general, SVM regression fits a multi-dimensional dataset into a hyperplane within a margin surrounding the hyperplane. Additionally, the SVM algorithm projects the input vectors into the higher-dimensional feature space with a predefined kernel function to change a nonlinear problem into a linear regression problem. Hence, SVM provides a unique and global solution due to the employment of structural risk minimization principle instead of empirical risk minimization. The ‘svm’ R package was used to develop the model. The value of cost and gamma parameters for SVM modeling was set as 1, and the radial basis kernel function was used to account for the nonlinearity in the dataset.
Decision tree (DT)
High efficiency and insensitivity to missing values are the key attributes that widen the applicability of the DT algorithm in modeling and predictions (Lu & Ma 2020). In the DT algorithm, variables are split into a set of rectangles based on the minimum sum of squares, following a binary sequence to develop set-specific models. The ‘rpart’ R package was used to assume ‘anova’ for the regression fitting.
Wang–Mendel (WM)
The WM model, a variant of the fuzzy framework, is well known for structural simplicity and good predictability (Wang & Mendel 1992). The WM method divides both predictor and response variable data into fuzzy regions. Furthermore, the fuzzy regions refer to intervals for each linguistic term and generate a Gaussian membership function for each region. The ‘frbs.learn’ R package was used for model training.
MLR model for model benchmarking
The MLR model provides a simple linear relationship between response and predictor variables. The ‘olsrr’ package was used in RStudio and ‘lm’ function to develop an MLR model. The MLR was used as a benchmark model to compare the predictability of the ML models under a limited data environment.
Data preparation and model training and testing
A complete data matrix without missing observations is a prerequisite of data-based model development. The corresponding all data for a month from the data matrix were removed if any observation is missing for a particular variable to prepare the complete data matrix. Monthly water quality data for 20 years supposed to contain 240 observations for each station. However, data unavailability (58–66%) and aforementioned removal (1–14%) process result in a complete data matrix of 80–98 observations across all stations (Table S1 in the Supplementary Material).
Multisite ensemble modeling
Along with the site-specific training and testing, a multisite synthesis approach was used to develop a combined ensemble DO model for each ML method to test the spatial robustness under a similar nutrient regime in predicting DO. The water quality stations were first categorized based on the N-limited, P-limited, and N&P-co-limited criteria (Figure 1). Second, the calibration and validation stations were selected within each limited nutrient category (Table S2 in the Supplementary Material). Third, the calibration sites for model training under each category were combined and then used the corresponding validation site for independent testing. Finally, all the five models (RF, SVM, DT, WM, and MLR) were applied for each of the nutrient-limited classes.
RESULTS
Environmental controls of DO
Pearson correlation coefficient among all water quality variables across the coast of Florida. Tw, Sal, TN, and TP refer to the estuary temperature, salinity, total nitrogen, and total phosphorus, respectively.
Pearson correlation coefficient among all water quality variables across the coast of Florida. Tw, Sal, TN, and TP refer to the estuary temperature, salinity, total nitrogen, and total phosphorus, respectively.
Site-specific modeling
Different site-specific models showing efficiency (NSE) and accuracy (RSR) for all stations during training ((a), (b)) and testing ((c), (d)) periods. The model performance was calculated using different models (RF, random forest; SVM, support vector machine; DT, decision tree; WM, Wang–Mendel; MLR, multiple linear regression) and different sets of predictor variables (M1–M4; details in Table 2), independently. NSE and RSR refer to the Nash–Sutcliffe Efficiency and the ratio of root-mean-square error to the standard deviation of observations, respectively.
Different site-specific models showing efficiency (NSE) and accuracy (RSR) for all stations during training ((a), (b)) and testing ((c), (d)) periods. The model performance was calculated using different models (RF, random forest; SVM, support vector machine; DT, decision tree; WM, Wang–Mendel; MLR, multiple linear regression) and different sets of predictor variables (M1–M4; details in Table 2), independently. NSE and RSR refer to the Nash–Sutcliffe Efficiency and the ratio of root-mean-square error to the standard deviation of observations, respectively.
The site-specific ML models using M2 (NSE = 0.48–0.98; RSR = 0.12–0.72; MBE = −0.24 to 0.21 mg/L), M3 (NSE = 0.49–0.98; RSR = 0.14–0.71; MBE = −0.11 to 0.14 mg/L/), and M4 (NSE = 0.13–0.97; RSR = 0.19– 0.93; MBE = −0.14 to 0.58 mg/L) combinations showed lower performance compared to M1 combination during testing across all stations (Figure 4 and Figures S8–S22 in Supplementary Material). Similar to the M1 combination results, RF (NSE = 0.86–0.98; RSR = 0.12–0.38; MBE = −0.08 to 0.07 mg/L) and SVM (NSE = 0.72–0.96; RSR = 0.21–0.53; MBE = −0.11 to 0.16 mg/L) performed better than the DT (NSE = 0.64–0.90; RSR = 0.32–0.60; MBE = −0.10 to 0.14 mg/L) and WM (NSE = 0.48–0.82; RSR = 0.42–0.72; MBE = −0.24 to 0.21 mg/L) during testing for M2 and M3 combinations. However, the RF (NSE = 0.90–0.97; RSR = 0.19–0.32; MBE = −0.03 to 0.05 mg/L) model better predicted DO compared to SVM (NSE = 0.49–0.86; RSR = 0.38–0.71; MBE = −0.14 to 0.08 mg/L) when TN and TP were not included in the predictor variable set (i.e., M4). Furthermore, the RF model (NSE = 0.81–0.98; RSR = 0.12–0.44; MBE = −0.10 to 0.11 mg/L) outperformed the MLR model (NSE = 0.26–0.71; RSR = 0.54–0.86; MBE =− 0.18 to 0.17 mg/L) considering all combinations across all stations.
Spatially ensemble (multisite synthesized) modeling
The multisite synthesis approach was applied to predict DO after categorizing the stations based on the TN:TP ratio (Table S2 in the Supplementary Material). The trained sub-models were used to independently evaluate the prediction performance of the validation stations that are not included in training for each limiting nutrient category.
The performance of the multisite synthesized model for different nutrient-limited estuaries for corresponding independent validation estuary DO observations. The model performance was calculated using different models (MLR, multiple linear regression; SVM, support vector machine; RF, random forest; DT, decision tree; WM, Wang–Mendel) and different sets of predictor variables (M1–M4; details in Table 2). N-, P-, and N&P-co-limited refer to the nitrogen, phosphorus, and nitrogen & phosphorus co-limited streams, respectively. NSE and RSR refer to the Nash–Sutcliffe Efficiency and the ratio of root-mean-square error to the standard deviation of observations, respectively.
The performance of the multisite synthesized model for different nutrient-limited estuaries for corresponding independent validation estuary DO observations. The model performance was calculated using different models (MLR, multiple linear regression; SVM, support vector machine; RF, random forest; DT, decision tree; WM, Wang–Mendel) and different sets of predictor variables (M1–M4; details in Table 2). N-, P-, and N&P-co-limited refer to the nitrogen, phosphorus, and nitrogen & phosphorus co-limited streams, respectively. NSE and RSR refer to the Nash–Sutcliffe Efficiency and the ratio of root-mean-square error to the standard deviation of observations, respectively.
The scatter plot of observed and predicted DO during training (denoted as a blue circle) and testing (denoted as a red triangle) periods. The available time series of observed estuary DO under each limiting nutrient category were compiled to train the models – multiple linear regression (a)–(d), SVM (e)–(h), RF (i)–(l), decision tree (m)–(p), and WM (q)–(t) – predicted the corresponding available independent estuary DO observations considering M1 combination. The dotted line represents the 1:1 line. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.002.
The scatter plot of observed and predicted DO during training (denoted as a blue circle) and testing (denoted as a red triangle) periods. The available time series of observed estuary DO under each limiting nutrient category were compiled to train the models – multiple linear regression (a)–(d), SVM (e)–(h), RF (i)–(l), decision tree (m)–(p), and WM (q)–(t) – predicted the corresponding available independent estuary DO observations considering M1 combination. The dotted line represents the 1:1 line. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wqrj.2022.002.
DISCUSSION
In this study, four different machine learning techniques were trained and tested to predict the estuarine DO as a function of climatic and water quality drivers in a limited data setting. The inclusion of nutrients in the model provides better performance in predicting DO (Figures 4 and 5; Figure S2 in the Supplementary Material), indicating that nutrients (i.e., TN and TP) are potentially essential components for model training. Furthermore, the RF and SVM models exhibited better model performance compared to DT, WM, and MLR when nutrients are included as predictors. Additionally, the model training and testing with the spatially aggregated datasets suggested an overall better predictive ability of RF compared to the other ML models and benchmark MLR under similar limiting nutrient conditions. Overall, the study suggested that the RF model incorporating both climatic and nutrients as predictions (M1 combination) would provide the most optimal performance in predicting estuarine DO under a limited data environment.
The drivers of DO and the choice of predictors for model training
Although the increasing number of predictor variables may ensure minor prediction errors, modeling complexity and uncertainty increases with the number of predictors. The correlation results indicated that Tw is the most influential factor in describing DO (Figure 3), while correlations of salinity, pH, and nutrients with DO were weak to moderate. The nutrients (TN and TP) were positively correlated with Tw as the mineralization rate of organic matter increases with Tw (Sobek et al. 2017). TN was negatively related to salinity as salinity causes nutrient deficiency. Although the strength of influential predictor variables (pH, Sal, TN, and TP) changes spatially and temporally across the southwest coast of Florida, their overall linkages and optimal influences remained similar. Therefore, the incorporation of both source- (pH) and sink- (Tw, Salinity, TN, and TP) type predictors of DO supports the reasonably accurate selection and combinations of predictor variables for the model training.
The evaluated different model performances for each site using different combinations (M1–M4) facilitated selecting the best combination of independent variables in predicting DO. The results suggested that the site-specific model performance was moderately reduced after removing TP (0.16–42.13%) and TN (0.12–34.15%) from the predictor variable set, respectively, for 66 and 73% of cases considering all combinations across all stations. Based on the comparison between M2 and M3, TP is comparatively more important than TN to capture the estuary DO dynamics at the site-specific level. The exclusion of both nutrient variables (combination M4) from the M1 combination notably reduced the site-specific model performance (0.42–84.6%) during testing across all stations, highlighting the importance of keeping nutrients as predictors for the ML-based estuary DO modeling.
The importance of the nutrients as predictors was further explored by using a multisite synthesis approach where the water quality stations were categorized based on their TN:TP ratio. The ratio between TN and TP defines the limiting nutrients of the estuary ecosystem and led to categorize the stations as either N-limited or P-limited or N&P-co-limited estuaries. A high TN:TP ratio refers higher TN and introduces eutrophication in the estuary system. Eutrophication restricts the incoming light into the deeper water and introduces the lack of oxygen in the estuary system. The results suggested that DO follows a decreasing trend with the increase of TN:TP ratio (Figure S1 in Supplementary Material). This supports the idea of categorizing the sites based on the TN:TP ratio for model training under similar nutrient-limited regimes.
The driving mechanisms of DO due to the changes in land use are well known (Dos Reis Oliveira et al. 2019). Furthermore, the availability of LULC data leads to incorporating LULC for the DO model (data-driven) development. Although LULC is a potential predictor to investigate the nutrient contribution in the estuary for the extensive dataset (Kang et al. 2010), minor changes during a short period resist the incorporation of LULC to capture the effects of nutrient dynamics on monthly or seasonal variability of DO. Consequently, short-term nutrient dynamics is an integral part to investigate the coastal hypoxia, harmful algal blooms, and impacts to human health (Pellerin et al. 2016). However, the scarcity of nutrient data restricts the nutrient incorporation for the model development. This study overcomes the aforementioned limitations and incorporated nutrients (TN and TP) into the model to investigate the most influential nutrient on DO modeling.
Performance-based ranking of the developed ML models
The ML model performances were evaluated using different combinations of predictor variables (M1–M4) under a limited data environment, and the ML model results were compared with the traditional MLR model. Consistent with the existing literature (Heddam & Kisi 2017), the trained ML models reasonably captured the nonlinear relationships between response and predictor variables. Furthermore, the observation that the ML models consistently performed better than the MLR (Figure 4) model was analogues to the findings of the existing literature (Zhang et al. 2019). The selection of fuzzy rule in the WM algorithm led to the inability of site-specific WM model to adequately predict DO (Figure 4) (Shaghaghian 2010). Although site-specific DT predicted DO with an acceptable accuracy, the site-specific RF and SVM performed better than DT for all stations.
Similar results were observed in the multisite synthesis approach considering all combinations. The incompatibility with the overlapping target data hinders the applicability of the SVM model, which works well for high dimensional spaces. Hence, SVM was able to predict within the acceptable accuracy for only ROOK 456 considering all combinations. Furthermore, the introduction of generating new nodes for capturing the untrained data leads to high variance in the response variable, and eventually, DT failed to capture the dynamics for the independent validation estuaries (NSE = 0.06–0.57; RSR = 0.65–0.96; MBE = −0.06 to 0.45 mg/L) for 90% cases, considering all combinations across all independent validation stations. Similarly, the incompatibility of WM model for predicting data outside the training period (Wang & Mendel 1992) was concurrent with our studied estuaries (NSE = −0.37 to 0.49; RSR = 0.71–1.16; MBE = −0.09–0.45 mg/L) considering all combinations. Therefore, the RF model is more stable and accurate for capturing the estuary DO dynamics for the multisite synthesized approach (Figures 5 and 6 and Figures S23–S25 in the Supplementary Material).
Overall, the analysis indicated the superior performance of RF for all the model combinations, although the prediction accuracy reduces with increasing the TN:TP ratio. The freedom of normality assumption and complex nonlinear relationship distinguishes the RF (Vincenzi et al. 2011) model over the other models. In general, the increasing TN:TP ratio represents higher nitrogen pollution in estuary. Therefore, studied results implicate that ML models, particularly RF, perform better in less polluted systems – which are also consistent with the literature (Zhu & Heddam 2020).
On the effectiveness of the ML approach under a limited data environment
There are a wide range of ML methods for DO simulation and prediction in estuary, streams, and rivers (Valera et al. 2020). However, the application of ML models ensuring satisfactory predictive performance using discrete data is still limited in water quality prediction (Chen et al. 2020). Khoshgoftaar et al. (2007) reported that ML models are not optimal for rare events (small dataset). Furthermore, the existing models require big continuous time-series data along with many predictor variables, which increase model complexity and computation time. Considering these limitations, the study explored the application of ML under a limited data environment, and the obtained predictive power of the site-specific RF model confirmed the application of the ML models using optimal predictor variables under a limited data setting (Figure 4). Furthermore, the RF model also confirms the application of the ML model for the prediction of independent validation observations ‒ while the testing period is outside the range of the training period ‒ estuary DO (Figure 4 and Figures S3–S22 in the Supplementary Material).
CONCLUSIONS
The objective of this study was to investigate the applicability of the ML models considering nutrient variables for the prediction of estuary DO with limited water quality variable data. However, estuary DO dynamics demand incorporating nutrients into the ML model to better explain the response variable, and this study confirms the potentiality of nutrients for the estuary DO prediction. Furthermore, the nonlinear relationship between DO and predictor variables limits the applicability of the MLR model, and this study confirms that site-specific ML models (RF, SVM, DT, and WM) were able to capture the estuary DO dynamics from good (RSR = 0.50–0.60) to a very good (RSR < 0.50) range during the testing set. Furthermore, the satisfactory ML model performance for N- and N&P-co-limited estuaries confirms the spatial transferability of the RF model by categorizing the stations based on the TN:TP ratio. Overall, this study facilitates the application of both site-specific and multisite synthesis approaches for developing an RF model under a limiting nutrient category with the fine resolution data (e.g., weekly, daily, and hourly) to acquire knowledge about the estuary ecosystem. The spatial transferability within the limiting nutrient category also widens the successful application of the RF model as a gap-filling technique and improves the repository of the water quality data. Therefore, the successful application of the RF model using limited data helps to obtain more insight about similar limiting nutrient-based estuary ecosystems. I recommend incorporating more observations based on nutrient data availability to obtain more concrete foundation of the ML model applicability on the limiting nutrient-based multisite synthesis approach. As a result, this study would provide an insight to estimate total daily load for the coastal/marine ecosystem.
ACKNOWLEDGEMENT
The data used in this study were obtained from the SFWMD environmental monitoring database. The author would like to thank the staff of the DBhydro and the SFWMD who are involved in the sample collection, data processing, and management of the corresponding websites.
DATA AVAILABILITY STATEMENT
All relevant data are available from an online repository or repositories (https://my.sfwmd.gov/dbhydroplsql/show_dbkey_info.main_menu).
CONFLICT OF INTEREST
The authors declare there is no conflict.