ABSTRACT
Machine learning models for water quality prediction often face challenges due to insufficient data and uneven spatial-temporal distributions. To address these issues, we introduce a framework combining machine learning, numerical modeling, and remote sensing imagery to predict coastal water turbidity, a key water quality proxy. This approach was tested in the Great Lakes region, specifically Cleveland Harbor, Lake Erie. We trained models using observed data and synthetic data from 3D numerical models and tested them against in situ and remote sensing data from PlanetLabs' Dove satellites. High-resolution (HR) data improved prediction accuracy, with RMSE values of 0.154 and 0.146 log10(FNU) and R2 values of 0.92 and 0.93 for validation and test datasets, respectively. Our study highlights the importance of unified turbidity measures for data comparability. The machine learning model demonstrated skill in predicting turbidity through transfer learning, indicating applicability in diverse, data-scarce regions. This approach can enhance decision support systems for coastal environments by providing accurate, timely predictions of water quality variables. Our methodology offers robust strategies for turbidity and water quality monitoring and has potential for improving input data quality for numerical models and developing predictive models from remote sensing data.
HIGHLIGHTS
Developed a framework combining machine learning, numerical modeling, and remote sensing for accurate turbidity prediction.
Need for uniform turbidity measurement to improve monitoring and data comparability.
Explored turbidity prediction in data-scarce areas via transfer learning.
INTRODUCTION
Understanding the clarity and quality of water through turbidity, a key parameter influenced by particulate matter, is crucial for the health of aquatic ecosystems (Boyd & Tucker 2012; Water Quality & Health 2017). Turbidity reflects not only the physical conditions of water bodies but also hints at the presence of pollutants, making it an essential indicator for environmental monitoring. While measurements in nephelometric turbidity units (), formazin nephelometric units (), and milligrams per liter () are common, the relationship between turbidity and suspended solids is complex, varying with particle characteristics (Bilotta & Brazier 2008; U.S. Geological Survey 2022). This underscores the need for reliable approaches to accurately gauge water quality and address the environmental challenges posed by elevated levels of turbidity.
Monitoring and predicting turbidity in coastal areas present a formidable challenge, necessitating the convergence of different methods, from field measurements to numerical modeling and remote sensing. Mechanistic models, particularly those based on three-dimensional conservation principles, have been used extensively to simulate the fate and transport of contaminants in water bodies (Pelletier et al. 2006; Nguyen et al. 2014, 2017; Safaie et al. 2020; Feizabadi et al. 2022; Memari & Phanikumar 2024a, 2024b). These models can simulate complex phenomena such as turbidity currents with reversing buoyancy (Sequeiros et al. 2009). However, their accuracy pivots on the realistic representation of forcing fields and boundary conditions, such as river discharge and turbidity (Jalón-Rojas et al. 2021; Zhu et al. 2022; Feizabadi et al. 2023). This suggests that while these models have furnished pivotal insights, their reliability and applicability necessitate further enhancements in the context of dynamic environments such as coastal waters.
In parallel with numerical modeling, remote sensing, particularly the use of satellite data, has emerged as a potent tool for assessing water quality parameters (Nelson et al. 2002; Olmanson et al. 2008; Topp et al. 2020; Li et al. 2022). This technology offers a synoptic view of the water bodies, allowing the consistent tracking of parameters, including turbidity. It exploits remote sensing indices such as the normalized difference turbidity index (NDTI) and normalized difference water index (NDWI) to infer turbidity levels from space (Gao 1996; Lacaux et al. 2007). However, despite its promise, remote sensing confronts several limitations, such as the inability to penetrate deep water bodies and the influence of atmospheric conditions on the received signal (Cunningham et al. 2013), necessitating supplementary approaches for water quality assessment.
In light of this, the fusion of machine learning with remote sensing and field measurement has gained momentum in recent years (Filisbino Freire da Silva et al. 2021). Machine learning models, equipped with the ability to discern complex patterns in data, have been utilized to predict outcomes related to water quality, including turbidity (Normandin et al. 2019; Li et al. 2021). For example, machine learning models such as random forests and decision trees have been effective in predicting turbidity and water quality from environmental factors (Anmala & Turuganti 2021; Venkateswarlu & Anmala 2023). These models, however, are data-sensitive and demand high-quality labeled data for training, which may not always be available. A significant limitation is the lack of spatial and temporal resolution in field measurements, which is critical in dynamic coastal environments. Similarly, the generalizability of these models hinges on the training data that encompasses a wide range of concentration values, which field sampling often lacks, especially for high turbidity samples.
The advent of transfer learning, a technique that allows the application of knowledge from related source domains to improve the performance of models in a target domain has shown promising results in overcoming these challenges (Farahani et al. 2021; Zhuang et al. 2021; Hee et al. 2022). Transfer learning techniques, coupled with machine learning algorithms, have shown potential in applications such as remote sensing for water quality assessment (Pu et al. 2019; Zhu et al. 2019; Gambin et al. 2021; Syariz et al. 2022; Arias-Rodriguez et al. 2023).
Despite these advancements, considerable gaps remain, particularly concerning data availability in a wide range of concentration values for training machine learning models and their generalizability. This calls for an integrated approach that leverages the strengths of numerical modeling, machine learning, and remote sensing. Our study addresses these gaps by utilizing synthetic data from well-tested numerical models to train machine learning models for turbidity prediction using remote sensing imagery in the visible bands. We also investigate the applicability of the trained models through transfer learning in different domains. This approach will allow us to predict turbidity in a unified concentration unit () for the study site and potentially for other coastal areas. The study contributes novel insights into the precision and limitations of numerical models, as turbidity can serve as a natural tracer. This holistic approach has the potential to transform our understanding of water quality in dynamic coastal environments.
MATERIALS AND METHODS
Study area
The Port of Cleveland, located along the banks of the Cuyahoga River, has played a vital role in the economic development of the city. However, the port has also contributed to the pollution of the river through the discharge of wastewater and other industrial effluents. Recent endeavors addressed these concerns and diminished the adverse impacts on the river and its surrounding ecosystems. The port has implemented several initiatives aimed at enhancing water quality and safeguarding the health of the river and the Great Lakes ecosystem (Cleveland-Cuyahoga County Port Authority 2024).
Multiple factors contribute to the pollution and turbidity observed in the Cuyahoga River. Notably, the discharge of untreated or partially treated wastewater from municipal and industrial sources is a significant source of contamination. This encompasses sewage, stormwater runoff, and various types of water contaminated with chemicals, metals, and other pollutants. Another contributing factor is the release of untreated or partially treated industrial effluent from factories and other industrial facilities, which contains chemicals, metals, and other pollutants utilized or produced during the manufacturing process (American Rivers 2024). In addition to these pollution sources, stormwater runoff can transport sediment and debris into the river, exacerbating its turbidity. Moreover, the river is affected by nonpoint source pollution, including agricultural and urban runoff.
Mechanistic modeling details and data
Circulation and transport models
This study utilized the finite volume community ocean model (FVCOM) to conduct simulations of lake wide and nearshore circulation. FVCOM is a three-dimensional numerical model based on triangular unstructured grids, capable of prognostic and free surface simulations of water circulation and transport in coastal and oceanic environments (Chen et al. 2003). It is widely employed for investigating hydrodynamic processes such as ocean currents and waves. FVCOM offers versatility in applications, including forecasting water levels and tides, modeling the movement of pollutants such as oil spills, and assessing the impacts of climate change on the ocean (Yang & Khangaonkar 2007; Chen et al. 2008; Lai et al. 2010; Ma et al. 2011; Memari & Siadatmousavi 2018). The model incorporates hydrostatic and Boussinesq approximations to solve the governing hydrodynamic equations. Vertical eddy viscosity and diffusivity are described using the Mellor-Yamada level 2.5 turbulence closure scheme (Mellor & Yamada 1982; Galperin et al. 1988), while the Smagorinsky turbulence closure model is employed for horizontal diffusion to determine coefficients for horizontal momentum and thermal diffusion (Smagorinsky 1963). These approaches allow for the dynamic calculation of horizontal and vertical mixing coefficients based on local conditions. This dynamic adjustment ensures that the mixing processes are accurately represented spatially and temporally within the model. Detailed equations for FVCOM can be found in Chen et al. (2003).
The upwind scheme was applied as it is particularly suitable for advection-dominant flows, such as those occurring within the harbor influenced by river flow. Moreover, careful attention was given to accurately representing the computational domain to minimize errors associated with steep gradients and numerical diffusion.
The hydrodynamic model simulation was conducted from 1 April 2019 to 30 July with a time step of 0.2 s. The simulation was initialized as a cold start with zero velocity. The initial temperature values were assigned as 0.1 at the surface and linearly increased to 2.0 at the bottom of the lake. The first 2 months of the simulation (April and May) were designated as a spin-up period.
The coupled transport model describing the turbidity transport in space and time can be described by the following advection - dispersion equation:
The coupled transport model run was initialized on 30 May 2019, and continued until 30 July 2019, with a time step of 0.2 s. The initial turbidity was set uniformly to 8 across all computational nodes. Output from the model was saved at hourly intervals throughout the simulation period.
Domain discretization and bathymetry data
The computational domain encompassing Lake Erie and the Cleveland Harbor Area was discretized into 56,556 triangular elements and 30,071 nodes in the horizontal direction. The resolution of the triangular mesh in the horizontal plane varied based on the distance from the shoreline and bathymetry contours. In the harbor area and at the river mouth, the mesh resolution ranged from 10 to 40 m (Figure 1(b) and 1(c)), gradually increasing to 1,800 m in the central and deeper regions of the lake. The utilization of an unstructured triangular mesh facilitated precise representation of abrupt changes in the coastline shape, particularly within the harbor area. In the vertical direction, the computational domain was divided into 20 uniformly spaced sigma layers (21 levels). The resolution varied from a few centimeters near the shoreline to several meters in the offshore and deep sections of the lake.
For most parts of the computational domain, 3-arc-second bathymetry data with a resolution of approximately 90 m, obtained from NOAA (National Geophysical Data Center 1999), were employed. While this resolution is considered accurate for the majority of lake and ocean models, it lacks the required precision within the harbor, near breakwaters, and at the river mouth where there are sharp bathymetry gradients and HR mesh. To address this issue, two additional sets of bathymetry data were utilized. The first was 10-m bathymetry data obtained from the United States Army Corps of Engineers, USACE, (United States Army Corps of Engineers. Hydrographic Surveys 2023), which covered most of the harbor and river mouth areas. The second consisted of electronic navigational charts (ENC) data from the NOAA's Office of Coast Survey (www.nauticalcharts.noaa.gov) (NOAA Electronic Navigational Charts (ENC) | InPort, n.d.), which accounted for less than 1% of the total bathymetry data and specifically captured the bathymetry near the breakwater walls. This level of detail was necessary to ensure an accurate representation of the computational domain, given our focus on the circulation within and in proximity to the harbor area and the transport between the lake and the river. Additionally, the fine mesh resolution of 10–40 m applied inside and near the harbor further justified the inclusion of such detailed bathymetry. All bathymetry data were interpolated to the computational mesh using the natural neighbor method. This method, also known as Sibson interpolation, generates smooth surfaces by calculating a weighted average of neighboring data points based on the proportion of the overlapping Voronoi cells. It is particularly effective for scattered data and less sensitive to data point distribution, ensuring high accuracy in representing the bathymetric surface (Sibson 1981; Edelsbrunner & Shah 1992; Hiyoshi & Sugihara 2004).
Meteorological forcing data
Meteorological data from the National Data Buoy Center (NDBC, NOAA National Data Buoy Center 1971) and the ERA5 reanalysis dataset from the European Center for Medium-Range Weather Forecast (ECMWF, Hersbach et al. 2020) were combined to generate the forcing field for the mechanistic model (FVCOM model). To integrate the two meteorological datasets, hourly wind and air temperature data from the ERA5 grid points situated at least 15 km away from any NDBC stations were extracted and merged with the NDBC data. The observed wind speeds from the NDBC stations were adjusted to a height of 10 m as described by Schwab & Morton (1984), who used a formulation that incorporates parameters such as the drag coefficient, stability length, and roughness length. The height-adjusted NDBC wind speeds were then combined with the ERA5 10-m wind speeds. On the other hand, no height correction was applied to the NDBC air temperature, as air temperature variations over short heights are negligible. The combined wind speeds and air temperatures were subsequently interpolated to the numerical mesh using the natural neighbor method.
For other forcing fields, including solar radiation (shortwave and longwave), cloud cover, and relative humidity, interpolation was performed solely from the ERA5 grid points over the lake, without integration with field observations. The decision to use an integrated approach combining both ERA5 and NDBC data was driven by the limited availability and spatial inconsistency of field observations over the lake, particularly on the northern side (Canadian side) of Lake Erie. By integrating the field observations with the ERA5 reanalysis dataset, we aimed to enhance the spatiotemporal resolution of the data while utilizing valuable field observations. This integrated method is expected to provide a more accurate representation of the forcing fields for the mechanistic model, compared to relying solely on either ERA5 or NDBC data.
River forcing data
Hourly observations of river discharge and turbidity were utilized to generate input (boundary condition) for the transport model. The data were obtained from the USGS stream gauge (#04208000, U. S. Geological Survey 2016) at the Cuyahoga River at Independence, OH, which is located 20.1 km away from the mouth of the river. However, the turbidity time series had continuous gaps due to missing data. To address this, interpolation methods were employed to predict the missing values. Classic interpolation methods proved ineffective due to the non-linear nature of turbidity values and continuous missing data. Therefore, Gaussian Process Regression (GPR), a machine learning technique (Wang & Jing 2022), was employed to predict the missing values and to generate hourly turbidity values as input for the mechanistic model.
GPR is a Bayesian nonparametric model suitable for interpolation and prediction (Rasmussen & Williams 2005; Wang & Jing 2022). It assumes that the underlying function describing the data follows a Gaussian process, which consists of random variables indexed by inputs with a joint Gaussian distribution. GPR provides a flexible and powerful approach to modeling complex datasets by defining a prior over functions and updating this prior with observed data to obtain a posterior distribution. This allows GPR to provide not only predictions but also uncertainty estimates, and this is particularly useful in environmental modeling.
One of the key advantages of GPR is its ability to handle noise in the data by explicitly incorporating it into the likelihood function, leading to more robust and reliable predictions (Rasmussen & Williams 2005). The kernel function, a crucial component of GPR, determines the covariance structure of the Gaussian process and can be tailored to capture the specific characteristics of the data, such as periodicity or smoothness. Commonly used kernels include the radial basis function (RBF) kernel and the Matérn kernel, each offering different properties suited to various types of data (Duvenaud 2014). By leveraging GPR, we effectively addressed the challenges posed by the non-linear and missing turbidity data, ensuring accurate and reliable input for our mechanistic model.
Remote sensing
In this study, remote sensing data from Planet Labs' PlanetScope Dove satellites (Planet Labs PBC 2019) were utilized to train a machine learning model based on the synthetic data from the mechanistic model. PlanetScope satellite constellation operates a fleet of small remote sensing satellites that capture HR images of the Earth's surface on a daily basis. These images, obtained through multispectral and hyperspectral sensors, offer a frequent revisit time (1–2 days) and high spatial resolution (∼3 m), making them suitable for our research focused on a small area with dynamic circulation and transport patterns. It should be noted that the spectral resolution of Dove satellite data is comparatively lower than other remote sensing datasets like Landsat 8, Landsat 9, Sentinel 2, and Sentinel 3.
The specific data employed in our research were the surface reflectance (SR) product, which provides radiometrically calibrated and atmospherically corrected images. This product is available for orthorectified scenes captured by the sun-synchronous orbit Dove satellites. To achieve atmospheric correction, lookup tables generated using the 6SV2.1 radiative transfer code were employed, enabling the mapping of top-of-atmosphere (TOA) reflectance to bottom-of-atmosphere (BOA) reflectance. The SR product is provided as a 16-bit GeoTIFF image, with reflectance values scaled by a factor of 10,000. For atmospheric inputs, water vapor and ozone information were retrieved from MODIS (Moderate Resolution Imaging Spectroradiometer, https://modis.gsfc.nasa.gov/) near-real-time (NRT) data, while aerosol optical depth (AOD) input was determined from MODIS NRT aerosol data. By considering localized atmospheric conditions, the SR product ensures consistency and minimizes uncertainty in spectral response across various time points and locations. Further details can be found in Table 1.
Instrument . | Spectral Bands (nm) . | Pixel Size (m) . | Atmospheric Corrections . |
---|---|---|---|
PS2 | Blue: 455–515 Green: 500–590 Red: 590–670 NIR: 780–860 | 3.0 | Conversion to top-of-atmosphere (TOA) reflectance values using at-sensor radiance and supplied coefficients. |
Conversion to surface reflectance values using the 6SV2.1 radiative transfer code and MODIS NRT data. |
Instrument . | Spectral Bands (nm) . | Pixel Size (m) . | Atmospheric Corrections . |
---|---|---|---|
PS2 | Blue: 455–515 Green: 500–590 Red: 590–670 NIR: 780–860 | 3.0 | Conversion to top-of-atmosphere (TOA) reflectance values using at-sensor radiance and supplied coefficients. |
Conversion to surface reflectance values using the 6SV2.1 radiative transfer code and MODIS NRT data. |
Data preparation for machine learning
To ensure data quality for training the machine learning model, the scope was strictly limited to the data derived from the harbor sampling area (Figure 1(b)). This strategic selection was crucial in guaranteeing that the detected turbidity predominantly originated from the Cuyahoga River, thereby minimizing the influence of lake-wide turbidity and turbidity from other rivers (both upstream and downstream). This decision also capitalized on the resolution of the mechanistic and transport model, which, considering the data-intensive nature commonly associated with ML models, supplied ample training data samples.
The machine learning model was then developed using processed simulated turbidity and remote sensing data, specifically focusing on the harbor sampling area. Of the five images used for training the model, 75% of the data was used for training and validation, and the remaining 25% was dedicated to testing the ML model. This systematic allocation ensured a robust development and comprehensive evaluation of the ML model's performance. The trained ML model can be utilized to predict turbidity (in ) based on the remote sensing SR at four bands, as well as the NDTI and NDWI indexes. The flowchart for developing the ML model, based on the simulated turbidity and the remote sensing data, is shown in Figure 2(c).
Transfer learning: detecting turbidity in other regions of the lake
Transfer learning, briefly defined as the application of a model trained in one domain to a new, but related, domain, was employed in our study to enhance the adaptability of machine learning models across diverse regions (Zhuang et al. 2021; Syariz et al. 2022). We utilized the trained ML model, which was developed for predicting turbidity based on remote sensing images in the Cleveland Harbor area, to predict turbidity in other regions of Lake Erie, namely Cattaraugus River and River Raisin (Figure 1). This allowed us to assess the generalizability of the trained ML model across different regions of the lake.
Validation process
The performance of the mechanistic model in accurately describing water velocity was assessed by comparing model results for water current velocity with observed data obtained from a Nortek Aquadopp HR (2.0 MHz frequency) Profiler ADCP (Acoustic Doppler Current Profiler) deployed in Lake Erie at Erie, PA (coordinates: 42.1886 N and −79.9821 W), from 9 August to 10 September 2019 (Memari & Phanikumar 2024a). The instrument was deployed at a depth of 5.80 m from the surface and the measurement cell size was set to 0.25 m. Likewise, the accuracy of the mechanistic model in representing surface water temperature was evaluated using data from two NDBC buoys (#45164 and #45169), as illustrated in Figure 1.
Error metrics
Two metrics, (root mean square error) and (correlation coefficient), were used to evaluate the performance of the mechanistic model and machine learning models in comparison to the observed data. measures the average magnitude of differences between predicted (and observed ( values, indicating the overall accuracy of the models. A lower signifies better agreement between predictions and observations. assesses the proportion of variance in the observed data explained by the models. Higher values indicate stronger correlations and capture a greater portion of the observed data variability. By considering both and , the models' accuracy, predictive capability, and ability to capture data patterns were comprehensively evaluated.
RESULTS
Application of ML to river turbidity prediction
Validation of the circulation model
Variable . | . | . |
---|---|---|
0.064 | 0.716 | |
0.047 | 0.650 | |
Water Temp (), NDBC #45164 | 0.437 | 0.938 |
Water Temp (), NDBC #45169 | 0.482 | 0.955 |
Variable . | . | . |
---|---|---|
0.064 | 0.716 | |
0.047 | 0.650 | |
Water Temp (), NDBC #45164 | 0.437 | 0.938 |
Water Temp (), NDBC #45169 | 0.482 | 0.955 |
Additionally, we assessed the mechanistic model's ability to accurately simulate the water temperature by comparing the simulated water surface temperature with observations from two NDBC stations (#45164 and #45169) close to the harbor area. Figure 5(c) and 5(d) demonstrate that the mechanistic model successfully simulated water surface temperature values and captured the overall trend. Table 2 provides the and values for the comparison between the simulated and observed water surface temperatures. The value was found to be smaller than 0.5 , and was greater than 0.93.
Validation of the transport model
In Figure 6, we observe similarities between the extent and intensity of the turbidity plume in the simulated results and the remote sensing RGB images, particularly at a smaller scale within the harbor. While the similarities are notable within the harbor area, differences become more pronounced as we move away from the point source at the mouth of the river. Although the simulated turbidity closely resembles the observed data inside the harbor, variations become more significant outside the harbor area as the distance from the point source increases.
Combined performance of circulation and transport models
The accuracy and reliability of turbidity predictions in this study are significantly influenced by the combined performance of the circulation and transport models. The circulation model provides essential velocity fields and boundary conditions that drive the transport model, directly influencing turbidity transport and dispersion. The validation of the circulation model using ADCP data (see Figure 5(a) and 5(b)) confirmed its accuracy in simulating water currents, which is vital for reliable turbidity predictions. The transport model, utilizing these accurate velocity fields, effectively simulates the spatial and temporal distribution of turbidity. This combined approach was validated by comparing simulated turbidity plumes with remote sensing images, demonstrating strong agreement, especially within the harbor (see Figures 6 and 7).
Extracting turbidity maps from remote sensing images
Transfer learning: detecting turbidity in other regions of the lake
Overall, the ML model demonstrated considerable skill in predicting turbidity through transfer learning, particularly when juxtaposed with the RGB images. However, it is worth noting that the model fell short in predicting accurate concentration levels as shown in Figure 10 (indicated by a red box). The absence of fine-tuning for these new sites likely contributed to this shortcoming.
The discrepancy might also be linked to the limitations of the training data. The model was originally trained on a limited number of images from a different region of the lake, which could have influenced the results. Additionally, variations in the concentration of dissolved and suspended compounds within the water might have affected the SR at the measured bands in these regions, thereby potentially contributing to the observed discrepancies.
DISCUSSION
The analysis of our results indicates that the most accurate results are obtained in the harbor area due to detailed domain discretization and precise boundary conditions (Sequeiros et al. 2009). This detailed discretization provides a clearer understanding of turbidity patterns, similar to the insights provided by HR Landsat 8 imagery in the Po River prodelta (Braga et al. 2017). Our results align with the findings of Braga et al. (2017), underscoring the importance of HR data for accurate prediction and analysis of turbidity. It should be noted that this approach can be implemented for the areas without coastal structures as long as the circulation and transport are resolved accurately.
Significantly, our utilization of a mechanistic model, calibrated with ADCP and temperature data, allowed resolution of water dynamics, capturing the spatiotemporal variability of turbidity currents with high fidelity. This approach distinguishes our work from previous studies, as the ADCP data-driven calibration ensures a realistic representation of the underlying physical processes, enhancing the precision and accuracy of our predictions, while providing abundant simulated (synthetic) data for the machine learning model. By excluding this 3D mechanistic modeling component, we would sacrifice the spatiotemporal resolution, compromising the robustness and reliability of our assessments, and rendering our model susceptible to oversimplifications and poor performance.
Notably, our study also highlights the need for a unified measure of turbidity across the lake, which aligns with the method used by Zhu et al. (2022) in their study of the Great Lakes. By using the same unit as the USGS gauges, we can compare turbidity values at the river mouth with those in the lake, enhancing the comparability of data.
In contrast, the accuracy of results decreases outside the harbor area, mainly because of a larger mesh size which translates to larger numerical diffusion errors and the absence of other sources of turbidity in the model such as wastewater treatment plants, as well as mechanisms such as deposition and resuspension (Felix 2002; Schulz et al. 2018). This mirrors the conclusions drawn by Schulz et al. (2018), where varying hydro- and meteorological conditions affected sediment fluxes in different locations.
As we moved farther from the river mouth and harbor area, we encountered issues related to mesh sensitivity and numerical diffusion. This suggests that an accurate representation of the domain and forcing fields, such as the wind field (Beletsky et al. 2013), and river boundary conditions is required to address the uncertainty of boundary conditions (Hunt & Jones 2020).
Our study also recognized that turbidity is influenced by a variety of sources along the river, from the USGS gauge to the mouth of the river at the harbor. However, deriving turbidity from one remote sensing index poses challenges, such as a single NDTI value can correspond to multiple turbidity values (Garg et al. 2017, 2020). This supports the findings of (Zheng & DiGiacomo 2022), who utilized a simplified water clarity–turbidity index (CTI) to better capture major changes in water clarity/turbidity by including multiple variables from Visible Infrared Imaging Radiometer Suite (VIIRS) measurements, Secchi disk depth, and particulate backscattering coefficient.
The type of remote sensing instrument used has a significant impact on the results. If a model is trained on imagery from a specific instrument, it may produce inaccurate predictions when applied to images captured by other sensors such as Sentinel or Landsat (Le Fouest et al. 2015; Vanhellemont & Ruddick 2021). Yet, our approach proved effective when training machine learning models on different instruments, supporting the claims of (Saberioon et al. 2020; Filisbino Freire da Silva et al. 2021) regarding the potential of machine learning and satellite data in water quality prediction.
Notably, our study also highlights the need for a unified measure of turbidity across the lake and at the river mouth through in situ instruments, which aligns with the method used by Garg et al. (2017). By using the same unit as the USGS gauges, we can compare turbidity values at the river mouth with those in the lake, enhancing the comparability of data.
Finally, our study proposes using a GPR model to address uncertainty effectively, supporting the conclusions of Filisbino Freire da Silva et al. (2021) about the potential of machine learning in water quality assessment. This strategy, coupled with generating ample data for training machine learning models, can significantly improve our understanding of turbidity detection and monitoring, effectively addressing issues of water composition variability noted in our study and the research conducted by Normandin et al. (2019). This approach also aids in evaluating and refining the performance of mechanistic models in predicting turbidity accurately, building upon the work of Felix (2002) and Sequeiros et al. (2009).
Overall, our approach offers a promising direction for the development of robust, data-rich, and effective strategies for turbidity detection and monitoring. The synergy between in situ measurements, remote sensing imagery, and machine learning algorithms can help in developing a more comprehensive understanding of turbidity dynamics in various marine and estuarine environments.
CONCLUSION
Our research aimed to address data limitations and enhance model generalizability for predicting water turbidity from remote sensing imagery by integrating mechanistic modeling, machine learning, and remote sensing. Our findings confirmed the efficacy of this integrated approach, showing significant improvements in both the precision and accuracy of turbidity predictions across a broad range of turbidity values.
We used a machine learning model to enhance hourly interpolation of river turbidity values during data gaps when there were no measurements of turbidity at the Cuyahoga River mouth. This was demonstrated to be effective in describing boundary conditions of the mechanistic model and improving its performance. Validation results showed strong correlations between observed and predicted turbidity, confirming the reliability of our models. Specifically, the machine learning model for river turbidity was validated not only by using performance metrics such as and but also by comparing the turbidity plume generated by the mechanistic model with remote sensing images at multiple times.
We bridged the gap in data availability for training machine learning models for predicting lake turbidity values from remote sensing images by using synthetic (simulated) turbidity data from the mechanistic model. This approach improved data availability across a wide range of concentration values. The accurate and abundant simulated data generated by the mechanistic model proved invaluable for training the machine learning model. This integration underscores the necessity for unified turbidity measures and highlights the impact of remote sensing data quality on prediction accuracy. This machine learning model not only performed well in predicting turbidity from the remote sensing images in the study site (Cleveland Harbor) but also demonstrated acceptable capability in predicting turbidity at other coastal areas, showing model transferability. However, fine-tuning of the machine learning model for new locations will further improve the predictions and generalizability of the model.
A number of decision support systems for coastal environments, including those for beach closures and harmful algal bloom (HAB) severity prediction, involve turbidity as a key variable. Our approach can be extended to enhance such systems by providing accurate and timely predictions of turbidity and other water quality variables of interest (e.g., bacteria, viruses, nutrients, HABs, etc.).
ACKNOWLEDGEMENTS
We thank Drs. Mary Anne Evans and Muruleedhara Byappanahalli, USGS and Mr Glen Black (USGS dive safety instructor) for their assistance with field data collection. The FVCOM model is available from the University of Massachusetts (http://fvcom.smast.umassd.edu/). ADCP field data used in this research are available on HydroShare (Memari & Phanikumar 2024a). The remote sensing data used in this research are available from Planet website (https://www.planet.com/). Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.
DATA AVAILABILITY STATEMENT
All relevant data are available from the online repository HydroShare at the following link: https://doi.org/10.4211/hs.5ee190b481c749fb8398f182742720f1.
CONFLICT OF INTEREST
The authors declare there is no conflict.