To analyze the applicability of direct insertion of total suspended matter (TSM) concentration field based on turbidity derived from satellite data to numerical simulation, dispersion studies of suspended matter in Lake Säkylän Pyhäjärvi (lake area 154 km^{2}; mean depth 5.4 m) were conducted using the 3D COHERENS simulation model. To evaluate the practicality of direct insertion, five cases with different initialization frequencies were conducted: (1) every time, when satellite data were available; (2) every 10 days; (3) 20 days; (4) 30 days; and (5) control run without repeated initialization. To determine the effectiveness of initialization frequency, three methods of comparison were used: simple spatial differences of TSM concentration without biomass in the lake surface layer; averaged spatial differences between initialization data and the forecasts; and time series of TSM concentration and observation data at 1 m depth at the deepest point of the lake. Results showed that direct insertion improves the forecast significantly, even if it is applied less often.

## INTRODUCTION

In management and decision-making processes related to lakes, rapid and effective means to achieve a realistic picture of lake conditions, and especially water quality, are required. Such circumstances include the accidental release of wastewater from malfunctioning purification plants, industrial or traffic contaminants, or algal blooms. Furthermore, the Water Framework Directive (WATECO 2003) requires that all EU countries manage their water bodies to obtain a good ecological status. In a lake-rich country such as Finland, the collection of *in situ* observations from all water bodies is practically impossible. Consequently, the Finnish Environment Institute (SYKE) is seeking new methods to obtain information related to water bodies such as automated lake stations, satellite observation and modelling (Lepistö *et al*. 2010). This paper provides new insight into the possibilities of obtaining information using satellite observation and modelling, where data are available, reflecting the possibilities of using similar methodology and parameterization for lakes with sparse data, and even in very rapidly evolving situations, in which careful model calibration and validation are not possible.

Modelling of the dispersion of suspended particulate matter or harmful substances in aquatic environments has long remained an active field of research (James 2002; Virtanen 2009). The difficulty of modelling an actual lake arises from the complex time-dependent boundary conditions, such as inlets and outlets of rivers and the sedimentation process at the bottom, which are usually unknown. Moreover, some physical processes are usually parameterized, and correct parameterization requires great effort from observation work. Furthermore, parameters describing these processes might be sporadic. Detailed description and correct modelling of sedimentation and resuspension are demanding tasks, especially in shallow lakes where the effects of wind and waves penetrate to the bottom.

Data assimilation methods have been known as ways to improve model accuracy. Satellite-image-based information has been assimilated into sediment transport models in several studies with numerous assimilation techniques. In the southern North Sea, the ensemble Kalman filter method was applied. The obtained model results provided consistent sea surface features when compared to remote sensing data, and showed a clear decrease of the model bias for independent *in situ* observation data (El Serafy *et al*. 2011). In Lake Michigan, direct insertion and a Kriging-based approach were applied. They improved the root-mean-square error of the forecast by 40% compared to purely model-based approaches, and by 20% over purely data-based approaches (Stroud *et al*. 2009). In the James River estuary in Virginia, variational inverse schemes were applied and retrieved poorly known parameters in a cohesive sediment transport model (Yang and Hamrick 2003). Weaknesses of those sophisticated methods are the assumptions and earlier information that one must have. The dependency of the original model and the outcome when data are assimilated might be harder to explain.

For these reasons, we studied how conceptually and computationally simple data assimilation methods (direct insertion of the key variable of the model) perform when considering the transport of suspended matter. Sometimes, direct insertion is regarded as an assimilation method when the temporal frequency of the input data is small, but we tried to determine whether direct insertion improved the forecasts, even in rapidly evolving and complex situations, which are usually encountered in demanding calibrations.

To achieve the goal presented above, we performed dispersion studies in Lake Säkylän Pyhäjärvi (Figure 1), the largest lake in south-western Finland. Its water quality demands much attention from local municipalities and industries because it supplies drinking water resources, recreational facilities, and raw materials for fishery-related enterprises. Its eutrophication has been severe because of the load from its agriculturally dominated catchment: rivers introduce 70% of all lake water. Many scientific studies have been conducted to investigate and preserve its water quality (Räsänen *et al*. 1992; Huttula 1994; Malve *et al*. 1994; Ventelä *et al*. 2007; Lepistö *et al*. 2008). This lake is one of the most studied lakes in Finland. Large amounts of information are available for modelling.

## METHODS

### 3D hydrodynamic model and initial values and boundary conditions

We used ver. 2.1.1 of the COHERENS model (Luyten 2011) as our 3D hydrodynamic model to provide the physics related to the advection–diffusion equation. COHERENS solves Navier–Stokes equations using Boussinesq approximation and the assumption of vertical hydrostatic equilibrium using a horizontally uniform rectangular grid with a resolution of 100 m for both horizontal directions (137 and 243 grids in longitude (i direction) and latitude (j direction), respectively) and the vertical 20 σ-layers. The parameterization that we used for the circulation model is found to be reasonable in other lakes in Finland.

We implemented a simple sediment transport module based on COHERENS ver. 1 (Luyten *et al*. (1999) and references of it), that includes resuspension. Modelled sediments are inorganic mineral particles (total suspended matter (TSM)). The settling velocity was calibrated simply as 3.0 × 10^{−8} [m/s] so that the value qualitatively reproduced the observed TSM concentration profile at the deepest point vertically and temporally, i.e., a higher concentration at the bottom than at the surface, and the value near the surface maintained the same level for days, as shown in Figure 4. It is worth noting that the value was actually smaller than the one calculated for this lake by Ekholm *et al*. (1997) by three orders of magnitude: 5.18 [m/d], which is 6.01 × 10^{−5} [m/s]. The impact of that fact on the model results is explained in the ‘Results and discussion’ section.

The calculation period extended from noon on May 15 to noon on 6 July 2009. The initial condition of the lake water temperature was 10.1 [°C], which was the vertical average at the deepest point on 10 and 20 May in 2009 according to the SYKE Hertta database. The initial current velocities were set to zero. The TSM concentration was set to 0 mg/l to stabilize the first stage of the calculation. After 24 hours, at noon on May 16, data initialization for TSM concentration (defined in the ‘Satellite data and their assimilation to the model’ section) was applied for all cases. As boundary conditions, we used TSM loading, river discharges and temperatures at river mouths. Meteorological data were applied homogeneously on the lake surface. The output interval was set to once a day at noon.

### Data and materials

The time series for river discharges [m^{3}/s] and TSM concentrations [mg/l] at the three river mouths (Figure 1) were obtained from the SYKE Hertta database. The river discharges were observed daily. TSM loading data were only available a few times a month. They were linearly interpolated to obtain the daily time series. Daily temperatures of the river water were obtained from catchment model WSFS-Vemala results (Huttunen *et al*. 2013) from SYKE. Meteorological data from the Finnish Meteorological Institute (FMI) include data for the wind speed [m/s] and direction [°], air temperature [°C], humidity (between 0 and 100%), cloud coverage (between 0 and 100%), and air pressure [Pa] on a daily basis.

Satellite data for turbidity and chlorophyll (Chl)-a concentration at the lake surface were provided by SYKE for 7 days: May 16 (obtained at 12 h 47 m 42 s Eastern European Summer Time); June 1 (12 h 44 m 51 s), 8 (12 h 24 m 44 s), 18 (12 h 10 m 22 s), 21 (12 h 16 m 7 s), and 26 (12 h 59 m 14 s); and July 6 (12 h 44 m 51 s). The satellite-based TSM data were processed from MERIS images with the Boreal Lakes processor (Doerffer & Schiller 2008; Koponen *et al*. 2008; Kallio *et al*. 2014). For Chl-a, the FUB/WeW (Schroeder 2005) processor was used. Both quantities were calibrated with observations from the automatic stations. Only cloud-free images (based on visual examination) were used in this analysis. Thus, there were no missing values due to cloud cover. Each image is a single image (no composites). MERIS Full Resolution data were used, which have spatial resolution of 300 m. *In situ* turbidity values [FNU] were obtained daily and manually from the automatic station shown in Figure 1. The manual samplings were carried out three times around the study period.

### Satellite data and their assimilation to the model

_{(no biomass)}[mg/l] denotes the TSM concentration without biomass, Tur and Chl denote turbidity [FNU] and Chl-a concentration [μg/l], respectively, derived from satellite data. When TSM

_{(no biomass)}becomes negative, we simply put zero because the values are so small that it can be ignored. The first term on the right-hand side is the calibration equation between

*in situ*turbidity and Tur with the Boreal processor. The relative error of this relation is estimated as 30%. The last term on it, 0.1·Chl, was adopted as a rough estimation of the amount of dried biomass (Reynolds 2006). Relative error of the pre-factor 0.1 was estimated as 30%. This approximation is based on the probability interpretation of the time-averaged histograms of the measured appearance of different algae groups on the lake during the open water period 2009. The relative error associated to Chl is 20%. Using error propagation procedures (with an average value for measured quantities) the total relative error of Equation (1) is 40%. We will hereinafter call the obtained TSM

_{(no biomass)}‘nudging data’. We also used the same equation for the

*in situ*turbidity presented in Figure 4. The error for that is slightly smaller because the error of

*in situ*measured turbidity is smaller. However, the Chl-a corrections maintain a high relative error. Therefore, it is approximated as more than 30%.

The satellite data are fundamentally two-dimensional because the vertical information is limited to the range of penetration depth. In May–September in Lake Säkylän Pyhäjärvi in the 410–710 nm wavelength region (covering MERIS channels 1–9), the average spectral penetration depth varied between 0.4 and 1.9 m, as estimated with the bio-optical model developed for Finnish lakes (Kallio 2006). TSM interpretation by MERIS in this wavelength region is most sensitive to reflectance at 700–710 nm (e.g., Härmä *et al*. 2001; Kallio *et al*. 2005), where the average penetration depth was about 1.0 m. Therefore, we assume that the penetration depth of 1.0 m corresponds reasonably well with the real value during our simulation period. The nudging data were applied only within this penetration depth. The difference between the nudging data and the mean of the model result over the penetration depth taken beforehand was added vertically consistently to grid points below the penetration depth. Mismatched data for the lake surface area derived from the satellite data and bathymetry introduced to the model were unified to the greatest extent possible using data array conversion. Satellite data are generally saturated by high back scattering at water edges, and the edge becomes vague. Therefore the land areas were defined once grids were saturated during 2009 (all available satellite data were from 11 days in mid-May to mid-September).

### Tested cases and initialization timing

To evaluate the practicality of the assimilation (direct insertion) of the model, five model runs (1–5) were performed using different initialization frequencies: (1) every time, when satellite data were available; (2) every 10 days; (3) 20 days; (4) 30 days; and (5) the control run without repeated initialization (except for the initialization on 16 May 2009). The initialization timings were at noon on: (1) June 1, 8, 18, 21 and 26 and July 6; (2) June 1, 8, 18 and 26 and July 6; (3) June 1 and 21 and July 6; and (4) June 1 and July 6.

### Quantities used to measure the effectiveness of occasional initialization

At this point, it is worth mentioning that the nudging data, Equation (1), provide the best presentation of the actual TSM concentration on a lake surface that is available from the satellite data. Based on this consideration, we regard the nudging data as the ‘true’ value in this paper.

*x*(*i,j*) denotes a forecast, which is the averaged TSM concentration at grid point (*i,j*) within the penetration depth of the model results, and *X*(*i,j*) denotes the ‘true’ value, which is the nudging data at grid point (*i,j*). Moreover, when Forecast RMSE (root mean square error), Equation (3), is shown with dashed lines in Figure 3(a) or 3(b), respectively, *x*(*i,j*) also denotes the average of TSM concentration converted with Equation (1) from all available satellite data from 2009, or the nudging data at the latest initialization, respectively. To determine the effectiveness of different initialization timings, we have calculated Forecast RMSE as described:

*N _{x}* and

*N*respectively, denote the number of grid points in longitude and latitude directions except for land areas defined from the average of satellite data 2009.

_{y}## RESULTS AND DISCUSSION

The main idea of this paper is to demonstrate that the combination of satellite data and model code provides more accurate forecasts than using either technique alone. We compared the model results for each case with *delta(i,j)*, *Forecast RMSE* and single-point measure of TSM without biomass.

Figure 2 presents the *delta(i,j)* for each case. The completely equivalent model results (forecasts) are shown for June 1 (top-right five images) because of the same initial calculation conditions. Subsequently, the variation of *delta(i,j)* among the five cases becomes more explicit. Similarities according to the most recent initialization timing exist despite the initialization rate. This is because TSM is slow to sink, and this is attributed to the small settling velocity. One might see that the increased initialization frequency would increase the model's capability to reproduce the spatial characteristics of nudging data. It is worth recalling that relatively-large *delta(i,j)* exist near the lake edge. This could be caused by saturation of satellite data at water edges. This mismatch is expected to add a certain amount of error to the *Forecast RMSE* presented in Figure 3.

Figure 3(a) and 3(b), respectively, show *Forecast RMSE* as a function of time and the temporal separation from the latest initializations. Compared to the case with the average or control run in Figure 3(a), *Forecast RMSE* are always lower in 1–4 assimilated cases, except for only one day on June 18. Therefore, assimilation provides the better forecast, even if assimilation is applied with an interval of 30 days. In addition, case 1 almost always has the lowest *Forecast RMSE*. Thus, because the nudging data are regarded as the ‘true’ value in this study, it can be said that the more the initialization frequency increases, the better the model predicts the future situation.

From Figure 3(b), one might see that *Forecast RMSE* is increasing as a function of temporal separation. Furthermore, it is below 1.0 within 10 days of the last initialization. Compared to *Forecast RMSE* for longer temporal separation, these values are kept small. One should consider, however, that the actual TSM concentration during the simulation period is estimated roughly as 1–2 mg/l. Compared to this, the observed *Forecast RMSE* of 1.0 is quite high. On the other hand, the gap separating the solid and dashed lines represents how much the model improved the forecasts compared to the forecasts using only the latest obtained satellite data. The gap seems to become larger and larger as the temporal separation becomes large. The solid lines always go below the dashed lines. This shows that the combination of the regular initialization and even a simple calibrated computational model improves the accuracy of forecasts. In particular, in the case with the long-term forecasts, it is considered that the improvement will be more explicit.

Figure 4 shows that every time initialization was performed on models, the forecast value was modified by the nudging data. After initialization, the modelled TSM concentration level is almost constant and less temporal variation exists. As described in the ‘3D hydrodynamic model and initial values and boundary conditions’ section, the settling velocity might be too small to reproduce realistic variability. Moreover, unimplemented detailed resuspension processes, considering wave interaction at the lake surface and its influence on the bottom, and time and spatial variability of the bottom shear stress can also be expected to have a strong influence on the TSM concentration. The high-turbidity peaks in mid-June are caused by phytoplankton growth as reported in Lepistö *et al*. (2008). Because simple sedimentation processes were modelled, it is no surprise that the seasonal phytoplankton growth is not shown in the all forecasts. Nevertheless, we can see that the forecasts do not become diverged from observations as direct insertion greatly assists the forecast to provide realistic concentrations. This could be the reason why assimilation provides better forecasts, even if the used sediment model and assimilation method are simple and the parameterization of sediment transport is not exhaustively calibrated, as only the sinking velocity is calibrated at a single observation point.

## CONCLUSIONS

To analyze the applicability of direct insertion of the key variable of a computational model, a dispersion study of suspended matter in Lake Säkylän Pyhäjärvi was conducted using the COHERENS Version 2 model. We assessed five cases with different initialization frequencies: every time that satellite data were available; every 10 days; 20 days; 30 days; and control run without repeated initialization.

This paper demonstrates that the predictive performance of a coupled circulation-sediment model can be notably improved by incorporating satellite data, even if the used sediment model and assimilation method are simple and its parameterization of sediment transport is not exhaustively calibrated, as only the sinking velocity is calibrated at a single observation point. This is also true when direct insertion is applied less often, such as the interval of 30 days. In particular, the 10-day period shows values of Forecast RMSE increases that are smaller when compared to situations where no simulation is used at all.

For usability, until the necessary modules are implemented and until the demanding calibrations have been done, direct insertion can obtain more realistic forecasts of lake water quality. Future studies should be performed carefully to examine considerations of the possibilities of using satellite-based bathymetrical data and detailed sedimentation processes. Such studies can enrich our experience in using more sophisticated data assimilation methods.

## ACKNOWLEDGEMENTS

The authors would like to thank SYKE for data and its parallel computing environment for the model calculation, and FMI for meteorological data. In addition, the authors appreciate the significant contributions made by Marko Järvinen (SYKE) and Niina Kotamäki (SYKE). This project was supported by the Japan Society for the Promotion of Science and Academy of Finland under the Japan–Finland Bilateral Core Program and the JSPS Institutional Program for Young Researcher Overseas Visits.