## Abstract

As the identified results of groundwater pollution source identification (GPSI) can influence the cost for the polluter in paying for remediating groundwater resources, it is important that the accuracy of the estimated result should be as high as possible. However, many factors can influence the result, such as noisy concentration data and incomplete concentration data. Thus, this paper is aimed at studying the difference between using the observation data before and after denoising and interpolating for solving GPSI problems. Four kinds of noise level and 20 groups of missing data were designed to test the performance of wavelet denoising and cubic spline interpolation, respectively. The results show that the denoising process can improve the estimated result for the GPSI problem, and the higher the noise level, the stronger this effect. In terms of interpolation, more accurate results can be made after interpolating if the missing data belong to the period after the source releases the pollutant. If the missing data are from when the pollution source is active, interpolation cannot help increase the estimated performance.

## INTRODUCTION

Since groundwater pollution sources are buried in underground rock and soil media, the characteristics of these sources (including the number of pollution sources, their exact location, and release history) are difficult to make clear. However, information about groundwater pollution sources is connected with designs of remediation schemes, estimating risks, and conducting pollution liability assessments (Woodrow *et al.* 1986; Mirghani *et al.* 2012; Zhao *et al.* 2016). Therefore, studies of groundwater pollution source identification (GPSI) are especially important.

Comprehensive reviews on GPSI have been provided by several researchers (Atmadja & Bagtzoglou 2001; Michalak & Kitanidis 2004; Bagtzoglou & Atmadja 2005; Yeh 2015). Methods for GPSI can be divided into four categories: optimization approaches, probabilistic and geostatistical simulation approaches, analytical solution and regression approaches, and direct approaches. No matter which approach is used, the effect of the error in the observation data is shown to have a strong sensitivity for the estimated result. Even very small measurement errors can make the estimated release history change dramatically (Skaggs & Kabala 1994; Singh *et al.* 2004). Thus, it is necessary to extract the actual concentration data from the noisy data first, then use the inversion method to identify the characteristics of the release sources. The technique for data denoising develops from simple nonparametric regression to spatially adaptive nonparametric methods (Percival & Walden 2000). The latter methods include variable-bandwidth kernel methods, variable-knot splines and wavelets (Hsu *et al.* 2005). This research chooses wavelet denoising over other methods because it handles abrupt changes better than the others (Donoho & Johnstone 1994).

In addition, identification accuracy is reduced if observation data are missing (Milnes & Perrochet 2007; Datta *et al.* 2011). The filling in of missing values, called data imputation, can be used to complete this incomplete data. The classic methods for dealing with incomplete data include Lagrange imputation, Newton imputation, Hermit imputation and cubic spline interpolation. The reason to choose cubic spline interpolation is its accuracy for the kind of missing data with a few missing values (Lakshminarayan *et al.* 1999).

Although the influence of both noisy observation data and missing data on the identified results is mentioned in many studies (Ayvaz 2010; Zhao *et al.* 2015), there has been no specific study on identifying the source characteristics of the observation data after denoising and interpolation has been performed. Thus, this study focuses on the differences of the identified source characteristics in the observation data before and after denoising and interpolating. Four kinds of noise level and 20 groups of missing data were designed to test the performance of wavelet denoising and cubic spline interpolation, respectively.

## METHODOLOGY

A flow diagram of the proposed approach is presented in Figure 1.

A brief description of the approach steps is presented below.

### Wavelet denoising

Noise is a phenomenon that affects all frequencies. In general, useful signals occupy the low-frequency domain. On the contrary, noise occupies the high-frequency domain. Wavelet analysis can separate the high-frequency components and low-frequency components effectively.

The first procedure of wavelet denoising applies a wavelet transform to a data object to obtain the vector of the wavelet coefficient. Second, we suppress or remove those elements above a certain threshold. Third, we reconstruct the denoised data using a cognate inverse wavelet transform. The detail of how to denoise data with noise can be seen in Cao *et al.* (2003).

### Cubic spline interpolation

The spline is a draftsman to draw a smooth curve slat. The spline function is obtained after the spline curve has been described by mathematical modeling. Its characteristics are piecewise polynomial and each piecewise polynomial has a certain connection, so the spline function is not only sufficiently smooth but also retains a certain discontinuity. Smoothness is to ensure the smooth graceful shape of the curve. Discontinuity can make it turn freely flexible when used.

Let be a partition of [*a*,*b*]. Then *s* is said to be a cubic spline over if and *s* restricted to [*x*_{i-1},*x*_{i}] is a cubic polynomial for . The space of all such cubic splines *s* is denoted by . If, in addition, for , *s* is said to be an -interpolate of *f* (Lucass 1974).

### The simulation–optimization method

*T*represents the transmissivity tensor [L

_{ij}^{2}T

^{−1}],

*h*represents the hydraulic head [L],

*W*represents the volumetric flux per unit area representing fluid inflow (negative sign) and outflow (positive sign) [LT

^{−1}],

*b*represents the saturated thickness of the aquifer,

*C*represents the dissolved concentration [ML

^{−3}],

*D*represents the hydrodynamic dispersion tensor [L

_{ij}^{2}T

^{−1}],

*v*represents the average linear seepage velocity [LT

_{i}^{−1}],

*C*

_{S}represents the dissolved concentration in a source or sink flux [ML

^{−3}],

*θ*represents the effective porosity [−], and

*K*represents the hydraulic conductivity tensor [LT

_{ij}^{−1}].

The meaning of the objective function shown in Equation (4) is to minimize the deviation between actual observations and model predictions at each sampling location and each stress period. *N _{t}* stands for the number of stress periods,

*N*stands for the number of sampling locations.

_{d}*C*represents the simulated pollution concentration at sampling location

_{k}(t)*k*and stress period

*t*and Ĉ

*means the measured pollution concentration at sampling location*

_{k}(t)*k*and stress period

*t.*In Equation (5), stands for the simulation model which transforms the source flux

*q*into concentrations

*c*. Equations (6) and (7) stand for the bounds of the concentrations and source fluxes, respectively.

### Performance evaluation measurements

As the NE value goes to zero, the better the identified values are.

*Sj*. The formula for

*Sj*can be defined as follows: where

*w*stands for approximation coefficients and detailed coefficients.

_{j,i}*Sj*reflects the content of useful and noisy information of the signal. If

*Sj*< 0.01, it means the detailed coefficient only contains noisy information.

### Incorporating measurement errors

*ɛ*is a normally distributed error matrix, and

*a*represents the level of noise. The noise level is low if 0.10. The noise level is moderate if . The noise level is high if 0.15 (Singh & Datta 2006).

## NUMERICAL APPLICATIONS

### Basic introduction of the aquifer

The performance of both wavelet denoising and cubic spline interpolation is tested on a hypothetical case. The aquifer (Figure 2) is assumed to be uniform and the flow condition is steady-state. Boundary conditions of the aquifer model are linearly varying specified heads at the east and west sides, and no-flow at the north and south sides. There is one pollution source (S1) and six sampling wells (O1–O6). Observation pollution concentrations at each sampling location were gathered in each stress period by running the simulation model. The simulation time for this case is 60 months (5 years), which has been divided into 20 equal stress periods. Each stress period is 3 months. We assume that the released pollutant is typically conservative and the initial conditions, boundary conditions, and the length of the release periods are known. Moreover, the source is active in the first four stress periods (1 year). The detailed parameter values and the actual values of source fluxes are shown in Tables 1 and 2.

Parameters | Values |
---|---|

Hydraulic conductivity in x-direction, (m/s) | 0.0002 |

Hydraulic conductivity in y-direction, (m/s) | 0.0002 |

Effective porosity, | 0.25 |

Longitudinal dispersivity, (m) | 40 |

Transverse dispersivity, (m) | 9.6 |

Saturated thickness, (m) | 30.5 |

Volume flux per unit area, W (m/s) | 1.010^{−9} |

Initial concentration (mg/L) | 100 |

Parameters | Values |
---|---|

Hydraulic conductivity in x-direction, (m/s) | 0.0002 |

Hydraulic conductivity in y-direction, (m/s) | 0.0002 |

Effective porosity, | 0.25 |

Longitudinal dispersivity, (m) | 40 |

Transverse dispersivity, (m) | 9.6 |

Saturated thickness, (m) | 30.5 |

Volume flux per unit area, W (m/s) | 1.010^{−9} |

Initial concentration (mg/L) | 100 |

Pollution sources | Source fluxes in each stress period (g/s) | |||
---|---|---|---|---|

SP1 | SP2 | SP3 | SP4 | |

S1 | 255 | 497 | 0 | 297 |

Pollution sources | Source fluxes in each stress period (g/s) | |||
---|---|---|---|---|

SP1 | SP2 | SP3 | SP4 | |

S1 | 255 | 497 | 0 | 297 |

Note that the actual values of the source flux (Table 2) are not used in the calculation process. It only acts as the input of the simulation model to generate the observation concentrations at each sampling location, and the standard values evaluate the performance estimated results. This is also an advantage of the hypothetical case, as in the field study, the actual values are unknown. Therefore, the identified results could not be evaluated for the field study.

### Performance of wavelet denoising

At first, the actual observation concentrations are perturbed by Equation (9). Then the observation concentrations with noise are denoised using wavelet denoising. Both code programming and wavelet toolbox are used to train the parameter and filter. Wavedec functions and wrcoef functions are the most important functions when using Matlab. The parameters needing to be estimated for wavelet denoising are wavelet filter and denoising level. The Daubechies 4 filter, Coiflet 2 filter, and Symmlet 4 are the alternative wavelet filters due to their popularity for noised groundwater concentrations, and denoising level (*N*) is set from 1 to 4. The performance of the parameters is shown in Table 3.

Noise level | Criterion | Daubechies 4 | Coiflet 2 | Symmlet 4 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

N = 1 | N = 2 | N = 3 | N = 4 | N = 1 | N = 2 | N = 3 | N = 4 | N = 1 | N = 2 | N = 3 | N = 4 | ||

a = 0.05 | Sj | 0.002 | 0.011 | 0.025 | 0.033 | 0.010 | 0.011 | 0.033 | 0.025 | 0.012 | 0.015 | 0.029 | 0.017 |

NE | 1.99 | 4.87 | 10.39 | 14.16 | 1.98 | 4.50 | 10.53 | 14.08 | 2.00 | 4.16 | 10.69 | 14.32 | |

a = 0.10 | Sj | 0.003 | 0.02 | 0.02 | 0.03 | 0.011 | 0.013 | 0.034 | 0.026 | 0.012 | 0.015 | 0.030 | 0.017 |

NE | 3.81 | 5.76 | 10.87 | 14.62 | 3.69 | 5.36 | 11.02 | 14.46 | 3.70 | 4.97 | 11.11 | 14.66 | |

a = 0.15 | Sj | 0.004 | 0.018 | 0.025 | 0.036 | 0.010 | 0.014 | 0.035 | 0.027 | 0.012 | 0.015 | 0.031 | 0.017 |

NE | 5.65 | 4.68 | 11.36 | 15.09 | 5.42 | 6.25 | 11.52 | 14.85 | 5.43 | 5.82 | 11.54 | 14.99 | |

a = 0.20 | Sj | 0.004 | 0.020 | 0.026 | 0.037 | 0.010 | 0.015 | 0.036 | 0.027 | 0.012 | 0.014 | 0.032 | 0.017 |

NE | 7.52 | 7.62 | 11.85 | 15.56 | 7.23 | 7.16 | 12.03 | 15.25 | 7.24 | 6.69 | 12.01 | 15.32 |

Noise level | Criterion | Daubechies 4 | Coiflet 2 | Symmlet 4 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

N = 1 | N = 2 | N = 3 | N = 4 | N = 1 | N = 2 | N = 3 | N = 4 | N = 1 | N = 2 | N = 3 | N = 4 | ||

a = 0.05 | Sj | 0.002 | 0.011 | 0.025 | 0.033 | 0.010 | 0.011 | 0.033 | 0.025 | 0.012 | 0.015 | 0.029 | 0.017 |

NE | 1.99 | 4.87 | 10.39 | 14.16 | 1.98 | 4.50 | 10.53 | 14.08 | 2.00 | 4.16 | 10.69 | 14.32 | |

a = 0.10 | Sj | 0.003 | 0.02 | 0.02 | 0.03 | 0.011 | 0.013 | 0.034 | 0.026 | 0.012 | 0.015 | 0.030 | 0.017 |

NE | 3.81 | 5.76 | 10.87 | 14.62 | 3.69 | 5.36 | 11.02 | 14.46 | 3.70 | 4.97 | 11.11 | 14.66 | |

a = 0.15 | Sj | 0.004 | 0.018 | 0.025 | 0.036 | 0.010 | 0.014 | 0.035 | 0.027 | 0.012 | 0.015 | 0.031 | 0.017 |

NE | 5.65 | 4.68 | 11.36 | 15.09 | 5.42 | 6.25 | 11.52 | 14.85 | 5.43 | 5.82 | 11.54 | 14.99 | |

a = 0.20 | Sj | 0.004 | 0.020 | 0.026 | 0.037 | 0.010 | 0.015 | 0.036 | 0.027 | 0.012 | 0.014 | 0.032 | 0.017 |

NE | 7.52 | 7.62 | 11.85 | 15.56 | 7.23 | 7.16 | 12.03 | 15.25 | 7.24 | 6.69 | 12.01 | 15.32 |

If *Sj* < 0.01, it means the detailed coefficient only contains noisy information. It can be seen that only the combination of Daubechies 4 filter and denoising level 1 can reach this requirement. Thus, the denoising level is set to be 1, and the Daubechies 4 filter is chosen as the wavelet filter. After the parameters are estimated, the denoising results can be calculated. From the limitation of space, only one sample of denoising results is shown in Figure 3. To evaluate the accuracy of wavelet denoising, the Gaussian filter denoising method (Janecki 2011; Afshari *et al.* 2017) is employed. Figure 4 shows the comparison of residual results by wavelet and Gaussian filter under four different noise levels. It can be seen clearly that wavelet denoising performs more accurately than Gaussian filter denoising under all four noise levels. The average NE values for wavelet denoising and Gaussian filter denoising are 2.0% vs 3.2%, 3.8% vs 5.4%, 5.7% vs 7.6%, 7.5% vs 9.9% when noise level equals 0.05, 0.10, 0.15, 0.20, respectively.

The identified results for both the noisy observation concentrations and denoising observation concentrations are shown in Table 4. As the noise level increased, the accuracy of the identified results decreased for both the noisy (from 18.81% to 65.24%) and denoised data(19.30% to 51.39%). In contrast with the two groups of data, the identified results have a slight difference when noise level is low. However, when noise level increased from 0.10 to 0.20, the estimated result using denoised observation concentrations is much better than those using noisy observation concentrations for the same noise level.

Pollution sources | Source fluxes at each stress period (g/s) | NE of the identified results | ||||
---|---|---|---|---|---|---|

Noise level a | SP1 | SP2 | SP3 | SP4 | ||

Before denoising | a = 0.05 | 258 | 486 | 101 | 214 | 18.81 |

a = 0.10 | 253 | 492 | 181 | 138 | 33.00 | |

a = 0.15 | 253 | 485 | 276 | 57 | 50.39 | |

a = 0.20 | 247 | 492 | 357 | −19 | 65.24 | |

After denoising | a = 0.05 | 259 | 482 | 102 | 215 | 19.30 |

a = 0.10 | 247 | 517 | 139 | 163 | 28.65 | |

a = 0.15 | 240 | 542 | 186 | 106 | 41.52 | |

a = 0.20 | 238 | 559 | 224 | 60 | 51.39 | |

Actual values | a = 0 | 255 | 497 | 0 | 298 |

Pollution sources | Source fluxes at each stress period (g/s) | NE of the identified results | ||||
---|---|---|---|---|---|---|

Noise level a | SP1 | SP2 | SP3 | SP4 | ||

Before denoising | a = 0.05 | 258 | 486 | 101 | 214 | 18.81 |

a = 0.10 | 253 | 492 | 181 | 138 | 33.00 | |

a = 0.15 | 253 | 485 | 276 | 57 | 50.39 | |

a = 0.20 | 247 | 492 | 357 | −19 | 65.24 | |

After denoising | a = 0.05 | 259 | 482 | 102 | 215 | 19.30 |

a = 0.10 | 247 | 517 | 139 | 163 | 28.65 | |

a = 0.15 | 240 | 542 | 186 | 106 | 41.52 | |

a = 0.20 | 238 | 559 | 224 | 60 | 51.39 | |

Actual values | a = 0 | 255 | 497 | 0 | 298 |

### Performance of cubic spline interpolation

To avoid any influence on the interpolation results, all the observation data are the output of the simulation model with no noise added. The performance of three different interpolation methods is shown in Figure 5.

The green line shows the interpolation performance by cubic spline interpolation. As the stress period goes up, the NE of the interpolation results decreases. The NE value after the seventh period is less than 1%. Moreover, except for the data in the first stress period, the other NE values are all below 10%. In fact, in estimating the first concentration data, it is not interpolation, but deduction of the back-up value (using the second to the 20th data to estimate the first value). For other interpolation results, the accuracy is not bad.

To verify the performance of cubic spline interpolation, linear interpolation and nearest interpolation methods are used for comparison. The Interp1 function in Matlab is the most important function used for interpolation. From Figure 5, it can be seen that. cubic spline interpolation performs obviously more accurately in all 20 stress periods. Thus, cubic spline interpolation is employed to interpolate incomplete concentrations. The performance of the identified results before and after interpolation is shown in Figure 6.

The blue line in Figure 6 illustrates the normalized errors using the incomplete observation concentrations, which missed the corresponding stress period concentration data. In the first four stress periods, the identified results get worse as we move from period 1 to 4. After the fourth stress period, the identified results do not have a significant trend. The reason why the first four stress periods are so important to the identified results is that the pollution source releases pollutant for those first four periods. We can infer that the pollutant concentrations in the period when the pollution source is released need to be gathered totally if high accuracy of the identified results is desired. The concentration data found in the period when the pollution source has finished releasing do not need to be collected totally because they have little influence on the estimated results.

The trend of the red line is consistent with the blue line. The results also indicate that the pollutant concentrations need to be gathered totally when the source releases the pollutant. Comparing the red line with the blue line, we can draw two conclusions. Firstly, for the first four stress periods, the NE of the identified results from the missing data is lower than that of the interpolated data. This means that when the source releases the pollutant, the actual observation data are much more reliable than the interpolated data. However, if the observation data are missing in the stress period after the end of the release time, the identified result from the interpolation data is more reliable than the identified result using just missing data.

## CONCLUSIONS

This paper illustrates the difference between using observation data before and after denoising and interpolating for solving GPSI problems. The purpose of this research is to explore whether the identified results are more accurate after denoising and interpolating.

The results show that denoising has an effect on the estimated result for the GPSI problem, and the higher the noise level, the stronger this effect. In terms of interpolation, more accurate results can be made after interpolating if the missing data belong to the period after the source releases pollutant. If the missing data are from when the pollution source is active, interpolating does not increase the estimated performance.

Although denoising and interpolating can improve the accuracy of the identified results to a certain extent, the accurate actual concentration data are the most reliable information to use. For real-world cases, the more accurate and complete the concentration data collected, the better the identified results could be.

## ACKNOWLEDGEMENTS

This study is based on the work supported by the National Natural Science Foundation of China (41807196, 41672232), China's Post-doctoral Science Fund (2018M641793). The authors thank the editor and anonymous reviewers for their constructive comments and suggested revisions.

## REFERENCES

*et al.*versus Velsicol Chemical Corporation