It is important to identify the source information after a sudden water contamination incident occurs in a water supply system. The accuracy of the simulation model's parameters determines the accuracy of the source information. However, it is difficult to obtain the true value of these parameters by existing methods, so reduction of the errors caused by the uncertainty of these parameters is a crucial problem. A source identification framework which considers the uncertainty of the model's sensitive parameters and combines Bayesian inference and Markov Chain Monte Carlo (MCMC) algorithms simulation is established, and the South-to-North Water Diversion Project is taken as the case study in this paper. Compared with a framework which does not consider the uncertainty of the model's parameters, the proposed framework could solve the error caused by the wrong choice of model parameters and obtain more accurate results. In addition, the proposed framework based on traditional MCMC and that based on the Delayed Rejection and Adaptive Metropolis (DRAM-MCMC) are compared to prove that the DRAM-MCMC is more convergent and accurate. Lastly, the proposed framework based on DRAM-MCMC is proved to solve the problem with high practicality and generality in the studied long distance water diversion project.

## INTRODUCTION

In recent years, sudden water contamination incidents in water supply systems or rivers, resulting from transport accidents, sewage pipe bursts, unregulated sewage treatment plant discharges, terrorism attack and extreme natural disasters (such as earthquake), have occurred more frequently and have caused environmental, economic and societal consequences, in particular in developing countries such as China (He *et al*. 2011; Tang *et al*. 2014). Rapid and accurate source identification of water contamination is crucial to reducing the potential impacts of such incidents and thus is an essential step to develop mitigation and adaptation strategies in water quality management.

The contamination source identification problem, which is normally regarded as an ill-posed and inverse problem, has attracted a great deal of attention. Many methods have been developed in the literature, such as the classical regularization methods, simulation-optimization methods and Bayesian inference methods (Hamdi & Mahfoudhi 2013). The Tikhonov regularization method, one of the classical regularization methods, was first used to solve such problems, has been proved to be reliable and simple, and also can cope with the noise in the experimental data (Nguyen *et al*. 1999; Akçelik *et al*. 2003; Wang *et al*. 2013; Qi *et al*. 2016). With the development of computing power, optimization algorithms were used for source identification problems in combination with water quality simulation models, such as genetic algorithms (Khlaifi *et al*. 2009), heuristic harmony search algorithms (Ayvaz 2010) and parallel evolutionary strategies (Mirghani *et al*. 2009). Such algorithms are likely to find solutions that are more accurate than those obtained from classical regularization methods. However, the regularization and optimization methods can only provide the point estimation but cannot take uncertainties in the inverse problem, which will increase the risk of obtaining the inaccurate identification and reduced reliability of finding the optimal solution because an increasing number of model parameters are uncertain. On the contrary, Bayesian approaches have a number of distinctive advantages and have been used in many areas (Wang *et al*. 2013; Zhang *et al*. 2015). This approach could provide a posterior probability distribution of the corresponding source parameters and quantify random errors in the data (Hassan *et al*. 2009). Takaishi (2013) combined the Bayesian inference and Markov Chain Monte Carlo (MCMC) method implemented by the importance sampling method into the generalized autoregressive conditional heteroscedasticity (GARCH) model, and they found that the methods could reduce the statistical error of the GARCH parameters. Wang & Harrison (2013) used the Bayesian and MCMC methods to identify the contaminant profile in Water Distribution Systems, they examined a statistical learning approach to build a regression model between the proposed parameters and likelihood for each pair of source and sensor nodes in the network and proved that the method was feasible and efficient. Shao *et al*. (2014) combined the Bayesian approaches and MCMC simulation to identify water quality model parameters, and the result shows that the method has high reliability and anti-noise capability. It can be imagined that the method combining the Bayesian approaches and MCMC is a good way to solve the contamination source identification problem.

In previous research, model parameters were regarded as deterministic, though they were obtained by the tracer tracking method, analogy method and empirical formula method generally, they had high uncertainties which resulted in the accuracy of the results directly (Van Griensven & Meixner 2007; Blasone *et al*. 2008; Xu *et al*. 2009; Zhang *et al*. 2013; Tian *et al*. 2014). Therefore, how uncertainties of model parameters are considered is crucially important. This paper aims to propose a framework, which replaces the deterministic value with the prior probability function of the model parameters obtained from the practical measure, analogy method and empirical formula method to make identification more accurate and faster. Furthermore, the modified method of MCMC, i.e., the DRAM-MCMC, combining the Delayed Rejection, Adaptive Metropolis and MCMC (Haario *et al*. 2006), is proposed to improve the efficiency and accuracy of the source identification.

The remainder of this paper provides an overview of the framework of the source identification which considers the uncertainties of the parameters and gives the scheme design, followed by details of the case study. The sensitivity analysis results for the parameters and uncertainty analysis of different scenarios are then provided and the conclusions drawn.

## METHODOLOGY

### Water quality model

*x*represents the distance along the channel,

*y*represents the distance along the cross-section,

*t*is the time,

*C*(g/l) is the pollutant concentration at the point (

*x,y,t*),

*U*and

_{x}*U*(m/s) denote the velocity of the river at the flow crosswise and lengthwise, respectively,

_{y}*D*and

_{X}*D*are the vertical and horizontal mixed coefficients, respectively, and

_{y}*K*represents the pollutant degradation coefficient, (

*d*

^{-1}).

### Bayesian inference and MCMC

#### Bayesian inference

The Bayesian inference is a useful approach where prior knowledge is taken into consideration naturally and allows to the user to obtain uncertainties about the estimated parameter (Zio & Zoia 2008; Wang & Chen 2013; Zhao *et al*. 2014).

*et al*. (2006) and Wang & Chen (2013). Various parameters are independent of each other. Therefore, according to the Bayesian inference, the total prior distribution is as follows: where and represent the upper and lower limitations of the

*i*th parameter, respectively.

#### MCMC sampling

The main methods to construct the Markov chain transition probability matrix include the Gibbs sampling algorithm, the Metropolis–Hastings algorithm and Metropolis algorithm (Cowles & Carlin 1996; Cowles & Rosenthal 1998; Zio & Zoia 2008; Haghighattalab *et al*. 2012). However, according to the previous studies (e.g., Haario *et al*. 2006; Mbalawata *et al*. 2015), the challenge of the standard methods is that it is very hard to find a good proposal distribution in complicated high-dimensional models. Haario *et al*. (2006) proposed a modified Metropolis algorithm Delayed Rejection and Adaptive Metropolis (DRAM-MCMC) and proved that the method combining with DRAM) was more efficient than MCMC. In the method, a higher stage candidate in DR is added to preserve the property and the reversibility of the Markov chain relative to the distribution of interest at each time step in the DRAM-MCMC method. And the advantage of DR could also save in terms of simulation time depend on exploiting the hierarchy between kernels. Moreover, AM, which has the correct ergodic properties, has been introduced into MCMC so that the likelihood probability distribution could be updated along the process using the full information cumulated. More details and theory can be seen in Haario *et al*. (2006).

In this paper, there are not enough data to construct the likelihood distribution. So the likelihood distribution in the traditional MCMC is assumed to follow a normal distribution as a result of lacking data (Wang & Chen 2013; Zhao *et al*. 2014), in DRAM-MCMC, it is assumed to follow a correlated Gaussian distribution according to the study by Haario *et al*. (2006). And the key problems are how to set the values of covariance and how many stages should be built. On the one hand, the covariance is made up of the initial covariance C_{0}, the length of the initial non-adaptation N_{0}, the target's dimension *d* and the standard optimal factor S_{d}. According to the study of Haario *et al.* (2006), the target's dimension *d* should be less than 15, so it is set to 6. The length N_{0} of the initial non-adaptation period which is related to *d* is set to 1,000. The Gaussian proposal is started by the standard optimal factor S_{d} = 2.4^{2}/*d* (Gelman *et al*. 1996). On the other hand, a second stage proposal is used in DR.

### Scenario design

Three different scenarios to identify the source are designed for comparison. Scenario 1 and Scenario 2 are set to compare the advantage between the traditional framework which does not consider the uncertainty of the model parameters and the proposed framework in this paper. Scenario 2 and Scenario 3 are used to prove the superiority of the modified MCMC. The mode of the posteriori distribution of is used as a representation of the characteristics of the contamination event. The number of the iteration was set to 50,000 to make these three schemes comparable in the paper. The detailed scenarios are shown in Table 1.

Scenario | S1 | S2 | S3 |
---|---|---|---|

Is parameter uncertainty considered | No | Yes | Yes |

Sampling method | Traditional-MCMC | Traditional-MCMC | DRAM-MCMC |

Scenario | S1 | S2 | S3 |
---|---|---|---|

Is parameter uncertainty considered | No | Yes | Yes |

Sampling method | Traditional-MCMC | Traditional-MCMC | DRAM-MCMC |

Scenario 1 (S1): the main purpose of this scenario is to illustrate that the uncertainty of parameters could easily lead to greater error in the results. It is difficult to select a single value in the range that is consistent with the truth value because of the uncertainties. In each identification, we chose a set of fixed parameters from the ranges of model parameters and use the traditional MCMC to estimate the source terms in this scenario without considering model parameter uncertainties.

Scenario 2 (S2) is based on the proposed framework and uses traditional MCMC, and Scenario 3 (S3) is based on the proposed framework and uses the modified MCMC, DRAM-MCMC.

According to the scenarios designed above, the accuracy and high efficiency of the proposed framework could be proved, because it is useful to solve the specified event. But we cannot prove that the proposed framework is suitable for other events, which differ in occurrence time, total load of the sewerage and position. Therefore, in order to prove this, 2,000 events with different levels of Load, Location and Time are designed to be identified with S3.

## CASE STUDY

In order to prove the accuracy and versatility of the scenario, the middle route of the South-to-North Water Diversion Project was chosen as a case study. The project transfers water from Danjiangkou reservoir in Hubei province and crosses the Yangtze River, Huaihe River, Haihe River, Yellow River basins and finally arrives at Beijing Tuancheng Lake. The total length of the project is 1,277 km, crossing nearly 150 cities and 151,000 hectares of land via open-channels, culverts and pipes. In addition, the span of the project is very large, and 936 different types of structures, including 44 railways and 571 bridges over the channel, have been built. Also, many factories have been built near the project especially in the Shijiazhuang province. Therefore, there is a high probability of sudden water pollution incident especially in Shijiazhuang province. Above all, a section in the Shijiazhuang province of 20 km length is taken as an example to identify the source term of the chemical contaminant.

Parameter | Interval | Default value |
---|---|---|

[980, 1,470] | 1,225 | |

[28.8, 43.2] | 36 | |

[17.28, 25.92] | 21.6 | |

[0.096, 0.144] | 0.12 | |

[0.08, 0.12] | 0.1 | |

[0.8, 1.2] | 1 |

Parameter | Interval | Default value |
---|---|---|

[980, 1,470] | 1,225 | |

[28.8, 43.2] | 36 | |

[17.28, 25.92] | 21.6 | |

[0.096, 0.144] | 0.12 | |

[0.08, 0.12] | 0.1 | |

[0.8, 1.2] | 1 |

## RESULTS AND DISCUSSION

### Sensitivity analysis

*U*is the only sensitive parameter, while for Location, there are two sensitive parameters,

_{x}*Ux*and

*Dx*. The sensitivity indices of other parameters, including

*Uy*,

*Dy*,

*h*and

*K*, are all below the 10% threshold, which indicates that they are insensitive parameters and their impact on the results is very small. This is because the source identification model is built for a long distance water transfer project where the horizontal distance is much longer than the longitudinal distance. During the period of translation, diffusion and decay have happened, so the influence of the horizontal will be much lower than that of the longitudinal so that it could even be ignored, and the longitudinal parameters

*Ux*and

*Dx*are important for the source identification model. In conclusion, the uncertainties of the sensitivity parameters

*Ux*and

*Dx*need to be considered, while the insensitivity parameters,

*Uy*,

*Dy*,

*h*and

*K*are regarded as constant.

### Uncertainty analysis

*x*-axis represents the relative errors, the left

*y*-axis represents the frequency and the right

*y*-axis represents the cumulative frequency.

The assumption is that only the relative errors of all of the three source terms (Load, Location and Time) of less than 20% can be accepted. As can be seen from Figure 4(a), it is solved by S1 that only 27.2% of the 2,000 set results could be accepted. It is difficult to choose the correct values within these small ranges. More than 60% of the 2,000 set results are higher than 50% of the true results. This reveals that when the uncertainty of model parameters is not considered, it is very likely to fail to identify the contamination event.

Figure 4(b) shows that 63% of the results solved by S2 could be accepted, which implies that the scenario based on the traditional MCMC could reduce the uncertainty so as to make the results more accurate. The disadvantage of this scenario is that the relative error of more than 17% of the 2,000 set results exceed 100%. This means that the convergence of the scenario will be reduced if we consider the uncertainty of the sensitive parameters of the model, which uses the range instead of the fixed value. In order to solve the problem, we propose the modified method DRAM-MCMC (Haario *et al*. 2006) in S3. And the results, as shown in the Figure 4(c), show that 96.55% of the 2,000 set results could be accepted, revealing that the DRAM-MCMC is more efficient and accurate than the traditional MCMC. Compared with the traditional MCMC, it could increase the convergence of the scenario because the DR keeps the Markovian property and reversibility by increasing the stage candidate and the AM improves ‘global’ adaptation based on the past history of the chain.

According to the analysis above, we could suggest that the commonality of the scenario based on DRAM-MCMC is better than that based on MCMC. In order to prove the availability of the scenario, we performed an analysis based on the data and obtained the mean and the standard deviation of relative error, as listed in Table 3. This shows that for the 80th quantile, the mean and the standard deviation of the relative errors are all less than 20%, relatively small, and the 90 quantile of the relative errors are all less than 31.18%.

Distance (D) | Time (T) | Quality (Q) | |
---|---|---|---|

Standard deviation of the relative error | 11.89% | 16.54% | 12.65% |

80th quantile of relative error | 17.80% | 19.08% | 14.20% |

90th quantile of relative error | 28.45% | 31.18% | 21.01% |

Mean of relative error | 11.50% | 11.56% | 9.28% |

Distance (D) | Time (T) | Quality (Q) | |
---|---|---|---|

Standard deviation of the relative error | 11.89% | 16.54% | 12.65% |

80th quantile of relative error | 17.80% | 19.08% | 14.20% |

90th quantile of relative error | 28.45% | 31.18% | 21.01% |

Mean of relative error | 11.50% | 11.56% | 9.28% |

## CONCLUSIONS

According to previous research, we know that the uncertainty of parameters of the model, which is used to identify the source, leads to big errors in the result. So a framework considering the uncertainty of the model's insensitive parameters based on MCMC to improve accuracy of results is proposed. Two improvements have been used in this framework. One is that the sensitive parameters have been chosen by using the sensitivity analysis method to give the prior probability function instead of constant values. The other is that DRAM-MCMC is used to improve the accuracy of the result and the convergence of the identification.

This study provides a step-wise analysis of the source identification framework, which considers the uncertainty of the model's sensitive parameters. The main findings from this study can be summarized as follows: (1) from the sensitivity analysis of the framework, which does not consider the uncertainty of the model parameters, we could know that *D _{x}* and

*U*are the sensitive parameters leading to very large uncertainties in the results; (2) through comparing the traditional framework which does not consider the uncertainty of the model's parameters and the proposed framework, we have found that the proposed framework which does consider the uncertainty obtains more accurate results than the traditional one; (3) a comparison of the proposed framework based on the traditional MCMC with the new DRAM-MCMC reveals that the proposed framework based on DRAM-MCMC has a better performance in improving the accuracy and convergence of the source terms; and, finally, (4) the proposed framework based on DRAM-MCMC is used to identify many different events, and the result shows that the 80th quantile, the mean and the standard deviation of the relative errors are all less than 20% which is very small, and the results prove that the proposed framework is effective for the case study of the South-to-North Water Diversion Project, but it should be tested further on more case studies in the future.

_{x}## ACKNOWLEDGEMENTS

This work was supported by the National Natural Science Foundation of China (Grant No. 51320105010). Moreover, this study is partly funded by the national science and technology major project under grant 2014ZX03005001.