This paper presents a data mining (DM)-based approach to developing a watershed water quality evaluation model (water quality evaluation model based on data mining (WQEMD)) as an alternative to physical watershed models. Three DM techniques (i.e. model tree, artificial neural network, and radial basis function) were employed to develop a WQEMD based on watershed characteristics (e.g. hydrology, geology, and land usage). To represent watershed characteristics, three cases and ten scenarios were considered. The three cases were defined as (1) the size (area) allocation of sub-watersheds, (2) the watershed imperviousness ratio, and (3) the combination of the area and imperviousness ratio. The ten scenarios were composed of the following parameters; impervious, pervious, land usage, rainfall, slope. The best WQEMDs were subsequently developed using statistics (correlation coefficient, mean-absolute error, root mean-squared error, and root relative-squared error). In addition, the WQEMDs developed were then verified using the Geum-Sum-Youngsan River watershed. The percentage difference of biochemical oxygen demand (BOD), total nitrogen (T-N) and total phosphorus (T-P) were 30.6%, 23.44%, and 2.79%, respectively. The results show that a WQEMD developed in this way is effective and can be used in place of a physical watershed model and is useful to aid in determining areas having the best potential for successful remediation.

INTRODUCTION

Over the past decade, population pressure, land-use conversion, and its accompanying pollution, appear to be the major diffuse pollution problems of today (Novotny 2003). Increasing impervious surface has affected urban stream hydrographic conditions and resulted in significantly higher and earlier peak-discharge rates, increased pollutant loads, increased stream temperatures, increased stream-bank erosion, and other adverse effects on stream biota, compared to conditions in rural or undeveloped streams (Schueler 1987, 1994, Brabec et al. 2002; CWP 2005; Wickham et al. 2014).

To deal with these problems, various watershed models have been applied to estimate and analyze watershed management and restoration plans (Cho & Roesner 2014). Watershed models can be classified as either physically or empirically based models depending on the modeling process and basic assumptions for calibration. In general, the data and computation requirements of a physically based model are enormous for calibration; this process is extremely complicated and time-consuming (Daniel et al. 2011). In addition, on many occasions, acquisition of these data may be unavailable or expensive (Chau et al. 2005). Therefore, considering the cost/product factor, if the result of data mining (DM) provides similar accuracy levels compared to a physically based model, the DM approach could be used as a cost-effective alternative (Amirhossien et al. 2015). Furthermore, in order to predict urbanization impacts and watershed management plans for restoration, nationwide watershed models should be adapted before applying watershed restoration or rehabilitation measures (Singh & Frevert 2004; Novotny 2008).

Therefore, there is a great need for simplified methods by which to evaluate watershed environmental conditions, which could also be used to forecast the effects of restoration. To face these issues, researchers studied various DM techniques which are model tree (MT), artificial neural network (ANN), and radial basis function (RBF). The results from this study recorded and demonstrated strongly how these techniques could be employed for water quality prediction models (Atkins et al. 2007; Hatzikos et al. 2008; Palani et al. 2008; Wu et al. 2014).

The objectives of this research were to establish a computation-efficient water quality evaluation model based on data mining (WQEMD) and to advance a decision-making system that utilizes data such as land coverage and watershed areas.

DATA PREPARATION FOR DEVELOPING THE WQEMD

To establish a WQEMD, data about watershed characteristics should be compiled, including water quality data for specific watersheds. The results of numerous studies have been published on the impact correlation between land use types and water quality parameters (Sliva & Williams 2001; Woli et al. 2004; Mehaffey et al. 2005; Schoonover et al. 2005; Stutter et al. 2007; Tu 2011). Therefore, in this research, land usage, water quality, hydrology, and geology data of a watershed were accumulated, as shown in Table 1.

Table 1

Data source and period in order to compare land use and water quality

DivisionParameterSourcePeriod
Water quality COD, BOD, T-N, T-P Ministry of Environment 10 years (2001–2010) 
Hydrology Rainfall Water Management Information System 30 years (1966–2007) 
Geology Slope Remote sensing. Landsat TM data were collected between 2008 and 2010 at a 30 × 30 m spatial resolution, and were processed in order to reveal the LUCC features. The data geometrical corrections, classification and accuracy assessment were carried out with the support of the digital image processing software PCI. Topographical maps (1:50,000) were used as the reference for the geometric corrections. According to the geographical names, the maps were re-drawn on the digitized topographical map base using ArcView GIS 3.2 ver. software to complete editing, labeling, projection, transformation, edge matching and overlaying processes.  
Land usage Pervious & impervious, urban, agriculture, forest, grass, wetland, barren, water 2008–2010 
DivisionParameterSourcePeriod
Water quality COD, BOD, T-N, T-P Ministry of Environment 10 years (2001–2010) 
Hydrology Rainfall Water Management Information System 30 years (1966–2007) 
Geology Slope Remote sensing. Landsat TM data were collected between 2008 and 2010 at a 30 × 30 m spatial resolution, and were processed in order to reveal the LUCC features. The data geometrical corrections, classification and accuracy assessment were carried out with the support of the digital image processing software PCI. Topographical maps (1:50,000) were used as the reference for the geometric corrections. According to the geographical names, the maps were re-drawn on the digitized topographical map base using ArcView GIS 3.2 ver. software to complete editing, labeling, projection, transformation, edge matching and overlaying processes.  
Land usage Pervious & impervious, urban, agriculture, forest, grass, wetland, barren, water 2008–2010 

COD: chemical oxygen demand; BOD: biochemical oxygen demand; T-N: total nitrogen; T-P: total phosphorus.

In order to discover the relationship between water quality and watershed characteristics, several scenarios involving combinations of land usage and imperviousness were assumed, as shown in the following section.

PARAMETER SELECTION PROCESS

The correlation between land use types, and water quality parameters, has been studied by many researchers. Based on results from previous research, it has already been verified that most water quality indicators are significantly associated with most land use indicators (Randhir et al. 2001; Wang et al. 2004; Tu 2011). Imperviousness also has a strong relationship with water quality (CWP 2005; Conway 2007). Therefore, land use and imperviousness were employed to find the correlation between watershed characteristics and water quality in this research.

To develop a WQEMD for watershed water quality forecasting, three steps were assumed. The first step takes into consideration the area size of sub-watersheds, the second step considers the imperviousness ratio, and the last step considers both the area size and imperviousness ratio.

Pertaining to the areal-size allocation of a sub-watershed, the Korean peninsula was divided into the watersheds of three rivers. The correlation between land use and water quality, was not constant in different regions. This was because the characteristics and pollution sources of different watersheds were not the same (Tu 2011). The three groups of watersheds, and the three steps taken, followed the same process as the development of a WQAM (Cho 2014).

Allocation to each scenario

The Han River, the Nakdong River, and the Geun-Sum-Youngsan River are shown in Figure 1. The first step in correlating water quality, hydrology, geology, and land usage is to classify the sub-watersheds into bins according to size of the area — 0–200 km2, 200–500 km2, and over 500 km2 for Han and Nakdong River watersheds and 0–100 km2, 100–150 km2, 150–200 km2, and over 200 km2 for Geum-Sum-Youngsan River watershed which is shown in Table 2. In the second step, watershed data are classified in accordance to imperviousness because it has a strong relationship with water-quality impacts (Conway 2007). In this case, imperviousness is divided into four intervals – below 20%, 20–25%, and over 30% which is shown in Table 3. In order to consider both sub-watershed areas and the percentage of imperviousness among sub-watershed characteristics, sub-watershed areas are divided below and above 250 km2 and their imperviousness cover is broken up into categories of 0–20%, 20–25%, and over 25%, which is shown Table 4.
Table 2

The number of applied sub-watersheds for each river watershed (step one)

 Number of applied sub-watersheds
Area (km2)Han RiverNakdong RiverGeum-Sum-Youngsan River
0–100 – – 14 
100–150 – – 17 
150–200 – – 13 
Over 200 – – 38 
0–200 22 17 – 
200–500 23 28 – 
Over 500 16 19 – 
Total (207) 61 64 82 
 Number of applied sub-watersheds
Area (km2)Han RiverNakdong RiverGeum-Sum-Youngsan River
0–100 – – 14 
100–150 – – 17 
150–200 – – 13 
Over 200 – – 38 
0–200 22 17 – 
200–500 23 28 – 
Over 500 16 19 – 
Total (207) 61 64 82 
Table 3

The number of applied sub-watersheds for each river watershed (step two)

 The number of sub-watersheds
ImperviousnessTotalHan RiverNakdong RiverGeum-Sum-Youngsan River
Below 20% 75 33 21 21 
20–25% 88 19 35 34 
Over 25% 36 19 
 The number of sub-watersheds
ImperviousnessTotalHan RiverNakdong RiverGeum-Sum-Youngsan River
Below 20% 75 33 21 21 
20–25% 88 19 35 34 
Over 25% 36 19 
Table 4

The number of standard basins for both area allocations and the percentage of imperviousness of standard basins

 Imperviousness (%)
Area allocation (km2)Below 20%20–25%Over 25%
Below 250 km2 23 53 24 
Over 250 km2 53 41 13 
 Imperviousness (%)
Area allocation (km2)Below 20%20–25%Over 25%
Below 250 km2 23 53 24 
Over 250 km2 53 41 13 
Figure 1

Three groups’ watersheds on a map of South Korea.

Figure 1

Three groups’ watersheds on a map of South Korea.

To identify critical parameters that have strong relationships with water quality, five parameters impervious, pervious, rainfall, slope, and land usage were combined with imperviousness and land usage in the ten scenarios shown in Table 5.

Table 5

Data classification and scenarios for generating the WQEMD

CasesWatershedRangekm and % of land use (Attribute)ScenariosaTotal scenarios
First (area size) Han River watershed Below 200 km2 (01) km (land use), (01) imp, (02) imp, per 60 (han) 
200–500 km2 (03) imp, ra, (04) imp, ra, sl 60(Nak) 
Over 500 km2 (05) imp, sl, (06) sl 80(GSY) 
Nakdong River watershed Below 200 km2 (02) % (land use) (07) land, (08) land, ra Total:200 
200–500 km2 (09) land, sl, (10) land, ra, sl 
Over 500 km2 
Geum-Sum-Youngsan River watershed Below 100 km2 
100–150 km2 
150–200 km2 
Over 200 km2 
Second (impervious ratio) Han River watershed Below 20% (01) km (land use), (01) imp, (02) imp, per 60 (han) 
20–25% (03) imp, ra, (04) imp, ra, sl 60(Nak) 
Over 25% (02) % (land use) (05) imp, sl, (06) sl 60(GSY) 
Nakdong River watershed Below 20% (07) land, (08) land, ra Total:180 
20–25% (09) land, sl, (10) land, ra, sl 
Over 25% 
Geum-Sum-Youngsan River watershed Below 20% 
20–25% 
Over 25% 
Third (area size + impervious ratio) Below 250 km2 Below 20% (01) km (land use), (01) imp, (02) imp, per 60 (be250) 
20–25% (03) imp, ra, (04) imp, ra, sl 60 (ov250) 
Over 25% (02) % (land use) (05) imp, sl, (06) sl Total:120 
Over 250 km2 Below 20% (07) land, (08) land, ra 
20–25% (09) land, sl, (10) land, ra, sl 
Over 25% 
CasesWatershedRangekm and % of land use (Attribute)ScenariosaTotal scenarios
First (area size) Han River watershed Below 200 km2 (01) km (land use), (01) imp, (02) imp, per 60 (han) 
200–500 km2 (03) imp, ra, (04) imp, ra, sl 60(Nak) 
Over 500 km2 (05) imp, sl, (06) sl 80(GSY) 
Nakdong River watershed Below 200 km2 (02) % (land use) (07) land, (08) land, ra Total:200 
200–500 km2 (09) land, sl, (10) land, ra, sl 
Over 500 km2 
Geum-Sum-Youngsan River watershed Below 100 km2 
100–150 km2 
150–200 km2 
Over 200 km2 
Second (impervious ratio) Han River watershed Below 20% (01) km (land use), (01) imp, (02) imp, per 60 (han) 
20–25% (03) imp, ra, (04) imp, ra, sl 60(Nak) 
Over 25% (02) % (land use) (05) imp, sl, (06) sl 60(GSY) 
Nakdong River watershed Below 20% (07) land, (08) land, ra Total:180 
20–25% (09) land, sl, (10) land, ra, sl 
Over 25% 
Geum-Sum-Youngsan River watershed Below 20% 
20–25% 
Over 25% 
Third (area size + impervious ratio) Below 250 km2 Below 20% (01) km (land use), (01) imp, (02) imp, per 60 (be250) 
20–25% (03) imp, ra, (04) imp, ra, sl 60 (ov250) 
Over 25% (02) % (land use) (05) imp, sl, (06) sl Total:120 
Over 250 km2 Below 20% (07) land, (08) land, ra 
20–25% (09) land, sl, (10) land, ra, sl 
Over 25% 

aimp: impervious, per: pervious, ra: rain, sl: slope, land: land usage.

Land usage: urban, agriculture, forest, grass, wetland, barren, water (seven items).

METHODS AND ARCHITECTURE

Three new WQEMDs were established by application of the three cases and ten scenarios in MT, ANN, and RBF, as shown in Figure 2. In order to determine the best WQEMD, the model-selection process, and the verification model with parameter estimation, were reviewed (Kutner et al. 2004).
Figure 2

Architecture to establish WQEMD based on data analysis methods.

Figure 2

Architecture to establish WQEMD based on data analysis methods.

WQEMDs based on MT, ANN and RBF were implemented using Weka 3.6.9 (Witten et al. 2011). The input files (Table 1) are opened in the preprocessing stage of Weka. MT, ANN, and RBF models can be chosen in the Weka classify tab. There are four test options to implement the Weka Explorer: (1) use a training set, (2) supplied test set, (3) cross-validation, and (4) percentage split. In this study, the percentage-split test option was selected. Two thirds of the data (66%) were selected for calibration and the rest (34%) were used for verification.

MT

An MT is an extension of a regression tree because it associates leaves with multivariate linear models rather than with a zero-order model. There are two approaches in MTs: MARS (multiple adaptive regression splines, Friedman 1991) and M5 MTs (Quinlan 1992). The MARS algorithm is implemented using MARS software, and the M5 MT algorithm is implemented using Cubist software, with some changes to the WEKA software (Slolmatine & Siek 2006).

The M5 algorithm was used to derive a MT, and this required two major procedures. The first procedure was to build the tree and the second was to provide inferences relating to the knowledge in the tree. An illustrative example is presented in Figure 3(a) and 3(b) (Slolmatine & Siek 2006). In the tree-building procedure, input space was partitioned into mutually exclusive regions with a linear regression model. In the inference procedure, a new instance is fed into one of the models in the tree leaves, according to the splitting condition constructed in the tree-building procedure. Then the predicted output is obtained by the linear model within a leaf.
Figure 3

Splitting of the input space and prediction of a new instance based on the MT (a), (b) and schematic diagram of an ANN (c). (a) Splitting of the input space such as X1 × X2 by M5 MT algorithm; each model is a linear regression model y = a0 + a1x1 + a2x2, (b) prediction of a new instance based on the MT, (c) schematic diagram of an artificial neuron.

Figure 3

Splitting of the input space and prediction of a new instance based on the MT (a), (b) and schematic diagram of an ANN (c). (a) Splitting of the input space such as X1 × X2 by M5 MT algorithm; each model is a linear regression model y = a0 + a1x1 + a2x2, (b) prediction of a new instance based on the MT, (c) schematic diagram of an artificial neuron.

In this study, the regression trees were tried unpruned and pruned, to determine the optimization results. The minimum number of instances was set to ‘4’.

ANN

ANN relates to the study of the human brain and nerve systems. A neuron, the basic element of a nerve system, is activated and produces output signals, when the value of an input signal is about a certain threshold. ANN is a computing system that numerically models these structures and operations. The schematic of the ANN is shown in Figure 3(c). In this study, the ANN model was built with a multilayer perceptron (MLP). In order to make conditions for MLP, the training time and validation threshold were set to ‘700 iterations’ and ‘20’, respectively. Learning rate and momentum were ‘0.3’ and ‘0.2’ (Kim et al. 2014). In order to optimize the results, several hidden layers were entered into the model. This neural network used back-propagation to train the model.

RBF

RBF – a neural network is an advanced form of ANN. The RBF network architecture, for classification purposes, consists of three layers: an input layer, a hidden layer, and an output layer. The input layer relies on as many neurons as the number of input features. Input neurons propagate input features to the next layer. Each neuron in the hidden layer is associated with a function (Gaussian functions were used), characterized by a center and a width . The output layer is composed of as many neurons as classes to be recognized. Each output neuron computes a simple weighted summation over the response of the hidden neurons for a given input pattern : 
formula
1
where, k is the number of hidden neurons, wij represents the weight associated with the connection between the function and the output neuron , and Wbias, i is the bias of the output neuron .

The RBF model was built using the RBF network, which implements a normalized Gaussian-radial-basics-function network. The RBF network uses the k-means clustering algorithm to provide the basics functions, and then learns either a logistic, or a linear, regression. The Clustering Seed, minimum standard deviation, and number of clusters were set to ‘1’, ‘0.1’ and ‘4’, respectively. In order to optimize the results, the number of cluster values chosen was from ‘2’ to ‘5’ or ‘6’.

Finally, the WQEMDs using MT, ANN, and RBF were evaluated and selected by computing correlation coefficient (CC), mean absolute error (MAE), root mean squared error (RMSE), relative absolute error (RAE), and root relative squared error (RRSE). In order to select the best WQEMD, CCs were first compared and the ones with higher CC were selected. However, if the CCs were equal, the WQEMDs with smaller RMSE and RRSE, and larger MAE and RAE were selected (Witten et al. 2011).

RESULTS

DM (MT, ANN, and RBF)

The best WQEMD was derived from the first, second, and third cases using DM. The WQEMD CC values for the first case were higher than those of the second and third cases with the exception of the Nakdong River (Figure 4 and Table 6). Therefore, the first WQEMD was selected as the best case.
Table 6

Detailed explanation of the range for the index shown in Figure 4 

CasesWatershedRangeIndexCasesWatershedRangeIndex
First step (area size) Han River watershed Below 200 km2 FH0–200 Second (imper vious ratio) Nakdong River watershed Below 20% SN0–20% 
200–500 km2 FH200–500 20–25% SN20–25% 
Over 500 km2 FH500– Over 25% SN25%– 
Nakdong River watershed Below 200 km2 FN0–200 Geum-Sum-Youngsan River watershed Below 20% SG0–20% 
200–500 km2 FN200–500 20–25% SG20–25% 
Over 500 km2 FN500– Over 25% SG25%– 
Geum-Sum-Youngsan River watershed Below 100 km2 FG0–100 Third (area size +impervious ratio) Below 250 km2 Below 20% TB250_IP0–20% 
100–150 km2 FG100–150 20–25% TB250_IP20–25% 
150–200 km2 FG150–200 Over 25% TB250_IP25%– 
Over 200 km2 FG200– Over 250 km2 Below 20% TO250_IP0–20% 
Second (imper- vious ratio) Han River watershed Below 20% SH0–20% 20–25% TO250_IP20–25% 
20–25% SH20–25% Over 25% TO250_IP25%– 
Over 25% SH25%–     
CasesWatershedRangeIndexCasesWatershedRangeIndex
First step (area size) Han River watershed Below 200 km2 FH0–200 Second (imper vious ratio) Nakdong River watershed Below 20% SN0–20% 
200–500 km2 FH200–500 20–25% SN20–25% 
Over 500 km2 FH500– Over 25% SN25%– 
Nakdong River watershed Below 200 km2 FN0–200 Geum-Sum-Youngsan River watershed Below 20% SG0–20% 
200–500 km2 FN200–500 20–25% SG20–25% 
Over 500 km2 FN500– Over 25% SG25%– 
Geum-Sum-Youngsan River watershed Below 100 km2 FG0–100 Third (area size +impervious ratio) Below 250 km2 Below 20% TB250_IP0–20% 
100–150 km2 FG100–150 20–25% TB250_IP20–25% 
150–200 km2 FG150–200 Over 25% TB250_IP25%– 
Over 200 km2 FG200– Over 250 km2 Below 20% TO250_IP0–20% 
Second (imper- vious ratio) Han River watershed Below 20% SH0–20% 20–25% TO250_IP20–25% 
20–25% SH20–25% Over 25% TO250_IP25%– 
Over 25% SH25%–     
Figure 4

Comparing WQEMD CCs of the three steps.

Figure 4

Comparing WQEMD CCs of the three steps.

Table 7 shows the best WQEMD derived from the first case. It was noted that water quality displayed a strong and significant relationship with watershed land usage, rainfall, slope, imperviousness, and perviousness. For biochemical oxygen demand (BOD) simulation, the Han River from 200 km2 to 500 km2 was the best WQEMD [CC = 0.995, MAE = 2.572, RMSE = 4.344, RAE = 101.9%, RRSE = 146.3%]. The Nakdong River from 0 km2 to 200 km2 was the worst WQEMD [CC = 0.743, MAE = 0.657, RMSE = 0.692, RAE = 79.6%, RRSE = 77.2%]. The results of chemical oxygen demand (COD), total nitrogen (T-N) and total phosphorus (T-P) simulation are shown in Supplementary Information Table S1 (available with the online version of this paper).

Table 7

WQEMD for water quality simulation using DM

   Evaluation on test split
 
Water qualityModelScenarioaNo. of InstancesNo. of rules/hidden/clusterCCMAERMSERAERRSETotal no. of instanceskm2/%
BOD han (M5P)_200 22 0.941 0.721 1.011 61.1% 79.5% 
han (ANN)_500 23 0.995 2.572 4.344 101.9% 146.3% km 
han (RBF)_500over 16 0.968 0.222 0.296 57.8% 67.0% km 
nakdong (M5P)_200 21 0.743 0.657 0.692 79.6% 77.2% 
nakdong (ANN)_500 32 0.765 0.526 0.586 69.0% 66.8% 11 km 
nakdong (RBF)_500over 11 0.883 0.333 0.404 49.9% 49.1% 
gsy (M5P)_100 14 0.994 0.204 0.240 40.5% 42.5% km 
gsy (ANN)_150 14 0.959 1.198 1.422 67.1% 63.9% km 
gsy (ANN)_200 10 0.937 0.366 0.432 38.9% 34.0% km 
gsy (RBF)_200over 36 0.919 0.250 0.327 28.5% 35.1% 13 
   Evaluation on test split
 
Water qualityModelScenarioaNo. of InstancesNo. of rules/hidden/clusterCCMAERMSERAERRSETotal no. of instanceskm2/%
BOD han (M5P)_200 22 0.941 0.721 1.011 61.1% 79.5% 
han (ANN)_500 23 0.995 2.572 4.344 101.9% 146.3% km 
han (RBF)_500over 16 0.968 0.222 0.296 57.8% 67.0% km 
nakdong (M5P)_200 21 0.743 0.657 0.692 79.6% 77.2% 
nakdong (ANN)_500 32 0.765 0.526 0.586 69.0% 66.8% 11 km 
nakdong (RBF)_500over 11 0.883 0.333 0.404 49.9% 49.1% 
gsy (M5P)_100 14 0.994 0.204 0.240 40.5% 42.5% km 
gsy (ANN)_150 14 0.959 1.198 1.422 67.1% 63.9% km 
gsy (ANN)_200 10 0.937 0.366 0.432 38.9% 34.0% km 
gsy (RBF)_200over 36 0.919 0.250 0.327 28.5% 35.1% 13 

aScenario 1: area + impervious, Scenario 2: area + impervious + pervious, Scenario 3: area + impervious + rainfall, Scenario 4: area + impervious + rainfall + slope, Scenario 5: area + impervious + slope, Scenario 6: area + slope, Scenario 7: Land use, Scenario 8: land use + rainfall, Scenario 9: land use + slope, Scenario 10: land use + rainfall + slope.

Verification

Selected WQEMDs were verified in order to demonstrate their reliability. The Yongdam Dam watershed, which is located in the southern part of Korea, was used for the verification exercise. Figure 5 shows the Yongdam Dam watershed (127 °18′49″ to 127 °44′47″ east longitude and 35 °34′50″ to 36 °1′37″ north latitude).
Figure 5

Yongdam Dam watershed.

Figure 5

Yongdam Dam watershed.

In order to simulate water quality, the equations shown in Table 7, were used for DM. The ‘Re-evaluate model on current test set’ option of the Weka program was used for the WQEMD shown in Table 8.

Table 8

Data sets for water quality simulation using WQEMD (Geum-Sum-Youngsan River watershed)

    Rainfall Land use
Water quality (mg/L) (the average from 2001 to 2010)
WatershedSub-watershedPerviousImpervious(mm/month)Slope (%)UrbanAgricultureForestGrassWetlandBarrenWaterBODT-NT-P
Geum-Sum- Youngsan River Watershed Geum River (km2224 57.70 112.4 32.87 8.4 74.5 178.1 16.6 1.0 3.4 2.2 2.12 2.717 0.151 
Geum River (%) 80 20 112.4 32.87 3.0 26.4 63.1 5.9 0.4 1.2 0.8    
Guryang Stream (km2129 33.23 146.9 29.75 4.6 44.7 109.2 2.6 0.2 1.3 1.0 2.45 2.683 0.158 
Guryang Stream (%) 80 20 146.9 29.75 2.8 27.5 67.2 1.6 0.1 0.8 0.6    
Jinan Stream (km227 7.63 113.7 26.98 2.0 10.7 20.4 1.1 0.2 0.1 0.2 1.93 1.862 0.096 
Jinan Stream (%) 78 22 113.7 26.98 5.8 31.1 59.1 3.2 0.5 0.4 0.6    
Jeongja Stream (km280 16.68 134.4 40.46 1.3 11.7 81.9 1.6 0.1 0.2 0.5 1.29 2.577 0.026 
Jeongja Stream (%) 83 17 134.4 40.46 1.4 12.1 84.5 1.7 0.2 0.2 0.5    
Juja Stream (km248 9.40 137.8 40.23 0.6 5.2 50.2 0.7 0.1 0.2 0.3 1.17 1.615 0.015 
Juja Stream (%) 84 16 137.8 40.23 1.1 9.1 88.0 1.3 0.2 0.3 0.6    
    Rainfall Land use
Water quality (mg/L) (the average from 2001 to 2010)
WatershedSub-watershedPerviousImpervious(mm/month)Slope (%)UrbanAgricultureForestGrassWetlandBarrenWaterBODT-NT-P
Geum-Sum- Youngsan River Watershed Geum River (km2224 57.70 112.4 32.87 8.4 74.5 178.1 16.6 1.0 3.4 2.2 2.12 2.717 0.151 
Geum River (%) 80 20 112.4 32.87 3.0 26.4 63.1 5.9 0.4 1.2 0.8    
Guryang Stream (km2129 33.23 146.9 29.75 4.6 44.7 109.2 2.6 0.2 1.3 1.0 2.45 2.683 0.158 
Guryang Stream (%) 80 20 146.9 29.75 2.8 27.5 67.2 1.6 0.1 0.8 0.6    
Jinan Stream (km227 7.63 113.7 26.98 2.0 10.7 20.4 1.1 0.2 0.1 0.2 1.93 1.862 0.096 
Jinan Stream (%) 78 22 113.7 26.98 5.8 31.1 59.1 3.2 0.5 0.4 0.6    
Jeongja Stream (km280 16.68 134.4 40.46 1.3 11.7 81.9 1.6 0.1 0.2 0.5 1.29 2.577 0.026 
Jeongja Stream (%) 83 17 134.4 40.46 1.4 12.1 84.5 1.7 0.2 0.2 0.5    
Juja Stream (km248 9.40 137.8 40.23 0.6 5.2 50.2 0.7 0.1 0.2 0.3 1.17 1.615 0.015 
Juja Stream (%) 84 16 137.8 40.23 1.1 9.1 88.0 1.3 0.2 0.3 0.6    

The data sets used for simulation of the Geum River watershed are shown in Table 5. They include the categories pervious, impervious, rainfall, slope, and land usage. Land usage was determined using Landsat TM data (Table 1). Water quality data from the end site of each sub-watershed were used, particularly those of average water quality in 2010.

The results of water quality simulation using the WQEMDs are shown in Table 9 and in Figure 6.
Table 9

Results of water quality simulation for the Yongdam Dam watershed using the WQEMD

 BOD (mg/L)
T-N (mg/L)
T-P (mg/L)
Sub-watershedWQEMDObservedWQEMDObservedWQEMDObserved
Geum River 1.46 2.12 1.89 2.72 0.030 0.151 
Guryang Stream 1.90 2.45 3.12 2.68 0.148 0.158 
Jinan Stream 1.03 1.93 1.23 1.86 0.110 0.096 
Jeongja Stream 1.04 1.29 1.49 2.58 0.085 0.026 
Juja Stream 0.80 1.17 1.03 1.62 0.085 0.015 
Sum 6.23 8.97 8.77 11.45 0.458 0.446 
% difference 30.62 – 23.44 – 2.786 – 
 BOD (mg/L)
T-N (mg/L)
T-P (mg/L)
Sub-watershedWQEMDObservedWQEMDObservedWQEMDObserved
Geum River 1.46 2.12 1.89 2.72 0.030 0.151 
Guryang Stream 1.90 2.45 3.12 2.68 0.148 0.158 
Jinan Stream 1.03 1.93 1.23 1.86 0.110 0.096 
Jeongja Stream 1.04 1.29 1.49 2.58 0.085 0.026 
Juja Stream 0.80 1.17 1.03 1.62 0.085 0.015 
Sum 6.23 8.97 8.77 11.45 0.458 0.446 
% difference 30.62 – 23.44 – 2.786 – 
Figure 6

Water quality simulation results based on the WQEMD of the Yongdam Dam watershed.

Figure 6

Water quality simulation results based on the WQEMD of the Yongdam Dam watershed.

Comparing the WQEMD to the observed data in the BOD simulation, the results of DM have a trend similar to that of the observed data. In the T-N and T-P simulations, the result of the DM also resulted in trends similar to those of the observed data.

DISCUSSION

Parameters that influence water quality

In this research, imperviousness and land use were the main parameters used to develop the WQEMD. Scenarios 1 to 5 and Scenarios 7 to 10, were based on imperviousness and land usage, respectively. In Figure 7, it can be seen that BOD and COD were influenced more by imperviousness (60% and 70%, respectively) than by land usage (40% and 10%, respectively). In contrast, T-N and T-P are influenced more by land usage (70% and 70%, respectively) than by imperviousness (30% and 30%, respectively).
Figure 7

Percentages for each of the selected scenarios for the best WQEMDs based on DM (Scenario 1: area + impervious, Scenario 2: area + impervious + pervious, Scenario 3: area + impervious + rainfall, Scenario 4: area + impervious + rainfall + slope, Scenario 5: area + impervious + slope, Scenario 6: area + slope, Scenario 7: land use, Scenario 8: land use + rainfall, Scenario 9: land use + slope, Scenario 10: land use + rainfall + slope).

Figure 7

Percentages for each of the selected scenarios for the best WQEMDs based on DM (Scenario 1: area + impervious, Scenario 2: area + impervious + pervious, Scenario 3: area + impervious + rainfall, Scenario 4: area + impervious + rainfall + slope, Scenario 5: area + impervious + slope, Scenario 6: area + slope, Scenario 7: land use, Scenario 8: land use + rainfall, Scenario 9: land use + slope, Scenario 10: land use + rainfall + slope).

Based upon the results of the selected WQEMD, we assumed that organic matter, like BOD and COD, was affected by both impervious and pervious surfaces. This means that organic matter could be influenced by rainfall runoff. On the other hand, nutrients like T-N and T-P were influenced more by land usage.

Comparing input data between WQEMD and a physically based model

According to the results of this research, in order to predict the water quality data through the WQEMD, a few input data, pervious, impervious, rainfall, slope, and land use are needed. On the other hand, a physically based model needs areal precipitation, watershed representation, surface runoff, infiltration, subsurface flow and interflow, groundwater flow and base flow, evaporation and evapotranspiration, interception, depression storage, detention storage, rainfall-excess/soil moisture accounting, snowmelt runoff, stream–aquifer interaction, reservoir flow routing, channel flow routing, water quality, model calibration and model testing (Singh & Frevert 2004). Table 10 shows the different input data for both WQEMD and HSPF. Therefore, when it comes to input data, WQEMD could be very useful in the first stage, such as feasibility study, to compare each watershed's status of water quality and to decide the site priority for restoration as well as rehabilitation.

Table 10

Data sets for water quality simulation using WQEMD (Geum-Sum-Youngsan River watershed)

ModelWQEMDHSPF
Input data Pervious, impervious, rainfall, slope, and land use Continuous meteorological time series records including (at a minimum)/Rainfall/Potential evapotranspiration 
For SNOW simulation, additional required meteorological time series include/Temperature/Wind speed/Solar radiation/Dew point temperature 
For additional simulation options, other required meteorological time series may include/Pan evaporation/Cloud cover 
Soils data (auxiliary data set to guide hydrologic calibration), pollutant buildup and wash off, stream dimensions or rating curves, and point-source loading inputs 
A large number of parameters need to be specified (some default values are available) 
ModelWQEMDHSPF
Input data Pervious, impervious, rainfall, slope, and land use Continuous meteorological time series records including (at a minimum)/Rainfall/Potential evapotranspiration 
For SNOW simulation, additional required meteorological time series include/Temperature/Wind speed/Solar radiation/Dew point temperature 
For additional simulation options, other required meteorological time series may include/Pan evaporation/Cloud cover 
Soils data (auxiliary data set to guide hydrologic calibration), pollutant buildup and wash off, stream dimensions or rating curves, and point-source loading inputs 
A large number of parameters need to be specified (some default values are available) 

CONCLUSIONS

In this study, the best WQEMDs were selected after analysis of three cases and ten scenarios using MT, ANN, and RBF. The qualities of the WQEMDs were determined using statistical criteria (CC, MAE, RMSE, RAE, and RRSE). This WQEMD approach serves as an alternative to complex watershed modeling. It was developed using the Yongdam Dam watershed as the case study. The WQEMD of the Geum-Sum-Youngsan River watershed was applied because the Yongdam Dam is located within the Geum-Sum-Youngsan River watershed. The data sets used as inputs to the WQEMD included pervious area, impervious area, rainfall, slope, and land usage.

Water quality results simulated using the WQEMD were correlated to the observed data and the percent differences from the Yongdam Dam watershed parameters (BOD, T-N, and T-P) were 30.62% (fair), 23.44%, (good) and 2.79% (very good), respectively. The confidence range of the percent difference for water quality parameters can be classified as very good (<15%), good (15–25%), and fair (25–35%) (Source: ‘Watershed Model Calibration and Validation: Issues and Procedures’ from BASINS/HSPF Training Lecture No.15).

The models for the other watersheds (e.g. the Han and Nakdong River watersheds, especially having high Relative Absolute Error) should also be verified in order to select the WQEMD for each watershed, which might best enhance future research.

Finally, these results show that WQEMDs that relate watershed parameters to watershed water quality can be used for management purposes. Furthermore, they can be used to identify and prioritize restoration and rehabilitation areas in a watershed even though existing data are insufficient to satisfy the requirements of a physically based model.

REFERENCES

REFERENCES
Amirhossien
F.
Alireza
F.
Kazem
J.
Mohammadbagher
S.
2015
A comparison of ANN and HSPF models for runoff simulation in Balkhichai River Watershed, Iran
.
American Journal of Climate Change
2015
(
4
),
203
216
.
Atkins
J. P.
Burdon
D.
Allen
J. H.
2007
An application of contingent valuation and decision tree analysis to water quality improvements
.
Marine Pollution Bulletin
55
(
2007
),
591
602
.
CWP
2005
Urban Subwatershed Restoration Manual No. 1, An Integrated Framework to Restore Small Urban Watershed, Center for Watershed Protection
.
Chau
K. W.
Wu
C. L.
Li
Y. S.
2005
Comparison of several flood forecasting models in Yangtze River
.
Journal of Hydrologic Engineering, ASEC
10
(
6
),
2005
,
485
491
.
Cho
Y. D.
2014
Development of a WQAM: a water quality assessment model based on watershed characteristics by non-linear regression. Water Science & Technology; Water Supply
, 15 (
2
),
236
247
.
Cho
Y. D.
Roesner
L. A.
,
2014
Development of SPAWM: selection program for available watershed models. Water Science & Technology, 70 (3), 387–396
.
Daniel
E. B.
Camp
J. V.
LeBoeuf
E. J.
Penrod
J. R.
Dobbins
J. P.
Abkowitz
M. D.
2011
Watershed modeling and its applications: a state-of-the-art review
.
Open Hydrology Journal
2011
(
5
),
26
50
.
Friedman
J. H.
1991
Multivariate adaptive regression splines
.
Annals of Statistics
19
,
1
141
.
Hatzikos
E. V.
Tsoumakas
G.
Tzanis
G.
Bassiliades
N.
Vlahavas
I.
2008
An empirical study on sea water quality prediction
.
Knowledge-Based Systems
21
(
2008
),
471
478
.
Kim
J.
Kim
J.
Cho
Y.
2014
Establishing a predictive model for Chlorophyll-A concentration in lake daechung, Korea using multilinear statistical techniques
.
Journal of Environmental Engineering, ASCE
,
141
(
2
).
Kutner
M. H.
Christopher
J.
Nachtsheim
J. N.
Li
W.
2004
Applied Linear Statistical Models
.
Mehaffey
M. H.
Nash
M. S.
Wade
T. G.
Ebert
D. W.
Jones
K. B.
Rager
A.
2005
Linking land cover and water quality in New York City's water supply watershed
.
Environmental Monitoring and Assessment
107
,
29
44
.
Novotny
V.
2003
Water Quality Diffuse Pollution and Watershed Management
.
J. Wiley & Sons
,
New York, NY
.
Novotny
V.
2008
Watershed Models, Encyclopedia of Ecology
, pp.
3748
3759
.
Palani
S.
Liong
S. Y.
Tkalich
P.
2008
An ANN application for water quality forecasting
.
Marine Pollution Bulletin
56
(
2008
),
1586
1597
.
Quinlan
J. R.
1992
Learning with continuous classes. In: Proceedings AI '92, Adams & Sterling (eds)
,
World Scientific
(
1992
), pp.
343
348
.
Randhir
T. O.
O'Connor
R.
Penner
P. R.
Goodwin
D. W.
2001
A watershed-based land prioritization model for water supply protection
.
Forest Ecology and Management
143
(
2001
),
47
56
.
Schueler
T.
1987
Controlling Urban Runoff: A Practical Manual for Planning and Designing Urban Best Management Practices
.
Metropolitan Washington Council of Governments
,
Washington, DC
.
Schueler
T. R.
1994
The Importance of Imperviousness
.
Watershed Protect. Tech
.
1
,
100
111
.
Singh
V. P.
Frevert
D. K.
2004
Watershed Modeling, World Water Congress
.
Slolmatine
D. P.
Siek
M. B.
2006
Modular learning models in forecasting natural phenomena
.
Neural Networks
19
(
2006
),
215
224
.
Wang
X.
Yu
S.
Huang
G. H.
2004
Land allocation based on integrated GIS-optimization modeling at a watershed level
.
Landscape and Urban Planning
66
(
2004
),
61
74
.
Wickham
J. D.
Wade
T. G.
Norton
D. J.
2014
Spatial patterns of watershed impervious cover relative to stream location
.
Ecological Indicators
40
(
2014
),
109
116
.
Witten
I. H.
Frank
E.
Hall
M. A.
2011
Data Mining Practical Machine Learning Tools and Techniques
.
Woli
K. P.
Nagumo
T.
Kuramochi
K.
Hatano
R.
2004
Evaluating river water quality through land use analysis and N budget approaches in livestock farming areas
.
Science of the Total Environment
329
,
61
74
.

Supplementary data