Shimabara City in Nagasaki Prefecture, Japan, is located on a volcanic peninsula that has abundant groundwater. Almost all public water supplies use groundwater in this region. For this reason, understanding groundwater characteristics is a pre-requisite for proper water supply management. Thus, we investigated the groundwater chemistry characteristics in Shimabara by use of self-organizing maps (SOMs). The input to SOM was concentrations of eight major groundwater chemical components, namely Cl−, NO3−, SO42–, HCO3−, Na+, K+, Mg2+, and Ca2+ collected at 36 sampling locations. The locations constituted private and public water supply wells, springs, and a river sampled from April 2012 to May 2015. Results showed that depending on the chemistry, surface water and groundwater could be classified into five main clusters displaying unique patterns. Further, the five clusters could be divided into two major water types, namely, nitrate- and non-polluted water. According to Stiff and Piper trilinear diagrams, the nitrate-polluted water represented Ca-(SO4 + NO3) (calcium sulfate nitrate) type, while the non-polluted water was classified as Ca-HCO3 (calcium bicarbonate) type. This indicates that recharging rain water in the upstream areas is polluted by agricultural activities in the mid-slope areas of Shimabara.
INTRODUCTION
Groundwater is used for various purposes, such as water supply, agriculture, and industry. During recent decades, groundwater has been polluted by increasing fertilizer applications to meet the demand of food supply due to population growth. Monitoring and protection of groundwater are essential to meet the demand for safe groundwater. To understand the effects of hydrogeological processes and anthropogenic activities on regional groundwater, it is important to study the chemical characteristics. The hydrogeochemistry of groundwater is influenced by many factors, such as climate, mineralogy of aquifers, chemical composition of rainfall and surface water, topography, and anthropogenic activities. Thus, a hydrogeochemical interpretation of groundwater quality from representative water samples can provide useful information on the geochemical processes, hydrodynamics, origin, and interaction of the groundwater with aquifer materials.
Shimabara City is known as a region that, to a great extent, relies on groundwater for the public water supply (Committee on Nitrate Reduction in Shimabara Peninsula 2011). However, Shimabara groundwater has been increasingly polluted by nitrate since 1988. We analyzed the present situation of groundwater pollution by nitrate in Shimabara and showed that agricultural activities are the main polluter of the groundwater (Nakagawa et al. 2016). To better understand the characteristics of the water chemistry, multivariate analysis such as principal component analysis (PCA), which can reduce data dimensionality and extract synthetic indexes with minimum information loss, is often used (e.g., Aiuppa et al. 2003; Cloutier et al. 2008; Banoeng-Yakubo et al. 2009; Sonkamble et al. 2012; Nadiri et al. 2013; Omonona et al. 2014; Singaraja et al. 2014; Ghesquière et al. 2015; Marghade et al. 2015; Matiatos 2016). Using groundwater chemistry, we classified Shimabara water by use of principal component and cluster analysis (Nakagawa et al. 2016). The results showed that groundwater could be classified into four clusters, where one cluster expressed nitrate pollution and the other clusters showed ion dissolution from the aquifer matrix. However, it is sometimes difficult to decipher PCA results due to bias resulting from the complexity and nonlinearity of large data (Choi et al. 2014). Recently, multivariate analysis using self-organizing maps (SOMs) has been applied to various research fields, such as ecology (Céréghino et al. 2001; Bedoya et al. 2009), geomorphology (Hentati et al. 2010), hydrology (Kalteh & Berndtsson 2007), meteorology (Nishiyama et al. 2007), and wastewater treatment (García & González 2004). SOM has also been used to classify the water chemistry of rivers and groundwater (Hong & Rosen 2001; Jin et al. 2011; Choi et al. 2014; Nguyen et al. 2015). Thus, SOM is a powerful and effective tool for detection and interpretation of spatially varying phenomena. Especially, SOM has a better ability to handle the nonlinearities, noisy or irregular data, and multivariate data without mechanistic understanding of the system. SOM is also easily and quickly updated when adding new data (Hong & Rosen 2001; Kalteh et al. 2008). The similarity of extracted pattern classification can be visually compared using color gradients (Jin et al. 2011).
In the previous study (Nakagawa et al. 2016), we used field observed data from August 2011 to November 2013. We continued to collect data, and available data were extended to May 2015. Therefore, in this study, we confirmed our previous results by using a more informative method, SOM, together with an extended database. Using SOM, visual representation of groundwater characteristics is easy, and more detailed clustering with better analyses results is possible as compared to conventional PCA. To improve the understanding of groundwater characteristics in Shimabara we applied SOM combined with hierarchical cluster analysis using water chemistry as input. According to the results obtained by SOM analysis, we discuss the spatial trends of groundwater characteristics in Shimabara and the practical application of SOM for future water use.
STUDY AREA AND DATA USED
In total, 353 water samples were collected from April 2012 to May 2015. Sampling was performed at seven resident wells (RW), 21 public water supply wells (W), two observation wells (O), five springs (S), and one river (R) (Figure 1). To ensure spatially representative groundwater conditions, sampling sites covering the whole area of Shimabara except for forest and other land use (Figures 1 and 2) were used. Sampling was done four times annually with 2–4 month intervals to ensure temporally varying groundwater conditions. Sampling at specific locations (RW-14, b, W-21, O-2, S-2, 3, 5, and R-2) was done with less frequency. The hydrogeochemical data used in this study consist of major dissolved ion concentrations for Cl−, NO3−, SO42–, HCO3−, Na+, K+, Mg2+, and Ca2+. Mean and standard deviation of 36 sampling sites using averaged temporal ion concentrations for each of the sampling sites are summarized in Table 1. It is necessary to normalize the data prior to application of SOM to ensure that all parameters are given the same importance. SOM results are highly sensitive to data pre-processing method due to the fact that the Euclidean distance between input data is used (e.g., Jin et al. 2011). To solve this problem, the range between minimum and maximum ion concentrations was standardized into [0, 1] (Nishiyama et al. 2007; Jin et al. 2011) as preprocessing in this study.
Major ion (mg L−1) . | Mean . | SD . |
---|---|---|
Cl− | 12.4 | 1.4 |
NO3− | 38.4 | 5.0 |
SO42– | 21.9 | 3.2 |
HCO3− | 55.7 | 6.6 |
Na+ | 12.1 | 2.4 |
K+ | 6.4 | 1.2 |
Mg2+ | 8.7 | 1.1 |
Ca2+ | 22.4 | 2.9 |
Major ion (mg L−1) . | Mean . | SD . |
---|---|---|
Cl− | 12.4 | 1.4 |
NO3− | 38.4 | 5.0 |
SO42– | 21.9 | 3.2 |
HCO3− | 55.7 | 6.6 |
Na+ | 12.1 | 2.4 |
K+ | 6.4 | 1.2 |
Mg2+ | 8.7 | 1.1 |
Ca2+ | 22.4 | 2.9 |
METHODOLOGY
The SOM is a modified artificial neural network characterized by unsupervised training that can project high-dimensional information onto a low-dimensional array (e.g., Vesanto et al. 2000). Many researchers have chosen a two-dimensional array (e.g., Jiang et al. 2014). The result is a readily understandable and visual pattern classification. The objective here of the SOM application was to obtain physically explainable reference vectors using input vectors. Thus, the input vectors were composed of, in total, 353 hydrogeochemical data points (approximately quarterly sampling at the 36 sampling locations) with eight variables (major dissolved ion concentrations: Cl−, NO3−, SO42–, HCO3−, Na+, K+, Mg2+, and Ca2+). Reference vectors were obtained after iterative updates through a training phase that comprised three main procedures: competition between nodes, selection of a winner node, and updating of the reference vectors (e.g., Vesanto et al. 2000). Selection of proper initialization and data transformation methods are important factors when designing a relevant SOM methodology. In SOM applications, in general, a larger map size gives a higher resolution for pattern recognition. The optimum number of SOM nodes is determined by applying the heuristic rule , where m denotes the number of SOM nodes and n represents the number of input data (García & González 2004; Hentati et al. 2010; Jin et al. 2011). Herein, this heuristic rule was used to determine the total number of nodes in the SOM. The ratio of the number of rows and columns is determined by the square root of the ratio between the two largest eigenvalues of the correlation matrix of input data. The eigenvalues are obtained from PCA. In a previous study using the sampled data from August 2011 to November 2013, two principal components (Factor 1 and Factor 2) explained 86.5% of the total variance (Nakagawa et al. 2016).
After organizing the SOM structure with the above rule, a linear initialization technique made each node set with a reference vector. A linear initialization technique increases the speed of the training phase and proper abstracting pattern for limited data (Jeong et al. 2010). Further, when only limited data are available, the linear initialization is more suitable for the pattern classification as compared to random initialization, because of small data sets and boundary effects (Nguyen et al. 2015). The linear initialization used eigenvalues and eigenvectors of input data to set initial reference vectors on the structured SOM. This means that the initial reference vectors already include prior information about the input data, resulting in a quicker and more efficient training phase (Vesanto et al. 2000). In this study, each reference vector was updated through the SOM training process using a batch mode with neighborhood function taking a Gaussian form. Although some issues on the implementation of the batch SOM are discussed in some detail in Jiang et al. (2014), the results of the SOM analysis supported previous clustering results (Nakagawa et al. 2016; shown below). The reference vectors obtained at the end of the training process were fine-tuned using cluster analysis.
There are various clustering algorithms available in the literature (e.g., García & González 2004; Jin et al. 2011). In this study, partitioned algorithms and hierarchical algorithms, which are k-means and Ward's algorithms, respectively, were applied for appropriate clustering of reference vectors. For partitioned clustering methods, the k-means algorithm is most frequently used for SOM (e.g., Jin et al. 2011). The Davies–Bouldin Index (DBI) applying k-means algorithm determines the optimal number of clusters (García & González 2004; Jin et al. 2011). The DBI values, based on similarity within a cluster and dissimilarity between clusters, were calculated from a minimum of two clusters to the total number of nodes. Therefore, the smaller DBI value appears as the dissimilarity to each cluster becomes larger. In other words, a minimum DBI represents the optimal number of clusters for the trained SOM. The Ward's linkage method, which is one of the hierarchical techniques, is the most commonly used clustering method (Faggiano et al. 2010; Hentai et al. 2010; Jin et al. 2011). In this study, the final fine-tuning cluster analysis was carried out using Ward's method. The above calculation processes were carried out using a modified version of SOM Toolbox 2.0 (Vesanto et al. 2000). The output SOM clusters were plotted on Piper trilinear and Stiff diagrams to explain the main features of each cluster. Furthermore, the SOM clusters were mapped spatially to clarify influence from land use.
RESULTS AND DISCUSSION
Based on the methodology described above, the number of SOM nodes was determined to be equal to 91. The number of rows and columns was 7 and 13, respectively. Thus, this SOM design was used for the cluster analysis of standardized water chemistry data from the 36 locations in Shimabara.
To confirm quantitative relationships, as mentioned above, correlation coefficients between reference vectors for each parameter were calculated (Table 2). There is a high correlation (r = 0.99) between Cl− and NO3−. There is also a high correlation between Na+ and Mg2+ (r = 0.92). Similarly, the color gradient for the relationship between SO42– and Ca2+ indicates a high correlation coefficient (r = 0.94). The relation between each ion indicates factors affecting groundwater chemistry. For example, a high co-variation (R2 = 0.72) between higher concentrations of NO3− and Cl− was observed, indicating that they originate from common sources, such as human and animal waste (e.g., Diédhiou et al. 2012). Moreover, the same result can be observed between SO42– and Ca2 (r = 0.79). The high correlation implies that the dissolution of gypsum may be one of the key factors controlling the geochemical evolution of groundwater (Liu et al. 2015).
. | NO3− . | SO42– . | HCO3− . | Na+ . | K+ . | Mg2+ . | Ca2+ . |
---|---|---|---|---|---|---|---|
Cl− | 0.99* | 0.82* | −0.51* | 0.47* | 0.86* | 0.46* | 0.78* |
NO3− | 0.75* | −0.60* | 0.38* | 0.82* | 0.36* | 0.71* | |
SO42– | −0.03 | 0.84* | 0.92* | 0.79* | 0.94* | ||
HCO3− | 0.43* | −0.11 | 0.52* | 0.11 | |||
Na+ | 0.71* | 0.92* | 0.82* | ||||
K+ | 0.72* | 0.94* | |||||
Mg2+ | 0.88* |
. | NO3− . | SO42– . | HCO3− . | Na+ . | K+ . | Mg2+ . | Ca2+ . |
---|---|---|---|---|---|---|---|
Cl− | 0.99* | 0.82* | −0.51* | 0.47* | 0.86* | 0.46* | 0.78* |
NO3− | 0.75* | −0.60* | 0.38* | 0.82* | 0.36* | 0.71* | |
SO42– | −0.03 | 0.84* | 0.92* | 0.79* | 0.94* | ||
HCO3− | 0.43* | −0.11 | 0.52* | 0.11 | |||
Na+ | 0.71* | 0.92* | 0.82* | ||||
K+ | 0.72* | 0.94* | |||||
Mg2+ | 0.88* |
*Correlations significant at p = 0.01.
The five classified clusters can generally be divided into two water quality types. Cluster-2 and -3 can be characterized as polluted water due to the high concentration of NO3−. The other group includes cluster-1, -4, and -5, representing non-polluted water (pristine water type).
Table 3 shows mean ion concentrations calculated from raw data and classified into the respective cluster. The NO3− for cluster-3 indicates a higher mean value than 50 mg L−1 which is the maximum contamination level recommended by the World Health Organization (WHO 2011) for drinking water. The NO3− for cluster-2 meets the WHO standard. However, it exceeds 13 mg L−1 which is the maximum nitrate concentration unaffected by human activities (Eckhardt & Stackelberg 1995). It confirms that the two clusters include polluted water as mentioned above. Cluster-1, -4, and -5 display much lower mean NO3− concentrations. An NO3− concentration exceeding the maximum concentration level recommended by the WHO has also been reported in other studies (e.g., Diédhiou et al. 2012; Hansen et al. 2012; Liu et al. 2015; Dragon et al. 2016; Matiatos 2016). In these investigations, the maximum NO3− concentration ranged from 91 to 855 mg L−1.
. | Cl− (mg L−1) . | NO3− (mg L−1) . | SO42– (mg L−1) . | HCO3− (mg L−1) . | Na+ (mg L−1) . | K+ (mg L−1) . | Mg2+ (mg L−1) . | Ca2+ (mg L−1) . |
---|---|---|---|---|---|---|---|---|
Cluster-1 | 5.1 | 9.9 | 3.2 | 37.7 | 6.5 | 3.4 | 3.2 | 8.7 |
Cluster-2 | 14.3 | 42.1 | 22.5 | 39.0 | 11.2 | 6.2 | 8.1 | 20.5 |
Cluster-3 | 21.3 | 78.8 | 37.7 | 27.5 | 14.4 | 8.6 | 11.2 | 31.5 |
Cluster-4 | 6.4 | 9.9 | 10.5 | 108.5 | 11.1 | 4.9 | 10.6 | 21.0 |
Cluster-5 | 6.8 | 6.2 | 41.3 | 175.4 | 25.1 | 7.9 | 17.6 | 33.5 |
. | Cl− (mg L−1) . | NO3− (mg L−1) . | SO42– (mg L−1) . | HCO3− (mg L−1) . | Na+ (mg L−1) . | K+ (mg L−1) . | Mg2+ (mg L−1) . | Ca2+ (mg L−1) . |
---|---|---|---|---|---|---|---|---|
Cluster-1 | 5.1 | 9.9 | 3.2 | 37.7 | 6.5 | 3.4 | 3.2 | 8.7 |
Cluster-2 | 14.3 | 42.1 | 22.5 | 39.0 | 11.2 | 6.2 | 8.1 | 20.5 |
Cluster-3 | 21.3 | 78.8 | 37.7 | 27.5 | 14.4 | 8.6 | 11.2 | 31.5 |
Cluster-4 | 6.4 | 9.9 | 10.5 | 108.5 | 11.1 | 4.9 | 10.6 | 21.0 |
Cluster-5 | 6.8 | 6.2 | 41.3 | 175.4 | 25.1 | 7.9 | 17.6 | 33.5 |
Based on the Stiff and Piper trilinear diagrams, the polluted water type is represented as Ca-(SO4 + NO3) (calcium sulfate nitrate type), while the non-polluted water type is classified as Ca-HCO3 (calcium bicarbonate type). Similar results were reported by Shin et al. (2013). According to the study, water samples collected from the upper reaches of Korean rivers were of Ca-HCO3 type, whereas water samples collected from lower reaches and with relative high nitrate concentrations were classified as Na-Cl-NO3 type. This indicates that water samples are affected by anthropogenic factors such as fertilizer, manure, and septic waste.
SUMMARY AND CONCLUSION
In this study, water chemistry data from 36 sampling locations, obtained from April 2012 to May 2015, were classified using SOM in combination with hierarchical cluster analysis to clarify groundwater characteristics in Shimabara, Japan. The SOM provided readily understandable results for classifying the water chemistry data into distinguishable hydrogeochemical types. The Piper trilinear and Stiff diagrams for the reference vectors were plotted to display fundamental characteristics of each cluster. In addition, the spatial distribution of the respective clusters explained the spatial variability of the hydrogeochemical characteristics determined by the SOM. Based on the SOM results, the water chemistry data could be divided into five clusters that revealed two representative water types characterized by nitrate pollution (cluster-2 and -3) and non-polluted (cluster-1, -4, and -5) water. The spatial distribution of cluster-2 and -3 shows that agricultural activities are causing groundwater pollution in the northern part of Shimabara. The Stiff and Piper trilinear diagrams based on the reference vectors for each cluster showed that non-polluted water and polluted water are characterized by Ca-HCO3 type and Ca-(SO4 + NO3) type, respectively. This indicates that nitrate pollution is a product from agricultural activities and classified into cluster-2 and -3.
The SOM analysis showed that mountainside recharged pristine groundwater is classified into cluster-1. Some groundwater in cluster-1 is also located close to the mid-slope hills. This means that non-polluted water can be used from this agricultural area. For other purposes, water quality evaluation methods such as the Wilcox classification diagram (Wilcox 1955), can be used to evaluate whether water in cluster-2 or -3 can be used for, e.g., irrigation. The clusters from the SOM analysis are useful for further groundwater remediation alternatives.
The application and results of the SOM support our previous conclusion (Nakagawa et al. 2016) regarding the spatial distribution of nitrate pollution in the study area and its causes. Data that display a scattered distribution in the Piper trilinear diagram can be difficult to analyze by PCA. However, in this case, SOM can be an alternative method (Choi et al. 2014). In this study, both PCA and SOM successfully classified groundwater chemistry in the study area. However, SOM gives more robust and explainable results that can be used to characterize groundwater chemistry. More detailed characteristics along this line will be described in a new paper (Amano et al. in press).
ACKNOWLEDGEMENTS
This work was supported by JSPS KAKENHI Grant Number 24360194 and 15KT0120.