Shimabara City in Nagasaki Prefecture, Japan, is located on a volcanic peninsula that has abundant groundwater. Almost all public water supplies use groundwater in this region. For this reason, understanding groundwater characteristics is a pre-requisite for proper water supply management. Thus, we investigated the groundwater chemistry characteristics in Shimabara by use of self-organizing maps (SOMs). The input to SOM was concentrations of eight major groundwater chemical components, namely Cl−, NO3−, SO42–, HCO3−, Na+, K+, Mg2+, and Ca2+ collected at 36 sampling locations. The locations constituted private and public water supply wells, springs, and a river sampled from April 2012 to May 2015. Results showed that depending on the chemistry, surface water and groundwater could be classified into five main clusters displaying unique patterns. Further, the five clusters could be divided into two major water types, namely, nitrate- and non-polluted water. According to Stiff and Piper trilinear diagrams, the nitrate-polluted water represented Ca-(SO4 + NO3) (calcium sulfate nitrate) type, while the non-polluted water was classified as Ca-HCO3 (calcium bicarbonate) type. This indicates that recharging rain water in the upstream areas is polluted by agricultural activities in the mid-slope areas of Shimabara.
Groundwater is used for various purposes, such as water supply, agriculture, and industry. During recent decades, groundwater has been polluted by increasing fertilizer applications to meet the demand of food supply due to population growth. Monitoring and protection of groundwater are essential to meet the demand for safe groundwater. To understand the effects of hydrogeological processes and anthropogenic activities on regional groundwater, it is important to study the chemical characteristics. The hydrogeochemistry of groundwater is influenced by many factors, such as climate, mineralogy of aquifers, chemical composition of rainfall and surface water, topography, and anthropogenic activities. Thus, a hydrogeochemical interpretation of groundwater quality from representative water samples can provide useful information on the geochemical processes, hydrodynamics, origin, and interaction of the groundwater with aquifer materials.
Shimabara City is known as a region that, to a great extent, relies on groundwater for the public water supply (Committee on Nitrate Reduction in Shimabara Peninsula 2011). However, Shimabara groundwater has been increasingly polluted by nitrate since 1988. We analyzed the present situation of groundwater pollution by nitrate in Shimabara and showed that agricultural activities are the main polluter of the groundwater (Nakagawa et al. 2016). To better understand the characteristics of the water chemistry, multivariate analysis such as principal component analysis (PCA), which can reduce data dimensionality and extract synthetic indexes with minimum information loss, is often used (e.g., Aiuppa et al. 2003; Cloutier et al. 2008; Banoeng-Yakubo et al. 2009; Sonkamble et al. 2012; Nadiri et al. 2013; Omonona et al. 2014; Singaraja et al. 2014; Ghesquière et al. 2015; Marghade et al. 2015; Matiatos 2016). Using groundwater chemistry, we classified Shimabara water by use of principal component and cluster analysis (Nakagawa et al. 2016). The results showed that groundwater could be classified into four clusters, where one cluster expressed nitrate pollution and the other clusters showed ion dissolution from the aquifer matrix. However, it is sometimes difficult to decipher PCA results due to bias resulting from the complexity and nonlinearity of large data (Choi et al. 2014). Recently, multivariate analysis using self-organizing maps (SOMs) has been applied to various research fields, such as ecology (Céréghino et al. 2001; Bedoya et al. 2009), geomorphology (Hentati et al. 2010), hydrology (Kalteh & Berndtsson 2007), meteorology (Nishiyama et al. 2007), and wastewater treatment (García & González 2004). SOM has also been used to classify the water chemistry of rivers and groundwater (Hong & Rosen 2001; Jin et al. 2011; Choi et al. 2014; Nguyen et al. 2015). Thus, SOM is a powerful and effective tool for detection and interpretation of spatially varying phenomena. Especially, SOM has a better ability to handle the nonlinearities, noisy or irregular data, and multivariate data without mechanistic understanding of the system. SOM is also easily and quickly updated when adding new data (Hong & Rosen 2001; Kalteh et al. 2008). The similarity of extracted pattern classification can be visually compared using color gradients (Jin et al. 2011).
In the previous study (Nakagawa et al. 2016), we used field observed data from August 2011 to November 2013. We continued to collect data, and available data were extended to May 2015. Therefore, in this study, we confirmed our previous results by using a more informative method, SOM, together with an extended database. Using SOM, visual representation of groundwater characteristics is easy, and more detailed clustering with better analyses results is possible as compared to conventional PCA. To improve the understanding of groundwater characteristics in Shimabara we applied SOM combined with hierarchical cluster analysis using water chemistry as input. According to the results obtained by SOM analysis, we discuss the spatial trends of groundwater characteristics in Shimabara and the practical application of SOM for future water use.
STUDY AREA AND DATA USED
Figure 1 shows the study area and the sampling locations in Shimabara, Nagasaki Prefecture, Japan. Shimabara has an area of 82.8 km2 and is located in the northeastern part of Shimabara Peninsula. In the center of the peninsula, the active volcano Unzen (Mt Fugendake) is located. The geology of the Shimabara area is thus formed by volcanic deposits composed of dacite, andecite, volcanic ash, and lapilli. Average annual precipitation is about 2,100 mm (1967–2013). The mean annual temperature is 16.9 °C, and the average monthly temperature ranges from 4.2 (January) to 29.0 °C (in August) (Japan Meteorological Agency 2015).
Figure 2 shows altitude and land use in Shimabara. According to the figure, the land use can generally be divided into forest, agriculture, and urban areas. Areas above an altitude of 200 m are generally occupied by forest. According to the estimated regional groundwater flow, the forest areas, which comprise 36.5% of Shimabara, may be recognized as groundwater recharge zones. Upland and paddy fields are concentrated into the northern parts of the area, occupying 23.6% and 7.5% of Shimabara, respectively. Buildings are usually located at altitudes below 100 m along the coast and represent 14.9% of Shimabara. Other land use is 17.5%.
In total, 353 water samples were collected from April 2012 to May 2015. Sampling was performed at seven resident wells (RW), 21 public water supply wells (W), two observation wells (O), five springs (S), and one river (R) (Figure 1). To ensure spatially representative groundwater conditions, sampling sites covering the whole area of Shimabara except for forest and other land use (Figures 1 and 2) were used. Sampling was done four times annually with 2–4 month intervals to ensure temporally varying groundwater conditions. Sampling at specific locations (RW-14, b, W-21, O-2, S-2, 3, 5, and R-2) was done with less frequency. The hydrogeochemical data used in this study consist of major dissolved ion concentrations for Cl−, NO3−, SO42–, HCO3−, Na+, K+, Mg2+, and Ca2+. Mean and standard deviation of 36 sampling sites using averaged temporal ion concentrations for each of the sampling sites are summarized in Table 1. It is necessary to normalize the data prior to application of SOM to ensure that all parameters are given the same importance. SOM results are highly sensitive to data pre-processing method due to the fact that the Euclidean distance between input data is used (e.g., Jin et al. 2011). To solve this problem, the range between minimum and maximum ion concentrations was standardized into [0, 1] (Nishiyama et al. 2007; Jin et al. 2011) as preprocessing in this study.
|Major ion (mg L−1)||Mean||SD|
|Major ion (mg L−1)||Mean||SD|
The SOM is a modified artificial neural network characterized by unsupervised training that can project high-dimensional information onto a low-dimensional array (e.g., Vesanto et al. 2000). Many researchers have chosen a two-dimensional array (e.g., Jiang et al. 2014). The result is a readily understandable and visual pattern classification. The objective here of the SOM application was to obtain physically explainable reference vectors using input vectors. Thus, the input vectors were composed of, in total, 353 hydrogeochemical data points (approximately quarterly sampling at the 36 sampling locations) with eight variables (major dissolved ion concentrations: Cl−, NO3−, SO42–, HCO3−, Na+, K+, Mg2+, and Ca2+). Reference vectors were obtained after iterative updates through a training phase that comprised three main procedures: competition between nodes, selection of a winner node, and updating of the reference vectors (e.g., Vesanto et al. 2000). Selection of proper initialization and data transformation methods are important factors when designing a relevant SOM methodology. In SOM applications, in general, a larger map size gives a higher resolution for pattern recognition. The optimum number of SOM nodes is determined by applying the heuristic rule , where m denotes the number of SOM nodes and n represents the number of input data (García & González 2004; Hentati et al. 2010; Jin et al. 2011). Herein, this heuristic rule was used to determine the total number of nodes in the SOM. The ratio of the number of rows and columns is determined by the square root of the ratio between the two largest eigenvalues of the correlation matrix of input data. The eigenvalues are obtained from PCA. In a previous study using the sampled data from August 2011 to November 2013, two principal components (Factor 1 and Factor 2) explained 86.5% of the total variance (Nakagawa et al. 2016).
After organizing the SOM structure with the above rule, a linear initialization technique made each node set with a reference vector. A linear initialization technique increases the speed of the training phase and proper abstracting pattern for limited data (Jeong et al. 2010). Further, when only limited data are available, the linear initialization is more suitable for the pattern classification as compared to random initialization, because of small data sets and boundary effects (Nguyen et al. 2015). The linear initialization used eigenvalues and eigenvectors of input data to set initial reference vectors on the structured SOM. This means that the initial reference vectors already include prior information about the input data, resulting in a quicker and more efficient training phase (Vesanto et al. 2000). In this study, each reference vector was updated through the SOM training process using a batch mode with neighborhood function taking a Gaussian form. Although some issues on the implementation of the batch SOM are discussed in some detail in Jiang et al. (2014), the results of the SOM analysis supported previous clustering results (Nakagawa et al. 2016; shown below). The reference vectors obtained at the end of the training process were fine-tuned using cluster analysis.
There are various clustering algorithms available in the literature (e.g., García & González 2004; Jin et al. 2011). In this study, partitioned algorithms and hierarchical algorithms, which are k-means and Ward's algorithms, respectively, were applied for appropriate clustering of reference vectors. For partitioned clustering methods, the k-means algorithm is most frequently used for SOM (e.g., Jin et al. 2011). The Davies–Bouldin Index (DBI) applying k-means algorithm determines the optimal number of clusters (García & González 2004; Jin et al. 2011). The DBI values, based on similarity within a cluster and dissimilarity between clusters, were calculated from a minimum of two clusters to the total number of nodes. Therefore, the smaller DBI value appears as the dissimilarity to each cluster becomes larger. In other words, a minimum DBI represents the optimal number of clusters for the trained SOM. The Ward's linkage method, which is one of the hierarchical techniques, is the most commonly used clustering method (Faggiano et al. 2010; Hentai et al. 2010; Jin et al. 2011). In this study, the final fine-tuning cluster analysis was carried out using Ward's method. The above calculation processes were carried out using a modified version of SOM Toolbox 2.0 (Vesanto et al. 2000). The output SOM clusters were plotted on Piper trilinear and Stiff diagrams to explain the main features of each cluster. Furthermore, the SOM clusters were mapped spatially to clarify influence from land use.
RESULTS AND DISCUSSION
Based on the methodology described above, the number of SOM nodes was determined to be equal to 91. The number of rows and columns was 7 and 13, respectively. Thus, this SOM design was used for the cluster analysis of standardized water chemistry data from the 36 locations in Shimabara.
Figure 3 shows the obtained component planes for the 91 reference vectors (nodes) of the eight ion component concentrations (standardized to a range between 0 and 1). Each component plane shows the standardized value of each parameter (concentration) of the 91 reference vectors (nodes) using a color gradient. Comparison between the component planes shows relationships (or correlation) among the parameters. For example, a similar color gradient can be observed for Cl− (Figure 3(a)) and NO3− (Figure 3(b)). The same trend can be seen for Na+ (Figure 3(e)) and Mg2+ (Figure 3(g)) in their respective component planes. This means that there is high positive correlation between these variables. A great advantage of SOM is that relationships between nodes on the component plane are clearly visualized. For example, the node located at the uppermost left end shows lower normalized concentrations for all ions (Cl−:0.00, NO3−:0.00, SO42–:0.00, HCO3−:0.11, Na+:0.00, K+:0.00, Mg2+:0.00, and Ca2+:0.00). The node, located at the uppermost right end, shows moderately higher normalized concentrations for HCO3−, Mg2+, and Ca2+ (Cl−:0.13, NO3−:0.09, SO42–:0.15, HCO3−:0.46, Na+:0.09, K+:0.18, Mg2+:0.33, Ca2+:0.40). The node located at the lowermost left shows relatively higher normalized ion concentrations except for HCO3− (Cl−:0.85, NO3−:0.83, SO42–:0.78, HCO3−:0.04, Na+:0.30, K+:1.00, Mg2+:0.43, Ca2+:0.90). On the other hand, the node located at the lowermost right shows higher normalized ion concentrations except for Cl− and NO3− (Cl−:0.29, NO3−:0.17, SO42–:0.95, HCO3−:0.95, Na+:1.00, K+:0.65, Mg2+:0.98, Ca2+:1.00).
To confirm quantitative relationships, as mentioned above, correlation coefficients between reference vectors for each parameter were calculated (Table 2). There is a high correlation (r = 0.99) between Cl− and NO3−. There is also a high correlation between Na+ and Mg2+ (r = 0.92). Similarly, the color gradient for the relationship between SO42– and Ca2+ indicates a high correlation coefficient (r = 0.94). The relation between each ion indicates factors affecting groundwater chemistry. For example, a high co-variation (R2 = 0.72) between higher concentrations of NO3− and Cl− was observed, indicating that they originate from common sources, such as human and animal waste (e.g., Diédhiou et al. 2012). Moreover, the same result can be observed between SO42– and Ca2 (r = 0.79). The high correlation implies that the dissolution of gypsum may be one of the key factors controlling the geochemical evolution of groundwater (Liu et al. 2015).
*Correlations significant at p = 0.01.
Figure 4 shows the variation of DBI with a magnified front between 2 and 14 clusters. The minimum DBI is shown for five clusters, meaning that this number should be used as an optimal value. After determining the number of clusters, the hierarchical clustering algorithm by Ward was carried out for the five clusters to fine-tune pattern classification. Figure 5 shows the hierarchical cluster dendrogram. The 91 nodes of the SOM were classified into five different clusters. Figure 6 shows the pattern classification map for these five clusters. The number for each node represents the raw data classified into each node. Simultaneous analysis of the component planes (Figure 3) and the pattern classification result (Figure 6) indicates what kind of data the respective clusters include. For example, cluster-3 (the lower left part of Figure 6) is associated with a high content of Cl− and NO3−. This pattern is observed in the same part of the respective component planes for each parameter, as shown in Figure 3. On the other hand, groundwater samples in nodes with an extremely low concentration of all ions are located at the upper left part of each component plane (associated with cluster-1), as shown in Figure 3.
More quantitative information than the visualized pattern classification can be extracted and interpreted from the obtained reference vectors. Stiff diagrams for the respective clusters were represented by mean and upper and lower limits of one standard deviation using reference vectors of each cluster to characterize the clustered data. For example, the Stiff diagram for cluster-1 is represented by reference vectors of 18 nodes classified into the cluster. Figure 7 shows Stiff diagrams for the five clusters, with eight parameters containing mean values and standard deviations. Cluster-1 (Figure 7(a)) shows low values for all ions compared to other clusters. The visible patterns of cluster-2 (Figure 7(b)) and cluster-3 (Figure 7(c)) are not similar, as shown in the figure. However, they are characterized by high concentrations of NO3−. Cluster-2 represents lower concentrations than that of cluster-3 for all ions except HCO3−. The pattern with the highest Ca2+ in cations and HCO3− in anions is associated with cluster-4 (Figure 7(d)). In this cluster, the concentration of Na+, K+, and Mg2+ is slightly lower than that for Ca2+. For anions, the concentration of HCO3− is significantly higher than other anions. This pattern is also shown in cluster-5 (Figure 7(e)). It is clear that all ion concentrations except for Cl− and NO3− of cluster-5 are higher than that of cluster-4.
The five classified clusters can generally be divided into two water quality types. Cluster-2 and -3 can be characterized as polluted water due to the high concentration of NO3−. The other group includes cluster-1, -4, and -5, representing non-polluted water (pristine water type).
Table 3 shows mean ion concentrations calculated from raw data and classified into the respective cluster. The NO3− for cluster-3 indicates a higher mean value than 50 mg L−1 which is the maximum contamination level recommended by the World Health Organization (WHO 2011) for drinking water. The NO3− for cluster-2 meets the WHO standard. However, it exceeds 13 mg L−1 which is the maximum nitrate concentration unaffected by human activities (Eckhardt & Stackelberg 1995). It confirms that the two clusters include polluted water as mentioned above. Cluster-1, -4, and -5 display much lower mean NO3− concentrations. An NO3− concentration exceeding the maximum concentration level recommended by the WHO has also been reported in other studies (e.g., Diédhiou et al. 2012; Hansen et al. 2012; Liu et al. 2015; Dragon et al. 2016; Matiatos 2016). In these investigations, the maximum NO3− concentration ranged from 91 to 855 mg L−1.
|Cl− (mg L−1)||NO3− (mg L−1)||SO42– (mg L−1)||HCO3− (mg L−1)||Na+ (mg L−1)||K+ (mg L−1)||Mg2+ (mg L−1)||Ca2+ (mg L−1)|
|Cl− (mg L−1)||NO3− (mg L−1)||SO42– (mg L−1)||HCO3− (mg L−1)||Na+ (mg L−1)||K+ (mg L−1)||Mg2+ (mg L−1)||Ca2+ (mg L−1)|
Figure 8 shows Piper trilinear diagrams for all reference vectors (91) and the respective cluster. With respect to cations, most vectors of all clusters are located in zone B in the lower left delta-shaped region, indicating a non-typical water. However, a part of the reference vectors for cluster-3 is located in zone A, indicating a calcium-type water. For anions, reference vectors are mostly located in zone B, E, and F in the lower right delta-shaped region, suggesting that the reference vectors of cluster-1, -4, and -5 are bicarbonate-type water and the reference vectors of cluster-2 and -3 are sulfate and nitrate-type water or non-typical water. Thus, in the Piper trilinear diagram, two main water types are revealed. These are calcium-magnesium bicarbonate type (zone I) including cluster-1, 4, and 5 (non-polluted water type) and calcium-magnesium chloride-sulfate-nitrate type (zone III) including cluster-2 and -3 (polluted water type).
Based on the Stiff and Piper trilinear diagrams, the polluted water type is represented as Ca-(SO4 + NO3) (calcium sulfate nitrate type), while the non-polluted water type is classified as Ca-HCO3 (calcium bicarbonate type). Similar results were reported by Shin et al. (2013). According to the study, water samples collected from the upper reaches of Korean rivers were of Ca-HCO3 type, whereas water samples collected from lower reaches and with relative high nitrate concentrations were classified as Na-Cl-NO3 type. This indicates that water samples are affected by anthropogenic factors such as fertilizer, manure, and septic waste.
Figure 9 shows the spatial distribution of the five clusters in Shimabara. All sampling locations belonging to cluster-2 and -3, representing the polluted water type, are located in the northern part of Shimabara encompassing a concentration of agricultural fields. In order to investigate the interaction between groundwater and river water, one sample was taken from the river (R-2) and included into the SOM analysis. The results showed that R-2 also is classified into cluster-3 as O-1 and 2. This revealed that they are connected and exchange water with each other. Samples with high nitrate concentrations often correspond with agricultural land use (Babiker et al. 2004; Esmaeili et al. 2014). This confirms that agricultural activities are related to high nitrate concentrations in groundwater. Ishihara et al. (2002) reported that fecal coliforms were detected in the northern part of Shimabara. This means that the groundwater in this area is affected by livestock waste. It is observed that most sampling locations for cluster-1 are distributed in the mountainside forest area upstream of the heavily polluted areas. This shows that groundwater is recharged in the area and typically is of pristine water type. The average NO3− concentration of cluster-1 is slightly lower than that of cluster-4 according to Table 3. Sampling points such as W-12 and 13 located in the agricultural area are thus affected by agricultural activities belonging to cluster-1. This suggests that cluster-1 shows a transition of water chemistry from pristine to polluted water type. The sampling locations for cluster-4 and -5, characterized by high ion concentrations, are located in the urban area at a lower altitude (below 100 m). This suggests that dissolution of ions from the aquifer matrix during groundwater flow from the mountainside to the urbanized area may increase ion concentrations. Mayuyama avalanche debris deposits are distributed in the eastern area of Mt Mayuyama (Ozeki et al. 2005). This area corresponds to sampling locations for cluster-5. The pattern of cluster-5 has high concentration for all ions, as shown in Figure 7. This is due to the effect of volcanic deposits on the groundwater chemistry in the area.
SUMMARY AND CONCLUSION
In this study, water chemistry data from 36 sampling locations, obtained from April 2012 to May 2015, were classified using SOM in combination with hierarchical cluster analysis to clarify groundwater characteristics in Shimabara, Japan. The SOM provided readily understandable results for classifying the water chemistry data into distinguishable hydrogeochemical types. The Piper trilinear and Stiff diagrams for the reference vectors were plotted to display fundamental characteristics of each cluster. In addition, the spatial distribution of the respective clusters explained the spatial variability of the hydrogeochemical characteristics determined by the SOM. Based on the SOM results, the water chemistry data could be divided into five clusters that revealed two representative water types characterized by nitrate pollution (cluster-2 and -3) and non-polluted (cluster-1, -4, and -5) water. The spatial distribution of cluster-2 and -3 shows that agricultural activities are causing groundwater pollution in the northern part of Shimabara. The Stiff and Piper trilinear diagrams based on the reference vectors for each cluster showed that non-polluted water and polluted water are characterized by Ca-HCO3 type and Ca-(SO4 + NO3) type, respectively. This indicates that nitrate pollution is a product from agricultural activities and classified into cluster-2 and -3.
The SOM analysis showed that mountainside recharged pristine groundwater is classified into cluster-1. Some groundwater in cluster-1 is also located close to the mid-slope hills. This means that non-polluted water can be used from this agricultural area. For other purposes, water quality evaluation methods such as the Wilcox classification diagram (Wilcox 1955), can be used to evaluate whether water in cluster-2 or -3 can be used for, e.g., irrigation. The clusters from the SOM analysis are useful for further groundwater remediation alternatives.
The application and results of the SOM support our previous conclusion (Nakagawa et al. 2016) regarding the spatial distribution of nitrate pollution in the study area and its causes. Data that display a scattered distribution in the Piper trilinear diagram can be difficult to analyze by PCA. However, in this case, SOM can be an alternative method (Choi et al. 2014). In this study, both PCA and SOM successfully classified groundwater chemistry in the study area. However, SOM gives more robust and explainable results that can be used to characterize groundwater chemistry. More detailed characteristics along this line will be described in a new paper (Amano et al. in press).
This work was supported by JSPS KAKENHI Grant Number 24360194 and 15KT0120.