River water is an important source for drinking water supply in Northwestern New Territories of Hong Kong. Thus, there is no denying the fact that monitor the quality of river water is a must for the locals. In this study, a mixed multivariate analysis method was used to lower monitoring costs by optimizing the layout of water quality monitoring stations. To this purpose, the data from a period of five years and over 36,000 observations was evaluated in this article. The cluster analysis approach was also used to categorize monitoring stations into three groups. What's more, three latent factors that predominantly influence the river water quality were assessed using factor analysis: anthropogenic pollution, seawater intrusion and geological processes, and the nitrification process. A spatial pattern using the three latent factor scores was plotted and six redundant monitoring stations were identified by this pattern. Finally, discriminant analysis was used to extract seven significant parameters. The results showed that the surface water-monitoring program of the watercourses in the Northwestern New Territories (Hong Kong) could be adjusted by reducing the monitoring stations to 18 and the measured chemical parameters to seven to ensure the detection of water quality and reduce the cost.

  • Studying the spatial variation of water resource by a mixed multivariate statistical method.

  • Performing cluster analysis for three latent factors that affect river water quality.

  • Establish evaluation system to identified redundant monitoring stations.

  • Extract seven significant parameters for measuring the contamination conditions.

  • Give a more reasonable monitoring station layout to ensure the detection of water quality and reduce the cost.

Graphical Abstract

Graphical Abstract

River water is one of the vital water resources. However, distinct from ground water and lakes, it is vulnerable to pollution. anthropogenic activities and natural processes can degrade the quality of surface water and impair its usability without difficulty. Accordingly, to control water pollution, monitor water quality in river basins (Simeonov et al. 2003), and interpret the temporal and spatial variations in water quality (Dixon & Chiswell 1996; Singh et al. 2004), monitoring water quality regularly is essential. Take Hong Kong for instance, the Northwestern New Territories is one of the most polluted regions in Hong Kong. The Hong Kong Environmental Protection Department (Hong Kong EPD) has established 24 water quality monitoring stations in the Northwestern New Territories and performed routine water quality monitoring in order to protect the water resource.

Multivariate statistical techniques such as cluster analysis (CA), factor analysis (FA), and discriminate analysis (DA) are commonly used to categorizing raw data due to their ability to analysis multiple related factors. For instance, they are applied to characterize the quality of river water, allowing a better understanding of the temporal and spatial variations in water quality, and be applied to the identification of discriminant parameters that are of use in optimizing monitoring network as well (Simeonov et al. 2003; Shrestha & Kazama 2007).

There is no denying the fact that multivariate statistical methods have the ability to extract meaningful information from a data set; prior research has proved this. FA and CA, as two kinds of multivariate statistical methods, were applied to assess the impacts of human activities on spatial variations in the water quality of 19 rivers (Wang et al. 2007). Meanwhile, the usefulness of CA, FA, and DA for interpreting complex data sets, assessing water quality, identifying pollution factors, and understanding temporal and spatial variations has been demonstrated in water quality for effective river water quality management (Kowlkowski et al. 2006; Shrestha & Kazama 2007; Venkastesharaju et al. 2010; Juahir et al. 2011; Samsudin et al. 2011; Wang et al. 2012). CA and DA were used to assess temporal and spatial variations in water quality, indicating that performing DA extracted significant parameters responsible for most variations in river water quality (Zhou et al. 2007). CA, FA, and DA were used to reduce the monitoring stations and the measured chemical parameters to lower monitoring cost (Wang et al. 2014). Moreover, multivariate statistical techniques have been applied to assess groundwater and lake water (Yang et al. 2010; Lu et al. 2012). The CA, FA, and DA multivariate methods utilized in the current study have been shown to be highly useful tools for extracting key information from complex data sets of water quality.

In the present study, large data sets obtained during a five-year (2009–2013) monitoring program were subjected to CA, FA, and DA to extract latent information about the similarities or dissimilarities among the monitoring sites. Fourteen water quality parameters, collected quarterly for one year, were carried out on water quality index (WQI) and multivariate statistical techniques in Agra, Uttar Pradesh, India. (Isaac & Siddiqui 2022) However, some reported work has been carried out, and the conclusion can be drawn that the collection times of four times a year are not enough to analyze the actual situation well. Considering the influences of temporal differences, we focus on determining redundant monitoring stations and identifying water quality variables responsible for spatial variations in water quality. The WQI has been applied to categorize the water quality, which is quite useful to infer the quality of water for the people and policy makers in the concerned area. (Ghoderao et al. 2022) We also focus on testing the validity of the results in spatial DA. As is shown in the research, the spatial landscape patterns resulted in differences in the concentrations of TCB and TP, while the concentrations of TCB and TP were applied as main indicators to represent the water quality in the recipient rivers and streams to some degree (Chang et al. 2022).

Monitoring area and sampling

The Northwestern New Territories (see Figure 1) include 13 inland watercourses (rivers, creeks, nullahs and streams): the Indus, Beas, and Ganges Rivers, as three major rivers in the North District, join the Shenzhen River; other four major watercourses including the Shenzhen River, flow into the inner Deep Bay located in Yuen Long Basin. Simultaneously, the Yuen Long Creek, the Kam Tin River, the Tin Shui Wai Nullah, the Fairview Park Nullah, and six minor streams (the Ngau Hom Sha, Ha Pak Nai, Tai Shui Hang, Pak Nai, Sheung Pak Nai, and Tsang Kok streams) near Lau Fau Shan drain into the outer Deep Bay. According to the report (Hong Kong EPD 2009-2013), the Northwestern New Territories is one of the most polluted regions in Hong Kong. It covers the entire Deep Bay Water Control Zone, the North districts and Yuen Long Districts. Except for the downtown area of Sheung Shui and Fan Ling, major watercourses drain most of the untreated sewage from the rural areas, as well as receiving domestic wastewater and agricultural runoff. Consequently, it is vital to assess spatial variations of water quality in this region.
Figure 1

Study area and location of the monitoring stations.

Figure 1

Study area and location of the monitoring stations.

Close modal

The Indus River is one of the largest rivers in the area, with a total length of about 49 km, covering an area of 43 km2. The Beas River, major branch of the Indus, flows from Lam Tsuen Country Park and covers an area of 20 km2. The Ganges River originating in Wo Ken Shan has a smaller area of 10 km2. Yuen Long Creek is around 60 km long and covers an area of 27 km2. With an area of 44.3 km2, the 50-km long Kam Tin River passes through the urban areas of Kam Tin and Yuen Long. All these four watercourses in the Yuen Long Basin flow into the inner Deep Bay via concrete channels.

HKEPD has collected water quality data from 24 monitoring sites, covering a wide range of the 13 inland watercourses.

Monitored parameters and data pretreatment

The data for 24 water quality monitoring sites, consisting of 48 water quality parameters monitored monthly over five years (2009 − 2013), were obtained from the Hong Kong EPD (2009-2013). Of these 48 parameters, 25 were selected based on their sampling continuity at all the selected monitoring sites, to be used in the present analysis. The selected parameters included electrical conductivity (EC), pH, dissolved oxygen (DO), temperature (TEMP), chemical oxygen demand (COD), five-day biochemical oxygen demand (BOD5), ammonia–nitrogen (), total Kjeldahl nitrogen (TKN), nitrate nitrogen (), total phosphorus (TP), Escherichia coliforms (E. coli), fecal coliforms (F. coli), total solids (TS), total suspended solids (TSS), sulfide (), fluoride (F), arsenic (As), aluminum (Al), iron (Fe), copper (Cu), chromium (Cr), manganese (Mn), lead (Pb), nickel (Ni), and zinc (Zn). All the water quality parameters are expressed in milligram/liter, except pH, EC (μS·cm−1), TEMP (°C), E. coli (cfu/100 ml) and F. coli (cfu/100 ml). The basic statistics of the five-year data set (36,000 observations) on river water quality are summarized in Table 1.

Table 1

Statistical descriptives of water quality parameters

ParametersMeanSDSEMinimumMaximum
TEMP 24.77 4.82 0.13 12.20 37.10 
pH 7.54 0.47 0.01 6.30 10.10 
EC 782.35 2,630.91 69.33 22.00 29,410.00 
TS 580.65 1,932.69 50.93 30.00 24,000.00 
TSS 30.84 80.87 2.13 0.25 1,100.00 
DO 7.22 2.34 0.06 1.50 18.60 
COD 19.32 30.27 0.80 1.00 420.00 
BOD5 13.00 27.80 0.73 0.05 240.00 
TKN 4.56 7.55 0.20 0.03 100.00 
 0.86 1.56 0.04 0.00 40.00 
 3.43 6.36 0.17 0.00 84.00 
TP 0.75 1.14 0.03 0.01 12.00 
As 2.94 3.76 0.10 0.50 39.00 
 0.03 0.13 0.00 0.01 4.02 
0.29 0.17 0.00 0.10 1.20 
Zn 43.83 77.68 2.05 5.00 1,500.00 
Ni 2.45 3.77 0.10 0.50 63.00 
Mn 163.14 255.52 6.73 5.00 3,100.00 
Pb 4.20 10.91 0.29 0.00 190.00 
Fe 694.51 907.54 23.92 25.00 15,000.00 
Cr 1.01 2.61 0.07 0.50 67.00 
Cu 5.20 10.19 0.27 0.50 190.00 
Al 222.80 299.91 7.90 25.00 3,900.00 
Escherichia coli 180,196.93 538,185.64 14,182.44 0.50 6,800,000.00 
Fecal coliforms 441,387.30 1,549,019.78 40,820.26 0.50 40,000,000.00 
ParametersMeanSDSEMinimumMaximum
TEMP 24.77 4.82 0.13 12.20 37.10 
pH 7.54 0.47 0.01 6.30 10.10 
EC 782.35 2,630.91 69.33 22.00 29,410.00 
TS 580.65 1,932.69 50.93 30.00 24,000.00 
TSS 30.84 80.87 2.13 0.25 1,100.00 
DO 7.22 2.34 0.06 1.50 18.60 
COD 19.32 30.27 0.80 1.00 420.00 
BOD5 13.00 27.80 0.73 0.05 240.00 
TKN 4.56 7.55 0.20 0.03 100.00 
 0.86 1.56 0.04 0.00 40.00 
 3.43 6.36 0.17 0.00 84.00 
TP 0.75 1.14 0.03 0.01 12.00 
As 2.94 3.76 0.10 0.50 39.00 
 0.03 0.13 0.00 0.01 4.02 
0.29 0.17 0.00 0.10 1.20 
Zn 43.83 77.68 2.05 5.00 1,500.00 
Ni 2.45 3.77 0.10 0.50 63.00 
Mn 163.14 255.52 6.73 5.00 3,100.00 
Pb 4.20 10.91 0.29 0.00 190.00 
Fe 694.51 907.54 23.92 25.00 15,000.00 
Cr 1.01 2.61 0.07 0.50 67.00 
Cu 5.20 10.19 0.27 0.50 190.00 
Al 222.80 299.91 7.90 25.00 3,900.00 
Escherichia coli 180,196.93 538,185.64 14,182.44 0.50 6,800,000.00 
Fecal coliforms 441,387.30 1,549,019.78 40,820.26 0.50 40,000,000.00 

The prerequisite of most multivariate statistical methods is that variables should conform to the normal distribution, because of which, checking the normality of the distribution of each variable by analyzing kurtosis and skewness statistical test before multivariate statistical analysis is a must (Johnson & Wichern 1992; Lattin et al. 2003; Papatheodorou et al. 2006). The original data demonstrated values of kurtosis ranging from −0.727 to 633.359 and skewness ranging from −0.332 to 22.449, indicating with 95% confidence that distributions were far from normal. Since most of the values of kurtosis or skewness were greater than zero, the original data were transformed in the form (Kowalkowski et al. 2006; Papatheodorou et al. 2006). After log-transformation, the kurtosis and skewness values ranged from −1.261 to 2.978 and − 1.757 to 1.18, respectively. However, the the distributions of the log-transformed TS, and Cr were also non-normal. Therefore, they were not regarded in the following study. All log-transformed variables were also z-scale standardized in the case of CA (the mean and variance were set to zero and one, respectively) to minimize the effects of different units and variance of variables and to render the data dimensionless (Liu et al. 2003; Singh et al. 2004).

CA is an unsupervised pattern recognition method that divides a large group of cases into smaller groups or clusters of relatively similar cases dissimilar to other groups. Hierarchical CA, the most common approach, starts with each case in a separate cluster and joins the clusters together step by step until only one cluster remains (Lattin et al. 2003; McKenna 2003). The Euclidean distance usually provides an index of the similarity between two samples, and a distance can be represented by the difference between transformed values of the samples (Otto 1998). In this study, hierarchical CA was performed on the standardized data using Ward's method with squared Euclidean distances as a measure of similarity. The method uses analysis of variance (ANOVA) to calculate the distances between clusters to minimize the sum of squares of any two possible clusters at each step. Both temporal and spatial variations in water quality were determined from hierarchical CA using the linkage distance (Wunderlin et al. 2001; Simeonov et al. 2003; Singh et al. 2004; Astel et al. 2006; Kowlkowski et al. 2006; Shrestha & Kazama 2007).

Another multivariate technique, FA yields the general relationship between measured water quality parameters by elucidating the multivariate patterns that might help to simplify and classify the original data. It can be used to determine the spatial and temporal distribution of resultant factors and interpret them. This may yield insight into the main processes that govern the distribution of water quality parameters. Firstly, the raw data was standardized and made dimensionless. Secondly, the correlation coefficient matrix, eigenvalues, and eigenvectors were determined to yield the covariance matrix. Finally, the data are transformed into factors, and only factors with eigenvalues that exceed 1 were retained in this study (Reyment & Joreskog 1993). The contribution of each factor (factor score) at each monitoring station was computed and depicted spatially.

DA is a method of analyzing dependence and a special case of canonical correlation. One of its objectives is to determine the significance of different variables, which can allow the separation of two or more naturally occurring groups. DA operates on original data, and constructs a discriminant function for each group (Johnson & Wichern 1992; Wunderlin et al. 2001; Lattin et al. 2003) as follows:
(1)
where i is the number of groups (G), is the constant inherent to each group, n is the number of parameters used to classify a set of data into a given group, is the weight coefficient, assigned by DA to a given parameter ().

In this study, DA was performed on original data using the standard and backward stepwise modes to evaluate the spatial variations in water quality. The best discriminant functions for each mode were constructed considering the quality of the classification matrix and the number of parameters.

Temporal similarity and period grouping

An initial exploratory approach involved the use of hierarchical CA on standardized log-transformed data sets sorted by season. CA generated a dendrogram (Figure 2), grouping the 12 months into two clusters at , and the difference between the clusters was significant. Cluster 1 (the first period) included December, January, February, March and April, approximately corresponding to the dry season in Hong Kong (October to March; Yeung 1999). Cluster 2 (the second period) included the remaining months (May, June, July, August, September, October, and November), closely corresponding to the wet season (April to September). However, if the 12 months had been empirically divided into spring (March to May), summer (June to September), fall (October to December), and winter (January to February), or into dry/wet seasons, a mistake in grouping would have been made. So research in which data was collected four times a year can seldom reflect a practical situation.
Figure 2

Dendrogram showing clustering of monitoring periods.

Figure 2

Dendrogram showing clustering of monitoring periods.

Close modal

Conclusion can be drawn that monthly data reflect local water quality better than quarterly data. Actually, Figure 2 demonstrates that the temporal patterns to water quality were not purely consistent with the four seasons or the dry/wet seasons. The premise of water quality analysis is that relevant data needs to be collected at least once a month for the ensurance of accuracy.

Spatial similarity and site grouping

Considering the experience obtained from temporal-CA, spatial-CA was also used to identify similar monitoring sites. Both spatial similarity analysis for each temporal cluster and the integrated clusters (the first and second periods) were carried out, but the results were almost similar. Therefore, only the latter result is discussed. Spatial-CA produced a dendrogram, shown in Figure 3, with three groups at . Group A consisted of DB1, DB2, DB3, DB4, and DB6. Group B consisted of DB5, GR1, GR2, GR3, IN2, IN3, RB1, RB2, RB3, and TSR2, and group C consisted of FVR1, IN1, KT1, KT2, TSR1, YL1, YL2, YL3, and YL4. The group classifications varied with significance level, because the sites in these groups had similar features and natural backgrounds that were affected by similar sources. Due to the impacts of hydrological and ecological processes on water quality, discharges from upstream catchments have induced significant pollution to the recipients. Different types of land use and landscape patterns determine water quality and the catchment hydro-geological and environmental variables (Chang et al. 2022), Thus, we divide it into three different groups according to the pollution procedure, In group A, five sites were located in the Ha Pak Nai, Tai Shui Hang, Pak Nai, and Sheung Pak Nai streams, which are free from major point and non-point pollution sources. Moreover, based on the Hong Kong Annual River Water Quality Report, the water quality of these streams remained pristine over the five years of this study (2009–2013). Group B corresponded to relatively moderately polluted sites; most sites in this group were upstream where the major pollution sources were discharges from unsewered villages and livestock farms. Group C corresponded to highly polluted sites that received pollution from point and the Yuen Long and Tin Shui Wai town centers. Measures according to the circumstances can help us achieve maximum results with little effort. Hierarchical CA provided a useful classification of the surface watercourses in the study area that can be used to design an optimal future spatial monitoring network with lower cost (Simeonov et al. 2003; Singh et al. 2004).
Figure 3

Dendrogram showing clustering of monitoring sites.

Figure 3

Dendrogram showing clustering of monitoring sites.

Close modal

Dominating water quality factor and patterns

Table 2 lists the FA results, including the rotated loading, eigenvalues, and variance percentage of each major factor. Eigenvalues greater than unity, which contribute significantly to the variance, were generally used as a criterion for forming a factor. Three major factors accounted for 84.64% of the variance in the original data set. The highlighted variables denote absolute values of loadings greater than 0.5, which form the constituted parameters of the corresponding factor. Factor 1 draws the conclusion that 65.50% of the total variance with strong positive loadings of EC, TS, DO, COD, BOD5, TKN and ; moderate positive loadings of E. coli, TEMP, ammonia, and nitrite. The TOC, BOD, COD, TP, ammonia, E. coli, TEMP, and DO represent the loading of pollution caused by untreated sewage. In addition, Zn, COD, and Cu indicate the loading of factory effluents. Factor 1 is denoted as the anthropogenic contamination factor. Factor 2 explains 11.6% of the total variance and is mainly associated with high positive loadings of nitrate and moderate positive loadings of nitrite. Nitrate and nitrite are associated with the nitrification reaction. Factor 2 is assigned as the nitrification factor. Factor 3 explains 9.3% of the total variance with strong positive loadings of EC and moderate positive loadings of As and SS. It can be labeled as the seawater intrusion and geological factor (Liu et al. 2003). Meanwhile, Factor 4, which explains 6.6% of the total variance with strong negative loadings of pH and moderate positive loadings of SS, is denoted as weathering processes. The result can be seen in Figure 4.
Table 2

Rotated component loadings of the three principal components including eigenvalues greater than one, their percentage of variance, and cumulative percentage of variance in the FA

PamametersFactors
123
TEMP 0.52 0.18 0.74 
pH 0.02 −0.23 0.69 
EC 0.51 0.54 0.31 
TS 0.84 0.43 −0.19 
TSS −0.84 −0.28 0.38 
DO 0.83 0.52 0.11 
COD 0.84 0.49 0.02 
BOD5 0.77 0.58 0.15 
TKN −0.60 0.21 0.56 
 0.76 0.58 0.14 
 0.68 0.62 0.20 
TP 0.03 0.87 0.07 
As 0.82 0.00 0.36 
 0.84 0.50 −0.03 
0.69 0.57 0.09 
Zn 0.25 0.77 −0.33 
Ni 0.88 0.17 −0.02 
Mn 0.29 0.84 −0.23 
Pb 0.81 0.50 0.04 
Fe 0.89 −0.10 0.22 
Cr 0.87 0.40 0.05 
Cu 0.86 0.44 0.00 
Eigenvalue 14.41 2.36 1.85 
Percent of variance 65.50 10.72 8.42 
Cumulative variance (%) 65.50 76.22 84.64 
PamametersFactors
123
TEMP 0.52 0.18 0.74 
pH 0.02 −0.23 0.69 
EC 0.51 0.54 0.31 
TS 0.84 0.43 −0.19 
TSS −0.84 −0.28 0.38 
DO 0.83 0.52 0.11 
COD 0.84 0.49 0.02 
BOD5 0.77 0.58 0.15 
TKN −0.60 0.21 0.56 
 0.76 0.58 0.14 
 0.68 0.62 0.20 
TP 0.03 0.87 0.07 
As 0.82 0.00 0.36 
 0.84 0.50 −0.03 
0.69 0.57 0.09 
Zn 0.25 0.77 −0.33 
Ni 0.88 0.17 −0.02 
Mn 0.29 0.84 −0.23 
Pb 0.81 0.50 0.04 
Fe 0.89 −0.10 0.22 
Cr 0.87 0.40 0.05 
Cu 0.86 0.44 0.00 
Eigenvalue 14.41 2.36 1.85 
Percent of variance 65.50 10.72 8.42 
Cumulative variance (%) 65.50 76.22 84.64 
Figure 4

Cluster analysis of lithograph in Table 2.

Figure 4

Cluster analysis of lithograph in Table 2.

Close modal
Figure 5 displays the score pattern of four major factors of the 24 monitoring stations in the watercourses in Northwestern New Territories (Hong Kong). The score pattern shows the spatial variance of the major factors predominantly influencing the surface water quality. Many neighboring monitoring stations have similar score patterns of these three major factors. Six sets of stations had a similar factor score pattern (i.e., Stations DB1 and DB6, DB3and DB4, IN2 and IN3, RB1 and RB2, KT1 and KT2, YL3 and YL4). This finding indicates that some of these monitoring stations with similar score patterns are redundant. The results suggest that the HKEPD can choose only one station as an indicator station in each set to reduce monitoring costs.
Figure 5

Major factor score pattern of 24 monitoring stations.

Figure 5

Major factor score pattern of 24 monitoring stations.

Close modal

Spatial variations in water quality

Spatial-DA was performed using the original data set of 22 parameters after classification into the three major groups, A, B, and C, obtained through CA. The sites were the dependent variables and the measured parameters constituted the independent variables. DFs and CMs obtained from the standard, and backward stepwise modes of DA, are shown in Tables 3 and 4. The standard DA mode constructed DFs using 22 parameters. However, the backward stepwise DA showed that EC, , TP, As, Ni, Fe and F. coli were the discriminant parameters in spatial variation, with correct assignations of 90.4% for the three group sites (Table 5). Thus, the spatial-DA results suggested that only seven parameters, i.e., EC, , TP, As, Ni, Fe, and F. coli were needed to account for most of the expected spatial variations in water quality.

Table 3

Wilks’ lambda and chi-square test of DA of spatial variation of water quality

PamametersTest of fun.(s)RWilks’ lambdaChi-squarep level
Standard 0.900 0.093 3,391.609 0.000 
0.733 0.560 828.344 0.000 
Backward 0.876 0.108 3,196.171 0.000 
0.707 0.607 716.639 0.000 
PamametersTest of fun.(s)RWilks’ lambdaChi-squarep level
Standard 0.900 0.093 3,391.609 0.000 
0.733 0.560 828.344 0.000 
Backward 0.876 0.108 3,196.171 0.000 
0.707 0.607 716.639 0.000 
Table 4

Classification functions coefficients for DA of spatial variation

PamametersStandard mode
Backward stepwise mode
ABCABC
TEMP 119.056 117.096 122.896    
PH 1,890.406 1,926.923 1,921.663    
EC 36.417 40.743 40.369 24.536 28.305 28.329 
TSS −1.068 −2.879 −3.841    
DO −40.907 −39.121 −39.505    
COD 2.640 4.041 5.784    
BOD5 −21.665 −21.704 −21.387    
TKN −20.251 −21.483 −21.823    
 −1.690 .037 .571 2.636 4.505 4.836 
 20.226 19.990 22.191    
TP .851 6.783 7.771 −14.531 −9.120 −5.824 
As −38.406 −36.820 −38.773 −7.135 −4.620 −7.034 
−89.199 −90.628 −89.879    
Zn 16.902 16.177 17.093    
Ni 25.312 21.205 27.540 −20.065 −25.898 −18.528 
Mn 24.957 24.775 24.260    
Pb −32.943 −33.213 −32.147    
Fe 43.198 52.260 51.758 34.513 41.679 40.917 
Cu −4.672 −3.119 −2.583    
Al 14.433 13.065 12.646    
E. coli 4.340 4.082 4.096    
F. coli 5.331 7.910 8.626 7.467 9.272 10.632 
Constant −1,061.939 −1,122.534 −1,127.033 −88.898 −14.057 −117.496 
PamametersStandard mode
Backward stepwise mode
ABCABC
TEMP 119.056 117.096 122.896    
PH 1,890.406 1,926.923 1,921.663    
EC 36.417 40.743 40.369 24.536 28.305 28.329 
TSS −1.068 −2.879 −3.841    
DO −40.907 −39.121 −39.505    
COD 2.640 4.041 5.784    
BOD5 −21.665 −21.704 −21.387    
TKN −20.251 −21.483 −21.823    
 −1.690 .037 .571 2.636 4.505 4.836 
 20.226 19.990 22.191    
TP .851 6.783 7.771 −14.531 −9.120 −5.824 
As −38.406 −36.820 −38.773 −7.135 −4.620 −7.034 
−89.199 −90.628 −89.879    
Zn 16.902 16.177 17.093    
Ni 25.312 21.205 27.540 −20.065 −25.898 −18.528 
Mn 24.957 24.775 24.260    
Pb −32.943 −33.213 −32.147    
Fe 43.198 52.260 51.758 34.513 41.679 40.917 
Cu −4.672 −3.119 −2.583    
Al 14.433 13.065 12.646    
E. coli 4.340 4.082 4.096    
F. coli 5.331 7.910 8.626 7.467 9.272 10.632 
Constant −1,061.939 −1,122.534 −1,127.033 −88.898 −14.057 −117.496 
Table 5

Classification matrix for DA of spatial variation

Monitoring sitesPercent correctPeriod assigned by DAa
ABC
  Standard mode 
93.00 279 19 
86.2 18 362 40 
93.3 48 672 
Total 91.2 297 429 714 
  Backward stepwise mode 
92.3 277 20 
85.7 19 360 41 
91.7 60 660 
Total 90.4 296 440 704 
Monitoring sitesPercent correctPeriod assigned by DAa
ABC
  Standard mode 
93.00 279 19 
86.2 18 362 40 
93.3 48 672 
Total 91.2 297 429 714 
  Backward stepwise mode 
92.3 277 20 
85.7 19 360 41 
91.7 60 660 
Total 90.4 296 440 704 

aChecked by cross-validation method.

Based on the above results, backward DA was proved to be a valuable tool to recognize the discriminant parameters in spatial variations of surface water quality; additionally, it was essential to strengthen the monitoring accuracy of EC, , TP, As, Ni, Fe, and F. coli to clearly identify variations in future. Furthermore, compared to another two groups, the pollution of group C was relatively serious and should be controlled.

In this case study, different multivariate statistical methods were used to assess spatial variations in water quality of watercourses in the Northwestern New Territories, Hong Kong to take the influences of temporal differences into consideration. Hierarchical CA grouped the 12 months into two periods (the first and second periods) and classified 24 sampling sites into three groups (A, B, and C) based on the similarity of water quality characteristics. The temporal and spatial similarities and groupings could facilitate the design of an optimal future monitoring strategy that could decrease monitoring frequency, the number of sampling stations, and the corresponding costs for the Northwestern New Territories. Moreover, DA provided better results spatially with great discriminatory ability, according to significance tests. DA rendered an important reduction in the required amount of data for the three groups of monitoring sites, because it only used seven parameters (EC, , TP, As, Ni, Fe, and F. coli) for the spatial analysis and produced more than 90.40% correct assignations. There is no doubt that water quality is influenced by color degree, hardness, and EC. From the pH value in the original data, the conclusion can be easily drawn that the color degree of these water qualities is similar, which is because with the development of technology, the factory will first neutralize the acid and base of sewage before discharge, so it is reasonable to remove the color degree. The rest of the hardness and EC are the factors that we screened out. Therefore, DA allowed a reduction in the dimensionality of the large data set and indicated a few significant parameters responsible for large variations in water quality that could reduce the number of sampling parameters. This study illustrates that multivariate statistical methods are an excellent exploratory tool for interpreting complex water quality data sets and for understanding spatial variations considering the influences of temporal differences, which are useful and effective for water quality management. Meanwhile, through the results of water quality analysis of all the monitoring stations, the monitoring points that can be optimized are selected. It is of economic value to reduce the testing cost while ensuring the effect of water quality testing.

This work was supported by the National Natural Science Foundation of China (No. 12101604).

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Astel
A.
,
Biziuk
M.
,
Przyjazny
A.
&
Namiesnik
J.
2006
Chemometrics in monitoring spatial and temporal variations in drinking water quality
.
Water Research
8
,
1706
1716
.
Chang
T.
,
Nie
L. M.
,
Killingtveit
A.
,
Nost
T.
&
Lu
J. M.
2022
Assessment of the impacts of landscape patterns on water quality in Trondheim rivers and Fjord, Norway
.
Water Supply
22
(
5
),
5558
5574
.
Dixon
W.
&
Chiswell
B.
1996
Review of aquatic monitoring program design
.
Water Research
30
,
1935
1948
.
Ghoderao
S. B.
,
Meshram
S. G.
&
Meshram
C.
2022
Development and evaluation of a water quality index for groundwater quality assessment in parts of Jabalpur District, Madhya Pradesh, India
.
Water Supply
22
(
6
),
6002
6012
.
Hong Kong Environmental Protection Department 2009-2013, River Water Quality in Hong Kong, www.epd.gov.hk/epd/sc_chi/environmentinhk/water/hkwqrc/waterquality/river-2.html
Johnson
R. A.
&
Wichern
D. W.
1992
Applied Multivariate Statistical Analysis
, 5th edn.
Prentice-Hall
, New Jersey.
Juahir
H.
,
Zain
S. Z.
,
Yusoff
M. K.
,
Hanida
T. I. T.
,
Armi
A. S. M.
&
Toriman
M. E.
2011
Spatial water quality assessment of Langat River Basin (Malaysia) using environmetric techniques
.
Environmental Monitoring and Assessment
173
,
625
641
.
Kowalkowski
T.
,
Zbytniewski
R.
,
Szpejna
J.
&
Buszewski
B.
2006
Apllication of chemometrics in water classification
.
Water Research
40
,
744
752
.
Lattin
J.
,
Carroll
D.
&
Green
P.
2003
Analyzing Multivariate Data
.
Duxbury
,
New York
.
Otto
M.
,
1998
Multivariate methods
. In:
Analytical Chemistry
(
Kellner
R.
,
Mermet
J. M.
,
Otto
M.
&
Widmer
H. M.
, eds).
Wiley-VCH
,
Weinheim
.
Reyment
R. A.
&
Joreskog
K. H.
1993
Applied Factor Analysis in the Natural Sciences
.
Cambridge University Press
,
New York
.
Samsudin
M. S.
,
Juahir
H.
,
Zain
S. M.
&
Adhan
N. H.
2011
Surface river water quality interpretation using environmetric techniques: case study at Perlis river basin, Malaysia
.
International Journal of Environmental Protection
1
(
5
),
1
8
.
Simeonov
V.
,
Stratis
J. A.
,
Samara
C.
,
Zachariadis
G.
,
Voutsa
D.
&
Anthemidis
A.
2003
Assessment of the surface water quality in Northern Greece
.
Water Research
37
,
4119
4124
.
Venkastesharaju
K.
,
Somashejar
R. K.
&
Prakash
K. L.
2010
Study of seasonal and spatial variation in surface water quality of Cauvery river stretch in Karnataka
.
Journal of Ecology and Natural Environment
2
(
1
),
1
9
.
Wang
X. L.
,
Lu
Y. L.
,
Han
J. Y.
,
He
G. Z.
&
Wang
T. Y.
2007
Identification of anthropogenic influence on water quality of rivers in Taihu watershed
.
Journal of Environmental Sciences
19
,
475
481
.
Wang
X.
,
Cai
Q.
,
Ye
L.
&
Qu
X.
2012
Evaluation of spatial and temporal variation in stream water by multivariate statistical techniques: a case study of the Xiangxi River basin, China
.
Quaternary International
1
,
1
8
.
Wunderlin
D. A.
,
Diaz
M. D. P.
,
Ame
M. V.
,
Pesce
S. F.
,
Hued
A. C.
&
Bistoni
M. D.
2001
Pattern recognition techniques for the evaluation of spatial and temporal variations in water quality. A case study: Suquia River Basin (Cordoba Argentina)
.
Water Research
35
,
2881
2894
.
Yang
Y. H.
,
Zhou
F.
,
Guo
H. C.
,
Sheng
H.
,
Liu
H.
&
Dao
X.
2010
Analysis of spatial and temporal water pollution patterns in Lake Dianchi using multivariate statistical methods
.
Environmental Monitoring and Assessment
170
,
407
416
.
Yeung
I. M. H.
1999
Multivariate analysis of the Hong Kong Victoria harbour water quality data
.
Environmental Monitoring and Assessment
593
,
331
342
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).