Water quality evaluation and apportionment of pollution sources: a case study of the Baralia and Puthimari River (India)

Water quality monitoring programs are indispensable for developing water conservation strategies, but elucidation of large and random datasets generated in these monitoring programs has become a global challenge. Rapid urbanization, industrialization and population growth pose a threat of pollution for the surface water bodies of the Assam, a state in northeastern India. This calls for strict water quality monitoring programs, which would thereby help in understanding the status of water bodies. In this study, the water quality of Baralia and Puthimari River of Assam was assessed using cluster analysis (CA), information entropy, and principal component analysis (PCA) to derive useful information from observed data. 15 sampling sites were selected for collection of samples during the period May 2016June 2017. Collected samples were analysed for 20 physicochemical parameters. Hierarchal CA was used to classify the sampling sites in different clusters. CA grouped all the sites into 3 clusters based on observed variables. Water quality of rivers was evaluated using entropy weighted water quality index (EWQI). EWQI of rivers varied from 61.62 to 314.68. PCA was applied to recognise various pollution sources. PCA identified six principal components that elucidated 87.9% of the total variance and represented surface runoff, untreated domestic wastewater and illegally dumped municipal solid waste (MSW) as major factors affecting the water quality. This study will help policymakers and managers in making better decisions in allocating funds and determining priorities. It will also assist in effective and efficient policies for the improvement of water quality.


INTRODUCTION
Rivers have always been the most significant freshwater source for life, social progress, and economic development, as ancient civilisations have prospered along with them (Varol et al. 2012). Anthropogenic activities and natural processes deteriorate surface water quality and adversely affect their importance (Singh et al. 2004;Dash et al. 2020;Prasad et al. 2020). Due to domestic and industrial wastewater discharges, agricultural runoff and uncontrolled dumping of municipal solid waste near the river banks, the surface water quality has been gravely affected (Singh et al. 2004(Singh et al. , 2018(Singh et al. , 2019Shrestha & Kazama 2007;Dash et al. 2018;Zavaleta et al. 2021). In the past few decades, surface water quality has gained utmost importance, especially in developing countries like India, and has become a sensitive issue (Simeonov et al. 2003;Singh et al. 2019;Borah et al. 2020).
Assam, a north-eastern state of India, situated between Latitude 90°to 96°North and 24°to 28°East, has been forever a host to cultural diversity and ethnicity. An interesting quote from the 19th century writes 'The number and magnitude of rivers in Assam probably exceed those of any other country in the world of equal extent'. However, in the past few decades, there has been the rehabilitation of communities along with rapid urbanisation and industrial growth, which has taken a toll on the surface water quality of the state (Singh et al. 2019). There has been an increasing demand for water quality monitoring and policies to diminish the additional stresses on rivers. Reliable information on water quality and the identification of pollution sources is essential for preventing and controlling surface water pollution (Bu et al. 2010). Water pollution is defined as the presence of natural organic matter, which is a complex mixture of various organic molecules mainly originating from aquatic organisms, soil and terrestrial vegetation and toxic chemicals that exceed what is naturally found in the water and may pose a threat to the environment (Avsar et al. 2014;Avsar et al. 2015).
Pollutants compromising the health of river systems depend on the economic and social characteristics of the beneficiary/user societies (Lekkas et al. 2004). Environmental protection agencies monitor water quality based on comprehensive sets of indicators. In order to guard the ecological status, the Water Framework Directive declared that not only chemical concentrations of pollutants in rivers are to be used to assess water quality, but also its effects on trophic chains. However, chemical monitoring of parameters will continue to be an important data source. Monitoring of water resource is vital for reliable information about its quality and to prevent and control its pollution (Zavaleta et al. 2021). Major issues associated with water quality monitoring are handling huge and complex data sets generated due to many water quality parameters at different sampling locations and deriving useful information from them. Application of water quality indices (WQIs) and various multivariate statistical techniques (MSTs) such as cluster analysis (CA) and principal component analysis (PCA) offers a better understanding of data (Singh et al. 2017;Zavaleta et al. 2021).
In the present paper, Baralia and Puthimari River water quality has been assessed and the possible sources of pollution have been identified to understand the status of water quality. This will help in developing policies to reduce the additional stresses on these surface water resources. For the evaluation of the quality of river water, water quality is expressed in terms of entropy-weighted water quality index (EWQI). The concept of modern WQI was introduced in 1960 ( Sutadian et al. 2016). Since then, many indices have been proposed, but there is no globally accepted WQI. There is a lot of subjectivity and uncertainty involved in WQI development. Water quality parameters are random variables, and their probability distribution affects the index's probability distribution (Landwehr 1979). Assignment of fixed weights of indices based on the indices' inherent information would reduce subjective disturbances (Li et al. 2011). Such information may be explained by Shannon or information entropy.
EWQI tries to provide an improved method for offering a cumulatively derived, numerical expression describing a certain level of quality of water based on information entropy Amiri et al. 2014;Fagbote et al. 2014;Gorgij et al. 2017;Karunanidhi et al. 2020). Information theory involves quantifying information and analyses the statistical structure of a series of numbers or symbol that builds a communication signal (Ozkul et al. 2000;Liu et al. 2012). Entropy refers to the randomness of a system and the concept of information entropy was introduced by Claude Shannon in 1948, which is also commonly known as Shannon entropy. Shannon entropy is the expected value of a random variable formed by information generated by any event or a particular set of events. The entropy concept of information theory has been successfully used in the various water resource and environmental engineering fields. In this study, the concept of entropy is used to determine water quality parameters' contribution to calculate the WQI.
Hierarchical cluster analysis (HCA) was applied to identify similar sites based on their characteristics. PCA is a useful tool for data reduction and explains inter-correlated variables' variance by transmuting them into a smaller set of independent variables (Yang et al. 2010;Dash et al. 2018). Over the last three decades, researchers have widely used these methods in surface water quality assessment. In this study, HCA and PCA methods have been used to identify the parameters responsible for water quality variations. The novelty of the present work reflects the combined use of WQI and MSTs in water quality monitoring and management. WQI evaluates the quality of water and MSTs recognise the unobservable, latent pollution sources of water bodies.

Study area
Sample collection was done from Baralia River and Puthimari River during the period of May 2016 to June 2017. Baralia and Puthimari Rivers are northern bank tributaries of Brahmaputra River ( Figure 1). Puthimari River originates from the foothills of the Himalayan Ranges in Bhutan. After crossing the Indo-Bhutan border, it bifurcates into Baralia and Puthimari Rivers near Bornodi Wildlife sanctuary, Arangajuli, Assam (26°43 0 24.82″N, 91°41 0 8.25″E), and possesses all the characteristics of a flashy river. Length of river Baralia is approximately 39.1 km and that of Puthimari River is 139 km. It meanders freely and has many loops, the slope being somewhat flatter in its lower reaches. Baralia River flows through the heart of Rangia. Rangia is a town in Kamrup rural district of Assam, whereas Puthimari River flows through the outside of the city. According to the provisional report of the 2011 Census of India, Rangia had a population of 26,389. Males account for 54% of the population and females for 46% of this population. The region has a humid subtropical climate with heavy rainfall, hot summer and high humidity. The average temperature varies from 12 to 38°C during the year. The principal food crops produced in the region are rice (paddy) and vegetables. Heavy floods also characterise the region due to high rainfall during monsoon.

Site sampling, preservation and analysis
Sampling was done from the two rivers from May 2016 to June 2017. Before sampling, a preliminary survey of the catchment area was carried out to decide the sampling sites' location and identify the various point and nonpoint pollution sources. Prior information on the basic characteristics of the catchment area or basin is required before applying the mathematical or statistical tools on the measured parameters to validate and interpret the results judiciously (Alberto et al. 2001). Nine sites of the Baralia River and six sites of the Puthimari River were selected as sampling sites.
Water samplings were carried out in triplicate, from the well-mixed section of the rivers. Clean plastic bottles of 1 L capacity were used for collection of the water samples. Samples were collected in two forms, preserved samples (for the analysis of heavy metals) and non-preserved. For the preserved samples, HNO 3 (2 mL/L) was added to ensure pH 2. Standard Methods for the Examination of Water and Wastewater (APHA 2012) were adopted to analyse the samples. Temperature, pH, EC, DO and turbidity were determined in-situ. Quality control was maintained as recommended in the standard methods. Parameters such as pH, EC, and turbidity were analysed as early as possible in the laboratory since there is a change in the properties over time. Samples were protected from contamination and deterioration before their arrival in the laboratory. After collection, samples were immediately placed in a lightproof insulated box containing melting ice-packs to ensure rapid cooling. Reagents were prepared as recommended by standard methods (APHA 2012). Deionised water was used for carrying out the dilutions. Standard solutions were prepared by diluting the stock solutions. The water quality parameters analysed are shown in Table 1 along with the units, abbreviations and analytical methods used.

Entropy weighted water quality index (EWQI)
WQI is a single arithmetic number, based on a weighted average of selected parameters that express overall water quality. Assignment of weight to each selected parameter is an important and challenging task. Generally, assignment of weight to water quality parameters is a matter of opinion and hence subjective (Abbasi & Abbasi 2012). In this study, an entropy-based weight is assigned to each parameter. Entropy and related information measures offer valuable descriptions of the long term behaviour of random processes. Steps involved in the calculation of EWQI are as follows Amiri et al. 2014;Fagbote et al. 2014;Gorgij et al. 2017): • A matrix 'U' was developed with all 'm' water samples (i ¼ 1,2,…m) and 'n' measured parameters (1) • Initial matrix was converted to the standard grade matrix 'V', to remove the error caused due to where 'v ij ' is normalisation of an evaluated parameter (n) in a particular water sample (m). • Information entropy was calculated by: • Entropy weight of the parameter (j) was calculated by: • Quality rating scale for each parameter was assigned by: • EWQI was calculated by: where C j ¼ measured concentration of the parameter.
Classification of EWQI into five ranks is shown in Table 2

Multivariate statistical techniques (MSTs)
CA is an exploratory analysis that divides a large number of objects into a smaller number of different groups based on similarity. Clustering is unsupervised classification and its procedures may be hierarchical or non-hierarchical. A tree-like structure called dendrogram characterises a hierarchical CA (HCA). HCA can be agglomerative or divisive. In the present study, agglomerative HCA has been used to identify the similarity among sampling locations. HCA was performed on z transformed datasets using Ward's method of linkage. Ward's method of linkage begins with 'n' clusters, each containing a single observation and continues until all the observations are comprised into one cluster. This method is based on the error sum of squares. For the measure of similarity, Euclidean distance has been used. Euclidean distance measures the geometric distance between the two observations. PCA was applied to transform the original variable into new and uncorrelated variables (Shrestha & Kazama 2007). It is a powerful data reduction technique used to reduce the variable numbers to explain the variance with fewer variables (Zhang et al. 2009;Dash et al. 2020). The following steps are involved in the PCA: Step 1: Standardisation of the dataset (all the variables will be transformed to the same scale).
Step 2: Computation of covariance matrix (to observe how the variables are varying from the mean with respect to each other).
Step 3: Computation of eigenvalues and eigenvectors for the covariance matrix (to decide the principal components).
Step 4: Computation of the Principal Components (PCs).
Step 5: Reorientation the data from the original axes to the ones represented by the PCs. The basic idea behind PCA is to ascertain patterns and correlations among observed variables. Based on a strong correlation between different variables, a final judgement is made about reducing the dimensions of the datasets in such a way that the substantial statistical information is still retained.
Statistical analysis was performed using IBM SPSS Statics 20 software.

RESULTS AND DISCUSSION
Descriptive statistics of the observed various water quality parameters of Baralia and Puthimari Rivers are shown in Tables 3-6. It has been observed that TDS and TUR have very high SD and Variance. This may be due to the influence of rainfall, surface runoff, river water flow and erosion from the river bed and banks. Erosion is more pronounced in both banks than the sedimentation (Baishya & Sahariah 2015).
In the present study, HCA was used to categorise the sampling sites and a Dendrogram was generated. HCA grouped the sampling locations into three different clusters. Grouped sampling sites under each cluster are shown in Figure 2. In the flow path, Baralia River encountered mostly agricultural, and forest areas in the upper reaches, a densely populated Rangia town in the middle reach and scattered population, forest areas and farming land in the lower reaches. But, Puthimari River encounters scattered population, forest areas and agriculture land in lower reaches throughout its flow length. Sampling sites located at the middle reach of the stream and near Rangia town were grouped under cluster 1. EWQI of all water samples with a value more than 150 indicated that the water quality was 'poor' or 'extremely poor' (Table 7). Higher EWQI was observed at sampling sites located near the densely populated market area of Rangia town. Sampling stations near the town receive pollutants from domestic wastewater. Wastewater from household activities was disposed of into open drains in front of the houses, which discharged this into the river without any treatment. There is no well-connected drainage system in the town. Baralia River is also used for washing clothes, bathing of pets, and fishing, which also contribute to the pollution (CPCB 2015). Another important factor contributing to pollution was municipal solid waste (MSW).
MSW is routinely dumped in town streets and along the banks of the rivers. MSW was found to be dumped about in thin, non-contiguous layers at numerous locations along the riverbank. Still, in many areas, thicker, contiguous fills existed on the river bank lying in contact with the flowing water. Water leaching through solid waste directly affects the water quality of the river (CPCB 2015). Sampling sites SPPR1 and SPBR1 were grouped in this Cluster 2 ( Figure 2). These sites were located at the river's uppermost reach where inhabitant's density is significantly less, and human activities are minimal. EWQI of these two sampling locations were 69.65 and 61.62, respectively (Table 7), which indicate the water quality as 'good' (Table 2).
Cluster 3 consisted of sampling sites, namely SPPR2, SPBR2, SPBR3, SPBR8 and SPBR9. Sampling sites SPPR2, SPBR2 and SPBR3 were located upstream of Rangia town, at that part of the basin where inhabitant's density is low and agricultural activities and livestock breeding dominates land use pattern. The EWQI at those locations was in the range of 100-150, which indicated the water quality as 'average'. Sampling sites SPBR8 and SPBR9 were located in the river's downstream section, away from Rangia town. EWQI of SPBR8 was 149.45 and that of SPBR9 was 147.43, which indicated water quality as 'average'. Water quality of cluster 3 was better than the water quality of cluster 1. It indicates the self-assimilative process of the river.
PCA was performed on all observed water quality parameters collected from various sampling locations. For extraction of principal component, to explain the sources of variance in observed water quality parameters, an eigenvalue greater than one was taken as the criteria. PCA generated six useful factors which explained 87.98% of the total variance (Table 8). Factor 1, which explained 28.73% of the total variance associated with inorganic constituents. It had strong positive loading on EC, TH, TA, K þ , Ca 2þ and Mg 2þ . Conductivity in water is affected by the presence of inorganic dissolved solids such as Cl À , SO 4 2À , Na þ , K þ and Ca 2þ . Ca 2þ and Mg 2þ dissolved in water are the two most common sources of hardness. This factor is associated with surface runoff (Goonetilleke et al. 2005). Factor 2 represented 17.5% of the total variance related to heavy metals such as Fe, Mn, Cu and Zn. This factor had strong positive loading on Fe, Mn and Cu and had a moderately strong positive loading on Zn. This heavy metal factor can be interpreted as metal pollution leaching from MSW, illegally dumped near the bank. Factor 3, which explained 12.7% of total variance had strong positive loading on DO and Cl À and moderate loading on Na þ . This factor represents pollution sources mainly from municipal effluents (USGS 1999). Cl À is a major constituent of municipal wastewater normally coming from kitchen wastewater. Salts such as table salt are composed of Na þ and Cl À . When table salt is mixed with water, its Na þ and Cl À ions separate as they dissolve. Chlorinated drinking water also increases chloride levels in the wastewater of a community (USGS 1999;Ha & Bae 2001). Factor 4 accounted for 11.89% of the total variance and had strong positive loading on pH and SO 4 2À . Sulfates naturally occur in minerals of some soil and rock formations (Al-Khashman & Shawabkeh 2006). This factor may be attributed to the physicochemical source of variability. Factor 5 had strong positive loading on TDS and moderate loading on NO 3 À . This factor can be attributed to pollution due to the use of fertilisers for agricultural activities. This can also occur with animal waste and manure finding their way into the river. Factor 6 explained 6.95% of the total variance and had strong negative loading on Pb and moderate positive loading on F À . This factor may also be due to the physicochemical source of variability (Varol & Sen 2009).

CONCLUSION
In this study, water quality data for 20 physical and chemical parameters, collected from 9 sampling sites of Baralia River and 6 sampling sites of Puthimari River in Assam (India) during the period of May 2016 -June 2017 were analysed. EWQI was used to assess the water quality of rivers. HCA was applied to group the similar sites and it grouped all the monitored sites into 3 clusters based on pollution levels. PCA was applied to identify possible sources of pollution. The important conclusions from the study were drawn as follows: • The analysis showed that domestic discharge coming from various household activities and runoff leaching from the illegally dumped municipal solid waste near the river bank are adversely affecting the water quality of Baralia and Puthimari River. Worst water quality has been observed near Rangia town.
• HCA grouped all the sampling sites into 3 clusters based on similarities in the water quality characteristics. This method can be used for the optimisation of sampling sites.  • The study demonstrated the importance of Shannon entropy and MSTs in water quality assessment.
The study illustrated the utility of EWQI in evaluating surface water quality, the results of which were further reinforced by the application of PCA and HCA.
• The present work justifies the effectiveness of combined use of EWQI and MSTs in water quality monitoring and management.   The study will help policymakers that take care of the water supply and water pollution control since these form a significant tool for easy understanding and thereby making their applicability uncomplicated. Indeed, these methodologies make the water quality datasets utilization enormously easy and lucid. This study will also assist in making decisions in allocating funds and determining priorities.

DISCLOSURE STATEMENT
The authors reported no potential conflict of interest.

DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.