Abstract
Water Quality Index (WQI) is a unique and effective rating technique for assessing the quality of water. Nevertheless, most of the indices are not applicable to all water types as these are dependent on core physico-chemical water parameters that can make them biased and sensitive towards specific attributes including: (i) time, location and frequency for data sampling; (ii) number, variety and weights allocation of parameters. Therefore, there is a need to evaluate these indices to eliminate uncertainties that make them unpredictable and which may lead to manipulation of the water quality classes. The present study calculated five WQIs for two temporal periods: (i) June to December 2019 obtained in real time (using the Internet of Things (IoT) nodes) at inlet and outlet streams of Rawal Dam; (ii) 2012–2019 obtained from the Rawal Dam Water Filtration Plant, collected through GIS-based grab sampling. The computed WQIs categorized the collected datasets as ‘Very Poor’, primarily owing to the uneven distribution of the water samples that has led to class imbalance in the data. Additionally, this study investigates the classification of water quality using machine learning algorithms namely: Decision Tree (DT), k-Nearest Neighbor (KNN), Logistic Regression (LogR), Multilayer Perceptron (MLP) and Naive Bayes (NB); based on the parameters including: pH, dissolved oxygen, conductivity, turbidity, fecal coliform and temperature. The classification results showed that the DT algorithm outperformed other models with a classification accuracy of 99%. Although WQI is a popular method used to assess the water quality, there is a need to address the uncertainties and biases introduced by the limitations of data acquisition (such as specific location/area, type and number of parameters or water type) leading to class imbalance. This can be achieved by developing a more refined index that considers various other factors such as topographical and hydrological parameters with spatial temporal variations combined machine learning techniques to effectively contribute in estimation of water quality for all regions.
HIGHLIGHTS
Evaluated five WQI based on six physico-chemical parameters to analyze their sensitivity toward selected location, type and frequency for data sampling.
Computed WQIs categorized the dataset as ‘Very Poor’ because of the uneven distribution of water samples leading to class imbalance.
Five ML models used in which Decision Tree classification accuracy is 99%.
For refined index topographical and hydrological parameters should be considered.
INTRODUCTION
Water is a prime resource that is very vital for nature. It forms the chief constituent of ecosystem. Water is utilized in different fields of agriculture, forestry, livestock production, industrial and other creative activities. It is the primary need for industrial, agricultural and other growing household affairs (Hamzaoui-Azaza et al. 2011; Pazand & Hezarkhani 2012). The quality of water can be defined by its physical, biological and chemical characteristics. Water deterioration occurs due to growing populations or urbanization, anthropogenic activities and the tremendous increase in industrialization (Carpenter et al. 1998; Neal et al. 1998; Singh et al. 2004). Human health is being affected by waterborne diseases that have caused up to 5–10 million deaths worldwide (Chung & Yoo 2015). To maintain a stable civilization on Earth, good water quality has a significant role, therefore water needs to be monitored and managed properly.
Physico-chemical and biological parameters are mostly used to monitor the quality of water that should fall under set standards and guidelines. The occurrence of these parameters beyond the defined limit can be harmful for human health. To express water quality in some standard form researchers have come up with several water quality indices, which are the most effective tools used to describe the quality of water (Couillard & Lefebvre 1985; House & Newsome 1989; Secunda et al. 1998; Nives 1999; Jonnalagadda & Mhere 2001). Water Quality Index (WQI) is a mathematical tool that represents the water quality class by categorizing different water parameters into a standard numerical value that lies between 0 and 100. WQI classifies water quality typically in five classes or categories ranging from excellent to worst and summarizes the complex water quality data for the general public (Nives 1999).
Over many years, different water agencies have proposed several water quality indices but there was no major breakthrough until 1970. A two-phased approach was used to calculate such indices in which at first the raw water quality parameters are converted into a sub-index (SI) value and then further accumulated to a WQI value (Ott 1978). Scaled on a rate of 0–100, the WQI has five or six classes accordingly (Pesce & Wunderlin 2000; Sargaonkar & Deshpande 2003; Štambuk-Giljanović 2003; Tsegaye et al. 2006; Bouza-Deaño et al. 2008). A higher value yields a better WQI class, whereas lower WQI values correspond to a low or inferior class (Landwehr 1979; Brown et al. 1972; Lohani & Todino 1984; Dinius 1987). This classification has helped many studies to determine the quality of water (Bhatt & Pathak 1992; Kumar & Shukla 2002; Mulani et al. 2009; Khanna et al. 2013); it may also help to analyze the trend of water quality over a period of time and can identify how environmental impact and anthropogenic activities have affected the water quality for drinking or other water consumption.
The important water quality indices used worldwide include: the Weighted Arithmetic WQI Method (WAWQI) (Brown et al. 1972), Minimum Operator Index (MOI) (Smith 1990), National Sanitation Foundation WQI (NSF-WQI) (McClelland 1974), Canadian Council of Ministers of the Environment WQI (CCME-WQI) (of Ministers of the Environment 2001) and Oregon WQI (OWQI) (Cude 2001). Over many years, water quality has been assessed from different regions based on various techniques. Mostly, water samples are collected through GIS-based grab sampling, which is labor intensive and therefore consumes many resources. In 2006 (Lumb et al. 2006), CCME-WQI was computed for the years 1960–2002 for the Mackenzie River. The parameters used for the assessment include suspended solids, turbidity (Tur), trace metals and true color. The study revealed that the high presence of such trace metals had deteriorated the water quality. In 2006 (Davies 2006), CCME-WQI was calculated after collecting samples from six sites in central North America for the years 1969–2002, based on 17 parameters namely: chloride, total phosphorus, nitrate, ammonia, fecal coliform (FC) and dissolved oxygen (DO). This study reveals that CCME-WQI is not an ideal tool for water quality assessment as it is affected by the number of measurements, parameters and samples collected over a certain time duration.
In 2008, one paper (Qadir et al. 2008), assessed the Chenab river (Pakistan) which was monitored from September 2004 to April 2006. The samples were extracted for 24 physico-chemical parameters such as pH, DO, conductivity, total dissolved solids (TDS) from seven sites on a seasonal basis and were assessed using statistical techniques including hierarchical agglomerative cluster analysis or principal component analysis.
In 2009, one group (Kumar & Dua 2009), described the water quality of River Ravi (India), which was monitored for the months of January 2003 to December 2005. The water samples were collected using standard Century Water Analysis kit. The quality was assessed using the CCME-WQI method based on eight physico-chemical parameters including hardness, calcium, pH, TDS, DO, magnesium, alkalinity and conductivity. Their analysis shows that the water quality was directly proportional to the value of DO in water. In 2009, one study (Ramakrishnaiah et al. 2009), assessed the water quality for the pre-monsoon period after collecting samples from 269 locations in Tumkur Taluk (India) during February 2006. A weighted WQI method was used along with the Indian BIS standard to calculate the WQI based on 17 parameters: pH, conductivity, TDS, hardness, calcium, magnesium, sodium, bicarbonate, phosphate, nitrate, carbonate, chloride, sulphate, fluoride, potassium, iron and manganese. It was concluded that the water quality was excellent for this region due to the high values of iron, nitrate and chloride etc. In addition, the analysis showed a high correlation between magnesium and chloride. In 2009, one study (Rajankar et al. 2009), described 22 sites in Nagpur region (India) which were monitored for the years of 2005 and 2006 in the post monsoon, summer and winter seasons showing poor water quality for the region. The water samples were collected through grab sampling from tube and dug wells. There were nine parameters (pH, temperature (T), TDS etc.) that were used to calculate the WQI using the standard Q-value (NSF method). In another study in 2009 (De Rosemond et al. 2009), water quality data were extracted from 71 mining facilities in Canada for the year 2004. The National Environmental Effects Monitoring Office (NEEMO), Canada collected the physico-chemical parameters four times annually including DO, iron, lead, mercury, pH etc. and assessed the application of numeric water quality objectives including the freshwater aquatic life water quality guidelines (WQG) and Region-Specific Objectives (RSO) for the CCME-WQI. Their analysis showed that WQG is an effective tool but has a limited use in observing the spatial changes, whereas RSO is a much better option as the spatial changes can be easily evaluated.
In 2010 (Khan 2010), the water supply of Attock city was examined based on WQI after collecting 30 ground water samples from 30 sampling stations. Six physico-chemical parameters were extracted from the samples namely pH, TDS, DO, conductivity, sulphates and nitrates. The high concentration of nitrate ions made the quality of water unsuitable for human and animal consumption. In another study in 2010 (Vasanthavigar et al. 2010), 148 samples were examined that were collected from January to May 2008 from the Tamilnadu (India) region in the pre- and post-monsoon seasons. WQI was computed based on the Indian BIS standard combined with a weighted method using following parameters: calcium, magnesium, sodium, potassium, chlorine, pH, sulphate, conductivity and TDS. The analysis showed that the increase in anthropogenic activities during the post-monsoon season had led to an increase in magnesium, potassium and sulphates causing water quality deterioration. In 2010 (Das et al. 2010), WQI and Urbanization Index (UI) was calculated based on the All-India Public Health Engineering standard and also computed with the help of software. This WQI examined how rain water quality was affected by any constructional activities in six sampling stations in Kolkata (India). There were 36 samples collected in buckets and parameters such as TDS, pH, chloride, hardness, Tur, magnesium, calcium and T are used. According to the study, both WQI and UI were inversely proportional to one another, but their conclusion is based on very few samples.
In 2011 (Puri et al. 2011), water quality of Nagpur city (India) was studied for the months January to December 2008. The data were extracted from four permanent stations on a monthly basis and collected in sterilized glass bottles and afterwards assessed using NSF-WQI. The parameters monitored included: conductivity, chlorine, TDS, DO, hardness, biological oxygen demand (BOD), pH and FC. Results revealed that the lake water was unsafe for drinking and showed a medium to poor quality rating for all seasons except for during the monsoon. In 2011, one group (Sharma & Kansal 2011), analyzed the Yamuna river (India) using the CCME-WQI method. In the pre-monsoon and post-monsoon seasons, parameters such as pH, BOD, DO, FC and ammonia were extracted from four sites for the years 2000 to 2009. The river was highly polluted and belonged to the poor water quality class. In 2012 (Chowdhury et al. 2012), 34 water stations along the Faridpur-Barishal road in Bangladesh are monitored using data from March to April 2011. The water samples were collected through grab sampling. The six parameters including pH, TDS, DO, BOD, conductivity and T are used to assess the water quality based on the NSF-WQI and WAWQI methods. The results showed that BOD and DO were the most important parameters in determining the quality of water. In another study in 2012 (Srebotnjak et al. 2012), GEMS/Water program and the European Environment Agency (EEA) provided data to produce a composite index named Environmental Pollution Index (EPI). The data included 100 countries with 2 million samples of different lakes, rivers and reservoirs. The hot-deck imputation was applied for replacing missing values and has improved the results of the WQI. Their proposed index claimed to immediately identify the issues and problem affecting the quality of water like low presence of nitrogen or total phosphorus parameters.
In 2014 (Selvam et al. 2014), CCME-WQI was used to evaluate the water resources in Tuticorin (India). There were 14 physico-chemical parameters such as pH, conductivity, hardness, TDS, sulphate, phosphate examined from 72 samples collected from wells. The water quality was marked as fair in the pre-monsoon period and marked good in the post-monsoon period. Another study in 2014 (Liou et al. 2004), proposed a generalized WQI for Taiwan. The water samples were collected for the years 1994–2000 from 205 monitoring stations (21 main rivers). They used 12 parameters namely; DO, ammonia, nitrogen, Tur, FC, suspended solids, T, pH, cadmium, zinc, lead, copper and chromium. The proposed new index high- lighted the areas that were previously missed in the existing index and were affected by industrial activities in the Keya River. In a research carried out in 2014 (Nazeer et al. 2014), data for the Soan river (Pakistan) were examined through CCME-WQI method in the pre-monsoon (April to May) and post-monsoon (September to October) season of 2008. There were 18 samples collected through grab sampling and parameters such as pH, T, DO, TDS and conductivity were measured. The results were better in the pre-monsoon period but water was deemed unsuitable for human and animal consumption due to the high presence of nickel, lead and cadmium.
In 2016 (Ewaid 2017), Al-Gharraf river's quality was analyzed on a monthly basis from 10 stations for the year 2015. The NSF-WQI and Heavy Metal Pollution Index (HPI) were calculated based on 13 parameters namely; BOD, DO, Tur, nitrates, phosphates, T, pH and four heavy metals. The river quality is determined to be poor after analyzing the HPI due to the anthropogenic activities in the environment such as soil erosion, sewage discharge and other industrial activities. In 2018 (Bhatti et al. 2018), 29 water samples were collected from Nagarparkar, Pakistan using grab sampling. The quality was assessed using WAWQI based on 18 physico-chemical parameters including; pH, conductivity, TDS, Tur, alkalinity, hardness, chloride, DO, sulphate, calcium, magnesium, iron, cadmium, nickel, copper, manganese, arsenic and fluoride. The study revealed that only 35% of the parameters were within the defined WHO limits, while the remaining 65% were beyond the defined range for good water quality. In another study in 2018 (Wu et al. 2018), 96 sites in Lake Taihu were examined using samples from September 2014 to January 2016 on a seasonal basis. The Pesce & Wunderlin (2000) WQI and WQI min methods were used to assess the quality based on 13 physico-chemical parameters. The quality was high in autumn but the lowest in the winter season.
In 2019 (Golbaz et al. 2019), a unique swimming pool WQI (SPWQI) was developed for monitoring the quality of swimming pools based on 13 physico-chemical and biological parameters. The SPWQI is a modified version of the WAWQI method. This index helped in managing and treating the water quality. In another study in 2019 (Gupta et al. 2019), artificial neural networks (ANN) were used to develop a universal WQI based on the WHO parameters. The study revealed that ANN based on cascade forward architecture was successful for predicting the WQI using five physico-chemical parameters such as Tur, pH, conductivity, DO and FC. However, the limitation of ANN method was that it can vary with the change in parameters and therefore needs to be further worked on to get the desired results. In 2019 (Abbasnia et al. 2019), the quality of 654 dug wells in Sistan and Baluchistan, Iran was studied using the WAWQI method. Overall, the drinking quality of the dug wells was categorized as excellent and good.
In 2020 (Karunanidhi et al. 2020), 61 samples containing eight physico-chemical parameters such as calcium, sodium, sulphate, fluoride were collected through grab sampling from the Shanmuganadhi River basin, India using the WAWQI method. The WQI results showed that 52% of the samples were unfit for water consumption whereas 48% were classified as good. The researchers suggested to reduce the fluoride concentration, the groundwater samples should be treated or recharged using artificial methods for drinking water. In another study in 2020 (Chabuk et al. 2020), data for the water of Tirgis River, Iraq was assessed for wet and dry seasons in 2016 by collecting 12 parameters from 11 locations using the WAWQI method and GIS software. The results showed that the parameter concentrations were higher in the dry season compared with the wet season except for potassium, conductivity, TDS and bicarbonate. The computed WQI showed that the river had poor water quality due to human activities surrounding the river. The study also revealed that the application of WQI was only effective after the water was treated due to the high parameter concentrations that could be present in raw water. In 2020 (Ustaoğlu et al. 2020), the quality of the Turnasuyu Basin, Turkey was evaluated using the WAWQI method and data for the period February 2017 to January 2018. The water quality of the basin was deemed as good for public use. The physico-chemical parameters observed for the year did not exceed the permissible WHO limits. However, anthropogenic activities may impact the quality downstream of the basin. In 2020 (Seifi et al. 2020), the researchers altered the WAWQI index by introducing a Monte-Carlo simulation for weight allocation. The quality of the Kerman aquifer, Iran was observed by collecting 1189 samples during dry and wet seasons. The water quality of the aquifer was considered to be poor based on the computed WQI. The findings revealed that the Monte-Carlo method was useful for the WQI evaluation.
The review of past research has shown that mostly the CCME-WQI and the WAWQI methods were used to evaluate the quality of water. The most common parameters used to build these indices comprise the physical, chemical and biological water parameters. However, there was a certain amount of uncertainty present with the application of these water quality indices in that they are unpredictable in complex environmental situations (Silvert 2000). These indices are mostly biased, as they use a limited number of parameters and are developed for a specific place. These uncertainties were associated with the development and evaluation of the WQI. For instance, the quality of water may vary between two certain points of a lake at a specific time of the day. The physico-chemical properties can change from dawn to dusk in a single day because of the dynamic nature of the water bodies (Khan & Abbasi 1998). Therefore, there are some reasons why most of these indices fail to accurately classify water quality: (1) firstly, the sensitivity to the type of predefined parameters used for development of each standard, (2) the incorporation of a limited set of variables or parameters, and (3) the weight allocation to each parameter. The high concentration of a single parameter can increase the WQI value, which can manipulate the class or category of water quality. Therefore, no index has been universally accepted and there is a need to evaluate and perform a comparative analysis of these indices that can eliminate the uncertainties and biases in these standards.
In this study, five water quality indices were applied on the two datasets; (1) Dataset 1: the water samples collected using IoT sensors from selected locations at Rawal Dam and (2) Dataset 2: the data provided by the Rawal Dam Water Filtration Plant using GIS-based grab sampling. In addition, classification was performed on the two indices (WAWQI and OWQI) calculated by applying five machine learning algorithms that include Naive Bayes (NB), Multilayer Perceptron (MLP), Logistic Regression (LogR), k-Nearest Neighbor (KNN) and Decision Tree (DT).
STUDY AREA
The study area for the subject study was the Rawal Dam (Ali et al. 2013), which is located in the capital city of Islamabad within an isolated section of the Margalla Hills National Park at a longitude of 73°7′E, latitude of 33° 42′N and altitude of 1,800 m. The dam has a height of 133.5 ft, a depth of 102 ft and a surface area of 8.8 km2. The dam is 213 m long and 33.5 m high. The catchment area is 106.25 square miles with three zones namely Kurang, Nurpur village and Shahdara. Kurang river is the outlet stream of the dam. Four major streams and 43 small streams runoff into the Rawal Dam lake. However, during the rainy season, polluted runoff water, local spring discharges and untreated sewage water fall occasionally in the Kurang River and the streams. The reservoir has a maximum capacity of 58,581,810 m3. The citizens of both Rawalpindi and Islamabad receive about 22 million gallons per day of water from this dam. The lake has 15 different types of fish species including Rahu, Doula, Tilapia, Thaila, Carp fish and Mori. Bhara Kahu, Bani Gala, Malpur and Noorpur Shahan are located in the catchment area of the Rawal Dam, and are highly populated. With the development of housing colonies in the catchment area of Rawal Lake, the quality of water is deteriorating due to solid waste disposal and untreated sewage in the tributaries. Moreover, another pollution factor is the disposal of poultry waste, as the catchment area covers over 360 poultry sheds. Tourist attractions such as Murree Hills and Chattar Park are also located in the catchment area leading to another source of water pollution.
The dam was been selected as the subject study because we had easy access to the data as the dam provides the water supply to the city of Rawalpindi and Islamabad. The provision of data from government bodies usually has many administrative hindrances, therefore this makes the data invaluable. Moreover, the Rawal Dam is a rain-fed area that is interesting to explore due to the change in climatic factors.
Sample analysis
Figure 1 shows the streams and the experimental points selected for data collection. The physical parameters observed for the lake include ‘T’, ‘Tur’, ‘pH’, ‘DO’, ‘conductivity’ and ‘FC’. For Dataset 1, the T of the lake varied from 29 °C in June to 18.75 °C in December. Tur varied from 30 NTU in June to 429 NTU in December. pH varied from 1.74 in June to 6.23 in December. DO varied from 1.95 mg/L in June to 1.46 mg/L in December. Conductivity varied from 795.65 μs/cm in June to 30,803 μs/cm in December. For Dataset 2, the T of the lake varied from 23 °C in 2012 to 24 °C in 2019. Tur varied from 18 NTU in 2012 to 22 NTU in 2019. pH varied from 7.19 in 2012 to 7.25 in 2019. DO varied from 3.5 mg/L in 2012 to 6.6 mg/L in 2019. Conductivity varied from 736 μs/cm in 2012 to 520 μs/cm in 2019. FC varied from 170 colonies/100 ml in 2012 to 140 colonies/100 ml in 2019.
DATA COLLECTION AND PREPROCESSING
Dataset collection
Dataset 1
The data collected from Rawal Dam from June to December of the year 2019 were named Dataset 1. The data were collected in real time using the Internet of Things (IoT) sensors (See Figure 1) that are deployed over identified stations at inlet and outlet streams of Rawal Lake. The data were transmitted to the local server using GSM technology for further preprocessing and analysis.
Five parameters were recorded namely: ‘T’, ‘Tur’, ‘pH’, ‘DO’ and ‘conductivity’. The dataset had 5672 instances. Figure 2(a) shows the change in concentrations of the parameters collected over time. The initial version of the dataset can be seen on Kaggle (https://www.kaggle.com/mahmedphdcs17seecs/rawal-dam-water-quality-dataset-2019).
Variation in the concentrations of parameters for Datasets 1 and 2. (a) June to December 2019 (Dataset 1), (b) 2012–2019 (Dataset 2).
Variation in the concentrations of parameters for Datasets 1 and 2. (a) June to December 2019 (Dataset 1), (b) 2012–2019 (Dataset 2).
Dataset 2
Dataset 2 contains 1114 samples collected from years 2013 to 2018 through GIS-based grab sampling. This dataset was made up of six parameters namely ‘T’, ‘Tur’, ‘pH’, ‘DO’, ‘conductivity’ and ‘FC’. The dataset was provided by the Rawal Dam Water Filtration Plant. Figure 2(b) shows the variation of the parameters over the years in Dataset 2.
Preprocessing of dataset
The samples in both the datasets needed to be converted to a WQI value that can categorize them as best or worst, depending on the index method used. Table 1 displays the WQI values and their respective classifications. Overall, the parameters in Dataset 1 do not show a positive correlation except for ‘conductivity with Tur’ and ‘conductivity with pH’ as seen in Figure 3. The parameters in Dataset 2 do not show any positive correlations as seen in Figure 3.
Classification of WQI values for five indices are represented as Excellent (E), Good (G), Fair (F), Poor (P), Very Poor (VP), Unfit for Drinking (U), Medium (M), Bad (B), Very Bad (VB), Marginal (Ml), Eminently suitable for all uses (ES), Suitable for all uses (S), Main use may be compromised (C), Unsuitable for several uses (Uns), Totally unsuitable for many uses (TU) (Brown et al. 1972; Smith 1990; McClelland 1974; Canadian Council of Ministers of the Environment 2001; Cude 2001)
Index . | No. of parameters . | WQI value . | Rating class . | Class no. . |
---|---|---|---|---|
0–25 | E | 0 | ||
25–50 | G | 1 | ||
WAWQI | 10 | 51–75 | F | 2 |
76–100 | P | 3 | ||
101–150 | VP | 4 | ||
Above 150 | U | 5 | ||
90–100 | E | 0 | ||
70–90 | G | 1 | ||
NSF- | 9 | 50–70 | M | 2 |
WQI | ||||
25–50 | B | 3 | ||
0–25 | VB | 4 | ||
95.0–100.0 | E | 0 | ||
80.0–94.9 | G | 1 | ||
CCME- | Up to 47 | 65.0–79.9 | F | 2 |
WQI | ||||
45.0–64.9 | Ml | 3 | ||
0.0–44.9 | P | 4 | ||
90–100 | E | 0 | ||
85–89 | G | 1 | ||
OWQI | 8 | 80–84 | F | 2 |
60–79 | P | 3 | ||
less than 60 | VP | 4 | ||
80–100 | ES | 0 | ||
60–79 | S | 1 | ||
MOI | 8 | 40–59 | C | 2 |
20–39 | Uns | 3 | ||
0–19 | TU | 4 |
Index . | No. of parameters . | WQI value . | Rating class . | Class no. . |
---|---|---|---|---|
0–25 | E | 0 | ||
25–50 | G | 1 | ||
WAWQI | 10 | 51–75 | F | 2 |
76–100 | P | 3 | ||
101–150 | VP | 4 | ||
Above 150 | U | 5 | ||
90–100 | E | 0 | ||
70–90 | G | 1 | ||
NSF- | 9 | 50–70 | M | 2 |
WQI | ||||
25–50 | B | 3 | ||
0–25 | VB | 4 | ||
95.0–100.0 | E | 0 | ||
80.0–94.9 | G | 1 | ||
CCME- | Up to 47 | 65.0–79.9 | F | 2 |
WQI | ||||
45.0–64.9 | Ml | 3 | ||
0.0–44.9 | P | 4 | ||
90–100 | E | 0 | ||
85–89 | G | 1 | ||
OWQI | 8 | 80–84 | F | 2 |
60–79 | P | 3 | ||
less than 60 | VP | 4 | ||
80–100 | ES | 0 | ||
60–79 | S | 1 | ||
MOI | 8 | 40–59 | C | 2 |
20–39 | Uns | 3 | ||
0–19 | TU | 4 |
Correlation among the water quality parameters in Dataset 1 and Dataset 2. Yellow and green colors represent a high correlation whereas blue represents low correlation.
Correlation among the water quality parameters in Dataset 1 and Dataset 2. Yellow and green colors represent a high correlation whereas blue represents low correlation.
For applying different indices on the datasets, the SI or quality rating (q) was calculated based on the values of the physico-chemical parameters. The formulae of these indices are mentioned in detail below.
WAWQI method
n = the number of parameters, here the value is 5 (for Dataset 1) and 6 (for Dataset 2),
qn = quality rating of the nth parameter given in Equation (2),
- wn = unit weight of the nth parameter given in Equation (3).where,
Sn = standard value of nth water quality parameter,
Vn = observed value of nth water quality parameter,
- k = proportionality constant given in Equation (4).
Water quality parameters and their corresponding values calculated using WAWQI
Parameters . | Sn . | Ideal value (Vid) . | k . | Unit weight (wn) . |
---|---|---|---|---|
Tur | 5 | 0 | 2.3801 | 0.476023801 |
pH | 8.5 | 7 | 2.3801 | 0.280014001 |
DO | 15 | 14.6 | 2.3801 | 0.1586746 |
Conductivity | 400 | 0 | 2.3801 | 0.005950298 |
T | 30 | 0 | 2.3801 | 0.0793373 |
FC | 0.99 | 0 | 0.699 | 0.7062 |
Parameters . | Sn . | Ideal value (Vid) . | k . | Unit weight (wn) . |
---|---|---|---|---|
Tur | 5 | 0 | 2.3801 | 0.476023801 |
pH | 8.5 | 7 | 2.3801 | 0.280014001 |
DO | 15 | 14.6 | 2.3801 | 0.1586746 |
Conductivity | 400 | 0 | 2.3801 | 0.005950298 |
T | 30 | 0 | 2.3801 | 0.0793373 |
FC | 0.99 | 0 | 0.699 | 0.7062 |
CCME-WQI method
- nse = normalized sum of excursions given in Equation (9):
Water quality parameters and their corresponding values calculated using CCME-WQI for Dataset 1 and Dataset 2
Data . | Scope (F1) . | Frequency (F2) . | Normalized sum of excursions (NSE) . | Amplitude (F3) . | CCME value . | CCME-WQI rating . |
---|---|---|---|---|---|---|
Dataset 1 | 80 | 40.13 | 15.745 | 94.03 | 25.05 | Poor (0–44.9) (see Table 1) |
Dataset 2 | 66.67 | 45.18 | 18.21 | 94.79 | 28.18 | Poor (0–44.9) (see Table 1) |
NSF-WQI method
wn = unit weight of the nth parameter,
qn = quality rating of the nth parameter.
Table 4 shows the weightages updated accordingly due to the number of parameters used for the current study. The weightages are updated with respect to the ratio of the presently used NSF weightages.
Water quality parameters and their corresponding weights calculated using NSF-WQI.
Parameters . | NSF-WQI weightages . | New weightages (Dataset 1) . | New weightages (Dataset 2) . |
---|---|---|---|
DO | 0.17 | 0.34 | 0.34 |
pH | 0.11 | 0.22 | 0.16 |
T | 0.1 | 0.20 | 0.1 |
Tur | 0.08 | 0.16 | 0.08 |
Conductivity | – | 0.08 | 0 |
FC | 0.16 | – | 0.32 |
Parameters . | NSF-WQI weightages . | New weightages (Dataset 1) . | New weightages (Dataset 2) . |
---|---|---|---|
DO | 0.17 | 0.34 | 0.34 |
pH | 0.11 | 0.22 | 0.16 |
T | 0.1 | 0.20 | 0.1 |
Tur | 0.08 | 0.16 | 0.08 |
Conductivity | – | 0.08 | 0 |
FC | 0.16 | – | 0.32 |
OWQI method
n = number of parameters,
SIi = SI is the sub-index for the nth parameter given in Table 5.
SI calculation for T, pH, DO and FC in OWQI
Parameters . | Sub-index calculation . | |||
---|---|---|---|---|
T | T≤11C | 11C<T≤29C | 29C<T | |
SIT= 100 | SIT=76.54+4.172*T−0.1623*T2 −2.0557E−3*T3 | SIT= 10 | ||
DO | DO≤3.3 mg/L | 3.3 <DO<10.5 mg/L | 10.5 mg/L≤DO | |
SIDO= 10 | SIDO=−80.29 + 31.88 ∗ DO − 1.401 ∗ DO2 | SIDO= 100 | ||
pH | (pH<4) || (11 <pH) | 4≤pH<7 | 7≤pH≤8 | 8<pH≤11 |
SIpH= 10 | SIpH= 2.628 ∗ exp (pH ∗ 0.5200) | SIpH= 100 | SIpH= 100* exp ((pH − 8)* − 0.5188) | |
FC | FC≤ 50/100 mL | 50/100 mL< FC ≤1600/100 mL | 1600/100 mL < FC | |
SIFC= 98 | SIFC= 98 ∗ (exp ((FC − 50) ∗ −9.9178E − 4) | SIFC= 10 |
Parameters . | Sub-index calculation . | |||
---|---|---|---|---|
T | T≤11C | 11C<T≤29C | 29C<T | |
SIT= 100 | SIT=76.54+4.172*T−0.1623*T2 −2.0557E−3*T3 | SIT= 10 | ||
DO | DO≤3.3 mg/L | 3.3 <DO<10.5 mg/L | 10.5 mg/L≤DO | |
SIDO= 10 | SIDO=−80.29 + 31.88 ∗ DO − 1.401 ∗ DO2 | SIDO= 100 | ||
pH | (pH<4) || (11 <pH) | 4≤pH<7 | 7≤pH≤8 | 8<pH≤11 |
SIpH= 10 | SIpH= 2.628 ∗ exp (pH ∗ 0.5200) | SIpH= 100 | SIpH= 100* exp ((pH − 8)* − 0.5188) | |
FC | FC≤ 50/100 mL | 50/100 mL< FC ≤1600/100 mL | 1600/100 mL < FC | |
SIFC= 98 | SIFC= 98 ∗ (exp ((FC − 50) ∗ −9.9178E − 4) | SIFC= 10 |
OWQI was calculated for both datasets but with conductivity and Tur parameters excluded, as OQWI does not include an SI range for these parameters. Table 5 shows the SI formulae for T, DO, pH and FC. Table 6 shows the top 20 samples with SI calculations for Datasets 1 and 2.
Top 20 samples of water quality parameters and their corresponding SI values calculated using OWQI for Datasets 1 and 2
Dataset 1 . | ||||||||
---|---|---|---|---|---|---|---|---|
T . | SIT . | DO . | SIDO . | pH . | SIpH . | . | . | OWQI value . |
29.67 | 10 | 9.31 | 95.079 | 2.31 | 10 | 12.213 | ||
29.67 | 10 | 9.31 | 95.079 | 2.31 | 10 | 12.213 | ||
29.48 | 10 | 10.1 | 98.781 | 2.35 | 10 | 12.216 | ||
29.48 | 10 | 10.1 | 98.781 | 2.35 | 10 | 12.216 | ||
29.67 | 10 | 9.05 | 93.478 | 2.42 | 10 | 12.212 | ||
29.67 | 10 | 9.05 | 93.478 | 2.42 | 10 | 12.212 | ||
29.67 | 10 | 9.24 | 94.667 | 2.35 | 10 | 12.213 | ||
29.67 | 10 | 9.24 | 94.667 | 2.35 | 10 | 12.213 | ||
29.48 | 10 | 9.39 | 95.534 | 1.74 | 10 | 12.214 | ||
29.48 | 10 | 9.39 | 95.534 | 1.74 | 10 | 12.214 | ||
29.48 | 10 | 7.67 | 81.810 | 6.91 | 95.528 | 17.1 | ||
29.48 | 10 | 7.67 | 81.810 | 6.91 | 95.528 | 17.1 | ||
29.48 | 10 | 8.62 | 90.415 | 6.64 | 83.015 | 17.093 | ||
29.48 | 10 | 8.62 | 90.415 | 6.64 | 83.015 | 17.093 | ||
29.39 | 10 | 9.62 | 96.740 | 5.58 | 47.838 | 16.867 | ||
29.39 | 10 | 9.62 | 96.740 | 5.58 | 47.838 | 16.867 | ||
29.48 | 10 | 9.36 | 95.365 | 6.34 | 71.024 | 17.059 | ||
29.48 | 10 | 9.36 | 95.365 | 6.34 | 71.024 | 17.059 | ||
29.48 | 10 | 9.31 | 95.079 | 6.64 | 83.015 | 17.103 | ||
29.48 | 10 | 9.31 | 95.079 | 6.64 | 83.015 | 17.103 | ||
Dataset 2 . | ||||||||
T . | SIT . | DO . | SIDO . | pH . | SIpH . | FC . | SIFC . | OWQI value . |
23 | −1158.62 | 3.5 | 14.13 | 7.19 | 100.00 | 170 | 87.00 | 27.62 |
16 | −317.47 | 3.8 | 20.62 | 7.45 | 100.00 | 53 | 97.71 | 39.48 |
16 | −317.47 | 2.5 | 10.00 | 7.99 | 100.00 | 63 | 96.74 | 19.79 |
16 | −317.47 | 2.3 | 10.00 | 8.05 | 97.44 | 55 | 97.52 | 19.78 |
10 | −0.32 | 4.6 | 36.71 | 8.18 | 91.08 | 57 | 97.32 | 0.63 |
10 | −0.32 | 4.2 | 28.89 | 8.41 | 80.84 | 57 | 97.32 | 0.63 |
12 | −73.62 | 3.6 | 16.32 | 8.6 | 73.25 | 40 | 98.00 | 30.75 |
13 | −121.51 | 3.1 | 10.00 | 7.99 | 100.00 | 70 | 96.08 | 19.73 |
12 | −73.62 | 3.6 | 16.32 | 8.3 | 85.59 | 45 | 98.00 | 30.94 |
12 | −73.62 | 3.9 | 22.73 | 8.5 | 77.15 | 55 | 97.52 | 40.89 |
14 | −177.70 | 3.4 | 11.91 | 8.18 | 91.08 | 28 | 98.00 | 23.39 |
13 | −121.51 | 2.4 | 10.00 | 8.34 | 83.83 | 46 | 98.00 | 19.69 |
14 | −177.70 | 2.4 | 10.00 | 8.4 | 81.26 | 44 | 98.00 | 19.72 |
14 | −177.70 | 3 | 10.00 | 8.3 | 85.59 | 60 | 97.03 | 19.73 |
14 | −177.70 | 3 | 10.00 | 8.44 | 79.59 | 50 | 98.00 | 19.71 |
12 | −73.62 | 2.4 | 10.00 | 8.44 | 79.59 | 40 | 98.00 | 19.57 |
21 | −855.26 | 3.6 | 16.32 | 8.13 | 93.48 | 60 | 97.03 | 31.72 |
13 | −121.51 | 3.6 | 16.32 | 8.8 | 66.03 | 45 | 98.00 | 31.03 |
14 | −177.70 | 3.6 | 16.32 | 8.42 | 80.42 | 53 | 97.71 | 31.45 |
13 | −121.51 | 5.5 | 52.67 | 8.4 | 81.26 | 95 | 93.72 | 75.95 |
Dataset 1 . | ||||||||
---|---|---|---|---|---|---|---|---|
T . | SIT . | DO . | SIDO . | pH . | SIpH . | . | . | OWQI value . |
29.67 | 10 | 9.31 | 95.079 | 2.31 | 10 | 12.213 | ||
29.67 | 10 | 9.31 | 95.079 | 2.31 | 10 | 12.213 | ||
29.48 | 10 | 10.1 | 98.781 | 2.35 | 10 | 12.216 | ||
29.48 | 10 | 10.1 | 98.781 | 2.35 | 10 | 12.216 | ||
29.67 | 10 | 9.05 | 93.478 | 2.42 | 10 | 12.212 | ||
29.67 | 10 | 9.05 | 93.478 | 2.42 | 10 | 12.212 | ||
29.67 | 10 | 9.24 | 94.667 | 2.35 | 10 | 12.213 | ||
29.67 | 10 | 9.24 | 94.667 | 2.35 | 10 | 12.213 | ||
29.48 | 10 | 9.39 | 95.534 | 1.74 | 10 | 12.214 | ||
29.48 | 10 | 9.39 | 95.534 | 1.74 | 10 | 12.214 | ||
29.48 | 10 | 7.67 | 81.810 | 6.91 | 95.528 | 17.1 | ||
29.48 | 10 | 7.67 | 81.810 | 6.91 | 95.528 | 17.1 | ||
29.48 | 10 | 8.62 | 90.415 | 6.64 | 83.015 | 17.093 | ||
29.48 | 10 | 8.62 | 90.415 | 6.64 | 83.015 | 17.093 | ||
29.39 | 10 | 9.62 | 96.740 | 5.58 | 47.838 | 16.867 | ||
29.39 | 10 | 9.62 | 96.740 | 5.58 | 47.838 | 16.867 | ||
29.48 | 10 | 9.36 | 95.365 | 6.34 | 71.024 | 17.059 | ||
29.48 | 10 | 9.36 | 95.365 | 6.34 | 71.024 | 17.059 | ||
29.48 | 10 | 9.31 | 95.079 | 6.64 | 83.015 | 17.103 | ||
29.48 | 10 | 9.31 | 95.079 | 6.64 | 83.015 | 17.103 | ||
Dataset 2 . | ||||||||
T . | SIT . | DO . | SIDO . | pH . | SIpH . | FC . | SIFC . | OWQI value . |
23 | −1158.62 | 3.5 | 14.13 | 7.19 | 100.00 | 170 | 87.00 | 27.62 |
16 | −317.47 | 3.8 | 20.62 | 7.45 | 100.00 | 53 | 97.71 | 39.48 |
16 | −317.47 | 2.5 | 10.00 | 7.99 | 100.00 | 63 | 96.74 | 19.79 |
16 | −317.47 | 2.3 | 10.00 | 8.05 | 97.44 | 55 | 97.52 | 19.78 |
10 | −0.32 | 4.6 | 36.71 | 8.18 | 91.08 | 57 | 97.32 | 0.63 |
10 | −0.32 | 4.2 | 28.89 | 8.41 | 80.84 | 57 | 97.32 | 0.63 |
12 | −73.62 | 3.6 | 16.32 | 8.6 | 73.25 | 40 | 98.00 | 30.75 |
13 | −121.51 | 3.1 | 10.00 | 7.99 | 100.00 | 70 | 96.08 | 19.73 |
12 | −73.62 | 3.6 | 16.32 | 8.3 | 85.59 | 45 | 98.00 | 30.94 |
12 | −73.62 | 3.9 | 22.73 | 8.5 | 77.15 | 55 | 97.52 | 40.89 |
14 | −177.70 | 3.4 | 11.91 | 8.18 | 91.08 | 28 | 98.00 | 23.39 |
13 | −121.51 | 2.4 | 10.00 | 8.34 | 83.83 | 46 | 98.00 | 19.69 |
14 | −177.70 | 2.4 | 10.00 | 8.4 | 81.26 | 44 | 98.00 | 19.72 |
14 | −177.70 | 3 | 10.00 | 8.3 | 85.59 | 60 | 97.03 | 19.73 |
14 | −177.70 | 3 | 10.00 | 8.44 | 79.59 | 50 | 98.00 | 19.71 |
12 | −73.62 | 2.4 | 10.00 | 8.44 | 79.59 | 40 | 98.00 | 19.57 |
21 | −855.26 | 3.6 | 16.32 | 8.13 | 93.48 | 60 | 97.03 | 31.72 |
13 | −121.51 | 3.6 | 16.32 | 8.8 | 66.03 | 45 | 98.00 | 31.03 |
14 | −177.70 | 3.6 | 16.32 | 8.42 | 80.42 | 53 | 97.71 | 31.45 |
13 | −121.51 | 5.5 | 52.67 | 8.4 | 81.26 | 95 | 93.72 | 75.95 |
MOI method
Here,
n = number of parameters,
SI = SI is the sub-index for the nth parameter.
Top 20 samples of water quality parameters and their corresponding SI values calculated using MOI for Datasets 1 and 2
Dataset 1 . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
T . | SIT . | DO . | SIDO . | pH . | SIpH . | Tur . | SITur . | . | . | MOI value . |
29.67 | 92 | 9.31 | 105 | 2.31 | 23 | 201.35 | 940 | 23 | ||
29.67 | 92 | 9.31 | 105 | 2.31 | 23 | 201.35 | 940 | 23 | ||
29.48 | 92 | 10.1 | 114 | 2.35 | 23 | 126.68 | 591 | 23 | ||
29.48 | 92 | 10.1 | 114 | 2.35 | 23 | 126.68 | 591 | 23 | ||
29.67 | 92 | 9.05 | 102 | 2.42 | 24 | 107.67 | 502 | 24 | ||
29.67 | 92 | 9.05 | 102 | 2.42 | 24 | 107.67 | 502 | 24 | ||
29.67 | 92 | 9.24 | 104 | 2.35 | 23 | 238.68 | 1114 | 23 | ||
29.67 | 92 | 9.24 | 104 | 2.35 | 23 | 238.68 | 1114 | 23 | ||
29.48 | 92 | 9.39 | 106 | 1.74 | 17 | 204.74 | 955 | 17 | ||
29.48 | 92 | 9.39 | 106 | 1.74 | 17 | 204.74 | 955 | 17 | ||
29.48 | 92 | 7.67 | 87 | 6.91 | 67 | 33 | 154 | 67 | ||
29.48 | 92 | 7.67 | 87 | 6.91 | 67 | 33 | 154 | 67 | ||
29.48 | 92 | 8.62 | 97 | 6.64 | 65 | 34.35 | 160 | 65 | ||
29.48 | 92 | 8.62 | 97 | 6.64 | 65 | 34.35 | 160 | 65 | ||
29.39 | 91 | 9.62 | 109 | 5.58 | 54 | 45.89 | 214 | 54 | ||
29.39 | 91 | 9.62 | 109 | 5.58 | 54 | 45.89 | 214 | 54 | ||
29.48 | 92 | 9.36 | 106 | 6.34 | 62 | 48.61 | 227 | 62 | ||
29.48 | 92 | 9.36 | 106 | 6.34 | 62 | 48.61 | 227 | 62 | ||
29.48 | 92 | 9.31 | 105 | 6.64 | 65 | 52 | 243 | 65 | ||
29.48 | 92 | 9.31 | 105 | 6.64 | 65 | 52 | 243 | 65 | ||
Dataset 2 . | ||||||||||
T . | SIT . | DO . | SIDO . | pH . | SIpH . | Tur . | SITur . | FC . | SIFC . | MOI value . |
23 | 71.45 | 3.5 | 39.55 | 7.19 | 70.21 | 18 | 84.00 | 170 | 1.53 | 1.53 |
16 | 49.70 | 3.8 | 42.94 | 7.45 | 72.75 | 42.15 | 196.70 | 53 | 0.48 | 0.48 |
16 | 49.70 | 2.5 | 28.25 | 7.99 | 78.02 | 46.7 | 217.93 | 63 | 0.57 | 0.57 |
16 | 49.70 | 2.3 | 25.99 | 8.05 | 78.61 | 47.15 | 220.03 | 55 | 0.50 | 0.50 |
10 | 31.06 | 4.6 | 51.97 | 8.18 | 79.88 | 22 | 102.67 | 57 | 0.51 | 0.51 |
10 | 31.06 | 4.2 | 47.45 | 8.41 | 82.12 | 28 | 130.67 | 57 | 0.51 | 0.51 |
12 | 37.28 | 3.6 | 40.68 | 8.6 | 83.98 | 34.8 | 162.40 | 40 | 0.36 | 0.36 |
13 | 40.38 | 3.1 | 35.03 | 7.99 | 78.02 | 30.2 | 140.93 | 70 | 0.63 | 0.63 |
12 | 37.28 | 3.6 | 40.68 | 8.3 | 81.05 | 32.7 | 152.60 | 45 | 0.41 | 0.41 |
12 | 37.28 | 3.9 | 44.06 | 8.5 | 83.00 | 34.7 | 161.93 | 55 | 0.50 | 0.50 |
14 | 43.49 | 3.4 | 38.42 | 8.18 | 79.88 | 50.2 | 234.27 | 28 | 0.25 | 0.25 |
13 | 40.38 | 2.4 | 27.12 | 8.34 | 81.44 | 49.3 | 230.07 | 46 | 0.41 | 0.41 |
14 | 43.49 | 2.4 | 27.12 | 8.4 | 82.02 | 65 | 303.33 | 44 | 0.40 | 0.40 |
14 | 43.49 | 3 | 33.90 | 8.3 | 81.05 | 56 | 261.33 | 60 | 0.54 | 0.54 |
14 | 43.49 | 3 | 33.90 | 8.44 | 82.41 | 60 | 280.00 | 50 | 0.45 | 0.45 |
12 | 37.28 | 2.4 | 27.12 | 8.44 | 82.41 | 66 | 308.00 | 40 | 0.36 | 0.36 |
21 | 65.23 | 3.6 | 40.68 | 8.13 | 79.39 | 13 | 60.67 | 60 | 0.54 | 0.54 |
13 | 40.38 | 3.6 | 40.68 | 8.8 | 85.93 | 34 | 158.67 | 45 | 0.41 | 0.41 |
14 | 43.49 | 3.6 | 40.68 | 8.42 | 82.22 | 30.35 | 141.63 | 53 | 0.48 | 0.48 |
13 | 40.38 | 5.5 | 62.14 | 8.4 | 82.02 | 330 | 1540.00 | 95 | 0.86 | 0.86 |
Dataset 1 . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
T . | SIT . | DO . | SIDO . | pH . | SIpH . | Tur . | SITur . | . | . | MOI value . |
29.67 | 92 | 9.31 | 105 | 2.31 | 23 | 201.35 | 940 | 23 | ||
29.67 | 92 | 9.31 | 105 | 2.31 | 23 | 201.35 | 940 | 23 | ||
29.48 | 92 | 10.1 | 114 | 2.35 | 23 | 126.68 | 591 | 23 | ||
29.48 | 92 | 10.1 | 114 | 2.35 | 23 | 126.68 | 591 | 23 | ||
29.67 | 92 | 9.05 | 102 | 2.42 | 24 | 107.67 | 502 | 24 | ||
29.67 | 92 | 9.05 | 102 | 2.42 | 24 | 107.67 | 502 | 24 | ||
29.67 | 92 | 9.24 | 104 | 2.35 | 23 | 238.68 | 1114 | 23 | ||
29.67 | 92 | 9.24 | 104 | 2.35 | 23 | 238.68 | 1114 | 23 | ||
29.48 | 92 | 9.39 | 106 | 1.74 | 17 | 204.74 | 955 | 17 | ||
29.48 | 92 | 9.39 | 106 | 1.74 | 17 | 204.74 | 955 | 17 | ||
29.48 | 92 | 7.67 | 87 | 6.91 | 67 | 33 | 154 | 67 | ||
29.48 | 92 | 7.67 | 87 | 6.91 | 67 | 33 | 154 | 67 | ||
29.48 | 92 | 8.62 | 97 | 6.64 | 65 | 34.35 | 160 | 65 | ||
29.48 | 92 | 8.62 | 97 | 6.64 | 65 | 34.35 | 160 | 65 | ||
29.39 | 91 | 9.62 | 109 | 5.58 | 54 | 45.89 | 214 | 54 | ||
29.39 | 91 | 9.62 | 109 | 5.58 | 54 | 45.89 | 214 | 54 | ||
29.48 | 92 | 9.36 | 106 | 6.34 | 62 | 48.61 | 227 | 62 | ||
29.48 | 92 | 9.36 | 106 | 6.34 | 62 | 48.61 | 227 | 62 | ||
29.48 | 92 | 9.31 | 105 | 6.64 | 65 | 52 | 243 | 65 | ||
29.48 | 92 | 9.31 | 105 | 6.64 | 65 | 52 | 243 | 65 | ||
Dataset 2 . | ||||||||||
T . | SIT . | DO . | SIDO . | pH . | SIpH . | Tur . | SITur . | FC . | SIFC . | MOI value . |
23 | 71.45 | 3.5 | 39.55 | 7.19 | 70.21 | 18 | 84.00 | 170 | 1.53 | 1.53 |
16 | 49.70 | 3.8 | 42.94 | 7.45 | 72.75 | 42.15 | 196.70 | 53 | 0.48 | 0.48 |
16 | 49.70 | 2.5 | 28.25 | 7.99 | 78.02 | 46.7 | 217.93 | 63 | 0.57 | 0.57 |
16 | 49.70 | 2.3 | 25.99 | 8.05 | 78.61 | 47.15 | 220.03 | 55 | 0.50 | 0.50 |
10 | 31.06 | 4.6 | 51.97 | 8.18 | 79.88 | 22 | 102.67 | 57 | 0.51 | 0.51 |
10 | 31.06 | 4.2 | 47.45 | 8.41 | 82.12 | 28 | 130.67 | 57 | 0.51 | 0.51 |
12 | 37.28 | 3.6 | 40.68 | 8.6 | 83.98 | 34.8 | 162.40 | 40 | 0.36 | 0.36 |
13 | 40.38 | 3.1 | 35.03 | 7.99 | 78.02 | 30.2 | 140.93 | 70 | 0.63 | 0.63 |
12 | 37.28 | 3.6 | 40.68 | 8.3 | 81.05 | 32.7 | 152.60 | 45 | 0.41 | 0.41 |
12 | 37.28 | 3.9 | 44.06 | 8.5 | 83.00 | 34.7 | 161.93 | 55 | 0.50 | 0.50 |
14 | 43.49 | 3.4 | 38.42 | 8.18 | 79.88 | 50.2 | 234.27 | 28 | 0.25 | 0.25 |
13 | 40.38 | 2.4 | 27.12 | 8.34 | 81.44 | 49.3 | 230.07 | 46 | 0.41 | 0.41 |
14 | 43.49 | 2.4 | 27.12 | 8.4 | 82.02 | 65 | 303.33 | 44 | 0.40 | 0.40 |
14 | 43.49 | 3 | 33.90 | 8.3 | 81.05 | 56 | 261.33 | 60 | 0.54 | 0.54 |
14 | 43.49 | 3 | 33.90 | 8.44 | 82.41 | 60 | 280.00 | 50 | 0.45 | 0.45 |
12 | 37.28 | 2.4 | 27.12 | 8.44 | 82.41 | 66 | 308.00 | 40 | 0.36 | 0.36 |
21 | 65.23 | 3.6 | 40.68 | 8.13 | 79.39 | 13 | 60.67 | 60 | 0.54 | 0.54 |
13 | 40.38 | 3.6 | 40.68 | 8.8 | 85.93 | 34 | 158.67 | 45 | 0.41 | 0.41 |
14 | 43.49 | 3.6 | 40.68 | 8.42 | 82.22 | 30.35 | 141.63 | 53 | 0.48 | 0.48 |
13 | 40.38 | 5.5 | 62.14 | 8.4 | 82.02 | 330 | 1540.00 | 95 | 0.86 | 0.86 |
Machine learning techniques
Five machine learning algorithms were applied in this study. The reason for selecting these five algorithms is that these are easier to use or understand and computationally more efficient. These algorithms are widely tested on small to medium sized datasets and have proved to give satisfactory performance in a minimal amount of time (Ashari et al. 2013)
k-Nearest Neighbors
Logistic Regression
Naive Bayes
Multilayer Perceptron
ANN (Fausett 2006) has a multilayer architecture in which neurons are connected to each other with a set of links called the synapses. Each link has a synaptic weight. The neurons are placed in the layers of the network and work in parallel. The first layer in the network is the input layer. The input nodes at this layer are simply the unprocessed information that enters the network. The input layer does not perform any computations. Then we have the hidden layer. A network can have many or zero hidden layers. The hidden layer is responsible for increasing the performance of the network. The last layer is the output layer. The output layer performs calculations that give the output for the whole network. The behavior of the output layer depends on the activity of the hidden layers.
Decision Tree
DT is a hierarchical structure of the decisions and their outcomes. It is used to identify the path for reaching a specific goal. A predefined class is provided to classify an instance by DT. DT is very popular because of its simplicity. It is made up of nodes and edges. The node with no incoming edges is called the ‘root’. The node with outgoing edges is called the ‘internal node’. All other nodes are called ‘leaves’. DT splits the internal node according to the value of a single attribute (Maimon & Rokach 2014).
FINDINGS AND DISCUSSION
In this section, the results of using five different indices that have been applied on the two datasets are discussed first and later the results of the classification of the water quality using machine learning algorithms are analyzed in detail.
Analysis of water quality indices
All indices generally showed that the Dataset 1 has ‘Poor’, ‘Unsuitable’ or ‘Unfit for Drinking’ water quality status. Figure 4 shows the water quality indices calculated using the five methods for Dataset 1. Here, the five months of 2019 and the respective count of water quality classes for each sample in these months are displayed. It can be seen that the WAWQI calculated for the month of October shows that the water quality mostly lies in ‘Excellent’ class while quality is ‘Unfit’ for the months of November and December. The CCME-WQI for Dataset 1 lies in the ‘Poor’ category for all five months. For NSF-WQI, the water quality varies from ‘Poor’ to ‘Excellent’ in the month of December while in other months the water quality remains ‘Poor’. OWQI calculated for the Dataset 1 classified all the water samples collected as ‘Very Poor’. This result shows that, like other indices, the outcome is mostly the same. However, to calculate these indices, some parameters had to be excluded, for example to calculate the OWQI, two parameters namely ‘conductivity’ and ‘Tur’ were ignored as the OWQI has no SI range for these parameters as seen in Table 5. Using only four parameters may have an impact on the information obtained from this index. For MOI, the samples were mostly categorized as ‘Totally Unsuitable’ for the months of October to December.
The water quality for Dataset 2 can be seen in Figure 5. Here, the WAWQI, categorized the samples as ‘Unfit’ for all the years. The problem identified for the WAWQI is that the WQI values are affected by the addition of the ‘FC’ parameter, which classifies the samples as ‘Unfit for Drinking’. The CCME-WQI for Dataset 2 lies in the ‘Poor’ category for all the years. The drawback of this index is that, for calculating the CCME value, all the samples are used to compute a single value that is assigned to every sample and generates a single CCME-WQI class. NSF-WQI for Dataset 2, mostly falls in the ‘Unclassified’ class. This is the class that was assigned to the samples that do not fall in the ratings range defined by the index. This makes the NSF-WQI not applicable on all types of water samples. OWQI calculated for the Dataset 2 has categorized most of the water samples collected as ‘Very Poor’. Similarly, for MOI, the samples are categorized as ‘Totally Unsuitable’ for all the years.
Figure 6(a) shows the comparison of indices month-wise for Dataset 1. The X-axis in Figure 6(a) represents the months, whereas the classes ‘Excellent’, ‘Good’ etc. are assigned numerical values in Y-axis so these can be compared to the five indices and month-wise changes could be observed. Here, it can be seen that the indices mostly lie in the 4–5 range, which represents the ‘Poor’ or ‘Unfit’ class. Therefore, giving a classification of Poor water quality for the Rawal lake. Moreover, it can also be observed that the results computed for Dataset 1 with NSF-WQI and WAWQI are the same throughout the months of June to December. Similarly, CCME-WQI and OWQI show the same classification for the water quality of Rawal lake. Moreover, for Dataset 2 the water quality either belonged to class 4 or 5 that represents the ‘Poor’ or ‘Unfit’ category, as seen in Figure 6(b). Some samples are unclassified as they did not fall under the range specified by the respective WQI used and are represented with a 10 scale in the Figure 6(b). Here, it can be observed that the CCME-WQI, MOI and WAWQI have assigned a similar classification to Rawal lake throughout the years.
(a) Comparison of indices over months June 2019 to December 2019 (Dataset 1), where Y-axis represents the class 0–5 for each index and X-axis represent the months. (b) Comparison of indices over years 2012–2019 (Dataset 2), where the Y-axis represents the class 0–5 for each index and the X-axis represents the years.
(a) Comparison of indices over months June 2019 to December 2019 (Dataset 1), where Y-axis represents the class 0–5 for each index and X-axis represent the months. (b) Comparison of indices over years 2012–2019 (Dataset 2), where the Y-axis represents the class 0–5 for each index and the X-axis represents the years.
The indices have been computed with a limited set of parameters obtained from the Rawal Dam Lake. The constraint of this study was mainly the fact that these indices are developed based on certain selected parameters and the absence of some major parameters can impact the outcomes. As these indices can be unpredictable and every index has its own limitations or disadvantages. The WAWQI is very sensitive to the parameters, as a single parameter with a high concentration value can affect the index classification. Similarly, the NSF-WQI loses important information during processing of data, as the classification is dependent on the weights assigned to each parameter. Moreover, this index worked well only if the parameters involved were independent of each other. It requires all nine parameters and excluding any one may impact its performance. The calculation of CCME-WQI is a subjective process, as it involves the combination of three factors and several mathematical calculations compared with other indices. In addition, all these indices are not very generic and cannot be applied to all water types. Most indices are created for a specific location and may not be applicable to other sites, including the CCME-WQI that is developed for the province of British Columbia, Canada and OWQI that is developed for the state of Oregon.
Numerous WQIs have been developed for classifying water quality, but these have less global application. The validity of these indices depends on the data handling process, as valuable information might be lost. Consider NSF-WQI, in which eight out of nine parameters may have a satisfactory value while pH has the value zero. This would result in assignment of a high class to the water body that would be considered invalid, as a low pH would not be able to support marine life. The ability of these indices to handle missing values, outliers and other anomalies is still unknown. The removal or exclusion of certain parameters may influence the outcome of the indices. However, these indices may prove effective under specific conditions for the water body of interest.
Furthermore, the data collected through IoT nodes have its own limitations, as the nodes are deployed at the edges of inlet and outlet streams and there is no access to the center and other parts of the dam for data collection. Due to this constraint, the dataset suffered from class imbalance problems as the water at the edges and the near the bank of dam have high Tur. This meant that the majority of the data samples would be of the same category or class, leading to a lack of variation in the data samples gathered. From this it can be inferred that the factors, such as location, time and data collection frequency, may have a noticeable impact on the computation of water quality. Similarly, the calculation of the five WQIs gives a water quality classification of ‘Poor’ or ‘Very Poor’ for Rawal Dam, which may lead us to believe that the datasets collected may suffer from class imbalance. The high number of distinct values for parameters led to a skewed distribution, which has introduced this imbalance.
Classification of water quality using machine learning
The datasets used for the classification include the Dataset 1 and Dataset 2 with WAWQI and OWQI as class variables respectively. For water quality classification, five machine learning classifiers were applied on both datasets that included: DT, MLP, LogR, NB and KNN. NB and DT are less time consuming and gave a satisfactory performance compared with MLP, which is known for its flexibility and high classification accuracy (Su & Zhang 2006; Yang et al. 2015; Hemalatha & Rani 2017).
For classification of the water quality status, the WAWQI classes namely Excellent (E), Good (G), Fair (F), Poor (P), Very Poor (VP) and Unfit for Drinking (U) have been used for Dataset 1. The dataset has five parameters namely ‘T’, ‘Tur’, ‘pH’, ‘DO’ and ‘conductivity’. A 99.6% accuracy was achieved with DT on the test set of Dataset 1. Whereas, KNN and NB performed well with 95 and 90% accuracy respectively. Table 8 shows the evaluation results. Figure 7 displays the confusion matrices of MLP, KNN, DT, NB and LogR. These matrices show the actual and predicted samples for all six classes. With DT, 365/367 samples of class ‘Excellent’ were correctly classified whereas two were misclassified as class ‘Good’. Similarly, 19/19 class ‘Good’ samples were correctly classified. For the ‘Fair’ and ‘Poor’ classes, one sample was misclassified as class ‘Excellent’.
Classwise Pre, Rec, F1-Sc on test set (Dataset 1)
. | Pre . | Rec . | F1-Sc . | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Classifier . | E . | G . | F . | P . | VP . | U . | E . | G . | F . | P . | VP . | U . | E . | G . | F . | P . | VP . | U . |
DT | 99 | 90 | 100 | 100 | 100 | 100 | 99 | 100 | 95 | 97 | 100 | 100 | 99 | 95 | 98 | 99 | 100 | 100 |
MLP | 98 | 0 | 0 | 0 | 0 | 77 | 66 | 0 | 0 | 0 | 0 | 100 | 79 | 0 | 0 | 0 | 0 | 83 |
KNN | 95 | 50 | 71 | 59 | 68 | 98 | 100 | 26 | 24 | 47 | 74 | 99 | 97 | 34 | 36 | 52 | 71 | 99 |
NB | 88 | 0 | 0 | 0 | 78 | 95 | 99 | 0 | 0 | 0 | 81 | 97 | 93 | 0 | 0 | 0 | 79 | 96 |
LogR | 87 | 0 | 0 | 0 | 0 | 89 | 100 | 0 | 0 | 0 | 0 | 100 | 93 | 0 | 0 | 0 | 0 | 94 |
. | Pre . | Rec . | F1-Sc . | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Classifier . | E . | G . | F . | P . | VP . | U . | E . | G . | F . | P . | VP . | U . | E . | G . | F . | P . | VP . | U . |
DT | 99 | 90 | 100 | 100 | 100 | 100 | 99 | 100 | 95 | 97 | 100 | 100 | 99 | 95 | 98 | 99 | 100 | 100 |
MLP | 98 | 0 | 0 | 0 | 0 | 77 | 66 | 0 | 0 | 0 | 0 | 100 | 79 | 0 | 0 | 0 | 0 | 83 |
KNN | 95 | 50 | 71 | 59 | 68 | 98 | 100 | 26 | 24 | 47 | 74 | 99 | 97 | 34 | 36 | 52 | 71 | 99 |
NB | 88 | 0 | 0 | 0 | 78 | 95 | 99 | 0 | 0 | 0 | 81 | 97 | 93 | 0 | 0 | 0 | 79 | 96 |
LogR | 87 | 0 | 0 | 0 | 0 | 89 | 100 | 0 | 0 | 0 | 0 | 100 | 93 | 0 | 0 | 0 | 0 | 94 |
For Dataset 2, the OWQI classes namely Excellent (E), Good (G), Fair (F), Poor (P), Very Poor (VP) and Unclassified (Un) have been used. The model was trained for supervised classification of OWQI classes on Dataset 2 that has six parameters including ‘T’, ‘Tur’, ‘pH’, ‘DO’, ‘FC’ and ‘conductivity’. MLP, KNN, NB, LogR and DT were applied to predict the OWQI classes. DT gave the best performance with a 96% accuracy on the test set. KNN, NB and LogR gave good performances followed by DT. The evaluation results are displayed in Table 9. Figure 8 displays the confusion matrices for MLP, KNN, DT, NB and LogR. These matrices showed the predicted and the actual samples for all six classes. The DT had classified 17/20 samples as belonging to the ‘Excellent’ class. Here, three were misclassified as class ‘Good’. Similarly, 5/5 ‘Good’ class samples were correctly predicted. For the ‘Fair’ class, one sample was misclassified as ‘Poor’ and three samples were misclassified as ‘Good’. For the ‘Poor’ class, one sample was misclassified as ‘Fair’ and for the ‘Unclassified’ class, one sample was misclassified as ‘Very Poor’. All 176/176 samples were correctly predicted for the class ‘Very Poor’. The evaluation results for both datasets are displayed in Table 10.
Classwise Pre, Rec, F1-Sc on test set (Dataset 2)
. | Pre . | Rec . | F1-Sc . | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Classifier . | E . | G . | F . | P . | VP . | U . | E . | G . | F . | P . | VP . | U . | E . | G . | F . | P . | VP . | U . |
DT | 100 | 45 | 86 | 89 | 99 | 100 | 85 | 100 | 60 | 89 | 100 | 67 | 92 | 62 | 71 | 89 | 100 | 80 |
NB | 100 | 33 | 100 | 89 | 99 | 30 | 85 | 100 | 20 | 89 | 96 | 100 | 92 | 50 | 33 | 89 | 98 | 46 |
LogR | 83 | 10 | 0 | 43 | 97 | 67 | 100 | 20 | 0 | 33 | 99 | 67 | 91 | 13 | 0 | 38 | 98 | 67 |
KNN | 48 | 0 | 33 | 14 | 89 | 0 | 65 | 0 | 10 | 11 | 90 | 0 | 55 | 0 | 15 | 12 | 90 | 0 |
MLP | 0 | 0 | 0 | 0 | 76 | 0 | 0 | 0 | 0 | 0 | 84 | 0 | 0 | 0 | 0 | 0 | 80 | 0 |
. | Pre . | Rec . | F1-Sc . | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Classifier . | E . | G . | F . | P . | VP . | U . | E . | G . | F . | P . | VP . | U . | E . | G . | F . | P . | VP . | U . |
DT | 100 | 45 | 86 | 89 | 99 | 100 | 85 | 100 | 60 | 89 | 100 | 67 | 92 | 62 | 71 | 89 | 100 | 80 |
NB | 100 | 33 | 100 | 89 | 99 | 30 | 85 | 100 | 20 | 89 | 96 | 100 | 92 | 50 | 33 | 89 | 98 | 46 |
LogR | 83 | 10 | 0 | 43 | 97 | 67 | 100 | 20 | 0 | 33 | 99 | 67 | 91 | 13 | 0 | 38 | 98 | 67 |
KNN | 48 | 0 | 33 | 14 | 89 | 0 | 65 | 0 | 10 | 11 | 90 | 0 | 55 | 0 | 15 | 12 | 90 | 0 |
MLP | 0 | 0 | 0 | 0 | 76 | 0 | 0 | 0 | 0 | 0 | 84 | 0 | 0 | 0 | 0 | 0 | 80 | 0 |
Acc on test Set (Datasets 1 and 2)
. | Dataset 1 . | Dataset 2 . | ||||||
---|---|---|---|---|---|---|---|---|
Classifier . | Acc . | Avg Pre . | Avg Rec . | Avg F1-Sc . | Acc . | Avg Pre . | Avg Rec . | Avg F1-Sc . |
DT | 99.65 | 100 | 100 | 100 | 95.96 | 97 | 96 | 96 |
KNN | 95.15 | 93 | 94 | 93 | 91.5 | 76 | 78 | 77 |
NB | 90.4 | 86 | 90 | 88 | 91.5 | 97 | 91 | 92 |
LogR | 88 | 78 | 88 | 83 | 89.68 | 87 | 90 | 88 |
MLP | 76.8 | 71 | 77 | 72 | 66 | 60 | 66 | 63 |
. | Dataset 1 . | Dataset 2 . | ||||||
---|---|---|---|---|---|---|---|---|
Classifier . | Acc . | Avg Pre . | Avg Rec . | Avg F1-Sc . | Acc . | Avg Pre . | Avg Rec . | Avg F1-Sc . |
DT | 99.65 | 100 | 100 | 100 | 95.96 | 97 | 96 | 96 |
KNN | 95.15 | 93 | 94 | 93 | 91.5 | 76 | 78 | 77 |
NB | 90.4 | 86 | 90 | 88 | 91.5 | 97 | 91 | 92 |
LogR | 88 | 78 | 88 | 83 | 89.68 | 87 | 90 | 88 |
MLP | 76.8 | 71 | 77 | 72 | 66 | 60 | 66 | 63 |
Water quality classification using machine learning algorithms can prove to be a more effective and reliable method than the WQI, as the WQI uses the current instance data to perform various mathematical calculations, whereas machine learning algorithms consider the historical data or previous trends for water quality classification. Although the WQI calculation can help to label the respective dataset for performing classification using machine learning algorithms.
Therefore, the findings revealed that the application of the WQIs on water samples showed the unpredictable nature of each index, as each index comes with its own limitations whether it is: (1) their sensitivity to the high concentration to a specific parameter, or (2) their dependability on the weights assigned to each parameter, and (3) their application to certain locations or water types. Moreover, classification of water quality may be more effective by considering other environmental factors (Pu et al. 2016, 2019; Pu 2019; Pu et al. 2020) along with the classic water quality parameters already used to compute WQIs. Using such factors may eliminate the uncertainties and biases introduced by the selection of the location, type and number of parameters and weights assigned to these parameters. Therefore, developing a more advanced version of the water pollution index that takes as input topographical and hydrological parameters including: slope, lineament density and environmental parameters such as vegetation resistance, velocity distribution and natural irregular channel impact to the dam, etc. These parameters, when combined with machine learning techniques, could prove to be a more effective way to predict the water quality of any location.
CONCLUSION
In this study, the quality of Rawal Dam Lake's water was studied by applying five widely used water quality indices on two distinct datasets collected in real time using IoT sensors and through GIS-based grab sampling. The findings have shown that the indices may be affected by the process of how the data are collected, the number and type of parameters, time, frequency to measure the quality and the weights allocated to each parameter by the respective index that can increase the WQI value leading to the bias in the class allocation of WQI. The limitations of this study include: (1) the uneven distribution of type of water samples in the datasets and (2) that six physico-chemical parameters are used in this study, leading to water quality being categorized as either ‘Poor’ or ‘Very Poor’. Moreover, this paper presented the Rawal Dam data to find if machine learning could be useful to determine the class of water quality instead of WQI. The analysis showed that the DT algorithm had the highest accuracy of 99.6% and could be regarded as suitable for classifying the water quality of the dataset used. However, these indices have their own limitations and are mostly developed for a specific location/area or water type, making the indices generally less applicable to all water types and locations. Therefore, these water quality indices need to be updated to eliminate the uncertainties and biases introduced by the selection of the location, type and number of parameters and weights assigned to these parameters. For this aim, there is a need to develop more advanced and enhanced water pollution indices, based on other parameters such as topographical, environmental and hydrological parameters and including slope, lineament density, land use/land cover, rainfall, etc., combined with machine learning techniques that can effectively contribute to estimating the quality of water for all regions.
ACKNOWLEDGEMENTS
We would like to thank the Rawal Dam Water Filtration Plant for providing the data (referred as Dataset 2) and fulfilling other requirements for research purpose.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.