Abstract
In the occurrence of environmental disasters involving water resources, deploying an emergency monitoring network for assessing water quality is within the first measures to be taken. Emergency networks usually cover a large set of water quality variables and monitoring stations along the watershed. Focusing on variables that represent greater risk to the environment and have less predictable spatial and temporal distribution is a strategy to optimize efforts on monitoring. The goal of this study is to assess the use of Shannon's entropy to identify non-critical water quality variables in an emergency monitoring network implemented in a watershed impacted by the collapse of a mining iron tailing dam, the Doce River watershed (Brazil). Monitoring stations were grouped into water quality subregions through cluster analysis and Shannon's entropy was used to estimate information redundancy of monitored variables. From information redundancy and after checking for compliance with environment normative, non-critical water quality variables were identified. Results indicated that non-critical variables represent 32–50% of the variables monitored. Emergency network managers find in this method a robust tool to improve the network performance. However, special attention should be paid to outliers' presence that can bias analyses based on Shannon's entropy.
HIGHLIGHTS
Water quality variable selection criteria based on Shannon's entropy have not been explored in emergency monitoring networks.
The selection criterion based on the concept of ‘non-critical variables’ is appropriate for water quality variables.
The set of selected non-critical variables was very expressive and demonstrates the potential of the method to guide adjustments in emergency networks.
Metals and metalloids are predominant in terms of the chemical nature of the main non-critical variables in the Doce River watershed.
INTRODUCTION
Water is one of the main natural resources affected by disasters involving environmental impacts, such as dam break of mine tailings (da Cunha Richard et al. 2020; Baudson et al. 2021), toxic and harmful substances leakage or spillage (Raven & Georg 1989; Pedrosa 2007; Hou 2012), gas pipe drilling (Graham & Wilcox 2021), spill of ash (Deonarine et al. 2013), oil (He et al. 2023), vinasse (da Silva et al. 2022), radioactive substances (Koo et al. 2014) and other hazardous residues from anthropogenic activities (e.g.Guo & Duan 2021; Cacciuttolo & Cano 2022). The COVID-19 pandemic has shown that health disasters can also have environmental implications and influence water consumption (Alvisi et al. 2021; Berglund et al. 2022), wastewater characteristics (Sharif et al. 2021; de Araújo et al. 2022) and water quality (Jerez et al. 2023; Lian et al. 2023).
Disasters related to water pollution are often sudden, difficult to predict and lead to serious environmental, social and economic impacts (Hou et al. 2014). Regarding space-time behavior, the impacts resulting from environmental disasters can range from short to long-term and can be localized or cover large areas (Semenova 2020; Gabriel et al. 2021).
Environmental disasters can directly impact water resources and aquatic ecosystems. Changes in water quality, alterations in the morphology of riverbed sediment, and the bioavailability of pollutants previously immobilized in sediment can occur (Miller et al. 2023). Regarding biological aspects, there may be mortality of fish and other aquatic organisms, habitat destruction, mutagenicity, bioaccumulation of pollutants, carcinogenicity, and genotoxicity (Deonarine et al. 2013; Graham & Wilcox 2021; da Silva et al. 2022; Lusweti et al. 2022).
Additionally, depending on the characteristics of pollutants released by disasters, there may be persistence in the environment and a risk to human health (Deonarine et al. 2013; Graham & Wilcox 2021; Lusweti et al. 2022).
The occurrence of such events requires quick actions from the stakeholders aiming at mitigating impacts and, when possible, avoiding additional losses (Tang et al. 2019; He et al. 2023). Environmental monitoring data, especially water quality data, will play an important role, both for short-term responses such as helping to predict space–time dispersion of pollutants (Ding & Fang 2019; Pereira et al. 2021), and for medium and long-term responses as for designing and monitoring recovery strategies.
Emergency networks for water quality monitoring usually differ from pre-existing regulatory networks. First, disaster impacts on aquatic ecosystems are, as a rule, little known because they may result in complex mechanisms involving physicochemical and biological interactions which may have synergistic impacts on aquatic biota and be amplified over the trophic chain (Brinkmann & Rowan 2018; Zorzal-almeida & Fernandes 2021). Consequently, emergency monitoring networks cover a broader spectrum of pollutants (some unusual in regulatory networks but directly related to disasters), in an attempt to encompass all likely impacts on water uses and aquatic ecosystem.
Additionally, emergency monitoring networks have a greater number of monitoring stations aiming at producing a great volume of spatially distributed information; however, the length of monitored time series is generally short (UNEP 2005; Shi et al. 2018; Jing et al. 2019; Manley et al. 2020; Mendes et al. 2022; Oehrig et al. 2023; Pacheco et al. 2023; Wild et al. 2023). Finally, data analysis from emergency monitoring networks is a challenging task because many statistical methods make assumptions regarding data distribution and/or require many observations (N) compared to the number of variables, known in the literature as ‘small N large P problem’ (Mirauda & Ostoich 2020).
Studies on the emergency network for water quality monitoring are scarce in the literature, but water quality monitoring networks for regulatory purposes have been the object of numerous studies in the past decades (Karamouz et al. 2009) which aimed to identify the best scenarios regarding the monitored variables (Khalil et al. 2010; Barcellos & Souza 2022), monitoring frequency (Do et al. 2013; da Luz et al. 2022), number of samples and location of monitoring stations (Nguyen et al. 2020; Reina-García et al. 2020; de Almeida et al. 2022). The large number of water quality variables leads to the need for strategies for optimizing monitoring networks (Calazans et al. 2018a); however, most of the studies have focused on other aspects of monitoring network design.
In a review study, Nguyen et al. (2019) found 14 research studies from a total of 311 that investigated strategies for the selection of critical water quality variables. Identifying non-critical variables allows the proposition of distinct monitoring strategies for them, such as a reduction in monitoring frequency targeting at lowering costs without loss of relevant information.
As for the methods used for assessing water quality variables relevance in monitoring programs, there has been a predominance of statistical methods, with emphasis on principal components analysis (Calazans et al. 2018b), correlation-regression (Khalil et al. 2010) and discriminant analysis (Wang et al. 2014). Alternative approaches are considered promising, especially those based on information entropy (Shannon 1948).
Shannon's entropy, also known as information entropy, measures uncertainty about random processes and was applied in several water resources engineering problems (e.g.Baran et al. 2017; Xiong et al. 2018; Liuzzo et al. 2019; Wang et al. 2021). Recent research has highlighted that information entropy remains a valuable tool for hydrological studies (Yazdi 2018; Singh et al. 2019; Mirauda & Ostoich 2020; Ursulak & Coulibaly 2021). In absolute numbers, studies are still not very expressive in the literature and much of the method potential still remains unexplored (Singh et al. 2019). By 2023, only nine studies using information entropy for the selection of water quality variables had been published (Jiang et al. 2020a; Barbaros 2022). Our literature review revealed that none of them addressed water quality variables within the context of emergency networks. On the other hand, Shannon's entropy is a widely employed method in research concerning the design, redesign and optimization of hydrological monitoring networks (Keum et al. 2017; Nguyen et al. 2019; Jiang et al. 2020a).
An optimized monitoring of water quality should focus on variables that pose a greater risk to the environment and present a less predictable spatial and time distribution. In this sense, Shannon's entropy can be a useful tool by treating water quality parameters as random variables and allowing quantifying their information content (uncertainty). A more predictable water quality variable is a candidate for lower-frequency monitoring. In this context, the goal of this study is to assess the use of Shannon entropy to identify non-critical water quality variables in the context of emergency monitoring networks implemented in watersheds impacted by environmental disasters.
Background on information entropy
Marginal entropy can be applied to both discrete and continuous data, measured at different time scales (annual, monthly, seasonal or others) to assess the uncertainty associated with the set (Shannon 1948; Singh et al. 2019). Maximum entropy is the maximum entropy value related to an appropriately chosen probability distribution that tends to maximize given the existing constraints. In the case of discrete variables with defined domain [a,b] (finite interval) and no restrictions on moments, the uniform distribution maximizes entropy (Singh 2013; Singh et al. 2019). When formulating the maximum entropy solution, estimates of improve if physical principles are considered helping to eliminate physically inconsistent solutions (Perdigão et al. 2020).
Information redundancy assumes values in the range [0,1]. Variables with redundancy close to 0 exhibit a high degree of uncertainty, making them highly informative. Conversely, variables with high predictability, indicated by information redundancy approaching the value of 1, are less informative (Shannon & Weaver 1949; Singh 2013).
MATERIAL AND METHODS
Study area
The regional climate is characterized by a humid tropical type, exhibiting clearly defined seasonality. The wet season in the watershed contributes to 85% of the annual precipitation and spans from October to March, while the dry season persists from April to September. The annual rainfall ranges from 900 to 1,500 mm and the air temperature exceeds 18 °C (Kütter et al. 2023).
Doce River historically records high values of thermotolerant coliforms, turbidity, and total phosphorus. The presence of some metals above the permitted levels, such as dissolved iron, total manganese, and total lead, was highlighted in the watershed's water resources plan, occurring at several locations and indicating the impact of industrial and agricultural activities (ANA 2016).
Mining is one of the main economic activities in the watershed, especially in the upper Doce River region, and large mining projects have been conducted in this region for decades (Espindola et al. 2017). On 5 November 2015, at the Germano mining complex, in the municipality of Mariana (MG), the Fundão dam collapsed. The reservoir accumulated approximately 50 million m3 of iron mining tailings. The released tailings reached the Santarém reservoir, causing its overtopping and forcing the wave to pass along 55 km in the Gualaxo do Norte River until it flowed into the Carmo River. From there, the plume traveled 22 km until it reached the Doce River, where the small hydroelectric reservoir Risoleta Neves retained part of the tailings (Figure A1 – Supplementary material). From there, it continued to the watershed outlet, where the tailings were released into the Atlantic Ocean on 21 November 2015, totaling 663.2 km of directly impacted water bodies (IBAMA 2015).
Along its trajectory, the tailings caused numerous biophysical impacts on the river system and the coastal zone, affecting the channel and the banks of the Doce River, impairing the water quality, and making it unfit for human and animal consumption and for aquatic biota. In addition, water supply was suspended to 12 cities served directly by the Doce River, affecting an estimated population of 424,000 people (IBAMA 2015; Sánchez et al. 2018).
Data collection and treatment
This research uses data from the Systematic Quali-quantitative Water and Sediment Monitoring Program (PMQQS), operated by Renova foundation in the Doce River watershed (RENOVA 2017). The PMQQS program started on 31 July 2017, and is permanent. The network currently operates 92 stations distributed among the coastal, estuarine, Doce River, Doce River lagoons, and its tributaries. This study focused on lotic water bodies, comprising a total of 39 monitoring stations (Figure 1) (Table A1 – Supplementary material).
For this study, variables that had a percentage of invalidated data greater than 30% were discarded. Likewise, variables with the percentage of missing values above 30% were discarded. Out of the 90 water quality variables (refer to Table A2 in Supplementary material), 78 were chosen for analysis (see Table 1). Data below the limit of quantification or outside the limits of analytical detection (censored data) were replaced by the respective limits of detection. Considering all non-missing records as valid data, a total of 107,585 values from monthly monitoring conducted between August 2017 and December 2020 were used in this research.
Water quality variables – PMQQS network . | ||
---|---|---|
Total alkalinity | Total copper | Ammoniacal nitrogen |
Dissolved aluminum | In situ conductivity | Total Kjeldahl nitrogen |
Total aluminum | True color | Organic nitrogen |
Dissolved antimony | Dissolved chromium | In situ dissolved oxygen |
Total antimony | Total chrome | In situ saturated dissolved oxygen |
Dissolved arsenic | BOD | Polyphosphate |
Total arsenic | Total hardness | In situ redox potential |
Dissolved barium | Escherichia colia | Dissolved silver |
Total barium | Feoftina | Total silver |
Dissolved beryllium | Iron II | Dissolved selenium |
Total beryllium | Iron III | Total selenium |
Dissolved boron | Dissolved iron | Total sodium |
Total boron | Total iron | Total dissolved solids |
Dissolved cadmium | Dissolved phosphorus | Sedimentable solids |
Total cadmium | Total phosphorus | Total suspended solids |
Total calcium | Total magnesium | Total solids |
Dissolved organic carbon | Dissolved manganese | Sulfides as undissociated H2S |
Total organic carbon | Total manganese | Total sulfides |
Dissolved lead | Dissolved mercury | In situ sample temperature |
Total lead | Total mercury | In situ turbidity |
Free cyanide | Dissolved molybdenum | Dissolved vanadium |
Total chloride | Total molybdenum | Total vanadium |
Chlorophyll a | Dissolved nickel | Dissolved zinc |
Dissolved cobalt | Total nickel | Total zinc |
Total cobalt | Nitrate | In situ pH |
Dissolved copper | Nitrite | Laboratory pH |
Water quality variables – PMQQS network . | ||
---|---|---|
Total alkalinity | Total copper | Ammoniacal nitrogen |
Dissolved aluminum | In situ conductivity | Total Kjeldahl nitrogen |
Total aluminum | True color | Organic nitrogen |
Dissolved antimony | Dissolved chromium | In situ dissolved oxygen |
Total antimony | Total chrome | In situ saturated dissolved oxygen |
Dissolved arsenic | BOD | Polyphosphate |
Total arsenic | Total hardness | In situ redox potential |
Dissolved barium | Escherichia colia | Dissolved silver |
Total barium | Feoftina | Total silver |
Dissolved beryllium | Iron II | Dissolved selenium |
Total beryllium | Iron III | Total selenium |
Dissolved boron | Dissolved iron | Total sodium |
Total boron | Total iron | Total dissolved solids |
Dissolved cadmium | Dissolved phosphorus | Sedimentable solids |
Total cadmium | Total phosphorus | Total suspended solids |
Total calcium | Total magnesium | Total solids |
Dissolved organic carbon | Dissolved manganese | Sulfides as undissociated H2S |
Total organic carbon | Total manganese | Total sulfides |
Dissolved lead | Dissolved mercury | In situ sample temperature |
Total lead | Total mercury | In situ turbidity |
Free cyanide | Dissolved molybdenum | Dissolved vanadium |
Total chloride | Total molybdenum | Total vanadium |
Chlorophyll a | Dissolved nickel | Dissolved zinc |
Dissolved cobalt | Total nickel | Total zinc |
Total cobalt | Nitrate | In situ pH |
Dissolved copper | Nitrite | Laboratory pH |
It was also necessary to carry out a treatment to identify and remove outliers because the Shannon's entropy and particularly the maximum entropy are sensitive to them (Nooghabi & Nooghabi 2016). Several algorithms for detecting outliers were tested and the Gini method (NAIR 1936) resulted in the best estimates.
Setting water quality spatial and temporal boundaries
The Doce River watershed was divided into homogeneous water quality subregions by applying cluster analysis to identify groups monitoring stations where data showed strong similarity. This analysis was conducted with the aim of defining subregions with similar characteristics, as large watersheds may encompass different hydrological and environmental contexts within them (Robertson et al. 2006; Versiani et al. 2009; Baldan et al. 2022).
Cluster analysis is a multivariate statistical method usually used in exploratory analyses to identify patterns in datasets and to gather similar observations into groups (Kettenring 2006). In this study, Ward's method (Ward 1963) was adopted, which is an agglomerative hierarchical technique, and the Euclidean distance was taken as a measurement of dissimilarity. Ward's method forms quite homogeneous groups with minimal internal variance (Szekely & Rizzo 2005) and is extensively applied in water quality studies (Azhar et al. 2015; Lobo et al. 2015; Hajigholizadeh & Melesse 2017; Kändler et al. 2017; Pinto et al. 2018; Li et al. 2019; da Silva 2020; Jiang et al. 2020b).
Data were standardized, as suggested by Härdle & Simar (2015), to eliminate distortions arising from the different measurement scales of water quality variables. As a criterion for determining the ideal number of groups, the analysis of the fusion behavior in the dendrogram (Hair et al. 2006) is associated with knowledge about the spatial distribution of stations.
To assess the temporal variability of water quality, data were also classified according to the period of the year in the wet (October to March) or dry (April to September) season.
Computing information redundancy of water quality variables
For every water quality variable observed in each monitoring station, redundancy was obtained from treated time series through (1) discretizing observed data; (2) calculating marginal entropies according to Equation (1); (3) calculating maximum entropy and; (4) obtaining relative entropy (Equation (2)) and redundancy (Equation (3)).
The procedure applied in this work for discretizing water quality variables is described in Section 2.5. Discretizing random variables was necessary because it is particularly difficult to determine the marginal entropy of continuous variables. The use of discrete distributions, even in monitoring networks where continuous variables prevail, has become increasingly frequent (Keum & Coulibaly 2017a; Mirauda & Ostoich 2020). Maximum entropy was obtained from uniform distribution and using Monte Carlo Simulation (MCS) as explained in Section 2.5.1.
An exploratory analysis was performed to assess marginal and maximum entropy and information redundancy considering each water quality subregion and data seasonality.
Quantization of continuous random variables
In the case of PMQQS program, many water quality variables concern trace elements and present censored values. Using Equation (2) as proposed by Alfonso et al. (2013) would distort the probability density function (PDF), leading to a loss of variability, impacting the number of discrete states which strongly affects marginal entropy values (Çengel 2021). To overcome this limitation, we proposed to approximate the quantized value of each observation to the number of decimals one unit lower than the sensitivity level of the variable's analytical method. For example, if the variable had a measurement precision represented by the fourth decimal digit, in the quantization process, the third decimal digit was assumed as the precision of the quantized variable. This ensured that relatively close values represented the same discrete state and that remarkably different values became distinct states of the new discrete variable after quantization.
Maximum entropy using MCS
To obtain the maximum entropy of each variable, the uniform distribution was considered as the one that maximizes the marginal entropy. The uniform distribution was selected because (i) the variables used to determine marginal entropy had been previously discretized through a quantization method, (ii) we adopted a defined domain [a,b] based on the data original range, and (iii) no assumption was made on the behavior of data mean and variance, as time series are generally considered to be relatively short for this purpose.
MCS was applied to obtain the maximum entropy through the pseudorandom number generator Mersenne Twister (Matsumoto & Nishimura 1998) which was developed in the late 1990s for use in Monte Carlo simulation. For every water quality variable, an ensemble of 100 different synthetic datasets was drawn from the uniform distribution taking the observed data interval as a domain to preserve the variable characteristics. Thus, considering the number of variables and stations of the PMQQS network investigated in this research, 304,200 simulations were performed.
The native Stats package R was used for the Monte Carlo Simulations (R Core Team 2021).
Assessing non-critical water quality variables
The threshold for classifying variables as LICV was determined by using informational redundancy of all water quality variables in both seasonal periods across the investigated watershed. The threshold was defined by observing the percentile in which there was a leap in informational redundancy toward 100% or values close to this threshold.
Variables classified as LICV in all monitoring stations in a subregion, () in both seasonal periods, were confronted with the maximum admissible limits for fresh waters class 2 established by resolution no. 357/2005 of the National Council for the Environment (BRASIL 2005). The variables that exceeded the maximum limits set by the legislation were removed from the group of ‘non-critical variables’. For this consistency step, the LICV series even included measures identified as outliers. These LICV variables that were within the legislation limits were identified as non-critical variables in the Doce River watershed and could be submitted to a reduction in their monitoring frequency.
Monitored variables in dissolved form without environmental standards were analysed according to established limits for their total concentration.
Validation
Principal component analysis (PCA) was used to validate non-critical water quality variables selected in the previous step. PCA is a commonly employed technique for handling multivariate data, with the aim of organizing and reducing dimensionality. By linearly combining the original variables, PCA produces a new set of orthogonal variables, referred to as principal components. The components are independent, non-correlated, and the sum of their variances equals that of the original variables (Abdi & Williams 2010; Olsen et al. 2012; Sergeant et al. 2016).
It is expected that due to their low information content, the non-critical water quality variables will present a small contribution for the two first components, showing that they are not determinant for water quality characteristics in each subregion. The data used in the PCA were the same as described in Section 3.2. However, since PCA does not admit the presence of missing data, it was necessary to eliminate missing data campaigns. The dataset was standardized to eliminate distortions arising from the different measurement scales of water quality variables. As for the correlation between the variables, one variable from each pair whose Pearson correlation coefficient showed a value greater than 0.8 was eliminated. This procedure aimed at reducing redundant variables, optimizing the set of analysed variables (Hair et al. 2006) and avoiding distortions in the results that could attribute greater importance to multicollinear variables. Outliers were not eliminated to preserve as much as possible the information contained in the dataset.
RESULTS AND DISCUSSION
Water quality spatial boundaries
In the development of all methodological stages, the models and functions were implemented in the R programming language using entropy, FactoMineR and MultivariateAnalysis packages (Le et al. 2008; Hausser & Strimmer 2021; Azevedo 2022).
Subregion A is formed by one station located in Gualaxo do Norte River, upstream Fundão dam. Subregions B, C and D are formed by monitoring stations in Gualaxo do Norte River, downstream Fundão dam. Subregion E is formed by monitoring stations in Carmo River. Subregions F, H, K and L are formed by monitoring stations in Doce River tributaries (Piranga, Santo Antônio, Piracicaba, Guandu and Caratinga Rivers), while subregions G, J and M are formed by monitoring stations located both in the Doce River and in its tributaries (Piranga, Matipó, Suaçui Grande and Manhuaçu Rivers). Subregions I and N have one monitoring station each, in Doce River.
Assessing entropy measures
Marginal entropy values are asymmetrically distributed, with greater spread up to the 50th percentile and higher values in the upper tail of the distribution. The lowest marginal entropy obtained, both in the dry and wet seasons, was zero, which indicates little or negligible uncertainty. The variance of the observed data was represented by discrete states after quantizing the variables. The smaller the variance of a given variable, the smaller the number of discrete states to describe it, and it may reach a single state and zero uncertainty (null entropy).
The maximum entropy was limited to the highest value of marginal entropy. During the dry season, the maximum entropy reached 4.09 bits and in the rainy season, 4.39 bits. As with marginal entropy, seasonality drives maximum entropy toward greater values in the wet season (Figure 4(b)).
The similarity among the most expressive values of marginal and maximum entropies was already expected, because when using the uniform distribution with a finite interval obtained from the observed data, the statistical characteristics of the variables in the results of the Monte Carlo simulations are preserved.
Selection of non-critical water quality variables in the Doce River watershed
A total of 51 variables in the wet season and 62 variables in the dry season were classified as LICV across the 14-water quality subregions. Those with information redundancy greater than 80% in all subregions in both seasonal periods included a total of 41 variables. Comparing their values with Brazilian environmental standards allowed verifying that there were non-conformances. In these cases, despite the low information content presented by the variables in all stations of the same subregion, they were not labeled as non-critical variables.
In these cases, subregion L (in Caratinga River) accounted for the highest number of variables with non-conforming values, four in total, while subregions D (in Gualaxo no Norte River) and J (in Doce River) recorded three non-conforming variables each, subregions B (in Gualaxo do Norte River and Suaçuí Grande River), C (in Gualaxo no Norte River), E (in Carmo River), G (in Doce River), M (in Manhuaçu River and Manhuaçu River) and N (in Doce River) recorded two non-conforming variables each, F (in Santo Antônio Rivers), H (in Piracicaba River), I (in Doce River) and K (in Guandu River) had the lowest number (one non-conforming variable each). In subregion A (station RGN-01 in Gualaxo do Norte River, upstream of the Fundão dam), all were within environmental standard limits. The number of non-critical water quality variables ranged from 25 to 39, according to the water quality subregion (Figure A3 – Supplementary material).
Regarding non-conforming variables, the one that stood out the most in the comparison with environmental standards was biochemical oxygen demand (BOD), with non-conforming values in eight subregions, total phosphorus in six subregions, total lead in four subregions, total mercury and dissolved phosphorus in two subregions each and dissolved boron, total boron, total cadmium and total nickel, with one subregion each (Table A4 – Supplementary material).
The final set of variables selected as non-critical has a predominance of metals and metalloids. In summary, 15 non-critical variables were identified in the 14 subregions, other six variables in 13 subregions and five variables in 12 subregions. The other non-critical variables cover a smaller number of subregions (Table 2).
Non-critical variables are characterized by low information entropy, which means small uncertainty in their expected values. Uncertainty is linked to the degree of redundancy, in this case, understood as the repeatability of the data. When the measurements of these variables are repeated continuously, each new piece of data is generated under great predictability, producing little or no new information content. These variables are likely assigned as ‘non-critical’ because they are frequently under quantification limits and/or are not related to the mine tailings discharged in the watershed. Moreover, the monitoring program started a few years after the dam break and the pollutant levels may have decreased due physical–chemical processes, such as sedimentation, taking place in the aquatic system.
Spatial and temporal variability of non-critical variables
Due to changes in rainfall and runoff patterns associated with the spatial dynamics of land use/occupation in watersheds, seasonality often impacts water quantity and quality (Rodrigues et al. 2018; Schliemann et al. 2021). Seasonality influence was also observed by researchers who investigated water quality in the Doce River watershed, before and after the Fundão disaster (Petrucio et al. 2005; Nogueira et al. 2021; Passos et al. 2021).
The dry period predominated over the wet period in terms of the number of non-critical water quality variables. Point source pollution loads, mainly from untreated and treated sewage, remain nearly constant over the year while during the dry period, small precipitated volumes and reduced flows in the river network generate lower diffuse pollution loads from the watershed and smaller erosion and resuspension of sediments in the rivers (Oliveira & Quaresma 2017; da Cunha Richard et al. 2020). Consequently, water quality variable concentrations in the dry period present less variability (less uncertainty) and lower entropy when compared to the wet period.
Information redundancy of dissolved arsenic shows a great spread of values along the watershed (Figure 7), both in dry and wet seasons, ranging from 0 to 100%. Less redundancy zones are very well delimited in the upper reach of the Doce River with values below 40%. Monitoring stations included in this area correspond to all stations of subregions E, G and I. Subregion E corresponds to the Carmo River watershed which was historically explored by gold mining (Borba et al. 2000; Daus et al. 2005; Silva et al. 2018). This activity, especially in past centuries when mining was rudimentary and with little care regarding environmental impacts, promotes arsenic release from the soil and minerals, representing a serious risk to ecosystems and human health due to its toxicity characteristics (Alonso et al. 2020; Barcelos et al. 2020).
Subregions G and I are located at the upper reach of the Doce River, probably influenced by gold mining in the Carmo watershed, as concentrations of dissolved arsenic in non-compliance with environmental standards have been recorded before and after the Fundão disaster in this location. Comparing the maps of Figure 7 for dissolved arsenic, redundancy decreases from dry to wet season in the middle reach of the Doce River (subregion J). Less frequently, dangerous levels of arsenic have also been identified in the water for human consumption closer to the middle reach of the Doce River (Teixeira et al. 2020). Dissolved arsenic was indicated as a non-critical variable in all subregions, except in the four aforementioned subregions (E, G, I and J).
The maps in Figure 8 show that due to total manganese presence and variability all over the watershed, redundancy spatial distribution is quite homogenous and similar between dry and wet periods. Low levels of redundancy predominate throughout the year, ranging from 0 to 34.4% in the dry season and from 1.7 to 27.1% during the wet season. This behavior reflects a more expressive variability of total manganese, both in temporal and spatial terms in the Doce River watershed. Recent studies report measurements of total manganese at high concentrations and even violating environmental standards all over the Doce River watershed (de Carvalho et al. 2018). This is likely related to Fundão disaster, since manganese is part of iron mining tailings and tends to be gradually remobilized from sediment deposited in river channels to surface waters by chemical and physical–chemical processes, representing a medium and long-term risk to the population and aquatic ecosystems (Carvalho et al. 2018; Baudson et al. 2021; Moreira et al. 2021; Duarte & Neves 2023).
Validation
The correlation analysis showed a strong multicollinearity between the water quality variables in all subregions (see Figures A4 to A17 – Supplementary material). The set of variables eliminated from each strongly correlated pair ranged from 9 (subregion F) to 40 variables (subregion N) (Table A5 – Supplementary material). Variables whose variability was null and could not be standardized were also excluded from the analysis, and therefore, the total set of variables investigated in the PCA ranged from 9 to 46 (Table A6 – Supplementary material).
The PCA showed that the cumulative explained variance in the two first principal components ranged from 21.5 (subregion E) to 67.6% (subregion N), with an average value between subregions of 29.9%. The eigenvalues of these components in all subregions were greater than 1, being representative according to the Kaiser criterion (Ferré 1995).
The database of all subregions included in some proportion non-critical water quality variables. Subregion N had only one non-critical variable investigated in the PCA while subregion M had 11 non-critical variables (Table A7 – Supplementary material).
The set of variables with the greatest contribution to the first principal component (PC1) in subregions A, B, C, E, F, H, L, M and N for PC1 did not include non-critical variables. Variables with a percentage of individual contribution higher than the average for each subregion were considered to have the greatest contribution. Subregions G, I and K had one and subregions D and J had two non-critical variables each contributing to the first component (Table 3) (Figure A18 – Supplementary material).
Non-critical water quality variable . | Subregions . |
---|---|
Total chrome | D |
Total Kjeldahl nitrogen | D |
Total lead | I |
Total nickel | G |
Dissolved cobalt | J |
Total selenium | J |
Chlorophyll a | K |
Non-critical water quality variable . | Subregions . |
---|---|
Total chrome | D |
Total Kjeldahl nitrogen | D |
Total lead | I |
Total nickel | G |
Dissolved cobalt | J |
Total selenium | J |
Chlorophyll a | K |
Similarly, non-critical variables were also identified among those with the greatest contribution to the second principal component – PC2 (Table 4), except in subregions E, F and I. In the other subregions, the non-critical variables that contributed in a significant proportion to PC2 ranged from 1 (subregion A, G, H, M and N) to 3 variables (subregion B) (Table 4) (Figure A18 – Supplementary material).
Non-critical water quality variables . | Subregions . |
---|---|
Ammoniacal nitrogen | A, C, K |
Free cyanide | B, H |
Total Kjeldahl nitrogen | B |
Iron II | B |
Total sulfides | C, L |
Settleable solids | D |
Total lead | N |
Polyphosphate | D, K, L, M |
Dissolved antimony | G |
Total beryllium | J, K |
Total selenium | J |
Non-critical water quality variables . | Subregions . |
---|---|
Ammoniacal nitrogen | A, C, K |
Free cyanide | B, H |
Total Kjeldahl nitrogen | B |
Iron II | B |
Total sulfides | C, L |
Settleable solids | D |
Total lead | N |
Polyphosphate | D, K, L, M |
Dissolved antimony | G |
Total beryllium | J, K |
Total selenium | J |
Addressing outliers in environmental data is challenging, as these values are not necessarily indicative of errors, but may represent extreme values that actually occurred in the environment. In this study, outlier removal was carried out only before the implementation of the method based on information entropy. The decision to employ this strategy was guided by the recognition that outlier values would obscure the quantization process and, consequently, hinder the identification of non-critical water quality variables through the information entropy method. While outlier values could have been excluded before conducting the PCA analysis, this procedure was intentionally omitted to illustrate their potential impact on the results. Indeed, the analysis of Figure 9 shows that the greater importance attributed to non-critical variables concerning PC1 and PC2 components comes exclusively from the outliers, corroborating the results of the selection of water quality non-critical variables using information entropy.
It is, therefore, recommended that in future applications of the non-critical variable selection method, data treatment by outlier removal should be maintained as a preliminary step. This process does not necessarily have to rely solely on the Gini method. Other methods need to be analysed, as they may yield satisfactory results considering the specific characteristics of each dataset.
Management context
The occurrence of large-scale environmental disasters has placed environmental management, especially the management of water resources, at the center of attention worldwide. Depending on the nature of the impacts and their extension in geographical terms, the environmental recovery process becomes quite complex and challenging, often requiring the construction of solutions that are not readily available.
In the case of water resources management, the entire chain of actions necessary for the recovery of the impacted ecosystems, from the immediate monitoring of the spatial dynamics of the pollutant dispersion to the temporal monitoring of mitigation measures, goes through the establishment and operation of monitoring networks. In this context, these networks play an increasingly strategic role to safeguard ecosystems and guarantee water security for the affected populations. Very often emergency monitoring networks are oversized, both in the number of stations and monitored parameters. Although this approach is legitimate and necessary, especially in cases where pollutants are highly mobile and dangerous, it is necessary to identify strategies which allow for optimizing the monitoring efforts, not losing sight of the specificities of each crisis scenario. In the Doce River watershed, for instance, the emergency water quality monitoring must consider the dynamics of pollutant bioavailability. Under varying environmental conditions, elements immobilized in sediments may experience accelerated bioavailability through physicochemical processes (Costa et al. 2021; Queiroz et al. 2021a). This phenomenon is highlighted by several authors who caution against the potential for chronic effects stemming from the Fundão disaster. Metals and metalloids adsorbed or complexed with iron oxides and hydroxides in sediments can undergo exchanges, leading to an elevation in their concentration in water resources and presenting a significant risk to ecosystems and human populations (Queiroz et al. 2018, 2021b; Gabriel et al. 2021).
Emergency monitoring programs require regular assessment. Identifying less informative and lower-risk variables to water resources and aquatic ecosystems enables a potential reduction of the monitoring frequency of non-critical variables. It ensures that monitoring is focused on the most critical variables, minimizing resource wastage and allowing surplus resources to be redirected to other actions of environmental recovery programs. Its implementation can assist managers of emergency monitoring networks in making more secure and subjective-free decisions regarding monitoring criteria.
4. CONCLUSIONS
This paper presented a method based on information entropy for selecting non-critical water quality variables of emergency monitoring networks in the context of environmental disasters impacting water resources. The method has the potential to support managers in decision-making regarding the planning and operation of emergency monitoring networks. It allows us to summarize for the Doce River study case:
14 subregions with homogeneous water quality characteristics were identified in the Doce River watershed through cluster analysis;
Seasonality strongly influences the information content of water quality variables, generally making them more informative during the wet season;
Across the watershed, 41 non-critical water quality variables were identified and distributed among distinct sets within each water quality subregion;
Non-critical water quality variables ranged from 32 to 50% of the total monitored variables in the water quality subregions;
Metals and metalloids stand out as the prevailing non-critical water quality variables in the Doce River watershed.
The Doce River watershed served as a case study; nevertheless, the applied method is easily transferable to other watersheds due to (i) its independence from data following a normal distribution, (ii) its ability to capture the spatiotemporal fluctuations in the informative content of water quality variables, even for elements with concentrations as low as parts per billion (ppb); (iii) its objectivity in the identification of non-critical water quality variables, avoiding the influence of personal judgments. A careful evaluation is recommended to evaluate outliers and define the discretization method for continuous variables in other study areas.
ACKNOWLEDGEMENTS
We acknowledge IFMG – Campus Governador Valadares for the support provided as a waiver for exclusive dedication to the doctoral research of the first author.
Variables were standardized for allowing visualization.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.