Abstract
This paper presents to study the performance of machine learning techniques consisting of multivariate adaptive regression spline (MARS), feed forward neural network-back propagation (FFNN-BP), and decision tree regression (DTR) for estimating the physico-chemical properties of groundwater in the coastal plain area in Vinh Linh and Gio Linh districts of Quang Tri province of Vietnam. With 290 groundwater samples collected in two districts, this study has identified three main elements CO2, Ca, CaCO3 for simulation. Quantitative analysis results have shown that these three components are such as CaCO3 with from 0 to 25.8 mg/l, Ca from 0 to 87.55 mg/l and CO2 from 0 to 12 mg/l. In the present examination, groundwater quality index (GQI) values and their representative categories have been referred by the Vietnam Groundwater Standard (QCVN01). Furthermore, the statistical accuracy parameters were used to compare among models. To deploy FFNN-BP and DTR, different types of transfer and kernel functions were tested, respectively. Determining the results of MARS, FFNN-BP and DTR showed that three models have suitable carrying out for forecasting water quality components. Comparison of outcomes of MARS model with the FFNN-BP and DTR models indicated that this model has good performance for forecasting the elements of water quality, its level of accuracy was slightly more than the other. To assess the accurate values of the models according to the measurement parameters for training phase illustrated that the order of the models was MARS to give the best result, followed by DTR and finally FFNN-BP, respectively.
HIGHLIGHTS
Machine learning methods are used for spatial modeling of physico-chemical properties of groundwater.
MARS performances suitable precision compared to the DTR and FFNN-BP models.
Total CaCO3 value in the experiment samples adapted the regular limit of QCVN01 with ‘Excellent’ point.
The quality of water parameters (i.e., CaCO3, Ca, and CO2) of the coastal plain area was predicted.
The study results have shown that the water quality in these two districts is usable for humans, livestock, and agriculture activities.
INTRODUCTION
The presence of contaminants in natural freshwater is considered one of the most crucial environmental problems in many areas of developing countries, where several communities are hardly approaching a potable water supply (Organisation mondiale de la santé, Światowa Organizacja Zdrowia, World Health Organization, & World Health Organisation Staff 2004; Giang et al. 2021). Low-income communities, which lean on untreated surface and groundwater supplies for domestic and agricultural purposes are the most affected by poor water quality (Ayoko et al. 2007). Unfortunately, they also do not have adequate tools to monitor quality of water regularly (Resh 2008; Omarova et al. 2018; Najafzadeh & Niazmardi 2021). Thus, they are increasingly expected to obtain reliable assessments of quality of water, which can be used (Bonansea et al. 2015).
Climate change leads to seawater intrusion affecting groundwater resources of coastal cities (Kumar 2012; Alfarrah & Walraevens 2018). Urbanization and industrialization have caused uncontrolled over-exploitation and depletion of groundwater consequently (Kanwal et al. 2015; Sunardi et al. 2021). Furthermore, untreated wastewater from residential areas and industrial zones has seeped into the ground leading to an increasing amount and content of chemical elements in groundwater (Mukate et al. 2018; Khan et al. 2020a).
The chemical element of groundwater is considered a standard of measurement to show the capable level of groundwater for plenty of targets such as human and animal drinking, agricultural, and industrial activities. It has been shown in practice that the uses of groundwater sources require different standard indicators to distinct water quality circumstances (Loaiciga et al. 1992). The groundwater quality concept is an integrative index composed of chemical, physical, and biological features which maintain expected groundwater utilizations. Hence, groundwater is divided by composition as groundwater quality index (GQI) for the management and consumption of groundwater resources (Najafzadeh et al. 2021a).
To evaluate the quality of water for drinking and agricultural irrigation, several variables are routinely monitored. This process makes a big database, but it can be time-consuming for data acquisition while the accurate rendering of the multivariate data may be challenging.
With regard to use machine learning for forecasting physico-chemical parameters in water, using artificial neural network (ANN)-estimated river water quality components (Niroobakhsh et al. 2012; Emamgholizadeh et al. 2014; Najah et al. 2014; Raheli et al. 2017; Haghiabi et al. 2018; İlhan et al. 2021; Najafzadeh et al. 2021b); employing multivariate adaptive regressive splines (MARS) to predict physico-chemicals in water (Haghiabi 2016; Bhatt et al. 2017; Ahmadi et al. 2019; Esmaeilbeiki et al. 2020); deploying decision tree regression (DTR) to forecast quality of water (Liao & Sun 2010; He et al. 2012; Jaloree et al. 2014; Chandanapalli et al. 2018; Gakii & Jepkoech 2019; Jalal & Ezzedine 2020; Lu & Ma 2020). Furthermore. MARS, feed forward neural network-back propagation (FFNN-BP), and DTR models also belong to nonparametric learning, and the model is used in those areas (Bengio et al. 2010; Al Iqbal et al. 2012; Genuer et al. 2017; Khaldi et al. 2019; Kohler et al. 2019; Yurochkin et al. 2019; Antoniadis et al. 2020; Devianto et al. 2020; Khan et al. 2020b; Zheng et al. 2020; Amiri-Ardakani & Najafzadeh 2021). Najafzadeh & Ghaemi (2019) implemented the LS-SVM and MARS models to estimate BOD5 and COD parameters through 200 samples collected from Karoun River, in the southwest of Iran. The result showed that the MARS model has proved precise approximations compared with real data. Saghebian et al. (2014) applied a decision tree model to classify groundwater quality in Ardebil, Iran. Research results have proved that this model can be acceptable range of criteria for quality classification of groundwater. Khan et al. (2021) used FFNN-BP model to estimate Escherichia coli in groundwater with 1,301 groundwater samples were obtained from 348 villages and cities in from 2016 to 2019 in Rajasthan state, India. Consequently, deploying the model based on Grover's algorithm was more efficient in forecasting all patterns in the calculated E. coli in groundwater. Najafzadeh et al. (2021a) studied the groundwater quality of the Rafsanjan Plain of Iran, quantifying it using artificial intelligence (AI) to assess GQI values for 15 years. The results of the groundwater quality prediction analysis of the MARS model with RMSE = 2.444 and SI = 0.0304. In addition, this result was also compared with the World Health Organization groundwater standard which also showed that the entire area of Rafsanjan lacks water quality at the ‘Excellent’ level with a high probability. The chance for ‘Good’ water quality varies from 1% (at GQI = 50 worst cases) to 55% (at GQI = 100 best cases).
Groundwater quality prediction work has some errors for various reasons such as the quality of collected groundwater samples, measurement of variability and the subjective opinion of groundwater sample analysts, and other random parameters related to groundwater quality prediction that have not been studied yet. Therefore, the problem of assessing reliability for implicit quality classifications. In addition, using analytical methods is also subject to the bias of environmentalists, geologists, and experts.
This paper presents the prediction of the physico-chemical properties of groundwater using FFNN-BP, DTR, and MARS models. The input vectors used in the models are leaned on 290 samples that were collected from 290 wells of households in coastal plain area in Gio Linh and Vinh Linh districts of Quang Tri province. With the support of the collected data, the GQI values were analyzed based on the Vietnam Groundwater Standard (QCVN01) and their relevance for proposed use. After that, highlight comparison among three models that base on the results of statistical accuracy parameters such as mean (M), bias (bias is shown by mean error), root mean square error (RMSE), mean absolute error (MAE), standard deviation (St Dev), pearson correlation coefficient (R), skewness coefficient (Skew), minimum (Min), maximum (Max), scatter index (SI), and Nash–Sutcliffe efficiency (NSE). Finally, the collection of results of these three models may show the working efficiency of the models for predicting the quality of water.
The structure of the paper is organized as follows. Section 1 gives the paper's introduction. Section 2 presents study area, the MARS, FFNN-BP, and DTR models and explains them clearly for understanding use throughout this paper. Section 3 describes the study results. Finally, Section 4, and Section 5 introduce the discussions and conclusions.
STUDY AREA AND METHODOLOGY
Study area
MARS







Feed forward neural network-back propagation




Structure of FFNN-BP network for physico-chemical properties groundwater prediction.
Structure of FFNN-BP network for physico-chemical properties groundwater prediction.
DTR


Performance metrics



Data collection
Statistical characteristics of physico-chemical components data
Item . | St Dev . | Mean . | Min . | Max . | Skew . |
---|---|---|---|---|---|
CaCO3 | 4.23 | 1.30 | 0 | 25.80 | 2.06 |
Ca | 16.1 | 6.05 | 0 | 87.55 | 1.57 |
CO2 | 2.42 | 0.79 | 0 | 12 | 1.56 |
Item . | St Dev . | Mean . | Min . | Max . | Skew . |
---|---|---|---|---|---|
CaCO3 | 4.23 | 1.30 | 0 | 25.80 | 2.06 |
Ca | 16.1 | 6.05 | 0 | 87.55 | 1.57 |
CO2 | 2.42 | 0.79 | 0 | 12 | 1.56 |
Unit: mg/l.
RESULTS
The output function of MARS is presented as below:
MARS = 7.907–0.249F1–0.129F2, where F1 = max(0, Ca − 55.79), F2 = max(0, 55.79 − Ca).
Fi is the basis function. F1 may be explained as the maximum value of 0 and Ca − 55.79. The minus sign ahead of the maximum value is equivalent to a minimum value. In addition, the MARS analysis indicates that the most important is Ca. Furthermore, the output function for FFNN-BP and DTR do not occur.
The data in Figure 4(a)–4(c) shows the relationship between the three variables. The content of CO2 and Ca increase lead to the content of CaCO3 increase. The FFNN-BP model makes a forecasting form that resembles a cone shape. In the meantime, the DTR and MARS charts look like the image of papers with some folds. Through these three images, it is hard to judge which model gives the best estimating. Hence, the values of performance metrics of the three models are presented in Table 2, Figure 5. The NSE of the three models for the training and testing phases are from 0.89 to 0.95 and are closer to 1. The MAE, RMSE, and bias values are also from −0.12 mg/l to −0.09 mg/l and are closer to 0. In addition, the values of the SI statistical indicator are simulated fluctuation from 0.21 mg/l to 2.23 mg/l. These show that the forecast results are very consistent compared with the actual data. As for the experimental results for each specific model, it indicates that the MARS model for training phase with NSE, MAE, RMSE, bias, and SI values are 0.95, 0.14 mg/l, 0.24 mg/l, 0.00 mg/l, and 0.26 mg/l respectively, and these properties are better than DTR and FFNN-BP models. Regarding the testing phase results, the highest accuracy for forecast is MARS model with NSE = 0.95, the second-highest is FFNN-BP model with NSE = 0.91, and the lowest is DTR model with NSE = 0.89.
Accuracy parameters for physico-chemical components prediction
Parameter . | DTR . | FFNN-BP . | MARS . | |||
---|---|---|---|---|---|---|
Testing . | Training . | Testing . | Training . | Testing . | Training . | |
MAE (mg/l) | 0.25 | 0.25 | 0.50 | 0.46 | 0.21 | 0.14 |
RMSE (mg/l) | 0.34 | 0.33 | 0.99 | 0.90 | 0.41 | 0.24 |
Bias (mg/l) | −0.09 | −0.19 | −0.12 | 0.01 | −0.04 | 0.00 |
SI (mg/l) | 3.23 | 3.10 | 1.53 | 1.19 | 0.21 | 0.26 |
R | 0.91 | 0.93 | 0.92 | 0.91 | 0.93 | 0.94 |
NSE | 0.89 | 0.94 | 0.91 | 0.90 | 0.95 | 0.95 |
GCV (mg/l) | 0.14 | 0.14 |
Parameter . | DTR . | FFNN-BP . | MARS . | |||
---|---|---|---|---|---|---|
Testing . | Training . | Testing . | Training . | Testing . | Training . | |
MAE (mg/l) | 0.25 | 0.25 | 0.50 | 0.46 | 0.21 | 0.14 |
RMSE (mg/l) | 0.34 | 0.33 | 0.99 | 0.90 | 0.41 | 0.24 |
Bias (mg/l) | −0.09 | −0.19 | −0.12 | 0.01 | −0.04 | 0.00 |
SI (mg/l) | 3.23 | 3.10 | 1.53 | 1.19 | 0.21 | 0.26 |
R | 0.91 | 0.93 | 0.92 | 0.91 | 0.93 | 0.94 |
NSE | 0.89 | 0.94 | 0.91 | 0.90 | 0.95 | 0.95 |
GCV (mg/l) | 0.14 | 0.14 |
Physico-chemical properties prediction with (a) MARS model, (b) FFNN-BP model, and (c) DTR model (Unit: mg/l).
Physico-chemical properties prediction with (a) MARS model, (b) FFNN-BP model, and (c) DTR model (Unit: mg/l).
The best performance indicators for CaCO3 prediction (a) MARS training model, (b) FFNN-BP training model, (c) DRT training model.
The best performance indicators for CaCO3 prediction (a) MARS training model, (b) FFNN-BP training model, (c) DRT training model.
The best performance indicators for CaCO3 prediction for Training, and Testing.
DISCUSSIONS
Invasive seawater, untreated wastewater from residential areas and industrial zones, and over-exploitation of groundwater have been seriously affecting the quality and quantity of the underground water system. Therefore, a water quality assessment is a regular and continuous work to help people and authorities show solutions to treat groundwater to serve daily life. Hence, this paper described a comparative study and analysis of MARS, FFNN-BP, and DTR models in estimating physico-chemical properties of groundwater. The different circumstances, influential factors, and indicators have been observed for the experimentation. The following key findings are as the predictive errors in the case of the models decreased if the testing set decreased; MARS was the highlight in comparison to other models. Furthermore, to compare RMSE and MAE using MARS model of this study result with the study result of Najafzadeh et al. (2021a) about the groundwater quality indicate that their RMSE = 0.55 mg/l, and MAE = 0.00 mg/l are equivalent this study with RMSE = 0.41 mg/l, MAE = 0.21 mg/l.
According to QCVN01, the groundwater index of CaCO3 is from 0 to 300 mg/l (National technical regulation on domestic water quality of Vietnam, QCVN 01: 2009/BYT 2021). The indicators in Table 3 show the range of water quality categorization based on the quality index of weight groundwater for human consumption. However, the total CaCO3 value in the experiment samples ranged from 0 to 25.8 mg/l, and the mean value was 1.3 mg/l, this point illustrated that adapted the regular limit of QCVN01 with ‘Excellent’. These points indicate that the water quality in these two districts is usable for humans, livestock, and agriculture activities. Rapid urbanization, industrialization, and climate change will negatively affect the groundwater resource in the areas. Therefore, households need to build a water purification system for drinking; at the same time, the local government needs to supply a clean water system so that people have clean water for daily living. In addition, local authorities and households should install early warning sensors for changes in chemical content in groundwater at some wells. The work periodically checks and advises on pollution levels of groundwater.
Water quality categorization based on the quality index of weight groundwater (GWQI)
GWQI . | < 50 . | 50 − 100 . | 100–200 . | 200–300 . | > 300 . |
---|---|---|---|---|---|
Quality categorization | Excellent | Good | Bad | Very bad | Unsuitable for drinking |
GWQI . | < 50 . | 50 − 100 . | 100–200 . | 200–300 . | > 300 . |
---|---|---|---|---|---|
Quality categorization | Excellent | Good | Bad | Very bad | Unsuitable for drinking |
Unit: mg/l.
The groundwater samples for the study were mainly collected at the beginning of the dry season, so the hydrodynamic coefficient is dissimilar from the rainy season. The groundwater analysis equipment had an effect on the study because the equipment did not detect any more chemicals in the water that affect public health in these two areas. However, the results of this study also contribute to supporting local authorities to have appropriate solutions to help households use clean water.
CONCLUSIONS
Assessing and estimating water quality is a difficult and complex task, this result will give warnings to users and authorities to have appropriate treatment solutions. Hence, this study has deployed the MARS, DTR, and FFNN-BP predicted physico-chemical properties groundwater in the coastal plain of Vinh Linh and Gio Linh, which is located in the north middle of Vietnam. For phases of training and testing carried out in the models, the observed data consisting of CO2 and Ca was used as inputs, while CaCO3 was used as output. The stimulated results pointed out that the three models have a high suitable presentation for forecasting water quality components. The best performance was related to the MARS. The results of DTR and FFNN-BP also showed that their accuracy is a suitable presentation for practical purposes. Furthermore, the conducting of a comparison of three models showed that the outcomes of MARS and DTR models were slightly more reliable in comparing with FFNN-BP. The qualitative description of the GWQI found that the whole region of Vinh Linh and Gio Linh districts gained ‘Excellent’ water quality with 100% cases. At the same time, the study results have shown that the water quality in these two districts is usable for humans, livestock, and agriculture.
In addition, this study demostrated that machine learning models play a key role in the decision-making progress for carring out an effects of climate changes, urbanization, and industrialization on quality of groundwater. Another possible future work is to enhance quality of groundwater analysis equipment may finding other elements in the groundwater, and samples should be collected evenly throughout the seasons of the year.
ACKNOWLEDGEMENTS
We would like to thank Quang Tri province of Departments of Natural Resources and Environment, and Science and Technology, and Vinh Linh and Gio Linh residence. We also thanks the project of Hue University ‘Assessing the corrosion capacity of groundwater for foundation concrete structures in the northern coastal plains of Quang Tri province’, the code DHH2013-01-151 has supported the collection of field data in Quang Tri province.
AUTHOR CONTRIBUTIONS
Conceptualization, Discussion, Writing, Material, Methods, Review, and Editing: Nguyen Hong Giang, Tran Dinh Hieu; Data collection, Writing: Hoang Ngo Tu Do; Methods: Thinh-Tien Nguyen. All authors have read and agreed to the published version of the manuscript.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.