Spatial prediction of spring locations in data poor region of Central Himalayas

This research explores the methods for understanding groundwater springs distribution and occurrence using Geographic Information System (GIS) and Machine Learning technique in data poor areas of the Central Himalayas. The objectives of this study are to analyse the distribution of natural springs, evaluate three random forest models for its predictability and establish a model for the prediction of occurrence of springs. This study evaluates the primary causal factors for occurrence of springs. The data used in this study consists of 20 parameters based on topography, geology, lithology, hydrology and land use as causal factors, whereas 621 spring location and discharge (n1⁄4 621) measured during 2014–2016 and 815 non-spring locations (generated by GIS tool) use as supporting evidence to train (80%) and test (20%) the prediction model. Results show that the Bootstrap method is comparatively reliable (92% accuracy) over Boosted tree (64% accuracy) and Decision tree (74% accuracy) methods to classify and predict the occurrence of springs in the watershed. Bootstrap Forest shows the high Prediction rate for True Positive (82% actual spring predicted as a spring) and True Negative (89% actual non-spring predicted as non-spring), and the model seems consistent in both responses. This model was then applied to an independent dataset to predict spring location estimates with 75% accuracy. Therefore, spatial statistical methods prove efficient at predicting spring occurrence in data poor regions.


GRAPHICAL ABSTRACT INTRODUCTION
Springs are the primary source of water in mountainous and hilly areas of the Himalaya. The distribution of the springs and their condition determines the livelihood opportunities of the community, including agriculture, livestock farming as well as provision of clean water for drinking, sanitation and hygiene (Pariyar ). Groundwater in the form of mountain springs ensure water security for the majority of the rural population, though springs are mostly overlooked against studies at the basins and sub-basins (Rasul ).

Recent problem faced by local communities, mainly
drying up of such springs has caused severe problems in such mountain communities (Rasul ; Rawat ). Spatial prediction of groundwater is studied using GIS and Remote Sensing (Ozdemir a); Weight of Evidence and Artificial Neural Networks (Corsini et al. ); Bivariate statistical model (Moghaddam et al. ); binary logistic regression method (Ozdemir b) and multicriteria data analysis (Chenini et al. ). Studies show that groundwater occurrence is controlled by lithology, structures and landforms where GIS and remote sensing proves to be a powerful tool (Solomon & Quiel ). A study on groundwater potential modelling considered lineaments, drainage density, topographic wetness index, relief and convergence index as determining factors (Liu et al. ). Statistical maps depict the relative probability of occurrence without considering the time factor (Catani et al. ).
Decision trees can efficiently discover new and unexpected patterns, trends and relationship compared to other spatial techniques. Decision trees are easy to build and interpret and can automatically handle interactions between both continuous and categorical variables. Random Forests (RF) are a combination of tree predictors (Breiman ) basically a machine-learning algorithm (Catani et al. ) for decision-making. Random forests have recently emerged as one of the most commonly applied nonparametric statistical methods in various scientific areas (Shih ) (Genuer et al. ) and exhibits high accuracy, robustness against over-fitting the training data (Puissant et al. ) also reduces the noise effect (Breiman ).   The studied springs are located between 1,000 m to 3,000 m of elevation, with discharge ranging from 0.01 litre per second (lps) to 5 lps, with a mean of 0.36 lps as recorded during dry periods (March-May) of the year. The distribution is highly skewed (skewness >1) with high discharge springs being less frequent. High occurrence (67%) of the springs to scatter around 1,000-2,000 m altitude and 37% springs located around 180-270 degrees' aspect (South and South West).
Discharge data of representative 11 springs measured every 15 days for 1 year ( Figure 2) clearly suggests that, average discharge of spring measured in litre per sec starts to increase from August (mean 0.25 ± sd 0.15) up to October (mean 0.66 ± sd 0.35) and gradually decreases until February (mean 0.22 ± sd 0.12). March onwards the discharge goes as low as drying up in some of the sources which reach the lowest during June (mean 0.08 ± SD 0.07) and slowly starts to rise from July onwards, which is typical for the springs depending on the Monsoon precipitation that is received throughout the country during June to September. The discharge behaviour of these springs suggests that all springs are geologically identical (Bryan ) and are recharged in a similar pattern during monsoon as winter precipitation is insignificant.

GIS datasets
The independent variables as causal factors taken for the study are generated from Digital Elevation Model ( Although the DEM-derived parameters represent distinct terrain properties and processes, their interrelationship may lead to multicollinearity. However, for Springs mapping,   Out of 621 known spring points, 3 data were excluded as an outlier and finally dataset included 618 spring points and 815 non-spring points, a total of 1,433 data. The data (57% springs and 43% non-springs) for the study area consist of 1,433 rows each with 20 columns. The data were randomly divided into training (80%) and validation (20%) datasets.

Model evaluation
In statistical classification models, a receiver operating   (Tables 3 and 4), where Decision tree resulted in 64% accuracy and Boosted tree resulted in 74% accuracy produced as the ability to predict validation data.
The null hypothesis of all AUCs produced by 3 models are equal was rejected (Table 5) and the difference between AUCs (Table 6) were also observed to be significant.

Sub-watershed comparison
Comparison between sub-watersheds is considered as a reliable method to compare the results of the model based on the evaluation of causal factors within a watershed.
The data of springs and non-springs was further divided into sub watersheds in this study (Figure 3), as 7 watersheds were selected based on the adequate number of spring data (N > 30). The bootstrap forest method could establish prediction model with accuracy ranging from 58% to 100%.
In this case, small watershed and insufficient validation data affects the accuracy but this provides comparative

Spring occurrence prediction model
Random Forest (Bootstrap) method with 20 causal factors generated 500 trees for classification and voting produced Additionally, the model showed regression limitation of elevation parameter which resulted in predicting no springs above an altitude of 3,000 m. This is a common error of the method that it cannot train itself beyond training data range.
To improve this, the model was re-run with 19 parameters, excluding elevation which resulted in the improved prediction model. Though elevation was excluded, Relief, Hypsometric interval and curvatures are topographic parameter which considers the role of elevation related Above Fit details report of the model in Table 11 shows the classification accuracy for training data is (1-0.0403) × 100% i.e. 96% and the prediction accuracy for validation data is (1-0.2877) × 100%, i.e. 72%. Also, the confusion matrix in Table 12 shows how the cases in the data table were classified and predicted by the current model.
Another important aspect of random forest -Bootstrap method is that it provides estimates of the variable importance shown as column contributions. It shows which variable helps better classify the data for the obtained accuracy. Column contribution sorted in descending order of generalized R square (R2) in Table 13 shows performance        (0) Generalized R 2 0.6243 0.1032 À203.4 (1-(L(0)/L(model))^(2/n))/(1-L(0)^(2/n))

DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.