## Abstract

Mine water inrush is a major type of disaster in coal mine production in China. It causes heavy casualties and serious economic losses and threatens coal mine safety. To quickly and accurately identify mine water inrush source, according to the hydrochemical characteristics of different aquifers in the Donghuantuo mining area, this paper systematically analyzes the hydraulic connection of the aquifers in main coal mining areas before and after mining activities. Four types of hydrochemical data were collected: No. 5 coal seam roof water, No. 8 coal seam roof water, No. 122 coal seam floor water, and No. 1214 coal seam aquifer water in the Donghuantuo mining area. In addition, based on the hydrochemical data, the parameter selection of LightGBM was optimized by Particle Swarm Optimization (PSO) and constructed the PSO-LightGBM water inrush source identification model. The recognition accuracy of PSO-LightGBM model was compared with LightGBM model, classification regression tree (CART) model, and random forest (RF) model. The results showed that coal mining activities would have a significant impact on the water quality characteristics of the roof sandstone fissure water of No. 5 coal mine. Mining activities had a certain impact on the accuracy of the identification model. In addition, compared with the four recognition models, PSO-LightGBM model had the highest recognition accuracy of 97.22%. It showed that the model had high accuracy, stability, generalization ability, and important reference value for the identification of mine water inrush source.

## HIGHLIGHTS

The mine water environment changed significantly before and after the mining in the study area.

Changes in water quality will affect the identification of water inrush sources to a certain extent.

Establishment of PSO-LightGBM mine water source identification model.

Comparison and analysis of PSO-LightGBM, LightGBM, RF and CART.

The article analyzes the main reasons for the misjudgment of the model.

### Graphical Abstract

## INTRODUCTION

The vigorous development of the coal industry is an important factor for the healthy and stable development of the economy (LaMoreaux *et al.* 2014). In recent years, with the continuous expansion of coal mining depth and width, the harm of mine water inrush has become more and more serious. Therefore, it is urgent to prevent and control mine water inrush. Quickly identifying the source of mine water inrush is a powerful tool to solve the problem. (Li 2018; Dong *et al.* 2020).

During the formation of coal mine water, the water contained in the lithosphere, hydrosphere, atmosphere, biosphere, and stratum is subject to various physical and chemical actions. Therefore, hydrochemical types are important information sources to characterize water-filled aquifers. The traditional method to judge the water source of coal mine is to classify according to the ion concentration of seven elements in groundwater directly or in combination with the identification model (Wang *et al.* 2020; Yang *et al.* 2021). In addition, some scholars analyzed the mine water source based on the change data of mine water level and temperature (Wu *et al.* 2019), and some scholars identify the water source based on the variation of mine water isotopes or trace elements (Singh *et al.* 2018; Guan *et al.* 2019). Some scholars determined the source of mine water according to fluorescence spectrum law of mine water (Yang *et al.* 2018; Hu *et al.* 2019).

Nowadays, the combination of traditional craftsmanship and computer technology is becoming more and more mature. Therefore, the identification methods of mine water inrush sources based on different water characteristics are constantly updated. Traditional methods of water temperature and water level discrimination or direct analysis of the hydrochemical data are being replaced by semi-quantitative analysis methods. The most widely used methods are combined with mathematical theory analysis, such as the cluster analysis method (Panagopoulos *et al.* 2016; Zhang *et al.* 2019), fuzzy mathematics method (Tiantian *et al.* 2019), grey theory method (Ju & Hu 2021), Bayesian discriminant method (Bogardi *et al.* 1982; Wu *et al.* 2016), Fisher feature extraction algorithm (Wang *et al.* 2021), GIS theoretical method (Donglin *et al.* 2012), SVM algorithm (Ma *et al.* 2018), and neural network algorithm (Chen *et al.* 2022; Yan *et al.* 2021). The above identification models are generally fast and effective, and machine learning methods have better applicability and advantages in identifying water inrush sources.

In the application of machine learning algorithm, Baudron *et al.* (2013) established a model for identifying water sources using Random Forest algorithm Saghebian *et al.* (2014) established a water sources classification model for a certain region in Iran using Decision Tree method and Gan *et al.* (2021) used LightGBM model to predict the downstream water level of the river.

To sum up, the research on the identification of mine water inrush sources has accumulated an extremely rich theoretical basis and application results. However, with the development of information technology and the comprehensive application of multidisciplinary means, many new methods and theories have emerged. Among them, machine-learning algorithm has been applied in the prediction of water quality and water level, but there is still research space in the application of mine water source identification. Therefore, based on the qualitative analysis of hydrochemical data in the mining area, firstly, we made a comparative analysis of the ion concentration characteristics and hydrochemical types of mine water in the Donghuantuo area. Then, we constructed a LightGBM algorithm model based on particle swarm optimization (PSO) to realize the identification of mine water sources. Finally, the PSO-LightGBM model was compared with the LightGBM model, Classification and Regression Tree (CART) model, and Random Forest (RF) model to determine the more suitable model for mine water inrush sources identification. The research results have well verified the application value of machine learning algorithms in mine water inrush water source identification, which can provide theoretical support for the realization of rapid water inrush water source identification, and provide technical support for mine water hazard prevention and control in similar geological coalfields.

## STUDY AREA

### Geological and hydrogeological conditions

^{2＋}Mg

^{2＋}Na

^{＋}; (2) carboniferous and Permian sandstone fissure water, it includes the No.5 coal seam strong aquifer group, No.5 – No.12 coal seam weak water-bearing group Groups, No.1214 coal seam strong aquifers, and the water quality type is - Ca

^{2＋}Na

^{＋}or - Ca

^{2＋}Mg

^{2＋}; (3) Ordovician limestone water, the top fissures, and karst caves are mostly filled with sand, gravel, and clay. The type of water quality is - Ca

^{2＋}Mg

^{2＋}.

Figure 2 shows the stratigraphic system of Donghuantuo mine area, corresponding coal seams, stratigraphic column, lithology, and aquifers. After investigation, it was found that in various water inrush accidents in Donghuantuo mine, the floor water inrush occurred many times in 121 and 122 coal seams during the roadway excavation, and the water inrush source was from No. 1214 coal seam strong aquifer. Among them, on July 18th, 2002, floor water gushing occurred during the mining of No. 122 coal, and the maximum water volume was 1.94 m^{3}/min. In 2013 and 2015, water inrush disasters occurred during the mining of No.8 coal. The water inrush sources were both in the roof aquifer of the No.5 coal seam, and the maximum water volume reached 1.5 and 2.6 m^{3}/min respectively. Therefore, the main research objects were the crack water in the sandstone roof of No. 5 coal seam, No.8 coal roof water, No.122 coal seam floor sandstone fissure water, and No. 1214 coal aquifer. The approximate location of the water sample is shown in blue dots in Figure 2.

### Data collection

Since the 1990s, 195 groups of hydrochemical data have been collected in Donghuantuo mine area. It includes No.5 coal seam roof sandstone fissure water (129 groups), No.8 coal roof water (12 groups), No.122 coal seam floor sandstone fissure water (36 groups), and No.1214 coal aquifers (18 groups). In this study, the original data samples mainly included cations (Ca^{2+}, Mg^{2+}, Na^{+}), anions (, , Cl^{−}), pH value, and total hardness (TH).

## METHODS

### LightGBM algorithm model

*et al.*2017), which has the advantages of good training effect and low computational complexity. It has been gradually applied to different types of data analysis tasks such as classification, regression, and sorting. Assuming a supervised dataset , the purpose of the LightGBM algorithm is to find an approximation of a function , such that the function can minimize the specified fitness function . The fitness function is used to judge how well the model fits the data. The optimization function can be expressed as:

*w*is the vector of leaf node sample weights,

*q*is the regression tree structure,

*J*is the number of leaves in the tree. Then, when the

*i*th tree is obtained, all the information of the previous (

*t*

*−*1)-th tree needs to be used. Therefore, the objective function of the algorithm iteration

*i*th generation is as follows:where is the regularization term to prevent the model from overfitting the training data. In the optimization of the objective function, the objective function after the second-order Taylor expansion is expressed as:where and are the first-order and second-order gradient statistics of the fitness function, respectively. Since the regression tree has been defined above, the complexity of a tree is:where is the number of leaf nodes, is the leaf node coefficient, and is the regularization coefficient. Assuming that is the sample set divided into leaf nodes, the objective function can be changed to:

*K*integers, which can effectively reduce the computational cost and storage cost (Figure 3(a)). At the same time, the Leaf-wise leaf growth strategy (Figure 3(b) and 3(c)) is adopted, which can achieve better accuracy, significantly reduce algorithm complexity, and greatly reduce training time consumption, thereby improving training efficiency and prediction accuracy. Combining the above characteristics, the training process of the LightGBM model is shown in Figure 3(d).

### Particle swarm optimization algorithm model

Particle Swarm Optimization (PSO) is an evolutionary computing technique, which is derived from the study on birds predation behavior, first proposed by Kennedy & Eberhart (1995). PSO is also an iterative-based optimization tool. It first initializes a set of random solutions in the system, takes each individual as a particle without weight and volume in the n-dimensional space, and then searches for the optimal value through iteration, so that the particles in the solution space can be searched according to the optimal particle. This algorithm has a rapid searching speed and good initial convergence, so it is widely used in many fields.

### PSO-LightGBM algorithm optimization model

Step 1: Initialize particle swarm parameters, including the number of particles, learning rate, weighting coefficient, and the maximum number of iterations.

Step 2: Train the LightGBM model. The parameters that need to be optimized change as the particles position changes.

Step 3: Calculate and evaluate the fitness value. The fitness value is derived from the negative training accuracy score output by the LightGBM model, which is used to evaluate the performance of the particle swarm algorithm. The smaller the fitness value, the better the performance.

Step 4: Determine the stop state. When the number of iterations is reached, the iterative process is terminated to obtain the optimal parameters of the LightGBM model. Otherwise, the iterative calculation is performed.

Step 5: Validate the classification results of the model. The LightGBM model established by the optimization results is used to output the water inrush water source identification results.

## RESULTS AND DISCUSSION

### Hydrochemical characteristics

Figure 5(a) and (b) shows the Piper's trilinear diagram and Durov diagram of No.5 coal roof water; (c) and (d) are Piper's trilinear diagram and Durov diagram of No.8 coal roof water; (e) and (f) are Piper's trilinear diagram and Durov diagram of No.122 coal floor water; (g) and (h) are Piper's trilinear diagram and Durov diagram of the No.1214 coal aquifer.

It can be seen from Figure 5(a) and 5(b) that the concentrations of Ca^{2+} and Mg^{2+} in the cations have decreased significantly after 2016, while the Na^{+} content has increased significantly, and the content of is always predominant in the anion. The hydrochemical type changed from -Ca^{2+} (71.8%) to -Na^{+} (100%) after 2016. The TDS increased significantly, indicating that the groundwater environment changed significantly before and after coal mining, and the runoff conditions became worse.

Through Piper's trilinear diagram and Durov diagram of No.8 coal roof water (Figure 5(c) and 5(d)), No.12-2 coal roof water (Figure 5(e) and 5(f)), and No.12-14 coal aquifer (Figure 5(g) and 5(h)) it can be seen that the Ca^{2+} was always dominant in cations, was always dominant in anions, and the hydrochemical type has not changed before and after 2016, all of which was Ca^{2+} water (100%). TDS was less than 420 mg/L before and after 2016, and there was no significant change. Compared with No.5 coal roof hydrogeological characteristics, mining activities have less impact on No.8 coal roof water, No.12-2 coal floor water, and No.12-14 coal aquifer.

All data samples mainly include cations (Ca^{2+}, Mg^{2+}, Na^{+}), anions (, , Cl^{−}), PH value, and total hardness (TH) as the original discriminant indicators using the principal component analysis (PCA) method for processing. Among them, the correlation coefficient matrix between various water source components is shown in Table 1.

. | Ca^{2+}
. | Mg^{2+}
. | Na^{+}
. | . | . | Cl^{−}
. | PH . | TH . |
---|---|---|---|---|---|---|---|---|

Ca^{2+} | 1.000 | |||||||

Mg^{2+} | 0.893 | 1.000 | ||||||

Na^{+} | −0.855 | −0.790 | 1.000 | |||||

−0.728 | −0.649 | 0.967 | 1.000 | |||||

−0.175 | −0.185 | 0.403 | 0.389 | 1.000 | ||||

Cl^{−} | −0.043 | 0.213 | 0.038 | 0.081 | −0.144 | 1.000 | ||

PH | −0.595 | −0.498 | 0.490 | 0.402 | −0.148 | 0.113 | 1.000 | |

TH | 0.536 | 0.319 | −0.394 | −0.344 | 0.081 | −0.628 | −0.349 | 1.000 |

. | Ca^{2+}
. | Mg^{2+}
. | Na^{+}
. | . | . | Cl^{−}
. | PH . | TH . |
---|---|---|---|---|---|---|---|---|

Ca^{2+} | 1.000 | |||||||

Mg^{2+} | 0.893 | 1.000 | ||||||

Na^{+} | −0.855 | −0.790 | 1.000 | |||||

−0.728 | −0.649 | 0.967 | 1.000 | |||||

−0.175 | −0.185 | 0.403 | 0.389 | 1.000 | ||||

Cl^{−} | −0.043 | 0.213 | 0.038 | 0.081 | −0.144 | 1.000 | ||

PH | −0.595 | −0.498 | 0.490 | 0.402 | −0.148 | 0.113 | 1.000 | |

TH | 0.536 | 0.319 | −0.394 | −0.344 | 0.081 | −0.628 | −0.349 | 1.000 |

It can be seen from Table 1 that the correlation coefficient value of Mg^{2+} and Ca^{2+} is 0.893, and the correlation coefficient value of and Na^{+} is 0.967, indicating that there is a strong correlation between some variables. The correlation coefficient values of and Na^{+}, are 0.403 and 0.389, respectively, and the correlation coefficient values of Cl^{−} and Mg^{2+} is 0.213, indicating that some variables had moderate correlations. According to different PCA dimensions, the cumulative contribution rate of extracted principal components is shown in Table 2.

Element . | Initial eigenvalues . | Extract the load sum of squares . | ||||
---|---|---|---|---|---|---|

Total . | Percent variance . | Cumulation % . | Total . | Percent variance . | cumulation % . | |

Ca^{2+} | 4.123 | 51.532 | 51.532 | 3.864 | 48.301 | 48.301 |

Mg^{2+} | 1.685 | 21.061 | 72.592 | 1.694 | 21.169 | 69.470 |

Na^{+} | 1.092 | 13.649 | 86.242 | 1.342 | 16.772 | 86.242 |

0.486 | 6.074 | 92.316 | ||||

0.377 | 4.708 | 97.024 | ||||

Cl^{−} | 0.171 | 2.138 | 99.163 | |||

PH | 0.062 | 0.781 | 99.943 | |||

TH | 0.005 | 0.057 | 100.000 |

Element . | Initial eigenvalues . | Extract the load sum of squares . | ||||
---|---|---|---|---|---|---|

Total . | Percent variance . | Cumulation % . | Total . | Percent variance . | cumulation % . | |

Ca^{2+} | 4.123 | 51.532 | 51.532 | 3.864 | 48.301 | 48.301 |

Mg^{2+} | 1.685 | 21.061 | 72.592 | 1.694 | 21.169 | 69.470 |

Na^{+} | 1.092 | 13.649 | 86.242 | 1.342 | 16.772 | 86.242 |

0.486 | 6.074 | 92.316 | ||||

0.377 | 4.708 | 97.024 | ||||

Cl^{−} | 0.171 | 2.138 | 99.163 | |||

PH | 0.062 | 0.781 | 99.943 | |||

TH | 0.005 | 0.057 | 100.000 |

As shown in Table 2, the variance eigenvalues of Ca^{2+}, Mg^{2+}, and Na^{+} are all greater than 1, indicating that they are of great significance for distinguishing different types of water sources, and their cumulative contribution rates are 51.532, 72.592, and 86.242% respectively. This showed that the first three principal components already carry most of the information of the original data and can accurately distinguish different types of water sources.

^{2+}, Mg

^{2+}, and Na

^{+}were located on the steep slope, while the characteristic values of the last five scatter points were all less than 1.

To sum up, the groundwater environment has changed significantly before and after coal mining. Therefore, to discuss whether the change of hydrochemical data will affect the water source identification results, we had put forward two cases for comparative analysis. The data of No.8 coal roof water (marked as B), No.122 coal floor water (marked as C), No.1214 coal aquifer (marked as D) remain unchanged: (1) water source identification based on all data of No. 5 coal (marked as A1); (2) water source identification based on the data after the change of No. 5 coal (marked as A2) after 2016.

### Model identification results

The accuracy rate (A), the precision rate (P), the recall rate (R), and the F1 value were selected for the LightGBM model, PSO-LightGBM model, Classification and Regression Tree (CART) model, Random Forest (RF) model to compare and analyze their classification performance.

The multi-class problem is divided into multiple two-class problems for evaluation. There are four cases where the predicted results of the classifier are combined with the actual results on the dataset, as shown in Table 3.

Prediction result . | Actual result . | |
---|---|---|

1 . | 0 . | |

1 | TP | FN |

0 | FP | TN |

Prediction result . | Actual result . | |
---|---|---|

1 . | 0 . | |

1 | TP | FN |

0 | FP | TN |

Among them, T represents True, F represents False, P represents Positive, and N represents Negative. TP means predicts correctly, FP means predicts incorrectly, FN means predicts incorrectly, and TN means predicts correctly.

Figure 7 shows the classification accuracy, precision, recall, and F1 value of LightGBM, PSO-LightGBM, CART, and RF models. It can be seen that in the prediction and classification models of water source types, PSO-LightGBM is slightly better than RF and better than LightGBM, and all three optimized classification models are better than CART.

When using the monitoring water sample data after coal mining as the identification model, it can be intuitively seen that the identification accuracy rate will be improved to a certain extent.

To sum up, from the comparison between the classification value and the real value and the analysis of the error results of each model, it can be seen that the classification value of PSO-LightGBM is closer to the real value, and the classification performance is better. It can be concluded that the data changes after mining have a certain influence on the results of the recognition model, and the optimized recognition model has higher accuracy.

## CONCLUSIONS

The 196 hydrochemical data monitored in the Donghuantuo mining area in the past 30 years were deeply analyzed by traditional hydrochemical analysis methods, and a PSO-lightGBM water inrush source identification model was established based on the above data. The main research conclusions are as follows:

- (1)
We optimized the LightGBM model with particle swarm optimization (PSO) and established the PSO-LightGBM mine water source identification model. The model has the characteristics of simple operation and high identification accuracy. Furthermore, the identification accuracy of four identification models, including PSO-LightGBM, LightGBM, RF, and CART, were compared and analyzed. The identification accuracy of PSO-LightGBM is the highest, reaching 97.22%, and the recognition accuracy of the CART model is relatively low.

- (2)
The mine water environment changed significantly before and after the mining of No.5 coal seam in the study area. After 2016, its hydrochemical type changed from Ca

^{2+}type water (71.8%) to Na^{+}type (100%). Moreover, there has been a significant increase in TDS, and this change will affect the identification of water inrush sources to a certain extent. - (3)
Through the analysis of the identification factors, it is found that the main reason for the misjudgment of the model is that the water quality between adjacent aquifers is relatively similar, or the established model identification interval is not accurate enough. Especially after coal mining occurs, the mine water environment usually changes significantly, and it is necessary to analyze the changes of water samples in time, thereby improving the reliability of water sample data, and ultimately strengthen the accuracy of water inrush water source identification results.

## ACKNOWLEDGEMENTS

This study was financially supported by the National Natural Science Foundation (41972255), the National Natural Science Foundation (U171020056), and the Ministry of Science and Technology of China (2017YFC0804104). In addition, Y. Ji is supported by the China Scholarship Council.

## DATA AVAILABILITY STATEMENT

All relevant data are included in the paper or its Supplementary Information.

## CONFLICT OF INTEREST

The authors declare there is no conflict.