Abstract
Membrane bioreactors (MBRs) are a sewage treatment process that combines membrane separation with bioreactor technology. It has great advantages in sewage treatment. Membrane fouling hinders MBR process development, however. Studies have shown that the degree of membrane fouling can be judged using the membrane flux rate. In this study, principal component analysis was used to extract the main factors affecting membrane fouling, then the random forest algorithm on the Hadoop big data platform was used to establish an MBR membrane flux prediction model, which was tested. In order to verify the model's effectiveness, BP neural network and SVM support vector machine models were established using the same experimental data. The experimental results from the different models were compared, and the results showed that the random forest algorithm gave the best MBR membrane flux predictions.
Highlights
We used the principal component analysis method to obtain the main influence factors of MBR membrane flux.
We used the random forest algorithm on the Hadoop big data platform to establish a simulation prediction model of MBR membrane flux, and realized the membrane flux prediction.
Through comparison with other algorithms, our algorithm works better.
INTRODUCTION
Humans depend on water resources but, with the development of society, such resources have suffered serious pollution and water shortages have become a reality to be faced. Water pollution reduces product quality, hinders industrial development, affects the ecological environment, and harms human health. For a long time, traditional aerobic biological sewage treatment technology – for example, the activated sludge process – has played an important role in industrial and domestic sewage treatment. Due to the various disadvantages of traditional activated sludge, improved technologies have been developed, like membrane bioreactor technology (MBR), a high-efficiency process combining membrane separation and biological treatment technology (Chang et al. 2011; Li et al. 2019). MBR can separate mud and water efficiently through the membrane module, and the effluent can be used directly. It has the advantages of a small footprint, easy automation, and good effluent quality. With the development of wastewater treatment technology, the membrane bioreactor market has boomed (Braak et al. 2011; Wang et al. 2014; Meng et al. 2017).
Membrane contamination hinders MBR wastewater treatment development and is a direct cause of decreases in membrane flux, the magnitude of which is a measure of the degree of contamination (Zhang et al. 2006; Alibardi et al. 2014). To help improve the MBR process, intelligent computer simulation models can be used to predict membrane flux. At present, MBR flux prediction is focused mostly on mathematical models, BP (back propagation) neural network models and other methods (Lee et al. 2002; Barello et al. 2014). Mathematical simulation helps to improve the MBR process. Kapumbe et al. (2019) used the ASM3 model to simulate MBR wastewater treatment (Kapumbe et al. 2019), for instance, while Khan et al. (2009) established a mathematical model to predict the degree of membrane pollution (Khan et al. 2009). These methods have been useful but have not achieved membrane flux prediction. Jun et al. (2019) established a regression model to predict flux through a ceramic membrane treating wastewater, based on Colin Maclaurin mathematical principles (Jun et al. 2019). Their model does not reflect the mechanism of membrane fouling well and the physical meaning of the parameters is not precise, because of which the model is generally poor and does not predict fouling accurately. Li et al. (2014) established a genetic algorithm optimized BP neural network to predict MBR membrane flux (Li et al. 2014). BP neural networks learn on the basis of gradient descent, which is prone to overfitting and requires a lot of test data to ensure prediction accuracy.
The random forest algorithm is a machine learning algorithm based on a classification tree. It has the advantages of fast classification speed and low parameter adjustment, and can effectively avoid overfitting while processing large, multi-dimensional data sets efficiently (Strobl et al. 2008; Lee et al. 2010; Ross & Allen 2014). The algorithm has achieved good results in the fields of medical treatment (Chen & Liu 2005; Lin et al. 2011), fire protection (Oliveira et al. 2012), fault diagnosis (Cerrada et al. 2016), agriculture (Naidoo et al. 2012; Wang et al. 2016) and other fields. Hadoop is an open source distributed computing platform developed by Apache. Its main core includes the distributed file system HDFS (Hadoop Distributed File System) and MapReduce computing framework (Ghazi & Gangodkar 2015). The random forest algorithm based on the Hadoop platform has been used very extensively in applied research. Wu et al. (2019) proposed an improved random forest algorithm combined with the MapReduce computing framework (Wu et al. 2019). Masarat et al. (2016) combined the random forest and Hadoop platforms to design an intrusion detection system (Masarat et al. 2016). Fan et al. (2018) used the pair to build a big data analysis platform for highway travel time prediction (Fan et al. 2018). To achieve better membrane flux prediction, the random forest algorithm is also used on the Hadoop platform. The main factors are used as the algorithm's input layer with the membrane flux as the output layer.
ANALYSIS OF FACTORS INFLUENCING MEMBRANE FOULING
An MBR is complicated. All parameters involved in sewage treatment can cause membrane fouling and, with changes in operating methods and conditions, the factors affecting the MBR change constantly. Microbial, inorganic and organic pollution cross and affect each other, between them constituting the main types of MBR membrane pollution. Because of this, membrane flux is affected by a combination of factors comprising mainly mixed liquor suspended solids (MLSS), temperature, operating pressure, total resistance, COD, etc (Kulesha et al. 2018). Due to the numerous operating conditions and many factors affecting membrane flux in the MBR process, principal component analysis (PCA) is used to simplify the MBR membrane flux prediction model and improve the effectiveness of membrane flux prediction.
PCA is a statistical analysis method based on the idea of dimensionality reduction to convert multiple indicators into a few comprehensive indicators. It can combine multiple relevant original variables linearly into unrelated principal component factor (Liu et al. 2016). It converges fast, does not require a basis function, and can solve irregular data distribution in a limited area. PCA is used widely in communication technology, statistical analysis, and image processing. Its use in selecting membrane fouling factors includes several steps:
- (1)
Determine the main variables affecting MBR membrane flux and construct a matrix of influencing factors: MLSS, total resistance, operating pressure, COD, pH, temperature. Set to X.
- (2)
- (3)
Solve matrix A to obtain the covariance matrix () of A, sort the characteristic roots, obtain the eigenvector matrix () and eigenvalue matrix () of S by solving.
- (4)
Decompose the matrix and obtain the principal component matrix ().
- (5)
Analyze the eigenvalue matrix V, determine the correlation between the indicators, and obtain three principal components.
- (6)
Determine the three principal component elements – MLSS, total resistance, and operating pressure – as the random forest algorithm input.
HADOOP AND RANDOM FOREST ALGORITHM
Hadoop platform
The core of Hadoop architecture is HDFS and MapReduce. HDFS is used to implement the underlying distributed file storage, and the MapReduce framework to implement distributed parallel computing. The HDFS cluster belongs to a master-slave structure (Master and Slave), and usually consists of one Namenode and multiple Datanodes. The Namenode is the master server, which coordinates the client's access to files and manages the file system. Datanodes are used for storage and data management data (Slave) (Li et al. 2016). Users can store files directly on HDFS, which are then divided into multiple small file blocks and stored on a set of data nodes. The Namenode coordinates the Datanodes’ work, performing file system operations like file opening, closing, and renaming. Figure 1 is a Hadoop cluster deployment diagram.
MapReduce, a software framework in the Hadoop architecture, uses a parallel programming model. Software developers can write parallel distributed programs based on it and distribute them to a cluster containing thousands of machines for execution. At the same time, it can ensure that the processing of a large number of data sets is correct. The Jobtracker running independently on the management node and the TaskTracker running on multiple slave nodes jointly form the MapReduce framework. After receiving the tasks, the management node publishes them to slave nodes, and the master node coordinates their operation. Figure 2 is a MapReduce work flow chart.
Random forest algorithm
Mahout
Mahout is an open source project under the Hadoop platform. It implements many classic machine learning algorithms based on the MapReduce mode, mainly including classification, clustering, and recommendation engine algorithms (Bagchi 2015).. As the random forest algorithm can be run in parallel, construction of the decision tree can be assigned to different Map tasks, and the trained model is output to HDFS through the reduce process, thereby completing the random forest model's construction.
EXPERIMENTAL MODEL ESTABLISHMENT
Random forest model
Five computers were used to build a Hadoop cluster, one NameNode node, four DataNode nodes, and four Map and Reduce tasks set on each node. The size of the HDFS file block was set at 64M. Figure 3 shows the membrane flux prediction model.
The experimental data for MBR process operation came from a municipal sewage treatment plant. There were 90 data sets. MLSS, total membrane resistance, and operating pressure are used as the model's input, and its output is MBR membrane flux. Two very important random forest parameters are the number of variables (m_try) pre-selected by the tree nodes and the number of trees (n_tree) in it. On the basis of experiments, values of 2 and 300 were selected for m_try and n_tree, respectively. 84 experimental data sets were generated randomly as training samples, and a nonlinear relationship was established between influencing factors and membrane flux by training. The algorithm steps were:
- (1)
Resampling using the Bootstrap method to generate the training set.
- (2)
For each of the M features in the training set, m were selected (m < M). For each tree node, the best of the m features for branch growth were selected based on the Gini coefficient.
- (3)
Steps (1) and (2) were repeated to generate a decision tree, using the training set, to form a random forest. The final result came from voting on each decision tree.
Random forest selects the training set from the original data using the bagging method (Strobl et al. 2007). If the original data set contains n records, the probability that a record is not selected is (1–1/n)n; the records that are not selected are called Out of Bag (OOB). When n is very large, (1–1/n)n converges to 0.368; that is, the proportion of data that may not be selected is close to 36.8%, which guarantees sample diversity. The more decision trees there are in a random forest, the more accurate it will be. In other words, as the number of trees increases, the prediction model's error decreases gradually and then stabilizes. The model error analysis diagram is shown in Figure 4.
Once the random forest prediction model has been trained, six sets of test data are brought in to obtain the prediction, which is then compared with the experimental value. The experimental and predicted results are shown in Figure 5 and an analysis of the experimental results in Table 1. The average relative prediction error was 4.56%, a good result.
Experimental value (L/m2 h) | 21.7 | 17.3 | 15.9 | 19.2 | 30.4 | 26.3 |
Predicted value (L/m2 h) | 22.3 | 16.7 | 14.9 | 18.5 | 31.6 | 28.2 |
Relative error | 0.0276 | 0.0347 | 0.0629 | 0.0365 | 0.0395 | 0.0722 |
Mean relative error | 0.0456 |
Experimental value (L/m2 h) | 21.7 | 17.3 | 15.9 | 19.2 | 30.4 | 26.3 |
Predicted value (L/m2 h) | 22.3 | 16.7 | 14.9 | 18.5 | 31.6 | 28.2 |
Relative error | 0.0276 | 0.0347 | 0.0629 | 0.0365 | 0.0395 | 0.0722 |
Mean relative error | 0.0456 |
Comparative analysis of experimental results
In order to verify the random forest model's effectiveness, a BP neural network model and an SVM (support vector machine) model were established using the same samples. The multi-model comparison is shown in Figure 6.
Methods . | MAE . | RMSE . | R2 . |
---|---|---|---|
BP neural network | 1.4500 | 1.6568 | 0.8945 |
Support Vector Machine | 1.2667 | 1.4119 | 0.9234 |
Random forest | 1.0000 | 1.1000 | 0.9535 |
Methods . | MAE . | RMSE . | R2 . |
---|---|---|---|
BP neural network | 1.4500 | 1.6568 | 0.8945 |
Support Vector Machine | 1.2667 | 1.4119 | 0.9234 |
Random forest | 1.0000 | 1.1000 | 0.9535 |
CONCLUDING REMARKS
Membrane fouling has always hindered the widespread use of MBR. In this study, PCA was used to reduce the dimensions of the MBR membrane fouling factor set, and enable the selection of three indicators – MLSS, resistance, and pressure – as the main factors affecting MBR membrane flux. At present, the development of big data cloud computing is getting faster and faster. This study used the Hadoop platform and its two cores, HDFS and MapReduce, to build an actual distributed cluster environment. Due to shortcomings in MBR membrane flux prediction in common mathematical models and traditional simulation algorithms, prediction accuracy and prediction time were considered, and the random forest algorithm was applied to MBR membrane flux prediction on the big data platform. The MBR membrane flux prediction model based on random forest was used to predict MBR flux, and the predicted and experimental values compared. The random forest prediction model has higher prediction accuracy. In order to verify the model's validity, it was compared with BP neural network and SVM models, and was shown to have higher prediction accuracy than either. The random forest model is effective in MBR membrane pollution simulation prediction.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.