Abstract
Palm oil mill effluent (POME) treatment is an anthropogenic activity contributing to global warming through methane emission. The inability to address this issue would deem true the catastrophic impacts of global temperatures exceeding 2 °C as was predicted by the Intergovernmental Panel on Climate Change (IPCC) in 2015. Little research and development exist on GHGs monitoring and methane emissions in POME treatment facilities as opposed to research on improving biogas production. A methane emission prediction tool based on machine learning models and tools can address this problem and consequently facilitate the development of efficient carbon neutrality approaches in POME treatment plants. In this study, six regression models were explored alongside their kernels using eight predictors, linking towards methane emission volume. The best model found was support vector machine (SVM), producing performance metrics for R2 and RMSE with values of 0.45 and 0.749, respectively.
HIGHLIGHTS
Selection and utilisation of a suitable machine learning algorithm for the prediction of CH4 emissions.
Obtaining raw data on POME treatment for database development and understanding the nature and quality of the dataset using available data from pre-processing methods.
Integrating the developed database into the CH4 emission tool.
Comparing factors influencing CH4 emissions.
Determining the highest influencing parameters.
INTRODUCTION
Palm oil, a type of vegetable oil extracted from the fruits of the oil palm tree (Elaeis guineensis) (CABI 2019), is labelled as the most widely used vegetable oil on the planet. Based on recent data obtained from Statista (2022b), the current global palm oil production rate is capped at 75 million metric tons, equivalent to about 20% of the global vegetable oil production. Statista (2022a) has also reported that in addition to production, the worldwide consumption of palm oil is currently capped at 73 million metric tonnes, dominating the world vegetable oil market by capturing 35% in market shares. The reason palm oil possesses such great statistics is mainly due to the ability of oil palm fruit as a very efficient crop, capable of producing enormous volumes of oil over relatively small areas of land throughout the year (Abd Ghani 2021). In addition to that, as the global palm oil market achieved a market value of 50.6 billion USD in 2021, experts such as IMARC group strongly believe that by 2027, the global palm oil market is expected to reach 65.5 billion USD, with a CAGR of 4.3% between 2022 and 2027 (IMARC Group 2021). Malaysia, as the world's second largest producers of palm oil after Indonesia, is responsible for 26% of global palm oil production and 34.3% of global palm oil exports (MPOC 2020). In terms of economic potential and growth, it is undeniable that palm oil will continue to provide various benefits to Malaysia in the present, but also in the future. However, despite the benefits and positive projections stated above, from an environmental standpoint, palm oil does have a radical effect on the environment as the refining process of oil palm to palm oil produces a non-toxic wastewater known as palm oil mill effluent (POME) (Zafar 2022).
POME is defined as the effluent formed in palm oil mills during the last phases of palm oil processing that can be categorised as liquid waste (Bachrun & Baskara 2023). POME contains (95–96)% water, (0.6–0.7)% oil, (4–5)% total solids of which (2–4)% contributes to suspended solids (Hassan & Abd-Aziz 2012). In terms of concentration, POME typically consists of total solid concentration of (∼40,500 mg/L), oil and grease concentration of (∼4,000 mg/L), high chemical oxygen demand (COD) concentration of (∼50,000 mg/L), and high biological oxygen demand (BOD) (∼25,000 mg/L) (Hassan & Abd-Aziz 2012). Based on the concentrations stated, it can be classified that POME contributes to the death of aquatic life if channelled directly into receiving streams as it contains 100 times higher COD and BOD concentrations than municipal sewage (Kamyab et al. 2018). High COD and BOD values translate to low oxygen availability in water which will cause aquatic organisms to suffocate and die (EPA 2012). Based on the book titled ‘Palm Oil’ by Lai et al. (2016), it is estimated that roughly 2.5 tonnes of POME are generated for every tonne of crude palm oil recovered from milling. As a prominent palm oil producer, the amount of POME generated by Malaysia's palm oil industry is approximately 50 million tons per annum (Akhbari et al. 2020). Therefore, to prevent water pollution such as oxygen depletion and eutrophication, the proper treatment of POME is essential (Ng 2017).
Malaysia has taken action to tackle pollution caused by POME by implementing strict discharge limits on industrial wastewater effluents through the Environmental Quality Act 1974 (Department of Environment Malaysia 1979) and further improvised in the Environmental Quality (Sewage) Regulations 2009 (Department of Environment 2010). Based on the environmental regulations, the limits set by the Department of Environment for BOD5, COD, suspended solids and oil and grease are 50, 200, 100, and 20 mg/L, respectively. As a result, Malaysia has adopted a few treatment methods to ensure that treated POME achieves the target discharge limits. These methods are mainly anaerobic treatment (AD), membrane treatment (MD), and evaporation method (EM). According to Kamyab et al. (2018), the most extensively used technique for the treatment of POME is the biological approach, which is based on anaerobic and aerobic ponding systems using bacteria. In the case of Malaysia, AD in the form of open ponding is mainly selected as a primary treatment approach due to its cost-effectiveness (Appels et al. 2008) associated with low capital and operating costs (Sarbatly 2020). Despite anaerobic ponding, AD is also applied in equipment such as fluidised bed reactors, closed anaerobic digesters, up flow anaerobic sludge fixed film reactor (UASFF), ultrasonic membrane anaerobic system (UMAS), and many more to treat POME.
A common misconception interpreted by operators working in palm oil mills is that POME is the most expensive and challenging waste to manage (Madaki & Seng 2013). Instead of viewing it as waste, POME contains a very high potential to become a resource if subjected to the right treatment methods such as AD. According to Ward et al. (2008), POME can produce a carbon neutral energy source in the form of biogas. Also, according to Tambone et al. (2010), after the AD process, the final residue is nutrient-dense and can be utilised as fertiliser in agriculture. By looking at these statements, POME can generate various resources if utilised accordingly.
As understood from the statement made earlier, POME can yield biogas through AD. Loh et al. (2017) verified that biogas from AD via open ponding or open digestor tanks comprises (60–70)% methane (CH4), (30–35)% carbon dioxide (CO2), and trace amounts of hydrogen sulphide (H2S). This is due to the bacterial activity breaking down the organic matter present in POME in the absence of oxygen (EPA 2022). Despite the obvious benefits of operating under oxygen free conditions, the biggest downside to this is the release of greenhouse gases (GHGs) into the atmosphere. Methane (CH4) is the second most abundant manmade GHG after carbon dioxide (CO2), accounting for roughly 20% of global emissions (EPA 2021). Methane is 25 times more effective than carbon dioxide at trapping heat in the atmosphere (United States Environmental Protection Agency 2021). Therefore, it is crucial to capture biogas generated from POME treatment as it is proven by World Biogas Association (2018) that capturing biogas would enable the reduction of global emissions by (18–20)% which is in line with the Paris Agreement to address climate change (UNFCCC 2015). Malaysia as a signatory to the Paris Agreement has committed to tackling climate change by implementing more biogas capture in the palm oil industry as it contributes to about 23.7% of the total methane emission in Malaysia (Ministry of Environment & Water 2020).
By looking into the state of our global climate, the global warming rate has rapidly intensified, leading to the declaration of a climate emergency state of 2,089 jurisdictions from 38 different countries as of March 2022 (climateemergencydeclaration.org 2022). According to the Intergovernmental Panel on Climate Change (IPCC), the global warming levels of 1.5 and 2 °C will be exceeded in the 21st century if no significant and more intensive measures are taken to combat climate change (Masson-Delmotte et al. 2021). As a result, Malaysia vowed to cut its greenhouse gas emission intensity across the economy by 45% based on GDP in 2030 at the recent United Nations Framework Convention on Climate Change (UNFCC) COP 26 held from 31 October to 12 November 2021 in Glasgow (Dalm 2021). And one way to achieve this ambition is by focusing on the installation of methane capturing facilities in new and existing palm oil mills (Ministry of Energy 2017).
Despite the benefits and potentials POME contains, Malaysia has not yet been close to fully capturing the methane emissions and utilising it to generate biogas as most of the palm oil mills are operating on ponding systems. In addition to that, based on a study by Khairul et al. (2019), only a total of 50 palm oil mills are currently in the testing phase of biogas recovery under the Clean Development Mechanism (CDM) in Malaysia. Following that, Statista (2020b) has reported that there are 457 existing Malaysian palm oil mills in operation as of the year 2020. This shows that only 11% of the overall Malaysian palm oil industry has adapted the transition phase to biogas capture which is far from ideal.
To ensure the process of capturing methane runs smoothly, identifying the exact amounts of biogas emissions from POME treatment facilities accurately should be regarded as a top priority for Malaysia. This can be achieved through means of prediction or real-time monitoring. An accurate identification method can serve as a design basis for engineers and building operators to design the capture systems accurately and in accordance with the size of the associated palm oil mills. However, as easy as it may seem, there are a few challenges associated with the creation of a prediction tool or programme to quantify methane emissions.
Firstly, the numerous process parameters produce many instabilities in the anaerobic digestion process, which leads to several unpredictabilities in the system's methane generation. According to studies done by Lam & Lee (2011), Abdurahman et al. (2013) methane emission from anaerobic digestion is influenced by a number of parameters, including operating temperature and pH that affects the survivability of methanogens in the organic matter (Zinder et al. 1984; Choong et al. 2018). Besides that, BOD, COD, and suspended solids affect growth of microorganisms such as methanogens that would impact the digestion of POME which in turn would cause variation in the methane produced (Utami et al. 2016; Putro 2022). Hydraulic retention rate (HRT) and organic loading rate (OLR) would impact the efficiency of the digestion of POME, thus making them crucial parameters in methane production (Zainal et al. 2022). In addition to that, according to Abdurahman et al. (2013), the technology used to pre-process and prepare the digestion for AD processed can alter the amounts of emitted methane from the process, ranging between 36 and 71.9%. When it comes to estimating and monitoring CH4 emissions, these uncertainties, along with the lack of research and studies on this topic, represent a significant challenge to obtain accurate measurements of CH4 from POME.
Secondly, conventionally, methane monitoring has been done through the traditional offline grab-sampling methods by physically going to the treatment sites and conducting experiments and tests to verify the amount of methane emitted during the treatment processes. As of 2020, the arrival of the COVID-19 pandemic halted all physical activities in a recent study by Diffenbaugh et al. (2020). These activities include the development and transition of renewable energy (RE) by the government of Malaysia such as methane monitoring and capturing in palm oil mills. Due to this, the reliance on fossil fuels as the primary sources of energy has re-emerged since the energy demands keep rising despite the restrictions implemented by the movement control orders (MCOs), resulting in an increase in GHG emissions (Vaka et al. 2020). Therefore, the transition to RE has never been more challenging as Malaysia now generates only 8% of its energy from renewable sources. This is small in percentage when compared to the ambitious goal of 20% energy generated from renewable resources by 2025 (Joshi 2019). Moreover, this shows that there is a need to develop a tool to predict and monitor GHGs from POME which can aid the operators and personnel working remotely to determine the plants' methane emissions and facilitate CH4 capturing. Besides that, according to NST Business (2021), since some palm oil mills have been transitioning to automation via Industrial Revolution 4.0 by implementing Palm Oil Mill Integrated System (POMIS) to facilitate daily operations, incorporating a monitoring or prediction tool implemented in their Internet of Things (IoT) would drive these palm oil mills towards sustainability at a faster rate.
Machine learning (ML) has advanced in recent years, and it may be employed as a predictive tool in microbial ecology and system biology studies (Kazemi Yazdi & Scholz 2010; Witten 2011). The rise of various ML algorithms such as linear regression (LR), artificial neural network (ANN), support vector machine (SVM), and Gaussian process regression (GPR) incorporated into these studies have been successful in predicting various outputs. This also includes the successful prediction of biogas production from AD (Zaied et al. 2020; Asadi & McPhedran 2021). However, despite this successful prediction, most of these studies involve optimising the respective palm oil mill's biogas production and their AD processes. There are still very few developments towards a specific programme that can be used by the palm oil mills to incorporate within their IoT and be used to remotely monitor and predict methane emissions. Therefore, the main purpose of this research is to develop a CH4 prediction tool based on an ML algorithm to aid in the monitoring and control of CH4 emissions in POME treatment plants. To achieve this, a correlation must be built between critical parameters that will affect CH4 emission based on collected databases to support this tool using several ML tools.
METHODOLOGY
Data collection
To develop the CH4 emission prediction tool, a dataset is required as a foundation to kickstart the research. Without a dataset, there can be no programme development to fulfil the aims and objectives of the study as stated by Dekker (2006) in an article. Therefore, to foster the development of the prediction tool, datasets were obtained from four different palm oil mills across Malaysia. Each of the datasets obtained primarily focuses on POME treatment and compromises 24-monthly data points which ranged from 2019 to 2021. The datasets obtained contained the main process parameters such as COD, BOD5, TS, SS, pH, temperature, OLR, and HRT affecting the quality of POME and final quality of biogas produced at the end of each month. All the parameters obtained from the palm oil mills were used as input variables into the models as they have a varying but crucial influence on the biogas production. As mentioned by Loh et al. (2017), since the biogas obtained predominantly consists of CH4, the scope of research is to predict the amount of CH4 that will be released at the end of POME treatment. An overall summary of the datasets obtained and combined is provided in Table 1.
Summary of obtained dataset
Parameter . | Unit . | Range . |
---|---|---|
COD | mg/L | 53,450–92,844 |
BOD5 | mg/L | 22,500–47,520 |
TS | mg/L | 20,148–56,420 |
SS | mg/L | 12,300–5,7650 |
pH | – | 4.6–5.2 |
Temperature | °C | 46.8–62.4 |
OLR | kg CODin/m3 day | 0.9–1.8 |
HRT | days | 34–88 |
CH4 | Nm3 | 11,340–295,480 |
Parameter . | Unit . | Range . |
---|---|---|
COD | mg/L | 53,450–92,844 |
BOD5 | mg/L | 22,500–47,520 |
TS | mg/L | 20,148–56,420 |
SS | mg/L | 12,300–5,7650 |
pH | – | 4.6–5.2 |
Temperature | °C | 46.8–62.4 |
OLR | kg CODin/m3 day | 0.9–1.8 |
HRT | days | 34–88 |
CH4 | Nm3 | 11,340–295,480 |
Data pre-processing and preparation
The process of changing data from one version to another is known as data transformation. The most common data transformations are those that convert raw data into a clean, usable format. An analyst will use exploratory data analysis to analyse the data behaviour, retrieve data from the original source, perform the transformation, and lastly save the data in the proper database throughout the data transformation process. This involves methods such as normalisation, standardisation, and much more.
Normalisation




As mentioned, the dataset shown in Table 1 was transformed using min–max normalisation using the normalise (x, ‘range’, [0 1]) syntax provided in MATLAB® 2022a (MathWorks 2022b). The syntax mentioned utilised the equation provided by Equation (3) and was further subjected to the regression learner toolbox provided by MATLAB® 2022a. The performance of each model present within the regression learner is tabulated and provided in the results and discussion section.
Standardisation




Upon applying Equation (2), the structure of a dataset will be converted into a single, standardised data format where the new mean and standard deviation values are 0 and 1 (Liu 2020). Unlike min–max normalisation, the standardised values of each variable in a dataset can take a positive or negative value (Elen & Avuçlu 2021). This is justified as the absolute value of is defined as the distance in standard deviation units between the raw score
and mean X. When the raw score is below the mean,
is negative and positive for vice versa. The advantage of applying standardisation as a method of data transformation is because it can minimise the effects of outliers present in any dataset (Jayalakshmi & Santhakumaran 2011).
Following normalisation, the same dataset in Table 1 was also transformed using standardisation using the normalise(x) syntax provided in MATLAB® 2022a (MathWorks 2022b). The syntax obeyed the equation provided by Equation (2) and the standardised data was further subjected to the regression learner toolbox provided by MATLAB® 2022a. Following this, the performance of each model found within the regression learner is tabulated and provided in the Results and Discussion section. By observing the performance of the models obtained using the regression learner, the prediction capabilities of the tool can be further improvised by revisiting the pre-processing and preparation of the dataset.
Exploratory data analysis (EDA)
Tukey (1977) defines EDA as numerical detective work. In terms of engineering statistics, EDA refers to the crucial process of doing preliminary data analysis utilising summary statistics and visualisations to find trends. EDA's major goal is to assist data analysts before making any assumptions. EDA is capable of detecting obvious errors, providing a clearer understanding of data patterns, the detection of outliers, and the discovery of variable correlations (IBM Cloud Education 2020). The commonly used methods to carry out EDA on datasets are Quantile-Quantile plots (QQ plot), normality tests, and skewness tests (Seltman 2018).
Shapiro–Wilk test (normality test)



Skewness test


In statistics, there are a few ways to identify types of skewness. The rule pointed out by Bulmer (1967) says skewness values ranging between 0 and 0.5 indicate a generally symmetrical distribution, values between 0.5 and 1 indicate a moderately skewed distribution, and values greater than 1 indicate a strongly skewed distribution. This rule also applies to the negative skewness values as well. According to Orcan (2020), by utilising this rule, skewness can determine the normality of a dataset. Furthermore, it has been confirmed by IBM (2020) and Lee (2020) that certain transformations may be used to prepare datasets for ML according to the type of data skewness.
Shapiro–Wilk tests for normality and skewness tests were done on each parameter within the dataset using IBM® SPSS® Statistics v26 software. By using the software, the statistical tests can determine if all the parameters should be normalised or standardised for the tool or different modes of transformation methods are required to be used for specific parameters. The results of the statistical tests are provided in the Results and Discussion section.
Quantile-Quantile plot (QQ plot)
The University of Virginia Library (2022) defines the QQ plot as a graphical tool to determine if a set of data obeys theoretical distribution such as normal or exponential distribution. In modern data analysis, if a dataset contains a large number of samples, N ≥ 30, assumptions that the dataset follows a normal distribution is safe and a good procedure (Toby Mordkoff 2000). QQ plots are useful because they allow data analysts to rapidly assess whether the assumption is reasonable, and if not, how the assumption is flawed, and which distribution should be used instead. QQ plots are also able to guide data analysts to use appropriate transformations according to the distribution plot observed (Samuels et al. 2021). In modern engineering statistics, software such as MATLAB® and SPSS® are frequently used to plot various QQ plots.
Based on the results observed from the statistical tests, parameters that do not obey normality are addressed by referring to the skewness values. Based on the skewness, the types of transformations to be used were then decided. For generally symmetrical skewness values, standardisation was used following the skewness rule under Results and Discussion. QQ plots were also used to verify the transformation using standardisation is deemed valid. Meanwhile, parameters that were found to be moderately and strongly skewed based on the same skewness rule following Results and Discussion were then subjected to QQ plots to determine the type of distribution obeyed. The QQ plot distributions used were normal distribution, lognormal distribution, Weibull distribution, logistic distribution, and Gamma distribution.
Following the QQ plots shown under Results and Discussion, one variable strongly obeys the lognormal distribution while the remaining two variables were seen to obey the logistic distribution. The parameter that obeyed the QQ plot for lognormal distribution was transformed using the natural logarithm function log(x) syntax provided in MATLAB® 2022a (MathWorks 2022a). For the variables that followed a logistic distribution, common transformation methods such as square root, exponential, and inverse were used and verified using the Shapiro–Wilk test. As the Shapiro–Wilk test deemed that these transformations were not proper for the remaining two variables, the Box-Cox transformation was used. The Box-Cox transformation was done using MATLAB® 2022a boxcox (x) syntax.
Following all the transformations carried out for each parameter in the dataset, the parameters that were subjected to transformations aside from standardisation are standardised once more before being used for training in the regression learner. The results obtained based on the performance of each model are provided in Results and Discussion.
RESULTS AND DISCUSSION
Normalisation and standardisation
After normalisation was carried out for the entire dataset and used in the regression learner for training, the best performance for each regression technique with kernel was recorded and tabulated using Table 2.
Regression learner result using normalisation
Model . | RMSE . | R2 . | MSE . | MAE . |
---|---|---|---|---|
SVM | 0.157 | 0.38 | 2.45 × 10−2 | 0.115 |
Ensemble of trees | 0.164 | 0.32 | 2.68 × 10−2 | 0.128 |
Linear regression | 0.173 | 0.24 | 3.00 × 10−1 | 0.136 |
Regression tree | 0.195 | 0.04 | 3.81 × 10−2 | 0.149 |
Neural network | 0.276 | -0.93 | 7.60 × 10−2 | 0.217 |
Model . | RMSE . | R2 . | MSE . | MAE . |
---|---|---|---|---|
SVM | 0.157 | 0.38 | 2.45 × 10−2 | 0.115 |
Ensemble of trees | 0.164 | 0.32 | 2.68 × 10−2 | 0.128 |
Linear regression | 0.173 | 0.24 | 3.00 × 10−1 | 0.136 |
Regression tree | 0.195 | 0.04 | 3.81 × 10−2 | 0.149 |
Neural network | 0.276 | -0.93 | 7.60 × 10−2 | 0.217 |
Next, the performance of the best models with kernels found in the regression learner was recorded based on the standardised dataset. Table 3 displays the recorded performance.
Regression learner result using standardisation
Model . | RMSE . | R2 . | MSE . | MAE . |
---|---|---|---|---|
SVM | 0.758 | 0.43 | 5.74 × 10−1 | 0.561 |
Ensemble of trees | 0.823 | 0.32 | 6.78 × 10−1 | 0.659 |
Linear regression | 0.866 | 0.25 | 7.50 × 10−1 | 0.683 |
Regression tree | 0.974 | 0.06 | 9.49 × 10−1 | 0.780 |
Neural network | 1.254 | −0.57 | 1.57 | 0.933 |
Model . | RMSE . | R2 . | MSE . | MAE . |
---|---|---|---|---|
SVM | 0.758 | 0.43 | 5.74 × 10−1 | 0.561 |
Ensemble of trees | 0.823 | 0.32 | 6.78 × 10−1 | 0.659 |
Linear regression | 0.866 | 0.25 | 7.50 × 10−1 | 0.683 |
Regression tree | 0.974 | 0.06 | 9.49 × 10−1 | 0.780 |
Neural network | 1.254 | −0.57 | 1.57 | 0.933 |
Despite the effort of trying to equalise the range and achieve a consistent trend for each feature using both normalisation and standardisation, the prediction capabilities of ML techniques were inadequate and required to be revisited as seen under Tables 2 and 3. It can also be seen that the performance of ML models using standardisation is slightly outperforming normalisation in terms of R2 values. This is proven through a study carried out by Singh & Singh (2020) on different normalisation methods, stating standardisation utilising mean and standard deviation is much more suited to be selected as a first choice when compared to min–max normalisation for prediction purposes. Secondly, when comparing RMSE, MAE, and MSE values for both forms of transformations, normalisation produces significantly fewer errors compared to standardisation. This does not necessarily mean normalisation produces lesser error than standardisation. The values based on errors are such because the way the data was transformed is different. Normalisation as mentioned ensures the dataset in between number ranges of [0–1] but standardisation fits the data according to the mean and standard deviation of the dataset. The key difference between both forms of transformation is that standardisation scales the features of the dataset into a common, flexible version without distorting the differences in a range of values while normalisation distorts the data by restricting the scale. To justify the statement previously, standardisation can address outliers and noise unlike normalisation which is sensitive to outliers and able to promote distortion in the dataset (Chanal et al. 2022). To avoid confusing the ML algorithms and achieve the overall goal of developing a CH4 prediction tool, standardisation should be chosen as the primary choice for normalising data in developing the model as it has much more superiority in terms of data representation.
Despite the comparison between both forms of transformation methods, the overall performance of ML models cannot be classified as a high performing model as mentioned in the paragraph earlier. This is primarily due to several reasons such as the parameters exhibiting different behaviour and properties which cannot be addressed using both normalisation and standardisation purely (Singh & Singh 2020). In addition to that, some parameters found within the dataset may be transformed into a better version to ease the training process for ML models while the remaining transformed parameters may inhibit the performance capabilities instead. As a result, the performance metrics would return a low R2 and high RMSE value. Therefore, as a solution, Singh & Singh (2020) proposed that transformations applied towards different parameters found in datasets should address their properties instead of generalising a single transformation method to be used in every scenario.
Exploratory data analysis and transformations
As mentioned in the sections earlier, Shapiro–Wilk test and skewness tests were done using IBM® SPSS® Statistics v26 software to understand the behaviour of each parameter following the findings using normalisation and standardisation transformations. The results of each parameter after undergoing the tests are tabulated below:
Based on the information shown in Table 4, only three parameters, pH, OLR, and HRT, are normally distributed as they returned a ‘0’ value after undergoing the Shapiro–Wilk test. Following that, the Shapiro–Wilk test is determined by W (p-value) with respect to the significance level, α. In data science and hypothesis testing, the common significance level used in every statistical software is 5% (α = 0.05). This translates that there exists a 5% risk that the null hypothesis would be rejected (Frost 2020). Based on the Shapiro–Wilk test, an α value of 0.05 also means that the null hypothesis is 95% confident that the data follows a normal distribution and would require strong evidence to reject it. This shows the p-value is crucial to understand the behaviour of each parameter as it reflects the compatibility towards normality (Di Leo & Sardanelli 2020).
Shapiro–Wilk and skewness test result
Parameter . | SW . | Skewness . | p-value . |
---|---|---|---|
COD IN | 1 | −0.065 | 0.001 |
BOD5 IN | 1 | −0.029 | 0.000 |
pH | 0 | 0.382 | 0.370 |
TEMP | 1 | −0.576 | 0.000 |
SS IN | 1 | 1.374 | 0.000 |
TS IN | 1 | 0.092 | 0.000 |
OLR | 0 | 0.070 | 0.506 |
HRT | 0 | −0.160 | 0.151 |
CH4 | 1 | −0.773 | 0.004 |
Parameter . | SW . | Skewness . | p-value . |
---|---|---|---|
COD IN | 1 | −0.065 | 0.001 |
BOD5 IN | 1 | −0.029 | 0.000 |
pH | 0 | 0.382 | 0.370 |
TEMP | 1 | −0.576 | 0.000 |
SS IN | 1 | 1.374 | 0.000 |
TS IN | 1 | 0.092 | 0.000 |
OLR | 0 | 0.070 | 0.506 |
HRT | 0 | −0.160 | 0.151 |
CH4 | 1 | −0.773 | 0.004 |
From observing the table of results, the p-value of the three variables is found to be greater than 0.05, which accepts the null hypothesis stating the data is normally distributed (Ramachandran & Tsokos 2021). Therefore, the method selected to transform the three parameters before being used in the regression learner was standardisation. Next, the remaining six parameters failed to exhibit normality as they returned a number ‘1’ following the Shapiro–Wilk test. This is because the p-value of each parameter yields a result which is lower than 0.05, translating that the remaining parameters are not normally distributed according to King & Eckersley (2019).
To address the remaining six variables that are non-normal, the skewness values from Table 4 were used in reference to EDA, in order to determine the type of skewness the dataset for each parameter possessed. It can be said that from the six, three parameters which are COD IN, BOD5 IN, and TS IN can be classified as generally symmetrical due to their skewness values being closer to zero. According to Brown (2022), skewness values close to zero indicate that the three datasets are close to being normally distributed. Therefore, standardisation can be used to transform these parameters.
Distribution of parameter
Parameter . | QQ plot distribution . |
---|---|
TEMP | Logistic |
SS IN | Lognormal |
CH4 | Logistic |
Parameter . | QQ plot distribution . |
---|---|
TEMP | Logistic |
SS IN | Lognormal |
CH4 | Logistic |
As observed from Table 5, both moderately distributed parameters, TEMP and CH4, followed a logistic distribution. As there are no transformations within the MATLAB® 2022a software to address this distribution type, Box-Cox transformation was chosen as the transformation technique to address this issue since power-based transformations can reduce left skewness (Watthanacheewakul 2021). In addition to that, using Box-Cox transformations can improve forecasting abilities of time series-based datasets, which is an advantage when considering methods to improve the performance of the ML model.
Lastly, for greatly skewed data on parameter SS IN, the QQ plot result observed in Figure 2 shows that it follows the lognormal distribution strongly. To support the fact that the data follows lognormal distribution, the SS IN parameter has a skewness value of 1.374, meaning that it is skewed to the right and matches the description made. Therefore, the transformation suited to address this distribution is without doubt the natural logarithm (Watthanacheewakul 2021). With the use of natural logarithm, it reduces the right skewness of the dataset into a normalised version.
Following all these transformations, the transformed dataset was sent to the regression learner to obtain the performance metrics of the best models with kernels and tabulated below:
When compared to both Tables 2 and 3, Table 6 shows minimal improvement in terms of R2 for SVM-based models even after exhausting different forms of data analysis, statistical tests, and transformation carefully. Although the MSE and RMSE have improved for the SVM model using Medium Gaussian kernel, the differences observed are not significant and the model still requires further improvement to be classified as a high performing model considered for the prediction tool. The prime reason contributing to the inability to produce a high performing model is the lack of more data points in each parameter within the dataset. As obtaining new datasets by visiting the palm oil mills and carrying out tests is unlikely due to COVID-19 restrictions, data augmentation was seen as a promising alternative. Naturally, having more data can aid the ML model to capture a better pattern to improve the prediction capabilities of CH4. Therefore, data augmentation enables the possibility of achieving this by generating new modified, synthetic data from the limited observed data obtained from the POME plants, while minimising the occurrence of overfitting (Maharana et al. 2022).
Regression learner result using EDA
Model . | RMSE . | R2 . | MSE . | MAE . |
---|---|---|---|---|
SVM | 0.749 | 0.45 | 5.61 × 10−1 | 0.566 |
Ensemble of trees | 0.858 | 0.27 | 7.35 × 10−1 | 0.667 |
Linear regression | 0.877 | 0.24 | 7.69 × 10−1 | 0.683 |
Regression tree | 0.996 | 0.02 | 9.92 × 10−1 | 0.773 |
Neural network | 1.183 | −0.38 | 1.40 | 0.909 |
Model . | RMSE . | R2 . | MSE . | MAE . |
---|---|---|---|---|
SVM | 0.749 | 0.45 | 5.61 × 10−1 | 0.566 |
Ensemble of trees | 0.858 | 0.27 | 7.35 × 10−1 | 0.667 |
Linear regression | 0.877 | 0.24 | 7.69 × 10−1 | 0.683 |
Regression tree | 0.996 | 0.02 | 9.92 × 10−1 | 0.773 |
Neural network | 1.183 | −0.38 | 1.40 | 0.909 |
This study is the first in a series of parallel research currently being conducted by the authors of this paper on developing a ML-assisted prediction tool to measure and monitor methane emissions from POME treatment facilities. The studies following the presented project in this paper are focusing on (1) validating the final methane emission prediction model with special emphasis on its potential in biogas capturing and recovery and (2) addressing different species of emissions generated and released through anaerobic digestion processes in POME treatment. The findings from the above-mentioned studies together with the outcomes of this research will facilitate the development of an advanced emissions prediction tool to be used for the accurate measurement of the total carbon footprint of POME treatment process.
CONCLUSION
In this research, a prediction tool was developed with the intention to aid the palm oil industry to monitor CH4 emissions and provide a potential pathway towards CH4 capture. SVM, ensemble of trees, LR, neural network, and regression tree were the various ML models explored. Transformation techniques in the form of standardisation, normalization, and transformation were used. The models were found to give the best performance when exploratory data analysis and transformation were done. SVM was the best performing model with an R2 and RMSE of 0.45 and 0.75. Further studies on the subject matter should focus on exploring data augmentation to increase the number of data available for training and testing the prediction model. Besides that, other ML models such as GPR can also be explored to improve the model performance.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.