Support vector machines for oil classi ﬁ cation link with polyaromatic hydrocarbon contamination in the environment

The main focus of this study is exploring the spatial distribution of polyaromatics hydrocarbon links between oil spills in the environment via Support Vector Machines based on Kernel-Radial Basis Function (RBF) approach for high precision classi ﬁ cation of oil spill type from its sample ﬁ ngerprinting in Peninsular Malaysia. The results show the highest concentrations of Σ Alkylated PAHs and Σ EPA PAHs in Σ TAH concentration in diesel from the oil samples PP3_liquid and GP6_Jetty with achieving 100% classi ﬁ cation output, corresponding to coherent decision boundary and projective subspace estimation. The high dimensional nature of this approach has led to the existence of a perfect separability of the oil type classi ﬁ cation from four clustered oil type component; i.e diesel, bunker C, Mixture Oil (MO), lube oil and Waste Oil (WO) with the slack variables of ξ ≠ 0. Of the four clusters, only the SVs of two are correctly predicted, namely diesel and MO. The kernel-RBF approach provides ef ﬁ cient and reliable oil sample classi ﬁ cation, enabling the oil classi ﬁ cation be optimally performed within a relatively short period of execution and a faster dataset classi ﬁ cation where the slack variables ξ are nonzero.


GRAPHICAL ABSTRACT INTRODUCTION
Millions of marine species that are inhabiting the oceans can be endangered if the fuel oil is released into the sea or ocean (Gaganis & Pasadakis ). Even though no direct aggravation or deleterious health consequences are presented to the human population, oil spills can still cause damage to the coastal environment such as fish stocks or coral reefs. However, the wind and currents can push the oil strands towards the shorelines, that would result in contaminating the human's inhabitant. This occurs particularly when the physical and chemical properties of the oil degrade during the severe environmental weathering process (Ramsey et al. ). In Malaysia, there are many fueloil spills related issues that pollute Malaysian waters and land. These oil spills can be caused by refinery petroleum plants and waste oil amongst other small or large-scale oil spills (Ismail et al. ; Juahir et al. ). The small-scale oil spills refers to the small volume of oil spills (e.g at the fishermen area), whereas the large-scale oil spills refers to the large amount of oil spill (e.g from the collision ships or tankers). For example, the deadly collision of oil tankers in the shipping waterway that occurred in August 19, 2009 in the Straits of Malacca, resulted in 58,000 tonnes of naphtha oil spills into the ocean (Ahmad ). Moreover, the oil spills can also be from the cars, marine engines and deliberate oil discharge into the environment. The oil spills that have occurred in the Straits of Malacca, Johor Straits as well as South China Sea are caused various reasons including accidental oil spills, intentional discharges from fishermen boats and maintenance and cleaning activities from the cargo vessels.
In order to implement an effective cleaning project, one needs to determine the oil type based on its polyaromatic hydrocarbon components. This requires a thorough oil spill sampling in the seawater surface along the coastal lines of Peninsular Malaysia since their presence in the environment tends to be rather recalcitrant (Ramsey et al. ). Performing oil spill sampling on a large amount of petroleum hydrocarbon oil spills in the oceans, seawaters, rivers, drains, water bodies and soil requires several strategic oil type classification methods that are very expensive and time consuming. The high complexity of the data requires an in-depth study on the restructuring of the data classification into the desired categories that is in line with the global scale in terms of competitiveness (Alamdar et al. ). Using innovative methods such as gas chromatography-flame ionization detector (GC-FID) and gas chromatography-mass spectrometry (GC-MS) enables the oil spill-fingerprinting to be used as source identification, oil spill characterization and environmental forensics technique. Through utilizing analytical development laboratory tools for highly complex samples, one can determine the PAHs of major petroleum-specific targets that are needed to be chemically characterized, enabling the identification the origin of the oil spill in a sample (Ismail et al. ). Furthermore, using the GC-FID and GC-MS methods in the oil spill fingerprinting study enables the comparable or comparative analysis of the data with the unknown sources of the oil spills. The laboratory analysis alone for characterizing and identifying the complexity of petroleum hydrocarbon oil spills throughout the samplings and towards to the result interpretation that might lead to measurement errors.
In this study, we employed the Support Vector Machines method, fundamental approach of which has offered an accurate classification technique and data prediction, which is a similar concept to artificial neural networks (Amjady et al. ). The building of substantial non-linear classification boundaries was performed by the kernel function of SVM, as an alternative measure in pattern recognition (Dufrenois & Noyer ). This learning machine enables the improvement of the output results of GCFID and GC-MS to achieve the best oil type classification and higher profit with no unstable patterns (overfitting). Unlike ANNs, the SVMs complexity algorithm is an independent input space dimensionality that often outperforms ANNs in practice. Furthermore, it is less prone to overfitting when we are dealing with large complexity dataset of oil spills.
The adoption technique of SVM provides a strong support that has a benefit in decision making to gain the best results in electricity market players (Silva et al. ). There is however no method to apply SVMs in any oil spill fingerprinting area in Malaysia. In general, SVMs have brought the most reliable improvement in pattern recognition for smoothing the large or complex datasets of clustering and regression in this study. This paper presents the SVMs based on Kernel -Radial Basis Function (RBF) algorithm that was applied as a development approach for oil spill type classification using GC-FID and GC-MS. The objective of introducing SVM RBF in oil spill fingerprinting is to provide various statistical-based problem solving suggestions in obtaining the oil spill type classification with a level of precision that can recognize the most significant petroleum-hydrocarbon compounds. Hence, the application of SVM RBF as decision making function is an innovative approach to obtain high reliable output within a short execution time. The discovery of the maximal margin hyperplane enables a linear learning separation of the computational complexity in the non-separable datasets. Practically, the application of SVM is diverse in many fields such as computer-based programming. The SVM RBF approach is highly the advantages of generative production data with no implicit assumption to ensure that the optimal classification boundary is achieved and the associated risks are minimized (Dufrenois & Noyer ).
The function of SVM RBF in oil classification from oil spill fingerprinting is to improve the accuracy of boundary classification of complex mixtures of datasets and discriminate the most significant petroleum-oil type categories. In this study, we have explored a wide range of applications considering the RBF-SVM model as a decision function classifier to gain high precision of oil type's classification performance from the oil spill fingerprints. In addition, this study provides an alternative approach in achieving significantly high oil classification accuracy of the potential sources that would lead to water pollution within the Peninsular Malaysia.

Sample collection
The oil samples were collected in the water surface in the form of oil film or sheens where the oil-contaminated water (liquid-liquid) as well as Teflon mesh fabric net (netoil) (10 cm × 10 cm) were used. The samples were collected from the ports, jetty and fishermen areas, as illustrated in Figure 1 and shown in Table S1. The number of sampling points were based on the site observation of the spill areas or the areas covered with oil, where for the bigger spill area, more sampling points were used. The samples were collected from February 2016 until September 2016. All oil samples were poured in the100-250 ml thick-walled wide neck borosilicate glass bottles with the inner neck diameter of 30 mm. The bottles were properly sealed and placed in a cool box. The temperature was maintained at þ4 C for delivery to the laboratory, where all the samples were kept at constant temperature of 4 C until further analysis.

Samples preparation procedure
The oil sample was added with 60 ml Dichloromethane (DCM), which was subsequently shake in the flask for complete oil extraction for 20 minutes. The procedures were performed triplicate. Then, the oil sample was passed through the funnel containing anhydrous sodium sulfate (5.0 cm thick), and the round bottom flask was used to collect the dried extract at the end of the separating funnel. The 60 ml of DCM was further added into the separating funnel for further extraction of the oil from the water. The dried extract was finally concentrated into 1.0 ml in rotary evaporator (30 C) using the solvent-exchanged with hexane.

Column cleanup
A pre-cleaned glass column (30 cm × 10.5 mm I.D.) was used to perform the column cleanup analysis. It contains about 6. 0 g pre-cleaned silica gel (100-200 mesh, Davisil grade 923) fused 0.5 cm pre-cleaned sodium sulfate conditioned with 20 ml hexane. Initially, the extract was spiked with surrogates (4 mix PAHs compounds and ο-terphenyl). 3.0 ml of hexane was repeatedly added onto the column for oil transfer into the column, followed by adding the 12.0 ml of hexane to elute aliphatic compounds. The aliphatic compound consists of n-alkanes and biomarkers (represents of F1) for further used for analysis of the saturates and biomarker. Secondly, to elute the aromatic compounds represents the F2, the 15 ml of 50% DCM in hexane (50:50 of dichloromethane: hexane) was used. Gentle stream of Nitrogen (N2) (1.0 ml/s) was used to concentrate of both mixtures of F1 and F2 to reach the volume of less than 0.5 ml. The spike process procedures were proceeded with F1 spiked with 1.0 ppm of C3017 β (H), 21β (H) -hopane and 20 ppm of 5α-androstane, and F2 spiked with 1 ppm of d14-terphenyl. The mixture of F3 was formed from the combination of the remaining half of F1 and F2 to determine of the total GC-detectable TPH, GC-resolved peak, and GC the unresolved complex mixture of hydrocarbon (UCM). Prior to analysis, all the three fractions were undergone for the concentration process to the final injection volume of 1.0 ml.

GC-FID and GC-MS analysis of the Oil spill fingerprints
The laboratory analysis of GC-FID and GC-MS approaches are rather important in identifying the source, type and distribution patterns recognition of the oil spills. The analyses of petroleum biomarkers of n-alkanes and PAHs in complex hydrocarbon mixture oil samples are performed based on the methods elaborated by (Wang ; Wang et al. ; Khelifa et al. ). Subsequent analyses for both n-alkanes distributions and total petroleum hydrocarbons (TPHs) of oil spilled clustering or separation were performed on Perkin Elmer, Clarus 680 equipped with PE AutoSystem GC with built-in Autosampler and flame ionization detector (FID) on the column dimension of 30 m × 0.25 mm ID DB-5. The Agilent, 7890A GC System which is equipped with mass-selective detector and CTC PAL ALS Autosampler was used to perform the analysis or identification of target PAH compounds (including five alkylated PAH homologous groups and other EPA priority PAHs) and high-molecularweight biomarkers such as terpanes and steranes. A 30 m × 0.25 mm HP-5MS fused silica column was used for identifying the target PAH compounds. The common practice of laboratory analysis using the repeatability limits (three times) for substantial reliability of the results was carried out. The repeatability of the analysis has resulted in a very similar concentrations where the measurement was calculated based on the mean. All the clustering oil types from the collected samples of the GC-FID and GC-MS were proceeded with the support vector machine for the result confirmatory. The tired approach applied in this study is illustrated in Figure 2. Support vector machineskernel based radial basis function (RBF) approaches in Oil spills classification Classification support vector machines (CSVM) kernelradial basis function (RBF) In this study, the CSVM Kernel-RBF approach was employed to validate the results obtained from the laboratory analyses by gas chromatography -flame ionization detector (GC-FID) and gas chromatography mass spectrometer (GC-MS). The application of CSVM in oil spill fingerprints attempts to learn the oil spill datasets by finding the optimal margin hyperplane for perfect separation or classification of the large dataset from different physicochemical characteristics (oil spill type's e.g diesel). The CSVM analysis was performed using the Statistica 13 Dell software 2015. The datasets were partitioned into three procedures; testing, training and predicting. In common domain, the SVM classifier trains the datasets and builds the hyper-plane through the Cutting Plane Algorithm (CPA) as decision boundary functions of separating the finite datasets with the margin maximized. The training process of oil spill datasets enables the usage of the hyperplanes found to estimate which oil type the datasets belong to. The discriminative power between groups during the training process provides perfect separation of the complex oil spill data sets into the respective groups. The validation approach was also accomplished using the SVM Kernel-Radial Basis Function (RBF). The execution of CSVM Kernel-RBF was based on Classification Type 1 and the Kernel-based approach which is reliable for non-linear boundary (Lavine et al. ; Dufrenois & Noyer ).
In this study, the forty-seven datasets were randomly trained and validated. In the Kernel-RBF approach, the hidden layer provides a set of functions, forming the arbitrary basis as decision boundary for oil spill compounds. They are then converted as an input layer for the input vectors (x 1, x 2 … x mo ) which are then subsequently expanded or spanned through the hidden layer of kernel inner product (support vectors (x i )) where the linear output layer maps the input vectors (training vectors) to optimal classification as the output (y). Kernel inner product provides the optimal hyperplane construction or similar function of neuron network in Artificial Neural Network (ANN) in the output feature space. The SVM architecture functioning as the decision functions is illustrated in Figure 3. During the training level, the proper weights and Support Vectors (independent variables) are properly computed that would generate non-zero, α co-efficient which is then considered in the calculation (Haykin ; Singh et al. ).

(ii) Classification and Regression Models (a) Data Pre-processing
The complete dataset of 47 primary data and 32 secondary data from the input data comprising 94 oil spill compounds are modeled in this study. Equation (1) was used to normalize the input dataset (Adib et al. ) to the dataset bounded region of y i € (þ 1, À1) to prevent truncation error as large amounts of numerical oil spill dataset: where Xn i is a set of scaled input or output oil spill data, X i is the actual input/output oil spill data, X min is the minimum value of observed data set, and X max is the maximum value of the observed dataset. Since the input dataset {X n } N n¼1 was normalized using Equation (1), the output model of y i was also normalized. The input dataset of oil spill was mapped into the normalized feature-space according to Equation (1) before running the data.
The normalized dataset of {X n } N n¼1 was randomly divided into three subsets, namely training, optimization and testing (Adib et al. ). In this analysis, 80% of the data was used for the subsets of the training, 10% for optimization and another10% for testing. The details of the oil spill datasets used for three subsets are elaborated in the next section.

(b) Data Training Process
Phase 1: The training dataset was given using {X n } N n¼1 formula where N represents the number of oil spill samples and X€ R n is the input vector of n input-features and{Y n } N n¼1 is the output feature with class labels or bound region y i € (þ1, À1) as finite value. In binary classification, the function y i € (þ1, À1), however, yi € (1,2,3,4,5,6,7,8,9,0) (Rodriguez-galiano et al. ). Prior to training the dataset, the optimum values of key parameters (γ, ϵ, σ 2 ) were selected, where γ is used for the fitting error minimization and the estimated function smoothness, ϵ is referred to as precision threshold and σ 2 determines the efficiency of the SVM model. The slack variable, ξ was also used to quantify the output features that consider the positive and negative classes (Adib et al. ) (Equation (2)) The ϵ in Equation (2) is the sensitivity of the optimal misclassification error, and ξ n ξ n Ã} N n¼1 are the slack variables to quantify the output features that consider the positive and negative classes. The SVM model works for the maximummargin hyperplane where the separating hyperplane can be written as Equation (3) where x i (x 1 , x 2 , x 3 …..x n ) denotes the real-input of standard scalar dimensional vector, w is the normal real-vector to a high-dimensional (hyperplane) feature space, and b is known as a hyper-plane bias or the offset of the hyperplane from the origin line. For the case of linear separable, two hyperplanes can function as Equations (4) and (5) below in order to prevent any data point falling between the hyperplanes (García Nieto et al. ): Phase 2: The SVM was then applied to find the functional form of classification and regression to deal with the non-linearity separation, which is subject to misclassification error function or constraint minimization. The training data in SVM model includes the sequential error function optimization. Two SVM models were selected to solve the oil spill compound problems.
Subject to the constraints; where, C is the capacity constraint, w is denoted as the coefficient vector, b is a constant, ξ represents the parameters for handling the non-separable input data, y Є ±1 is the class labels, x i represents the independent variables and Kernel (Ø) is denoted as a transform data from the input (independent data) to the feature space. Since the greater value of C results in optimal classification obtained and minimizes the errors of misclassification, thus the value of C was carefully chosen to avoid overfitting (Table 1).
(ii) Regression SVM (also known as epsilon-SVM regression) The SVM was also applied to regression where the functional dependence of dependent variables of y i on a set of independent variables of x were estimated (Cristianini subject to; Radial Basis Function (RBF) Phase 3: The Kernel-RBF (Radial Basis Functions) trick was chosen with a similar format as that of the Kernelsbased learning methods. This is the most preferred and reliable execution type of Kernels-based learning methods with the case-wise as the selected case. Furthermore, the technique of Kernel-RBF method employs the specific function of the non-linear classification for the center of the subsets where two tuning parameters (γ, α) were added, with α being as the kernel parameter, and γ as regularization parameter. These functions are truly the heart of the statistical inference analysis. The value of γ constant was set to 0.185 and the training cost constant capacity, C was set to 10 for non-linearity classification and regression. The results of the used of kernel trick acts as the problem-solver to the regression with the separable non-linearly in the input space.
In Kernel-RBF, the hidden layer provides a set of functions forming the arbitrary basis as the decision boundary for the oil spill compounds, which is then converted to the input layer of the input vectors or oil spill compounds (SAS141A , SAS141B, BAS303A, BAS304B … x mo ). Subsequently, these layers are expanded or spanned through the hidden layer of kernel inner product (support vectors (x i )) to which the linear output layer maps the input vectors (training vectors) to optimal discrimination as the output (y). Similar to SVM method, the Kernel inner product provides the optimal hyperplane construction or similar function of neuron network in ANN in the feature space of output. During the training level, the proper weights and support vectors (independent variables) were properly computed, and only those support vectors with non-zero value, α co-efficient were considered in the calculation (Haykin ). The Lagrange classifier method was required to resolve the optimal weight (w) in high dimensional space based on the following expression (Haykin ): where, α i represents the coefficients of Lagrange multiplier. Phase 4: The kernel functions as the algorithm in diverse dataset pattern assessment through data mapping into the high dimensional feature space. The Kernel-Radial Basis Function algorithm was performed using the following equation (Haykin ): Subject to the optimal weight solution in Equation (9) (Haykin ); Phase 5: The Cross Validation method was selected for predictive model validation through assessing the optimal value of different variables to determine the Mean Error Square (MES), coefficient determination (R 2 ) of fitness factor and correlation coefficient by using a 10-fold cross validation procedure performed by the cross validation algorithm.

RESULT AND DISCUSSION
The Support Vector Machines (SVMs) have recently become one of the most popular machine learning tools for data mining and pattern recognition to solve problems in data clustering (Vapnik ; Singh et al. ; Ni & Zhai ). The outstanding performance of SVMs in classification and regression in many applications has been proven by Cristianini & Shawe-Taylor (Cristianini & Shawe-Taylor, 2000). The enabling-element values in large quantities and accurate predictability requires the optimum problem solving mechanisms that are commonly sought by many organizations in achieving market objectivity (Meeus et al. ). Support Vector Machines (SVMs) selection model which is used for extrapolation, interpolation and thorough evaluation of data in description that is definitely advantageous in this context, and crucially needed to cross-validate or re-confirm the results obtained from GC-FID and GC-MS. The Support Vector Machines (SVMs) approach remarkably improves the complex patterns in data exploitation (oil spill dataset) obtained from GC-FID and GC-MS into clustering and prediction of large mapped input datasets that are structurally classified. The classification function is known as a hyperplane in space of input oil spill datasets in richer features with a hyperplane clustering or separating space (Alamdar et al. ). In this study, SVM-approaches to the classification-based methods enable characterization of the nature of data input behavior with decision boundary that is separating the oil spill compounds (unseen variables) into a non-linearity separable high-dimensional hyperplane. This allows the oil spill variables to get separated or classified into one or two classes or categories that have similar homogeneity. In oil spill fingerprinting applications, this SVM approach separates the datasets into high dimensional clustering-based methods that keep the maximal-margin hyperplane between the classes and enables the generative data distributions. Moreover, the approach efficiency for high dimensional datasets, clustering was proven by Moraes & Faria in 2016 (Moraes & Faria ).

Polycyclic aromatic hydrocarbon characterization in oil spill samples
PAH is commonly known as a highly toxic substance for the environment and its presence could be detrimental to people and other beings. This PAHs in the petroleum-oil can exist in both low molecular weight and pyrogenic high molecular weight. The aromatic structures mostly contain paraffinic chain, naphthenic and aromatics rings (one to four rings), where the latter structures are in the form of side by side. However, the concentration of PAHs in in this study was the lowest concentration detected in the lube oil, but the highest concentration in the diesel-based spill. In general, the used lube oil contain both low and high molecular weight PAHs as resulted from the incomplete combustion of fuel oil and residues (Bishop ; Kaplan et al. ; Matar & Hatch ). Table 2 presents the summarized quantitation results of 11 selected oil spill samples of PAHs. From the overall quantitation of PAHs in diesel from the oil sample PP3_liquid the highest concentrations of ∑ Alkylated PAHs (2,221.17 μg/ ml) and ∑ EPA PAHs (10.48 μg/ml) were detected. The diesel from the oil sample GP6_Jetty was the second highest with the concentration of ∑ Alkylated PAHs (715.04 μg/ml) and ∑ EPA PAHs (11.2 μg/ml). From the overall 47 oil samples, only 11 were found to be significant and were selected based on the significant detectable concentrations for further elaboration in this section.
The values of the alkylated PAH and the EPA Priority PAH can determine the category of oil spill such as diesel, Waste Oil (from the mixture of many kinds of oils), lube oil and Bunker C. The alkylated homologue PAHs comprises of hydrocarbon compounds of napthanic (C1-N - In this study on the other hand, diesel has revealed the highest concentration of aromatic hydrocarbon (∑alkylated PAHs) from the oil samples; Pasir Gudang (PP3-liq), Gelang Patah (GP6-Jetty) and Kuala Perlis (KP3) with the quantitated values of 2,221.17 μg/ml, 715.04 μg/ml and 67.33 μg/ml, respectively. This large ∑alkylated PAHs contribution could have been resulted from the combustion from the marine engines (Yang et al. ). Lube oil from Tanjung Gemok (TGG 03) was determined as the second highest, but very low in concentration of total ∑alkylated PAHs with the quantitated value of 3.254 μg/ml. In general, the lube oil mentioned above is not only very low in terms of concentration of GC-detectable alkanes fraction but also in terms of concentration of the aromatic hydrocarbons Diesel from the oil sample Pasir Gudang (PP3-liq) showed significantly high amount of 2-to-6 rings alkylated homologous aromatic hydrocarbon (PAHs) content, in particular the compounds of naphthalene, phenantherene, dibenzothiopene and fluorene with the concentrations that were determined to be 1,185.197 μg/ml, 408.33 μg/ml, 101.59 μg/ml and 3.88 μg/ml, respectively. However, a relatively low concentration of the US EPA Priority PAHs were found in the oil sample diesel (PP3-liq) with the quantitated value of 10.48 μg/ml that is dominated by the Fluoranthene (1.92 μg/ml) and pyrene (8.56 μg/ml).
The reason for the relatively low concentration of the US EPA Priority PAHs in the diesel oil sample might be the presence of the trace amount of polycyclic sulfur aromatic hydrocarbon (PSAH) such as dibenzothiophene in disel, instead of alkylated PAHs. Compounds such as benzo[b]naphtol[1-2,d]thiophene are commonly used to trace the existence of diesel-based oil emission in the environment (Daisey et al. ). Meanwhile, the lube oil from the TGG03 (Tanjung Gemok, Rompin) is exactly opposite of diesel in terms of the amount of detectable alkylated PAH and US EPA PAH in this sample which is 3.25 μg/ml in the former and 0.14 μg/ml in the latter. The suspected Bunker C from the oil sample MERSING 4 (Mersing, Johor) was the least detectable in terms of concentrations of alkylated PAH and US EPA Priority PAHs with the quantitated values of 1.22 μg/ml and 0.01 μg/ml, respectively. In general the lube oil has a very unique characteristics as it  Table 2, the lube oil (used) has a similar characteristic of the diesel-based oil, however the concentration of the EPA Priority PAHs is significantly distinctive as the additional constituents of pyrogenic PAHs. For instance, there are two constituents of EPA Priority PAHs detected in diesel sample of Pasir Gudang (PP3-liq) namely, Anthracene (An) and Fluoranthene (Fl) with the concentrations accounted for 1.92 μg/ml and 8.56 μg/ml, respectively. The used lube oil has significantly retained the pyrogenic PAHs ranging from Acenaphthylene (Acl), Acenaphtene (Ace), Anthracene (An), Fluoranthene (Fl), Pyrene (Py), Benz fluoranthene (BbF) and Indeno(1,2,3-cd)pyrene. The chromatogram signatures provided the information that the used lube oil of TGG 03 could be the mixture of lube oil, unburned diesel, and combustion exhaust of oils from the marine use (Yang et al. ).
Furthermore, the waste oil from the ambiguous mixture of many types of oil, for example the oil sample Port Klang (KLG-Net), Kukup (PKK1), Kuala Kedah (KK1) and Prai Port (PP4) found to be contained of both alkylated PAHs and EPA Priority PAHs all of which were in low concentrations observed (Table 2). Thus, it was determined that the oil mixture-types of many oil could have been blended with the petrogenic PAHs and pyrogenic PAHs contributions (from closed system combustions of diesel engines) as the oil samples containing low and high-molecularweight (HMW) (2-6 ringed) of PAHs. However, only oil samples Port Klang (KLG-Net), Prai Port (PP4) and Kukup (PKK1) were characteristically almost similar by looking at the low and high-molecular-weight of alkylated PAHs and EPA Priority PAHs fragmentations. While the other oil sample like Kuala Kedah (KK1) was only dominance with the alkylated PAHs compounds of naphthalene and phenantherene, the compounds of Anthracene (An) and Fluoranthene (Fl) dominated the EPA Priority PAHs compound. In addition, through the ambiguous characterization of the suspected Bunker C oil-type (MERSING 4), it probably contained a low concentration of both petrogenic and pyrogenic of PAHs (Table 2). In general, this type of residual oil from the heavy fuel oil was derived from the largely used marine diesel and industrial power generators (Wang & Fingas ). The pyrogenic of PAHs contribution was dominant by Anthracene (An), Fluoranthene (Fl), Pyrene (Py) and Benz [a] anthracene (BaA) with the values of 0.000314 μg/ml, 0.000971 μg/ml, 0.002641 μg/ml and 0.00444 μg/ml, respectively. However, the petrogenic PAHs contribution primarily naphthalene, phenantherene, dibenzothiopene and flourene were all dominant in the Bunker C. Relatively, the ∑petrogenic PAHs concentration was only 1.22 μg/ml, whereas 0.01 μg/ml accounted for the low amount of EPA Priority PAHs. The small ratio of ∑EPA Priority PAH/∑ Alkylated homologous PAHs which is 0.01 μg/ml is a significant indication that the contribution of the most poisonous compound (EPA Priority PAH) in this oil sample is very little. It could be a result of different chemical composition in the suspected Bunker C type fuel compared to the conventional bunker type fuel. The uncommon profiling shows a preferential loss properties of the bunker-type fuel these days since it can be a mixture of residual oil and diesel fuel or the lighter oil-types (middle-range distillates) according to the marine use (Wang & Fingas ; Stout & Wang ). Figure S1 demonstrates the GC-MS profiling of the eleven (11) selected oil samples. The distribution patterns of PAHs from the GC-MS of the sample (a) PP2 MAMPU, in particular has achieved the maximum peak at the carbon nC6 as 55. 97 μg/ml and 31.13 μg/ml. The widerresolved hydrocarbon is observed in the wider carbon range of nC9 to nC34 and nC12 to nC30. However, the PAHs concentration is diminished after the carbon nC34 and nC30 from the minutes of 31 and 34, respectively. For (b) PP3-liquid, the carbon range of nC14 -nC30 dominates a small hump, at the minutes of 12 to 26. From the peak distributions of oil-spills, the oil sample (c) GP 6-Jetty dominates a wider-resolved hydrocarbon ranging from nC13 to nC28. The GC-detectable fraction in this sample (a) PKK 1 reveals an unexpected or irregular chemical fingerprints as it is typically attributable to the mixture of many kinds of oils (including the vegetable oils and other petroleum-byproducts that contain complex mixture constituents) that enter the seawater in the absence of a certain amount of diesel or any other related hydrocarbon fuel-type.
Identifying the source of oil split as the uncommon chemical fingerprints were obtained might become a problematic process. The highest detectable PAHs concentrations are 12.12 μg/ml at nC19. The highest concentrations in this sample (d) PKK 1 is exhibited at the minutes of 24 and 21, respectively. The oil sample (e) KLG7-net demonstrates the uncommon hydrocarbon fragmentation in the absence of petroleum hydrocarbon-oil based. The profiling signature of the hydrocarbon for oil sample (f) KK1 shows almost no existence of alkane's distribution or polyaromatic hydrocarbon with a very minimal unresolved complete mixture. The absence of the unresolved complete mixture ranges from the carbon nC26 to nC40 at the minutes of 26 to 40. The highest PAHs concentration in this sample is 2.845 μg/ml. For oil sample (g) PP4 the carbon-range of nC18 to nC19 are the most abundant with the detected concentrations of 1.073 μg/ml, 1.092 μg/ml, 1.054 μg/ml, 0.739 μg/ml and 1.193 μg/ml at the minute of 18. It is suspected that these PAHs oils are from the petrogenic contribution of the marine fuel-oil based, vegetable oil from the restaurant of the ships or diesel-based that are widely used by many ships at the Prai port. The small hump dominates the (h) PP3, ranging from nC16 to nC28. The highest concentrations are detected as 22.642 μg/ml at nC22 for Mersing 4. This indicates the unique characteristic of chromatograms since they contain waste or used lube oil.
The small hump of a typical unresolved complex mixture (UCM) is observed in (i) MERSING 4 from the broader range of nC29 to nC40 in the absence of alkanes and aromatic hydrocarbon concentrations (Yang et al. ) from the minutes of 26 to 40. The oil sample (j) TGG 03 is in dark-colour and the chromatogram profiling exhibits a broader range of a large unresolved complex mixture (UCM) in the carbon range of nC26 to nC40 at minutes of 21 to 34. This profile could have been derived from the mixture of lube oil type product and diesel from the engine ferries carrying passengers to Tioman Island or boating activities in the area which are considered under the category of waste oil (WO). The unresolved complex mixture (UCM) is the indication of the pyrogenic contribution (incomplete fuel combustion) and the chemical compositional containing higher molecular weight pyrogenic PAH ( The trained classifier in SVM is non-linear, and all the results obtained from the SVM Kernel-RBF classifier visibly demonstrate the high accuracy of oil spill types' classification ( Figure 5). On the training set of the petroleum-oil compound datasets, the classifier found to be accurately classified 100% (Vapnik ). From the classification output, this analysis highly corresponds to coherent decision boundary and projective subspace estimation. The output of the support vectors of the training performance as the input of the Kernel-RBF classifier for reliable classification with the values of C (10), γ (0.011) and n (33) are depicted in Table 4. Table 4 presents the Confusion Matrix of Support Vector Machines presenting the cluster for the overall oil spill samples with Observed (rows) × Predicted (columns). This is the confusion matrix of Kernel-RBF Classification on the training set of the data. The high dimensional nature of this approach leads to the existence of the perfect separability of the oil types' classification where the computation reveals that the slack variable is ξ ≠ 0 (Gaonkar et al. ), and the C value is considerably high with the confidence level of (0.95) and lower value error.
It can be clearly seen that from 47 oil spill sample data as the input vectors the decision boundary hyperplane of the SVM classifier is determined by 33 significant support vectors (SVs) that have yielded the total of 4 non-linearly separable classes of oil type through the observations (Bunker C -4 nos, diesel -15 nos, MO -22 nos, and WO -2 nos.). The SVs are the training input vectors subset where the 33 support vectors are the vectors that generate the non-zero Lagrange multiplier coefficients (αi).
Overall, all four (4) nos. bunker C prior cluster were identified based on the chromatogram profiling. Unfortunately, the one identified bunker C in the previous cluster is determined as 100% incorrect. The SVs have predicted the bunker C cluster as non-bunker C product-derivatives, where the SVs classified the bunker C as the MO. Obviously, the total twenty-two (22) nos of MO were 100% correctly classified. The WO has obtained 100% incorrect results as it was predicted by SVs rather than the MO ( Figure 5). The lube oil prior cluster namely, 4 nos, are predicted as diesel-type product. Consequently, the lube oil cluster has not appeared in the table of classification summary of support vector machines (Table 4). The environmental samples that are highly exposed to the adulteration and contamination have altered the physical and chemical characteristics of the WO and lube-oil.
It is rather important to note that the prediction capabilities are captured as the one petroleum-oil spill sample at 'Dependent' under the category of MOLFO, however the SVM classifier predicted the MOLFO as the HFO, whilst the remainder of the oil samples were correctly    classified as they fit to the clusters they actually belong ( Figure 6). The SVM results explain that the Kernel-RBF classifier prediction error by defect was actually based on the actual value according to the oil spill sample background. As we can notice from this analysis, Kernel-RBF has improved the computation time for subspace estimation.

CONCLUSION
The study of the seawater PAHs along the Peninsular Malaysia coastlines has revealed the presence of aromatic compound contaminants at most of the sites. Different PAHs concentrations are resulted from the spatial variant contaminant from many activities at the sites. The exposure of the oil spills into the environment has led to a high potential of weathering or degradation process, yielding chemically and physically changed contaminants. This study shows that the ∑TAH concentration in diesel from the oil sample PP3_liquid has the highest concentrations of ∑ Alkylated PAHs (2,221.17 μg/ml) and ∑ EPA PAHs (10.48 μg/ml). Diesel from the oil sample GP6_Jetty is the second highest with the concentration of ∑ Alkylated PAHs (715.04 μg/ml) and ∑ EPA PAHs (11.2 μg/ml). The Classification of SVM Kernel-RBF method in this study is used as a decision boundary approach for classification problems. This method deals with maximizing the classification of the input vectors and maps into the higher dimensional feature space of the inner kernel hidden layer, the error value of which is minimized to achieve the most reliable oil spill type classification. The CSVM Kernel-RBF approaches in oil spill classification which is in fact a different prediction technique compared to Artificial Neural Networks (ANNs), is responsible for training process within a short execution time for large quantities of datasets from GC-FID and GC-MS technique fingerprints. The results obtained from this approach is highly promising. It has become evident that SVM classifier (C-SVM) enables the prediction accuracy of classification achieved (not more than 10% error) from oil spill fingerprinting in Peninsular Malaysia. The confusion matrix output demonstrates the accuracy value in classification of four (4) discriminated oil types, namely, Diesel, WO, MO and bunker C. Only diesel and MO are found to be correctly predicted by SVs as the oil types in the environmental samples. The most concerning matters which are the kernel-RBF approach provides the oil classification to be optimally performed within a relatively short period of time and faster dataset classifications with slack variables of ξ ≠ 0 and coherent decision boundary.

DATA AVAILABILITY STATEMENT
Data cannot be made publicly available readers should contact the corresponding author for details.