Abstract
Data availability is key for modeling of wastewater treatment processes. However, process data are characterized by missing values and outliers. This study applied a self-organizing map (SOM) to fill in missing values and replace outliers in wastewater treatment data from Kauma Sewage Treatment Plant in Lilongwe, Malawi. We used primary and secondary wastewater data and executed the SOM algorithm to fill missing values and replace outliers in effluent pH, biochemical oxygen demand, and dissolved oxygen. The results suggest that the SOM algorithm is reliable in filling gaps in wastewater time series data with less than 50% missing values with correlation coefficient (R) values of >0.90. The SOM algorithm failed to reliably fill gaps and replace outliers in time series data with >50% missing values. For instance, high mean square error (MSE) values of 3,655.57, 10.62, and 2,153.34 for pH, DO, and BOD, respectively, were registered in datasets with more than 50% missing values, while very small MSE values (MSE ≈ 0) were associated with effluent pH, BOD, and DO data with missing values of >50%. Practitioners can use this approach to improve the planning and management of wastewater treatment facilities where available data records are riddled with missing observations.
HIGHLIGHTS
Missing data impinge on wastewater treatment plant processes efficiency.
The advancement of information technology and artificial intelligence enables the infilling of missing data.
We proposed to infill missing data and outliers using a Multivariate model called the Self-Organizing Map.
Missing data and outliers are replaced with reasonable estimates.
The approach has provided long series data for modelling the behavior of the wastewater treatment process.
INTRODUCTION
The wastewater treatment (WWT) process aims to achieve effluent and sludge quality that is environmentally safe for disposal/or reuse (Larsen et al. 2013). Optimal wastewater treatment plant (WWTP) operation and control can be achieved by developing a robust mathematical tool that enables the prediction of the quality of treated effluent based on past observations of certain key parameters (Hassen & Asmare 2018). However, numerous challenges arise when modeling wastewater treatment processes including lack of reliable wastewater quality data. The lack of good quality data is attributed to the unavailability of equipment, and the high cost of procedures that facilitate the measurement of wastewater quality characteristics (Hassen & Asmare 2018). Equipment malfunction and human errors also lead to gaps and outliers in data records. Gaps and outliers in data records pose a challenge in model identification, calibration, and verification (Rustum 2009). One of the common solutions is simply to remove records containing missing values and outliers, rendering them unreliable for modeling purposes. Considering the time limitations, cost of data collection, and scarcity of available data, removing such records cannot be a viable option (Mwale et al. 2012). As a result, significant data pre-processing is required to fill the gaps or identify outliers in the data record (Rustum 2009). This is particularly important, especially in developing countries like Malawi where the availability of data without missing values is a challenge.
The existing methods for dealing with missing values include replacing missing values with the mean, median, or regression models for the data (Rustum & Adeloye 2011). Outliers can be resolved by using trimmed means, and other scale estimators besides standard deviation such as Median of Absolute Deviation (MAD) and Winsorization (Rustum 2009). However, using these models to predict missing values or outliers in a long time series is difficult and frequently unreliable (Zhu et al. 2018). This is particularly challenging when the number of values to be filled is relatively high in comparison to the total record length.
These challenges can be resolved by using computing techniques such as artificial intelligence (AI). The most promising approaches in this class of techniques include Artificial Neural Networks (ANNs), Fuzzy Logic (FL), and Genetic Algorithms (GAs). The application of AI in data cleaning and management is well-established in both water resources and hydrology (Rustum 2009; Nkiaka et al. 2016). The ANNs are the most popular algorithms among AI classes as they use the same available data to learn about the behaviour of a time series. Furthermore, ANNs have the capability of modeling complex nonlinear systems as they do not require prior knowledge of the system process(s) under study. In addition to that, the ANNs have proven robust even in the presence of missing observations in time series (Mwale et al. 2012).
Multilayer Perceptron (MLP) is one of the most widely used algorithms within the ANN family. Despite their robustness in filling missing gaps in time series, the MLP demands a long time series for training (Rustum 2009; Mwale et al. 2012). Accordingly, additional pre-processing of the time series is required in order to provide estimates in the input space before the training can begin (Rustum 2009). This is very important especially when significant portions of the time series to be used for training have incomplete data, or fall short of time series to facilitate training (Nkiaka et al. 2016). Dealing with time series data with many missing values is also computationally intensive as it requires additional storage memory (Kalteh & Berndtsson 2011).
Kohonen Self-Organizing Maps (KSOMs) simply called Self-Organizing Maps (SOMs) are another member of the ANN class. The SOMs are competitive and unsupervised ANN (Rustum 2009). SOMs are becoming popular in filling missing values in time series and have proven more effective than ANN-MLP (Mwale et al. 2012; Nkiaka et al. 2016). Many studies have successfully applied SOMs to fill gaps in time series with satisfactory results (e.g. Mwale et al. 2012; Nkiaka et al. 2016; Kumar et al. 2021a). Despite this widespread use in many studies around the world, the use of SOMs in Malawi, particularly in wastewater treatment processes where data quality is a concern, has been limited. This study tested the reliability of the SOM to predict missing values and replace outliers in time series data for the Kauma Sewage Treatment Plant situated in Lilongwe, Malawi. This work formed part of an ongoing research project that intended to apply AI algorithms in modeling the performance of the Kauma Sewage Treatment Plant.
MATERIALS AND METHODS
Description of SOMs
Description of the study area
A schematic of Kauma Sewage Treatment plant. Not drawn to scale. (Adapted from Mtethiwa et al. 2008; Ravina et al. 2021).
A schematic of Kauma Sewage Treatment plant. Not drawn to scale. (Adapted from Mtethiwa et al. 2008; Ravina et al. 2021).
Sampling and data collection procedures
The study used both secondary and primary data. In this study, only domestic sewage was of interest. Secondary data was obtained from document reviews of Kauma sewage treatment plant data. Primary data were collected twice (morning and evening) daily for 30 days from 11 February 2022 to 17 March 2022. Wastewater samples were collected from the influent raw wastewater. This was followed by analysis of the influent raw wastewater composite samples characteristics such as including pH, chemical oxygen demand (COD), total dissolved solids (TDS), total suspended solids (TSS), electrical conductivity (EC), and dissolved oxygen (DO). The analyses were conducted following the standard methods for measuring characteristics of water and wastewater as highlighted by APHA (2017). Only COD and biochemical oxygen demand (BOD) were obtained from samples that were collected from the septage lagoon and effluent-treated wastewater. At the septage, lagoon samples were taken during the emptying of the sludge, and to ensure that the samples were not industrial sewage, a deliberate effort was made to follow the source of the sludge. For each truck, four samples of 2-L each were collected, where one sample was collected at the beginning of emptying and two samples were collected in the middle of emptying, followed by one sample at the end of the emptying exercise. The samples were then mixed, also regarded as double sampling, to obtain the most homogenous characteristics. Double sampling was considered for quality assurance.
Implementation of the SOM algorithm
Prediction of missing components of the input vector using the KSOM (BMU, Best Matching Unit) (Source:Rustum & Adeloye 2011).
Prediction of missing components of the input vector using the KSOM (BMU, Best Matching Unit) (Source:Rustum & Adeloye 2011).


Setting of SOM algorithm parameters
Bearing in mind that the learning process involved in the computation of a feature map is stochastic, the accuracy of the map depends on the number of iterations executed by the SOM algorithm during the initialization phase of the algorithm (Gabrielsson & Gabrielsson 2006). These authors recommend that for good statistical accuracy, the number of iterations is at least 500 times the number of network nodes. The default SOM software parameters for map size and lattice (rows and columns) were used, which were the same as using Equations (9) and (10).
The infilling process was completed through the following steps:
Step 1:Data collection and normalization: The data to be filled (i.e., wastewater quality data) was gathered and standardized; these were the depleted input vectors.
Step 2: Training: To form the SOM, the depleted input vector (data matrix) was introduced into the iterative training procedure. Weight vectors were initialized at the start of training using both a random and a linear initialization method. The comparison and adjustment processes were repeated until the optimal number of iterations was reached or the specified error criteria were met.
Step 3:Extraction of data from the trained SOM: All minimum Euclidian distances were examined; this was followed by examining the SOM's BMU for the depleted input vector (i.e. with missing values and outliers). Because the BMU identified in this step was a trained SOM node, it was assumed to have the complete complement of missing values.
Step 4:Missing value replacement: At this stage, missing values and outliers of the input depleted vector were replaced with the corresponding values in BMU identified in the above step.
The SOM model was developed and validated using data from 2015 to 2022. Initially, the SOM toolbox was used to train the model with default values of learning rate (a0=0.5) and neighborhood radius (a0=max(l1,l2)/4) where l1 and l2 are the dimensions of the map computed using Equation (10). The toolbox computes the size (number of units or neurons) of the map using Equation (9), but the final units on the map (M) were adjusted to equal the product of l1 and l2. The map size of the SOM model M = 126 units with dimensions of 14 × 9. The final quantization and topographic errors were 0.955 and 0.247, respectively.
Application of SOM
A SOM toolbox version 2.1 developed at Helsinki University of Technology Finland (www.cis.hut.fi/projects/somtoolbox/) was used in the MATLAB® 2021a environment for the application of the SOM algorithm for infilling missing data and outliers (The Maths Works, Inc. 2021). In this analysis, a batch training algorithm was used because its implementation in MATLAB is considerably more efficient than that of the sequential training algorithm as it requires less time for training and produces less quantization and topographic errors. The information was presented in columns, with each column representing a wastewater quality parameter. To meet MATLAB® data entry requirements, entries without data and outliers were recorded as NaN (Not a Number). To train all of the data in a single simulation, the data entries should overlap so that there is no single day/month with no data entry for all the wastewater quality parameters.
Ethical consideration
The study sought clearance from the Mzuzu University Research Ethics Committee (MZUNIREC) Ref No: MZUNIREC/DOR/21/62. Permission was also obtained from Lilongwe City Council to engage laboratory technicians during data collection processes. Informed consent was also obtained from the laboratory technicians.
RESULTS AND DISCUSSIONS
Descriptive statistics of wastewater quality parameters for Kauma Sewage Treatment Plant
Table 1 presents a summary of the descriptive statistics of Kauma Sewage Treatment Plant Data. A total of 616 data samples obtained covered a period of 6 years (2015–2021). A standard deviation (SD) is a measure of how dispersed the data is in relation to the mean. Low standard deviation means data are clustered around the mean, and high standard deviation indicates data are more spread out. High standard deviation (SD = 2,798.18) value was observed from influent COD from the septage lagoon while influent pH had the lowest standard deviation (SD = 0.46). As shown in Table 2, there were large numbers of missing values that could not be thrown away. Assuming that the data have a normal distribution, the mean and standard deviation of the entire dataset are used to obtain a Z-score of each data point, while in the modified Z-score, the median of absolute deviation about the mean (MAD) is used instead of standard deviation to obtain modified Z-score of each data point (Rustum & Adeloye 2007). In the present study, the modified Z-score method identified more outliers than either the visual inspection or the Z-score method. All identified outliers by the modified Z-score method were deleted and treated as missing values to be estimated. Finally, influent pH, influent DO, effluent BOD, and effluent COD were recorded to have more than 50% proportion of missing data and outliers.
Computed descriptive statistics of Kauma Sewage Treatment Plant data
Parameter . | Unit . | Mean . | SD . | SE . | Max . | Min . | UB . | LB . |
---|---|---|---|---|---|---|---|---|
pHinf | – | 7.01 | 0.46 | 0.02 | 8.00 | 5.40 | 7.05 | 6.97 |
Tempinf | °C | 24.73 | 1.83 | 0.07 | 29.00 | 20.40 | 24.88 | 24.58 |
BOD5inf | mg/l | 228.47 | 41.33 | 2.08 | 450.00 | 74.00 | 232.57 | 224.38 |
CODinf | mg/l | 358.34 | 88.49 | 4.56 | 552.70 | 182.00 | 367.31 | 349.37 |
BODinf SL | mg/l | 821.67 | 542.71 | 84.76 | 2,329.5 | 109 | 992.96 | 650.36 |
CODinf SL | mg/l | 2,615.11 | 2,798.18 | 437.00 | 14,090.88 | 826.56 | 3,498.32 | 1,731.89 |
TDSinf | mg/l | 465.15 | 86.07 | 3.71 | 739.00 | 230.00 | 472.44 | 457.87 |
TSSinf | mg/l | 173.28 | 11.87 | 0.73 | 199.00 | 146.00 | 174.72 | 171.85 |
ECinf | μS/cm | 783.83 | 118.96 | 4.86 | 1,070.00 | 441.00 | 793.38 | 774.28 |
TURBinf | NTU | 9.649 | 0.581 | 0.023 | 11 | 8 | 9.695 | 9.603 |
DOinf | mg/l | 1.12 | 0.99 | 0.04 | 3.21 | 0.07 | 1.21 | 1.03 |
BOD5eff | mg/l | 22.06 | 7.16 | 0.36 | 70.00 | 5.00 | 22.76 | 21.36 |
CODeff | mg/l | 40.41 | 12.46 | 0.63 | 58.20 | 20.00 | 41.65 | 39.17 |
Parameter . | Unit . | Mean . | SD . | SE . | Max . | Min . | UB . | LB . |
---|---|---|---|---|---|---|---|---|
pHinf | – | 7.01 | 0.46 | 0.02 | 8.00 | 5.40 | 7.05 | 6.97 |
Tempinf | °C | 24.73 | 1.83 | 0.07 | 29.00 | 20.40 | 24.88 | 24.58 |
BOD5inf | mg/l | 228.47 | 41.33 | 2.08 | 450.00 | 74.00 | 232.57 | 224.38 |
CODinf | mg/l | 358.34 | 88.49 | 4.56 | 552.70 | 182.00 | 367.31 | 349.37 |
BODinf SL | mg/l | 821.67 | 542.71 | 84.76 | 2,329.5 | 109 | 992.96 | 650.36 |
CODinf SL | mg/l | 2,615.11 | 2,798.18 | 437.00 | 14,090.88 | 826.56 | 3,498.32 | 1,731.89 |
TDSinf | mg/l | 465.15 | 86.07 | 3.71 | 739.00 | 230.00 | 472.44 | 457.87 |
TSSinf | mg/l | 173.28 | 11.87 | 0.73 | 199.00 | 146.00 | 174.72 | 171.85 |
ECinf | μS/cm | 783.83 | 118.96 | 4.86 | 1,070.00 | 441.00 | 793.38 | 774.28 |
TURBinf | NTU | 9.649 | 0.581 | 0.023 | 11 | 8 | 9.695 | 9.603 |
DOinf | mg/l | 1.12 | 0.99 | 0.04 | 3.21 | 0.07 | 1.21 | 1.03 |
BOD5eff | mg/l | 22.06 | 7.16 | 0.36 | 70.00 | 5.00 | 22.76 | 21.36 |
CODeff | mg/l | 40.41 | 12.46 | 0.63 | 58.20 | 20.00 | 41.65 | 39.17 |
SD, standard deviation; SE, standard error; UB, upper bound of 95%. Confidence interval for the mean; LB, lower bound of 95% confidence interval for the mean; BOD, biochemical oxygen demand; COD, chemical oxygen demand; TDS, total dissolved solids; TSS, total suspended solids; EC, electrical conductivity; DO, dissolved oxygen; Temp, temperature; TURB, turbidity. Suffixes: inf, influent; eff, effluent; SL, Septage Lagoons.
Proportion of missing data and outliers
. | Number of outliers . | . | ||||
---|---|---|---|---|---|---|
Parameter . | Unit . | Number of missing values . | Visual inspection . | Z-score . | Modified Z-score . | The proportion of missing data and outliers (%) . |
pHinf | – | 298 | 8 | 22 | 76 | 60.71 |
Tempinf | °C | 81 | 45 | 29 | 76 | 25.40 |
BOD5inf | mg/l | 221 | 4 | 88 | 83 | 49.43 |
CODinf | mg/l | 239 | 0 | 0 | 44 | 46.02 |
BODinf SL | mg/l | 44 | 3 | 1 | 81 | 20.33 |
CODinf SL | mg/l | 114 | 1 | 0 | 30 | 23.41 |
TDSinf | mg/l | 77 | 2 | 2 | 74 | 24.55 |
TSSinf | mg/l | 17 | 0 | 11 | 73 | 14.63 |
ECinf | μS/cm | 209 | 4 | 2 | 82 | 47.32 |
Turbinf | NTU | 28 | 22 | 25 | 49 | 12.50 |
DOinf | mg/l | 298 | 15 | 31 | 66 | 59.09 |
BOD5eff | mg/l | 351 | 2 | 0 | 28 | 61.63 |
CODeff | mg/l | 226 | 0 | 0 | 101 | 53.17 |
. | Number of outliers . | . | ||||
---|---|---|---|---|---|---|
Parameter . | Unit . | Number of missing values . | Visual inspection . | Z-score . | Modified Z-score . | The proportion of missing data and outliers (%) . |
pHinf | – | 298 | 8 | 22 | 76 | 60.71 |
Tempinf | °C | 81 | 45 | 29 | 76 | 25.40 |
BOD5inf | mg/l | 221 | 4 | 88 | 83 | 49.43 |
CODinf | mg/l | 239 | 0 | 0 | 44 | 46.02 |
BODinf SL | mg/l | 44 | 3 | 1 | 81 | 20.33 |
CODinf SL | mg/l | 114 | 1 | 0 | 30 | 23.41 |
TDSinf | mg/l | 77 | 2 | 2 | 74 | 24.55 |
TSSinf | mg/l | 17 | 0 | 11 | 73 | 14.63 |
ECinf | μS/cm | 209 | 4 | 2 | 82 | 47.32 |
Turbinf | NTU | 28 | 22 | 25 | 49 | 12.50 |
DOinf | mg/l | 298 | 15 | 31 | 66 | 59.09 |
BOD5eff | mg/l | 351 | 2 | 0 | 28 | 61.63 |
CODeff | mg/l | 226 | 0 | 0 | 101 | 53.17 |
SOM component planes
Visual analysis of the component planes shows that the color (or gray) gradient of the plane of BOD5inf is parallel to the gradient of CODinf with high values of BOD5inf being correlated with high values of CODinf and vice-versa. Similarly, high values of BOD5eff were correlated with high values of CODeff and vice-versa. The component planes also confirm the negative correlation between pH and BOD5inf, CODinf and DOinf with low values of pH associated with the high values of the BOD5inf, CODinf, and DOinf. The positive correlation between BOD and COD was expected as COD values are typically higher than BOD values, and the ratio between them varies depending on the characteristics of the wastewater (Rai et al. 2019). Table 3 displays the complete correlation matrix for all 11 variables of the prototype vectors. Although this is a simple tool for examining the linear relationship between various variables, its results appear to agree with the indications of cross-correlation provided by the much more complex SOM analysis that resulted in the component planes.
Correlation matrix for variables in code vectors
. | pHinf . | Tinf . | BOD5inf . | CODinf . | BODinf SL . | CODinf SL . | TDSinf . | TSSinf . | ECinf . | TURBinf . | DOinf . | BOD5eff . | CODeff . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pHinf | 1 | ||||||||||||
Tinf | 0.161 | 1 | |||||||||||
BOD5inf | −0.461 | −0.606* | 1 | ||||||||||
CODinf | −0.615* | −0.510 | .922** | 1 | |||||||||
BODinf SL | −0.251 | 0.167 | 0.043 | 0.061 | 1 | ||||||||
CODinf SL | −0.307 | 0.153 | 0.144 | 0.215 | .834** | 1 | |||||||
TDSinf | 0.368 | −0.125 | 0.131 | 0.088 | −0.620* | −0.599* | 1 | ||||||
TSSinf | −0.199 | −0.319 | 0.339 | 0.275 | 0.249 | 0.035 | −0.196 | 1 | |||||
ECinf | −0.135 | 0.210 | 0.193 | 0.411 | 0.041 | 0.130 | 0.086 | 0.409 | 1 | ||||
TURBinf | 0.167 | 0.067 | −0.009 | −0.048 | −0.048 | 0.125 | 0.162 | −0.386 | −0.320 | 1 | |||
DOinf | −0.587* | 0.204 | 0.098 | 0.269 | 0.110 | 0.301 | 0.003 | 0.024 | 0.234 | 0.163 | 1 | ||
BOD5eff | −0.013 | 0.154 | 0.033 | 0.066 | −0.309 | −0.157 | −0.008 | 0.003 | 0.252 | −0.237 | 0.196 | 1 | |
CODeff | 0.014 | 0.166 | −0.131 | −0.052 | −0.285 | 0.050 | 0.114 | −0.441 | −0.100 | 0.380 | 0.344 | .625* | 1 |
. | pHinf . | Tinf . | BOD5inf . | CODinf . | BODinf SL . | CODinf SL . | TDSinf . | TSSinf . | ECinf . | TURBinf . | DOinf . | BOD5eff . | CODeff . |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pHinf | 1 | ||||||||||||
Tinf | 0.161 | 1 | |||||||||||
BOD5inf | −0.461 | −0.606* | 1 | ||||||||||
CODinf | −0.615* | −0.510 | .922** | 1 | |||||||||
BODinf SL | −0.251 | 0.167 | 0.043 | 0.061 | 1 | ||||||||
CODinf SL | −0.307 | 0.153 | 0.144 | 0.215 | .834** | 1 | |||||||
TDSinf | 0.368 | −0.125 | 0.131 | 0.088 | −0.620* | −0.599* | 1 | ||||||
TSSinf | −0.199 | −0.319 | 0.339 | 0.275 | 0.249 | 0.035 | −0.196 | 1 | |||||
ECinf | −0.135 | 0.210 | 0.193 | 0.411 | 0.041 | 0.130 | 0.086 | 0.409 | 1 | ||||
TURBinf | 0.167 | 0.067 | −0.009 | −0.048 | −0.048 | 0.125 | 0.162 | −0.386 | −0.320 | 1 | |||
DOinf | −0.587* | 0.204 | 0.098 | 0.269 | 0.110 | 0.301 | 0.003 | 0.024 | 0.234 | 0.163 | 1 | ||
BOD5eff | −0.013 | 0.154 | 0.033 | 0.066 | −0.309 | −0.157 | −0.008 | 0.003 | 0.252 | −0.237 | 0.196 | 1 | |
CODeff | 0.014 | 0.166 | −0.131 | −0.052 | −0.285 | 0.050 | 0.114 | −0.441 | −0.100 | 0.380 | 0.344 | .625* | 1 |
*Correlation is significant at the 0.05 level (two-tailed).
**Correlation is significant at the 0.01 level (two-tailed).
SD, standard deviations; se, standard error; UB, upper bound of 95% confidence interval for the mean; LB, lower bound of 95% confidence interval for the mean; pH, power of hydrogen; BOD, biochemical oxygen demand; COD, chemical oxygen demand; TDS, total dissolved solids; TSS, total suspended solids; EC, electrical conductivity; DO, dissolved oxygen. SUFFIXES: inf, influent; eff, effluent; SL, Septage Lagoon.
Performance indices of SOM
Parameter . | Unit . | MSE . | R . |
---|---|---|---|
pHinf | – | 3.655 × 103 | 0.978 |
Tempinf | °C | 7.65 × 10−3 | 0.974 |
BOD5inf | mg/l | 6.702 × 10−3 | 0.997 |
CODinf | mg/l | 6.262 × 10−4 | 0.998 |
BODinf SL | mg/l | 4.950 × 10−5 | 0.981 |
CODinf SL | mg/l | 2.301 × 10−3 | 0.991 |
TDSinf | mg/l | 1.038 × 10−2 | 0.999 |
TSSinf | mg/l | 3.320 × 10−1 | 0.938 |
ECinf | μS/cm | 7.192 × 10−3 | 0.999 |
Turbinf | NTU | 3.61 × 10−4 | 0.995 |
DOinf | mg/l | 1.062 × 102 | 0.973 |
BOD5eff | mg/l | 2.153 × 103 | 0.986 |
CODeff | mg/l | 1.159 × 10−3 | 0.999 |
Parameter . | Unit . | MSE . | R . |
---|---|---|---|
pHinf | – | 3.655 × 103 | 0.978 |
Tempinf | °C | 7.65 × 10−3 | 0.974 |
BOD5inf | mg/l | 6.702 × 10−3 | 0.997 |
CODinf | mg/l | 6.262 × 10−4 | 0.998 |
BODinf SL | mg/l | 4.950 × 10−5 | 0.981 |
CODinf SL | mg/l | 2.301 × 10−3 | 0.991 |
TDSinf | mg/l | 1.038 × 10−2 | 0.999 |
TSSinf | mg/l | 3.320 × 10−1 | 0.938 |
ECinf | μS/cm | 7.192 × 10−3 | 0.999 |
Turbinf | NTU | 3.61 × 10−4 | 0.995 |
DOinf | mg/l | 1.062 × 102 | 0.973 |
BOD5eff | mg/l | 2.153 × 103 | 0.986 |
CODeff | mg/l | 1.159 × 10−3 | 0.999 |
MSE, mean square error; R, correlation coefficient.
Model evaluations
Time series plots for predicted and observed values (a) for effluent BOD5, (b) for effluent COD.
Time series plots for predicted and observed values (a) for effluent BOD5, (b) for effluent COD.
The performance of the SOM in predicting the various characteristics is depicted in Figure 7 and Table 4. The associated correlation coefficient values are all greater than 0.90, indicating that the overall performance is good. Variables with a high proportion of missing values, on the other hand, had high MSE values. The mean square values of influent pH, DO, and effluent BOD, for example, were 3.655 × 103, 1.062 × 103, and 2.153 × 103, respectively. The MSE is the average of the squares of the model predictions' errors. When there is no error in a model, the MSE is zero. As model error increases, so does its value.

Based on the null hypothesis that the skew coefficient will be zero, the skew coefficient which was outside this CI was considered to be not normally distributed. Table 5 displays the results of this hypothesis testing. Based on these findings, it is clear that the residuals associated with the majority of characteristics are normally distributed. The only exceptions are influent pH, influent DO, and effluent BOD, which have test statistics that fall just outside the 95% confidence interval.
Approximate normality test for residuals
Variable . | No. . | Lower limit . | Upper limit . | Skew coefficient . | Normal (Y/N) (skewed) . |
---|---|---|---|---|---|
pHinf | 572 | −0.009 | 0.009 | −1.054 | N |
Tempinf | 459 | −0.6273 | 0.00265 | −0.312 | Y |
BOD5inf | 395 | −0.182 | 0.182 | −0.164 | Y |
CODinf | 377 | −3.079 | 3.079 | 0.806 | Y |
BOD5inf SL | 242 | −10.8931 | 10.89306 | −0.95502 | Y |
COD5inf SL | 252 | −19.6372 | 19.63723 | −0.10421 | Y |
TDSinf | 539 | −0.791 | 0.791 | 0.134 | Y |
TSSinf | 265 | −8.83 | 8.83 | −0.64 | Y |
ECinf | 599 | −1.839 | 1.839 | −0.077 | Y |
Turbinf | 529 | −0.775 | 1.193 | 0.209 | Y |
DOinf | 502 | −0.021 | 0.021 | 3.072 | N |
BOD5eff | 407 | −0.450 | 0.450 | −1.411 | N |
CODeff | 390 | −0.408 | 0.408 | 0.325 | Y |
Variable . | No. . | Lower limit . | Upper limit . | Skew coefficient . | Normal (Y/N) (skewed) . |
---|---|---|---|---|---|
pHinf | 572 | −0.009 | 0.009 | −1.054 | N |
Tempinf | 459 | −0.6273 | 0.00265 | −0.312 | Y |
BOD5inf | 395 | −0.182 | 0.182 | −0.164 | Y |
CODinf | 377 | −3.079 | 3.079 | 0.806 | Y |
BOD5inf SL | 242 | −10.8931 | 10.89306 | −0.95502 | Y |
COD5inf SL | 252 | −19.6372 | 19.63723 | −0.10421 | Y |
TDSinf | 539 | −0.791 | 0.791 | 0.134 | Y |
TSSinf | 265 | −8.83 | 8.83 | −0.64 | Y |
ECinf | 599 | −1.839 | 1.839 | −0.077 | Y |
Turbinf | 529 | −0.775 | 1.193 | 0.209 | Y |
DOinf | 502 | −0.021 | 0.021 | 3.072 | N |
BOD5eff | 407 | −0.450 | 0.450 | −1.411 | N |
CODeff | 390 | −0.408 | 0.408 | 0.325 | Y |
These results indicate that, although SOM algorithm is quite robust for infilling gaps in wastewater time series, it cannot be used for infilling gaps in time series with a high proportion of missing data owing to the reduced model performance that was observed in time series that had more than 50% of missing data. This could be explained by the insufficient data from which the model is expected to learn, and thus cannot correctly replicate the pattern in the data. For example, measured influents pH, DO had 60.71 and 59.09% proportions of missing data produced mean square values of 3.655 × 103 and 1.062 × 103 in that order. Similarly, effluent BOD had a proportion of 61.63% missing data that produced an MSE value of 2.153 × 103. This implies that time series with extended periods of missing observations should not be used as the model may infill the missing observations but still fail to replicate the pattern in the data. In the context of rainfall–runoff modeling, Mwale et al. (2012) propose that such inconsistencies can be resolved by training time series data with the data from the same spatial zones. However, this can be a very challenging task in the context of wastewater systems due to the complex processes involved. The correlation coefficient values of more than 0.9 obtained in this study are comparable to those reported by other researchers summarized in Table 6. Similarly, the quantification errors that were very close to zero are also comparable to those reported by the aforementioned authors. This similarity in the results could largely be explained by similarity in the implementation procedure of SOM algorithm in MATLAB.
Research in engineering-related problem optimization using SOM algorithm
Authors . | Location . | Optimisation problem . | Parameters . | Software . | Fitness functions . | Major findings . |
---|---|---|---|---|---|---|
Nijim & Rustum (2022) | United Kingdom (Seafield wastewater treatment plant data in Edinburgh) | Apply SOM algorithm as alternative model to verify the accuracy of the Multivariate Imputation by Chained Equations (MICE) | DO | Not specified | R, MSE, AAE | Performance of MICE Model was excellent with less proportion of missing values and poor when proportion was high. Results were similar to those produced by the SOM Model in previous case studies. |
Juboori et al. (2022) | Desk Reviews | Analyze reinforced concrete structures employing Self-organizing maps | Elements of Reinforced Cement Concrete (RCC) | MATLAB | AAE, RAAE, NRMSE, MSE, R, CE | The Self-Organizing Map (KSOM) is an attractive tool for modeling reinforced concrete structures. Moreover, this technique offers a magnificent tool for high-dimensional data visualization. |
Kumar et al. (2021b) | India (National Institute of Technology, Hamirpur) | Develop self-organizing map (SOM), feed-forward neural network (FFNN), and multiple linear regression (MLR) models were for estimating the well-watered canopy temperature (Tc-ww) using air temperature and relative humidity as input predictor variables | Relative Humidity, Air Temperature, Well- watered canopy Temperature | MATLAB | MBE, MAE, MSE, PE, R | The findings indicated that the SOM-modeled values presented a better agreement with the measured values in comparison to MLR- and FFNN-based estimates, with R2 values of 0.978, 0.924, and 0.923 for KSOM, MLR, and FFNN, respectively, during model validation. |
Kumar et al. (2021a) | India (National Institute of Technology, Hamirpur) | Develop a self-organizing map (SOM) based model to predict the Crop Water Stress Index (CWSI) using microclimatic variables, namely air temperature, canopy temperature and relative humidity | CWSI, air temperature, solar radiation, wind speed, and relative humidity | MATLAB | NSE, BE, AE, R | The SOM predicted CWSI presented a good agreement with the baseline computed CWSI values during model training (R = 0.98, NSE = 0.97, AE = 0.018, BE = 0.0004) and testing (R = 0.98, NSE = 0.98, AE = 0.018, BE = 0.002). |
Ramachandran et al. (2019) | N/A | Use SOM to predict anaerobic digestion system behavior, study correlation between various process parameters, and extract Knowledge. | Glucose, Biogass flowrate, Methane gas, pH, | MATLAB (Synthetic MATLAB–Simulink–Excel model) | R, AAE, MSE, RMSE | The model accurately predicted the variations in methane and total gas output with respect to changes in input parameters as the correlation was more than 90% for most of the parameters. |
Rizvi & Rustum (2018) | California (wastewater treatment plant in San Diego) | Use SOM to study the effects of precipitation on the performance of wastewater treatment plant | Precipitation, Q, SS, BOD, COD, NH4, NO3, PO4, Temp, pH, | MATLAB | N/A | The results of the case study showcased SOM as a tool which was able to recognize the relationship among different parameters with rain in the wastewater treatment system. |
Nkiaka et al. (2016) | Cameroon (Logone catchment, Lake Chad basin) | Use SOM to infill missing data in hydro -meteorological time series | Rainfall data, River discharge data | MATLAB | R | SOMs are a robust and efficient method for infilling missing gaps in hydro-meteorological time series as indicated by coefficient of determination values which were all above 0.75 and 0.65 for rainfall and river discharge time series respectively. |
Mwale et al. (2014) | Malawi (Shire River Basin) | Use SOM to extract features from the raw data, which then formed the basis of infilling the gap-riddled data to provide more complete and much longer records those enhanced predictions | Rainfall data, River discharge data | MATLAB | NSE, R, MSRE | SOM is quite robust to infill missing data and can therefore be used to infill large gaps, something that would be impossible with traditional infilling methods, thus presenting a relatively long series needed data for hydrological modeling. |
Adeloye & Rustum (2012) | Nigeria (Osun basin) | Use SOM to model rainfall–runoff relationship | Rainfall data, River discharge data | MATLAB | R | The study demonstrated the successful use of emerging tools to overcome practical problems in sparsely gauged basins. |
Rustum & Adeloye (2012) | United Kingdom (Seafield ASP in Edinburgh | Used SOM to enhance the performance of a multi-layered perceptron, feed-forward back propagation artificial neural networks | Flow Rate, COD, SS. Ammonia, Blanket Depth | MATLAB | R, MSE, AAE | The study clearly demonstrated the usefulness of the clustering power of the SOM in helping to reduce noise in observed data to achieve better modeling and prediction of environmental systems behavior. |
Rustum & Adeloye (2007) | United Kingdom (Seafield ASP in Edinburgh | Using SOM to replace outliers and missing values from activated sludge plant data | Flow rate, BOD, SS, WAS, MLSS, RAS, SSVI, sludge age, F/M | MATLAB | R, MSE | Results demonstrated that the SOM is an excellent tool for replacing outliers and missing values from a high-dimensional dataset. |
Authors . | Location . | Optimisation problem . | Parameters . | Software . | Fitness functions . | Major findings . |
---|---|---|---|---|---|---|
Nijim & Rustum (2022) | United Kingdom (Seafield wastewater treatment plant data in Edinburgh) | Apply SOM algorithm as alternative model to verify the accuracy of the Multivariate Imputation by Chained Equations (MICE) | DO | Not specified | R, MSE, AAE | Performance of MICE Model was excellent with less proportion of missing values and poor when proportion was high. Results were similar to those produced by the SOM Model in previous case studies. |
Juboori et al. (2022) | Desk Reviews | Analyze reinforced concrete structures employing Self-organizing maps | Elements of Reinforced Cement Concrete (RCC) | MATLAB | AAE, RAAE, NRMSE, MSE, R, CE | The Self-Organizing Map (KSOM) is an attractive tool for modeling reinforced concrete structures. Moreover, this technique offers a magnificent tool for high-dimensional data visualization. |
Kumar et al. (2021b) | India (National Institute of Technology, Hamirpur) | Develop self-organizing map (SOM), feed-forward neural network (FFNN), and multiple linear regression (MLR) models were for estimating the well-watered canopy temperature (Tc-ww) using air temperature and relative humidity as input predictor variables | Relative Humidity, Air Temperature, Well- watered canopy Temperature | MATLAB | MBE, MAE, MSE, PE, R | The findings indicated that the SOM-modeled values presented a better agreement with the measured values in comparison to MLR- and FFNN-based estimates, with R2 values of 0.978, 0.924, and 0.923 for KSOM, MLR, and FFNN, respectively, during model validation. |
Kumar et al. (2021a) | India (National Institute of Technology, Hamirpur) | Develop a self-organizing map (SOM) based model to predict the Crop Water Stress Index (CWSI) using microclimatic variables, namely air temperature, canopy temperature and relative humidity | CWSI, air temperature, solar radiation, wind speed, and relative humidity | MATLAB | NSE, BE, AE, R | The SOM predicted CWSI presented a good agreement with the baseline computed CWSI values during model training (R = 0.98, NSE = 0.97, AE = 0.018, BE = 0.0004) and testing (R = 0.98, NSE = 0.98, AE = 0.018, BE = 0.002). |
Ramachandran et al. (2019) | N/A | Use SOM to predict anaerobic digestion system behavior, study correlation between various process parameters, and extract Knowledge. | Glucose, Biogass flowrate, Methane gas, pH, | MATLAB (Synthetic MATLAB–Simulink–Excel model) | R, AAE, MSE, RMSE | The model accurately predicted the variations in methane and total gas output with respect to changes in input parameters as the correlation was more than 90% for most of the parameters. |
Rizvi & Rustum (2018) | California (wastewater treatment plant in San Diego) | Use SOM to study the effects of precipitation on the performance of wastewater treatment plant | Precipitation, Q, SS, BOD, COD, NH4, NO3, PO4, Temp, pH, | MATLAB | N/A | The results of the case study showcased SOM as a tool which was able to recognize the relationship among different parameters with rain in the wastewater treatment system. |
Nkiaka et al. (2016) | Cameroon (Logone catchment, Lake Chad basin) | Use SOM to infill missing data in hydro -meteorological time series | Rainfall data, River discharge data | MATLAB | R | SOMs are a robust and efficient method for infilling missing gaps in hydro-meteorological time series as indicated by coefficient of determination values which were all above 0.75 and 0.65 for rainfall and river discharge time series respectively. |
Mwale et al. (2014) | Malawi (Shire River Basin) | Use SOM to extract features from the raw data, which then formed the basis of infilling the gap-riddled data to provide more complete and much longer records those enhanced predictions | Rainfall data, River discharge data | MATLAB | NSE, R, MSRE | SOM is quite robust to infill missing data and can therefore be used to infill large gaps, something that would be impossible with traditional infilling methods, thus presenting a relatively long series needed data for hydrological modeling. |
Adeloye & Rustum (2012) | Nigeria (Osun basin) | Use SOM to model rainfall–runoff relationship | Rainfall data, River discharge data | MATLAB | R | The study demonstrated the successful use of emerging tools to overcome practical problems in sparsely gauged basins. |
Rustum & Adeloye (2012) | United Kingdom (Seafield ASP in Edinburgh | Used SOM to enhance the performance of a multi-layered perceptron, feed-forward back propagation artificial neural networks | Flow Rate, COD, SS. Ammonia, Blanket Depth | MATLAB | R, MSE, AAE | The study clearly demonstrated the usefulness of the clustering power of the SOM in helping to reduce noise in observed data to achieve better modeling and prediction of environmental systems behavior. |
Rustum & Adeloye (2007) | United Kingdom (Seafield ASP in Edinburgh | Using SOM to replace outliers and missing values from activated sludge plant data | Flow rate, BOD, SS, WAS, MLSS, RAS, SSVI, sludge age, F/M | MATLAB | R, MSE | Results demonstrated that the SOM is an excellent tool for replacing outliers and missing values from a high-dimensional dataset. |
R, coefficient of determination; MSE, Mean Square Error; AAE, Average Absolute Error; MBE, Mean Bias Error; MAE, Mean Absolute Error; PE, Percent Error; NSE, Nash–Sutcliffe efficiency; BE, Bias Error; AE, Absolute Error; RAAE, Relative Average Absolute Error; NRMSE, Normalized Root Mean Square Error; CE, Classification Error; MSRE, Mean Squared Relative Error; SS, Suspended Solids; WAS, Waste Activated Sludge; MLSS, Mixed liquor suspended solids; RAS MLSS, Return Activated Sludge Mixed Liquor Suspended Solids; SSVI, Stirred sludge Volume Index; F/M, Food to microorganisms Ratio.
CONCLUSIONS
This study sought to apply a SOM algorithm in filling missing values and replacing outliers in wastewater data for the Kauma sewage treatment plant. Results showed that the SOM algorithm is reliable for infilling gaps and replacing outliers in wastewater time series data with less than 50% proportion of missing data. The SOM performance registered a deteriorating trend with missing values of more than 50% in the time series data. Overall R values of >0.90 obtained in this study are within the range of performance prediction reported in literature. This approach can be used by practitioners to enhance the planning and management of wastewater treatment facilities where available records are infested with missing observations. We recommend further research to ascertain the accuracy of the SOM algorithm in filling and replacing outliers in extended data records measured at different time scales such as hourly or/and daily measured data.
ACKNOWLEDGEMENTS
The structure of methodology and results section of this work followed the work of Rustum & Adeloye (2007). The authors are grateful to the following people, Dr Linda Strande (PhD) and the entire team from Eawag (Swiss Federal Institute of Aquatic Science and Technology) for the technical guidance and material support. Lilongwe City Council for the support rendered during data collection particularly to the following people Mr Obvious Nyirenda, Mr Phyllis Mkwezalamba Mr John Thyoka, Mr Chimango Mweso and Mr Orymo Nyirenda
FUNDING
This work received funding from the National Commission for Science and Technology (NCST) under NCST Small Grants Scheme. It also received funding from the Malawi Ministry of Education, Science and Technology under Higher Education Research and Development for Young Researchers (postgraduate) scheme (Ref. No. EDU/HE/21/74).
SOFTWARE AVAILABILITY STATEMENT
The SOM Toolbox (Version 2.2) for MATLAB used in this study is freely available for download from GITHUB (https://github.com/ilarinieminen/SOM-Toolbox)
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.