Abstract
Cholera is a serious disease that affects a huge number of people, especially in underdeveloped nations, and is particularly prevalent in Africa and southern Asia. This study aimed to determine cholera incidence trends and patterns in West Africa, as well as to develop a statistical model for cholera incidence. The outcomes of this study were occurrence, which was given a value of 1 if a case occurred and a value of 0 otherwise, and incidence rate. Logistic regression was used to model occurrence, while log-linear regression was used to model incidence after excluding the records with zero cases. The trend of cholera incidence rate was approximately constant for the Democratic Republic of Congo, whereas rates vary substantially throughout the study period in other countries. A confidence intervals plot shows that cholera incidence was higher in September and October, lower in 2015–2017, higher in Guinea, Niger, and Congo (west), and lower in Cote de-Ivoire, Cameroon, the Democratic Congo and Central African republics, Togo and Guinea Bissau. These two models can fit the data quite well. As a result, the method used in this study may be considered as an alternative to the traditional Poisson regression and negative binomial regression models.
HIGHLIGHTS
The analysis of disease cases can be separated into two steps of analysis, which are occurrence and incidence.
Logistic regression can be used to predict the occurrence of disease.
Log-linear regression model provides a good fit to the incidence rate when the data have a high right skew.
Mali and Niger have both high occurrence and high incidence of cholera.
Graphical Abstract
INTRODUCTION
Cholera is an intense diarrhea from ingesting the bacterium Vibrio cholerae in food or water. Cholera remains a worldwide danger to general well-being in the absence of social development. Analysts have assessed that there are globally 1.3–4.0 million cholera cases consistently, and 21,000–143,000 deaths annually from the disease (Mengel et al. 2014). The burden of cholera is large, particularly in developing countries, and is the highest in Africa and southern Asia (WHO 2017).
Many studies have modeled various aspects of cholera, including transmission, the impacts of the interventions, or model predictions of the number of cholera cases from mathematical and statistical models. The study by Chao et al. (2014) described a variety of approaches to modeling cholera outbreaks, of both epidemic and endemic cholera. However, the modeler has to make an appropriate choice of the initial conditions, other parametrization, and possibly of model structure. Ezeagu et al. (2019) have presented a mathematical model for the dynamic transmission of cholera and show that effective quarantine, vaccination, and proper sanitation can reduce the disease contact rates and help eliminate the spread of cholera. The study by Koepke et al. (2016) developed a predictive model of cholera outbreaks in Bangladesh based on environmental factors. Their method simultaneously accounts for disease dynamics and environmental variables in a susceptible-infected-recovered-susceptible (SIRS) model. They also applied a Bayesian framework and Markov chain Monte Carlo methods to sample the posterior of the data. They found that the model can successfully predict an increase in the number of infected individuals in the population weeks before the observed number of cholera cases increases.
Instead of attempting to identify the role of climate through simple correlations with disease incidence, some studies have applied mathematical models and non-linear statistical time series analysis to identify the effects of climate variability on cholera disease. The models indicate that climate plays a pivotal role in modulating the sizes of outbreaks (Koelle 2009). The study by Daisy et al. (2020) used a seasonal-auto-regressive-integrated-moving-average (SARIMA) model for time-series analysis. Both a single-variable (SVM) and a multi-variable SARIMA model (MVM) were developed, compared, and tested to evaluate the relationships of environmental factors to cholera incidence. The results revealed that MVM (AIC = 15, BIC = 36) showed a better performance than SVM (AIC = 21, BIC = 39).
Studies of cholera have been done from a variety of perspectives, including epidemiology of the disease and model creation, to determine the relationships of candidate predictive factors with cholera incidence. However, only a few studies have provided a simple but effective approach. Therefore, the aim of this study is to propose an alternative statistical modeling for the disease incidence that combines traditional logistic regression and simple linear regression.
METHODS
Data source
Data management
The candidate predictors in this analysis were weeks, years, and countries. There are 5,304 combinations of these factor levels (52 weeks × 6 years × 17 countries), but 20 are missing because Liberia had no data for weeks 1–9 in 2015, Sierra Leone had no data for weeks 49–52 in 2012, and Togo has no data for weeks 48–52 in 2012. Thus, these 20 missing values were omitted from the analysis.
Statistical analysis
In general, when dealing with count data, the researcher will use the Poisson model to determine the relationship between the independent and dependent variables. When the Poisson model fails to fit the data due to excess variation, the negative binomial model is used with an over-dispersion parameter θ, where smaller values of θ correspond to greater dispersion (Venables & Ripley 2002). When there are a lot of zeros in the data, fitting the negative binomial model with a very small value of θ will produce a poor model. As a result, researchers frequently employ zero-inflated or hurdle models.




The graph of confidence intervals was used to visualize the model's results, with sum contrast applied (Tongkumchum & McNeil 2009). This method computes an estimate and the 95% confidence interval of the occurrence and incidence rate for each predictor level in the model. Each 95% confidence interval will be compared to the overall mean or another value. Thematic maps were used to classify countries based on whether their cholera occurrence is above or below the overall mean, whereas thematic maps for cholera incidence were created to classify countries based on the overall median.
The receiver operating characteristic (ROC) curve from logistic regression was used for assessing the accuracy of model prediction, whereas the area under the ROC curve (AUC) represents the model performance and model accuracy. Linear regression models assume that errors are normally distributed, and a quantile–quantile (Q–Q) plot of studentized residuals is the best way to examine this assumption. Model results are displayed as confidence interval plots and thematic maps. The R program version 3.4.4 (R Core Team 2018) was used for all statistical analysis and graphical displays.
Ethics approval
Ethical approval for the study was obtained from the Ethical Review Committee for Human Research, Prince of Songkla University, Pattani Campus, No. PSU.PN.1-006/63, March 31, 2020.
RESULTS
Trends and patterns of cholera incidence in West Africa: 2012–2017
Seasonal patterns of weekly cholera incidence in West Africa: 2012–2017. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wh.2023.241.
Seasonal patterns of weekly cholera incidence in West Africa: 2012–2017. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wh.2023.241.
Statistical modeling of cholera incidence rate using Poisson and negative binomial
Residuals Quantile–Quantile plots for Poisson regression model (a) and negative binomial model (b).
Residuals Quantile–Quantile plots for Poisson regression model (a) and negative binomial model (b).
Statistical modeling of cholera occurrence and incidence using logistic regression and log-linear regression
The previous section showed that both Poisson and negative binomial regression approaches gave poor fits to cholera incidence rates. This study proposes an alternative data analysis approach, by replacing the incidence rate with a bivariate outcome variable comprising occurrence and incidence. The occurrence is set to 1 when cases occurred, and set to 0 is otherwise. The incidence represents the severity of the disease and can be calculated as the incidence rate of cases that occur, i.e. only when occurrence is 1.
Confidence intervals for cholera incidence (a) and Quantile–Quantile plot (b). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wh.2023.241.
Confidence intervals for cholera incidence (a) and Quantile–Quantile plot (b). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wh.2023.241.
The graph shows that cholera rates were high in September and October, and comparatively low in 2015–2017. The countries with higher incidence rates were Guinea, Niger, and Congo (west) and lower in Cote de-Ivoire, Cameroon, the Democratic Congo and Central African republics, Togo and Guinea Bissau.
Cholera occurrence and incidence map
Spatial distribution of cholera occurrence (a), incidence (b), and combination for occurrence-incidence (c).
Spatial distribution of cholera occurrence (a), incidence (b), and combination for occurrence-incidence (c).
DISCUSSION
Since the interesting outcome is the incidence rate of disease, most studies use Poisson and Negative binomial models for the association between the factors and incidence rate. However, in this study, it was found that those models did not fit the data well. Therefore, we suggest the alternative approach which divided the outcome into two variables. Then logistic regression and log-linear regression were used to model the occurrence and incidence, respectively. Based on the predictive accuracy and AUC for the logistic regression model and the Q–Q plot of residuals for log-linear regression, the results showed that the logistic regression and log-linear regression could fit the data quite well. These two models provide similar results for both seasonal patterns and time series trends. Thus, we conclude that the analyses by separately fitting a logistic model for disease occurrence and a log-linear regression model for disease incidence (after excluding the non-cases or zero incidences) can be an effective alternative to traditional Poisson regression or negative binomial regression. This study corroborates many earlier ones that have applied logistic regression for the associations between risk factors and cholera cases. The study by Nsenga (2020) showed that logistic regression model fits data well in predicting the cholera incidence. The model's overall percentage of accuracy in classification was 78.0%, and the area under the ROC was 0.903. Similarly, Rajendran et al. (2007) stated that the multinomial logistic regression model was an effective approach for predicting the count of acute diarrheal patients infected with Vibrio cholera, compared with discriminant function analysis and log-linear models. Moreover, the study by Musa & Olayemi (2020) also confirmed that the logistic regression can identify the main risk factors having statistically significant associations with the responses of patients to a cholera treatment. Based on the models, it was shown that cholera rates were high in September and October, and comparatively low in 2015–2017. Republic of Congo had the highest number of cases, where Nigeria, Cameroon, Niger, Mali, Benin, and Sierra Leone were the countries having the occurrence of cholera greater than the overall mean. Moreover, when considering both occurrence and incidence, the countries with low occurrence and high incidence consist of Ghana, Cameroon, and the RCA. This finding is in line with several studies. According to Asadgol et al. (2020), climatic factors, particularly rainfall, temperature, sea surface temperature (SST), and El Nino Southern Oscillation, have a significant impact on cholera incidence (ENSO). According to a research by Ngwa et al. (2016), all of the climatic subzones in Cameroon had substantial seasonal changes in disease patterns. In the northern Sudano-Sahelian subzone, the wet season seems to be when the most instances occurred (July–September). The southern Equatorial Monsoon subzone recorded occurrences throughout the year, with peak rainfall having the fewest occurrences (July–September). Another aspect of the cholera pandemic was the lack of a steady supply of safe drinking water and sanitary conditions. Similarly, the finding by Alkassoum et al. (2019) revealed that the trend of cholera incidence was peaked during the rainy season. According to research by Manzo et al. (2017), the Niger Republic's case fertility rate increased from 2.0 to 7.8% between 2012 and 2015 as a result of fewer people having access to clean water and adequate sanitation. The majority of assessments also stated that low- or middle-income nations were where the cholera epidemic first emerged. The annual incidence rate for Nigeria was 3.1 cases per 100,000 people. The lack of information about cholera, food instability, climate change, urbanization, overcrowding, and inadequate water and sanitation systems may be related to this. This also includes the absence of effective waste management systems, high-quality housing, and suitable medical services (Ebob 2019; Charnley et al. 2021).
This study discovered some limitations because the available dataset does not provide information on other factors that could be associated with the number of cholera cases, such as individual characteristics, behavior, access to clean water, adequate sanitation or level of contamination, and some others. However, the results from this study still provide the information for a high-burden location, where both occurrence and incidence are high. This may help the decision-makers in charge of planning and taking preventative measures for both environmental issues and behavioral ones. Furthermore, the methods used in this analysis are usable to another dataset.
CONCLUSIONS
This study used statistical modeling to examine the occurrence and incidence rate of cholera disease in West Africa from 2012 to 2017. The study demonstrates that the model can be well fitted to the data. After the model's results were transformed into a thematic map, it was discovered that two northern countries (Mali and Niger) have both high occurrence and high incidence, whereas three coastal countries (Guinea Bissau, Cote d'Ivoire, and Togo) have both low occurrence and incidence. This exploration may assist a health professional or decision-maker develop a prevention program for those areas.
ACKNOWLEDGEMENTS
This research received financial support from the Thailand's Education Hub for ASEAN Countries (THE-AC), Prince of Songkla University. We are grateful to Emeritus Professor Don McNeil and Associate Professor Seppo Karrila for their guidance and to the referees for their helpful comments.
DATA AVAILABILITY STATEMENT
All relevant data are included in the paper or its Supplementary Information.
CONFLICT OF INTEREST
The authors declare there is no conflict.