Cholera is a serious disease that affects a huge number of people, especially in underdeveloped nations, and is particularly prevalent in Africa and southern Asia. This study aimed to determine cholera incidence trends and patterns in West Africa, as well as to develop a statistical model for cholera incidence. The outcomes of this study were occurrence, which was given a value of 1 if a case occurred and a value of 0 otherwise, and incidence rate. Logistic regression was used to model occurrence, while log-linear regression was used to model incidence after excluding the records with zero cases. The trend of cholera incidence rate was approximately constant for the Democratic Republic of Congo, whereas rates vary substantially throughout the study period in other countries. A confidence intervals plot shows that cholera incidence was higher in September and October, lower in 2015–2017, higher in Guinea, Niger, and Congo (west), and lower in Cote de-Ivoire, Cameroon, the Democratic Congo and Central African republics, Togo and Guinea Bissau. These two models can fit the data quite well. As a result, the method used in this study may be considered as an alternative to the traditional Poisson regression and negative binomial regression models.

  • The analysis of disease cases can be separated into two steps of analysis, which are occurrence and incidence.

  • Logistic regression can be used to predict the occurrence of disease.

  • Log-linear regression model provides a good fit to the incidence rate when the data have a high right skew.

  • Mali and Niger have both high occurrence and high incidence of cholera.

Graphical Abstract

Graphical Abstract
Graphical Abstract

Cholera is an intense diarrhea from ingesting the bacterium Vibrio cholerae in food or water. Cholera remains a worldwide danger to general well-being in the absence of social development. Analysts have assessed that there are globally 1.3–4.0 million cholera cases consistently, and 21,000–143,000 deaths annually from the disease (Mengel et al. 2014). The burden of cholera is large, particularly in developing countries, and is the highest in Africa and southern Asia (WHO 2017).

Many studies have modeled various aspects of cholera, including transmission, the impacts of the interventions, or model predictions of the number of cholera cases from mathematical and statistical models. The study by Chao et al. (2014) described a variety of approaches to modeling cholera outbreaks, of both epidemic and endemic cholera. However, the modeler has to make an appropriate choice of the initial conditions, other parametrization, and possibly of model structure. Ezeagu et al. (2019) have presented a mathematical model for the dynamic transmission of cholera and show that effective quarantine, vaccination, and proper sanitation can reduce the disease contact rates and help eliminate the spread of cholera. The study by Koepke et al. (2016) developed a predictive model of cholera outbreaks in Bangladesh based on environmental factors. Their method simultaneously accounts for disease dynamics and environmental variables in a susceptible-infected-recovered-susceptible (SIRS) model. They also applied a Bayesian framework and Markov chain Monte Carlo methods to sample the posterior of the data. They found that the model can successfully predict an increase in the number of infected individuals in the population weeks before the observed number of cholera cases increases.

Instead of attempting to identify the role of climate through simple correlations with disease incidence, some studies have applied mathematical models and non-linear statistical time series analysis to identify the effects of climate variability on cholera disease. The models indicate that climate plays a pivotal role in modulating the sizes of outbreaks (Koelle 2009). The study by Daisy et al. (2020) used a seasonal-auto-regressive-integrated-moving-average (SARIMA) model for time-series analysis. Both a single-variable (SVM) and a multi-variable SARIMA model (MVM) were developed, compared, and tested to evaluate the relationships of environmental factors to cholera incidence. The results revealed that MVM (AIC = 15, BIC = 36) showed a better performance than SVM (AIC = 21, BIC = 39).

Studies of cholera have been done from a variety of perspectives, including epidemiology of the disease and model creation, to determine the relationships of candidate predictive factors with cholera incidence. However, only a few studies have provided a simple but effective approach. Therefore, the aim of this study is to propose an alternative statistical modeling for the disease incidence that combines traditional logistic regression and simple linear regression.

Data source

The data used in this study were obtained from Humanitarian Data Exchange (HDX), which are supported by the US Census Bureau (2019) and are available at https://centre.humdata.org. These data provide information on country code, country name, and affected cases date by week and year. The dataset covers 6 years from 2012 to 2017 and includes data from 17 West African countries, including Nigeria, Guinea, Guinea Bissau, Burkina Faso, Mali, Cote D'Ivoire, Chad, Niger, Togo, Ghana, Benin, Central Africa, Cameroon, Congo, Democratic Republic of the Congo, Sierra Leone, and Liberia as shown in Figure 1.
Figure 1

Geographical location of the study area.

Figure 1

Geographical location of the study area.

Close modal

Data management

The candidate predictors in this analysis were weeks, years, and countries. There are 5,304 combinations of these factor levels (52 weeks × 6 years × 17 countries), but 20 are missing because Liberia had no data for weeks 1–9 in 2015, Sierra Leone had no data for weeks 49–52 in 2012, and Togo has no data for weeks 48–52 in 2012. Thus, these 20 missing values were omitted from the analysis.

Statistical analysis

In general, when dealing with count data, the researcher will use the Poisson model to determine the relationship between the independent and dependent variables. When the Poisson model fails to fit the data due to excess variation, the negative binomial model is used with an over-dispersion parameter θ, where smaller values of θ correspond to greater dispersion (Venables & Ripley 2002). When there are a lot of zeros in the data, fitting the negative binomial model with a very small value of θ will produce a poor model. As a result, researchers frequently employ zero-inflated or hurdle models.

This study proposes a different approach to analyzing the zero-inflated model by dividing the outcome into two variables. The first outcome is occurrence, which is defined as a binary variable with code 1 representing the case of cholera occurred and 0 is otherwise. The occurrence model was simply fitted using logistic regression with week, year, and country as predictors. The mathematical equation for this model can be written as:
formula
where represents the outcome probability in a combination of predictive factor levels. The terms are the effects of each predictor. The outcome probability can be expressed as follows:
formula
The second outcome variable is the incidence rate, which is calculated by using the number of cases divided by the population, given that the cases occurred at least once. The following formula can be applied:
formula
To model the incidence rate, we use a log-linear regression model to ensure that the data have a normal distribution and meet the assumption of a linear model. If represents the population in each country, and is the corresponding number of cholera cases in country k, week i, and year j, the log-linear regression model can be written as the following equation:
formula

The graph of confidence intervals was used to visualize the model's results, with sum contrast applied (Tongkumchum & McNeil 2009). This method computes an estimate and the 95% confidence interval of the occurrence and incidence rate for each predictor level in the model. Each 95% confidence interval will be compared to the overall mean or another value. Thematic maps were used to classify countries based on whether their cholera occurrence is above or below the overall mean, whereas thematic maps for cholera incidence were created to classify countries based on the overall median.

The receiver operating characteristic (ROC) curve from logistic regression was used for assessing the accuracy of model prediction, whereas the area under the ROC curve (AUC) represents the model performance and model accuracy. Linear regression models assume that errors are normally distributed, and a quantile–quantile (Q–Q) plot of studentized residuals is the best way to examine this assumption. Model results are displayed as confidence interval plots and thematic maps. The R program version 3.4.4 (R Core Team 2018) was used for all statistical analysis and graphical displays.

Ethics approval

Ethical approval for the study was obtained from the Ethical Review Committee for Human Research, Prince of Songkla University, Pattani Campus, No. PSU.PN.1-006/63, March 31, 2020.

Trends and patterns of cholera incidence in West Africa: 2012–2017

For the 17 West African countries considered, Figure 2 shows rates of cholera cases per 100,000 population in successive weeks for the years 2012–2017 inclusive. The trend is approximately constant in the Democratic Republic of Congo, whereas the rates vary substantially over this period in the other countries.
Figure 2

Trends in weekly cholera incidences in West Africa: 2012–2017.

Figure 2

Trends in weekly cholera incidences in West Africa: 2012–2017.

Close modal
Figure 3 shows the seasonal patterns in the cholera rates. It reveals that cholera rates tend to be higher after July when monsoon rains arrive, with more pronounced peaks in Sierra Leone, Guinea Bissau, Guinea, and Ghana. The results from factor analysis show two slightly different patterns, seen graphed in the bottom right panel. The red line represents countries that mostly have the highest cases in August, whereas the blue line shows the countries whose highest cases occurred in or around October.
Figure 3

Seasonal patterns of weekly cholera incidence in West Africa: 2012–2017. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wh.2023.241.

Figure 3

Seasonal patterns of weekly cholera incidence in West Africa: 2012–2017. Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wh.2023.241.

Close modal

Statistical modeling of cholera incidence rate using Poisson and negative binomial

Since the incidence rate is defined as the number of new cases of a disease within a time period, we first test the Poisson regression on these data. We found that this model fits very poorly, as the Q–Q plots of residuals shown in Figure 4(a) make clear. Therefore, negative binomial regression with a dispersion parameter equal to 0.26 was applied to the data. While this reduced the error the fit remained poor, as seen in Figure 4(b).
Figure 4

Residuals Quantile–Quantile plots for Poisson regression model (a) and negative binomial model (b).

Figure 4

Residuals Quantile–Quantile plots for Poisson regression model (a) and negative binomial model (b).

Close modal

Statistical modeling of cholera occurrence and incidence using logistic regression and log-linear regression

The previous section showed that both Poisson and negative binomial regression approaches gave poor fits to cholera incidence rates. This study proposes an alternative data analysis approach, by replacing the incidence rate with a bivariate outcome variable comprising occurrence and incidence. The occurrence is set to 1 when cases occurred, and set to 0 is otherwise. The incidence represents the severity of the disease and can be calculated as the incidence rate of cases that occur, i.e. only when occurrence is 1.

The logistic regression model provides the association between cholera occurrences for each level of each risk factor. However, the confidence intervals for weeks are relatively wide compared to those for year or for country. Therefore, we use 4-week period of year instead of 52 weeks to estimate the seasonal pattern, as shown in Figure 5.
Figure 5

Confidence intervals of cholera occurrence.

Figure 5

Confidence intervals of cholera occurrence.

Close modal
The graph demonstrates that for each year, the period from September to November had statistically more occurrences of cholera than the overall mean. But when each year is taken into account, it becomes clear that from 2012 to 2017, the occurrence tends to decline. Republic of Congo had the highest number of cases, where Nigeria, Cameroon, Niger, Mali, Benin, and Sierra Leone were the countries having the occurrence of cholera greater than the overall mean. The model was evaluated using a ROC curve, which revealed that it fits the data quite well, with an AUC of 66.93 and 73.91% predictive accuracy, as shown in Figure 6.
Figure 6

ROC curve from the logistic regression.

Figure 6

ROC curve from the logistic regression.

Close modal
For cholera incidence, a Q–Q plot of residuals shows that also the log-linear model fits the data well (Figure 7(b)). Figure 7(a) displays the relationship between each level of each predictor (4-week of year, year, and country) and incidence. The green dots denote crude incidence rates, computed directly from data for each predictor. The black dots denote incidence rates computed from the model. These incidence rates differed according to the log transformation bias.
Figure 7

Confidence intervals for cholera incidence (a) and Quantile–Quantile plot (b). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wh.2023.241.

Figure 7

Confidence intervals for cholera incidence (a) and Quantile–Quantile plot (b). Please refer to the online version of this paper to see this figure in colour: http://dx.doi.org/10.2166/wh.2023.241.

Close modal

The graph shows that cholera rates were high in September and October, and comparatively low in 2015–2017. The countries with higher incidence rates were Guinea, Niger, and Congo (west) and lower in Cote de-Ivoire, Cameroon, the Democratic Congo and Central African republics, Togo and Guinea Bissau.

Cholera occurrence and incidence map

Confidence interval plots can be used to divide the countries into three groups, depending on the placement of these intervals completely above, around, or below a specified level. Thematic maps show that these metrics have different patterns. For example, Congo (DRC) shows high occurrence and low incidence, whereas the reverse is true for Ghana, Cameroon, and the RCA, as shown in Figure 8. These maps showed that some areas may need more attention in preventing the disease.
Figure 8

Spatial distribution of cholera occurrence (a), incidence (b), and combination for occurrence-incidence (c).

Figure 8

Spatial distribution of cholera occurrence (a), incidence (b), and combination for occurrence-incidence (c).

Close modal

Since the interesting outcome is the incidence rate of disease, most studies use Poisson and Negative binomial models for the association between the factors and incidence rate. However, in this study, it was found that those models did not fit the data well. Therefore, we suggest the alternative approach which divided the outcome into two variables. Then logistic regression and log-linear regression were used to model the occurrence and incidence, respectively. Based on the predictive accuracy and AUC for the logistic regression model and the Q–Q plot of residuals for log-linear regression, the results showed that the logistic regression and log-linear regression could fit the data quite well. These two models provide similar results for both seasonal patterns and time series trends. Thus, we conclude that the analyses by separately fitting a logistic model for disease occurrence and a log-linear regression model for disease incidence (after excluding the non-cases or zero incidences) can be an effective alternative to traditional Poisson regression or negative binomial regression. This study corroborates many earlier ones that have applied logistic regression for the associations between risk factors and cholera cases. The study by Nsenga (2020) showed that logistic regression model fits data well in predicting the cholera incidence. The model's overall percentage of accuracy in classification was 78.0%, and the area under the ROC was 0.903. Similarly, Rajendran et al. (2007) stated that the multinomial logistic regression model was an effective approach for predicting the count of acute diarrheal patients infected with Vibrio cholera, compared with discriminant function analysis and log-linear models. Moreover, the study by Musa & Olayemi (2020) also confirmed that the logistic regression can identify the main risk factors having statistically significant associations with the responses of patients to a cholera treatment. Based on the models, it was shown that cholera rates were high in September and October, and comparatively low in 2015–2017. Republic of Congo had the highest number of cases, where Nigeria, Cameroon, Niger, Mali, Benin, and Sierra Leone were the countries having the occurrence of cholera greater than the overall mean. Moreover, when considering both occurrence and incidence, the countries with low occurrence and high incidence consist of Ghana, Cameroon, and the RCA. This finding is in line with several studies. According to Asadgol et al. (2020), climatic factors, particularly rainfall, temperature, sea surface temperature (SST), and El Nino Southern Oscillation, have a significant impact on cholera incidence (ENSO). According to a research by Ngwa et al. (2016), all of the climatic subzones in Cameroon had substantial seasonal changes in disease patterns. In the northern Sudano-Sahelian subzone, the wet season seems to be when the most instances occurred (July–September). The southern Equatorial Monsoon subzone recorded occurrences throughout the year, with peak rainfall having the fewest occurrences (July–September). Another aspect of the cholera pandemic was the lack of a steady supply of safe drinking water and sanitary conditions. Similarly, the finding by Alkassoum et al. (2019) revealed that the trend of cholera incidence was peaked during the rainy season. According to research by Manzo et al. (2017), the Niger Republic's case fertility rate increased from 2.0 to 7.8% between 2012 and 2015 as a result of fewer people having access to clean water and adequate sanitation. The majority of assessments also stated that low- or middle-income nations were where the cholera epidemic first emerged. The annual incidence rate for Nigeria was 3.1 cases per 100,000 people. The lack of information about cholera, food instability, climate change, urbanization, overcrowding, and inadequate water and sanitation systems may be related to this. This also includes the absence of effective waste management systems, high-quality housing, and suitable medical services (Ebob 2019; Charnley et al. 2021).

This study discovered some limitations because the available dataset does not provide information on other factors that could be associated with the number of cholera cases, such as individual characteristics, behavior, access to clean water, adequate sanitation or level of contamination, and some others. However, the results from this study still provide the information for a high-burden location, where both occurrence and incidence are high. This may help the decision-makers in charge of planning and taking preventative measures for both environmental issues and behavioral ones. Furthermore, the methods used in this analysis are usable to another dataset.

This study used statistical modeling to examine the occurrence and incidence rate of cholera disease in West Africa from 2012 to 2017. The study demonstrates that the model can be well fitted to the data. After the model's results were transformed into a thematic map, it was discovered that two northern countries (Mali and Niger) have both high occurrence and high incidence, whereas three coastal countries (Guinea Bissau, Cote d'Ivoire, and Togo) have both low occurrence and incidence. This exploration may assist a health professional or decision-maker develop a prevention program for those areas.

This research received financial support from the Thailand's Education Hub for ASEAN Countries (THE-AC), Prince of Songkla University. We are grateful to Emeritus Professor Don McNeil and Associate Professor Seppo Karrila for their guidance and to the referees for their helpful comments.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Alkassoum
S. I.
,
Djibo
I.
,
Amadou
H.
,
Bohari
A.
,
Issoufou
H.
,
Aka
J.
&
Mamadou
S.
2019
The global burden of cholera outbreaks in Niger: an analysis of the national surveillance data, 2003–2015
.
Transactions of The Royal Society of Tropical Medicine and Hygiene
113
(
5
),
273
280
.
https://doi.org/10.1093/trstmh/try145
.
Asadgol
Z.
,
Badirzadeh
A.
,
Niazi
S.
,
Mokhayeri
Y.
,
Kermani
M.
,
Mohammadi
H.
&
Gholami
M.
2020
How climate change can affect cholera incidence and prevalence? A systematic review
.
Environmental Science and Pollution Research
27
,
34906
34926
.
https://doi.org/10.1007/s11356-020-09992-7
.
Chao
D. L.
,
Longini
I. M.
&
Morris
J. G.
2014
Modeling cholera outbreaks
.
Current Topics in Microbiology and Immunology
379
,
195
209
.
Charnley
G. E. C.
,
Kelman
I.
,
Green
N.
,
Hinsley
W.
,
Gaythorpe
K. A. M.
&
Murray
K. A.
2021
Exploring relationships between drought and epidemic cholera in Africa using generalised linear models
.
BMC Infectious Disease
21
,
1177
.
https://doi.org/10.1186/s12879-021-06856-4
.
Daisy
S. S.
,
Saiful Islam
A. K. M.
,
Akanda
A. S.
,
Faruque
A. S. G.
,
Amin
N.
&
Jensen
P. K. M.
2020
Developing a forecasting model for cholera incidence in Dhaka megacity through time series climate data
.
Journal of Water and Health
18
(
2
),
207
223
.
Ebob
T. J.
2019
An overview of cholera epidemiology: a focus on Africa; with a keen interest on Nigeria
.
International Journal of Tropical Disease & Health
40
(
3
),
1
17
.
doi: 10.9734/IJTDH/2019/v40i330229
.
Ezeagu
N. J.
,
Togbenon
H. A.
&
Moyo
E.
2019
Modeling and analysis of cholera dynamics with vaccination
.
American Journal of Applied Mathematics and Statistics
7
(
1
),
1
8
.
doi: 10.12691/ajams-7-1-1
.
Koelle
K.
2009
The impact of climate on the disease dynamics of cholera
.
Clinical Microbiology and Infection
15
(
Suppl 1
),
29
31
.
Koepke
A. A.
,
Longini
I. M.
Jr.
,
Halloran
M. E.
,
Wakefield
J.
&
Minin
V. N.
2016
Predictive modeling of cholera outbreak in Bangladesh
.
Annals of Applied Statistics
10
(
2
),
575
595
.
doi: 10.1214/16-AOAS908
.
Manzo
L. M.
,
Moumouni
A.
,
Issa
I.
,
Amadou
A.
,
Zanguina
J.
,
Ibrahim
D. D.
,
Mainassara
H. B.
&
Ousmane
S.
2017
Cholera in Niger republic: an analysis of national surveillance data, 1991–2015
.
International Journal of Infection
4
(
3
),
e15591
.
doi: 10.5812/iji.15591
.
Mengel
M. A.
,
Delrieu
I.
,
Heyerdahl
L.
&
Gessner
B. D.
2014
Cholera outbreaks in Africa
.
Current Topics in Microbiology and Immunology
379
,
117
144
.
doi:10.1007/82_2014_369
.
Musa
B. T.
&
Olayemi
O. S.
2020
Application of logistic regression models for the evaluation of cholera outbreak in Adamawa state Nigeria
.
International Journal of Mathematics and Statistics Studies
8
(
1
),
32
54
.
Ngwa
M. C.
,
Liang
S.
,
Kracalik
I. T.
,
Morris
L.
,
Blackburn
J. K.
,
Mbam
L. M.
,
Pouth
S. F. B. B.
,
Teboh
A.
,
Yang
Y.
,
Arabi
M.
,
Sugimoto
M. A.
&
Morris
J. G.
2016
Cholera in Cameroon, 2000–2012: spatial and temporal analysis at the operational (health district) and sub climate levels
.
PLoS Neglected Tropical Disease
10
(
11
),
e0005105
.
https://doi.org/10.1371/journal.pntd.0005105
.
Nsenga
N.
2020
Population-Level Determinants of Cholera Incidence in African Countries
.
Doctoral Thesis
,
Walden University
. .
Rajendran
K.
,
Ramamurthy
T.
&
Sur
D.
2007
Multinomial logistic regression model for inferential risk age groups for infection caused by Vibrio cholera in Kolkata, India
.
Journal of Modern Applied Statistical Methods
6
(
1
),
324
330
.
R Core Team
2018
R: A Language and Environment for Statistical Computing
.
R Foundation for Statistical Computing
,
Vienna
.
Available from: http://www.R-project.org/ (accessed 9 May 2018)
.
Tongkumchum
P.
&
McNeil
D.
2009
Confidence intervals using contrasts for regression model
.
Songklanakarin Journal of Science and Technololy
31
,
151
156
.
US Census Bureau
2019
World Population, 2012–2018
.
Available from: https://www.census.gov./ (accessed 26 June 2019)
.
Venables
W. N.
&
Ripley
D. B.
2002
Modern Applied Statistics with S
.
Springer
,
New York
.
World Health Organization
2017
Cholera vaccines. WHO positions paper. Weekly Epidemiological Record, 76, 117–240
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY-NC-ND 4.0), which permits copying and redistribution for non-commercial purposes with no derivatives, provided the original work is properly cited (http://creativecommons.org/licenses/by-nc-nd/4.0/).