Abstract
This study has utilised National Oceanic and Atmospheric Administration (NOAA) NCEP/NCAR Reanalysis 1 project meteorological data and the HYSPLIT model to extract the air parcel trajectories for selected historical extreme rainfall events in South Africa. The k-means unsupervised machine learning algorithm has been used to cluster the resulting trajectories, and from this, the spatial origin of moisture for each of the rainfall events has been determined. It has been demonstrated that rainfall events on the east coast with moisture originating from the Indian Ocean have distinctly larger average maximum daily rainfall magnitudes (279 mm) compared to those that occur on the west coast with Atlantic Ocean influences (149 mm) and those events occurring in the central plateau (150 mm) where moisture has been continentally recirculated. Further, this study has suggested new metrics by which the HYSPLIT trajectories may be assessed and demonstrated the applicability of trajectory clustering in a region not previously studied. This insight may in future facilitate improved early warning systems based on monitoring of atmospheric systems, and an understanding of rainfall magnitudes and origins can be used to improve the prediction of design floods for infrastructure design.
HIGHLIGHTS
Clustering extreme rainfall events based on origins and storm track.
Observed differences in magnitude between event clusters.
Depth of rainfall events (mm/day) from weather systems originating in the Indian Ocean is larger than other events.
INTRODUCTION
Extreme rainfall events, and the resulting flooding, are a cause of particular concern to countries across Africa where increasing flood-induced economic impacts are predicted to be driven by climate change (di Baldassarre et al. 2010; Winsemius et al. 2016). For example, Aon (2020) reported that 39 of the 45 Global Disasters recorded in Africa in 2020 were attributed to flooding, and in the past decade, flooding has surpassed drought as the natural disaster affecting the greatest number of people on the African continent (Lumbroso 2020). Aside from the direct damage to human life and infrastructure, the impact of flooding on public health can be considerable. As well as, the disruption of medical services and the impacts of flooding can cause a drastic increase in vector-borne diseases (Ahern et al. 2005) and natural disasters resulting from extreme rainfall, which can exacerbate existing vulnerabilities in underserved and informally established, flood-prone, peri-urban areas communities (Khandlhela & May 2006).
As with many countries across the continent, deadly floods due to heavy rainfall have occurred across South Africa in recent years (le Maitre et al. 2019). In South Africa, the rate of urbanization has steadily increased (World Bank 2021), resulting in amplified pressures on existing services and infrastructure, negatively affecting infrastructure resilience while potentially exacerbating the impacts of any flood event; for example, as a result of the 77 flood events listed between 1980 and 2010, over 1,000 people were thought to have died (Zuma et al. 2012). Recent floods caused by heavy rainfall in 2011 alone killed more than 40 people and caused $51 million in damages nationwide (Mabuse 2021), while the 2019 Durban floods are thought to have killed 70 people and caused $45 million in damages (UNOOSA 2019).
South Africa is a semi-arid country, which experiences an uneven spatial distribution of rainfall, and there is a notable range in the total annual rainfall amount and the seasonal distribution of rainfall (Roffe et al. 2019), from approximately 250 mm in the west, at approximately 20° longitude, to over 1,000 mm in the east, at approximately 30° longitude (International Food Policy Research Institute (IFPRI) 2014). However, recent research carried out by the South African Weather Service (SAWS 2019) found that precipitation from the most intense rainfall events is increasing across the nation, while there was a decreasing trend in annual rainfall across most regions and an increase in annual rainfall in the southern interior. This highly variable rainfall is due in part to South Africa's location, being influenced by both the South Atlantic and the cold Benguela current on the west coast (Hahn et al. 2017), as well as the South Indian Ocean and the warm Agulhas current on the east coast (Jury 2015). In addition, the latitude of the southern tip of South Africa exposes the South Western Cape area to mid-latitude cyclones causing winter rainfall events to dominate in this region, in comparison to the majority of South Africa, which predominantly receives summer rainfall (Odoulami et al. 2020).
Published research has identified that the frequency and magnitude of extreme rainfall events have increased in some parts of South Africa (Ziervogel et al. 2014), and future rainfall conditions are projected to further intensify and become more extreme for many regions in the country (du Plessis & Burger 2015; de Waal et al. 2017). Increased frequency and magnitude of extreme events amplify potential flood risks. Therefore, understanding the climate drivers and origins of extreme rainfalls is becoming increasingly important to researchers and practitioners in updating design strategies for hydraulic infrastructure in South Africa (Schulze & Schütte 2019). This is critical for water resources planning and design of future infrastructure, maintaining and upgrading existing hydraulic structures, and mitigating risks to current urban stormwater drainage systems in a rapidly developing country.
Recognising the importance of developing an understanding of extreme rainfall events in South Africa, this study focuses on investigating the distinct regions of moisture origin for extreme rainfall in South Africa. This has been achieved by utilising the HYSPLIT (Hybrid Single Particle Lagrangian Integrated Trajectory) model (Stein et al. 2015) to extract the storm trajectories, defined here as the pathways that moisture followed through the atmosphere, for a series of historical maximum magnitude rainfall events recorded at weather monitoring stations across South Africa in the years 1950–2010. These trajectories have then been clustered using unsupervised machine learning techniques to determine the regions of moisture origin and the differences in event rainfall magnitude that they cause, in different regions of South Africa. The HYSPLIT model has been used extensively in other studies as a tool for tracing atmospheric moisture, for example, in studies of atmospheric chemistry origins (Xia et al. 2020; Ma et al. 2021), simulation of dust storms (McGowan & Clark 2008; Ashrafi et al. 2014), and volcanic ash (Hurst & Davis 2017). Notably, several researchers have used the HYSPLIT system when defining weather patterns and their synoptic conditions, focusing primarily on the classification of extreme events (Santos et al. 2018; Barnes et al. 2020; Karozis et al. 2021).
Described in this article are the data used and the techniques used to validate this data, followed by the methodology applied and the results of the clustering. These results are discussed, and conclusions are presented.
DATA
Each event is characterised by the longitude and latitude positions of the monitoring station, the maximum recorded rainfall (mm), the date occurrence, and the first and last year of continuous records that exist and have been considered for each station. The data was made available from a study updating the probable maximum precipitation (PMP – the theoretical upper limit for rainfall used for engineering design purposes) values for South Africa (Johnson & Smithers 2020), which considered 1,629 rainfall monitoring stations with at least 40 years of record available and selected a spatially representative sample of extreme rainfall events that consisted of continuously recorded data. The PMP events are typically required for engineering design of critical infrastructure where failure is catastrophic such as large reservoirs and nuclear installations (Wang et al. 2019. The database of events used for this study is therefore assessed to be suitable for the study of extreme rainfall events.
In addition to the date, longitude and latitude coordinates requires as an input to HYSPLIT, and the model also requires altitude, time of day, and length of extraction values for each trajectory calculation. For each of the 376 events, the trajectories were extracted for four times evenly spaced throughout the day (00:00, 08:00, 16:00, and 24:00), and these trajectory calculations were initiated at six altitudes (10, 410, 810, 1,210, 1,610, and 2,010 m), which corresponds to the altitude range in which moisture is expected to be found in the atmosphere (Wallace & Hobbs 2006). Thus, each rainfall day is covered by 24 trajectories. A length of 5 days was initially selected for the duration of trajectory back-calculation to ensure that sufficient data would be gathered for the k-means clustering process while recognising that the output of an HYSPLIT analysis can be simplified to a (n × 2) matrix where each row contains the latitude and longitude positions of the air parcel at that interval and the number of rows n is the number of hours for which the back-calculation has been initialised – 5 day trajectories can therefore be shortened during analysis.
The 4 times and 6 altitudes selected yielded 24 trajectories representing each of the 376 individual events resulting in a total of 24 × 376 = 9,024 trajectories. Due to a limitation of the HYSPLIT model, events that occurred on the last day of the month were not able to generate the trajectory specified for 24:00 at each of the six altitudes (HYSPLIT is unable to ‘roll over’ to the next day when the last day of a month is selected as the day of initiation as it requires accessing two different meteorological data files during the same calculation). This has resulted in six trajectories for each of the 15 events that occurred on the final day of the month not being generated, and therefore, a total of 8,934 trajectories being generated. As this represents less than 1% of the total trajectories being lost and the affected events still have trajectories generated for all altitudes, they have been included in the clustering process, and it is not likely that this will cause any significant impact to the results.
METHODOLOGY
This section outlines the methodology used to obtain the storm trajectories and perform the clustering analysis which form the basis of the subsequent meteorological interpretation.
Event storm trajectories
Samples of trajectories with 1-, 2-, 3-, and 4-day lengths were plotted to visually inspect the pathways of a random sample of trajectories and ensure that they followed pathways that are considered credible from a meteorological perspective. This was also done to gauge the length of trajectories that should be considered when defining distinct event types through clustering of the trajectories. Too short a trajectory considered may lead to clusters that are not spatially different, as there will be necessarily very little spatial difference between the location of events (as they are evenly spread throughout South Africa), and too long a trajectory considered would be nonsensical for this study of moisture origin as it would entail considering trajectories that have been influenced by multiple atmospheric systems and would likely not be a good representation of the relevant moisture pathways. In addition, the larger errors inherent in longer trajectories would be potentially introduced. It is important to note that HYSPLIT exports air parcel trajectories, and it must be inferred the distance the moisture in this air parcel has travelled prior to being deposited as rainfall during the rainfall event in question. For example, if a trajectory was generated from an event in South Africa and showed that 5 days previously, the air parcel had passed over South America, this does not necessarily mean that the moisture deposited as rainfall during this event was carried over South America. In this case, it is judged to be far more likely that the moisture was taken into the atmosphere over the South Atlantic.
Figure 2(a) and 2(b) show that 1-day and 2-day trajectories are centred around the African continent and will likely not give the spatial difference required to perform an effective cluster analysis and will therefore be unlikely to give an indication of the spatial origin of moisture for the extreme rainfall events considered. Figure 2(d) also demonstrates that by considering trajectories of lengths 4 days and above, the clustering process will be considering elements of trajectories that are being influenced by other atmospheric systems than those generating the extreme rainfall event under consideration.
Three-day trajectory lengths (Figure 2(c)) were therefore chosen for the analysis. The longitude and latitude coordinates of each trajectory, for each hourly interval within the 72 h previous to the rainfall event, were extracted and stored in individual vectors for ease of plotting. These vectors were then concatenated into a single matrix containing all position data for all trajectories to be used as input for the k-means clustering.
Trajectory clustering
Clustering the trajectories extracted into visually distinct groups necessitated the selection of an unsupervised machine learning technique, as this was an exploratory analysis that was efficient and appropriate for the task. Many unsupervised clustering techniques exist, but this study adopted the k-means algorithm (Hartigan & Wong 1979) due to its inherent simplicity and efficiency when considering clusters (Cui et al. 2021), being a Euclidean distance minimisation algorithm. The k-means algorithm was also adopted for use in other storm trajectory classification studies, notably by both Santos et al. (2018) and Barnes et al. (2019). The k-means algorithm also benefits from being a simple algorithm to understand which reduces the ‘black-box’ effect and associated uncertainty (Evans et al. 2019) when using the algorithm, furthering confidence in the results. The basic procedure that k-means carries out has been summarised below and adapted from Hartigan & Wong (1979):
- 1.
Obtain a matrix of M points in N dimensions.
- 2.
Select K initial cluster centres.
- 3.
Assign all M points to the cluster's centre closest to them in Euclidian space.
- 4.
Redefine cluster centres to be the average of the points contained within the cluster.
- 5.
Re-allocate points to the nearest adjacent cluster centre.
- 6.
Repeat steps 4 and 5 until all points remain in the same cluster
In the context of HYSPLIT trajectories, each of the M points is a single trajectory made up of a series of n pairs of longitude, and latitude coordinates representing the position of the air parcel at each 1-h interval. As each trajectory contains n = 73 pairs of coordinates which were flattened to form a vector, N = 73 × 2 = 146 and the resulting matrix used for clustering is of dimensions (N × M). When considering 2D data, the cluster centre can be easily visualised as the mean average point of all the points within the cluster, whereas when considering trajectories, the cluster centre becomes the ‘average’ trajectory of a cluster.
The trajectories generated by the HYSPLIT model contain a number of parameters that can be analysed using the k-means algorithm. From the HYSPLIT output, the longitude, latitude, and altitude are listed at hourly intervals for each trajectory. In this study, only longitude and latitude have been considered for clustering, despite the availability of altitude data and the capacity of k-means to cluster data with different units. This is primarily because the altitude of trajectories is to an extent arbitrary, being chosen during this study. This is necessary as it is not possible to determine the altitude at which the actual precipitation is formed using HYSPLIT. Further to this, the previous work (Barnes et al. 2020) has indicated that altitude is a poor variable to consider when clustering as it adds dimensionality to the data and thus reduces the efficiency of the clustering algorithm when compared to the simpler case of using only longitude and latitude. Furthermore, the clustering of longitude and latitude is sufficient to produce visually different, spatially coherent trajectory clusters as demonstrated in this project and previous work (Tan et al. 2018).
When determining the optimum number of clusters of trajectories, the silhouette score was calculated, which considers both within cluster and out of cluster error (Rousseeuw 1987). The silhouette score measures the proximity of a trajectory to its allocated cluster centre (to be minimised) and proximity to the other cluster centres (to be maximised) and has possible values in the range [−1,1] with scores closer to 1 which indicate more distinct clusters being formed and scores close to 0 which indicate overlapping clusters (SciKit Learn 2020).
From Figure 3, an optimal number of clusters of three can be determined. Whilst both = 1 and = 2 yield higher silhouette scores; these are to be disregarded. Clustering into a single cluster will by definition yield a maximum silhouette score and the fact that the rate of change of the silhouette changes little between = 1 and = 3 demonstrates that while = 2 would be a mathematically efficient solution, some insight may be lost. Choosing = 3, after which there is a dramatic change in the rate of change of the silhouette score, will yield only marginally less mathematically optimal results but give a greater number of clusters which will allow for more insight into the underlying weather event generating systems.
In addition, as discussed earlier, rainfall in South Africa is dominated by systems originating in three different regions: the Atlantic Ocean, the Indian Ocean, and continental recirculation. Therefore, this study has opted to use three clusters for the primary investigation of extreme rainfall events as this is supported by both the data analysis and the meteorological considerations.
RESULTS
The results of the trajectory clustering process are detailed in this section. First, the three distinct clusters are detailed and the events from which the trajectories were initially generated are allocated to a trajectory. The rainfall magnitude distributions of each cluster are then analysed, as well as the spatial distribution of events allocated to each cluster is investigated.
Clustering process
From Figure 4(a)–4(c), it can observed that the clustering process has resulted in visually distinct groupings of trajectories that appear to originate from three distinct geographical regions:
Cluster 0: Indian Ocean
Cluster 1: Atlantic Ocean
Cluster 2: Continental Recirculation
Event allocation
The K-means clustering technique has been shown to be efficient at allocating events to clusters. However, as 24 (the number of trajectories generated per event) is both even and divisible by 3 (the number of clusters created), there is a possibility that events were erroneously allocated to more than one cluster. For example, if an event had an equal number of trajectories in two or more clusters (this can occur if eight trajectories are allocated to each of the three clusters from the same event, or if 12 trajectories from a single event are allocated to two of the three clusters, for example). Once again, only few events were not adequately allocated by this method. In total, eight events (2.1%) were found to contain an equal number of maximum trajectories in more than one cluster. Of the events allocated to more than one cluster, three were found to be events that were initially allocated to those clusters with less than 50% of their trajectories assigned. Overall, the clustering process resulted in three visually distinct clusters representing the spatial origin of moisture into which 368 of 376 events (98.1%) can be allocated based on the allocation of the greatest number of their trajectories.
The allocation of each of the rainfall events to a cluster is synonymous with allocating an event to a particular causal weather system. This allows for a study of the spatial trends of the extreme rainfall events considered, and the dominance of each cluster as an extreme rainfall-generating mechanism has been determined, as detailed in Table 1.
Cluster . | Number of trajectories clustered (%a) . | Number of events represented (%b) . | Number of events allocated (%c) . |
---|---|---|---|
0 – Indian Ocean | 1,663 (19) | 123 (33) | 64 (18) |
1 – Atlantic Ocean | 2,319 (26) | 170 (45) | 91(25) |
2 – Continental | 4,952 (55) | 278 (74) | 213 (58) |
Cluster . | Number of trajectories clustered (%a) . | Number of events represented (%b) . | Number of events allocated (%c) . |
---|---|---|---|
0 – Indian Ocean | 1,663 (19) | 123 (33) | 64 (18) |
1 – Atlantic Ocean | 2,319 (26) | 170 (45) | 91(25) |
2 – Continental | 4,952 (55) | 278 (74) | 213 (58) |
aAll percentages of the total rounded to the nearest whole number.
bPercentages in this column will not add up to 100% as each event can be represented in one, two, or all three clusters.
cNot including the eight events that could not be allocated.
Table 1 shows that for all clusters, approximately the same percentage of trajectories are allocated to each cluster as events (19 and 18% for cluster 0, 26 and 25% for cluster 1, 55 and 58% for cluster 2, respectively). This indicates that both the clustering process and the event allocation process are efficient and provide further confidence that the clusters are accurate representations of the true extreme rainfall-generating processes. If one or more clusters were allocated a significantly higher percentage of trajectories than events, this would indicate that the cluster contained a small number of trajectories from each of a larger number of events. This would likely not represent a rainfall-generating process but an amalgamation of the trajectories from two or more generating processes and may be an indication that a non-optimal K value had been used when clustering. Further evidence of the efficiency of the clustering process can be attributed to the fact that the variation between the number of events represented and the number of events allocated remains consistent across all clusters.
Table 1 demonstrates that the continentally originating moisture dominates extreme rainfall events with 58% of recorded events being attributed, followed by the Atlantic Ocean contributing 25% and the Indian Ocean contributing 18%. The differences in the spatial origin of these events are stark and provide evidence that extreme rainfall events in South Africa can have very different origins.
Variation in event magnitude between clusters
Furthermore, the interquartile range of cluster 1 (Atlantic Ocean) rainfall is greater than cluster 2 (continental), indicating marginally greater variation in rainfall magnitude. However, when considering outliers (defined as values with a magnitude greater than the 75% percentile + 1.5 × interquartile range), it is cluster 2 (continental) that has produced the largest rainfall event on record.
Table 1 and Figure 6 indicate that of the events considered, cluster 0 (Indian Ocean) events generally contribute greater levels of precipitation during rainfall events, but only represent 18% of all recorded events, whereas the remaining events attributed to cluster 1 (Atlantic Ocean) and cluster 2 (continental) average far lower levels of precipitation but represent the majority of events.
Spatial distribution of clusters
Figure 7(d) suggests that South Africa can be split roughly into three regions when considering the origin of moisture causing extreme rainfall events. The interior is dominated by continental recirculation, and the coasts are dominated by oceanic influences as would be expected. Furthermore, when considering = 3, the coastline can be split at approximately 25° longitude – the centre of the country – to delineate between coastal regions influenced by the Atlantic Ocean to the west and the Indian Ocean to the east. Whilst this is not an exact boundary, it does provide a rough guide to the origins of extreme rainfall in different regions.
When also considering that the oceanic originating events are confined to the coastal regions, at the base of the plateau that delineates central South Africa, Figure 7(d) indicates that the dominant cause of extreme rainfall in a region of South Africa can be effectively determined by two factors: whether the location is east or west of 25° longitude and whether the location sits on the plateau in the interior or on the sides of, or at the base of, the escarpment that delineates the plateau. Given the large area of the plateau in which rainfall is dominated by continental systems compared to the coastal areas in which oceanic influences dominate it is unsurprising that Table 1 demonstrates that the majority of extreme rainfall events considered are continental in origin.
CONCLUSIONS
This study sought to investigate the origins and pathways of atmospheric moisture causing extreme rainfall events in South Africa. To achieve this aim, a process was developed for extracting storm trajectories using data from the HYSPLIT model and clustering these trajectories using unsupervised machine learning techniques. A spatially representative database of observed historical maximum magnitude rainfall events has been considered combined with meteorological data supplied by the National Oceanic and Atmospheric Administration (NOAA) Air Resources Laboratory NCEP/NCAR Reanalysis 1 project.
The new clustering process developed in this study has revealed three distinct regions in which moisture originates when considering extreme rainfall events in South Africa – South Atlantic Ocean, South Indian Ocean and continentally – and that there are clear differences in the spatial and temporal distributions of these events. The coastal regions of South Africa are predominantly influenced by the respective adjacent oceans with cluster 0 events originating from the South Indian Ocean and dominating the east coast, whereas the west coast is predominantly influenced by moisture originating in the South Atlantic Ocean (cluster 1). Furthermore, the central region of South Africa is dominated by continentally originating moisture. Clear differences in rainfall magnitude have been identified with cluster 0 (Indian Ocean) accounting for 18% of events with an average magnitude of 279 mm, cluster 1 (Atlantic Ocean) accounting for 25% of events with an average magnitude of 149.3 mm, and cluster 2 (continental) accounting for the majority of events (58%) with an average magnitude of 150 mm. When considering South Africa as a whole, it appears as though the least frequent events are also the ones that carry the largest magnitude; however, the clustering process has identified that these events are predominantly found in the eastern region, indicating that it is more appropriate to consider extreme rainfall on a regional and local level when designing infrastructure.
The regions of influence of the three regions of moisture origin have been found to be demarcated approximately by the line of 25° longitude and the escarpment that delineates the central plateau. The approximate demarcation at 25° longitude is most likely due to this being approximately the boundary between the eastern arid zones (influenced by cluster 1, Atlantic Ocean) and the western temperate zones (influenced by cluster 0, Indian Ocean). Cluster 2 (continental) events are spread approximately evenly across the temperate and arid zones due to this cluster likely being dominated by the altitude of the plateau as a causal rainfall mechanism, rather than oceanic influences, and the approximately even area of the central plateau in both the temperate and arid regions.
This study has demonstrated that the combination of the HYSPLIT model and unsupervised clustering techniques is capable of developing insights into the spatial origin, dominance, and magnitude of extreme rainfall. These findings opens-up the possibility for further studies into how climate change might affect the processes governing extreme rainfall in the region and the potential consequences for the safety of critical infrastructure.
ACKNOWLEDGEMENTS
The authors gratefully acknowledge the NOAA Air Resources Laboratory (ARL) for the provision of the HYSPLIT transport and dispersion model. The five anonymous reviewers of the manuscript all provided helpful feedback on earlier versions of the manuscript, and their contributions were acknowledged.
STATEMENTS AND DECLARATIONS
The authors have no relevant financial or non-financial interests to disclose
AUTHOR CONTRIBUTIONS
Conceptualization: Thomas Rodding Kjeldsen and Rhys Phillips; Data curation: Katelyn Johnson; Methodology: Thomas Kjeldsen, Andrew Barnes, Katelyn Johnson, and Rhys Phillips; Formal analysis and investigation: Rhys Phillips; Writing – original draft preparation: Rhys Phillips; Writing – review and editing: Rhys Phillips, Thomas Rodding Kjeldsen, Andrew Barnes, and Katelyn Johnson.
DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.
CONFLICT OF INTEREST
The authors declare there is no conflict.