Geoscientists are continuously confronted by difficulties involved in handling varieties of data formats. Configuration of data only in time or space domains leads to the use of multiple stand-alone software in the spatio-temporal analysis which is a time-consuming approach. In this paper, the concept of cellular time series (CTS) and three types of meta data are introduced to improve the handling of CTS in the spatio-temporal analysis. The data structure was designed via Python programming language; however, the structure could also be implemented by other languages (e.g., R and MATLAB). We used this concept in the hydro-meteorological discipline. In our application, CTS of monthly precipitation was generated by employing data of 102 stations across Iran. The non-parametric Mann–Kendall trend test and change point detection techniques, including Pettitt's test, standard normal homogeneity test, and the Buishand range test were applied on the generated CTS. Results revealed a negative annual trend in the eastern parts, as well as being sporadically spread over the southern and western parts of the country. Furthermore, the year 1998 was detected as a significant change year in the eastern and southern regions of Iran. The proposed structure may be used by geoscientists and data providers for straightforward simultaneous spatio-temporal analysis.
The field of big data has dramatically expanded in the field of information and communications technology (ICT) since 2010; it has evolved in early stages in water and climate sciences (Chen & Han 2016). The term ‘big data’ appears simple, but its technical meaning is somewhat ambiguous. It is commonly used to refer to massive datasets, as well as to complexity beyond the capacity of conventional computing tools (Snijder et al. 2012).
The big data characteristics include ‘high volume’ (e.g., data volume, collection, retention), ‘high velocity’ dealing with speed of data collection and processing (e.g., data caches, point-to-point data routing), and ‘high variety’ dealing with categories (e.g., data formats and meta data) (Laney 2009). Discovery of knowledge from a high volume dataset is a challenging systematic issue (e.g., utilizing the existing hardware, data formats, models, and methods) (Council 2013).
Some disciplines, such as environmental and earth sciences, natural resources, meteorology and hydrometeorology involve dynamic variables (e.g., Simonovic 2009; Abuzied & Mansour 2018; Nodoushan & Shakibaeinia 2018). In this regard, the spatio-temporal analysis of well-structured data remains a challenging issue (Harpham & Danovaro 2015). Furthermore, large volumes of hydrometeorologic data have been recorded and released (e.g., data from sensors, remote sensing (RS), Earth observation, and internet of things) in recent decades and push the water and climate sectors into the inevitable big data field. Accordingly, for analysis of big data, there is need for a structure that, in addition to handling the time dimension, can accommodate the space dimension (Liu et al. 2016).
In recent years, global RS data have been stored by a number of structured data models, such as advanced self-describing network Common Data Form (netCDF), Hierarchical Data Format (HDF), and Gridded Binary (GRIB) that handle time and space dimensions (Chen & Han 2016). Recent generations of these data models (e.g., netCDF-4 and HDF5) can store more complex datasets (high volume and different variety) (Rew et al. 2006). netCDF and HDF are commonly used for storing big gridded data (GD), but to extract the values of variables, there is the need for additional time-consuming processing (decoding). In addition, the main problem (in data handling) arises when the spatio-temporal analysis of geoscience data is commonly performed by the end-users who are not trained ICT scientists (Ward et al. 2018).
RS data assessment leads to a new generation of challenges. Data providers publish modern geoscience datasets (e.g., satellite products) in high-compressed formats (e.g., HDF). For the RS product assessment (e.g., precipitation), there is a need to compare them with ground-based measures. On the other hand, historical ground-based data have been recorded and published since much earlier times and are usually accessible in traditional formats (e.g., txt, CSV, xls, and xlsx). Thus, there is a big gap (in terms of published formats) between structured data (e.g., RS and global reanalyzed datasets shared in netCDF or HDF formats) and ground-based data (Figure 1(a)). Consequently, synchronization and sorting of multiple data formats has become a crucial and inevitable time-consuming step in the process of RS data assessment. Hence, there is a need for a common flexible solution which covers the variety of traditional and modern data formats.
Figure 1(b) addresses another issue. Traditional spatio-temporal approaches (combining the stand-alone software) and transferring the results among them are not sufficient for huge data analysis with speed and accuracy. Geographic information systems (GIS) software (e.g., ArcGIS) are mainly based on the space domain (GD structure). Also, time series (TS) analysis software (e.g., SPSS) assists researchers to investigate the dynamic of data along the time dimension. Nevertheless, there is not a clear concept/guideline for how to combine the bunch of software due to spatio-temporal analysis (Figure 1(b)). To improve the procedure of spatio-temporal analysis, a new structure may be required. One of the main objectives of this paper is to provide a data structure/concept to improve the speed of spatio-temporal analysis. In this study, we focus on the ground-based data.
Different researches have been conducted in the field of data structure. Kruger et al. (2011) studied the utility of meta data in the query of NEXRAD data. They showed this is an efficient approach for selecting data subsets in time series. It is based on the Hydro-NEXRAD system, which contains a vast archive of online radar-centric data. However, their study was limited to a specific online dataset and not customizable by end-users independently. Liu et al. (2016) proposed the SciDB server-based data query structure. SciDB was developed for the NetCDF-4 format. It is implemented on the online platform and assists data query procedure. SciDB is a server-based structure and not developed for offline data sources (and does not cover traditional formats). Seo et al. (2019) studied NEXRAD dataset and introduced a pilot infrastructure for data management. It is based on meta data query and facilitates searching/converting procedures in precipitation datasets. This is a specified study on NEXRAD data and not developed for all types of geo-datasets (e.g., offline and user-defined).
The analysis of climatic and meteorological fluctuations are being conducted across different regions around the world (e.g., Yao et al. 2014; Pascale et al. 2015; Shirvani 2015; Abidi et al. 2017; Gundogdu 2017; Aminyavari et al. 2018; Kolachian & Saghafian 2019). As climatic variables fluctuate over time and in space, numerous studies have been reported to explore the spatio-temporal nature of climate (e.g., Toros 2012; Sen Roy & Rouault 2013; Delavau et al. 2015; Luković et al. 2015). In addition, climate variables affect hydrology and water resources. With the development of better computer-based algorithms and methods (such as machine learning, data mining, and artificial intelligence), many studies have been performed to simulate and forecast hydro-climatic phenomena (e.g., Chuntian & Chau 2002; Wu & Chau 2011; Chau 2017; Ali Ghorbani et al. 2018; Moazenzadeh et al. 2018; Nabaei et al. 2019; Yaseen et al. 2019).
Trend and change point tests are common temporal TS analysis, prevalently conducted on point data (e.g., Shi et al. 2014; Khalili et al. 2016). However, the study of trend and change point across space enhances our understanding of how TS behave spatially. At a glance, spatio-temporal climate data studies could be categorized into four groups based on the type of input data and results representation. The first group, with a huge number of reported studies, deals with in-situ (point) data. Inputs are typically provided by meteorological stations (e.g., Duhan & Pandey 2013; Santos & Fragoso 2013; de Luis et al. 2014; Goyal 2014; Hosseinzadeh Talaee et al. 2014; Shi et al. 2015; Singh et al. 2015; Ghasemi 2015; Li et al. 2015; Longobardi & Mautone 2015; Powell & Keim 2015; Javari 2017; Roushangar et al. 2018). The results of such studies lack spatial dimension and are specific to a certain location.
In the second study group, station-based TS analysis is transposed into a spatially continuous domain. This is usually carried out via interpolation techniques applied on output statistics (e.g., Dinpashoh 2006; Fathian et al. 2014; Hosseinzadeh Talaee et al. 2014; Abolverdi et al. 2016; Javari 2016; Minaei & Irannezhad 2016). The third type regionalizes a study area by inputting main (e.g., precipitation) or auxiliary (e.g., outputs of statistical analysis) variables via application of different classification techniques such as clustering (e.g., Dinpashoh et al. 2004; Modarres 2006; Raziei et al. 2008; Modarres & Sarhadi 2011; Fazel et al. 2017). The last study group deploys RS (e.g., radar and imagary; Khodadoust Siuki et al. 2017; Sharifi et al. 2018) or GD for spatio-temporal analysis (e.g., Sarmadi & Shokoohi 2015; Fallah et al. 2017). This procedure is more recent and outcome is presented in GD forms.
Efficient management of geoscience data is one of the major concerns of data scientists. As mentioned previously, most of the studies were limited to a particular format (e.g., NetCDF) or developed based on a specific platform (e.g., Hydro-NEXRAD or online services). It can be hindered in achieving a common data structure to the vast range of applications and multi-formats. To overcome these obstacles/solidity, a new data concept with the following specifications could be advantageous: (1) simple enough and applicable by a wide range of researchers; (2) flexible enough to cover the multi-format datasets (i.e., filling the gap between in-situ and RS data); (3) customizable (i.e., extendable) structure for generating and managing dataset; (4) could be implemented in a single platform due to integrating all the procedures of spatio-temporal analysis (not rely on multiple stand-alone software).
In this study, we used Python 2.7 which is compatible with the GNU General Public License (GPL). Python is one of the high-level languages and is well-suited for geoscience disciplines. Other languages (e.g., R, MATLAB, Java, and Scala) could also be used to generate and apply this concept. To cover the desired goals, we introduce the structure/concept of cellular time series (CTS), spatial meta data (SMD), temporal meta data (TMD), and meta CTS (mCTS) in this paper. Herein, CTS is defined as a set of connected cells over space, each of which carry a time series.
At first, the generation procedure of CTS is outlined. Next, three types of meta data are proposed for managing (query) of CTS information. Also, some well-known statistical analysis is reviewed which we applied in this paper across Iran. The details of MK trend test are described, followed by three different change point detection methods including Pettitt's test, SNHT, and the Buishand range test.
Generation of CTS
This paper proposes a conceptual structure for spatio-temporal applications where one needs to convert in-situ time series into CTS. Figure 2(c) shows three dimensions of the CTS that include X (longitude), Y (latitude), and T (time). Numerous data generation/prediction methods (e.g., IDW, Bayesian kriging, co-kriging, spline, and also numerical weather generation models) could be implied to create spatial (X-Y) layouts (Figure 2(a)). Finally, two-dimensional (2D) matrices in space were merged to form a three-dimensional (3D) CTS (Figure 2(b) and 2(c)). In this study, however, ground-based data were the basis for CTS as such data are presumed as a reference data source to assess/bias correct other global datasets (e.g., RS and reanalyzed data). Similarly, RS and reanalyzed data (usually with NetCDF, GRIB, and HDF formats) could also be used to generate CTS.
Spatial meta data
This is a 2D matrix in space dimension (X and Y). In this case, the SMD values are constant over time and attributed to all the temporal slices involved in the CTS. For example, digital elevation model, coordinates (longitude and latitude), the surface of a region (e.g., states, catchments, and climate zones), and land use can be categorized as SMD. SMD can help to perform spatial analysis on the CTS, for example, the query of the CTS values in a particular area. Afterwards, any spatial calculations (e.g., sum, average, standard deviation, and entropy) can be performed via SMD (Figure 2(d) and 2(g)).
Temporal meta data
This is a vector (along the time dimension). The TMD values are constant over the space and attributed to all the spatial slices involved in the CTS. For example, year, month, day, hour, minute, seasons, and different calendars (e.g., solar, Julian, and lunar) may be categorized as TMD. TMD can help to perform temporal analysis on the CTS. For example, extracting the values of specific days and converting daily precipitation to monthly could be performed via TMD (Figure 2(e) and 2(g)).
This is a 3D matrix (X, Y, and T) and could contain SMD and TMD features. The mCTS values can vary in both time and space. Any dynamic (in time and space) variable which could be used in the query of CTS information is assumed as mCTS. Different variables (e.g., temperature and humidity) and drought indices (e.g., SPI, SPEI, Palmer) may be inferred as mCTS. For example, the query of drought events (based on a specific index threshold) could be performed by mCTS (Figure 2(f) and 2(g)).
Figure 2(g) illustrates the general procedure (interactions) of CTS and meta data application. The procedure may be applied in most of the programming language. It is based on the meta data queries, extracting a new CTS, and applying functions (Fn) along the desired dimension.
MK trend test
In the above equation, n, g, and ti are the number of samples, number of tied groups, and number of data values in the ith group, respectively (see Hirsch et al. 1982).
Two tail test with null H0 (no trend) and alternative HA (rejected H0) hypothesis is then applied. The standard normal cumulative values are determined for α= 5% and 10% significant levels. Owing to the two tail test, if the Z is greater or smaller than critical values (Z1−α/2 and −Z1−α/2, respectively), the H0 is rejected and trend is significant.
If p is smaller than confidence level, Kt is significant and the corresponding t is detected as a change point. In this study, the 5% confidence level is selected for two tail hypothesis test.
The variable k refers to the change point and n is the total sample size. Maximum value of Tk is a candidate of significant change point and should be checked by hypothesis test. and are mean and standard deviation of series.
Buishand's range test
The change point was evaluated by comparing with the critical value at the significant level.
APPLICATION OF CTS
Iran is a country in western Asia between 44°00′E–63°25′E longitude and 25°00′N–38°40′N latitude with an area of about 1,648,000 km2. The Caspian Sea in the north and the Persian Gulf in the south border some 2,000 km of coastline in total (Figure 3). Zagros and Alborz are the two main mountain ranges in the country. Also, two vast central deserts are known as the Dasht-e Lut and the Dasht-e Kavir. The climate varies greatly across the country, depending on topography among other things.
Monthly precipitation of some 102 synoptic stations were obtained from the Iranian Meteorological Organization. The location of the stations is shown in Figure 3. The study period covers 1987–2014 as the common period of quality precipitation data. In order to provide a basis for seasonal analysis, December is set as the first month and so on; accordingly, the seasons are set as follows: winter = DJF (Dec, Jan, and Feb), spring = MAM (Mar, Apr, and May), summer = JJA (Jun, Jul, and Aug), and autumn = SON (Sep, Oct, and Nov). The CTS precipitation was generated using 336 monthly precipitation layers at 0.5° × 0.5° spatial resolution. Precipitation maps were generated via inverse distance weighted (IDW) interpolation method with exponent 2. CTS is not limited to IDW interpolation and could be loaded with many other techniques, even optimization tools. Afterward, the spatio-temporal trend and change point analyses were performed on the generated CTS.
RESULTS AND DISCUSSION
In this section, trend and change point analyses were applied on the CTS at different time scales. Results, extracted in cellular form, are illustrated at different significant levels.
|Precipitation (mm)||Season||Precipitation (mm)||Contribution (%)||Month||Precipitation (mm)||Contribution per season (%)||Contribution per year (%)|
|Precipitation (mm)||Season||Precipitation (mm)||Contribution (%)||Month||Precipitation (mm)||Contribution per season (%)||Contribution per year (%)|
As shown in the maps, northern and western Iran (along Alborz and Zagros mountain chains) receive the highest precipitation depths (Figure 4). The average annual precipitation varies from 20 mm (in Dasht-e Lut desert) to 1,410 mm (along the northern coastline). The overall average values closely match those reported by previous studies (e.g., Dinpashoh et al. 2004; Modarres & Sarhadi 2011; Raziei et al. 2014).
Summer rain mainly occurs along the Caspian Sea in the north where over 80% of the total summer precipitation in Iran is received (Figure 4). Average annual precipitation is 242.9 mm over the study period, 75% of which occurs in winter and spring (Table 1). Total precipitation from November to April constitutes about 83% of annual precipitation (Figure 5). Clearly, central and southeastern regions receive the lowest precipitation (Figure 4) and are classified as arid and semi-arid climates (Modarres & Sarhadi 2009).
Hypothesis tests were applied to detect significant trend at 5% and 10% levels. Two tail standard normal cumulative values are Z0.975 and Z0.95, respectively. The Z values were applied as a threshold for the P-values of MK statistic to detect cells with significant trend. Spatial distribution of significant trend in annual and seasonal time scales are shown in Figure 6. Also, the MK test was applied at monthly time scale as shown in Figure 7.
Annual precipitation trend decreases in the east, and in small portions of the west and south (Figure 6(a)). Decomposition of annual trend revealed that winter and spring fluctuations are the main sources of downward trend (Figure 6(b) and 6(c)), while the insignificant contribution of summer precipitation (about 4%) and upward incremental trend in autumn precipitation, appearing in the central and sporadic regions in eastern and southern regions (Figure 6(e)), balance the overall annual trend.
The monthly precipitation trend maps (Figure 7) shows that downward winter trend occurred in December (mostly in the southeast) and February (mostly in the west). Moreover, spring is affected by a downward March precipitation trend. More detailed analysis on March precipitation revealed that areas with 5% significant trend are concentrated in the west of Iran. Also, there are areas in the north, centre, and east with decreasing trend. The increasing autumn trend is mainly affected by November precipitation (Figure 7). Trends are observed in most regions at 5% significance level, except in parts of the northwest and southeast.
Figure 8 summarizes the trend analysis, in terms of the percentage of the country affected, at 5% and 10% significant levels. About 22.4% of areas have negative annual trend, out of which 7.9% is significant at 5% level. Areas subject to winter, spring, summer, and autumn precipitation trend cover 26.7% (decreasing), 23.7% (decreasing), nearly zero, and 23.4% (increasing) of the total study area, respectively. The downward precipitation trend in March covers 49.3% of Iran, out of which 27.7% of area is affected by 5% significant level. The dominant increasing trend occurs in November and covers 66.8% of Iran (Figure 8). These two observations are in light of the fact that March and November receive 16.7% and 10.5% of total annual precipitation (Table 1).
Other researchers have also analyzed the precipitation trend in Iran. Talaee (2014) studied trend in precipitation of seven rain gauges in Hamedan (a western province in Iran) and found that monthly trend analysis (from 1969 to 2009) via the MK test revealed decreasing and increasing trend in March and November, respectively (conforms with Figure 7 in Hamedan province). However, in February, the positive MK statistics found in Hamedan are in contrast with our results shown in Figure 7 (western regions). The main reason for this inconsistency is due to the difference between the length of the study periods. In other words, incorporation of more recent data in our study resulted in a downward trend in February.
Some'e et al. (2012) analyzed in-situ precipitation data in Iran. They applied the MK test on 28 synoptic stations in the period of 1967–2006. Three northwest stations (Khoy, Urmia, and Tabriz) were found to have significant negative trend at annual scale. Seasonal trend analysis revealed downward spring trend in three of the studied stations (in central, eastern, and southeast regions) and upward summer trend in the northeast. No significant trend was found in autumn while negative winter trend was detected in the northwest and along the coasts of the Caspian Sea and in one station in the southeast region. Comparison of their results with our study indicates some similarity in spring (eastern) and winter (northwest) trends. The incompatibility of the trend in some areas could be due to differences in the study period as well as different spatial scales.
Tabari et al. (2011) investigated annual precipitation trend in 13 stations over the 1966–2005 period along the Zagros mountain chain, stretching in western Iran towards the southeast. Results showed a significant decreasing trend (at 95% level) in one southern and two western stations. This is in general agreement with this study although the study periods were not quite the same and that study focused on point analysis.
In another study, Modarres & Sarhadi (2009) investigated annual precipitation trend in eight regions of Iran (Figure 9). They showed increasing and decreasing annual precipitation trend, respectively, in G2 and G8 regions for the 1951–2000 study period. In our study, MK statistic (Z) and P-value were calculated for all precipitation classes (Table 2) over the 1987–2014 period. However, no significant trend was detected in annual precipitation (2nd column) in G2 and G8 regions. Statistics were also studied at seasonal scale that indicated significant downward trend in winter in G5 and G7 regions (Table 2). The contrast in some of our results with those of Modarres & Sarhadi (2009) stems from the difference in the duration of the study periods such that more recent data could have been affected by climate change. Our study added 14 years of recent years in comparison with Modarres & Sarhadi's (2009) study.
*Significant at α = 10%.
Significant change point
The change point was detected by Pettitt, SNHT, and Buishand tests with two tail hypothesis at 5% significant level (Figure 10). The year 1998 was detected as the predominant change point in the east and south of Iran by almost all tests. Results showed higher sensitivity of Pettitt's test in comparison with SNHT and Buishand test. Pettitt's test revealed a broad range of change years (1997–2005) although the year 1998 was the predominant change point in areas where detection was positive. Interestingly, a similar spatial pattern in detection of decreasing trend in annual precipitation (Figure 6(a)) and the change point tests (Figure 10) could be observed.
The traditional TS structure focuses only on the time dimension. On the other hand, ‘GD’, which is increasingly heard of after launching satellite sensors in the RS era, deals with regular spatial data (mesh) in a snapshot. In the current paper, the concepts of CTS, SMD, TMD and, mCTS are introduced. We believe that CTS can bridge the traditional geo-datasets (especially ground-based data; e.g., binary, txt, ascii, csv, excel) to the real-world dimensions (time and space) in order to facilitate data processes (in scripting environments) for geoscientists.
In this study, we applied trend and change point detection methods on the CTS, which is a well-suited data structure to perform any time series analysis and probability-based analysis (e.g., copula) or any other methods applicable to in-situ TS (e.g., wavelet). Furthermore, the CTS is perfectly adaptable for spatial analysis (e.g., entropy and homogeneity) or different machine learning approaches (e.g., principal component analysis, clustering, bagging, and boosting). We chose the trend and change point methods because they are well-practiced and provide pre-requisite information for climate change studies.
Numerous studies have focused on trend and change point occurrence within precipitation gauge-based data series. However, due to the lack of explicit spatial dimension in such studies, spatial inferences of the results are not straightforward. Accordingly, this study focused on the spatial distribution of precipitation trend and change point in Iran in the context of CTS database. The precipitation data of some 102 synoptic stations were employed to generate 25 CTS-based maps at annual, seasonal, and monthly scales via Python programming. Moreover, temporally averaged precipitation maps at different time scales were constructed across Iran as the basis for further analysis.
Trend analyses were applied on the CTS at various time scales. The MK test revealed significant trend at 5% and 10% levels. Downward trend was identified in annual, winter and spring seasonal maps while upward trend in autumn precipitation was detected. Furthermore, monthly precipitation maps showed negative and positive trends, respectively, in March and November precipitation. Comparison of the cellular trend results with other similar studies at point scale were mostly in agreement except when the duration of the data were different.
Change point analysis conducted by Pettitt's test, SNHT, and the Buishand test revealed that the year 1998 was a significant change point in precipitation. This entirely coincides with the annual downward precipitation trend in the east and south of Iran.
All in all, the proposed integrated CTS, SMD, TMD, and mCTS concept is well-suited to load in a matrix-based programming language (e.g., Python, R, Matlab). Thus, there is good potential to define and use different numerical and statistical functions (in this study: MK, Pettitt, SNHT, and Buishand) in the scripting environments. CTS enhances the functionality of these languages due to spatio-temporal analysis in a single environment. This concept could be put in use to link time series with geographic data and enable managers to support spatial decision-making.
Suggestions and future directions
Different hydro-climatologic variables (e.g., temperature, solar radiation, humidity, and wind speed) could be produced in the CTS format. Also, different interpolation methods (e.g., kriging, co-kriging, and spline) could be studied and implemented. CTS and meta data are flexible joint structures that facilitate scripting for spatio-temporal analysis. Different cell size (i.e., resampling) could be generated by spatial quarry (via SMD). The 3D analysis could be applied to CTS (time and space simultaneously) to investigate the dynamic of climatic variables. Data providers could serve more online options including CTS in publishing in-situ, RS, and reanalyzed data.