Using autoregressive integrated moving average (ARIMA) for modeling and predicting time series is worldwide, but how many available recorded observations can be used for modeling to achieve better results is debatable. The length of data can significantly affect the results of ARIMA models. This article investigates the effect of different lengths on prediction accuracy. For this purpose, 732 monthly data of streamflow of the Kortian gauging station at the Kortian Stream watershed were used. To study the impact of the type of data in terms of monthly or seasonal observation data on the accuracy of modeling results, monthly data were converted into seasonal data and the results of monthly and seasonal modeling were compared. Therefore, multiplicative ARIMA models were performed for the monthly and seasonal modeling. Compared with the seasonal modeling, the monthly modeling presented more precise results than the sum of the square errors of monthly and seasonal modeling, which were 0.9408 and 2.5, respectively. For the monthly modeling, five different lengths of data were used. The C1 model used the last 60 data, C2 used the last 120 recorded observations, C3 used the last 240 data, C4 used the last 480 observations, and C5 used the last 708 data. To test the precision of models, 24 observations were put aside. Among the C1 to C5 models, the C4 model presented the best results in predicting 2 years ahead and C1 had the worst results.

  • ARIMA modeling and multiplicative ARIMA modeling have had effective functions in water resources management.

  • Due to the lack of observational data and recent droughts in Iran, accurate predicting of a catchment discharge is of great importance.

  • The length of observational data which are used for modeling is important to predict precisely.

AIC

Akaike information criterion

MS

Mean square error

ACF

Autocorrelation function

PACF

Partial Autocorrelation function

ARIMA

Autoregressive integrated moving average

SARIMA

Seasonal autoregressive integrated moving average

LSTM

Long short-term memory

ADF

Augmented Dickey–Fuller

A time series is a series of values of some magnitude obtained at consecutive times, often with equal intervals. The goal of the time series is to determine the regularity and identify the behavior of the variables in order to predict the future (Bowerman et al. 1979; Salas 1996; AliAhmadi et al. 2021). Methods used to formulate time series models and forecast future data are generally divided into two categories: quantitative methods such as Box–Jenkins models, moving average models, simple exponential smoothing, and integrated moving average autoregression models, and qualitative methods such as brainstorming, Delphi, and the nominal group (Azar & Momeni 2015; AliAhmadi et al. 2021).

Katimon et al. (2018) applied the autoregressive integrated moving average (ARIMA) model for modeling water quality and hydrological variables of the Johor River, Malaysia. Abdoli et al. (2020) compared the ARIMA model and the long short-term memory (LSTM) model for time series with eternal fluctuation. They illustrated that the LSTM model outdid the ARIMA model in producing predictions. In short-term forecasting, both models worked well; however, when the prediction time increased, the precision of both models decreased. Asakereh & Yousefizadeh (2015) used ARIMA and seasonal autoregressive integrated moving average models to simulate average monthly and annual temperature changes. AliAhmadi et al. (2021) used the SARIMA time series technique to predict the mass flow rate of the Hirmand River. They concluded that by increasing the non-seasonal moving average, the model's ability to estimate the monthly flow rate decreases. The effect of past recorded observations length on forecasting in non-seasonal ARIMA models was examined by Mwenda et al. (2015). They suggested that to obtain the finest ARIMA model, the models recommended by the Box–Jenkins methodology and models developed by different observation lengths constantly beginning with the most recent data values should be considered. Mohan & Arumugam (1995) used a seasonal ARIMA model, a Winter's model to predict weekly reference crop evapotranspiration series. They presented that the performance of both models was satisfactory and the resulting prediction errors of the models were inconsiderable. Thus, the models could be used widely in a practical manner. Suhartono (2011) investigated two monthly recorded data, namely the international airline passenger time series and the entrances of tourists to Bali, Indonesia. For this purpose, they used subset, additive, and multiplicative SARIMA models. Reports showed that the subset SARIMA model presented better results in predictions of airline passenger time series and the additive SARIMA model gave more precise predictions for tourist entrances time series.

Time series ARIMA models are popular in modeling and forecasting time series (Khairuddin et al. 2019). However, there are still some limitations (Wang et al. 2020). One is the length of data required to formulate the ARIMA model and produce highly accurate predictions (Katimon et al. 2018). The precision of ARIMA models to predict future data depends on the time series length (Box et al. 2015). At least 40 or 50 pieces of data for modeling and forecasting were recommended by Box & Jenkins (1976). Using the last 45–60 years of observations was recommended by Chen (2008) to achieve more accurate predictions in ARIMA models. It was suggested by Hyndman & Kostenko (2007) that the number of data used in statistical models relies on how many parameters and random variations exist in the data. Mwenda et al. (2015) examined 287 observations of the weekly solid waste production for testing the effect of data length on the precision of predictions. They illustrated that for predicting one week before or 9–12 weeks before, using the 120 past observations had the best results, whilst for forecasting 2–8 weeks ahead, the 260 past observation had better results. They concluded that using too few past observation data reduces the accuracy of the predicted values. However, increasing the length of the past data when formulating the models does not necessarily lead to the best results. Qin et al. (2019) investigated the past 30 years of undergraduate student enrollment data across the top 10 historically black colleges and universities in the US for a simulation study and 35 years of the fall term enrollment data of the undergraduate students at Howard University for an empirical study. They illustrated that in the simulation case, using the past 5 years had the lowest accuracy and the past 20 years provided the highest precision in predicting 10 years later. They also concluded that using the past 20 years had the greatest precision in predicting 10 years later.

Considering the availability of observational data from different years, it has always been a matter of debate and doubt for researchers regarding how much past data should be used for modeling and predicting future data in order to obtain the greatest effectiveness and accuracy. The more accurate forecasts, the better performance in management and utilization and future planning. This article examines the effect of different monthly data lengths on modeling and forecasting accuracy of monthly streamflow recorded values of the Kortian gauging station in order to be able to determine classifications based on the number of available observational data and the number of recent data required to obtain the best modeling performance. What has been explored in this paper is how much past observational data considered for the model will provide more accurate and better results. It has also been investigated whether modeling based on monthly data provides more accurate results, or if modeling based on seasonal data will be more accurate. Consequently, the monthly data were converted into seasonal data to examine whether the results might become more accurate.

Case study

Kortian Stream watershed is located in the center of Razavi Khorasan province in Iran. Razavi Khorasan province is situated in the north-east of Iran. Monthly streamflow recorded values (m3/s) of the Kortian gauging station from 1953 to 2013 at the Kortian Stream watershed were gathered for use in this article. The data were validated and analyzed and were imported to MINITAB v.2021. The Kortian Stream watershed area is approximately 4,200 hectares and is situated in the north-east of Iran. This area is located at an altitude of 1,232 meters above sea level and bounded by 36° 6′ and 36° 13′ N latitude, and 59° 17′ and 59° 37′ E longitudes. (Javanbakht et al. 2008; Dastorani et al. 2012). The location of the Kortian Stream watershed with distribution of rivers and gauging stations, and elevation conditions are presented in Figure 1.
Figure 1

Kortian Stream watershed. (a) Location of Razavi Khorasan province in Iran, (b) distribution of the Kortian Stream watershed and gauging stations in Razavi Khorasan province, (c) distribution of rivers around the study area, and (d) elevation of the Kortian Stream watershed.

Figure 1

Kortian Stream watershed. (a) Location of Razavi Khorasan province in Iran, (b) distribution of the Kortian Stream watershed and gauging stations in Razavi Khorasan province, (c) distribution of rivers around the study area, and (d) elevation of the Kortian Stream watershed.

Close modal

It should be noted that in all the modeling in this article, the last 2 years of recorded observations were discarded, and after completing the modeling and the 2-year forecasting, the predicted data were compared with the true data.

Materials

The data of the Kortian gauging station located in the Kortian Stream watershed were collected from 1953 to 2013. Then, the data were processed and sorted in the form of an Excel file, and the data were verified and validated. Finally, the data were entered into MINITAB v.2021 to perform the modeling process.

The ARIMA modeling application and procedure

ARIMA introduced by Box et al. (1994), which is often referred to as the Box–Jenkins model, consists of three sections: autoregressive section, integrated section, and moving average section. The autoregressive section is responsible for describing the relationship between past and present observations, the integrated section presents the number of differencing the time series, and the section illustrates the autocorrelation structure of error (Katimon et al. 2018; Wei 2006; Salas et al. 1980). The ARIMA method studies and predicts univariate and multivariate time series data. The past values of the time series, the past errors, and the past and present values are the three main elements that an ARIMA model uses as a linear combination to predict a value in the time series. When several time series are used as input variables for the ARIMA model, the model can be called the ARIMAX model. Time series used for ARIMA modeling must be non-periodic. A non-seasonal ARIMA is presented as ARIMA (p,d,q). The p and q terms are determined by autocorrelation function (ACF) and partial autocorrelation function (PACF). The general form of ARIMA (p,d,q) can be written as:
(1)
where represents the autoregressive coefficients, θ illustrates the moving average coefficients, B is the backward operator, and Zt is the observed time series.

To eliminate within-the-year periodicity in hydrological time series, a seasonal differencing approach can be applied. It was illustrated by Kavvas & Delleur (1975) that the seasonal differentiation of the monthly hydrological time series eliminates the periodic behavior. The basic truth of seasonal time series with the period of d is that the observational data that are separated by d time intervals behave similarly. Hence, the backward operator plays a significantly important role in the seasonal time series analysis. In the observational data with periodic behavior, two time intervals have to be analyzed meticulously (Stage & Statements 2014).

A multiplicative ARIMA presented as ARIMA (p,d,q) (P,D,Q)ω consisted of non-seasonal and seasonal parts. The P term (seasonal autoregressive term) is the number of lags from the previous season that are significantly correlated with the current season. The D and Q terms are the seasonal difference term and seasonal moving average term, respectively. The general form of ARIMA (p,d,q) (P,D,Q)ω can be written as:
(2)
where B is the backward operator and ω is the period of data. Generally, there are three steps that should be taken to build ARIMA models, namely identification of the model, estimation of parameters of the model and checking whether the model meets the assumption of analysis, and projection. In the identification step, the type of model and its order are determined for use in the ARIMA model. The type and order of the model are determined based on the ACF and PACF plots. In the estimation and diagnostic checking section, the values of the parameters of the model are appraised. Parameters estimation is based on ACF and PACF plots. To check the correctness and accuracy of the estimated parameters, diagnostic checking is used. The diagnostic checking includes residuals autocorrelation test, portmanteau lack-of-fit test, model inadequacy test, and so on (Yurekli & Kurunc 2005). In the projection step, based on the determined model and validated model parameters, the time series is predicted up to desired time steps (Shamshirband et al. 2020). In this article, the type of model was identified by time series plots and the augmented Dickey–Fuller (ADF) test (Fathian et al. 2015). The ACF and PACF plots were used to estimate the parameters p, q, P, and Q. The most-matched model was selected based on the Akaike information criterion (AIC), mean square error (MS), and Ljung–Box chi-square statistics. In order to eliminate stationarity, in this article, the Box–Cox transformation method and mean differencing method were used consecutively.
It should be considered that, in this article, the first modeling was performed on monthly data and then the data of all three consecutive months were summated. The seasonal data were prepared and seasonal modeling was performed on them. Finally, the results of all modeling were compared. For all modeling and forecasting, MINITAB v.2021 was used. The general process of modeling time series data is presented in Figure 2.
Figure 2

The process of Box–Jenkins modeling time series.

Figure 2

The process of Box–Jenkins modeling time series.

Close modal

As shown in Figure 2, generally, the actions that need to be done in the Box–Jenkins modeling process include drawing time series plots, drawing ACF and PACF plots of time series data and the residuals, applying periodic and seasonal differencing, carrying out various tests and validations for testing the parameters of the model and determining the type of model, and so on.

Monthly modeling

For monthly modeling and forecasting data, five conditions (C1, C2, C3, C4, and C5) based on the time series length were considered. C1 had the last 5 years data. C2 consisted of the last 10 years recorded observations. C3 had the last 20 years data. C4 had the last 40 years, and C5 consisted of all available 59 years observations.

Time series plot for C1 is shown in Figure 3. The time series plot did not exhibit any trend. According to the ADF test, ACF plot, and PACF plot, the C1 model was non-stationary and had periodic behavior. Consequently, the Box–Cox transformation method, monthly differencing, and seasonal differencing were performed. Table 1 presents the suitable models for the C1 model.
Table 1

Suitable multiplicative ARIMA models for the C1 model

Model numberdDnModelμS2ϬƐ2Mean square errorAkaike information criterion
48 ARIMA (1,0,1) (1,1,1) 0.18762 0.742 0.19831 0.214 −71.66 
36 ARIMA (1,0,1) (1,2,1) 0.11334 2.68441 0.36746 0.402 −30.04 
47 ARIMA (3,1,1) (1,1,1) 0.3348 0.19148 0.215 −67.69 
35 ARIMA (1,1,1) (1,2,1) 2.4945 0.35035 0.385 −30.71 
Model numberdDnModelμS2ϬƐ2Mean square errorAkaike information criterion
48 ARIMA (1,0,1) (1,1,1) 0.18762 0.742 0.19831 0.214 −71.66 
36 ARIMA (1,0,1) (1,2,1) 0.11334 2.68441 0.36746 0.402 −30.04 
47 ARIMA (3,1,1) (1,1,1) 0.3348 0.19148 0.215 −67.69 
35 ARIMA (1,1,1) (1,2,1) 2.4945 0.35035 0.385 −30.71 
Figure 3

Time series plot for the C1 model.

Figure 3

Time series plot for the C1 model.

Close modal
As can be seen in Table 1 model number 1 has the lowest AIC and MS values. Accordingly, the most-matched model for the C1 model is ARIMA (1,0,1) (1,1,1)12. Using the model, a 24-month forecast was performed, and Figure 4 shows a comparison between the predicted data and true data.
Figure 4

The comparison between true data and projection data for the C1 model.

Figure 4

The comparison between true data and projection data for the C1 model.

Close modal
The time series plot for the C2 model is presented in Figure 5. The time series plot exhibits no trend. According to the ADF test, ACF plot, and PACF plot, the C2 model was non-stationary and had seasonal movements. Thus, the Box–Cox transformation method and monthly and seasonal differencing were performed successively. Table 1 illustrates the most suitable models for the C2 model.
Figure 5

Time series plot for the C2 model.

Figure 5

Time series plot for the C2 model.

Close modal
According to Table 2, model number 1 presented the lowest AIC and MS values and was regarded as the best-fitted model. Figure 6 illustrates the comparison between the 24-month data predicted by the model and the actual recorded data.
Table 2

Suitable multiplicative ARIMA models for the C2 model

Model numberdDnModelμS2Mean square errorϬƐ2Akaike information criterion
108 ARIMA (1,0,1) (2,1,2) 0.05232 0.65059 0.176 0.16583 −188.05 
96 ARIMA (1,0,1) (2,2,2) 0.10559 2.34584 0.282 0.26286 −122.27 
107 ARIMA (3,1,1) (1,1,2) 0.58083 0.182 0.16728 −181.32 
95 ARIMA (3,1,1) (2,2,2) 2.06869 0.289 0.26473 −116.26 
94 ARIMA (5,2,1) (2,2,1) 0.01443 4.74145 0.47 0.42431 −66.58 
Model numberdDnModelμS2Mean square errorϬƐ2Akaike information criterion
108 ARIMA (1,0,1) (2,1,2) 0.05232 0.65059 0.176 0.16583 −188.05 
96 ARIMA (1,0,1) (2,2,2) 0.10559 2.34584 0.282 0.26286 −122.27 
107 ARIMA (3,1,1) (1,1,2) 0.58083 0.182 0.16728 −181.32 
95 ARIMA (3,1,1) (2,2,2) 2.06869 0.289 0.26473 −116.26 
94 ARIMA (5,2,1) (2,2,1) 0.01443 4.74145 0.47 0.42431 −66.58 
Figure 6

The comparison between the true data and predicted data for the C2 model.

Figure 6

The comparison between the true data and predicted data for the C2 model.

Close modal
The same procedure was done on C3, C4, and C5 models. The time series plots for C3, C4, and C5 models are presented in Figure 7. There were no trend demonstrated, and based on the ADF tests, ACF plots, and PACF plots, these models were non-stationary and had periodic behavior. Therefore, the Box–Cox transformation method and monthly and seasonal differencing were performed. Table 3 presents suitable models for the C3 model. Table 4 illustrates appropriate models for the C4 model, and Table 5 expresses acceptable models for the C5 model.
Table 3

Suitable multiplicative ARIMA models for the C3 model

Model numberdDnModelμS2Mean square errorϬƐ2Akaike information criterion
228 ARIMA (1,0,3) (4,1,1) 0.097 0.584 0.217 0.1997 −357.3 
216 ARIMA (1,0,2) (4,2,2) 0.017 1.757 0.2535 0.24412 −296.6 
Model numberdDnModelμS2Mean square errorϬƐ2Akaike information criterion
228 ARIMA (1,0,3) (4,1,1) 0.097 0.584 0.217 0.1997 −357.3 
216 ARIMA (1,0,2) (4,2,2) 0.017 1.757 0.2535 0.24412 −296.6 
Table 4

Appropriate multiplicative ARIMA models for the C4 model

Model numberdDnModelμS2Mean square errorϬƐ2Akaike information criterion
468 ARIMA (1,0,1) (3,1,3) 0.03133 0.554 0.177 0.17327 −814.36 
456 ARIMA (1,0,4) (4,2,1) 0.0085 1.6 0.206 0.20174 −717.95 
467 ARIMA (3,1,1) (4,1,1) 0.00277 0.415 0.186 0.17768 −796.87 
455 ARIMA (3,1,1) (4,2,2) −0.0057 1.237 0.215 0.21075 −698.5 
454 ARIMA (4,2,1) (4,2,1) 0.0025 3.043 0.327 0.32017 −505.1 
467 ARIMA (4,1,1) (4,1,1) 0.00277 0.415 0.195 0.19084 −761.5 
Model numberdDnModelμS2Mean square errorϬƐ2Akaike information criterion
468 ARIMA (1,0,1) (3,1,3) 0.03133 0.554 0.177 0.17327 −814.36 
456 ARIMA (1,0,4) (4,2,1) 0.0085 1.6 0.206 0.20174 −717.95 
467 ARIMA (3,1,1) (4,1,1) 0.00277 0.415 0.186 0.17768 −796.87 
455 ARIMA (3,1,1) (4,2,2) −0.0057 1.237 0.215 0.21075 −698.5 
454 ARIMA (4,2,1) (4,2,1) 0.0025 3.043 0.327 0.32017 −505.1 
467 ARIMA (4,1,1) (4,1,1) 0.00277 0.415 0.195 0.19084 −761.5 
Table 5

Acceptable multiplicative ARIMA models for the C5 model

Model numberdDnModelμS2Mean square errorϬƐ2Akaike information criterion
696 ARIMA (1,0,4) (4,1,1) 0.012 0.54 0.179 0.17603 −1197.1 
684 ARIMA (1,0,4) (4,2,1) 0.016 1.488 0.2096 0.20667 −1066.4 
695 ARIMA (3,1,3) (4,1,1) 0.4 0.1827 0.1802 −1177.1 
683 ARIMA (3,1,1) (4,2,2) 1.13 0.232 0.22917 −996.3 
682 ARIMA (4,2,1) (4,2,1) 0.0016 2.8 0.31 0.0.30551 −796.7 
694 ARIMA (4,2,1) (4,1,1) 0.0018 0.963 0.252 0.24847 −954.4 
Model numberdDnModelμS2Mean square errorϬƐ2Akaike information criterion
696 ARIMA (1,0,4) (4,1,1) 0.012 0.54 0.179 0.17603 −1197.1 
684 ARIMA (1,0,4) (4,2,1) 0.016 1.488 0.2096 0.20667 −1066.4 
695 ARIMA (3,1,3) (4,1,1) 0.4 0.1827 0.1802 −1177.1 
683 ARIMA (3,1,1) (4,2,2) 1.13 0.232 0.22917 −996.3 
682 ARIMA (4,2,1) (4,2,1) 0.0016 2.8 0.31 0.0.30551 −796.7 
694 ARIMA (4,2,1) (4,1,1) 0.0018 0.963 0.252 0.24847 −954.4 
Figure 7

The time series plots for C3, C4, and C5 models.

Figure 7

The time series plots for C3, C4, and C5 models.

Close modal

According to Tables 4 and 5, model number 1 had the lowest AIC and MS values. Thus, the most-matched model for the C4 model is ARIMA (1,0,1) (3,1,3)12 and the most-matched model for the C5 model is ARIMA (1,0,4) (4,1,1)12. Related to the C3 model, although model number 1 presented the lowest AIC and MS values, based on Ljung–Box chi-square statistics, this model was rejected. As a result, model number 2 was accepted.

The comparison between the real recorded observations and predicted data for the C3 model is presented in Figure 8. Figure 9 illustrates the comparison between the true data and predicted data for the C4 model; the comparison between the actual data and forecast data for the C5 model is displayed in Figure 10.
Figure 8

The comparison between the real data and predicted data for the C3 model.

Figure 8

The comparison between the real data and predicted data for the C3 model.

Close modal
Figure 9

The comparison between the true data and forecast data for the C4 model.

Figure 9

The comparison between the true data and forecast data for the C4 model.

Close modal
Figure 10

The comparison between the actual recorded observations and predicted data for the C5 model.

Figure 10

The comparison between the actual recorded observations and predicted data for the C5 model.

Close modal

According to the Table 6, using the last 40 years recorded data for modeling and forecasting presented the best results.

Table 6

The sum of squares error for C1 to C5 models

ModelNumber of past years of data used in the modelBest-fitted modelSum of the squares error of the 2-year actual data and the predicted data
C1 ARIMA (1,0,1) (1,1,1) 6.25 
C2 10 ARIMA (1,0,1) (2,1,2) 2.068 
C3 20 ARIMA (1,0,0) (4,2,1) 3.15 
C4 40 ARIMA (1,0,1) (3,1,3) 0.9408 
C5 59 ARIMA (1,0,4) (4,1,1) 1.46 
ModelNumber of past years of data used in the modelBest-fitted modelSum of the squares error of the 2-year actual data and the predicted data
C1 ARIMA (1,0,1) (1,1,1) 6.25 
C2 10 ARIMA (1,0,1) (2,1,2) 2.068 
C3 20 ARIMA (1,0,0) (4,2,1) 3.15 
C4 40 ARIMA (1,0,1) (3,1,3) 0.9408 
C5 59 ARIMA (1,0,4) (4,1,1) 1.46 

Seasonal modeling

The monthly data were summed every 3 consecutive months to obtain seasonal data. According to Table 6, in seasonal modeling, only data from the last 40 years was examined because it provided the best results. Figure 11 presents the time series plot of the last 40 years seasonal data.
Figure 11

The time series plot of seasonal data.

Figure 11

The time series plot of seasonal data.

Close modal

The data exhibited no trend, and based on the ADF test, ACF plot, and PACF plot, there was a non-stationarity in the data as well as periodic behavior. Thus, the Box–Cox transformation method, mean differencing, and seasonal differencing with the time period ω = 4 were performed. It should be noted that like monthly modeling, the last 2 years recorded observations were kept to test the precision of accepted models.

Table 7 presents the appropriate models chosen for the seasonal modeling. As the table illustrates, model number 1 had the lowest AIC and MS values. Therefore, the most-matched model for the seasonal modeling is ARIMA (1,0,2) (4,1,1)4. The comparison between the last 2 years predicted data and true recorded observations is presented in Figure 12. The sum of squares error of the true data and predicted data was 2.5.
Table 7

Appropriate multiplicative ARIMA models for the seasonal modeling

Model numberdDnModelμS2Mean square errorϬƐ2Akaike information criterion
232 ARIMA (1,0,2) (4,1,1) −0.01596 0.6413 0.2505 0.24166 −321.492 
228 ARIMA (2,0,1) (4,2,1) −0.01627 1.831 0.2865 0.27759 −284.207 
231 ARIMA (1,1,1) (4,1,1) −0.0068 0.602 0.2627 0.25118 −313.146 
227 ARIMA (1,1,1) (4,2,1) 0.0069 1.805 0.2925 0.28384 −279.871 
230 ARIMA (2,2,1) (3,1,1) 0.0086 1.368 0.609 0.59306 −112.166 
Model numberdDnModelμS2Mean square errorϬƐ2Akaike information criterion
232 ARIMA (1,0,2) (4,1,1) −0.01596 0.6413 0.2505 0.24166 −321.492 
228 ARIMA (2,0,1) (4,2,1) −0.01627 1.831 0.2865 0.27759 −284.207 
231 ARIMA (1,1,1) (4,1,1) −0.0068 0.602 0.2627 0.25118 −313.146 
227 ARIMA (1,1,1) (4,2,1) 0.0069 1.805 0.2925 0.28384 −279.871 
230 ARIMA (2,2,1) (3,1,1) 0.0086 1.368 0.609 0.59306 −112.166 
Figure 12

The comparison between the true data and predicted data.

Figure 12

The comparison between the true data and predicted data.

Close modal

In general, since the conversion of monthly data to the seasonal data reduced the accuracy of modeling, it seems that the shorter the observation data used, the better the results in the ARIMA modeling.

Regarding the monthly modeling, in the case study, the number of observational data with small values (values less than 1,000 L/s) is very large and increasing the length of the data for modeling increases the number of small values and makes the average of the whole sample smaller and the average approaches these small values. As a result, the average deviates from the peak values (values more than 6,000 L/s), which contributes to errors in ARIMA modeling and makes the modeling perform unsatisfactory at the peak points.

According to Figure 7, it seems that in the case of reducing the sample selection with a shorter length (the C3 model), the number of peak points (values greater than 6,000 L/s) will appear only once in the sample. Considering that the ARIMA model is a type of long-term memory model, the accuracy of the model results decreases in the peak points due to the lack of peak-value points in the C3. On the other hand, the C4 model includes more peak-value points (values more than 6,000 L/s) and includes all the peak points of the available data. For this reason, it has improved the model effectiveness. Like the C4 model, the C5 model includes all the peak points, but in the C5 model, the number of low-value points (values less than 1,000 L/s) has increased compared to the C4 model, and this factor has reduced the mean in the C5 model and the modeling at peak-value points is associated with errors. Therefore, it seems that considering that the ARIMA model is of long-term memory type, the more peak points used in the modeling and, at the same time, not too many low-value points used, the better the modeling results.

In this paper, the effect of recorded observation length used in modeling on ARIMA modeling and its prediction accuracy was investigated. Generally, among the monthly modeling and seasonal modeling, the monthly modeling had the more accurate results, followed by the seasonal modeling. It can be concluded that converting monthly data into seasonal data cannot affect the accuracy of predictions positively. Regarding the monthly modeling, C1 was the worst model. In predicting 2 years ahead, C4 presented the best results, followed by C5. Thus, more fitted predictions were obtained when using the last 40 years of observation data of streamflow of the Kortian gauging station. Therefore, it can be concluded that increasing the data used for modeling and forecasting does not necessarily increase the accuracy of forecasting results. The results of this article are in line with the results of other articles that have studied the effect of data length on forecasting accuracy for other watersheds. Also, based on Figures 4, 6, 8, 9, 10, and 12, it can be deduced that the ARIMA model does not demonstrate proper and accurate performance in predicting peaks and cannot be reliably used to predict peak values. However, it is clear that as the number of data used for modeling increased, the accuracy of predicting peak values increased as well. For future works to achieve a classification to determine the number of data required for the most accurate modeling of multiplicative ARIMA based on the number of available observational data, it is necessary to examine different models with different observational data ranges so that the number of data used in modeling to achieve the best performance is determined based on different ranges of observational data.

All relevant data are included in the paper or its Supplementary Information.

The authors declare there is no conflict.

Abdoli
G.
,
MehrAra
M.
&
Ardalani
M. E.
2020
Comparing the prediction accuracy of LSTM and ARIMA models for time-series with permanent fluctuation
.
Periódico do Núcleo de Estudos e Pesquisas sobre Gênero e DireitovCentro de Ciências Jurídicas-Universidade Federal da Paraíba 9, 314–339.
AliAhmadi, N., Moradi, E., Hoseini, S. M. & Shahraki, A. S. 2021 Prediction of the mass flow rate of the Hirmand River: The application of the SARIMA time-series technique. Journal of Irrigation and Water Engineering 45, 172–191. (in Persian).
Asakereh. H. & Yousefizadeh. R. 2015 Simulation of average monthly and annual temperature changes using time series approach. The First Scientific Research Congress on the Development and Promotion of Agricultural Sciences, Natural Resources and Environment. Tehran, Iran.
Azar, A. & Momenim M. 2015 Application of Statistics in Management. 1st ed. The Organization for Researching and Composing University Textbooks. p 350. (In Persian)
Box
G. E. P.
&
Jenkins
G. M.
1976
Time Series Analysis: Forecasting and Control
, 1st edn.
Holden-Day
,
San Fransisco
, p.
575
,
ISBN: 0816211043
.
Box
G. E. P.
,
Jenkins
G. M.
&
Reisel
G. C.
1994
Time Series Analysis Forecasting and Control
, 3rd edn.
Prentice Hall
,
New Jersey
.
Box
G. E. P.
,
Jenkins
G. M.
,
Reisel
G. C.
&
Ljung
G. M.
2015
Time Series Analysis Forecasting and Control
, 5th edn.
John Wiley & Sons
,
New Jersey
.
Bowerman
B.
,
and R
L.
&
Connel
T.
1979
Time Series and Forecasting
.
PWS Publishing
,
North Scituate, MA
.
Chen
C. K.
2008
An integrated enrollment forecasting model
.
IR Applications
15
,
1
18
.
Dastorani, M. T., Talebi, A., Heidari, A. & Poormohammadi, S. 2012 Investigating the effect of climate change on the amount of precipitation in the Torogh dam basin in Mashhad. 5th National Conference on Watershed Management and Soil and Water Resources Management. Kerman, Iran. (in Persian)
Fathian, F., Fakheri Fard, A., Dinpashoh, Y. & Mousavi Nadoshani, S. S. 2015 Testing for stationarity and nonlinearity of daily streamflow time series based on different statistical tests (Case study: Upstream basin rivers of Zarrineh Roud Dam). Journal of Water and Soil (Agricultural Sciences and Technology) 30, 1009–1024. (in Persian).
Hyndman
R. J.
&
Kostenko
A. V.
2007
Minimum sample size requirements for seasonal forecasting models
Foresight
6
(
Spring
),
12
15
.
Javanbakht, M., Mousavi Harami, R., Torshizian, H., Sharifi, E. & Soukhtanlou, H. 2008 Sediment yield and study of fining trend in Torogh Dam watershed with emphasize on Moghan-Kortian sub basin. Journal of Geotechnical Geology 4, 97–107.
Katimon
A.
,
Shahid
S.
&
Mohsenipour
M.
2018
Modeling water quality and hydrological variables using ARIMA: a case study of Johor River, Malaysia
.
Sustainable Water Resources Management
4
,
991
998
.
Kavvas
M. L.
&
Delleur
J. W.
1975
Removal of periodicities by differencing and monthly mean subtraction
.
Journal of Hydrology
26
,
335
353
.
Khairuddin, N., Aris, A. Z., Elshafie, A., Sheikhy Narany, T., Ishak, M. Y. & Isa, N. M. 2019 Efficient forecasting model technique for river stream flow in tropical environment. Urban Water Journal 16 (3), 183–192.
Mohan
S.
&
Arumugam
N.
1995
Forecasting weekly reference crop évapotranspiration series
.
Hydrological Science Journal
40
,
689
702
.
Mwenda
A.
,
Kuznetsov
D.
&
Mirau
S.
2015
Analyzing the impact of historical data length in non-seasonal ARIMA models forecasting
.
Mathematical Theory and modeling
5
,
77
85
.
Qin, L., Shanks, K., Phillips, G. A. & Bernard, D. 2019 The impact of lengths of time series on the accuracy of the ARIMA forecasting. International Research in Higher Education 4, 58–68.
Salas
,
1996
Applied Time Series in Hydrology
.
McGrew Hill
,
Highlands Ranch, CO
.
Salas
J. D.
,
Delleur
J. W.
,
Yevjevich
V.
&
Lane
W. L.
1980
Applied Modeling of Hydrologic Time Series
.
Water resources publications
,
Colorado
.
Shamshirband, S., Hashemi, S., Salimi, H., Samadianfard, S., Asadi, E., Shadkani, S. & Chau, K. W. 2020 Predicting standardized streamflow index for hydrological drought using machine learning models. Engineering Applications of Computational Fluid Mechanics 14 (1), 339–350.
Stage
F.
&
Statements
U. A. P.
2014
The ARIMA Procedure
.
SAS Institute Inc.
,
Cary, NC
.
Suhartono
S.
2011
Time series forecasting by using seasonal autoregressive integrated moving average: subset, multiplicative or additive model
.
Journal of Mathematics and Statistics
7
,
20
27
.
Wang, Z., Fathollahzadeh Attar, N., Khalili, K., Behmanesh, J., Band, S. S., Mosavi, A. & Chau, K. W. 2020 Monthly streamflow prediction using a hybrid stochastic-deterministic approach for parsimonious non-linear time series modeling. Engineering Applications of Computational Fluid Mechanics 14 (1), 1351–1372.
Wei
W. W.
2006
Time Series Analysis: Univariate and Multivariate Methods
, 2nd edn.
Pearson Education
,
New York, NY
.
Yurekli
K.
&
Kurunc
A.
2005
Testing the residuals of an ARIMA model on the Cekerek stream watershed in Turkey
.
Turkish Journal of Engineering and Environmental Sciences
29
(
2
),
61
74
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).