Time series forecasting using data mining models applied to various time sequence data of a wide variety of domains has been well documented. In this work, time series of water level data recorded every hour at ‘Cristobal Bay’ in Panama during the years 1909–1980 are employed to construct a model(s) that can be suitable for predicting changes in sea level patterns. Four time lag assemblages of variable combinations of the time series information are fully explored to identify the optimal combinations for the dataset using a data mining tool. The results, based on the assessment using time series of Cristobal data, show that in general using cross-validation and a longer time lag period of the time series led to more accurate forecasting of the model than that of a shorter lag period of the time series. The study also suggests that data mining techniques using cross-validation and the aid of an attribute evaluator can be effectively used in modeling time series for changes in sea level at coastal areas, and changes in ecosystems that by their nature are characterized by nonlinearity and presentation of chaotic climatic changes in their physical behavior.

INTRODUCTION

Nowadays, as global warming is of great concern, knowledge about changes in future variations in sea level is of great importance for the protection of coastal areas, residential areas of low relief, and for monitoring and predicting changes in complex marine ecosystems. It is also of high concern for the planning and construction of coastal structures for the development and implementation of alternatives for the generation of ocean energy production technologies (O'Rourke et al. 2010).

The prediction of sea level variations has always been a subject of intense interest to mankind, not only from a human point of view, but also from an economic one. The most famous examples of flooding are in the Venice Lagoon, some of which occurred during November 1997, November 2000 and November 2001, with the surge events reaching heights between 100 and 118 cm. These have been the object of intense studies using hydrodynamic models (Umgiesser et al. 2004).

The instantaneous measurements and measurements of mean sea level (MSL) dependent on time are not seasonal, spatial or temporal. These vary due to the synergy of certain influences in tidal changes, temperature, salinity, atmospheric pressure and sea currents on large scales (Chen et al. 2000; Douglas et al. 2009), and sometimes result in tidal waves and flooding. The transition from deterministic models of global ocean circulation that have a large computational grid (tens of kilometres), which among other parameters include sea levels, to models at local fine scales is a difficult task that demands enormous and extensive mathematical computation devices (Carretero et al. 2000; Monbaliu et al. 2000).

To solve the issues surrounding prediction of changes in sea level at coastal areas, amongst the techniques used in data mining. Several alternative methods have been used that do not require the use of computational grids, such as fuzzy logic (Long & Meesad 2014), artificial neural networks (ANNs) (Pashova & Popova 2011) and genetic algorithms (Peralta et al. 2010). For example, ANNs can approximate any nonlinear function, and make an approximate solution of complex systems without prior knowledge of the internal relationships among its components (Haykin 1999).

In this work we have followed nonlinear time series forecasting (TSF) analysis in an effort to understand the nonlinear dynamic behavior of the system in question, and to provide a prediction tool that can be feasible in dealing with data of irregular patterns, characterized by a series whose observed values are not lengthy, i.e. there are no cyclic and seasonal trends, or where the complexity of the generating processes can be very different.

Since there are no data beyond the year 1980, to predict beyond this year is not straightforward. If we had had data beyond this time span, say up to 2010, we could then perform a prediction, as the amount of available data would be significant. On the other hand, climate has changed a lot since then, and the morphological conditions of the study site have changed through time. River discharge patterns may have changed as contributors of forcing functions, and the oscillating patterns of the water height, precipitation frequencies and the ongoing global sea level rise all influence the geophysical characteristics of the site.

As these points are all relevant to the study, we are also interested in exploring the performance of our results with those obtained by other studies in which the use of cross-validation has been recently applied to time series datasets (Bergmeir & Benítez 2012), specifically because, given the nature of the data, and as mentioned, our time series is not considerably lengthy and there are no available data beyond 1980. Cross-validation implies that the whole dataset is used in training and testing, so for this experiment we consider the use of cross-validation combined with the use of attribute selection algorithms along with the learning classifier, as cross-validation is the technique of choice for dealing with prediction performance when one is faced with the issue of limited data.

RELATED WORK

Other than employing conventional statistical models for TSF (Makridakis et al. 2008) in the scientific and engineering community, various authors have addressed the forecasting assignment using a wide range of TSF applications, varying from financial to economic to natural physical phenomena, etc. From their research, many types of neural network architectures are employed such as recurrent neural network radial basis function networks, Bayesian neural networks and neuron-fuzzy networks, generalized regression neural networks, extreme learning machines and beta basis function neural networks, just to mention a few as the list is extensive. The earliest and the most popular type of networks are feed-forward multilayer perceptrons (MLPs) (Peralta et al. 2010; Donate et al. 2011).

More recently, some are using direct encoding schemes, and others are using indirect encoding schemes. In recent years, much literature has evolved on the use of evolutionary artificial neural networks and many forecasting applications have been carried out.

BACKGROUND ON DATA MINING AND TSF

Data mining, also called Knowledge Discovery in Databases, is the field of discovering novel and potentially useful information from large amounts of data (Rushing et al. 2005).

Contrary to standard statistical methods, data mining paradigms search for appealing information without the need for a priori hypotheses; the patterns that can be discovered depend upon the data mining tasks used. For this purpose, there are two types of data mining tasks: descriptive data mining tasks that describe the general properties of the existing data and predictive data mining tasks that attempt to perform predictions based on inference from available data. These techniques are often more powerful, flexible, and efficient for exploratory analysis than the statistical techniques (Bregman & Mackenthun 2006). The most commonly used techniques in data mining are: ANNs, rule induction, the nearest neighbor method, memory-based reasoning, logistic regression, discriminant analysis and decision trees. Application of these techniques would depend on the type of problems we are trying to solve.

For TSF, the most widely employed ANN algorithms are the MLPs (Zhang 2007). These are characterized by the feed-forward architecture of an input layer, one or more hidden layers, and an output layer. The nodes in each layer are connected to those in the immediate next layer by acyclic links. In practical applications, it is enough to consider a single hidden layer structure (Kamruzzaman et al. 2006).

DATA SOURCE AND STUDY AREA

The time series dataset used for this experiment was extracted from tide gauge (Figure 1) data (Holgate et al. 2013), which are available to the public through the Permanent Service for Mean Sea Level (2015) database. This is responsible for the collection, publication, analysis and interpretation of sea level data from the global network of tide gauges. It is based in Liverpool at the National Oceanography Centre, which is a component of the UK Natural Environment Research Council (2015).

The case data covered a period of 852 months, that is, January 1909 to August 1980. The following should be noted from the dataset: the Cristobal dataset was considered to be affected by problems with automatic tide recorders in the early 1970s, and several years of MSL values look high compared to nearby stations, e.g. Cartagena. Therefore, revised local reference (RLR) data are used. Table 1 shows a brief statistical description of the dataset.

Table 1

Descriptive statistics for the Cristobal water level data

VariableWater level
Unit 
No. instances 860 
Median 6.967 
Mean 6.968 
Std Dev. 0.051 
Variance 0.003 
Minimum 6.832 
Maximum 7.217 
VariableWater level
Unit 
No. instances 860 
Median 6.967 
Mean 6.968 
Std Dev. 0.051 
Variance 0.003 
Minimum 6.832 
Maximum 7.217 

EXPERIMENTAL SETUP

Figure 2 shows the monthly water level of Cristobal Bay between 1909 and 1980, sampled every hour as recorded by the tide gauge. The outlier seen in the graph can be attributed to problems with the gauge during the early 1970s, as was mentioned earlier. High-water level, e.g. no less than 6.97 m, is also shown. The high water phenomenon has a characteristic behavior during the year; September, October, November and December are the months in which the phenomenon is more pronounced whereas during the summer it has been rarely observed, except for January and February.
Figure 2

Water level at the Cristobal Bay from 1909 to 1980.

Figure 2

Water level at the Cristobal Bay from 1909 to 1980.

All experiments in this paper are implemented after applying the process of data retrieval, screening and preparing input format.

Although there is no specific guideline for splitting of the data in the literature, it is generally agreed that most data points should be used for model building.

The regression tree M5P is the model tree learner using the M5 algorithm (Quinlan 1992) implemented in the Weka software. The criterion employed in the M5P algorithm, as a regression classifier to carry out the partitions, is straightforward. M5P combines a conventional decision tree with the possibility of linear regression functions at the nodes. At first, a decision-tree induction algorithm is employed to build a tree, but instead of maximizing the information gain at each inner node, a splitting criterion is used, which minimizes the intra-subset variation in the water level values down each branch. In M5P, the splitting procedure stops if the water level values of all instances that reach a node vary very slightly, or only a few instances remain.

The second step is pruning of the tree back from each leaf. When pruning, an inner node is turned into a leaf with a regression plane. Thirdly, in order to avoid sharp discontinuities between the subtrees, a smoothing procedure is applied that combines the leaf model prediction with each node along the path back to the root, smoothing it at each of these nodes by combining it with the value predicted by the linear model for that node.

One advantage of decision tree classifiers is that rules can be inferred from the trees generated that are very descriptive, helping users to understand their data, whereas this is not the case for linear models, such as M5P. Weka software can generate both decision trees and decision tree rules depending on selected options. Trees and model rules were generated using 10-fold cross-validation and the results with the best value for the correlation coefficient (R) and least value for the percentage relative absolute error (%RAE) on the test dataset were selected.

We decided on using the complete dataset to create three scenarios or experiments to find a predictive model which is expressed mathematically as: 
formula
1
where k is the time variable, i is the variable for the time step and F is some function defining a very large and general water level of time series.

It is known that TSF models are best when the datasets are considerably large, this is also true for data mining processes as they require a large amount of data, which is split and used for training and testing purposes. Given this, to model the time series, we employed a four lag approach to deal with the situation of only having available 860 instances and no other data beyond the final year of the data series, so we inferred from time series characteristics to relate the present values of the series to past values. For this reason, the approaches applied were as follows:

  • modeling data with lag t = 10 months 
    formula
    2
  • modeling data with lag t = 50 months 
    formula
    3
  • modeling data with lag t = 60 months 
    formula
    4
  • modeling data with lag t = 120 months 
    formula
    5

The experimental models are built with the usual cross validation approach. It is known from the literature that cross-validation is one of the most important tools in the evaluation of regression and classification methods (Kunst & Jumah 2004; Arlot & Celisse 2010). This leads us to justification regarding the question of predicting performance based on limited data. As the generalization of the system is of high importance in time series modeling, many different techniques exist, of which repeated cross-validation is probably the method of choice in most practical situations of data scarcity, and it is very practical when the amount of data for training and testing is limited.

For each lag time approach, (2) to (5) above, we experimented with the following classifier schemes: a rule, a function, and a regression model tree, in order to find which of these paradigms was suitable to deal with the dataset and therefore perform as a feasible model predictor. Table 2 shows the best percentage correctness was given by the regression model tree.

Table 2

Statistical test results for the experiment

DatasetClassifiers
Cristobal water level (100) tree.M rule function 
 0.79 0.00* 0.76* 
DatasetClassifiers
Cristobal water level (100) tree.M rule function 
 0.79 0.00* 0.76* 

As shown in the literature, all the statistics compare true values to their estimates, but do it in a slightly different way. They all tell us ‘how far away’ our estimated values are from the true value θ. Sometimes square roots are used and sometimes absolute values; this is because when using square roots, the extreme values have more influence on the results. For example, in RAE and root relative squared error (RRSE) we divide those differences by the variation of θ so they have a scale from 0 to 1, and if you multiply this value by 100 you get similarity in 0–100 scale (e.g. percentage). The values of (θθi)2 or |θθi| tell us how much θ differs from its mean value, so we could tell that it is about how much θ differs from itself, compared to the variance. For this reason, the measures are named ‘relative’, they give us results related to the scale of θ. In this sense, the performance measure for the three classifier schemes used is judged by the commonly used error measure, the RRSE. The best scheme was chosen based on the statistical results, in this case model three, M5P, was chosen as the best classifier scheme.

After completing the experiments, we selected the time series that was grouped into the four approaches for modeling scenarios, 10, 50, 60 and 120 past values. We then ran and analyzed each of the data groups with the previously selected M5P algorithm, using 10-fold cross-validation as the testing method. The performance metrics used for the runs were the correlation coefficient (R), the mean absolute error (MAE), and the percentage relative absolute error (%RAE). As the modeling scenarios were running, interesting patterns representing knowledge were identified. So, as these patterns appeared, we ran each of the scenarios without the aid of an attribute selection method, and another test with attribute selection methods using both the Correlation based Feature Selection Evaluator subset evaluator and the WrapperSubsetEval (which uses a classifier plus cross-validation) (Kohavi & John 1997) on all of the four selected lag approaches. This enabled comparison of performance between each of the scenarios and identification of the attribute selection scheme that was best at finding the adequate number of variables amongst the architecture of the selected lags. In this way we could determine the relevance of attributes, and hence those giving the best accuracy in prediction.

EXPERIMENTAL RESULTS AND ANALYSIS

Data from the Cristobal Bay water level time series were used, with implementation of time lags scenarios t = 10, t= 50, t = 60 and t = 120 months, respectively. Regression with 10-fold cross-validation was applied to the different scenarios. Three runs for each scenario were executed, the first one applying the M5P classifier without submitting the dataset to an attribute selection algorithm. To determine the relevance of attributes, the second and third runs were implemented using the cfs subset evaluator and the wrapper subset evaluator. The cfs subset evaluator assesses the predictive ability of each attribute individually and the degree of redundancy among them, preferring sets of attributes that are highly correlated with the water level but with low inter-correlation. The wrapper subset evaluator also uses a classifier to evaluate attribute sets, but it employs cross-validation to estimate the accuracy of the learning scheme for each set.

Then, for each time lag scenario implanted, the data were submitted to classification with each of the attribute selection learners, and the results and performance of each was compared.

We report results for the runs to reveal the effectiveness of model implementation. It is not surprising that across Tables 36, model fitting and forecasting performance (Figures 58) generally gets worse from subsample time lags in which the number of attributes selected gets higher. This is due to high computational time, as data dimensionality increases and overfitting issues arise (see Loughrey & Cunningham 2005). However, we show that where attribute selection schemes are applied and the number of lags are higher, the model performance is improved. Results also indicate, that the wrapper evaluator scheme performed best over the other two approaches implemented.

Table 3

Statistical summary of results and generated linear models with runs for time lag t = 10

 Performance measure
Attribute schemeAttribute schemeAttribute scheme
Run no.RMAE% Relative absolute errorwithout attribute selection evaluatorCfs evaluatorwrapper evaluator
0.7864 0.024 60.204 – – 
0.7865 0.024 60.407 – – 
0.7856 0.024 60.301 – – 
 Generated linear models 
Water level = 0.2304*t-10 − 0.0681*t-5 + 0.082*t-3 + 0.0568*t-2 + 0.5985*t-1 + 0.6999 
Water level = 0.223*t-10 + 0.0829*t-2 + 0.6014*t-1 + 0.6461 
Water level = 0.2296*t-10 + 0.6617*t-1 + 0.7576 
 Performance measure
Attribute schemeAttribute schemeAttribute scheme
Run no.RMAE% Relative absolute errorwithout attribute selection evaluatorCfs evaluatorwrapper evaluator
0.7864 0.024 60.204 – – 
0.7865 0.024 60.407 – – 
0.7856 0.024 60.301 – – 
 Generated linear models 
Water level = 0.2304*t-10 − 0.0681*t-5 + 0.082*t-3 + 0.0568*t-2 + 0.5985*t-1 + 0.6999 
Water level = 0.223*t-10 + 0.0829*t-2 + 0.6014*t-1 + 0.6461 
Water level = 0.2296*t-10 + 0.6617*t-1 + 0.7576 
Table 4

Statistical summary of results and generated linear models with runs for time lag t = 50

 Performance measure
Attribute schemeAttribute schemeAttribute scheme
Run no.RMAE% Relative absolute errorwithout attribute selection evaluatorCfs evaluatorwrapper evaluator
0.8174 0.023 56.091 18 – – 
0.8064 0.023 57.226 – – 
0.8219 0.022 55.519 – – 11 
 Generated linear models 
Water level = 0.0652*t-48 − 0.0933*t-38 + 0.0535*t-36 + 0.0599*t-35 + 0.0598*t-33 − 0.0979*t-29 − 0.081*t-27 + 0.0507*t-25 + 0.0825*t-23 + 0.0534*t-18 − 0.1423*t-16 + 0.1036*t-12 + 0.1425*t-11 − 0.0562*t-9 + 0.1285*t-4 + 0.1075*t-3 + 0.0755*t-2 + 0.4451*t-1 + 0.3013 
Water level = 0.1068*t-48 + 0.267*t-11 + 0.0826*t-2 + 0.5155*t-1 + 0.1961 
Water level = 0.0728*t-48 − 0.0688*t-38 + 0.0956*t-36 − 0.0873*t-27 + 0.1111*t-23 − 0.1466*t-16 + 0.1158*t-12 + 0.1403*t-11 + 0.1101*t-4 + 0.1292*t-3 + 0.4817*t-1 + 0.3222 
 Performance measure
Attribute schemeAttribute schemeAttribute scheme
Run no.RMAE% Relative absolute errorwithout attribute selection evaluatorCfs evaluatorwrapper evaluator
0.8174 0.023 56.091 18 – – 
0.8064 0.023 57.226 – – 
0.8219 0.022 55.519 – – 11 
 Generated linear models 
Water level = 0.0652*t-48 − 0.0933*t-38 + 0.0535*t-36 + 0.0599*t-35 + 0.0598*t-33 − 0.0979*t-29 − 0.081*t-27 + 0.0507*t-25 + 0.0825*t-23 + 0.0534*t-18 − 0.1423*t-16 + 0.1036*t-12 + 0.1425*t-11 − 0.0562*t-9 + 0.1285*t-4 + 0.1075*t-3 + 0.0755*t-2 + 0.4451*t-1 + 0.3013 
Water level = 0.1068*t-48 + 0.267*t-11 + 0.0826*t-2 + 0.5155*t-1 + 0.1961 
Water level = 0.0728*t-48 − 0.0688*t-38 + 0.0956*t-36 − 0.0873*t-27 + 0.1111*t-23 − 0.1466*t-16 + 0.1158*t-12 + 0.1403*t-11 + 0.1101*t-4 + 0.1292*t-3 + 0.4817*t-1 + 0.3222 
Table 5

Statistical summary of results and generated linear models with runs for time lag t= 60

 Performance measure
Attribute schemeAttribute schemeAttribute scheme
Run no.RMAE% Relative absolute errorwithout attribute selection evaluatorCfs evaluatorwrapper evaluator
0.8118 0.023 56.448 19 – – 
0.8024 0.023 57.544 – – 
0.8222 0.022 55.573 – – 9 
 Generated linear models 
Water level = 0.0896*t-59 − 0.097*t-52 + 0.0792*t-48 − 0.068*t-38 + 0.0659*t-36 + 0.0576*t-33 − 0.0925*t-29 + 0.06*t-28 − 0.0754*t-27 + 0.0753*t-23 + 0.0592*t-18 − 0.1378*t-16 + 0.1056*t-12 + 0.1373*t-11 − 0.0592*t-9 + 0.1371*t-4 + 0.1057*t-3 + 0.0767*t-2 + 0.4426*t-1 + 0.2658 
Water level = 0.3001*t-11 + 0.0906*t-2 + 0.5373*t-1 + 0.5013 
Water level = 0.1186*t-59 − 0.1372*t-52 + 0.1081*t-48 − 0.1277*t-16 + 0.1096*t-12 + 0.1698*t-11 + 0.1223*t-4 + 0.1195*t-3 + 0.4661*t-1 + 0.355 
 Performance measure
Attribute schemeAttribute schemeAttribute scheme
Run no.RMAE% Relative absolute errorwithout attribute selection evaluatorCfs evaluatorwrapper evaluator
0.8118 0.023 56.448 19 – – 
0.8024 0.023 57.544 – – 
0.8222 0.022 55.573 – – 9 
 Generated linear models 
Water level = 0.0896*t-59 − 0.097*t-52 + 0.0792*t-48 − 0.068*t-38 + 0.0659*t-36 + 0.0576*t-33 − 0.0925*t-29 + 0.06*t-28 − 0.0754*t-27 + 0.0753*t-23 + 0.0592*t-18 − 0.1378*t-16 + 0.1056*t-12 + 0.1373*t-11 − 0.0592*t-9 + 0.1371*t-4 + 0.1057*t-3 + 0.0767*t-2 + 0.4426*t-1 + 0.2658 
Water level = 0.3001*t-11 + 0.0906*t-2 + 0.5373*t-1 + 0.5013 
Water level = 0.1186*t-59 − 0.1372*t-52 + 0.1081*t-48 − 0.1277*t-16 + 0.1096*t-12 + 0.1698*t-11 + 0.1223*t-4 + 0.1195*t-3 + 0.4661*t-1 + 0.355 
Table 6

Statistical summary of results and generated linear models with runs for time lag t = 120

 Performance measure
Attribute schemeAttribute schemeAttribute scheme
Run no.RMAE% Relative absolute errorwithout attribute selection evaluatorCfs evaluatorwrapper evaluator
0.8070 0.023 58.066 30 
0.8024 0.023 57.544 
0.8227 0.022 55.595 9 
 Generated linear models 
Water level = 0.0737*t-120 − 0.0752*t-94 − 0.0691*t-92 + 0.061*t-91 − 0.0496*t-88 + 0.0867*t-83 + 0.0678*t-78 − 0.0919*t-76 + 0.1022*t-72 − 0.0816*t-67 + 0.0611*t-64 + 0.0551*t-59 − 0.0906*t-52 + 0.0671*t-48 − 0.0641*t-38 + 0.0912*t-33 − 0.0716*t-29 + 0.0669*t-28 − 0.0845*t-27 + 0.0604*t-23 + 0.0537*t-20 − 0.0989*t-16 − 0.0532*t-13 + 0.1138*t-12 + 0.1411*t-11 − 0.0541*t-9 + 0.1532*t-4 + 0.118*t-3 + 0.0747*t-2 + 0.4207*t-1 + 0.1141 
Water level = 0.3001*t-11 + 0.0906*t-2 + 0.5373*t-1 + 0.5013 
Water level = −0.1009*t-88 + 0.1362*t-83 − 0.1097*t-75 + 0.1463*t-72 − 0.0875*t-16 + 0.2136*t-11 + 0.124*t-4 + 0.1263*t-3 + 0.4955*t-1 + 0.3916 
 Performance measure
Attribute schemeAttribute schemeAttribute scheme
Run no.RMAE% Relative absolute errorwithout attribute selection evaluatorCfs evaluatorwrapper evaluator
0.8070 0.023 58.066 30 
0.8024 0.023 57.544 
0.8227 0.022 55.595 9 
 Generated linear models 
Water level = 0.0737*t-120 − 0.0752*t-94 − 0.0691*t-92 + 0.061*t-91 − 0.0496*t-88 + 0.0867*t-83 + 0.0678*t-78 − 0.0919*t-76 + 0.1022*t-72 − 0.0816*t-67 + 0.0611*t-64 + 0.0551*t-59 − 0.0906*t-52 + 0.0671*t-48 − 0.0641*t-38 + 0.0912*t-33 − 0.0716*t-29 + 0.0669*t-28 − 0.0845*t-27 + 0.0604*t-23 + 0.0537*t-20 − 0.0989*t-16 − 0.0532*t-13 + 0.1138*t-12 + 0.1411*t-11 − 0.0541*t-9 + 0.1532*t-4 + 0.118*t-3 + 0.0747*t-2 + 0.4207*t-1 + 0.1141 
Water level = 0.3001*t-11 + 0.0906*t-2 + 0.5373*t-1 + 0.5013 
Water level = −0.1009*t-88 + 0.1362*t-83 − 0.1097*t-75 + 0.1463*t-72 − 0.0875*t-16 + 0.2136*t-11 + 0.124*t-4 + 0.1263*t-3 + 0.4955*t-1 + 0.3916 

It can be noted that the variations in the MAE measures for the three schemes applied are not significantly different, yet the %RAE did differ, and was significantly lower for the results obtained with the wrapper evaluator scheme, indicating that this model implementation is adequate for the dataset in question.

Scenario with time lag t = 10

As described in the experimental setup section, the first experiment with the dataset was done with a time lag t = 10 months.

Table 3 reports the summary statistics from each run implemented with the M5P classifier, with 10-fold cross-validation and the linear models generated (Figures 3 and 4) by each of the runs applied, respectively. It also shows the R, MAE, %RAE and the number of relevant attributes obtained with the two attribute evaluators employed versus a non-attribute selection approach.
Figure 3

A decision tree for the water level forecasting at time lag t-10.

Figure 3

A decision tree for the water level forecasting at time lag t-10.

Figure 4

M5 pruned decision tree model for the water level forecasting at time lag t-10.

Figure 4

M5 pruned decision tree model for the water level forecasting at time lag t-10.

The optimum feature subsets found are summarized in Table 3. For the first dataset (excluding attribute selection) the best feature sets indicated were five out of ten (t-10, t-5, t-3, t-2 and t-1, respectively). For the second dataset (with cfs algorithm), three feature sets were selected as relevant (t-10, t-2 and t-1). After the wrapper feature selection was applied to the same dataset, the classifier was able to find two feature subsets as relevant (t-10 and t-1). It included two of the same time lag features that were found in the same dataset applied without attribute selection and with application of the cfs algorithm. For the dataset with time lag t = 10, the relevant inputs identified without the application of an attribute selection scheme gave the best accuracy, with R = 0.7864 and also the lowest %RAE = 60.2044. In spite of the small statistical differences between each scenario, it can be inferred that the models observed similar trends during the first and last month of the series, that is t-1 and t-10 respectively.

Scenario with time lag t= 50

On the other hand, results with time lag t = 50 showed that the model executed without the attribute selection scheme applied selected 18 attributes as relevant out of 50 (Table 4). For cases that were run with the cfs subset evaluator and wrapper subset evaluator, a total of four and 11 attributes were selected as relevant, respectively.

Moreover, the experimental analysis for this case showed that the model without applying the use of an attribute selection scheme generated a linear model with 18 attributes that were represented as time lags (t-48, t-38, t-36, t-35, t-33, t-29, t-27, t-25, t-23, t-18, t-16, t-12, t-11, t-9, t-4, t-3, t-2 and t-1 respectively). Initially, what can be inferred is that the model probably observed similar trends during the first 4 months of the series. Then, it discretized as it skipped 4 months back in time (e.g. time lags t-8, 7, 6 and 5) in search of similar patterns for the initial trend when arriving at time lag t-9. This issue that we address falls in agreement with the fact that stationarity in a series is a very important factor. So, if non-stationarity cannot be removed by conventional time series methods, the model building procedure may require a processing step that determines which parts of the series to include in the modeling, as proposed by Deco et al. (1997), or prediction of the series might even be an impossible task (Kim et al. 2004). We also observed an increase and decrease in the number of consecutive months in which the model discretized to find similar patterns in the time series. So, in this case, the frequency of discretization observed by the model was of 4, 1, 3, 1, 4, 1, 1, 1, 3, 1, 1, and 9 months back in time, until it could arrive at the following month or series of months in which it could observe similar patterns of the trend observed during the first 4 months of the series.

For model runs with the cfs subset evaluator, we observed a generated linear model that selected four attributes, that is, four lags out of 50 back in time (t-48, t-11, t-2 and t-1, respectively). This represents 14 values less than what was selected by the first scheme (no-attribute selection). Although the number of attributes selected is smaller, the performance of the previous model is better in statistical terms, as it yielded a lower value of the RAE (56.091%) and better fit of the model (R = 0.8174). It can be noticed, for the first 2 months, that this model observed similar patterns of the trend just as was the case with the previous scheme. Notwithstanding, during the discretization process, the overall performance of this algorithm, was shown to skip 8 months back in time before arriving at the following month (i.e. t-11), where it identified similar patterns of the trend. Finally, this scheme continued to discretize among lag values, as it skipped 36 more lags back in time before it could identify similar patterns of the trend when arriving at time lag t-48. The experimental analysis also showed the wrapper algorithm to select a total of 11 time lags as relevant attributes (t-48, t-38, t-36, t-27, t-23, t-16, t-12, t-11, t-4, t-3, and t-1 respectively). A quick inspection revealed that the attributes selected are almost the same as those that were selected by our previous case, that is, the non-attribute selection scheme. In this scheme, the wrapper did not take into account seven of the lags previously selected by the non-attribute selection scheme. The analysis also revealed that the wrapper algorithm selected 3 of the first 4 months that were selected under the non-attribute selection scheme. On the other hand, the discretization performance showed that it was necessary for the model to skip 1, 6, 3, 6, 3, 8, 1 and 9 months back in time until it found the preceding month or months with similar behavior in the patterns. This shows the importance of stationarity and the use of cross-validation in time series, and the reliance of a model on past values to predict future values. In general, for this scenario, the wrapper scheme outperformed the other two schemes (Table 4).

Scenario with time lag t = 60

Under this scenario, an increase in the errors of the results for the three runs performed can be observed as well as a decrease in the coefficient of fit (R), with the exception that this value for runs with the wrapper subset evaluator increased, as can be seen below in Table 5. Results also indicated that for the runs carried out with the non-attribute selection scheme, 19 attributes were selected by the classifier as relevant, whereas for time lag t-50, as seen above, only 18 attributes were selected, so it selected one more variable to include in the search as the lags were increased. For model runs with the wrapper scheme, nine attributes were selected during the search as relevant. The cfs subset evaluator scheme selected only three attributes, yet the coefficient of fit was smallest compared to the other two schemes.

For a time lag of t-60, experimental analysis showed important improvements for the runs carried out employing the wrapper subset evaluator, as it obtained better forecasting results than the non-attribute selection and the cfs subset evaluator schemes. On the other hand, of particular note, even the non-attribute selection scheme performed well over the cfs evaluator. As with the previous runs, interesting patterns in the selection of relevant attributes can be observed.

Then, it was further noted, for the case without the use of an attribute selection scheme, that the classifier selected almost the same first ten lag time attributes (t-23, t-18, t-16, t-12, t-11, t-9, t-4, t-3, t-2 and t-1 respectively) that were selected by running the same scheme, however with a smaller dataset (t-50). Notwithstanding, in about half of the lag sequence, the remaining 9 variables (t-59, t-52, t-48, t-38, t-36, t-33, t-29, t-28 and t-27) under t-60 were also present for time lag t-50, only with the exceptions that lag times t-35 and t-25 were not chosen in the case of scenario t-60, and lag times t-59, t-52 and t-28 were not selected in the scenario for lag time t-50. We noticed as well that the sequence of search observed by this model was that it needed to search 4, 1, 3, 1, 4, 3, 3, 2, 1, 9, 3 and 6 months back to find repeating patterns in previous months. The first eight in the sequence of search is similar to what we observed for a time lag of t-50. Therefore, these two models seem to contrast, and neither their coefficients of fit nor errors differed significantly. This pattern could probably be related to the amount of data in each of the datasets, as the t-60 set had 10 more lags added to it, but this did not lead to any significant difference over the previous model with t-50.

The cfs subset evaluator selected only three lag time attributes (t-11, t-2 and t-1) in which it found similar patterns for the first 2 months of the series, and then it needed to skip 9 months back to search for similar patterns as it arrived at t-11. As can be seen in Table 5, it did worse than the scheme for non-attribute selection. Alternatively, again, the experimental analysis showed the wrapper subset evaluator outperformed the previous schemes, as can be seen from the higher regression value (R = 0.8222) and lower error result 55.573%. Nine attributes (t-59, t-52, t-48, t-16, t-12, t-11, t-4, t-3, and t-1 respectively), with similar patterns in the time series, were selected by this model. This represented ten attributes less than were previously selected with the non-attribute selection scheme and six more values than were selected by the cfs evaluator. The sequence of search for this model was represented by 1, 6, 3, 31, 3 and 6 months back in time to find similar patterns present in the first 3 months of the series. Once again, the wrapper algorithm outperformed the other two schemes selected.

Scenario with time lag t = 120

The situation presented in Table 6 is for a lag time of t-120 modeled with the three schemes we have applied so far. It is clearly shown that the wrapper scheme outperformed the other two schemes applied, as is clearly represented by a fit of 0.8227, though there is a noticeable increase in the %RAE error (55.595%), which represents an increase of 0.076% and 0.022% with respect to the errors in scenarios t-50 and t-60 respectively. For this model, nine attributes were selected, which are represented by time lags t-88, t-83, t-75, t-72, t-16, t-11, t-4, t-3, and t-1. The sequence followed by the model to address the information in the previous values was as follows: 1, 6, 4, 55, 2, 7 and 4 months back in time, until it was able to find similar patterns in each of the time lags selected as relevant. The performance for the scheme run without selecting an attribute selection paradigm was better (R = 0.8070) than that with the cfs algorithm, and the error was higher than that which was displayed for the cfs scheme. It follows that this model selected 30 attributes as relevant (see Table 6). The sequence of search for this model was as follows, 4, 1, 2, 3, 2, 3, 3, 3, 9, 3, 6, 4, 4, 3, 1, 4, 4, 2, 1 and 25 months back in time, until it found the initial patterns displayed in the first 4 months of the series. Although the cfs subset evaluator had a lower error than the non-attribute scheme, the results for the goodness of fit were lower. On the contrary, this model selected three attributes (t-11, t-2 and t-1 respectively), and the search sequence went on until it found the next time lag with similar patterns as in the first two lags, which was at 8 months back in time. As has been demonstrated, the proposed use of a cross-validation strategy combined with the use of an attribute selection algorithm is simple and effective.

CONCLUSIONS AND RECOMMENDATIONS

In this experiment we explored three approaches to forecast time series using cross-validation and attribute selection algorithms.

The M5P model tree classification algorithm was used to generate model trees and linear model rules for predicting water level changes. The data used were for Cristobal Bay, Panama from 1909 to 1980, obtained from the Permanent Service for Mean Sea Level (PSMSL) database. The series only has data until the year 1980; since the data are assumed to have been reduced to a common datum (RLR) and this has been recommended to be used for time series analysis by the PSMSL Organization and not the metric, we decided on using the Revised Local Reference dataset for the purpose of forecasting employing data mining techniques.

The approach carried out in order to analyze the Cristobal data included the reviewing of traditional forecasting as depicted by the results in Figures 58 and data mining techniques for time series. We performed a comprehensive empirical study. It included the implementation of four sets of lagged times, t-10, t-50, t-60 and t-120, respectively, as time dependencies for the dataset.
Figure 5

Graphical output for algorithm performance comparison with time lag t-10.

Figure 5

Graphical output for algorithm performance comparison with time lag t-10.

Figure 6

Graphical output for algorithm performance comparison with time lag t-50.

Figure 6

Graphical output for algorithm performance comparison with time lag t-50.

Figure 7

Graphical output for algorithm performance comparison with time lag t-60.

Figure 7

Graphical output for algorithm performance comparison with time lag t-60.

Figure 8

Graphical output for algorithm performance comparison with time lag t-120.

Figure 8

Graphical output for algorithm performance comparison with time lag t-120.

Using standard 10-fold cross-validation, the forecasting reliabilities of these models were evaluated by computing a number of statistical measures and by applying attribute selection schemes as a combination of scenario runs on the time lags implemented. Once we had determined the best number of attributes of the dataset, the process was repeated with this reduced dataset to find the best number of attributes, and so on.

The results of the experiments indicate that using cross-validation and attribute selection schemes improves the prediction proficiency. The wrapper subset evaluator outperformed over the other two schemes for all scenarios that were executed. The best prediction was shown for time lag t-120, which suggests that the length of the time series was essential for the model and, if enhanced with the application of an attribute selection evaluator, performance can be improved significantly, as was the case with the application of the wrapper subset evaluator. However, we need to address the fact that as model complexity increases, the model becomes more prone to overfitting and therefore there is a chance for errors to increase, as was discussed under the experimental results and analysis section. We also noticed that a certain amount of lag is necessary as the data contain dependencies that are characterized by the nature of the time series (e.g. stationarity, seasonality and irregularities) and time-evolving environmental effects may occur. The cfs subset evaluator performance, although not bad, did not excel over the non-attribute selection scheme in statistical terms when forecasting the water level.

The shortcomings that the commonly used methods in TSF have are well documented, and various other methods have been proposed in the literature. For this reason, to pursue better results, a larger dataset that will comprise data collected over many decades will be needed.

We can conclude that this work is important for climate change studies because the variations in weather conditions in terms of water level rise in the area of study, which is within the boundaries of the Panama Canal operations, is not well documented and can be studied using these data mining techniques.

ACKNOWLEDEGMENTS

The authors would like to thank the staff of the Vicerectoría de Investigación (University of Panama) for supporting this research project.

REFERENCES

REFERENCES
Bergmeir
C.
Benítez
J. M.
2012
On the use of cross-validation for time series predictor evaluation
.
Information Sciences
191
,
192
213
.
Bregman
J. I.
Mackenthun
K. M.
2006
Environmental Impact Statements
.
MI Lewis Publication
,
Chelsea
.
Carretero
J. C.
Alvarez
E.
Gomez
M.
Perez
B.
Rodríguez
I.
2000
Ocean forecasting in narrow shelf seas: application to the Spanish coasts
.
Coastal Engineering
41
(
1–3
),
269
293
.
Chen
J. L.
Shum
C. K.
Wilson
C. R.
Chambers
D. P.
Tapley
B. D.
2000
Seasonal sea level change from TOPEX/Poseidon observation and thermal contribution
.
Journal of Geodesy
73
,
638
647
.
Deco
G.
Neuneier
R.
Schürmann
B.
1997
Non-parametric data selection for neural learning in non-stationary time series
.
Neural Networks
10
(
3
),
401
407
.
Douglas
B. C.
Kearney
M. S.
Leatherman
S. P.
2009
Sea level rise history and consequences
.
International Geophysics Series
75
,
97
119
.
Haykin
S.
1999
Neural Networks: A Comprehensive Foundation
.
Prentice-Hall
,
Upper Saddle River, NJ
, p.
842
.
Holgate
S. J.
Matthews
A.
Woodworth
P. L.
Rickards
L. J.
Tamisiea
M. E.
Bradshaw
E.
Foden
P. R.
Gordon
K. M.
Jevrejeva
S.
Pugh
J.
2013
New data systems and products at the Permanent Service for Mean Sea Level
.
Journal of Coastal Research
29
(
3
),
493
504
.
Kamruzzaman
J.
Begg
R.
Sarker
R.
2006
Artificial Neural Networks in Finance and Manufacturing
.
Idea Group Publishing
,
Hershey, PA
17033
.
Kim
T. Y.
Oh
K. J.
Kim
C.
Do
J. D.
2004
Artificial neural networks for non-stationary time series
.
Neurocomputing
61
(
1–4
),
439
447
.
Kohavi
R.
John
G. H.
1997
Wrappers for feature subset selection
.
Artificial Intelligence
97
(
1–2
),
273
324
.
Kunst
R. M.
Jumah
A.
2004
Toward a theory of evaluating predictive accuracy, Economics Series, vol. 162, Institute for Advanced Studies
.
Long
N. C.
Meesad
P.
2014
An optimal design for type–2 fuzzy logic system using hybrid of chaos firefly algorithm and genetic algorithm and its application to sea level prediction
.
Journal of Intelligent and Fuzzy Systems
27
(
3
),
1335
1346
.
Loughrey
J.
Cunningham
P.
2005
Using Early-Stopping to Avoid Overfitting in Wrapper-Based Feature Selection Employing Stochastic Search
.
Technical report, Technical Report TCD-CS-2005-37
.
Department of Computer Science, Trinity College Dublin
,
Dublin
,
Ireland
.
Makridakis
S.
Wheelwright
S.
Hyndman
R.
2008
Forecasting Methods and Applications
, 2nd edn.
Wiley
,
New York
,
USA
.
Monbaliu
J.
Padilla-Hernández
R.
Hargreaves
J. C.
Carretero Albiach
J. C.
Luo
W.
Sclavo
M.
Günther
H.
2000
The spectral wave model, WAM, adapted for applications with high spatial resolution
.
Coastal Engineering
41
(
1–3
),
41
62
.
Natural Environment Research Council
2015
http://www.nerc.ac.uk/ (accessed December 2015)
.
O'Rourke
F.
Boyle
F.
Reynolds
A.
2010
Tidal energy update 2009
.
Applied Energy
87
(
2
),
398
409
.
Pashova
L.
Popova
S.
2011
Daily sea level forecast at tide gauge Burgas, Bulgaria using artificial neural networks
.
Journal of Research
66
(
2
),
154
161
.
Peralta
J.
Li
X.
Gutierrez
G.
Sanchis
A.
2010
Time series forecasting by evolving artificial neural networks using genetic algorithms and differential evolution
. In:
Proceedings of the 2010 WCCI Conference, IJCNN-WCCI’
,
3999–4006
.
Permanent Service for Mean Sea Level
2015
http://www.psmsl.org/ (accessed December 2015)
.
Quinlan
J. R.
1992
Induction of decision trees
.
Machine Learning
1
,
181
186
.
Rushing
J.
Ramachandran
R.
Nair
U.
Graves
S.
Welch
R.
Lin
H.
2005
A data mining toolkit for scientists and engineers
.
Computers & Geosciences
31
,
607
618
.
Umgiesser
G.
Canu
D. M.
Cucco
A.
Solidoro
C.
2004
A finite element model for the Venice Lagoon. Development, set up, calibration and validation
.
Journal of Marine System
51
(
1
),
123
145
.