3. Test Dataset & Raw Datasets
TEST DATASETS
Generated 5 test datasets (20 observations each)
using ARMA(p, q) model
ARMA(1,1)
Zt=at+ Ф *Zt-1 - θ *at-1
where a0 =0 Z0 = 0 t = 1…20
• Series 1: Ф = -0.8, θ = 0.1
• Series 2: Ф = 0.8, θ = -0.1
• Series 3: Ф = 0.85, θ = -0.15
• Series 4: Ф = -0.8, θ = 0.1 shifted with 21
observations
• Series 5: Ф = -0.85, θ = 0.15 with 21 observations
RAW DATASETS
Raw_itd1 & Raw_itd2: commodities datasets
• 119 series in Raw_itd1
• 120 series in Raw_itd2
• 222 monthly observations from January 1997 to
June 2015
Group Series
Similar Group Series 2&3
Dissimilar Group Series 1&2
Identical Shifted Group Series 1&4
Similar Shifted Group Series 1&5
4. Approach 1 Correlation Table
Test dataset: Advantages:
• Able to detect period
movements/shifts but
unstable
Disadvantages:
• Only capture linear
correlation between series
• Sensitive to outliersGroup Series
Similar Group Series 2&3
Dissimilar Group Series 1&2
Identical Shifted Group Series 1&4
Similar Shifted Group Series 1&5
Series S1 S2 S3 S4 S5
S1 1 0.18454 0.14729 -0.9221 0.98857
S2 0.18454 1 0.99359 0.0392 0.15459
S3 0.14729 0.99359 1 0.05083 0.12369
S4 -0.9221 0.0392 0.05083 1 -0.95089
S5 0.98857 0.15459 0.12369 -0.95089 1
6. Approach 2 SAS Proc Similarity Method
•Use Distance Matrix to compute a similarity measure of a pair of series
Data
Input
series:
X1 X2 X3 … Xn
Target
series:
Y1 Y2 Y3 … Ym
Distance Matrix
Input Series
X1 X2 X3 … Xn
Target
Series
Y1 D11 D12 D13 … D1n
Y2 … … … … …
Y3 … … … … …
… … … … … …
Ym Dm1 … … … Dmn
Dij = Input Series – Target Series
e.g. D11 = X1 – Y1
Normalized
Rescaled
Computes all possible paths to
transverse the matrix
7. Approach 2 SAS Proc Similarity Method
Output:
1. Similarity Measures: Absolute Deviation
• Measure=ABSDEV / Absolute Deviation
• total distance of the minimum path to transverse the distance matrix
2. Cost Statistics: statistics associated with minimum path
3. Path Statistics: indicating percentages of direct path (diagonal movement),
compression (vertical movement) and expansion (horizontal movement)
The Less the Amount of Absolute Deviation, the More Similar the Two Series Are
8. More on Proc Similarity in SAS
Basic Structure
proc similarity data=data out= ;
input S1 / normalize = scale=;
target S2 / slide = normalize= measure=
compress=(localabs=0) expand=(localabs=0);
run;
Output:
Similarity Measures
Path & Cost Measures
Transformed Input &Target Series
Input & Target Path Index
Distance Metric
Transformation
Normalization:
Absolute:
Standard:
Scale:
Absolute:
Standard:
User-defined:
FCMPOPT Statement & Options
Measures
1. SQRDEV/ABSDEV : squared or absolute deviation
2. MSQRDEV/MABSDEV : mean squared or absolute deviation
relative to the length of the input or target sequence
relative to the minimum or maximum valid path length
3. User-defined Measures
9. More on Proc Similarity in SAS
Path & Cost Statistics Plots
10. Approach 2 SAS Proc Similarity Method
Results of Raw_itd1
Advantages:
Higher accuracy rate of detecting similar pairs
Normalized and rescaled the series
Compute pair-wise similarity measures using DO loop
Performs well when treating totally dissimilar series that crosses each
other
Disadvantages:
Bad at detecting similar and shifted series
Sensitive to Outliers
Series Pairs
Proc Similarity
Measure
RW_CMACDG391 & RW_CMACDP553 44.00063414
RW_CMACDG183 & RW_CMACDG274 49.51170024
RW_CMACDG274 & RW_CMACDG391 52.46527532
RW_CMACDG274 & RW_CMACDG474 53.32933511
RW_CMACDG274 & RW_CMACDP553 54.0065316
RW_CMACDG274 & RW_CMACDG221 374.47813762
Table 2 Proc Similarity Measures
Series Absolute Deviation
Similar Group Series 2&3 1.92422
Dissimilar Group Series 1&2 20.25579
Identical Shifted Group Series 1&4 1.20447
Similar Shifted Group Series 1&5 33.96824
11. Approach 2 SAS Proc Similarity Method
Identical PairSimilar Pair Very Similar Pair
Similarity Measure = 51.98 Similarity Measure = 2.78E-14Similarity Measure = 14.11
12. Approach 3 SIM Coefficient
SIM coefficient is calculated by the following:
𝑦𝑡
(1)
=
𝑥 𝑡
(1)
−𝑥 𝑡−1
(1)
𝑥 𝑡−1
(1) for t=2,…,T
𝑆𝑖𝑚 𝑦 1 , 𝑦 2 =
𝑡=2
𝑇
[
𝑎𝑏𝑠(𝑦𝑡
1
− 𝑦𝑡
2
)
max 𝑎𝑏𝑠 𝑦𝑡
1
, 𝑎𝑏𝑠 𝑦𝑡
2
(𝑇 − 1)]
The Closer the Value to Zero, the More Similar the Series Are.
13. Approach 3 SIM Coefficient
Series Pairs SIM_final
RW_CMACDG291 & RW_CMACDG472 0.504446
RW_CMACDG291 & RW_CMACDG473 0.515302
RW_CMACDG284 & RW_CMACDG472 0.544256
RW_CMACDG221 & RW_CMACDG282 0.546508
RW_CMACDG221 & RW_CMACDG291 0.548653
Results of Raw_itd1
Advantages:
Compute pair-wise similarity measures using DO loop
Performs well when treating totally dissimilar series that crosses
each other
Best in detecting both identical & shifted series and similar & shifted
series
Disadvantages :
Accuracy rate is lower than Proc Similarity method
Cut-off: considered similar when below 0.7
Table 3 SIM Measures
Series SIM Coefficient
Similar Group Series 2&3 0.36887
Dissimilar Group Series 1&2 0.98743
Identical Shifted Group Series 1&4 0.23527
Similar Shifted Group Series 4&5 0.23978
14. Approach 4 Derivatives Comparison Method
Step 1: Use spline function to represent series
Step 2: Compute first derivatives/slopes at each knot
Step 3: Compute second derivatives/rate of change at
each knot
Step 4: Compute difference between first & second
derivatives
The Smaller the Difference, the More Similar the Series Are.
Basic Idea:
spline function on series + calculation of first & second derivatives
16. Quantitative Measures
Approach 4 Derivatives Comparison Method
Table 4 Comparison of Derivatives
Difference between 1st
Derivative
Difference between 2nd
Derivative
Similar Group 0.81508 0.16769
Dissimilar Group 20.4207 4.21562
Identical Shifted Group 24.2524 5.17868
Similar Shifted Group 34.8150 7.75581
Disadvantage:
Slow in processing time
Bad at detecting both identical & shifted
series and similar & shifted series
# of knots < # of observations
17. Approach 5 Spectral Analysis Method
SAS: proc spectra
Plot frequency against phase spectrum in
radians of X and Y
Phase spectrum:
Time Domain:
e.g ARIMA Model
• Auto-covariance
• Auto-correlation
Frequency Domain:
• spectral density
function
Time Series Representation Similar Pair
Dissimilar Pair
18. Conclusion
Approach 1 Correlation Table:
Easy to Interpret & Fastest in Computing Pair-Wise Measures
Approach 2 SAS Proc Similarity Method:
More Functionalities & Highest Accuracy Rate
Approach 3 SIM Coefficient:
Straight Forward Formula & Add More Accuracy
Approach 4 Derivatives Comparison Method:
Confirmation Mechanism
Approach 5 Spectral Analysis Method:
Different Prospective & Measures
As I was introduced at the beginning of the term about different seasonal adjustment options that can be produced by X12 program, similar series will share similar or even identical options. If we can determine the similarity beforehand and obtain a quantitative measure of the similarity, we can avoid redundant processing of seasonal adjustment on similar series and be better prepared when explaining similar or identical options for different series. So what we were trying to quantify is the similarity for month-to-month movement. In this way, we are able to find relationship that we don’t know about beforehand and how each other are related.
There are 5 approaches we came up with and all of them were tested on both test datasets and raw/real datasets.
The test datasets are generated based on an ARMA model with different value of phi and theta. With pre-determined similarity or dissimilarity between each dataset, I can verify each method by testing on each dataset. I identified 4 types of groups (each with 20 observations): a similar group which contains Series 2&3; a dissimilar group (series 1&2); an identical and shift group which is series 1&4 (as you can they have the same phi and theta but just 1 time point shifted); Also a similar and shifted pair (series 4&5 so the parameters are off by 0.05 with 21 observations)
The raw datasets contain commodities information from 1997 to 2015. Dataset 1 is customs based and dataset 2 is based on balance of payment.
Add ARIMA model formula
So the first approach to quantify similarity between series is by constructing a correlation table. So correlation coefficients of all combination of series are computed and organized into a table. I first tested this approach on test dataset. As you can see from the table, correlation that is closer to 1 or -1 suggests strong linear correlation between the series. As you can see from the table, it correctly identified the similar and dissimilar groups. It can also detect period movement which is the shift in data. However, in this case it could be a coincidence since I shifted the series by 1 time unit, when it goes up and then down at certain time points, the shifted one will go down and then up, resulting to this negative correlation. Another major problem with this approach is that it can only capture linear correlation between series, while in reality there’re also many non-linear relationship existed between series and this approach will fail to detect them. It is also very sensitive to outliers which means, a single outlier will severely affect the correlation coefficients which results in wrong identification of similar or dissimilar pairs.
The correlation coefficients in Table 1 confirm the similarity between df1 & df4, df1 & df5, df2 & df3 and df4 & df5 with their absolute values being close to 1. The dissimilar series are also correctly identified with correlation coefficients being close to 0.
Only capture linear correlation between series: so any non-linear relationship between 2 series cannot be detected by this approach
For raw datasets, a correlation matrix is also calculated and it identified numerous pair of similar series. One way to visualization it is to plot one series against the other. As we see from the graphs, similar series identified by linear correlation will have points scatter around the diagonal while dissimilar pair will have points scatter off the diagonal. For the dissimilar plot, there is a pattern appeared in the graph. Like most of the points are scatted around 2 regions, however, the correlation approach consider these as without linear correlation. This approach is the easiest to implement and can be used as a preliminary examination of the series. Next I started to explain more sophisticated approaches.
When plotting one series against the other series:
Similar pair: all the points should be scattered around the diagonal
Dissimilar pair: all the points are off the diagonal which indicates dissimilar pair
The second approach is a SAS procedure called proc similarity. It’s a procedure that compute a similarity measure for time-stamped data, time series, and other sequentially ordered numeric data. The basic idea behind is to use a Distance Matrix to compute a quantitative measure of a pair of series. Initially, there’re 2 series, X and Y with n or m observations. One is referred to as input series, the other as target series. Then, those sequence of data get normalized and rescaled, which can be easy specified and implement within the procedure. A distance matrix is constructed by calculating the difference between each data point, more specifically using input data point minus target data point, like in the table on the right. So D11 is value of series X at time 1 minus that of Y at time 1 and so on. The next step is compute all possible path to traverse the matrix from left to right side. For example, this is one of the paths and this is another. It will assign a path index to each of these path to indicate number of movement associated with that path. So moving from one cell to the next is counted as 1 step. For example, one possible path can take 11 steps to complete so the path index for it will be 11.
Input and target series are normalized and input sequence is scaled to the target sequence before constructing distance matrix
It computes all possible paths to traverse the matrix and assign a path index to each path indicating the number of movement associated with that path.
Compression and Expansion
Next, we can choose the similarity measures and other statistics we want to produce. The first one is absolute deviation which is the total distance of the minimum path to transverse the matrix. For this project, we limit the path option to only going through the diagonal so we have comparability across all combinations of series in the raw datasets. Cost statistics can also be produced which contain basic descriptive statistics of the minimum path. Path statistics are basically proportion of direct path which is the diagonal path, compression which is the vertical movement and expansion which is the horizontal movement. As a general rule, the less amount of distance associated with a direct path, the more similar the series are.
Since measuring similarity between my specific time series data, which is commodities data, only utilize a small portion of functionalities in proc similarity in SAS, I will talk a bit more about the procedure and illustrate the flexibility of this procedure.
So this is the basic structure of the procedure. 2 series are coded as an input series and a target series. This procedure includes a transformation mechanism that can be easily specified. For this project, both normalization and rescaling were used. The reason is that when I run the procedure without any rescaling, the similarity measure, which is the absolute deviation, can be larger for similar series that are far away from each other in terms of values than a dissimilar pair that are moving in the opposite direction that even cross each other at certain time point. This is because when I constructed the distance matrix without any transformation, the value of difference between input and target data is larger than 2 series that are moving close to each other. So when I used that matrix to calculate the absolute deviation, of course the value will be larger. When I constructed the distance matrix of a dissimilar pair that crosses each other, the difference at the crossing section can be very small and lead to smaller value of similarity measure.
In terms of similarity measure, there’re also various options we can choose from. Similarity measure can be categorized into 2 major groups: squared or absolute deviation and mean of squared or absolute deviation. In terms of the means, it can be calculated related to the length of series or the min or max of the path to suit the needs of specific analysis.
In terms of output for this procedure, not only it can produce various tables containing the statistics that I mention before, it can also produce various plots. This table here is just an example of one of its output. It list all transformed input and target series, also the path index for each series with corresponding distance.
On the left, it’s the path and cost statistics output. This is a screen shot of analyzing the commodity series, so only diagonal path is used. But you can also specify the expansion and compression limit, so you can go off the diagonal. On the right is a series of graphs produced by the procedure. As you can see, the left one is the original plot of 2 series, which crosses each other, the right side is rescaled and normalized series. This is path plot which indicates the minimum path to transverse distance matrix. Distance of each path is also plotted as well as the distribution of distance for all the possible paths.
Plot and distribution of path relative distance
path relative distance = path distance / corresponding target sequence value
This is a snap shot of the results of top 5 most similar pair identified by this approach. And on the right side is the plot of original data from one of these pairs. We can see from the graph, those 2 series are very similar in terms of month-to-month movement. One advantages of this approach is that it actually has the highest accuracy rate of detecting similar pairs compared to the other approaches. The series can be easily transformed in the code and I was able to use a DO loop to compute similarity measure for all the pair-wise combination, just like in the correlation matrix. Since this procedure involve normalization and rescaling, it performs very well when examining dissimilar series that crosses each other.
One disadvantage of this method is that when dealing with series that are similar but shifted by certain time period, its measurement is not very accurate. From the results on test dataset, we can see it can correctly detect similar, dissimilar and identical & shifted pairs, but not for the similar & shifted pair. Remember the rule is the smaller the better, but for test series that are similar but shifted by 1 time unit, its value is very large. Also it’s very sensitive to outlier in terms of building the distance matrix. The distance at the outlier time point will be very large which will affect the absolute deviation of the similarity measure.
Higher accuracy rate of detecting similar pairs compared to the other methods
performs badly when treating totally dissimilar series that crosses each other because of the intersection of 2 series, but the trend is only a single consideration during seasonal adjustment, which can be de-trend
Without rescaling, the proc similarity measure of these 2 series is 196.4936 which is smaller due to the intersection of 2 series. At the intersection, the distance between 2 series are very small which causes decrease in sum of absolute deviation.
This are a few plots of series with its corresponding similarity measure from this approach. As you can see, the series are getting increasingly similar as the number gets smaller. For the perfectly identical pair, the measure is basically zero. The last graph is a result of combining customs based and balance of payment commodity datasets, so it could mean they’re the same product.
When I combined 2 raw datasets, this approach performs very well in detecting and distinguishing similar (or identical) and dissimilar pairs.
What I want to point out for those plots is that even though the first pair are relatively closer to each other than 2nd pair, the similarity measure can still be able too distinguish the level of similarity.
The third approach is called SIM coefficient method and basically is to use this formula to calculate a similarity measure between 2 series and evaluate them based on that. So first step is to get the relative difference of both series, let’s call it input and target series. The SIM coefficient is then calculated by this formula: take the absolute value of difference between 2 series, then divided by maximum value at that time point, sum them up and divided by total number of observations minus 1. The general rule for this coefficient is the closer to value zero, the more similar the series are. If both series are identical, then yt1 minus yt2 will be zero. If yt1 and yt2 are always very close to each other at every time point, this value will be also closer to 0.
This is the top 5 most similar pairs identified by this approach. And you can see the plot of the most similar pair. Since the formula for calculating this similarity measure is fairly straight forward, I was also able to quickly compute all the pair-wise combinations. This approach also performs pretty well in terms of determining dissimilar pairs that cross each other, since we also calculate the measure after conducting the relative differencing. One disadvantage of this approach is the accuracy rate is lower than proc similarity and it’s slower and less flexible compared to proc similarity. This approach also performs relatively well when detecting identical & shifted and similar & shifted pairs, as you can see from this table which is the results of the test datasets.
The next approach is called derivative comparison approach. The basic idea is that if 2 series are moving in the same direction at the same rate throughout the observation period, they should similar in terms of the month-to-month movements. So what I used is to construct a spline function to the series in order to obtain a mathematical equation of the data, then compare the first and second derivatives between 2 series. So the red line is the slope and the green line is the rate of change at each knot.
This approach incorporates use of spline function on the series and calculation of first and second derivatives to evaluate the similarity between 2 series.
Use Spline function to represent series with specified number of knots
As you can see from the plots of a dissimilar pair, the red line which is the slope of each series is very different from the other. A more quantitative way to measure this is to take the difference between the first and second derivative.
The results in this table is based on the test datasets, similar group indeed have a lower value of difference between 1st and 2nd derivatives, and larger value for the dissimilar group. But it fails to distinguish the identical & shifted and similar & shifted group. The reason is that since the knots of the spline function is fixed, it will only calculated the derivatives at each knot, which in the shifted case are off by 1 time point in this case. I specify the knot to be every other time point. Since the number of knots is smaller than number of observations, this could be a good thing and bad thing, because when choosing certain group of knots, you might skip over certain outliers, while in other case, you might include them. One way to fix that is to alternate the position of the knots to construct multiple spline functions for the series and then repeat this process of derivative comparisons to get the final results. In this way, it may accommodate the disadvantage of fixed knots and its inability to detect shifted pairs. Another drawback of this approach is it runs pretty slowly, at least based on the codes I have written. so it is also one of the disadvantages of this method.
From the graphs, the first and second derivatives of the similar series (df2 & df3) are also similar while the dissimilar ones (df1 & df2) are also showing obvious distinctions. The magnitude of the difference of derivatives between series are also summarized in the following table.
Due to position of the knots are fixed, bad at detecting shifted pairs.
Finally, there is another method I looked into briefly compared to the other, the spectral analysis approach. I wasn’t familiar with this term so I did quite a bit of research online. So from what I understand is: spectral analysis is to represent a time series data using cyclical components of different frequencies, compared to the ARIMA model which I learned a lot from the time series course which uses previous realizations and white noise. Like auto-covariance and autocorrelation function of stationary time series in time domain, we can also study spectral density function or spectrum as a function of the frequency in the frequency domain of time series.
Cross spectral analysis is an extension of these techniques which enable 2 series to be analyzed simultaneously. The extent to which any frequency component in one series is correlated with the frequency component in another series can be estimated as Coherence. If you plot coherence against the frequency, you are able to identify the pattern of correlation between pairs of components. In the SAS procedure proc spectra, all the statistics including coherence squared, sine and cosine transforms, can be produced. There is one statistics called phase spectrum which is one of the parameter of the cross spectrum formula, is plotted against frequency. As you can see from the right, similar pair have less peaks than the dissimilar pair. This approach explain time series data from a different prospective and it can serve as another approach we can look more into to quantify the similarity between 2 series.
In conclusion, each approach has its pros and cons. In the shortest version, the correlation matrix approach is easy to interpret and it’s fastest in computing all the pair-wise combinations. Proc similarity approach has the highest accuracy rate with more functionalities and flexibility in terms of manipulating the data. SIM coefficient approach has a straight forward formula for quantifying similarity and can add more accuracy in addition to the proc similarity approach. Derivative comparison approach, in my opinion can serve as a confirmation mechanism to further verify the similarity results. And spectral analysis approach looks at the series from a frequency based prospective and can provide more quantitative measurements for similarity. Another thing to note is all of the methods are sensitive to outliers, so one way to improve the outcome can be to use the outlier treated datasets. So in my opinion, the most recommended approach to quantify similarity between time series is the SAS proc similarity procedure. If more accuracy is needed, SIM coefficient method can serve as a second filter to further filter out inaccurate or dissimilar pairs.