SlideShare a Scribd company logo
1 of 19
Similarity Measures for Time Series
SOPHIA HE
Outline
•Why find similarity?
•Test & Raw datasets
• Approach 1 Correlation Table
• Approach 2 SAS Proc Similarity Method
• Approach 3 SIM Coefficient Method
• Approach 4 Derivatives Comparison Method
• Approach 5 Spectral Analysis Method
•Conclusion
Test Dataset & Raw Datasets
TEST DATASETS
Generated 5 test datasets (20 observations each)
using ARMA(p, q) model
ARMA(1,1)
Zt=at+ Ф *Zt-1 - θ *at-1
where a0 =0 Z0 = 0 t = 1…20
• Series 1: Ф = -0.8, θ = 0.1
• Series 2: Ф = 0.8, θ = -0.1
• Series 3: Ф = 0.85, θ = -0.15
• Series 4: Ф = -0.8, θ = 0.1 shifted with 21
observations
• Series 5: Ф = -0.85, θ = 0.15 with 21 observations
RAW DATASETS
Raw_itd1 & Raw_itd2: commodities datasets
• 119 series in Raw_itd1
• 120 series in Raw_itd2
• 222 monthly observations from January 1997 to
June 2015
Group Series
Similar Group Series 2&3
Dissimilar Group Series 1&2
Identical Shifted Group Series 1&4
Similar Shifted Group Series 1&5
Approach 1 Correlation Table
Test dataset: Advantages:
• Able to detect period
movements/shifts but
unstable
Disadvantages:
• Only capture linear
correlation between series
• Sensitive to outliersGroup Series
Similar Group Series 2&3
Dissimilar Group Series 1&2
Identical Shifted Group Series 1&4
Similar Shifted Group Series 1&5
Series S1 S2 S3 S4 S5
S1 1 0.18454 0.14729 -0.9221 0.98857
S2 0.18454 1 0.99359 0.0392 0.15459
S3 0.14729 0.99359 1 0.05083 0.12369
S4 -0.9221 0.0392 0.05083 1 -0.95089
S5 0.98857 0.15459 0.12369 -0.95089 1
Approach 1 Correlation Table
Similar pair Dissimilar pair
Approach 2 SAS Proc Similarity Method
•Use Distance Matrix to compute a similarity measure of a pair of series
Data
Input
series:
X1 X2 X3 … Xn
Target
series:
Y1 Y2 Y3 … Ym
Distance Matrix
Input Series
X1 X2 X3 … Xn
Target
Series
Y1 D11 D12 D13 … D1n
Y2 … … … … …
Y3 … … … … …
… … … … … …
Ym Dm1 … … … Dmn
Dij = Input Series – Target Series
e.g. D11 = X1 – Y1
Normalized
Rescaled
Computes all possible paths to
transverse the matrix
Approach 2 SAS Proc Similarity Method
Output:
1. Similarity Measures: Absolute Deviation
• Measure=ABSDEV / Absolute Deviation
• total distance of the minimum path to transverse the distance matrix
2. Cost Statistics: statistics associated with minimum path
3. Path Statistics: indicating percentages of direct path (diagonal movement),
compression (vertical movement) and expansion (horizontal movement)
The Less the Amount of Absolute Deviation, the More Similar the Two Series Are
More on Proc Similarity in SAS
Basic Structure
proc similarity data=data out= ;
input S1 / normalize = scale=;
target S2 / slide = normalize= measure=
compress=(localabs=0) expand=(localabs=0);
run;
Output:
Similarity Measures
Path & Cost Measures
Transformed Input &Target Series
Input & Target Path Index
Distance Metric
Transformation
Normalization:
 Absolute:
 Standard:
Scale:
 Absolute:
 Standard:
User-defined:
 FCMPOPT Statement & Options
Measures
1. SQRDEV/ABSDEV : squared or absolute deviation
2. MSQRDEV/MABSDEV : mean squared or absolute deviation
 relative to the length of the input or target sequence
 relative to the minimum or maximum valid path length
3. User-defined Measures
More on Proc Similarity in SAS
Path & Cost Statistics Plots
Approach 2 SAS Proc Similarity Method
Results of Raw_itd1
Advantages:
 Higher accuracy rate of detecting similar pairs
 Normalized and rescaled the series
 Compute pair-wise similarity measures using DO loop
 Performs well when treating totally dissimilar series that crosses each
other
Disadvantages:
 Bad at detecting similar and shifted series
 Sensitive to Outliers
Series Pairs
Proc Similarity
Measure
RW_CMACDG391 & RW_CMACDP553 44.00063414
RW_CMACDG183 & RW_CMACDG274 49.51170024
RW_CMACDG274 & RW_CMACDG391 52.46527532
RW_CMACDG274 & RW_CMACDG474 53.32933511
RW_CMACDG274 & RW_CMACDP553 54.0065316
RW_CMACDG274 & RW_CMACDG221 374.47813762
Table 2 Proc Similarity Measures
Series Absolute Deviation
Similar Group Series 2&3 1.92422
Dissimilar Group Series 1&2 20.25579
Identical Shifted Group Series 1&4 1.20447
Similar Shifted Group Series 1&5 33.96824
Approach 2 SAS Proc Similarity Method
Identical PairSimilar Pair Very Similar Pair
Similarity Measure = 51.98 Similarity Measure = 2.78E-14Similarity Measure = 14.11
Approach 3 SIM Coefficient
SIM coefficient is calculated by the following:
𝑦𝑡
(1)
=
𝑥 𝑡
(1)
−𝑥 𝑡−1
(1)
𝑥 𝑡−1
(1) for t=2,…,T
𝑆𝑖𝑚 𝑦 1 , 𝑦 2 =
𝑡=2
𝑇
[
𝑎𝑏𝑠(𝑦𝑡
1
− 𝑦𝑡
2
)
max 𝑎𝑏𝑠 𝑦𝑡
1
, 𝑎𝑏𝑠 𝑦𝑡
2
(𝑇 − 1)]
The Closer the Value to Zero, the More Similar the Series Are.
Approach 3 SIM Coefficient
Series Pairs SIM_final
RW_CMACDG291 & RW_CMACDG472 0.504446
RW_CMACDG291 & RW_CMACDG473 0.515302
RW_CMACDG284 & RW_CMACDG472 0.544256
RW_CMACDG221 & RW_CMACDG282 0.546508
RW_CMACDG221 & RW_CMACDG291 0.548653
Results of Raw_itd1
Advantages:
 Compute pair-wise similarity measures using DO loop
 Performs well when treating totally dissimilar series that crosses
each other
 Best in detecting both identical & shifted series and similar & shifted
series
Disadvantages :
 Accuracy rate is lower than Proc Similarity method
Cut-off: considered similar when below 0.7
Table 3 SIM Measures
Series SIM Coefficient
Similar Group Series 2&3 0.36887
Dissimilar Group Series 1&2 0.98743
Identical Shifted Group Series 1&4 0.23527
Similar Shifted Group Series 4&5 0.23978
Approach 4 Derivatives Comparison Method
Step 1: Use spline function to represent series
Step 2: Compute first derivatives/slopes at each knot
Step 3: Compute second derivatives/rate of change at
each knot
Step 4: Compute difference between first & second
derivatives
The Smaller the Difference, the More Similar the Series Are.
Basic Idea:
spline function on series + calculation of first & second derivatives
Approach 4 Derivatives Comparison Method
Dissimilar Series: Series 1&2
Quantitative Measures
Approach 4 Derivatives Comparison Method
Table 4 Comparison of Derivatives
Difference between 1st
Derivative
Difference between 2nd
Derivative
Similar Group 0.81508 0.16769
Dissimilar Group 20.4207 4.21562
Identical Shifted Group 24.2524 5.17868
Similar Shifted Group 34.8150 7.75581
Disadvantage:
 Slow in processing time
 Bad at detecting both identical & shifted
series and similar & shifted series
 # of knots < # of observations
Approach 5 Spectral Analysis Method
SAS: proc spectra
Plot frequency against phase spectrum in
radians of X and Y
Phase spectrum:
Time Domain:
e.g ARIMA Model
• Auto-covariance
• Auto-correlation
Frequency Domain:
• spectral density
function
Time Series Representation Similar Pair
Dissimilar Pair
Conclusion
Approach 1 Correlation Table:
 Easy to Interpret & Fastest in Computing Pair-Wise Measures
Approach 2 SAS Proc Similarity Method:
 More Functionalities & Highest Accuracy Rate
Approach 3 SIM Coefficient:
 Straight Forward Formula & Add More Accuracy
Approach 4 Derivatives Comparison Method:
 Confirmation Mechanism
Approach 5 Spectral Analysis Method:
 Different Prospective & Measures
Thank You
SOPHIA HE

More Related Content

What's hot

The commonly used measures of absolute dispersion are: 1. Range 2. Quartile D...
The commonly used measures of absolute dispersion are: 1. Range 2. Quartile D...The commonly used measures of absolute dispersion are: 1. Range 2. Quartile D...
The commonly used measures of absolute dispersion are: 1. Range 2. Quartile D...Muhammad Amir Sohail
 
Frequency Distributions for Organizing and Summarizing
Frequency Distributions for Organizing and Summarizing Frequency Distributions for Organizing and Summarizing
Frequency Distributions for Organizing and Summarizing Long Beach City College
 
Erning spss baru sekali
Erning spss baru sekaliErning spss baru sekali
Erning spss baru sekaliERNING KAROMAH
 
16 ch ken black solution
16 ch ken black solution16 ch ken black solution
16 ch ken black solutionKrunal Shah
 
Measures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsMeasures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsLong Beach City College
 
Standard deviation by nikita
Standard deviation by nikitaStandard deviation by nikita
Standard deviation by nikitaNikita Dewangan
 
anova Statistical analysis
anova Statistical analysisanova Statistical analysis
anova Statistical analysisakash dalvi
 
Measures of variation and dispersion report
Measures of variation and dispersion reportMeasures of variation and dispersion report
Measures of variation and dispersion reportAngelo
 
Variance and standard deviation
Variance and standard deviationVariance and standard deviation
Variance and standard deviationAmrit Swaroop
 
Variance & standard deviation
Variance & standard deviationVariance & standard deviation
Variance & standard deviationFaisal Hussain
 
Measures of Dispersion: Standard Deviation and Co- efficient of Variation
Measures of Dispersion: Standard Deviation and Co- efficient of Variation Measures of Dispersion: Standard Deviation and Co- efficient of Variation
Measures of Dispersion: Standard Deviation and Co- efficient of Variation RekhaChoudhary24
 
Averages and range
Averages and rangeAverages and range
Averages and rangemwardyrem
 
3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplotsLong Beach City College
 
Standard Deviation
Standard DeviationStandard Deviation
Standard Deviationpwheeles
 
Propteties of Standard Deviation
Propteties of Standard DeviationPropteties of Standard Deviation
Propteties of Standard DeviationSahil Jindal
 
Biostatistics and Research Methodology: Unit I - Measures of Central Tendency...
Biostatistics and Research Methodology: Unit I - Measures of Central Tendency...Biostatistics and Research Methodology: Unit I - Measures of Central Tendency...
Biostatistics and Research Methodology: Unit I - Measures of Central Tendency...Chaitali Dongaonkar
 

What's hot (20)

QUARTILE DEVIATION
QUARTILE DEVIATIONQUARTILE DEVIATION
QUARTILE DEVIATION
 
The commonly used measures of absolute dispersion are: 1. Range 2. Quartile D...
The commonly used measures of absolute dispersion are: 1. Range 2. Quartile D...The commonly used measures of absolute dispersion are: 1. Range 2. Quartile D...
The commonly used measures of absolute dispersion are: 1. Range 2. Quartile D...
 
Frequency Distributions for Organizing and Summarizing
Frequency Distributions for Organizing and Summarizing Frequency Distributions for Organizing and Summarizing
Frequency Distributions for Organizing and Summarizing
 
Standard deviation
Standard deviationStandard deviation
Standard deviation
 
Erning spss baru sekali
Erning spss baru sekaliErning spss baru sekali
Erning spss baru sekali
 
3 handouts section3-11
3 handouts section3-113 handouts section3-11
3 handouts section3-11
 
16 ch ken black solution
16 ch ken black solution16 ch ken black solution
16 ch ken black solution
 
Measures of Relative Standing and Boxplots
Measures of Relative Standing and BoxplotsMeasures of Relative Standing and Boxplots
Measures of Relative Standing and Boxplots
 
Standard deviation by nikita
Standard deviation by nikitaStandard deviation by nikita
Standard deviation by nikita
 
anova Statistical analysis
anova Statistical analysisanova Statistical analysis
anova Statistical analysis
 
Measures of variation and dispersion report
Measures of variation and dispersion reportMeasures of variation and dispersion report
Measures of variation and dispersion report
 
Variance and standard deviation
Variance and standard deviationVariance and standard deviation
Variance and standard deviation
 
Variance & standard deviation
Variance & standard deviationVariance & standard deviation
Variance & standard deviation
 
Measures of Dispersion: Standard Deviation and Co- efficient of Variation
Measures of Dispersion: Standard Deviation and Co- efficient of Variation Measures of Dispersion: Standard Deviation and Co- efficient of Variation
Measures of Dispersion: Standard Deviation and Co- efficient of Variation
 
Averages and range
Averages and rangeAverages and range
Averages and range
 
3.2 Measures of variation
3.2 Measures of variation3.2 Measures of variation
3.2 Measures of variation
 
3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots3.3 Measures of relative standing and boxplots
3.3 Measures of relative standing and boxplots
 
Standard Deviation
Standard DeviationStandard Deviation
Standard Deviation
 
Propteties of Standard Deviation
Propteties of Standard DeviationPropteties of Standard Deviation
Propteties of Standard Deviation
 
Biostatistics and Research Methodology: Unit I - Measures of Central Tendency...
Biostatistics and Research Methodology: Unit I - Measures of Central Tendency...Biostatistics and Research Methodology: Unit I - Measures of Central Tendency...
Biostatistics and Research Methodology: Unit I - Measures of Central Tendency...
 

Viewers also liked

Thematic instruction
Thematic instructionThematic instruction
Thematic instructionStacy Huhn
 
Johnmilton on his blindness final Please email me if you use this (deb0211040...
Johnmilton on his blindness final Please email me if you use this (deb0211040...Johnmilton on his blindness final Please email me if you use this (deb0211040...
Johnmilton on his blindness final Please email me if you use this (deb0211040...Debbie Lou
 
Restoration project
Restoration projectRestoration project
Restoration project121Steve
 
Building bridges across disciplines in basic education
Building bridges across disciplines in basic educationBuilding bridges across disciplines in basic education
Building bridges across disciplines in basic educationAngel Dixcee Aguilan
 
Britain in the 16th and 17th century
Britain in the 16th and 17th centuryBritain in the 16th and 17th century
Britain in the 16th and 17th centuryIlinka Terziyska
 
England in the 17th Century
England in the 17th Century England in the 17th Century
England in the 17th Century william_via
 
Building bridges across the social science discipline, the Anthropology
Building bridges across the social science discipline, the AnthropologyBuilding bridges across the social science discipline, the Anthropology
Building bridges across the social science discipline, the AnthropologyGeorge Lopera
 
Integrative teaching techniques rex
Integrative teaching techniques rexIntegrative teaching techniques rex
Integrative teaching techniques rexRex Jardeleza
 
On His Blindness John Milton
On His Blindness John MiltonOn His Blindness John Milton
On His Blindness John MiltonAndre Oosthuysen
 
Characters ramayana
Characters ramayanaCharacters ramayana
Characters ramayanaJackyline TL
 
On His Blindness by: John Milton
On His Blindness by: John MiltonOn His Blindness by: John Milton
On His Blindness by: John MiltonCaroline Lace
 
17th century literature
17th century literature17th century literature
17th century literatureSinde KURT
 
Introduction to Social Studies
Introduction to Social StudiesIntroduction to Social Studies
Introduction to Social Studiesmsranck
 
INTEGRATIVE TEACHING BY: Jepoy Pajalla Floriano
INTEGRATIVE TEACHING BY: Jepoy Pajalla FlorianoINTEGRATIVE TEACHING BY: Jepoy Pajalla Floriano
INTEGRATIVE TEACHING BY: Jepoy Pajalla FlorianoBSEPhySci14
 
Social studies methods and concepts for primary esoc
Social studies methods and concepts for primary esocSocial studies methods and concepts for primary esoc
Social studies methods and concepts for primary esocMarie Tulcey
 
Successful strategies for social studies teaching and learning
Successful strategies for social studies teaching and learningSuccessful strategies for social studies teaching and learning
Successful strategies for social studies teaching and learningKarylle Honeybee Ako
 

Viewers also liked (20)

Revised Thematic Unit Presentation
Revised Thematic Unit PresentationRevised Thematic Unit Presentation
Revised Thematic Unit Presentation
 
Thematic instruction
Thematic instructionThematic instruction
Thematic instruction
 
Johnmilton on his blindness final Please email me if you use this (deb0211040...
Johnmilton on his blindness final Please email me if you use this (deb0211040...Johnmilton on his blindness final Please email me if you use this (deb0211040...
Johnmilton on his blindness final Please email me if you use this (deb0211040...
 
Restoration project
Restoration projectRestoration project
Restoration project
 
Building bridges across disciplines in basic education
Building bridges across disciplines in basic educationBuilding bridges across disciplines in basic education
Building bridges across disciplines in basic education
 
Ramayana
RamayanaRamayana
Ramayana
 
Britain in the 16th and 17th century
Britain in the 16th and 17th centuryBritain in the 16th and 17th century
Britain in the 16th and 17th century
 
England in the 17th Century
England in the 17th Century England in the 17th Century
England in the 17th Century
 
Building bridges across the social science discipline, the Anthropology
Building bridges across the social science discipline, the AnthropologyBuilding bridges across the social science discipline, the Anthropology
Building bridges across the social science discipline, the Anthropology
 
Building bridges
Building bridgesBuilding bridges
Building bridges
 
Ramayana
RamayanaRamayana
Ramayana
 
Integrative teaching techniques rex
Integrative teaching techniques rexIntegrative teaching techniques rex
Integrative teaching techniques rex
 
On His Blindness John Milton
On His Blindness John MiltonOn His Blindness John Milton
On His Blindness John Milton
 
Characters ramayana
Characters ramayanaCharacters ramayana
Characters ramayana
 
On His Blindness by: John Milton
On His Blindness by: John MiltonOn His Blindness by: John Milton
On His Blindness by: John Milton
 
17th century literature
17th century literature17th century literature
17th century literature
 
Introduction to Social Studies
Introduction to Social StudiesIntroduction to Social Studies
Introduction to Social Studies
 
INTEGRATIVE TEACHING BY: Jepoy Pajalla Floriano
INTEGRATIVE TEACHING BY: Jepoy Pajalla FlorianoINTEGRATIVE TEACHING BY: Jepoy Pajalla Floriano
INTEGRATIVE TEACHING BY: Jepoy Pajalla Floriano
 
Social studies methods and concepts for primary esoc
Social studies methods and concepts for primary esocSocial studies methods and concepts for primary esoc
Social studies methods and concepts for primary esoc
 
Successful strategies for social studies teaching and learning
Successful strategies for social studies teaching and learningSuccessful strategies for social studies teaching and learning
Successful strategies for social studies teaching and learning
 

Similar to Sophia He

2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_fariaPaulo Faria
 
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkSPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkStats Statswork
 
6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdf6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdfssuserdca880
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysisAnimesh Kumar
 
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSISFUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSISIrene Pochinok
 
Standard deviation quartile deviation
Standard deviation  quartile deviationStandard deviation  quartile deviation
Standard deviation quartile deviationRekha Yadav
 
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...Naoki Shibata
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Predictionsriram30691
 
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
Maxim Kazantsev
 
2.1-2.2 Organizing Data
2.1-2.2 Organizing Data2.1-2.2 Organizing Data
2.1-2.2 Organizing Datamlong24
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis Baivab Nag
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statisticsBurak Mızrak
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clusteringishmecse13
 
measure of dispersion
measure of dispersion measure of dispersion
measure of dispersion som allul
 

Similar to Sophia He (20)

2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria2014-mo444-practical-assignment-04-paulo_faria
2014-mo444-practical-assignment-04-paulo_faria
 
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by StatsworkSPSS Step-by-Step Tutorial and Statistical Guides by Statswork
SPSS Step-by-Step Tutorial and Statistical Guides by Statswork
 
ARIMA
ARIMA ARIMA
ARIMA
 
6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdf6-130914140240-phpapp01.pdf
6-130914140240-phpapp01.pdf
 
Spss tutorial-cluster-analysis
Spss tutorial-cluster-analysisSpss tutorial-cluster-analysis
Spss tutorial-cluster-analysis
 
Measures of Dispersion.pptx
Measures of Dispersion.pptxMeasures of Dispersion.pptx
Measures of Dispersion.pptx
 
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSISFUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
 
ictir2016
ictir2016ictir2016
ictir2016
 
Standard deviation quartile deviation
Standard deviation  quartile deviationStandard deviation  quartile deviation
Standard deviation quartile deviation
 
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
 
House Sale Price Prediction
House Sale Price PredictionHouse Sale Price Prediction
House Sale Price Prediction
 
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS

 
2.1-2.2 Organizing Data
2.1-2.2 Organizing Data2.1-2.2 Organizing Data
2.1-2.2 Organizing Data
 
Cluster Analysis
Cluster Analysis Cluster Analysis
Cluster Analysis
 
FINAL (1)
FINAL (1)FINAL (1)
FINAL (1)
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
 
Hierarchical clustering
Hierarchical clusteringHierarchical clustering
Hierarchical clustering
 
measure of dispersion
measure of dispersion measure of dispersion
measure of dispersion
 
Spc training
Spc training Spc training
Spc training
 
Test Optimization Using Adaptive Random Testing Techniques
Test Optimization Using Adaptive Random Testing TechniquesTest Optimization Using Adaptive Random Testing Techniques
Test Optimization Using Adaptive Random Testing Techniques
 

Sophia He

  • 1. Similarity Measures for Time Series SOPHIA HE
  • 2. Outline •Why find similarity? •Test & Raw datasets • Approach 1 Correlation Table • Approach 2 SAS Proc Similarity Method • Approach 3 SIM Coefficient Method • Approach 4 Derivatives Comparison Method • Approach 5 Spectral Analysis Method •Conclusion
  • 3. Test Dataset & Raw Datasets TEST DATASETS Generated 5 test datasets (20 observations each) using ARMA(p, q) model ARMA(1,1) Zt=at+ Ф *Zt-1 - θ *at-1 where a0 =0 Z0 = 0 t = 1…20 • Series 1: Ф = -0.8, θ = 0.1 • Series 2: Ф = 0.8, θ = -0.1 • Series 3: Ф = 0.85, θ = -0.15 • Series 4: Ф = -0.8, θ = 0.1 shifted with 21 observations • Series 5: Ф = -0.85, θ = 0.15 with 21 observations RAW DATASETS Raw_itd1 & Raw_itd2: commodities datasets • 119 series in Raw_itd1 • 120 series in Raw_itd2 • 222 monthly observations from January 1997 to June 2015 Group Series Similar Group Series 2&3 Dissimilar Group Series 1&2 Identical Shifted Group Series 1&4 Similar Shifted Group Series 1&5
  • 4. Approach 1 Correlation Table Test dataset: Advantages: • Able to detect period movements/shifts but unstable Disadvantages: • Only capture linear correlation between series • Sensitive to outliersGroup Series Similar Group Series 2&3 Dissimilar Group Series 1&2 Identical Shifted Group Series 1&4 Similar Shifted Group Series 1&5 Series S1 S2 S3 S4 S5 S1 1 0.18454 0.14729 -0.9221 0.98857 S2 0.18454 1 0.99359 0.0392 0.15459 S3 0.14729 0.99359 1 0.05083 0.12369 S4 -0.9221 0.0392 0.05083 1 -0.95089 S5 0.98857 0.15459 0.12369 -0.95089 1
  • 5. Approach 1 Correlation Table Similar pair Dissimilar pair
  • 6. Approach 2 SAS Proc Similarity Method •Use Distance Matrix to compute a similarity measure of a pair of series Data Input series: X1 X2 X3 … Xn Target series: Y1 Y2 Y3 … Ym Distance Matrix Input Series X1 X2 X3 … Xn Target Series Y1 D11 D12 D13 … D1n Y2 … … … … … Y3 … … … … … … … … … … … Ym Dm1 … … … Dmn Dij = Input Series – Target Series e.g. D11 = X1 – Y1 Normalized Rescaled Computes all possible paths to transverse the matrix
  • 7. Approach 2 SAS Proc Similarity Method Output: 1. Similarity Measures: Absolute Deviation • Measure=ABSDEV / Absolute Deviation • total distance of the minimum path to transverse the distance matrix 2. Cost Statistics: statistics associated with minimum path 3. Path Statistics: indicating percentages of direct path (diagonal movement), compression (vertical movement) and expansion (horizontal movement) The Less the Amount of Absolute Deviation, the More Similar the Two Series Are
  • 8. More on Proc Similarity in SAS Basic Structure proc similarity data=data out= ; input S1 / normalize = scale=; target S2 / slide = normalize= measure= compress=(localabs=0) expand=(localabs=0); run; Output: Similarity Measures Path & Cost Measures Transformed Input &Target Series Input & Target Path Index Distance Metric Transformation Normalization:  Absolute:  Standard: Scale:  Absolute:  Standard: User-defined:  FCMPOPT Statement & Options Measures 1. SQRDEV/ABSDEV : squared or absolute deviation 2. MSQRDEV/MABSDEV : mean squared or absolute deviation  relative to the length of the input or target sequence  relative to the minimum or maximum valid path length 3. User-defined Measures
  • 9. More on Proc Similarity in SAS Path & Cost Statistics Plots
  • 10. Approach 2 SAS Proc Similarity Method Results of Raw_itd1 Advantages:  Higher accuracy rate of detecting similar pairs  Normalized and rescaled the series  Compute pair-wise similarity measures using DO loop  Performs well when treating totally dissimilar series that crosses each other Disadvantages:  Bad at detecting similar and shifted series  Sensitive to Outliers Series Pairs Proc Similarity Measure RW_CMACDG391 & RW_CMACDP553 44.00063414 RW_CMACDG183 & RW_CMACDG274 49.51170024 RW_CMACDG274 & RW_CMACDG391 52.46527532 RW_CMACDG274 & RW_CMACDG474 53.32933511 RW_CMACDG274 & RW_CMACDP553 54.0065316 RW_CMACDG274 & RW_CMACDG221 374.47813762 Table 2 Proc Similarity Measures Series Absolute Deviation Similar Group Series 2&3 1.92422 Dissimilar Group Series 1&2 20.25579 Identical Shifted Group Series 1&4 1.20447 Similar Shifted Group Series 1&5 33.96824
  • 11. Approach 2 SAS Proc Similarity Method Identical PairSimilar Pair Very Similar Pair Similarity Measure = 51.98 Similarity Measure = 2.78E-14Similarity Measure = 14.11
  • 12. Approach 3 SIM Coefficient SIM coefficient is calculated by the following: 𝑦𝑡 (1) = 𝑥 𝑡 (1) −𝑥 𝑡−1 (1) 𝑥 𝑡−1 (1) for t=2,…,T 𝑆𝑖𝑚 𝑦 1 , 𝑦 2 = 𝑡=2 𝑇 [ 𝑎𝑏𝑠(𝑦𝑡 1 − 𝑦𝑡 2 ) max 𝑎𝑏𝑠 𝑦𝑡 1 , 𝑎𝑏𝑠 𝑦𝑡 2 (𝑇 − 1)] The Closer the Value to Zero, the More Similar the Series Are.
  • 13. Approach 3 SIM Coefficient Series Pairs SIM_final RW_CMACDG291 & RW_CMACDG472 0.504446 RW_CMACDG291 & RW_CMACDG473 0.515302 RW_CMACDG284 & RW_CMACDG472 0.544256 RW_CMACDG221 & RW_CMACDG282 0.546508 RW_CMACDG221 & RW_CMACDG291 0.548653 Results of Raw_itd1 Advantages:  Compute pair-wise similarity measures using DO loop  Performs well when treating totally dissimilar series that crosses each other  Best in detecting both identical & shifted series and similar & shifted series Disadvantages :  Accuracy rate is lower than Proc Similarity method Cut-off: considered similar when below 0.7 Table 3 SIM Measures Series SIM Coefficient Similar Group Series 2&3 0.36887 Dissimilar Group Series 1&2 0.98743 Identical Shifted Group Series 1&4 0.23527 Similar Shifted Group Series 4&5 0.23978
  • 14. Approach 4 Derivatives Comparison Method Step 1: Use spline function to represent series Step 2: Compute first derivatives/slopes at each knot Step 3: Compute second derivatives/rate of change at each knot Step 4: Compute difference between first & second derivatives The Smaller the Difference, the More Similar the Series Are. Basic Idea: spline function on series + calculation of first & second derivatives
  • 15. Approach 4 Derivatives Comparison Method Dissimilar Series: Series 1&2
  • 16. Quantitative Measures Approach 4 Derivatives Comparison Method Table 4 Comparison of Derivatives Difference between 1st Derivative Difference between 2nd Derivative Similar Group 0.81508 0.16769 Dissimilar Group 20.4207 4.21562 Identical Shifted Group 24.2524 5.17868 Similar Shifted Group 34.8150 7.75581 Disadvantage:  Slow in processing time  Bad at detecting both identical & shifted series and similar & shifted series  # of knots < # of observations
  • 17. Approach 5 Spectral Analysis Method SAS: proc spectra Plot frequency against phase spectrum in radians of X and Y Phase spectrum: Time Domain: e.g ARIMA Model • Auto-covariance • Auto-correlation Frequency Domain: • spectral density function Time Series Representation Similar Pair Dissimilar Pair
  • 18. Conclusion Approach 1 Correlation Table:  Easy to Interpret & Fastest in Computing Pair-Wise Measures Approach 2 SAS Proc Similarity Method:  More Functionalities & Highest Accuracy Rate Approach 3 SIM Coefficient:  Straight Forward Formula & Add More Accuracy Approach 4 Derivatives Comparison Method:  Confirmation Mechanism Approach 5 Spectral Analysis Method:  Different Prospective & Measures

Editor's Notes

  1. As I was introduced at the beginning of the term about different seasonal adjustment options that can be produced by X12 program, similar series will share similar or even identical options. If we can determine the similarity beforehand and obtain a quantitative measure of the similarity, we can avoid redundant processing of seasonal adjustment on similar series and be better prepared when explaining similar or identical options for different series. So what we were trying to quantify is the similarity for month-to-month movement. In this way, we are able to find relationship that we don’t know about beforehand and how each other are related. There are 5 approaches we came up with and all of them were tested on both test datasets and raw/real datasets.
  2. The test datasets are generated based on an ARMA model with different value of phi and theta. With pre-determined similarity or dissimilarity between each dataset, I can verify each method by testing on each dataset. I identified 4 types of groups (each with 20 observations): a similar group which contains Series 2&3; a dissimilar group (series 1&2); an identical and shift group which is series 1&4 (as you can they have the same phi and theta but just 1 time point shifted); Also a similar and shifted pair (series 4&5 so the parameters are off by 0.05 with 21 observations) The raw datasets contain commodities information from 1997 to 2015. Dataset 1 is customs based and dataset 2 is based on balance of payment. Add ARIMA model formula
  3. So the first approach to quantify similarity between series is by constructing a correlation table. So correlation coefficients of all combination of series are computed and organized into a table. I first tested this approach on test dataset. As you can see from the table, correlation that is closer to 1 or -1 suggests strong linear correlation between the series. As you can see from the table, it correctly identified the similar and dissimilar groups. It can also detect period movement which is the shift in data. However, in this case it could be a coincidence since I shifted the series by 1 time unit, when it goes up and then down at certain time points, the shifted one will go down and then up, resulting to this negative correlation. Another major problem with this approach is that it can only capture linear correlation between series, while in reality there’re also many non-linear relationship existed between series and this approach will fail to detect them. It is also very sensitive to outliers which means, a single outlier will severely affect the correlation coefficients which results in wrong identification of similar or dissimilar pairs. The correlation coefficients in Table 1 confirm the similarity between df1 & df4, df1 & df5, df2 & df3 and df4 & df5 with their absolute values being close to 1. The dissimilar series are also correctly identified with correlation coefficients being close to 0. Only capture linear correlation between series: so any non-linear relationship between 2 series cannot be detected by this approach
  4. For raw datasets, a correlation matrix is also calculated and it identified numerous pair of similar series. One way to visualization it is to plot one series against the other. As we see from the graphs, similar series identified by linear correlation will have points scatter around the diagonal while dissimilar pair will have points scatter off the diagonal. For the dissimilar plot, there is a pattern appeared in the graph. Like most of the points are scatted around 2 regions, however, the correlation approach consider these as without linear correlation. This approach is the easiest to implement and can be used as a preliminary examination of the series. Next I started to explain more sophisticated approaches. When plotting one series against the other series: Similar pair: all the points should be scattered around the diagonal Dissimilar pair: all the points are off the diagonal which indicates dissimilar pair
  5. The second approach is a SAS procedure called proc similarity. It’s a procedure that compute a similarity measure for time-stamped data, time series, and other sequentially ordered numeric data. The basic idea behind is to use a Distance Matrix to compute a quantitative measure of a pair of series. Initially, there’re 2 series, X and Y with n or m observations. One is referred to as input series, the other as target series. Then, those sequence of data get normalized and rescaled, which can be easy specified and implement within the procedure. A distance matrix is constructed by calculating the difference between each data point, more specifically using input data point minus target data point, like in the table on the right. So D11 is value of series X at time 1 minus that of Y at time 1 and so on. The next step is compute all possible path to traverse the matrix from left to right side. For example, this is one of the paths and this is another. It will assign a path index to each of these path to indicate number of movement associated with that path. So moving from one cell to the next is counted as 1 step. For example, one possible path can take 11 steps to complete so the path index for it will be 11. Input and target series are normalized and input sequence is scaled to the target sequence before constructing distance matrix It computes all possible paths to traverse the matrix and assign a path index to each path indicating the number of movement associated with that path. Compression and Expansion
  6. Next, we can choose the similarity measures and other statistics we want to produce. The first one is absolute deviation which is the total distance of the minimum path to transverse the matrix. For this project, we limit the path option to only going through the diagonal so we have comparability across all combinations of series in the raw datasets. Cost statistics can also be produced which contain basic descriptive statistics of the minimum path. Path statistics are basically proportion of direct path which is the diagonal path, compression which is the vertical movement and expansion which is the horizontal movement. As a general rule, the less amount of distance associated with a direct path, the more similar the series are.
  7. Since measuring similarity between my specific time series data, which is commodities data, only utilize a small portion of functionalities in proc similarity in SAS, I will talk a bit more about the procedure and illustrate the flexibility of this procedure. So this is the basic structure of the procedure. 2 series are coded as an input series and a target series. This procedure includes a transformation mechanism that can be easily specified. For this project, both normalization and rescaling were used. The reason is that when I run the procedure without any rescaling, the similarity measure, which is the absolute deviation, can be larger for similar series that are far away from each other in terms of values than a dissimilar pair that are moving in the opposite direction that even cross each other at certain time point. This is because when I constructed the distance matrix without any transformation, the value of difference between input and target data is larger than 2 series that are moving close to each other. So when I used that matrix to calculate the absolute deviation, of course the value will be larger. When I constructed the distance matrix of a dissimilar pair that crosses each other, the difference at the crossing section can be very small and lead to smaller value of similarity measure. In terms of similarity measure, there’re also various options we can choose from. Similarity measure can be categorized into 2 major groups: squared or absolute deviation and mean of squared or absolute deviation. In terms of the means, it can be calculated related to the length of series or the min or max of the path to suit the needs of specific analysis. In terms of output for this procedure, not only it can produce various tables containing the statistics that I mention before, it can also produce various plots. This table here is just an example of one of its output. It list all transformed input and target series, also the path index for each series with corresponding distance.
  8. On the left, it’s the path and cost statistics output. This is a screen shot of analyzing the commodity series, so only diagonal path is used. But you can also specify the expansion and compression limit, so you can go off the diagonal. On the right is a series of graphs produced by the procedure. As you can see, the left one is the original plot of 2 series, which crosses each other, the right side is rescaled and normalized series. This is path plot which indicates the minimum path to transverse distance matrix. Distance of each path is also plotted as well as the distribution of distance for all the possible paths. Plot and distribution of path relative distance path relative distance = path distance / corresponding target sequence value
  9. This is a snap shot of the results of top 5 most similar pair identified by this approach. And on the right side is the plot of original data from one of these pairs. We can see from the graph, those 2 series are very similar in terms of month-to-month movement. One advantages of this approach is that it actually has the highest accuracy rate of detecting similar pairs compared to the other approaches. The series can be easily transformed in the code and I was able to use a DO loop to compute similarity measure for all the pair-wise combination, just like in the correlation matrix. Since this procedure involve normalization and rescaling, it performs very well when examining dissimilar series that crosses each other. One disadvantage of this method is that when dealing with series that are similar but shifted by certain time period, its measurement is not very accurate. From the results on test dataset, we can see it can correctly detect similar, dissimilar and identical & shifted pairs, but not for the similar & shifted pair. Remember the rule is the smaller the better, but for test series that are similar but shifted by 1 time unit, its value is very large. Also it’s very sensitive to outlier in terms of building the distance matrix. The distance at the outlier time point will be very large which will affect the absolute deviation of the similarity measure. Higher accuracy rate of detecting similar pairs compared to the other methods performs badly when treating totally dissimilar series that crosses each other because of the intersection of 2 series, but the trend is only a single consideration during seasonal adjustment, which can be de-trend Without rescaling, the proc similarity measure of these 2 series is 196.4936 which is smaller due to the intersection of 2 series. At the intersection, the distance between 2 series are very small which causes decrease in sum of absolute deviation.
  10. This are a few plots of series with its corresponding similarity measure from this approach. As you can see, the series are getting increasingly similar as the number gets smaller. For the perfectly identical pair, the measure is basically zero. The last graph is a result of combining customs based and balance of payment commodity datasets, so it could mean they’re the same product. When I combined 2 raw datasets, this approach performs very well in detecting and distinguishing similar (or identical) and dissimilar pairs. What I want to point out for those plots is that even though the first pair are relatively closer to each other than 2nd pair, the similarity measure can still be able too distinguish the level of similarity.
  11. The third approach is called SIM coefficient method and basically is to use this formula to calculate a similarity measure between 2 series and evaluate them based on that. So first step is to get the relative difference of both series, let’s call it input and target series. The SIM coefficient is then calculated by this formula: take the absolute value of difference between 2 series, then divided by maximum value at that time point, sum them up and divided by total number of observations minus 1. The general rule for this coefficient is the closer to value zero, the more similar the series are. If both series are identical, then yt1 minus yt2 will be zero. If yt1 and yt2 are always very close to each other at every time point, this value will be also closer to 0.
  12. This is the top 5 most similar pairs identified by this approach. And you can see the plot of the most similar pair. Since the formula for calculating this similarity measure is fairly straight forward, I was also able to quickly compute all the pair-wise combinations. This approach also performs pretty well in terms of determining dissimilar pairs that cross each other, since we also calculate the measure after conducting the relative differencing. One disadvantage of this approach is the accuracy rate is lower than proc similarity and it’s slower and less flexible compared to proc similarity. This approach also performs relatively well when detecting identical & shifted and similar & shifted pairs, as you can see from this table which is the results of the test datasets.
  13. The next approach is called derivative comparison approach. The basic idea is that if 2 series are moving in the same direction at the same rate throughout the observation period, they should similar in terms of the month-to-month movements. So what I used is to construct a spline function to the series in order to obtain a mathematical equation of the data, then compare the first and second derivatives between 2 series. So the red line is the slope and the green line is the rate of change at each knot. This approach incorporates use of spline function on the series and calculation of first and second derivatives to evaluate the similarity between 2 series. Use Spline function to represent series with specified number of knots
  14. As you can see from the plots of a dissimilar pair, the red line which is the slope of each series is very different from the other. A more quantitative way to measure this is to take the difference between the first and second derivative.
  15. The results in this table is based on the test datasets, similar group indeed have a lower value of difference between 1st and 2nd derivatives, and larger value for the dissimilar group. But it fails to distinguish the identical & shifted and similar & shifted group. The reason is that since the knots of the spline function is fixed, it will only calculated the derivatives at each knot, which in the shifted case are off by 1 time point in this case. I specify the knot to be every other time point. Since the number of knots is smaller than number of observations, this could be a good thing and bad thing, because when choosing certain group of knots, you might skip over certain outliers, while in other case, you might include them. One way to fix that is to alternate the position of the knots to construct multiple spline functions for the series and then repeat this process of derivative comparisons to get the final results. In this way, it may accommodate the disadvantage of fixed knots and its inability to detect shifted pairs. Another drawback of this approach is it runs pretty slowly, at least based on the codes I have written. so it is also one of the disadvantages of this method. From the graphs, the first and second derivatives of the similar series (df2 & df3) are also similar while the dissimilar ones (df1 & df2) are also showing obvious distinctions. The magnitude of the difference of derivatives between series are also summarized in the following table. Due to position of the knots are fixed, bad at detecting shifted pairs.
  16. Finally, there is another method I looked into briefly compared to the other, the spectral analysis approach. I wasn’t familiar with this term so I did quite a bit of research online. So from what I understand is: spectral analysis is to represent a time series data using cyclical components of different frequencies, compared to the ARIMA model which I learned a lot from the time series course which uses previous realizations and white noise. Like auto-covariance and autocorrelation function of stationary time series in time domain, we can also study spectral density function or spectrum as a function of the frequency in the frequency domain of time series. Cross spectral analysis is an extension of these techniques which enable 2 series to be analyzed simultaneously. The extent to which any frequency component in one series is correlated with the frequency component in another series can be estimated as Coherence. If you plot coherence against the frequency, you are able to identify the pattern of correlation between pairs of components. In the SAS procedure proc spectra, all the statistics including coherence squared, sine and cosine transforms, can be produced. There is one statistics called phase spectrum which is one of the parameter of the cross spectrum formula, is plotted against frequency. As you can see from the right, similar pair have less peaks than the dissimilar pair. This approach explain time series data from a different prospective and it can serve as another approach we can look more into to quantify the similarity between 2 series.
  17. In conclusion, each approach has its pros and cons. In the shortest version, the correlation matrix approach is easy to interpret and it’s fastest in computing all the pair-wise combinations. Proc similarity approach has the highest accuracy rate with more functionalities and flexibility in terms of manipulating the data. SIM coefficient approach has a straight forward formula for quantifying similarity and can add more accuracy in addition to the proc similarity approach. Derivative comparison approach, in my opinion can serve as a confirmation mechanism to further verify the similarity results. And spectral analysis approach looks at the series from a frequency based prospective and can provide more quantitative measurements for similarity. Another thing to note is all of the methods are sensitive to outliers, so one way to improve the outcome can be to use the outlier treated datasets. So in my opinion, the most recommended approach to quantify similarity between time series is the SAS proc similarity procedure. If more accuracy is needed, SIM coefficient method can serve as a second filter to further filter out inaccurate or dissimilar pairs.