SlideShare a Scribd company logo
1 of 11
Improving SVM classification on imbalanced
time series data sets with ghost points
Presenter: Shang-Tse Chen
Authors: Suzan Köknar-Tezel, Longin Jan Latecki
Introduction
● Imbalanced dataset is a challenge for data mining
○ always predict majority class -> high accuracy
○ often, rare events are more interesting
● Common Technique:
○ Up / Down sampling
○ SMOTE (adding synthetic points in feature space)
● This paper
○ adding synthetic points in distance space
Research Question
● For time series data
○ not intuitive to represent as features in Rn
○ distance between two sequence is non-metric
○ Cannot use SMOTE
● In many applications, pair-wise distance is more relevant
○ many classifier only need pair-wise distances,
■ eg. SVM, knn
○ many good algorithms to compute distance in time
series data, e.g. DTW, OSB, …, etc.
Research Question
● Can we add synthetic data in distance space?
● Does it improve the performance?
Methodology
● Given any two points a, b in a distance space X, we can define a
ghost point e = μ(a,b).
● For every x ∈ X, the distance from x to e, d(x, μ(a,b)) is as follows:
○ case 1: {x, a, b} is a metric, then
■ d(x, µ(a, b))2 = ½ d(x, a)2 + ½ d(x, b)2 - ¼ d(a, b)2
○ case 2: If d(a, b) > d(x, a) + d(x, b), then
■ d(x, µ(a, b)) = ½ d(a, b) - d(x, b)
○ case 3a: If d(x, a) > d(x, b) + d(a, b), then
■ d(x, µ(a, b))2 = d(x, b)2 + ¼ d(a, b)2
○ case 3b: If d(x, b) > d(x, a) + d(a, b), then
■ d(x, µ(a, b))2 = d(x, a)2 + ¼ d(a, b)2
Data Collection, Processing
● UCR Time series datasets
○ Use 17 datasets from various domains
○ number of classes range from 2 to 50
● MPEG-7
○ 1400 binary images consisting of 70 object classes
○ within each class there are 20 shapes
○ each shape is represented with 100 equidistant sample points on the contour
○ these points are converted into sequences by calculating the curvature of each point with
respect to its five neighbors on each side.
○ this yields 1400 sequences, each of length 100
○ this transformation is invariant to rotation and scale
Key Results
● UCR Data Sets and OSB
● Shaded results indicate best
performers
● the darker the shade,
the larger the difference
Key Results
● UCR Data Sets and DTW
Key Results
● MPEG-7 dataset
Summary
● Proposed a new approach for over-sampling the minority
class of imbalanced data
● Unlike other feature based methods, the ghost points
are added in distance space.
● Ghost points can be added to non-metric distance space
○ Can be used with DTW, OSB, and many more.
● Empirical results show significant improvement
Critique of work
● For large-scale data, over-sampling is time consuming
● Introduce another parameters, i.e. the number of
ghost points that we should add
● May not perform well in highly noisy data

More Related Content

What's hot

Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
ArchiLab 7
 
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Computer Science Club
 
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
CvilleDataScience
 

What's hot (20)

Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...Drobics, m. 2001:  datamining using synergiesbetween self-organising maps and...
Drobics, m. 2001: datamining using synergiesbetween self-organising maps and...
 
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
Andrew Goldberg. Highway Dimension and Provably Efficient Shortest Path Algor...
 
Discrete time signals on MATLAB
Discrete time signals on MATLABDiscrete time signals on MATLAB
Discrete time signals on MATLAB
 
Assignment 1
Assignment 1Assignment 1
Assignment 1
 
Generalized Notions of Data Depth
Generalized Notions of Data DepthGeneralized Notions of Data Depth
Generalized Notions of Data Depth
 
Application of Dijkstra Algorithm in Robot path planning
Application of Dijkstra Algorithm in Robot path planningApplication of Dijkstra Algorithm in Robot path planning
Application of Dijkstra Algorithm in Robot path planning
 
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
DASH: A C++ PGAS Library for Distributed Data Structures and Parallel Algorit...
 
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
Using Principal Component Analysis to Remove Correlated Signal from Astronomi...
 
Graph Evolution Models
Graph Evolution ModelsGraph Evolution Models
Graph Evolution Models
 
Minicourse on Network Science
Minicourse on Network ScienceMinicourse on Network Science
Minicourse on Network Science
 
Line Detection
Line DetectionLine Detection
Line Detection
 
Representation
RepresentationRepresentation
Representation
 
Quiz 2
Quiz 2Quiz 2
Quiz 2
 
Deep single view 3 d object reconstruction with visual hull
Deep single view 3 d object reconstruction with visual hullDeep single view 3 d object reconstruction with visual hull
Deep single view 3 d object reconstruction with visual hull
 
R programmingmilano
R programmingmilanoR programmingmilano
R programmingmilano
 
Modelling the Clustering Coefficient of a Random graph
Modelling the Clustering Coefficient of a Random graphModelling the Clustering Coefficient of a Random graph
Modelling the Clustering Coefficient of a Random graph
 
Hubba Deep Learning
Hubba Deep LearningHubba Deep Learning
Hubba Deep Learning
 
Deep Learning meetup
Deep Learning meetupDeep Learning meetup
Deep Learning meetup
 
15 chapter9 graph_algorithms_mst
15 chapter9 graph_algorithms_mst15 chapter9 graph_algorithms_mst
15 chapter9 graph_algorithms_mst
 
Multidimension Scaling and Isomap
Multidimension Scaling and IsomapMultidimension Scaling and Isomap
Multidimension Scaling and Isomap
 

Viewers also liked (10)

Probability cheatsheet
Probability cheatsheetProbability cheatsheet
Probability cheatsheet
 
Smoking soujanya
Smoking soujanyaSmoking soujanya
Smoking soujanya
 
Revista digital pdf jorge pinzon
Revista digital pdf jorge pinzonRevista digital pdf jorge pinzon
Revista digital pdf jorge pinzon
 
Presentation yamin
Presentation yaminPresentation yamin
Presentation yamin
 
Sound shredding moustafa
Sound shredding moustafaSound shredding moustafa
Sound shredding moustafa
 
2014.chi.structured labeling to facilitate concept evolution in machine learning
2014.chi.structured labeling to facilitate concept evolution in machine learning2014.chi.structured labeling to facilitate concept evolution in machine learning
2014.chi.structured labeling to facilitate concept evolution in machine learning
 
banheiras
banheirasbanheiras
banheiras
 
SAVE_Memb_Certificate.jpg
SAVE_Memb_Certificate.jpgSAVE_Memb_Certificate.jpg
SAVE_Memb_Certificate.jpg
 
Business Resume
Business ResumeBusiness Resume
Business Resume
 
Determinar la ecuación general de la circunferencia que pasa por el punto A(-...
Determinar la ecuación general de la circunferencia que pasa por el punto A(-...Determinar la ecuación general de la circunferencia que pasa por el punto A(-...
Determinar la ecuación general de la circunferencia que pasa por el punto A(-...
 

Similar to Md2k 0219 shang

On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
VitAnhNguyn94
 

Similar to Md2k 0219 shang (20)

Making BIG DATA smaller
Making BIG DATA smallerMaking BIG DATA smaller
Making BIG DATA smaller
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
Vectorise all the things
Vectorise all the thingsVectorise all the things
Vectorise all the things
 
Neural Network Approximation.pdf
Neural Network Approximation.pdfNeural Network Approximation.pdf
Neural Network Approximation.pdf
 
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reductionAaa ped-17-Unsupervised Learning: Dimensionality reduction
Aaa ped-17-Unsupervised Learning: Dimensionality reduction
 
Semantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite ImagerySemantic Segmentation on Satellite Imagery
Semantic Segmentation on Satellite Imagery
 
CS 354 More Graphics Pipeline
CS 354 More Graphics PipelineCS 354 More Graphics Pipeline
CS 354 More Graphics Pipeline
 
Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)Learning multifractal structure in large networks (Purdue ML Seminar)
Learning multifractal structure in large networks (Purdue ML Seminar)
 
Design and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture NotesDesign and Analysis of Algorithms Lecture Notes
Design and Analysis of Algorithms Lecture Notes
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
Survey on optical flow estimation with DL
Survey on optical flow estimation with DLSurvey on optical flow estimation with DL
Survey on optical flow estimation with DL
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Vectorise all the things - long version.pptx
Vectorise all the things - long version.pptxVectorise all the things - long version.pptx
Vectorise all the things - long version.pptx
 
Noisy optimization --- (theory oriented) Survey
Noisy optimization --- (theory oriented) SurveyNoisy optimization --- (theory oriented) Survey
Noisy optimization --- (theory oriented) Survey
 
DBSCAN
DBSCANDBSCAN
DBSCAN
 
Chromatic Sparse Learning
Chromatic Sparse LearningChromatic Sparse Learning
Chromatic Sparse Learning
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Image segmentation
Image segmentationImage segmentation
Image segmentation
 
On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1On clusteredsteinertree slide-ver 1.1
On clusteredsteinertree slide-ver 1.1
 

More from BBKuhn (9)

Md2 k 04_19_2015
Md2 k 04_19_2015Md2 k 04_19_2015
Md2 k 04_19_2015
 
March19 tun
March19 tunMarch19 tun
March19 tun
 
March12 rahman
March12 rahmanMarch12 rahman
March12 rahman
 
March12 natarajan
March12 natarajanMarch12 natarajan
March12 natarajan
 
March12 chatterjee
March12 chatterjeeMarch12 chatterjee
March12 chatterjee
 
March12 alzantot
March12 alzantotMarch12 alzantot
March12 alzantot
 
March5 gao
March5 gaoMarch5 gao
March5 gao
 
March5 bargar
March5 bargarMarch5 bargar
March5 bargar
 
MD2K Presentation to Stanford Mobilize (1/22/15)
MD2K Presentation to Stanford Mobilize (1/22/15)MD2K Presentation to Stanford Mobilize (1/22/15)
MD2K Presentation to Stanford Mobilize (1/22/15)
 

Recently uploaded

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
Silpa
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
ANSARKHAN96
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
Silpa
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Silpa
 

Recently uploaded (20)

Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.Atp synthase , Atp synthase complex 1 to 4.
Atp synthase , Atp synthase complex 1 to 4.
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRingsTransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
TransientOffsetin14CAftertheCarringtonEventRecordedbyPolarTreeRings
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
CYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptxCYTOGENETIC MAP................ ppt.pptx
CYTOGENETIC MAP................ ppt.pptx
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptxTHE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
THE ROLE OF BIOTECHNOLOGY IN THE ECONOMIC UPLIFT.pptx
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.Cyathodium bryophyte: morphology, anatomy, reproduction etc.
Cyathodium bryophyte: morphology, anatomy, reproduction etc.
 

Md2k 0219 shang

  • 1. Improving SVM classification on imbalanced time series data sets with ghost points Presenter: Shang-Tse Chen Authors: Suzan Köknar-Tezel, Longin Jan Latecki
  • 2. Introduction ● Imbalanced dataset is a challenge for data mining ○ always predict majority class -> high accuracy ○ often, rare events are more interesting ● Common Technique: ○ Up / Down sampling ○ SMOTE (adding synthetic points in feature space) ● This paper ○ adding synthetic points in distance space
  • 3. Research Question ● For time series data ○ not intuitive to represent as features in Rn ○ distance between two sequence is non-metric ○ Cannot use SMOTE ● In many applications, pair-wise distance is more relevant ○ many classifier only need pair-wise distances, ■ eg. SVM, knn ○ many good algorithms to compute distance in time series data, e.g. DTW, OSB, …, etc.
  • 4. Research Question ● Can we add synthetic data in distance space? ● Does it improve the performance?
  • 5. Methodology ● Given any two points a, b in a distance space X, we can define a ghost point e = μ(a,b). ● For every x ∈ X, the distance from x to e, d(x, μ(a,b)) is as follows: ○ case 1: {x, a, b} is a metric, then ■ d(x, µ(a, b))2 = ½ d(x, a)2 + ½ d(x, b)2 - ¼ d(a, b)2 ○ case 2: If d(a, b) > d(x, a) + d(x, b), then ■ d(x, µ(a, b)) = ½ d(a, b) - d(x, b) ○ case 3a: If d(x, a) > d(x, b) + d(a, b), then ■ d(x, µ(a, b))2 = d(x, b)2 + ¼ d(a, b)2 ○ case 3b: If d(x, b) > d(x, a) + d(a, b), then ■ d(x, µ(a, b))2 = d(x, a)2 + ¼ d(a, b)2
  • 6. Data Collection, Processing ● UCR Time series datasets ○ Use 17 datasets from various domains ○ number of classes range from 2 to 50 ● MPEG-7 ○ 1400 binary images consisting of 70 object classes ○ within each class there are 20 shapes ○ each shape is represented with 100 equidistant sample points on the contour ○ these points are converted into sequences by calculating the curvature of each point with respect to its five neighbors on each side. ○ this yields 1400 sequences, each of length 100 ○ this transformation is invariant to rotation and scale
  • 7. Key Results ● UCR Data Sets and OSB ● Shaded results indicate best performers ● the darker the shade, the larger the difference
  • 8. Key Results ● UCR Data Sets and DTW
  • 10. Summary ● Proposed a new approach for over-sampling the minority class of imbalanced data ● Unlike other feature based methods, the ghost points are added in distance space. ● Ghost points can be added to non-metric distance space ○ Can be used with DTW, OSB, and many more. ● Empirical results show significant improvement
  • 11. Critique of work ● For large-scale data, over-sampling is time consuming ● Introduce another parameters, i.e. the number of ghost points that we should add ● May not perform well in highly noisy data