SlideShare a Scribd company logo
1 of 31
Download to read offline
iSAX 2.0 INDEXING & MINING
 ONE BILLION TIME SERIES

          Paper By:
      Alessandro Camerra
       Themis Palpanas
           Jin Shieh
        Eamonn Keogh
iSAX 2.0 INDEXING & MINING
 ONE BILLION TIME SERIES

       Presented by:
         Vasu Jain
Contents
1. Introduction

2. Preliminaries
    a. The SAX Representation
    b. The iSAX Representation
    c. Indexing iSAX

3. iSAX 2 Index
    a. Bulk Loading
    b. Node Splitting Policy

4. Experimental Evaluations

5. Conclusion & Future work

6. References



                                 CONTENTS | 2
Time Series

Sequence of data points measured at successive time instants spaced
at uniform time intervals.

Time Series T = t1, t2, t3, t4 …… tn
    n real valued observations at equal intervals of time

e.g. Stock Prices, Acid precipitation data at a single location




                                                                  INTRODUCTION | 3
Time Series Examples
                                       Historical archives
    Sensors            DNA sequences




                                              Motion data

 Stocks                ECG
 data




                                               INTRODUCTION | 4
Introduction
•   Indexing and Mining time series is hot. There is a pressing need for
    indexing and mining Time series data

•   Time series of order 100 of Millions to Billions

“…we have about a million samples per minute coming in from 1000 gas
turbines around the world… we need to be able to do similarity search for...”
Lane Desborough, GE.
 “…an archival rate of 3.6 billion points a day, how can we (do similarity
search) in this data?” Josh Patterson, TVA.

•   Two bottlenecks while mining these massive Time Series
     • Time complexity of building the index
         (6 days to index 100 million objects)
     • Time to retrieve data from the disk



                                                                         INTRODUCTION | 5
Introduction
•   iSAX 2.0
     • To solve these problem
     • A data structure which is an extension to iSAX
     • Suitable for indexing and mining very large time series.

•   Novel mechanism for scalable indexing of time series: Bulk loading
    algorithm, node splitting policies.

•   Results:
     •   Index building time reduced by 72% with bulk loading scheme
     •   Index size reduced by 27%
     •   Number of disk page accesses reduced by 50%
     •   Scalability achieved allows to consider new challenges in data
         mining problems which have been untenable otherwise.
     •   First approach that is experimentally validated to scale Time series
         data collections with up to 1 Billion objects.

                                                                    INTRODUCTION | 6
Preliminaries : The SAX representation
     1. Represent a time series T of length n in w-dimensional space using PAA
             Where the ith element is:
                                                                        ni


                                                                      T
                                                                        w

                                                       ti  w
                                                            n                      j
                                                                 j   n ( i 1) 1
                                                                      w

     2. Then discretize into a vector of symbols
          Breakpoints map to a small alphabet a of symbols
          For every segment a bit wise representation is assigned to code that region
                                                                                   3
 3                                      3

                                                                                   2   iSAX(T,4,4)
 2     A time series T                  2   PAA(T,4)
                                                                                                                         00
                                                                                   1
 1                                      1
                                                                                                                         01
                                                                                   0
 0                                      0                                                                                10

-1                                     -1
                                                                                  -1                                     11

-2                                                                                -2
                                       -2



-3                                                                                -3
                                       -3
       0        4        8   12   16        0      4   8        12           16        0        4    8       12     16




     i) A time series T, of length 16. ii) A PAA approximation of T, with 4 segments.
     iii) A time series T converted into SAX words of cardinality 4


                                                                                                         PRELIMINARIES | 7
Preliminaries : The SAX representation




                                         PRELIMINARIES | 8
Preliminaries : The SAX representation

More About SAX Representation

   1. The SAX Representation supports arbitrary breakpoints.
   2. A SAX word is simply a vector of discrete numbers.
        e.g. SAX word can be written as {3, 3, 1, 0} or in binary as {11, 11, 01, 00}.
   3. Denoted by T4, produced by function SAX(T,4,4).
   4. SAX(T, w, c) = TC where w = word length, c = cardinality of symbols
                      SAX(T,4,2) = T2 = {1, 1, 0, 0}
                      SAX(T,4,4) = T2 = {11, 11, 01, 00}
   5. Converting to a reduced cardinality SAX word by ignoring Trailing Bits




                                                                        PRELIMINARIES | 9
Preliminaries : iSAX representation
   1. Tedious to write binary string in SAX Representation
   2. To include cardinality in iSAX representation we add cardinality as subscript
                      iSAX(T,4,8) = T8 = {68,68,38,08}
   3. Key property: Ability to compare two iSAX words of different cardinalities.



   4. Promote the lower cardinality representation into the cardinality of the larger
            Promote S2 as

   5. Obtain missing bit values by returning the one closest in SAX space to the
   corresponding value in T8. Not exact but admissible value of word S.


   6. We can compare iSAX words where each word has mixed cardinality
                        {111, 11, 101, 0} = {78,34,58,02}


                                                                       PRELIMINARIES | 10
Preliminaries : Indexing iSAX
   1. Using the iSAX representation, we can produce a set of bw different
      mutually exclusive iSAX words. (b = cardinality, w = word length)

   2. Words can be represented by files
       e.g word {68,68,38,08} can be mapped to 6.8_6.8_3.8_0.8.txt

   3. Threshold th defines maximum time series a word can hold.

   4. For overflow, split the file. To split choose one iSAX symbol, examine an
      additional bit, and use its value to create two new files.

                              {24,34,34,24}
                              (Original file)


            {48,34,34,24}                       {58,34,34,24}
            (child file 1)                       (child file 2)

                                                                     PRELIMINARIES | 11
Preliminaries : Indexing iSAX
   1. File splitting produces hierarchical, but unbalanced, index structure that
      contains non-overlapping regions

   2. Three class of nodes:

       1. Root Node: Representative of the complete iSAX space. Contains no
          SAX representation, but only pointers to the children nodes.
       2. Leaf Node: Contains a pointer to an index file on disk with the raw time
          series entries. Stores the highest cardinality iSAX word for each time
          series.
       3. Internal Node: Created when the number of time series contained by a
          leaf node exceeds threshold th.

   3. iSAX employs binary splits along a single dimension




                                                                   PRELIMINARIES | 12
THE iSAX 2.0 INDEX
  Disadvantages of iSAX
  1. Take too long to build index for large datasets.
      Indexing a dataset with 500 M time series takes 20 days to complete

  2. Because:
      • Naïve node splitting policy leading to ineffective splits and additional
         disk I/O
      • No bulk loading strategy, does not use available main memory to
         reduce disk I/O

  3. To overcome these challenges:
      • New algorithm for Time series bulk loading that reduces disk I/O
      • New Node splitting policy resulting in more compact index, and hence
         further reducing the I/O cost.

  4.   We will refer to this improved iSAX index as iSAX 2.0.


                                                                    iSAX 2.0 INDEX | 13
THE iSAX 2.0 INDEX
  Bulk Loading
  Design Principles
      • Minimize the number of disk I/O operations
      • Maximize sequential disk access
      • Take advantage of available Main memory

  Approach for Proposed solution
      • Instead of building entire index at once, build distinct sub trees of index
        one at a time

  Assumptions
      • Limited main memory i.e. less than necessary to fit entire dataset and
        index




                                                                     iSAX 2.0 INDEX | 14
THE iSAX 2.0 INDEX
  Bulk Loading: Algorithm basics
  Uses two main memory buffer layers

      First Buffer Layer (FBL) :
      • Corresponds to first level of iSAX 2.0 nodes
      • Cluster together the Time series that will end up in same iSAX 2.0 sub
         tree rooted in one of the direct children of the root.
      • No size restriction, grow till they occupy available main memory.

      Last Buffer Layer (LBL) :
      • Corresponds to leaf nodes
      • Gather all Time series of leaf nodes and flush them to disk
      • Same size as of leaf nodes on disk




                                                                  iSAX 2.0 INDEX | 15
THE iSAX 2.0 INDEX
  Bulk Loading: Algorithm Description
  Algorithm operates in two phases which alternate until entire dataset is indexed

      Phase 1:
      • Time series is inserted in the corresponding FBL buffer until the main
        memory is almost full.
      • At the end of Phase 1, we have time series collected in the FBL buffers.
      • Corresponding (leaf) nodes L1, L2, L3, of the index are not yet created.




                                                                    iSAX 2.0 INDEX | 16
THE iSAX 2.0 INDEX
  Bulk Loading: Algorithm Description
     Phase 2:
     • Time series contained in each FBL buffer is moved to the appropriate
       LBL buffers.

     • Sequentially for each FBL buffer, the algorithm reads the time series and
       creates the entire sub tree (with Internal and leaf iSAX 2.0 nodes) rooted
       at the node corresponding to that FBL buffer.

     • By emptying the right-most FBL buffer, we create the sub tree rooted at
       internal node I1. The algorithm also creates for each leaf node a
       corresponding LBL buffer

     • After all Time series of a FBL buffer moves to corresponding LBL buffers,
       these LBL are flushed to disk making memory available for use.



                                                                 iSAX 2.0 INDEX | 17
THE iSAX 2.0 INDEX
  Bulk Loading: Algorithm Description
     Phase 2:
     • At the end of Phase 2, all the time series from the FBL buffers have
       moved down the tree to the appropriate leaf nodes and LBL buffers, and
       then from the LBL buffers to the disk.
     • Now both FBL and LBL are empty, and we are ready for next iteration of
       our algorithm going back to Phase 1.
     • This process continues until the entire dataset has been indexed.




                                                              iSAX 2.0 INDEX | 18
THE iSAX 2.0 INDEX
  Node Splitting Policy
  Design Principles
      • Keep the index as small as can
      • Avoid poor utilization of leaf nodes thus reducing the length of index
      • Make splits that distribute Time series equally to children nodes

  Approach for Proposed solution
      • Examine for each segment the distributions of the highest cardinality
        symbols across the relevant time series.
      • Split the segment for which the highest cardinality iSAX symbols lie on
        both sides of the breakpoint

  Solution
      • Split segment for which breakpoint is within range μ ± 3σ and closest to μ
         where σ is Standard deviation and μ is mean for the Symbol



                                                                    iSAX 2.0 INDEX | 19
THE iSAX 2.0 INDEX
  Node Splitting Policy
  Example
     • iSAX word of length (segments) four, split a node with cardinality 2 (for
        all segments). Compute μ ± 3σ value for each segment

      • Segment 1 lies entirely below the lowest breakpoint of cardinality 4 (i.e.,
        the cardinality of the two new nodes after the split) ∴ No Split

      • Segments 2 and 3 ranges cross some breakpoint of cardinality 4.
        Split on segment 3, because its μ value lies closer to a breakpoint than
        that of segment 2.

      • This tells some of the time series in the node to be split will end up in
        the new node representing the area above the breakpoint, while the rest
        will move to the second new node thus, achieving a balanced split.



                                                                    iSAX 2.0 INDEX | 20
THE iSAX 2.0 INDEX
  Node Splitting Policy




                          iSAX 2.0 INDEX | 21
THE iSAX 2.0 INDEX
  Psuedo Code

  Bulk Loading Algorithm

  Node splitting Algorithm




                             iSAX 2.0 INDEX | 22
Experimental Evaluation
Experimental Framework

• Intel Xeon E5504 , 24 GB Memory, 2TB disk, Windows Vista SP2
• AMD Athlon 64 X2, 3 GB Memory, 400 GB disk, Windows XP SP2
• Implemented in C#/.Net 3.5

Other Algorithms used for comparison
• iSAX 2.0 compared with iSAX and iSAX Buffer Tree

Datasets
•   Up to 1 Billion random time series of size 256 (~ 2TB)
•   20 Million subsequences of length 176
•   22 Million DNA subsequences of size 640
•   70 Million web image color histograms of size 256 (~133 GB)




                                                       EXPERIMENTAL EVALUATION | 23
Experimental Evaluation
Splitting Policy




• On an average 34% less nodes than iSAX index
• On an average 30% less time to build an index
• Leaf nodes with an average of over 54% more occupancy



                                                  EXPERIMENTAL EVALUATION | 24
Experimental Evaluation
Bulk Loading Evaluation

• Index built time for 1 Billion time
  series reduced to 72%
  From 2 months to 16 days

• New strategy leads to 35% less
  disk usage

• More than 99.5% sequential disk
  page accesses




                                        EXPERIMENTAL EVALUATION | 25
Experimental Evaluation
Case Study in Entomology

• Early detection of harmful insect behavior
• 20 Million Time series of size 176
• 6 Hours to index (~ 27 GB disk space), 0.5 second to answer 1-NN query

Mining Massive DNA Sequences
• Investigating genetic relationships
• 22 Million Time series of size 640
• 9 Hours to index (~ 115 GB disk space), 0.6 second to answer 10-NN query

Mining Massive Image collections
• Finding similar images
• 70 Million web image color histograms of size 256 (~133 GB)
• 12 Hours to index (~ 133 GB disk space), 0.9 second to answer 1-NN query



                                                    EXPERIMENTAL EVALUATION | 26
Experimental Evaluation
Case Study in Entomology




Mining Massive DNA Sequences




Mining Massive Image collections




                                   EXPERIMENTAL EVALUATION | 27
Conclusions and future work
Conclusions

Proposed iSAX 2.0
    • indexing for billion size time series collections

Experimentally validated proposed approach
    • First published experiment with analysis over 1 Billion time series

Case studies in diverse domains shows usefulness of this approach
   • Analysis of vast collection of data in fields like Etymology, web
       mining, DNA sequencing etc.

Future Work
Index and mine larger time series of order 10 Billion or more
    • e.g. Functional magnetic resonance imaging (fMRI)
    • Single experiment (1 subject, 1 test) produces 60,000 TS of length
       3000 leading to 12GB


                                                          CONCLUSIONS & FUTURE WORK | 28
REFERENCES
• iSAX 2.0 Indexing and Mining Billion time Series : Alessandro Camerra, Themis
  Palpanas, Jin Shieh and Eamonn Keogh (2010) ICDM 2010 [pdf]

• Microsoft Research iSAX 2.0: Indexing and Mining One Billion Time Series
  MSR URL

• Images Source 1, Source 2




                                                                    REFERENCES | 29
THANK YOU

   Q/A




            REFERENCES | 29

More Related Content

Similar to Indexing and Mining a Billion Time series using iSAX 2.0

HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 
4CSPDuploadpdf.pdf
4CSPDuploadpdf.pdf4CSPDuploadpdf.pdf
4CSPDuploadpdf.pdfgrssieee
 
Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...Sourjya Dutta
 
Talk data sciencemeetup
Talk data sciencemeetupTalk data sciencemeetup
Talk data sciencemeetupdatasciencenl
 
SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSPlanetData Network of Excellence
 
Module v sp
Module v spModule v sp
Module v spVijaya79
 
Simple information transmission by different natural languages
Simple information transmission by different natural languagesSimple information transmission by different natural languages
Simple information transmission by different natural languagesyaevents
 

Similar to Indexing and Mining a Billion Time series using iSAX 2.0 (15)

HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
4CSPDuploadpdf.pdf
4CSPDuploadpdf.pdf4CSPDuploadpdf.pdf
4CSPDuploadpdf.pdf
 
Lec4
Lec4Lec4
Lec4
 
Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...
 
Talk data sciencemeetup
Talk data sciencemeetupTalk data sciencemeetup
Talk data sciencemeetup
 
Introduction to VTK
Introduction to VTKIntroduction to VTK
Introduction to VTK
 
SciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMSSciQL, Bridging the Gap between Science and Relational DBMS
SciQL, Bridging the Gap between Science and Relational DBMS
 
The FFT And Spectral Analysis
The FFT And Spectral AnalysisThe FFT And Spectral Analysis
The FFT And Spectral Analysis
 
SciQL, A Query Language for Science Applications
SciQL, A Query Language for Science ApplicationsSciQL, A Query Language for Science Applications
SciQL, A Query Language for Science Applications
 
Module v sp
Module v spModule v sp
Module v sp
 
Arrays in Databases, the next frontier?
Arrays in Databases, the next frontier?Arrays in Databases, the next frontier?
Arrays in Databases, the next frontier?
 
Signals Processing Homework Help
Signals Processing Homework HelpSignals Processing Homework Help
Signals Processing Homework Help
 
Simple information transmission by different natural languages
Simple information transmission by different natural languagesSimple information transmission by different natural languages
Simple information transmission by different natural languages
 
Anchor Modeling27 Feb Paul
Anchor Modeling27 Feb PaulAnchor Modeling27 Feb Paul
Anchor Modeling27 Feb Paul
 
Anchor Modeling27 Feb Paul
Anchor Modeling27 Feb PaulAnchor Modeling27 Feb Paul
Anchor Modeling27 Feb Paul
 

More from Vasu Jain

Tech jobs beyond programming - Game design
Tech jobs beyond programming - Game designTech jobs beyond programming - Game design
Tech jobs beyond programming - Game designVasu Jain
 
Conversational AI & Smart Homes
Conversational AI & Smart HomesConversational AI & Smart Homes
Conversational AI & Smart HomesVasu Jain
 
Identifying and solving enterprise problems
Identifying and solving enterprise problems  Identifying and solving enterprise problems
Identifying and solving enterprise problems Vasu Jain
 
Building Enterprise Chat Bots
Building Enterprise Chat BotsBuilding Enterprise Chat Bots
Building Enterprise Chat BotsVasu Jain
 
Chat Bots and how to build a Slack bot
Chat Bots and how to build a Slack botChat Bots and how to build a Slack bot
Chat Bots and how to build a Slack botVasu Jain
 
Sentiment analysis of tweets [SCSE, 13]
Sentiment analysis of tweets [SCSE, 13]Sentiment analysis of tweets [SCSE, 13]
Sentiment analysis of tweets [SCSE, 13]Vasu Jain
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweetsVasu Jain
 
How google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrowHow google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrowVasu Jain
 
Internet Explorer 9
Internet Explorer 9Internet Explorer 9
Internet Explorer 9Vasu Jain
 
Cloud Computing Architecture
Cloud Computing Architecture Cloud Computing Architecture
Cloud Computing Architecture Vasu Jain
 
Power Point 2010
Power Point 2010 Power Point 2010
Power Point 2010 Vasu Jain
 
Microsoft Office 2010 Overview
Microsoft Office 2010 OverviewMicrosoft Office 2010 Overview
Microsoft Office 2010 OverviewVasu Jain
 
Windows 7 What's Next
Windows 7 What's NextWindows 7 What's Next
Windows 7 What's NextVasu Jain
 
Windows Phone 7
Windows Phone 7Windows Phone 7
Windows Phone 7Vasu Jain
 

More from Vasu Jain (14)

Tech jobs beyond programming - Game design
Tech jobs beyond programming - Game designTech jobs beyond programming - Game design
Tech jobs beyond programming - Game design
 
Conversational AI & Smart Homes
Conversational AI & Smart HomesConversational AI & Smart Homes
Conversational AI & Smart Homes
 
Identifying and solving enterprise problems
Identifying and solving enterprise problems  Identifying and solving enterprise problems
Identifying and solving enterprise problems
 
Building Enterprise Chat Bots
Building Enterprise Chat BotsBuilding Enterprise Chat Bots
Building Enterprise Chat Bots
 
Chat Bots and how to build a Slack bot
Chat Bots and how to build a Slack botChat Bots and how to build a Slack bot
Chat Bots and how to build a Slack bot
 
Sentiment analysis of tweets [SCSE, 13]
Sentiment analysis of tweets [SCSE, 13]Sentiment analysis of tweets [SCSE, 13]
Sentiment analysis of tweets [SCSE, 13]
 
Sentiment analysis of tweets
Sentiment analysis of tweetsSentiment analysis of tweets
Sentiment analysis of tweets
 
How google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrowHow google is using linked data today and vision for tomorrow
How google is using linked data today and vision for tomorrow
 
Internet Explorer 9
Internet Explorer 9Internet Explorer 9
Internet Explorer 9
 
Cloud Computing Architecture
Cloud Computing Architecture Cloud Computing Architecture
Cloud Computing Architecture
 
Power Point 2010
Power Point 2010 Power Point 2010
Power Point 2010
 
Microsoft Office 2010 Overview
Microsoft Office 2010 OverviewMicrosoft Office 2010 Overview
Microsoft Office 2010 Overview
 
Windows 7 What's Next
Windows 7 What's NextWindows 7 What's Next
Windows 7 What's Next
 
Windows Phone 7
Windows Phone 7Windows Phone 7
Windows Phone 7
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Indexing and Mining a Billion Time series using iSAX 2.0

  • 1. iSAX 2.0 INDEXING & MINING ONE BILLION TIME SERIES Paper By: Alessandro Camerra Themis Palpanas Jin Shieh Eamonn Keogh
  • 2. iSAX 2.0 INDEXING & MINING ONE BILLION TIME SERIES Presented by: Vasu Jain
  • 3. Contents 1. Introduction 2. Preliminaries a. The SAX Representation b. The iSAX Representation c. Indexing iSAX 3. iSAX 2 Index a. Bulk Loading b. Node Splitting Policy 4. Experimental Evaluations 5. Conclusion & Future work 6. References CONTENTS | 2
  • 4. Time Series Sequence of data points measured at successive time instants spaced at uniform time intervals. Time Series T = t1, t2, t3, t4 …… tn n real valued observations at equal intervals of time e.g. Stock Prices, Acid precipitation data at a single location INTRODUCTION | 3
  • 5. Time Series Examples Historical archives Sensors DNA sequences Motion data Stocks ECG data INTRODUCTION | 4
  • 6. Introduction • Indexing and Mining time series is hot. There is a pressing need for indexing and mining Time series data • Time series of order 100 of Millions to Billions “…we have about a million samples per minute coming in from 1000 gas turbines around the world… we need to be able to do similarity search for...” Lane Desborough, GE. “…an archival rate of 3.6 billion points a day, how can we (do similarity search) in this data?” Josh Patterson, TVA. • Two bottlenecks while mining these massive Time Series • Time complexity of building the index (6 days to index 100 million objects) • Time to retrieve data from the disk INTRODUCTION | 5
  • 7. Introduction • iSAX 2.0 • To solve these problem • A data structure which is an extension to iSAX • Suitable for indexing and mining very large time series. • Novel mechanism for scalable indexing of time series: Bulk loading algorithm, node splitting policies. • Results: • Index building time reduced by 72% with bulk loading scheme • Index size reduced by 27% • Number of disk page accesses reduced by 50% • Scalability achieved allows to consider new challenges in data mining problems which have been untenable otherwise. • First approach that is experimentally validated to scale Time series data collections with up to 1 Billion objects. INTRODUCTION | 6
  • 8. Preliminaries : The SAX representation 1. Represent a time series T of length n in w-dimensional space using PAA Where the ith element is: ni T w ti  w n j j n ( i 1) 1 w 2. Then discretize into a vector of symbols Breakpoints map to a small alphabet a of symbols For every segment a bit wise representation is assigned to code that region 3 3 3 2 iSAX(T,4,4) 2 A time series T 2 PAA(T,4) 00 1 1 1 01 0 0 0 10 -1 -1 -1 11 -2 -2 -2 -3 -3 -3 0 4 8 12 16 0 4 8 12 16 0 4 8 12 16 i) A time series T, of length 16. ii) A PAA approximation of T, with 4 segments. iii) A time series T converted into SAX words of cardinality 4 PRELIMINARIES | 7
  • 9. Preliminaries : The SAX representation PRELIMINARIES | 8
  • 10. Preliminaries : The SAX representation More About SAX Representation 1. The SAX Representation supports arbitrary breakpoints. 2. A SAX word is simply a vector of discrete numbers. e.g. SAX word can be written as {3, 3, 1, 0} or in binary as {11, 11, 01, 00}. 3. Denoted by T4, produced by function SAX(T,4,4). 4. SAX(T, w, c) = TC where w = word length, c = cardinality of symbols SAX(T,4,2) = T2 = {1, 1, 0, 0} SAX(T,4,4) = T2 = {11, 11, 01, 00} 5. Converting to a reduced cardinality SAX word by ignoring Trailing Bits PRELIMINARIES | 9
  • 11. Preliminaries : iSAX representation 1. Tedious to write binary string in SAX Representation 2. To include cardinality in iSAX representation we add cardinality as subscript iSAX(T,4,8) = T8 = {68,68,38,08} 3. Key property: Ability to compare two iSAX words of different cardinalities. 4. Promote the lower cardinality representation into the cardinality of the larger Promote S2 as 5. Obtain missing bit values by returning the one closest in SAX space to the corresponding value in T8. Not exact but admissible value of word S. 6. We can compare iSAX words where each word has mixed cardinality {111, 11, 101, 0} = {78,34,58,02} PRELIMINARIES | 10
  • 12. Preliminaries : Indexing iSAX 1. Using the iSAX representation, we can produce a set of bw different mutually exclusive iSAX words. (b = cardinality, w = word length) 2. Words can be represented by files e.g word {68,68,38,08} can be mapped to 6.8_6.8_3.8_0.8.txt 3. Threshold th defines maximum time series a word can hold. 4. For overflow, split the file. To split choose one iSAX symbol, examine an additional bit, and use its value to create two new files. {24,34,34,24} (Original file) {48,34,34,24} {58,34,34,24} (child file 1) (child file 2) PRELIMINARIES | 11
  • 13. Preliminaries : Indexing iSAX 1. File splitting produces hierarchical, but unbalanced, index structure that contains non-overlapping regions 2. Three class of nodes: 1. Root Node: Representative of the complete iSAX space. Contains no SAX representation, but only pointers to the children nodes. 2. Leaf Node: Contains a pointer to an index file on disk with the raw time series entries. Stores the highest cardinality iSAX word for each time series. 3. Internal Node: Created when the number of time series contained by a leaf node exceeds threshold th. 3. iSAX employs binary splits along a single dimension PRELIMINARIES | 12
  • 14. THE iSAX 2.0 INDEX Disadvantages of iSAX 1. Take too long to build index for large datasets. Indexing a dataset with 500 M time series takes 20 days to complete 2. Because: • Naïve node splitting policy leading to ineffective splits and additional disk I/O • No bulk loading strategy, does not use available main memory to reduce disk I/O 3. To overcome these challenges: • New algorithm for Time series bulk loading that reduces disk I/O • New Node splitting policy resulting in more compact index, and hence further reducing the I/O cost. 4. We will refer to this improved iSAX index as iSAX 2.0. iSAX 2.0 INDEX | 13
  • 15. THE iSAX 2.0 INDEX Bulk Loading Design Principles • Minimize the number of disk I/O operations • Maximize sequential disk access • Take advantage of available Main memory Approach for Proposed solution • Instead of building entire index at once, build distinct sub trees of index one at a time Assumptions • Limited main memory i.e. less than necessary to fit entire dataset and index iSAX 2.0 INDEX | 14
  • 16. THE iSAX 2.0 INDEX Bulk Loading: Algorithm basics Uses two main memory buffer layers First Buffer Layer (FBL) : • Corresponds to first level of iSAX 2.0 nodes • Cluster together the Time series that will end up in same iSAX 2.0 sub tree rooted in one of the direct children of the root. • No size restriction, grow till they occupy available main memory. Last Buffer Layer (LBL) : • Corresponds to leaf nodes • Gather all Time series of leaf nodes and flush them to disk • Same size as of leaf nodes on disk iSAX 2.0 INDEX | 15
  • 17. THE iSAX 2.0 INDEX Bulk Loading: Algorithm Description Algorithm operates in two phases which alternate until entire dataset is indexed Phase 1: • Time series is inserted in the corresponding FBL buffer until the main memory is almost full. • At the end of Phase 1, we have time series collected in the FBL buffers. • Corresponding (leaf) nodes L1, L2, L3, of the index are not yet created. iSAX 2.0 INDEX | 16
  • 18. THE iSAX 2.0 INDEX Bulk Loading: Algorithm Description Phase 2: • Time series contained in each FBL buffer is moved to the appropriate LBL buffers. • Sequentially for each FBL buffer, the algorithm reads the time series and creates the entire sub tree (with Internal and leaf iSAX 2.0 nodes) rooted at the node corresponding to that FBL buffer. • By emptying the right-most FBL buffer, we create the sub tree rooted at internal node I1. The algorithm also creates for each leaf node a corresponding LBL buffer • After all Time series of a FBL buffer moves to corresponding LBL buffers, these LBL are flushed to disk making memory available for use. iSAX 2.0 INDEX | 17
  • 19. THE iSAX 2.0 INDEX Bulk Loading: Algorithm Description Phase 2: • At the end of Phase 2, all the time series from the FBL buffers have moved down the tree to the appropriate leaf nodes and LBL buffers, and then from the LBL buffers to the disk. • Now both FBL and LBL are empty, and we are ready for next iteration of our algorithm going back to Phase 1. • This process continues until the entire dataset has been indexed. iSAX 2.0 INDEX | 18
  • 20. THE iSAX 2.0 INDEX Node Splitting Policy Design Principles • Keep the index as small as can • Avoid poor utilization of leaf nodes thus reducing the length of index • Make splits that distribute Time series equally to children nodes Approach for Proposed solution • Examine for each segment the distributions of the highest cardinality symbols across the relevant time series. • Split the segment for which the highest cardinality iSAX symbols lie on both sides of the breakpoint Solution • Split segment for which breakpoint is within range μ ± 3σ and closest to μ where σ is Standard deviation and μ is mean for the Symbol iSAX 2.0 INDEX | 19
  • 21. THE iSAX 2.0 INDEX Node Splitting Policy Example • iSAX word of length (segments) four, split a node with cardinality 2 (for all segments). Compute μ ± 3σ value for each segment • Segment 1 lies entirely below the lowest breakpoint of cardinality 4 (i.e., the cardinality of the two new nodes after the split) ∴ No Split • Segments 2 and 3 ranges cross some breakpoint of cardinality 4. Split on segment 3, because its μ value lies closer to a breakpoint than that of segment 2. • This tells some of the time series in the node to be split will end up in the new node representing the area above the breakpoint, while the rest will move to the second new node thus, achieving a balanced split. iSAX 2.0 INDEX | 20
  • 22. THE iSAX 2.0 INDEX Node Splitting Policy iSAX 2.0 INDEX | 21
  • 23. THE iSAX 2.0 INDEX Psuedo Code Bulk Loading Algorithm Node splitting Algorithm iSAX 2.0 INDEX | 22
  • 24. Experimental Evaluation Experimental Framework • Intel Xeon E5504 , 24 GB Memory, 2TB disk, Windows Vista SP2 • AMD Athlon 64 X2, 3 GB Memory, 400 GB disk, Windows XP SP2 • Implemented in C#/.Net 3.5 Other Algorithms used for comparison • iSAX 2.0 compared with iSAX and iSAX Buffer Tree Datasets • Up to 1 Billion random time series of size 256 (~ 2TB) • 20 Million subsequences of length 176 • 22 Million DNA subsequences of size 640 • 70 Million web image color histograms of size 256 (~133 GB) EXPERIMENTAL EVALUATION | 23
  • 25. Experimental Evaluation Splitting Policy • On an average 34% less nodes than iSAX index • On an average 30% less time to build an index • Leaf nodes with an average of over 54% more occupancy EXPERIMENTAL EVALUATION | 24
  • 26. Experimental Evaluation Bulk Loading Evaluation • Index built time for 1 Billion time series reduced to 72% From 2 months to 16 days • New strategy leads to 35% less disk usage • More than 99.5% sequential disk page accesses EXPERIMENTAL EVALUATION | 25
  • 27. Experimental Evaluation Case Study in Entomology • Early detection of harmful insect behavior • 20 Million Time series of size 176 • 6 Hours to index (~ 27 GB disk space), 0.5 second to answer 1-NN query Mining Massive DNA Sequences • Investigating genetic relationships • 22 Million Time series of size 640 • 9 Hours to index (~ 115 GB disk space), 0.6 second to answer 10-NN query Mining Massive Image collections • Finding similar images • 70 Million web image color histograms of size 256 (~133 GB) • 12 Hours to index (~ 133 GB disk space), 0.9 second to answer 1-NN query EXPERIMENTAL EVALUATION | 26
  • 28. Experimental Evaluation Case Study in Entomology Mining Massive DNA Sequences Mining Massive Image collections EXPERIMENTAL EVALUATION | 27
  • 29. Conclusions and future work Conclusions Proposed iSAX 2.0 • indexing for billion size time series collections Experimentally validated proposed approach • First published experiment with analysis over 1 Billion time series Case studies in diverse domains shows usefulness of this approach • Analysis of vast collection of data in fields like Etymology, web mining, DNA sequencing etc. Future Work Index and mine larger time series of order 10 Billion or more • e.g. Functional magnetic resonance imaging (fMRI) • Single experiment (1 subject, 1 test) produces 60,000 TS of length 3000 leading to 12GB CONCLUSIONS & FUTURE WORK | 28
  • 30. REFERENCES • iSAX 2.0 Indexing and Mining Billion time Series : Alessandro Camerra, Themis Palpanas, Jin Shieh and Eamonn Keogh (2010) ICDM 2010 [pdf] • Microsoft Research iSAX 2.0: Indexing and Mining One Billion Time Series MSR URL • Images Source 1, Source 2 REFERENCES | 29
  • 31. THANK YOU Q/A REFERENCES | 29