SlideShare a Scribd company logo
VAST-Tree: A Vector-Advanced and
Compressed Structure
for Massive Data Tree Traversal

         EDBT 2012, March 27th-29th, 2012
         Humboldt University, Berlin




                                            1
Outline
• Backgrounds & Motivation
  – Modern HW and HW-aware algorithms

• Prerequisite Knowledge
  – Search keys with SIMD instructions

• Proposed Technique
  – Branch compression for high parallelization

• Experimental
  – Twitter Public Timeline as a real data set
  – Compression ratio and throughput

                                                  2
Backgrounds: Modern Hardware
• Fast and highly-functional Hardware
  – Multi-/Many-core CPUs
    Intel Ivy bridge/Haswell/Skylake/Knights Ferry
  – GPUs for General Purpose
  – ...


• New algorithms advanced by these hardware
  – Sorts, Searches, Compression, and DB kernels

         A today topic: tree searches on multi-core CPUs
                                                      3
Backgrounds: Multi-core CPUs
• Highly-advanced instructions
  – 128/256/512-bit SIMD, Transactional Memory, ...


• Branch Prediction
  – Process “if-then” paths efficiently
  – High penalties of branch misses


• Parallelism & Memory
  – Many cores on a single processor
  – Limited by memory accesses [5][14][15]
                                                      4
Backgrounds: Tree Traversal
• Search a key from a sequence of values

• Fundamental operations
     – Used everywhere, and well-known
            search_key
                             Code Snippet:
                   48        if (search_key >
                                        node->compare_key)
              12        68      node = node->right;
                             else
                                node = node->left;
 7          20

  Ex.) Binary-Tree                                           5
Backgrounds: Tree Traversal
• But, legacy algorithms too inefficient
                                          Actual Execution time: 20-40%
                           100%                                                    6.0E+03
 Ratio of execution time




                                                                                             # of instructions
                                                                                   4.0E+03
                                        complete instructions
                           50%          stall time
                                        branch penalties
                                        # of instructions                          2.0E+03



                            0%                                                     0.0E+00
                                  22(0.161)   24(0.167)    26(0.206)   28(0.319)
                                                 log2(# of keys)
                                                                                                                 6
Backgrounds: Existing Algorithms
• Cache-conscious B+Tree [4][10][11][19][20]
  – Realigning, prefetching, and buffering nodes


• FAST [14]
  – Cache-conscious and branch-free techniques
  – SIMD instructions used for branch-free searches


• PALM [24]
  – Support incremental updates for FAST

                                                      7
Prerequisite Knowledge:
Tree traversal with SIMD instructions




                                        8
Prerequisite Knowledge: Searches with SIMD
• Process multiple data with SIMD instructions
  – Most x86 processors support 128bit SIMD
  – Return 1 or 0 with inequality relation


• FAST compare 3 keys simultaneously
                             32bit        128bit


               Register A:   34      78   91       x

               Register B:   79      79   79       x

               Register C:     1     1    0        x   9
Logical Example: Searches with SIMD
   : SIMD blocks compared simultaneously
79 : A search key




                                           10
Logical Example: Searches with SIMD
     : SIMD blocks compared simultaneously
  79 : A search key

Compare keys with SIMD




                                             11
Logical Example: Searches with SIMD
     : SIMD blocks compared simultaneously
  79 : A search key                     A lookup table
                                     Returned        Offset
                                     Values          Blocks
Compare keys with SIMD               ...             ...

                                           1 1 0 x   3
                                     ...             ...




        1                2    3              4


                                                           12
Logical Example: Searches with SIMD
   : SIMD blocks compared simultaneously
79 : A search key




                             Move to a next SIMD block   13
Physical Example: Searches with SIMD
• Arrange SIMD blocks in breadth first order on
  physically consecutive memory




                                                  14
Physical Example: Searches with SIMD
 • Arrange SIMD blocks in breadth first order on
   physically consecutive memory


                  36B Offset Jumps!

  [34, 78, 91], [2, 11, 23], [35, 39, 49], [80, 87, 88], ...
                               To high addresses in memory
Each SIMD block is 12B


                                                               15
Issue: Number of Comparison Keys
• More keys compared simultaneously!
  – SIMD supports 1byte and 2byte elements



                           x                          x


                           x                          x


                           x                          x

  1byte each and 16 elements   2byte each and 8 elements


                                                           16
A proposed technique:
Branch compression for high parallelization




                                          17
VAST-Tree: Designing Data Structure
• Classify branches into 3 layers
  – Apply FAST to P32, and compress keys in P16 and P8


  : SBs - SIMD blocks
                                                     (H32)
  : CBs - Compression blocks

2byte keys, and 7 keys
compared simultaneously                              (H16)



1byte keys, and 15 keys                              (H8)
compared simultaneously

                                                     18
Detail Outline: VAST-Tree
• Branch Lossy Compression
  – Comparison Errors
• SIMD Aligned Layouts
• Other topics ...




                             19
Proposed: Branch Lossy Compression
• Apply to each compression block
   – Prefix and suffix bit truncation

• Transform ‘search’ keys similarly for comparison
   – Extracted bit location stored in the header of CBs

                              Remove lower bits   1byte keys
  Ascending order
    keys in a CB




                          1
                    Extract partial bits with
                        red background                         20
Penalty: Comparison Errors
• But, some lossy keys compared incorrectly
   Example)
        value1 - 3220 (1100 1001 01002=20110)
        value2 - 3219 (1100 1001 00112=20110)

         Original Values: 3220 > 3219 --> Return 0
                          A error happens!
     Compressed Values: 201 ≦ 201 --> Return 1

• Check and correct errors after tree traversal
  – Scan leaf nodes sequentially
                                                     21
Proposed: SIMD-Aligned Layouts
• Load data efficiently to SIMD registers

• A few padding spaces between blocks
  – Many blanks caused by page alignment in FAST

                     Each block is SIMD-length aligned
                     SBs




                   CBs
                                                         22
Proposed: SIMD-Aligned Layouts
• Load data efficiently to SIMD registers

• A few padding spaces between blocks
  – Many blanks caused by page alignment in FAST

                     Each block is SIMD-length aligned
                     SBs




                   CBs              Padding spaces
                                                         23
Proposed: Other Topics
• Linear search optimization
  – Remove bottom SBs


• Apply P4Delta to leaf nodes
  – A lossless compression method        Compress fixed k keys
                                            into a chunk

    Keys in leaf nodes:




                          Single chunk Single chunk       24
Experimental Results




                       25
Setup: Synthetic and Realistic Data Sets
• Twitter Public Timeline data
  – May, 2010 to Apr., 2011
  – Twitter Ids and Timestamps
  – 36,068,948 entries (nearly equal to 225)
                               1.0
                               0.8          Twitter - Ids
• Synthetic data       Ratio                Twitter - Timestamps
                               0.6
  – Follow a Poisson           0.4
    distribution               0.2
                               0.0
                                     0 1 2 3 4 5 6 7 8 9 10
                                             d-gaps
                                                               26
Results: Compression Ratio – Branch Nodes

   VAST-Tree parameters(H32, H16)     Best




                                             27
Results: Compression Ratio – Leaf Nodes


          Minimize Error Penalty




                Improve Compression




                                          28
Results: Throughput – Synthetic Data
             1.0E+08
                                      VAST-Tree w/o P4Delta
                                      VAST-Tree w P4Delta
             7.5E+07                  FAST
Throughput




                                      binary trees




                                                              Better
             5.0E+07


             2.5E+07


             0.0E+00
                       22    24         26            28
                            log2 (# of keys)



                                                                       29
Results: Throughput – Twitter Data

                                     Better
                                     Worse




                                              30
Results: Error Ratio
        1.0
                                    1/λ= 16
                                    1/λ= 64
        0.8
                                    1/λ= 256
                                    Twitter -Ids
        0.6                         Twitter -Timestamps
Ratio




        0.4

        0.2

        0.0        Better
              0-   10-       100-       1000-      10000-
                            ⊿w



                                                            31
Summary & Future Work
• Proposed lossy compression for high parallelization
   – Linear search opt., leaf compression, and others


• Experimental Evaluation
   –     Compress branch nodes dynamically
   –     Improve throughput and compression ratio
   –     Throughput worsen by leaf compression


• Future Work
   – Update supports, and more amount of keys


                                                        32
33

More Related Content

Similar to VAST-Tree, EDBT'12

MongoDB: Scaling write performance | Devon 2012
MongoDB: Scaling write performance | Devon 2012MongoDB: Scaling write performance | Devon 2012
MongoDB: Scaling write performance | Devon 2012
Daum DNA
 
What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBD
ShapeBlue
 

Similar to VAST-Tree, EDBT'12 (20)

Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL MySQL NDB Cluster 8.0 SQL faster than NoSQL
MySQL NDB Cluster 8.0 SQL faster than NoSQL
 
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
Nikita Abdullin - Reverse-engineering of embedded MIPS devices. Case Study - ...
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
 
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
 
Lecture 25
Lecture 25Lecture 25
Lecture 25
 
Computer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryComputer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary Memory
 
MongoDB: Scaling write performance | Devon 2012
MongoDB: Scaling write performance | Devon 2012MongoDB: Scaling write performance | Devon 2012
MongoDB: Scaling write performance | Devon 2012
 
QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...QuadIron An open source library for number theoretic transform-based erasure ...
QuadIron An open source library for number theoretic transform-based erasure ...
 
What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...What should be done to IR algorithms to meet current, and possible future, ha...
What should be done to IR algorithms to meet current, and possible future, ha...
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
 
Everything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @TwitterEverything I Ever Learned About JVM Performance Tuning @Twitter
Everything I Ever Learned About JVM Performance Tuning @Twitter
 
Apache Cassandra Opinion and Fact
Apache Cassandra Opinion and FactApache Cassandra Opinion and Fact
Apache Cassandra Opinion and Fact
 
Memory ECC - The Comprehensive of SEC-DED.
Memory ECC - The Comprehensive of SEC-DED. Memory ECC - The Comprehensive of SEC-DED.
Memory ECC - The Comprehensive of SEC-DED.
 
NoSQL Data Stores: Introduzione alle Basi di Dati Non Relazionali
NoSQL Data Stores: Introduzione alle Basi di Dati Non RelazionaliNoSQL Data Stores: Introduzione alle Basi di Dati Non Relazionali
NoSQL Data Stores: Introduzione alle Basi di Dati Non Relazionali
 
Pitfalls of Object Oriented Programming
Pitfalls of Object Oriented ProgrammingPitfalls of Object Oriented Programming
Pitfalls of Object Oriented Programming
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
What CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBDWhat CloudStackers Need To Know About LINSTOR/DRBD
What CloudStackers Need To Know About LINSTOR/DRBD
 

More from Takeshi Yamamuro

An Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List CompressionAn Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List Compression
Takeshi Yamamuro
 
VLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging HardwareVLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging Hardware
Takeshi Yamamuro
 
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
Takeshi Yamamuro
 
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
Takeshi Yamamuro
 
Introduction to Modern Analytical DB
Introduction to Modern Analytical DBIntroduction to Modern Analytical DB
Introduction to Modern Analytical DB
Takeshi Yamamuro
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-
Takeshi Yamamuro
 
A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequences
Takeshi Yamamuro
 
VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-
Takeshi Yamamuro
 
研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法
Takeshi Yamamuro
 

More from Takeshi Yamamuro (20)

LT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature ExpectationLT: Spark 3.1 Feature Expectation
LT: Spark 3.1 Feature Expectation
 
Apache Spark + Arrow
Apache Spark + ArrowApache Spark + Arrow
Apache Spark + Arrow
 
Quick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + αQuick Overview of Upcoming Spark 3.0 + α
Quick Overview of Upcoming Spark 3.0 + α
 
MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理MLflowによる機械学習モデルのライフサイクルの管理
MLflowによる機械学習モデルのライフサイクルの管理
 
Taming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache SparkTaming Distributed/Parallel Query Execution Engine of Apache Spark
Taming Distributed/Parallel Query Execution Engine of Apache Spark
 
LLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecodeLLJVM: LLVM bitcode to JVM bytecode
LLJVM: LLVM bitcode to JVM bytecode
 
20180417 hivemall meetup#4
20180417 hivemall meetup#420180417 hivemall meetup#4
20180417 hivemall meetup#4
 
An Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List CompressionAn Experimental Study of Bitmap Compression vs. Inverted List Compression
An Experimental Study of Bitmap Compression vs. Inverted List Compression
 
Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題Sparkのクエリ処理系と周辺の話題
Sparkのクエリ処理系と周辺の話題
 
20160908 hivemall meetup
20160908 hivemall meetup20160908 hivemall meetup
20160908 hivemall meetup
 
20150513 legobease
20150513 legobease20150513 legobease
20150513 legobease
 
20150516 icde2015 r19-4
20150516 icde2015 r19-420150516 icde2015 r19-4
20150516 icde2015 r19-4
 
VLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging HardwareVLDB2013 R1 Emerging Hardware
VLDB2013 R1 Emerging Hardware
 
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
浮動小数点(IEEE754)を圧縮したい@dsirnlp#4
 
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
LLVMで遊ぶ(整数圧縮とか、x86向けの自動ベクトル化とか)
 
Introduction to Modern Analytical DB
Introduction to Modern Analytical DBIntroduction to Modern Analytical DB
Introduction to Modern Analytical DB
 
SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-SIGMOD’12勉強会 -Session 7-
SIGMOD’12勉強会 -Session 7-
 
A x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequencesA x86-optimized rank&select dictionary for bit sequences
A x86-optimized rank&select dictionary for bit sequences
 
VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-VLDB’11勉強会 -Session 9-
VLDB’11勉強会 -Session 9-
 
研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法研究動向から考えるx86/x64最適化手法
研究動向から考えるx86/x64最適化手法
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Buy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdfBuy Epson EcoTank L3210 Colour Printer Online.pdf
Buy Epson EcoTank L3210 Colour Printer Online.pdf
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya HalderCustom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
Custom Approval Process: A New Perspective, Pavel Hrbacek & Anindya Halder
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdfLinux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
Linux Foundation Edge _ Overview of FDO Software Components _ Randy at Intel.pdf
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 

VAST-Tree, EDBT'12

  • 1. VAST-Tree: A Vector-Advanced and Compressed Structure for Massive Data Tree Traversal EDBT 2012, March 27th-29th, 2012 Humboldt University, Berlin 1
  • 2. Outline • Backgrounds & Motivation – Modern HW and HW-aware algorithms • Prerequisite Knowledge – Search keys with SIMD instructions • Proposed Technique – Branch compression for high parallelization • Experimental – Twitter Public Timeline as a real data set – Compression ratio and throughput 2
  • 3. Backgrounds: Modern Hardware • Fast and highly-functional Hardware – Multi-/Many-core CPUs Intel Ivy bridge/Haswell/Skylake/Knights Ferry – GPUs for General Purpose – ... • New algorithms advanced by these hardware – Sorts, Searches, Compression, and DB kernels A today topic: tree searches on multi-core CPUs 3
  • 4. Backgrounds: Multi-core CPUs • Highly-advanced instructions – 128/256/512-bit SIMD, Transactional Memory, ... • Branch Prediction – Process “if-then” paths efficiently – High penalties of branch misses • Parallelism & Memory – Many cores on a single processor – Limited by memory accesses [5][14][15] 4
  • 5. Backgrounds: Tree Traversal • Search a key from a sequence of values • Fundamental operations – Used everywhere, and well-known search_key Code Snippet: 48 if (search_key > node->compare_key) 12 68 node = node->right; else node = node->left; 7 20 Ex.) Binary-Tree 5
  • 6. Backgrounds: Tree Traversal • But, legacy algorithms too inefficient Actual Execution time: 20-40% 100% 6.0E+03 Ratio of execution time # of instructions 4.0E+03 complete instructions 50% stall time branch penalties # of instructions 2.0E+03 0% 0.0E+00 22(0.161) 24(0.167) 26(0.206) 28(0.319) log2(# of keys) 6
  • 7. Backgrounds: Existing Algorithms • Cache-conscious B+Tree [4][10][11][19][20] – Realigning, prefetching, and buffering nodes • FAST [14] – Cache-conscious and branch-free techniques – SIMD instructions used for branch-free searches • PALM [24] – Support incremental updates for FAST 7
  • 8. Prerequisite Knowledge: Tree traversal with SIMD instructions 8
  • 9. Prerequisite Knowledge: Searches with SIMD • Process multiple data with SIMD instructions – Most x86 processors support 128bit SIMD – Return 1 or 0 with inequality relation • FAST compare 3 keys simultaneously 32bit 128bit Register A: 34 78 91 x Register B: 79 79 79 x Register C: 1 1 0 x 9
  • 10. Logical Example: Searches with SIMD : SIMD blocks compared simultaneously 79 : A search key 10
  • 11. Logical Example: Searches with SIMD : SIMD blocks compared simultaneously 79 : A search key Compare keys with SIMD 11
  • 12. Logical Example: Searches with SIMD : SIMD blocks compared simultaneously 79 : A search key A lookup table Returned Offset Values Blocks Compare keys with SIMD ... ... 1 1 0 x 3 ... ... 1 2 3 4 12
  • 13. Logical Example: Searches with SIMD : SIMD blocks compared simultaneously 79 : A search key Move to a next SIMD block 13
  • 14. Physical Example: Searches with SIMD • Arrange SIMD blocks in breadth first order on physically consecutive memory 14
  • 15. Physical Example: Searches with SIMD • Arrange SIMD blocks in breadth first order on physically consecutive memory 36B Offset Jumps! [34, 78, 91], [2, 11, 23], [35, 39, 49], [80, 87, 88], ... To high addresses in memory Each SIMD block is 12B 15
  • 16. Issue: Number of Comparison Keys • More keys compared simultaneously! – SIMD supports 1byte and 2byte elements x x x x x x 1byte each and 16 elements 2byte each and 8 elements 16
  • 17. A proposed technique: Branch compression for high parallelization 17
  • 18. VAST-Tree: Designing Data Structure • Classify branches into 3 layers – Apply FAST to P32, and compress keys in P16 and P8 : SBs - SIMD blocks (H32) : CBs - Compression blocks 2byte keys, and 7 keys compared simultaneously (H16) 1byte keys, and 15 keys (H8) compared simultaneously 18
  • 19. Detail Outline: VAST-Tree • Branch Lossy Compression – Comparison Errors • SIMD Aligned Layouts • Other topics ... 19
  • 20. Proposed: Branch Lossy Compression • Apply to each compression block – Prefix and suffix bit truncation • Transform ‘search’ keys similarly for comparison – Extracted bit location stored in the header of CBs Remove lower bits 1byte keys Ascending order keys in a CB 1 Extract partial bits with red background 20
  • 21. Penalty: Comparison Errors • But, some lossy keys compared incorrectly Example) value1 - 3220 (1100 1001 01002=20110) value2 - 3219 (1100 1001 00112=20110) Original Values: 3220 > 3219 --> Return 0 A error happens! Compressed Values: 201 ≦ 201 --> Return 1 • Check and correct errors after tree traversal – Scan leaf nodes sequentially 21
  • 22. Proposed: SIMD-Aligned Layouts • Load data efficiently to SIMD registers • A few padding spaces between blocks – Many blanks caused by page alignment in FAST Each block is SIMD-length aligned SBs CBs 22
  • 23. Proposed: SIMD-Aligned Layouts • Load data efficiently to SIMD registers • A few padding spaces between blocks – Many blanks caused by page alignment in FAST Each block is SIMD-length aligned SBs CBs Padding spaces 23
  • 24. Proposed: Other Topics • Linear search optimization – Remove bottom SBs • Apply P4Delta to leaf nodes – A lossless compression method Compress fixed k keys into a chunk Keys in leaf nodes: Single chunk Single chunk 24
  • 26. Setup: Synthetic and Realistic Data Sets • Twitter Public Timeline data – May, 2010 to Apr., 2011 – Twitter Ids and Timestamps – 36,068,948 entries (nearly equal to 225) 1.0 0.8 Twitter - Ids • Synthetic data Ratio Twitter - Timestamps 0.6 – Follow a Poisson 0.4 distribution 0.2 0.0 0 1 2 3 4 5 6 7 8 9 10 d-gaps 26
  • 27. Results: Compression Ratio – Branch Nodes VAST-Tree parameters(H32, H16) Best 27
  • 28. Results: Compression Ratio – Leaf Nodes Minimize Error Penalty Improve Compression 28
  • 29. Results: Throughput – Synthetic Data 1.0E+08 VAST-Tree w/o P4Delta VAST-Tree w P4Delta 7.5E+07 FAST Throughput binary trees Better 5.0E+07 2.5E+07 0.0E+00 22 24 26 28 log2 (# of keys) 29
  • 30. Results: Throughput – Twitter Data Better Worse 30
  • 31. Results: Error Ratio 1.0 1/λ= 16 1/λ= 64 0.8 1/λ= 256 Twitter -Ids 0.6 Twitter -Timestamps Ratio 0.4 0.2 0.0 Better 0- 10- 100- 1000- 10000- ⊿w 31
  • 32. Summary & Future Work • Proposed lossy compression for high parallelization – Linear search opt., leaf compression, and others • Experimental Evaluation – Compress branch nodes dynamically – Improve throughput and compression ratio – Throughput worsen by leaf compression • Future Work – Update supports, and more amount of keys 32
  • 33. 33