OSCON 2013: Apache Drill Workshop > Execution & ValueVectors
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

OSCON 2013: Apache Drill Workshop > Execution & ValueVectors

on

  • 1,569 views

Discussion of Drill execution strategies

Discussion of Drill execution strategies

Statistics

Views

Total Views
1,569
Views on SlideShare
1,567
Embed Views
2

Actions

Likes
3
Downloads
32
Comments
0

1 Embed 2

https://twitter.com 2

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

OSCON 2013: Apache Drill Workshop > Execution & ValueVectors Presentation Transcript

  • 1. 1 Apache Drill: Execution Jacques Nadeau, OSCON July 23, 2013 jacques@apache.org |@intjesus
  • 2. 2 Drill is… –Optimistic & Pipelined –Columnar & Late materialized –Vectorized –Language Agnostic –MPP Query Engine
  • 3. 3 Optimistic Execution  Optimistic Recovery  Pipelined Scheduling  Pipelined Communication
  • 4. 4 Optimistic Recovery  Assume Failures  Don’t overbuild for them – The shorter the queries, the less work lost on failure  Graceful management of node failure at a system level – Individual queries must be rerun  Avoid the overhead of persistence and barriers.
  • 5. 5 Pipelined Operators  Pipelining – push data along as soon as it is available – Cross-operator and cross-node  Straight forward for simple operators like filter, project  Also possible with less common things like sort, radix hash join – External Sort: merge only what is needed to push first part of data down pipeline  Destination buffering rather source buffering
  • 6. 6 Full pipelining requires query at once scheduling Query at Once  Schedule entire query at once  Pros: – Fastest data movement – Less herd effect  Cons: – Poorer workload distribution – Failure checkpoints hard Task by Task  Schedule each task when all previous tasks are completed  Pros: – Potential better workload distribution – Failure checkpoints straightforward  Cons: – Slower data movement – Poorer routing decision
  • 7. 7 Comparison with MapReduce  Barriers – Map completion required before shuffle/reduce commencement – All maps must complete before reduce can start – In chained jobs, one job must finish entirely before the next one can start  Persistence and Recoverability – Data is persisted to disk between each barrier – Serialization and deserialization are required between execution phase
  • 8. 8 Record versus Columnar Representation Record Column
  • 9. 9 Data Format Example Donut Price Icing Bacon Maple Bar 2.19 [Maple Frosting, Bacon] Portland Cream 1.79 [Chocolate] The Loop 2.29 [Vanilla, Fruitloops] Triple Chocolate Penetration 2.79 [Chocolate, Cocoa Puffs] Record Encoding Bacon Maple Bar, 2.19, Maple Frosting, Bacon, Portland Cream, 1.79, Chocolate The Loop, 2.29, Vanilla, Fruitloops, Triple Chocolate Penetration, 2.79, Chocolate, Cocoa Puffs Columnar Encoding Bacon Maple Bar, Portland Cream, The Loop, Triple Chocolate Penetration 2.19, 1.79, 2.29, 2.79 Maple Frosting, Bacon, Chocolate, Vanilla, Fruitloops, Chocolate, Cocoa Puffs
  • 10. 10 Places to Apply Columnar  Columnar Storage (on disk) – Improved compression when similar data is co-located – Alternative compression techniques: dictionary, RLE, delta – Avoid column reads when not needed  Columnar Execution (in memory) – Improved cache locality – Improved cpu pipelineing (especially with things like null checks) – Can reduce memory copies – Maintain unusual encoding schemas for direct relational operator use
  • 11. 11 Columnar Execution: When to materialize  Users want rows  Data is Columnar  When do you transform? –On read into memory –On return to user –Somewhere in between  Later is generally better –Not always :)
  • 12. 12 Late Decompression  Don’t necessarily materialize each value  Reduce memory consumption  Reduce CPU cost  Examples: RLE, Bit Dictionary
  • 13. 13 Example: RLE and Sum  Dataset – 2, 4 – 8, 10  Goal – Sum all the records  Normal Work – Decompress & store: 2, 2, 2, 2, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8 – Add: 2 + 2 + 2 + 2 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8 + 8  Optimized Work – 2 * 4 + 8 * 10 – Less Memory, less operations
  • 14. 14 Example: Bitpacked Dictionary VarChar Sort  Dataset: – Dictionary: [Rupert, Bill, Larry] – Values: [1,0,1,2,1,2,1,0]  Normal Work: – Decompress & store: Bill, Rupert, Bill, Larry, Bill, Larry, Bill, Rupert – Sort: ~24 comparisons of variable width strings (requiring length lookup and check during comparisons)  Optimized Work – Sort Dictionary: {Bill: 1, Larry: 2, Rupert: 0} – Sort bitpacked values – Work: max 3 string comparisons, ~24 comparisons of fixed-width dictionary bits – Data in 16 bits as opposed 368/736 for UTF8/16
  • 15. 15 Storage versus Relational operators  How do you write operator implementations for many different data representations – If you’re trying to inline, you have to avoid abstractions to complex for JVM to simplify  Push optimizations to storage layer for things like RLE – Rare that data is exactly in desired format beyond simplest queries  Define a primary in-memory representation for columnar data – Support alternative randomly-accesible compressions schemas in all operators (such as Dictionary/Bitpacked)
  • 16. 16 Vectorization  Operating on more than one record at the same time – Old school: use word-sized manipulations when records are stored smaller than word size – New School: SIMD (single input multiple data) instructions • GCC, LLVM and JVM all to various otpimizations automatically • More can be had manually coding algorithms – Logical Vectorization: • Using general record characteristics to reduce CPU cycles per collection of records  Alternative Meaning – Avoiding branching to speed CPU pipeline, working on large cache local data in process
  • 17. 17 Drill Columnar Approach  A RecordBatch contains one or more ValueVectors corresponding to each Field within a BatchSchema  Operators can operate directly against ValueVector or work with an alternative view of data by work leveraging a SelectionVector  Leverage simple Vectorization and trust JIT to optimize SIMD by generating simple buffer based operations and loops. – Explore performance impact of advanced SIMD in C for specific operators
  • 18. 18 Record Batch  Unit of work for the query system – Operators always work on a batch of records  All values associated with a particular collection of records  Each record batch must have a single defined schema – Possibly includes fields that have embedded types if you have a heterogeneous field  Record batches are pipelined between operators and nodes  No more than 65k records  Target single L2 cache (~256k)  Operator reconfiguration is done at RecordBatch boundaries RecordBatch VV VV VV VV RecordBatch VV VV VV VV RecordBatch VV VV VV VV
  • 19. 19 SelectionVector  Includes particular records from consideration by record batch index  Avoids early copying of records after applying filtering – Maintains random accessibility  All operators need to support SelectionVector access Donut Price Icing Bacon Maple Bar 2.19 [Maple Frosting, Bacon] Portland Cream 1.79 [Chocolate] The Loop 2.29 [Vanilla, Fruitloops] Triple Chocolate Penetration 2.79 [Chocolate, Cocoa Puffs] Selection Vector 0 3
  • 20. 20 ValueVector  One ore more contiguous buffers of data containing values – Stored in Native Order – In-memory representation fully specified for cross language portability  Associated with a single field – Synonymous with column in traditional flat tables  Nested fields are separate ValueVectors  Randomly accessible  Defined for each System datatype  Each has Accessor and Mutator – Primitives and simple primitive “structs” are access interfaces
  • 21. 21 Drill DataTypes MajorType = MinorType + DataMode + (Width|Scale)?  MinorType –Describes width and nature of data: smallint, bigint, uint32, varchar4 (utf8), var16char4 (utf16)  DataMode: –Optional (nullable) –Required (non-nullable) –Repeated (non item list/array)
  • 22. 22 Traditional 3 value semantics & Drill 4 value  SQL’s 3-Valued Semantics –True –False –Unknown  Drill adds fourth –Repeated
  • 23. 23 Fixed Value Vectors
  • 24. 24 Nullable Values
  • 25. 25 Repeated Values
  • 26. 26 Variable Width
  • 27. 27 Repeated Map
  • 28. 28 Strengths of RecordBatch + ValueVectors  RecordBatch separates high performance/low performance space – Record-by-record, avoid method invocation – Batch-by-batch, trust JVM  Avoid serialization/deserialization  Off-heap means large memory footprint without GC woes  Full specification combined with off-heap and batch-level execution allows C/C++ operators as necessary  Random access: sort without restructuring
  • 29. 29 Code Play Time Get Latest Drill  git clone git://git.apache.org/incubator-drill.git  cd incubator-drill/sandbox/prototype  git checkout 9f69ed0  mvn clean install Download OSCON Drill examples:  git clone https://github.com/jacques-n/oscon-drill.git  cd oscon-drill  mvn install  cd vectors http://bit.ly/19goc7R
  • 30. 30 Vectors Exercise Goals  RPC implementation to minimize data copies and support keeping all data off-heap  Basic benchmark analysis comparing ValueVectors and straight ProtoBuf encoding Logic  C = A + B  Assume two lists of fixed four byte integers (list a and list b).  Send them to remote node  Remote node decodes them, adds the two numbers together for each record, then returns the list (list c)  First node sums all returning numbers and verifies expected result
  • 31. 31 Vectors Exercise ├── pom.xml └── src ├── main/java/org/apache/drill/oscon/rpc │ │ ├── ClientConnectFuture.java │ │ ├── ExampleClient.java │ │ ├── ExampleConfig.java │ │ └── ExampleServer.java │ └── protobuf │ └── Example.proto └── test/java/org/apache/drill/oscon/rpc └── TestRpc.java