Successfully reported this slideshow.
Your SlideShare is downloading. ×

The columnar roadmap: Apache Parquet and Apache Arrow

Ad

The columnar roadmap:
Apache Parquet and Apache Arrow
Julien Le Dem
VP Apache Parquet, Apache Arrow PMC, Principal Enginee...

Ad

Julien Le Dem
@J_
Principal Engineer
Data Platform
• Author of Parquet
• Apache member
• Apache PMCs: Arrow, Kudu, Heron, ...

Ad

Agenda
• Community Driven Standard
• Benefits of Columnar representation
• Vertical integration: Parquet to Arrow
• Arrow ...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Upcoming SlideShare
Node Labels in YARN
Node Labels in YARN
Loading in …3
×

Check these out next

1 of 49 Ad
1 of 49 Ad

The columnar roadmap: Apache Parquet and Apache Arrow

Download to read offline

The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.

In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.

Speaker
Julien Le Dem, Principal Engineer, WeWork

The Hadoop ecosystem has standardized on columnar formats—Apache Parquet for on-disk storage and Apache Arrow for in-memory. With this trend, deep integration with columnar formats is a key differentiator for big data technologies. Vertical integration from storage to execution greatly improves the latency of accessing data by pushing projections and filters to the storage layer, reducing time spent in IO reading from disk, as well as CPU time spent decompressing and decoding. Standards like Arrow and Parquet make this integration even more valuable as data can now cross system boundaries without incurring costly translation. Cross-system programming using languages such as Spark, Python, or SQL can becomes as fast as native internal performance.

In this talk we’ll explain how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future. We’ll detail how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions as well as several future improvements. We will also discuss how standard Arrow-based APIs pave the way to breaking the silos of big data. One example is Arrow-based universal function libraries that can be written in any language (Java, Scala, C++, Python, R, ...) and will be usable in any big data system (Spark, Impala, Presto, Drill). Another is a standard data access API with projection and predicate push downs, which will greatly simplify data access optimizations across the board.

Speaker
Julien Le Dem, Principal Engineer, WeWork

More Related Content

Slideshows for you

More from DataWorks Summit

The columnar roadmap: Apache Parquet and Apache Arrow

  1. 1. The columnar roadmap: Apache Parquet and Apache Arrow Julien Le Dem VP Apache Parquet, Apache Arrow PMC, Principal Engineer WeWork @J_ julien.ledem@wework.com June 2018
  2. 2. Julien Le Dem @J_ Principal Engineer Data Platform • Author of Parquet • Apache member • Apache PMCs: Arrow, Kudu, Heron, Incubator, Pig, Parquet, Tez • Used Hadoop first at Yahoo in 2007 • Formerly Twitter Data platform and Dremio Julien
  3. 3. Agenda • Community Driven Standard • Benefits of Columnar representation • Vertical integration: Parquet to Arrow • Arrow based communication
  4. 4. CommunityDriven Standard
  5. 5. An open source standard • Parquet: Common need for on disk columnar. • Arrow: Common need for in memory columnar. • Arrow is building on the success of Parquet. • Top-level Apache project • Standard from the start: – Members from 13+ major open source projects involved • Benefits: – Share the effort – Create an ecosystem Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  6. 6. Interoperability and Ecosystem Before With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Functionality duplication and unnecessary conversions • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg: Parquet-to-Arrow reader)
  7. 7. Benefits of Columnar representation
  8. 8. Columnar layout Logical table representation Row layout Column layout @EmrgencyKittens
  9. 9. On Disk and in Memory • Different trade offs – On disk: Storage. • Accessed by multiple queries. • Priority to I/O reduction (but still needs good CPU throughput). • Mostly Streaming access. – In memory: Transient. • Specific to one query execution. • Priority to CPU throughput (but still needs good I/O). • Streaming and Random access.
  10. 10. Parquet on disk columnar format
  11. 11. Parquet on disk columnar format • Nested data structures • Compact format: – type aware encodings – better compression • Optimized I/O: – Projection push down (column pruning) – Predicate push down (filters based on stats)
  12. 12. Parquet nested representation Document DocId Links Name Backward Forward Language Url Code Country Columns: docid links.backward links.forward name.language.code name.language.country name.url Borrowed from the Google Dremel paper https://blog.twitter.com/2013/dremel-made-simple-with-parquet
  13. 13. Access only the data you need a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a b c a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 + = Columnar Statistics Read only the data you need!
  14. 14. Parquet file layout
  15. 15. Arrow in memory columnar format
  16. 16. Arrow goals • Well-documented and cross language compatible • Designed to take advantage of modern CPU • Embeddable – in execution engines, storage layers, etc. • Interoperable
  17. 17. Arrow in memory columnar format • Nested Data Structures • Maximize CPU throughput – Pipelining – SIMD – cache locality • Scatter/gather I/O
  18. 18. CPU pipeline
  19. 19. Columnar data persons = [{ name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] }, { name: ’Jack', age: 37, phones: [ ‘555-333-3333’ ] }]
  20. 20. Record Batch Construction Schema Negotiation Dictionary Batch Record Batch Record Batch Record Batch name (offset) name (data) age (data) phones (list offset) phones (data) data header (describes offsets into data) name (bitmap) age (bitmap) phones (bitmap) phones (offset) { name: ’Joe', age: 18, phones: [ ‘555-111-1111’, ‘555-222-2222’ ] } Each box (vector) is contiguous memory The entire record batch is contiguous on wire
  21. 21. Vertical integration: Parquet to Arrow
  22. 22. Representation comparison for flat schema
  23. 23. Naïve conversion
  24. 24. Naïve conversion Data dependent branch
  25. 25. Peeling away abstraction layers Vectorized read (They have layers)
  26. 26. Bit packing case
  27. 27. Bit packing case No branch
  28. 28. Run length encoding case
  29. 29. Run length encoding case Block of defined values
  30. 30. Run length encoding case Block of null values
  31. 31. Predicate pushdown
  32. 32. Example: filter and projection
  33. 33. Naive filter and projection implementation
  34. 34. Peeling away abstractions
  35. 35. Arrow based communication
  36. 36. Universal high performance UDFs SQL engine Python process User defined function SQL Operator 1 SQL Operator 2 reads reads
  37. 37. Arrow RPC/REST API • Generic way to retrieve data in Arrow format • Generic way to serve data in Arrow format • Simplify integrations across the ecosystem • Arrow based pipe
  38. 38. RPC: arrow based storage interchange The memory representation is sent over the wire. No serialization overhead. Scanner projection/predicate push down Operator Arrow batches Storage Mem Disk SQL execution Scanner Operator Scanner Operator Storage Mem Disk Storage Mem Disk …
  39. 39. RPC: arrow based cache The memory representation is sent over the wire. No serialization overhead. projection push down Operator Arrow-based Cache SQL execution Operator Operator …
  40. 40. RPC: Single system execution The memory representation is sent over the wire. No serialization overhead. Scanner Scanner Scanner Parquet files projection push down read only a and b Partial Agg Partial Agg Partial Agg Agg Agg Agg Shuffle Arrow batches Result
  41. 41. Results - PySpark Integration: 53x speedup (IBM spark work on SPARK-13534) http://s.apache.org/arrowresult1 - Streaming Arrow Performance 7.75GB/s data movement http://s.apache.org/arrowresult2 - Arrow Parquet C++ Integration 4GB/s reads http://s.apache.org/arrowresult3 - Pandas Integration 9.71GB/s http://s.apache.org/arrowresult4
  42. 42. Language Bindings Parquet • Target Languages – Java – CPP – Python & Pandas • Engines integration: – Many! Arrow • Target Languages – Java – CPP, Python – R (underway) – C, Ruby, JavaScript • Engines integration: – Drill – Pandas, R – Spark (underway)
  43. 43. Arrow Releases 237 131 76 17 62 22 34 178 195 311 89 138 93 137 October 10, 2016 February 18, 2017 May 5, 2017 May 22, 2017 July 23, 2017 August 14, 2017 September 17, 2017 0.1.0 0.2.0 0.3.0 0.4.0 0.5.0 0.6.0 0.7.0 Days Changes
  44. 44. Current activity: • Spark Integration (SPARK-13534) • Pages index in Parquet footer (PARQUET-922) • Arrow REST API (ARROW-1077) • Bindings: – C, Ruby (ARROW-631) –JavaScript (ARROW-541)
  45. 45. Get Involved • Join the community – dev@{arrow,parquet}.apache.org – http://{arrow,parquet}.apache.org – Follow @Apache{Parquet,Arrow}
  46. 46. THANK YOU! julien.ledem@wework.com Julien Le Dem @J_ June 2018
  47. 47. We’re hiring! Contact: julien.ledem@wework.com
  48. 48. We’re growing We’re hiring! • WeWork doubled in size last year and the year before • We expect to double in size this year as well • Broadened scope: WeWork Labs, Powered by We, Flatiron School, Meetup, WeLive, … WeWork in numbers Technology jobs Locations • San Francisco: Platform teams • New York • Tel Aviv • Singapore • WeWork has 283 locations, in 75 cities internationally • Over 253,000 members worldwide • More than 40,000 member companies Contact: julien.ledem@wework.com
  49. 49. Questions? Julien Le Dem @J_ julien.ledem@wework.com June 2018

×