Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PyData London 2017 – Efficient and portable DataFrame storage with Apache Parquet

980 views

Published on

Slides for my presentation at PyData London 2017: https://pydata.org/london2017/schedule/presentation/54/

Published in: Data & Analytics
  • Be the first to comment

PyData London 2017 – Efficient and portable DataFrame storage with Apache Parquet

  1. 1. 1 Efficient and portable DataFrame storage with Apache Parquet Uwe L. Korn, PyData London 2017
  2. 2. 2 • Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas User About me xhochy uwe@apache.org
  3. 3. 3 Agenda • History of Apache Parquet • The format in detail • Use it in Python
  4. 4. 4 About Parquet 1. Columnar on-disk storage format 2. Started in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option
  5. 5. 5 Why use Parquet? 1. Columnar format
 —> vectorized operations 2. Efficient encodings and compressions
 —> small size without the need for a fat CPU 3. Query push-down
 —> bring computation to the I/O layer 4. Language independent format
 —> libs in Java / Scala / C++ / Python /…
  6. 6. 6 Who uses Parquet? • Query Engines • Hive • Impala • Drill • Presto • … • Frameworks • Spark • MapReduce • … • Pandas • Dask
  7. 7. File Structure File RowGroup Column Chunks Page Statistics
  8. 8. Encodings • Know the data • Exploit the knowledge • Cheaper than universal compression • Example dataset: • NYC TLC Trip Record data for January 2016 • 1629 MiB as CSV • columns: bool(1), datetime(2), float(12), int(4) • Source: http://www.nyc.gov/html/tlc/html/about/ trip_record_data.shtml
  9. 9. Encodings — PLAIN • Simply write the binary representation to disk • Simple to read & write • Performance limited by I/O throughput • —> 1499 MiB
  10. 10. Encodings — RLE & Bit Packing • bit-packing: only use the necessary bit • RunLengthEncoding: 378 times „12“ • hybrid: dynamically choose the best • Used for Definition & Repetition levels
  11. 11. Encodings — Dictionary • PLAIN_DICTIONARY / RLE_DICTIONARY • every value is assigned a code • Dictionary: store a map of code —> value • Data: store only codes, use RLE on that • —> 329 MiB (22%)
  12. 12. Compression 1. Shrink data size independent of its content 2. More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli
 —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)
 Snappy: 216 MiB (14 %)
  13. 13. Query pushdown 1. Only load used data 1. skip columns that are not needed 2. skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded
  14. 14. Benchmarks (size)
  15. 15. Benchmarks (time)
  16. 16. Benchmarks (size vs time)
  17. 17. Read & Write Parquet 17 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/
  18. 18. 18 Apache Arrow? • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp Just released 0.3
  19. 19. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github mirror: https://github.com/apache/ arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • C++ Github mirror: https://github.com/ apache/parquet-cpp 19 Get Involved!
  20. 20. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 20

×