Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

21 views

Published on

How Apache Arrow and Apache Parquet are helpful technologies to connect the Python Data ecosystem to other landscapes such as the Java/Scala based Big Data ecosystem.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landscapes using Arrow and Parquet

  1. 1. 1 Connecting PyData to other Big Data Landscapes using Arrow and Parquet Uwe L. Korn, PyCon.DE 2017
  2. 2. 2 • Data Scientist & Architect at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Work in Python, Cython, C++11 and SQL • Heavy Pandas user About me xhochy uwe@apache.org
  3. 3. 3 Python is a good companion for a Data Scientist …but there are other ecosystems out there.
  4. 4. • Large set of files on distributed filesystem • Non-uniform schema • Execute query • Only a subset is interesting 4 Why do I care? not in Python
  5. 5. 5 All are amazing but… How to get my data out of Python and back in again? …but there was no fast Parquet access 2 years ago. Use Parquet!
  6. 6. 6 A general problem • Great interoperability inside ecosystems • Often based on a common backend (e.g. NumPy) • Poor integration to other systems • CSV is your only resort • „We need to talk!“ • Memory copy is about 10GiB/s • (De-)serialisation comes on top
  7. 7. 7 Columnar Data Image source: https://arrow.apache.org/img/simd.png ( https://arrow.apache.org/ )
  8. 8. 8 Apache Parquet
  9. 9. 9 About Parquet 1. Columnar on-disk storage format 2. Started in fall 2012 by Cloudera & Twitter 3. July 2013: 1.0 release 4. top-level Apache project 5. Fall 2016: Python & C++ support 6. State of the art format in the Hadoop ecosystem • often used as the default I/O option
  10. 10. 10 Why use Parquet? 1. Columnar format
 —> vectorized operations 2. Efficient encodings and compressions
 —> small size without the need for a fat CPU 3. Predicate push-down
 —> bring computation to the I/O layer 4. Language independent format
 —> libs in Java / Scala / C++ / Python /…
  11. 11. Compression 1. Shrink data size independent of its content 2. More CPU intensive than encoding 3. encoding+compression performs better than compression alone with less CPU cost 4. LZO, Snappy, GZIP, Brotli
 —> If in doubt: use Snappy 5. GZIP: 174 MiB (11%)
 Snappy: 216 MiB (14 %)
  12. 12. Predicate pushdown 1. Only load used data • skip columns that are not needed • skip (chunks of) rows that not relevant 2. saves I/O load as the data is not transferred 3. saves CPU as the data is not decoded Which products are sold in $?
  13. 13. File Structure File RowGroup Column Chunks Page Statistics
  14. 14. Read & Write Parquet 14 https://arrow.apache.org/docs/python/parquet.html Alternative Implementation: https://fastparquet.readthedocs.io/en/latest/
  15. 15. Read & Write Parquet 15 Pandas 0.21 will bring pd.read_parquet(…) df.write_parquet(…) http://pandas.pydata.org/pandas-docs/version/0.21/io.html#io-parquet
  16. 16. 16 Save in one, load in another ecosystem …but always persist the intermediate.
  17. 17. 17 Zero-Copy DataFrames
  18. 18. 2.57s Converting 1 million longs from Spark to PySpark 18 (8MiB)
  19. 19. 19 Apache Arrow • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript and the JVM • This brought Parquet to Pandas without any Python code in parquet-cpp
  20. 20. 20 Dissecting Arrow C++ • General zero-copy memory management • jemalloc as the base allocator • Columnar memory format & metadata • Schema & DataType • Columns & Table
  21. 21. 21 Dissecting Arrow C++ • Structured data IPC (inter-process communication) • used in Spark for JVM<->Python • future extensions include: GRPC backend, shared memory communication, … • Columnar in-memory analytics • be the backbone of Pandas 2.0
  22. 22. 0.05s Converting 1 million longs from Spark to PySpark 22 with Arrow https://github.com/apache/spark/pull/15821#issuecomment-282175163
  23. 23. 23 Apache Arrow – Real life improvement Real life example! Retrieve a dataset from an MPP database and analyze it in Pandas 1. Run a query in the DB 2. Pass it in columnar form to the DB driver 3. The OBDC layer transform it into row-wise form 4. Pandas makes it columnar again Ugly real-life solution: export as CSV, bypass ODBC
  24. 24. 24 Better solution: Turbodbc with Arrow support 1. Retrieve columnar results 2. Pass them in a columnar fashion to Pandas More systems in the future (without the ODBC overhead) See also Michael’s talk tomorrow: Turbodbc: Turbocharged database access for data scientists Apache Arrow – Real life improvement
  25. 25. 25 Ray
  26. 26. GPU Open Analytics Initiative 26 https://blogs.nvidia.com/blog/2017/09/22/gpu-data-frame/
  27. 27. Cross language DataFrame library • Website: https://arrow.apache.org/ • ML: dev@arrow.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/ARROW • Slack: https:// apachearrowslackin.herokuapp.com/ • Github: https://github.com/apache/arrow Apache Arrow Apache Parquet Famous columnar file format • Website: https://parquet.apache.org/ • ML: dev@parquet.apache.org • Issues & Tasks: https://issues.apache.org/jira/ browse/PARQUET • Slack: https://parquet-slack- invite.herokuapp.com/ • Github: https://github.com/apache/parquet- cpp 27 Get Involved!
  28. 28. Blue Yonder GmbH Ohiostraße 8 76149 Karlsruhe Germany +49 721 383117 0 Blue Yonder Software Limited 19 Eastbourne Terrace London, W2 6LG United Kingdom +44 20 3626 0360 Blue Yonder Best decisions, delivered daily Blue Yonder Analytics, Inc. 5048 Tennyson Parkway Suite 250 Plano, Texas 75024 USA 28

×