Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extending Pandas using Apache Arrow and Numba

4,196 views

Published on

With the latest release of Pandas the ability to extend it with custom dtypes was introduced. Using Apache Arrow as the in-memory storage and Numba for fast, vectorized computations on these memory regions, it is possible to extend Pandas in pure Python while achieving the same performance of the built-in types. In the talk we implement a native string type as an example.

Published in: Data & Analytics
  • Yes you are right. There are many research paper writing services available now. But almost services are fake and illegal. Only a genuine service will treat their customer with quality research papers. ⇒ www.WritePaper.info ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ❤❤❤ http://bit.ly/2F90ZZC ❤❤❤
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ❶❶❶ http://bit.ly/2F90ZZC ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Extending Pandas using Apache Arrow and Numba

  1. 1. 1 PyData Berlin 2018 Uwe L. Korn Extending Pandas using Apache Arrow and Numba
  2. 2. 2 PyData Berlin 2018 Uwe L. Korn Extending Pandas using Apache Arrow and Numba
  3. 3. 3 PyData Berlin 2018 Uwe L. Korn Strings, Strings, please give me Strings!
  4. 4. 4 • Senior Data Scientist at Blue Yonder (@BlueYonderTech) • Apache {Arrow, Parquet} PMC • Data Engineer and Architect with heavy focus around Pandas About me xhochy mail@uwekorn.com
  5. 5. 5 1. Shortcomings of Pandas 2. ExtensionArrays 3. Arrow for storage 4. Numba for compute 5. All the stuff Agenda
  6. 6. 6 Pandas Series • Payload stored in a numpy.ndarray • Index for data alignment • Rich analytical API • Accessors like .dt or .str
  7. 7. 7 Shortcomings • Limited to NumPy data types, otherwise object • NumPy’s focus is numerical data and tensors • Pandas performs well when NumPy performs well • Most popular: • no native variable-length strings • integers are non-nullable
  8. 8. 8 What’s the problem?
  9. 9. 9 What’s the problem?
  10. 10. 10 Why are objects bad? Python Data Science Handbook, Jake VanderPlas; O’Reilly Media, Nov 2016 https://jakevdp.github.io/PythonDataScienceHandbook/02.01-understanding-data-types.html
  11. 11. 11 Extending Pandas (0.23+) • Two new interfaces: • ExtensionDtype • What type of scalars? • ExtensionArray • Implement basic array ops • Pandas provides algorithms on top
  12. 12. 10x !!112
  13. 13. 13 Extending Pandas (0.23+) • _from_sequence • _from_factorized • __getitem__ • __len__ • dtype • nbytes • isna • copy • _concat_same_type https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.extensions.ExtensionArray.html 13
  14. 14. 14 Apache Arrow • Specification for in-memory columnar data layout • No overhead for cross-system communication • Designed for efficiency (exploit SIMD, cache locality, ..) • Exchange data without conversion between Python, C++, C(glib), Ruby, Lua, R, JavaScript, Go, Rust, Matlab and the JVM • Brought Parquet to Pandas and made PySpark fast (@pandas_udf)
  15. 15. 15 Nice properties • More native datatypes: string, date, nullable int, list of X, … • Everything is nullable • Memory can be chunked • Zero-copy to other ecosystems like Java / R • Highly efficient I/O
  16. 16. 16 Not so nice properties • Still a young project • Not much analytic on top (yet!) • Core is in modern C++ • Extremely fast but hard to extend in Python
  17. 17. 17 Writing Algorithms in Python is easy! but slow
  18. 18. 18 Photo by Matthew Brodeur on Unsplash
  19. 19. 19 Fast for-loops with Numba
  20. 20. 20 Anatomy of an Arrow StringArray • 3 memory buffers • bitmap to indicate valid (non-null) entries • uint32 array of offsets:„where does the string start“ • uint8 array of characters (UTF-8 encoded) • int64 offset • allows zero-copy slicing
  21. 21. 21 Numba @jitclass
  22. 22. 22 Numba @jitclass
  23. 23. 23 Photo by Niklas Tidbury on Unsplash
  24. 24. 24 Fletcher https://github.com/xhochy/fletcher • Implements Extension{Array,Dtype} with Apache Arrow as storage • Uses Numba to implement the necessary analytic on top
  25. 25. Demo25
  26. 26. 26 Fletcher Demo
  27. 27. 27 Fletcher Demo
  28. 28. 28 Fletcher Demo
  29. 29. 29 Fletcher Demo
  30. 30. 30 ExtensionArray Implementations https://github.com/ContinuumIO/cyberpandas IPArray (PR) https://github.com/geopandas/geopandas GeometryArray (WIP) https://github.com/xhochy/fletcher Apache Arrow + Numba backed Arrays
  31. 31. 31 Photo by Israel Sundseth on Unsplash pip install fletcher
  32. 32. 32 By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons By JOEXX (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons 24. - 26. October + 2 days of sprints (27/28.10.) ZKM Karlsruhe, DEKarlsruhe Call for Participation opens next week.
  33. 33. 33 I’m Uwe Korn Twitter: @xhochy https://github.com/xhochy Thank you!

×