Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Arrow -- Cross-language development platform for in-memory data

2,493 views

Published on

Slides from NYC R Conference, April 20, 2018

Published in: Technology
  • I have always found it hard to meet the requirements of being a student. Ever since my years of high school, I really have no idea what professors are looking for to give good grades. After some google searching, I found this service ⇒ www.HelpWriting.net ⇐ who helped me write my research paper.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! I can recommend a site that has helped me. It's called ⇒ www.WritePaper.info ⇐ They helped me for writing my quality research paper.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • If u need a hand in making your writing assignments - visit ⇒ www.HelpWriting.net ⇐ for mire detailed information.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Apache Arrow -- Cross-language development platform for in-memory data

  1. 1. Wes McKinney Apache Arrow Cross-Language Development Platform for In-Memory Analytics NYC R Conference- 20 April 2018
  2. 2. Wes McKinney • Created Python pandas project (~2008), lead developer/maintainer until 2013 • PMC Apache Arrow, Apache Parquet, ASF Member • Wrote Python for Data Analysis (1e 2012, 2e 2017) • Formerly Co-founder / CEO of DataPad (acquired by Cloudera in 2014) • Other OSS work: Ibis, Feather, Apache Kudu, statsmodels
  3. 3. ● Raise money to support full-time open source developers ● Grow Apache Arrow ecosystem ● Build cross-language, portable computational libraries for data science ● Build relationships across industry https://ursalabs.org
  4. 4. People
  5. 5. Initial Sponsors and Partners Prospective sponsors / partners, please reach out: info@ursalabs.org
  6. 6. Apache Arrow • https://github.com/apache/arrow • Open source community initiative started in 2016 • Backed by ~13 major OSS projects at start, significantly more now • Shared standards and systems for memory interoperability and computation • Cross-language libraries
  7. 7. Defragmenting Data Access
  8. 8. “Portable” Data Frames pandas R JVM Non-Portable Data Frames Arrow Portable Data Frames … Share data and algorithms at ~zero cost
  9. 9. Some Arrow Use Cases • Runtime in-memory format for analytical query engines • Zero-copy (no deserialization) interchange via shared memory • Low-overhead streaming messaging / RPC • Serialization format implementation • Zero-copy random access to on-disk data • Example: Feather files • Data ingest / data access
  10. 10. Arrow’s Columnar Memory Format • Runtime memory format for analytical query processing • Companion to serialization tech like Apache {Parquet, ORC} • “Fully shredded” columnar, supports flat and nested schemas • Organized for cache-efficient access on CPUs/GPUs • Optimized for data locality, SIMD, parallel processing • Accommodates both random access and scan workloads
  11. 11. Arrow Implementations and Bindings Upcoming: Rust (native), R (binding), Julia (native)
  12. 12. Example use: Ray ML framework from Berkeley RISELab March 20, 2017All Rights Reserved 12 Source: https://arxiv.org/abs/1703.03924 • Shared memory-based object store • Zero-copy tensor reads using Arrow libraries
  13. 13. Some Industry Contributors in Apache Arrow ClearCode
  14. 14. Arrow Project Growth • 138 Contributors on GitHub • > 1900 Resolved JIRAs • > 100K binary package downloads per month JIRA Burndown since Project Inception
  15. 15. Current Project Status • 0.9.0 Release: March 21, 2018 • Some focus areas • Columnar format stability / forward compatibility • Streaming messaging / RPC procedure • Language implementations / interop • Data access (e.g. Parquet input/output, ORC) • Downstream integrations (Apache Spark, Python/pandas, …)
  16. 16. Upcoming Roadmap • Software development lifecycle improvements • Data ingest / access / export • Computational libraries (CPU + GPU) • Expanded language support • Richer RPC / messaging • More system integrations
  17. 17. The current data science stack’s computational foundation is severely dated, rooted in 1980s / 1990s FORTRAN-style semantics Single-core / single-threaded algorithms Naïve execution model, eager evaluation Primitive memory management, expensive data access Fragmented language ecosystems, “Proprietary” memory models …
  18. 18. Data scientists working with “small” data have not experienced great pain Small Data (< ~10GB) Medium Data (~10 - ~100GB) Big Data (> ~100GB-1TB) Current Python/R stack begins to “fail” around this point Users doing fine here
  19. 19. We can do so much better through modern systems techniques Multi-core algorithms, GPU acceleration, Code generation (LLVM) Lazy evaluation, “query” optimization Sophisticated memory management, Efficient access to huge data sets Interoperable memory models, zero-copy interchange between system components Note 1 Moore’s Law (and small data) enabled us to get by for a long time without confronting some of these challenges Note 2 Most of these methods have already been widely employed in analytic databases. Limited “novel” research needed
  20. 20. Computational libraries • “Kernel functions” performing vectorized analytics on Arrow memory format • Select CPU or GPU variant based on data location • Operator graphs (compose multiple operators) • Subgraph compiler (using LLVM) • Runtime engine: execute operator graphs
  21. 21. Data Access / Ingest • Apache Avro • Apache Parquet nested data support • Apache ORC • CSV • JSON • ODBC / JDBC • … and likely other data access points
  22. 22. Arrow-powered Data Science Systems • Portable runtime libraries, usable from multiple programming languages • Decoupled front ends • Companion to distributed systems like Dask, Ray
  23. 23. Getting involved • Join dev@arrow.apache.org • PRs to https://github.com/apache/arrow • Learn more about the Ursa Labs vision for Arrow-powered data science: https://ursalabs.org/tech/

×