Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ursa Labs and Apache Arrow in 2019

3,529 views

Published on

Update on Apache Arrow project and not-for-profit Ursa Labs org for 2019 https://ursalabs.org/. Active projects and development objectives

Published in: Technology
  • You can hardly find a student who enjoys writing a college papers. Among all the other tasks they get assigned in college, writing essays is one of the most difficult assignments. Fortunately for students, there are many offers nowadays which help to make this process easier. The best service which can help you is ⇒ www.WritePaper.info ⇐
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Don't forget another good way of simplifying your writing is using external resources (such as ⇒ www.HelpWriting.net ⇐ ). This will definitely make your life more easier
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • You are welcome to visit our brilliant writing company in order to get rid of your academic writing problems once and for all! HelpWriting.net
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD THE BOOK INTO AVAILABLE FORMAT (New Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download Full doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download PDF EBOOK here { https://urlzs.com/UABbn } ......................................................................................................................... Download EPUB Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... Download doc Ebook here { https://urlzs.com/UABbn } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book THE can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer THE is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBOOK .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, CookBOOK, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, EBOOK, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story THE Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money THE the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths THE Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Ursa Labs and Apache Arrow in 2019

  1. 1. Ursa Labs and Apache Arrow in 2019 Infrastructure for Next-generation Data Science Wes McKinney PyData Miami 2019-01-11
  2. 2. https://ursalabs.org ● Funding and employment for full-time open source developers ● Grow Apache Arrow ecosystem ● Build cross-language, portable computational libraries for data science ● Not-for-profit, funded by multiple corporations Ursa Labs Mission
  3. 3. Led by key figures from R and Python worlds
  4. 4. Team • 5 full-time remote engineers (US, Canada, Europe) • Contributions from the RStudio tidyverse team • We’re hiring! • Senior computational systems engineer • Build / test / packaging automation engineer
  5. 5. Ursa Labs Sponsors Main sponsor and administrative partner
  6. 6. Sponsors help in many ways
  7. 7. Much of the data science stack’s computational foundation is severely dated, rooted in 1980s / 1990s FORTRAN-style semantics Single-core / single-threaded algorithms Naïve execution model, eager evaluation Primitive memory management, expensive data access Fragmented language ecosystems, “Proprietary” memory models …
  8. 8. We can do so much better through modern systems techniques Multi-core algorithms, GPU acceleration, Code generation (LLVM) Lazy evaluation, “query” optimization Sophisticated memory management, Efficient access to huge data sets Interoperable memory models, zero-copy interchange between system components Note 1 Moore’s Law (and small data) enabled us to get by for a long time without confronting some of these challenges Note 2 Most of these methods have already been widely employed in analytic databases. Limited “novel” research needed
  9. 9. • Open source project founded in 2016 by key developers of 13 major open source data projects • Key ideas • Language agnostic, open standard in-memory format for columnar data (aka “data frames”) • Bring together database and data science communities to collaborate on shared computational technology • 3 years old, over 200 unique contributors, > 1 million monthly installs
  10. 10. Defragmenting Data Access Up to 80-90% of CPU cycles spent on de/serialization Life without Arrow Life with Arrow No de/serialization
  11. 11. The Arrow Development Platform • Open source library stack offering some level of support for 11 different programming languages • Focus • Reuse of runtime data and algorithms without copying or serialization • Fast data access (storage systems, file formats) • Efficient data interchange (IPC, RPC) • Accelerated In-memory computing • Foundation of new systems, while accelerating existing ones
  12. 12. Worse Patterns Better Patterns Custom data structures Copy and convert Custom algorithms Custom file formats Custom wire protocols (Open) Standard data structures Zero copy Standard algorithms Standard file formats Standard wire protocols
  13. 13. Analytic database architecture Front end API Computation Engine In-memory storage IO and Deserialization ● Vertically integrated / “Black Box” ● Internal components do not have a public API ● Users interact with front end
  14. 14. Analytic database, deconstructed Front end API Computation Engine In-memory storage IO and Deserialization ● Components have public APIs ● Use what you need ● Different front ends can be developed
  15. 15. Analytic database, deconstructed Front end API Computation Engine In-memory storage IO and Deserialization Arrow is front end agnostic
  16. 16. Arrow Use Cases ● Data access ○ Read and write widely used storage formats ○ Interact with database protocols, other data sources ● Data movement ○ Zero-copy interprocess communication ○ Efficient RPC / client-server communications ● Computation libraries ○ Efficient in-memory / out-of-core data frame-type analytics ○ LLVM-compilation for vectorized expression evaluation
  17. 17. Some problems relevant to pandas users • Memory-mapping large on-disk datasets • Efficient string processing • Chunked / non-contiguous tables • Native nested types (structs, arrays, unions) • Efficient interchange with other systems
  18. 18. Example: Arrow-accelerated Python + Apache Spark ● Joint work with Li Jin from Two Sigma, Bryan Cutler from IBM ● Vectorized user-defined functions, fast data export to pandas import pandas as pd from scipy import stats @pandas_udf('double') def cdf(v): return pd.Series(stats.norm.cdf(v)) df.withColumn('cumulative_probability', cdf(df.v))
  19. 19. Example: Arrow-accelerated Python + Apache Spark Spark SQL Arrow Columnar Stream Input PySpark Worker Zero copy via socket pandas Arrow Columnar Stream Output to arrow from arrow from arrow to arrow
  20. 20. Example: NVIDIA RAPIDS libraries
  21. 21. Some Industry Contributors to Apache Arrow ClearCode
  22. 22. 2019 Ursa Labs Development Agenda ● File format ingest/export ● Arrow RPC: “Flight” Framework ● Gandiva: LLVM-based expression compiler ● In-memory Columnar Query Engine ● Language interop: Python and R ● Cloud file-system support
  23. 23. 2018 Accomplishments • 3 major releases (0.9, 0.10, 0.11) • 1600+ resolved JIRA issues • 7 codebase donations • Major engineering efforts • Improved CI / CD tooling; packaging automation for releases • Bootstrap R library development • C++ CSV reader • Combine Arrow and Parquet C++ codebases • GPU support library
  24. 24. File format support ● CSV ● JSON ● Avro ● Parquet ● ORC
  25. 25. Arrow Flight RPC Framework • Key idea: standardized high performance data transport • A gRPC-based framework for defining custom data services that send and receive Arrow columnar data natively • Uses Protocol Buffers v3 for client protocol • Pluggable command execution layer, authentication • Low-level gRPC optimizations (~ 10x faster than comparables) • Write Arrow memory directly onto outgoing gRPC buffer • Avoid any copying or deserialization
  26. 26. Arrow Flight - Efficient gRPC transport Client DoGet Data Node FlightData Row Batch Row Batch Row Batch Row Batch Row Batch ... Data transported in a Protocol Buffer, but reads can be made zero-copy by writing a custom gRPC “deserializer”
  27. 27. Gandiva, LLVM-powered expression compiler • Initially developed by Dremio, donated to Apache Arrow • Efficient evaluation of projections, filters, and aggregates • Uses LLVM for runtime code generation • Dremio using to accelerate a Java-based distributed SQL engine
  28. 28. Using Gandiva from Java with zero-copy SELECT year(timestamp), month(timestamp), … FROM table ... Input Table Fragment Arrow Java JNI (Zero-copy) Evaluate Gandiva LLVM Function Arrow C++ Result Table Fragment
  29. 29. Cloud Service Support ● Support data engineering workflows on AWS, GCP, Azure ● Optimized IO for cloud blob storage (S3, GCS, etc.) ● In C++, so can be used in Python, R, Ruby, etc.
  30. 30. Looking ahead • 2019 likely to be year of rapid growth for Apache Arrow • Grow community, diversity of language and scope • Join us: https://github.com/apache/arrow

×