Advertisement

Shared Infrastructure for Data Science

Director of Ursa Labs, Open Source Developer at Ursa Labs
Oct. 11, 2017
Advertisement

More Related Content

Similar to Shared Infrastructure for Data Science(20)

Advertisement

More from Wes McKinney(20)

Advertisement

Shared Infrastructure for Data Science

  1. Wes McKinney @wesmckinn SHARED INFRASTRUCTURE FOR DATA SCIENCE WES MCKINNEY @WESMCKINN Rice Data Science Conference | October 2017
  2. ME 2
  3. I M P O R TA N T L E G A L I N F O R M AT I O N • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved Wes McKinney @wesmckinn 3
  4. THINKING ON THE LAST 10 YEARS 4 2007 2017
  5. CLOSED SOURCE OPEN SOURCE 5
  6. Shared front-ends for data science
  7. THE NEXT 10 YEARS AND BEYOND 7 2017 2027 …
  8. THE AI ARMS RACE Wes McKinney @wesmckinn 8
  9. CHANGING HARDWARE LANDSCAPE DISK PROCESSIN G MEMORY 9
  10. T DATA SCIENCE “LANGUAGE “SILOS” FRONT-END PYTHON R JVM JULIA … 10
  11. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS 11
  12. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS pandas NumPy pandas NumPy pandas scikit-learn 12
  13. RENOVATING PANDAS Wes McKinney @wesmckinn 13
  14. 27
  15. T MAKING THE SILOS “SMALLER” FRONT-END PYTHON R JVM JULIA ? … 14
  16. PROGRAMMING LANGUAGES AS USER INTERFACES 15
  17. GRAPHIC: Iceberg under sea (only top part visible to naked eye)
  18. T df <- read_csv(…) df % group_by(…) % summarise(…) df = read_csv(…) df.groupby(…).aggregate(…) PYTHON R SAME ANALYSIS, DIFFERENT IMPLEMENTATION 17
  19. T A SHARED RUNTIME FOR DATA SCIENCE FRONT-END PYTHON R JVM JULIA SHARED DATA SCIENCE RUNTIME … 18
  20. FROM IDEA TO ACTION 19
  21. T PART 1: STANDARD IN-MEMORY FORMAT R PYTHON JVM PORTABLE DATA FRAME Non-Portable Data Frames 20…
  22. T PART 2: ZERO COPY INTERCHANGE RPYTHON JVM SHARED MEMORY + STANDARD MEMORY FORMATS … 21
  23. T PART 3: HIGH PERFORMANCE DATA ACCESS BINARY COLUMNAR CSV SQL PORTABLE DATA FRAME Storage Formats/ Databases … 22
  24. T PART 4: FLEXIBLE COMPUTATION ENGINE • Zero-overhead User-defined Functions • Portable Operator “Graphs” • “Embeddable” in Larger Systems 23
  25. APACHE ARROW Language-agnostic Data Frame Format Zero-Copy Interchange 24
  26. 24 Without Arrow With Arrow Simple, fast data interchange
  27. 24 • Cache-efficient columnar memory: optimized for CPU affinity and SIMD / parallel processing, O(1) random value access • Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based and streaming binary formats • Complex schema support: Flat and nested data types • Main implementations in C++ and Java: with integration tests • Bindings / implementations for C, Python, Ruby, Javascript in various stages of development Big picture Arrow goals
  28. T BUILDING THE ARROW FORMAT • “Superset” of representations supported by R, pandas, SQL engines • Optimized for CPU cache affinity • ASF Governance: Open + Transparent Community Project 25
  29. FEATHER: MINIMALIST ARROW ON DISK
  30. Some Arrow OSS Users Feather Format Ray Project 27
  31. FROM ARROW TO PANDAS2 28
  32. Logical Operator Graphs 27 (a + b).log() Log Add a b
  33. Terminology 27 • Kernel functions: atomic units of computation • Operator nodes: input/output types, operator parallelism properties
  34. Parallel Execution of Operator Graphs 27 a b ADD LOG tmp out
  35. Some Optimization strategies 27 • Multicore scheduling • Elimination of temporaries • Operator fusion / pipelinng
  36. A 28 Arrow-optimized data connectors Arrow in-memory format Logical Data Frame Expression Graphs Parallel Dataflow Execution Engine Python user API, DataFrame semantics, User-defined functions pandas2 Apache Arrow
  37. BUILDING THE FUTURE 28
  38. Wes McKinney @wesmckinn THANK YOU WES MCKINNEY @WESMCKINN Apache Arrow: http://arrow.apache.org
Advertisement