Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Wes McKinney @wesmckinn
SHARED INFRASTRUCTURE FOR
DATA SCIENCE
WES MCKINNEY @WESMCKINN
Rice Data Science Conference | Octo...
ME
2
I M P O R TA N T L E G A L I N F O R M AT I O N
• The information presented here is offered for informational purposes onl...
THINKING ON THE LAST 10 YEARS
4
2007 2017
CLOSED SOURCE OPEN SOURCE
5
Shared front-ends
for data science
THE NEXT 10 YEARS AND BEYOND
7
2017 2027 …
THE AI ARMS RACE
Wes McKinney @wesmckinn 8
CHANGING HARDWARE LANDSCAPE
DISK
PROCESSIN
G
MEMORY
9
T
DATA SCIENCE “LANGUAGE “SILOS”
FRONT-END
PYTHON R JVM JULIA …
10
WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
11
WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
p...
RENOVATING PANDAS
Wes McKinney @wesmckinn 13
27
T
MAKING THE SILOS “SMALLER”
FRONT-END
PYTHON R JVM JULIA
?
…
14
PROGRAMMING LANGUAGES
AS USER INTERFACES
15
GRAPHIC: Iceberg under sea (only top
part visible to naked eye)
T
df <- read_csv(…)
df % group_by(…) % summarise(…)
df = read_csv(…)
df.groupby(…).aggregate(…)
PYTHON
R
SAME ANALYSIS, DI...
T
A SHARED RUNTIME FOR DATA SCIENCE
FRONT-END
PYTHON R JVM JULIA
SHARED DATA SCIENCE RUNTIME
…
18
FROM IDEA TO ACTION
19
T
PART 1: STANDARD IN-MEMORY FORMAT
R
PYTHON
JVM
PORTABLE DATA
FRAME
Non-Portable Data Frames
20…
T
PART 2: ZERO COPY INTERCHANGE
RPYTHON JVM
SHARED MEMORY + STANDARD MEMORY FORMATS
…
21
T
PART 3: HIGH PERFORMANCE DATA
ACCESS
BINARY
COLUMNAR
CSV
SQL
PORTABLE
DATA FRAME
Storage Formats/ Databases
… 22
T
PART 4: FLEXIBLE COMPUTATION ENGINE
• Zero-overhead User-defined Functions
• Portable Operator “Graphs”
• “Embeddable” i...
APACHE ARROW
Language-agnostic Data Frame Format
Zero-Copy Interchange
24
24
Without Arrow With Arrow
Simple, fast data interchange
24
• Cache-efficient columnar memory: optimized for CPU affinity and
SIMD / parallel processing, O(1) random value access
...
T
BUILDING THE ARROW FORMAT
• “Superset” of representations supported by
R, pandas, SQL engines
• Optimized for CPU cache ...
FEATHER: MINIMALIST ARROW ON DISK
Some Arrow OSS Users
Feather Format
Ray Project
27
FROM ARROW TO PANDAS2
28
Logical Operator Graphs
27
(a + b).log()
Log Add
a
b
Terminology
27
• Kernel functions: atomic units of
computation
• Operator nodes: input/output types,
operator parallelism ...
Parallel Execution of Operator Graphs
27
a b
ADD LOG
tmp out
Some Optimization strategies
27
• Multicore scheduling
• Elimination of temporaries
• Operator fusion / pipelinng
A
28
Arrow-optimized data connectors
Arrow in-memory format
Logical Data Frame Expression Graphs
Parallel Dataflow Executi...
BUILDING THE FUTURE
28
Wes McKinney @wesmckinn
THANK YOU
WES MCKINNEY @WESMCKINN
Apache Arrow: http://arrow.apache.org
Shared Infrastructure for Data Science
Upcoming SlideShare
Loading in …5
×

Shared Infrastructure for Data Science

5,160 views

Published on

Talk at Rice Data Science Conference October 9, 2017. Extended version of talk from JupyterCon 2017

Published in: Technology

Shared Infrastructure for Data Science

  1. 1. Wes McKinney @wesmckinn SHARED INFRASTRUCTURE FOR DATA SCIENCE WES MCKINNEY @WESMCKINN Rice Data Science Conference | October 2017
  2. 2. ME 2
  3. 3. I M P O R TA N T L E G A L I N F O R M AT I O N • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved Wes McKinney @wesmckinn 3
  4. 4. THINKING ON THE LAST 10 YEARS 4 2007 2017
  5. 5. CLOSED SOURCE OPEN SOURCE 5
  6. 6. Shared front-ends for data science
  7. 7. THE NEXT 10 YEARS AND BEYOND 7 2017 2027 …
  8. 8. THE AI ARMS RACE Wes McKinney @wesmckinn 8
  9. 9. CHANGING HARDWARE LANDSCAPE DISK PROCESSIN G MEMORY 9
  10. 10. T DATA SCIENCE “LANGUAGE “SILOS” FRONT-END PYTHON R JVM JULIA … 10
  11. 11. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS 11
  12. 12. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS pandas NumPy pandas NumPy pandas scikit-learn 12
  13. 13. RENOVATING PANDAS Wes McKinney @wesmckinn 13
  14. 14. 27
  15. 15. T MAKING THE SILOS “SMALLER” FRONT-END PYTHON R JVM JULIA ? … 14
  16. 16. PROGRAMMING LANGUAGES AS USER INTERFACES 15
  17. 17. GRAPHIC: Iceberg under sea (only top part visible to naked eye)
  18. 18. T df <- read_csv(…) df % group_by(…) % summarise(…) df = read_csv(…) df.groupby(…).aggregate(…) PYTHON R SAME ANALYSIS, DIFFERENT IMPLEMENTATION 17
  19. 19. T A SHARED RUNTIME FOR DATA SCIENCE FRONT-END PYTHON R JVM JULIA SHARED DATA SCIENCE RUNTIME … 18
  20. 20. FROM IDEA TO ACTION 19
  21. 21. T PART 1: STANDARD IN-MEMORY FORMAT R PYTHON JVM PORTABLE DATA FRAME Non-Portable Data Frames 20…
  22. 22. T PART 2: ZERO COPY INTERCHANGE RPYTHON JVM SHARED MEMORY + STANDARD MEMORY FORMATS … 21
  23. 23. T PART 3: HIGH PERFORMANCE DATA ACCESS BINARY COLUMNAR CSV SQL PORTABLE DATA FRAME Storage Formats/ Databases … 22
  24. 24. T PART 4: FLEXIBLE COMPUTATION ENGINE • Zero-overhead User-defined Functions • Portable Operator “Graphs” • “Embeddable” in Larger Systems 23
  25. 25. APACHE ARROW Language-agnostic Data Frame Format Zero-Copy Interchange 24
  26. 26. 24 Without Arrow With Arrow Simple, fast data interchange
  27. 27. 24 • Cache-efficient columnar memory: optimized for CPU affinity and SIMD / parallel processing, O(1) random value access • Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based and streaming binary formats • Complex schema support: Flat and nested data types • Main implementations in C++ and Java: with integration tests • Bindings / implementations for C, Python, Ruby, Javascript in various stages of development Big picture Arrow goals
  28. 28. T BUILDING THE ARROW FORMAT • “Superset” of representations supported by R, pandas, SQL engines • Optimized for CPU cache affinity • ASF Governance: Open + Transparent Community Project 25
  29. 29. FEATHER: MINIMALIST ARROW ON DISK
  30. 30. Some Arrow OSS Users Feather Format Ray Project 27
  31. 31. FROM ARROW TO PANDAS2 28
  32. 32. Logical Operator Graphs 27 (a + b).log() Log Add a b
  33. 33. Terminology 27 • Kernel functions: atomic units of computation • Operator nodes: input/output types, operator parallelism properties
  34. 34. Parallel Execution of Operator Graphs 27 a b ADD LOG tmp out
  35. 35. Some Optimization strategies 27 • Multicore scheduling • Elimination of temporaries • Operator fusion / pipelinng
  36. 36. A 28 Arrow-optimized data connectors Arrow in-memory format Logical Data Frame Expression Graphs Parallel Dataflow Execution Engine Python user API, DataFrame semantics, User-defined functions pandas2 Apache Arrow
  37. 37. BUILDING THE FUTURE 28
  38. 38. Wes McKinney @wesmckinn THANK YOU WES MCKINNEY @WESMCKINN Apache Arrow: http://arrow.apache.org

×