Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Wes McKinney @wesmckinn
DATA SCIENCE WITHOUT BORDERS
WES MCKINNEY @WESMCKINN
JupyterCon | August 2017
ME
2
I M P O R T A N T L E G A L I N F O R M A T I O N
• The information presented here is offered for informational purposes o...
THINKING ON THE LAST 10 YEARS
4
2007 2017
CLOSED SOURCE OPEN SOURCE
5
A shared front-end
for data science
THE NEXT 10 YEARS AND BEYOND
7
2017 2027 …
THE AI ARMS RACE
Wes McKinney @wesmckinn 8
CHANGING HARDWARE LANDSCAPE
DISK
PROCESSING
MEMORY
9
T
DATA SCIENCE “LANGUAGE
“SILOS”
FRONT-END
PYTHON R JVM JULIA …
10
WHAT’S IN A SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
11
WHAT’S IN A SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
p...
RENOVATING PANDAS
Wes McKinney @wesmckinn 13
T
MAKING THE SILOS “SMALLER”
FRONT-END
PYTHON R JVM JULIA
?
…
14
PROGRAMMING LANGUAGES
AS USER INTERFACES
15
GRAPHIC: Iceberg under sea (only
top part visible to naked eye)
T
df <- read_csv(…)
df % group_by(…) % summarise(…)
df = read_csv(…)
df.groupby(…).aggregate(…)
PYTHON
R
SAME ANALYSIS, DI...
T
A SHARED RUNTIME FOR DATA SCIENCE
FRONT-END
PYTHON R JVM JULIA
SHARED DATA SCIENCE RUNTIME
…
18
FROM IDEA TO ACTION
19
T
PART 1: STANDARD IN-MEMORY
FORMAT
R
PYTHON
JVM
PORTABLE DATA
FRAME
Non-Portable Data Frames
20…
T
PART 2: ZERO COPY INTERCHANGE
RPYTHON JVM
SHARED MEMORY + STANDARD MEMORY FORMATS
…
21
T
PART 3: HIGH PERFORMANCE
DATA ACCESS
BINARY
COLUMNAR
CSV
SQL
PORTABLE
DATA FRAME
Storage Formats/ Databases
… 22
T
PART 4: FLEXIBLE COMPUTATION
ENGINE
• Zero-overhead User-defined
Functions
• Portable Operator “Graphs”
• “Embeddable” i...
APACHE ARROW
Language-agnostic Data Frame Format
Zero-Copy Interchange
24
T
BUILDING THE ARROW
FORMAT
• “Superset” of representations supported
by R, pandas, SQL engines
• Optimized for CPU cache ...
FEATHER: MINIMALIST ARROW ON
Some Arrow OSS Users
Feather Format
Ray Project
27
BUILDING THE FUTURE
28
Wes McKinney @wesmckinn
THANK YOU
WES MCKINNEY @WESMCKINN
Apache Arrow: http://arrow.apache.org
Data Science Without Borders (JupyterCon 2017)
Upcoming SlideShare
Loading in …5
×

Data Science Without Borders (JupyterCon 2017)

4,987 views

Published on

Talk about building shared, language-agnostic computational infrastructure for data science. Discusses the motivation and work that's happening in the Apache Arrow project to help (http://arrow.apache.org)

Published in: Technology

Data Science Without Borders (JupyterCon 2017)

  1. 1. Wes McKinney @wesmckinn DATA SCIENCE WITHOUT BORDERS WES MCKINNEY @WESMCKINN JupyterCon | August 2017
  2. 2. ME 2
  3. 3. I M P O R T A N T L E G A L I N F O R M A T I O N • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved Wes McKinney @wesmckinn 3
  4. 4. THINKING ON THE LAST 10 YEARS 4 2007 2017
  5. 5. CLOSED SOURCE OPEN SOURCE 5
  6. 6. A shared front-end for data science
  7. 7. THE NEXT 10 YEARS AND BEYOND 7 2017 2027 …
  8. 8. THE AI ARMS RACE Wes McKinney @wesmckinn 8
  9. 9. CHANGING HARDWARE LANDSCAPE DISK PROCESSING MEMORY 9
  10. 10. T DATA SCIENCE “LANGUAGE “SILOS” FRONT-END PYTHON R JVM JULIA … 10
  11. 11. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS 11
  12. 12. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS pandas NumPy pandas NumPy pandas scikit-learn 12
  13. 13. RENOVATING PANDAS Wes McKinney @wesmckinn 13
  14. 14. T MAKING THE SILOS “SMALLER” FRONT-END PYTHON R JVM JULIA ? … 14
  15. 15. PROGRAMMING LANGUAGES AS USER INTERFACES 15
  16. 16. GRAPHIC: Iceberg under sea (only top part visible to naked eye)
  17. 17. T df <- read_csv(…) df % group_by(…) % summarise(…) df = read_csv(…) df.groupby(…).aggregate(…) PYTHON R SAME ANALYSIS, DIFFERENT IMPLEMENTATION 17
  18. 18. T A SHARED RUNTIME FOR DATA SCIENCE FRONT-END PYTHON R JVM JULIA SHARED DATA SCIENCE RUNTIME … 18
  19. 19. FROM IDEA TO ACTION 19
  20. 20. T PART 1: STANDARD IN-MEMORY FORMAT R PYTHON JVM PORTABLE DATA FRAME Non-Portable Data Frames 20…
  21. 21. T PART 2: ZERO COPY INTERCHANGE RPYTHON JVM SHARED MEMORY + STANDARD MEMORY FORMATS … 21
  22. 22. T PART 3: HIGH PERFORMANCE DATA ACCESS BINARY COLUMNAR CSV SQL PORTABLE DATA FRAME Storage Formats/ Databases … 22
  23. 23. T PART 4: FLEXIBLE COMPUTATION ENGINE • Zero-overhead User-defined Functions • Portable Operator “Graphs” • “Embeddable” in Larger Systems 23
  24. 24. APACHE ARROW Language-agnostic Data Frame Format Zero-Copy Interchange 24
  25. 25. T BUILDING THE ARROW FORMAT • “Superset” of representations supported by R, pandas, SQL engines • Optimized for CPU cache affinity • ASF Governance: Open + Transparent Community Project 25
  26. 26. FEATHER: MINIMALIST ARROW ON
  27. 27. Some Arrow OSS Users Feather Format Ray Project 27
  28. 28. BUILDING THE FUTURE 28
  29. 29. Wes McKinney @wesmckinn THANK YOU WES MCKINNEY @WESMCKINN Apache Arrow: http://arrow.apache.org

×