Successfully reported this slideshow.
Your SlideShare is downloading. ×

Data Science Without Borders (JupyterCon 2017)

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 30 Ad

Data Science Without Borders (JupyterCon 2017)

Download to read offline

Talk about building shared, language-agnostic computational infrastructure for data science. Discusses the motivation and work that's happening in the Apache Arrow project to help (http://arrow.apache.org)

Talk about building shared, language-agnostic computational infrastructure for data science. Discusses the motivation and work that's happening in the Apache Arrow project to help (http://arrow.apache.org)

Advertisement
Advertisement

More Related Content

Similar to Data Science Without Borders (JupyterCon 2017) (20)

More from Wes McKinney (20)

Advertisement

Recently uploaded (20)

Data Science Without Borders (JupyterCon 2017)

  1. 1. Wes McKinney @wesmckinn DATA SCIENCE WITHOUT BORDERS WES MCKINNEY @WESMCKINN JupyterCon | August 2017
  2. 2. ME 2
  3. 3. I M P O R TA N T L E G A L I N F O R M AT I O N • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved Wes McKinney @wesmckinn 3
  4. 4. THINKING ON THE LAST 10 YEARS 4 2007 2017
  5. 5. CLOSED SOURCE OPEN SOURCE 5
  6. 6. A shared front-end for data science
  7. 7. THE NEXT 10 YEARS AND BEYOND 7 2017 2027 …
  8. 8. THE AI ARMS RACE Wes McKinney @wesmckinn 8
  9. 9. CHANGING HARDWARE LANDSCAPE DISK PROCESSIN G MEMORY 9
  10. 10. T DATA SCIENCE “LANGUAGE “SILOS” FRONT-END PYTHON R JVM JULIA … 10
  11. 11. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS 11
  12. 12. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS pandas NumPy pandas NumPy pandas scikit-learn 12
  13. 13. RENOVATING PANDAS Wes McKinney @wesmckinn 13
  14. 14. T MAKING THE SILOS “SMALLER” FRONT-END PYTHON R JVM JULIA ? … 14
  15. 15. PROGRAMMING LANGUAGES AS USER INTERFACES 15
  16. 16. GRAPHIC: Iceberg under sea (only top part visible to naked eye)
  17. 17. T df <- read_csv(…) df % group_by(…) % summarise(…) df = read_csv(…) df.groupby(…).aggregate(…) PYTHON R SAME ANALYSIS, DIFFERENT IMPLEMENTATION 17
  18. 18. T A SHARED RUNTIME FOR DATA SCIENCE FRONT-END PYTHON R JVM JULIA SHARED DATA SCIENCE RUNTIME … 18
  19. 19. FROM IDEA TO ACTION 19
  20. 20. T PART 1: STANDARD IN-MEMORY FORMAT R PYTHON JVM PORTABLE DATA FRAME Non-Portable Data Frames 20…
  21. 21. T PART 2: ZERO COPY INTERCHANGE RPYTHON JVM SHARED MEMORY + STANDARD MEMORY FORMATS … 21
  22. 22. T PART 3: HIGH PERFORMANCE DATA ACCESS BINARY COLUMNAR CSV SQL PORTABLE DATA FRAME Storage Formats/ Databases … 22
  23. 23. T PART 4: FLEXIBLE COMPUTATION ENGINE • Zero-overhead User-defined Functions • Portable Operator “Graphs” • “Embeddable” in Larger Systems 23
  24. 24. APACHE ARROW Language-agnostic Data Frame Format Zero-Copy Interchange 24
  25. 25. T BUILDING THE ARROW FORMAT • “Superset” of representations supported by R, pandas, SQL engines • Optimized for CPU cache affinity • ASF Governance: Open + Transparent Community Project 25
  26. 26. FEATHER: MINIMALIST ARROW ON DISK
  27. 27. Some Arrow OSS Users Feather Format Ray Project 27
  28. 28. BUILDING THE FUTURE 28
  29. 29. Wes McKinney @wesmckinn THANK YOU WES MCKINNEY @WESMCKINN Apache Arrow: http://arrow.apache.org

×