Raising the Tides: Open Source Analytics for Data Science

Wes McKinney
Wes McKinneyDirector of Ursa Labs, Open Source Developer at Ursa Labs
Raising the Tides:
Open Source Analytics for
Data Science
Wes McKinney @wesmckinn
N E W S W E E K A I & D A T A S C I E N C E
C O N F E R E N C E – C A P I T A L M A R K E T S
2 M A R C H 2 0 1 7
Wes McKinney @wesmckinn
Me
Wes McKinney @wesmckinn
Important Legal Information
• The information presented here is offered for informational purposes only and
should not be used for any other purpose (including, without limitation, the
making of investment decisions). Examples provided herein are for illustrative
purposes only and are not necessarily based on actual data. Nothing herein
constitutes: an offer to sell or the solicitation of any offer to buy any security or
other interest; tax advice; or investment advice. This presentation shall remain
the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma
reserves the right to require the return of this presentation at any time.
• Some of the images, logos or other material used herein may be protected by
copyright and/or trademark. If so, such copyrights and/or trademarks are most
likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or
trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two
Sigma, nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
Wes McKinney @wesmckinn
In the next 20 minutes
∞ Important trends in the industry
∞ Two Sigma involvement in open source
∞ Growing the community
WHAT I’M SEEING TODAY
Wes McKinney @wesmckinn
Industry giants open source core AI
and machine learning technology
Wes McKinney @wesmckinn
Open source “disruption” in data science
languages and supporting technologies
Wes McKinney @wesmckinn
Observation #1:
User Mindshare is a Key Asset
Wes McKinney @wesmckinn
Observation #2:
Tools may be less important than human
capital and data
Wes McKinney @wesmckinn
Two Sigma
Building a state-of-the-art, collaborative
data science platform
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
∞ Computational capabilities: larger and more
complex data sets
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
∞ Computational capabilities: larger and more
complex data sets
∞ Collaboration within and across teams
TOOLS AND THE
“DATA SCIENTIST SHORTAGE”
WHY WE PARTICIPATE
IN OPEN SOURCE
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
4. Benefit sooner from open source innovations
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
4. Benefit sooner from open source innovations
5. Attract and retain the best engineering talent
Wes McKinney @wesmckinn
Where we are investing
Collaboration
and Publishing
Cluster Resource
Management Scalable / Distributed
Computing
High Performance
Data Processing
Wes McKinney @wesmckinn
Core data infrastructure technologies
Apache
Arrow
Apache
Parquet
• Efficient columnar in-
memory data processing
• High-speed, interoperable
data messaging for Java,
C++, Python
• Industry-standard columnar
file format for distributed
storage
• Efficient IO for Spark, Python,
etc.
Wes McKinney @wesmckinn
Open source in-memory and distributed
analytics
• Popular Python analytics
library
• Powerful and easy-to-use
data cleaning, analytics, and
time series processing
• Flint: scalable time series
analytics for Spark
• Enhanced Python
integration
Wes McKinney @wesmckinn
Cluster resource management
• Scalable cluster resource
manager
• Native container support
• Fair job scheduler for Mesos
• Managing multi-tenant Spark
clusters
cook
Wes McKinney @wesmckinn
Collaboration and publishing
• Notebook “kernels” for
polyglot research and
development
• Inter-language data exchange
• Leading web notebook &
reproducible research
development platform
• Interactive widgets framework
TOWARD HIGH TIDE:
Preserving competitive advantage and
building common knowledge
Thank you
Wes McKinney @wesmckinn
1 of 28

Recommended

Data Science Without Borders (JupyterCon 2017) by
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
6.2K views30 slides
Shared Infrastructure for Data Science by
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
8.5K views39 slides
BI Past Present and Future - 2016 Persepective by
BI Past Present and Future - 2016 PersepectiveBI Past Present and Future - 2016 Persepective
BI Past Present and Future - 2016 PersepectiveGary Nuttall MBCS CITP
401 views50 slides
Big Data Landscape 2016 by
Big Data Landscape 2016 Big Data Landscape 2016
Big Data Landscape 2016 Matt Turck
14.1K views2 slides
You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival - by
 You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival - You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival -
You Have the Data, Now What? (Chris Lynch) - 2014 Boston Data Festival -freshdatabos
1.7K views10 slides
Internet of Things: Lightning Round, Hite by
Internet of Things: Lightning Round, HiteInternet of Things: Lightning Round, Hite
Internet of Things: Lightning Round, HiteGovLoop
1.2K views22 slides

More Related Content

What's hot

Big, small or just complex data? by
Big, small or just complex data?Big, small or just complex data?
Big, small or just complex data?panoratio
1.7K views9 slides
Data Science Popup Austin: Data Meet Product by
Data Science Popup Austin: Data Meet Product Data Science Popup Austin: Data Meet Product
Data Science Popup Austin: Data Meet Product Domino Data Lab
821 views39 slides
Big Data Maturity and its Evolution by
Big Data Maturity and its EvolutionBig Data Maturity and its Evolution
Big Data Maturity and its EvolutionSriram Murali K J
188 views16 slides
Study: #Big Data in #Austria by
Study: #Big Data in #AustriaStudy: #Big Data in #Austria
Study: #Big Data in #AustriaSemantic Web Company
1.9K views20 slides
DMTI Spatial Location Hub Analytics: big data, analytics, visualization by
DMTI Spatial Location Hub Analytics: big data, analytics, visualizationDMTI Spatial Location Hub Analytics: big data, analytics, visualization
DMTI Spatial Location Hub Analytics: big data, analytics, visualizationDMTI Spatial
6K views16 slides
Big data high performance computing commenting by
Big data   high performance computing commentingBig data   high performance computing commenting
Big data high performance computing commentingIntel IT Center
313 views10 slides

What's hot(20)

Big, small or just complex data? by panoratio
Big, small or just complex data?Big, small or just complex data?
Big, small or just complex data?
panoratio1.7K views
Data Science Popup Austin: Data Meet Product by Domino Data Lab
Data Science Popup Austin: Data Meet Product Data Science Popup Austin: Data Meet Product
Data Science Popup Austin: Data Meet Product
Domino Data Lab 821 views
DMTI Spatial Location Hub Analytics: big data, analytics, visualization by DMTI Spatial
DMTI Spatial Location Hub Analytics: big data, analytics, visualizationDMTI Spatial Location Hub Analytics: big data, analytics, visualization
DMTI Spatial Location Hub Analytics: big data, analytics, visualization
DMTI Spatial6K views
Big data high performance computing commenting by Intel IT Center
Big data   high performance computing commentingBig data   high performance computing commenting
Big data high performance computing commenting
Intel IT Center313 views
Introduction to Data Mining, Business Intelligence and Data Science by IMC Institute
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data Science
IMC Institute3.3K views
5 Factors Impacting Your Big Data Project's Performance by Qubole
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
Qubole41.5K views
High Performance Computing and Big Data: The coming wave by Intel IT Center
High Performance Computing and Big Data: The coming waveHigh Performance Computing and Big Data: The coming wave
High Performance Computing and Big Data: The coming wave
Intel IT Center620 views
Hans Henseler - Intelligent data analysis for improving public security - Da... by DataValueTalk
Hans Henseler - Intelligent data analysis for improving public security -  Da...Hans Henseler - Intelligent data analysis for improving public security -  Da...
Hans Henseler - Intelligent data analysis for improving public security - Da...
DataValueTalk 2.2K views
7 Big Data Challenges and How to Overcome Them by Qubole
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
Qubole9.4K views
6 levels of big data analytics applications by panoratio
6 levels of big data analytics applications6 levels of big data analytics applications
6 levels of big data analytics applications
panoratio3.1K views
A Statistician's View on Big Data and Data Science (Version 1) by Prof. Dr. Diego Kuonen
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
Big data landscape v 3.0 - Matt Turck (FirstMark) by Matt Turck
Big data landscape v 3.0 - Matt Turck (FirstMark) Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark)
Matt Turck153.5K views
Big Data and Harvesting Data from Social Media by R A Akerkar
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social Media
R A Akerkar958 views
Strategic Planning For Government Data Centers Presentation by gjohnsonatitmg
Strategic Planning For Government Data Centers PresentationStrategic Planning For Government Data Centers Presentation
Strategic Planning For Government Data Centers Presentation
gjohnsonatitmg766 views
Logical Data Fabric: Architectural Components by Denodo
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural Components
Denodo 456 views
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013. by Jari Koister
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Jari Koister558 views

Viewers also liked

Memory Interoperability in Analytics and Machine Learning by
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
5.6K views27 slides
Improving Python and Spark (PySpark) Performance and Interoperability by
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
19.8K views37 slides
Python Data Wrangling: Preparing for the Future by
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
12.5K views27 slides
PyCon APAC 2016 Keynote by
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
3.6K views36 slides
High Performance Python on Apache Spark by
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
16.6K views35 slides
Apache Arrow and Python: The latest by
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
5.8K views19 slides

Viewers also liked(20)

Memory Interoperability in Analytics and Machine Learning by Wes McKinney
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney5.6K views
Improving Python and Spark (PySpark) Performance and Interoperability by Wes McKinney
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney19.8K views
Python Data Wrangling: Preparing for the Future by Wes McKinney
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney12.5K views
PyCon APAC 2016 Keynote by Wes McKinney
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney3.6K views
High Performance Python on Apache Spark by Wes McKinney
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney16.6K views
Apache Arrow and Python: The latest by Wes McKinney
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney5.8K views
Python Data Ecosystem: Thoughts on Building for the Future by Wes McKinney
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney5.4K views
Next-generation Python Big Data Tools, powered by Apache Arrow by Wes McKinney
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney13K views
Apache Arrow (Strata-Hadoop World San Jose 2016) by Wes McKinney
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney17K views
Improving data interoperability in Python and R by Wes McKinney
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
Wes McKinney2.6K views
DataFrames: The Good, Bad, and Ugly by Wes McKinney
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
Wes McKinney12.9K views
pandas: Powerful data analysis tools for Python by Wes McKinney
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney9.8K views
Python for Financial Data Analysis with pandas by Wes McKinney
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney61.8K views
Productive Data Tools for Quants by Wes McKinney
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
Wes McKinney1.7K views
My Data Journey with Python (SciPy 2015 Keynote) by Wes McKinney
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney7.4K views
Enabling Python to be a Better Big Data Citizen by Wes McKinney
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney6K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P... by Wes McKinney
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney103.9K views
Data Tools and the Data Scientist Shortage by Wes McKinney
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
Wes McKinney3.7K views
DataFrames: The Extended Cut by Wes McKinney
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
Wes McKinney8.5K views

Similar to Raising the Tides: Open Source Analytics for Data Science

Elastic's recommendation on keeping services up and running with real-time vi... by
Elastic's recommendation on keeping services up and running with real-time vi...Elastic's recommendation on keeping services up and running with real-time vi...
Elastic's recommendation on keeping services up and running with real-time vi...FaithWestdorp
106 views30 slides
Understanding What’s Possible: Getting Business Value from Big Data Quickly by
Understanding What’s Possible: Getting Business Value from Big Data QuicklyUnderstanding What’s Possible: Getting Business Value from Big Data Quickly
Understanding What’s Possible: Getting Business Value from Big Data QuicklyInside Analysis
356 views21 slides
Environmental Big Data: Business Perspective by
Environmental Big Data: Business PerspectiveEnvironmental Big Data: Business Perspective
Environmental Big Data: Business PerspectiveCLEEN_Ltd
404 views11 slides
Big Data Innovation by
Big Data InnovationBig Data Innovation
Big Data Innovationpaul.hawking
785 views25 slides
Big Data Brown Bag by
Big Data Brown BagBig Data Brown Bag
Big Data Brown Bagusmanqureshi
107 views29 slides
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights by
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government InsightsVirtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government InsightsSplunk
2.7K views46 slides

Similar to Raising the Tides: Open Source Analytics for Data Science(20)

Elastic's recommendation on keeping services up and running with real-time vi... by FaithWestdorp
Elastic's recommendation on keeping services up and running with real-time vi...Elastic's recommendation on keeping services up and running with real-time vi...
Elastic's recommendation on keeping services up and running with real-time vi...
FaithWestdorp106 views
Understanding What’s Possible: Getting Business Value from Big Data Quickly by Inside Analysis
Understanding What’s Possible: Getting Business Value from Big Data QuicklyUnderstanding What’s Possible: Getting Business Value from Big Data Quickly
Understanding What’s Possible: Getting Business Value from Big Data Quickly
Inside Analysis356 views
Environmental Big Data: Business Perspective by CLEEN_Ltd
Environmental Big Data: Business PerspectiveEnvironmental Big Data: Business Perspective
Environmental Big Data: Business Perspective
CLEEN_Ltd404 views
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights by Splunk
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government InsightsVirtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
Splunk2.7K views
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat" by MDS ap
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
MDS ap195 views
An Encyclopedic Overview Of Big Data Analytics by Audrey Britton
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
Audrey Britton4 views
Data Virtualization - Enabling Next Generation Analytics by Denodo
Data Virtualization - Enabling Next Generation AnalyticsData Virtualization - Enabling Next Generation Analytics
Data Virtualization - Enabling Next Generation Analytics
Denodo 849 views
Data Catalogs Are the Answer – What is the Question? by DATAVERSITY
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
DATAVERSITY510 views
How I Learned to Stop Worrying and Love Linked Data by Domino Data Lab
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
Domino Data Lab 335 views
Building Data Science Teams by EMC
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
EMC12.5K views
Accrete.AI Product Flyer 1.26.18 by Prashant Bhuyan
Accrete.AI Product Flyer 1.26.18Accrete.AI Product Flyer 1.26.18
Accrete.AI Product Flyer 1.26.18
Prashant Bhuyan209 views
A Dynamic Data Catalog for Autonomy and Self-Service by Denodo
A Dynamic Data Catalog for Autonomy and Self-ServiceA Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-Service
Denodo 233 views
Noise to Signal - The Biggest Problem in Data by DATAVERSITY
Noise to Signal - The Biggest Problem in DataNoise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in Data
DATAVERSITY671 views
RWDG Slides: Data and Metadata Will Not Govern Themselves by DATAVERSITY
RWDG Slides: Data and Metadata Will Not Govern ThemselvesRWDG Slides: Data and Metadata Will Not Govern Themselves
RWDG Slides: Data and Metadata Will Not Govern Themselves
DATAVERSITY864 views
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S... by Matt Stubbs
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...
Matt Stubbs946 views
DataEd Slides: Growing Practical Data Governance Programs by DATAVERSITY
DataEd Slides: Growing Practical Data Governance ProgramsDataEd Slides: Growing Practical Data Governance Programs
DataEd Slides: Growing Practical Data Governance Programs
DATAVERSITY964 views

More from Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow by
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
1.1K views31 slides
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
1.1K views26 slides
Apache Arrow: High Performance Columnar Data Framework by
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
1.5K views53 slides
New Directions for Apache Arrow by
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
1.9K views27 slides
Apache Arrow Flight: A New Gold Standard for Data Transport by
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
2.2K views31 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames by
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
2K views47 slides

More from Wes McKinney(16)

Solving Enterprise Data Challenges with Apache Arrow by Wes McKinney
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney1.1K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney1.1K views
Apache Arrow: High Performance Columnar Data Framework by Wes McKinney
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney1.5K views
New Directions for Apache Arrow by Wes McKinney
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport by Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney2K views
Apache Arrow: Present and Future @ ScaledML 2020 by Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney970 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney2.1K views
Apache Arrow: Leveling Up the Analytics Stack by Wes McKinney
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session by Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney2.5K views
Apache Arrow: Leveling Up the Data Science Stack by Wes McKinney
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney3.5K views
Ursa Labs and Apache Arrow in 2019 by Wes McKinney
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney4.2K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney1.1K views
Apache Arrow at DataEngConf Barcelona 2018 by Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney2K views
Apache Arrow: Cross-language Development Platform for In-memory Data by Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney6.6K views
Apache Arrow -- Cross-language development platform for in-memory data by Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney2.9K views

Recently uploaded

VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueShapeBlue
203 views54 slides
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...Jasper Oosterveld
35 views49 slides
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueShapeBlue
263 views23 slides
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOsPriyanka Aash
158 views59 slides
LLMs in Production: Tooling, Process, and Team Structure by
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureAggregage
42 views77 slides
NTGapps NTG LowCode Platform by
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform Mustafa Kuğu
423 views30 slides

Recently uploaded(20)

VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue203 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue263 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash158 views
LLMs in Production: Tooling, Process, and Team Structure by Aggregage
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
Aggregage42 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu423 views
"Package management in monorepos", Zoltan Kochan by Fwdays
"Package management in monorepos", Zoltan Kochan"Package management in monorepos", Zoltan Kochan
"Package management in monorepos", Zoltan Kochan
Fwdays33 views
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue by ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
ShapeBlue222 views
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De... by Moses Kemibaro
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Don’t Make A Human Do A Robot’s Job! : 6 Reasons Why AI Will Save Us & Not De...
Moses Kemibaro34 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays56 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue161 views
Why and How CloudStack at weSystems - Stephan Bienek - weSystems by ShapeBlue
Why and How CloudStack at weSystems - Stephan Bienek - weSystemsWhy and How CloudStack at weSystems - Stephan Bienek - weSystems
Why and How CloudStack at weSystems - Stephan Bienek - weSystems
ShapeBlue238 views
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ... by ShapeBlue
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
ShapeBlue166 views
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit... by ShapeBlue
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
ShapeBlue159 views
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R... by ShapeBlue
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
Setting Up Your First CloudStack Environment with Beginners Challenges - MD R...
ShapeBlue173 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue297 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue221 views

Raising the Tides: Open Source Analytics for Data Science

  • 1. Raising the Tides: Open Source Analytics for Data Science Wes McKinney @wesmckinn N E W S W E E K A I & D A T A S C I E N C E C O N F E R E N C E – C A P I T A L M A R K E T S 2 M A R C H 2 0 1 7
  • 3. Wes McKinney @wesmckinn Important Legal Information • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
  • 4. Wes McKinney @wesmckinn In the next 20 minutes ∞ Important trends in the industry ∞ Two Sigma involvement in open source ∞ Growing the community
  • 6. Wes McKinney @wesmckinn Industry giants open source core AI and machine learning technology
  • 7. Wes McKinney @wesmckinn Open source “disruption” in data science languages and supporting technologies
  • 8. Wes McKinney @wesmckinn Observation #1: User Mindshare is a Key Asset
  • 9. Wes McKinney @wesmckinn Observation #2: Tools may be less important than human capital and data
  • 10. Wes McKinney @wesmckinn Two Sigma Building a state-of-the-art, collaborative data science platform
  • 11. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets
  • 12. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity
  • 13. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity ∞ Computational capabilities: larger and more complex data sets
  • 14. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity ∞ Computational capabilities: larger and more complex data sets ∞ Collaboration within and across teams
  • 15. TOOLS AND THE “DATA SCIENTIST SHORTAGE”
  • 16. WHY WE PARTICIPATE IN OPEN SOURCE
  • 17. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies
  • 18. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems
  • 19. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data
  • 20. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data 4. Benefit sooner from open source innovations
  • 21. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data 4. Benefit sooner from open source innovations 5. Attract and retain the best engineering talent
  • 22. Wes McKinney @wesmckinn Where we are investing Collaboration and Publishing Cluster Resource Management Scalable / Distributed Computing High Performance Data Processing
  • 23. Wes McKinney @wesmckinn Core data infrastructure technologies Apache Arrow Apache Parquet • Efficient columnar in- memory data processing • High-speed, interoperable data messaging for Java, C++, Python • Industry-standard columnar file format for distributed storage • Efficient IO for Spark, Python, etc.
  • 24. Wes McKinney @wesmckinn Open source in-memory and distributed analytics • Popular Python analytics library • Powerful and easy-to-use data cleaning, analytics, and time series processing • Flint: scalable time series analytics for Spark • Enhanced Python integration
  • 25. Wes McKinney @wesmckinn Cluster resource management • Scalable cluster resource manager • Native container support • Fair job scheduler for Mesos • Managing multi-tenant Spark clusters cook
  • 26. Wes McKinney @wesmckinn Collaboration and publishing • Notebook “kernels” for polyglot research and development • Inter-language data exchange • Leading web notebook & reproducible research development platform • Interactive widgets framework
  • 27. TOWARD HIGH TIDE: Preserving competitive advantage and building common knowledge

Editor's Notes

  1. Who am I? Software Architect at Two Sigma Investments Creator of Python pandas project and contributor to many other open source tools related to the field of data science
  2. For Nick: title could be “In the next 18 minutes”
  3. Nick: Title change to: New news in open source
  4. Design: logo wall? Background image + logos? Open Source Data Science disruption, trends Industry giants releasing core machine learning / AI technology Google / Facebook / Microsoft / Amazon / Baidu
  5. Design: as above
  6. Design: Full bleed image - By virtue of developers
  7. Design: Full bleed image - By virtue of developers
  8. Design: our building full bleed?
  9. Design: Maybe split into 4 pages?
  10. Design: Maybe split into 4 pages?
  11. Design: Maybe split into 4 pages?
  12. Design: Maybe split into 4 pages?
  13. Design: Full bleed - By virtue of developers
  14. Design: Full bleed - By virtue of developers
  15. Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  16. Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  17. Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  18. Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  19. Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  20. Design: perhaps section page Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  21. Design: for discussion! Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  22. Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  23. Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  24. Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  25. Design: Full bleed - By virtue of developers
  26. Design: Close on logo? Nick: End slide to link back to title?? Preserve comp advantage AND build common progress = raising the tides.