Raising the Tides:
Open Source Analytics for
Data Science
Wes McKinney @wesmckinn
N E W S W E E K A I & D A T A S C I E N C E
C O N F E R E N C E – C A P I T A L M A R K E T S
2 M A R C H 2 0 1 7
Wes McKinney @wesmckinn
Me
Wes McKinney @wesmckinn
Important Legal Information
• The information presented here is offered for informational purposes only and
should not be used for any other purpose (including, without limitation, the
making of investment decisions). Examples provided herein are for illustrative
purposes only and are not necessarily based on actual data. Nothing herein
constitutes: an offer to sell or the solicitation of any offer to buy any security or
other interest; tax advice; or investment advice. This presentation shall remain
the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma
reserves the right to require the return of this presentation at any time.
• Some of the images, logos or other material used herein may be protected by
copyright and/or trademark. If so, such copyrights and/or trademarks are most
likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or
trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two
Sigma, nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
Wes McKinney @wesmckinn
In the next 20 minutes
∞ Important trends in the industry
∞ Two Sigma involvement in open source
∞ Growing the community
WHAT I’M SEEING TODAY
Wes McKinney @wesmckinn
Industry giants open source core AI
and machine learning technology
Wes McKinney @wesmckinn
Open source “disruption” in data science
languages and supporting technologies
Wes McKinney @wesmckinn
Observation #1:
User Mindshare is a Key Asset
Wes McKinney @wesmckinn
Observation #2:
Tools may be less important than human
capital and data
Wes McKinney @wesmckinn
Two Sigma
Building a state-of-the-art, collaborative
data science platform
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
∞ Computational capabilities: larger and more
complex data sets
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
∞ Computational capabilities: larger and more
complex data sets
∞ Collaboration within and across teams
TOOLS AND THE
“DATA SCIENTIST SHORTAGE”
WHY WE PARTICIPATE
IN OPEN SOURCE
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
4. Benefit sooner from open source innovations
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
4. Benefit sooner from open source innovations
5. Attract and retain the best engineering talent
Wes McKinney @wesmckinn
Where we are investing
Collaboration
and Publishing
Cluster Resource
Management Scalable / Distributed
Computing
High Performance
Data Processing
Wes McKinney @wesmckinn
Core data infrastructure technologies
Apache
Arrow
Apache
Parquet
• Efficient columnar in-
memory data processing
• High-speed, interoperable
data messaging for Java,
C++, Python
• Industry-standard columnar
file format for distributed
storage
• Efficient IO for Spark, Python,
etc.
Wes McKinney @wesmckinn
Open source in-memory and distributed
analytics
• Popular Python analytics
library
• Powerful and easy-to-use
data cleaning, analytics, and
time series processing
• Flint: scalable time series
analytics for Spark
• Enhanced Python
integration
Wes McKinney @wesmckinn
Cluster resource management
• Scalable cluster resource
manager
• Native container support
• Fair job scheduler for Mesos
• Managing multi-tenant Spark
clusters
cook
Wes McKinney @wesmckinn
Collaboration and publishing
• Notebook “kernels” for
polyglot research and
development
• Inter-language data exchange
• Leading web notebook &
reproducible research
development platform
• Interactive widgets framework
TOWARD HIGH TIDE:
Preserving competitive advantage and
building common knowledge
Thank you
Wes McKinney @wesmckinn

Raising the Tides: Open Source Analytics for Data Science

  • 1.
    Raising the Tides: OpenSource Analytics for Data Science Wes McKinney @wesmckinn N E W S W E E K A I & D A T A S C I E N C E C O N F E R E N C E – C A P I T A L M A R K E T S 2 M A R C H 2 0 1 7
  • 2.
  • 3.
    Wes McKinney @wesmckinn ImportantLegal Information • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
  • 4.
    Wes McKinney @wesmckinn Inthe next 20 minutes ∞ Important trends in the industry ∞ Two Sigma involvement in open source ∞ Growing the community
  • 5.
  • 6.
    Wes McKinney @wesmckinn Industrygiants open source core AI and machine learning technology
  • 7.
    Wes McKinney @wesmckinn Opensource “disruption” in data science languages and supporting technologies
  • 8.
    Wes McKinney @wesmckinn Observation#1: User Mindshare is a Key Asset
  • 9.
    Wes McKinney @wesmckinn Observation#2: Tools may be less important than human capital and data
  • 10.
    Wes McKinney @wesmckinn TwoSigma Building a state-of-the-art, collaborative data science platform
  • 11.
    Wes McKinney @wesmckinn Scalingdata science in many dimensions ∞ Access to diverse data sets
  • 12.
    Wes McKinney @wesmckinn Scalingdata science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity
  • 13.
    Wes McKinney @wesmckinn Scalingdata science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity ∞ Computational capabilities: larger and more complex data sets
  • 14.
    Wes McKinney @wesmckinn Scalingdata science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity ∞ Computational capabilities: larger and more complex data sets ∞ Collaboration within and across teams
  • 15.
    TOOLS AND THE “DATASCIENTIST SHORTAGE”
  • 16.
  • 17.
    Wes McKinney @wesmckinn Whywe participate in Open Source 1. Drive progress and innovation in foundational technologies
  • 18.
    Wes McKinney @wesmckinn Whywe participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems
  • 19.
    Wes McKinney @wesmckinn Whywe participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data
  • 20.
    Wes McKinney @wesmckinn Whywe participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data 4. Benefit sooner from open source innovations
  • 21.
    Wes McKinney @wesmckinn Whywe participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data 4. Benefit sooner from open source innovations 5. Attract and retain the best engineering talent
  • 22.
    Wes McKinney @wesmckinn Wherewe are investing Collaboration and Publishing Cluster Resource Management Scalable / Distributed Computing High Performance Data Processing
  • 23.
    Wes McKinney @wesmckinn Coredata infrastructure technologies Apache Arrow Apache Parquet • Efficient columnar in- memory data processing • High-speed, interoperable data messaging for Java, C++, Python • Industry-standard columnar file format for distributed storage • Efficient IO for Spark, Python, etc.
  • 24.
    Wes McKinney @wesmckinn Opensource in-memory and distributed analytics • Popular Python analytics library • Powerful and easy-to-use data cleaning, analytics, and time series processing • Flint: scalable time series analytics for Spark • Enhanced Python integration
  • 25.
    Wes McKinney @wesmckinn Clusterresource management • Scalable cluster resource manager • Native container support • Fair job scheduler for Mesos • Managing multi-tenant Spark clusters cook
  • 26.
    Wes McKinney @wesmckinn Collaborationand publishing • Notebook “kernels” for polyglot research and development • Inter-language data exchange • Leading web notebook & reproducible research development platform • Interactive widgets framework
  • 27.
    TOWARD HIGH TIDE: Preservingcompetitive advantage and building common knowledge
  • 28.

Editor's Notes

  • #3 Who am I? Software Architect at Two Sigma Investments Creator of Python pandas project and contributor to many other open source tools related to the field of data science
  • #5 For Nick: title could be “In the next 18 minutes”
  • #6 Nick: Title change to: New news in open source
  • #7 Design: logo wall? Background image + logos? Open Source Data Science disruption, trends Industry giants releasing core machine learning / AI technology Google / Facebook / Microsoft / Amazon / Baidu
  • #8 Design: as above
  • #9 Design: Full bleed image - By virtue of developers
  • #10 Design: Full bleed image - By virtue of developers
  • #11 Design: our building full bleed?
  • #12 Design: Maybe split into 4 pages?
  • #13 Design: Maybe split into 4 pages?
  • #14 Design: Maybe split into 4 pages?
  • #15 Design: Maybe split into 4 pages?
  • #16 Design: Full bleed - By virtue of developers
  • #17 Design: Full bleed - By virtue of developers
  • #18 Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  • #19 Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  • #20 Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  • #21 Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  • #22 Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  • #23 Design: perhaps section page Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  • #24 Design: for discussion! Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  • #25 Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  • #26 Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  • #27 Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  • #28 Design: Full bleed - By virtue of developers
  • #29 Design: Close on logo? Nick: End slide to link back to title?? Preserve comp advantage AND build common progress = raising the tides.