Successfully reported this slideshow.

Raising the Tides: Open Source Analytics for Data Science

4

Share

Loading in …3
×
1 of 28
1 of 28

Raising the Tides: Open Source Analytics for Data Science

4

Share

Download to read offline

Description

Given March 2, 2017 at Newsweek AI/Data Science in Capital Markets conference

Transcript

  1. 1. Raising the Tides: Open Source Analytics for Data Science Wes McKinney @wesmckinn N E W S W E E K A I & D A T A S C I E N C E C O N F E R E N C E – C A P I T A L M A R K E T S 2 M A R C H 2 0 1 7
  2. 2. Wes McKinney @wesmckinn Me
  3. 3. Wes McKinney @wesmckinn Important Legal Information • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
  4. 4. Wes McKinney @wesmckinn In the next 20 minutes ∞ Important trends in the industry ∞ Two Sigma involvement in open source ∞ Growing the community
  5. 5. WHAT I’M SEEING TODAY
  6. 6. Wes McKinney @wesmckinn Industry giants open source core AI and machine learning technology
  7. 7. Wes McKinney @wesmckinn Open source “disruption” in data science languages and supporting technologies
  8. 8. Wes McKinney @wesmckinn Observation #1: User Mindshare is a Key Asset
  9. 9. Wes McKinney @wesmckinn Observation #2: Tools may be less important than human capital and data
  10. 10. Wes McKinney @wesmckinn Two Sigma Building a state-of-the-art, collaborative data science platform
  11. 11. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets
  12. 12. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity
  13. 13. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity ∞ Computational capabilities: larger and more complex data sets
  14. 14. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity ∞ Computational capabilities: larger and more complex data sets ∞ Collaboration within and across teams
  15. 15. TOOLS AND THE “DATA SCIENTIST SHORTAGE”
  16. 16. WHY WE PARTICIPATE IN OPEN SOURCE
  17. 17. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies
  18. 18. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems
  19. 19. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data
  20. 20. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data 4. Benefit sooner from open source innovations
  21. 21. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data 4. Benefit sooner from open source innovations 5. Attract and retain the best engineering talent
  22. 22. Wes McKinney @wesmckinn Where we are investing Collaboration and Publishing Cluster Resource Management Scalable / Distributed Computing High Performance Data Processing
  23. 23. Wes McKinney @wesmckinn Core data infrastructure technologies Apache Arrow Apache Parquet • Efficient columnar in- memory data processing • High-speed, interoperable data messaging for Java, C++, Python • Industry-standard columnar file format for distributed storage • Efficient IO for Spark, Python, etc.
  24. 24. Wes McKinney @wesmckinn Open source in-memory and distributed analytics • Popular Python analytics library • Powerful and easy-to-use data cleaning, analytics, and time series processing • Flint: scalable time series analytics for Spark • Enhanced Python integration
  25. 25. Wes McKinney @wesmckinn Cluster resource management • Scalable cluster resource manager • Native container support • Fair job scheduler for Mesos • Managing multi-tenant Spark clusters cook
  26. 26. Wes McKinney @wesmckinn Collaboration and publishing • Notebook “kernels” for polyglot research and development • Inter-language data exchange • Leading web notebook & reproducible research development platform • Interactive widgets framework
  27. 27. TOWARD HIGH TIDE: Preserving competitive advantage and building common knowledge
  28. 28. Thank you Wes McKinney @wesmckinn

Editor's Notes

  • Who am I?
    Software Architect at Two Sigma Investments
    Creator of Python pandas project and contributor to many other open source tools related to the field of data science

  • For Nick: title could be “In the next 18 minutes”
  • Nick: Title change to: New news in open source
  • Design: logo wall? Background image + logos?

    Open Source Data Science disruption, trends
    Industry giants releasing core machine learning / AI technology
    Google / Facebook / Microsoft / Amazon / Baidu

  • Design: as above
  • Design: Full bleed image

    - By virtue of developers
  • Design: Full bleed image


    - By virtue of developers
  • Design: our building full bleed?
  • Design: Maybe split into 4 pages?
  • Design: Maybe split into 4 pages?
  • Design: Maybe split into 4 pages?
  • Design: Maybe split into 4 pages?
  • Design: Full bleed

    - By virtue of developers
  • Design: Full bleed

    - By virtue of developers
  • Design: split into 2: 1. Why we participate in open source (section divider)
    2. “1. Drive progress…” (full bleed image)
  • Design: split into 2: 1. Why we participate in open source (section divider)
    2. “1. Drive progress…” (full bleed image)
  • Design: split into 2: 1. Why we participate in open source (section divider)
    2. “1. Drive progress…” (full bleed image)
  • Design: split into 2: 1. Why we participate in open source (section divider)
    2. “1. Drive progress…” (full bleed image)
  • Design: split into 2: 1. Why we participate in open source (section divider)
    2. “1. Drive progress…” (full bleed image)
  • Design: perhaps section page

    Areas of focus
    In-memory analytics
    Collaboration
    Distributed computing
    Cluster resource management
  • Design: for discussion!

    Areas of focus
    In-memory analytics
    Collaboration
    Distributed computing
    Cluster resource management
  • Areas of focus
    In-memory analytics
    Collaboration
    Distributed computing
    Cluster resource management
  • Areas of focus
    In-memory analytics
    Collaboration
    Distributed computing
    Cluster resource management
  • Areas of focus
    In-memory analytics
    Collaboration
    Distributed computing
    Cluster resource management
  • Design: Full bleed

    - By virtue of developers
  • Design: Close on logo?

    Nick: End slide to link back to title?? Preserve comp advantage AND build common progress = raising the tides.
  • Description

    Given March 2, 2017 at Newsweek AI/Data Science in Capital Markets conference

    Transcript

    1. 1. Raising the Tides: Open Source Analytics for Data Science Wes McKinney @wesmckinn N E W S W E E K A I & D A T A S C I E N C E C O N F E R E N C E – C A P I T A L M A R K E T S 2 M A R C H 2 0 1 7
    2. 2. Wes McKinney @wesmckinn Me
    3. 3. Wes McKinney @wesmckinn Important Legal Information • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
    4. 4. Wes McKinney @wesmckinn In the next 20 minutes ∞ Important trends in the industry ∞ Two Sigma involvement in open source ∞ Growing the community
    5. 5. WHAT I’M SEEING TODAY
    6. 6. Wes McKinney @wesmckinn Industry giants open source core AI and machine learning technology
    7. 7. Wes McKinney @wesmckinn Open source “disruption” in data science languages and supporting technologies
    8. 8. Wes McKinney @wesmckinn Observation #1: User Mindshare is a Key Asset
    9. 9. Wes McKinney @wesmckinn Observation #2: Tools may be less important than human capital and data
    10. 10. Wes McKinney @wesmckinn Two Sigma Building a state-of-the-art, collaborative data science platform
    11. 11. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets
    12. 12. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity
    13. 13. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity ∞ Computational capabilities: larger and more complex data sets
    14. 14. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity ∞ Computational capabilities: larger and more complex data sets ∞ Collaboration within and across teams
    15. 15. TOOLS AND THE “DATA SCIENTIST SHORTAGE”
    16. 16. WHY WE PARTICIPATE IN OPEN SOURCE
    17. 17. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies
    18. 18. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems
    19. 19. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data
    20. 20. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data 4. Benefit sooner from open source innovations
    21. 21. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data 4. Benefit sooner from open source innovations 5. Attract and retain the best engineering talent
    22. 22. Wes McKinney @wesmckinn Where we are investing Collaboration and Publishing Cluster Resource Management Scalable / Distributed Computing High Performance Data Processing
    23. 23. Wes McKinney @wesmckinn Core data infrastructure technologies Apache Arrow Apache Parquet • Efficient columnar in- memory data processing • High-speed, interoperable data messaging for Java, C++, Python • Industry-standard columnar file format for distributed storage • Efficient IO for Spark, Python, etc.
    24. 24. Wes McKinney @wesmckinn Open source in-memory and distributed analytics • Popular Python analytics library • Powerful and easy-to-use data cleaning, analytics, and time series processing • Flint: scalable time series analytics for Spark • Enhanced Python integration
    25. 25. Wes McKinney @wesmckinn Cluster resource management • Scalable cluster resource manager • Native container support • Fair job scheduler for Mesos • Managing multi-tenant Spark clusters cook
    26. 26. Wes McKinney @wesmckinn Collaboration and publishing • Notebook “kernels” for polyglot research and development • Inter-language data exchange • Leading web notebook & reproducible research development platform • Interactive widgets framework
    27. 27. TOWARD HIGH TIDE: Preserving competitive advantage and building common knowledge
    28. 28. Thank you Wes McKinney @wesmckinn

    Editor's Notes

  • Who am I?
    Software Architect at Two Sigma Investments
    Creator of Python pandas project and contributor to many other open source tools related to the field of data science

  • For Nick: title could be “In the next 18 minutes”
  • Nick: Title change to: New news in open source
  • Design: logo wall? Background image + logos?

    Open Source Data Science disruption, trends
    Industry giants releasing core machine learning / AI technology
    Google / Facebook / Microsoft / Amazon / Baidu

  • Design: as above
  • Design: Full bleed image

    - By virtue of developers
  • Design: Full bleed image


    - By virtue of developers
  • Design: our building full bleed?
  • Design: Maybe split into 4 pages?
  • Design: Maybe split into 4 pages?
  • Design: Maybe split into 4 pages?
  • Design: Maybe split into 4 pages?
  • Design: Full bleed

    - By virtue of developers
  • Design: Full bleed

    - By virtue of developers
  • Design: split into 2: 1. Why we participate in open source (section divider)
    2. “1. Drive progress…” (full bleed image)
  • Design: split into 2: 1. Why we participate in open source (section divider)
    2. “1. Drive progress…” (full bleed image)
  • Design: split into 2: 1. Why we participate in open source (section divider)
    2. “1. Drive progress…” (full bleed image)
  • Design: split into 2: 1. Why we participate in open source (section divider)
    2. “1. Drive progress…” (full bleed image)
  • Design: split into 2: 1. Why we participate in open source (section divider)
    2. “1. Drive progress…” (full bleed image)
  • Design: perhaps section page

    Areas of focus
    In-memory analytics
    Collaboration
    Distributed computing
    Cluster resource management
  • Design: for discussion!

    Areas of focus
    In-memory analytics
    Collaboration
    Distributed computing
    Cluster resource management
  • Areas of focus
    In-memory analytics
    Collaboration
    Distributed computing
    Cluster resource management
  • Areas of focus
    In-memory analytics
    Collaboration
    Distributed computing
    Cluster resource management
  • Areas of focus
    In-memory analytics
    Collaboration
    Distributed computing
    Cluster resource management
  • Design: Full bleed

    - By virtue of developers
  • Design: Close on logo?

    Nick: End slide to link back to title?? Preserve comp advantage AND build common progress = raising the tides.
  • More Related Content

    Related Books

    Free with a 30 day trial from Scribd

    See all

    ×