Raising the Tides: Open Source Analytics for Data Science
1. Raising the Tides:
Open Source Analytics for
Data Science
Wes McKinney @wesmckinn
N E W S W E E K A I & D A T A S C I E N C E
C O N F E R E N C E – C A P I T A L M A R K E T S
2 M A R C H 2 0 1 7
12. Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
13. Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
∞ Computational capabilities: larger and more
complex data sets
14. Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
∞ Computational capabilities: larger and more
complex data sets
∞ Collaboration within and across teams
17. Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
18. Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
19. Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
20. Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
4. Benefit sooner from open source innovations
21. Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
4. Benefit sooner from open source innovations
5. Attract and retain the best engineering talent
22. Wes McKinney @wesmckinn
Where we are investing
Collaboration
and Publishing
Cluster Resource
Management Scalable / Distributed
Computing
High Performance
Data Processing
23. Wes McKinney @wesmckinn
Core data infrastructure technologies
Apache
Arrow
Apache
Parquet
• Efficient columnar in-
memory data processing
• High-speed, interoperable
data messaging for Java,
C++, Python
• Industry-standard columnar
file format for distributed
storage
• Efficient IO for Spark, Python,
etc.
24. Wes McKinney @wesmckinn
Open source in-memory and distributed
analytics
• Popular Python analytics
library
• Powerful and easy-to-use
data cleaning, analytics, and
time series processing
• Flint: scalable time series
analytics for Spark
• Enhanced Python
integration
26. Wes McKinney @wesmckinn
Collaboration and publishing
• Notebook “kernels” for
polyglot research and
development
• Inter-language data exchange
• Leading web notebook &
reproducible research
development platform
• Interactive widgets framework
Who am I?
Software Architect at Two Sigma Investments
Creator of Python pandas project and contributor to many other open source tools related to the field of data science
For Nick: title could be “In the next 18 minutes”
Nick: Title change to: New news in open source
Design: logo wall? Background image + logos?
Open Source Data Science disruption, trends
Industry giants releasing core machine learning / AI technology
Google / Facebook / Microsoft / Amazon / Baidu
Design: as above
Design: Full bleed image
- By virtue of developers
Design: Full bleed image
- By virtue of developers
Design: our building full bleed?
Design: Maybe split into 4 pages?
Design: Maybe split into 4 pages?
Design: Maybe split into 4 pages?
Design: Maybe split into 4 pages?
Design: Full bleed
- By virtue of developers
Design: Full bleed
- By virtue of developers
Design: split into 2: 1. Why we participate in open source (section divider)
2. “1. Drive progress…” (full bleed image)
Design: split into 2: 1. Why we participate in open source (section divider)
2. “1. Drive progress…” (full bleed image)
Design: split into 2: 1. Why we participate in open source (section divider)
2. “1. Drive progress…” (full bleed image)
Design: split into 2: 1. Why we participate in open source (section divider)
2. “1. Drive progress…” (full bleed image)
Design: split into 2: 1. Why we participate in open source (section divider)
2. “1. Drive progress…” (full bleed image)
Design: perhaps section page
Areas of focus
In-memory analytics
Collaboration
Distributed computing
Cluster resource management
Design: for discussion!
Areas of focus
In-memory analytics
Collaboration
Distributed computing
Cluster resource management
Areas of focus
In-memory analytics
Collaboration
Distributed computing
Cluster resource management
Areas of focus
In-memory analytics
Collaboration
Distributed computing
Cluster resource management
Areas of focus
In-memory analytics
Collaboration
Distributed computing
Cluster resource management
Design: Full bleed
- By virtue of developers
Design: Close on logo?
Nick: End slide to link back to title?? Preserve comp advantage AND build common progress = raising the tides.