SlideShare a Scribd company logo
1 of 28
Raising the Tides:
Open Source Analytics for
Data Science
Wes McKinney @wesmckinn
N E W S W E E K A I & D A T A S C I E N C E
C O N F E R E N C E – C A P I T A L M A R K E T S
2 M A R C H 2 0 1 7
Wes McKinney @wesmckinn
Me
Wes McKinney @wesmckinn
Important Legal Information
• The information presented here is offered for informational purposes only and
should not be used for any other purpose (including, without limitation, the
making of investment decisions). Examples provided herein are for illustrative
purposes only and are not necessarily based on actual data. Nothing herein
constitutes: an offer to sell or the solicitation of any offer to buy any security or
other interest; tax advice; or investment advice. This presentation shall remain
the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma
reserves the right to require the return of this presentation at any time.
• Some of the images, logos or other material used herein may be protected by
copyright and/or trademark. If so, such copyrights and/or trademarks are most
likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or
trademark laws. Use of such image, copyright or trademark does not imply any
association with such organization (or endorsement of such organization) by Two
Sigma, nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
Wes McKinney @wesmckinn
In the next 20 minutes
∞ Important trends in the industry
∞ Two Sigma involvement in open source
∞ Growing the community
WHAT I’M SEEING TODAY
Wes McKinney @wesmckinn
Industry giants open source core AI
and machine learning technology
Wes McKinney @wesmckinn
Open source “disruption” in data science
languages and supporting technologies
Wes McKinney @wesmckinn
Observation #1:
User Mindshare is a Key Asset
Wes McKinney @wesmckinn
Observation #2:
Tools may be less important than human
capital and data
Wes McKinney @wesmckinn
Two Sigma
Building a state-of-the-art, collaborative
data science platform
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
∞ Computational capabilities: larger and more
complex data sets
Wes McKinney @wesmckinn
Scaling data science in many dimensions
∞ Access to diverse data sets
∞ Enhancing individual productivity
∞ Computational capabilities: larger and more
complex data sets
∞ Collaboration within and across teams
TOOLS AND THE
“DATA SCIENTIST SHORTAGE”
WHY WE PARTICIPATE
IN OPEN SOURCE
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
4. Benefit sooner from open source innovations
Wes McKinney @wesmckinn
Why we participate in Open Source
1. Drive progress and innovation in foundational
technologies
2. Increase the overall value, interoperability, and
sustainability of our closed source systems
3. Raise awareness of problems faced at scale on real
world data
4. Benefit sooner from open source innovations
5. Attract and retain the best engineering talent
Wes McKinney @wesmckinn
Where we are investing
Collaboration
and Publishing
Cluster Resource
Management Scalable / Distributed
Computing
High Performance
Data Processing
Wes McKinney @wesmckinn
Core data infrastructure technologies
Apache
Arrow
Apache
Parquet
• Efficient columnar in-
memory data processing
• High-speed, interoperable
data messaging for Java,
C++, Python
• Industry-standard columnar
file format for distributed
storage
• Efficient IO for Spark, Python,
etc.
Wes McKinney @wesmckinn
Open source in-memory and distributed
analytics
• Popular Python analytics
library
• Powerful and easy-to-use
data cleaning, analytics, and
time series processing
• Flint: scalable time series
analytics for Spark
• Enhanced Python
integration
Wes McKinney @wesmckinn
Cluster resource management
• Scalable cluster resource
manager
• Native container support
• Fair job scheduler for Mesos
• Managing multi-tenant Spark
clusters
cook
Wes McKinney @wesmckinn
Collaboration and publishing
• Notebook “kernels” for
polyglot research and
development
• Inter-language data exchange
• Leading web notebook &
reproducible research
development platform
• Interactive widgets framework
TOWARD HIGH TIDE:
Preserving competitive advantage and
building common knowledge
Thank you
Wes McKinney @wesmckinn

More Related Content

What's hot

Big, small or just complex data?
Big, small or just complex data?Big, small or just complex data?
Big, small or just complex data?panoratio
 
Data Science Popup Austin: Data Meet Product
Data Science Popup Austin: Data Meet Product Data Science Popup Austin: Data Meet Product
Data Science Popup Austin: Data Meet Product Domino Data Lab
 
Big Data Maturity and its Evolution
Big Data Maturity and its EvolutionBig Data Maturity and its Evolution
Big Data Maturity and its EvolutionSriram Murali K J
 
DMTI Spatial Location Hub Analytics: big data, analytics, visualization
DMTI Spatial Location Hub Analytics: big data, analytics, visualizationDMTI Spatial Location Hub Analytics: big data, analytics, visualization
DMTI Spatial Location Hub Analytics: big data, analytics, visualizationDMTI Spatial
 
Big data high performance computing commenting
Big data   high performance computing commentingBig data   high performance computing commenting
Big data high performance computing commentingIntel IT Center
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIMC Institute
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance Qubole
 
High Performance Computing and Big Data: The coming wave
High Performance Computing and Big Data: The coming waveHigh Performance Computing and Big Data: The coming wave
High Performance Computing and Big Data: The coming waveIntel IT Center
 
Hans Henseler - Intelligent data analysis for improving public security - Da...
Hans Henseler - Intelligent data analysis for improving public security -  Da...Hans Henseler - Intelligent data analysis for improving public security -  Da...
Hans Henseler - Intelligent data analysis for improving public security - Da...DataValueTalk
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome ThemQubole
 
6 levels of big data analytics applications
6 levels of big data analytics applications6 levels of big data analytics applications
6 levels of big data analytics applicationspanoratio
 
Webinar: How to Design Primary Storage for GDPR
Webinar: How to Design Primary Storage for GDPRWebinar: How to Design Primary Storage for GDPR
Webinar: How to Design Primary Storage for GDPRStorage Switzerland
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)Prof. Dr. Diego Kuonen
 
Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark) Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark) Matt Turck
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaR A Akerkar
 
Strategic Planning For Government Data Centers Presentation
Strategic Planning For Government Data Centers PresentationStrategic Planning For Government Data Centers Presentation
Strategic Planning For Government Data Centers Presentationgjohnsonatitmg
 
Logical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsDenodo
 
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Jari Koister
 

What's hot (20)

Big, small or just complex data?
Big, small or just complex data?Big, small or just complex data?
Big, small or just complex data?
 
Data Science Popup Austin: Data Meet Product
Data Science Popup Austin: Data Meet Product Data Science Popup Austin: Data Meet Product
Data Science Popup Austin: Data Meet Product
 
Big Data Maturity and its Evolution
Big Data Maturity and its EvolutionBig Data Maturity and its Evolution
Big Data Maturity and its Evolution
 
Study: #Big Data in #Austria
Study: #Big Data in #AustriaStudy: #Big Data in #Austria
Study: #Big Data in #Austria
 
DMTI Spatial Location Hub Analytics: big data, analytics, visualization
DMTI Spatial Location Hub Analytics: big data, analytics, visualizationDMTI Spatial Location Hub Analytics: big data, analytics, visualization
DMTI Spatial Location Hub Analytics: big data, analytics, visualization
 
Big data high performance computing commenting
Big data   high performance computing commentingBig data   high performance computing commenting
Big data high performance computing commenting
 
Introduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data ScienceIntroduction to Data Mining, Business Intelligence and Data Science
Introduction to Data Mining, Business Intelligence and Data Science
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
High Performance Computing and Big Data: The coming wave
High Performance Computing and Big Data: The coming waveHigh Performance Computing and Big Data: The coming wave
High Performance Computing and Big Data: The coming wave
 
Hans Henseler - Intelligent data analysis for improving public security - Da...
Hans Henseler - Intelligent data analysis for improving public security -  Da...Hans Henseler - Intelligent data analysis for improving public security -  Da...
Hans Henseler - Intelligent data analysis for improving public security - Da...
 
7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them7 Big Data Challenges and How to Overcome Them
7 Big Data Challenges and How to Overcome Them
 
Using Big Data Analytics
Using Big Data AnalyticsUsing Big Data Analytics
Using Big Data Analytics
 
6 levels of big data analytics applications
6 levels of big data analytics applications6 levels of big data analytics applications
6 levels of big data analytics applications
 
Webinar: How to Design Primary Storage for GDPR
Webinar: How to Design Primary Storage for GDPRWebinar: How to Design Primary Storage for GDPR
Webinar: How to Design Primary Storage for GDPR
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark) Big data landscape v 3.0 - Matt Turck (FirstMark)
Big data landscape v 3.0 - Matt Turck (FirstMark)
 
Big Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social MediaBig Data and Harvesting Data from Social Media
Big Data and Harvesting Data from Social Media
 
Strategic Planning For Government Data Centers Presentation
Strategic Planning For Government Data Centers PresentationStrategic Planning For Government Data Centers Presentation
Strategic Planning For Government Data Centers Presentation
 
Logical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural Components
 
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
Talk at IEEE Big Data/Cloud conference in Santa Clara, June 28th, 2013.
 

Viewers also liked

Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyWes McKinney
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for QuantsWes McKinney
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney
 
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageWes McKinney
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
DataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
 
pandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
Productive Data Tools for Quants
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
 
My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Data Tools and the Data Scientist Shortage
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
 
DataFrames: The Extended Cut
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended Cut
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 

Similar to Raising the Tides: Open Source Analytics for Data Science

Elastic's recommendation on keeping services up and running with real-time vi...
Elastic's recommendation on keeping services up and running with real-time vi...Elastic's recommendation on keeping services up and running with real-time vi...
Elastic's recommendation on keeping services up and running with real-time vi...FaithWestdorp
 
PPT-How to Build Your PA on a Strong Data Foundation
PPT-How to Build Your PA on a Strong Data Foundation PPT-How to Build Your PA on a Strong Data Foundation
PPT-How to Build Your PA on a Strong Data Foundation Natasha Ramdial - Roopnarine
 
Understanding What’s Possible: Getting Business Value from Big Data Quickly
Understanding What’s Possible: Getting Business Value from Big Data QuicklyUnderstanding What’s Possible: Getting Business Value from Big Data Quickly
Understanding What’s Possible: Getting Business Value from Big Data QuicklyInside Analysis
 
Environmental Big Data: Business Perspective
Environmental Big Data: Business PerspectiveEnvironmental Big Data: Business Perspective
Environmental Big Data: Business PerspectiveCLEEN_Ltd
 
Big Data Innovation
Big Data InnovationBig Data Innovation
Big Data Innovationpaul.hawking
 
Big Data Brown Bag
Big Data Brown BagBig Data Brown Bag
Big Data Brown Bagusmanqureshi
 
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government InsightsVirtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government InsightsSplunk
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"MDS ap
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAudrey Britton
 
Data Virtualization - Enabling Next Generation Analytics
Data Virtualization - Enabling Next Generation AnalyticsData Virtualization - Enabling Next Generation Analytics
Data Virtualization - Enabling Next Generation AnalyticsDenodo
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataDomino Data Lab
 
NTXISSACSC3 - Security at the Point of Storage by Todd Barton
NTXISSACSC3 - Security at the Point of Storage by Todd BartonNTXISSACSC3 - Security at the Point of Storage by Todd Barton
NTXISSACSC3 - Security at the Point of Storage by Todd BartonNorth Texas Chapter of the ISSA
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science TeamsEMC
 
Is Your Company's Data Secure? Shelley Vinson Helfer
Is Your Company's Data Secure? Shelley Vinson HelferIs Your Company's Data Secure? Shelley Vinson Helfer
Is Your Company's Data Secure? Shelley Vinson HelferMAX Technical Training
 
Accrete.AI Product Flyer 1.26.18
Accrete.AI Product Flyer 1.26.18Accrete.AI Product Flyer 1.26.18
Accrete.AI Product Flyer 1.26.18Prashant Bhuyan
 
A Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-ServiceA Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-ServiceDenodo
 
Noise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in DataNoise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in DataDATAVERSITY
 
RWDG Slides: Data and Metadata Will Not Govern Themselves
RWDG Slides: Data and Metadata Will Not Govern ThemselvesRWDG Slides: Data and Metadata Will Not Govern Themselves
RWDG Slides: Data and Metadata Will Not Govern ThemselvesDATAVERSITY
 
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...Matt Stubbs
 

Similar to Raising the Tides: Open Source Analytics for Data Science (20)

Elastic's recommendation on keeping services up and running with real-time vi...
Elastic's recommendation on keeping services up and running with real-time vi...Elastic's recommendation on keeping services up and running with real-time vi...
Elastic's recommendation on keeping services up and running with real-time vi...
 
PPT-How to Build Your PA on a Strong Data Foundation
PPT-How to Build Your PA on a Strong Data Foundation PPT-How to Build Your PA on a Strong Data Foundation
PPT-How to Build Your PA on a Strong Data Foundation
 
Understanding What’s Possible: Getting Business Value from Big Data Quickly
Understanding What’s Possible: Getting Business Value from Big Data QuicklyUnderstanding What’s Possible: Getting Business Value from Big Data Quickly
Understanding What’s Possible: Getting Business Value from Big Data Quickly
 
Environmental Big Data: Business Perspective
Environmental Big Data: Business PerspectiveEnvironmental Big Data: Business Perspective
Environmental Big Data: Business Perspective
 
Big Data Innovation
Big Data InnovationBig Data Innovation
Big Data Innovation
 
Big Data Brown Bag
Big Data Brown BagBig Data Brown Bag
Big Data Brown Bag
 
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government InsightsVirtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government Insights
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
An Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data AnalyticsAn Encyclopedic Overview Of Big Data Analytics
An Encyclopedic Overview Of Big Data Analytics
 
Data Virtualization - Enabling Next Generation Analytics
Data Virtualization - Enabling Next Generation AnalyticsData Virtualization - Enabling Next Generation Analytics
Data Virtualization - Enabling Next Generation Analytics
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
How I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked DataHow I Learned to Stop Worrying and Love Linked Data
How I Learned to Stop Worrying and Love Linked Data
 
NTXISSACSC3 - Security at the Point of Storage by Todd Barton
NTXISSACSC3 - Security at the Point of Storage by Todd BartonNTXISSACSC3 - Security at the Point of Storage by Todd Barton
NTXISSACSC3 - Security at the Point of Storage by Todd Barton
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
 
Is Your Company's Data Secure? Shelley Vinson Helfer
Is Your Company's Data Secure? Shelley Vinson HelferIs Your Company's Data Secure? Shelley Vinson Helfer
Is Your Company's Data Secure? Shelley Vinson Helfer
 
Accrete.AI Product Flyer 1.26.18
Accrete.AI Product Flyer 1.26.18Accrete.AI Product Flyer 1.26.18
Accrete.AI Product Flyer 1.26.18
 
A Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-ServiceA Dynamic Data Catalog for Autonomy and Self-Service
A Dynamic Data Catalog for Autonomy and Self-Service
 
Noise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in DataNoise to Signal - The Biggest Problem in Data
Noise to Signal - The Biggest Problem in Data
 
RWDG Slides: Data and Metadata Will Not Govern Themselves
RWDG Slides: Data and Metadata Will Not Govern ThemselvesRWDG Slides: Data and Metadata Will Not Govern Themselves
RWDG Slides: Data and Metadata Will Not Govern Themselves
 
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...
Big Data LDN 2017: Become an Information-driven Organisation With Cognitive S...
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 

More from Wes McKinney (17)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Raising the Tides: Open Source Analytics for Data Science

  • 1. Raising the Tides: Open Source Analytics for Data Science Wes McKinney @wesmckinn N E W S W E E K A I & D A T A S C I E N C E C O N F E R E N C E – C A P I T A L M A R K E T S 2 M A R C H 2 0 1 7
  • 3. Wes McKinney @wesmckinn Important Legal Information • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
  • 4. Wes McKinney @wesmckinn In the next 20 minutes ∞ Important trends in the industry ∞ Two Sigma involvement in open source ∞ Growing the community
  • 6. Wes McKinney @wesmckinn Industry giants open source core AI and machine learning technology
  • 7. Wes McKinney @wesmckinn Open source “disruption” in data science languages and supporting technologies
  • 8. Wes McKinney @wesmckinn Observation #1: User Mindshare is a Key Asset
  • 9. Wes McKinney @wesmckinn Observation #2: Tools may be less important than human capital and data
  • 10. Wes McKinney @wesmckinn Two Sigma Building a state-of-the-art, collaborative data science platform
  • 11. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets
  • 12. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity
  • 13. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity ∞ Computational capabilities: larger and more complex data sets
  • 14. Wes McKinney @wesmckinn Scaling data science in many dimensions ∞ Access to diverse data sets ∞ Enhancing individual productivity ∞ Computational capabilities: larger and more complex data sets ∞ Collaboration within and across teams
  • 15. TOOLS AND THE “DATA SCIENTIST SHORTAGE”
  • 16. WHY WE PARTICIPATE IN OPEN SOURCE
  • 17. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies
  • 18. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems
  • 19. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data
  • 20. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data 4. Benefit sooner from open source innovations
  • 21. Wes McKinney @wesmckinn Why we participate in Open Source 1. Drive progress and innovation in foundational technologies 2. Increase the overall value, interoperability, and sustainability of our closed source systems 3. Raise awareness of problems faced at scale on real world data 4. Benefit sooner from open source innovations 5. Attract and retain the best engineering talent
  • 22. Wes McKinney @wesmckinn Where we are investing Collaboration and Publishing Cluster Resource Management Scalable / Distributed Computing High Performance Data Processing
  • 23. Wes McKinney @wesmckinn Core data infrastructure technologies Apache Arrow Apache Parquet • Efficient columnar in- memory data processing • High-speed, interoperable data messaging for Java, C++, Python • Industry-standard columnar file format for distributed storage • Efficient IO for Spark, Python, etc.
  • 24. Wes McKinney @wesmckinn Open source in-memory and distributed analytics • Popular Python analytics library • Powerful and easy-to-use data cleaning, analytics, and time series processing • Flint: scalable time series analytics for Spark • Enhanced Python integration
  • 25. Wes McKinney @wesmckinn Cluster resource management • Scalable cluster resource manager • Native container support • Fair job scheduler for Mesos • Managing multi-tenant Spark clusters cook
  • 26. Wes McKinney @wesmckinn Collaboration and publishing • Notebook “kernels” for polyglot research and development • Inter-language data exchange • Leading web notebook & reproducible research development platform • Interactive widgets framework
  • 27. TOWARD HIGH TIDE: Preserving competitive advantage and building common knowledge

Editor's Notes

  1. Who am I? Software Architect at Two Sigma Investments Creator of Python pandas project and contributor to many other open source tools related to the field of data science
  2. For Nick: title could be “In the next 18 minutes”
  3. Nick: Title change to: New news in open source
  4. Design: logo wall? Background image + logos? Open Source Data Science disruption, trends Industry giants releasing core machine learning / AI technology Google / Facebook / Microsoft / Amazon / Baidu
  5. Design: as above
  6. Design: Full bleed image - By virtue of developers
  7. Design: Full bleed image - By virtue of developers
  8. Design: our building full bleed?
  9. Design: Maybe split into 4 pages?
  10. Design: Maybe split into 4 pages?
  11. Design: Maybe split into 4 pages?
  12. Design: Maybe split into 4 pages?
  13. Design: Full bleed - By virtue of developers
  14. Design: Full bleed - By virtue of developers
  15. Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  16. Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  17. Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  18. Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  19. Design: split into 2: 1. Why we participate in open source (section divider) 2. “1. Drive progress…” (full bleed image)
  20. Design: perhaps section page Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  21. Design: for discussion! Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  22. Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  23. Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  24. Areas of focus In-memory analytics Collaboration Distributed computing Cluster resource management
  25. Design: Full bleed - By virtue of developers
  26. Design: Close on logo? Nick: End slide to link back to title?? Preserve comp advantage AND build common progress = raising the tides.