Data Tools and the Data Scientist Shortage

Wes McKinney
Wes McKinneyDirector of Ursa Labs, Open Source Developer at Ursa Labs
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  Tools	
  and	
  the	
  Data	
  
Scien;st	
  Shortage	
  
Wes	
  McKinney	
  @wesmckinn	
  
Data	
  Summit	
  @	
  Web	
  Summit	
  2015-­‐11-­‐04	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Career	
  theme:	
  Serial	
  creator	
  of	
  data	
  tools	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
hMps://hbr.org/2012/10/data-­‐scien;st-­‐the-­‐sexiest-­‐job-­‐of-­‐the-­‐21st-­‐century/	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
hMp://www.bloomberg.com/news/ar;cles/2015-­‐06-­‐04/help-­‐wanted-­‐black-­‐belts-­‐in-­‐data	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
“The	
  United	
  States	
  alone	
  faces	
  a	
  shortage	
  of	
  140,000	
  
to	
  190,000	
  people	
  with	
  analy;cal	
  exper;se	
  and	
  1.5	
  
million	
  managers	
  and	
  analysts	
  with	
  the	
  skills	
  to	
  
understand	
  and	
  make	
  decisions	
  based	
  on	
  the	
  analysis	
  
of	
  big	
  data.”	
  
	
  
McKinsey	
  &	
  Co	
  
hMp://www.mckinsey.com/features/big_data	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Source:	
  Drew	
  Conway,	
  “The	
  Data	
  Science	
  Venn	
  Diagram”	
  
Tradi;onal	
  view	
  of	
  Data	
  Science	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Analyzing	
  the	
  Analyzers,	
  Harris,	
  Murphy,	
  Vaisman	
  
Many	
  Kinds	
  of	
  “Data	
  People”	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Analyzing	
  the	
  Analyzers,	
  Harris,	
  Murphy,	
  Vaisman	
  
Many	
  Kinds	
  of	
  “Data	
  People”	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Addressing	
  the	
  analy;cal	
  shortage	
  
Educa;on	
   Culture	
   Tools	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  process	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
The	
  “Great	
  Decoupling”	
  for	
  Industry	
  Analy;cs	
  
UI
ComputeStorage
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
The	
  “Great	
  Decoupling”	
  for	
  Industry	
  Analy;cs	
  
UI
ComputeStorage
Accumula;on	
  of	
  user	
  ;me	
  
Legacy	
  technology:	
  
ver;cally-­‐integrated	
  
solu;ons	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ubiquitous	
  Real-­‐Time	
  Storage	
  and	
  
Compute:	
  A	
  view	
  from	
  2040	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  analysis	
  hierarchy	
  of	
  needs	
  
Data Storage / Access
Clean Data
Analysis and Visualization
Productivity tools / UI
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  data	
  tooling	
  UI	
  innova;ons	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Rejec;ng	
  the	
  “Highlander	
  Fallacy”	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
SQL	
  Programming:	
  the	
  “mainframe	
  
punch	
  cards”	
  of	
  our	
  ;me	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Many	
  SQL	
  engines	
  
…	
  and	
  more	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Execu;ng	
  data	
  science	
  languages	
  in	
  the	
  compute	
  layer	
  
UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase
Python,
R, Julia, …?
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
  are	
  my	
  own	
  
1 of 22

Recommended

Ibis: Scaling Python Analytics on Hadoop and Impala by
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and ImpalaWes McKinney
7.6K views33 slides
PyData: The Next Generation by
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next GenerationWes McKinney
22.2K views31 slides
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives by
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
915 views36 slides
How to Build Continuous Ingestion for the Internet of Things by
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
3.7K views24 slides
DataFrames: The Good, Bad, and Ugly by
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and UglyWes McKinney
12.9K views24 slides
Introduction To Big Data Analytics On Hadoop - SpringPeople by
Introduction To Big Data Analytics On Hadoop - SpringPeopleIntroduction To Big Data Analytics On Hadoop - SpringPeople
Introduction To Big Data Analytics On Hadoop - SpringPeopleSpringPeople
6.9K views15 slides

More Related Content

What's hot

Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs... by
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Databricks
731 views12 slides
Managed Cluster Services by
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
190 views29 slides
2016 Cybersecurity Analytics State of the Union by
2016 Cybersecurity Analytics State of the Union2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the UnionCloudera, Inc.
1.1K views29 slides
PyCon Singapore 2013 Keynote by
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 KeynoteWes McKinney
94.6K views19 slides
Managing the Dewey Decimal System by
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
1K views8 slides
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa... by
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...Dremio Corporation
1.1K views37 slides

What's hot(19)

Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs... by Databricks
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Risk Management Framework Using Intel FPGA, Apache Spark, and Persistent RDDs...
Databricks731 views
Managed Cluster Services by Adam Doyle
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
Adam Doyle190 views
2016 Cybersecurity Analytics State of the Union by Cloudera, Inc.
2016 Cybersecurity Analytics State of the Union2016 Cybersecurity Analytics State of the Union
2016 Cybersecurity Analytics State of the Union
Cloudera, Inc.1.1K views
PyCon Singapore 2013 Keynote by Wes McKinney
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
Wes McKinney94.6K views
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa... by Dremio Corporation
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation1.1K views
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info... by DataStax
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
Webinar: Proofpoint, a pioneer in security-as-a-service protects people, info...
DataStax644 views
Deep Learning with Cloudera by Cloudera, Inc.
Deep Learning with ClouderaDeep Learning with Cloudera
Deep Learning with Cloudera
Cloudera, Inc.2.7K views
Govern This! Data Discovery and the application of data governance with new s... by Cloudera, Inc.
Govern This! Data Discovery and the application of data governance with new s...Govern This! Data Discovery and the application of data governance with new s...
Govern This! Data Discovery and the application of data governance with new s...
Cloudera, Inc.2.8K views
How to boost your datamanagement with Dremio ? by Vincent Terrasi
How to boost your datamanagement with Dremio ?How to boost your datamanagement with Dremio ?
How to boost your datamanagement with Dremio ?
Vincent Terrasi1.4K views
Introduction of Big data and Hadoop by Arohi Khandelwal
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
Arohi Khandelwal163 views
Seeking Cybersecurity--Strategies to Protect the Data by Cloudera, Inc.
Seeking Cybersecurity--Strategies to Protect the DataSeeking Cybersecurity--Strategies to Protect the Data
Seeking Cybersecurity--Strategies to Protect the Data
Cloudera, Inc.715 views
Available platforms for Big Data 2.0 by Petr Novotný
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0
Petr Novotný46 views
Big Data and Hadoop - key drivers, ecosystem and use cases by Jeff Kelly
Big Data and Hadoop - key drivers, ecosystem and use casesBig Data and Hadoop - key drivers, ecosystem and use cases
Big Data and Hadoop - key drivers, ecosystem and use cases
Jeff Kelly1.4K views
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis... by Spark Summit
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Spark Summit861 views
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench by NOVA DATASCIENCE
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science WorkbenchNOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA DATASCIENCE171 views

Similar to Data Tools and the Data Scientist Shortage

Next-Gen ML/AI Platform by
Next-Gen ML/AI PlatformNext-Gen ML/AI Platform
Next-Gen ML/AI PlatformJosh Yeh
58.1K views22 slides
The Vision & Challenge of Applied Machine Learning by
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine LearningCloudera, Inc.
635 views42 slides
Analytics, Everywhere. Keys to Effective Analytics and Data Discovery by
Analytics, Everywhere. Keys to Effective Analytics and Data DiscoveryAnalytics, Everywhere. Keys to Effective Analytics and Data Discovery
Analytics, Everywhere. Keys to Effective Analytics and Data DiscoveryDLT Solutions
363 views40 slides
Introducing Cloudera DataFlow (CDF) 2.13.19 by
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
4.9K views31 slides
Build a modern platform for anti-money laundering 9.19.18 by
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
1K views24 slides
Keynote: The Journey to Pervasive Analytics by
Keynote: The Journey to Pervasive AnalyticsKeynote: The Journey to Pervasive Analytics
Keynote: The Journey to Pervasive AnalyticsCloudera, Inc.
2.6K views30 slides

Similar to Data Tools and the Data Scientist Shortage(20)

Next-Gen ML/AI Platform by Josh Yeh
Next-Gen ML/AI PlatformNext-Gen ML/AI Platform
Next-Gen ML/AI Platform
Josh Yeh58.1K views
The Vision & Challenge of Applied Machine Learning by Cloudera, Inc.
The Vision & Challenge of Applied Machine LearningThe Vision & Challenge of Applied Machine Learning
The Vision & Challenge of Applied Machine Learning
Cloudera, Inc.635 views
Analytics, Everywhere. Keys to Effective Analytics and Data Discovery by DLT Solutions
Analytics, Everywhere. Keys to Effective Analytics and Data DiscoveryAnalytics, Everywhere. Keys to Effective Analytics and Data Discovery
Analytics, Everywhere. Keys to Effective Analytics and Data Discovery
DLT Solutions363 views
Introducing Cloudera DataFlow (CDF) 2.13.19 by Cloudera, Inc.
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.4.9K views
Build a modern platform for anti-money laundering 9.19.18 by Cloudera, Inc.
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.1K views
Keynote: The Journey to Pervasive Analytics by Cloudera, Inc.
Keynote: The Journey to Pervasive AnalyticsKeynote: The Journey to Pervasive Analytics
Keynote: The Journey to Pervasive Analytics
Cloudera, Inc.2.6K views
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An... by CA Technologies
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
Technology Primer: Hey IT—Your Big Data Infrastructure Can’t Sit in a Silo An...
CA Technologies1.2K views
151116 Sedania Cloudera BDA Profile by Zarul Zaabah
151116 Sedania Cloudera BDA Profile151116 Sedania Cloudera BDA Profile
151116 Sedania Cloudera BDA Profile
Zarul Zaabah253 views
Edc event vienna presentation 1 oct 2019 by Cloudera, Inc.
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.4.5K views
Building Data Science Teams: A Moneyball Approach by joshwills
Building Data Science Teams: A Moneyball ApproachBuilding Data Science Teams: A Moneyball Approach
Building Data Science Teams: A Moneyball Approach
joshwills1.6K views
Big data oracle_introduccion by Fran Navarro
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccion
Fran Navarro1.6K views
巨量資料入門 The evolution of data architecture by Wei-Chiu Chuang
巨量資料入門 The evolution of data architecture巨量資料入門 The evolution of data architecture
巨量資料入門 The evolution of data architecture
Wei-Chiu Chuang248 views
Stl meetup cloudera platform - january 2020 by Adam Doyle
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
Adam Doyle721 views
The 5 Biggest Data Myths in Telco: Exposed by Cloudera, Inc.
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: Exposed
Cloudera, Inc.333 views
Optimize your cloud strategy for machine learning and analytics by Cloudera, Inc.
Optimize your cloud strategy for machine learning and analyticsOptimize your cloud strategy for machine learning and analytics
Optimize your cloud strategy for machine learning and analytics
Cloudera, Inc.867 views
Cloudera 助力台灣大數據產業的發展 by Etu Solution
Cloudera 助力台灣大數據產業的發展Cloudera 助力台灣大數據產業的發展
Cloudera 助力台灣大數據產業的發展
Etu Solution3.2K views
Creating your Center of Excellence (CoE) for data driven use cases by Frank Vullers
Creating your Center of Excellence (CoE) for data driven use casesCreating your Center of Excellence (CoE) for data driven use cases
Creating your Center of Excellence (CoE) for data driven use cases
Frank Vullers764 views
Addressing Challenges with IoT Edge Management by DataWorks Summit
Addressing Challenges with IoT Edge ManagementAddressing Challenges with IoT Edge Management
Addressing Challenges with IoT Edge Management
DataWorks Summit348 views
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera) by Spark Summit
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark Summit2.3K views
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J... by Data Con LA
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Data Con LA924 views

More from Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow by
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
1.1K views31 slides
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
1.1K views26 slides
Apache Arrow: High Performance Columnar Data Framework by
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
1.5K views53 slides
New Directions for Apache Arrow by
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
1.9K views27 slides
Apache Arrow Flight: A New Gold Standard for Data Transport by
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
2.2K views31 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames by
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
2K views47 slides

More from Wes McKinney(20)

Solving Enterprise Data Challenges with Apache Arrow by Wes McKinney
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney1.1K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney1.1K views
Apache Arrow: High Performance Columnar Data Framework by Wes McKinney
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney1.5K views
New Directions for Apache Arrow by Wes McKinney
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport by Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney2K views
Apache Arrow: Present and Future @ ScaledML 2020 by Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney970 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney2.1K views
Apache Arrow: Leveling Up the Analytics Stack by Wes McKinney
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session by Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney2.5K views
Apache Arrow: Leveling Up the Data Science Stack by Wes McKinney
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney3.5K views
Ursa Labs and Apache Arrow in 2019 by Wes McKinney
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney4.2K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney1.1K views
Apache Arrow at DataEngConf Barcelona 2018 by Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney2K views
Apache Arrow: Cross-language Development Platform for In-memory Data by Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney6.6K views
Apache Arrow -- Cross-language development platform for in-memory data by Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney2.9K views
Shared Infrastructure for Data Science by Wes McKinney
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney8.5K views
Data Science Without Borders (JupyterCon 2017) by Wes McKinney
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney6.2K views
Memory Interoperability in Analytics and Machine Learning by Wes McKinney
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney5.6K views
Raising the Tides: Open Source Analytics for Data Science by Wes McKinney
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney3.2K views

Recently uploaded

Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericShapeBlue
58 views9 slides
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOsPriyanka Aash
103 views59 slides
Future of AR - Facebook Presentation by
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook PresentationRob McCarty
54 views27 slides
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...ShapeBlue
59 views13 slides
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc
130 views29 slides
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineShapeBlue
154 views19 slides

Recently uploaded(20)

Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ by ShapeBlue
Confidence in CloudStack - Aron Wagner, Nathan Gleason - AmericConfidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
Confidence in CloudStack - Aron Wagner, Nathan Gleason - Americ
ShapeBlue58 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash103 views
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty54 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue59 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc130 views
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online by ShapeBlue
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
ShapeBlue154 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue134 views
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava... by ShapeBlue
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
ShapeBlue74 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10110 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software373 views
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue by ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlueCloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
CloudStack Managed User Data and Demo - Harikrishna Patnala - ShapeBlue
ShapeBlue68 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue81 views
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P... by ShapeBlue
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
Developments to CloudStack’s SDN ecosystem: Integration with VMWare NSX 4 - P...
ShapeBlue120 views
State of the Union - Rohit Yadav - Apache CloudStack by ShapeBlue
State of the Union - Rohit Yadav - Apache CloudStackState of the Union - Rohit Yadav - Apache CloudStack
State of the Union - Rohit Yadav - Apache CloudStack
ShapeBlue218 views
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda... by ShapeBlue
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
ShapeBlue93 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue138 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays49 views
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ... by ShapeBlue
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
ShapeBlue48 views

Data Tools and the Data Scientist Shortage

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Tools  and  the  Data   Scien;st  Shortage   Wes  McKinney  @wesmckinn   Data  Summit  @  Web  Summit  2015-­‐11-­‐04  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Career  theme:  Serial  creator  of  data  tools  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   hMps://hbr.org/2012/10/data-­‐scien;st-­‐the-­‐sexiest-­‐job-­‐of-­‐the-­‐21st-­‐century/  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   hMp://www.bloomberg.com/news/ar;cles/2015-­‐06-­‐04/help-­‐wanted-­‐black-­‐belts-­‐in-­‐data  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   “The  United  States  alone  faces  a  shortage  of  140,000   to  190,000  people  with  analy;cal  exper;se  and  1.5   million  managers  and  analysts  with  the  skills  to   understand  and  make  decisions  based  on  the  analysis   of  big  data.”     McKinsey  &  Co   hMp://www.mckinsey.com/features/big_data  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Source:  Drew  Conway,  “The  Data  Science  Venn  Diagram”   Tradi;onal  view  of  Data  Science  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Analyzing  the  Analyzers,  Harris,  Murphy,  Vaisman   Many  Kinds  of  “Data  People”  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Analyzing  the  Analyzers,  Harris,  Murphy,  Vaisman   Many  Kinds  of  “Data  People”  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Addressing  the  analy;cal  shortage   Educa;on   Culture   Tools  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Data  process  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   The  “Great  Decoupling”  for  Industry  Analy;cs   UI ComputeStorage
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   The  “Great  Decoupling”  for  Industry  Analy;cs   UI ComputeStorage Accumula;on  of  user  ;me   Legacy  technology:   ver;cally-­‐integrated   solu;ons  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Ubiquitous  Real-­‐Time  Storage  and   Compute:  A  view  from  2040  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Data  analysis  hierarchy  of  needs   Data Storage / Access Clean Data Analysis and Visualization Productivity tools / UI
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Some  data  tooling  UI  innova;ons  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Rejec;ng  the  “Highlander  Fallacy”  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   SQL  Programming:  the  “mainframe   punch  cards”  of  our  ;me  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Many  SQL  engines   …  and  more  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Execu;ng  data  science  languages  in  the  compute  layer   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own