1 May 14, 2013© Kalido I Kalido Confidential May 14, 2013
Data Scientist: Your Must-Have
Business Investment NOW
2 May 14, 2013© Kalido I Kalido Confidential May 14, 2013
Gregory Piatetsky
Editor, Kdnuggets
co-founder KDD and ACM SIGKDD
David Smith
Data Scientist
Revolution Analytics
Carla Gentry
Data Scientist
Analytical Solution
Darren Peirce
CTO
Kalido
Eric Kavanagh
DM Radio Host
Information Management
Magazine’s DM Radio
Today’s Speakers #DataScienceNow
Revolution Confidential
3
© Dov Harrington, CC By-2.0
http://www.flickr.com/photos/idovermani/4110546683/
Revolution Confidential
Statistician Data Scientist
Image Baseball (Cricket) HBR Sexiest Job of 21st Century
Mode Reactive Consultative
Works Solo In a team
Inputs Data File, Hypothesis A Business Problem
Data Pre-prepared, clean Distributed, messy, unstructured
Data Size Kilobytes Gigabytes
Tools SAS, Mainframe R, Python, awk, Hadoop, Linux,
…
Nouns Tables Data Visualizations
Focus Inference (why) Prediction (what)
Output Report Data App / Data Product
Latency Weeks Seconds
Stars G.E.P Box
Trevor Hastie
Hilary Mason
Nate Silver
http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ 4
Revolution Confidential
Statistician Data Scientist
Image Baseball (Cricket) HBR Sexiest Job of 21st Century
Mode Reactive Consultative
Works Solo In a team
Inputs Data File, Hypothesis A Business Problem
Data Pre-prepared, clean Distributed, messy, unstructured
Data Size Kilobytes Gigabytes
Tools SAS, Mainframe R, Python, awk, Hadoop, Linux,
…
Nouns Tables Data Visualizations
Focus Inference (why) Prediction (what)
Output Report Data App / Data Product
Latency Weeks Seconds
Stars G.E.P Box
Trevor Hastie
Hilary Mason
Nate Silver
5
Revolution Confidential
Three Essential Skills of Data Scientists
6
Drew Conway
http://www.dataists.com/2010/09/the-data-science-venn-diagram/
Data Integration
Mashups
Applications
Models
Visualization
Predictions
Uncertainty
Problems
Data Sources
Credibility
Effective
Data
Applications
Revolution ConfidentialData Science to the
Rescue!
Revolution Confidential
Business
Intelligence Data Science
Perspective Looking backwards Looking forwards
Actions Slice and Dice Interact
Expertise Business User Data Scientist
Data Warehoused, Siloed Distributed, real-time
Scope Unlimited Specific business question
Questions What happened? What will happen?
What if?
Output Table Answer
Applicability Historic, possible
confounding factors
Future, correcting for influences
Tools SAP, Cognos,
Microstrategy, SAS
Revolution R Enterprise
QlikView, Tableau, Jaspersoft
Hot or not? So 1997 Transformational
8
What is Data Science?
By Carla Gentry
Data Scientist
Analytical-Solution
Data Science is….
• The term "data science" has existed for over
thirty years – first mentioned by Peter Naur in
1960 but more recently it has gained a lot of
attention!
Data Science can be broken down into
4 main areas of expertise.
• Data knowledge
– design & structure
• Programming
– SAS, R, SQL, NO-SQL
• Analytics
– Insight
• Communication
– Tell the story
Data Knowledge: Part analyst - part IT
• What kind of servers do you own?
- Servers vs. Mainframe
• What kind of load can the server handle?
- Iterations matter
– Why ask this?
Programming – Pick a language and
use it wisely
• Efficiency is KING!
- Why?
• Number of iterations & complex algorithms or
scripts. Snowflakes vs. Star schema?
-Design is import but why?
• Key things: normalize, index, there is more to
Data Science than just analytics.
How can I learn about Data Science?
• For those who want to invest their time and
talent there are resources.
• College Courses
• Online
• Webinars
• Blogs
Data Science and Data Scientists
Now
Gregory Piatetsky, @kdnuggets
Analytics, Big Data,
Data Mining, and Data Science Resources
15© KDnuggets 2013
• Statistics, 1830-
• Data mining, 1980-
• Knowledge Discovery in
Data (KDD), 1989-
• Business Analytics, 1997-
• Predictive Analytics, 2002-
• Data Analytics,2011-
• Data Science, 2011-
• …?
© KDnuggets 2013 16
Same Core Idea:
Finding Useful
Patterns in Data
Different
Emphasis
Trends from Google Ngrams (1800-2008)
and Google Trends (2005-2013)
Big Data > Data Mining >
Business Analytics > Predictive Analytics
> Data Science
17© KDnuggets 2013
Big Data
Google Trends search, Jan 2008- Apr 2013, Worldwide
Data mining
© KDnuggets 2013 18
Data Scientist – sexiest job of the 21st Century (???)
say Thomas H. Davenport and D.J. Patil, (HBR, Oct 2012)
“Data Scientist”
Fastest growing term on
www.kdnuggets.com/jobs
1% of jobs in 2010
4% of jobs in 2011
19% of jobs in 2012
23% of jobs in 2013
19© KDnuggets 2013
Data Mining
Big Data
Data Scientist
“Data mining” jobs are more common, but
“Big Data” jobs are surging much faster than “Data Scientist”
“Statistician” jobs are steady, but not growing
Statistician
• Big Data can produce better predictions, but expect limited
improvement
• Example: Netflix prize took 3 years to improve prediction of
movie ranking from 0.95 stars to 0.86
• Inherent randomness in human behavior
• Data Science should help separate hype from reality
• Biggest effects from Big Data are from new platforms, like
Google, Facebook, LinkedIn; Personalized medicine
• However, Big Data makes privacy online almost possible
Gregory Piatetsky-Shapiro, Big Data Hype and Reality, Harvard
Business Review blog, Oct 2012
© KDnuggets 2013 20
© 2013 KDnuggets
21
Gartner Hype Cycle
Big Data
Gartner VP says Big Data
is Falling into the Trough
of Disillusionment, Jan
2013
© 2013 Kalido I Kalido Confidential I May 14, 201322
Q&A
Gregory Piatetsky
Editor, Kdnuggets
co-founder KDD and ACM
SIGKDD
@kdnuggets
David Smith
Data Scientist
Revolution Analytics
@revodavid
Carla Gentry
Data Scientist
Analytical Solution
@data_nerd
Darren Peirce
CTO
Kalido
@DarrenPeirce
Eric Kavanagh
DM Radio Host
Information Management
Magazine’s DM Radio
@eric_kavanagh
© 2010 Kalido I Kalido Confidential I May 14, 201323
Summers Sessions: Two Tracks For YOU
Series Kickoff
May 14: Data Scientist: Your must-have
business investment now.
(30 Minute Learning Sessions)
May 28 Rapid Data Integration
tools and methods
June 4 Harmonizing Data for the
Warehouse
June 11 Rapid Iteration Methodology
Using Information Models
Series Kickoff
June 25: Find your data warehouse’s hidden
costs before they find you.
(30 Minute Learning Sessions)
July 2 The real cost per release cycle
July 9 Automate to reduce operating costs
July 16 Reduce tool cost
July 23 Scale drives cost reductions
Agile Information Foundation
for the Data Scientist
TCO: Find data warehouse
costs before they find you.
Visit get.kalido.com/summer-series to register

Data Scientists: Your Must-Have Business Investment

  • 1.
    1 May 14,2013© Kalido I Kalido Confidential May 14, 2013 Data Scientist: Your Must-Have Business Investment NOW
  • 2.
    2 May 14,2013© Kalido I Kalido Confidential May 14, 2013 Gregory Piatetsky Editor, Kdnuggets co-founder KDD and ACM SIGKDD David Smith Data Scientist Revolution Analytics Carla Gentry Data Scientist Analytical Solution Darren Peirce CTO Kalido Eric Kavanagh DM Radio Host Information Management Magazine’s DM Radio Today’s Speakers #DataScienceNow
  • 3.
    Revolution Confidential 3 © DovHarrington, CC By-2.0 http://www.flickr.com/photos/idovermani/4110546683/
  • 4.
    Revolution Confidential Statistician DataScientist Image Baseball (Cricket) HBR Sexiest Job of 21st Century Mode Reactive Consultative Works Solo In a team Inputs Data File, Hypothesis A Business Problem Data Pre-prepared, clean Distributed, messy, unstructured Data Size Kilobytes Gigabytes Tools SAS, Mainframe R, Python, awk, Hadoop, Linux, … Nouns Tables Data Visualizations Focus Inference (why) Prediction (what) Output Report Data App / Data Product Latency Weeks Seconds Stars G.E.P Box Trevor Hastie Hilary Mason Nate Silver http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ 4
  • 5.
    Revolution Confidential Statistician DataScientist Image Baseball (Cricket) HBR Sexiest Job of 21st Century Mode Reactive Consultative Works Solo In a team Inputs Data File, Hypothesis A Business Problem Data Pre-prepared, clean Distributed, messy, unstructured Data Size Kilobytes Gigabytes Tools SAS, Mainframe R, Python, awk, Hadoop, Linux, … Nouns Tables Data Visualizations Focus Inference (why) Prediction (what) Output Report Data App / Data Product Latency Weeks Seconds Stars G.E.P Box Trevor Hastie Hilary Mason Nate Silver 5
  • 6.
    Revolution Confidential Three EssentialSkills of Data Scientists 6 Drew Conway http://www.dataists.com/2010/09/the-data-science-venn-diagram/ Data Integration Mashups Applications Models Visualization Predictions Uncertainty Problems Data Sources Credibility Effective Data Applications
  • 7.
  • 8.
    Revolution Confidential Business Intelligence DataScience Perspective Looking backwards Looking forwards Actions Slice and Dice Interact Expertise Business User Data Scientist Data Warehoused, Siloed Distributed, real-time Scope Unlimited Specific business question Questions What happened? What will happen? What if? Output Table Answer Applicability Historic, possible confounding factors Future, correcting for influences Tools SAP, Cognos, Microstrategy, SAS Revolution R Enterprise QlikView, Tableau, Jaspersoft Hot or not? So 1997 Transformational 8
  • 9.
    What is DataScience? By Carla Gentry Data Scientist Analytical-Solution
  • 10.
    Data Science is…. •The term "data science" has existed for over thirty years – first mentioned by Peter Naur in 1960 but more recently it has gained a lot of attention!
  • 11.
    Data Science canbe broken down into 4 main areas of expertise. • Data knowledge – design & structure • Programming – SAS, R, SQL, NO-SQL • Analytics – Insight • Communication – Tell the story
  • 12.
    Data Knowledge: Partanalyst - part IT • What kind of servers do you own? - Servers vs. Mainframe • What kind of load can the server handle? - Iterations matter – Why ask this?
  • 13.
    Programming – Picka language and use it wisely • Efficiency is KING! - Why? • Number of iterations & complex algorithms or scripts. Snowflakes vs. Star schema? -Design is import but why? • Key things: normalize, index, there is more to Data Science than just analytics.
  • 14.
    How can Ilearn about Data Science? • For those who want to invest their time and talent there are resources. • College Courses • Online • Webinars • Blogs
  • 15.
    Data Science andData Scientists Now Gregory Piatetsky, @kdnuggets Analytics, Big Data, Data Mining, and Data Science Resources 15© KDnuggets 2013
  • 16.
    • Statistics, 1830- •Data mining, 1980- • Knowledge Discovery in Data (KDD), 1989- • Business Analytics, 1997- • Predictive Analytics, 2002- • Data Analytics,2011- • Data Science, 2011- • …? © KDnuggets 2013 16 Same Core Idea: Finding Useful Patterns in Data Different Emphasis Trends from Google Ngrams (1800-2008) and Google Trends (2005-2013)
  • 17.
    Big Data >Data Mining > Business Analytics > Predictive Analytics > Data Science 17© KDnuggets 2013 Big Data Google Trends search, Jan 2008- Apr 2013, Worldwide Data mining
  • 18.
    © KDnuggets 201318 Data Scientist – sexiest job of the 21st Century (???) say Thomas H. Davenport and D.J. Patil, (HBR, Oct 2012) “Data Scientist” Fastest growing term on www.kdnuggets.com/jobs 1% of jobs in 2010 4% of jobs in 2011 19% of jobs in 2012 23% of jobs in 2013
  • 19.
    19© KDnuggets 2013 DataMining Big Data Data Scientist “Data mining” jobs are more common, but “Big Data” jobs are surging much faster than “Data Scientist” “Statistician” jobs are steady, but not growing Statistician
  • 20.
    • Big Datacan produce better predictions, but expect limited improvement • Example: Netflix prize took 3 years to improve prediction of movie ranking from 0.95 stars to 0.86 • Inherent randomness in human behavior • Data Science should help separate hype from reality • Biggest effects from Big Data are from new platforms, like Google, Facebook, LinkedIn; Personalized medicine • However, Big Data makes privacy online almost possible Gregory Piatetsky-Shapiro, Big Data Hype and Reality, Harvard Business Review blog, Oct 2012 © KDnuggets 2013 20
  • 21.
    © 2013 KDnuggets 21 GartnerHype Cycle Big Data Gartner VP says Big Data is Falling into the Trough of Disillusionment, Jan 2013
  • 22.
    © 2013 KalidoI Kalido Confidential I May 14, 201322 Q&A Gregory Piatetsky Editor, Kdnuggets co-founder KDD and ACM SIGKDD @kdnuggets David Smith Data Scientist Revolution Analytics @revodavid Carla Gentry Data Scientist Analytical Solution @data_nerd Darren Peirce CTO Kalido @DarrenPeirce Eric Kavanagh DM Radio Host Information Management Magazine’s DM Radio @eric_kavanagh
  • 23.
    © 2010 KalidoI Kalido Confidential I May 14, 201323 Summers Sessions: Two Tracks For YOU Series Kickoff May 14: Data Scientist: Your must-have business investment now. (30 Minute Learning Sessions) May 28 Rapid Data Integration tools and methods June 4 Harmonizing Data for the Warehouse June 11 Rapid Iteration Methodology Using Information Models Series Kickoff June 25: Find your data warehouse’s hidden costs before they find you. (30 Minute Learning Sessions) July 2 The real cost per release cycle July 9 Automate to reduce operating costs July 16 Reduce tool cost July 23 Scale drives cost reductions Agile Information Foundation for the Data Scientist TCO: Find data warehouse costs before they find you. Visit get.kalido.com/summer-series to register

Editor's Notes

  • #4 Most data scientists wear hipster glasses and T-shirts with ironic, geeky quotes.
  • #5 http://www.hilarymason.com/media_and_press/im-in-glamour-magazine/
  • #6 http://www.hilarymason.com/media_and_press/im-in-glamour-magazine/Ivan Fellegi, Chief Statistician of Canada and SSC President for 1981http://www.flickr.com/photos/ssc_liaison/431047111/
  • #21 Churn: bestalgorithms for predicting churn have lift of 5-7 – 5-7 times better than random. Behavioral advertising: 2-3% CTR – 10 times better than random