THE ART OF DATA
SCIENCE
Josh Wills
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
2© Cloudera, Inc. All rights reserved.
Or, A Less Pretentious Title
Josh Wills | Senior Director of Data Science
The Art of Data Science
3© Cloudera, Inc. All rights reserved.
The Data Science State of the Union
4© Cloudera, Inc. All rights reserved.
Data Scientists At Work
5© Cloudera, Inc. All rights reserved.
Data Scientists at Home
6© Cloudera, Inc. All rights reserved.
Data Scientists…Everywhere?
7© Cloudera, Inc. All rights reserved.
Creating Some Definition
8© Cloudera, Inc. All rights reserved.
The Mismeasure of Data Scientists
9© Cloudera, Inc. All rights reserved.
The Tremendous Promise of Big Data
10© Cloudera, Inc. All rights reserved.
The Unfortunate Reality of Big Data
11© Cloudera, Inc. All rights reserved.
Like Urban Planning, but for Data
12© Cloudera, Inc. All rights reserved.
Data Modeling for Data Science
13© Cloudera, Inc. All rights reserved.
Event Series Analytics
14© Cloudera, Inc. All rights reserved.
A Simple Star Schema for Spell Correction
15© Cloudera, Inc. All rights reserved.
A Supernova Schema for Search
16© Cloudera, Inc. All rights reserved.
Spell Correction in SQL
17© Cloudera, Inc. All rights reserved.
The Operational/Analytical Impedance Mismatch
18© Cloudera, Inc. All rights reserved.
Exhibit: http://github.com/jwills/exhibit
19© Cloudera, Inc. All rights reserved.
Pushing Beyond The Limits of Our Tools
20© Cloudera, Inc. All rights reserved.
Thanks!
jwills@cloudera.com

The Art of Data Science

Editor's Notes

  • #5 Data cleansing, preparation, feature engineering. The dirty work we do all day at the insight factory.
  • #7 Does everyone have to be a data scientist? Are there tools that can make anyone into a data scientist?
  • #8 http://www.quora.com/What-is-the-difference-between-a-data-scientist-and-a-statistician
  • #9 Data scientists can do two things better than data analysts: ask great questions and answer them faster than other people would think possible.
  • #10 Everyone gets a Ferrari!
  • #11 Oh no! Everyone has a Ferrari! Induced demand: as you increase the supply of something, the demand for it increases as well.
  • #12 We need the equivalent of public transit infrastructure for analytic queries: low marginal cost for asking one more question, goes the places most people need to go, removes load from the roadways.
  • #13 The spell correction example as a model for what the public transit infrastructure should look like.
  • #14 Can we create a data model that makes this kind of powerful analysis available to people who only know SQL?
  • #18 Even better, can a common data model enable us to seamlessly move models from the offline, analytical world to the online, operational world? Because the supernova data model is essentially the HBase/Cassandra/Mongo/etc. data model.
  • #19 http://github.com/jwills/exhibit
  • #20 No tool can make you a data scientist, because it’s the ability to push beyond the limits of your tools that makes you a data scientist.