The Life of a Data ScientistMay 23, 2012
About Me• jwills@cloudera.com and @josh_wills• Formerly of Google (2008 – 2011)   • Worked on the ad auction   • Led the t...
@josh_wills, #hacker                            vs.@josh_wills, #ThoughtLeader   Copyright 2012 Cloudera Inc. All rights r...
What is a Data Scientist?  Copyright 2012 Cloudera Inc. All rights reserved
One Definition…             Copyright 2012 Cloudera Inc. All rights reserved
… versus Another             Copyright 2012 Cloudera Inc. All rights reserved
Why Is Everyone Talking About Them?        Copyright 2012 Cloudera Inc. All rights reserved
Because They Make Things Fun.             Copyright 2012 Cloudera Inc. All rights reserved
Data Scientists Power The Products You Love              Copyright 2012 Cloudera Inc. All rights reserved
The Job Isn’t New. The Impact Is.               Copyright 2012 Cloudera Inc. All rights reserved
How Do I Become One? Copyright 2012 Cloudera Inc. All rights reserved
The Standard Reply              Copyright 2012 Cloudera Inc. All rights reserved
Personality Trait #1: Relentless, but in a Lazy Way               Copyright 2012 Cloudera Inc. All rights reserved
Personality Trait #2: (Acquired) Humility               Copyright 2012 Cloudera Inc. All rights reserved
Step 1: Study Math              Copyright 2012 Cloudera Inc. All rights reserved
But…I didn’t study math.  Copyright 2012 Cloudera Inc. All rights reserved
Alternate Step 1: Study (Computer) Science              Copyright 2012 Cloudera Inc. All rights reserved
Things People Don’t Know About Computer Science               Copyright 2012 Cloudera Inc. All rights reserved
Things Scientists Don’t Know About Statistics                 Copyright 2012 Cloudera Inc. All rights reserved
Problem Solving In Context              Copyright 2012 Cloudera Inc. All rights reserved
Phase 2: Stuff You Still Don’t Know       Copyright 2012 Cloudera Inc. All rights reserved
Statisticians: How to Work on a Engineering Team • Modular software   design • Unit tests • Code reviews • Automated build...
Software Engineers: How to Carry Out an Analysis              Copyright 2012 Cloudera Inc. All rights reserved
Industrial Machine Learning              Copyright 2012 Cloudera Inc. All rights reserved
Data Scientists and Hadoop              Copyright 2012 Cloudera Inc. All rights reserved
Data Analyst“If my tools and data can’t answer a question, thenthe question doesn’t get answered.”               Copyright...
Data Scientist“If my tools and data can’t answer a question, thenI go get better tools and data.”                 Copyrigh...
Incredibly Common Question  “When should I use Hadoop instead of a          relational database?”             Copyright 20...
The Unit of Analysis Problem: Three Symptoms            Copyright 2012 Cloudera Inc. All rights reserved
First Symptom: COUNT DISTINCT     Copyright 2012 Cloudera Inc. All rights reserved
Second Symptom: Cursors  Copyright 2012 Cloudera Inc. All rights reserved
Third Symptom: ALTER TABLE OF_DOOM        Copyright 2012 Cloudera Inc. All rights reserved
The Unit of Analysis Problem • Data warehouses are   optimized to analyze   transactions   • Awesome for finance     and E...
What Are You Trying to Analyze?           Simple Entities                                    Complex Entities •   Static a...
Choosing Our Own Data Format • We get to structure our   data in the way that   works best for the   problem we are solvin...
Spell Correction: The Drosophila of Data Science             Copyright 2012 Cloudera Inc. All rights reserved
Simple Counts on Complex Objects             Copyright 2012 Cloudera Inc. All rights reserved
The Uncanny Valley for Statisticians on Hadoop              Copyright 2012 Cloudera Inc. All rights reserved
The Business of Data Science   Copyright 2012 Cloudera Inc. All rights reserved
Where You Should Work: The Two Options         Copyright 2012 Cloudera Inc. All rights reserved
A Startup            Copyright 2012 Cloudera Inc. All rights reserved
Close to the Money              Copyright 2012 Cloudera Inc. All rights reserved
Dealing for Data               Copyright 2012 Cloudera Inc. All rights reserved
Education and Growth             Copyright 2012 Cloudera Inc. All rights reserved
Questions?@josh_wills
Upcoming SlideShare
Loading in …5
×

Airbnb Tech Talk: Josh Wills on The Life of a Data Scientist

5,825 views

Published on

This is the accompanying presentation for a tech talk given at Airbnb.
Video of the talk here:
http://www.youtube.com/watch?v=h9vQIPfe2uU
Other tech talks:
https://www.airbnb.com/tech_talks

Published in: Education, Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,825
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
48
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • It helps to be an actual scientist first.
  • In the talk I mention “Baseball By the Numbers” instead of this book– the one I really meant. ;)
  • Airbnb Tech Talk: Josh Wills on The Life of a Data Scientist

    1. 1. The Life of a Data ScientistMay 23, 2012
    2. 2. About Me• jwills@cloudera.com and @josh_wills• Formerly of Google (2008 – 2011) • Worked on the ad auction • Led the team that build the data infrastructure for Google+• Before that: a bunch of startups • Sometimes as a software engineer, sometimes as a statistician• Math degree from Duke and a half-finished PhD from The University of Texas at Austin• Now: Director of Data Science at Cloudera Copyright 2011 Cloudera Inc. All rights reserved 2012 2
    3. 3. @josh_wills, #hacker vs.@josh_wills, #ThoughtLeader Copyright 2012 Cloudera Inc. All rights reserved
    4. 4. What is a Data Scientist? Copyright 2012 Cloudera Inc. All rights reserved
    5. 5. One Definition… Copyright 2012 Cloudera Inc. All rights reserved
    6. 6. … versus Another Copyright 2012 Cloudera Inc. All rights reserved
    7. 7. Why Is Everyone Talking About Them? Copyright 2012 Cloudera Inc. All rights reserved
    8. 8. Because They Make Things Fun. Copyright 2012 Cloudera Inc. All rights reserved
    9. 9. Data Scientists Power The Products You Love Copyright 2012 Cloudera Inc. All rights reserved
    10. 10. The Job Isn’t New. The Impact Is. Copyright 2012 Cloudera Inc. All rights reserved
    11. 11. How Do I Become One? Copyright 2012 Cloudera Inc. All rights reserved
    12. 12. The Standard Reply Copyright 2012 Cloudera Inc. All rights reserved
    13. 13. Personality Trait #1: Relentless, but in a Lazy Way Copyright 2012 Cloudera Inc. All rights reserved
    14. 14. Personality Trait #2: (Acquired) Humility Copyright 2012 Cloudera Inc. All rights reserved
    15. 15. Step 1: Study Math Copyright 2012 Cloudera Inc. All rights reserved
    16. 16. But…I didn’t study math. Copyright 2012 Cloudera Inc. All rights reserved
    17. 17. Alternate Step 1: Study (Computer) Science Copyright 2012 Cloudera Inc. All rights reserved
    18. 18. Things People Don’t Know About Computer Science Copyright 2012 Cloudera Inc. All rights reserved
    19. 19. Things Scientists Don’t Know About Statistics Copyright 2012 Cloudera Inc. All rights reserved
    20. 20. Problem Solving In Context Copyright 2012 Cloudera Inc. All rights reserved
    21. 21. Phase 2: Stuff You Still Don’t Know Copyright 2012 Cloudera Inc. All rights reserved
    22. 22. Statisticians: How to Work on a Engineering Team • Modular software design • Unit tests • Code reviews • Automated build and test infrastructure • Source code management Copyright 2012 Cloudera Inc. All rights reserved
    23. 23. Software Engineers: How to Carry Out an Analysis Copyright 2012 Cloudera Inc. All rights reserved
    24. 24. Industrial Machine Learning Copyright 2012 Cloudera Inc. All rights reserved
    25. 25. Data Scientists and Hadoop Copyright 2012 Cloudera Inc. All rights reserved
    26. 26. Data Analyst“If my tools and data can’t answer a question, thenthe question doesn’t get answered.” Copyright 2012 Cloudera Inc. All rights reserved
    27. 27. Data Scientist“If my tools and data can’t answer a question, thenI go get better tools and data.” Copyright 2012 Cloudera Inc. All rights reserved
    28. 28. Incredibly Common Question “When should I use Hadoop instead of a relational database?” Copyright 2012 Cloudera Inc. All rights reserved
    29. 29. The Unit of Analysis Problem: Three Symptoms Copyright 2012 Cloudera Inc. All rights reserved
    30. 30. First Symptom: COUNT DISTINCT Copyright 2012 Cloudera Inc. All rights reserved
    31. 31. Second Symptom: Cursors Copyright 2012 Cloudera Inc. All rights reserved
    32. 32. Third Symptom: ALTER TABLE OF_DOOM Copyright 2012 Cloudera Inc. All rights reserved
    33. 33. The Unit of Analysis Problem • Data warehouses are optimized to analyze transactions • Awesome for finance and ERP • Not ideal for product and marketing • A function of what databases are good at Copyright 2012 Cloudera Inc. All rights reserved
    34. 34. What Are You Trying to Analyze? Simple Entities Complex Entities • Static attributes • Evolving attributes • Flat data structure • Hierarchical data structure • Transient • Persistent • Examples • Examples • SKUs • Customers • Line items from an invoice • Suppliers • Log messages • Website visitors Copyright 2012 Cloudera Inc. All rights reserved
    35. 35. Choosing Our Own Data Format • We get to structure our data in the way that works best for the problem we are solving • Flexible • Evolvable • Compact • Fast serialization/deserializati on Copyright 2012 Cloudera Inc. All rights reserved
    36. 36. Spell Correction: The Drosophila of Data Science Copyright 2012 Cloudera Inc. All rights reserved
    37. 37. Simple Counts on Complex Objects Copyright 2012 Cloudera Inc. All rights reserved
    38. 38. The Uncanny Valley for Statisticians on Hadoop Copyright 2012 Cloudera Inc. All rights reserved
    39. 39. The Business of Data Science Copyright 2012 Cloudera Inc. All rights reserved
    40. 40. Where You Should Work: The Two Options Copyright 2012 Cloudera Inc. All rights reserved
    41. 41. A Startup Copyright 2012 Cloudera Inc. All rights reserved
    42. 42. Close to the Money Copyright 2012 Cloudera Inc. All rights reserved
    43. 43. Dealing for Data Copyright 2012 Cloudera Inc. All rights reserved
    44. 44. Education and Growth Copyright 2012 Cloudera Inc. All rights reserved
    45. 45. Questions?@josh_wills

    ×