Airbnb Tech Talk: Josh Wills on The Life of a Data Scientist

5,766 views

Published on

This is the accompanying presentation for a tech talk given at Airbnb.
Video of the talk here:
http://www.youtube.com/watch?v=h9vQIPfe2uU
Other tech talks:
https://www.airbnb.com/tech_talks

Published in: Education, Technology
0 Comments
10 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,766
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
46
Comments
0
Likes
10
Embeds 0
No embeds

No notes for slide
  • It helps to be an actual scientist first.
  • In the talk I mention “Baseball By the Numbers” instead of this book– the one I really meant. ;)
  • Airbnb Tech Talk: Josh Wills on The Life of a Data Scientist

    1. 1. The Life of a Data ScientistMay 23, 2012
    2. 2. About Me• jwills@cloudera.com and @josh_wills• Formerly of Google (2008 – 2011) • Worked on the ad auction • Led the team that build the data infrastructure for Google+• Before that: a bunch of startups • Sometimes as a software engineer, sometimes as a statistician• Math degree from Duke and a half-finished PhD from The University of Texas at Austin• Now: Director of Data Science at Cloudera Copyright 2011 Cloudera Inc. All rights reserved 2012 2
    3. 3. @josh_wills, #hacker vs.@josh_wills, #ThoughtLeader Copyright 2012 Cloudera Inc. All rights reserved
    4. 4. What is a Data Scientist? Copyright 2012 Cloudera Inc. All rights reserved
    5. 5. One Definition… Copyright 2012 Cloudera Inc. All rights reserved
    6. 6. … versus Another Copyright 2012 Cloudera Inc. All rights reserved
    7. 7. Why Is Everyone Talking About Them? Copyright 2012 Cloudera Inc. All rights reserved
    8. 8. Because They Make Things Fun. Copyright 2012 Cloudera Inc. All rights reserved
    9. 9. Data Scientists Power The Products You Love Copyright 2012 Cloudera Inc. All rights reserved
    10. 10. The Job Isn’t New. The Impact Is. Copyright 2012 Cloudera Inc. All rights reserved
    11. 11. How Do I Become One? Copyright 2012 Cloudera Inc. All rights reserved
    12. 12. The Standard Reply Copyright 2012 Cloudera Inc. All rights reserved
    13. 13. Personality Trait #1: Relentless, but in a Lazy Way Copyright 2012 Cloudera Inc. All rights reserved
    14. 14. Personality Trait #2: (Acquired) Humility Copyright 2012 Cloudera Inc. All rights reserved
    15. 15. Step 1: Study Math Copyright 2012 Cloudera Inc. All rights reserved
    16. 16. But…I didn’t study math. Copyright 2012 Cloudera Inc. All rights reserved
    17. 17. Alternate Step 1: Study (Computer) Science Copyright 2012 Cloudera Inc. All rights reserved
    18. 18. Things People Don’t Know About Computer Science Copyright 2012 Cloudera Inc. All rights reserved
    19. 19. Things Scientists Don’t Know About Statistics Copyright 2012 Cloudera Inc. All rights reserved
    20. 20. Problem Solving In Context Copyright 2012 Cloudera Inc. All rights reserved
    21. 21. Phase 2: Stuff You Still Don’t Know Copyright 2012 Cloudera Inc. All rights reserved
    22. 22. Statisticians: How to Work on a Engineering Team • Modular software design • Unit tests • Code reviews • Automated build and test infrastructure • Source code management Copyright 2012 Cloudera Inc. All rights reserved
    23. 23. Software Engineers: How to Carry Out an Analysis Copyright 2012 Cloudera Inc. All rights reserved
    24. 24. Industrial Machine Learning Copyright 2012 Cloudera Inc. All rights reserved
    25. 25. Data Scientists and Hadoop Copyright 2012 Cloudera Inc. All rights reserved
    26. 26. Data Analyst“If my tools and data can’t answer a question, thenthe question doesn’t get answered.” Copyright 2012 Cloudera Inc. All rights reserved
    27. 27. Data Scientist“If my tools and data can’t answer a question, thenI go get better tools and data.” Copyright 2012 Cloudera Inc. All rights reserved
    28. 28. Incredibly Common Question “When should I use Hadoop instead of a relational database?” Copyright 2012 Cloudera Inc. All rights reserved
    29. 29. The Unit of Analysis Problem: Three Symptoms Copyright 2012 Cloudera Inc. All rights reserved
    30. 30. First Symptom: COUNT DISTINCT Copyright 2012 Cloudera Inc. All rights reserved
    31. 31. Second Symptom: Cursors Copyright 2012 Cloudera Inc. All rights reserved
    32. 32. Third Symptom: ALTER TABLE OF_DOOM Copyright 2012 Cloudera Inc. All rights reserved
    33. 33. The Unit of Analysis Problem • Data warehouses are optimized to analyze transactions • Awesome for finance and ERP • Not ideal for product and marketing • A function of what databases are good at Copyright 2012 Cloudera Inc. All rights reserved
    34. 34. What Are You Trying to Analyze? Simple Entities Complex Entities • Static attributes • Evolving attributes • Flat data structure • Hierarchical data structure • Transient • Persistent • Examples • Examples • SKUs • Customers • Line items from an invoice • Suppliers • Log messages • Website visitors Copyright 2012 Cloudera Inc. All rights reserved
    35. 35. Choosing Our Own Data Format • We get to structure our data in the way that works best for the problem we are solving • Flexible • Evolvable • Compact • Fast serialization/deserializati on Copyright 2012 Cloudera Inc. All rights reserved
    36. 36. Spell Correction: The Drosophila of Data Science Copyright 2012 Cloudera Inc. All rights reserved
    37. 37. Simple Counts on Complex Objects Copyright 2012 Cloudera Inc. All rights reserved
    38. 38. The Uncanny Valley for Statisticians on Hadoop Copyright 2012 Cloudera Inc. All rights reserved
    39. 39. The Business of Data Science Copyright 2012 Cloudera Inc. All rights reserved
    40. 40. Where You Should Work: The Two Options Copyright 2012 Cloudera Inc. All rights reserved
    41. 41. A Startup Copyright 2012 Cloudera Inc. All rights reserved
    42. 42. Close to the Money Copyright 2012 Cloudera Inc. All rights reserved
    43. 43. Dealing for Data Copyright 2012 Cloudera Inc. All rights reserved
    44. 44. Education and Growth Copyright 2012 Cloudera Inc. All rights reserved
    45. 45. Questions?@josh_wills

    ×