Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Winning with Big Data: Secrets of the Successful Data Scientist

9,828 views

Published on

A new class of professionals, called data scientists, have emerged to address the Big Data revolution. In this talk, I discuss nine skills for munging, modeling, and visualizing Big Data. Then I present a case study of using these skills: the analysis of billions of call records to predict customer churn at a North American telecom.

http://en.oreilly.com/datascience/public/schedule/detail/15316

Published in: Technology
  • Analysis of telecom using data to predict/stop churn
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Winning with Big Data: Secrets of the Successful Data Scientist

  1. 1. WINNING<br />WITH<br />BIG <br />DATA<br />Secrets of the Successful<br />Data Scientist<br />Making Data Work<br />June 9, 2010<br />Michael Driscoll<br />@dataspora<br />
  2. 2. WHY DATA<br />MATTERS<br />
  3. 3. THE INDUSTRIAL<br />AGE <br />OF <br />DATA<br />
  4. 4. WHAT IS <br />BIG DATA?<br />Data that is distributed.<br />
  5. 5. WHAT IS<br />DATA <br />SCIENCE?<br />
  6. 6. NINE WAYS <br />TO WIN<br />
  7. 7. 1. CHOOSE THE<br />RIGHT TOOL<br />You don’t need a chainsaw to cut butter.<br />
  8. 8. 2. COMPRESS EVERYTHING<br />mysqldump -u myuser -p mypasssourceDB| <br />gzip| sshmike@dataspora.com "cat - | <br />gunzip | mysql-u myuser -p mypasstargetDB"<br />The world is IO-bound.<br />
  9. 9. 3. SPLIT UP<br />YOUR DATA<br />Split, apply, combine.<br />See Hadley Wickham’s paper at http://had.co.nz/plyr/plyr-intro-090510.pdf<br />
  10. 10. 4. WORK <br />WITH SAMPLES<br />perl -ne "print if (rand() < 0.01)" <br /> data.csv > sample.csv<br />Big Data is heavy, <br />samples are light.<br />
  11. 11. 5. USE<br />STATISTICS<br />
  12. 12. COPY<br />FROM OTHERS<br />git clone git://github.com/kevinweil/hadoop-lzo<br />Use open source.<br />
  13. 13. 7. ESCAPE<br />CHART TYPOLOGIES<br />Charts are compositions,<br />not containers.<br />
  14. 14. 8. USE COLOR<br />WISELY<br />Color can enhance <br />or insult.<br />
  15. 15. 9. TELL A STORY<br />People are listening.<br />
  16. 16. ONE <br />SUCCESS<br />STORY<br />
  17. 17. WHY DO TELCO CUSTOMERS LEAVE?<br />Sign up<br />Leave<br />Goal: “less churn.”<br />
  18. 18. DATA:<br />BILLIONS<br />OF CALLS<br />… and millions of callers.<br />
  19. 19. DOES CALL <br />QUALITY<br />MATTER?<br />… a difference,<br />but not significant.<br />
  20. 20. WHAT ABOUT<br />SOCIAL<br />NETWORKS?<br />Hmmm...<br />
  21. 21. BUILD THE <br />CALL GRAPH<br />… but is it predictive?<br />
  22. 22. EVOLUTION OF A CALL GRAPH<br />April<br />
  23. 23. EVOLUTION OF A CALL GRAPH<br />May<br />
  24. 24. EVOLUTION OF A CALL GRAPH<br />June<br />
  25. 25. EVOLUTION OF A CALL GRAPH<br />July<br />
  26. 26. 700% INCREASE<br />IN CHURN<br />when a cancellation<br />occurs in a call network.<br />
  27. 27. THANKS!<br />QUESTIONS?<br />Michael Driscoll<br />twitter @dataspora<br />http://www.dataspora.com/blog<br />Making Data Work<br />June 9, 2010<br />

×