Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Winning With Big Data: Secrets of the Successful Data Scientist

2,150 views

Published on

The world is experiencing an Industrial Revolution of Data. In any given minute the machines around us are tracking billions of mouse clicks, credit card swipes, and GPS coordinates. And increasingly this data is being saved, aggregated, and analyzed. These massive data flows present big challenges to firms, but also new opportunities for deriving insights.

Presented at the June 2010 gathering of the Bay Area's Business Intelligence Special Interest Group.

Published in: Technology
  • Be the first to comment

Winning With Big Data: Secrets of the Successful Data Scientist

  1. 1. WINNING<br />WITH<br />BIG <br />DATA<br />Secrets of the Successful<br />Data Scientist<br />SDForum BI SIG<br />June 15, 2010<br />Michael Driscoll<br />@dataspora<br />
  2. 2. WHY DATA<br />MATTERS<br />NOW<br />
  3. 3. THE INDUSTRIAL<br />AGE <br />OF <br />DATA<br />
  4. 4. WHAT IS <br />BIG DATA?<br />Data that is distributed.<br />
  5. 5. WHAT IS<br />DATA <br />SCIENCE?<br />
  6. 6. WHY DATA SCIENCE<br />IS SEXY<br />
  7. 7. “The sexy job in the next ten years will be statisticians…”<br />- Hal Varian<br />=<br />+<br />
  8. 8.
  9. 9. data<br />model<br />1000 bytes<br />2 bytes<br />
  10. 10. 9 WAYS TO WIN<br />WITH DATA<br />
  11. 11. 1. CHOOSE THE<br />RIGHT TOOL<br />You don’t need a chainsaw to cut butter.<br />
  12. 12. 2. COMPRESS EVERYTHING<br />mysqldump -u myuser -p mypasssourceDB | <br />gzip | ssh mike@dataspora.com "cat - | <br />gunzip | mysql -u myuser -p mypasstargetDB"<br />The world is IO-bound.<br />
  13. 13. 3. SPLIT UP<br />YOUR DATA<br />Split, apply, combine.<br />
  14. 14. 4. WORK <br />WITH SAMPLES<br />perl -ne "print if (rand() < 0.01)" <br /> data.csv > sample.csv<br />Big Data is heavy, <br />samples are light.<br />
  15. 15. 5. USE<br />STATISTICS<br />
  16. 16. COPY<br />FROM OTHERS<br />git clone git://github.com/kevinweil/hadoop-lzo<br />Use open source.<br />
  17. 17. 7. ESCHEW CHART TYPOLOGIES<br />Charts are compositions,<br />not containers.<br />
  18. 18. 8. COLORWITH CARE<br />Color can enhance <br />or insult.<br />
  19. 19. 9. TELL A STORY<br />People are listening.<br />
  20. 20. ONE <br />SUCCESS<br />STORY<br />
  21. 21. WHY DO TELCO CUSTOMERS LEAVE?<br />Sign up<br />Leave<br />Goal: “less churn.”<br />
  22. 22. DATA:<br />BILLIONS<br />OF CALLS<br />… and millions of callers.<br />
  23. 23. DOES CALL <br />QUALITY<br />MATTER?<br />… a difference,<br />but not significant.<br />
  24. 24. WHAT ABOUT<br />SOCIAL<br />NETWORKS?<br />Hmmm...<br />
  25. 25. BUILD THE <br />CALL GRAPH<br />… but is it predictive?<br />
  26. 26. EVOLUTION OF A CALL GRAPH<br />April<br />
  27. 27. EVOLUTION OF A CALL GRAPH<br />May<br />
  28. 28. EVOLUTION OF A CALL GRAPH<br />June<br />
  29. 29. EVOLUTION OF A CALL GRAPH<br />July<br />
  30. 30. 700% INCREASE<br />IN CHURN<br />when a cancellation<br />occurs in a call network.<br />
  31. 31. FINAL <br />THOUGHTS<br />
  32. 32. THE BIG DATA STACK<br />Actions<br />Data Products<br />(Content Filters, Rec Engines)<br />Analytics<br />(R, SPSS, SAS, SAP)<br />Insights<br />Big Data<br />Dedicated RDBMS <br />Data<br />
  33. 33. THANKS!<br />QUESTIONS?<br />Michael Driscoll<br />med@dataspora.com<br />@dataspora on Twitter<br />http://www.dataspora.com/blog<br />SDForum BI SIG<br />June 15, 2010<br />

×