Your SlideShare is downloading. ×
WINNING<br />WITH<br />BIG <br />DATA<br />Secrets of the Successful<br />Data Scientist<br />Making Data Work<br />June 9...
WHY DATA<br />MATTERS<br />
THE INDUSTRIAL<br />AGE <br />OF <br />DATA<br />
WHAT IS <br />BIG DATA?<br />Data that is distributed.<br />
WHAT IS<br />DATA <br />SCIENCE?<br />
NINE WAYS <br />TO WIN<br />
1.  CHOOSE THE<br />RIGHT TOOL<br />You don’t need a chainsaw to cut butter.<br />
2. COMPRESS  EVERYTHING<br />mysqldump -u myuser -p mypasssourceDB| <br />gzip| sshmike@dataspora.com "cat - | <br />gunzi...
3. SPLIT UP<br />YOUR DATA<br />Split, apply, combine.<br />See  Hadley Wickham’s paper at http://had.co.nz/plyr/plyr-intr...
4. WORK <br />WITH SAMPLES<br />perl -ne "print if (rand() < 0.01)"  <br /> data.csv > sample.csv<br />Big Data is heavy, ...
5.  USE<br />STATISTICS<br />
COPY<br />FROM OTHERS<br />git clone git://github.com/kevinweil/hadoop-lzo<br />Use open source.<br />
7. ESCAPE<br />CHART TYPOLOGIES<br />Charts are compositions,<br />not containers.<br />
8. USE COLOR<br />WISELY<br />Color can enhance <br />or insult.<br />
9. TELL A STORY<br />People are listening.<br />
ONE <br />SUCCESS<br />STORY<br />
WHY DO TELCO CUSTOMERS LEAVE?<br />Sign up<br />Leave<br />Goal:  “less churn.”<br />
DATA:<br />BILLIONS<br />OF CALLS<br />… and millions of callers.<br />
DOES CALL <br />QUALITY<br />MATTER?<br />… a difference,<br />but not significant.<br />
WHAT ABOUT<br />SOCIAL<br />NETWORKS?<br />Hmmm...<br />
BUILD THE <br />CALL GRAPH<br />… but is it predictive?<br />
EVOLUTION OF A CALL GRAPH<br />April<br />
EVOLUTION OF A CALL GRAPH<br />May<br />
EVOLUTION OF A CALL GRAPH<br />June<br />
EVOLUTION OF A CALL GRAPH<br />July<br />
700% INCREASE<br />IN CHURN<br />when a cancellation<br />occurs in a call network.<br />
THANKS!<br />QUESTIONS?<br />Michael Driscoll<br />twitter @dataspora<br />http://www.dataspora.com/blog<br />Making Data ...
Upcoming SlideShare
Loading in...5
×

Winning with Big Data: Secrets of the Successful Data Scientist

8,692

Published on

A new class of professionals, called data scientists, have emerged to address the Big Data revolution. In this talk, I discuss nine skills for munging, modeling, and visualizing Big Data. Then I present a case study of using these skills: the analysis of billions of call records to predict customer churn at a North American telecom.

http://en.oreilly.com/datascience/public/schedule/detail/15316

Published in: Technology
1 Comment
41 Likes
Statistics
Notes
  • Analysis of telecom using data to predict/stop churn
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
8,692
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
0
Comments
1
Likes
41
Embeds 0
No embeds

No notes for slide
  • If you had to put your finger on the beginning of the information age, it might be the creation of the first telegraph in 1792, in France, by a pair of brothers.The first time that man-made information began at the speed of light, over long distances.Cars, cash registers, subway turnstyles, gene chips, TiVos, and cell phones are streaming billions of data points.Prof. Joe Hellerstein of Berkeley has dubbed it “The Industrial Revolution of Data” – where machines, not people, are the dominant producers of data.
  • In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.
  • In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.I’m defining data Science is: applying tools to data to answer questions. It is at the intersection of these tools. And it is a growing field, because data is getting bigger, and our tools are getting better. (Suffice to say, the questions we ask have been around since time immemorial: whoAnother word for questions is hypotheses.
  • Do you really need Hadoop for that job? Think twice about it.Can you do everything on one machine?Escalate only as necessary… don’t solve problems that don’t yet exist.At the same time, optimize for scalability, not performance. Cleverness is usually punished in the long run.
  • Compressing gives you a 6-8x bump immediately in network and disk IO, out of the gate.This example also illustrates another piece: avoid hitting disk at all costs.If you’re working on the cloud,
  • This is the essence of parallelism: find some independent dimension on which to split your data.* Even your data isn’t in a database, split it up the old-fashioned way – one file per hour, day, or month, depending on its size – these often form natural samples to work from.* Learn &amp; understand how to partition, shard, or otherwise distribute your data in a database.* Parallel load is your friend: Several databases have parallel load features; Hadoop has distcp.
  • do you want to moving GBs and TBs around?sometimes you want to visualize and work on the data locally…so sample!* reservoir sampling is a fixed-memory algorithm for achieving a defined-sized sample* the above illustrates how to get a basic 1% uniform sample method in a perl one-liner
  • When we compare two real-valued measures, they will almost always be different.The critical question is: How confident are we in the difference? Is it significant?
  • Don’t reinvent the wheel, steal someone else’s wheels of 1s and 0s.Statistics is hard – so go ahead &amp; use someone else’s stuff. Go ahead. It’s there. That what’s great about R. 2000 statistical libraries written by professors.
  • Not machines, people.
  • Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
  • Most telcos lose 1-2% of their customers every month.It’s 7x more expensive to acquire a customer, than to retain.
  • Not machines, people.
  • This illustrates what we said earlier: statistics matters. We needed to rule this out.(If anything the correlation occurs opposite of what we expected).
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • Windowing functions in Greenplum, which is a modified Postgres distributed database.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • Transcript of "Winning with Big Data: Secrets of the Successful Data Scientist"

    1. 1. WINNING<br />WITH<br />BIG <br />DATA<br />Secrets of the Successful<br />Data Scientist<br />Making Data Work<br />June 9, 2010<br />Michael Driscoll<br />@dataspora<br />
    2. 2. WHY DATA<br />MATTERS<br />
    3. 3. THE INDUSTRIAL<br />AGE <br />OF <br />DATA<br />
    4. 4. WHAT IS <br />BIG DATA?<br />Data that is distributed.<br />
    5. 5. WHAT IS<br />DATA <br />SCIENCE?<br />
    6. 6. NINE WAYS <br />TO WIN<br />
    7. 7. 1. CHOOSE THE<br />RIGHT TOOL<br />You don’t need a chainsaw to cut butter.<br />
    8. 8. 2. COMPRESS EVERYTHING<br />mysqldump -u myuser -p mypasssourceDB| <br />gzip| sshmike@dataspora.com "cat - | <br />gunzip | mysql-u myuser -p mypasstargetDB"<br />The world is IO-bound.<br />
    9. 9. 3. SPLIT UP<br />YOUR DATA<br />Split, apply, combine.<br />See Hadley Wickham’s paper at http://had.co.nz/plyr/plyr-intro-090510.pdf<br />
    10. 10. 4. WORK <br />WITH SAMPLES<br />perl -ne "print if (rand() < 0.01)" <br /> data.csv > sample.csv<br />Big Data is heavy, <br />samples are light.<br />
    11. 11. 5. USE<br />STATISTICS<br />
    12. 12. COPY<br />FROM OTHERS<br />git clone git://github.com/kevinweil/hadoop-lzo<br />Use open source.<br />
    13. 13. 7. ESCAPE<br />CHART TYPOLOGIES<br />Charts are compositions,<br />not containers.<br />
    14. 14. 8. USE COLOR<br />WISELY<br />Color can enhance <br />or insult.<br />
    15. 15. 9. TELL A STORY<br />People are listening.<br />
    16. 16. ONE <br />SUCCESS<br />STORY<br />
    17. 17. WHY DO TELCO CUSTOMERS LEAVE?<br />Sign up<br />Leave<br />Goal: “less churn.”<br />
    18. 18. DATA:<br />BILLIONS<br />OF CALLS<br />… and millions of callers.<br />
    19. 19. DOES CALL <br />QUALITY<br />MATTER?<br />… a difference,<br />but not significant.<br />
    20. 20. WHAT ABOUT<br />SOCIAL<br />NETWORKS?<br />Hmmm...<br />
    21. 21. BUILD THE <br />CALL GRAPH<br />… but is it predictive?<br />
    22. 22. EVOLUTION OF A CALL GRAPH<br />April<br />
    23. 23. EVOLUTION OF A CALL GRAPH<br />May<br />
    24. 24. EVOLUTION OF A CALL GRAPH<br />June<br />
    25. 25. EVOLUTION OF A CALL GRAPH<br />July<br />
    26. 26. 700% INCREASE<br />IN CHURN<br />when a cancellation<br />occurs in a call network.<br />
    27. 27. THANKS!<br />QUESTIONS?<br />Michael Driscoll<br />twitter @dataspora<br />http://www.dataspora.com/blog<br />Making Data Work<br />June 9, 2010<br />

    ×