Winning With Big Data: Secrets of the Successful Data Scientist

2,075 views

Published on

The world is experiencing an Industrial Revolution of Data. In any given minute the machines around us are tracking billions of mouse clicks, credit card swipes, and GPS coordinates. And increasingly this data is being saved, aggregated, and analyzed. These massive data flows present big challenges to firms, but also new opportunities for deriving insights.

Presented at the June 2010 gathering of the Bay Area's Business Intelligence Special Interest Group.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,075
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
113
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • I’m Mike Driscoll, founder of Dataspora LLC, we’re a boutique analytics firm based in San Francisco.Before coming out to the Bay Area, I worked on the human genome project & got a doctorate in Computational Biology.Today I’m going to talk about Big Data, Data Science, and some tips for the Data Scientist.
  • If you had to put your finger on the beginning of the information age, it might be the creation of the first telegraph in 1792, in France, by a pair of brothers.The first time that man-made information began at the speed of light, over long distances.Cars, cash registers, subway turnstyles, gene chips, TiVos, and cell phones are streaming billions of data points.We live in a world exploding with data. In any given minute, databases somewhere are tracking mouse clicks on web sites, point of sale purchases, rider swipes through subway turnstyles, physician prescriptions, digital video recorder rewinds, and the location of every GPS-enabled car and phone on the planet.Prof. Joe Hellerstein of Berkeley has dubbed it “The Industrial Revolution of Data” – where machines, not people, are the dominant producers of data.So the world is streaming billions of data points per minute. This is Big Data – capital B, capital D. Ben Lorica of O’Reilly Media has said Big Data is “data that you have to think about” when storing, analyzing or otherwise grappling with it.But capturing data isn’t enough. We need tools to make sense of it.At Facebook, they call their data analysts, ‘data scientists’. I like this term, because it captures the point of collecting this data: testing hypotheses about the world.And to test hypotheses using Big Data, we need statistics.
  • In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.
  • In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.I’m defining data Science is: applying tools to data to answer questions. It is at the intersection of these tools. And it is a growing field, because data is getting bigger, and our tools are getting better. (Suffice to say, the questions we ask have been around since time immemorial: whoAnother word for questions is hypotheses.I’ll talk about tools for munging; the answers to these questions are
  • Do you really need Hadoop for that job? Think twice about it.Can you do everything on one machine?Escalate only as necessary… don’t solve problems that don’t yet exist.At the same time, optimize for scalability, not performance. Cleverness is usually punished in the long run.
  • Compressing gives you a 6-8x bump immediately in network and disk IO, out of the gate.This example also illustrates another piece: avoid hitting disk at all costs.If you’re working on the cloud,
  • This is the essence of parallelism, and in fact, of big data: the key is to some independent dimension on which to split your data.Otherwise everything sits together, in a monolithic file system, database, or data store -- which often spells disaster.* Even your data isn’t in a database, split it up the old-fashioned way – one file per hour, day, or month, depending on its size – these often form natural samples to work from.* Learn & understand how to partition, shard, or otherwise distribute your data in a database.* Parallel load is your friend: Several databases have parallel load features; Hadoop has distcp.
  • do you want to moving GBs and TBs around?sometimes you want to visualize and work on the data locally…so sample!* reservoir sampling is a fixed-memory algorithm for achieving a defined-sized sample* the above illustrates how to get a basic 1% uniform sample method in a perl one-liner
  • When we compare two real-valued measures, they will almost always be different.The critical question is: How confident are we in the difference? Is it significant?There’s also something to be said for significant but so small in magnitude as to be meaningless.(I once sat through a heart drug presentation, which showed a significant but inconsequential difference versus Aspirin. The price differential was not inconsequential, however).
  • Don’t reinvent the wheel, steal someone else’s wheels of 1s and 0s.Statistics is hard – so go ahead & use someone else’s stuff. Go ahead. It’s there. Just today I cribbed code from StackOverflow to make a heatmap in R.That what’s great about R. 2000 statistical libraries written by professors.
  • Not machines, people.
  • Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
  • Most telcos lose 1-2% of their customers every month.It’s 7x more expensive to acquire a customer, than to retain.
  • Not machines, people.
  • This illustrates what we said earlier: statistics matters. We needed to rule this out.(If anything the correlation occurs opposite of what we expected).
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • Windowing functions in Greenplum, which is a modified Postgres distributed database.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
  • The stack is loosely coupled: right tool for the right job. No one firm can do it all.- There aren’t – not yet at least – out of the box solutions for getting through this: the data scientists occupy the middle.Big Data is disrupting this entire stack: -- at the bottom, new DB firms like Aster-- in the middle, the same revo-You know who sits on the top of that stack? We do. That’s why storytelling is such an important skill.
  • Winning With Big Data: Secrets of the Successful Data Scientist

    1. 1. WINNING<br />WITH<br />BIG <br />DATA<br />Secrets of the Successful<br />Data Scientist<br />SDForum BI SIG<br />June 15, 2010<br />Michael Driscoll<br />@dataspora<br />
    2. 2. WHY DATA<br />MATTERS<br />NOW<br />
    3. 3. THE INDUSTRIAL<br />AGE <br />OF <br />DATA<br />
    4. 4. WHAT IS <br />BIG DATA?<br />Data that is distributed.<br />
    5. 5. WHAT IS<br />DATA <br />SCIENCE?<br />
    6. 6. WHY DATA SCIENCE<br />IS SEXY<br />
    7. 7. “The sexy job in the next ten years will be statisticians…”<br />- Hal Varian<br />=<br />+<br />
    8. 8.
    9. 9. data<br />model<br />1000 bytes<br />2 bytes<br />
    10. 10. 9 WAYS TO WIN<br />WITH DATA<br />
    11. 11. 1. CHOOSE THE<br />RIGHT TOOL<br />You don’t need a chainsaw to cut butter.<br />
    12. 12. 2. COMPRESS EVERYTHING<br />mysqldump -u myuser -p mypasssourceDB | <br />gzip | ssh mike@dataspora.com "cat - | <br />gunzip | mysql -u myuser -p mypasstargetDB"<br />The world is IO-bound.<br />
    13. 13. 3. SPLIT UP<br />YOUR DATA<br />Split, apply, combine.<br />
    14. 14. 4. WORK <br />WITH SAMPLES<br />perl -ne "print if (rand() < 0.01)" <br /> data.csv > sample.csv<br />Big Data is heavy, <br />samples are light.<br />
    15. 15. 5. USE<br />STATISTICS<br />
    16. 16. COPY<br />FROM OTHERS<br />git clone git://github.com/kevinweil/hadoop-lzo<br />Use open source.<br />
    17. 17. 7. ESCHEW CHART TYPOLOGIES<br />Charts are compositions,<br />not containers.<br />
    18. 18. 8. COLORWITH CARE<br />Color can enhance <br />or insult.<br />
    19. 19. 9. TELL A STORY<br />People are listening.<br />
    20. 20. ONE <br />SUCCESS<br />STORY<br />
    21. 21. WHY DO TELCO CUSTOMERS LEAVE?<br />Sign up<br />Leave<br />Goal: “less churn.”<br />
    22. 22. DATA:<br />BILLIONS<br />OF CALLS<br />… and millions of callers.<br />
    23. 23. DOES CALL <br />QUALITY<br />MATTER?<br />… a difference,<br />but not significant.<br />
    24. 24. WHAT ABOUT<br />SOCIAL<br />NETWORKS?<br />Hmmm...<br />
    25. 25. BUILD THE <br />CALL GRAPH<br />… but is it predictive?<br />
    26. 26. EVOLUTION OF A CALL GRAPH<br />April<br />
    27. 27. EVOLUTION OF A CALL GRAPH<br />May<br />
    28. 28. EVOLUTION OF A CALL GRAPH<br />June<br />
    29. 29. EVOLUTION OF A CALL GRAPH<br />July<br />
    30. 30. 700% INCREASE<br />IN CHURN<br />when a cancellation<br />occurs in a call network.<br />
    31. 31. FINAL <br />THOUGHTS<br />
    32. 32. THE BIG DATA STACK<br />Actions<br />Data Products<br />(Content Filters, Rec Engines)<br />Analytics<br />(R, SPSS, SAS, SAP)<br />Insights<br />Big Data<br />Dedicated RDBMS <br />Data<br />
    33. 33. THANKS!<br />QUESTIONS?<br />Michael Driscoll<br />med@dataspora.com<br />@dataspora on Twitter<br />http://www.dataspora.com/blog<br />SDForum BI SIG<br />June 15, 2010<br />

    ×