• Save
Winning with Big Data: Secrets of the Successful Data Scientist
Upcoming SlideShare
Loading in...5
×
 

Winning with Big Data: Secrets of the Successful Data Scientist

on

  • 8,072 views

A new class of professionals, called data scientists, have emerged to address the Big Data revolution. In this talk, I discuss nine skills for munging, modeling, and visualizing Big Data. Then I ...

A new class of professionals, called data scientists, have emerged to address the Big Data revolution. In this talk, I discuss nine skills for munging, modeling, and visualizing Big Data. Then I present a case study of using these skills: the analysis of billions of call records to predict customer churn at a North American telecom.

http://en.oreilly.com/datascience/public/schedule/detail/15316

Statistics

Views

Total Views
8,072
Views on SlideShare
7,789
Embed Views
283

Actions

Likes
38
Downloads
0
Comments
1

12 Embeds 283

http://www.slideshare.net 256
http://data-science.tumblr.com 9
http://facebook.slideshare.com 5
http://localhost 3
http://www.lmodules.com 2
https://bb9dev.newcastle.edu.au 2
http://10.17.208.221 1
http://searchutil01 1
http://www.linkedin.com 1
http://list.ly 1
http://www.taaza.com 1
http://pmomale-ld1 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Analysis of telecom using data to predict/stop churn
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • If you had to put your finger on the beginning of the information age, it might be the creation of the first telegraph in 1792, in France, by a pair of brothers.The first time that man-made information began at the speed of light, over long distances.Cars, cash registers, subway turnstyles, gene chips, TiVos, and cell phones are streaming billions of data points.Prof. Joe Hellerstein of Berkeley has dubbed it “The Industrial Revolution of Data” – where machines, not people, are the dominant producers of data.
  • In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.
  • In this talk I’m also going to be talking about tools for medium data; b/c these translate well into the Big Data space.I’m defining data Science is: applying tools to data to answer questions. It is at the intersection of these tools. And it is a growing field, because data is getting bigger, and our tools are getting better. (Suffice to say, the questions we ask have been around since time immemorial: whoAnother word for questions is hypotheses.
  • Do you really need Hadoop for that job? Think twice about it.Can you do everything on one machine?Escalate only as necessary… don’t solve problems that don’t yet exist.At the same time, optimize for scalability, not performance. Cleverness is usually punished in the long run.
  • Compressing gives you a 6-8x bump immediately in network and disk IO, out of the gate.This example also illustrates another piece: avoid hitting disk at all costs.If you’re working on the cloud,
  • This is the essence of parallelism: find some independent dimension on which to split your data.* Even your data isn’t in a database, split it up the old-fashioned way – one file per hour, day, or month, depending on its size – these often form natural samples to work from.* Learn & understand how to partition, shard, or otherwise distribute your data in a database.* Parallel load is your friend: Several databases have parallel load features; Hadoop has distcp.
  • do you want to moving GBs and TBs around?sometimes you want to visualize and work on the data locally…so sample!* reservoir sampling is a fixed-memory algorithm for achieving a defined-sized sample* the above illustrates how to get a basic 1% uniform sample method in a perl one-liner
  • When we compare two real-valued measures, they will almost always be different.The critical question is: How confident are we in the difference? Is it significant?
  • Don’t reinvent the wheel, steal someone else’s wheels of 1s and 0s.Statistics is hard – so go ahead & use someone else’s stuff. Go ahead. It’s there. That what’s great about R. 2000 statistical libraries written by professors.
  • Not machines, people.
  • Okay, now I want you to try and forget everything you just heard about base graphics.ggplot2 is a new visualization package formally released in 2009, developed by Professor Hadley Wickham.It is a based a different perspective of developing graphics, and has its own set of functions and parameters.
  • Most telcos lose 1-2% of their customers every month.It’s 7x more expensive to acquire a customer, than to retain.
  • Not machines, people.
  • This illustrates what we said earlier: statistics matters. We needed to rule this out.(If anything the correlation occurs opposite of what we expected).
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • Windowing functions in Greenplum, which is a modified Postgres distributed database.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.
  • “A Survey of R Graphics” – presented to the LA R Users Group, June 18, 2009.Today I’m going to go through a survey of data visualization functions and packages in R. In particular, I’ll discuss three approaches for data visualization in R: (i) the built-in base graphics functions, (ii) the ggplot2 package, and (iii) the lattice package.I’ll also discuss some methods for visualizing large data sets.I’ll end with an overview of Rapache, a tool for embedding R in web applications.For questions beyond this talk, I can be contacted at:Michael E Driscollhttp://www.dataspora.commike@dataspora.com.

Winning with Big Data: Secrets of the Successful Data Scientist Winning with Big Data: Secrets of the Successful Data Scientist Presentation Transcript

  • WINNING
    WITH
    BIG
    DATA
    Secrets of the Successful
    Data Scientist
    Making Data Work
    June 9, 2010
    Michael Driscoll
    @dataspora
  • WHY DATA
    MATTERS
  • THE INDUSTRIAL
    AGE
    OF
    DATA
  • WHAT IS
    BIG DATA?
    Data that is distributed.
  • WHAT IS
    DATA
    SCIENCE?
  • NINE WAYS
    TO WIN
  • 1. CHOOSE THE
    RIGHT TOOL
    You don’t need a chainsaw to cut butter.
  • 2. COMPRESS EVERYTHING
    mysqldump -u myuser -p mypasssourceDB|
    gzip| sshmike@dataspora.com "cat - |
    gunzip | mysql-u myuser -p mypasstargetDB"
    The world is IO-bound.
  • 3. SPLIT UP
    YOUR DATA
    Split, apply, combine.
    See Hadley Wickham’s paper at http://had.co.nz/plyr/plyr-intro-090510.pdf
  • 4. WORK
    WITH SAMPLES
    perl -ne "print if (rand() < 0.01)"
    data.csv > sample.csv
    Big Data is heavy,
    samples are light.
  • 5. USE
    STATISTICS
  • COPY
    FROM OTHERS
    git clone git://github.com/kevinweil/hadoop-lzo
    Use open source.
  • 7. ESCAPE
    CHART TYPOLOGIES
    Charts are compositions,
    not containers.
  • 8. USE COLOR
    WISELY
    Color can enhance
    or insult.
  • 9. TELL A STORY
    People are listening.
  • ONE
    SUCCESS
    STORY
  • WHY DO TELCO CUSTOMERS LEAVE?
    Sign up
    Leave
    Goal: “less churn.”
  • DATA:
    BILLIONS
    OF CALLS
    … and millions of callers.
  • DOES CALL
    QUALITY
    MATTER?
    … a difference,
    but not significant.
  • WHAT ABOUT
    SOCIAL
    NETWORKS?
    Hmmm...
  • BUILD THE
    CALL GRAPH
    … but is it predictive?
  • EVOLUTION OF A CALL GRAPH
    April
  • EVOLUTION OF A CALL GRAPH
    May
  • EVOLUTION OF A CALL GRAPH
    June
  • EVOLUTION OF A CALL GRAPH
    July
  • 700% INCREASE
    IN CHURN
    when a cancellation
    occurs in a call network.
  • THANKS!
    QUESTIONS?
    Michael Driscoll
    twitter @dataspora
    http://www.dataspora.com/blog
    Making Data Work
    June 9, 2010