Analysis Is Painless

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Analysis Is Painless - Presentation Transcript

    1. Analysis is Painless Data Analysis 101 [email_address]
    2. Why is Data Analysis Scary?
      • Data is overwhelming
      • Lots of setup for little return
      • Resources are scarce
      • Takes a lot of time
      • Other reasons?
    3. Get over the hump…
      • Pick a tool – doesn’t matter what it is
        • Programmer? Use a framework like Hadoop
        • Database guy? Use analytic DBMS like Vertica
      • Run a sample or demo first
      • Grab a little data – start with 1GB
        • Goal is to develop a pattern that you can brute force
        • Functionality first, optimization later
    4. Hadoop
      • http://wiki.apache.org/hadoop
      • You write just *three* functions:
        • Map, Reduce and Main
      • Map: parse data into key, value pairs
        • Value can be complex formats too
      • Reduce: optionally combine matching keys
        • WordCount example sums counts for same words
      • Main
        • Orchestrate the whole process
    5. What does MAIN do?
      • Start with a JobConf and give it a name
      • JobConf conf = new JobConf(WordCount.class);
      • conf.setJobName("wordcount");
      • Set the key and value classes that the Mapper will output and Reducer will input
      • conf.setOutputKeyClass(Text. class);
      • conf.setOutputValueClass(IntWritable. class);
      • Tell it the classes (Combiner is an optional special case reducer)
      • conf.setMapperClass(Map. class);
      • conf.setCombinerClass(Reduce. class);
      • conf.setReducerClass(Reduce. class);
      • Set the format classes for input and output
      • conf.setInputFormat(TextInputFormat. class);
      • conf.setOutputFormat(TextOutputFormat. class);
      • Format specific initialization
      • FileInputFormat. setInputPaths(conf, new Path(args[0]));
      • FileOutputFormat. setOutputPath(conf, new Path(args[1]));
      • Run!
      • JobClient. runJob(conf);
    6. Vertica
      • http://www.vertica.com
      • SQL RDBMS designed for Analytics
      • Easy to tune, scales by adding nodes
      • Develop a schema (CREATE TABLE)
      • Copy in ascii files (or ETL them)
      • Run queries (SELECT COUNT(*) FROM…)
    7. Basic DBMS
      • Define a schema
      • CREATE TABLE words
      • ( word VARCHAR(64), num INT);
      • Load some data
      • COPY words FROM /home/user/words.csv’
      • DELIMITER ‘,’ NULL ‘’;
      • or
      • INSERT INTO words VALUES(‘barcamp’, 29081234);
      • Query it
      • SELECT word, SUM(num) AS total FROM words GROUP BY word ORDER BY total LIMIT 10;

    + otrajmanotrajman, 8 months ago

    custom

    204 views, 0 favs, 0 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 204
      • 204 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 0
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories