Your SlideShare is downloading. ×
Analysis Is Painless
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Analysis Is Painless

563
views

Published on

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
563
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Analysis is Painless Data Analysis 101 [email_address]
  • 2. Why is Data Analysis Scary?
    • Data is overwhelming
    • Lots of setup for little return
    • Resources are scarce
    • Takes a lot of time
    • Other reasons?
  • 3. Get over the hump…
    • Pick a tool – doesn’t matter what it is
      • Programmer? Use a framework like Hadoop
      • Database guy? Use analytic DBMS like Vertica
    • Run a sample or demo first
    • Grab a little data – start with 1GB
      • Goal is to develop a pattern that you can brute force
      • Functionality first, optimization later
  • 4. Hadoop
    • http://wiki.apache.org/hadoop
    • You write just *three* functions:
      • Map, Reduce and Main
    • Map: parse data into key, value pairs
      • Value can be complex formats too
    • Reduce: optionally combine matching keys
      • WordCount example sums counts for same words
    • Main
      • Orchestrate the whole process
  • 5. What does MAIN do?
    • Start with a JobConf and give it a name
    • JobConf conf = new JobConf(WordCount.class);
    • conf.setJobName("wordcount");
    • Set the key and value classes that the Mapper will output and Reducer will input
    • conf.setOutputKeyClass(Text. class);
    • conf.setOutputValueClass(IntWritable. class);
    • Tell it the classes (Combiner is an optional special case reducer)
    • conf.setMapperClass(Map. class);
    • conf.setCombinerClass(Reduce. class);
    • conf.setReducerClass(Reduce. class);
    • Set the format classes for input and output
    • conf.setInputFormat(TextInputFormat. class);
    • conf.setOutputFormat(TextOutputFormat. class);
    • Format specific initialization
    • FileInputFormat. setInputPaths(conf, new Path(args[0]));
    • FileOutputFormat. setOutputPath(conf, new Path(args[1]));
    • Run!
    • JobClient. runJob(conf);
  • 6. Vertica
    • http://www.vertica.com
    • SQL RDBMS designed for Analytics
    • Easy to tune, scales by adding nodes
    • Develop a schema (CREATE TABLE)
    • Copy in ascii files (or ETL them)
    • Run queries (SELECT COUNT(*) FROM…)
  • 7. Basic DBMS
    • Define a schema
    • CREATE TABLE words
    • ( word VARCHAR(64), num INT);
    • Load some data
    • COPY words FROM /home/user/words.csv’
    • DELIMITER ‘,’ NULL ‘’;
    • or
    • INSERT INTO words VALUES(‘barcamp’, 29081234);
    • Query it
    • SELECT word, SUM(num) AS total FROM words GROUP BY word ORDER BY total LIMIT 10;