Analysis Is Painless

  • 558 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
558
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
2
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Analysis is Painless Data Analysis 101 [email_address]
  • 2. Why is Data Analysis Scary?
    • Data is overwhelming
    • Lots of setup for little return
    • Resources are scarce
    • Takes a lot of time
    • Other reasons?
  • 3. Get over the hump…
    • Pick a tool – doesn’t matter what it is
      • Programmer? Use a framework like Hadoop
      • Database guy? Use analytic DBMS like Vertica
    • Run a sample or demo first
    • Grab a little data – start with 1GB
      • Goal is to develop a pattern that you can brute force
      • Functionality first, optimization later
  • 4. Hadoop
    • http://wiki.apache.org/hadoop
    • You write just *three* functions:
      • Map, Reduce and Main
    • Map: parse data into key, value pairs
      • Value can be complex formats too
    • Reduce: optionally combine matching keys
      • WordCount example sums counts for same words
    • Main
      • Orchestrate the whole process
  • 5. What does MAIN do?
    • Start with a JobConf and give it a name
    • JobConf conf = new JobConf(WordCount.class);
    • conf.setJobName("wordcount");
    • Set the key and value classes that the Mapper will output and Reducer will input
    • conf.setOutputKeyClass(Text. class);
    • conf.setOutputValueClass(IntWritable. class);
    • Tell it the classes (Combiner is an optional special case reducer)
    • conf.setMapperClass(Map. class);
    • conf.setCombinerClass(Reduce. class);
    • conf.setReducerClass(Reduce. class);
    • Set the format classes for input and output
    • conf.setInputFormat(TextInputFormat. class);
    • conf.setOutputFormat(TextOutputFormat. class);
    • Format specific initialization
    • FileInputFormat. setInputPaths(conf, new Path(args[0]));
    • FileOutputFormat. setOutputPath(conf, new Path(args[1]));
    • Run!
    • JobClient. runJob(conf);
  • 6. Vertica
    • http://www.vertica.com
    • SQL RDBMS designed for Analytics
    • Easy to tune, scales by adding nodes
    • Develop a schema (CREATE TABLE)
    • Copy in ascii files (or ETL them)
    • Run queries (SELECT COUNT(*) FROM…)
  • 7. Basic DBMS
    • Define a schema
    • CREATE TABLE words
    • ( word VARCHAR(64), num INT);
    • Load some data
    • COPY words FROM /home/user/words.csv’
    • DELIMITER ‘,’ NULL ‘’;
    • or
    • INSERT INTO words VALUES(‘barcamp’, 29081234);
    • Query it
    • SELECT word, SUM(num) AS total FROM words GROUP BY word ORDER BY total LIMIT 10;