Analysis Is Painless

  • 1. Analysis is Painless Data Analysis 101 [email_address]
  • 2. Why is Data Analysis Scary?
    • Data is overwhelming
    • Lots of setup for little return
    • Resources are scarce
    • Takes a lot of time
    • Other reasons?
  • 3. Get over the hump…
    • Pick a tool – doesn’t matter what it is
      • Programmer? Use a framework like Hadoop
      • Database guy? Use analytic DBMS like Vertica
    • Run a sample or demo first
    • Grab a little data – start with 1GB
      • Goal is to develop a pattern that you can brute force
      • Functionality first, optimization later
  • 4. Hadoop
    • You write just *three* functions:
      • Map, Reduce and Main
    • Map: parse data into key, value pairs
      • Value can be complex formats too
    • Reduce: optionally combine matching keys
      • WordCount example sums counts for same words
    • Main
      • Orchestrate the whole process
  • 5. What does MAIN do?
    • Start with a JobConf and give it a name
    • JobConf conf = new JobConf(WordCount.class);
    • conf.setJobName("wordcount");
    • Set the key and value classes that the Mapper will output and Reducer will input
    • conf.setOutputKeyClass(Text. class);
    • conf.setOutputValueClass(IntWritable. class);
    • Tell it the classes (Combiner is an optional special case reducer)
    • conf.setMapperClass(Map. class);
    • conf.setCombinerClass(Reduce. class);
    • conf.setReducerClass(Reduce. class);
    • Set the format classes for input and output
    • conf.setInputFormat(TextInputFormat. class);
    • conf.setOutputFormat(TextOutputFormat. class);
    • Format specific initialization
    • FileInputFormat. setInputPaths(conf, new Path(args[0]));
    • FileOutputFormat. setOutputPath(conf, new Path(args[1]));
    • Run!
    • JobClient. runJob(conf);
  • 6. Vertica
    • SQL RDBMS designed for Analytics
    • Easy to tune, scales by adding nodes
    • Develop a schema (CREATE TABLE)
    • Copy in ascii files (or ETL them)
    • Run queries (SELECT COUNT(*) FROM…)
  • 7. Basic DBMS
    • Define a schema
    • CREATE TABLE words
    • ( word VARCHAR(64), num INT);
    • Load some data
    • COPY words FROM /home/user/words.csv’
    • DELIMITER ‘,’ NULL ‘’;
    • or
    • INSERT INTO words VALUES(‘barcamp’, 29081234);
    • Query it
    • SELECT word, SUM(num) AS total FROM words GROUP BY word ORDER BY total LIMIT 10;