Analysis is Painless Data Analysis 101 [email_address]
Why is Data Analysis Scary? Data is overwhelming Lots of setup for little return Resources are scarce Takes a lot of time Other reasons?
Get over the hump… Pick a tool – doesn’t matter what it is Programmer?  Use a framework like Hadoop Database guy? Use analytic DBMS like Vertica Run a sample or demo first Grab a little data – start with 1GB Goal is to develop a pattern that you can brute force Functionality first, optimization later
Hadoop http://wiki.apache.org/hadoop You write just *three* functions: Map, Reduce and Main Map: parse data into key, value pairs Value can be complex formats too Reduce: optionally combine matching keys  WordCount example sums counts for same words Main Orchestrate the whole process
What does MAIN do? Start with a JobConf and give it a name JobConf conf =  new JobConf(WordCount.class); conf.setJobName("wordcount"); Set the key and value classes that the Mapper will output and Reducer will input conf.setOutputKeyClass(Text. class); conf.setOutputValueClass(IntWritable. class); Tell it the classes (Combiner is an optional special case reducer) conf.setMapperClass(Map. class); conf.setCombinerClass(Reduce. class); conf.setReducerClass(Reduce. class); Set the format classes for input and output conf.setInputFormat(TextInputFormat. class); conf.setOutputFormat(TextOutputFormat. class); Format specific initialization FileInputFormat. setInputPaths(conf,  new Path(args[0])); FileOutputFormat. setOutputPath(conf,  new Path(args[1])); Run! JobClient. runJob(conf);
Vertica http://www.vertica.com SQL RDBMS designed for Analytics Easy to tune, scales by adding nodes Develop a schema (CREATE TABLE) Copy in ascii files (or ETL them) Run queries (SELECT COUNT(*) FROM…)
Basic DBMS Define a schema CREATE TABLE words  ( word VARCHAR(64), num INT); Load some data COPY words FROM /home/user/words.csv’ DELIMITER ‘,’ NULL ‘’; or INSERT INTO words VALUES(‘barcamp’, 29081234); Query it SELECT word, SUM(num) AS total FROM words GROUP BY word ORDER BY total LIMIT 10;

Analysis Is Painless

  • 1.
    Analysis is PainlessData Analysis 101 [email_address]
  • 2.
    Why is DataAnalysis Scary? Data is overwhelming Lots of setup for little return Resources are scarce Takes a lot of time Other reasons?
  • 3.
    Get over thehump… Pick a tool – doesn’t matter what it is Programmer? Use a framework like Hadoop Database guy? Use analytic DBMS like Vertica Run a sample or demo first Grab a little data – start with 1GB Goal is to develop a pattern that you can brute force Functionality first, optimization later
  • 4.
    Hadoop http://wiki.apache.org/hadoop Youwrite just *three* functions: Map, Reduce and Main Map: parse data into key, value pairs Value can be complex formats too Reduce: optionally combine matching keys WordCount example sums counts for same words Main Orchestrate the whole process
  • 5.
    What does MAINdo? Start with a JobConf and give it a name JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); Set the key and value classes that the Mapper will output and Reducer will input conf.setOutputKeyClass(Text. class); conf.setOutputValueClass(IntWritable. class); Tell it the classes (Combiner is an optional special case reducer) conf.setMapperClass(Map. class); conf.setCombinerClass(Reduce. class); conf.setReducerClass(Reduce. class); Set the format classes for input and output conf.setInputFormat(TextInputFormat. class); conf.setOutputFormat(TextOutputFormat. class); Format specific initialization FileInputFormat. setInputPaths(conf, new Path(args[0])); FileOutputFormat. setOutputPath(conf, new Path(args[1])); Run! JobClient. runJob(conf);
  • 6.
    Vertica http://www.vertica.com SQLRDBMS designed for Analytics Easy to tune, scales by adding nodes Develop a schema (CREATE TABLE) Copy in ascii files (or ETL them) Run queries (SELECT COUNT(*) FROM…)
  • 7.
    Basic DBMS Definea schema CREATE TABLE words ( word VARCHAR(64), num INT); Load some data COPY words FROM /home/user/words.csv’ DELIMITER ‘,’ NULL ‘’; or INSERT INTO words VALUES(‘barcamp’, 29081234); Query it SELECT word, SUM(num) AS total FROM words GROUP BY word ORDER BY total LIMIT 10;