0
Analysis is Painless Data Analysis 101 [email_address]
Why is Data Analysis Scary? <ul><li>Data is overwhelming </li></ul><ul><li>Lots of setup for little return </li></ul><ul><...
Get over the hump… <ul><li>Pick a tool – doesn’t matter what it is </li></ul><ul><ul><li>Programmer?  Use a framework like...
Hadoop <ul><li>http://wiki.apache.org/hadoop </li></ul><ul><li>You write just *three* functions: </li></ul><ul><ul><li>Map...
What does MAIN do? <ul><li>Start with a JobConf and give it a name </li></ul><ul><li>JobConf conf =  new JobConf(WordCount...
Vertica <ul><li>http://www.vertica.com </li></ul><ul><li>SQL RDBMS designed for Analytics </li></ul><ul><li>Easy to tune, ...
Basic DBMS <ul><li>Define a schema </li></ul><ul><li>CREATE TABLE words  </li></ul><ul><li>( word VARCHAR(64), num INT); <...
Upcoming SlideShare
Loading in...5
×

Analysis Is Painless

586

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
586
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Analysis Is Painless"

  1. 1. Analysis is Painless Data Analysis 101 [email_address]
  2. 2. Why is Data Analysis Scary? <ul><li>Data is overwhelming </li></ul><ul><li>Lots of setup for little return </li></ul><ul><li>Resources are scarce </li></ul><ul><li>Takes a lot of time </li></ul><ul><li>Other reasons? </li></ul>
  3. 3. Get over the hump… <ul><li>Pick a tool – doesn’t matter what it is </li></ul><ul><ul><li>Programmer? Use a framework like Hadoop </li></ul></ul><ul><ul><li>Database guy? Use analytic DBMS like Vertica </li></ul></ul><ul><li>Run a sample or demo first </li></ul><ul><li>Grab a little data – start with 1GB </li></ul><ul><ul><li>Goal is to develop a pattern that you can brute force </li></ul></ul><ul><ul><li>Functionality first, optimization later </li></ul></ul>
  4. 4. Hadoop <ul><li>http://wiki.apache.org/hadoop </li></ul><ul><li>You write just *three* functions: </li></ul><ul><ul><li>Map, Reduce and Main </li></ul></ul><ul><li>Map: parse data into key, value pairs </li></ul><ul><ul><li>Value can be complex formats too </li></ul></ul><ul><li>Reduce: optionally combine matching keys </li></ul><ul><ul><li>WordCount example sums counts for same words </li></ul></ul><ul><li>Main </li></ul><ul><ul><li>Orchestrate the whole process </li></ul></ul>
  5. 5. What does MAIN do? <ul><li>Start with a JobConf and give it a name </li></ul><ul><li>JobConf conf = new JobConf(WordCount.class); </li></ul><ul><li>conf.setJobName(&quot;wordcount&quot;); </li></ul><ul><li>Set the key and value classes that the Mapper will output and Reducer will input </li></ul><ul><li>conf.setOutputKeyClass(Text. class); </li></ul><ul><li>conf.setOutputValueClass(IntWritable. class); </li></ul><ul><li>Tell it the classes (Combiner is an optional special case reducer) </li></ul><ul><li>conf.setMapperClass(Map. class); </li></ul><ul><li>conf.setCombinerClass(Reduce. class); </li></ul><ul><li>conf.setReducerClass(Reduce. class); </li></ul><ul><li>Set the format classes for input and output </li></ul><ul><li>conf.setInputFormat(TextInputFormat. class); </li></ul><ul><li>conf.setOutputFormat(TextOutputFormat. class); </li></ul><ul><li>Format specific initialization </li></ul><ul><li>FileInputFormat. setInputPaths(conf, new Path(args[0])); </li></ul><ul><li>FileOutputFormat. setOutputPath(conf, new Path(args[1])); </li></ul><ul><li>Run! </li></ul><ul><li>JobClient. runJob(conf); </li></ul>
  6. 6. Vertica <ul><li>http://www.vertica.com </li></ul><ul><li>SQL RDBMS designed for Analytics </li></ul><ul><li>Easy to tune, scales by adding nodes </li></ul><ul><li>Develop a schema (CREATE TABLE) </li></ul><ul><li>Copy in ascii files (or ETL them) </li></ul><ul><li>Run queries (SELECT COUNT(*) FROM…) </li></ul>
  7. 7. Basic DBMS <ul><li>Define a schema </li></ul><ul><li>CREATE TABLE words </li></ul><ul><li>( word VARCHAR(64), num INT); </li></ul><ul><li>Load some data </li></ul><ul><li>COPY words FROM /home/user/words.csv’ </li></ul><ul><li>DELIMITER ‘,’ NULL ‘’; </li></ul><ul><li>or </li></ul><ul><li>INSERT INTO words VALUES(‘barcamp’, 29081234); </li></ul><ul><li>Query it </li></ul><ul><li>SELECT word, SUM(num) AS total FROM words GROUP BY word ORDER BY total LIMIT 10; </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×