Your SlideShare is downloading. ×
0
Analysis Is Painless
Analysis Is Painless
Analysis Is Painless
Analysis Is Painless
Analysis Is Painless
Analysis Is Painless
Analysis Is Painless
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Analysis Is Painless

583

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
583
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Analysis is Painless Data Analysis 101 [email_address]
  • 2. Why is Data Analysis Scary? <ul><li>Data is overwhelming </li></ul><ul><li>Lots of setup for little return </li></ul><ul><li>Resources are scarce </li></ul><ul><li>Takes a lot of time </li></ul><ul><li>Other reasons? </li></ul>
  • 3. Get over the hump… <ul><li>Pick a tool – doesn’t matter what it is </li></ul><ul><ul><li>Programmer? Use a framework like Hadoop </li></ul></ul><ul><ul><li>Database guy? Use analytic DBMS like Vertica </li></ul></ul><ul><li>Run a sample or demo first </li></ul><ul><li>Grab a little data – start with 1GB </li></ul><ul><ul><li>Goal is to develop a pattern that you can brute force </li></ul></ul><ul><ul><li>Functionality first, optimization later </li></ul></ul>
  • 4. Hadoop <ul><li>http://wiki.apache.org/hadoop </li></ul><ul><li>You write just *three* functions: </li></ul><ul><ul><li>Map, Reduce and Main </li></ul></ul><ul><li>Map: parse data into key, value pairs </li></ul><ul><ul><li>Value can be complex formats too </li></ul></ul><ul><li>Reduce: optionally combine matching keys </li></ul><ul><ul><li>WordCount example sums counts for same words </li></ul></ul><ul><li>Main </li></ul><ul><ul><li>Orchestrate the whole process </li></ul></ul>
  • 5. What does MAIN do? <ul><li>Start with a JobConf and give it a name </li></ul><ul><li>JobConf conf = new JobConf(WordCount.class); </li></ul><ul><li>conf.setJobName(&quot;wordcount&quot;); </li></ul><ul><li>Set the key and value classes that the Mapper will output and Reducer will input </li></ul><ul><li>conf.setOutputKeyClass(Text. class); </li></ul><ul><li>conf.setOutputValueClass(IntWritable. class); </li></ul><ul><li>Tell it the classes (Combiner is an optional special case reducer) </li></ul><ul><li>conf.setMapperClass(Map. class); </li></ul><ul><li>conf.setCombinerClass(Reduce. class); </li></ul><ul><li>conf.setReducerClass(Reduce. class); </li></ul><ul><li>Set the format classes for input and output </li></ul><ul><li>conf.setInputFormat(TextInputFormat. class); </li></ul><ul><li>conf.setOutputFormat(TextOutputFormat. class); </li></ul><ul><li>Format specific initialization </li></ul><ul><li>FileInputFormat. setInputPaths(conf, new Path(args[0])); </li></ul><ul><li>FileOutputFormat. setOutputPath(conf, new Path(args[1])); </li></ul><ul><li>Run! </li></ul><ul><li>JobClient. runJob(conf); </li></ul>
  • 6. Vertica <ul><li>http://www.vertica.com </li></ul><ul><li>SQL RDBMS designed for Analytics </li></ul><ul><li>Easy to tune, scales by adding nodes </li></ul><ul><li>Develop a schema (CREATE TABLE) </li></ul><ul><li>Copy in ascii files (or ETL them) </li></ul><ul><li>Run queries (SELECT COUNT(*) FROM…) </li></ul>
  • 7. Basic DBMS <ul><li>Define a schema </li></ul><ul><li>CREATE TABLE words </li></ul><ul><li>( word VARCHAR(64), num INT); </li></ul><ul><li>Load some data </li></ul><ul><li>COPY words FROM /home/user/words.csv’ </li></ul><ul><li>DELIMITER ‘,’ NULL ‘’; </li></ul><ul><li>or </li></ul><ul><li>INSERT INTO words VALUES(‘barcamp’, 29081234); </li></ul><ul><li>Query it </li></ul><ul><li>SELECT word, SUM(num) AS total FROM words GROUP BY word ORDER BY total LIMIT 10; </li></ul>

×