Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Another Intro to Hadoop [email_address] Context Optional April 2, 2010 By Adeel Ahmad
About Me <ul><li>Follow me on Twitter @_adeel </li></ul><ul><li>The AI Show podcast: www.aishow.org </li></ul><ul><li>Arti...
Too much data <ul><li>User-generated, social networks, logging and tracking </li></ul><ul><li>Google, Yahoo and others nee...
Can't scale <ul><li>Challenge to both store and analyze datasets </li></ul><ul><li>Slow to process </li></ul><ul><li>Unrel...
Solve it through software <ul><li>Split up the data </li></ul><ul><li>Run jobs in parallel </li></ul><ul><li>Sort and comb...
Enter Hadoop <ul><li>Open-source Apache project written in Java </li></ul><ul><li>MapReduce implementation for parallelizi...
History <ul><li>Open-source Apache project </li></ul><ul><li>Grew out of Apache Nutch project, an open-source search engin...
MapReduce <ul><li>Operates exclusively on <key, value> pairs </li></ul><ul><li>Split the input data into independent chunk...
MapReduce
MapReduce
HDFS <ul><li>Hadoop Distributed File System </li></ul><ul><li>Files split into large blocks </li></ul><ul><li>Designed for...
Self-managing and self-healing <ul><li>Bring the computation as physically close to the data as possible for best bandwidt...
Hadoop Streaming <ul><li>Don't need to write mappers and reducers in Java </li></ul><ul><li>Text-based API that exposes st...
Example: Word count <ul><li># mapper.rb </li></ul><ul><li>STDIN.each_line do |line| </li></ul><ul><li>word_count = {} </li...
Who Uses Hadoop? <ul><li>Yahoo </li></ul><ul><li>Facebook </li></ul><ul><li>Netflix </li></ul><ul><li>eHarmony </li></ul><...
Developing With Hadoop <ul><li>Don't need a whole cluster to start </li></ul><ul><li>Standalone </li></ul><ul><ul><li>Non-...
How to Run Hadoop <ul><li>Linux, OSX, Windows, Solaris </li></ul><ul><li>Just need Java, SSH access to nodes </li></ul><ul...
How to Run Hadoop <ul><li>Cloudera Inc. provides their own distributions and enterprise support and training for Hadoop </...
How to Run Hadoop <ul><li>Amazon Elastic MapReduce (EMR) </li></ul><ul><ul><li>GUI or command-line cluster management </li...
Pig <ul><li>High-level scripting language developed by Yahoo </li></ul><ul><li>Describes multi-step jobs </li></ul><ul><li...
Hive <ul><li>High-level interface created by Facebook </li></ul><ul><li>Gives db-like structure to data </li></ul><ul><li>...
Mahout <ul><li>Machine-learning libraries for Hadoop </li></ul><ul><ul><li>Collaborative filtering </li></ul></ul><ul><ul>...
More stuff <ul><li>Hbase – database based on Google's Bigtable </li></ul><ul><li>Sqoop – database import tool </li></ul><u...
Upcoming SlideShare
Loading in …5
×

Another Intro To Hadoop

Introduction to Hadoop.
What are Hadoop, MapReeduce, and Hadoop Distributed File System.
Who uses Hadoop?
How to run Hadoop?
What are Pig, Hive, Mahout?

  • Login to see the comments

Another Intro To Hadoop

  1. 1. Another Intro to Hadoop [email_address] Context Optional April 2, 2010 By Adeel Ahmad
  2. 2. About Me <ul><li>Follow me on Twitter @_adeel </li></ul><ul><li>The AI Show podcast: www.aishow.org </li></ul><ul><li>Artificial intelligence news every week. </li></ul><ul><li>Senior App Genius at Context Optional </li></ul><ul><li>We're hiring Ruby developers. Contact me! </li></ul>
  3. 3. Too much data <ul><li>User-generated, social networks, logging and tracking </li></ul><ul><li>Google, Yahoo and others need to index the entire internet and return search results in milliseconds </li></ul><ul><li>NYSE generates 1 TB data/day </li></ul><ul><li>Facebook has 400 terabytes of stored data and ingests 20 terabytes of new data per day. Hosts approx. 10 billion photos, 1 petabyte (2009) </li></ul>
  4. 4. Can't scale <ul><li>Challenge to both store and analyze datasets </li></ul><ul><li>Slow to process </li></ul><ul><li>Unreliable machines (CPUs and disks can do down) </li></ul><ul><li>Not affordable (faster, more reliable machines are expensive) </li></ul>
  5. 5. Solve it through software <ul><li>Split up the data </li></ul><ul><li>Run jobs in parallel </li></ul><ul><li>Sort and combine to get the answer </li></ul><ul><li>Schedule across arbitrarily-sized cluster </li></ul><ul><li>Handle fault-tolerance </li></ul><ul><li>Since even the best systems breakdown, use cheap commodity computers </li></ul>
  6. 6. Enter Hadoop <ul><li>Open-source Apache project written in Java </li></ul><ul><li>MapReduce implementation for parallelizing application </li></ul><ul><li>Distributed filesystem for redundant data </li></ul><ul><li>Many other sub-projects </li></ul><ul><li>Meant for cheap, heterogenous hardware </li></ul><ul><li>Scale up by simply adding more cheap hardware </li></ul>
  7. 7. History <ul><li>Open-source Apache project </li></ul><ul><li>Grew out of Apache Nutch project, an open-source search engine </li></ul><ul><li>Two Google papers </li></ul><ul><ul><li>MapReduce (2003): programming model for parallel processing </li></ul></ul><ul><ul><li>Google File System (2003) for fault-tolerant processing of large amounts of data </li></ul></ul>
  8. 8. MapReduce <ul><li>Operates exclusively on <key, value> pairs </li></ul><ul><li>Split the input data into independent chunks </li></ul><ul><li>Processed by the map tasks in parallel </li></ul><ul><li>Sort the outputs of the maps </li></ul><ul><li>Send to the reduce tasks </li></ul><ul><li>Write to output files </li></ul>
  9. 9. MapReduce
  10. 10. MapReduce
  11. 11. HDFS <ul><li>Hadoop Distributed File System </li></ul><ul><li>Files split into large blocks </li></ul><ul><li>Designed for streaming reads and appending writes, not random access </li></ul><ul><li>3 replicas for each piece of data by default </li></ul><ul><li>Data can be encoded/archived formats </li></ul>
  12. 12. Self-managing and self-healing <ul><li>Bring the computation as physically close to the data as possible for best bandwidth, instead of copying data </li></ul><ul><li>Tries to use same node, then same rack, then same data center </li></ul><ul><li>Auto-replication if data lost </li></ul><ul><li>Auto-kill and restart of tasks on another node if taking too long or flaky </li></ul>
  13. 13. Hadoop Streaming <ul><li>Don't need to write mappers and reducers in Java </li></ul><ul><li>Text-based API that exposes stdin and stdout </li></ul><ul><li>Use any language </li></ul><ul><li>Ruby gems: Wukong, Mandy </li></ul>
  14. 14. Example: Word count <ul><li># mapper.rb </li></ul><ul><li>STDIN.each_line do |line| </li></ul><ul><li>word_count = {} </li></ul><ul><li>line.split.each do |word| </li></ul><ul><li>word_count[word] ||= 0 </li></ul><ul><li>word_count[word] += 1 </li></ul><ul><li>end </li></ul><ul><li>word_count.each do |k,v| </li></ul><ul><li>puts &quot;#{k} #{v}&quot; </li></ul><ul><li>end </li></ul><ul><li>end </li></ul><ul><li># reducer.rb </li></ul><ul><li>word = nil </li></ul><ul><li>count = 0 </li></ul><ul><li>STDIN.each_line do |line| </li></ul><ul><li>wordx, countx = line.strip.split </li></ul><ul><li>if word x!= word </li></ul><ul><li>puts &quot;#{word} #{count}&quot; unless word.nil? </li></ul><ul><li>word = wordx </li></ul><ul><li>count = 0 </li></ul><ul><li>end </li></ul><ul><li>count += countx.to_i </li></ul><ul><li>end </li></ul><ul><li>puts &quot;#{word} #{count}&quot; unless word.nil? </li></ul>
  15. 15. Who Uses Hadoop? <ul><li>Yahoo </li></ul><ul><li>Facebook </li></ul><ul><li>Netflix </li></ul><ul><li>eHarmony </li></ul><ul><li>LinkedIn </li></ul><ul><li>NY Times </li></ul><ul><li>Digg </li></ul><ul><li>Flightcaster </li></ul><ul><li>RapLeaf </li></ul><ul><li>Trulia </li></ul><ul><li>Last.fm </li></ul><ul><li>Ning </li></ul><ul><li>CNET </li></ul><ul><li>Lots more... </li></ul>
  16. 16. Developing With Hadoop <ul><li>Don't need a whole cluster to start </li></ul><ul><li>Standalone </li></ul><ul><ul><li>Non-distributed </li></ul></ul><ul><ul><li>Single Java process </li></ul></ul><ul><li>Pseudo-distributed </li></ul><ul><ul><li>Just like full-distributed </li></ul></ul><ul><ul><li>Components in separate processes </li></ul></ul><ul><li>Full distributed </li></ul><ul><ul><li>Now you need a real cluster </li></ul></ul>
  17. 17. How to Run Hadoop <ul><li>Linux, OSX, Windows, Solaris </li></ul><ul><li>Just need Java, SSH access to nodes </li></ul><ul><li>XML config files </li></ul><ul><li>Download core Hadoop </li></ul><ul><ul><li>Can do everything we mentioned </li></ul></ul><ul><ul><li>Still needs user to play with config files and create scripts </li></ul></ul>
  18. 18. How to Run Hadoop <ul><li>Cloudera Inc. provides their own distributions and enterprise support and training for Hadoop </li></ul><ul><ul><li>Core Hadoop plus patches </li></ul></ul><ul><ul><li>Bundled with command-line scripts, Hive, Pig </li></ul></ul><ul><ul><li>Publish AMI and scripts for EC2 </li></ul></ul><ul><ul><li>Best option for your own cluster </li></ul></ul>
  19. 19. How to Run Hadoop <ul><li>Amazon Elastic MapReduce (EMR) </li></ul><ul><ul><li>GUI or command-line cluster management </li></ul></ul><ul><ul><li>Supports Streaming, Hive, Pig </li></ul></ul><ul><ul><li>Grabs data and MapReduce code from S3 buckets and puts it into HDFS </li></ul></ul><ul><ul><li>Auto-shutdown EC2 instances </li></ul></ul><ul><ul><li>Cloudera now has scripts for EMR </li></ul></ul><ul><ul><li>Easiest option </li></ul></ul>
  20. 20. Pig <ul><li>High-level scripting language developed by Yahoo </li></ul><ul><li>Describes multi-step jobs </li></ul><ul><li>Translated into MapReduce tasks </li></ul><ul><li>Grunt command-line interface </li></ul><ul><li>Ex: Find top 5 most visited pages by users aged 18 to 25 </li></ul><ul><li>Users = LOAD 'users' AS (name, age); </li></ul><ul><li>Filtered = FILTER Users BY age >=18 AND age <= 25; </li></ul><ul><li>Pages = LOAD 'pages' AS (user, url); </li></ul><ul><li>Joined = JOIN Filtered BY name, Pages BY user; </li></ul><ul><li>Grouped = GROUP Joined BY url; </li></ul><ul><li>Summed = FOREACH Grouped GENERATE group, COUNT(Joined) AS clicks; </li></ul><ul><li>Sorted = ORDER Summed BY clicks DESC </li></ul>
  21. 21. Hive <ul><li>High-level interface created by Facebook </li></ul><ul><li>Gives db-like structure to data </li></ul><ul><li>HIveQL declarative language for querying </li></ul><ul><li>Queries get turned into MapReduce jobs </li></ul><ul><li>Command-line interface </li></ul><ul><li>ex. </li></ul><ul><li>CREATE TABLE raw_daily_stats_table (dates STRING, ..., pageviews STRING); </li></ul><ul><li>LOAD DATA INPATH 'finaloutput' INTO TABLE raw_daily_stats_table; </li></ul><ul><li>SELECT … FROM … JOIN ... </li></ul>
  22. 22. Mahout <ul><li>Machine-learning libraries for Hadoop </li></ul><ul><ul><li>Collaborative filtering </li></ul></ul><ul><ul><li>Clustering </li></ul></ul><ul><ul><li>Frequent pattern recognition </li></ul></ul><ul><ul><li>Genetic algorithms </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Product/friend recommendation </li></ul></ul><ul><ul><li>Classify content into defined groups </li></ul></ul><ul><ul><li>Find associations, patterns, behaviors </li></ul></ul><ul><ul><li>Identify important topics in conversations </li></ul></ul>
  23. 23. More stuff <ul><li>Hbase – database based on Google's Bigtable </li></ul><ul><li>Sqoop – database import tool </li></ul><ul><li>Zookeeper – coordination service for distributed apps to keep track of servers, like a filesystem </li></ul><ul><li>Avro – data serialization system </li></ul><ul><li>Scribe – logging system developed by Facebook </li></ul>

×