1. WHAT STARTS HERE CHANGES THE WORLD
and MapReduce
Hemanth Kumar Mantri
Graduate Student
UT-Austin
November 9th 2011
2. WHAT STARTS HERE CHANGES THE WORLD
Agenda
• What is Hadoop?
• Where is MapReduce used?
• HDFS and MapReduce
• Amazon Web Services
• Map Reduce Demo on Hadoop
3. WHAT STARTS HERE CHANGES THE WORLD
What is Hadoop?
• Inspired by Google File System (GFS) and
MapReduce.
• Supports data-intensive distributed
applications.
• Thousands of nodes and PBytes of data.
• Apache project – Open Source
• Implemented in Java
• Yahoo! - largest contributor
6. WHAT STARTS HERE CHANGES THE WORLD
Who Uses Hadoop?
• At Google:
– Index construction for Google Search
– Popular Passages in Google Books
– Article clustering for Google News
• At Yahoo!:
– “Web map” powering Yahoo! Search
– Spam detection for Yahoo! Mail
– More than 100,000 CPUs in >36,000 computers
• At Facebook:
– Used in reporting/analytics and machine learning
• Data Mining, Spam detection
– as storage engine for logs.
– 1100-machine cluster with 8800 cores and about 12 PB raw storage.
8. WHAT STARTS HERE CHANGES THE WORLD
Yelp!
• Uses Amazon S3 to store daily logs and photos,
– generating around 100GB of logs per day.
• Amazon Elastic MapReduce for:
– People Who Viewed this Also Viewed
– Review highlights
– Auto complete as you type on search
– Search spelling suggestions
– Top searches
– Ads
• Yelp runs approximately 200 Elastic MapReduce jobs
processing 3TB of data per day.
9. WHAT STARTS HERE CHANGES THE WORLD
Hadoop Components
• Distributed file system (HDFS)
– Single namespace for entire cluster
– Almost same as GFS
– Replicates data 3x for fault-tolerance
• MapReduce framework
– Executes user jobs specified as “map” and
“reduce” functions
– Manages work distribution & fault-tolerance
14. WHAT STARTS HERE CHANGES THE WORLD
Amazon Web Services
• Collection of services – Pay as you use!
– S3 (Simple Storage Service)
Storage in the Cloud ($0.140/GB/Month)
Key Value Store (Big HashMap!)
– EC2 (Elastic Compute Cloud)
Compute in the Cloud ($0.085 - $2.6 /computing hour)
– Elastic MapReduce
Run Hadoop Jobs on EC2 using Data stored in S3
– Email Service
– …. Many more
15. WHAT STARTS HERE CHANGES THE WORLD
Map Reduce on EC2 Cluster
• Create AWS account and get the keys for authentication
• Go to src/contrib/ec2 in Hadoop directory
• Launch a cluster on EC2
– % bin/hadoop-ec2 launch-cluster <cluster-name> <#nodes>
• Login to the cluster
– % bin/hadoop-ec2 login test-cluster
• Start Computation
– # cd /usr/local/hadoop-*
– # bin/hadoop jar hadoop-*-examples.jar pi 10 10000000
• Terminate the Cluster after use!!!!!
– % bin/hadoop-ec2 terminate-cluster test-cluster
16. WHAT STARTS HERE CHANGES THE WORLD
References
• Hadoop Project Page:
– http://hadoop.apache.org/
• Amazon Web Services:
– http://aws.amazon.com/