Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
Falsehood #1: Machines can be reliable…Image: MadMan the Mighty CC BY-NC-SA
Hadoop separatesdistributed system fault- tolerance code from application logic. Unicorns Systems Statisticians Programmers
Falsehood #2: Machines deserve identities... Image:Laughing Squid CC BY-NC-SA
Hadoop lets you interact with a cluster, not a bunch of machines.Image:Yahoo! Hadoop cluster [ OSCON ‟07 ]
Falsehood #3: Your analysis fits on one machine… Image: Matthew J. Stinson CC-BY-NC
Hadoop scales linearly with data sizeor analysis complexity.Data-parallel or compute-parallel. For example: Extensive machine learning on <100GB of image data Simple SQL queries on >100TB of clickstream dataHadoop works for both applications!
Hadoop sounds like magic. Coincidentally, today is Houdini‟s birthday, though he was not a Hadoop committer.How is it possible?
A Typical Look...5-4000 commodity servers(8-core, 24GB RAM, 4-12 TB, gig-E)2-level network architecture 20-40 nodes per rack
You specify map() and reduce() functions. The framework does the rest.
map() map: K₁,V₁→list K₂,V₂Key: byte offset 193284Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36-0700] "GET /userimage/123 HTTP/1.0" 200 2326”Key: userimageValue: 2326 bytesThe map function runs on the same node as the datawas stored!
Input Format• Wait! HDFS is not a Key-Value store!• InputFormat interprets bytes as a Key and Value 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326 Key: log offset 193284 Value: “127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /userimage/123 HTTP/1.0" 200 2326”
The ShuffleEach map output is assigned to a“reducer” based on its keymap output is grouped andsorted by key
Hadoop in the Wild (yes, it‟s used in production)Yahoo! Hadoop Clusters: > 82PB, >25k machines(Eric14, HadoopWorld NYC ‟09)Facebook: 15TB new data per day;1200 machines, 21PB in one clusterTwitter: >1TB per day, ~120 nodesLots of 5-40 node clusters at companies withoutpetabytes of data (web, retail, finance, telecom,research)
Product Recommendations• Naïve approach: Users who bought toothpaste bought toothbrushes.• Hadoop approach • What products did a user browse, hover over, rate, add to cart (but not buy), etc in the last 2 months? • What are the attributes of the user? • What are our margins, promotions, inventory, etc?
Production Recommendations• A lot of data! • Activity: ~20GB/day x ~60 days = 1.2TB • User Data: 2GB • Purchase Data: ~5GB• Pre-aggregating loses fidelity for individual users.
Hadoop and Java (the good)Integration, integration, integration!Tooling: IDEs, JCarder, AspectJ,Maven/IvyDeveloper accessibility
Hadoop and Java (the bad)Java is great for applications. Hadoop issystems programming.JNI is our hammer Compression, Security, FS accessC++ wrapper for setuid task execution
Hadoop and Java (the ugly)JVM bugs!Garbage Collection pauses on 50GBheapsWORA is a giant lie for systems – worstof both worlds?
Ok, fine, what next?Get Hadoop! CDH - Cloudera‟s Distribution including Apache Hadoop http://cloudera.com/ http://hadoop.apache.org/Try it out! (Locally, VM, or EC2)Watch free training videos onhttp://cloudera.com/