Hadoop                          A Hands-on Introduction                          Claudio Martella                         ...
Outline                    • What is Hadoop                    • Why is Hadoop                    • How is Hadoop         ...
A bit of Context                    • 2003: first MapReduce library @ Google                    • 2003: GFS paper          ...
An Ecosystem            • HDFS & MapReduce            • Zookeeper            • HBase            • Pig & Hive            • ...
Traditional way                    • Design a high-level Schema                    • You store data in a RDBMS            ...
BigData & NoSQL                    • Store first, think later                    • Schema-less storage                    •...
Vertical Scalability                    • Extremely expensive                    • Requires expertise in distributed syste...
Horizontal Scalability                    • Built on top of commodity hardware                    • Easy to use programmin...
1st Assumptions                    • Data to process does not fit on one node.                    • Each node is commodity ...
2nd Assumptions                    • Moving computation is cheap.                    • Moving data is expensive.          ...
3rd Assumptions                    • Systems run on spinning hard disks.                    • Disk seek >> disk scan.     ...
Typical Problem                    • Collect and iterate over many records                    • Filter and extract somethi...
Typical Problem                    • Collect and iterate over many records      AP                    • Filter and extract...
Quick example          127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/          1.0" 200 2326 "http:...
MapReduce                    • Programmers define two functions:                          ★   map (key, value)           (k...
k1 v1   k2 v2   k3 v3      k4 v4   k5 v5       k6 v6                           map                 map                    ...
MapReduce daemons                    • JobTracker: it’s the Master, it runs the                          schedule of the j...
User                                                                           Program                                    ...
HDFS daemons                    • NameNode: it’s the Master, it keeps the                          filesystem metadata (in-...
Application                                     GFS master                                  (file name, chunk index)      ...
Transparent to                    • Workers to data assignment                    • Map / Reduce assignment to nodes      ...
Take home recipe                    • Scan-based computation (no random I/O)                    • Big datasets            ...
Not good for                    • Real-time / Stream processing                    • Graph processing                    •...
Questions?Tuesday, November 8, 11
Baseline solutionTuesday, November 8, 11
What we attacked                    • You don’t want to parse the file many times                    • You don’t want to re...
Our solution                          0 1.3 0     0 7.1 1.1        1.3   7.1   1.1                    1.2 0         0   0 ...
Benchmarking                    • serial python (single-core): 7 minutes                    • java+hadoop (single-core): 2...
Upcoming SlideShare
Loading in...5
×

Hadoop: A Hands-on Introduction

5,207

Published on

An introduction to Hadoop. This seminar was intended to non IT engineers but more NLP specialists and cognitive scientists.

See the blog post for more information on this presentation.

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,207
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
112
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Hadoop: A Hands-on Introduction

  1. 1. Hadoop A Hands-on Introduction Claudio Martella Elia Bruni 9 November 2011Tuesday, November 8, 11
  2. 2. Outline • What is Hadoop • Why is Hadoop • How is Hadoop • Hadoop & Python • Some NLP code • A more complicated problem: Eva 2Tuesday, November 8, 11
  3. 3. A bit of Context • 2003: first MapReduce library @ Google • 2003: GFS paper • 2004: MapReduce paper • 2005: Apache Nutch uses MapReduce • 2006: Hadoop was born • 2007: first 1000 nodes cluster at Y! 3Tuesday, November 8, 11
  4. 4. An Ecosystem • HDFS & MapReduce • Zookeeper • HBase • Pig & Hive • Mahout • Giraph • Nutch 4Tuesday, November 8, 11
  5. 5. Traditional way • Design a high-level Schema • You store data in a RDBMS • Which has very poor write throughput • And doesn’t scale very much • When you talk about Terabyte of data • Expensive Data Warehouse 5Tuesday, November 8, 11
  6. 6. BigData & NoSQL • Store first, think later • Schema-less storage • Analytics • Petabyte scale • Offline processing 6Tuesday, November 8, 11
  7. 7. Vertical Scalability • Extremely expensive • Requires expertise in distributed systems and concurrent programming • Lacks of real fault-tolerance 7Tuesday, November 8, 11
  8. 8. Horizontal Scalability • Built on top of commodity hardware • Easy to use programming paradigms • Fault-tolerance through replication 8Tuesday, November 8, 11
  9. 9. 1st Assumptions • Data to process does not fit on one node. • Each node is commodity hardware. • Failure happens. Spread your data among your nodes and replicate it. 9Tuesday, November 8, 11
  10. 10. 2nd Assumptions • Moving computation is cheap. • Moving data is expensive. • Distributed computing is hard. Move computation to data, with simple paradigm. 10Tuesday, November 8, 11
  11. 11. 3rd Assumptions • Systems run on spinning hard disks. • Disk seek >> disk scan. • Many small files are expensive. Base the paradigm on scanning large files. 11Tuesday, November 8, 11
  12. 12. Typical Problem • Collect and iterate over many records • Filter and extract something from each • Shuffle & sort these intermediate results • Group-by and aggregate them • Produce final output set 12Tuesday, November 8, 11
  13. 13. Typical Problem • Collect and iterate over many records AP • Filter and extract something from each M • Shuffle & sort these intermediate R results • Group-by and aggregate them ED U • Produce final output set C E 13Tuesday, November 8, 11
  14. 14. Quick example 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/ 1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" • (frank, index.html) • (index.html, 10/Oct/2000) • (index.html, http://www.example.com/start.html) 14Tuesday, November 8, 11
  15. 15. MapReduce • Programmers define two functions: ★ map (key, value) (key’, value’)* ★ reduce (key’, [value’+]) (key”, value”)* • Can also define: ★ combine (key, value) (key’, value’)* ★ partitioner: k‘ partition 15Tuesday, November 8, 11
  16. 16. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 9 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 9 reduce reduce reduce r1 s1 r2 s2 r3 s3 16Tuesday, November 8, 11
  17. 17. MapReduce daemons • JobTracker: it’s the Master, it runs the schedule of the jobs, assigns tasks to nodes, collects hearth-beats from workers, reschedules for fault-tolerance. • TaskTracker: it’s the Worker, it runs on each slave, runs (multiple) Mappers and Reducers each in their JVM. 17Tuesday, November 8, 11
  18. 18. User Program (1) fork (1) fork (1) fork Master (2) assign map (2) assign reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files 18Redrawn from (Dean and Ghemawat, OSDI 2004) Tuesday, November 8, 11
  19. 19. HDFS daemons • NameNode: it’s the Master, it keeps the filesystem metadata (in-memory), the file- block-node mapping, decides replication and block placement, collects heart-beats from nodes. • DataNode: it’s the Slave, it stores the blocks (64MB) of the files and serves directly reads and writes. 19Tuesday, November 8, 11
  20. 20. Application GFS master (file name, chunk index) /foo/bar GSF Client File namespace chunk 2ef0 (chunk handle, chunk location) Instructions to chunkserver Chunkserver state (chunk handle, byte range) GFS chunkserver GFS chunkserver chunk data Linux file system Linux file system … …awn from (Ghemawat et al., SOSP 2003) 20 Tuesday, November 8, 11
  21. 21. Transparent to • Workers to data assignment • Map / Reduce assignment to nodes • Management of synchronization • Management of communication • Fault-tolerance and restarts 21Tuesday, November 8, 11
  22. 22. Take home recipe • Scan-based computation (no random I/O) • Big datasets • Divide-and-conquer class algorithms • No communication between tasks 22Tuesday, November 8, 11
  23. 23. Not good for • Real-time / Stream processing • Graph processing • Computation without locality • Small datasets 23Tuesday, November 8, 11
  24. 24. Questions?Tuesday, November 8, 11
  25. 25. Baseline solutionTuesday, November 8, 11
  26. 26. What we attacked • You don’t want to parse the file many times • You don’t want to re-calculate the norm • You don’t want to calculate 0*n 26Tuesday, November 8, 11
  27. 27. Our solution 0 1.3 0 0 7.1 1.1 1.3 7.1 1.1 1.2 0 0 0 0 3.4 1.2 3.4 0 5.7 0 0 1.1 2 5.7 1.1 2 5.1 0 0 4.6 0 10 5.1 4.6 10 0 0 0 1.6 0 0 1.6 line format: <string><norm>[<col><value>]* for example: cat 12.1313 0 5.1 3 4.6 5 10 27Tuesday, November 8, 11
  28. 28. Benchmarking • serial python (single-core): 7 minutes • java+hadoop (single-core): 2 minutes • serial python (big file): 18 days • java+hadoop (parallel, big file): 8 hours • it makes sense: 18d / 3.5 = 5.14d / 14 = 8h 28Tuesday, November 8, 11
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×