Your SlideShare is downloading. ×
Hadoop: A Hands-on Introduction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Hadoop: A Hands-on Introduction

5,057
views

Published on

An introduction to Hadoop. This seminar was intended to non IT engineers but more NLP specialists and cognitive scientists. …

An introduction to Hadoop. This seminar was intended to non IT engineers but more NLP specialists and cognitive scientists.

See the blog post for more information on this presentation.

Published in: Technology

1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,057
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
105
Comments
1
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop A Hands-on Introduction Claudio Martella Elia Bruni 9 November 2011Tuesday, November 8, 11
  • 2. Outline • What is Hadoop • Why is Hadoop • How is Hadoop • Hadoop & Python • Some NLP code • A more complicated problem: Eva 2Tuesday, November 8, 11
  • 3. A bit of Context • 2003: first MapReduce library @ Google • 2003: GFS paper • 2004: MapReduce paper • 2005: Apache Nutch uses MapReduce • 2006: Hadoop was born • 2007: first 1000 nodes cluster at Y! 3Tuesday, November 8, 11
  • 4. An Ecosystem • HDFS & MapReduce • Zookeeper • HBase • Pig & Hive • Mahout • Giraph • Nutch 4Tuesday, November 8, 11
  • 5. Traditional way • Design a high-level Schema • You store data in a RDBMS • Which has very poor write throughput • And doesn’t scale very much • When you talk about Terabyte of data • Expensive Data Warehouse 5Tuesday, November 8, 11
  • 6. BigData & NoSQL • Store first, think later • Schema-less storage • Analytics • Petabyte scale • Offline processing 6Tuesday, November 8, 11
  • 7. Vertical Scalability • Extremely expensive • Requires expertise in distributed systems and concurrent programming • Lacks of real fault-tolerance 7Tuesday, November 8, 11
  • 8. Horizontal Scalability • Built on top of commodity hardware • Easy to use programming paradigms • Fault-tolerance through replication 8Tuesday, November 8, 11
  • 9. 1st Assumptions • Data to process does not fit on one node. • Each node is commodity hardware. • Failure happens. Spread your data among your nodes and replicate it. 9Tuesday, November 8, 11
  • 10. 2nd Assumptions • Moving computation is cheap. • Moving data is expensive. • Distributed computing is hard. Move computation to data, with simple paradigm. 10Tuesday, November 8, 11
  • 11. 3rd Assumptions • Systems run on spinning hard disks. • Disk seek >> disk scan. • Many small files are expensive. Base the paradigm on scanning large files. 11Tuesday, November 8, 11
  • 12. Typical Problem • Collect and iterate over many records • Filter and extract something from each • Shuffle & sort these intermediate results • Group-by and aggregate them • Produce final output set 12Tuesday, November 8, 11
  • 13. Typical Problem • Collect and iterate over many records AP • Filter and extract something from each M • Shuffle & sort these intermediate R results • Group-by and aggregate them ED U • Produce final output set C E 13Tuesday, November 8, 11
  • 14. Quick example 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/ 1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)" • (frank, index.html) • (index.html, 10/Oct/2000) • (index.html, http://www.example.com/start.html) 14Tuesday, November 8, 11
  • 15. MapReduce • Programmers define two functions: ★ map (key, value) (key’, value’)* ★ reduce (key’, [value’+]) (key”, value”)* • Can also define: ★ combine (key, value) (key’, value’)* ★ partitioner: k‘ partition 15Tuesday, November 8, 11
  • 16. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 9 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 9 reduce reduce reduce r1 s1 r2 s2 r3 s3 16Tuesday, November 8, 11
  • 17. MapReduce daemons • JobTracker: it’s the Master, it runs the schedule of the jobs, assigns tasks to nodes, collects hearth-beats from workers, reschedules for fault-tolerance. • TaskTracker: it’s the Worker, it runs on each slave, runs (multiple) Mappers and Reducers each in their JVM. 17Tuesday, November 8, 11
  • 18. User Program (1) fork (1) fork (1) fork Master (2) assign map (2) assign reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files 18Redrawn from (Dean and Ghemawat, OSDI 2004) Tuesday, November 8, 11
  • 19. HDFS daemons • NameNode: it’s the Master, it keeps the filesystem metadata (in-memory), the file- block-node mapping, decides replication and block placement, collects heart-beats from nodes. • DataNode: it’s the Slave, it stores the blocks (64MB) of the files and serves directly reads and writes. 19Tuesday, November 8, 11
  • 20. Application GFS master (file name, chunk index) /foo/bar GSF Client File namespace chunk 2ef0 (chunk handle, chunk location) Instructions to chunkserver Chunkserver state (chunk handle, byte range) GFS chunkserver GFS chunkserver chunk data Linux file system Linux file system … …awn from (Ghemawat et al., SOSP 2003) 20 Tuesday, November 8, 11
  • 21. Transparent to • Workers to data assignment • Map / Reduce assignment to nodes • Management of synchronization • Management of communication • Fault-tolerance and restarts 21Tuesday, November 8, 11
  • 22. Take home recipe • Scan-based computation (no random I/O) • Big datasets • Divide-and-conquer class algorithms • No communication between tasks 22Tuesday, November 8, 11
  • 23. Not good for • Real-time / Stream processing • Graph processing • Computation without locality • Small datasets 23Tuesday, November 8, 11
  • 24. Questions?Tuesday, November 8, 11
  • 25. Baseline solutionTuesday, November 8, 11
  • 26. What we attacked • You don’t want to parse the file many times • You don’t want to re-calculate the norm • You don’t want to calculate 0*n 26Tuesday, November 8, 11
  • 27. Our solution 0 1.3 0 0 7.1 1.1 1.3 7.1 1.1 1.2 0 0 0 0 3.4 1.2 3.4 0 5.7 0 0 1.1 2 5.7 1.1 2 5.1 0 0 4.6 0 10 5.1 4.6 10 0 0 0 1.6 0 0 1.6 line format: <string><norm>[<col><value>]* for example: cat 12.1313 0 5.1 3 4.6 5 10 27Tuesday, November 8, 11
  • 28. Benchmarking • serial python (single-core): 7 minutes • java+hadoop (single-core): 2 minutes • serial python (big file): 18 days • java+hadoop (parallel, big file): 8 hours • it makes sense: 18d / 3.5 = 5.14d / 14 = 8h 28Tuesday, November 8, 11