An introduction to Hadoop. This seminar was intended to non IT engineers but more NLP specialists and cognitive scientists.
See the blog post for more information on this presentation.
What's New in Teams Calling, Meetings and Devices March 2024
Hadoop: A Hands-on Introduction
1. Hadoop
A Hands-on Introduction
Claudio Martella
Elia Bruni
9 November 2011
Tuesday, November 8, 11
2. Outline
• What is Hadoop
• Why is Hadoop
• How is Hadoop
• Hadoop & Python
• Some NLP code
• A more complicated problem: Eva
2
Tuesday, November 8, 11
3. A bit of Context
• 2003: first MapReduce library @ Google
• 2003: GFS paper
• 2004: MapReduce paper
• 2005: Apache Nutch uses MapReduce
• 2006: Hadoop was born
• 2007: first 1000 nodes cluster at Y!
3
Tuesday, November 8, 11
5. Traditional way
• Design a high-level Schema
• You store data in a RDBMS
• Which has very poor write throughput
• And doesn’t scale very much
• When you talk about Terabyte of data
• Expensive Data Warehouse
5
Tuesday, November 8, 11
6. BigData & NoSQL
• Store first, think later
• Schema-less storage
• Analytics
• Petabyte scale
• Offline processing
6
Tuesday, November 8, 11
7. Vertical Scalability
• Extremely expensive
• Requires expertise in distributed systems
and concurrent programming
• Lacks of real fault-tolerance
7
Tuesday, November 8, 11
8. Horizontal Scalability
• Built on top of commodity hardware
• Easy to use programming paradigms
• Fault-tolerance through replication
8
Tuesday, November 8, 11
9. 1st Assumptions
• Data to process does not fit on one node.
• Each node is commodity hardware.
• Failure happens.
Spread your data among your nodes
and replicate it.
9
Tuesday, November 8, 11
10. 2nd Assumptions
• Moving computation is cheap.
• Moving data is expensive.
• Distributed computing is hard.
Move computation to data,
with simple paradigm.
10
Tuesday, November 8, 11
11. 3rd Assumptions
• Systems run on spinning hard disks.
• Disk seek >> disk scan.
• Many small files are expensive.
Base the paradigm on scanning large files.
11
Tuesday, November 8, 11
12. Typical Problem
• Collect and iterate over many records
• Filter and extract something from each
• Shuffle & sort these intermediate results
• Group-by and aggregate them
• Produce final output set
12
Tuesday, November 8, 11
13. Typical Problem
• Collect and iterate over many records
AP
• Filter and extract something from each
M
• Shuffle & sort these intermediate R
results
• Group-by and aggregate them ED
U
• Produce final output set C
E
13
Tuesday, November 8, 11
14. Quick example
127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /index.html HTTP/
1.0" 200 2326 "http://www.example.com/start.html" "Mozilla/4.08 [en]
(Win98; I ;Nav)"
• (frank, index.html)
• (index.html, 10/Oct/2000)
• (index.html, http://www.example.com/start.html)
14
Tuesday, November 8, 11
15. MapReduce
• Programmers define two functions:
★ map (key, value) (key’, value’)*
★ reduce (key’, [value’+]) (key”, value”)*
• Can also define:
★ combine (key, value) (key’, value’)*
★ partitioner: k‘ partition
15
Tuesday, November 8, 11
16. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6
map map map map
a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 9
Shuffle and Sort: aggregate values by keys
a 1 5 b 2 7 c 2 3 6 9
reduce reduce reduce
r1 s1 r2 s2 r3 s3
16
Tuesday, November 8, 11
17. MapReduce daemons
• JobTracker: it’s the Master, it runs the
schedule of the jobs, assigns tasks to
nodes, collects hearth-beats from workers,
reschedules for fault-tolerance.
• TaskTracker: it’s the Worker, it runs on
each slave, runs (multiple) Mappers and
Reducers each in their JVM.
17
Tuesday, November 8, 11
18. User
Program
(1) fork (1) fork (1) fork
Master
(2) assign map
(2) assign reduce
worker
split 0
(6) write output
split 1 (5) remote read worker
(3) read file 0
split 2 (4) local write
worker
split 3
split 4 output
worker
file 1
worker
Input Map Intermediate files Reduce Output
files phase (on local disk) phase files
18
Redrawn from (Dean and Ghemawat, OSDI 2004)
Tuesday, November 8, 11
19. HDFS daemons
• NameNode: it’s the Master, it keeps the
filesystem metadata (in-memory), the file-
block-node mapping, decides replication
and block placement, collects heart-beats
from nodes.
• DataNode: it’s the Slave, it stores the
blocks (64MB) of the files and serves
directly reads and writes.
19
Tuesday, November 8, 11
20. Application GFS master
(file name, chunk index) /foo/bar
GSF Client File namespace chunk 2ef0
(chunk handle, chunk location)
Instructions to chunkserver
Chunkserver state
(chunk handle, byte range)
GFS chunkserver GFS chunkserver
chunk data
Linux file system Linux file system
… …
awn from (Ghemawat et al., SOSP 2003)
20
Tuesday, November 8, 11
21. Transparent to
• Workers to data assignment
• Map / Reduce assignment to nodes
• Management of synchronization
• Management of communication
• Fault-tolerance and restarts
21
Tuesday, November 8, 11
22. Take home recipe
• Scan-based computation (no random I/O)
• Big datasets
• Divide-and-conquer class algorithms
• No communication between tasks
22
Tuesday, November 8, 11
23. Not good for
• Real-time / Stream processing
• Graph processing
• Computation without locality
• Small datasets
23
Tuesday, November 8, 11
26. What we attacked
• You don’t want to parse the file many times
• You don’t want to re-calculate the norm
• You don’t want to calculate 0*n
26
Tuesday, November 8, 11