3. Big Data
• Sources: Server logs, clickstream, machine, sensor, social…
• Use-cases: batch/interactive/real-time
4. Scalable
o Petabytes of data
Economical
o Use commodity hardware
o Share clusters among many applications
Reliable
o Failure is common when you run thousands of machines. Handle it well in
the SW layer.
Simple programming model
o Applications must be simple to write and maintain
What is needed from a Distributed Platform?
5. Hadoop is peta-byte scale distributed data storage and data
processing infrastructure
Based on Google GFS & MR paper
Contributed mostly by Yahoo! in the initial years and now have a
more widespread developer and user base
1000s of nodes, PBs of data in storage
What is Hadoop?
6. • Cheap JBODs for storage
• Move processing to where data is
Location awareness (topology)
• Assume hardware failures to be the norm
• Map & Reduce primitives are fairly simple yet powerful
Most set operations can be performed using these primitives
• Isolation
Hadoop Basics
8. Goals:
Fault tolerant, scalable, distributed storage system
Designed to reliably store very large files across machines in a
large cluster
Assumptions:
Files are written once and read several times
Applications perform large sequential streaming reads
Not a Unix-like, POSIX file system
Access via command line or Java API
HDFS
9. • Data is organized into files and directories
• Files are divided into uniform sized blocks and distributed across
cluster nodes
• Blocks are replicated to handle hardware failure
• Filesystem keeps checksums of data for corruption detection and
recovery
• HDFS exposes block placement so that computes can be migrated
to data
HDFS – Data Model
11. • Namenode is SPOF (HA for NN is now available in 2.0
Alpha)
• Responsible for managing a list of all active data nodes,
FS name system (files, directories, blocks and their
locations)
• Block placement policy
• Ensuring adequate replicas
• Writing edit logs durably
Namenode
12. • Service to allow data to be streamed in & out
• Block is the unit of data that data node understands
• Block reports to Namenode periodically
• Checksum checks, disk usage stats are managed by datanode
• Clients talk to datanode for actual data
• As long as there is at least one data node available to service file
blocks, failures in datanodes can be tolerated, albeit at lower
performance.
Datanode
13. HDFS – Write pipeline
DFS Client Namenode
Data node 1
Data node 2
Data node 3
Rack 2
Create file, get Block Loc (1)
DN 1, 2 & 3 (2)
Stream file (5)
Ack (5a)
Ack (4a)
Ack (3a)
Complete file (3b)
Rack 1
14. • Default is 3 replicas, but configurable
• Blocks are placed (writes are pipelined):
On same node
On different rack
On the other rack
• Clients read from closest replica
• If the replication for a block drops below target, it is
automatically re-replicated.
HDFS – Block placement
15. • Data is checked with CRC32
• File Creation
‣ Client computes checksum per block
‣ DataNode stores the checksum
• File access
‣ Client retrieves the data and checksum from DataNode
‣ If Validation fails, Client tries other replicas
HDFS – Data correctness
18. Say we have 100s of machines available to us. How do we write
applications on them?
As an example, consider the problem of creating an index for search.
‣ Input: Hundreds of documents
‣ Output: A mapping of word to document IDs
‣ Resources: A few machines
Map-Reduce Application
19. The problem : Inverted Index
Farmer1 has the
following animals:
bees, cows, goats.
Some other
animals …
Animals: 1, 2, 3, 4, 12
Bees: 1, 2, 23, 34
Dog: 3,9
Farmer1: 1, 7
…
21. In our example
‣ Map: (doc-num, text) ➝ [(word, doc-num)]
‣ Reduce: (word, [doc1, doc3, ...]) ➝ [(word, “doc1, doc3, …”)]
General form:
‣ Two functions: Map and Reduce
‣ Operate on key and value pairs
‣ Map: (K1, V1) ➝ list(K2, V2)
‣ Reduce: (K2, list(V2)) ➝ (K3, V3)
‣ Primitives present in Lisp and other functional languages
Same principle extended to distributed computing
‣ Map and Reduce tasks run on distributed sets of machines
This is Map-Reduce
22. Abstracts functionality common to all Map/Reduce applications
‣ Distribute tasks to multiple machines
‣ Sorts, transfers and merges intermediate data from all machines from the Map phase to
the Reduce phase
‣ Monitors task progress
‣ Handles faulty machines, faulty tasks transparently
Provides pluggable APIs and configuration mechanisms for writing applications
‣ Map and Reduce functions
‣ Input formats and splits
‣ Number of tasks, data types, etc…
Provides status about jobs to users
Map-Reduce Framework
23. MR – Architecture
Job Client Job Tracker
DFS Client
DFS Client
DFS Client
DFS Client
DFS Client
DFS Client
Task Tracker
Heartbeat Task Assignment
Shuffle
Submit
Progress
H
D
F
S
24. • All user code runs in isolated JVM
• Client computes splits
• JT just schedules these splits (one mapper per split)
• Mapper, Reducer, Partitioner and Combiner and any custom
Input/OutputFormat runs in user JVM
• Idempotence
Map-Reduce
25. Hadoop HDFS + MR cluster
Machines with Datanodes and Tasktrackers
D D D DTT
JobTracker
Namenode
T T TD
Client
Submit Job
HTTP Monitoring UI
Get Block
Locations
26. • Input: A bunch of large text files
• Desired Output: Frequencies of Words
WordCount: Hello World of Hadoop
28. Mapper
‣ Input: value: lines of text of input
‣ Output: key: word, value: 1
Reducer
‣ Input: key: word, value: set of counts
‣ Output: key: word, value: sum
Launching program
‣ Defines the job
‣ Submits job to cluster
Word Count Example