2. Why Data?
Get insights to offer
a better product
“More data usually
beats better
algorithms”
Get of insights to
make better decisions
Avoid “guesstimates”
3. What Is Challenging?
Store data reliably
Analyze data quickly
Cost-effective way
Use expressible and
high-level language
4. Fundamental Ideas
A big system of
machines, not a big
machine
Failures will happen
Move computation to
data, not data to
computation
Write complex code
only once, but right
5. Apache Hadoop
An open-source Java
software
Storing and processing
of very large data sets
A clusters of
commodity machines
A simple programming
model
6. Apache Hadoop
Two main components:
HDFS - a distributed file
system
MapReduce – a
distributed processing
layer
7. HDFS
The Purpose Of HDFS
●
Store large datasets
in a distributed,
scalable and fault-
tolerant way
●
High throughput
●
Very large files
●
Streaming reads and writes (no edits)
8. HDFS Mis-Usage
Do NOT use, if you have
Low-latency
requests
Random
reads and writes
Lots of
small files
Then better to consider
RDBMs,
9. Splitting Files And
Replicating Blocks
Split a very large file into
smaller (but still large)
blocks
Store them redundantly on
a set of machines
10. Spiting Files Into Blocks
●
The default block size
is 64MB
●
Minimize the overhead
of a disk seek
operation (less than
1%)
●
A file is just “sliced”
into chunks after each
64MB (or so)
12. Master And Slaves
The Master node keeps and
manages all metadata
information
The Slave nodes store blocks
of data and serve them to
the client
Master node (called
NameNode)
Slave nodes (called DataNodes
13. Classical* HDFS Cluster
*no NameNode HA, no HDFS
Replication
Manages metadata
Does some
“house-keeping”
operations for
NameNode
Stores and retrieves
blocks of data
14. HDFS NameNode
Performs all the metadata-
related operations
Keeps information in RAM (for
fast look up)
The file system tree
Metadata for all
files/directories (e.g.
ownership, permissions)
Names and locations of
blocks
15. HDFS DataNode
Stores and retrieves blocks of
data
Data is stored as regular files on a local filesystem (e.g. ext4)
e.g. blk_-992391354910561645 (+ checksums in a separate file)
A block itself does not know which file it belongs to!
Sends a heartbeat message to
the NN to say that it is still
alive
Sends a block report to the NN
periodically
16. HDFS Secondary NameNode
NOT a failover NameNode
Periodically merges a prior
snapshot (fsimage) and editlog(s)
(edits)
Fetches current fsimage and
edits files from the NameNode
Applies edits to fsimage to
create the up-to-date fsimage
Then sends the up-to-date
fsimage back to the NameNode
17. Reading A File From HDFS
Block data is never sent through the
NameNode
The NameNode redirects a client to an
appropriate DataNode
The NameNode chooses a DataNode that
is as “close” as possible
Lots of data
comes
from DataNodes
to a client
Blocks locations
$ hadoop fs -cat /toplist/2013-05-15/poland.txt
18. HDFS And Local File System
●
Runs on the top
of a native file
system (e.g. ext3,
ext4, xfs)
●
HDFS is simply a
Java application
that uses a native
19. HDFS Data Integrity
HDFS detects corrupted
blocks
● When writing
Client computes the
checksums for each block
Client sends checksums to
a DN together with data
● When reading
Client verifies the
20. HDFS NameNode Scalability
Stats based on Yahoo!
Clusters
●
An average file 1.5≈
blocks (block size = 128
MB)
●
An average file 600≈
bytes in RAM (1 file and 2
blocks objects)
●
100M files 60 GB of≈
metadata
26. MapReduce Job
Input data is divided
into
splits and converted
into
<key, value> pairs
Invokes map() function
multiple times
Keys are
sorted,
values not
(but
could be)
Invokes reduce()
Function multiple times
27. MapReduce Example: ArtistCount
Artist, Song, Timestamp, User
Key is the offset of the line
from the beginning
of the line
We could specify which artist
goes to which reducer
(HashParitioner is default one)
28. MapReduce Example:
ArtistCount
map(Integer key, EndSong value, Context context):
context.write(value.artist, 1)
reduce(String key, Iterator<Integer> values, Context
context):
int count = 0
for each v in values:
count += v
context.write(key, count)
Pseudo-code in
non-existing
language ;)
30. MapReduce Implementation
●
Batch processing system
●
Automatic parallelization
and distribution of
computation
●
Fault-tolerance
●
Deals with all messy
details related to
distributed processing
●
Relatively easy to use
for programmers
32. TaskTracker Reponsibilities
●
Runs map and reduce
tasks
●
Reports to JobTracker
Heartbeats saying
that it is still alive
Number of free
map and reduce slots
Task progress,