4000 nodes: 14PB storage
HDFS – Assumptions and Goals
• Hardware Failure: Houndred or thousands machines, expect to fail.
• Streaming Data Access: Batch processing, high throughtput not low latency.
• Large Data Sets: Terrabytes, works on cluster, scale, milions of files single instance.
• Simple Coherency Model: Write-once-read-many(create, read, close, no
changes) maximize coherency and high throughtput, perfect for Map/Reduce.
• Moving Computation instead of Moving Data: Is way
more cheaper, huge data, minimize network. HDFS moves the computation close to the data.
• Sofware and hardware Portability: Easily Portable.
• Very large distributed FS
• 10k nodes, 100M files, 10PB
• Works with comodity hardware
• File replication
• Detect and recover from failures
• Optimized for batch processing
• Files break by blocks 128mb
• blocks: replicated in N dataNodes
• Data Coherency
• Write Once, Read Many
• Only Append to existent files
Map/Reduce: Local Read
Node 0 Node 1 Node 2 Node 3
• Local Read, no need for network copy
• Data is read from many disks in parallel
Map/Reduce: The Magic!
Single Hard Drive: Reads 75mb/second
12 hard drive
12 * 75mb/second * 4k =
3.4 TB/ second