HDFS attributes Distributed, Reliable,Scalable Optimized for streaming reads, very large data sets Assumes write once read several times No local caching possible due to large files and streaming reads High data replication Fit logically with mapreduce Synchronized access to metadata--> namenode Metadata (Edit log, FSI image) stored in namenode local os file system.
HDFS architecture ensures that the read request is served from the nearest node (replication)
Mapreduce framework ensures that the operations are executed nearest to the data -->moving operations is cheaper to moving data
Optimized operations in every stage--> combiner, data replication (parallel buffering and transfer from one node), ...
Scalability goal Flat scalability--> addition and removal of a node is fairly straight forward
Sub projects Zoo keeper for small shared information (useful for synchronization, lock, leader selection and so many sharing problems in distributed systems). Hbase for semi structured data (provides implementation of google big table design) Hive for ad hoc query analysis (currently supports insertion in multiple tables, group by, multiple table selection and order by is under construction) Avro for data serialization applicable to map reduce