One of the key benefits of Hadoop is the ability to dump any type of data into Hadoop then the input record readers will abstract it out as if it was structured (i.e. schema on read vs on write) Open Source Software allows for innovation by partners and customers. It also enables third-party inspection of source code which provides assurances on security and product quality. 1 HDD = 75 MB/sec, 1000 HDDs = 75 GB/sec, the “head of fileserver” bottleneck is eliminated. The system is self-healing in the sense that it automatically routes around failure. If a node fails then its workload and data are transparently shifted some where else. The system is intelligent in the sense that the MapReduce scheduler optimizes for the processing to happen on the same node storing the associated data (or co-located on the same leaf Ethernet switch), it also speculatively executes redundant tasks if certain nodes are detected to be slow.
Pool commodity servers in a single hierarchical namespace. Designed for large files that are written once and read many times. Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes. Typical Hadoop node is eight cores with 24GB ram and four 1TB SATA disks. Default data block size is 64MB, though most folks now set it to 128MB or even higher
Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries. MapReduce can run on top of HDFS or a selection of other storage systems Intelligent scheduling algorithms for locality, sharing, and resource optimization.
The Power of Hadoop in Cloud ComputingJoey Echeverria, Solutions Architectjoey@cloudera.com, @fwiffo