Rabindra Nath Nandi
Software Engineer(Big Data) , IPvision Canada Inc
Outlines
● A brief history of Hadoop
● Why Hadoop
● Hadoop Fundamental
● HDFS
● MapReduce
● A MapReduce Program - WordCount Problem
● Installation
● Resources
A brief history of Hadoop
The genesis of Hadoop came from the Google File System paper(2003)
This paper spawned another research paper from Google - MapReduce:
Simplified Data Processing on Large Clusters.(2004)
Hadoop Project Started from Project Apache Nutch(2006)
Douh Cutting a Yahoo Researcher initially handled the project
Yahoo the Main Contributor
Why Hadoop
Everyday millions on contents are uploaded generated in facebook, google
These data needs to be stored and processed on demand
Today's hardware facility are high , so data storage doesn’t matter
So faster data processing and data storing with low cost is needed
Why Hadoop
Ability to store and process huge amounts of any kind of data, quickly
Computing power
Fault tolerance
Flexibility
Low cost
Scalability
Hadoop Fundamental
● Hadoop Provides both data storage and data processing facility
● HDFS- Hadoop Distributed File System
● MapReduce - A Distributed Data Processing Engine
HDFS Fundamental
● File systems that manage the storage across a network of machines are
called distributed file systems
Two Types of Nodes
● NameNode(Master): Holds metadata and keeps tracks of block's location in
DataNodes
● DataNode: Slave Nodes that stores and retrieves data block.DNs Periodically
reports to NameNode about list of block that they are storing
● File splits into 128 mb blocks (default)
● Replicated to 3 datanodes(default)
HDFS Fundamental
HDFS Fundamental
MapReduce Fundamental
Map() Function:
Process a key/value pair to generate intermediate key/value pair
Reduce() Function
Merge all intermediate values associated with the same key
MapReduce Fundamental
A MapReduce Program - WordCount Problem
Mapper Example
A MapReduce Program - WordCount Problem
Reducer Example
A MapReduce Program - WordCount Problem
Driver Class
Installation
http://www.tutorialspoint.com/hadoop/hadoop_enviornment_setup.htm
https://www.digitalocean.com/community/tutorials/how-to-install-hadoop-on-
ubuntu-13-10
Resources
https://hadoop.apache.org
Hadoop: The Definitive Guide, 3rd Edition
Storage and Analysis at Internet Scale By Tom White

Hadoop introduction