1. HADOOP FOUNDATION FOR ANALYTICS
BY
B.MONICA
II M.SC COMPUTER SCIENCE
BON SECOURS COLLEGE FOR WOMEN
1
2. HADOOP
It is an open-source software framework
licensed under the Apache v2 license
It includes:
– Map Reduce : offline computing engine
– HDFS : Hadoop distributed file system
EXAMPLE
2
3. HADOOP GOALS
Scalable: It can reliably store and process petabytes.
Economical: It distributes the data
Efficient: it can process it in parallel on the nodes where the
data is located.
Reliable: It automatically maintains multiple copies of data
3
4. USES FOR HADOOP
Data-intensive text processing
Assembly of large genomes
Graph mining
Machine learning and data mining
Large scale social network analysis
4
5. HADOOP: ASSUMPTIONS
Hardware will fail.
Applications need a write-once-read-many access model.
EXAMPLE
Facebook:
- To store copies of internal log and dimension
data sources
- it as a source for reporting/analytics and
machine learning
- 320 machine cluster with 2,560 cores and
about 1.3 PB raw storage 5
7. HISTORY OF HADOOP
Hadoop was started by Doug Cutting to support
two of his other well known projects, Lucene and
Nutch
Hadoop has been inspired by Google's File
System (GFS) which was detailed in a paper by
released by Google in 2003
Hadoop, originally called Nutch Distributed File
System (NDFS) split from Nutch in 2006 to
become a sub-project of Lucene. At this point it
was renamed to Hadoop.
7
8. EXAMPLE
Google search engine
2013 - Hadoop 1.1.2 and Hadoop 2.0.3 alpha.
- Ambari , Cassandra, Mahout have been
added
8
9. • Hadoop is in use at most organizations that
handle big data:
o Yahoo!
o Facebook
o Amazon
o Netflix
9
10. APACHE MAP REDUCE
A software framework for distributed
processing of large data sets
The framework takes care of scheduling tasks,
monitoring them and re-executing any failed
tasks.
It splits the input data set into independent
chunks.
Map Reduce framework sorts the outputs of
the maps, which are then input to the reduce
tasks..
10
12. MAP REDUCE DATAFLOW
An input reader
A Map function
A partition function
A compare function
A Reduce function
An output writer
EXAMPLE:
JOB TRACKER
TASK TRACKER 12
13. MAP REDUCE-FAULT TOLERANCE
Worker failure: The master pings every worker
periodically.
Master Failure: It is easy to make the master write
periodic checkpoints of the master data structures
13
14. JOB TRACKER
Tracking Map Reduce jobs in Hadoop
Job Tracker performs following actions in Hadoop
It accepts the Map Reduce Jobs from client
applications
Talks to Name Node to determine data location
Locates available Task Tracker Node
Submits the work to the chosen Task Tracker
Node
14
15. OTHER TOOLS
Hive
Hadoop processing with SQL
Pig
Hadoop processing with scripting
Cascading
Pipe and Filter processing model
H Base
Database model built on top of Hadoop
Flume
Designed for large scale data movement
15