Hadoop

GANDHI INSTITUTE FOR TECHNOLOGICAL
ADVANCEMENT, BHUBANESWAR
TECHNICAL SEMINAR ON
HADOOP
GUIDED BY- PRESENTED BY-
PROF.KUNDAN CHANDRA PATRA NAME-ABHIJEET RAJ
PROF. SWOGAT KUMAR JENA BRANCH-CSE(1)
PROF. SAROJ KUMAR MOHANTY REG NO.-1301287529

CONTENTS -
1. INTRODUCTION TO HADOOP
2. HADOOP-HISTORY AND ORIGIN
3. BIG DATA ANALYTICS AND CHALLENGES
4. HADOOP ECOSYSTEM
5. HDFS ARCHITECTURE
6. HADOOP VS RDBMS
7. MAP REDUCE
8. PIG AND HIVE
9. CONCLUSION
1Abhijeet raj,131001

INTRODUCTION-
• What is Hadoop-
• Apache Hadoop is an open-source software
framework for distribuited storage and
processing of large data
• Written in java
• Based on Google file system(GFS)
2Abhijeet raj,131001

Continued...
• It is designed to scale up from single servers to
thousands of machines, each offering local
computation and storage.
• Hadoop framework consists on two main layers
• HDFS
• Map Reduce
Abhijeet raj,131001 3

History and Origin
• Doug cutting trying to make an open source
search engine in 2003
• Google released their distributed system
papers called Map/Reduce and Google file
system (GFS) which powered Google search
engine:

Continued...
• Doug cutting took these ideas and started to
work on open source
• In 2006 he joins Yahoo! and the distributed
system named as Hadoop
• Yahoo open sourced it through Apache
organization

Organizations using Hadoop
• Amazon
• Adobe
• Cloudspace
• Ebay
• Facebook
• Google
• IBM
• LinkedIn
• yahoo

Big data analytics and
challenges
• Minimum size of that a Big Data file starts is
at least 1 Terabyte.
• 4 V’s tossed for Big Data:-
1. VOLUME- The scale of data
2. VARIETY- Different forms of data
3. VELOCITY- Analysis of streaming data
4. VARACITY- Uncertainity of data

Challenges for Big Data
processing
• Meeting the need for speed
• Scale
• Continuous Availability
• Displaying meaningful results
• Workload diversity
• Data security
• Cost
• Manageability

Hadoop vs traditional RDBMS
Factors Hadoop RDBMS
Size of data Petabytes Gigabytes
Integrity of data Low High
Data schema Dynamic Static
Access method Interactive and batch Batch
Scaling Linear Non linear
Data structure Unstructured/structured Structured
Normalization of data Not required Required
Query response time Has latency(due to
batch process)
Can be near immediate

Hadoop Ecosystem

HDFS(Hadoop Distribuited File System)
• a distributed file system designed to run on
commodity hardware
• It is suitable for the distributed storage and
processing.
• The built-in servers of namenode and
datanode help users to easily check the
status of cluster.
• HDFS provides file permissions and
authentication.

Continued...
Namenode
• Namenode is the node which stores the filesystem
metadata i.e. which file maps to what block
locations and which blocks are stored on which
datanode.
Datanode
• The data node is where the actual data resides.

Continued...
Job tracker
• primary function of the job tracker is resource
management ,tracking resource availability and
task life cycle management
Task tracker
• Follow the orders of the job tracker and
updating the job tracker with its progress status
periodically.

Goals of HDFS
• Fault detection and recovery
• Huge datasets
• Reduce network traffic
• Increases throughput

Map Reduce
• MapReduce is a processing technique and a
program model for distributed computing
based on java
• Map-data are broken into tuples
• Reduce-combines the tuples into a smaller
form

Advantages of Map Reduce
• Easy to scale data processing over multiple
computing nodes.
• Parallel processing.
• Fast.
• Simple model of programming

HBASE
• Developed by Apache software foundation
• Database for Hadoop.
• Open source
• Non-relational

Continued...
• Distribuited
• Written in java
• Connectivity is done using JDBC –Type 4
driver

YARN
• Yet Another Resource Negotiator
• In Yarn, the job tracker is split into two
different daemons called Resource
Manager and Node Manager

YARN ARCHITECTURE

PIG
• Analyzing large data sets that consists of a
high-level language for expressing data
analysis programs
• Structure is amenable to substantial
parallelization

Continued...
• Easy of programming
• Optimization opportunities
• Extensibility

HIVE
• Data warehouse software facilitates querying
and managing large datasets
• Allows traditional map/reduce programmers
to plug in their custom mappers and
reducers

PIG VS HIVE
PIG HIVE
TYPES OF FLOW PROCEDURAL LANGUAGE DECLARATIVE LANGUAGE
EASY OF USE COMPLEX EASY
NATURE OF USAGE EFFICIENCY IN COMPUTING ANALYTICS AREA
TYPE OF DATA VARIABLES TABLES
DEBUGGING FACILITY DEBUGGED LOCALLY COMPLEX
MAINTENANCE MORE LESS
DEVELOPMENT TIME MORE LESS
HANDLING BIG DATA HANDLES MORE DATA MEMORY OVERFLOW

REFERENCES
• hadoop.apache.org
• tutorialspoint.com
• hbase.apache.org
• en.wikipedia.org/wiki/Apache_Hadoop
• Pig.apache.org
• datastax.com
• youtube.com
• Google images

Conclusion
• Hadoop has been very effective solution for
companies dealing with the data in petabytes
or big data.
• Has overcame the limitations of traditional
data storage problems.
• Being open source , widely accepted

Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Hadoop

Similar to Hadoop (20)

Recently uploaded

Recently uploaded (20)

Hadoop