Hadoop_arunam_ppt

APACHE HADOOP
JERRIN JOSEPH
CSU ID#2578741

CONTENTS
 Hadoop
 Hadoop Distributed File System (HDFS)
 Hadoop MapReduce
 Apache Hive
 Apache H-Base
 ZooKeeper
 Hortonworks Data Platform
 Cloudera Hadoop Solution

ABSTRACT
 Hadoop is an efficient Big data handling tool.
 Reduced the data processing time from ‘days’ to
‘hours’.
 Hadoop Distributed File System(HDFS) is the data
storage unit of Hadoop.
 Hadoop MapReduce is the data processing unit
which works on distributed processing principle.

INTRODUCTION
 What is Big Data??
 Bulk Amount
 Unstructured
 Lots of Applications which need to handle huge
amount of data (in terms of 500+ TB per day)
 If a regular machine need to transmit 1TB of data
through 4 channels : 43 Minutes.
 What if 500 TB ??

HADOOP
 “The Apache Hadoop software library is a
framework that allows for the distributed processing
of large data sets across clusters of computers
using simple programming models”[1]
 Core Components :
 HDFS: large data sets across clusters of
computers.
 Hadoop MapReduce: the distributed processing
using simple programming models

HADOOP : KEY FEATURES
 High Scalability
 Highly Tolerant to Software & Hardware Failures
 High Throughput
 Best for larger files with less in number
 Performs fast and parallel execution of Jobs
 Provides Streaming access to data
 Can be built out of commodity hardware

HADOOP: DRAWBACKS
 Not good for Low-latency data access
 Not good for Small files with large in number
 Not good for Multiple write files
 Do not encryption at storage level or network level
 Have a high complexity security model
 Hadoop is not a Database: Hence cannot alter a
file.

HADOOP DISTRIBUTED FILE
SYSTEM (HDFS)

HADOOP DISTRIBUTED FILE SYSTEM
(HDFS)
 Storage unit of Hadoop
 Relies on principles of Distributed File System.
 HDFS have a Master-Slave architecture
 Main Components:
 Name Node : Master
 Data Node : Slave
 3+ replicas for each block
 Default Block Size : 64MB

HDFS: KEY FEATURES
 Highly fault tolerant. (automatic failure recovery
system)
 High throughput
 Designed to work with systems with vary large file
(files with size in TB) and few in number.
 Provides streaming access to file system data. It is
specifically good for write once read many kind of
files (for example Log files).
 Can be built out of commodity hardware. HDFS
doesn't need highly expensive storage devices.

NAME NODE
 Master of HDFS
 Maintains and Manages data on Data Nodes
 High reliability Machine (can be even RAID)
 Expensive Hardware
 Stores NO data; Just holds Metadata!
 Secondary Name Node:
 Reads from RAM of Name Node and stores it to hard
disks periodically.
 Active & Passive Name Nodes from Gen2 Hadoop

DATA NODES
 Slaves in HDFS
 Provides Data Storage
 Deployed on independent machines
 Responsible for serving Read/Write requests from
Client.
 The data processing is done on Data Nodes.

 Client makes a Write request to Name Node
 Name Node responds with the information about on
available data nodes and where data to be written.
 Client write the data to the addressed Data Node.
 Replicas for all blocks are automatically created by
Data Pipeline.
 If Write fails, Data Node will notify the Client and
get new location to write.
 If Write Completed Successfully, Acknowledgement
is given to Client
 Non-Posted Write by Hadoop
HDFS OPERATION

HADOOP MAPREDUCE
 Simple programming model
 Hadoop Processing Unit
 MapReduce also have Master-Slave architecture
 Main Components:
 Job Tracker : Master
 Task Tracker : Slave
 From Google’s MapReduce
 Do not fetch data to Master Node; Processed data
at Slave Node and returns output to Master

 Implemented using Maps and Reduces
 Split by FileInputFormat
 Maps
 Inheriting Mapper Class
 Produces (key, value) pair as intermediate result from
data.
 Reduces
 Inheriting Reducer Class
 Produces required output from intermediate result
produced by Maps.
HADOOP MAPREDUCE

JOB TRACKER
 Master in MapReduce
 Receives the job request from Client
 Governs execution of jobs
 Makes the task scheduling decision
TASK TRACKER
 Slave in MapReduce
 Governs execution of Tasks
 Periodically reports the progress of tasks

HIVE
 Built on top of Hadoop
 Supports SQL like Query Language : Hive-QL
 Data in Hive is organized into tables
 Provides structure for unstructured Big Data
 Work with data inside HDFS
 Data : File or Group of Files in HDFS
 Schema : In the form of metadata stored in
Relational Database
 Data and Schema are separated
 Schema only for existing data
 Supports Primitive Column Types and Nestable
Collection Types (Array and Map)

HIVE QUERY LANGUAGE
 SQL like language
 DDL : to create tables with specific serialization
formats
 DML : to load data from external sources and insert
query results into Hive tables
 Do not support updating and deleting rows in
existing tables
 Supports Multi-Table insert
 Supports custom map-reduce scripts written in any
language
 Can be extended with custom functions (UDFs)
 User Defined Transformation Function(UDTF)
 User Defined Aggregation Function (UDAF)

 External Interfaces:
 Web UI : Management
 Hive CLI : Run Queries, Browse Tables, etc
 API : JDBC, ODBC
 Metastore :
 System catalog which contains metadata about Hive
tables
 Driver :
 manages the life cycle of a Hive-QL statement during
compilation, optimization and execution
 Compiler :
 translates Hive-QL statement into a plan which consists
of a DAG of map-reduce jobs
HIVE ARCHITECTURE

HIVE ACHIEVEMENTS & FUTURE PLANS
 First step to provide warehousing layer for
Hadoop(Web-based Map-Reduce data processing
system)
 Accepts only sub-set of SQL: Working to subsume
SQL syntax
 Working on Rule-based optimizer : Plans to build
Cost-based optimizer
 Enhancing JDBC and ODBC drivers for making the
interactions with commercial BI tools.
 Working on making it perform better

H-BASE
 Distributed Column-oriented database on top of
Hadoop/HDFS
 Provides low-latency access to single rows from
billions of records
 Column oriented:
 OLAP
 Best for aggregation
 High compression rate: Few distinct values
 Do not have a Schema or data type
 Built for Wide tables : Millions of columns
Billions of rows
 Denormalized data
 Master-Slave architecture

HMASTER SERVER
 Like Name Node in HDFS
 Manages and Monitors H-Base Cluster Operations
 Assign Region to Region Servers
 Handling Load-balancing and Splitting
REGION SERVER
 Like Data Node in HDFS
 Highly Scalable
 Handle Read/ Write Requests
 Direct communication with Clients

INTERNAL ARCHITECTURE
 Tables Regions
 Store
 MemStore
 FileStore Blocks
 Column Families

ZOOKEEPER
 Coordination
 Race Condition
 Dead-locks
 Partial Failure
 Inconsistency
 What is ZooKeeper?
 Distributed coordination service for distributed
applications
 Like a Centralized Repository
 Challenges for Distributed Applications
 ZooKeeper Goals
 Serialization
 Atomicity
 Reliability
 Simple API

HORTONWORKS DATA PLATFORM
1. Governance and Integration
2. Data Access
3. Data Management
4. Security
5. Operations
 YARN : Data Operating System between Data
Storage and Data Access.

 Data Operating System on Hadoop
 Enables data processing simultaneously in multiple
ways
 Provides resource management and pluggable
architecture.
 The data processing engines works with YARN.
HDP: YARN

HDP: GOVERNANCE INTEGRATION
 provide a reliable, repeatable, and simple
framework for managing the flow of data in and out
of Hadoop
 Falcon: Framework for simplifying data
management and pipeline processing in Hadoop.
 Sqoop: Tool used to transfer data between Hadoop
and Relational Databases.
 Flume: Service for effectively collecting,
aggregating, and moving large amounts of
streaming data into Hadoop.

 Batch: MapReduce
 Script: Pig
 Pig Latin defines a set of transformations on data such
as aggregate, join and sort.
 SQL: Hive
 NoSQL: HBase
 Stream: Storm
 Distributed real-time computational system for
processing fast, large streams of data.
 Search: Solr
 Advanced full-text search and near real-time indexing.
HDP: DATA ACCESS

 Critical features for authentication, authorization,
accountability and data protection.
 Deploy, monitor and manage a Hadoop cluster
within the enterprise data ecosystem.
 Oozie: Java web application used to schedule
Hadoop jobs.
HDP: SECURITY
HDP: OPERATIONS

 Cloudera solution High-Level architecture
CLOUDERA HADOOP ARCHITECTURE

 Cloudera solution taxonomy
CLOUDERA HADOOP ARCHITECTURE

DEPLOYMENT
 Crowbar:
 Complete automated operations platform.
 To provision h/w, configure it, and install Red Hat
Enterprise Linux and Cloudera Manager.
 Designed to deploy layers of infrastructure on bare-
metal servers and all the way up the stack.

MANAGEMENT
 Ganglia:
 gathering metrics and tracking them over time.
 designed to scale to thousands of nodes
 Nagios:
 Powerful monitoring system
 Enables organizations to identify and resolve IT
infrastructure problems.
 Monitoring, Alerting, Response, Reporting,
Maintenance, and Planning.

CDH : COMPONENTS
 Avro: Serialization system.
 Crunch: Java library for more easily writing, testing,
and running MR pipelines.
 DataFu: library of UDFs for data mining and
statistics in Apache Pig.
 Cloudera Impala
 Kite SDK: APIs, examples, and docs for building
apps on top of Hadoop.
 Cloudera Search: Offers free-text, Google-style
search of Hadoop data for business users.

 JobTracker : Master Name Node
 TaskTracker : Data Node(x)
 NameNode : Master Name Node
 Secondary namenode : Secondary Name Node
 Operating System Provisioning : Admin Node
 Chef : Admin Node
 Yum Repositories : Admin Node
 Cloudera Manager : Edge Node(x)
 Zookeeper : Data Node(x)
 HMaster : Master Name Node
 RegionServer : Data Node(x)
 Crowbar Admin : Admin Node
 Journal : Master Name Node,
Secondary Name Node, HA
Node
CLOUDERA :SOFTWARE LOCATIONS

 Cloudera works using Crowbar and Cloudera
Manager for setting up hardware and connecting
different tools to the cluster. HDP deployed YARN
data operating system with resource management
and pluggable architecture.
 Cloudera focus on business solutions while
Hortonworks focus on research stream.
 Both supports almost all the tools on Hadoop
Cluster.
CLOUDERA V/S HORTONWORKS

 Cloudera Search in CDH and Solr Search in HDP.
 Cloudera Hadoop have some extra tools defined to
work on Hadoop cluster than the common tools on
Apache Hadoop.
CLOUDERA V/S HORTONWORKS

CONCLUSION
 Hadoop is a successful solution for Big Data
Handling
 Hadoop expanded from a simple project to the level
of a platform
 The projects and tools on Hadoop are proof for the
successfulness of Hadoop.

REFERENCES
[1] "Apache Hadoop", http://hadoop.apache.org/
[2] “Apache Hive”, http://hive.apache.org/
[3] “Apache HBase”, https://hbase.apache.org/
[4] “Apache ZooKeeper”, http://zookeeper.apache.org/
[5] Jason Venner, "Pro Hadoop", Apress Books, 2009
[6] "Hadoop Wiki", http://wiki.apache.org/hadoop/
[7] Jiong Xie, Shu Yin, Xiaojun Ruan, Zhiyang Ding, Yun
Tian, James Majors, Adam Manzanares, Xiao Qin, "
Improving MapReduce Performance through Data
Placement in Heterogeneous Hadoop Clusters", 19th
International Heterogeneity in Computing Workshop,
Atlanta, Georgia, April 2010

[8] Dhruba Borthakur, The Hadoop Distributed File
System: Architecture and Design, The Apache
Software Foundation 2007.
[9] "Apache Hadoop",
http://en.wikipedia.org/wiki/Apache_Hadoop
[10] "Hadoop Overview",
http://www.revelytix.com/?q=content/hadoop-
overview
[11] Konstantin Shvachko, Hairong Kuang, Sanjay
Radia, Robert Chansler, The Hadoop Distributed
File System, Yahoo!, Sunnyvale, California USA,
Published in: Mass Storage Systems and
Technologies (MSST), 2010 IEEE 26th Symposium.
REFERENCES

[12] Vinod Kumar Vavilapalli, Arun C Murthy, Chris
Douglas, Sharad Agarwal, Mahadev Konar, Robert
Evans, Thomas Graves, Jason Lowe, Hitesh Shah,
Siddharth Seth, Bikas Saha, Carlo Curino, Owen
O’Malley, Sanjay Radia, Benjamin Reed, Eric
Baldeschwieler, Apache Hadoop YARN: Yet Another
Resource Negotiator, ACM Symposium on Cloud
Computing 2013, Santa Clara, California.
[13] Raja Appuswamy, Christos Gkantsidis, Dushyanth
Narayanan, Orion Hodson, and Antony Rowstron,
Scale-up vs Scale-out for Hadoop: Time to rethink?,
Microsoft Research, ACM Symposium on Cloud
Computing 2013, Santa Clara, California.
[14] “Hortonworks Data Platform”,
http://www.hortonworks.com/hdp/
REFERENCES

[15] Dell | Cloudera Solution Reference Architecture
v2.1.0, A Dell Reference Architecture Guide, Nov
2012
[16] “Cloudera”, http://www.cloudera.com/
REFERENCES

Hadoop_arunam_ppt

More Related Content

What's hot

Similar to Hadoop_arunam_ppt

Hadoop_arunam_ppt