Hadoop and It_s Components_PPT .pptx

HADOOP
HIGH AVAILABILITY DISTRIBUTED OBJECT ORIENTED PLATFORM
Latest version: Hadoop 3.3.6
Release Date: 23 June, 2023
Written in JAVA
Developer: Apache Software Foundation
Presented by:
Asmita Raj (CUSB2202312004)
Medha Madhvi
(CUSB2202312016)

C O M P O N E N T S O F H A D O O P
1 . H D F S
2 . M A P R E D U C E
3 . Y A R N
4 . H A D O O P C O M M O N O R C O M M O N U T I L I T I E S

MAPREDUCE ARCHITECTURE
The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities
called a :
• Job tracker:Acts like a master (responsible for complete execution of submitted job)
• MultipleTask Trackers:Acts like slaves, each of them performing the job
• A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.
• It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on different data nodes.
• Execution of individual task is then to look after by task tracker, which resides on every data node executing part
of the job.
• Task tracker’s responsibility is to send the progress report to the job tracker.
• In addition, task tracker periodically sends ‘heartbeat’ signal to the Job tracker so as to notify him of the current
state of the system.
• Thus, job tracker keeps track of the overall progress of each job. In the event of task failure, the job tracker can
reschedule it on a different task tracker.

THE DATA GOES THROUGH THE FOLLOWING PHASES OF MAPREDUCE IN
BIG DATA
• Input: Input data distributed on nodes called input split.
• Input Splits: An input to a MapReduce in Big Data job is divided into fixed-size pieces called input splits. Input split is a chunk of the input that is
consumed by a single map.
• RecordReader: RecordReader communicates with the input split and converts the data into key-value pairs suitable to be read by the mapper.
• Mapper:The mapper works on the key-value pairs and gives an intermittent output, which goes for further processing.
• Intermediate Keys: The key-value pairs generated by the mapper are known as intermediate keys.
• Combiner: This kind of local Reducer helps group similar data generated from the map phase into identifiable sets. It is an optional part of the
MapReduce algorithm.
• Partitioner: Partitioner decides how outputs from combiners are sent to the reducers.
• Shuffler and sorted:This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase output.
The output of the partitioner is shuffled and sorted.
• Reducer:The reducer combines all the intermediate values for the intermediate keys into a list called tuples.
• RecordWriter: The RecordWriter writes these output key-value pairs from reducer to the output files.
• Final output: In the output phase, we have an output formatter that translates the final key- value pairs from the Reducer function and writes them
onto a file using a record writer.

WORD COUNT EXAMPLE THROUGH MAPREDUCE
India is peninsular country in Asia.
India is the Seventh largest in the world.
India has the Second largest population in the world.
India is a country having different religious.
India is a collection of States and union territories.
New Delhi is a capital of India.

YARN
• Stands for “Yet Another Resource Negotiator “.
• YARN helps to open up Hadoop by allowing to process and run data for batch
processing, stream processing, interactive processing and graph processing
which are stored in HDFS.

WHY YARN?
• The problem in Hadoop 1.0 is of a single master for all, resulting in bottlenecking issue.
• The computational resource utilization was inefficient.Thus, scalability became an issue with this version of
Hadoop.This issue is resolved byYARN, a vital core component in its successor Hadoop version 2.0.
• In Hadoop 2.0,The concept of Application Master and Resource Manager was introduced byYARN.Across the
cluster of Hadoop, the utilization of resources is monitored by the Resource Manager.

MapReduce Vs YARN
MapReduce
YARN

COMPONENTS OF YARN
• Client: It submits map-reduce jobs.
• Resource Manager: It is the master daemon ofYARN and is responsible for resource assignment and management among all
the applications. It has two major components:
• Scheduler: It performs scheduling based on the allocated application and available resources. It is a pure scheduler, means it
does not perform other tasks such as monitoring or tracking and does not guarantee a restart if a task fails.There are mainly 3
types of Schedulers in Hadoop:
1. FIFO (First In First Out) Scheduler.
2. Capacity Scheduler.
3. Fair Scheduler.
• Application manager: It is responsible for accepting the application and negotiating the first container from the
resource manager. It also restarts the Application Master container if a task fails.

COMPONENTS OF YARN
• Node Manager:
• It registers with the Resource Manager and sends heartbeats with the health status of the node.
• It monitors resource usage, performs log management and also kills a container based on directions from the resource manager.
• It is also responsible for creating the container process and start it on the request of Application master.
• Application Master:
• The application master is responsible for negotiating resources with the resource manager, tracking the status and monitoring
progress of a single application.
• The application master requests the container from the node manager by sending a Container Launch Context(CLC) which includes
everything an application needs to run.
• Once the application is started, it sends the health report to the resource manager from time-to- time.
• Container:
• It is a collection of physical resources such as RAM, CPU cores and disk on a single node.
• The containers are invoked by Container Launch Context (CLC) which is a record that contains information such as environment
variables, security tokens, dependencies etc.

WORKING FLOW OF YARN
• Client submits an application
• The Resource Manager allocates a container to start the Application
Manager
• The Application Manager registers itself with the Resource Manager
• Application Manager asks containers from Resource Manager
• The Application Manager notifies the Node Manager to launch containers
• Application code is executed in the container
• Client contacts Resource Manager/Application Manager to monitor
application’s status
• Application Manager unregisters with Resource Manager

HADOOP COMMON OR COMMON UTILITIES
• Hadoop common or Common utilities are nothing but our java library and java
files or the java scripts that we need for all the other components present in a
Hadoop cluster.
• These utilities are used by HDFS,YARN, and MapReduce for running the cluster.
• Hadoop Common verify that Hardware failure in a Hadoop cluster is common
so it needs to be solved automatically in software by Hadoop Framework.

HADOOP DISTRIBUTED FILE SYSTEM

 HDFS OVERVIEW
 ARCHITECTURE OF HDFS
 HDFS Data blocks
 CONCEPT OF READING AND WRITING DATA IN HDFS
 ADVANTAGE AND DISADVANTAGE HDFS
 FEATURE OF HDFS
 BASIC COMMANDS IN HDFS
 GOAL OF HDFS

HDFS OVERVIEW
• Hadoop comes with a distributed file system called HDFS.
• Data is store in multiple node , also replicated.
• write once and read many (WORM).
• It is cost effective as it uses commodity hardware.
• It involves the concept of blocks, data nodes and Name node(Master
node).
• When the one data node is down , the data can be accessed through
any other data node which contains the replication.
• Storage data block by default size (128MB).

Hadoop Distributed File System

Some definitions
• A Rack is a collection of nodes that are physically stored close together and are
all on the same network.
• A Cluster is a collection of racks.
• NameNode: Manages the files system namespace and regulates access to
clients. There is a single Name Node for a cluster.
• The Secondary NameNode works concurrently with the primary NameNode as a
helper daemon.
• DataNode: Serves read, write requests, and performs block creation, deletion,
and replication upon instruction from NameNode.
• A file is split in one or more blocks and a set of blocks are stored in DataNodes.
• A Hadoop block default size 128 MB.

DATA BLOCKS:--
 Minimum size of data that can be read or write in one shot.
 Default data size of the HDFS block is 128MB. The block size
changeable .
 Data are divided into different blocks and stored in the clusters.
 When the data size is smaller than the block size , then it will not
occupy the whole block.

NameNode(Master node)
• It is also known as Master node or Primary node.
• It is used to store the meta-data and another data related to datanodes.
• Managing the file-system namespace.
• It control the access of different clients into the data blocks.
• Periodically checks the availability of the data nodes.
• It also care about the replication factor of the data blocks.

Secondary Namenode
• Secondary NameNode is used for taking the hourly backup of the
data. In case the Hadoop cluster fails, or crashes, the secondary
Namenode will take the hourly backup or checkpoints of that data
and store this data into a file name fsimage. This file then gets
transferred to a new system

Function of Secondary NameNode

Rack Awareness
• To reduce the network traffic during file read/write, NameNode
chooses the closest DataNode for serving the client read/write
request.
• NameNode maintains rack ids of each DataNode to achieve this rack
information.
• concept of choosing the closest DataNode based on the rack
information is known as Rack Awareness.

Why Rack Awareness?
• To reduce the network traffic while file read/write, which improves
the cluster performance.
• To achieve fault tolerance, even when the rack goes down.
• Achieve high availability of data so that data is available even in
unfavorable conditions.
• To reduce the latency, that is, to make the file read/write operations
done with lower delay.

Rack Awareness and Replication

Erasure Coding and Parity Block

FEATURE OF HDFS
• Highly Scalable - HDFS is highly scalable as it can scale hundreds of
nodes in a single cluster.
• Replication - Due to some unfavorable conditions, the node
containing the data may be loss. So, to overcome such problems, HDFS
always maintains the copy of data on a different machine.
• Fault tolerance - In HDFS, the fault tolerance signifies the robustness
of the system in the event of failure. The HDFS is highly fault-tolerant
that if any machine fails, the other machine containing the copy of that
data automatically become active.
• Distributed data storage - This is one of the most important features
of HDFS that makes Hadoop very powerful. Here, data is divided into
multiple blocks and stored into nodes.
• Portable - HDFS is designed in such a way that it can easily portable
from platform to another.

ADVANTAGE OF HDFS
• We can use HDFS to storage large data like Gigabytes ,Petabytes , zyotabytes
etc..
• It is designed for write once and read many times pattern.
• It has streaming data access facility.
• It works on low cost commodity hardware ,so this is cost effective model.

Disadvantage/Limitation of HDFS
1. Issues with Small Files
2. Slow Processing Speed
3. Support for Batch Processing only
4. No Real-time Processing
5. Latency
6. Iterative Processing
7. Security Issue
8. Lengthy Code
9. No Caching

GOAL OF HDFS
• HANDLING THE SOFTWARE FAILURE :--
The HDFS contains multiple server machines. Anyhow, if any
machine fails, the HDFS goal is to recover it quickly.
• STREAMING DATA ACCESS:--
data that is continuously generated by different sources.
• Throughput high

Department of Computer Science
Program Name: M. Sc. (Computer Science)
Hadoop Component
Components In Hadoop Ecosystem
Presented By: Pratyush Pritam
Enrollment no. - CUSB2202312021
Semester -3 , Session – 2022-24

Hadoop Component

Hadoop Component
The Components in the Hadoop Ecosystem are classified into:
General Purpose Execution Engines
Database Management Tools
Data Abstraction Engines
Real-Time Data Streaming Tools
Machine Learning Engines
Cluster Management
Graph Processing Engines
Data Storage

Hadoop Component
Apache Hive:
Functionality:
 Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data.
 Hive abstracts the complexity of Hadoop MapReduce.
 Apache Hive supports Data Definition Language (DDL), Data Manipulation Language (DML) and User Defined Functions
(UDF).
 Extensible and scalable to cope up with the growing volume and variety of data, without affecting performance of the system.
 It is as an efficient ETL (Extract, Transform, Load) tool.

Hadoop Component
Hive Architecture and its Components
 Hive Clients: Hive supports application written in many
languages like Java, C++, Python etc. using JDBC, Thrift
and ODBC drivers.
 Hive Services: Apache Hive provides various services like
CLI, Web Interface etc. to perform queries.
 Processing framework and Resource
Management: Internally, Hive uses Hadoop MapReduce
framework as de facto engine to execute the queries.
 Distributed Storage: As Hive is installed on top of Hadoop,
it uses the underlying HDFS for the distributed storage.

Hadoop Component
Data Abstraction Engines- Pig
Functionality:
• Apache Pig is a high-level platform and scripting language built on top of Hadoop.
• It simplifies the development of complex data processing tasks using a scripting language called Pig Latin.
• Apache Pig is a convenient tools developed by Yahoo for analysing huge data sets efficiently and easily.
• It provides a high level data flow language Pig Latin that is optimized, extensible and easy to use.
• The most outstanding feature of Pig programs is that their structure is open to considerable parallelization making it easy
for handling large data sets.

Hadoop Component
Apache Pig Architecture

Hadoop Component
Sqoop
Functionality:
• Apache Sqoop is a tool for efficiently transferring bulk data between Hadoop and structured datastores such as
relational databases.
• It supports importing data from external sources into Hadoop and exporting data from Hadoop into external
sources.
• It imports data from external datastores into HDFS, Hive, and HBase.

Hadoop Component
Sqoop Features
Sqoop has several features, which makes it helpful in the Big Data world:
Parallel Import/Export
Sqoop uses the YARN framework to import and export data. This provides fault tolerance on top of parallelism.
Import Results of an SQL Query
Sqoop enables us to import the results returned from an SQL query into HDFS.
Connectors For All Major RDBMS Databases
Sqoop provides connectors for multiple RDBMSs, such as the MySQL and Microsoft SQL servers.
Kerberos Security Integration
Sqoop supports the Kerberos computer network authentication protocol, which enables nodes communication over an insecure
network to authenticate users securely.
Provides Full and Incremental Load
Sqoop can load the entire table or parts of the table with a single command.

Hadoop Component
Sqoop Architecture & Working

Hadoop Component
Real Time Streaming Tools- Flume
 Flume is another data collection and ingestion tool, a distributed service for collecting, aggregating, and moving large
amounts of log data.
 It ingests online streaming data from social media, logs files, web server into HDFS.
 Flume component is used to gather and aggregate large amounts of data.
 Apache Flume is used for collecting data from its origin and sending it back to the resting location (HDFS).
 Flume accomplishes this by outlining data flows that consist of 3 primary structures channels, sources and sinks.
 The processes that run the dataflow with flume are known as agents and the bits of data that flow via flume are known as
events.

Hadoop Component
Flume Architecture
 It then goes through the source, channel, and sink.
 The sink feature ensures that everything is in sync with the requirements.
 Finally, the data is dumped into HDFS.

Hadoop Component
kafka
 Apache Kafka is an open-source distributed streaming platform that is designed for building real-time data pipelines
and streaming applications.
 Originally developed by LinkedIn, it was later open-sourced and became part of the Apache Software Foundation.
 Kafka is known for its high throughput, fault tolerance, and durability, making it a popular choice for handling
large volumes of real-time data.
 The Kafka cluster can handle failures with the Masters and Databases.
 Kafka has high throughput for both publishing and subscribing messages even if many TB of messages is stored.

Hadoop Component
Kafka Stream Features

Hadoop Component
Example
 Spotify uses Kafka as a part of their log collection pipeline.
 Airbnb uses Kafka in its event pipeline and exception tracking.
 At FourSquare ,Kafka powers online-online and online-offline messaging.
 Kafka power's MailChimp's data pipeline
As seen below, we have the sender, the message queue, and the receiver involved in data
transfer.
Publish-Subscribe Messaging System:
Kafka operates on a publish-subscribe model, where data producers (publishers) send messages
to topics, and data consumers (subscribers) subscribe to those topics to receive the messages.

Hadoop Component
Reference:
1. https://www.edureka.co/blog/a-deep-dive-into-pig/
2. https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
3. https://www.projectpro.io/article/hadoop-ecosystem-components-and-its architecture/114#toc-11

Hadoop Component
Thank You

Hadoop Component
Presented By: Md Asif Faizi
3rd
Semester, 2022-2024

Hadoop Component
Data storage Components of Hadoop Ecosystem-Oozie
 Apache Oozie is a scheduler system to manage & execute Hadoop jobs in a distributed environment.
 We can create a desired pipeline with combining a different kind of tasks.
 It can be your Hive, Pig, Sqoop or MapReduce task.
 Using Apache Oozie you can also schedule your jobs.
 Within a sequence of the task, two or more jobs can also be programmed to run parallel to each other.
 It is a scalable, reliable and extensible system.
 Oozie is an open Source Java web-application, which is responsible for triggering the workflow actions.
 It, in turn, uses the Hadoop execution engine to execute the tasks.

Hadoop Component
There are three types of jobs in Apache Oozie:
 Oozie Workflow Jobs− These are Directed Acyclic Graphs (DAGs) which specifies a sequence of actions to be executed.
 Oozie Coordinator Jobs− These consist of workflow jobs triggered by time and data availability.
 Oozie Bundles− These can be referred as a package of multiple coordinators and workflow jobs.

Hadoop Component
ZooKeeper:
 Zookeeper is the king of coordination and provides simple, fast, reliable and ordered operational services for a Hadoop
cluster.
 Zookeeper is responsible for synchronization service, distributed configuration service and for providing a naming registry
for distributed systems.
 ZooKeeper is essentially a centralized service for distributed systems to a hierarchical key-value store .

Hadoop Component
General purpose Execution Engines - Apache Spark
 Apache Spark is a powerful open–source distributed computing system fast and general-purpose cluster computing system
for big data processing.
 In-memory data processing for high performance.
 Iterative algorithms for machine learning.
 Real-time stream processing.
 Spark Code is Reusable in Batch-Processing environment.

Hadoop Component
Graph Processing Engines
GraphX
 GraphX is Apache Spark’s API for graphs and graph-parallel computation.
 GraphX unifies ETL (Extract, Transform & Load) process, exploratory analysis and iterative graph computation within a
single system
 Comparable performance to the fastest specialized graph processing systems.
 GraphX provides a distributed graph computation framework, allowing you to represent and manipulate graphs efficiently.
 It can handle both directed and undirected graphs.

Hadoop Component
Machine Learning Engines – Mahout
 A mahout is one who drives an elephant as its master.
 The name comes from its close association with Apache Hadoop which uses an elephant as its logo
 Mahout is used to create scalable and distributed machine learning algorithms such as clustering, linear regression,
classification, and so on.
 Mahout was developed to implement distributed Machine Learning algorithms.
 It is capable to store and process big data in a distributed environment across a cluster using simple programming models.
 Mahout offers the coder a ready-to-use framework for doing data mining tasks on large volumes of data.
 Mahout lets applications to analyze large sets of data effectively and in quick time.

Hadoop Component
Applications of Mahout
 Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use Mahout internally.
 Foursquare helps you in finding out places, food, and entertainment available in a particular area.
 It uses the recommender engine of Mahout.
 Twitter uses Mahout for user interest modelling.
 Yahoo! uses Mahout for pattern mining.
 It implements popular machine learning techniques such as:
 Recommendation
 Classification
 Clustering

Hadoop Component
Presented By: Sandeep Kumar Chauhan
3rd
Semester , 2022-24

Hadoop Component
Database Management Tools - Spark SQL
 Spark SQL is a module for structured and semi-structured data processing.
 It acts as a distributed Query engine.
 It provides programming abstractions for data frames and is mainly used in importing data from RDDs, Hive,
and Parquet files.

Hadoop Component
How does Spark SQL work?
The architecture of Spark consists of three main layers that include the following:
1. Language API:
 The language API is the top layer of Spark SQL Architecture that shows the compatibility of Spark SQL with different
languages such as Python, Scala, Java, HiveQL, etc.
2. Schema RDD:
 This is the middle layer of Spark SQL Architecture responsible for tables, records, and schemas.
 The Schema RDD can be used as a temporary table and called a Data Frame.
3. Data Sources:
 Data Sources are the last layer of the Architecture where the data sources are usually text files, databases, tables, etc.
 Spark SQL has different data sources such as JSON documents, HIVE tables, Parquet files, and the Cassandra database.

Hadoop Component
Features of Spark SQL
 Spark Integration:
 The Spark SQL queries can be integrated easily with the Spark programs.
 You can also query the structured data in these programs using SQL or DataFrame APIs.
 Performance:
 Spark SQL has high performance over Hadoop and provides better performance with increased iterations for datasets due to its
in-memory processing power.
 Scalability:
 Spark SQL can be used with a code-based optimizer, columnar storage, or code generator that makes most of the queries agile
along with computing the nodes through Spark Engine.
 This makes scalability easier and uses extra information to read data from multiple sources.

Hadoop Component
Apache Drill
 Apache Drill is a low latency distributed query engine.
 Its major objective is to combine a variety if data stores by just a single query.
 It is capable to support different varieties of NoSQL databases.
 High-performance scale with support for thousands of users across thousands of nodes.
 All the SQL analytics functionality.
 End-to-end security by default with industry Standard Authentication mechanisms.

Hadoop Component
Query Execution Diagram
The following image shows a DrillBit query execution diagram −
Query Execution Diagram

Hadoop Component
HBase
 HBase is a distributed column-oriented database built on top of the Hadoop file system.
 It is an open-source project and is horizontally scalable.
 HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge
amounts of structured data.
 It leverages the fault tolerance provided by the Hadoop File System (HDFS).
 One can store the data in HDFS either directly or through HBase.
 Data consumer reads/accesses the data in HDFS randomly using HBase.
 HBase sits on top of the Hadoop File System and provides read and write access.

Hadoop Component
HBase History

Hadoop Component
Storage Mechanism in HBase
 HBase is a column-oriented database and the tables in it are sorted by row.
 The table schema defines only column families, which are the key value pairs.
 A table have multiple column families and each column family can have any number of columns.
 Subsequent column values are stored contiguously on the disk.
 Each cell value of the table has a timestamp. In short, in an HBase:
 Table is a collection of rows.
 Row is a collection of column families.
 Column family is a collection of columns.
 Column is a collection of key value pairs.

Hadoop Component
Features of HBase
 HBase is linearly scalable.
 It has automatic failure support.
 It provides consistent read and writes.
 It integrates with Hadoop, both as a source and a destination.
 It has easy java API for client.
 It provides data replication across clusters.
Applications of Hbase
 It is used whenever there is a need to write heavy applications.
 HBase is used whenever we need to provide fast random access to available data.
 Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

Hadoop and It_s Components_PPT .pptx

More Related Content

Similar to Hadoop and It_s Components_PPT .pptx

Recently uploaded

Hadoop and It_s Components_PPT .pptx