SlideShare a Scribd company logo
1 of 133
Big data Introduction
Agenda
● What are the characteristics of Big data
● Generic concepts related to Data warehouse platform
● Why traditional Data warehouse platform are not consider suitable for Big
data analytics
● Understanding the vast world of Big data platform
● Benefit of Hadoop file system and other cloud storage
● Mastering the concept of Parallel distributed Computing framework
● Component failure and recoveries
● Demystifying the Map Reduce
What is Big Data?
● Big data is a phrase, used to describe a massive volume of both structured
and unstructured data that is so large that it's difficult to store, process and
access using traditional techniques
● Following are the characterises exhibits by Big data
● Massive Volume
● Lots of unstructured data
● Not possible to store this process using traditional RDBMS and ETL tools
Three Vs of Big Data
Volume : Terabytes and petabytes (and even exabytes) of data
Velocity : Data flows into an organization at increasing rates
Variety : Any type of structured, Semi structured and unstructured
data
Why Volume is increasing?
• Volume increasing exponentially with different Media format
• Lots of sensor, IOT and social media data
Why Velocity is increasing?
• Rate at which data is coming in
• Rate at which data needs to be analyzed
• Speed of processing should be faster than arrival of
new data
2.7 billion LIKES/day 400 million tweets/day
Why Variety is increasing?
• Variety of data needs to be collected to perform
advance analytics
• These data can from various sources such as
RDBMS
• Application generated data and log file
• XML, JSON,CSV,TSV , Text file and Binary data
stored on file systems
Structured data has predefined schema such as field name and data type
• CSV file with header
• Records from databases
Unstructured data don’t have predefined schema
• Text file with column names
• Audio Video
• Images
Common Types of Big Data?
• The advent of social medium, IOT devices and business-to-business
transactions are introducing new file types and formats
Big Data Analytics
● Process of collecting, organizing and analyzing large amounts of data
to gain insight
● Information and Data as Assets
● Improve decision making
● Enhance performance
● To be competitive
● To be two steps ahead of the competition
● Derive insights and drive growth
What is Data Warehouse?
• A Data Warehouse is a central location where consolidated data from
multiple locations are stored.
• The end user accesses it whenever he needs some information.
• Data Warehouse is not loaded every time when new data is generated.
• There are timelines determined by the business as to when a Data
Warehouse needs to be loaded – daily, monthly, once in a quarter etc.
Dimension Table?
• The objects of the subject are called Dimensions.
• The tables that describe the dimensions involved are called
dimension tables.
• Dividing a data warehouse project into dimensions, provides
structured information for reporting purpose.
e-commerce Company
Customer Product Date
ID Name Address ID Name Type Order
date
Shipment
date
Delivery
date
Subject
Dimensions
Attributes
What are Facts?
• A fact is a measure that can be summed, averaged or manipulated. If a
fact is manipulated, it has to be a measure that makes a business sense.
• A dimension is linked to a fact.
• A fact table contains 2 kinds of data – a dimension key and a measure.
Dimension
Product
» Number of units sold
Fact Table
Dimension
Dimension key
Measure
» Product_ID
Star Schema
FACT
Dealer_ID
Model_ID
Branch_ID
Date_ID
Units_Sold
Revenue
Dealer
Dealer_ID
Location_ID
Country_ID
Dealer_NM
Dealer_CNTCT
Branch Dim
Branch _ID
Name
Address
Country
Date Dim
Date_ID
Year
Month
Quarter
Date
Product
Product_ID
Product_Name
Model_ID
Variant_ID
• A star schema is the simplest of all the data warehouse schemas.
• The dimensions are not normalized; Fact has keys from every
dimension.
• Performance of the Star schema is good, but cost of storage is high
Snowflake Schema?
FACT
Dealer_ID
Product_ID
Date_ID
Units_Sold
Revenue
Date Dim
Date_ID
Year
Month
Quarter
Date
Location
Location_ID
Region
Dealer
Dealer_ID
Location_ID
Country_ID
Dealer_NM
Dealer_CNTCT
Product
Product_ID
Product_Name
Variant_ID
Country
Country_ID
Country_Name
Variant
Variant_ID
Variant_Type
• Snow flaking is where we normalize the dimension tables of a star
schema.
• In Snow Flake schema, Dimensions are normalized
• Query performance not so good as joins are more
Limitations of traditional data warehouse
● Not Possible to store huge data (Big Data)
● No Flexible schema
● Not suitable for storing unstructured data
● Limited scalability
● Takes huge time to perform ETL activity and load data into warehouse
● Not suitable for doing advance machine learning and AI
● Very expensive to develop, deploy and maintain the system
● A Data lake is a centralized repository that allows you to store all kinds of
data at any scale.
● In Data lake, we can store your data as-is.
● Data lakes are usually configured on a cluster of inexpensive and scalable
commodity hardware.
● A data lake works on a principle called schema-on-read.
● Data scientists can access, prepare, and analyze data faster and with more
accuracy using data lakes.
● This massive pool of data available in various non-traditional formats
provides the opportunity to access the data for a variety of use cases like
sentiment analysis or fraud detection.
What is Data Lake?
Characteristics Data Warehouse Data Lake
Data
Relational from transactional systems,
operational databases, and line of business
applications
Non-relational and relational from IoT devices,
web sites, mobile apps, social media, and
corporate applications
Schema
Designed prior to the DW implementation
(schema-on-write)
Written at the time of analysis (schema-on-read)
Price/Performance Fastest query results using higher cost storage
Query results getting faster using low-cost
commodity storage
Data Quality
Highly curated data
Any data that may or may not be curated (ie.
raw data)
Users Business analysts
Data scientists, Data developers, and Business
analysts (using curated data)
Analytics Batch reporting, BI and visualizations
Machine Learning, Predictive analytics, data
discovery and profiling
Data Lake Vs Data Warehouse
Big Data can be stored only on distributed platforms such as
✔Distributed file systems
✔Distributed databases
✔Distributed Messaging queue or PubSub platform
✔Distributed Search Engines
Not only for storage even for processing we shall need distributed and parallel
computing framework to process such data
✔Map Reduce
✔Spark
● Big data can be stored in following file system platforms
● Distributed file system such as HDFS, AWS S3, Azure data lake gen(ADLS
Gen2) and GCS
● These platforms are storing big data for analytics use cases as ETL
Processing, Reporting, Machine learning and AI
● However, these platform will not be suitable if you want to perform
CRUD(Create, Retrieve, Update and Delete) operation frequently
Storing Big data in file systems
● Big block size of 64 or 128 MB
● Bigger blocks enable reading more data in lesser disk IO Operations
● These file systems replicate the data and stored 3 copies of every block
● Replication helps these platforms to provide read scalability and data
availability
Benefits of storing Big data in file systems
● Apache Hadoop's MapReduce and HDFS components were
inspired by Google papers on MapReduce and Google File
System
● Doug Cutting and Mike Cafarella, were the authors of these
paper that were published in October 2003
What is HDFS?
• Distributed
• Highly fault tolerant
• Runs on low-cost commodity hardware
Architecture
HDFS Architecture
NameNode
Node 1 Node 2 Node 3 Node 4 Node 5
• Master of HDFS
• Runs the NameNode daemon
• Coordinates the data storage function
• Does not store any data
Node 6 Node 7 Node 8 Node 9 Node 10
1 GB
DataNod
e
DataNod
e
DataNod
e
DataNod
e
DataNod
e
DataNod
e
DataNod
e
DataNod
e
DataNod
e
DataNod
e
HDFS Process
HDFS Write
path
Component Failure & Recovery in HDFS
• NameNode Failure and Recovery
• If Namenode process goes down, then HDFS metadata will not be available
• In absence of metadata, client will be unable to read HDFS data
• Namenode High availability(HA) configuration is designed to handle this
• In Namenode HA configuration , one of the Namenode will be configured as standby
• For HA configuration zookeeper is required
• Zookeeper permits standby Namenode to Active Namenode if Namenode is not
available
• DataNode Failure and Recovery
• If DataNode process goes down, then some of HDFS blocks will not be available
• If replication factor is set to 3 then 2 more data nodes will have same block stored
• But what if all the 3 data nodes goes down in one shot?
• Rack awareness, if configured will help in this scenario
Rack Awareness
• Once Rack awareness is configured then Namenode will make sure block
gets replicated across the racks
• In below image, two copies of block A and B are stored on rack RC1 and
one copy on rack RC2
DataNode1: Rack RC1
DataNode2: RC1
DataNode4: RC1
DataNode4: RC2
DataNode5: RC2
DataNode6: RC2
A
A
A
B
B
B
What is YARN
• YARN allocates resources to run any application inside Hadoop
cluster
• Its full form is Yet Another Resource Negotiator
• It is an Operating System of Hadoop Cluster
• Part of core Hadoop
• Multi-node cluster has an aggregate pool of compute resources
• CPU and memory like resources that an operating system would
manage
YARN Bird’s Eye View
• ResourceManager (master)
• Application management
• Scheduling
• Security
• NodeManager (worker)
• Provides resources to lease
• Memory, cpu
• Container is unit of processing
• ApplicationMaster manages jobs
• JobHistory Server/Application Timeline Server
• Job history preserved in HDFS
YARN Architecture
NodeManager NodeManager NodeManager NodeManager
Container 1.1
Container 2.4
ResourceManager
Scheduler
NodeManager NodeManager NodeManager NodeManager
NodeManager NodeManager NodeManager NodeManager
Container 1.2
Container 1.3
AM1
Container 2.1
Container 2.2
Container2.3
AM2
5 Key Benefits of YARN
• Scale
• Improved Programing models and Services
• Agility
• Improved Cluster Utilization
• Beyond JAVA
Components of YARN
• ResourceManager
• Global resource scheduler
• Hierarchical queues
• NodeManager
• Per-machine agent
• Manages the life-cycle of container
• Container resource monitoring
• ApplicationMaster
• Per-application
• Manages application scheduling and task execution
• E.g. MapReduce Application Master
ResourceManager
Job Queues Scheduler
Node Manager
Container Container
Application Master
NodeManager
• Daemon process on same machine as DataNode process
• Container management
• Driven by authorized ApplicationMaster requests
• Start/stop/cleanup containers
• Container sizes are determined by Application Masters
• Node-level resource management
• Memory and CPU as resources
Resource Manager
• Applications Manager
• ApplicationMaster launch
• Publishing resource capabilities to ApplicationMaster
• Scheduling
• Capacity Scheduler (default)
• Hierarchical
• Liveliness Monitoring
• NodeManager & Application Masters
• Expects a heartbeat from a NodeManager within 10 minutes
• Heartbeats sent by NodeManagers every 1 second by default
• Expects a heartbeat from an ApplicationMaster within 10 minutes
Page 36
Application Master
• There will single Application Master per YARN Application
• App Master will be parent of all the containers of the application(Job)
• App Master will be initiated by Resource Manager
• Every App Master will have unique application Id
• App Master runs on the Data Node machine
Page 37
Component Failure & Recovery in YARN (1/2)
• Container Failure
• Exceptions are propagated back to the ApplicationMaster
• ApplicationMaster and NodeManager are responsible for
restart or reschedule
• ApplicationMaster will request new container(s) from
ResourceManager
• ApplicationMaster Failure
• It can recover the state of the running containers and
completed containers from the failed ApplicationMaster
• ResourceManager will start a new instance of the master
running in a new container managed by a NodeManager
Component Failure Recovery in YARN(2/2)
• NodeManager Failure
• Removed from Resource Manager’s pool of available nodes
• ApplicationMaster and containers running on the failed
NodeManager can be recovered as described above
• Application Masters can blacklist Node Managers
• ResourceManager Failure
• Failover to standby ResourceManager or manually brought up
by an administrator
• Recovers from saved state stored by ZooKeeper
• Cannot recover resource requests
Types of nodes in Hadoop Cluster
• Master Nodes:
• On these nodes master components of HDFS and YARN are installed
• Namenode and Resource Manager process are master nodes
• Management Nodes:
• On these nodes, mostly management software are installed
• Ambari/Cloudera manager , WebCat Server, Zookeeper and JobHistory Server
• Workers Nodes:
• On workers nodes, Datanode, NodeManager and other process runs
• Client Nodes:
• Software such as Hive, Spark client , Mapreduce Client are installed
• Edge Nodes:
• Different data ingestion tools such as Flume, Sqoop and Kafka are installed
Hadoop Cluster Layout
Master-1
Namenode-Active
Master-2
Namenode Standby
Management1
Ambari/Cloudera
Manager
Master-3
ResourceManager-
Active
Master-4
ResourceManager-
Standby
Management 2
Hmaster, Oozie
Worker-1
Datanode,
NodeManager,
RegionServer
Worerk-2
Worker-2
Datanode,
NodeManager,
RegionServer
Worker-3
Datanode,
NodeManager,
RegionServer
Worker-4
Datanode,
NodeManager,
RegionServer
Client Node
Hive,
MRClinet,SparkClient
Edge Node-1
Sqoop,
Flume,Kafka
Edge Node2-
Sqoop,Flume,Kafka
Hadoop As Data Lake
Data Source
Database
Files
Social media
Data
Ingestion
Layer
HDFS
Raw Data
MapReduce/Spark
Processed Data
Query Layer
Hive Impala/Presto
Data Ingestion
Layer
Oracle Sqoop
HDFS
JDBC
Import
FTP Server
MiniNIFI
Socail media
Flume/NIFI
Kafka
Kafka Connect
WEB-HDFS
APP
Inside Hadoop
Platform
HDFS: Flat file, SequenceFile, ORC, Parquet, AVRO ,
Compress(LZO,Snappy,BZO) Encrypt
MR: Java
Hive:
MR,Spark,Tez
MetaStore:
MySQL,Postgres
Spark: MR++,RDD: Tranformation/Action: DF/DS: Spark-SQL, ML/DL,GraphX
HBase Phoenix Oozie Ambari
MPP: Tez, Impala,Drill, Presto: Interactive/Adhoc query
Roles of Hadoop Ecosystem
Components
• Storage
• HDFS which is file system of Hadoop
• Hbase , a NoSQL database of Hadoop
• Kafka, a streaming platform for storing messages
• Processing
• MapReduce
• Apache Spark
• MPP Query engine
• Impala
• Presto
• Tez
• Data Ingestion
• Flume
• Sqoop
• NIFI
Data Ingestion : Apache Sqoop
• Apache SQOOP is a tool designed to support bulk export and import of
data into HDFS from RDBMS, enterprise data warehouses, and NoSQL
systems.
• Sqoop has connectors for working with a range of popular relational
databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2.
• Each of these connectors knows how to interact with its associated
RDBMS.
Sqoop Features
• Import a single table
• Import the complete database
• Import selected tables
• Import selected columns from a particular table
• Filter out certain rows from certain tables, etc by using the query option
• Sqoop uses MapReduce(map only) job to fetch data from an RDBMS and
stores that on HDFS.
Data Ingestion : Apache Flume
• Apache Flume is a data ingestion tool to ingest data from remote servers
• It collects , aggregate and direct stream into HDFS
• Flume works with different data sources to process and send data to defined
destinations including HDFS
Source
Channel
Sink
Destination
Hadoop 3.0 new features
• More Than Two NameNodes Supported
• Support for erasure encoding in HDFS
• Support for Opportunistic Containers and Distributed
Scheduling
• Intra-datanode balancer
• Filesystem Connector Support
• JDK 8.0 is the Minimum JAVA Version Supported by Hadoop
3.x
Let us perform some hands-on
✔ Understanding HDFS commands
✔ Executing MapReduce Job
✔ Verifying the results
✔ Exploring MapReduce Job history page
Execution Flow of YARN
Application
MR Execution Flow (1/4)
• Submit the request to resource manager by the job object created in the
client JVM. Note: Client JVM will be created in the client node of the
hadoop cluster
• Then application manager of the resource manager will check whether
application is a valid YARN application or not , and if it is a valid then
generated a unique YARN application id
• Job object inside the Driver process , using this unique id, creates a
staging folder /tmp/applciation_2731 and copies all the jar, script, .sql and
dependencies.
• Job object inside the Driver process, again submit the request to the
resource manager
MR Execution Flow (2/4)
• Resource manager selects one of the data node(based on the availability
of the resources CPU 1 core, 1 GB RAM) where Application master
image will be copied
• Then resource manager will instruct NM of the respective node(server) to
start the image . once image is started it will be called as App Master
container
• App master wants to break the job into n tasks for this he will connect with
the namenode and get metadata(blocks, locations) for the input folder
specided while running the job
MR Execution Flow (3/4)
• App master in order to create N no of tasks, first it needs to find out number of
input split with the help of InputFormatClasses . Input classes consider the data
split and make the record logically complete in order to decide the no of splits
• Application master will decide the task assignment and request the resource
manager to allocate the containers .
• Once resources are allocated then application master will request the respective
NM to start the container using those resources. Once container is started
mapper process will start in that container
MR Execution Flow (4/4)
• Mapper task runs map method implemented in the Mapper java class.
• Once mapper is over then sorting shuffling will take place . As part of this
phase, Hash partitioning will write the data (key value pair) into n part-m-* files
here n depends upon no of reducer set in the application
• These part files will be copied to respective reducers which will perform merge
sort to combine multiple values of the given key into list of values
• List of values and key will be fed into reduce method of Reducer class
• Output of the reducer will be stored in HDFS
Free
Capacity
Free
Capacit
y
Free
Capacit
y
Lifecycle of a YARN
Application
Scheduler
ApplicationsManager
Free Capacity
Container
Job
Map1
Containe
r r
Job
Map2
Container
ApplicationMaste
r MR Job
Container
Job
Map3
Container
Job
Map4
Container
Job
Map5
Container
Job
Map6
Container
Job
Map7
Container
Job
Map8
Container
Job
Map9
Container
Job
Map10
Container
Job
Reducer2
ApplicationMasterService
Response with
ApplicationID
Application Submission Context
Start
ApplicationMaster
Req/Rec Containers
Container Launch Requests
Container
Job
Reducer1
1 Client submits
application request
2
3
(ApplicationId, queue, application name, application type, AM resource, priority)
4
+
Container Launch Context
(launch commands, LocalResource, delegation tokens)
5
Get Capabilities
6
7
☺
ResourceManager
NodeManager 2
NodeManager 1
NodeManager 4
NodeManager 3
MapReduce Introduction
How does MapReduce work?
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Reducer
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Reducer
Mapping Phase
Reducing Phase
How does MapReduce work?
MapReduce
Map Input List
Map Output List
Mapper
Reduce Input List
Reduce Output List
Reducer
Mapping Phase
Reducing Phase
Hadoop MapReduce
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Hadoop MapReduce
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Create individual
Records
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Create individual
Records
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Create individual
Records
User Defined Logic
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Create individual
Records
User Defined Logic
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Create individual
Records
User Defined Logic
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Create individual
Records
User Defined Logic User Defined Logic
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Create individual
Records
User Defined Logic User Defined Logic
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Create individual
Records
User Defined Logic User Defined Logic Specify Path & Output
format
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Create individual
Records
User Defined Logic User Defined Logic Specify Path & Output
format
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Create individual
Records
User Defined Logic User Defined Logic Specify Path & Output
format
Replication, Rack
Awareness etc.
Hadoop MapReduce – Roles: User vs. Framework
MapReduce
<1, King
Queen King>
<King, 1>
<Queen, 1>
<King, 1>
<2, Minister
King Soldier>
<3, Queen
Soldier King>
<Minister, 1>
<King, 1>
<Soldier, 1>
<Queen, 1>
<Soldier, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<King, 1>
<Minister, 1>
<Queen, 1>
<Queen, 1>
<Soldier,1>
<Soldier,1>
<King,
(1,1,1,1)>
<Minister,
1>
<Queen,
(1,1)>
<Soldier,
(1,1)>
<King, 4>
<Minister, 1>
King Queen King
Minister King
Soldier
Queen Soldier
King
Input Splitting Map Shuffling Reduce Result
<Queen, 2>
<Soldier, 2>
Map Output
Load data into HDFS
Specify Path &
Input Format
Create ‘Input Splits’
Create individual
Records
User Defined Logic User Defined Logic Specify Path & Output
format
Replication, Rack
Awareness etc.
▪ MR Job can contain either map reduce both or just mappers
▪ No of Mappers in the Job depends upon the input split
▪ No of reducer in the Job needs to be set by the developer
▪ By default, no of reducer in the Job is just one
▪ Otherwise , recommended value of the reducers in the job is ¼ of
mappers tasks
▪ Each Map and Reduce task is separate container of JVM
▪ By default , Map and reduce JVM have 1 GB Memory and 1 CPU core
▪ Driver is the process which runs inside the Application Master JVM
▪ Each Mapper are Reducer tasks are being run as separate JVM by
the Node Manager
▪ Each task can be executed maximum 4 times
▪ Each task is being assigned by Application Master to specific Node
Manager
▪ InputFormat class is Java class which has intelligence of making the
record complete
MapReduce Execution Framework
MapReduce
MapReduce Execution Framework
MapReduce
Mapper Process
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Driver
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Driver
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
InputFormat
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Input Split 1
InputFormat
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Calculates
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Calculates
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Calculates
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Record
Reader
Calculates
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Record
Reader
Reads Reads
Calculates
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Record
Reader
Reads Reads
Calculates
Defines
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Record
Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Mapper
Record
Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Mapper Process
Mapper
Record
Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record
Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record
Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record
Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record
Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Shuffle
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record
Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record
Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Record
Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Reducer
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Reducer
Record
Reader
Reads
Passes <K,V> pairs
Reads
Calculates
Defines
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Reducer
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Reducer
Record
Reader
Reads
Passes <K,V> pairs
Reads
Passes <K,V> pairs
Calculates
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Reducer
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Reduce Process
Mapper Process
Mapper
Reducer
Record
Reader
Reads
Passes <K,V> pairs
Reads
Passes <K,V> pairs
OutputFormat
Calculates
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Reducer
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record
Reader
Output Data
Reads
Passes <K,V> pairs
Reads
Passes <K,V> pairs
OutputFormat
Calculates
Defines
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Reducer
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record
Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Reads
Passes <K,V> pairs
OutputFormat
Calculates
Defines
Defines
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Reducer
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record
Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Reads
Passes <K,V> pairs
OutputFormat
Calculates
Defines
Defines
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Reducer
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record
Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Writes
Reads
Passes <K,V> pairs
Writes
OutputFormat
Calculates
Defines
Defines
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Reducer
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record
Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Writes
Reads
Passes <K,V> pairs
Writes
OutputFormat
Defines
Calculates
Defines
Defines
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Reducer
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record
Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Writes
Reads
Passes <K,V> pairs
Writes
OutputFormat
Defines
Defines
Calculates
Defines
Defines
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Reducer
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record
Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Writes
Reads
Passes <K,V> pairs
Writes
OutputFormat
Defines
Defines
Calculates
Defines
Defines
Defines
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
MapReduce Execution Framework
MapReduce
Reduce Process
Mapper Process
Input HDFS File - inputFile.txt
Block A Block B Block C
Driver
Mapper
Reducer
Record
Reader
Input Split 1 Input Split 2 Input Split 3 Input Split 4
Writer
InputFormat
Output Data
Reduce Process
Mapper Process
Mapper
Reducer
Record
Reader
Writer
Output Data
Reads
Passes <K,V> pairs
Writes
Reads
Passes <K,V> pairs
Writes
OutputFormat
Defines
Defines
Calculates
Defines
Defines
Defines
Defines
Defines
Passes <K,V> pairs
Passes <K,V> pairs
<K, V> pairs <K, V> pairs
Partition Shuffle
Sort
▪ MR Job can contain either map reduce both or just mappers
▪ No of Mappers in the Job depends upon the input split
▪ No of reducer in the Job needs to be set by the developer
▪ By default, no of reducer in the Job is just one
▪ Otherwise , recommended value of the reducers in the job is ¼ of
mappers tasks
▪ Each Map and Reduce task is separate container of JVM
▪ By default , Map and reduce JVM have 1 GB Memory and 1 CPU core
Thank You

More Related Content

Similar to Data Science Machine Lerning Bigdat.pptx

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @ScaleDr Hajji Hicham
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataAshnikbiz
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web developmentTung Nguyen
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkMukesh Singh
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World DistilledRTTS
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadhMithlesh Sadh
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 

Similar to Data Science Machine Lerning Bigdat.pptx (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Processing Drone data @Scale
Processing Drone data @ScaleProcessing Drone data @Scale
Processing Drone data @Scale
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
An overview of modern scalable web development
An overview of modern scalable web developmentAn overview of modern scalable web development
An overview of modern scalable web development
 
Ledingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lkLedingkart Meetup #4: Data pipeline @ lk
Ledingkart Meetup #4: Data pipeline @ lk
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
 
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
 
Big Data
Big DataBig Data
Big Data
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 

More from Priyadarshini648418

AAI expert system and their usecases.ppt
AAI expert system and their usecases.pptAAI expert system and their usecases.ppt
AAI expert system and their usecases.pptPriyadarshini648418
 
deep learn about blood vessel auto1.pptx
deep learn about blood vessel auto1.pptxdeep learn about blood vessel auto1.pptx
deep learn about blood vessel auto1.pptxPriyadarshini648418
 
PowerPoint_merge.ppt on unix programming
PowerPoint_merge.ppt on unix programmingPowerPoint_merge.ppt on unix programming
PowerPoint_merge.ppt on unix programmingPriyadarshini648418
 
Applied Artificial Intelligence presenttt
Applied Artificial Intelligence presentttApplied Artificial Intelligence presenttt
Applied Artificial Intelligence presentttPriyadarshini648418
 
Nest_Dictionaries in python coding1.pptx
Nest_Dictionaries in python coding1.pptxNest_Dictionaries in python coding1.pptx
Nest_Dictionaries in python coding1.pptxPriyadarshini648418
 
Gender Recognition in the voice PPT.pptx
Gender Recognition in the voice PPT.pptxGender Recognition in the voice PPT.pptx
Gender Recognition in the voice PPT.pptxPriyadarshini648418
 
2. UNIX OS System Architecture easy.pptx
2. UNIX OS System Architecture easy.pptx2. UNIX OS System Architecture easy.pptx
2. UNIX OS System Architecture easy.pptxPriyadarshini648418
 
Unix_Introduction_BCA.pptx the very basi
Unix_Introduction_BCA.pptx the very basiUnix_Introduction_BCA.pptx the very basi
Unix_Introduction_BCA.pptx the very basiPriyadarshini648418
 

More from Priyadarshini648418 (8)

AAI expert system and their usecases.ppt
AAI expert system and their usecases.pptAAI expert system and their usecases.ppt
AAI expert system and their usecases.ppt
 
deep learn about blood vessel auto1.pptx
deep learn about blood vessel auto1.pptxdeep learn about blood vessel auto1.pptx
deep learn about blood vessel auto1.pptx
 
PowerPoint_merge.ppt on unix programming
PowerPoint_merge.ppt on unix programmingPowerPoint_merge.ppt on unix programming
PowerPoint_merge.ppt on unix programming
 
Applied Artificial Intelligence presenttt
Applied Artificial Intelligence presentttApplied Artificial Intelligence presenttt
Applied Artificial Intelligence presenttt
 
Nest_Dictionaries in python coding1.pptx
Nest_Dictionaries in python coding1.pptxNest_Dictionaries in python coding1.pptx
Nest_Dictionaries in python coding1.pptx
 
Gender Recognition in the voice PPT.pptx
Gender Recognition in the voice PPT.pptxGender Recognition in the voice PPT.pptx
Gender Recognition in the voice PPT.pptx
 
2. UNIX OS System Architecture easy.pptx
2. UNIX OS System Architecture easy.pptx2. UNIX OS System Architecture easy.pptx
2. UNIX OS System Architecture easy.pptx
 
Unix_Introduction_BCA.pptx the very basi
Unix_Introduction_BCA.pptx the very basiUnix_Introduction_BCA.pptx the very basi
Unix_Introduction_BCA.pptx the very basi
 

Recently uploaded

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 

Recently uploaded (20)

CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 

Data Science Machine Lerning Bigdat.pptx

  • 2. Agenda ● What are the characteristics of Big data ● Generic concepts related to Data warehouse platform ● Why traditional Data warehouse platform are not consider suitable for Big data analytics ● Understanding the vast world of Big data platform ● Benefit of Hadoop file system and other cloud storage ● Mastering the concept of Parallel distributed Computing framework ● Component failure and recoveries ● Demystifying the Map Reduce
  • 3. What is Big Data? ● Big data is a phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to store, process and access using traditional techniques ● Following are the characterises exhibits by Big data ● Massive Volume ● Lots of unstructured data ● Not possible to store this process using traditional RDBMS and ETL tools
  • 4. Three Vs of Big Data Volume : Terabytes and petabytes (and even exabytes) of data Velocity : Data flows into an organization at increasing rates Variety : Any type of structured, Semi structured and unstructured data
  • 5. Why Volume is increasing? • Volume increasing exponentially with different Media format • Lots of sensor, IOT and social media data
  • 6. Why Velocity is increasing? • Rate at which data is coming in • Rate at which data needs to be analyzed • Speed of processing should be faster than arrival of new data 2.7 billion LIKES/day 400 million tweets/day
  • 7. Why Variety is increasing? • Variety of data needs to be collected to perform advance analytics • These data can from various sources such as RDBMS • Application generated data and log file • XML, JSON,CSV,TSV , Text file and Binary data stored on file systems
  • 8. Structured data has predefined schema such as field name and data type • CSV file with header • Records from databases Unstructured data don’t have predefined schema • Text file with column names • Audio Video • Images
  • 9. Common Types of Big Data? • The advent of social medium, IOT devices and business-to-business transactions are introducing new file types and formats
  • 10. Big Data Analytics ● Process of collecting, organizing and analyzing large amounts of data to gain insight ● Information and Data as Assets ● Improve decision making ● Enhance performance ● To be competitive ● To be two steps ahead of the competition ● Derive insights and drive growth
  • 11. What is Data Warehouse? • A Data Warehouse is a central location where consolidated data from multiple locations are stored. • The end user accesses it whenever he needs some information. • Data Warehouse is not loaded every time when new data is generated. • There are timelines determined by the business as to when a Data Warehouse needs to be loaded – daily, monthly, once in a quarter etc.
  • 12. Dimension Table? • The objects of the subject are called Dimensions. • The tables that describe the dimensions involved are called dimension tables. • Dividing a data warehouse project into dimensions, provides structured information for reporting purpose. e-commerce Company Customer Product Date ID Name Address ID Name Type Order date Shipment date Delivery date Subject Dimensions Attributes
  • 13. What are Facts? • A fact is a measure that can be summed, averaged or manipulated. If a fact is manipulated, it has to be a measure that makes a business sense. • A dimension is linked to a fact. • A fact table contains 2 kinds of data – a dimension key and a measure. Dimension Product » Number of units sold Fact Table Dimension Dimension key Measure » Product_ID
  • 14.
  • 15. Star Schema FACT Dealer_ID Model_ID Branch_ID Date_ID Units_Sold Revenue Dealer Dealer_ID Location_ID Country_ID Dealer_NM Dealer_CNTCT Branch Dim Branch _ID Name Address Country Date Dim Date_ID Year Month Quarter Date Product Product_ID Product_Name Model_ID Variant_ID • A star schema is the simplest of all the data warehouse schemas. • The dimensions are not normalized; Fact has keys from every dimension. • Performance of the Star schema is good, but cost of storage is high
  • 17. Limitations of traditional data warehouse ● Not Possible to store huge data (Big Data) ● No Flexible schema ● Not suitable for storing unstructured data ● Limited scalability ● Takes huge time to perform ETL activity and load data into warehouse ● Not suitable for doing advance machine learning and AI ● Very expensive to develop, deploy and maintain the system
  • 18. ● A Data lake is a centralized repository that allows you to store all kinds of data at any scale. ● In Data lake, we can store your data as-is. ● Data lakes are usually configured on a cluster of inexpensive and scalable commodity hardware. ● A data lake works on a principle called schema-on-read. ● Data scientists can access, prepare, and analyze data faster and with more accuracy using data lakes. ● This massive pool of data available in various non-traditional formats provides the opportunity to access the data for a variety of use cases like sentiment analysis or fraud detection. What is Data Lake?
  • 19. Characteristics Data Warehouse Data Lake Data Relational from transactional systems, operational databases, and line of business applications Non-relational and relational from IoT devices, web sites, mobile apps, social media, and corporate applications Schema Designed prior to the DW implementation (schema-on-write) Written at the time of analysis (schema-on-read) Price/Performance Fastest query results using higher cost storage Query results getting faster using low-cost commodity storage Data Quality Highly curated data Any data that may or may not be curated (ie. raw data) Users Business analysts Data scientists, Data developers, and Business analysts (using curated data) Analytics Batch reporting, BI and visualizations Machine Learning, Predictive analytics, data discovery and profiling Data Lake Vs Data Warehouse
  • 20. Big Data can be stored only on distributed platforms such as ✔Distributed file systems ✔Distributed databases ✔Distributed Messaging queue or PubSub platform ✔Distributed Search Engines Not only for storage even for processing we shall need distributed and parallel computing framework to process such data ✔Map Reduce ✔Spark
  • 21. ● Big data can be stored in following file system platforms ● Distributed file system such as HDFS, AWS S3, Azure data lake gen(ADLS Gen2) and GCS ● These platforms are storing big data for analytics use cases as ETL Processing, Reporting, Machine learning and AI ● However, these platform will not be suitable if you want to perform CRUD(Create, Retrieve, Update and Delete) operation frequently Storing Big data in file systems
  • 22. ● Big block size of 64 or 128 MB ● Bigger blocks enable reading more data in lesser disk IO Operations ● These file systems replicate the data and stored 3 copies of every block ● Replication helps these platforms to provide read scalability and data availability Benefits of storing Big data in file systems
  • 23. ● Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on MapReduce and Google File System ● Doug Cutting and Mike Cafarella, were the authors of these paper that were published in October 2003
  • 24. What is HDFS? • Distributed • Highly fault tolerant • Runs on low-cost commodity hardware
  • 25. Architecture HDFS Architecture NameNode Node 1 Node 2 Node 3 Node 4 Node 5 • Master of HDFS • Runs the NameNode daemon • Coordinates the data storage function • Does not store any data Node 6 Node 7 Node 8 Node 9 Node 10 1 GB DataNod e DataNod e DataNod e DataNod e DataNod e DataNod e DataNod e DataNod e DataNod e DataNod e
  • 28. Component Failure & Recovery in HDFS • NameNode Failure and Recovery • If Namenode process goes down, then HDFS metadata will not be available • In absence of metadata, client will be unable to read HDFS data • Namenode High availability(HA) configuration is designed to handle this • In Namenode HA configuration , one of the Namenode will be configured as standby • For HA configuration zookeeper is required • Zookeeper permits standby Namenode to Active Namenode if Namenode is not available • DataNode Failure and Recovery • If DataNode process goes down, then some of HDFS blocks will not be available • If replication factor is set to 3 then 2 more data nodes will have same block stored • But what if all the 3 data nodes goes down in one shot? • Rack awareness, if configured will help in this scenario
  • 29. Rack Awareness • Once Rack awareness is configured then Namenode will make sure block gets replicated across the racks • In below image, two copies of block A and B are stored on rack RC1 and one copy on rack RC2 DataNode1: Rack RC1 DataNode2: RC1 DataNode4: RC1 DataNode4: RC2 DataNode5: RC2 DataNode6: RC2 A A A B B B
  • 30. What is YARN • YARN allocates resources to run any application inside Hadoop cluster • Its full form is Yet Another Resource Negotiator • It is an Operating System of Hadoop Cluster • Part of core Hadoop • Multi-node cluster has an aggregate pool of compute resources • CPU and memory like resources that an operating system would manage
  • 31. YARN Bird’s Eye View • ResourceManager (master) • Application management • Scheduling • Security • NodeManager (worker) • Provides resources to lease • Memory, cpu • Container is unit of processing • ApplicationMaster manages jobs • JobHistory Server/Application Timeline Server • Job history preserved in HDFS
  • 32. YARN Architecture NodeManager NodeManager NodeManager NodeManager Container 1.1 Container 2.4 ResourceManager Scheduler NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager Container 1.2 Container 1.3 AM1 Container 2.1 Container 2.2 Container2.3 AM2
  • 33. 5 Key Benefits of YARN • Scale • Improved Programing models and Services • Agility • Improved Cluster Utilization • Beyond JAVA
  • 34. Components of YARN • ResourceManager • Global resource scheduler • Hierarchical queues • NodeManager • Per-machine agent • Manages the life-cycle of container • Container resource monitoring • ApplicationMaster • Per-application • Manages application scheduling and task execution • E.g. MapReduce Application Master ResourceManager Job Queues Scheduler Node Manager Container Container Application Master
  • 35. NodeManager • Daemon process on same machine as DataNode process • Container management • Driven by authorized ApplicationMaster requests • Start/stop/cleanup containers • Container sizes are determined by Application Masters • Node-level resource management • Memory and CPU as resources
  • 36. Resource Manager • Applications Manager • ApplicationMaster launch • Publishing resource capabilities to ApplicationMaster • Scheduling • Capacity Scheduler (default) • Hierarchical • Liveliness Monitoring • NodeManager & Application Masters • Expects a heartbeat from a NodeManager within 10 minutes • Heartbeats sent by NodeManagers every 1 second by default • Expects a heartbeat from an ApplicationMaster within 10 minutes Page 36
  • 37. Application Master • There will single Application Master per YARN Application • App Master will be parent of all the containers of the application(Job) • App Master will be initiated by Resource Manager • Every App Master will have unique application Id • App Master runs on the Data Node machine Page 37
  • 38. Component Failure & Recovery in YARN (1/2) • Container Failure • Exceptions are propagated back to the ApplicationMaster • ApplicationMaster and NodeManager are responsible for restart or reschedule • ApplicationMaster will request new container(s) from ResourceManager • ApplicationMaster Failure • It can recover the state of the running containers and completed containers from the failed ApplicationMaster • ResourceManager will start a new instance of the master running in a new container managed by a NodeManager
  • 39. Component Failure Recovery in YARN(2/2) • NodeManager Failure • Removed from Resource Manager’s pool of available nodes • ApplicationMaster and containers running on the failed NodeManager can be recovered as described above • Application Masters can blacklist Node Managers • ResourceManager Failure • Failover to standby ResourceManager or manually brought up by an administrator • Recovers from saved state stored by ZooKeeper • Cannot recover resource requests
  • 40. Types of nodes in Hadoop Cluster • Master Nodes: • On these nodes master components of HDFS and YARN are installed • Namenode and Resource Manager process are master nodes • Management Nodes: • On these nodes, mostly management software are installed • Ambari/Cloudera manager , WebCat Server, Zookeeper and JobHistory Server • Workers Nodes: • On workers nodes, Datanode, NodeManager and other process runs • Client Nodes: • Software such as Hive, Spark client , Mapreduce Client are installed • Edge Nodes: • Different data ingestion tools such as Flume, Sqoop and Kafka are installed
  • 41. Hadoop Cluster Layout Master-1 Namenode-Active Master-2 Namenode Standby Management1 Ambari/Cloudera Manager Master-3 ResourceManager- Active Master-4 ResourceManager- Standby Management 2 Hmaster, Oozie Worker-1 Datanode, NodeManager, RegionServer Worerk-2 Worker-2 Datanode, NodeManager, RegionServer Worker-3 Datanode, NodeManager, RegionServer Worker-4 Datanode, NodeManager, RegionServer Client Node Hive, MRClinet,SparkClient Edge Node-1 Sqoop, Flume,Kafka Edge Node2- Sqoop,Flume,Kafka
  • 42. Hadoop As Data Lake Data Source Database Files Social media Data Ingestion Layer HDFS Raw Data MapReduce/Spark Processed Data Query Layer Hive Impala/Presto
  • 43. Data Ingestion Layer Oracle Sqoop HDFS JDBC Import FTP Server MiniNIFI Socail media Flume/NIFI Kafka Kafka Connect WEB-HDFS APP
  • 44. Inside Hadoop Platform HDFS: Flat file, SequenceFile, ORC, Parquet, AVRO , Compress(LZO,Snappy,BZO) Encrypt MR: Java Hive: MR,Spark,Tez MetaStore: MySQL,Postgres Spark: MR++,RDD: Tranformation/Action: DF/DS: Spark-SQL, ML/DL,GraphX HBase Phoenix Oozie Ambari MPP: Tez, Impala,Drill, Presto: Interactive/Adhoc query
  • 45. Roles of Hadoop Ecosystem Components • Storage • HDFS which is file system of Hadoop • Hbase , a NoSQL database of Hadoop • Kafka, a streaming platform for storing messages • Processing • MapReduce • Apache Spark • MPP Query engine • Impala • Presto • Tez • Data Ingestion • Flume • Sqoop • NIFI
  • 46. Data Ingestion : Apache Sqoop • Apache SQOOP is a tool designed to support bulk export and import of data into HDFS from RDBMS, enterprise data warehouses, and NoSQL systems. • Sqoop has connectors for working with a range of popular relational databases, including MySQL, PostgreSQL, Oracle, SQL Server, and DB2. • Each of these connectors knows how to interact with its associated RDBMS.
  • 47. Sqoop Features • Import a single table • Import the complete database • Import selected tables • Import selected columns from a particular table • Filter out certain rows from certain tables, etc by using the query option • Sqoop uses MapReduce(map only) job to fetch data from an RDBMS and stores that on HDFS.
  • 48. Data Ingestion : Apache Flume • Apache Flume is a data ingestion tool to ingest data from remote servers • It collects , aggregate and direct stream into HDFS • Flume works with different data sources to process and send data to defined destinations including HDFS Source Channel Sink Destination
  • 49. Hadoop 3.0 new features • More Than Two NameNodes Supported • Support for erasure encoding in HDFS • Support for Opportunistic Containers and Distributed Scheduling • Intra-datanode balancer • Filesystem Connector Support • JDK 8.0 is the Minimum JAVA Version Supported by Hadoop 3.x
  • 50. Let us perform some hands-on ✔ Understanding HDFS commands ✔ Executing MapReduce Job ✔ Verifying the results ✔ Exploring MapReduce Job history page
  • 51. Execution Flow of YARN Application
  • 52. MR Execution Flow (1/4) • Submit the request to resource manager by the job object created in the client JVM. Note: Client JVM will be created in the client node of the hadoop cluster • Then application manager of the resource manager will check whether application is a valid YARN application or not , and if it is a valid then generated a unique YARN application id • Job object inside the Driver process , using this unique id, creates a staging folder /tmp/applciation_2731 and copies all the jar, script, .sql and dependencies. • Job object inside the Driver process, again submit the request to the resource manager
  • 53. MR Execution Flow (2/4) • Resource manager selects one of the data node(based on the availability of the resources CPU 1 core, 1 GB RAM) where Application master image will be copied • Then resource manager will instruct NM of the respective node(server) to start the image . once image is started it will be called as App Master container • App master wants to break the job into n tasks for this he will connect with the namenode and get metadata(blocks, locations) for the input folder specided while running the job
  • 54. MR Execution Flow (3/4) • App master in order to create N no of tasks, first it needs to find out number of input split with the help of InputFormatClasses . Input classes consider the data split and make the record logically complete in order to decide the no of splits • Application master will decide the task assignment and request the resource manager to allocate the containers . • Once resources are allocated then application master will request the respective NM to start the container using those resources. Once container is started mapper process will start in that container
  • 55. MR Execution Flow (4/4) • Mapper task runs map method implemented in the Mapper java class. • Once mapper is over then sorting shuffling will take place . As part of this phase, Hash partitioning will write the data (key value pair) into n part-m-* files here n depends upon no of reducer set in the application • These part files will be copied to respective reducers which will perform merge sort to combine multiple values of the given key into list of values • List of values and key will be fed into reduce method of Reducer class • Output of the reducer will be stored in HDFS
  • 56. Free Capacity Free Capacit y Free Capacit y Lifecycle of a YARN Application Scheduler ApplicationsManager Free Capacity Container Job Map1 Containe r r Job Map2 Container ApplicationMaste r MR Job Container Job Map3 Container Job Map4 Container Job Map5 Container Job Map6 Container Job Map7 Container Job Map8 Container Job Map9 Container Job Map10 Container Job Reducer2 ApplicationMasterService Response with ApplicationID Application Submission Context Start ApplicationMaster Req/Rec Containers Container Launch Requests Container Job Reducer1 1 Client submits application request 2 3 (ApplicationId, queue, application name, application type, AM resource, priority) 4 + Container Launch Context (launch commands, LocalResource, delegation tokens) 5 Get Capabilities 6 7 ☺ ResourceManager NodeManager 2 NodeManager 1 NodeManager 4 NodeManager 3
  • 59. How does MapReduce work? MapReduce Map Input List Map Output List Reduce Input List Reduce Output List Mapping Phase Reducing Phase
  • 60. How does MapReduce work? MapReduce Map Input List Map Output List Mapper Reduce Input List Reduce Output List Mapping Phase Reducing Phase
  • 61. How does MapReduce work? MapReduce Map Input List Map Output List Mapper Reduce Input List Reduce Output List Mapping Phase Reducing Phase
  • 62. How does MapReduce work? MapReduce Map Input List Map Output List Mapper Reduce Input List Reduce Output List Mapping Phase Reducing Phase
  • 63. How does MapReduce work? MapReduce Map Input List Map Output List Mapper Reduce Input List Reduce Output List Mapping Phase Reducing Phase
  • 64. How does MapReduce work? MapReduce Map Input List Map Output List Mapper Reduce Input List Reduce Output List Mapping Phase Reducing Phase
  • 65. How does MapReduce work? MapReduce Map Input List Map Output List Mapper Reduce Input List Reduce Output List Reducer Mapping Phase Reducing Phase
  • 66. How does MapReduce work? MapReduce Map Input List Map Output List Mapper Reduce Input List Reduce Output List Reducer Mapping Phase Reducing Phase
  • 67. How does MapReduce work? MapReduce Map Input List Map Output List Mapper Reduce Input List Reduce Output List Reducer Mapping Phase Reducing Phase
  • 68. Hadoop MapReduce MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output
  • 69. Hadoop MapReduce MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output
  • 70. Hadoop MapReduce MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output
  • 71. Hadoop MapReduce MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output
  • 72. Hadoop MapReduce MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output
  • 73. Hadoop MapReduce MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output
  • 74. Hadoop MapReduce MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output
  • 75. Hadoop MapReduce MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output
  • 76. Hadoop MapReduce MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output
  • 77. Hadoop MapReduce MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output
  • 78. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output
  • 79. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS
  • 80. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format
  • 81. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format
  • 82. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’
  • 83. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’ Create individual Records
  • 84. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’ Create individual Records
  • 85. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’ Create individual Records User Defined Logic
  • 86. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’ Create individual Records User Defined Logic
  • 87. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’ Create individual Records User Defined Logic
  • 88. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’ Create individual Records User Defined Logic User Defined Logic
  • 89. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’ Create individual Records User Defined Logic User Defined Logic
  • 90. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’ Create individual Records User Defined Logic User Defined Logic Specify Path & Output format
  • 91. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’ Create individual Records User Defined Logic User Defined Logic Specify Path & Output format
  • 92. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’ Create individual Records User Defined Logic User Defined Logic Specify Path & Output format Replication, Rack Awareness etc.
  • 93. Hadoop MapReduce – Roles: User vs. Framework MapReduce <1, King Queen King> <King, 1> <Queen, 1> <King, 1> <2, Minister King Soldier> <3, Queen Soldier King> <Minister, 1> <King, 1> <Soldier, 1> <Queen, 1> <Soldier, 1> <King, 1> <King, 1> <King, 1> <King, 1> <King, 1> <Minister, 1> <Queen, 1> <Queen, 1> <Soldier,1> <Soldier,1> <King, (1,1,1,1)> <Minister, 1> <Queen, (1,1)> <Soldier, (1,1)> <King, 4> <Minister, 1> King Queen King Minister King Soldier Queen Soldier King Input Splitting Map Shuffling Reduce Result <Queen, 2> <Soldier, 2> Map Output Load data into HDFS Specify Path & Input Format Create ‘Input Splits’ Create individual Records User Defined Logic User Defined Logic Specify Path & Output format Replication, Rack Awareness etc.
  • 94. ▪ MR Job can contain either map reduce both or just mappers ▪ No of Mappers in the Job depends upon the input split ▪ No of reducer in the Job needs to be set by the developer ▪ By default, no of reducer in the Job is just one ▪ Otherwise , recommended value of the reducers in the job is ¼ of mappers tasks ▪ Each Map and Reduce task is separate container of JVM ▪ By default , Map and reduce JVM have 1 GB Memory and 1 CPU core
  • 95. ▪ Driver is the process which runs inside the Application Master JVM ▪ Each Mapper are Reducer tasks are being run as separate JVM by the Node Manager ▪ Each task can be executed maximum 4 times ▪ Each task is being assigned by Application Master to specific Node Manager ▪ InputFormat class is Java class which has intelligence of making the record complete
  • 99. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Driver
  • 100. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Driver
  • 101. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver
  • 102. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver InputFormat
  • 103. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Input Split 1 InputFormat
  • 104. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat
  • 105. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Calculates
  • 106. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Mapper Process Calculates
  • 107. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Mapper Process Calculates
  • 108. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Mapper Process Record Reader Calculates
  • 109. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Mapper Process Record Reader Reads Reads Calculates
  • 110. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Mapper Process Record Reader Reads Reads Calculates Defines
  • 111. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Mapper Process Record Reader Reads Passes <K,V> pairs Reads Calculates Defines Passes <K,V> pairs
  • 112. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Mapper Process Mapper Record Reader Reads Passes <K,V> pairs Reads Calculates Defines Passes <K,V> pairs
  • 113. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Mapper Process Mapper Record Reader Reads Passes <K,V> pairs Reads Calculates Defines Passes <K,V> pairs <K, V> pairs <K, V> pairs
  • 114. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Reduce Process Mapper Process Mapper Record Reader Reads Passes <K,V> pairs Reads Calculates Defines Passes <K,V> pairs <K, V> pairs <K, V> pairs
  • 115. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Reduce Process Mapper Process Mapper Record Reader Reads Passes <K,V> pairs Reads Calculates Defines Passes <K,V> pairs <K, V> pairs <K, V> pairs
  • 116. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Reduce Process Mapper Process Mapper Record Reader Reads Passes <K,V> pairs Reads Calculates Defines Passes <K,V> pairs <K, V> pairs <K, V> pairs
  • 117. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Reduce Process Mapper Process Mapper Record Reader Reads Passes <K,V> pairs Reads Calculates Defines Passes <K,V> pairs <K, V> pairs <K, V> pairs Shuffle
  • 118. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Reduce Process Mapper Process Mapper Record Reader Reads Passes <K,V> pairs Reads Calculates Defines Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle
  • 119. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Reduce Process Mapper Process Mapper Record Reader Reads Passes <K,V> pairs Reads Calculates Defines Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 120. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Reduce Process Mapper Process Mapper Record Reader Reads Passes <K,V> pairs Reads Calculates Defines Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 121. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Reducer Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Reduce Process Mapper Process Mapper Reducer Record Reader Reads Passes <K,V> pairs Reads Calculates Defines Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 122. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Reducer Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Reduce Process Mapper Process Mapper Reducer Record Reader Reads Passes <K,V> pairs Reads Passes <K,V> pairs Calculates Defines Passes <K,V> pairs Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 123. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Reducer Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Reduce Process Mapper Process Mapper Reducer Record Reader Reads Passes <K,V> pairs Reads Passes <K,V> pairs OutputFormat Calculates Defines Passes <K,V> pairs Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 124. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Reducer Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 InputFormat Output Data Reduce Process Mapper Process Mapper Reducer Record Reader Output Data Reads Passes <K,V> pairs Reads Passes <K,V> pairs OutputFormat Calculates Defines Defines Passes <K,V> pairs Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 125. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Reducer Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 Writer InputFormat Output Data Reduce Process Mapper Process Mapper Reducer Record Reader Writer Output Data Reads Passes <K,V> pairs Reads Passes <K,V> pairs OutputFormat Calculates Defines Defines Defines Passes <K,V> pairs Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 126. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Reducer Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 Writer InputFormat Output Data Reduce Process Mapper Process Mapper Reducer Record Reader Writer Output Data Reads Passes <K,V> pairs Reads Passes <K,V> pairs OutputFormat Calculates Defines Defines Defines Passes <K,V> pairs Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 127. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Reducer Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 Writer InputFormat Output Data Reduce Process Mapper Process Mapper Reducer Record Reader Writer Output Data Reads Passes <K,V> pairs Writes Reads Passes <K,V> pairs Writes OutputFormat Calculates Defines Defines Defines Passes <K,V> pairs Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 128. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Reducer Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 Writer InputFormat Output Data Reduce Process Mapper Process Mapper Reducer Record Reader Writer Output Data Reads Passes <K,V> pairs Writes Reads Passes <K,V> pairs Writes OutputFormat Defines Calculates Defines Defines Defines Passes <K,V> pairs Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 129. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Reducer Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 Writer InputFormat Output Data Reduce Process Mapper Process Mapper Reducer Record Reader Writer Output Data Reads Passes <K,V> pairs Writes Reads Passes <K,V> pairs Writes OutputFormat Defines Defines Calculates Defines Defines Defines Passes <K,V> pairs Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 130. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Reducer Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 Writer InputFormat Output Data Reduce Process Mapper Process Mapper Reducer Record Reader Writer Output Data Reads Passes <K,V> pairs Writes Reads Passes <K,V> pairs Writes OutputFormat Defines Defines Calculates Defines Defines Defines Defines Passes <K,V> pairs Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 131. MapReduce Execution Framework MapReduce Reduce Process Mapper Process Input HDFS File - inputFile.txt Block A Block B Block C Driver Mapper Reducer Record Reader Input Split 1 Input Split 2 Input Split 3 Input Split 4 Writer InputFormat Output Data Reduce Process Mapper Process Mapper Reducer Record Reader Writer Output Data Reads Passes <K,V> pairs Writes Reads Passes <K,V> pairs Writes OutputFormat Defines Defines Calculates Defines Defines Defines Defines Defines Passes <K,V> pairs Passes <K,V> pairs <K, V> pairs <K, V> pairs Partition Shuffle Sort
  • 132. ▪ MR Job can contain either map reduce both or just mappers ▪ No of Mappers in the Job depends upon the input split ▪ No of reducer in the Job needs to be set by the developer ▪ By default, no of reducer in the Job is just one ▪ Otherwise , recommended value of the reducers in the job is ¼ of mappers tasks ▪ Each Map and Reduce task is separate container of JVM ▪ By default , Map and reduce JVM have 1 GB Memory and 1 CPU core