SlideShare a Scribd company logo
1 of 24
Page 1Classification: Restricted
Hadoop Developer Training
Session 01 – Introduction to Hadoop & Big
Data
Page 2Classification: Restricted
Agenda
• What is Big Data?
• What is Hadoop?
• Overview of Hadoop Ecosystem
• Hadoop Distributed File System or HDFS
• Hadoop Cluster Modes
• Yarn
• MapReduce
• Hive
• Pig
• Zookeeper
• Flume
• Sqoop
Page 3Classification: Restricted
Big data can be characterized by 3Vs:
• The extreme volume of data.
• The velocity at which the data must be must processed.
• The wide variety of types of data.
 Volume: Size, Amount or Quantity of
Data.
 Velocity: Speed of data.
 Speed at which data must be stored.
 Speed at which data must be
processed.
 Variety: Type of data to be stored or
processed.
 Structured Data
 Unstructured Data
 Semi-Structured Data
What is Big Data?
Page 4Classification: Restricted
Volume , Velocity , Variety
(V3)
Characterization of Big – Data
Page 5Classification: Restricted
 A framework for storing & processing of data using commodity hardware
and storage
We need a system that should support :-
• Distributed Parallel processing
• Built in backup and fail-over mechanism
• Easily scalable and Economical
• Efficient and Reliable
So We Need Hadoop
What Is Hadoop?
Page 6Classification: Restricted
Hadoop Ecosystem Components
Overview to Hadoop System
Page 7Classification: Restricted
The Hadoop Distributed File System, or HDFS.
• HDFS is the storage system for a Hadoop
• When data arrives at the cluster, the HDFS software breaks it into pieces
and distributes those pieces among the different servers participating in
the cluster
• Each server stores just a small fragment of the complete data set
• each piece of data is replicated on more than one serve
Page 8Classification: Restricted
The Hadoop Distributed File System, or HDFS.
Page 9Classification: Restricted
Different modes of hadoop:-
• Standalone Mode
• Pseudo Distributed Mode(Single Node Cluster)
• Fully distributed mode (or multiple node cluster)
Standalone Mode
 Default mode of Hadoop
 HDFS is not utilized in this mode.
 Local file system is used for input and output .
 No Custom Configuration is required in 3 hadoop files
 mapred-site.xml
 core- site.xml
 hdfs-site.xml
 Standalone mode is much faster than Pseudo-distributed mode.
Hadoop Cluster Modes
Page 10Classification: Restricted
Pseudo Distributed Mode(Single Node Cluster)
 Configuration is required in given 3 files for this mode Replication factory is
one for HDFS.
 Here one node will be used as Master Node / Data Node / Job Tracker / Task
Tracker
 Used for Real Code to test in HDFS.
 Pseudo distributed cluster is a cluster where all daemons are Running on
one node itself.
Fully distributed mode (or multiple node cluster)
 This is a Production Phase
 Data are used and distributed across many nodes.
 Different Nodes will be used as Master Node / Data Node / Job
Tracker / Task Tracker
Hadoop Cluster Modes
Page 11Classification: Restricted
Core Components of Hadoop Cluster:-
Hadoop cluster has 3 components:
 Client.
 Master.
 Slave.
The role of each components
are shown in the below image.
Hadoop Cluster – Core Components
Page 12Classification: Restricted
Client:-
It is neither master nor slave, rather play a role of loading the data into cluster,
submit MapReduce jobs describing how the data should be processed and then
retrieve the data to see the response after job completion.
Hadoop Cluster – Core Components
Page 13Classification: Restricted
Masters:-
The Masters consists of 3 components
 NameNode
 Secondary Namenode
 JobTracker.
Hadoop Cluster – Core Components
Page 14Classification: Restricted
Slaves:-
Slave nodes are the majority of machines in Hadoop Cluster and are responsible to
 Store the data
 Process the computation
Each slave runs both a DataNode and Task Tracker daemon which communicates to
their masters. The Task Tracker daemon is a slave to the JobTracker and the
DataNode daemon a slave to the NameNode.
Hadoop Cluster – Core Components
Page 15Classification: Restricted
dfs.replication para
meter in the
file hdfs-site.xml.
Equip the Name Node with a highly redundant enterprise class server
configuration; dual power supplies, hot swappable fans, redundant NIC
connections, etc.
Hadoop Cluster – Core Components
Page 16Classification: Restricted
YARN - YARN stands for Yet Another Resource Negotiator. It is also called as
MapReduce 2(MRv2). The two major functionalities of Job Tracker in
MRv1, resource management and job scheduling/ monitoring are split into
separate daemons which are :-
 ResourceManager
 NodeManager
 ApplicationMaster.
Features:-
• Better resource management.
• Scalability
• Dynamic allocation of cluster resources.
YARN
Page 17Classification: Restricted
• Parallel Job processing framework
• Written in java
• Close integration with HDFS
• Provides :
– Auto partitioning of job into sub tasks
– Auto retry on failures
– Locality of task execution
MapReduce
Page 18Classification: Restricted
• Apache Hive in a few words:
“A data warehouse infrastructure built on top of Apache Hadoop”
• Used for:
– Ad-hoc querying and analyzing large data sets without having to learn
MapReduce
• Main features:
– SQL-like query language called HQL
– Built-in user defined functions (UDFs) to manipulate dates, strings, and
other data-mining tools
– Support for different storage types such as plain text, HBase, and others
Hive
Page 19Classification: Restricted
Data Access:
Pig -Apache Pig is an abstraction over MapReduce. It is a tool/platform which
is used to analyze larger sets of data representing them as data flows. Pig is
generally used with Hadoop; we can perform all the data manipulation
operations in Hadoop using Apache Pig.
To write data analysis programs, Pig provides a high-level language known as
Pig Latin. This language provides various operators using which programmers
can develop their own functions for reading, writing, and processing data.
To analyze data using Apache Pig, programmers need to write scripts using
Pig Latin language. All these scripts are internally converted to Map and
Reduce tasks. Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Salient features of pig:
• Ease of programming
• Optimization opportunities
• Extensibility.
Note :- Pig scripts internally will be converted to map reduce programs.
PIG
Page 20Classification: Restricted
"ZooKeeper allows distributed processes to coordinate with each other
through a shared hierarchical name space of data registers"
• Configuration management - machines
• config from a centralized source,
• facilitates simpler deployment/provisioning
• Leader election - a common problem in distributed coordination
• Centralized and highly reliable (simple) data registry
ZOOKEEPER
Page 21Classification: Restricted
Apache Flume - Flume is a distributed, reliable, and available
service for efficiently collecting, aggregating, and moving large
amounts of log data.
Features:
• Robust
• Fault tolerant
• Simple and flexible Architecture based on streaming data flows.
FLUME
Page 22Classification: Restricted
Sqoop is a tool designed to transfer data between Hadoop and relational
databases. You can use Sqoop to import data from a relational database
management system (RDBMS) such as MySQL or Oracle into the Hadoop
Distributed File System (HDFS), transform the data in Hadoop
MapReduce, and then export the data back into an RDBMS. Four key
features are found in Sqoop:
 Bulk import: Sqoop can import individual tables or entire databases into
HDFS. The data is stored in the native directories and files in the HDFS
file system.
 Data export: Sqoop can export data directly from HDFS into a relational
database using a target table definition based on the specifics of the
target database
SQOOP
Page 23Classification: Restricted
Any Question?
Page 24Classification: Restricted
Thank You!

More Related Content

What's hot

BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentationpuneet yadav
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the BasicsHBaseCon
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 ReleaseNick Dimiduk
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the BasicsHBaseCon
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoopChirag Ahuja
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop ClusterEdureka!
 
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache HiveHBaseCon
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jkEdureka!
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentFarzad Nozarian
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)alexbaranau
 

What's hot (20)

BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Hadoop Installation presentation
Hadoop Installation presentationHadoop Installation presentation
Hadoop Installation presentation
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
HBase: Just the Basics
HBase: Just the BasicsHBase: Just the Basics
HBase: Just the Basics
 
Hive : WareHousing Over hadoop
Hive :  WareHousing Over hadoopHive :  WareHousing Over hadoop
Hive : WareHousing Over hadoop
 
Administer Hadoop Cluster
Administer Hadoop ClusterAdminister Hadoop Cluster
Administer Hadoop Cluster
 
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon 2015: Analyzing HBase Data with Apache  HiveHBaseCon 2015: Analyzing HBase Data with Apache  Hive
HBaseCon 2015: Analyzing HBase Data with Apache Hive
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 
Apache hadoop hbase
Apache hadoop hbaseApache hadoop hbase
Apache hadoop hbase
 
Hadoop HDFS
Hadoop HDFSHadoop HDFS
Hadoop HDFS
 
Introduction to hadoop administration jk
Introduction to hadoop administration   jkIntroduction to hadoop administration   jk
Introduction to hadoop administration jk
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
 
Hadoop
HadoopHadoop
Hadoop
 
Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)Intro to HBase Internals & Schema Design (for HBase users)
Intro to HBase Internals & Schema Design (for HBase users)
 

Similar to Session 01 - Into to Hadoop

Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFSKavyaGo
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxDanishMahmood23
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache HadoopSufi Nawaz
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsrishavkumar1402
 

Similar to Session 01 - Into to Hadoop (20)

Hadoop - HDFS
Hadoop - HDFSHadoop - HDFS
Hadoop - HDFS
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Anju
AnjuAnju
Anju
 
Hadoop
HadoopHadoop
Hadoop
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
 
Intro to Apache Hadoop
Intro to Apache HadoopIntro to Apache Hadoop
Intro to Apache Hadoop
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
 

More from AnandMHadoop

Session 09 - Flume
Session 09 - FlumeSession 09 - Flume
Session 09 - FlumeAnandMHadoop
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperAnandMHadoop
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce AnandMHadoop
 
Session 04 -Pig Continued
Session 04 -Pig ContinuedSession 04 -Pig Continued
Session 04 -Pig ContinuedAnandMHadoop
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slidesAnandMHadoop
 
Session 02 - Yarn Concepts
Session 02 - Yarn ConceptsSession 02 - Yarn Concepts
Session 02 - Yarn ConceptsAnandMHadoop
 

More from AnandMHadoop (8)

Overview of Java
Overview of Java Overview of Java
Overview of Java
 
Session 14 - Hive
Session 14 - HiveSession 14 - Hive
Session 14 - Hive
 
Session 09 - Flume
Session 09 - FlumeSession 09 - Flume
Session 09 - Flume
 
Session 23 - Kafka and Zookeeper
Session 23 - Kafka and ZookeeperSession 23 - Kafka and Zookeeper
Session 23 - Kafka and Zookeeper
 
Session 19 - MapReduce
Session 19  - MapReduce Session 19  - MapReduce
Session 19 - MapReduce
 
Session 04 -Pig Continued
Session 04 -Pig ContinuedSession 04 -Pig Continued
Session 04 -Pig Continued
 
Session 04 pig - slides
Session 04   pig - slidesSession 04   pig - slides
Session 04 pig - slides
 
Session 02 - Yarn Concepts
Session 02 - Yarn ConceptsSession 02 - Yarn Concepts
Session 02 - Yarn Concepts
 

Recently uploaded

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 

Recently uploaded (20)

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 

Session 01 - Into to Hadoop

  • 1. Page 1Classification: Restricted Hadoop Developer Training Session 01 – Introduction to Hadoop & Big Data
  • 2. Page 2Classification: Restricted Agenda • What is Big Data? • What is Hadoop? • Overview of Hadoop Ecosystem • Hadoop Distributed File System or HDFS • Hadoop Cluster Modes • Yarn • MapReduce • Hive • Pig • Zookeeper • Flume • Sqoop
  • 3. Page 3Classification: Restricted Big data can be characterized by 3Vs: • The extreme volume of data. • The velocity at which the data must be must processed. • The wide variety of types of data.  Volume: Size, Amount or Quantity of Data.  Velocity: Speed of data.  Speed at which data must be stored.  Speed at which data must be processed.  Variety: Type of data to be stored or processed.  Structured Data  Unstructured Data  Semi-Structured Data What is Big Data?
  • 4. Page 4Classification: Restricted Volume , Velocity , Variety (V3) Characterization of Big – Data
  • 5. Page 5Classification: Restricted  A framework for storing & processing of data using commodity hardware and storage We need a system that should support :- • Distributed Parallel processing • Built in backup and fail-over mechanism • Easily scalable and Economical • Efficient and Reliable So We Need Hadoop What Is Hadoop?
  • 6. Page 6Classification: Restricted Hadoop Ecosystem Components Overview to Hadoop System
  • 7. Page 7Classification: Restricted The Hadoop Distributed File System, or HDFS. • HDFS is the storage system for a Hadoop • When data arrives at the cluster, the HDFS software breaks it into pieces and distributes those pieces among the different servers participating in the cluster • Each server stores just a small fragment of the complete data set • each piece of data is replicated on more than one serve
  • 8. Page 8Classification: Restricted The Hadoop Distributed File System, or HDFS.
  • 9. Page 9Classification: Restricted Different modes of hadoop:- • Standalone Mode • Pseudo Distributed Mode(Single Node Cluster) • Fully distributed mode (or multiple node cluster) Standalone Mode  Default mode of Hadoop  HDFS is not utilized in this mode.  Local file system is used for input and output .  No Custom Configuration is required in 3 hadoop files  mapred-site.xml  core- site.xml  hdfs-site.xml  Standalone mode is much faster than Pseudo-distributed mode. Hadoop Cluster Modes
  • 10. Page 10Classification: Restricted Pseudo Distributed Mode(Single Node Cluster)  Configuration is required in given 3 files for this mode Replication factory is one for HDFS.  Here one node will be used as Master Node / Data Node / Job Tracker / Task Tracker  Used for Real Code to test in HDFS.  Pseudo distributed cluster is a cluster where all daemons are Running on one node itself. Fully distributed mode (or multiple node cluster)  This is a Production Phase  Data are used and distributed across many nodes.  Different Nodes will be used as Master Node / Data Node / Job Tracker / Task Tracker Hadoop Cluster Modes
  • 11. Page 11Classification: Restricted Core Components of Hadoop Cluster:- Hadoop cluster has 3 components:  Client.  Master.  Slave. The role of each components are shown in the below image. Hadoop Cluster – Core Components
  • 12. Page 12Classification: Restricted Client:- It is neither master nor slave, rather play a role of loading the data into cluster, submit MapReduce jobs describing how the data should be processed and then retrieve the data to see the response after job completion. Hadoop Cluster – Core Components
  • 13. Page 13Classification: Restricted Masters:- The Masters consists of 3 components  NameNode  Secondary Namenode  JobTracker. Hadoop Cluster – Core Components
  • 14. Page 14Classification: Restricted Slaves:- Slave nodes are the majority of machines in Hadoop Cluster and are responsible to  Store the data  Process the computation Each slave runs both a DataNode and Task Tracker daemon which communicates to their masters. The Task Tracker daemon is a slave to the JobTracker and the DataNode daemon a slave to the NameNode. Hadoop Cluster – Core Components
  • 15. Page 15Classification: Restricted dfs.replication para meter in the file hdfs-site.xml. Equip the Name Node with a highly redundant enterprise class server configuration; dual power supplies, hot swappable fans, redundant NIC connections, etc. Hadoop Cluster – Core Components
  • 16. Page 16Classification: Restricted YARN - YARN stands for Yet Another Resource Negotiator. It is also called as MapReduce 2(MRv2). The two major functionalities of Job Tracker in MRv1, resource management and job scheduling/ monitoring are split into separate daemons which are :-  ResourceManager  NodeManager  ApplicationMaster. Features:- • Better resource management. • Scalability • Dynamic allocation of cluster resources. YARN
  • 17. Page 17Classification: Restricted • Parallel Job processing framework • Written in java • Close integration with HDFS • Provides : – Auto partitioning of job into sub tasks – Auto retry on failures – Locality of task execution MapReduce
  • 18. Page 18Classification: Restricted • Apache Hive in a few words: “A data warehouse infrastructure built on top of Apache Hadoop” • Used for: – Ad-hoc querying and analyzing large data sets without having to learn MapReduce • Main features: – SQL-like query language called HQL – Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools – Support for different storage types such as plain text, HBase, and others Hive
  • 19. Page 19Classification: Restricted Data Access: Pig -Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Apache Pig. To write data analysis programs, Pig provides a high-level language known as Pig Latin. This language provides various operators using which programmers can develop their own functions for reading, writing, and processing data. To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language. All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs. Salient features of pig: • Ease of programming • Optimization opportunities • Extensibility. Note :- Pig scripts internally will be converted to map reduce programs. PIG
  • 20. Page 20Classification: Restricted "ZooKeeper allows distributed processes to coordinate with each other through a shared hierarchical name space of data registers" • Configuration management - machines • config from a centralized source, • facilitates simpler deployment/provisioning • Leader election - a common problem in distributed coordination • Centralized and highly reliable (simple) data registry ZOOKEEPER
  • 21. Page 21Classification: Restricted Apache Flume - Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Features: • Robust • Fault tolerant • Simple and flexible Architecture based on streaming data flows. FLUME
  • 22. Page 22Classification: Restricted Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a relational database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), transform the data in Hadoop MapReduce, and then export the data back into an RDBMS. Four key features are found in Sqoop:  Bulk import: Sqoop can import individual tables or entire databases into HDFS. The data is stored in the native directories and files in the HDFS file system.  Data export: Sqoop can export data directly from HDFS into a relational database using a target table definition based on the specifics of the target database SQOOP