This document provides an overview of Big Data and Hadoop. It discusses what Big Data is, why existing data analytics approaches have limitations, and how Hadoop addresses these issues. Hadoop uses a master-slave architecture with the NameNode as master and DataNodes as slaves. It stores data in HDFS as blocks across DataNodes and allows distributed processing via MapReduce. The document covers Hadoop 1.0 and 2.0 components as well as challenges of Hadoop 1.x like single point of failure and lack of high availability of the NameNode.
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
5 Scenarios: When To Use & When Not to Use HadoopEdureka!
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
Join Cloudera’s founder and Chief Scientist, Jeff Hammerbacher, as he describes ten common problems that are being solved with Apache Hadoop.
A replay of the webinar can be viewed here:
https://www1.gotomeeting.com/register/719074008
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.
5 Scenarios: When To Use & When Not to Use HadoopEdureka!
Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.
Join Cloudera’s founder and Chief Scientist, Jeff Hammerbacher, as he describes ten common problems that are being solved with Apache Hadoop.
A replay of the webinar can be viewed here:
https://www1.gotomeeting.com/register/719074008
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
YouTube Link: https://youtu.be/ll_O9JsjwT4
** Big Data Hadoop Certification Training - https://www.edureka.co/big-data-hadoop-training-certification **
This Edureka PPT on "Hadoop components" will provide you with detailed knowledge about the top Hadoop Components and it will help you understand the different categories of Hadoop Components. This PPT covers the following topics:
What is Hadoop?
Core Components of Hadoop
Hadoop Architecture
Hadoop EcoSystem
Hadoop Components in Data Storage
General Purpose Execution Engines
Hadoop Components in Database Management
Hadoop Components in Data Abstraction
Hadoop Components in Real-time Data Streaming
Hadoop Components in Graph Processing
Hadoop Components in Machine Learning
Hadoop Cluster Management tools
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
What is HDFS | Hadoop Distributed File System | EdurekaEdureka!
( Hadoop Training: https://www.edureka.co/hadoop )
This What is HDFS PPT will help you to understand about Hadoop Distributed File System and its features along with practical. In this What is HDFS PPT, we will cover:
1. What is DFS and Why Do We Need It?
2. What is HDFS?
3. HDFS Architecture
4. HDFS Replication Factor
5. HDFS Commands Demonstration on a Production Hadoop Cluster
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Slides from May 2018 St. Louis Big Data Innovations, Data Engineering, and Analytics User Group meeting. The presentation focused on Data Modeling in Hive.
This Hadoop will help you understand the different tools present in the Hadoop ecosystem. This Hadoop video will take you through an overview of the important tools of Hadoop ecosystem which include Hadoop HDFS, Hadoop Pig, Hadoop Yarn, Hadoop Hive, Apache Spark, Mahout, Apache Kafka, Storm, Sqoop, Apache Ranger, Oozie and also discuss the architecture of these tools. It will cover the different tasks of Hadoop such as data storage, data processing, cluster resource management, data ingestion, machine learning, streaming and more. Now, let us get started and understand each of these tools in detail.
Below topics are explained in this Hadoop ecosystem presentation:
1. What is Hadoop ecosystem?
1. Pig (Scripting)
2. Hive (SQL queries)
3. Apache Spark (Real-time data analysis)
4. Mahout (Machine learning)
5. Apache Ambari (Management and monitoring)
6. Kafka & Storm
7. Apache Ranger & Apache Knox (Security)
8. Oozie (Workflow system)
9. Hadoop MapReduce (Data processing)
10. Hadoop Yarn (Cluster resource management)
11. Hadoop HDFS (Data storage)
12. Sqoop & Flume (Data collection and ingestion)
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Learn Spark SQL, creating, transforming, and querying Data frames
14. Understand the common use-cases of Spark and the various interactive algorithms
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training.
This slide deck is used as an introduction to the internals of Hadoop MapReduce, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
What is HDFS | Hadoop Distributed File System | EdurekaEdureka!
( Hadoop Training: https://www.edureka.co/hadoop )
This What is HDFS PPT will help you to understand about Hadoop Distributed File System and its features along with practical. In this What is HDFS PPT, we will cover:
1. What is DFS and Why Do We Need It?
2. What is HDFS?
3. HDFS Architecture
4. HDFS Replication Factor
5. HDFS Commands Demonstration on a Production Hadoop Cluster
Check our complete Hadoop playlist here: https://goo.gl/hzUO0m
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Slides from May 2018 St. Louis Big Data Innovations, Data Engineering, and Analytics User Group meeting. The presentation focused on Data Modeling in Hive.
This Hadoop will help you understand the different tools present in the Hadoop ecosystem. This Hadoop video will take you through an overview of the important tools of Hadoop ecosystem which include Hadoop HDFS, Hadoop Pig, Hadoop Yarn, Hadoop Hive, Apache Spark, Mahout, Apache Kafka, Storm, Sqoop, Apache Ranger, Oozie and also discuss the architecture of these tools. It will cover the different tasks of Hadoop such as data storage, data processing, cluster resource management, data ingestion, machine learning, streaming and more. Now, let us get started and understand each of these tools in detail.
Below topics are explained in this Hadoop ecosystem presentation:
1. What is Hadoop ecosystem?
1. Pig (Scripting)
2. Hive (SQL queries)
3. Apache Spark (Real-time data analysis)
4. Mahout (Machine learning)
5. Apache Ambari (Management and monitoring)
6. Kafka & Storm
7. Apache Ranger & Apache Knox (Security)
8. Oozie (Workflow system)
9. Hadoop MapReduce (Data processing)
10. Hadoop Yarn (Cluster resource management)
11. Hadoop HDFS (Data storage)
12. Sqoop & Flume (Data collection and ingestion)
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Learn Spark SQL, creating, transforming, and querying Data frames
14. Understand the common use-cases of Spark and the various interactive algorithms
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training.
This slide deck is used as an introduction to the internals of Hadoop MapReduce, as part of the Distributed Systems and Cloud Computing course I hold at Eurecom.
Course website:
http://michiard.github.io/DISC-CLOUD-COURSE/
Sources available here:
https://github.com/michiard/DISC-CLOUD-COURSE
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
Top Hadoop Big Data Interview Questions and Answers for Fresher , Hadoop, Hadoop Big Data, Hadoop Training, Hadoop Interview Question, Hadoop Interview Answers, Hadoop Big Data Interview Question
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
Asserting that Big Data is vital to business is an understatement. Organizations have generated more and more data for years, but struggle to use it effectively. Clearly Big Data has more important uses than ensuring compliance with regulatory requirements. In addition, data is being generated with greater velocity, due to the advent of new pervasive devices (e.g., smartphones, tablets, etc.), social Web sites (e.g., Facebook, Twitter, LinkedIn, etc.) and other sources like GPS, Google Maps, heat/pressure sensors, etc.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
Unit 8 - Information and Communication Technology (Paper I).pdfThiyagu K
This slides describes the basic concepts of ICT, basics of Email, Emerging Technology and Digital Initiatives in Education. This presentations aligns with the UGC Paper I syllabus.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
Overview on Edible Vaccine: Pros & Cons with Mechanism
Module 01 - Understanding Big Data and Hadoop 1.x,2.x
1. Naveen P.N
Trainer
NPN TrainingTraining is the essence of success and we are committed to it.
www.npntraining.com
Module 01 - Understanding Big Data and Hadoop
Includes (Hadoop 1.x & 2.x Architecture)
2. Topics for the Module `
What is Big Data
OLTP VS OLAP
Limitation of existing Data Analytics
Moving Data into Code
Moving Code into Data
Hadoop 1.0 / 2.0 Core Components
Hadoop 2.0 Core Components
Hadoop Master Slave Architecture
After completing the module, you will be able to understand:
File Blocks
Rack Awareness
Anatomy of File Read and Write
Hadoop 1.x Challenges
Scala REPL
Scala Variable Types
3. Big Data is the term for a collection of data sets so large and complex that it
becomes difficult to process it using Traditional data processing applications.
What is Big Data
www.npntraining.com/masters-program/big-data-architect-training.php
4. 12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
?TBsof
dataevery
day
2+ billion
people on
the Web
by end
2011
30 billion RFID tags
today
(1.3B in 2005)
4.6
billion
camera
phones
world
wide
100s of
millions of
GPS
enabled
devices
sold
annually
76 million smart
meters in 2009…
200M by 2014
Where Is This “Big Data” Coming From ?
www.npntraining.com/masters-program/big-data-architect-training.php
5. About RDBMS
Why do I need RDBMS
For quick response
It enables relation between data elements to be defined and managed.
It enables one database to be utilized for all applications.
Presently the data is stored in RDBMS, then what is the problem why the problem of BigData come
www.npntraining.com/masters-program/big-data-architect-training.php
6. OLTP VS OLAP
We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that
OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it
www.npntraining.com/masters-program/big-data-architect-training.php
7. "Big Data are high-volume, high-velocity, and/or high-variety information assets that require new
forms of processing to enable enhanced decision making, insight discovery and process
optimization”
Big Data spans three dimensions (3Vs)
www.npntraining.com/masters-program/big-data-architect-training.php
8. Storage only Grid (SAN)
(Raw Data)
ETL Compute Grid
RDBMS
(Aggregated Data)
1. Can’t explore original high
fidelity raw data
3. Premature data death
90% of data is Archived
A meagre 10% of Data
is available for BI
Limitation of Existing Data Analytics Architecture
www.npntraining.com/masters-program/big-data-architect-training.php
9. BI Reports + Interactive Apps
Solution: A Combined Storage Compute Layer
Hadoop: Storage + Compute Grid
RDBMS
(Aggregated Data)
Scalable throughput for
ETL & aggregation
Data Exploration
& Advanced
analytics
No Data
Archiving
Keep data alive
forever
Both Storage
And Compute
Grid together
Entire Data is
available for
processing
www.npntraining.com/masters-program/big-data-architect-training.php
10. Processing 1TB of Data
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
45 Minutes
Limitation
Traditional Approach
Processing Data in Enterprise
www.npntraining.com/masters-program/big-data-architect-training.php
11. Processing 1TB of Data
10 Machine
4 I/O Channels
Each Channel – 100 MB/s
4.3 Minutes
Hadoop Approach
Processing Data in DFS
www.npntraining.com/masters-program/big-data-architect-training.php
12. What is Apache Hadoop
Apache Hadoop is a framework that allows for the
distributed processing of large data sets across clusters of
commodity computers using a simple programming
model
To solve the Big Data problem a new
framework has evolved that is Hadoop.
Hadoop provides.
Commodity Hardware
Big Cluster
Map Reduce
Failover
Data distribution
Moving code to data
Heterogeneous Hardware
Scalable
Hadoop is based on work done by Google in the early 2000s
Google File System (GFS) paper published in 2003
MapReduce paper published in 2004
It is an architecture that can scale with huge volumes,
variety and speed requirements of Big Data by distributing
the work across dozens, hundreds, or even thousands of
commodity servers that process the data in parallel.
www.npntraining.com/masters-program/big-data-architect-training.php
13. Moving data into code Contd...
Terabyte
Wants to Analyze
the data
Traditional data processing architecture
Nodes are broken up into separate processing and storage nodes connected by high capacity link
Many data intensive applications are CPU demanding causing bottle neck in networks.
Latency in transferring data.
``
www.npntraining.com/masters-program/big-data-architect-training.php
14. Moving code to data
Wants to Analyze
the data
Map Reduce
Client writes
MapReduce
Jobs Jobs Jobs
Jobs Jobs Jobs
Hadoop takes radically new approach to the problem of distributed computing.
Distribute the data to multiple nodes.
Distribute the program for computation to these multiple nodes.
Individual nodes then work on data stay in their nodes.
No data transfer over the network is required for initial processing.
Additional nodes can be added for scalability.
``
15. Distribution Vendors
Cloudera Distribution for Hadoop (CDH)
MapR Distribution.
Hortonworks Data Platform
Apache BigTop Distribution.
``
www.npntraining.com/masters-program/big-data-architect-training.php
16. Hadoop 1.0 Core Components
Hadoop has two main components :
1. HDFS – Hadoop Distributed File System (Storage)
2. MapReduce ( Processing)
Hadoop
HDFS MapReduce
Responsible to store the data in chunks(by
splitting into blocks of 64MB each)
To process the data in a massive parallel
manner.
Daemons
Name Node
Data Node
Secondary Name Node
Job Tracker
Task Tracker
HDFS
NameNode (Master)
DataNode
Secondary NameNode
MapReduce
JobTracker
TaskTracker
Storage Processing
www.npntraining.com/masters-program/big-data-architect-training.php
17. Hadoop 2.0 Core Components
Hadoop 2.0 has two main components :
1. HDFS – Hadoop Distributed File System (Storage)
2. MapReduce ( Processing)
Hadoop
HDFS YARN/MRv2
Responsible to store the data in chunks(by
splitting into blocks of 128MB each)
To process the data in a massive parallel
manner.
Daemons
Name Node
Data Node
Secondary Name Node
ResourceManager
NodeManager
HDFS
NameNode (Master)
DataNode(Slave)
Secondary NameNode
MapReduce
ResourceManager(Master)
NodeManager(Slave)
Storage Processing
www.npntraining.com/masters-program/big-data-architect-training.php
18. HDFS – Hadoop Distributed File System
HDFS is a distributed and scalable file system designed for storing very large files with
streaming data access patterns, running clusters on commodity hardware.
HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode
(Master node) and a number of DataNodes (Slave nodes).
Each and every file in the
File System is divided into
blocks of size 512 Bytes
File System
www.npntraining.com/masters-program/big-data-architect-training.php
19. File Blocks
By default, block size is 64 MB in Hadoop 1.x and 128 MB in Hadoop 2.x
Why block size is large?
The main reason for having the HDFS blocks in large size is due to cost of seek time.
The large block size is to account for proper usage of storage space while considering the limit on the
memory of NameNode.
www.npntraining.com/masters-program/big-data-architect-training.php
20. 1.0 Master Slave Architecture – Simple cluster setup with Hadoop Daemons
……
MasterSlave
NameNode JobTracker
Secondary
NameNode
Single Box Single Box
Optional to have in two boxes
Single Box
In Separate Box ( Many)
TaskTracker TaskTracker TaskTracker
Slave1 Slave2 Slave3
DataNode DataNode DataNode
1 2 3
4
5
website : www.npntraining.com
21. 2.0 Master Slave Architecture – Simple cluster setup with Hadoop Daemons
……
MasterSlave
NameNode
Active
ResourceManager
NameNode
Standby
Single Box Single Box
Optional to have in two boxes
Single Box
In Separate Box ( Many)
NodeManager NodeManager
NodeManager
Slave1 Slave2 Slave3
DataNode DataNode DataNode
1
2 3
4
5
Secondary
NameNode
Single Box
In Separate Box ( Many)
3
website : www.npntraining.com
23. Hadoop Cluster: A Typical Use Case
www.npntraining.com/masters-program/big-data-architect-training.php
24. File Blocks in HDFS
Master
Node
Communicates
Wants to save 400 MB of
data into cluster/HDFS
Decides which nodes to
write the data to
First copy is always stored in nodes
which is in close proximity to the client
128
MB
128
MB
16
MB
128
MB
128
MB
128
MB
128
MB
128
MB
128
MB
16
MB
128
MB
16
MB
In HDFS the data is broken
into blocks of size 64 MB
Hadoop creates a 3 replication
by default which is
configurable and achieves fault
tolerance
``
website : www.npntraining.com
27. NameNode
NameNode does not store the files but only files metadata.
NameNode keeps track of all the file system related information(Metadata) such as:
Block Locations
Information about file permissions and ownership.
Last access time for the file.
User permission like which user have access to the file.
NameNode oversees the health DataNode and coordinates access to the data stored in DataNode.
The entire metadata is in main memory.
``
www.npntraining.com/masters-program/big-data-architect-training.php
28. NameNode Metadata
The entire metadata is in main memory.
No demand paging of FS meta-data.
NameNode maintains two files
1. fsimage and
2. edit log
The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata.
However, while the fsimage file format is very efficient to read
It’s unsuitable for making small incremental updates like renaming a single file. Thus, rather than
writing a new fsimage every time the namespace is modified, the NameNode instead records
the modifying operation in the edit log for durability.
``
www.npntraining.com/masters-program/big-data-architect-training.php
29. Secondary NameNode or CheckPoint Node
NameNode
Secondary
NameNode
-fsImage
Edit logs
Not a hot standby for the NameNode
Connects to NameNode every hour.
Housekeeping, backup of NameNode metadata
Saved metadata can build a failed NameNode
Pulls metadata
``
www.npntraining.com/masters-program/big-data-architect-training.php
30. Hadoop Components Contd…
ResourceManager
Name Node
NodeManager
Data Node
NodeManager
Data Node
NodeManager
Data Node
Master Node
Slave Node
Maintains and manages the blocks
present on the Slave Nodes
Periodically receives a Heartbeat and
a Block report from each of the data
nodes in the cluster
Heartbeat implies that the
DataNode is functioning properly,
once every 3 seconds
The HDFS architecture is built in
such a way that the user data is
never stored in the NameNode, it
only stored metadata.
It records the metadata of all the
files stored in the cluster, e.g. the
location, the size of the files,
permissions, hierarchy, etc
DataNodes perform the
low-level read and write
requests from the file
system’s clients.
Responsible for creating blocks,
deleting blocks and replicating
the same based on the decisions
taken by the NameNode.
Secondary Name Node
31. Anatomy of File Write – High Level
A user wants to write data to Hadoop
hdfs dfs –put 2016-apache-logs.txt / Client “cuts” input file into chunks of “block
size”
Client then contacts the NameNode to request write operation.
Sends No of blocks
Replication Factor
NameNode responds with pipeline of DataNodes for replication to write.
Clients reaches out to first DataNode in pipeline + Performs write
* No actual data transfer will take place from NameNode
Client takes the request “splits” input file into chunks of “block size”
Client writes blocks in parallel all the blocks are written at a time not one by one
www.npntraining.com/masters-program/big-data-architect-training.php
32. Anatomy of File Write – Full Example
2016-apache-logs.txt
200 MB file
Block size : 64 MB
Replication Factor : 3
44
MB
Hadoop Client
128
MB
128
MB
NameNode
–put request
Write pipeline
blk_000 to DN1,DN5,DN6
blk_001 to DN4,DN8,DN9
blk_002 to DN7,DN3,DN3
DataNode1
DataNode2
DataNode3
Rack01
DataNode4
DataNode5
DataNode6
Rack01
DataNode7
DataNode8
DataNode9
Rack01
website : www.npntraining.com
33. Anatomy of File Read – Full Example
2016-apache-logs.txt
200 MB file
Block size : 64 MB
Replication Factor : 3
44
MB
Hadoop Client
128
MB
128
MB
NameNode
–get request
Write pipeline
blk_000 to DN1
blk_001 to DN4
blk_002 to DN7
DataNode1
DataNode2
DataNode3
Rack01
DataNode4
DataNode5
DataNode6
Rack01
DataNode7
DataNode8
DataNode9
Rack01
website : www.npntraining.com
34. In HDFS, blocks of a file are written in parallel, however
replication of the blocks are done sequentially.
a) True
b) False
Hadoop is a framework that allows for the distributed
processing of :
a) Small Data sets
b) Large Data sets
A file of 400 MB is being copied to HDFS. The System has
finished copying 250 MB . What happens if a client tries to
access that file.
a) Can read up to block that’s successfully written.
b) Can read up to last bit successfully written
c) Will throw an exception
d) Cannot see that file until its finished copying
www.npntraining.com/masters-program/big-data-architect-training.php
36. What could be the limitation of Hadoop 1 / Gen 1
Hadoop 1.x cluster can it have multiple HDFS Namespaces
Which of the following are significant disadvantage in Hadoop 1.0
a) Single Point of Failure on NameNode
b) Too much burden on JobTracker
Hadoop 1.x cluster can it have multiple HDFS Namespaces
Can you use other than MapReduce for processing in Hadoop 1.x
www.npntraining.com/masters-program/big-data-architect-training.php
37. Hadoop 1.x - Challenges
NameNode – No Horizontal Scalability
Single NameNode and single Namespaces, limited by NameNode RAM
NameNode – No High Availability (HA)
NameNode is Single Point of Failure, Need manual recovery using Secondary NameNode in case
of failure.
Job Tracker - Overburdened
Spends significant portion of time and effort managing the life cycle of applications.
MRv1 – Only Map & Reduce tasks
HumongousData stored in HDFS remains unutilized and cannot be used for other workloads
such as Graph processing etc.
www.npntraining.com/masters-program/big-data-architect-training.php
38. Single NameNode running and managing Single Namespace. Maintains metadata in RAM
100 slaves/ 1000 slaves --> Managed by Single NameNode
Max tested till --> 4000 servers --> Single NameNode --> Single NameSpace
Lets assume we have /VOICE directory with too many files and folders we configure separate
NameNode for this
directory
/VOICE/... NameNode01
/SMS/... NameNode02
/Data/... NameNode03
So based on the directory structure we can configure NameNode, so in Hadoop 2 we can configure
10000 servers can be configured because because NameNode separately managing directory
structure that's why we call it as Federation
Limitation 1 – No Horizontal Scalability
www.npntraining.com/masters-program/big-data-architect-training.php
40. How does HDFS Federation help HDFS scale horizontally?
Reduces the load on any single NameNode by using the
multiple, independent NameNode to manage individual
parts of the file system namespace.
You have configured two name nodes to manage
/voice and /sms respectively. What will happen if you try to
put a file in /lte directory?
Put will fail. None of the namespace will manage the file
and you will get an IOException with a no such file or
directory error
www.npntraining.com/masters-program/big-data-architect-training.php
41. If you loose Namenode you will loose the Cluster details.Manual intervention should be there to
start new NameNode and copy backup from SecondaryNN
Problem
10am --> backup to SNN
10:45am --> NameNode breakdown --> You can get data till 10:00am from SNN ( Problem in Gen 1 )
Solution
==========
HighAvailability : Active and Standby Namenodes manage same data at given point of time.
--> In case Active NameNode fails Standby NameNode will act as Active and serves request
Limitation 2 – No High Availability
www.npntraining.com/masters-program/big-data-architect-training.php
42. Hadoop 2.x Architecture - HA
https://hadoop.apache.org/docs/r2.5.2/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
``
www.npntraining.com/masters-program/big-data-architect-training.php
43. HDFDS HA was developed to overcome the following
disadvantage in Hadoop 1.0
a) Single Point of Failure of NameNode
b) Only one version can be run in classic Map-Reduce
c) Too much burden on JobTracker
45. YARN – Yet Another Resource Negotiator
YARN is the core component of Hadoop 2 and is added to improve performance in Hadoop
Hadoop 1.x
MapReduce
Cluster Resource Management &
Data Processing
HDFS
(File Storage)
Hadoop 2.x
YARN
Cluster Resource Management
HDFS
(File Storage)
MapReduce
(Data Processing)
Others
(Data Processing)
It is the next generation computing platform which offer various advantage when compared to
classic MapReduce
It is a layer that separates the resource management layer and the processing components layer.
MapReduce2 moves Resource management (like infrastructure to monitor nodes, allocate
resources and schedule jobs) into YARN.
48. YARN Components
YARN consists of 3 components
1. ResourceManager
i. Scheduler
ii. Application Manager
2. NodeManager
3. Application Master
www.npntraining.com/masters-program/big-data-architect-training.php
49. YARN Architecture
ResourceManager
DN1NodeManager
Client
DN1NodeManager
DN1NodeManager
1. Client submits job to
Resource Manager
2. RM will contact any
of the NodeManager
Application Master
3. Node Manager will
create a daemon by Name
Application Master on the
same Node, its one per job
4. AM will communicate to
the ResourceManager to
find where the data is.
Container
www.npntraining.com/masters-program/big-data-architect-training.php
50. Sinble JobTracker to manage thousands of jobs
Problem
Jobtracker was overburdend
Solution
==========
YARN with multiple deamons like ResourceManager, NodeManger, ApplicationMaster(one per
Application)
Container --> variable resources allocated per task(in slave m/c) --> cpu,memory,disk,network
1. Resource Manager --> Entire Cluster Lever
2. NodeManager --> Per Node/Slave/machine/server
3. App Master --> life cycle of job ( App Master one per job )
Limitation 3 – Job Tracker Overburden
www.npntraining.com/masters-program/big-data-architect-training.php
51. YARN (Yet Another Resource Negotiator) is a new component added in Hadoop 2.0
Hadoop 1.x
MapReduce
Cluster Resource Management
& Data Processing
HDFS
(File Storage)
Hadoop 2.x
YARN
Cluster Resource Management
HDFS
(File Storage)
MapReduce
(Data Processing)
Others
(Data Processing)
Introduction to new YARN layer in Hadoop 2.0``
www.npntraining.com/masters-program/big-data-architect-training.php
FB, G+, LinkedIn, Twitter every day generating huge volume of data
Facebook recently unveiled some statistics on the amount of data its system processes and stores. According to Facebook, its data system processes 2.5 million pieces of content each day amounting to 500+ terabytes of data daily. Facebook generates 2.7 billion Like actions per day and 300 million new photos are uploaded daily.
Presently the data is stored in RDBMS, then why the problem of BigData
What is the limitation of of RDBMS / Why do I need RDBMS
We go online and we get response immediately that is the concept of DBMS or OLTP application
IBM’s Definition–Big DataCharacteristics
Velocity : CDR ( Call Detail Records ) Used to understand Customer Churn i.e. customer leaving service provider
The rate at which velocity is generated
Variety : Image MRI Scans
The advantage of Share nothing architecture is it can scale easily – simply by adding another node.
A shared nothing architecture (SN) is a distributed computing architecture in which each node is independent and self-sufficient, and there is no single point of contention across the system. More specifically, none of the nodes share memory or disk storage.
Latency in trasferring data
Latency in trasferring data
Processing coupled with data : In Hadoop we send jobs towards data.
There are some programs which manages the Hadoop components these programs are known as Daemons.
Daemons take care of the components in hadoop
There are some programs which manages the Hadoop components these programs are known as Daemons.
Daemons take care of the components in hadoop
HDFS is a block structured file system which is designed to store very large files, where each file is divided into blocks of predetermined size. These blocks are stored across a cluster of one or several commodity hardware's.
HDFS is a block structured file system which is designed to store very large files, where each file is divided into blocks of predetermined size. These blocks are stored across a cluster of one or several commodity hardware's.
Tell : http://wiki.apache.org/hadoop/PoweredBy
Fault Tolerance : Hadoop will not fail even one ore more slaves fail
Fault Tolerance : Hadoop will not fail even one ore more slaves fail
Rack : Group of server’s placed in a single place.
Hadoop writes one replication in one rack and other replication in different rack, administrator can even change this also.
Rack level fault tolerance also Hadoop provides
This way, if the NameNode crashes, it can restore its state by first loading the fsimage
This way, if the NameNode crashes, it can restore its state by first loading the fsimage
This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage
Between the client network will be slower when compared to the cluster
Between the client network will be slower when compared to the cluster
In HDFS, blocks of a file are written in parallel, however replication of the blocks are done sequentially.
Answer : True. A file is divided into Blocks , these blocks are written in parallel, but the block replication happen in sequence.
A file of 400 MB is being copied to HDFS. The System has finished copying 250 MB . What happens if a client tries to access that file.
Answer : (a) Can read up to block that’s successfully written.
There are lot of self standing software's which are build on top of Hadoop Framework each addressing lot of problems.
Software's built on top of Hadoop framework is called as Hadoop Eco-System.
Flume is used to stream the data from Non HDFS to HDFS.
e.g. Twitter
Each NameNode need not to coordinate so it is called as Federated.
In HDFS, blocks of a file are written in parallel, however replication of the blocks are done sequentially.
Answer : True. A file is divided into Blocks , these blocks are written in parallel, but the block replication happen in sequence.
A file of 400 MB is being copied to HDFS. The System has finished copying 250 MB . What happens if a client tries to access that file.
Answer : (a) Can read up to block that’s successfully written.
Each NameNode need not to coordinate so it is called as Federated.
HDFDS HA was developed to overcome the following disadvantage in Hadoop 1.0
Answer : (a)
Let’s say a client submits a program, the program communicates to JobTracker, in Hadoop terminology the program is considered as Job then in the job we would have mentioned which data to process, the Job Tracker communicates with NameNode to get the DataNode which has the data
In a Nutshell the responsibility of JobTracker
JT accepts the job
Figure where is the data
Invokes all the TT and assigns them the Job
It monitors all the tasks (TT crahes) it monitors the life cycle
JT will be overburdened because in production 1000’s of job will be running after certain time your JT becomes slow
In Hadoop 1.x MapReduce is the only programming model to process the data which is stored in HDFS.
In MapReduce work is divided into 2 phases
Map phase
Reduce phase
Each Map takes 1GB resource for processing
In Hadoop 2.x the processing is taken care by YARNThe minimum memory allocation for Map task is 1GB Map phase
Reduce phase
Dsf
http://sivansasidharan.me/blog/Hadoop_YARN/
Whenever a job is submitted it communicates to the Resource Manager.
Resource Manager will then contact any of the Node Manager not necessary NodeManager which have data and say there is a job
Node Manager launches a daemon called Application Master on the same node
Application Master is per job.
It is the responsibility of the ApplicationMaster to run the job.
Application Master can contact the NodeManager as well as Resource Manager by contacting the REsourceManager Application Master will come to know where is the data and it will contact that Node and launches something called as container
Containers are nothing but a simple Java process or JVM and inside the container actual program gets executed.
The advantage of such a complex architecture is if DN2 requires more Resource for processing, the Application Master can contact the ResourceManager and allocate more resource so RM is a global entity which manages resources .
On the other Hand the entire life cycle of the application starting from creating, monitoring, etc is managed by Application Master.
NodeManager keeps track of the resource present in tht DataNode and updates to ResourceManager.
In this architecture the resource is not managed by DataNode if any machine has more resource RM can communicate with NodeManager and NM will create Container and data will be copied and execute
http://sivansasidharan.me/blog/Hadoop_YARN/