www.edureka.in/hadoopSlide 1
www.edureka.in/hadoopSlide 2
 Module 1
 Understanding Big Data
 Hadoop Architecture
 Module 2
 Hadoop Cluster Configuration
 Data loading Techniques
 Hadoop Project Environment
 Module 3
 Hadoop MapReduce framework
 Programming in Map Reduce
 Module 4
 Advance MapReduce
 MRUnit testing framework
 Module 5
 Analytics using Pig
 Understanding Pig Latin
 Module 6
 Analytics using Hive
 Understanding HIVE QL
 Module 7
 Advance Hive
 NoSQL Databases and HBASE
 Module 8
 Advance HBASE
 Zookeeper Service
 Module 9
 Hadoop 2.0 – New Features
 Programming in MRv2
 Module 10
 Apache Oozie
 Real world Datasets and Analysis
 Project Discussion
Course Topics
www.edureka.in/hadoopSlide 3
 What is Big Data?
 Limitations of the existing solutions
 Solving the problem with Hadoop
 Introduction to Hadoop
 Hadoop Eco-System
 Hadoop Core Components
 HDFS Architecture
 MapRedcue Job execution
 Anatomy of a File Write and Read
 Hadoop 2.0 (YARN or MRv2) Architecture
Topics for Today
www.edureka.in/hadoopSlide 4
 Lots of Data (Terabytes or Petabytes)
 Big data is the term for a collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing applications. The
challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.
 Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of
information.
NYSE generates about one terabyte of new trade data
per day to Perform stock trading analytics to determine
trends for optimal trades.
What Is Big Data?
www.edureka.in/hadoopSlide 5
Un-Structured Data is Exploding
www.edureka.in/hadoopSlide 6
 IBM’s Definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Web
logs
Images
Videos
Audios
Sensor
Data
Volume Velocity Variety
IBM’s Definition
www.edureka.in/hadoopSlide 7
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Annie’s Introduction
www.edureka.in/hadoopSlide 8
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Annie’s Question
Map the following to corresponding data type:
- XML Files
- Word Docs, PDF files, Text files
- E-Mail body
- Data from Enterprise systems (ERP, CRM etc.)
www.edureka.in/hadoopSlide 9
XML Files -> Semi-structured data
Word Docs, PDF files, Text files -> Unstructured Data
E-Mail body -> Unstructured Data
Data from Enterprise systems (ERP, CRM etc.) -> Structured Data
Annie’s Answer
www.edureka.in/hadoopSlide 10
 More on Big Data
http://www.edureka.in/blog/the-hype-behind-big-data/
 Why Hadoop
http://www.edureka.in/blog/why-hadoop/
 Opportunities in Hadoop
http://www.edureka.in/blog/jobs-in-hadoop/
 Big Data
http://en.wikipedia.org/wiki/Big_Data
 IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/
Further Reading
www.edureka.in/hadoopSlide 11
 Web and e-tailing
 Recommendation Engines
 Ad Targeting
 Search Quality
 Abuse and Click Fraud Detection
 Telecommunications
 Customer Churn Prevention
 Network Performance
Optimization
 Calling Data Record (CDR)
Analysis
 Analyzing Network to Predict
Failure
http://wiki.apache.org/hadoop/PoweredBy
Common Big Data Customer Scenarios
www.edureka.in/hadoopSlide 12
 Government
 Fraud Detection And Cyber Security
 Welfare schemes
 Justice
 Healthcare & Life Sciences
 Health information exchange
 Gene sequencing
 Serialization
 Healthcare service quality
improvements
 Drug Safety
http://wiki.apache.org/hadoop/PoweredBy
Common Big Data Customer Scenarios (Contd.)
www.edureka.in/hadoopSlide 13
 Banks and Financial services
 Modeling True Risk
 Threat Analysis
 Fraud Detection
 Trade Surveillance
 Credit Scoring And Analysis
 Retail
 Point of sales Transaction
Analysis
 Customer Churn Analysis
 Sentiment Analysis
http://wiki.apache.org/hadoop/PoweredBy
Common Big Data Customer Scenarios (Contd.)
www.edureka.in/hadoopSlide 14
 Insight into data can provide Business
Advantage.
 Some key early indicators can mean Fortunes
to Business.
 More Precise Analysis with more data.
Case Study: Sears Holding Corporation
X
*Sears was using traditional systems such as Oracle Exadata,
Teradata and SAS etc. to store and process the customer activity
and sales data.
Hidden Treasure
www.edureka.in/hadoopSlide 15
http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038?
90% of
the ~2PB
Archived
Storage
Processing
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
ETL Compute Grid
3. Premature data
death
1. Can’t explore original
high fidelity raw data
2. Moving data to compute
doesn’t scale
Mostly Append
A meagre
10% of the
~2PB Data is
available for
BI
Storage only Grid (original Raw Data)
Collection
Limitations of Existing Data Analytics Architecture
www.edureka.in/hadoopSlide 16
*Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather
than a meagre 10% as was the case with existing Non-Hadoop solutions.
No Data
Archiving
1. Data Exploration &
Advanced analytics
2. Scalable throughput for ETL &
aggregation
3. Keep data alive
forever
Mostly Append
Instrumentation
BI Reports + Interactive Apps
RDBMS (Aggregated Data)
Collection
Hadoop : Storage + Compute Grid
Entire ~2PB
Data is
available for
processing
Both
Storage
And
Processing
Solution: A Combined Storage Computer Layer
www.edureka.in/hadoopSlide 17
Accessible
RobustSimple
Scalable
Differentiating
Factors
Hadoop Differentiating Factors
www.edureka.in/hadoopSlide 18
Structured Data Types Multi and Unstructured
Limited, No Data Processing Processing Processing coupled with Data
Standards & Structured Governance Loosely Structured
Required On write Schema Required On Read
Reads are Fast Speed Writes are Fast
Software License Cost Support Only
Known Entity Resources Growing, Complexities, Wide
Interactive OLAP Analytics
Complex ACID Transactions
Operational Data Store
Best Fit Use Data Discovery
Processing Unstructured Data
Massive Storage/Processing
RDBMSRDBMS EDW MPP NoSQL HADOOP
Hadoop – It’s about Scale And Structure
www.edureka.in/hadoopSlide 19
Read 1 TB Data
4 I/O Channels
Each Channel – 100 MB/s
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machines
Why DFS?
www.edureka.in/hadoopSlide 20
45 Minutes
Read 1 TB Data
4 I/O Channels
Each Channel – 100 MB/s
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machines
Why DFS?
www.edureka.in/hadoopSlide 21
4.5 Minutes45 Minutes
Read 1 TB Data
4 I/O Channels
Each Channel – 100 MB/s
1 Machine
4 I/O Channels
Each Channel – 100 MB/s
10 Machines
Why DFS?
www.edureka.in/hadoopSlide 22
 Apache Hadoop is a framework that allows for the distributed processing of large data sets
across clusters of commodity computers using a simple programming model.
 It is an Open-source Data Management with scale-out storage & distributed processing.
What Is Hadoop?
www.edureka.in/hadoopSlide 23
Reliable
EconomicalFlexible
Scalable
Hadoop
Features
Hadoop Key Characteristics
www.edureka.in/hadoopSlide 24
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Hadoop is a framework that allows for the distributed processing of:
- Small Data Sets
- Large Data Sets
Annie’s Question
www.edureka.in/hadoopSlide 25
Annie’s Answer
Large Data Sets. It is also capable to process small data-sets
however to experience the true power of Hadoop one needs to have
data in TB’s because this where RDBMS takes hours and fails
whereas Hadoop does the same in couple of minutes.
www.edureka.in/hadoopSlide 26
Apache Oozie (Workflow)
HDFS (Hadoop Distributed File System)
Pig Latin
Data Analysis
Mahout
Machine Learning
Hive
DW System
MapReduce Framework
HBase
Flume Sqoop
Import Or Export
Unstructured or
Semi-Structured data Structured Data
Hadoop Eco-System
www.edureka.in/hadoopSlide 27
Hadoop and
MapReduce magic in
action
Write intelligent applications using Apache Mahout
https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout
LinkedIn Recommendations
Machine Learning with Mahout
www.edureka.in/hadoopSlide 28
Hadoop is a system for large scale data processing.
It has two main components:
 HDFS – Hadoop Distributed File System (Storage)
 Distributed across “nodes”
 Natively redundant
 NameNode tracks locations.
 MapReduce (Processing)
 Splits a task across processors
 “near” the data & assembles results
 Self-Healing, High Bandwidth
 Clustered storage
 JobTracker manages the TaskTrackers
Hadoop Core Components
www.edureka.in/hadoopSlide 29
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
Data Node
Task
Tracker
MapReduce
Engine
HDFS
Cluster
Job Tracker
Admin Node
Name node
Hadoop Core Components (Contd.)
www.edureka.in/hadoopSlide 30
Metadata (Name, replicas,…):
/home/foo/data, 3,…
Rack 1
Blocks
Datanodes
Block ops
Replication
Write
Datanodes
Metadata ops
Client
Read
NameNode
Rack 2
Client
HDFS Architecture
www.edureka.in/hadoopSlide 31
 NameNode:
 master of the system
 maintains and manages the blocks which are present on the
DataNodes
 DataNodes:
 slaves which are deployed on each machine and provide the
actual storage
 responsible for serving read and write requests for the clients
Main Components Of HDFS
www.edureka.in/hadoopSlide 32
 Meta-data in Memory
 The entire metadata is in main memory
 No demand paging of FS meta-data
 Types of Metadata
 List of files
 List of Blocks for each file
 List of DataNode for each block
 File attributes, e.g. access time, replication factor
 A Transaction Log
 Records file creations, file deletions. etc
Name Node
(Stores metadata only)
METADATA:
/user/doug/hinfo -> 1 3 5
/user/doug/pdetail -> 4 2
Name Node:
Keeps track of overall file directory
structure and the placement of Data
Block
NameNode Metadata
www.edureka.in/hadoopSlide 33
 Secondary NameNode:
 Not a hot standby for the NameNode
 Connects to NameNode every hour*
 Housekeeping, backup of NemeNode metadata
 Saved metadata can build a failed NameNode
You give me
metadata every
hour, I will make
it secure
Single Point
Failure
Secondary
NameNode
NameNode
metadata
metadata
Secondary Name Node
www.edureka.in/hadoopSlide 34
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
NameNode?
a) is the “Single Point of Failure” in a cluster
b) runs on ‘Enterprise-class’ hardware
c) stores meta-data
d) All of the above
Annie’s Question
www.edureka.in/hadoopSlide 35
Annie’s Answer
All of the above. NameNode Stores meta-data and runs on reliable
high quality hardware because it’s a Single Point of failure in a
hadoop Cluster.
www.edureka.in/hadoopSlide 36
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Annie’s Question
When the NameNode fails, Secondary NameNode takes over
instantly and prevents Cluster Failure:
a) TRUE
b) FALSE
www.edureka.in/hadoopSlide 37
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Annie’s Answer
False. Secondary NameNode is used for creating NameNode
Checkpoints. NameNode can be manually recovered using ‘edits’
and ‘FSImage’ stored in Secondary NameNode.
www.edureka.in/hadoopSlide 38
DFS
Job Tracker
1. Copy Input Files
User
2. Submit Job
3. Get Input Files’ Info
6. Submit Job
4. Create Splits
5. Upload Job
Information
Input Files
Client
Job.xml.
Job.jar.
JobTracker
www.edureka.in/hadoopSlide 39
DFS
Client Job.xml.
Job.jar.
6. Submit Job
8. Read Job Files
7. Initialize Job
Job Queue
As many maps
as splitsInput Spilts
Maps Reduces
9. Create
maps and
reduces
Job Tracker
JobTracker (Contd.)
www.edureka.in/hadoopSlide 40
Job Tracker
Job QueueH3
H4
H5
H1
Task Tracker -
H2
Task Tracker -
H4
10. Heartbeat
12. Assign Tasks
10. Heartbeat
10. Heartbeat
10. Heartbeat
11. Picks Tasks
(Data Local if possible)
Task Tracker -
H3
Task Tracker -
H1
JobTracker (Contd.)
www.edureka.in/hadoopSlide 41
Hadoop framework picks which of the following daemon
for scheduling a task ?
a) namenode
b) datanode
c) task tracker
d) job tracker
Annie’s Question
www.edureka.in/hadoopSlide 42
Hello There!!
My name is Annie.
I love quizzes and
puzzles and I am here to
make you guys think and
answer my questions.
Annie’s Answer
JobTracker takes care of all the job scheduling and
assign tasks to TaskTrackers.
www.edureka.in/hadoopSlide 43
NameNode
NameNode
DataNode
DataNode
DataNode
DataNode
2. Create
7. Complete
5. ack Packet4. Write Packet
Pipeline of
Data nodes
DataNode
DataNode
Distributed
File System
HDFS
Client
1. Create
3. Write
4
5
4
5
Anatomy of A File Write
6. Close
www.edureka.in/hadoopSlide 44
NameNode
NameNode
DataNode
DataNode
DataNode
DataNode
2. Get Block locations
4. Read
DataNode
DataNode
Distributed
File System
HDFS
Client
1. Open
3. Read
5. Read
Anatomy of A File Read
FS Data
Input Stream
6. Close
Client JVM
Client Node
www.edureka.in/hadoopSlide 45
Replication and Rack Awareness
www.edureka.in/hadoopSlide 46
In HDFS, blocks of a file are written in parallel, however
the replication of the blocks are done sequentially:
a) TRUE
b) FALSE
Annie’s Question
www.edureka.in/hadoopSlide 47
Annie’s Answer
True. A files is divided into Blocks, these blocks are
written in parallel but the block replication happen in
sequence.
www.edureka.in/hadoopSlide 48
Annie’s Question
A file of 400MB is being copied to HDFS. The system
has finished copying 250MB. What happens if a client
tries to access that file:
a) can read up to block that's successfully written.
b) can read up to last bit successfully written.
c) Will throw an throw an exception.
d) Cannot see that file until its finished copying.
www.edureka.in/hadoopSlide 49
Annie’s Answer
Client can read up to the successfully written data block,
Answer is (a)
www.edureka.in/hadoopSlide 50
 Apache Hadoop and HDFS
http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/
 Apache Hadoop HDFS Architecture
http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/
Further Reading
www.edureka.in/hadoopSlide 51
 Setup the Hadoop development environment using the documents present in the LMS.
 Hadoop Installation – Setup Cloudera CDH3 Demo VM
 Execute Linux Basic Commands
 Execute HDFS Hands On commands
 Attempt the Module-1 Assignments present in the LMS.
Module-2 Pre-work
www.edureka.in/hadoopSlide 52
What’s within the LMS
This section will
give you an
insight of Big
Data and
Hadoop course
This section will give
some prerequisites on
Java to understand
the course better
Old Batch
recordings –
Handy for you This section will give
an insight of Hadoop
Installation
www.edureka.in/hadoopSlide 53
What’s within the LMS
Click here to
expand and view
all the elements
of this Module
www.edureka.in/hadoopSlide 54
What’s within the LMS
Recording of
the Class
Presentation
Hands-on
Guides
www.edureka.in/hadoopSlide 55
What’s within the LMS
Quiz
Assignment
Thank You
See You in Class Next Week

Hadoop Developer

  • 1.
  • 2.
    www.edureka.in/hadoopSlide 2  Module1  Understanding Big Data  Hadoop Architecture  Module 2  Hadoop Cluster Configuration  Data loading Techniques  Hadoop Project Environment  Module 3  Hadoop MapReduce framework  Programming in Map Reduce  Module 4  Advance MapReduce  MRUnit testing framework  Module 5  Analytics using Pig  Understanding Pig Latin  Module 6  Analytics using Hive  Understanding HIVE QL  Module 7  Advance Hive  NoSQL Databases and HBASE  Module 8  Advance HBASE  Zookeeper Service  Module 9  Hadoop 2.0 – New Features  Programming in MRv2  Module 10  Apache Oozie  Real world Datasets and Analysis  Project Discussion Course Topics
  • 3.
    www.edureka.in/hadoopSlide 3  Whatis Big Data?  Limitations of the existing solutions  Solving the problem with Hadoop  Introduction to Hadoop  Hadoop Eco-System  Hadoop Core Components  HDFS Architecture  MapRedcue Job execution  Anatomy of a File Write and Read  Hadoop 2.0 (YARN or MRv2) Architecture Topics for Today
  • 4.
    www.edureka.in/hadoopSlide 4  Lotsof Data (Terabytes or Petabytes)  Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.  Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades. What Is Big Data?
  • 5.
  • 6.
    www.edureka.in/hadoopSlide 6  IBM’sDefinition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ Web logs Images Videos Audios Sensor Data Volume Velocity Variety IBM’s Definition
  • 7.
    www.edureka.in/hadoopSlide 7 Hello There!! Myname is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Annie’s Introduction
  • 8.
    www.edureka.in/hadoopSlide 8 Hello There!! Myname is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Annie’s Question Map the following to corresponding data type: - XML Files - Word Docs, PDF files, Text files - E-Mail body - Data from Enterprise systems (ERP, CRM etc.)
  • 9.
    www.edureka.in/hadoopSlide 9 XML Files-> Semi-structured data Word Docs, PDF files, Text files -> Unstructured Data E-Mail body -> Unstructured Data Data from Enterprise systems (ERP, CRM etc.) -> Structured Data Annie’s Answer
  • 10.
    www.edureka.in/hadoopSlide 10  Moreon Big Data http://www.edureka.in/blog/the-hype-behind-big-data/  Why Hadoop http://www.edureka.in/blog/why-hadoop/  Opportunities in Hadoop http://www.edureka.in/blog/jobs-in-hadoop/  Big Data http://en.wikipedia.org/wiki/Big_Data  IBM’s definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ Further Reading
  • 11.
    www.edureka.in/hadoopSlide 11  Weband e-tailing  Recommendation Engines  Ad Targeting  Search Quality  Abuse and Click Fraud Detection  Telecommunications  Customer Churn Prevention  Network Performance Optimization  Calling Data Record (CDR) Analysis  Analyzing Network to Predict Failure http://wiki.apache.org/hadoop/PoweredBy Common Big Data Customer Scenarios
  • 12.
    www.edureka.in/hadoopSlide 12  Government Fraud Detection And Cyber Security  Welfare schemes  Justice  Healthcare & Life Sciences  Health information exchange  Gene sequencing  Serialization  Healthcare service quality improvements  Drug Safety http://wiki.apache.org/hadoop/PoweredBy Common Big Data Customer Scenarios (Contd.)
  • 13.
    www.edureka.in/hadoopSlide 13  Banksand Financial services  Modeling True Risk  Threat Analysis  Fraud Detection  Trade Surveillance  Credit Scoring And Analysis  Retail  Point of sales Transaction Analysis  Customer Churn Analysis  Sentiment Analysis http://wiki.apache.org/hadoop/PoweredBy Common Big Data Customer Scenarios (Contd.)
  • 14.
    www.edureka.in/hadoopSlide 14  Insightinto data can provide Business Advantage.  Some key early indicators can mean Fortunes to Business.  More Precise Analysis with more data. Case Study: Sears Holding Corporation X *Sears was using traditional systems such as Oracle Exadata, Teradata and SAS etc. to store and process the customer activity and sales data. Hidden Treasure
  • 15.
    www.edureka.in/hadoopSlide 15 http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038? 90% of the~2PB Archived Storage Processing Instrumentation BI Reports + Interactive Apps RDBMS (Aggregated Data) ETL Compute Grid 3. Premature data death 1. Can’t explore original high fidelity raw data 2. Moving data to compute doesn’t scale Mostly Append A meagre 10% of the ~2PB Data is available for BI Storage only Grid (original Raw Data) Collection Limitations of Existing Data Analytics Architecture
  • 16.
    www.edureka.in/hadoopSlide 16 *Sears movedto a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions. No Data Archiving 1. Data Exploration & Advanced analytics 2. Scalable throughput for ETL & aggregation 3. Keep data alive forever Mostly Append Instrumentation BI Reports + Interactive Apps RDBMS (Aggregated Data) Collection Hadoop : Storage + Compute Grid Entire ~2PB Data is available for processing Both Storage And Processing Solution: A Combined Storage Computer Layer
  • 17.
  • 18.
    www.edureka.in/hadoopSlide 18 Structured DataTypes Multi and Unstructured Limited, No Data Processing Processing Processing coupled with Data Standards & Structured Governance Loosely Structured Required On write Schema Required On Read Reads are Fast Speed Writes are Fast Software License Cost Support Only Known Entity Resources Growing, Complexities, Wide Interactive OLAP Analytics Complex ACID Transactions Operational Data Store Best Fit Use Data Discovery Processing Unstructured Data Massive Storage/Processing RDBMSRDBMS EDW MPP NoSQL HADOOP Hadoop – It’s about Scale And Structure
  • 19.
    www.edureka.in/hadoopSlide 19 Read 1TB Data 4 I/O Channels Each Channel – 100 MB/s 1 Machine 4 I/O Channels Each Channel – 100 MB/s 10 Machines Why DFS?
  • 20.
    www.edureka.in/hadoopSlide 20 45 Minutes Read1 TB Data 4 I/O Channels Each Channel – 100 MB/s 1 Machine 4 I/O Channels Each Channel – 100 MB/s 10 Machines Why DFS?
  • 21.
    www.edureka.in/hadoopSlide 21 4.5 Minutes45Minutes Read 1 TB Data 4 I/O Channels Each Channel – 100 MB/s 1 Machine 4 I/O Channels Each Channel – 100 MB/s 10 Machines Why DFS?
  • 22.
    www.edureka.in/hadoopSlide 22  ApacheHadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing. What Is Hadoop?
  • 23.
  • 24.
    www.edureka.in/hadoopSlide 24 Hello There!! Myname is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Hadoop is a framework that allows for the distributed processing of: - Small Data Sets - Large Data Sets Annie’s Question
  • 25.
    www.edureka.in/hadoopSlide 25 Annie’s Answer LargeData Sets. It is also capable to process small data-sets however to experience the true power of Hadoop one needs to have data in TB’s because this where RDBMS takes hours and fails whereas Hadoop does the same in couple of minutes.
  • 26.
    www.edureka.in/hadoopSlide 26 Apache Oozie(Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Mahout Machine Learning Hive DW System MapReduce Framework HBase Flume Sqoop Import Or Export Unstructured or Semi-Structured data Structured Data Hadoop Eco-System
  • 27.
    www.edureka.in/hadoopSlide 27 Hadoop and MapReducemagic in action Write intelligent applications using Apache Mahout https://cwiki.apache.org/confluence/display/MAHOUT/Powered+By+Mahout LinkedIn Recommendations Machine Learning with Mahout
  • 28.
    www.edureka.in/hadoopSlide 28 Hadoop isa system for large scale data processing. It has two main components:  HDFS – Hadoop Distributed File System (Storage)  Distributed across “nodes”  Natively redundant  NameNode tracks locations.  MapReduce (Processing)  Splits a task across processors  “near” the data & assembles results  Self-Healing, High Bandwidth  Clustered storage  JobTracker manages the TaskTrackers Hadoop Core Components
  • 29.
    www.edureka.in/hadoopSlide 29 Data Node Task Tracker DataNode Task Tracker Data Node Task Tracker Data Node Task Tracker MapReduce Engine HDFS Cluster Job Tracker Admin Node Name node Hadoop Core Components (Contd.)
  • 30.
    www.edureka.in/hadoopSlide 30 Metadata (Name,replicas,…): /home/foo/data, 3,… Rack 1 Blocks Datanodes Block ops Replication Write Datanodes Metadata ops Client Read NameNode Rack 2 Client HDFS Architecture
  • 31.
    www.edureka.in/hadoopSlide 31  NameNode: master of the system  maintains and manages the blocks which are present on the DataNodes  DataNodes:  slaves which are deployed on each machine and provide the actual storage  responsible for serving read and write requests for the clients Main Components Of HDFS
  • 32.
    www.edureka.in/hadoopSlide 32  Meta-datain Memory  The entire metadata is in main memory  No demand paging of FS meta-data  Types of Metadata  List of files  List of Blocks for each file  List of DataNode for each block  File attributes, e.g. access time, replication factor  A Transaction Log  Records file creations, file deletions. etc Name Node (Stores metadata only) METADATA: /user/doug/hinfo -> 1 3 5 /user/doug/pdetail -> 4 2 Name Node: Keeps track of overall file directory structure and the placement of Data Block NameNode Metadata
  • 33.
    www.edureka.in/hadoopSlide 33  SecondaryNameNode:  Not a hot standby for the NameNode  Connects to NameNode every hour*  Housekeeping, backup of NemeNode metadata  Saved metadata can build a failed NameNode You give me metadata every hour, I will make it secure Single Point Failure Secondary NameNode NameNode metadata metadata Secondary Name Node
  • 34.
    www.edureka.in/hadoopSlide 34 Hello There!! Myname is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. NameNode? a) is the “Single Point of Failure” in a cluster b) runs on ‘Enterprise-class’ hardware c) stores meta-data d) All of the above Annie’s Question
  • 35.
    www.edureka.in/hadoopSlide 35 Annie’s Answer Allof the above. NameNode Stores meta-data and runs on reliable high quality hardware because it’s a Single Point of failure in a hadoop Cluster.
  • 36.
    www.edureka.in/hadoopSlide 36 Hello There!! Myname is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Annie’s Question When the NameNode fails, Secondary NameNode takes over instantly and prevents Cluster Failure: a) TRUE b) FALSE
  • 37.
    www.edureka.in/hadoopSlide 37 Hello There!! Myname is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Annie’s Answer False. Secondary NameNode is used for creating NameNode Checkpoints. NameNode can be manually recovered using ‘edits’ and ‘FSImage’ stored in Secondary NameNode.
  • 38.
    www.edureka.in/hadoopSlide 38 DFS Job Tracker 1.Copy Input Files User 2. Submit Job 3. Get Input Files’ Info 6. Submit Job 4. Create Splits 5. Upload Job Information Input Files Client Job.xml. Job.jar. JobTracker
  • 39.
    www.edureka.in/hadoopSlide 39 DFS Client Job.xml. Job.jar. 6.Submit Job 8. Read Job Files 7. Initialize Job Job Queue As many maps as splitsInput Spilts Maps Reduces 9. Create maps and reduces Job Tracker JobTracker (Contd.)
  • 40.
    www.edureka.in/hadoopSlide 40 Job Tracker JobQueueH3 H4 H5 H1 Task Tracker - H2 Task Tracker - H4 10. Heartbeat 12. Assign Tasks 10. Heartbeat 10. Heartbeat 10. Heartbeat 11. Picks Tasks (Data Local if possible) Task Tracker - H3 Task Tracker - H1 JobTracker (Contd.)
  • 41.
    www.edureka.in/hadoopSlide 41 Hadoop frameworkpicks which of the following daemon for scheduling a task ? a) namenode b) datanode c) task tracker d) job tracker Annie’s Question
  • 42.
    www.edureka.in/hadoopSlide 42 Hello There!! Myname is Annie. I love quizzes and puzzles and I am here to make you guys think and answer my questions. Annie’s Answer JobTracker takes care of all the job scheduling and assign tasks to TaskTrackers.
  • 43.
    www.edureka.in/hadoopSlide 43 NameNode NameNode DataNode DataNode DataNode DataNode 2. Create 7.Complete 5. ack Packet4. Write Packet Pipeline of Data nodes DataNode DataNode Distributed File System HDFS Client 1. Create 3. Write 4 5 4 5 Anatomy of A File Write 6. Close
  • 44.
    www.edureka.in/hadoopSlide 44 NameNode NameNode DataNode DataNode DataNode DataNode 2. GetBlock locations 4. Read DataNode DataNode Distributed File System HDFS Client 1. Open 3. Read 5. Read Anatomy of A File Read FS Data Input Stream 6. Close Client JVM Client Node
  • 45.
  • 46.
    www.edureka.in/hadoopSlide 46 In HDFS,blocks of a file are written in parallel, however the replication of the blocks are done sequentially: a) TRUE b) FALSE Annie’s Question
  • 47.
    www.edureka.in/hadoopSlide 47 Annie’s Answer True.A files is divided into Blocks, these blocks are written in parallel but the block replication happen in sequence.
  • 48.
    www.edureka.in/hadoopSlide 48 Annie’s Question Afile of 400MB is being copied to HDFS. The system has finished copying 250MB. What happens if a client tries to access that file: a) can read up to block that's successfully written. b) can read up to last bit successfully written. c) Will throw an throw an exception. d) Cannot see that file until its finished copying.
  • 49.
    www.edureka.in/hadoopSlide 49 Annie’s Answer Clientcan read up to the successfully written data block, Answer is (a)
  • 50.
    www.edureka.in/hadoopSlide 50  ApacheHadoop and HDFS http://www.edureka.in/blog/introduction-to-apache-hadoop-hdfs/  Apache Hadoop HDFS Architecture http://www.edureka.in/blog/apache-hadoop-hdfs-architecture/ Further Reading
  • 51.
    www.edureka.in/hadoopSlide 51  Setupthe Hadoop development environment using the documents present in the LMS.  Hadoop Installation – Setup Cloudera CDH3 Demo VM  Execute Linux Basic Commands  Execute HDFS Hands On commands  Attempt the Module-1 Assignments present in the LMS. Module-2 Pre-work
  • 52.
    www.edureka.in/hadoopSlide 52 What’s withinthe LMS This section will give you an insight of Big Data and Hadoop course This section will give some prerequisites on Java to understand the course better Old Batch recordings – Handy for you This section will give an insight of Hadoop Installation
  • 53.
    www.edureka.in/hadoopSlide 53 What’s withinthe LMS Click here to expand and view all the elements of this Module
  • 54.
    www.edureka.in/hadoopSlide 54 What’s withinthe LMS Recording of the Class Presentation Hands-on Guides
  • 55.
  • 56.
    Thank You See Youin Class Next Week