www.edureka.in/hadoop-admin
How It Works…
 LIVE classes
 Class recordings
 Module wise Quizzes and Practical Assignments
 24x7 on-demand technical...
Course Topics
 Week 1
–
–
–

Understanding Big Data
Hadoop Components
Introduction to Hadoop 2.0

 Week 2
–
–
–

Hadoop ...
Topics for Today
 What is Big Data?
 Limitations of the existing solutions
 Solving the problem with Hadoop
 Introduct...
What Is Big Data?
 Lots of Data (Terabytes or Petabytes).
 Systems / Enterprises generate huge amount of data from Terab...
IBM’s Definition
 IBM’s definition – Big Data Characteristics
http://www-01.ibm.com/software/data/bigdata/

Characteristi...
Data Volume Is Growing Exponentially


Estimated Global Data Volume:
 2011: 1.8 ZB
 2015: 7.9 ZB



The world's inform...
What Big Companies Have To Say…
“Analyzing Big Data sets will become a key basis for competition.”

McKinsey

“Leaders in ...
Some Of the Hadoop Users

www.edureka.in/hadoop-admin
Hadoop Users – In Detail

http://wiki.apache.org/hadoop/PoweredBy

www.edureka.in/hadoop-admin
What Is Hadoop?
 Apache Hadoop is a framework that allows for the distributed processing of large data sets
across cluste...
Hadoop Key Characteristics

Reliable

Flexible

Hadoop

Features

Economical

Scalable

www.edureka.in/hadoop-admin
Hadoop History
Fastest sort of a TB, 3.5mins
over 910 nodes
Doug Cutting adds DFS &
MapReduce support to Nutch

Doug Cutti...
Hadoop 1.x Eco-System
Apache Oozie (Workflow)
Hive

Pig Latin

DW System

Data Analysis

Mahout
Machine Learning

MapReduc...
Hadoop 1.x Core Components
Hadoop is a system for large scale data processing.
It has two main components:
 HDFS – Hadoop...
Hadoop 1.x Core Components (Contd.)

MapReduce
Engine

Job Tracker

Task
Tracker

Task
Tracker

Task
Tracker

Task
Tracker...
Name Node and Data Nodes
 NameNode:
 master of the system
 maintains and manages the blocks which are present on the
Da...
Secondary Name Node
metadata

 Secondary NameNode:

NameNode

Single Point
Failure

 Not a hot standby for the NameNode
...
What Is MapReduce?
 MapReduce is a programming model
 It is neither platform- nor language-specific
 Record-oriented da...
What Is MapReduce? (Contd.)
Process can be considered as being similar to a Unix pipeline

cat /my/log | grep '.html' | so...
Hadoop 1.x – In Summary
Client

HDFS
Name Node

Data Node

Data Node

Map Reduce
Secondary
Name Node

Job Tracker

Task Tr...
Poll Questions

www.edureka.in/hadoop-admin
Hadoop Cluster Administrator
Roles and Responsibilities
 Deploying the cluster
 Performance and availability of the clus...
Hadoop 1.0 Vs. Hadoop 2.0

Property

Hadoop 1.x

Hadoop 2.x

NameNodes

1

Many

High Availability

Not present

Highly Av...
MRv1 Vs. MRv2
Hadoop 1.0

Hadoop 2.0
MapReduce
(data processing)

MapReduce
(Data Processing)
Job Tracker

HDFS
(Data Stor...
Hadoop 2.0 - Architecture
HDFS
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Secondary
Name N...
Assignments
 Attempt the following Assignments using the documents present in the LMS:
 Apache Hadoop 1.0 Installation o...
Thank You
See You in Class Next Week
Upcoming SlideShare
Loading in...5
×

Hadoop Adminstration with Latest Release (2.0)

2,985

Published on

The Hadoop Cluster Administration course at Edureka starts with the fundamental concepts of Apache Hadoop and Hadoop Cluster. It covers topics to deploy, manage, monitor, and secure a Hadoop Cluster. You will learn to configure backup options, diagnose and recover node failures in a Hadoop Cluster. The course will also cover HBase Administration. There will be many challenging, practical and focused hands-on exercises for the learners. Software professionals new to Hadoop can quickly learn the cluster administration through technical sessions and hands-on labs. By the end of this six week Hadoop Cluster Administration training, you will be prepared to understand and solve real world problems that you may come across while working on Hadoop Cluster.

Published in: Education, Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,985
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • NameNode Single Point of Failure
  • Any production cluster larger than 20-30 nodes requires a full time admin. This admin is responsible for:- the performance and availability of the cluster, the data it contains, and the jobs that run there.- deployment, upgrades, troubleshooting, configuration, tuning, job management, installing tools, architecting processes, monitoring, backups, recovery, etc. There is not a single organization with production Hadoop cluster that didn’t have a full-time admin. The fact that Cloudera is offering Hadoop Administrator Certification and that O’Reilly is selling a book called “Hadoop Operations” demonstrates the importance of a Hadoop Cluster Administrator role in industry.
  • Hadoop Adminstration with Latest Release (2.0)

    1. 1. www.edureka.in/hadoop-admin
    2. 2. How It Works…  LIVE classes  Class recordings  Module wise Quizzes and Practical Assignments  24x7 on-demand technical support  Deployment of different clusters  Online certification exam  Lifetime access to the Learning Management System www.edureka.in/hadoop-admin
    3. 3. Course Topics  Week 1 – – – Understanding Big Data Hadoop Components Introduction to Hadoop 2.0  Week 2 – – – Hadoop 2.0 Hadoop Configuration Hadoop Cluster Architecture  Week 3 – – – Different Hadoop Server Roles Data processing flow Cluster Network Configuration  Week 4 – – – Job Scheduling Fair Scheduler Monitoring a Hadoop Cluster  Week 5 – – – Securing your Hadoop Cluster Kerberos and HDFS Federation Backup and Recovery  Week 6 – – – Oozie and Hive Administration HBase Architecture HBase Administration www.edureka.in/hadoop-admin
    4. 4. Topics for Today  What is Big Data?  Limitations of the existing solutions  Solving the problem with Hadoop  Introduction to Hadoop  Hadoop Eco-System  Hadoop Core Components  MapReduce software framework  Hadoop Cluster Administrator: Roles and Responsibilities  Introduction to Hadoop 2.0 www.edureka.in/hadoop-admin
    5. 5. What Is Big Data?  Lots of Data (Terabytes or Petabytes).  Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. A airline jet collects 10 terabytes of sensor data for every 30 minutes of flying time. NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades. www.edureka.in/hadoop-admin
    6. 6. IBM’s Definition  IBM’s definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ Characteristics Of Big Data Volume Velocity Variety 12 Terabytes of tweets created each day Scrutinizes 5 million trade events created each day to identify potential fraud Sensor data, audio, video, cli ck streams, log files and more www.edureka.in/hadoop-admin
    7. 7. Data Volume Is Growing Exponentially  Estimated Global Data Volume:  2011: 1.8 ZB  2015: 7.9 ZB  The world's information doubles every two years  Over the next 10 years:  The number of servers worldwide will grow by 10x  Amount of information managed by enterprise data centers will grow by 50x  Number of “files” enterprise data center handle will grow by 75x Source: http://www.emc.com/leadership/programs/digital-universe.htm, which was based on the 2011 IDC Digital Universe Study www.edureka.in/hadoop-admin
    8. 8. What Big Companies Have To Say… “Analyzing Big Data sets will become a key basis for competition.” McKinsey “Leaders in every sector will have to grapple the implications of Big Data.” “Big Data analytics are rapidly emerging as the preferred solution to business and technology trends that are disrupting.” Gartner “Enterprises should not delay implementation of Big Data Analytics.” “Use Hadoop to gain a competitive advantage over more risk-averse enterprises.” Forrester Research “Prioritize Big Data projects that might benefit from Hadoop.” www.edureka.in/hadoop-admin
    9. 9. Some Of the Hadoop Users www.edureka.in/hadoop-admin
    10. 10. Hadoop Users – In Detail http://wiki.apache.org/hadoop/PoweredBy www.edureka.in/hadoop-admin
    11. 11. What Is Hadoop?  Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing. www.edureka.in/hadoop-admin
    12. 12. Hadoop Key Characteristics Reliable Flexible Hadoop Features Economical Scalable www.edureka.in/hadoop-admin
    13. 13. Hadoop History Fastest sort of a TB, 3.5mins over 910 nodes Doug Cutting adds DFS & MapReduce support to Nutch Doug Cutting & Mike Cafarella started working on Nutch 2002 2003 Fastest sort of a TB, 62secs over 1,460 nodes Sorted a PB in 16.25hours Over 3.658 nodes NY Times converts 4TB of Image archives over 100 EC2s 2004 Google publishes GFS & MapReduce papers 2005 2006 2007 2008 Yahoo! hires Cutting, Hadoop spins out of Nutch Facebook launches Hive: SQL Support for Hadoop 2009 Founded Doug Cutting Joins Cloudera Hadoop Summit 2009, 750 attendees www.edureka.in/hadoop-admin
    14. 14. Hadoop 1.x Eco-System Apache Oozie (Workflow) Hive Pig Latin DW System Data Analysis Mahout Machine Learning MapReduce Framework HBase HDFS (Hadoop Distributed File System) Flume Sqoop Import Or Export Unstructured or Semi-Structured data Structured Data www.edureka.in/hadoop-admin
    15. 15. Hadoop 1.x Core Components Hadoop is a system for large scale data processing. It has two main components:  HDFS – Hadoop Distributed File System (Storage)  Distributed across “nodes”  Natively redundant  NameNode tracks locations.  Additional Administration Tools:  Filesystem utilities  Job scheduling and monitoring  Web UI  MapReduce (Processing)  Splits a task across processors  “near” the data & assembles results  Self-Healing, High Bandwidth  Clustered storage  JobTracker manages the TaskTrackers www.edureka.in/hadoop-admin
    16. 16. Hadoop 1.x Core Components (Contd.) MapReduce Engine Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker HDFS Cluster Admin Node Name node Data Node Data Node Data Node Data Node www.edureka.in/hadoop-admin
    17. 17. Name Node and Data Nodes  NameNode:  master of the system  maintains and manages the blocks which are present on the DataNodes  DataNodes:  slaves which are deployed on each machine and provide the actual storage  responsible for serving read and write requests for the clients www.edureka.in/hadoop-admin
    18. 18. Secondary Name Node metadata  Secondary NameNode: NameNode Single Point Failure  Not a hot standby for the NameNode  Connects to NameNode every hour*  Housekeeping, backup of NemeNode metadata  Saved metadata can build a failed NameNode Secondary NameNode You give me metadata every hour, I will make it secure metadata www.edureka.in/hadoop-admin
    19. 19. What Is MapReduce?  MapReduce is a programming model  It is neither platform- nor language-specific  Record-oriented data processing (key and value)  Task distributed across multiple nodes Key Value  Where possible, each node processes data stored on that node  Consists of two phases MapReduce  Map  Reduce www.edureka.in/hadoop-admin
    20. 20. What Is MapReduce? (Contd.) Process can be considered as being similar to a Unix pipeline cat /my/log | grep '.html' | sort | uniq –c > /my/outfile MAP SORT REDUCE www.edureka.in/hadoop-admin
    21. 21. Hadoop 1.x – In Summary Client HDFS Name Node Data Node Data Node Map Reduce Secondary Name Node Job Tracker Task Tracker Map Reduce Task Tracker Map Reduce …. Data Blocks www.edureka.in/hadoop-admin
    22. 22. Poll Questions www.edureka.in/hadoop-admin
    23. 23. Hadoop Cluster Administrator Roles and Responsibilities  Deploying the cluster  Performance and availability of the cluster  Job scheduling and Management  Upgrades  Backup and Recovery  Monitoring the cluster  Troubleshooting www.edureka.in/hadoop-admin
    24. 24. Hadoop 1.0 Vs. Hadoop 2.0 Property Hadoop 1.x Hadoop 2.x NameNodes 1 Many High Availability Not present Highly Available Processing Control JobTracker, Task Tracker Resource Manager, Node Manager, App Master www.edureka.in/hadoop-admin
    25. 25. MRv1 Vs. MRv2 Hadoop 1.0 Hadoop 2.0 MapReduce (data processing) MapReduce (Data Processing) Job Tracker HDFS (Data Storage)  Problems with Resource utilization  Slots only for Map and Reduce Others (data Processing) Data Node YARN (Cluster Resource Management) Scheduler Applications Manager (AsM) HDFS (Data Storage)  Provides a Cluster Level Resource Manager  Application Level Resource Management (Node Manager??)  Provides slots for Jobs other than Map and Reduce www.edureka.in/hadoop-admin
    26. 26. Hadoop 2.0 - Architecture HDFS All name space edits logged to shared NFS storage; single writer (fencing) Secondary Name Node Shared edit logs YARN Read edit logs and applies to its own namespace Standby NameNode Active NameNode Data Node Client Data Node Resource Manager Node Manager Container Node Manager Container App Master Node Manager Container App Master Data Node Node Manager Container App Master Data Node App Master www.edureka.in/hadoop-admin
    27. 27. Assignments  Attempt the following Assignments using the documents present in the LMS:  Apache Hadoop 1.0 Installation on Ubuntu in Pseudo-Distributed Mode  Execute Linux Basic Commands  Execute HDFS Hands On  Cloudera CDH3 and CDH4 Quick VM installation on your local machine www.edureka.in/hadoop-admin
    28. 28. Thank You See You in Class Next Week

    ×