Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Adminstration with Latest Release (2.0)


Published on

The Hadoop Cluster Administration course at Edureka starts with the fundamental concepts of Apache Hadoop and Hadoop Cluster. It covers topics to deploy, manage, monitor, and secure a Hadoop Cluster. You will learn to configure backup options, diagnose and recover node failures in a Hadoop Cluster. The course will also cover HBase Administration. There will be many challenging, practical and focused hands-on exercises for the learners. Software professionals new to Hadoop can quickly learn the cluster administration through technical sessions and hands-on labs. By the end of this six week Hadoop Cluster Administration training, you will be prepared to understand and solve real world problems that you may come across while working on Hadoop Cluster.

Published in: Education, Technology
  • Be the first to comment

Hadoop Adminstration with Latest Release (2.0)

  1. 1.
  2. 2. How It Works…  LIVE classes  Class recordings  Module wise Quizzes and Practical Assignments  24x7 on-demand technical support  Deployment of different clusters  Online certification exam  Lifetime access to the Learning Management System
  3. 3. Course Topics  Week 1 – – – Understanding Big Data Hadoop Components Introduction to Hadoop 2.0  Week 2 – – – Hadoop 2.0 Hadoop Configuration Hadoop Cluster Architecture  Week 3 – – – Different Hadoop Server Roles Data processing flow Cluster Network Configuration  Week 4 – – – Job Scheduling Fair Scheduler Monitoring a Hadoop Cluster  Week 5 – – – Securing your Hadoop Cluster Kerberos and HDFS Federation Backup and Recovery  Week 6 – – – Oozie and Hive Administration HBase Architecture HBase Administration
  4. 4. Topics for Today  What is Big Data?  Limitations of the existing solutions  Solving the problem with Hadoop  Introduction to Hadoop  Hadoop Eco-System  Hadoop Core Components  MapReduce software framework  Hadoop Cluster Administrator: Roles and Responsibilities  Introduction to Hadoop 2.0
  5. 5. What Is Big Data?  Lots of Data (Terabytes or Petabytes).  Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. A airline jet collects 10 terabytes of sensor data for every 30 minutes of flying time. NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.
  6. 6. IBM’s Definition  IBM’s definition – Big Data Characteristics Characteristics Of Big Data Volume Velocity Variety 12 Terabytes of tweets created each day Scrutinizes 5 million trade events created each day to identify potential fraud Sensor data, audio, video, cli ck streams, log files and more
  7. 7. Data Volume Is Growing Exponentially  Estimated Global Data Volume:  2011: 1.8 ZB  2015: 7.9 ZB  The world's information doubles every two years  Over the next 10 years:  The number of servers worldwide will grow by 10x  Amount of information managed by enterprise data centers will grow by 50x  Number of “files” enterprise data center handle will grow by 75x Source:, which was based on the 2011 IDC Digital Universe Study
  8. 8. What Big Companies Have To Say… “Analyzing Big Data sets will become a key basis for competition.” McKinsey “Leaders in every sector will have to grapple the implications of Big Data.” “Big Data analytics are rapidly emerging as the preferred solution to business and technology trends that are disrupting.” Gartner “Enterprises should not delay implementation of Big Data Analytics.” “Use Hadoop to gain a competitive advantage over more risk-averse enterprises.” Forrester Research “Prioritize Big Data projects that might benefit from Hadoop.”
  9. 9. Some Of the Hadoop Users
  10. 10. Hadoop Users – In Detail
  11. 11. What Is Hadoop?  Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing.
  12. 12. Hadoop Key Characteristics Reliable Flexible Hadoop Features Economical Scalable
  13. 13. Hadoop History Fastest sort of a TB, 3.5mins over 910 nodes Doug Cutting adds DFS & MapReduce support to Nutch Doug Cutting & Mike Cafarella started working on Nutch 2002 2003 Fastest sort of a TB, 62secs over 1,460 nodes Sorted a PB in 16.25hours Over 3.658 nodes NY Times converts 4TB of Image archives over 100 EC2s 2004 Google publishes GFS & MapReduce papers 2005 2006 2007 2008 Yahoo! hires Cutting, Hadoop spins out of Nutch Facebook launches Hive: SQL Support for Hadoop 2009 Founded Doug Cutting Joins Cloudera Hadoop Summit 2009, 750 attendees
  14. 14. Hadoop 1.x Eco-System Apache Oozie (Workflow) Hive Pig Latin DW System Data Analysis Mahout Machine Learning MapReduce Framework HBase HDFS (Hadoop Distributed File System) Flume Sqoop Import Or Export Unstructured or Semi-Structured data Structured Data
  15. 15. Hadoop 1.x Core Components Hadoop is a system for large scale data processing. It has two main components:  HDFS – Hadoop Distributed File System (Storage)  Distributed across “nodes”  Natively redundant  NameNode tracks locations.  Additional Administration Tools:  Filesystem utilities  Job scheduling and monitoring  Web UI  MapReduce (Processing)  Splits a task across processors  “near” the data & assembles results  Self-Healing, High Bandwidth  Clustered storage  JobTracker manages the TaskTrackers
  16. 16. Hadoop 1.x Core Components (Contd.) MapReduce Engine Job Tracker Task Tracker Task Tracker Task Tracker Task Tracker HDFS Cluster Admin Node Name node Data Node Data Node Data Node Data Node
  17. 17. Name Node and Data Nodes  NameNode:  master of the system  maintains and manages the blocks which are present on the DataNodes  DataNodes:  slaves which are deployed on each machine and provide the actual storage  responsible for serving read and write requests for the clients
  18. 18. Secondary Name Node metadata  Secondary NameNode: NameNode Single Point Failure  Not a hot standby for the NameNode  Connects to NameNode every hour*  Housekeeping, backup of NemeNode metadata  Saved metadata can build a failed NameNode Secondary NameNode You give me metadata every hour, I will make it secure metadata
  19. 19. What Is MapReduce?  MapReduce is a programming model  It is neither platform- nor language-specific  Record-oriented data processing (key and value)  Task distributed across multiple nodes Key Value  Where possible, each node processes data stored on that node  Consists of two phases MapReduce  Map  Reduce
  20. 20. What Is MapReduce? (Contd.) Process can be considered as being similar to a Unix pipeline cat /my/log | grep '.html' | sort | uniq –c > /my/outfile MAP SORT REDUCE
  21. 21. Hadoop 1.x – In Summary Client HDFS Name Node Data Node Data Node Map Reduce Secondary Name Node Job Tracker Task Tracker Map Reduce Task Tracker Map Reduce …. Data Blocks
  22. 22. Poll Questions
  23. 23. Hadoop Cluster Administrator Roles and Responsibilities  Deploying the cluster  Performance and availability of the cluster  Job scheduling and Management  Upgrades  Backup and Recovery  Monitoring the cluster  Troubleshooting
  24. 24. Hadoop 1.0 Vs. Hadoop 2.0 Property Hadoop 1.x Hadoop 2.x NameNodes 1 Many High Availability Not present Highly Available Processing Control JobTracker, Task Tracker Resource Manager, Node Manager, App Master
  25. 25. MRv1 Vs. MRv2 Hadoop 1.0 Hadoop 2.0 MapReduce (data processing) MapReduce (Data Processing) Job Tracker HDFS (Data Storage)  Problems with Resource utilization  Slots only for Map and Reduce Others (data Processing) Data Node YARN (Cluster Resource Management) Scheduler Applications Manager (AsM) HDFS (Data Storage)  Provides a Cluster Level Resource Manager  Application Level Resource Management (Node Manager??)  Provides slots for Jobs other than Map and Reduce
  26. 26. Hadoop 2.0 - Architecture HDFS All name space edits logged to shared NFS storage; single writer (fencing) Secondary Name Node Shared edit logs YARN Read edit logs and applies to its own namespace Standby NameNode Active NameNode Data Node Client Data Node Resource Manager Node Manager Container Node Manager Container App Master Node Manager Container App Master Data Node Node Manager Container App Master Data Node App Master
  27. 27. Assignments  Attempt the following Assignments using the documents present in the LMS:  Apache Hadoop 1.0 Installation on Ubuntu in Pseudo-Distributed Mode  Execute Linux Basic Commands  Execute HDFS Hands On  Cloudera CDH3 and CDH4 Quick VM installation on your local machine
  28. 28. Thank You See You in Class Next Week