Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop Administration pdf


Published on

Published in: Education, Technology
  • Be the first to comment

Hadoop Administration pdf

  1. 1.
  2. 2. How It Works…  LIVE classes  Class recordings  Module wise Quizzes and Practical Assignments  24x7 on-demand technical support  Deployment of different clusters  Online certification exam  Lifetime access to the Learning Management System
  3. 3. Course Topics  Week 1 – Understanding Big Data – Hadoop Components – Introduction to Hadoop 2.0  Week 2 – Hadoop 2.0 – Hadoop Configuration – Hadoop Cluster Architecture  Week 3 – Different Hadoop Server Roles – Data processing flow – Cluster Network Configuration  Week 4 – Job Scheduling – Fair Scheduler – Monitoring a Hadoop Cluster  Week 5 – Securing your Hadoop Cluster – Kerberos and HDFS Federation – Backup and Recovery  Week 6 – Oozie and Hive Administration – HBase Architecture – HBase Administration
  4. 4. Topics for Today  What is Big Data?  Limitations of the existing solutions  Solving the problem with Hadoop  Introduction to Hadoop  Hadoop Eco-System  Hadoop Core Components  MapReduce software framework  Hadoop Cluster Administrator: Roles and Responsibilities  Introduction to Hadoop 2.0
  5. 5. What Is Big Data?  Lots of Data (Terabytes or Petabytes).  Systems / Enterprises generate huge amount of data from Terabytes to and even Petabytes of information. A airline jet collects 10 terabytes of sensor data for every 30 minutes of flying time. NYSE generates about one terabyte of new trade data per day to Perform stock trading analytics to determine trends for optimal trades.
  6. 6. IBM’s Definition  IBM’s definition – Big Data Characteristics Volume Velocity Variety Characteristics Of Big Data 12 Terabytes of tweets created each day Scrutinizes 5 million trade events created each day to identify potential fraud Sensor data, audio, video, click streams, log files and more
  7. 7.  Estimated Global Data Volume:  2011: 1.8 ZB  2015: 7.9 ZB  The world's information doubles every two years  Over the next 10 years:  The number of servers worldwide will grow by 10x  Amount of information managed by enterprise data centers will grow by 50x  Number of “files” enterprise data center handle will grow by 75x Source:, which was based on the 2011 IDC Digital Universe Study Data Volume Is Growing Exponentially
  8. 8. What Big Companies Have To Say… “Analyzing Big Data sets will become a key basis for competition.” “Leaders in every sector will have to grapple the implications of Big Data.” McKinsey Gartner Forrester Research “Big Data analytics are rapidly emerging as the preferred solution to business and technology trends that are disrupting.” “Enterprises should not delay implementation of Big Data Analytics.” “Use Hadoop to gain a competitive advantage over more risk-averse enterprises.” “Prioritize Big Data projects that might benefit from Hadoop.”
  9. 9. Some Of the Hadoop Users
  10. 10. Hadoop Users – In Detail
  11. 11.  Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.  It is an Open-source Data Management with scale-out storage & distributed processing. What Is Hadoop?
  12. 12. Hadoop Key Characteristics Reliable Economical Scalable Flexible Hadoop Features
  13. 13. Hadoop History Doug Cutting & Mike Cafarella started working on Nutch NY Times converts 4TB of Image archives over 100 EC2s Fastest sort of a TB, 62secs over 1,460 nodes Sorted a PB in 16.25hours Over 3.658 nodes Fastest sort of a TB, 3.5mins over 910 nodes Doug Cutting adds DFS & MapReduce support to Nutch Google publishes GFS & MapReduce papers Yahoo! hires Cutting, Hadoop spins out of Nutch Facebook launches Hive: SQL Support for Hadoop Doug Cutting Joins Cloudera Hadoop Summit 2009, 750 attendees Founded 2002 2003 2004 2005 2006 2007 2008 2009
  14. 14. Hadoop 1.x Eco-System Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Mahout Machine Learning Hive DW System MapReduce Framework HBase Flume Sqoop Import Or Export Unstructured or Semi-Structured data Structured Data
  15. 15. Hadoop is a system for large scale data processing. It has two main components:  HDFS – Hadoop Distributed File System (Storage)  Distributed across “nodes”  Natively redundant  NameNode tracks locations.  MapReduce (Processing)  Splits a task across processors  “near” the data & assembles results  Self-Healing, High Bandwidth  Clustered storage  JobTracker manages the TaskTrackers Hadoop 1.x Core Components  Additional Administration Tools:  Filesystem utilities  Job scheduling and monitoring  Web UI
  16. 16. Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Data Node Task Tracker Hadoop 1.x Core Components (Contd.) MapReduce Engine HDFS Cluster Job Tracker Admin Node Name node
  17. 17.  NameNode:  master of the system  maintains and manages the blocks which are present on the DataNodes  DataNodes:  slaves which are deployed on each machine and provide the actual storage  responsible for serving read and write requests for the clients Name Node and Data Nodes
  18. 18.  Secondary NameNode:  Not a hot standby for the NameNode  Connects to NameNode every hour*  Housekeeping, backup of NemeNode metadata  Saved metadata can build a failed NameNode You give me metadata every hour, I will make it secure Single Point Failure Secondary NameNode NameNode Secondary Name Node metadata metadata
  19. 19. What Is MapReduce?  MapReduce is a programming model  It is neither platform- nor language-specific  Record-oriented data processing (key and value)  Task distributed across multiple nodes  Where possible, each node processes data stored on that node  Consists of two phases  Map  Reduce ValueKey MapReduce
  20. 20. What Is MapReduce? (Contd.) Process can be considered as being similar to a Unix pipeline cat /my/log | grep '.html' | sort | uniq –c > /my/outfile MAP SORT REDUCE
  21. 21. Client HDFS Map Reduce Hadoop 1.x – In Summary Secondary Name Node Data Blocks Data Node Name Node Job Tracker Task Tracker Map Reduce Data Node Task Tracker Map Reduce ….
  22. 22. Poll Questions
  23. 23. Hadoop Cluster Administrator  Deploying the cluster  Performance and availability of the cluster  Job scheduling and Management  Upgrades  Backup and Recovery  Monitoring the cluster  Troubleshooting Roles and Responsibilities
  24. 24. Hadoop 1.0 Vs. Hadoop 2.0 Property Hadoop 1.x Hadoop 2.x NameNodes 1 Many High Availability Not present Highly Available Processing Control JobTracker, Task Tracker Resource Manager, Node Manager, App Master
  25. 25. MRv1 Vs. MRv2 Data Node HDFS (Data Storage) MapReduce (data processing) MapReduce (Data Processing) Others (data Processing) Hadoop 1.0 Hadoop 2.0 Scheduler Applications Manager (AsM) Job Tracker YARN (Cluster Resource Management) HDFS (Data Storage)  Provides a Cluster Level Resource Manager  Application Level Resource Management (Node Manager??)  Provides slots for Jobs other than Map and Reduce  Problems with Resource utilization  Slots only for Map and Reduce
  26. 26. Client HDFS YARN Resource Manager Hadoop 2.0 - Architecture Shared edit logs All name space edits logged to shared NFS storage; single writer (fencing) Read edit logs and applies to its own namespace Secondary Name Node Data Node Data Node Data Node Data Node Node Manager Container App Master Node Manager Container App Master Standby NameNode Node Manager Container App Master Node Manager Container App Master Active NameNode
  27. 27.  Attempt the following Assignments using the documents present in the LMS:  Single Node Apache Hadoop 1.0 Installation on Ubuntu  Execute Linux Basic Commands  Execute HDFS Hands On  Cloudera CDH3 and CDH4 Quick VM installation on your local machine Assignments
  28. 28. Thank You See You in Class Next Week