Introduction to Hadoop Administration

4,507 views

Published on

Published in: Education

Introduction to Hadoop Administration

  1. 1. Introduction to Hadoop Administration View Hadoop Administration course details at www.edureka.co/hadoop-admin
  2. 2. LIVE Online Class Class Recording in LMS 24/7 Post Class Support Module Wise Quiz Project Work on Large Data Base Verifiable Certificate www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions How it Works?
  3. 3. www.edureka.co/hadoop-adminSlide 3 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Objectives of this Session At the end of this module, you will be able to  Understand how Hadoop overruled the limitations of traditional technologies  Understand key responsibilities of Hadoop Administrator  Understand Hadoop Federation and High Availability  Understand Hadoop Cluster Modes  Set up a Hadoop Cluster  Commission and decommission a DataNode
  4. 4. www.edureka.co/hadoop-adminSlide 4 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Lots of Data (Terabytes or Petabytes) Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization  Systems/Enterprises generate huge amount of data from Terabytes and even Petabytes of information What is Big Data? Stock market generates about one terabyte of new trade data per day to perform stock trading analytics to determine trends for optimal trades.
  5. 5. www.edureka.co/hadoop-adminSlide 5 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions IBM’s Definition – Big Data Characteristics http://www-01.ibm.com/software/data/bigdata/ IBM’s Definition Web logs Images Videos Audios Sensor Data VOLUME VELOCITY VARIETY
  6. 6. www.edureka.co/hadoop-adminSlide 6 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Limitations of Existing Data Analytics Architecture http://www.informationweek.com/it-leadership/why-sears-is-going-all-in-on-hadoop/d/d-id/1107038? 90% of the ~2PB archived Storage Processing Instrumentation BI Reports + Interactive Apps RDBMS (Aggregated Data) ETL Compute Grid 3. Premature data death 1. Can’t explore original high fidelity raw data 2. Moving data to compute doesn’t scale Mostly Append A meagre 10% of the ~2PB Data is available for BI Storage only Grid (Original Raw Data) Collection
  7. 7. www.edureka.co/hadoop-adminSlide 7 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Solution: A Combined Storage Computer Layer *Sears moved to a 300-Node Hadoop cluster to keep 100% of its data available for processing rather than a meagre 10% as was the case with existing Non-Hadoop solutions. No Data Archiving 1. Data exploration & advanced analytics 2. Scalable throughput for ETL & aggregation 3. Keep data alive forever Mostly Append Instrumentation BI Reports + Interactive Apps RDBMS (Aggregated Data) Collection Hadoop : Storage + Compute Grid Entire ~2PB data is available for processing Both storage and processing
  8. 8. www.edureka.co/hadoop-adminSlide 8 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Why Hadoop? How to solve the challenges posed by Big Data?
  9. 9. www.edureka.co/hadoop-adminSlide 9 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Why Hadoop? The Hadoop platform is designed to solve problems posed by Big Data. Size of Data Variety of Data
  10. 10. www.edureka.co/hadoop-adminSlide 10 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions What is Hadoop? Apache Hadoop is a framework that allows the distributed processing of large data sets across clusters of commodity computers using a simple programming model. It is an Open-source Data Management with scale-out storage & distributed processing.
  11. 11. www.edureka.co/hadoop-adminSlide 11 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop Key Characteristics Reliable EconomicalFlexible Scalable Hadoop Features
  12. 12. www.edureka.co/hadoop-adminSlide 12 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Some of the Hadoop Users
  13. 13. www.edureka.co/hadoop-adminSlide 13 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Job Market
  14. 14. www.edureka.co/hadoop-adminSlide 14 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Skills Required General operational expertise such as good troubleshooting skills, understanding of Capacity Planning. Hadoop skills like HBase, Hive, Pig, Mahout, etc. They should be able to deploy Hadoop cluster, monitor and scale critical parts of the cluster. Good knowledge of Linux as Hadoop runs on Linux. Familiarity with open source configuration management and deployment tools such as Puppet or Chef and Linux scripting. Knowledge of Troubleshooting Core Java Applications is a plus.
  15. 15. www.edureka.co/hadoop-adminSlide 15 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop Admin Responsibilities Responsible for implementation and administration of Hadoop infrastructure. Testing HDFS, Hive, Pig and MapReduce access for Applications. Cluster maintenance tasks like Backup, Recovery, Upgrade, Patching. Performance tuning and Capacity planning for Clusters. Monitor Hadoop cluster and deploy security.
  16. 16. www.edureka.co/hadoop-adminSlide 16 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop 1.x and Hadoop 2.x Ecosystem Pig Latin Data Analysis Hive DW System Other YARN Frameworks (MPI, GIRAPH) HBaseMapReduce Framework YARN Cluster Resource Management Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis Hive DW System MapReduce Framework Apache Oozie (Workflow) HDFS (Hadoop Distributed File System) Pig Latin Data Analysis HBase Structured DataUnstructured/ Semi-structured Data Hadoop 1.x Hadoop 2.x
  17. 17. www.edureka.co/hadoop-adminSlide 17 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop 1.x Core Components Hadoop is a system for large scale data processing. 2 Main Hadoop 1.x Core Components Storage Processing HDFS MapReduce  Distributed across “nodes”  Natively redundant  NameNode tracks locations  Splits a task across processors  “near” the data & assembles results  Self-healing, high bandwidth  Clustered storage  JobTracker manages the TaskTrackers Additional Administration Tools: » Filesystem utilities » Job scheduling and monitoring » Web UI
  18. 18. www.edureka.co/hadoop-adminSlide 18 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop 2.x Core Components Hadoop is a system for large scale data processing. 2 Main Hadoop 2.x Core Components Storage Processing HDFS MapReduce NextGen / YARN / MRv2  Highly available  Distributed across “nodes”  NameNode tracks locations  Splits a task across processors  “near” the data & assembles results  Resource management and job scheduling/monitoring  Clustered storage  Individual application can utilize cluster resources in a shared, secure and multi- tenant manner  Maintains API compatibility with previous stable releases of Hadoop
  19. 19. www.edureka.co/hadoop-adminSlide 19 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Main Components of HDFS NameNode: » Master of the system » Maintains and manages the blocks which are present on the DataNodes DataNodes: » Slaves which are deployed on each machine and provide the actual storage » Responsible for serving read and write requests for the clients
  20. 20. www.edureka.co/hadoop-adminSlide 20 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions NameNode Metadata Meta-data in Memory » The entire metadata is in main memory » No demand paging of FS meta-data Types of Metadata » List of Files » List of Blocks for each file » List of DataNode for each block » File attributes, e.g. access time, replication factor A Transaction Log » Records file creations, file deletions. etc NameNode (Stores metadata only) METADATA: /user/doug/hinfo -> 1 3 5 /user/doug/pdetail -> 4 2 NameNode: Keeps track of overall file directory structure and the placement of Data Block
  21. 21. www.edureka.co/hadoop-adminSlide 21 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Secondary NameNode  Secondary NameNode:  Not a hot standby for the NameNode  Connects to NameNode every hour*  Housekeeping, backup of NameNode metadata  Saved metadata can build a failed NameNode You give me metadata every hour, I will make it secure Secondary NameNode NameNode metadata metadata Single Point Failure Only in case of hadoop 1.x , not in hadoop 2.x
  22. 22. www.edureka.co/hadoop-adminSlide 22 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop 2.x – In Summary NameNode High Availability Next Generation MapReduce Client HDFS YARN Resource ManagerSecondary Name Node Standby NameNode Active NameNode Distributed Data Storage Distributed Data Processing DataNode Node Manager Container App Master ……. Masters Slaves Node Manager DataNode Container App Master DataNode Node Manager Container App Master Shared edit logs OR Journal Node Scheduler Applications Manager (AsM)
  23. 23. www.edureka.co/hadoop-adminSlide 23 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop 2.x Cluster Architecture - Federation Namenode Block Management NS Storage … NamespaceBlockStorage Namespace NN-1 NN-k NN-n Common Storage BlockStorage … … Hadoop 1.0 Hadoop 2.0 http://hadoop.apache.org/docs/stable2/hadoop-project-dist/hadoop-hdfs/Federation.html NS1 NSk NSn DatanodeDatanode Datanode 1 … Datanode m … Datanode 2 … Pool 1 Pool k Pool n Block Pools
  24. 24. www.edureka.co/hadoop-adminSlide 24 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Node Manager Container App Master Node Manager Container App Master Hadoop 2.x – High Availability HDFS YARN Resource Manager All name space edits logged to shared NFS storage; single writer (fencing) Read edit logs and applies to its own namespace Secondary Name Node DataNode Standby NameNode Active NameNode DataNode Data Node DataNodeDataNode NameNode High Availability Next Generation MapReduce *Not necessary to configure Secondary NameNode http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html Client Shared Edit Logs HDFS HIGH AVAILABILITY Node Manager Container App Master Node Manager Container App Master
  25. 25. www.edureka.co/hadoop-adminSlide 25 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop 2.x – Resource Management Node Manager Container App Master Node Manager Container App Master HDFS YARN Resource Manager All name space edits logged to shared NFS storage; single writer (fencing) Read edit logs and applies to its own namespace Secondary Name Node DataNode Standby NameNode Active NameNode DataNode Data Node DataNodeDataNode NameNode High Availability Next Generation MapReduce *Not necessary to configure Secondary NameNode http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html Client Shared Edit Logs HDFS HIGH AVAILABILITY Node Manager Container App Master Node Manager Container App Master
  26. 26. www.edureka.co/hadoop-adminSlide 26 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop Cluster: A Typical Use Case NameNode Secondary NameNode DataNode RAM: 64 GB, Hard disk: 1 TB Processor: Xenon with 8 Cores Ethernet: 3 X 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply RAM: 32 GB, Hard disk: 1 TB Processor: Xenon with 4 Cores Ethernet: 3 X 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply RAM: 16GB Hard disk: 6 X 2TB Processor: Xenon with 2 cores. Ethernet: 3 X 10 GB/s OS: 64-bit CentOS DataNode RAM: 16GB Hard disk: 6 X 2TB Processor: Xenon with 2 cores. Ethernet: 3 X 10 GB/s OS: 64-bit CentOS
  27. 27. www.edureka.co/hadoop-adminSlide 27 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop 1.x Configuration Files – Apache Hadoop Core HDFS core-site.xml hdfs-site.xml mapred-site.xml Map Reduce
  28. 28. www.edureka.co/hadoop-adminSlide 28 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop 2.x Configuration Files – Apache Hadoop Core HDFS core-site.xml hdfs-site.xml yarn-site.xmlYARN mapred-site.xml Map Reduce
  29. 29. www.edureka.co/hadoop-adminSlide 29 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Replication and Rack Awareness
  30. 30. www.edureka.co/hadoop-adminSlide 30 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Hadoop Cluster Modes Hadoop can run in any of the following three modes: Fully-Distributed Mode Pseudo-Distributed Mode  No daemons, everything runs in a single JVM.  Suitable for running MapReduce programs during development.  Has no DFS.  Hadoop daemons run on the local machine.  Hadoop daemons run on a cluster of machines. Standalone (or Local) Mode
  31. 31. www.edureka.co/hadoop-adminSlide 31 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions DEMO Hadoop Cluster Setup
  32. 32. www.edureka.co/hadoop-adminSlide 32 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions DEMO Hadoop Rack Awareness
  33. 33. www.edureka.co/hadoop-adminSlide 33 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions DEMO Secondary NameNode
  34. 34. www.edureka.co/hadoop-adminSlide 34 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Commissioning and Decommissioning of DataNode DataNode Master Node DataNode DataNode DataNode DataNode DataNodeDataNode DataNode DecommissioningCommissioning
  35. 35. Questions www.edureka.co/hadoop-adminSlide 35 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
  36. 36. www.edureka.co/hadoop-adminSlide 36 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Course Topics  Module 1 » Understanding Big Data » Hadoop Components  Module 2 » Different Hadoop Server Roles » Hadoop Cluster Configuration  Module 3 » Hadoop Cluster Planning » Job Scheduling  Module 4 » Securing your Hadoop Cluster » Backup and Recovery  Module 5 » Hadoop 2.0 New Features » HDFS High Availability  Module 6 » Quorum Journal Manager (QJM) » Hadoop 2.0 - YARN  Module 7 » Oozie Workflow Scheduler » Hive and Hbase Administration  Module 8 » Hadoop Cluster Case Study » Hadoop Implementation

×