This document outlines the objectives, content, and structure of a 28-30 hour training course on Hadoop and its ecosystem. The course will provide both theoretical and hands-on instruction on topics such as Hadoop architecture and components, HDFS, MapReduce, YARN, Hive, Pig, Sqoop, HBase, and Oozie. Participants will learn how to install Hadoop clusters, develop MapReduce programs, integrate Hadoop components, and apply best practices for Hadoop development, administration, and management. The goal is for attendees to gain the skills needed to architect Hadoop projects and leverage its ecosystem for data analysis and storage.
Hadoop Architect Training - Big Data, MapReduce, HBase
1. Page 1 of 5
Big Data Architect - Hadoop & its Eco-system Training
Objective
The participants will learn the Installation of Hadoop Cluster, understand the basic,
advanced concepts of Map Reduce and the best practices for Apache Hadoop
Development as experienced by the Developers, Architects and Data Analysts of
core Apache Hadoop. They will also learn the following during the duration of the
course
1. Hadoop Ecosystem
2. Best programming practices for Map Reduce
3. System administration issues with other Hadoop projects such as Hive,
Pig, and Scoop
4. Configuration Map Reduce environment with Eclipse IDE
5. Running MR Unit Tests on MR Code
6. Advanced Map Reduce Algorithms and techniques
7. Working with Pig and HIVE
8. Working with NoSQL with emphasis on HBase
Note: The course will be have 40% of theoretical discussion and 60% of actual
hands on
Duration:
28 ~ 30 hours
Audience
This course is designed for anyone who is
1. Wanting to architect a project using Hadoop and its Eco System
components.
2. Wanting to develop Map Reduce programs
3. A Business Analyst or Data Warehousing person looking at alternative
approach to data analysis and storage.
Pre-Requisites
1. The participants should have at least basic knowledge of Java.
2. Any experience of Linux environment will be very helpful.
Course Outline
1. What is Big Data & Why Hadoop?
2. Page 2 of 5
• Big Data Characteristics, Challenges with traditional system
2. Hadoop Overview & Ecosystem
• Anatomy of Hadoop Cluster, Installing and Configuring Hadoop
• Hands-On Exercise
3. Hadoop Architecture
• Components in Hadoop
• Interaction between different Components
• Basic Understanding of each component
4. HDFS – Hadoop Distributed File System
• Name Nodes and Data Nodes
• Hands-On Exercise
5. Map Reduce Anatomy
• How Map Reduce Works?
• The Mapper & Reducer, InputFormats & OutputFormats, Data Type
• Hands-On Exercise
6. Understanding Cloudera Distribution
• What is CDH?
• Components in CDH
• Hands on Exercise
7. Understanding HortonWorks Distribution
• What is HDP?
• Components in HDP
• Hands on Exercise
8. Pseudo – Cluster Distribution of Vanilla Hadoop
• Hadoop Extraction and Installation
• Configuration / XML Files
• Hands on Exercise
3. Page 3 of 5
9. YARN
• Need for YARN
• Architecture of YARN
10. Installation of YARN in Ubuntu
• Configuration Settings
• Difference between Gen1 and Gen2`
11. Developing Map Reduce Programs
• Setting up Eclipse Development Environment, Creating Map Reduce
Projects, Debugging Map Reduce Code
• Hands-On Exercise
13. Advanced Tips & Techniques
• Determining optimal number of reducers, skipping bad records
• Partitioning into multiple output files & Passing parameters to tasks
• Optimizing Hadoop Cluster & Performance Tuning
14. Monitoring & Management of Hadoop
• Managing HDFS with Tools
• Using HDFS & Job Tracker Web UI
• Routine Administration Procedures
• Commissioning and decommissioning of nodes
• Hands-On Exercise
15. Using Hive
• Hive as a Data Warehouse
• Creating External & Internal Tables plus Loading Data
• Writing HSQL queries for data retrieval
• Creating partitions and querying data.
16. Using Pig
• Why Pig and its benefits
• Loading data into PigStorage
4. Page 4 of 5
• Querying data from PigStorage
• Hands-On Exercise
17. Sqoop
• Importing and Exporting data from using RDBMS
• Hands-On Exercise
18. Understanding the Other SQL options in Hadoop
• Intro to Stinger
• Intro to Impala
19. Hadoop Best Practices and Use Cases
20. NoSql Introduction
• What is NoSQL?
• Variation of NoSQL
• Advantage of Columnar Database
21. HBase
• Hbase Overview and Architecture
• Hbase v/s RDBMS
• Hbase Table Design
• Column Families and Regions
• Hbase Java API code
• Hands-On Exercise
• Hbase Installation
• Hbase shell commands
• Java Administration API
• Performance Tuning
23. Oozie – Work Flow Scheduler
• Why Workflow in Hadoop
• Understanding Configuration in Oozie
5. Page 5 of 5
Take Away from the Course
1. Understanding of What and Why of Hadoop with its Eco-System
Components.
2. Ability to write Map Reduce programs in a given scenario
3. Ability to correctly architect and implement the Best Practices in Hadoop
Development
4. Ability to Manage and Monitor Hadoop
5. Ability to Manage the different Hadoop Components when talking to each
other.