2016
Big Data Technologies
Hadoop and Analytics
Course Guide
Big Data Technologies
Hadoop and Analytics
Venue:
Indian Institute of Corporate Affairs (IICA)
(Under Ministry of Corporate Affairs)
Plot No. 6,7,8 Sector 5
IMT Manesar, Gurgaon
Haryana
Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs 2
Big Data Technologies
Hadoop and Analytics
Hands on with Big Data Technologies and Analytics
Center for e-Governance
Indian Institute of Corporate Affairs
(Under Ministry of Corporate Affairs)
Plot No. 6,7,8 Sector 5
IMT Manesar, Gurgaon
Haryana
Website: http://www.iica.in
Updated Dec 2016
Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs 3
Table of Contents
Module 1 - Introduction to Linux ........................................................................... 7
- Linux as a prerequisite for Big Data and Hadoop
- Overview of Linux Operating System
- Understanding the Linux command line
- Linux Commands and Shell Scripts
- Working with Linux GUI
- Exercises
Module 2 - Understanding Big Data .................................................................... 22
- Introduction to Big Data Technologies
- The 3 Vs of Big Data (Volume, Variety and Velocity)
- Structured and Unstructured Data
- Centralized vs. Distributed computing
- Applications and use cases of Big Data
- Opportunities and challenges of Big Data
Module 3 - Getting started with Hadoop ............................................................. 34
- What is Hadoop, and why is it popular
- Overview of Apache BigTop and Hadoop installation
- Hadoop configuration files
- Overview of Hadoop Vendor Distributions
- Distributed File Systems (DFS)
- Various types of DFS
- Getting familiar with Hadoop Virtual Machine Environment
- Hadoop Ecosystem Tools and Components
- Hadoop Command line (CLI) and Graphical interface (GUI)
- Exercises
Module 4 - Understanding the Hadoop Architecture ......................................... 51
- Name Node and Data Nodes
- Difference between Hadoop 1.x and 2.x
- Hadoop Distributed File System (HDFS)
- HDFS Overview and Architecture
- HDFS Data Flows (Read and Write)
- HDFS Interfaces - Command Line Interface, File System, Administrative and
Web Interface
- Copying data into HDFS, and working with data in HDFS
- Advanced HDFS features, like Data replication, Rack awareness, Fuse-DFS
- Overview of HDFS Federation, High Availability, Distcp and Hadoop Archives
- Exercises
Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs 4
Module 5 - YARN and MapReduce....................................................................... 75
- Functional Programming paradigms
- What is MapReduce
- Shuffling and Sorting
- YARN Resource Manager UI
- Standalone, Pseudo distributed, and Fully distributed mode
- MapReduce v1 compared to YARN and MapReduce v2
- Examples of MapReduce programs
- Exercises
Module 6 - Data Ingestion in HDFS...................................................................... 82
- Importing data to HDFS
- Introduction to SQOOP
- SQOOP configuration
- Ingesting data in HDFS using SQOOP
- Exporting data to RDBMS
- Introduction to Flume
- Flume configuration
- Capturing data in real-time using Flume
- Exercises
Module 7 - Working with Hive .............................................................................. 95
- Introduction to Hive and its Architecture
- Different Modes of executing Hive queries
- HiveQL (DDL & DML Operations)
- External vs. Managed Tables
- Hive vs. Impala
- User-Defined Functions (UDFs)
- Exercises
Module 8 - Working with Pig.............................................................................. 107
- Different Modes of executing Pig
- Pig Data Types
- Pig Latin language Constructs (LOAD, STORE, DUMP, SPLI T etc.)
- User-Defined Functions (UDFs)
- Developing and deploying Pig programs
- Exercises
Module 9 - Getting familiar with Apache Hadoop Ecosystem Tools .............. 112
- Introduction to Oozie workflows, designs and deployments
- Apache Mahout, and Building a Recommender using Mahout
- Introduction to Avro, Kafka, Storm, and Zookeeper
- Exercises
Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs 5
Module 10 - Introduction to NoSQL Databases................................................ 120
- Review of RDBMS
- Need for NoSQL
- Brewers CAP Theorem
- ACID vs. BASE
- Schema on Read vs. Schema on Write
- Different levels of consistency
- Different types of NoSQL databases
- Exercises
Module 11 - Working with NoSQL Databases................................................... 123
- Document stores - CouchBase, MongoDB
- Graph databases - Neo4J
- Key-value stores - Riak
- Column Family - Cassandra, HBase
- Overview of Hybrid NoSQL Databases
- Exercises
Module 12 - Working with Apache Spark.......................................................... 130
- Understanding Spark Architecture
- Comparing Hadoop and Spark
- Introduction to RDD
- Spark SQL
- Sample programs in Spark
- Exercises
Module 13 - Introduction to Data Analytics ...................................................... 138
- Difference between Data Analysis and Analytics
- Types of Analytics
- Big Data Analytics
- Business Analytics
- Predictive Analytics
- Real-Time Analytics
- Web Analytics
- Customized Analytics Solutions
- Exercises
Module 14 - Big Data Proof of Concepts and Use Cases ................................ 155
- Text Mining
- Traditional case of Watson
- Sentiment Analysis
- Weather Data Analysis
- Trending Topics and Conclusion
- Exercises

Big Data Analytics Course Guide TOC

  • 1.
    2016 Big Data Technologies Hadoopand Analytics Course Guide Big Data Technologies Hadoop and Analytics Venue: Indian Institute of Corporate Affairs (IICA) (Under Ministry of Corporate Affairs) Plot No. 6,7,8 Sector 5 IMT Manesar, Gurgaon Haryana
  • 2.
    Big Data Technologies• HADOOP • Analytics IICA Centre for e-Governance • Indian Institute of Corporate Affairs 2 Big Data Technologies Hadoop and Analytics Hands on with Big Data Technologies and Analytics Center for e-Governance Indian Institute of Corporate Affairs (Under Ministry of Corporate Affairs) Plot No. 6,7,8 Sector 5 IMT Manesar, Gurgaon Haryana Website: http://www.iica.in Updated Dec 2016
  • 3.
    Big Data Technologies• HADOOP • Analytics IICA Centre for e-Governance • Indian Institute of Corporate Affairs 3 Table of Contents Module 1 - Introduction to Linux ........................................................................... 7 - Linux as a prerequisite for Big Data and Hadoop - Overview of Linux Operating System - Understanding the Linux command line - Linux Commands and Shell Scripts - Working with Linux GUI - Exercises Module 2 - Understanding Big Data .................................................................... 22 - Introduction to Big Data Technologies - The 3 Vs of Big Data (Volume, Variety and Velocity) - Structured and Unstructured Data - Centralized vs. Distributed computing - Applications and use cases of Big Data - Opportunities and challenges of Big Data Module 3 - Getting started with Hadoop ............................................................. 34 - What is Hadoop, and why is it popular - Overview of Apache BigTop and Hadoop installation - Hadoop configuration files - Overview of Hadoop Vendor Distributions - Distributed File Systems (DFS) - Various types of DFS - Getting familiar with Hadoop Virtual Machine Environment - Hadoop Ecosystem Tools and Components - Hadoop Command line (CLI) and Graphical interface (GUI) - Exercises Module 4 - Understanding the Hadoop Architecture ......................................... 51 - Name Node and Data Nodes - Difference between Hadoop 1.x and 2.x - Hadoop Distributed File System (HDFS) - HDFS Overview and Architecture - HDFS Data Flows (Read and Write) - HDFS Interfaces - Command Line Interface, File System, Administrative and Web Interface - Copying data into HDFS, and working with data in HDFS - Advanced HDFS features, like Data replication, Rack awareness, Fuse-DFS - Overview of HDFS Federation, High Availability, Distcp and Hadoop Archives - Exercises
  • 4.
    Big Data Technologies• HADOOP • Analytics IICA Centre for e-Governance • Indian Institute of Corporate Affairs 4 Module 5 - YARN and MapReduce....................................................................... 75 - Functional Programming paradigms - What is MapReduce - Shuffling and Sorting - YARN Resource Manager UI - Standalone, Pseudo distributed, and Fully distributed mode - MapReduce v1 compared to YARN and MapReduce v2 - Examples of MapReduce programs - Exercises Module 6 - Data Ingestion in HDFS...................................................................... 82 - Importing data to HDFS - Introduction to SQOOP - SQOOP configuration - Ingesting data in HDFS using SQOOP - Exporting data to RDBMS - Introduction to Flume - Flume configuration - Capturing data in real-time using Flume - Exercises Module 7 - Working with Hive .............................................................................. 95 - Introduction to Hive and its Architecture - Different Modes of executing Hive queries - HiveQL (DDL & DML Operations) - External vs. Managed Tables - Hive vs. Impala - User-Defined Functions (UDFs) - Exercises Module 8 - Working with Pig.............................................................................. 107 - Different Modes of executing Pig - Pig Data Types - Pig Latin language Constructs (LOAD, STORE, DUMP, SPLI T etc.) - User-Defined Functions (UDFs) - Developing and deploying Pig programs - Exercises Module 9 - Getting familiar with Apache Hadoop Ecosystem Tools .............. 112 - Introduction to Oozie workflows, designs and deployments - Apache Mahout, and Building a Recommender using Mahout - Introduction to Avro, Kafka, Storm, and Zookeeper - Exercises
  • 5.
    Big Data Technologies• HADOOP • Analytics IICA Centre for e-Governance • Indian Institute of Corporate Affairs 5 Module 10 - Introduction to NoSQL Databases................................................ 120 - Review of RDBMS - Need for NoSQL - Brewers CAP Theorem - ACID vs. BASE - Schema on Read vs. Schema on Write - Different levels of consistency - Different types of NoSQL databases - Exercises Module 11 - Working with NoSQL Databases................................................... 123 - Document stores - CouchBase, MongoDB - Graph databases - Neo4J - Key-value stores - Riak - Column Family - Cassandra, HBase - Overview of Hybrid NoSQL Databases - Exercises Module 12 - Working with Apache Spark.......................................................... 130 - Understanding Spark Architecture - Comparing Hadoop and Spark - Introduction to RDD - Spark SQL - Sample programs in Spark - Exercises Module 13 - Introduction to Data Analytics ...................................................... 138 - Difference between Data Analysis and Analytics - Types of Analytics - Big Data Analytics - Business Analytics - Predictive Analytics - Real-Time Analytics - Web Analytics - Customized Analytics Solutions - Exercises Module 14 - Big Data Proof of Concepts and Use Cases ................................ 155 - Text Mining - Traditional case of Watson - Sentiment Analysis - Weather Data Analysis - Trending Topics and Conclusion - Exercises