A customized course that covers topics ranging from usage of Linux, understanding of Big Data including several Distributed File Systems like GlusterFS, Ceph, Lustre, Hadoop, Hive, Pig, NoSQL databases, Spark, different types of Analytics like Business/Predictive/Real-Time/Web and Big Data Analytics, Proof of Concept solutions and use cases.
1. 2016
Big Data Technologies
Hadoop and Analytics
Course Guide
Big Data Technologies
Hadoop and Analytics
Venue:
Indian Institute of Corporate Affairs (IICA)
(Under Ministry of Corporate Affairs)
Plot No. 6,7,8 Sector 5
IMT Manesar, Gurgaon
Haryana
2. Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs 2
Big Data Technologies
Hadoop and Analytics
Hands on with Big Data Technologies and Analytics
Center for e-Governance
Indian Institute of Corporate Affairs
(Under Ministry of Corporate Affairs)
Plot No. 6,7,8 Sector 5
IMT Manesar, Gurgaon
Haryana
Website: http://www.iica.in
Updated Dec 2016
3. Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs 3
Table of Contents
Module 1 - Introduction to Linux ........................................................................... 7
- Linux as a prerequisite for Big Data and Hadoop
- Overview of Linux Operating System
- Understanding the Linux command line
- Linux Commands and Shell Scripts
- Working with Linux GUI
- Exercises
Module 2 - Understanding Big Data .................................................................... 22
- Introduction to Big Data Technologies
- The 3 Vs of Big Data (Volume, Variety and Velocity)
- Structured and Unstructured Data
- Centralized vs. Distributed computing
- Applications and use cases of Big Data
- Opportunities and challenges of Big Data
Module 3 - Getting started with Hadoop ............................................................. 34
- What is Hadoop, and why is it popular
- Overview of Apache BigTop and Hadoop installation
- Hadoop configuration files
- Overview of Hadoop Vendor Distributions
- Distributed File Systems (DFS)
- Various types of DFS
- Getting familiar with Hadoop Virtual Machine Environment
- Hadoop Ecosystem Tools and Components
- Hadoop Command line (CLI) and Graphical interface (GUI)
- Exercises
Module 4 - Understanding the Hadoop Architecture ......................................... 51
- Name Node and Data Nodes
- Difference between Hadoop 1.x and 2.x
- Hadoop Distributed File System (HDFS)
- HDFS Overview and Architecture
- HDFS Data Flows (Read and Write)
- HDFS Interfaces - Command Line Interface, File System, Administrative and
Web Interface
- Copying data into HDFS, and working with data in HDFS
- Advanced HDFS features, like Data replication, Rack awareness, Fuse-DFS
- Overview of HDFS Federation, High Availability, Distcp and Hadoop Archives
- Exercises
4. Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs 4
Module 5 - YARN and MapReduce....................................................................... 75
- Functional Programming paradigms
- What is MapReduce
- Shuffling and Sorting
- YARN Resource Manager UI
- Standalone, Pseudo distributed, and Fully distributed mode
- MapReduce v1 compared to YARN and MapReduce v2
- Examples of MapReduce programs
- Exercises
Module 6 - Data Ingestion in HDFS...................................................................... 82
- Importing data to HDFS
- Introduction to SQOOP
- SQOOP configuration
- Ingesting data in HDFS using SQOOP
- Exporting data to RDBMS
- Introduction to Flume
- Flume configuration
- Capturing data in real-time using Flume
- Exercises
Module 7 - Working with Hive .............................................................................. 95
- Introduction to Hive and its Architecture
- Different Modes of executing Hive queries
- HiveQL (DDL & DML Operations)
- External vs. Managed Tables
- Hive vs. Impala
- User-Defined Functions (UDFs)
- Exercises
Module 8 - Working with Pig.............................................................................. 107
- Different Modes of executing Pig
- Pig Data Types
- Pig Latin language Constructs (LOAD, STORE, DUMP, SPLI T etc.)
- User-Defined Functions (UDFs)
- Developing and deploying Pig programs
- Exercises
Module 9 - Getting familiar with Apache Hadoop Ecosystem Tools .............. 112
- Introduction to Oozie workflows, designs and deployments
- Apache Mahout, and Building a Recommender using Mahout
- Introduction to Avro, Kafka, Storm, and Zookeeper
- Exercises
5. Big Data Technologies • HADOOP • Analytics IICA
Centre for e-Governance • Indian Institute of Corporate Affairs 5
Module 10 - Introduction to NoSQL Databases................................................ 120
- Review of RDBMS
- Need for NoSQL
- Brewers CAP Theorem
- ACID vs. BASE
- Schema on Read vs. Schema on Write
- Different levels of consistency
- Different types of NoSQL databases
- Exercises
Module 11 - Working with NoSQL Databases................................................... 123
- Document stores - CouchBase, MongoDB
- Graph databases - Neo4J
- Key-value stores - Riak
- Column Family - Cassandra, HBase
- Overview of Hybrid NoSQL Databases
- Exercises
Module 12 - Working with Apache Spark.......................................................... 130
- Understanding Spark Architecture
- Comparing Hadoop and Spark
- Introduction to RDD
- Spark SQL
- Sample programs in Spark
- Exercises
Module 13 - Introduction to Data Analytics ...................................................... 138
- Difference between Data Analysis and Analytics
- Types of Analytics
- Big Data Analytics
- Business Analytics
- Predictive Analytics
- Real-Time Analytics
- Web Analytics
- Customized Analytics Solutions
- Exercises
Module 14 - Big Data Proof of Concepts and Use Cases ................................ 155
- Text Mining
- Traditional case of Watson
- Sentiment Analysis
- Weather Data Analysis
- Trending Topics and Conclusion
- Exercises