This document outlines the topics that will be covered in an integrated Big Data course, including installing and configuring Hadoop, Spark, Kafka and NoSQL environments. Key areas to be covered include HDFS, MapReduce, Hive, Spark, RDDs, Spark SQL, Spark MLib, Apache Kafka, and MongoDB. Students will have hands-on access to a virtual machine with all required software pre-installed to complete exercises and assignments on each topic.
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Big data_hadoop_spark_kafka_nosql_training
1. Big DATA – Integrated course
SETUP – Hadoop, Spark, Kafka and NoSQL Environment
• Install and configure Virtual Box
• Load and configure RHEL based Virtual Machine
• Install/Configure VM with basic software’s
• User setup and Database account creation
• Configure SSH and checks ports availability
Hadoop - HDFS (Hadoop - Distributed File System)
• Hadoop Distributed file system, Background, GFS
• HDFS config files – core,hdfs, mapred site xmls
• Data Replication – Static and Dynamic configuration
• Data Storage – Block Size details
• HDFS - DFS shell commands
• HDFS -Admin commands and data recovery
Hadoop - MapReduce Framework
• MapReduce Introduction
• Writing MapReduce Programs
• Mappers and Reducers details
• Running MR jobs
• Configure custom Map and Reduce jobs.
Hadoop - Apache HIVE
• Hive Installation and Meta store setup
• Hive Shell commands
• Hive QL Basics
• Hive Local and MR mode data load
• Working with Tables, Databases etc.
• Hands on Exercises and Assignments
Spark - Spark Installation and Introduction
• Apache Spark Installation (version 2.x)
• Spark shell and Pyspark shell setup.
• Spark Executor cores and Executors setup
• Spark configurations for logs .
• Writing UDF (user defined functions)
2. Spark- Scala Installation and Introduction
• Scala Installation (version 2.x)
• Scala setup for Spark environment
• Scala based Spark exercise
Spark - Resilient Distributed Datasets (RDD)
• Working with RDDs in Spark
• Creating RDDs from scratch
• Creating RDD from preexisting data
• Accumulators and Broadcast variables
• RDD – Transformations commands
• RDD – Actions commands
• RDD complex exercises
Spark – Spark SQL and Data Frames
• Spark SQL and the SQL Context
• Creating DataFrames from raw datasets
• Transforming and Querying DataFrames
• Using csv files and mapping schema
• Using case structures and user defined data types
Spark - Spark Mlib (Machine Learning)
• Basic Principles of Machine Learning
• Supervised and Unsupervised Learnings
• Setup Machine Learning for Spark
• Transformations, Correlation Algorithm.
• Exercise for Regression , Correlation.
Kafka- Apache Kafka
• Introduction to Apache Kafka
• Identifying the major Kafka components
• Determining what data is appropriate for use with Kafka
• Developing with Kafka producers, consumers, and brokers
Kafka- Installation and Labs
• Kafka Features and terminologies
• High level Kafka architecture
• Kafka Installation in Linux/Windows.
• Install Kafka Zookeeper
• Install Kafka Server
3. Kafka- Consumer, Producer and Topics
• Writing Kafka Consumer Labs
• Create Kafka Messages
• Create Kafka Topics
• Message structure and topic configuration
• Write Kafka Producer
• Configure Producer and Kafka Server
• Kafka Multi Broker Configuration
NoSQL- Introduction and Details
• NoSQL databases introduction
• Types of NoSQL databases – MongoDB, Cassandra, Couch DB
• Use cases for NoSQL databases
• Document DB types
• Comparison with RDBMS
NoSQL- MongoDB
• MongoDB installation on Linux/windows box
• Mongo Demon threads
• Mongo Shell configuration
• Mongo collection creation
• Mongo data load in collections
NoSQL – Mongo Query Language
• MongoDB query language
• Mongo create() , update() and delete() query
• Mongo find() query
Study Materials and Labs
1) Complete Virtual Machine is shared with students. It has Java , Oracle DB , Mozilla
Firefox and other components pre-installed
2) The VM can be used even after the training is DONE. Please note it’s NOT a remote
lab type environment. You will be able to keep the VM and all labs even after the
training is completed