Introduction to Hadoop & SparkReachUs@CloudxLab.com
Welcome to
Big Data
with
Hadoop & Spark
Please introduce yourself while others are
joining
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Session 1 - Big Data with Hadoop & Spark
Duration: 3 hours
Agenda:
• Introduction to Big Data
• 10 mins. break
• Spark & Hadoop Architecture
Notes:
• Please introduce yourself using chat window while others are joining
• Session is being recorded & Recording & presentation will be shared
• This is Session 1 out of 18 sessions on Big Data with Hadoop & Spark specialization.
• It suffices as an introduction to Big Data Technology Stack.
Asking Questions?
• Every one except Instructor is muted
• Please ask questions by typing in Q&A Window
• Instructor will read out the questions before answering
• To get better answers, keep your messages short and avoid chat language
About CloudxLab
Videos Quizzes Hands-On Projects Case Studies
Real Life Use Cases
Making learning fun and for life
Automated Hands-on Assessments
Learn by doing
Automated Hands-on Assessments
Problem Statement Hands On
Cloud based Lab
Assessment
Automated Hands-on Assessments
Problem
Statement
Evaluation
Automated Hands-on Assessments
Python Assessment Jupyter Notebook
Automated Hands-on Assessments
Python Assessment Jupyter Notebook
Questions?
https://discuss.cloudxlab.com
reachus@cloudxlab.com
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Course Instructor
Sandeep Giri
Worked On Large Scale Computing
Graduated from IIT Roorkee
Software Engineer
Loves Explaining Technologies
Founder
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Learn To Process
Big Data
With
Hadoop, Spark
&
Related Technologies
Course Objective
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Data Variety
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Data Variety
ETL
Extract Transform Load
Introduction to Hadoop & SparkReachUs@CloudxLab.com
1.Groups of networked computers
2.Interact with each other
3.To achieve a common goal.
Distributed Systems
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Question
How Many Bytes in One Petabyte?
1.1259x10^15
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Question
How Much Data Facebook Stores in
One Day?
600 TB
Introduction to Hadoop & SparkReachUs@CloudxLab.com
What is Big Data?
• Simply: Data of Very Big Size
• Can’t process with usual tools
• Distributed Architecture
Needed
• Structured / Unstructured
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Characteristics of Big Data
Problems Involving the
handling of data coming at
fast rate.
e.g. Number of requests
being received by Facebook,
Youtube streaming, Google
Analytics
Problems involving
complex data structures
e.g. Maps, Social Graphs,
Recommendations
VOLUME VELOCITY VARIETY
Data At Rest Data In Motion Data in Many Forms
Problems related to storage
of huge data reliably.
e.g. Storage of Logs of a
website, Storage of data by
gmail.
FB: 300 PB. 600TB/ day
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Characteristics of Big Data - Variety
Problems involving complex data structures
e.g. Maps, Social Graphs, Recommendations
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Question
Time taken to read 1 TB from HDD?
Around 6 hours
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Is One PetaByte Big Data?
If you have to count just vowels in 1 Petabyte
data everyday, do you need distributed
system?
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Is One PetaByte Big Data?
Yes.
Most of the existing systems can’t handle it.
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Why Big Data?
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Devices:
Smart Phones
4.6 billion mobile-phones.
1 - 2 billion people accessing the internet.
Application
Social Networks
Internet of Things
The devices became cheaper, faster and smaller.
The connectivity improved. Result: Many Applications
Connectivity
Wifi, 4G, NFC, GPS
X =>
Why is It Important Now?
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Computing Components
To process & store data
we need
3. HDD or SSD
Disk Size + Speed
2. RAM - Speed & Size
1. CPU Speed
4. Network
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Which Components Impact the Speed of
Computing?
A. CPU
B. Memory Size
C. Memory Read Speed
D. Disk Speed
E. Disk Size
F. Network Speed
G. All of Above
Introduction to Hadoop & SparkReachUs@CloudxLab.com
1. Ecommerce - Recommendations
Example Big Data Customers
Introduction to Hadoop & SparkReachUs@CloudxLab.com
1. Ecommerce - Recommendations
Example Big Data Customers
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Example Big Data Problems
Recommendations - How?
USER ID MOVIE ID RATING
KUMAR matrix 4.0
KUMAR Ice age 3.5
GIRI
apocalypse
now
3.6
GIRI Ice age 3.5
USER ID MOVIE ID RATING
KUMAR apocalypse now 3.6
GIRI matrix 4.0
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Example Big Data Customers
2. Ecommerce - A/B Testing
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Big Data Customers
Telecommunications
1.Customer Churn Prevention
2.Network Performance Optimization
3.Calling Data Record (CDR) Analysis
4.Analyzing Network to Predict Failure
Government
1.Fraud Detection
2.Cyber Security Welfare
3.Justice
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Example Big Data Customers
Healthcare & Life Sciences
1.Health information exchange
2.Gene sequencing
3.Healthcare improvements
4.Drug Safety
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Big Data Solutions
1.Apache Hadoop
○ Apache Spark
2.Cassandra
3.MongoDB
4.Google Compute Engine
5.AWS
Introduction to Hadoop & SparkReachUs@CloudxLab.com
What is Hadoop?
A. Created by Doug Cutting (of Yahoo)
B. Built for Nutch search engine project
C. Joined by Mike Cafarella
D. Based on GFS, GMR & Google Big Table
E. Named after Toy Elephant
F. Open Source - Apache
G. Powerful, Popular & Supported
H. Framework to handle Big Data
I. For distributed, scalable and reliable computing
J. Written in Java
Introduction to Hadoop & SparkReachUs@CloudxLab.com
SQL like interface
SQL Interface
WorkFlow
Machine
learning
/ STATS
NoSQL Datastore
Compute Engine
File Storage
Components
Resource Manager
Spark
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Apache
• Really fast MapReduce
• 100x faster than Hadoop MapReduce in memory,
• 10x faster on disk.
• Builds on similar paradigms as MapReduce
• Integrated with Hadoop
Spark Core - A fast and general engine for large-scale
data processing.
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Spark Architecture
Spark Core
StandaloneAmazon EC2
Hadoop
YARN
Apache Mesos
HDFS
HBase
Hive
Tachyon
Cassandra
SQL
Streaming MLLib GraphX
SparkR Java Python Scala
Libraries
Languages
Data Sources
Resource/cluster managers
Dataframes
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Thank you. For the full course please enroll at
https://cloudxlab.com/
Introduction to Hadoop & SparkReachUs@CloudxLab.com
For the full course please enroll at https://cloudxlab.com/
Introduction to Hadoop & SparkReachUs@CloudxLab.com
My Courses
Introduction to Hadoop & SparkReachUs@CloudxLab.com
My Course List
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Topics or PlayLists
Introduction to Hadoop & SparkReachUs@CloudxLab.com
Learning Item

Introduction to Big data with Hadoop & Spark | Big Data Hadoop Spark Tutorial | CloudxLab

  • 1.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Welcome to Big Data with Hadoop & Spark Please introduce yourself while others are joining
  • 2.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Session 1 - Big Data with Hadoop & Spark Duration: 3 hours Agenda: • Introduction to Big Data • 10 mins. break • Spark & Hadoop Architecture Notes: • Please introduce yourself using chat window while others are joining • Session is being recorded & Recording & presentation will be shared • This is Session 1 out of 18 sessions on Big Data with Hadoop & Spark specialization. • It suffices as an introduction to Big Data Technology Stack. Asking Questions? • Every one except Instructor is muted • Please ask questions by typing in Q&A Window • Instructor will read out the questions before answering • To get better answers, keep your messages short and avoid chat language
  • 3.
    About CloudxLab Videos QuizzesHands-On Projects Case Studies Real Life Use Cases Making learning fun and for life
  • 4.
  • 5.
    Automated Hands-on Assessments ProblemStatement Hands On Cloud based Lab Assessment
  • 6.
  • 7.
    Automated Hands-on Assessments PythonAssessment Jupyter Notebook
  • 8.
    Automated Hands-on Assessments PythonAssessment Jupyter Notebook
  • 9.
  • 10.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Course Instructor Sandeep Giri Worked On Large Scale Computing Graduated from IIT Roorkee Software Engineer Loves Explaining Technologies Founder
  • 11.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Learn To Process Big Data With Hadoop, Spark & Related Technologies Course Objective
  • 12.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Data Variety
  • 13.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Data Variety ETL Extract Transform Load
  • 14.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com 1.Groups of networked computers 2.Interact with each other 3.To achieve a common goal. Distributed Systems
  • 15.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Question How Many Bytes in One Petabyte? 1.1259x10^15
  • 16.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Question How Much Data Facebook Stores in One Day? 600 TB
  • 17.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com What is Big Data? • Simply: Data of Very Big Size • Can’t process with usual tools • Distributed Architecture Needed • Structured / Unstructured
  • 18.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Characteristics of Big Data Problems Involving the handling of data coming at fast rate. e.g. Number of requests being received by Facebook, Youtube streaming, Google Analytics Problems involving complex data structures e.g. Maps, Social Graphs, Recommendations VOLUME VELOCITY VARIETY Data At Rest Data In Motion Data in Many Forms Problems related to storage of huge data reliably. e.g. Storage of Logs of a website, Storage of data by gmail. FB: 300 PB. 600TB/ day
  • 19.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Characteristics of Big Data - Variety Problems involving complex data structures e.g. Maps, Social Graphs, Recommendations
  • 20.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Question Time taken to read 1 TB from HDD? Around 6 hours
  • 21.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Is One PetaByte Big Data? If you have to count just vowels in 1 Petabyte data everyday, do you need distributed system?
  • 22.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Is One PetaByte Big Data? Yes. Most of the existing systems can’t handle it.
  • 23.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Why Big Data?
  • 24.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Devices: Smart Phones 4.6 billion mobile-phones. 1 - 2 billion people accessing the internet. Application Social Networks Internet of Things The devices became cheaper, faster and smaller. The connectivity improved. Result: Many Applications Connectivity Wifi, 4G, NFC, GPS X => Why is It Important Now?
  • 25.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Computing Components To process & store data we need 3. HDD or SSD Disk Size + Speed 2. RAM - Speed & Size 1. CPU Speed 4. Network
  • 26.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Which Components Impact the Speed of Computing? A. CPU B. Memory Size C. Memory Read Speed D. Disk Speed E. Disk Size F. Network Speed G. All of Above
  • 27.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Which Components Impact the Speed of Computing? A. CPU B. Memory Size C. Memory Read Speed D. Disk Speed E. Disk Size F. Network Speed G. All of Above
  • 28.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com 1. Ecommerce - Recommendations Example Big Data Customers
  • 29.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com 1. Ecommerce - Recommendations Example Big Data Customers
  • 30.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Example Big Data Problems Recommendations - How? USER ID MOVIE ID RATING KUMAR matrix 4.0 KUMAR Ice age 3.5 GIRI apocalypse now 3.6 GIRI Ice age 3.5 USER ID MOVIE ID RATING KUMAR apocalypse now 3.6 GIRI matrix 4.0
  • 31.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Example Big Data Customers 2. Ecommerce - A/B Testing
  • 32.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Big Data Customers Telecommunications 1.Customer Churn Prevention 2.Network Performance Optimization 3.Calling Data Record (CDR) Analysis 4.Analyzing Network to Predict Failure Government 1.Fraud Detection 2.Cyber Security Welfare 3.Justice
  • 33.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Example Big Data Customers Healthcare & Life Sciences 1.Health information exchange 2.Gene sequencing 3.Healthcare improvements 4.Drug Safety
  • 34.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Big Data Solutions 1.Apache Hadoop ○ Apache Spark 2.Cassandra 3.MongoDB 4.Google Compute Engine 5.AWS
  • 35.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com What is Hadoop? A. Created by Doug Cutting (of Yahoo) B. Built for Nutch search engine project C. Joined by Mike Cafarella D. Based on GFS, GMR & Google Big Table E. Named after Toy Elephant F. Open Source - Apache G. Powerful, Popular & Supported H. Framework to handle Big Data I. For distributed, scalable and reliable computing J. Written in Java
  • 36.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com SQL like interface SQL Interface WorkFlow Machine learning / STATS NoSQL Datastore Compute Engine File Storage Components Resource Manager Spark
  • 37.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Apache • Really fast MapReduce • 100x faster than Hadoop MapReduce in memory, • 10x faster on disk. • Builds on similar paradigms as MapReduce • Integrated with Hadoop Spark Core - A fast and general engine for large-scale data processing.
  • 38.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Spark Architecture Spark Core StandaloneAmazon EC2 Hadoop YARN Apache Mesos HDFS HBase Hive Tachyon Cassandra SQL Streaming MLLib GraphX SparkR Java Python Scala Libraries Languages Data Sources Resource/cluster managers Dataframes
  • 39.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Thank you. For the full course please enroll at https://cloudxlab.com/
  • 40.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com For the full course please enroll at https://cloudxlab.com/
  • 41.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com My Courses
  • 42.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com My Course List
  • 43.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Topics or PlayLists
  • 44.
    Introduction to Hadoop& SparkReachUs@CloudxLab.com Learning Item