Module - Big Data
Overview
Understanding Hadoop
Why Cloud Computing
Data is getting bigger World is getting smaller
Nothing fits in-memory on a single machine
Big Data, Small World
Super-computer Cluster of generic
computers
Big Data, Small World
Monolithic Distributed
Lots of cheap hardware
Replication & fault tolerance
Distributed Distributed computing
Lots of cheap hardware
Replication & fault tolerance
Distributed Distributed computing
HDFS
YARN
MapReduce
“Clusters”
“Nodes”
Distributed “Server Farms”
“Server Farms”
All of these servers need to be
co-ordinated by a single piece
of software
Single Co-ordinating Software
• Partition data
• Co-ordinate computing tasks
• Handle fault tolerance and recovery
• Allocate capacity to processes
Google developed proprietary
software to run on these
distributed systems
Single Co-ordinating Software
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<
r
<raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <
r
<raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
First: store millions of records on multiple machines
Single Co-ordinating Software
Single Co-ordinating Software
Second: run processes on all these machines to crunch data
Single Co-ordinating Software
Google File System
MapReduce
To solve distributed
storage
To solve distributed
computing
Single Co-ordinating Software
Apache developed open source
versions of these technologies
Single Co-ordinating Software
Google File System
MapReduce
HDFS
MapReduce
Single Co-ordinating Software
Google File System
MapReduce
Hadoop
A file system to manage
the storage of data
A framework to process
data across multiple servers
HDFS MapReduce
In 2013, Apache released Hadoop 2.0
Hadoop
HDFS MapReduce
MapReduce was broken into two separate parts
Hadoop
HDFS MapReduce
A framework to define
a data processing task
A framework to run
the data processing task
Hadoop
HDFS MapReduce YARN
Each of these components have corresponding
configuration files
Hadoop
HDFS MapReduce YARN
HDFS
MapReduce
YARN
User defines map and
reduce tasks using the
MapReduce API
Co-ordination Between Hadoop Blocks
A job is triggered on
the cluster
HDFS
MapReduce
YARN
Co-ordination Between Hadoop Blocks
YARN figures out where
and how to run the job, and
stores the result in HDFS
HDFS
MapReduce
YARN
Co-ordination Between Hadoop Blocks
Hadoop
An ecosystem of tools have sprung up around this
core piece of software
Hadoop Ecosystem
Hadoop Ecosystem
Hadoop
Hive HBase
Oozie
Kafka
Pig
Spark
Hive HBase
Oozie
Kafka
Pig
Spark
Hadoop Ecosystem
Provides an SQL interface to Hadoop
The bridge to Hadoop for folks who don’t
have exposure to OOP in Java
Hive
A database management system on
top of Hadoop
Integrates with your application just
like a traditional database
HBase
A data manipulation language
Transforms unstructured data into a
structured format
Query this structured data using
interfaces like Hive
Pig
A distributed computing engine used
along with Hadoop
Interactive shell to quickly process
datasets
Has a bunch of built in libraries for
machine learning, stream processing,
graph processing etc.
Spark
A tool to schedule workflows on all
the Hadoop ecosystem technologies
Oozie
Stream processing for unbounded
datasets
Kafka
Hadoop
Hive HBase
Oozie
Kafka
Pig
Spark
Hadoop Ecosystem
Understanding HDFS
Hadoop Distributed File System
HDFS
Built on commodity hardware
Highly fault tolerant, hardware failure is the norm
Suited to batch processing - data access has high
throughput rather than low latency
Supports very large data sets
HDFS
Manage file storage across multiple disks
HDFS
Each disk on a different machine in a cluster
HDFS
A cluster of machines
HDFS
A node in the cluster
HDFS
1 node is the
master node
HDFS
HDFS
Name node
Data nodes
Name node
HDFS
Data nodes
Name node
HDFS
If the data in the distributed file system is a book
HDFS
The name node is the
table of contents
Name node
The data nodes hold the
actual text in each page
Data nodes
Data nodes
Name node
HDFS
Manages the overall file system
Stores
• The directory structure
• Metadata of the files
Name node
Physically stores the data in the files
Data nodes
A large text file
Storing a File in HDFS
Block 1
Block 2
Block 3
Block 4
Block 6
Block 5
Block 7
Block 8
Break the data into
blocks
Storing a File in HDFS
Different length files are treated the
same way
Storage is simplified
Unit for replication and fault
tolerance
Break the data into blocks
Block 1
Block 2
Block 3
Block 4
Block 6
Block 5
Block 7
Block 8
Storing a File in HDFS
The blocks are of size
128 MB
Block 1
Block 2
Block 3
Block 4
Block 6
Block 5
Block 7
Block 8
Storing a File in HDFS
Block 1
Block 2
Block 3
Block 4
Block 6
Block 5
Block 7
Block size is a trade off
size 128 MB
Reduces
parallelism
Increases
overhead
Storing a File in HDFS
Block 1
Block 2
Block 3
Block 4
Block 6
Block 5
Block 7
Block 8
This size helps minimize the
time taken to seek to the
block on the disk
Storing a File in HDFS
size 128 MB
Store the blocks
across the data nodes
Block 1
Block 2
Block 3
Block 4
Block 6
Block 5
Block 7
Block 8
Storing a File in HDFS
Block 1 Block 2
DN 1
Block 3 Block 4
DN 2
Block 6
Block 5
DN 3
Block 7 Block 8
DN 4
Each node contains a partition or a split of data
Storing a File in HDFS
How do we know where the splits of a particular file are?
Block 1 Block 2
Block 3 Block 4
Block 6
Block 5
Block 7 Block 8
Storing a File in HDFS
DN 1
DN 2
DN 3
DN 4
Name node
File 1 Block 1 DN 1
File 1 Block 2 DN 1
File 1 Block 3 DN 2
File 1 Block 4 DN 2
File 1 Block 5 DN 3
Storing a File in HDFS
Block 1 Block 2
Block 3 Block 4
Block 6
Block 5
Block 7 Block 8
DN 1
DN 2
DN 3
DN 4
1. Use metadata in the name node to look up block
locations
2. Read the blocks from respective locations
Reading a File in HDFS
Reading a File in HDFS
Name node
File 1 Block 1 DN 1
File 1 Block 2 DN 1
File 1 Block 3 DN 2
File 1 Block 4 DN 2
File 1 Block 5 DN 3
Block 1 Block 2
Block 3 Block 4
Block 6
Block 5
Block 7 Block 8
DN 1
DN 2
DN 3
DN 4
File 1 Block 1 DN 1
File 1 Block 2 DN 1
File 1 Block 3 DN 2
File 1 Block 4 DN 2
File 1 Block 5 DN 3
Request to the name node
Reading a File in HDFS
Block 1 Block 2
DN 1
Name node
Read actual contents from the block
Reading a File in HDFS
File 1 Block 1 DN 1
File 1 Block 2 DN 1
File 1 Block 3 DN 2
File 1 Block 4 DN 2
File 1 Block 5 DN 3
Name node
Block 1 Block 2
DN 1
Read actual contents from the block
Reading a File in HDFS
Block 2
File 1 Block 1 DN 1
File 1 Block 2 DN 1
File 1 Block 3 DN 2
File 1 Block 4 DN 2
File 1 Block 5 DN 3
Name node
Block 1 Block 2
DN 1
A File Stored in HDFS
File 1 Block 1 DN 1
File 1 Block 2 DN 1
File 1 Block 3 DN 2
File 1 Block 4 DN 2
File 1 Block 5 DN 3
Name node
Block 1 Block 2
Block 3 Block 4
Block 6
Block 5
Block 7 Block 8
DN 1
DN 2
DN 3
DN 4
Challenges of Distributed Storage
Failure management in the data
nodes
Failure management for the name
node
Block 1 Block 2
Block 3 Block 4 Block 7 Block 8
Name node
DN 1
DN 2 DN 4
A File Stored in HDFS
What if one of the blocks gets corrupted?
Block 6
Block 5
DN 3
Block 6
DN 3
Block 5
Block 1 Block 2
Block 3 Block 4 Block 7 Block 8
Name node
DN 1
DN 2 DN 4
What if a data node containing
some blocks crashes?
A File Stored in HDFS
Define a replication
factor
Managing Failures in Data Nodes
A File Stored in HDFS
Block 1 Block 2
Block 3 Block 4
Block 6
Block 5
Block 7 Block 8
DN 1
DN 2
DN 3
DN 4
File 1 Block 1 DN 1
File 1 Block 2 DN 1
File 1 Block 3 DN 2
File 1 Block 4 DN 2
File 1 Block 5 DN 3
Name node
Block 3 Block 4
DN 2
1. Replicate blocks based on the
replication factor
2. Store replicas in different locations
Block 1 Block 2
Replication
Block 1 Block 2
DN 1
The replica locations
are also stored in the
name node
File 1 Block 1 DN 1
File 1 Block 2 DN 1
File 1 Block 3 DN 2
File 1 Block 4 DN 2
File 1 Block 5 DN 3
File 1 Block 1 DN 2
File 1 Block 2 DN 2
Replication
Name node
Choosing Replica Locations
Maximize
redundancy
Minimize write
bandwidth
Rack 1 Rack 2 Rack 3
Servers in a data center
Maximize
redundancy
Store replicas “far away” i.e. on
different nodes
Rack 1 Rack 2 Rack 3
Maximize
redundancy
Choosing Replica Locations
Maximize
redundancy
Minimize write
bandwidth
This requires that replicas be
stored close to each other
Minimize write
bandwidth
Rack 1 Rack 2 Rack 3
Node that runs the replication
pipeline
Minimize write
bandwidth
This node chooses the
location for the first replica
Rack 1 Rack 2 Rack 3
Minimize write
bandwidth
Data is forwarded from here
to the next replica location
Rack 1 Rack 2 Rack 3
Minimize write
bandwidth
Forwarded further to the
next replica location
Rack 1 Rack 2 Rack 3
Minimize write
bandwidth
Forwarding requires a
large amount of bandwidth
Rack 1 Rack 2 Rack 3
Minimize write
bandwidth
Increases the cost of
write operations
Rack 1 Rack 2 Rack 3
Minimize write
bandwidth
Maximize
redundancy
Default Hadoop Replication Strategy
Minimize write
bandwidth
Balancing both needs
Rack 1 Rack 2 Rack 3
First location chosen at
random
Default Hadoop Replication Strategy
Second location has to be on
a different rack (if possible)
Default Hadoop Replication Strategy
Rack 1 Rack 2 Rack 3
Third replica is on the same
rack as the second but on
different nodes
Default Hadoop Replication Strategy
Rack 1 Rack 2 Rack 3
Reduces inter-rack
traffic and improves
write performance
Default Hadoop Replication Strategy
Rack 1 Rack 2 Rack 3
Read operations are
sent to the rack
closest to the client
Default Hadoop Replication Strategy
Rack 1 Rack 2 Rack 3
Understanding MapReduce
MapReduce
Processing huge amounts of data
Requires running processes on many machines
MapReduce
A distributed system
MapReduce
MapReduce is a programming paradigm
MapReduce
T
akes advantage of the inherent parallelism in data
processing
MapReduce
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
Modern systems generate millions of records of
raw data
MapReduce
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
A task of this scale is processed in two stages
map reduce
MapReduce
map
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
reduce <raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
<raw data> <raw data> <raw data>
map reduce
The programmer defines these 2
functions
Hadoop does the rest - behind the
scenes
MapReduce
An operation performed in parallel, on small
portions of the dataset
map
One Record
Key-Value Output
map
reduce
An operation to combine the
results of the map step
Map
Output
Final Output
Map
Output
Map
Output
Map
Output
reduce
A step that can be performed in parallel
A step to combine the intermediate results
map
reduce
Word Frequency
above 14
are 20
how 21
star 22
twinkle 32
… ..
Consider a large text file
How?
Twinkle twinkle little star
How I wonder what you are
Up above the world so high
Like a diamond in the sky
Twinkle twinkle little star
How I wonder what you are
…..
Counting Word Frequencies
The raw data is really large (potentially in
PetaBytes)
It’s distributed across many machines in a cluster
Each machine holds a partition of data
Twinkle twinkle little star
How I wonder what you are
Up above the world so high
Like a diamond in the sky
Twinkle twinkle little star
How I wonder what you are
…..
MapReduce Flow
Each partition is given to a different
process i.e. to mappers
Twinkle twinkle little star
How I wonder what you are
Up above the world so high
Like a diamond in the sky
Twinkle twinkle little star
How I wonder what you are
MapReduce Flow
Each mapper works
in parallel
M
M
M
Twinkle twinkle little star
How I wonder what you are
Up above the world so high
Like a diamond in the sky
Twinkle twinkle little star
How I wonder what you are
MapReduce Flow
Within each mapper, the rows are
processed serially
M
Twinkle twinkle little star
How I wonder what you are
Map Flow
Each row emits {key, value} pairs
{twinkle, 1}
{twinkle, 1}
Word # Count
{little, 1}
{star, 1}
Twinkle twinkle little star
How I wonder what you are
M
Map Flow
{twinkle, 1}
{twinkle, 1}
{up, 1}
{above, 1}
{twinkle, 1}
{twinkle, 1}
The results are passed on to
another process i.e. a reducer
M
M
M
Reduce Flow
….
….
….
{twinkle, 1}
{twinkle, 1}
The reducer combines the
values with the same key
R
{twinkle, 4}
{up, 2}
{up, 1}
{above, 1}
{twinkle, 1}
{twinkle, 1}
M
M
M
….
….
….
Reduce Flow
Key Insight Behind MapReduce
M R
<K,V>
<K,V> <K,V>
Many data processing tasks can be expressed in this
form
1. What {key, value} pairs should be emitted in the map step?
2. How should values with the same key be combined?
?
?
Answer Two Questions
M R
?
?
word 1
Sum
Twinkle twinkle little star
How I wonder what you are
Up above the world so high
Like a diamond in the sky
Word Count
twinkle 2
little 1
… …
… …
… …
… …
For each word in each
line
{twinkle, 1}
{twinkle, 1}
{little, 1}
{star, 1}
…
..
Counting Word Frequencies
M R
Answer these to parallelize any
task :)
?
?
M R
Map Reduce Main
A class where the map
logic is implemented
A class where the reduce
logic is implemented
A driver program that
sets up the job
Implementing in Java
A class where the map
logic is implemented
A class where the reduce
logic is implemented
A driver program that
sets up the job
Map Reduce Main
Implementing in Java
Map Class
Mapper Class
The map logic is implemented in
a class that extends the
Mapper Class
Map Step
This is a generic class,
with 4 type parameters
<input key type,
input value type, output key
type, output value type>
Map Step
Map Class
Mapper Class
A class where the map
logic is implemented
A class where the reduce
logic is implemented
A driver program that
sets up the job
Map Reduce Main
Implementing in Java
Map Reduce Main
A class where the map
logic is implemented
A class where the reduce
logic is implemented
A driver program that
sets up the job
Implementing in Java
Reduce Class
Reducer Class
The reduce logic is implemented
in a class that extends the
Reducer Class
Reduce Step
Reducer Class
This is also a generic class,
with 4 type parameters
<input key type,
input value type, output key
type, output value type>
Reduce Step
Reduce Class
Reduce Class
Reducer Class
The output types of the Mapper should match the
input types of the Reducer
<input key type,
input value type, output key type, output
value type>
Map Class
Mapper Class
<input key type,
input value type, output key type,
output value type>
Matching Data Types
Map Reduce Main
A class where the map
logic is implemented
A class where the reduce
logic is implemented
A driver program that
sets up the job
Implementing in Java
Map Reduce Main
A class where the map
logic is implemented
A class where the reduce
logic is implemented
A driver program that
sets up the job
Implementing in Java
Main Class
Job Object
The Mapper and Reducer
classes are used by a Job
that is configured in the
Main Class
Setting up the Job
Main Class
Job Object
The Job has a bunch of
properties that need to
be configured
Mapper class
Reducer class
Output data types
Input filepath
Output filepath
Setting up the Job
Yet Another Resource Negotiator
YARN
YARN
Co-ordinates tasks running on the cluster
Assigns new nodes in case of failure
Resource Manager Node Manager
Runs on a single master node
Schedules tasks across nodes
Run on all other nodes
Manages tasks on the individual node
YARN
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
YARN
Already running
tasks
YARN
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
Job
Find a
NodeManager with
free capacity
Submitting a Job
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
Container
All processes on a
node are run within a
container
Application Master Process
This is the logical unit for
resources the process needs
-memory, CPU etc
NodeManager
Container
Application Master Process
A container executes a specific
application
1 NodeManager can have
multiple containers
NodeManager
Container
Application Master Process
The ResourceManager
starts off the Application
Master within the
Container
Application Master Process
Application Master Process
NodeManager
Container
Performs the computation required
for the task
If additional resources are required,
the Application Master makes the
request
Application Master Process
Application Master Process
NodeManager
Container
Requests containers for
mappers and reducers
Request includes CPU, memory
requirement
Application Processes
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
Assigns additional nodes
Notifies the original Application
Master which made the request
Application Processes
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
NodeManager
The Application Master on the
original node starts off the
Application Masters on the newly
assigned nodes
Application Processes
NodeManager
ResourceManager
NodeManager
NodeManager
NodeManager
NodeManager
Try to minimize write bandwidth
The Location Constraint
i.e. assign a process to the same node
where the data to be processed lives
If CPU, memory are not available
wait!
The Location Constraint
Scheduling Policies
FIFO Scheduler
Capacity
Scheduler
Fair Scheduler
Introducing Hive
Hadoop
File system to manage the
storage of data
Framework to process data
across multiple servers
Framework to run and manage
the data processing tasks
MapReduce
HDFS YARN
Hive on Hadoop
HIVE
Hive runs on top of the Hadoop distributed computing
framework
MapReduce
HDFS YARN
Hive stores its data in HDFS
HIVE
MapReduce
HDFS YARN
Hive on Hadoop
Data is stored as files - text files, binary files
Partitioned across machines in the cluster
Replicated for fault tolerance
Processing tasks parallelized across multiple machines
Hadoop Distributed File
System
HDFS
Hive runs all processes in the form of MapReduce jobs
under the hood
HIVE
MapReduce
HDFS YARN
Hive on Hadoop
A parallel programming model
Defines the logic to process data on multiple machines
Batch processing operations on files in HDFS
Usually written in Java using the Hadoop MapReduce
library
MapReduce
MapReduce
Do we have to write MapReduce code to
work with Hive? No
HIVE
MapReduce
HDFS YARN
Hive on Hadoop
HiveQL
Hive Query Language
A SQL-like interface to the underlying data
Modeled on the Structured Query Language (SQL)
Familiar to analysts and engineers
Simple query constructs
- select
- group by
- join
HiveQL
Hive exposes files in HDFS in the form of tables to the
user
HiveQL
Write SQL-like query in HiveQL and submit it to Hive
HiveQL
Hive will translate the query to MapReduce tasks and run
them on Hadoop
HiveQL
MapReduce will process files on HDFS and return
results to Hive
HiveQL
Hive abstracts away the details of the
underlying MapReduce jobs
Work with Hive almost exactly like
you would with a traditional
database
A Hive user sees data as if they were stored in tables
The Hive Metastore
HIVE
MapReduce
HDFS YARN
Exposes the file-based storage of HDFS in the form of
tables
Metastore
The Hive Metastore
HIVE
MapReduce
HDFS YARN
The Hive Metastore
Metastore
The bridge between data stored in files and the tables
exposed to users
Stores metadata for all the tables in Hive
Maps the files and directories in Hive to tables
Holds table definitions and the schema for each table
Has information on converting files to table
representations
The Hive Metastore
Any database with a JDBC driver
can be used as a metastore
The Hive Metastore
Development environments use the built-in Derby
database
Embedded metastore
The Hive Metastore
Same Java process as Hive itself
One Hive session to connect to the database
The Hive Metastore
Production environments
Local metastore: Allows multiple sessions to connect to
Hive
Remote metastore: Separate processes for Hive and the
metastore
The Hive Metastore
Hive vs. RDBMS
Computation
Data size Latency
Operations ACID compliance Query language
Hive RDBMS
Hive vs. RDBMS
Large datasets
Parallel computations
High latency
Read operations
Not ACID compliant by default
HiveQL
Small datasets
Serial computations
Low latency
Read/write operations
ACID compliant
SQL
Large datasets
Parallel computations
High latency
Read operations
Not ACID compliant by default
HiveQL
Small datasets
Serial computations
Low latency
Read/write operations
ACID compliant
SQL
Hive vs. RDBMS
Hive RDBMS
Large Datasets Small Datasets
Gigabytes or petabytes Megabytes or gigabytes
Hive vs. RDBMS
Calculating trends Accessing and updating individual
records
Hive vs. RDBMS
Large Datasets Small Datasets
Large datasets
Parallel computations
High latency
Read operations
Not ACID compliant by default
HiveQL
Small datasets
Serial computations
Low latency
Read/write operations
ACID compliant
SQL
Hive vs. RDBMS
Hive RDBMS
Parallel Computations Serial Computations
Distributed system with
multiple machines
Single computer with
backup
Hive vs. RDBMS
Semi-structured data files
partitioned across machines
Structured data in tables on
one machine
Hive vs. RDBMS
Parallel Computations Serial Computations
Disk space cheap, can add
space by adding machines
Disk space expensive on a
single machine
Hive vs. RDBMS
Parallel Computations Serial Computations
Large datasets
Parallel computations
High latency
Read operations
Not ACID compliant by default
HiveQL
Small datasets
Serial computations
Low latency
Read/write operations
ACID compliant
SQL
Hive vs. RDBMS
Hive RDBMS
High Latency Low Latency
Records not indexed,
cannot be accessed quickly
Records indexed, can be
accessed and updated fast
Hive vs. RDBMS
Fetching a row will run a
MapReduce that might take
minutes
Queries can be answered in
milliseconds or microseconds
Hive vs. RDBMS
High Latency Low Latency
Large datasets
Parallel computations
High latency
Read operations
Not ACID compliant by default
HiveQL
Small datasets
Serial computations
Low latency
Read/write operations
ACID compliant
SQL
Hive vs. RDBMS
Hive RDBMS
Read Operations Read/Write Operations
Not the owner of data
Hive vs. RDBMS
Hive stores files in HDFS
Hive files can be read and written by many
technologies
- Hadoop, Pig, Spark
Hive database schema cannot be enforced on these
files
No Data Ownership
Not the owner of data Sole gatekeeper for data
Read Operations Read/Write Operations
Hive vs. RDBMS
Schema-on-read
Read Operations Read/Write Operations
Hive vs. RDBMS
Number of columns, column types, constraints specified at table
creation
Hive tries to impose this schema when data is read
It may not succeed, may pad data with nulls
Schema-on-read
Schema-on-read Schema-on-write
Read Operations Read/Write Operations
Hive vs. RDBMS
Large datasets
Parallel computations
High latency
Read operations
Not ACID compliant by default
HiveQL
Small datasets
Serial computations
Low latency
Read/write operations
ACID compliant
SQL
Hive RDBMS
Hive vs. RDBMS
Not ACID Compliant ACID Compliant
Data can be dumped into Hive
tables from any source
Only data which satisfies constraints
are stored in the database
Hive vs. RDBMS
Large datasets
Parallel computations
High latency
Read operations
Not ACID compliant by default
HiveQL
Small datasets
Serial computations
Low latency
Read/write operations
ACID compliant
SQL
Hive vs. RDBMS
Hive RDBMS
Schema on read, no constraints enforced
Minimal index support
Row level updates, deletes as a special case
Many more built-in functions
Only equi-joins allowed
Restricted subqueries
Schema on write keys, not null, unique all
enforced
Indexes allowed
Row level operations allowed in general
Basic built-in functions
No restriction on joins
Whole range of subqueries
Hive vs. RDBMS
Hive RDBMS
OLAP Capabilities
Partitioning and Bucketing
Partitioning Bucketing
Splits data into smaller, manageable parts
Enables performance optimizations
Partitioning and Bucketing
Partitioning Bucketing
Partitioning
Data may be naturally split into logical units
CA
OR
WA
GA
NY
CT
Customers in the US
Partitioning
Each of these units will be stored in a different directory
CA
OR
WA
GA
NY
CT
Partitioning
State specific queries will run only on data in one directory
CA
OR
WA
GA
NY
CT
Partitioning
Splits may not of the same size
CA
OR
WA
GA
NY
CT
Partitioning and Bucketing
Partitioning Bucketing
Bucketing
Size of each split should be the same
Bucket 2
Bucket 1
Bucket 4
Bucket 3
Customers in the US
Bucketing
Bucket 2
Bucket 1
Bucket 4
Bucket 3
Customers in the US
Hash of a column value - address, name, timestamp anything
Bucketing
Each bucket is a separate file
Bucket 2
Bucket 1
Bucket 4
Bucket 3
Bucketing
Makes sampling and joining data more efficient
Bucket 2
Bucket 1
Bucket 4
Bucket 3
Partitioning and Bucketing
Partitioning Bucketing
Partitioning and Bucketing
Rather than running queries on such tables
Partitioning and Bucketing
Run queries on subsets that are
manageable
Queries on Big Data
Partitioning and Bucketing of
T
ables
Join Optimizations Window Functions
Join Optimizations
Join operations are MapReduce jobs under
the hood
Join Optimizations
Optimize joins by reducing the amount of
data held in memory
Join Optimizations
Or by structuring joins as a map-only operation
Reducing Data Held in Memory
A 500GB table joined with a 5MB table
Reducing Data Held in Memory
The large table will be split across multiple
machines in the cluster
Reducing Data Held in Memory
One table is held in memory while the other is read
from disk
Reducing Data Held in Memory
For better performance the smaller table
should be held in memory
Joins as Map-only Operations
MapReduce operations have 2 phases of
processing
M R
Joins as Map-only Operations
Certain queries can be structured
to have no reduce phase
M R
Queries on Big Data
Partitioning and Bucketing of
T
ables
Join Optimizations Window Functions
Window Functions
A suite of functions which are syntactic sugar
for complex queries
Window Functions
Make complex operations simple without needing
many intermediate calculations
Window Functions
Reduces the need for intermediate tables to store
temporary data
What were the top selling N products in last
week’s sale?
Window Functions
Window = one week
Operation = ranking product sales
What revenue percentile did this supplier fall into
for this quarter?
Window Functions
Window = one quarter
Operation = percentiles on revenues
Window Functions
A suite of functions which are syntactic sugar
for complex queries
Window Functions
Make complex operations simple without needing
many intermediate calculations
Window Functions
Reduces the need for intermediate tables to store
temporary data
What were the top selling N products in last
week’s sale?
Window Functions
Window = one week
Operation = ranking product sales
What revenue percentile did this supplier fall into
for this quarter?
Window Functions
Window = one quarter
Operation = percentiles on revenues
What revenue percentile did this supplier fall into
for this quarter?
Window Functions
What were the top selling N products in last
week’s sale?
Can be expressed in a
single query
Running Total of Revenue
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-01 7
o2 Kent Apples 2017-01-02 20
o3 Bellevue Flowers 2017-01-02 10
o4 Redmond Meat 2017-01-03 40
o5 Seattle Potatoes 2017-01-04 9
o6 Bellevue Bread 2017-01-04 5
o7 Redmond Bread 2017-01-05 5
o8 Issaquah Onion 2017-01-05 4
Orders in a grocery store
Running Total of Revenue
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-01 7
o2 Kent Apples 2017-01-02 20
o3 Bellevue Flowers 2017-01-02 10
o4 Redmond Meat 2017-01-03 40
o5 Seattle Potatoes 2017-01-04 9
o6 Bellevue Bread 2017-01-04 5
o7 Redmond Bread 2017-01-05 5
o8 Issaquah Onion 2017-01-05 4
Sorted by order ID
Running Total of Revenue
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-01 7
o2 Kent Apples 2017-01-02 20
o3 Bellevue Flowers 2017-01-02 10
o4 Redmond Meat 2017-01-03 40
o5 Seattle Potatoes 2017-01-04 9
o6 Bellevue Bread 2017-01-04 5
o7 Redmond Bread 2017-01-05 5
o8 Issaquah Onion 2017-01-05 4
Running Total
7
27
37
77
86
91
96
10
Running Total of Revenue
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-01 7
o2 Kent Apples 2017-01-02 20
o3 Bellevue Flowers 2017-01-02 10
o4 Redmond Meat 2017-01-03 40
o5 Seattle Potatoes 2017-01-04 9
o6 Bellevue Bread 2017-01-04 5
o7 Redmond Bread 2017-01-05 5
o8 Issaquah Onion 2017-01-05 4
Running Total
7
27
37
77
86
91
96
10
Running Total of Revenue
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-01 7
o2 Kent Apples 2017-01-02 20
o3 Bellevue Flowers 2017-01-02 10
o4 Redmond Meat 2017-01-03 40
o5 Seattle Potatoes 2017-01-04 9
o6 Bellevue Bread 2017-01-04 5
o7 Redmond Bread 2017-01-05 5
o8 Issaquah Onion 2017-01-05 4
Running Total
7
27
37
77
86
91
96
10
Running Total of Revenue
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-01 7
o2 Kent Apples 2017-01-02 20
o3 Bellevue Flowers 2017-01-02 10
o4 Redmond Meat 2017-01-03 40
o5 Seattle Potatoes 2017-01-04 9
o6 Bellevue Bread 2017-01-04 5
o7 Redmond Bread 2017-01-05 5
o8 Issaquah Onion 2017-01-05 4
Running Total
7
27
37
77
86
91
96
10
Running Total of Revenue
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-01 7
o2 Kent Apples 2017-01-02 20
o3 Bellevue Flowers 2017-01-02 10
o4 Redmond Meat 2017-01-03 40
o5 Seattle Potatoes 2017-01-04 9
o6 Bellevue Bread 2017-01-04 5
o7 Redmond Bread 2017-01-05 5
o8 Issaquah Onion 2017-01-05 4
Running Total
7
27
37
77
86
91
96
10
Running Total of Revenue
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-01 7
o2 Kent Apples 2017-01-02 20
o3 Bellevue Flowers 2017-01-02 10
o4 Redmond Meat 2017-01-03 40
o5 Seattle Potatoes 2017-01-04 9
o6 Bellevue Bread 2017-01-04 5
o7 Redmond Bread 2017-01-05 5
o8 Issaquah Onion 2017-01-05 4
For each row:
Calculate sum over all rows till the current row
Running Total
7
27
37
77
86
91
96
10
Running Total of Revenue
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-01 7
o2 Kent Apples 2017-01-02 20
o3 Bellevue Flowers 2017-01-02 10
o4 Redmond Meat 2017-01-03 40
o5 Seattle Potatoes 2017-01-04 9
o6 Bellevue Bread 2017-01-04 5
o7 Redmond Bread 2017-01-05 5
o8 Issaquah Onion 2017-01-05 4
Operation = sum
Running Total
7
27
37
77
86
91
96
10
Running Total of Revenue
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-01 7
o2 Kent Apples 2017-01-02 20
o3 Bellevue Flowers 2017-01-02 10
o4 Redmond Meat 2017-01-03 40
o5 Seattle Potatoes 2017-01-04 9
o6 Bellevue Bread 2017-01-04 5
o7 Redmond Bread 2017-01-05 5
o8 Issaquah Onion 2017-01-05 4
Window is all rows from the top to the current row
Running Total
7
27
37
77
86
91
96
10
Per Day Revenue Totals
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-11 7
o2 Kent Apples 2017-01-11 20
o3 Bellevue Flowers 2017-01-11 10
o4 Redmond Meat 2017-01-12 40
o5 Seattle Potatoes 2017-01-12 9
o6 Bellevue Bread 2017-01-12 5
o7 Redmond Bread 2017-01-13 5
o8 Issaquah Onion 2017-01-13 4
A revenue running total for each day
Per Day Revenue Totals
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-11 7
o2 Kent Apples 2017-01-11 20
o3 Bellevue Flowers 2017-01-11 10
o4 Redmond Meat 2017-01-12 40
o5 Seattle Potatoes 2017-01-12 9
o6 Bellevue Bread 2017-01-12 5
o7 Redmond Bread 2017-01-13 5
o8 Issaquah Onion 2017-01-13 4
Reset to 0 when a new day begins
Per Day Revenue Totals
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-11 7
o2 Kent Apples 2017-01-11 20
o3 Bellevue Flowers 2017-01-11 10
o4 Redmond Meat 2017-01-12 40
o5 Seattle Potatoes 2017-01-12 9
o6 Bellevue Bread 2017-01-12 5
o7 Redmond Bread 2017-01-13 5
o8 Issaquah Onion 2017-01-13 4
Operate on blocks of one day
Moving Averages
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-11 7
o2 Kent Apples 2017-01-11 20
o3 Bellevue Flowers 2017-01-11 10
o4 Redmond Meat 2017-01-12 40
o5 Seattle Potatoes 2017-01-12 9
o6 Bellevue Bread 2017-01-12 5
o7 Redmond Bread 2017-01-13 5
o8 Issaquah Onion 2017-01-13 4
Calculate the average for the last 3 items sold
Moving Averages
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-11 7
o2 Kent Apples 2017-01-11 20
o3 Bellevue Flowers 2017-01-11 10
o4 Redmond Meat 2017-01-12 40
o5 Seattle Potatoes 2017-01-12 9
o6 Bellevue Bread 2017-01-12 5
o7 Redmond Bread 2017-01-13 5
o8 Issaquah Onion 2017-01-13 4
3 previous rows from the current row
Moving Averages
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-11 7
o2 Kent Apples 2017-01-11 20
o3 Bellevue Flowers 2017-01-11 10
o4 Redmond Meat 2017-01-12 40
o5 Seattle Potatoes 2017-01-12 9
o6 Bellevue Bread 2017-01-12 5
o7 Redmond Bread 2017-01-13 5
o8 Issaquah Onion 2017-01-13 4
3 previous rows from the current row
Moving Averages
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-11 7
o2 Kent Apples 2017-01-11 20
o3 Bellevue Flowers 2017-01-11 10
o4 Redmond Meat 2017-01-12 40
o5 Seattle Potatoes 2017-01-12 9
o6 Bellevue Bread 2017-01-12 5
o7 Redmond Bread 2017-01-13 5
o8 Issaquah Onion 2017-01-13 4
3 previous rows from the current row
Percentage of the Total per Day
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-11 7
o2 Kent Apples 2017-01-11 20
o3 Bellevue Flowers 2017-01-11 10
o4 Redmond Meat 2017-01-12 40
o5 Seattle Potatoes 2017-01-12 9
o6 Bellevue Bread 2017-01-12 5
o7 Redmond Bread 2017-01-13 5
o8 Issaquah Onion 2017-01-13 4
Orders for one day, total revenue = 37
Percentage of the Total per Day
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-11 7
o2 Kent Apples 2017-01-11 20
o3 Bellevue Flowers 2017-01-11 10
o4 Redmond Meat 2017-01-12 40
o5 Seattle Potatoes 2017-01-12 9
o6 Bellevue Bread 2017-01-12 5
o7 Redmond Bread 2017-01-13 5
o8 Issaquah Onion 2017-01-13 4
% contribution of o1 = 7/37 * 100 = 19%
Percentage of the Total per Day
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-11 7
o2 Kent Apples 2017-01-11 20
o3 Bellevue Flowers 2017-01-11 10
o4 Redmond Meat 2017-01-12 40
o5 Seattle Potatoes 2017-01-12 9
o6 Bellevue Bread 2017-01-12 5
o7 Redmond Bread 2017-01-13 5
o8 Issaquah Onion 2017-01-13 4
% contribution of o2 = 20 /37 * 100 = 54%
Percentage of the Total per Day
ID Store Product Date Revenue
o1 Seattle Bananas 2017-01-11 7
o2 Kent Apples 2017-01-11 20
o3 Bellevue Flowers 2017-01-11 10
o4 Redmond Meat 2017-01-12 40
o5 Seattle Potatoes 2017-01-12 9
o6 Bellevue Bread 2017-01-12 5
o7 Redmond Bread 2017-01-13 5
o8 Issaquah Onion 2017-01-13 4
% contribution of o3 = 10 /37 * 100 = 27%
Percentage of the Total per Day
% contribution =
Revenue x 100
Total Revenue
Percentage of the Total per Day
% contribution =
Revenue x 100
sum(revenue) over (partition
by day)
Introducing Pig
Hive on Hadoop
HIVE
MapReduce
HDFS YARN
Hive runs all processes in the form of MapReduce jobs
under the hood
Allows processing of huge datasets
Hive on Hadoop
Errr… okay….
Where Does Data Come From?
Where Does Data Come From?
Mobile phones
GPS location coordinates
Self driving car sensors
Where Is This Data Stored?
Files on some storage system
Characteristics of Data
Unknown schema
Incomplete data
Inconsistent records
Apache Pig to the Rescue
Apache Pig to the Rescue
A high level scripting language to work
with data with unknown or inconsistent
schema
Pig
Part of the Hadoop eco-system
Works well with unstructured, incomplete data
Can work directly on files in HDFS
Used to get data into a data
warehouse
Pig Complements Hive
ETL Analysis
Extract, Transform, Load
Pull unstructured, inconsistent data from source,
clean it and place it in another database where it can
be analyzed
Extract, Transform, Load
Pull unstructured, inconsistent data from source,
clean it and place it in another database where it can
be analyzed
Extract, Transform, Load
Pull unstructured, inconsistent data from source, clean
it and place it in another database where it can be
analyzed
Extract, Transform, Load
Pull unstructured, inconsistent data from source, clean
it and place it in another database where it can be
analyzed
Apache Pig
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523
64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291
64.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 7352
64.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 5253
64.242.88.10 - - [07/Mar/2004:16:23:12 -0800] "GET /twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore¶m1=1.12¶m2=1.12 HTTP/1.1" 200 11382
64.242.88.10 - - [07/Mar/2004:16:24:16 -0800] "GET /twiki/bin/view/Main/PeterThoeny HTTP/1.1" 200 4924
64.242.88.10 - - [07/Mar/2004:16:29:16 -0800] "GET /twiki/bin/edit/Main/Header_checks?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12851
64.242.88.10 - - [07/Mar/2004:16:30:29 -0800] "GET /twiki/bin/attach/Main/OfficeLocations HTTP/1.1" 401 12851
64.242.88.10 - - [07/Mar/2004:16:31:48 -0800] "GET /twiki/bin/view/TWiki/WebTopicEditTemplate HTTP/1.1" 200 3732
64.242.88.10 - - [07/Mar/2004:16:32:50 -0800] "GET /twiki/bin/view/Main/WebChanges HTTP/1.1" 200 40520
64.242.88.10 - - [07/Mar/2004:16:33:53 -0800] "GET /twiki/bin/edit/Main/Smtpd_etrn_restrictions?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12851
64.242.88.10 - - [07/Mar/2004:16:35:19 -0800] "GET /mailman/listinfo/business HTTP/1.1" 200 6379
64.242.88.10 - - [07/Mar/2004:16:36:22 -0800] "GET /twiki/bin/rdiff/Main/WebIndex?rev1=1.2&rev2=1.1 HTTP/1.1" 200 46373
64.242.88.10 - - [07/Mar/2004:16:37:27 -0800] "GET /twiki/bin/view/TWiki/DontNotify HTTP/1.1" 200 4140
64.242.88.10 - - [07/Mar/2004:16:39:24 -0800] "GET /twiki/bin/view/Main/TokyoOffice HTTP/1.1" 200 3853
64.242.88.10 - - [07/Mar/2004:16:43:54 -0800] "GET /twiki/bin/view/Main/MikeMannix HTTP/1.1" 200 3686
64.242.88.10 - - [07/Mar/2004:16:45:56 -0800] "GET /twiki/bin/attach/Main/PostfixCommands HTTP/1.1" 401 12846
64.242.88.10 - - [07/Mar/2004:16:47:12 -0800] "GET /robots.txt HTTP/1.1" 200 68
64.242.88.10 - - [07/Mar/2004:16:47:46 -0800] "GET /twiki/bin/rdiff/Know/ReadmeFirst?rev1=1.5&rev2=1.4 HTTP/1.1" 200 5724
64.242.88.10 - - [07/Mar/2004:16:49:04 -0800] "GET /twiki/bin/view/Main/TWikiGroups?rev=1.2 HTTP/1.1" 200 5162
64.242.88.10 - - [07/Mar/2004:16:50:54 -0800] "GET /twiki/bin/rdiff/Main/ConfigurationVariables HTTP/1.1" 200 59679
64.242.88.10 - - [07/Mar/2004:16:52:35 -0800] "GET /twiki/bin/edit/Main/Flush_service_name?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12851
Apache Pig
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/
Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
Server IP Address
Apache Pig
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/
Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
Date and Time
Apache Pig
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/
Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
Request Type
Apache Pig
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/
Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
URL
Apache Pig
64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/
Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
IP Date Time Request Type URL
Pig Latin
A procedural, data flow language to extract, transform
and load data
Series of well-defined steps to perform
operations
No if statements or for loops
Procedural
Focused on transformations applied to the data
Written with a series of data operations in mind
Data Flow
Pig Latin
Data from one or more sources can be read, processed
and stored in parallel
Pig Latin
Cleans data, precomputes common aggregates before
storing in a data warehouse
Pig vs. SQL
Pig SQL
select sum(revenue)
from revenues
group by dept
foreach
(group revenues by dept)
generate
sum(revenue)
Pig vs. SQL
Pig SQL
A data flow language, transforms data to store in a
warehouse
Specifies exactly how data is to be modified at every
step
Purpose of processing is to store in a queryable
format
Used to clean data with inconsistent or incomplete
schema
A query language, is used for retrieving results
Abstracts away how queries are executed
Purpose of data extraction is analysis
Extract insights, generate reports, drive decisions
Hadoop
HDFS
File system to manage the
storage of data
Framework to process
data across multiple
servers
MapReduce YARN
Framework to run and
manage the data processing
tasks
Pig on Hadoop
HDFS MapReduce YARN
PIG
Pig runs on top of the Hadoop distributed computing
framework
HDFS MapReduce YARN
Reads files from HDFS, stores intermediate records in HDFS and
writes its final output to HDFS
PIG
Pig on Hadoop
HDFS MapReduce YARN
Decomposes operations into MapReduce jobs which run in parallel
PIG
Pig on Hadoop
HDFS MapReduce YARN
Provides non-trivial, built-in implementations of standard data
operations, which are very efficient
PIG
Pig on Hadoop
HDFS MapReduce YARN
Pig optimizes operations before MapReduce jobs are run, to
speed operations up
PIG
Pig on Hadoop
Pig on Hadoop
HDFS MapReduce YARN
PIG
Pig on Other Technologies
Apache Tez
PIG
Apache Spark
Apache Tez
PIG
Apache Spark
Tez is an extensible framework which improves on MapReduce by
making its operations faster
Pig on Other Technologies
Apache Tez
PIG
Apache Spark
Spark is another distributed computing technology which is
scalable, flexible and fast
Pig on Other Technologies
Pig vs. Hive
Pig Hive
Used to extract, transform and load
data into a data warehouse
Used by developers to bring together
useful data in one place
Uses Pig Latin, a procedural, data flow
language
Used to query data from a data
warehouse to generate reports
Used by analysts to retrieve business
information from data
Uses HiveQL, a structured query
language
Introducing Spark
Spark
General Purpose Interactive
Distributed Computing
An engine for data processing and analysis
General Purpose
Exploring
Cleaning and Preparing
Applying Machine Learning
Building Data Applications
Spark
General Purpose Interactive
Distributed Computing
An engine for data processing and analysis
Interactive
len([1,2,5])
3
REPL
REPL
Interactive
Read-Evaluate-Print-Loop
Interactive environments
Fast feedback
Spark
General Purpose Interactive
Distributed Computing
An engine for data processing and analysis
Distributed Computing
Integrate with Hadoop
Read data from HDFS
Process data across a cluster of
machines
Scala
Python
Java
Spark APIs
Almost all data is processed using
Resilient Distributed Datasets
RDDs are the main programming abstraction in Spark
Resilient Distributed Datasets
RDDs are in-memory collections of objects
In-memory,
yet resilient!
Resilient Distributed Datasets
With RDDs, you can interact and play with
billions of rows of data
…without caring about any of the
complexities
Resilient Distributed Datasets
Spark is made up of a few
different components
The basic functionality of Spark
RDDs
Spark Core
Spark Core is just a computing engine
It needs two additional components
Spark Core
A Storage System that
stores the data to be
processed
A Cluster Manager to help
Spark run tasks across a
cluster of machines
Spark Core
Storage
System
Cluster
Manager
Both of these are plug and play
components
Spark Core
Local file
system
HDFS
Storage System
Storage
System
Cluster
Manager
Spark Core
Built-in Cluster
Manager
YARN
Cluster Manager
Storage
System
Cluster
Manager
Plug and Play makes it easy to
integrate with Hadoop
Spark Core
For storage
For managing the
cluster
For computing
HDFS YARN MapReduce
A Hadoop Cluster
HDFS YARN MapReduce
Spark Core
Use Spark as an alternate/
additional compute engine
A Hadoop Cluster
Java 7 or above
Scala
IPython (Anaconda)
Prerequisites
Installing Spark
Download Spark binaries
Update environment variables
Configure iPython Notebook for Spark
Installing Spark
SPARK_HOME
Spark Environment Variables
PATH
Point to the folder where Spark has been
extracted
$SPARK_HOME/bin
%SPARK_HOME%/bin
Linux/Mac OS X
Windows
PYSPARK_DRIVER_PYTHON
Configuring IPython
PYSPARK_DRIVER_PYTHON_OPTS
ipython
“notebook”
Launch PySpark
> pyspark
This is just like a Python shell
Launch PySpark
Use Python functions, dicts, lists etc
Launch PySpark
You can import and use any installed Python
modules
Launch PySpark
Launches by default in a local non-distributed
mode
Launch PySpark
When the shell is launched it initializes a
SparkContext
SparkContext
The SparkContext represents a
connection to the Spark Cluster
SparkContext
Used to load data into memory from a
specified source
SparkContext
The data gets loaded into an RDD
SparkContext
Resilient
Distributed
Datasets
Partitions
Read-only
Lineage
RDDs represent
data in-memory
1 Swetha 30 Bangalore
2 Vitthal 35 New Delhi
3 Navdeep 25 Mumbai
4 Janani 35 New Delhi
5 Navdeep 25 Mumbai
6 Janani 35 New Delhi
Partitions
1 Swetha 30 Bangalore
2 Vitthal 35 New Delhi
3 Navdeep 25 Mumbai
4 Janani 35 New Delhi
5 Navdeep 25 Mumbai
6 Janani 35 New Delhi
Partitions
Data is divided into
partitions
Distributed to
multiple machines
1 Swetha 30 Bangalore
2 Vitthal 35 New Delhi
3 Navdeep 25 Mumbai
4 Janani 35 New Delhi
5 Navdeep 25 Mumbai
6 Janani 35 New Delhi
Partitions
Data is divided into
partitions
Partitions
Data is divided into
partitions
1 Swetha 30 Bangalore
2 Vitthal 35 New Delhi
3 Navdeep 25 Mumbai
4 Janani 35 New Delhi
5 Navdeep 25 Mumbai
6 Janani 35 New Delhi
Partitions
1 Swetha 30 Bangalore
2 Vitthal 35 New Delhi
3 Navdeep 25 Mumbai
4 Janani 35 New Delhi
5 Navdeep 25 Mumbai
6 Janani 35 New Delhi
Distributed to
multiple machines,
called nodes
Partitions
Distributed to
multiple machines,
called nodes
Node 1
Node 2
Node 3
Partitions
Nodes process
data in parallel
Node 1
Node 2
Node 3
Resilient
Distributed
Datasets
Partitions
Read-only
Lineage
RDDs are immutable
Read-only
Only Two Types of Operations
Transformation Action
Transform into another
RDD
Request a result
Transformation
A data set
loaded into an
RDD
1 Swetha 30 Bangalore
2 Vitthal 35 New Delhi
3 Navdeep 25 Mumbai
4 Janani 35 New Delhi
5 Navdeep 25 Mumbai
6 Janani 35 New Delhi
The user may define a
chain of transformations
on the dataset
1 Swetha 30 Bangalore
2 Vitthal 35 New Delhi
3 Navdeep 25 Mumbai
4 Janani 35 New Delhi
5 Navdeep 25 Mumbai
6 Janani 35 New Delhi
Transformation
1. Load data
2. Pick only the 3rd
column
3. Sort the values
Transformation
1 Swetha 30 Bangalore
2 Vitthal 35 New Delhi
3 Navdeep 25 Mumbai
4 Janani 35 New Delhi
5 Navdeep 25 Mumbai
6 Janani 35 New Delhi
1. Load data
2. Pick only the 3rd
column
3. Sort the values
Transformation
Wait until a result is requested
before executing any of these
transformations
Transformation
Only Two Types of Operations
Transformation Action
Transform into another
RDD
Request a result
Action
Request a
result using an
action
1. The first 10 rows
2. A count
3. A sum
Data is processed only when the user requests
a result
The chain of transformations defined earlier is
executed
Action
Lazy Evaluation
Spark keeps a record of
the series of
transformations requested
by the user
Lazy Evaluation
It groups the transformations
in an efficient way when an
Action is requested
Resilient
Distributed
Datasets
Partitions
Read-only
Lineage
When created, an RDD just holds
metadata
1. A transformation
2. It’s parent RDD
Lineage
Every RDD knows
where it came from RDD 1
Transformation 1
RDD 2
Lineage
Lineage can be traced
back all the way to the
source
Lineage
RDD 1
Transformation 1
RDD 2
RDD 1
Transformation 1
data.csv
Load data
Lineage can be traced
back all the way to the
source
RDD 2
Lineage
When an action is
requested on an RDD
All its parent RDDs are
materialized
RDD 1
Transformation 1
data.csv
Load data
RDD 2
Lineage
Resilience
Lazy Evaluation
In-built fault tolerance
If something goes wrong, reconstruct
from source
Materialize only when
necessary
Lineage
Introducing Flink
Bounded datasets are processed in
batches
Unbounded datasets are processed as
streams
Batch vs. Stream Processing
Batch Stream
Bounded, finite datasets
Slow pipeline from data ingestion to analysis
Periodic updates as jobs complete
Order of data received unimportant
Single global state of the world at any point
in time
Unbounded, infinite datasets
Processing immediate, as data is received
Continuous updates as jobs run constantly
Order important, out of order arrival tracked
No global state, only history of events
received
Data is received as a stream
- Log messages
- Tweets
- Climate sensor data
Stream Processing
Process the data one entity at a time
- Filter error messages
- Find references to the latest movies
- Track weather patterns
Stream Processing
Store, display, act on filtered messages
- Trigger an alert
- Show trending graphs
- Warn of sudden squalls
Stream Processing
Streaming data
Stream Processing
Stream processing
Stream Processing
Stream Processing
Traditional Systems
Files Databases
Reliable storage as the source of
truth
Stream-first Architecture
Message transport
Stream processing
Files
Databases
Stream
Stream-first Architecture
Files
Databases
Stream
The stream as the source of truth
Buffer for event data
Performant and persistent
Decoupling multiple sources
from processing
Kafka, MapR streams
Message Transport
Stream-first Architecture
Message transport
Stream processing
Files
Databases
Stream
High throughput, low latency
Fault tolerance with low
overhead
Manage out of order events
Easy to use, maintainable
Replay streams
Streaming Spark, Storm, Flink
Stream Processing
A good approximation to stream
processing is the use of micro-
batches
Mimicking Streams Using Micro-batches
4 3
1 6
9 0
8
7 3
A stream of integers
4
Grouped into batches
4 3
6
1
9 0
8
7 3
4
Mimicking Streams Using Micro-batches
If the batches are small enough…
It approximates real-time stream
processing
Mimicking Streams Using Micro-batches
4 3
6
1
9 0
8
7 3
4
Exactly once semantics, replay micro-batches
Latency-throughput trade-off based on batch sizes
Spark Streaming, Storm Trident
Mimicking Streams Using Micro-batches
4 3
6
1
9 0
8
7 3
4
Types of Windows
Types of Windows
Tumbling Window Sliding Window Session Window
Types of Windows
A stream of data
Tumbling Window
Fixed window size
Non-overlapping time
Number of entities differ within
a window
Tumbling Window
The window tumbles over the data, in a non-
overlapping manner
5
6
3
1
4
3 7
2
4
Tumbling Window
3 2
7
6
8
Apply the sum() operation on each window
14
7
19
21
Sliding Window
Fixed window size
Overlapping time - sliding
interval
Number of entities differ within
a window
Sliding Window
Fixed window size
Overlapping time - sliding
interval
Number of entities differ within
a window
5
6
3
1
4
3 7
2
4
3 2
7
6
8
Apply the sum() operation on each window
14
8
13
15
Sliding Window
19
14
Session Window
Changing window size based on
session data
No overlapping time
Number of entities differ within a
window
Session gap determines window size
Session Window
Changing window size based on
session data
No overlapping time
Number of entities differ within a
window
Session gap determines window size
Session Window
Changing window size based on
session data
No overlapping time
Number of entities differ within a
window
Session gap determines window size
Session Window
This gap is not large enough to start a
new window
Apache Flink is an open source,
distributed system built using the
stream-first architecture
The stream is the source of truth
Streaming execution model
Processing is continuous, one event at a
time
Apache Flink
Apache Flink
Everything is a stream
Batch processing with bounded datasets
are a special case of the unbounded
dataset
Same engine for all
Streaming and batch APIs both use the same
underlying architecture
Apache Flink
Handles out of order or late arriving data
Exactly once processing for stateful computations
Flexible windowing based on time, sessions, count,
etc.
Lightweight fault tolerance and checkpointing
Distributed, runs in large scale clusters
Stream Processing with Flink
Apache Flink
Runtime
Local
Single JVM
Cluster
Standalone, YARN
Cloud
GCE, EC2
DataStream API
Stream Processing
DataSet API
Batch Processing
CEP
Event Processing
T
able
Relational
Flink ML
Machine Learning
Gelly
Graph
T
able
Relational
DataStream API
Stream Processing
Apache Flink
Flink Programming Model
Data Source Data Sink
Transformations
Flink Programming Model
Transformations
Data
Source
Data
Sink
Transformations
A directed-acyclic graph
Data
Source
Data
Sink
BigQuery
BigQuery
- Enterprise Data Warehouse
- SQL queries - with Google storage underneath
- Fully-managed - no server, no resources deployed
- Access through: Web UI, REST API, clients
- Third party tools
Data Model
- Dataset = set of tables and views
- T
able must belong to dataset
- Dataset must belong to a project
- T
ables contain records with rows and columns (fields)
- Nested and repeated fields are OK too
Table Schema
- Can be specified at creation time
- Can also specify schema during initial load
- Can update schema later too
Table Types
- Native tables: BigQuery storage
- External tables
- Views
- BigT
able
- Cloud Storage
- Google Drive
Schema Auto-
Detection
- Available while
- BigQuery selects a random file in the data source and scans
up to 100 rows of data to use as a representative sample
- Loading data
- Querying external data
- Then examines each field and attempts to assign a data
type to that field based on the values in the sample
Loading Data
- Batch loads
- Streaming loads
- CSV
- JSON (newline delimited)
- Avro
- GCP Datastore backups
- High volume event tracking logs
- Realtime dashboards
Other Google
Sources
- Cloud storage
- Analytics 360
- Datastore
- Dataflow
- Cloud storage logs
Data Formats
- CSV
- JSON (newline delimited)
- Avro (open source data format that bundles serialized data with
the data's schema in the same file)
- Cloud Datastore backups (BigQuery converts data from each
entity in Cloud Datastore backup files to BigQuery's data types)
Alternatives to
Loading
- Public datasets
- Shared datasets
- Stackdriver log files (needs export - but direct)
Querying and
Viewing
- Interactive queries
- Batch queries
- Views
- Partitioned tables
Interactive
Queries
- Default mode (executed as soon as possible)
- Count towards limits on
- Daily usage
- Concurrent usage
Batch Queries
- BigQuery will schedule these to run whenever possible (idle resources)
- Don’t count towards limit on concurrent usage
- If not started within 24 hours, BigQuery makes them interactive
Views
- BigQuery views are logical - not materialised
- Underlying query will execute each time view is accessed
- Billing will happen accordingly
- Can’t assign access control - based on user running view
- Can create authorised view: share query results with
groups without giving read access to underlying data
- Can give row-level permissions to different users within
same view
Views
- Can’t export data from a view
- Can’t use JSON API to retrieve data
- Can’t mix standard and legacy SQL
- eg standard SQL query can’t access legacy-SQL view
- No user-defined functions allowed
- No wildcard table references allowed
- Limit of 1000 authorised views per dataset
Partitioned
Tables
- Special table where data is partitioned for you
- No need to create partitions manually or programmatically
- Manual partitions - performance degrades
- Limit of 1000 tables per query does not apply
Partitioned
Tables
- Date-partitioned tables offered by BigQuery
- Need to declare table as partitioned at creation time
- No need to specify schema (can do while loading data)
- BigQuery automatically creates date partitions
Query Plan
Explanation - In the web UI, click on “Explanation”
Slots
- Unit of Computational capacity needed to run queries
- BigQuery calculates on basis of query size, complexity
- Usually default slots sufficient
- Might need to be expanded for very large, complex queries
- Slots are subject to quota policies
- Can use StackDriver Monitoring to track slot usage
Dataflow
DataFlow
- Transform data
- Loosely semantically equivalent to Apache Spark
- Based on Apache Beam
- Older version of Dataflow (1.x) was not based on Beam
Flink Programming Model
Data Source Data Sink
Transformations
Recall
Flink Programming Model
Transformations
Data
Source
Data
Sink
Recall
Transformations
A directed-acyclic graph
Data
Source
Data
Sink
Recall
Transformations
A directed-acyclic graph
Data
Source
Data
Sink
Recall
Idea of a pipeline of transformations
modelled as DAG is powerful
DAG
- Flink
- Apache Beam
- TensorFlow
Apache Beam Architecture
A directed-acyclic graph
Data
Source
Data
Sink
Apache Beam Architecture
Pipeline: Entire set of computations
Data
Source
Data
Sink
Apache Beam Architecture
PCollection: Data set in the pipeline
Data
Source
Data
Sink
Apache Beam Architecture
Transform: Data processing step in pipeline
Data
Source
Data
Sink
Apache Beam Architecture
I/O Source and Sink
Data
Source
Data
Sink
Apache Beam
- Pipeline: single, potentially repeatable job, from start to
finish, in Dataflow
- PCollection: specialized container classes that can
represent data sets of virtually unlimited size
- Tranform: takes one or more PCollections as input,
performs a processing function that you provide on the
elements of that PCollection, and produces an output
PCollection.
- Source & Sink: different data storage formats, such as
files in Google Cloud Storage, BigQuery tables
Apache Beam Architecture
Pipeline: Entire set of computations
Data
Source
Data
Sink
Pipeline
- Pipeline: single, potentially repeatable job, from start to
finish, in Dataflow
- Encapsulates series of computations that accepts some
input data from external sources, transforms data to
provide some useful intelligence, and produce output
- Defined by driver program
- The actual pipeline computations run on a backend,
abstracted in the driver by a runner
Driver and
Runner
- Driver defines computation DAG (pipeline)
- Runner executes DAG on a backend
- Beam supports multiple backends
- Apache Spark
- Apache Flink
- Google Cloud Dataflow
- Beam Model
Typical Beam
Driver Program
- Create a Pipeline object
- Create an initial PCollection for pipeline data
- Source API to read data from an external source
- Create transform to build a PCollection from in-memory data.
- Define the Pipeline transforms to change, filter, group, analyse
PCollections
- transforms do not change input collection
- Output the final, transformed PCollection(s), typically using the
Sink API to write data to an external source.
- Run the pipeline using the designated Pipeline Runner.
Apache Beam Architecture
Pipeline: Entire set of computations
Data
Source
Data
Sink
Apache Beam Architecture
PCollection: Data set in the pipeline
Data
Source
Data
Sink
PCollection
- PCollection: specialized container classes that can
represent data sets of virtually unlimited size
- Fixed size: text file or BigQuery table
- Unbounded: Pub/Sub subscription
Apache Beam Architecture
A directed-acyclic graph
Data
Source
Data
Sink
Apache Beam Architecture
Pipeline: Entire set of computations
Data
Source
Data
Sink
Apache Beam Architecture
PCollection: Data set in the pipeline
Data
Source
Data
Sink
Apache Beam Architecture
Side Inputs: Inject additional data into some PCollection
Data
Source
Data
Sink
Transforms
- Transform takes one or more PCollections as input,
performs a processing function that you provide on the
elements of that PCollection, and produces an output
PCollection
- “What/Where/When/How”
“What?”
Apache Beam Docs
“Where?”
Apache Beam Docs
“When?”
Apache Beam Docs
“How?”
Apache Beam Docs
Apache Beam Architecture
A directed-acyclic graph
Data
Source
Data
Sink
Apache Beam Architecture
I/O Source and Sink
Data
Source
Data
Sink
Sources and
Sinks
- Source & Sink: different data storage formats, such as
files in Google Cloud Storage, BigQuery tables
- Custom sources and sinks possible too
Typical Beam
Driver Program
- Create a Pipeline object
- Create an initial PCollection for pipeline data
- Source API to read data from an external source
- Create transform to build a PCollection from in-memory data.
- Define the Pipeline transforms to change, filter, group, analyse
PCollections
- transforms do not change input collection
- Output the final, transformed PCollection(s), typically using the
Sink API to write data to an external source.
- Run the pipeline using the designated Pipeline Runner.
Typical Beam
Driver Program
- Create a Pipeline object
- Create an initial PCollection for pipeline data
- Source API to read data from an external source
- Create transform to build a PCollection from in-memory data.
- Define the Pipeline transforms to change, filter, group, analyse
PCollections
- transforms do not change input collection
- Output the final, transformed PCollection(s), typically using the
Sink API to write data to an external source.
- Run the pipeline using the designated Pipeline Runner.
Dataproc
Dataproc
- Managed Hadoop + Spark
- Includes: Hadoop, Spark, Hive and Pig
- “No-ops”: create cluster, use it, turn it off
- Use Google Cloud Storage, not HDFS - else billing will hurt
- Ideal for moving existing code to GCP
Cluster Machine
Types
- Built using Compute Engine VM instances
- Cluster - Need at least 1 master and 2 workers
- Preemptible instances - OK if used with care
Preemptible VMs
- Used for processing only - not data storage
- No preemptible-only clusters - at least 2 non-preemptible
workers will be added automatically
- Minimum persistent disk size - smaller of 100GB and primary
worker boot disk size (for local caching - not part of HDFS)
Initialisation
Actions
- Can specify scripts to be run from GitHub or Cloud Storage
- Can do so via GCP console, gcloud CLI or programmatically
- Run as root (i.e. no sudo required)
- Use absolute paths
- Use shebang line to indicate script interpreter
Scaling Clusters
- Can scale even when jobs are running
- Operations for scaling are:
- Add workers
- Remove workers
- Add HDFS storage
High Availability
(Beta)
- While creating Dataproc cluster, specify “High Availability”
configuration
- Will run 3 masters rather than 1
- The 3 masters will run in an Apache Zookeeper cluster for
automatic failover
Single Node
Clusters (Beta)
- Just 1 node - master as well as worker
- Can’t be a preemptible VM
- Use for prototyping, educational use etc
Restartable Jobs
(Beta)
- By default, Dataproc jobs do NOT restart on failure
- Can optionally change this - useful for long-running and
streaming jobs (eg Spark Streaming)
- Mitigates out-of-memory, unscheduled reboots
Connectors
- BigQuery
- BigT
able
- Cloud Storage
Datalab
Datalab
- Interactive code execution for notebooks
- Basically, Jupyter or iPython for notebooks for running code
- Packaged as container running on a VM instance
- Need to connect from browser to Datalab container on GCP
Notebooks
- Better than text files for code
- Include code, documentation (markdown), results
- Results can be text, images, HTML/Javascript
- Notebooks can be in Cloud Storage Repository (git repository)
Persistent Disk
- Notebook can be cloned from Cloud Storage repository to VM
persistent disk
- This clone => workspace => add/remove/modify files
- Notebooks auto-saved, but if persistent disk deleted and code
not committed to git, changes will be lost
Service Account
- Code can access BigQuery, Cloud ML etc using a service
account
- Service account needs to be authorised accordingly
Kernel
- Opening a notebook => backend kernel process manages
session and variables
- Each notebook has 1 python kernel - if N notebooks, then N
processes
- Kernels are single-threaded
- Memory usage is heavy - execution is slow - pick machine type
accordingly
Connecting to
Datalab
- gcloud SDK: running the Docker container in a Compute
Engine VM and connecting to it through a ssh tunnel
- Cloud Shell: running the Docker container in a Compute
Engine VM and connecting to it through Cloud Shell
- Local machine: running the Docker container on your local
machine (must have Docker installed)
Pub/Sub
Pub/Sub
- Messaging “middleware”
- Many-to-many asynchronous messaging
- Decouple sender and receiver
Message Life
A publisher application creates a topic in the Google
Cloud Pub/Sub service and sends messages to the
topic. A message contains a payload and optional
attributes that describe the payload content.
Message Life
Messages are persisted in a message store until
they are delivered and acknowledged by
subscribers.
Message Life
The Pub/Sub service forwards messages
from a topic to all of its subscriptions,
individually.
Each subscription receives messages either
by Pub/Sub pushing them to the subscriber's
chosen endpoint, or by the subscriber pulling
them from the service
Message Life
The subscriber receives pending messages
from its subscription and acknowledges each
one to the Pub/Sub service
Message Life
When a message is acknowledged by the
subscriber, it is removed from the
subscription's queue of messages
Basics
- Publisher apps create and send messages on a Topic
- Subscriber apps subscribe to a topic to receive messages
- Subscription is a queue (message stream) to a subscriber
- Message = data + attributes sent by publisher to a topic
- Message Attributes = key-value pairs sent by publisher with
message
Basics
- Publisher apps create and send messages on a Topic
- Subscriber apps subscribe to a topic to receive messages
- Once acknowledged by subscriber, message deleted from queue
- Messages persisted in a message store until delivered/
acknowledged
- One queue per subscription
- Push - WebHook endpoint
- Pull - HTTPS request to endpoint
Use-cases
- Balancing workloads in network clusters
- Asynchronous order processing
- Distributing event notifications
- Refreshing distributed caches
- Logging to multiple systems simultaneously
- Data streaming
- Reliability improvement
Architecture
- Data plane, which handles moving messages between publishers
and subscribers
- Control plane, which handles the assignment of publishers and
subscribers to servers on the data plane
- The servers in the data plane are called forwarders, and the
servers in the control plane are called routers.
Message Life
- A publisher sends a message
Message Life
- The message is written to storage.
Message Life
- Google Cloud Pub/Sub sends an acknowledgement to
the publisher that it has received the message and
guarantees its delivery to all attached subscriptions
Message Life
- At the same time as writing the
message to storage, Google Cloud
Pub/Sub delivers it to subscribers
Message Life
- Subscribers send an acknowledgement to
Google Cloud Pub/Sub that they have
processed the message.
Message Life
- Once at least one subscriber for each subscription has
acknowledged the message, Google Cloud Pub/Sub
deletes the message from storage
Publishers
- Any application that can make HTTPS requests to
googleapis.com
- App Engine app
- App running on Compute Engine instance
- App running on third party network
- Any mobile or desktop app
- Even a browser
Subscribers
- Push subscribers - Any app that can make HTTPS request to
googleapis.com
- Pull subscribers - must be WebHook endpoint that can accept
POST request over HTTPS

Bigdata Technologies that includes various components .pdf

  • 1.
  • 2.
  • 3.
  • 4.
    Why Cloud Computing Datais getting bigger World is getting smaller Nothing fits in-memory on a single machine
  • 5.
    Big Data, SmallWorld Super-computer Cluster of generic computers
  • 6.
    Big Data, SmallWorld Monolithic Distributed
  • 7.
    Lots of cheaphardware Replication & fault tolerance Distributed Distributed computing
  • 8.
    Lots of cheaphardware Replication & fault tolerance Distributed Distributed computing HDFS YARN MapReduce
  • 9.
  • 10.
    “Server Farms” All ofthese servers need to be co-ordinated by a single piece of software
  • 11.
    Single Co-ordinating Software •Partition data • Co-ordinate computing tasks • Handle fault tolerance and recovery • Allocate capacity to processes
  • 12.
    Google developed proprietary softwareto run on these distributed systems Single Co-ordinating Software
  • 13.
    <raw data> <rawdata> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> < r <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> < r <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> First: store millions of records on multiple machines Single Co-ordinating Software Single Co-ordinating Software
  • 14.
    Second: run processeson all these machines to crunch data Single Co-ordinating Software
  • 15.
    Google File System MapReduce Tosolve distributed storage To solve distributed computing Single Co-ordinating Software
  • 16.
    Apache developed opensource versions of these technologies Single Co-ordinating Software Google File System MapReduce
  • 17.
  • 18.
    Hadoop A file systemto manage the storage of data A framework to process data across multiple servers HDFS MapReduce
  • 19.
    In 2013, Apachereleased Hadoop 2.0 Hadoop HDFS MapReduce
  • 20.
    MapReduce was brokeninto two separate parts Hadoop HDFS MapReduce
  • 21.
    A framework todefine a data processing task A framework to run the data processing task Hadoop HDFS MapReduce YARN
  • 22.
    Each of thesecomponents have corresponding configuration files Hadoop HDFS MapReduce YARN
  • 23.
    HDFS MapReduce YARN User defines mapand reduce tasks using the MapReduce API Co-ordination Between Hadoop Blocks
  • 24.
    A job istriggered on the cluster HDFS MapReduce YARN Co-ordination Between Hadoop Blocks
  • 25.
    YARN figures outwhere and how to run the job, and stores the result in HDFS HDFS MapReduce YARN Co-ordination Between Hadoop Blocks
  • 26.
    Hadoop An ecosystem oftools have sprung up around this core piece of software Hadoop Ecosystem
  • 27.
  • 28.
  • 29.
    Provides an SQLinterface to Hadoop The bridge to Hadoop for folks who don’t have exposure to OOP in Java Hive
  • 30.
    A database managementsystem on top of Hadoop Integrates with your application just like a traditional database HBase
  • 31.
    A data manipulationlanguage Transforms unstructured data into a structured format Query this structured data using interfaces like Hive Pig
  • 32.
    A distributed computingengine used along with Hadoop Interactive shell to quickly process datasets Has a bunch of built in libraries for machine learning, stream processing, graph processing etc. Spark
  • 33.
    A tool toschedule workflows on all the Hadoop ecosystem technologies Oozie
  • 34.
    Stream processing forunbounded datasets Kafka
  • 35.
  • 36.
  • 37.
  • 38.
    Built on commodityhardware Highly fault tolerant, hardware failure is the norm Suited to batch processing - data access has high throughput rather than low latency Supports very large data sets HDFS
  • 39.
    Manage file storageacross multiple disks HDFS
  • 40.
    Each disk ona different machine in a cluster HDFS
  • 41.
    A cluster ofmachines HDFS
  • 42.
    A node inthe cluster HDFS
  • 43.
    1 node isthe master node HDFS
  • 44.
  • 45.
  • 46.
  • 47.
    If the datain the distributed file system is a book HDFS
  • 48.
    The name nodeis the table of contents Name node
  • 49.
    The data nodeshold the actual text in each page Data nodes
  • 50.
  • 51.
    Manages the overallfile system Stores • The directory structure • Metadata of the files Name node
  • 52.
    Physically stores thedata in the files Data nodes
  • 53.
    A large textfile Storing a File in HDFS
  • 54.
    Block 1 Block 2 Block3 Block 4 Block 6 Block 5 Block 7 Block 8 Break the data into blocks Storing a File in HDFS
  • 55.
    Different length filesare treated the same way Storage is simplified Unit for replication and fault tolerance Break the data into blocks Block 1 Block 2 Block 3 Block 4 Block 6 Block 5 Block 7 Block 8 Storing a File in HDFS
  • 56.
    The blocks areof size 128 MB Block 1 Block 2 Block 3 Block 4 Block 6 Block 5 Block 7 Block 8 Storing a File in HDFS
  • 57.
    Block 1 Block 2 Block3 Block 4 Block 6 Block 5 Block 7 Block size is a trade off size 128 MB Reduces parallelism Increases overhead Storing a File in HDFS
  • 58.
    Block 1 Block 2 Block3 Block 4 Block 6 Block 5 Block 7 Block 8 This size helps minimize the time taken to seek to the block on the disk Storing a File in HDFS size 128 MB
  • 59.
    Store the blocks acrossthe data nodes Block 1 Block 2 Block 3 Block 4 Block 6 Block 5 Block 7 Block 8 Storing a File in HDFS
  • 60.
    Block 1 Block2 DN 1 Block 3 Block 4 DN 2 Block 6 Block 5 DN 3 Block 7 Block 8 DN 4 Each node contains a partition or a split of data Storing a File in HDFS
  • 61.
    How do weknow where the splits of a particular file are? Block 1 Block 2 Block 3 Block 4 Block 6 Block 5 Block 7 Block 8 Storing a File in HDFS DN 1 DN 2 DN 3 DN 4
  • 62.
    Name node File 1Block 1 DN 1 File 1 Block 2 DN 1 File 1 Block 3 DN 2 File 1 Block 4 DN 2 File 1 Block 5 DN 3 Storing a File in HDFS Block 1 Block 2 Block 3 Block 4 Block 6 Block 5 Block 7 Block 8 DN 1 DN 2 DN 3 DN 4
  • 63.
    1. Use metadatain the name node to look up block locations 2. Read the blocks from respective locations Reading a File in HDFS
  • 64.
    Reading a Filein HDFS Name node File 1 Block 1 DN 1 File 1 Block 2 DN 1 File 1 Block 3 DN 2 File 1 Block 4 DN 2 File 1 Block 5 DN 3 Block 1 Block 2 Block 3 Block 4 Block 6 Block 5 Block 7 Block 8 DN 1 DN 2 DN 3 DN 4
  • 65.
    File 1 Block1 DN 1 File 1 Block 2 DN 1 File 1 Block 3 DN 2 File 1 Block 4 DN 2 File 1 Block 5 DN 3 Request to the name node Reading a File in HDFS Block 1 Block 2 DN 1 Name node
  • 66.
    Read actual contentsfrom the block Reading a File in HDFS File 1 Block 1 DN 1 File 1 Block 2 DN 1 File 1 Block 3 DN 2 File 1 Block 4 DN 2 File 1 Block 5 DN 3 Name node Block 1 Block 2 DN 1
  • 67.
    Read actual contentsfrom the block Reading a File in HDFS Block 2 File 1 Block 1 DN 1 File 1 Block 2 DN 1 File 1 Block 3 DN 2 File 1 Block 4 DN 2 File 1 Block 5 DN 3 Name node Block 1 Block 2 DN 1
  • 68.
    A File Storedin HDFS File 1 Block 1 DN 1 File 1 Block 2 DN 1 File 1 Block 3 DN 2 File 1 Block 4 DN 2 File 1 Block 5 DN 3 Name node Block 1 Block 2 Block 3 Block 4 Block 6 Block 5 Block 7 Block 8 DN 1 DN 2 DN 3 DN 4
  • 69.
    Challenges of DistributedStorage Failure management in the data nodes Failure management for the name node
  • 70.
    Block 1 Block2 Block 3 Block 4 Block 7 Block 8 Name node DN 1 DN 2 DN 4 A File Stored in HDFS What if one of the blocks gets corrupted? Block 6 Block 5 DN 3
  • 71.
    Block 6 DN 3 Block5 Block 1 Block 2 Block 3 Block 4 Block 7 Block 8 Name node DN 1 DN 2 DN 4 What if a data node containing some blocks crashes? A File Stored in HDFS
  • 72.
  • 73.
    A File Storedin HDFS Block 1 Block 2 Block 3 Block 4 Block 6 Block 5 Block 7 Block 8 DN 1 DN 2 DN 3 DN 4 File 1 Block 1 DN 1 File 1 Block 2 DN 1 File 1 Block 3 DN 2 File 1 Block 4 DN 2 File 1 Block 5 DN 3 Name node
  • 74.
    Block 3 Block4 DN 2 1. Replicate blocks based on the replication factor 2. Store replicas in different locations Block 1 Block 2 Replication Block 1 Block 2 DN 1
  • 75.
    The replica locations arealso stored in the name node File 1 Block 1 DN 1 File 1 Block 2 DN 1 File 1 Block 3 DN 2 File 1 Block 4 DN 2 File 1 Block 5 DN 3 File 1 Block 1 DN 2 File 1 Block 2 DN 2 Replication Name node
  • 76.
  • 77.
    Rack 1 Rack2 Rack 3 Servers in a data center Maximize redundancy
  • 78.
    Store replicas “faraway” i.e. on different nodes Rack 1 Rack 2 Rack 3 Maximize redundancy
  • 79.
  • 80.
    This requires thatreplicas be stored close to each other Minimize write bandwidth
  • 81.
    Rack 1 Rack2 Rack 3 Node that runs the replication pipeline Minimize write bandwidth
  • 82.
    This node choosesthe location for the first replica Rack 1 Rack 2 Rack 3 Minimize write bandwidth
  • 83.
    Data is forwardedfrom here to the next replica location Rack 1 Rack 2 Rack 3 Minimize write bandwidth
  • 84.
    Forwarded further tothe next replica location Rack 1 Rack 2 Rack 3 Minimize write bandwidth
  • 85.
    Forwarding requires a largeamount of bandwidth Rack 1 Rack 2 Rack 3 Minimize write bandwidth
  • 86.
    Increases the costof write operations Rack 1 Rack 2 Rack 3 Minimize write bandwidth
  • 87.
    Maximize redundancy Default Hadoop ReplicationStrategy Minimize write bandwidth Balancing both needs
  • 88.
    Rack 1 Rack2 Rack 3 First location chosen at random Default Hadoop Replication Strategy
  • 89.
    Second location hasto be on a different rack (if possible) Default Hadoop Replication Strategy Rack 1 Rack 2 Rack 3
  • 90.
    Third replica ison the same rack as the second but on different nodes Default Hadoop Replication Strategy Rack 1 Rack 2 Rack 3
  • 91.
    Reduces inter-rack traffic andimproves write performance Default Hadoop Replication Strategy Rack 1 Rack 2 Rack 3
  • 92.
    Read operations are sentto the rack closest to the client Default Hadoop Replication Strategy Rack 1 Rack 2 Rack 3
  • 93.
  • 94.
  • 95.
    Requires running processeson many machines MapReduce
  • 96.
  • 97.
    MapReduce is aprogramming paradigm MapReduce
  • 98.
    T akes advantage ofthe inherent parallelism in data processing MapReduce
  • 99.
    <raw data> <rawdata> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> Modern systems generate millions of records of raw data MapReduce
  • 100.
    <raw data> <rawdata> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> A task of this scale is processed in two stages map reduce MapReduce
  • 101.
    map <raw data> <rawdata> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
  • 102.
    reduce <raw data><raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data> <raw data>
  • 103.
    map reduce The programmerdefines these 2 functions Hadoop does the rest - behind the scenes MapReduce
  • 104.
    An operation performedin parallel, on small portions of the dataset map
  • 105.
  • 106.
    reduce An operation tocombine the results of the map step
  • 107.
  • 108.
    A step thatcan be performed in parallel A step to combine the intermediate results map reduce
  • 109.
    Word Frequency above 14 are20 how 21 star 22 twinkle 32 … .. Consider a large text file How? Twinkle twinkle little star How I wonder what you are Up above the world so high Like a diamond in the sky Twinkle twinkle little star How I wonder what you are ….. Counting Word Frequencies
  • 110.
    The raw datais really large (potentially in PetaBytes) It’s distributed across many machines in a cluster Each machine holds a partition of data Twinkle twinkle little star How I wonder what you are Up above the world so high Like a diamond in the sky Twinkle twinkle little star How I wonder what you are ….. MapReduce Flow
  • 111.
    Each partition isgiven to a different process i.e. to mappers Twinkle twinkle little star How I wonder what you are Up above the world so high Like a diamond in the sky Twinkle twinkle little star How I wonder what you are MapReduce Flow
  • 112.
    Each mapper works inparallel M M M Twinkle twinkle little star How I wonder what you are Up above the world so high Like a diamond in the sky Twinkle twinkle little star How I wonder what you are MapReduce Flow
  • 113.
    Within each mapper,the rows are processed serially M Twinkle twinkle little star How I wonder what you are Map Flow
  • 114.
    Each row emits{key, value} pairs {twinkle, 1} {twinkle, 1} Word # Count {little, 1} {star, 1} Twinkle twinkle little star How I wonder what you are M Map Flow
  • 115.
    {twinkle, 1} {twinkle, 1} {up,1} {above, 1} {twinkle, 1} {twinkle, 1} The results are passed on to another process i.e. a reducer M M M Reduce Flow …. …. ….
  • 116.
    {twinkle, 1} {twinkle, 1} Thereducer combines the values with the same key R {twinkle, 4} {up, 2} {up, 1} {above, 1} {twinkle, 1} {twinkle, 1} M M M …. …. …. Reduce Flow
  • 117.
    Key Insight BehindMapReduce M R <K,V> <K,V> <K,V> Many data processing tasks can be expressed in this form
  • 118.
    1. What {key,value} pairs should be emitted in the map step? 2. How should values with the same key be combined? ? ? Answer Two Questions M R
  • 119.
    ? ? word 1 Sum Twinkle twinklelittle star How I wonder what you are Up above the world so high Like a diamond in the sky Word Count twinkle 2 little 1 … … … … … … … … For each word in each line {twinkle, 1} {twinkle, 1} {little, 1} {star, 1} … .. Counting Word Frequencies M R
  • 120.
    Answer these toparallelize any task :) ? ? M R
  • 121.
    Map Reduce Main Aclass where the map logic is implemented A class where the reduce logic is implemented A driver program that sets up the job Implementing in Java
  • 122.
    A class wherethe map logic is implemented A class where the reduce logic is implemented A driver program that sets up the job Map Reduce Main Implementing in Java
  • 123.
    Map Class Mapper Class Themap logic is implemented in a class that extends the Mapper Class Map Step
  • 124.
    This is ageneric class, with 4 type parameters <input key type, input value type, output key type, output value type> Map Step Map Class Mapper Class
  • 125.
    A class wherethe map logic is implemented A class where the reduce logic is implemented A driver program that sets up the job Map Reduce Main Implementing in Java
  • 126.
    Map Reduce Main Aclass where the map logic is implemented A class where the reduce logic is implemented A driver program that sets up the job Implementing in Java
  • 127.
    Reduce Class Reducer Class Thereduce logic is implemented in a class that extends the Reducer Class Reduce Step
  • 128.
    Reducer Class This isalso a generic class, with 4 type parameters <input key type, input value type, output key type, output value type> Reduce Step Reduce Class
  • 129.
    Reduce Class Reducer Class Theoutput types of the Mapper should match the input types of the Reducer <input key type, input value type, output key type, output value type> Map Class Mapper Class <input key type, input value type, output key type, output value type> Matching Data Types
  • 130.
    Map Reduce Main Aclass where the map logic is implemented A class where the reduce logic is implemented A driver program that sets up the job Implementing in Java
  • 131.
    Map Reduce Main Aclass where the map logic is implemented A class where the reduce logic is implemented A driver program that sets up the job Implementing in Java
  • 132.
    Main Class Job Object TheMapper and Reducer classes are used by a Job that is configured in the Main Class Setting up the Job
  • 133.
    Main Class Job Object TheJob has a bunch of properties that need to be configured Mapper class Reducer class Output data types Input filepath Output filepath Setting up the Job
  • 134.
    Yet Another ResourceNegotiator YARN
  • 135.
    YARN Co-ordinates tasks runningon the cluster Assigns new nodes in case of failure
  • 136.
    Resource Manager NodeManager Runs on a single master node Schedules tasks across nodes Run on all other nodes Manages tasks on the individual node YARN
  • 137.
  • 138.
  • 139.
    Job Find a NodeManager with freecapacity Submitting a Job ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager
  • 140.
    NodeManager Container All processes ona node are run within a container Application Master Process
  • 141.
    This is thelogical unit for resources the process needs -memory, CPU etc NodeManager Container Application Master Process
  • 142.
    A container executesa specific application 1 NodeManager can have multiple containers NodeManager Container Application Master Process
  • 143.
    The ResourceManager starts offthe Application Master within the Container Application Master Process Application Master Process NodeManager Container
  • 144.
    Performs the computationrequired for the task If additional resources are required, the Application Master makes the request Application Master Process Application Master Process NodeManager Container
  • 145.
    Requests containers for mappersand reducers Request includes CPU, memory requirement Application Processes ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager NodeManager
  • 146.
    NodeManager Assigns additional nodes Notifiesthe original Application Master which made the request Application Processes ResourceManager NodeManager NodeManager NodeManager NodeManager NodeManager
  • 147.
    NodeManager The Application Masteron the original node starts off the Application Masters on the newly assigned nodes Application Processes NodeManager ResourceManager NodeManager NodeManager NodeManager NodeManager
  • 148.
    Try to minimizewrite bandwidth The Location Constraint i.e. assign a process to the same node where the data to be processed lives
  • 149.
    If CPU, memoryare not available wait! The Location Constraint
  • 150.
  • 151.
  • 152.
    Hadoop File system tomanage the storage of data Framework to process data across multiple servers Framework to run and manage the data processing tasks MapReduce HDFS YARN
  • 153.
    Hive on Hadoop HIVE Hiveruns on top of the Hadoop distributed computing framework MapReduce HDFS YARN
  • 154.
    Hive stores itsdata in HDFS HIVE MapReduce HDFS YARN Hive on Hadoop
  • 155.
    Data is storedas files - text files, binary files Partitioned across machines in the cluster Replicated for fault tolerance Processing tasks parallelized across multiple machines Hadoop Distributed File System HDFS
  • 156.
    Hive runs allprocesses in the form of MapReduce jobs under the hood HIVE MapReduce HDFS YARN Hive on Hadoop
  • 157.
    A parallel programmingmodel Defines the logic to process data on multiple machines Batch processing operations on files in HDFS Usually written in Java using the Hadoop MapReduce library MapReduce MapReduce
  • 158.
    Do we haveto write MapReduce code to work with Hive? No HIVE MapReduce HDFS YARN Hive on Hadoop
  • 159.
    HiveQL Hive Query Language ASQL-like interface to the underlying data
  • 160.
    Modeled on theStructured Query Language (SQL) Familiar to analysts and engineers Simple query constructs - select - group by - join HiveQL
  • 161.
    Hive exposes filesin HDFS in the form of tables to the user HiveQL
  • 162.
    Write SQL-like queryin HiveQL and submit it to Hive HiveQL
  • 163.
    Hive will translatethe query to MapReduce tasks and run them on Hadoop HiveQL
  • 164.
    MapReduce will processfiles on HDFS and return results to Hive HiveQL
  • 165.
    Hive abstracts awaythe details of the underlying MapReduce jobs Work with Hive almost exactly like you would with a traditional database
  • 166.
    A Hive usersees data as if they were stored in tables The Hive Metastore HIVE MapReduce HDFS YARN
  • 167.
    Exposes the file-basedstorage of HDFS in the form of tables Metastore The Hive Metastore HIVE MapReduce HDFS YARN
  • 168.
    The Hive Metastore Metastore Thebridge between data stored in files and the tables exposed to users
  • 169.
    Stores metadata forall the tables in Hive Maps the files and directories in Hive to tables Holds table definitions and the schema for each table Has information on converting files to table representations The Hive Metastore
  • 170.
    Any database witha JDBC driver can be used as a metastore The Hive Metastore
  • 171.
    Development environments usethe built-in Derby database Embedded metastore The Hive Metastore
  • 172.
    Same Java processas Hive itself One Hive session to connect to the database The Hive Metastore
  • 173.
    Production environments Local metastore:Allows multiple sessions to connect to Hive Remote metastore: Separate processes for Hive and the metastore The Hive Metastore
  • 174.
    Hive vs. RDBMS Computation Datasize Latency Operations ACID compliance Query language
  • 175.
    Hive RDBMS Hive vs.RDBMS Large datasets Parallel computations High latency Read operations Not ACID compliant by default HiveQL Small datasets Serial computations Low latency Read/write operations ACID compliant SQL
  • 176.
    Large datasets Parallel computations Highlatency Read operations Not ACID compliant by default HiveQL Small datasets Serial computations Low latency Read/write operations ACID compliant SQL Hive vs. RDBMS Hive RDBMS
  • 177.
    Large Datasets SmallDatasets Gigabytes or petabytes Megabytes or gigabytes Hive vs. RDBMS
  • 178.
    Calculating trends Accessingand updating individual records Hive vs. RDBMS Large Datasets Small Datasets
  • 179.
    Large datasets Parallel computations Highlatency Read operations Not ACID compliant by default HiveQL Small datasets Serial computations Low latency Read/write operations ACID compliant SQL Hive vs. RDBMS Hive RDBMS
  • 180.
    Parallel Computations SerialComputations Distributed system with multiple machines Single computer with backup Hive vs. RDBMS
  • 181.
    Semi-structured data files partitionedacross machines Structured data in tables on one machine Hive vs. RDBMS Parallel Computations Serial Computations
  • 182.
    Disk space cheap,can add space by adding machines Disk space expensive on a single machine Hive vs. RDBMS Parallel Computations Serial Computations
  • 183.
    Large datasets Parallel computations Highlatency Read operations Not ACID compliant by default HiveQL Small datasets Serial computations Low latency Read/write operations ACID compliant SQL Hive vs. RDBMS Hive RDBMS
  • 184.
    High Latency LowLatency Records not indexed, cannot be accessed quickly Records indexed, can be accessed and updated fast Hive vs. RDBMS
  • 185.
    Fetching a rowwill run a MapReduce that might take minutes Queries can be answered in milliseconds or microseconds Hive vs. RDBMS High Latency Low Latency
  • 186.
    Large datasets Parallel computations Highlatency Read operations Not ACID compliant by default HiveQL Small datasets Serial computations Low latency Read/write operations ACID compliant SQL Hive vs. RDBMS Hive RDBMS
  • 187.
    Read Operations Read/WriteOperations Not the owner of data Hive vs. RDBMS
  • 188.
    Hive stores filesin HDFS Hive files can be read and written by many technologies - Hadoop, Pig, Spark Hive database schema cannot be enforced on these files No Data Ownership
  • 189.
    Not the ownerof data Sole gatekeeper for data Read Operations Read/Write Operations Hive vs. RDBMS
  • 190.
  • 191.
    Number of columns,column types, constraints specified at table creation Hive tries to impose this schema when data is read It may not succeed, may pad data with nulls Schema-on-read
  • 192.
    Schema-on-read Schema-on-write Read OperationsRead/Write Operations Hive vs. RDBMS
  • 193.
    Large datasets Parallel computations Highlatency Read operations Not ACID compliant by default HiveQL Small datasets Serial computations Low latency Read/write operations ACID compliant SQL Hive RDBMS Hive vs. RDBMS
  • 194.
    Not ACID CompliantACID Compliant Data can be dumped into Hive tables from any source Only data which satisfies constraints are stored in the database Hive vs. RDBMS
  • 195.
    Large datasets Parallel computations Highlatency Read operations Not ACID compliant by default HiveQL Small datasets Serial computations Low latency Read/write operations ACID compliant SQL Hive vs. RDBMS Hive RDBMS
  • 196.
    Schema on read,no constraints enforced Minimal index support Row level updates, deletes as a special case Many more built-in functions Only equi-joins allowed Restricted subqueries Schema on write keys, not null, unique all enforced Indexes allowed Row level operations allowed in general Basic built-in functions No restriction on joins Whole range of subqueries Hive vs. RDBMS Hive RDBMS
  • 197.
  • 198.
    Partitioning and Bucketing PartitioningBucketing Splits data into smaller, manageable parts
  • 199.
    Enables performance optimizations Partitioningand Bucketing Partitioning Bucketing
  • 200.
    Partitioning Data may benaturally split into logical units CA OR WA GA NY CT Customers in the US
  • 201.
    Partitioning Each of theseunits will be stored in a different directory CA OR WA GA NY CT
  • 202.
    Partitioning State specific querieswill run only on data in one directory CA OR WA GA NY CT
  • 203.
    Partitioning Splits may notof the same size CA OR WA GA NY CT
  • 204.
  • 205.
    Bucketing Size of eachsplit should be the same Bucket 2 Bucket 1 Bucket 4 Bucket 3 Customers in the US
  • 206.
    Bucketing Bucket 2 Bucket 1 Bucket4 Bucket 3 Customers in the US Hash of a column value - address, name, timestamp anything
  • 207.
    Bucketing Each bucket isa separate file Bucket 2 Bucket 1 Bucket 4 Bucket 3
  • 208.
    Bucketing Makes sampling andjoining data more efficient Bucket 2 Bucket 1 Bucket 4 Bucket 3
  • 209.
  • 210.
    Partitioning and Bucketing Ratherthan running queries on such tables
  • 211.
    Partitioning and Bucketing Runqueries on subsets that are manageable
  • 212.
    Queries on BigData Partitioning and Bucketing of T ables Join Optimizations Window Functions
  • 213.
    Join Optimizations Join operationsare MapReduce jobs under the hood
  • 214.
    Join Optimizations Optimize joinsby reducing the amount of data held in memory
  • 215.
    Join Optimizations Or bystructuring joins as a map-only operation
  • 216.
    Reducing Data Heldin Memory A 500GB table joined with a 5MB table
  • 217.
    Reducing Data Heldin Memory The large table will be split across multiple machines in the cluster
  • 218.
    Reducing Data Heldin Memory One table is held in memory while the other is read from disk
  • 219.
    Reducing Data Heldin Memory For better performance the smaller table should be held in memory
  • 220.
    Joins as Map-onlyOperations MapReduce operations have 2 phases of processing M R
  • 221.
    Joins as Map-onlyOperations Certain queries can be structured to have no reduce phase M R
  • 222.
    Queries on BigData Partitioning and Bucketing of T ables Join Optimizations Window Functions
  • 223.
    Window Functions A suiteof functions which are syntactic sugar for complex queries
  • 224.
    Window Functions Make complexoperations simple without needing many intermediate calculations
  • 225.
    Window Functions Reduces theneed for intermediate tables to store temporary data
  • 226.
    What were thetop selling N products in last week’s sale? Window Functions Window = one week Operation = ranking product sales
  • 227.
    What revenue percentiledid this supplier fall into for this quarter? Window Functions Window = one quarter Operation = percentiles on revenues
  • 228.
    Window Functions A suiteof functions which are syntactic sugar for complex queries
  • 229.
    Window Functions Make complexoperations simple without needing many intermediate calculations
  • 230.
    Window Functions Reduces theneed for intermediate tables to store temporary data
  • 231.
    What were thetop selling N products in last week’s sale? Window Functions Window = one week Operation = ranking product sales
  • 232.
    What revenue percentiledid this supplier fall into for this quarter? Window Functions Window = one quarter Operation = percentiles on revenues
  • 233.
    What revenue percentiledid this supplier fall into for this quarter? Window Functions What were the top selling N products in last week’s sale? Can be expressed in a single query
  • 234.
    Running Total ofRevenue ID Store Product Date Revenue o1 Seattle Bananas 2017-01-01 7 o2 Kent Apples 2017-01-02 20 o3 Bellevue Flowers 2017-01-02 10 o4 Redmond Meat 2017-01-03 40 o5 Seattle Potatoes 2017-01-04 9 o6 Bellevue Bread 2017-01-04 5 o7 Redmond Bread 2017-01-05 5 o8 Issaquah Onion 2017-01-05 4 Orders in a grocery store
  • 235.
    Running Total ofRevenue ID Store Product Date Revenue o1 Seattle Bananas 2017-01-01 7 o2 Kent Apples 2017-01-02 20 o3 Bellevue Flowers 2017-01-02 10 o4 Redmond Meat 2017-01-03 40 o5 Seattle Potatoes 2017-01-04 9 o6 Bellevue Bread 2017-01-04 5 o7 Redmond Bread 2017-01-05 5 o8 Issaquah Onion 2017-01-05 4 Sorted by order ID
  • 236.
    Running Total ofRevenue ID Store Product Date Revenue o1 Seattle Bananas 2017-01-01 7 o2 Kent Apples 2017-01-02 20 o3 Bellevue Flowers 2017-01-02 10 o4 Redmond Meat 2017-01-03 40 o5 Seattle Potatoes 2017-01-04 9 o6 Bellevue Bread 2017-01-04 5 o7 Redmond Bread 2017-01-05 5 o8 Issaquah Onion 2017-01-05 4 Running Total 7 27 37 77 86 91 96 10
  • 237.
    Running Total ofRevenue ID Store Product Date Revenue o1 Seattle Bananas 2017-01-01 7 o2 Kent Apples 2017-01-02 20 o3 Bellevue Flowers 2017-01-02 10 o4 Redmond Meat 2017-01-03 40 o5 Seattle Potatoes 2017-01-04 9 o6 Bellevue Bread 2017-01-04 5 o7 Redmond Bread 2017-01-05 5 o8 Issaquah Onion 2017-01-05 4 Running Total 7 27 37 77 86 91 96 10
  • 238.
    Running Total ofRevenue ID Store Product Date Revenue o1 Seattle Bananas 2017-01-01 7 o2 Kent Apples 2017-01-02 20 o3 Bellevue Flowers 2017-01-02 10 o4 Redmond Meat 2017-01-03 40 o5 Seattle Potatoes 2017-01-04 9 o6 Bellevue Bread 2017-01-04 5 o7 Redmond Bread 2017-01-05 5 o8 Issaquah Onion 2017-01-05 4 Running Total 7 27 37 77 86 91 96 10
  • 239.
    Running Total ofRevenue ID Store Product Date Revenue o1 Seattle Bananas 2017-01-01 7 o2 Kent Apples 2017-01-02 20 o3 Bellevue Flowers 2017-01-02 10 o4 Redmond Meat 2017-01-03 40 o5 Seattle Potatoes 2017-01-04 9 o6 Bellevue Bread 2017-01-04 5 o7 Redmond Bread 2017-01-05 5 o8 Issaquah Onion 2017-01-05 4 Running Total 7 27 37 77 86 91 96 10
  • 240.
    Running Total ofRevenue ID Store Product Date Revenue o1 Seattle Bananas 2017-01-01 7 o2 Kent Apples 2017-01-02 20 o3 Bellevue Flowers 2017-01-02 10 o4 Redmond Meat 2017-01-03 40 o5 Seattle Potatoes 2017-01-04 9 o6 Bellevue Bread 2017-01-04 5 o7 Redmond Bread 2017-01-05 5 o8 Issaquah Onion 2017-01-05 4 Running Total 7 27 37 77 86 91 96 10
  • 241.
    Running Total ofRevenue ID Store Product Date Revenue o1 Seattle Bananas 2017-01-01 7 o2 Kent Apples 2017-01-02 20 o3 Bellevue Flowers 2017-01-02 10 o4 Redmond Meat 2017-01-03 40 o5 Seattle Potatoes 2017-01-04 9 o6 Bellevue Bread 2017-01-04 5 o7 Redmond Bread 2017-01-05 5 o8 Issaquah Onion 2017-01-05 4 For each row: Calculate sum over all rows till the current row Running Total 7 27 37 77 86 91 96 10
  • 242.
    Running Total ofRevenue ID Store Product Date Revenue o1 Seattle Bananas 2017-01-01 7 o2 Kent Apples 2017-01-02 20 o3 Bellevue Flowers 2017-01-02 10 o4 Redmond Meat 2017-01-03 40 o5 Seattle Potatoes 2017-01-04 9 o6 Bellevue Bread 2017-01-04 5 o7 Redmond Bread 2017-01-05 5 o8 Issaquah Onion 2017-01-05 4 Operation = sum Running Total 7 27 37 77 86 91 96 10
  • 243.
    Running Total ofRevenue ID Store Product Date Revenue o1 Seattle Bananas 2017-01-01 7 o2 Kent Apples 2017-01-02 20 o3 Bellevue Flowers 2017-01-02 10 o4 Redmond Meat 2017-01-03 40 o5 Seattle Potatoes 2017-01-04 9 o6 Bellevue Bread 2017-01-04 5 o7 Redmond Bread 2017-01-05 5 o8 Issaquah Onion 2017-01-05 4 Window is all rows from the top to the current row Running Total 7 27 37 77 86 91 96 10
  • 244.
    Per Day RevenueTotals ID Store Product Date Revenue o1 Seattle Bananas 2017-01-11 7 o2 Kent Apples 2017-01-11 20 o3 Bellevue Flowers 2017-01-11 10 o4 Redmond Meat 2017-01-12 40 o5 Seattle Potatoes 2017-01-12 9 o6 Bellevue Bread 2017-01-12 5 o7 Redmond Bread 2017-01-13 5 o8 Issaquah Onion 2017-01-13 4 A revenue running total for each day
  • 245.
    Per Day RevenueTotals ID Store Product Date Revenue o1 Seattle Bananas 2017-01-11 7 o2 Kent Apples 2017-01-11 20 o3 Bellevue Flowers 2017-01-11 10 o4 Redmond Meat 2017-01-12 40 o5 Seattle Potatoes 2017-01-12 9 o6 Bellevue Bread 2017-01-12 5 o7 Redmond Bread 2017-01-13 5 o8 Issaquah Onion 2017-01-13 4 Reset to 0 when a new day begins
  • 246.
    Per Day RevenueTotals ID Store Product Date Revenue o1 Seattle Bananas 2017-01-11 7 o2 Kent Apples 2017-01-11 20 o3 Bellevue Flowers 2017-01-11 10 o4 Redmond Meat 2017-01-12 40 o5 Seattle Potatoes 2017-01-12 9 o6 Bellevue Bread 2017-01-12 5 o7 Redmond Bread 2017-01-13 5 o8 Issaquah Onion 2017-01-13 4 Operate on blocks of one day
  • 247.
    Moving Averages ID StoreProduct Date Revenue o1 Seattle Bananas 2017-01-11 7 o2 Kent Apples 2017-01-11 20 o3 Bellevue Flowers 2017-01-11 10 o4 Redmond Meat 2017-01-12 40 o5 Seattle Potatoes 2017-01-12 9 o6 Bellevue Bread 2017-01-12 5 o7 Redmond Bread 2017-01-13 5 o8 Issaquah Onion 2017-01-13 4 Calculate the average for the last 3 items sold
  • 248.
    Moving Averages ID StoreProduct Date Revenue o1 Seattle Bananas 2017-01-11 7 o2 Kent Apples 2017-01-11 20 o3 Bellevue Flowers 2017-01-11 10 o4 Redmond Meat 2017-01-12 40 o5 Seattle Potatoes 2017-01-12 9 o6 Bellevue Bread 2017-01-12 5 o7 Redmond Bread 2017-01-13 5 o8 Issaquah Onion 2017-01-13 4 3 previous rows from the current row
  • 249.
    Moving Averages ID StoreProduct Date Revenue o1 Seattle Bananas 2017-01-11 7 o2 Kent Apples 2017-01-11 20 o3 Bellevue Flowers 2017-01-11 10 o4 Redmond Meat 2017-01-12 40 o5 Seattle Potatoes 2017-01-12 9 o6 Bellevue Bread 2017-01-12 5 o7 Redmond Bread 2017-01-13 5 o8 Issaquah Onion 2017-01-13 4 3 previous rows from the current row
  • 250.
    Moving Averages ID StoreProduct Date Revenue o1 Seattle Bananas 2017-01-11 7 o2 Kent Apples 2017-01-11 20 o3 Bellevue Flowers 2017-01-11 10 o4 Redmond Meat 2017-01-12 40 o5 Seattle Potatoes 2017-01-12 9 o6 Bellevue Bread 2017-01-12 5 o7 Redmond Bread 2017-01-13 5 o8 Issaquah Onion 2017-01-13 4 3 previous rows from the current row
  • 251.
    Percentage of theTotal per Day ID Store Product Date Revenue o1 Seattle Bananas 2017-01-11 7 o2 Kent Apples 2017-01-11 20 o3 Bellevue Flowers 2017-01-11 10 o4 Redmond Meat 2017-01-12 40 o5 Seattle Potatoes 2017-01-12 9 o6 Bellevue Bread 2017-01-12 5 o7 Redmond Bread 2017-01-13 5 o8 Issaquah Onion 2017-01-13 4 Orders for one day, total revenue = 37
  • 252.
    Percentage of theTotal per Day ID Store Product Date Revenue o1 Seattle Bananas 2017-01-11 7 o2 Kent Apples 2017-01-11 20 o3 Bellevue Flowers 2017-01-11 10 o4 Redmond Meat 2017-01-12 40 o5 Seattle Potatoes 2017-01-12 9 o6 Bellevue Bread 2017-01-12 5 o7 Redmond Bread 2017-01-13 5 o8 Issaquah Onion 2017-01-13 4 % contribution of o1 = 7/37 * 100 = 19%
  • 253.
    Percentage of theTotal per Day ID Store Product Date Revenue o1 Seattle Bananas 2017-01-11 7 o2 Kent Apples 2017-01-11 20 o3 Bellevue Flowers 2017-01-11 10 o4 Redmond Meat 2017-01-12 40 o5 Seattle Potatoes 2017-01-12 9 o6 Bellevue Bread 2017-01-12 5 o7 Redmond Bread 2017-01-13 5 o8 Issaquah Onion 2017-01-13 4 % contribution of o2 = 20 /37 * 100 = 54%
  • 254.
    Percentage of theTotal per Day ID Store Product Date Revenue o1 Seattle Bananas 2017-01-11 7 o2 Kent Apples 2017-01-11 20 o3 Bellevue Flowers 2017-01-11 10 o4 Redmond Meat 2017-01-12 40 o5 Seattle Potatoes 2017-01-12 9 o6 Bellevue Bread 2017-01-12 5 o7 Redmond Bread 2017-01-13 5 o8 Issaquah Onion 2017-01-13 4 % contribution of o3 = 10 /37 * 100 = 27%
  • 255.
    Percentage of theTotal per Day % contribution = Revenue x 100 Total Revenue
  • 256.
    Percentage of theTotal per Day % contribution = Revenue x 100 sum(revenue) over (partition by day)
  • 257.
  • 258.
    Hive on Hadoop HIVE MapReduce HDFSYARN Hive runs all processes in the form of MapReduce jobs under the hood
  • 259.
    Allows processing ofhuge datasets Hive on Hadoop Errr… okay….
  • 260.
    Where Does DataCome From?
  • 261.
    Where Does DataCome From? Mobile phones GPS location coordinates Self driving car sensors
  • 262.
    Where Is ThisData Stored? Files on some storage system
  • 263.
    Characteristics of Data Unknownschema Incomplete data Inconsistent records
  • 264.
    Apache Pig tothe Rescue
  • 265.
    Apache Pig tothe Rescue A high level scripting language to work with data with unknown or inconsistent schema
  • 266.
    Pig Part of theHadoop eco-system Works well with unstructured, incomplete data Can work directly on files in HDFS Used to get data into a data warehouse
  • 267.
  • 268.
    Extract, Transform, Load Pullunstructured, inconsistent data from source, clean it and place it in another database where it can be analyzed
  • 269.
    Extract, Transform, Load Pullunstructured, inconsistent data from source, clean it and place it in another database where it can be analyzed
  • 270.
    Extract, Transform, Load Pullunstructured, inconsistent data from source, clean it and place it in another database where it can be analyzed
  • 271.
    Extract, Transform, Load Pullunstructured, inconsistent data from source, clean it and place it in another database where it can be analyzed
  • 272.
    Apache Pig 64.242.88.10 -- [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846 64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523 64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291 64.242.88.10 - - [07/Mar/2004:16:11:58 -0800] "GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1" 200 7352 64.242.88.10 - - [07/Mar/2004:16:20:55 -0800] "GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1" 200 5253 64.242.88.10 - - [07/Mar/2004:16:23:12 -0800] "GET /twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore¶m1=1.12¶m2=1.12 HTTP/1.1" 200 11382 64.242.88.10 - - [07/Mar/2004:16:24:16 -0800] "GET /twiki/bin/view/Main/PeterThoeny HTTP/1.1" 200 4924 64.242.88.10 - - [07/Mar/2004:16:29:16 -0800] "GET /twiki/bin/edit/Main/Header_checks?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12851 64.242.88.10 - - [07/Mar/2004:16:30:29 -0800] "GET /twiki/bin/attach/Main/OfficeLocations HTTP/1.1" 401 12851 64.242.88.10 - - [07/Mar/2004:16:31:48 -0800] "GET /twiki/bin/view/TWiki/WebTopicEditTemplate HTTP/1.1" 200 3732 64.242.88.10 - - [07/Mar/2004:16:32:50 -0800] "GET /twiki/bin/view/Main/WebChanges HTTP/1.1" 200 40520 64.242.88.10 - - [07/Mar/2004:16:33:53 -0800] "GET /twiki/bin/edit/Main/Smtpd_etrn_restrictions?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12851 64.242.88.10 - - [07/Mar/2004:16:35:19 -0800] "GET /mailman/listinfo/business HTTP/1.1" 200 6379 64.242.88.10 - - [07/Mar/2004:16:36:22 -0800] "GET /twiki/bin/rdiff/Main/WebIndex?rev1=1.2&rev2=1.1 HTTP/1.1" 200 46373 64.242.88.10 - - [07/Mar/2004:16:37:27 -0800] "GET /twiki/bin/view/TWiki/DontNotify HTTP/1.1" 200 4140 64.242.88.10 - - [07/Mar/2004:16:39:24 -0800] "GET /twiki/bin/view/Main/TokyoOffice HTTP/1.1" 200 3853 64.242.88.10 - - [07/Mar/2004:16:43:54 -0800] "GET /twiki/bin/view/Main/MikeMannix HTTP/1.1" 200 3686 64.242.88.10 - - [07/Mar/2004:16:45:56 -0800] "GET /twiki/bin/attach/Main/PostfixCommands HTTP/1.1" 401 12846 64.242.88.10 - - [07/Mar/2004:16:47:12 -0800] "GET /robots.txt HTTP/1.1" 200 68 64.242.88.10 - - [07/Mar/2004:16:47:46 -0800] "GET /twiki/bin/rdiff/Know/ReadmeFirst?rev1=1.5&rev2=1.4 HTTP/1.1" 200 5724 64.242.88.10 - - [07/Mar/2004:16:49:04 -0800] "GET /twiki/bin/view/Main/TWikiGroups?rev=1.2 HTTP/1.1" 200 5162 64.242.88.10 - - [07/Mar/2004:16:50:54 -0800] "GET /twiki/bin/rdiff/Main/ConfigurationVariables HTTP/1.1" 200 59679 64.242.88.10 - - [07/Mar/2004:16:52:35 -0800] "GET /twiki/bin/edit/Main/Flush_service_name?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12851
  • 273.
    Apache Pig 64.242.88.10 -- [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/ Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846 Server IP Address
  • 274.
    Apache Pig 64.242.88.10 -- [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/ Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846 Date and Time
  • 275.
    Apache Pig 64.242.88.10 -- [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/ Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846 Request Type
  • 276.
    Apache Pig 64.242.88.10 -- [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/ Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846 URL
  • 277.
    Apache Pig 64.242.88.10 -- [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/ Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846 IP Date Time Request Type URL
  • 278.
    Pig Latin A procedural,data flow language to extract, transform and load data
  • 279.
    Series of well-definedsteps to perform operations No if statements or for loops Procedural
  • 280.
    Focused on transformationsapplied to the data Written with a series of data operations in mind Data Flow
  • 281.
    Pig Latin Data fromone or more sources can be read, processed and stored in parallel
  • 282.
    Pig Latin Cleans data,precomputes common aggregates before storing in a data warehouse
  • 283.
    Pig vs. SQL PigSQL select sum(revenue) from revenues group by dept foreach (group revenues by dept) generate sum(revenue)
  • 284.
    Pig vs. SQL PigSQL A data flow language, transforms data to store in a warehouse Specifies exactly how data is to be modified at every step Purpose of processing is to store in a queryable format Used to clean data with inconsistent or incomplete schema A query language, is used for retrieving results Abstracts away how queries are executed Purpose of data extraction is analysis Extract insights, generate reports, drive decisions
  • 285.
    Hadoop HDFS File system tomanage the storage of data Framework to process data across multiple servers MapReduce YARN Framework to run and manage the data processing tasks
  • 286.
    Pig on Hadoop HDFSMapReduce YARN PIG Pig runs on top of the Hadoop distributed computing framework
  • 287.
    HDFS MapReduce YARN Readsfiles from HDFS, stores intermediate records in HDFS and writes its final output to HDFS PIG Pig on Hadoop
  • 288.
    HDFS MapReduce YARN Decomposesoperations into MapReduce jobs which run in parallel PIG Pig on Hadoop
  • 289.
    HDFS MapReduce YARN Providesnon-trivial, built-in implementations of standard data operations, which are very efficient PIG Pig on Hadoop
  • 290.
    HDFS MapReduce YARN Pigoptimizes operations before MapReduce jobs are run, to speed operations up PIG Pig on Hadoop
  • 291.
    Pig on Hadoop HDFSMapReduce YARN PIG
  • 292.
    Pig on OtherTechnologies Apache Tez PIG Apache Spark
  • 293.
    Apache Tez PIG Apache Spark Tezis an extensible framework which improves on MapReduce by making its operations faster Pig on Other Technologies
  • 294.
    Apache Tez PIG Apache Spark Sparkis another distributed computing technology which is scalable, flexible and fast Pig on Other Technologies
  • 295.
    Pig vs. Hive PigHive Used to extract, transform and load data into a data warehouse Used by developers to bring together useful data in one place Uses Pig Latin, a procedural, data flow language Used to query data from a data warehouse to generate reports Used by analysts to retrieve business information from data Uses HiveQL, a structured query language
  • 296.
  • 297.
    Spark General Purpose Interactive DistributedComputing An engine for data processing and analysis
  • 298.
    General Purpose Exploring Cleaning andPreparing Applying Machine Learning Building Data Applications
  • 299.
    Spark General Purpose Interactive DistributedComputing An engine for data processing and analysis
  • 300.
  • 301.
  • 302.
    Spark General Purpose Interactive DistributedComputing An engine for data processing and analysis
  • 303.
    Distributed Computing Integrate withHadoop Read data from HDFS Process data across a cluster of machines
  • 304.
  • 305.
    Almost all datais processed using Resilient Distributed Datasets
  • 306.
    RDDs are themain programming abstraction in Spark Resilient Distributed Datasets
  • 307.
    RDDs are in-memorycollections of objects In-memory, yet resilient! Resilient Distributed Datasets
  • 308.
    With RDDs, youcan interact and play with billions of rows of data …without caring about any of the complexities Resilient Distributed Datasets
  • 309.
    Spark is madeup of a few different components
  • 310.
    The basic functionalityof Spark RDDs Spark Core
  • 311.
    Spark Core isjust a computing engine It needs two additional components Spark Core
  • 312.
    A Storage Systemthat stores the data to be processed A Cluster Manager to help Spark run tasks across a cluster of machines Spark Core
  • 313.
    Storage System Cluster Manager Both of theseare plug and play components Spark Core
  • 314.
  • 315.
  • 316.
  • 317.
    Storage System Cluster Manager Plug and Playmakes it easy to integrate with Hadoop Spark Core
  • 318.
    For storage For managingthe cluster For computing HDFS YARN MapReduce A Hadoop Cluster
  • 319.
    HDFS YARN MapReduce SparkCore Use Spark as an alternate/ additional compute engine A Hadoop Cluster
  • 320.
    Java 7 orabove Scala IPython (Anaconda) Prerequisites Installing Spark
  • 321.
    Download Spark binaries Updateenvironment variables Configure iPython Notebook for Spark Installing Spark
  • 322.
    SPARK_HOME Spark Environment Variables PATH Pointto the folder where Spark has been extracted $SPARK_HOME/bin %SPARK_HOME%/bin Linux/Mac OS X Windows
  • 323.
  • 324.
  • 325.
    This is justlike a Python shell Launch PySpark
  • 326.
    Use Python functions,dicts, lists etc Launch PySpark
  • 327.
    You can importand use any installed Python modules Launch PySpark
  • 328.
    Launches by defaultin a local non-distributed mode Launch PySpark
  • 329.
    When the shellis launched it initializes a SparkContext SparkContext
  • 330.
    The SparkContext representsa connection to the Spark Cluster SparkContext
  • 331.
    Used to loaddata into memory from a specified source SparkContext
  • 332.
    The data getsloaded into an RDD SparkContext
  • 333.
  • 334.
    RDDs represent data in-memory 1Swetha 30 Bangalore 2 Vitthal 35 New Delhi 3 Navdeep 25 Mumbai 4 Janani 35 New Delhi 5 Navdeep 25 Mumbai 6 Janani 35 New Delhi Partitions
  • 335.
    1 Swetha 30Bangalore 2 Vitthal 35 New Delhi 3 Navdeep 25 Mumbai 4 Janani 35 New Delhi 5 Navdeep 25 Mumbai 6 Janani 35 New Delhi Partitions Data is divided into partitions Distributed to multiple machines
  • 336.
    1 Swetha 30Bangalore 2 Vitthal 35 New Delhi 3 Navdeep 25 Mumbai 4 Janani 35 New Delhi 5 Navdeep 25 Mumbai 6 Janani 35 New Delhi Partitions Data is divided into partitions
  • 337.
    Partitions Data is dividedinto partitions 1 Swetha 30 Bangalore 2 Vitthal 35 New Delhi 3 Navdeep 25 Mumbai 4 Janani 35 New Delhi 5 Navdeep 25 Mumbai 6 Janani 35 New Delhi
  • 338.
    Partitions 1 Swetha 30Bangalore 2 Vitthal 35 New Delhi 3 Navdeep 25 Mumbai 4 Janani 35 New Delhi 5 Navdeep 25 Mumbai 6 Janani 35 New Delhi Distributed to multiple machines, called nodes
  • 339.
  • 340.
    Partitions Nodes process data inparallel Node 1 Node 2 Node 3
  • 341.
  • 342.
  • 343.
    Only Two Typesof Operations Transformation Action Transform into another RDD Request a result
  • 344.
    Transformation A data set loadedinto an RDD 1 Swetha 30 Bangalore 2 Vitthal 35 New Delhi 3 Navdeep 25 Mumbai 4 Janani 35 New Delhi 5 Navdeep 25 Mumbai 6 Janani 35 New Delhi
  • 345.
    The user maydefine a chain of transformations on the dataset 1 Swetha 30 Bangalore 2 Vitthal 35 New Delhi 3 Navdeep 25 Mumbai 4 Janani 35 New Delhi 5 Navdeep 25 Mumbai 6 Janani 35 New Delhi Transformation
  • 346.
    1. Load data 2.Pick only the 3rd column 3. Sort the values Transformation 1 Swetha 30 Bangalore 2 Vitthal 35 New Delhi 3 Navdeep 25 Mumbai 4 Janani 35 New Delhi 5 Navdeep 25 Mumbai 6 Janani 35 New Delhi
  • 347.
    1. Load data 2.Pick only the 3rd column 3. Sort the values Transformation
  • 348.
    Wait until aresult is requested before executing any of these transformations Transformation
  • 349.
    Only Two Typesof Operations Transformation Action Transform into another RDD Request a result
  • 350.
    Action Request a result usingan action 1. The first 10 rows 2. A count 3. A sum
  • 351.
    Data is processedonly when the user requests a result The chain of transformations defined earlier is executed Action
  • 352.
    Lazy Evaluation Spark keepsa record of the series of transformations requested by the user
  • 353.
    Lazy Evaluation It groupsthe transformations in an efficient way when an Action is requested
  • 354.
  • 355.
    When created, anRDD just holds metadata 1. A transformation 2. It’s parent RDD Lineage
  • 356.
    Every RDD knows whereit came from RDD 1 Transformation 1 RDD 2 Lineage
  • 357.
    Lineage can betraced back all the way to the source Lineage RDD 1 Transformation 1 RDD 2
  • 358.
    RDD 1 Transformation 1 data.csv Loaddata Lineage can be traced back all the way to the source RDD 2 Lineage
  • 359.
    When an actionis requested on an RDD All its parent RDDs are materialized RDD 1 Transformation 1 data.csv Load data RDD 2 Lineage
  • 360.
    Resilience Lazy Evaluation In-built faulttolerance If something goes wrong, reconstruct from source Materialize only when necessary Lineage
  • 361.
  • 362.
    Bounded datasets areprocessed in batches Unbounded datasets are processed as streams
  • 363.
    Batch vs. StreamProcessing Batch Stream Bounded, finite datasets Slow pipeline from data ingestion to analysis Periodic updates as jobs complete Order of data received unimportant Single global state of the world at any point in time Unbounded, infinite datasets Processing immediate, as data is received Continuous updates as jobs run constantly Order important, out of order arrival tracked No global state, only history of events received
  • 364.
    Data is receivedas a stream - Log messages - Tweets - Climate sensor data Stream Processing
  • 365.
    Process the dataone entity at a time - Filter error messages - Find references to the latest movies - Track weather patterns Stream Processing
  • 366.
    Store, display, acton filtered messages - Trigger an alert - Show trending graphs - Warn of sudden squalls Stream Processing
  • 367.
  • 368.
  • 369.
  • 370.
    Traditional Systems Files Databases Reliablestorage as the source of truth
  • 371.
    Stream-first Architecture Message transport Streamprocessing Files Databases Stream
  • 372.
  • 373.
    Buffer for eventdata Performant and persistent Decoupling multiple sources from processing Kafka, MapR streams Message Transport
  • 374.
    Stream-first Architecture Message transport Streamprocessing Files Databases Stream
  • 375.
    High throughput, lowlatency Fault tolerance with low overhead Manage out of order events Easy to use, maintainable Replay streams Streaming Spark, Storm, Flink Stream Processing
  • 376.
    A good approximationto stream processing is the use of micro- batches
  • 377.
    Mimicking Streams UsingMicro-batches 4 3 1 6 9 0 8 7 3 A stream of integers 4
  • 378.
    Grouped into batches 43 6 1 9 0 8 7 3 4 Mimicking Streams Using Micro-batches
  • 379.
    If the batchesare small enough… It approximates real-time stream processing Mimicking Streams Using Micro-batches 4 3 6 1 9 0 8 7 3 4
  • 380.
    Exactly once semantics,replay micro-batches Latency-throughput trade-off based on batch sizes Spark Streaming, Storm Trident Mimicking Streams Using Micro-batches 4 3 6 1 9 0 8 7 3 4
  • 381.
  • 382.
    Types of Windows TumblingWindow Sliding Window Session Window
  • 383.
    Types of Windows Astream of data
  • 384.
    Tumbling Window Fixed windowsize Non-overlapping time Number of entities differ within a window
  • 385.
    Tumbling Window The windowtumbles over the data, in a non- overlapping manner
  • 386.
    5 6 3 1 4 3 7 2 4 Tumbling Window 32 7 6 8 Apply the sum() operation on each window 14 7 19 21
  • 387.
    Sliding Window Fixed windowsize Overlapping time - sliding interval Number of entities differ within a window
  • 388.
    Sliding Window Fixed windowsize Overlapping time - sliding interval Number of entities differ within a window
  • 389.
    5 6 3 1 4 3 7 2 4 3 2 7 6 8 Applythe sum() operation on each window 14 8 13 15 Sliding Window 19 14
  • 390.
    Session Window Changing windowsize based on session data No overlapping time Number of entities differ within a window Session gap determines window size
  • 391.
    Session Window Changing windowsize based on session data No overlapping time Number of entities differ within a window Session gap determines window size
  • 392.
    Session Window Changing windowsize based on session data No overlapping time Number of entities differ within a window Session gap determines window size
  • 393.
    Session Window This gapis not large enough to start a new window
  • 394.
    Apache Flink isan open source, distributed system built using the stream-first architecture The stream is the source of truth
  • 395.
    Streaming execution model Processingis continuous, one event at a time Apache Flink
  • 396.
    Apache Flink Everything isa stream Batch processing with bounded datasets are a special case of the unbounded dataset
  • 397.
    Same engine forall Streaming and batch APIs both use the same underlying architecture Apache Flink
  • 398.
    Handles out oforder or late arriving data Exactly once processing for stateful computations Flexible windowing based on time, sessions, count, etc. Lightweight fault tolerance and checkpointing Distributed, runs in large scale clusters Stream Processing with Flink
  • 399.
    Apache Flink Runtime Local Single JVM Cluster Standalone,YARN Cloud GCE, EC2 DataStream API Stream Processing DataSet API Batch Processing CEP Event Processing T able Relational Flink ML Machine Learning Gelly Graph T able Relational
  • 400.
  • 401.
    Flink Programming Model DataSource Data Sink Transformations
  • 402.
  • 403.
  • 404.
  • 405.
    BigQuery - Enterprise DataWarehouse - SQL queries - with Google storage underneath - Fully-managed - no server, no resources deployed - Access through: Web UI, REST API, clients - Third party tools
  • 406.
    Data Model - Dataset= set of tables and views - T able must belong to dataset - Dataset must belong to a project - T ables contain records with rows and columns (fields) - Nested and repeated fields are OK too
  • 407.
    Table Schema - Canbe specified at creation time - Can also specify schema during initial load - Can update schema later too
  • 408.
    Table Types - Nativetables: BigQuery storage - External tables - Views - BigT able - Cloud Storage - Google Drive
  • 409.
    Schema Auto- Detection - Availablewhile - BigQuery selects a random file in the data source and scans up to 100 rows of data to use as a representative sample - Loading data - Querying external data - Then examines each field and attempts to assign a data type to that field based on the values in the sample
  • 410.
    Loading Data - Batchloads - Streaming loads - CSV - JSON (newline delimited) - Avro - GCP Datastore backups - High volume event tracking logs - Realtime dashboards
  • 411.
    Other Google Sources - Cloudstorage - Analytics 360 - Datastore - Dataflow - Cloud storage logs
  • 412.
    Data Formats - CSV -JSON (newline delimited) - Avro (open source data format that bundles serialized data with the data's schema in the same file) - Cloud Datastore backups (BigQuery converts data from each entity in Cloud Datastore backup files to BigQuery's data types)
  • 413.
    Alternatives to Loading - Publicdatasets - Shared datasets - Stackdriver log files (needs export - but direct)
  • 414.
    Querying and Viewing - Interactivequeries - Batch queries - Views - Partitioned tables
  • 415.
    Interactive Queries - Default mode(executed as soon as possible) - Count towards limits on - Daily usage - Concurrent usage
  • 416.
    Batch Queries - BigQuerywill schedule these to run whenever possible (idle resources) - Don’t count towards limit on concurrent usage - If not started within 24 hours, BigQuery makes them interactive
  • 417.
    Views - BigQuery viewsare logical - not materialised - Underlying query will execute each time view is accessed - Billing will happen accordingly - Can’t assign access control - based on user running view - Can create authorised view: share query results with groups without giving read access to underlying data - Can give row-level permissions to different users within same view
  • 418.
    Views - Can’t exportdata from a view - Can’t use JSON API to retrieve data - Can’t mix standard and legacy SQL - eg standard SQL query can’t access legacy-SQL view - No user-defined functions allowed - No wildcard table references allowed - Limit of 1000 authorised views per dataset
  • 419.
    Partitioned Tables - Special tablewhere data is partitioned for you - No need to create partitions manually or programmatically - Manual partitions - performance degrades - Limit of 1000 tables per query does not apply
  • 420.
    Partitioned Tables - Date-partitioned tablesoffered by BigQuery - Need to declare table as partitioned at creation time - No need to specify schema (can do while loading data) - BigQuery automatically creates date partitions
  • 421.
    Query Plan Explanation -In the web UI, click on “Explanation”
  • 422.
    Slots - Unit ofComputational capacity needed to run queries - BigQuery calculates on basis of query size, complexity - Usually default slots sufficient - Might need to be expanded for very large, complex queries - Slots are subject to quota policies - Can use StackDriver Monitoring to track slot usage
  • 423.
  • 424.
    DataFlow - Transform data -Loosely semantically equivalent to Apache Spark - Based on Apache Beam - Older version of Dataflow (1.x) was not based on Beam
  • 425.
    Flink Programming Model DataSource Data Sink Transformations Recall
  • 426.
  • 427.
  • 428.
    Transformations A directed-acyclic graph Data Source Data Sink Recall Ideaof a pipeline of transformations modelled as DAG is powerful
  • 429.
    DAG - Flink - ApacheBeam - TensorFlow
  • 430.
    Apache Beam Architecture Adirected-acyclic graph Data Source Data Sink
  • 431.
    Apache Beam Architecture Pipeline:Entire set of computations Data Source Data Sink
  • 432.
    Apache Beam Architecture PCollection:Data set in the pipeline Data Source Data Sink
  • 433.
    Apache Beam Architecture Transform:Data processing step in pipeline Data Source Data Sink
  • 434.
    Apache Beam Architecture I/OSource and Sink Data Source Data Sink
  • 435.
    Apache Beam - Pipeline:single, potentially repeatable job, from start to finish, in Dataflow - PCollection: specialized container classes that can represent data sets of virtually unlimited size - Tranform: takes one or more PCollections as input, performs a processing function that you provide on the elements of that PCollection, and produces an output PCollection. - Source & Sink: different data storage formats, such as files in Google Cloud Storage, BigQuery tables
  • 436.
    Apache Beam Architecture Pipeline:Entire set of computations Data Source Data Sink
  • 437.
    Pipeline - Pipeline: single,potentially repeatable job, from start to finish, in Dataflow - Encapsulates series of computations that accepts some input data from external sources, transforms data to provide some useful intelligence, and produce output - Defined by driver program - The actual pipeline computations run on a backend, abstracted in the driver by a runner
  • 438.
    Driver and Runner - Driverdefines computation DAG (pipeline) - Runner executes DAG on a backend - Beam supports multiple backends - Apache Spark - Apache Flink - Google Cloud Dataflow - Beam Model
  • 439.
    Typical Beam Driver Program -Create a Pipeline object - Create an initial PCollection for pipeline data - Source API to read data from an external source - Create transform to build a PCollection from in-memory data. - Define the Pipeline transforms to change, filter, group, analyse PCollections - transforms do not change input collection - Output the final, transformed PCollection(s), typically using the Sink API to write data to an external source. - Run the pipeline using the designated Pipeline Runner.
  • 440.
    Apache Beam Architecture Pipeline:Entire set of computations Data Source Data Sink
  • 441.
    Apache Beam Architecture PCollection:Data set in the pipeline Data Source Data Sink
  • 442.
    PCollection - PCollection: specializedcontainer classes that can represent data sets of virtually unlimited size - Fixed size: text file or BigQuery table - Unbounded: Pub/Sub subscription
  • 443.
    Apache Beam Architecture Adirected-acyclic graph Data Source Data Sink
  • 444.
    Apache Beam Architecture Pipeline:Entire set of computations Data Source Data Sink
  • 445.
    Apache Beam Architecture PCollection:Data set in the pipeline Data Source Data Sink
  • 446.
    Apache Beam Architecture SideInputs: Inject additional data into some PCollection Data Source Data Sink
  • 447.
    Transforms - Transform takesone or more PCollections as input, performs a processing function that you provide on the elements of that PCollection, and produces an output PCollection - “What/Where/When/How”
  • 448.
  • 449.
  • 450.
  • 451.
  • 452.
    Apache Beam Architecture Adirected-acyclic graph Data Source Data Sink
  • 453.
    Apache Beam Architecture I/OSource and Sink Data Source Data Sink
  • 454.
    Sources and Sinks - Source& Sink: different data storage formats, such as files in Google Cloud Storage, BigQuery tables - Custom sources and sinks possible too
  • 455.
    Typical Beam Driver Program -Create a Pipeline object - Create an initial PCollection for pipeline data - Source API to read data from an external source - Create transform to build a PCollection from in-memory data. - Define the Pipeline transforms to change, filter, group, analyse PCollections - transforms do not change input collection - Output the final, transformed PCollection(s), typically using the Sink API to write data to an external source. - Run the pipeline using the designated Pipeline Runner.
  • 456.
    Typical Beam Driver Program -Create a Pipeline object - Create an initial PCollection for pipeline data - Source API to read data from an external source - Create transform to build a PCollection from in-memory data. - Define the Pipeline transforms to change, filter, group, analyse PCollections - transforms do not change input collection - Output the final, transformed PCollection(s), typically using the Sink API to write data to an external source. - Run the pipeline using the designated Pipeline Runner.
  • 457.
  • 458.
    Dataproc - Managed Hadoop+ Spark - Includes: Hadoop, Spark, Hive and Pig - “No-ops”: create cluster, use it, turn it off - Use Google Cloud Storage, not HDFS - else billing will hurt - Ideal for moving existing code to GCP
  • 459.
    Cluster Machine Types - Builtusing Compute Engine VM instances - Cluster - Need at least 1 master and 2 workers - Preemptible instances - OK if used with care
  • 460.
    Preemptible VMs - Usedfor processing only - not data storage - No preemptible-only clusters - at least 2 non-preemptible workers will be added automatically - Minimum persistent disk size - smaller of 100GB and primary worker boot disk size (for local caching - not part of HDFS)
  • 461.
    Initialisation Actions - Can specifyscripts to be run from GitHub or Cloud Storage - Can do so via GCP console, gcloud CLI or programmatically - Run as root (i.e. no sudo required) - Use absolute paths - Use shebang line to indicate script interpreter
  • 462.
    Scaling Clusters - Canscale even when jobs are running - Operations for scaling are: - Add workers - Remove workers - Add HDFS storage
  • 463.
    High Availability (Beta) - Whilecreating Dataproc cluster, specify “High Availability” configuration - Will run 3 masters rather than 1 - The 3 masters will run in an Apache Zookeeper cluster for automatic failover
  • 464.
    Single Node Clusters (Beta) -Just 1 node - master as well as worker - Can’t be a preemptible VM - Use for prototyping, educational use etc
  • 465.
    Restartable Jobs (Beta) - Bydefault, Dataproc jobs do NOT restart on failure - Can optionally change this - useful for long-running and streaming jobs (eg Spark Streaming) - Mitigates out-of-memory, unscheduled reboots
  • 466.
  • 467.
  • 468.
    Datalab - Interactive codeexecution for notebooks - Basically, Jupyter or iPython for notebooks for running code - Packaged as container running on a VM instance - Need to connect from browser to Datalab container on GCP
  • 469.
    Notebooks - Better thantext files for code - Include code, documentation (markdown), results - Results can be text, images, HTML/Javascript - Notebooks can be in Cloud Storage Repository (git repository)
  • 470.
    Persistent Disk - Notebookcan be cloned from Cloud Storage repository to VM persistent disk - This clone => workspace => add/remove/modify files - Notebooks auto-saved, but if persistent disk deleted and code not committed to git, changes will be lost
  • 471.
    Service Account - Codecan access BigQuery, Cloud ML etc using a service account - Service account needs to be authorised accordingly
  • 472.
    Kernel - Opening anotebook => backend kernel process manages session and variables - Each notebook has 1 python kernel - if N notebooks, then N processes - Kernels are single-threaded - Memory usage is heavy - execution is slow - pick machine type accordingly
  • 473.
    Connecting to Datalab - gcloudSDK: running the Docker container in a Compute Engine VM and connecting to it through a ssh tunnel - Cloud Shell: running the Docker container in a Compute Engine VM and connecting to it through Cloud Shell - Local machine: running the Docker container on your local machine (must have Docker installed)
  • 474.
  • 475.
    Pub/Sub - Messaging “middleware” -Many-to-many asynchronous messaging - Decouple sender and receiver
  • 476.
    Message Life A publisherapplication creates a topic in the Google Cloud Pub/Sub service and sends messages to the topic. A message contains a payload and optional attributes that describe the payload content.
  • 477.
    Message Life Messages arepersisted in a message store until they are delivered and acknowledged by subscribers.
  • 478.
    Message Life The Pub/Subservice forwards messages from a topic to all of its subscriptions, individually. Each subscription receives messages either by Pub/Sub pushing them to the subscriber's chosen endpoint, or by the subscriber pulling them from the service
  • 479.
    Message Life The subscriberreceives pending messages from its subscription and acknowledges each one to the Pub/Sub service
  • 480.
    Message Life When amessage is acknowledged by the subscriber, it is removed from the subscription's queue of messages
  • 481.
    Basics - Publisher appscreate and send messages on a Topic - Subscriber apps subscribe to a topic to receive messages - Subscription is a queue (message stream) to a subscriber - Message = data + attributes sent by publisher to a topic - Message Attributes = key-value pairs sent by publisher with message
  • 482.
    Basics - Publisher appscreate and send messages on a Topic - Subscriber apps subscribe to a topic to receive messages - Once acknowledged by subscriber, message deleted from queue - Messages persisted in a message store until delivered/ acknowledged - One queue per subscription - Push - WebHook endpoint - Pull - HTTPS request to endpoint
  • 483.
    Use-cases - Balancing workloadsin network clusters - Asynchronous order processing - Distributing event notifications - Refreshing distributed caches - Logging to multiple systems simultaneously - Data streaming - Reliability improvement
  • 484.
    Architecture - Data plane,which handles moving messages between publishers and subscribers - Control plane, which handles the assignment of publishers and subscribers to servers on the data plane - The servers in the data plane are called forwarders, and the servers in the control plane are called routers.
  • 485.
    Message Life - Apublisher sends a message
  • 486.
    Message Life - Themessage is written to storage.
  • 487.
    Message Life - GoogleCloud Pub/Sub sends an acknowledgement to the publisher that it has received the message and guarantees its delivery to all attached subscriptions
  • 488.
    Message Life - Atthe same time as writing the message to storage, Google Cloud Pub/Sub delivers it to subscribers
  • 489.
    Message Life - Subscriberssend an acknowledgement to Google Cloud Pub/Sub that they have processed the message.
  • 490.
    Message Life - Onceat least one subscriber for each subscription has acknowledged the message, Google Cloud Pub/Sub deletes the message from storage
  • 491.
    Publishers - Any applicationthat can make HTTPS requests to googleapis.com - App Engine app - App running on Compute Engine instance - App running on third party network - Any mobile or desktop app - Even a browser
  • 492.
    Subscribers - Push subscribers- Any app that can make HTTPS request to googleapis.com - Pull subscribers - must be WebHook endpoint that can accept POST request over HTTPS