Stratosphere with big_data_analytics

Applying Stratosphere for
Big Data Analytics
1
P R E S E N T E D B Y : -
J V P S A V I N A S H ( A 2 0 3 4 4 3 9 7 )
A S H O K D E S H P A N D E ( A 2 0 3 3 4 7 6 4 )

Points of Discussion
 Big Data and Hadoop
 Map-Reduce Framework
 Stratosphere and its Components
 Stratosphere and its Architecture
 Stratosphere and its Operators
 Stratosphere vs Map-Reduce
 Execution and Analysis
2

Big Data
 Big Data is a collection of large and complex data sets that it become difficult to process using
on-hand database management tools. The challenges include capture, storage, search, sharing,
analysis and visualization.
 Problems :-
1) Large-Scale Data Storage
2) Large-Scale Data Analysis
 Solution :-
Hadoop – HDFS - MapReduce
3

Hadoop Approach
 Hadoop is a software framework for distributed processing of large datasets across large
clusters of computers.
Large datasets  Terabytes or Petabytes of data
Large Clusters  hundreds or thousands of nodes
 Hadoop is based on simple programming model called MapReduce.
 Hadoop = HDFS + Map / Reduce Infrastructure .
 Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop
applications.
 HDFS creates multiple replicas of data blocks and distributes them on computer nodes
throughout a cluster to enable reliable, extremely rapid computations.
 HDFS Data Block is usually 64MB or 128MB. Each block is replicated multiple(default = 3)
times and stored on different data nodes.
 MapReduce is a programming model for parallel data processing. Hadoop can run map reduce
programs in multiple languages like Java , Python , Ruby and C++.
4

Map Reduce - Example
5
MAP FUNCTION REDUCE FUNCTION
Operates on set of (key , value) pairs. Operates on set of (key , value) pairs from Mapper.
Map is applied in parallel on input data set. This
produces output keys and list of values for each
key depending on the functionality.
Reduce is then applied in parallel to each group , again
producing a collection of key , values.
Mapper output are partitioned per reducer = No.
of reduce task for that job.
Reducers cannot be set by user.

StratoSphere
 A massively parallel data processing system.
 Extends MapReduce with more operators.
 Support for advanced data flow graphs.
 Compiler/Optimizer , Java/Scala Interface , YARN
 Data Flow Composition
6

StratoSphere – Components
7
Query is parsed into Sopremo Plan which is a DAG
(Directed Acyclic Graph) of interconnected data
processing operators
End Users Specify Data Analysis tasks by writing
Meteor Queries.
A Generalization of MapReduce programming
paradigm.
Interprets data flow graphs and distributes tasks to
the computation nodes.

StratoSphere – Architecture
(1)
Users formulate a query that
is parsed into Sopremo Plan
(2)
Imports the
packages .
(3)
Registers the
discovered operators
and predefined
functions
(4)
Validates the script
and translate it into
Sopremo Plan
(5)
The plan is analyzed by
the schema inferencer
to obtain a global
schema
(6)
Creates a
consistent
PACT plan .
8

StratoSphere - Operators
9
MAP
Record-at-a-
Time Accepts Single Record as Input ,
Emits Any number of Records ,
Applications :- Filters / Transformations
One Input
REDUCE
Group-at-a-
Time
Groups the record of its input on Record
Key.
Accepts A list of Records as Input ,
Applications :- Aggregations
One Input
JOIN
Record-at-a-
Time
Joins both inputs on their Record Keys
and non-matched records are discarded.
Accepts One Record of each Input ,
Applications :- Equi-Joins
Two Inputs

StratoSphere - Operators
10
CROSS
Record-at-a-
Time
Cartesian product of the records of
both inputs.
Accepts One Record of each Input ,
Very Expensive Operation.
Two Inputs
CO-
GROUP
Group-at-a-
Time
Groups the record of its input on Record
Key.
Accepts One list of Records for each
Input ,
Emits Any number of Records .
Two Inputs
UNION
Record-at-a-
Time
Merges two or more input data sets into
a single output data set.
Follows Bag Semantics.
Duplicates are not removed.
Two Inputs

vs
11
Conclusions :-
1) Most tasks do not fit the MapReduce model.
2) Very Expensive – Always go to disk and HDFS.
3) Tedious to implement .

vs
12
Conclusions :-
1) Joins do not fit the MapReduce model.
2) Time Consuming to implement .
3) Hard Optimization necessary .

vs
13
Loop is outside the system
• Hard to Program
• Very Poor Performance
Loop is inside the system
• Easy to Program
• Huge Performance Gains

Summary : Feature Matrix
Map Reduce StratoSphere
Operators
• Map
• Reduce
• Map
• Reduce (multiple sort keys)
• Cross
• Join
• CoGroup
• Union
• Iterate , Iterate Delta
Composition Only MapReduce Arbitrary Data Flows
Data Exchange Batch through disk
Pipe-lined , in-memory
(automatic spilling to disk)
14

Stratosphere - Web Log Analysis
15
Stratosphere Query Interface

Web Log Analysis – continued…..
Optimizer Query Plan

17
Job Submission

18
Dashboard – Running Jobs

19
Dashboard – Running Jobs

20
Dashboard –Job Plan

Stratosphere with big_data_analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Stratosphere with big_data_analytics

Similar to Stratosphere with big_data_analytics (20)

Recently uploaded

Recently uploaded (20)

Stratosphere with big_data_analytics