Applying Stratosphere for
Big Data Analytics
1
P R E S E N T E D B Y : -
J V P S A V I N A S H ( A 2 0 3 4 4 3 9 7 )
A S H O K D E S H P A N D E ( A 2 0 3 3 4 7 6 4 )
Points of Discussion
 Big Data and Hadoop
 Map-Reduce Framework
 Stratosphere and its Components
 Stratosphere and its Architecture
 Stratosphere and its Operators
 Stratosphere vs Map-Reduce
 Execution and Analysis
2
Big Data
 Big Data is a collection of large and complex data sets that it become difficult to process using
on-hand database management tools. The challenges include capture, storage, search, sharing,
analysis and visualization.
 Problems :-
1) Large-Scale Data Storage
2) Large-Scale Data Analysis
 Solution :-
Hadoop – HDFS - MapReduce
3
Hadoop Approach
 Hadoop is a software framework for distributed processing of large datasets across large
clusters of computers.
Large datasets  Terabytes or Petabytes of data
Large Clusters  hundreds or thousands of nodes
 Hadoop is based on simple programming model called MapReduce.
 Hadoop = HDFS + Map / Reduce Infrastructure .
 Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop
applications.
 HDFS creates multiple replicas of data blocks and distributes them on computer nodes
throughout a cluster to enable reliable, extremely rapid computations.
 HDFS Data Block is usually 64MB or 128MB. Each block is replicated multiple(default = 3)
times and stored on different data nodes.
 MapReduce is a programming model for parallel data processing. Hadoop can run map reduce
programs in multiple languages like Java , Python , Ruby and C++.
4
Map Reduce - Example
5
MAP FUNCTION REDUCE FUNCTION
Operates on set of (key , value) pairs. Operates on set of (key , value) pairs from Mapper.
Map is applied in parallel on input data set. This
produces output keys and list of values for each
key depending on the functionality.
Reduce is then applied in parallel to each group , again
producing a collection of key , values.
Mapper output are partitioned per reducer = No.
of reduce task for that job.
Reducers cannot be set by user.
StratoSphere
 A massively parallel data processing system.
 Extends MapReduce with more operators.
 Support for advanced data flow graphs.
 Compiler/Optimizer , Java/Scala Interface , YARN
 Data Flow Composition
6
StratoSphere – Components
7
Query is parsed into Sopremo Plan which is a DAG
(Directed Acyclic Graph) of interconnected data
processing operators
End Users Specify Data Analysis tasks by writing
Meteor Queries.
A Generalization of MapReduce programming
paradigm.
Interprets data flow graphs and distributes tasks to
the computation nodes.
StratoSphere – Architecture
(1)
Users formulate a query that
is parsed into Sopremo Plan
(2)
Imports the
packages .
(3)
Registers the
discovered operators
and predefined
functions
(4)
Validates the script
and translate it into
Sopremo Plan
(5)
The plan is analyzed by
the schema inferencer
to obtain a global
schema
(6)
Creates a
consistent
PACT plan .
8
StratoSphere - Operators
9
MAP
Record-at-a-
Time Accepts Single Record as Input ,
Emits Any number of Records ,
Applications :- Filters / Transformations
One Input
REDUCE
Group-at-a-
Time
Groups the record of its input on Record
Key.
Accepts A list of Records as Input ,
Emits Any number of Records ,
Applications :- Aggregations
One Input
JOIN
Record-at-a-
Time
Joins both inputs on their Record Keys
and non-matched records are discarded.
Accepts One Record of each Input ,
Emits Any number of Records ,
Applications :- Equi-Joins
Two Inputs
StratoSphere - Operators
10
CROSS
Record-at-a-
Time
Cartesian product of the records of
both inputs.
Accepts One Record of each Input ,
Emits Any number of Records ,
Very Expensive Operation.
Two Inputs
CO-
GROUP
Group-at-a-
Time
Groups the record of its input on Record
Key.
Accepts One list of Records for each
Input ,
Emits Any number of Records .
Two Inputs
UNION
Record-at-a-
Time
Merges two or more input data sets into
a single output data set.
Follows Bag Semantics.
Duplicates are not removed.
Two Inputs
vs
11
Conclusions :-
1) Most tasks do not fit the MapReduce model.
2) Very Expensive – Always go to disk and HDFS.
3) Tedious to implement .
vs
12
Conclusions :-
1) Joins do not fit the MapReduce model.
2) Time Consuming to implement .
3) Hard Optimization necessary .
vs
13
Loop is outside the system
• Hard to Program
• Very Poor Performance
Loop is inside the system
• Easy to Program
• Huge Performance Gains
Summary : Feature Matrix
Map Reduce StratoSphere
Operators
• Map
• Reduce
• Map
• Reduce (multiple sort keys)
• Cross
• Join
• CoGroup
• Union
• Iterate , Iterate Delta
Composition Only MapReduce Arbitrary Data Flows
Data Exchange Batch through disk
Pipe-lined , in-memory
(automatic spilling to disk)
14
Stratosphere - Web Log Analysis
15
Stratosphere Query Interface
Web Log Analysis – continued…..
Optimizer Query Plan
Web Log Analysis – continued…..
17
Job Submission
Web Log Analysis – continued…..
18
Dashboard – Running Jobs
19
Web Log Analysis – continued…..
Dashboard – Running Jobs
20
Dashboard –Job Plan
Web Log Analysis – continued…..

Stratosphere with big_data_analytics

  • 1.
    Applying Stratosphere for BigData Analytics 1 P R E S E N T E D B Y : - J V P S A V I N A S H ( A 2 0 3 4 4 3 9 7 ) A S H O K D E S H P A N D E ( A 2 0 3 3 4 7 6 4 )
  • 2.
    Points of Discussion Big Data and Hadoop  Map-Reduce Framework  Stratosphere and its Components  Stratosphere and its Architecture  Stratosphere and its Operators  Stratosphere vs Map-Reduce  Execution and Analysis 2
  • 3.
    Big Data  BigData is a collection of large and complex data sets that it become difficult to process using on-hand database management tools. The challenges include capture, storage, search, sharing, analysis and visualization.  Problems :- 1) Large-Scale Data Storage 2) Large-Scale Data Analysis  Solution :- Hadoop – HDFS - MapReduce 3
  • 4.
    Hadoop Approach  Hadoopis a software framework for distributed processing of large datasets across large clusters of computers. Large datasets  Terabytes or Petabytes of data Large Clusters  hundreds or thousands of nodes  Hadoop is based on simple programming model called MapReduce.  Hadoop = HDFS + Map / Reduce Infrastructure .  Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.  HDFS creates multiple replicas of data blocks and distributes them on computer nodes throughout a cluster to enable reliable, extremely rapid computations.  HDFS Data Block is usually 64MB or 128MB. Each block is replicated multiple(default = 3) times and stored on different data nodes.  MapReduce is a programming model for parallel data processing. Hadoop can run map reduce programs in multiple languages like Java , Python , Ruby and C++. 4
  • 5.
    Map Reduce -Example 5 MAP FUNCTION REDUCE FUNCTION Operates on set of (key , value) pairs. Operates on set of (key , value) pairs from Mapper. Map is applied in parallel on input data set. This produces output keys and list of values for each key depending on the functionality. Reduce is then applied in parallel to each group , again producing a collection of key , values. Mapper output are partitioned per reducer = No. of reduce task for that job. Reducers cannot be set by user.
  • 6.
    StratoSphere  A massivelyparallel data processing system.  Extends MapReduce with more operators.  Support for advanced data flow graphs.  Compiler/Optimizer , Java/Scala Interface , YARN  Data Flow Composition 6
  • 7.
    StratoSphere – Components 7 Queryis parsed into Sopremo Plan which is a DAG (Directed Acyclic Graph) of interconnected data processing operators End Users Specify Data Analysis tasks by writing Meteor Queries. A Generalization of MapReduce programming paradigm. Interprets data flow graphs and distributes tasks to the computation nodes.
  • 8.
    StratoSphere – Architecture (1) Usersformulate a query that is parsed into Sopremo Plan (2) Imports the packages . (3) Registers the discovered operators and predefined functions (4) Validates the script and translate it into Sopremo Plan (5) The plan is analyzed by the schema inferencer to obtain a global schema (6) Creates a consistent PACT plan . 8
  • 9.
    StratoSphere - Operators 9 MAP Record-at-a- TimeAccepts Single Record as Input , Emits Any number of Records , Applications :- Filters / Transformations One Input REDUCE Group-at-a- Time Groups the record of its input on Record Key. Accepts A list of Records as Input , Emits Any number of Records , Applications :- Aggregations One Input JOIN Record-at-a- Time Joins both inputs on their Record Keys and non-matched records are discarded. Accepts One Record of each Input , Emits Any number of Records , Applications :- Equi-Joins Two Inputs
  • 10.
    StratoSphere - Operators 10 CROSS Record-at-a- Time Cartesianproduct of the records of both inputs. Accepts One Record of each Input , Emits Any number of Records , Very Expensive Operation. Two Inputs CO- GROUP Group-at-a- Time Groups the record of its input on Record Key. Accepts One list of Records for each Input , Emits Any number of Records . Two Inputs UNION Record-at-a- Time Merges two or more input data sets into a single output data set. Follows Bag Semantics. Duplicates are not removed. Two Inputs
  • 11.
    vs 11 Conclusions :- 1) Mosttasks do not fit the MapReduce model. 2) Very Expensive – Always go to disk and HDFS. 3) Tedious to implement .
  • 12.
    vs 12 Conclusions :- 1) Joinsdo not fit the MapReduce model. 2) Time Consuming to implement . 3) Hard Optimization necessary .
  • 13.
    vs 13 Loop is outsidethe system • Hard to Program • Very Poor Performance Loop is inside the system • Easy to Program • Huge Performance Gains
  • 14.
    Summary : FeatureMatrix Map Reduce StratoSphere Operators • Map • Reduce • Map • Reduce (multiple sort keys) • Cross • Join • CoGroup • Union • Iterate , Iterate Delta Composition Only MapReduce Arbitrary Data Flows Data Exchange Batch through disk Pipe-lined , in-memory (automatic spilling to disk) 14
  • 15.
    Stratosphere - WebLog Analysis 15 Stratosphere Query Interface
  • 16.
    Web Log Analysis– continued….. Optimizer Query Plan
  • 17.
    Web Log Analysis– continued….. 17 Job Submission
  • 18.
    Web Log Analysis– continued….. 18 Dashboard – Running Jobs
  • 19.
    19 Web Log Analysis– continued….. Dashboard – Running Jobs
  • 20.
    20 Dashboard –Job Plan WebLog Analysis – continued…..