SlideShare a Scribd company logo
Applying Stratosphere for
Big Data Analytics
1
P R E S E N T E D B Y : -
J V P S A V I N A S H ( A 2 0 3 4 4 3 9 7 )
A S H O K D E S H P A N D E ( A 2 0 3 3 4 7 6 4 )
Points of Discussion
 Big Data and Hadoop
 Map-Reduce Framework
 Stratosphere and its Components
 Stratosphere and its Architecture
 Stratosphere and its Operators
 Stratosphere vs Map-Reduce
 Execution and Analysis
2
Big Data
 Big Data is a collection of large and complex data sets that it become difficult to process using
on-hand database management tools. The challenges include capture, storage, search, sharing,
analysis and visualization.
 Problems :-
1) Large-Scale Data Storage
2) Large-Scale Data Analysis
 Solution :-
Hadoop – HDFS - MapReduce
3
Hadoop Approach
 Hadoop is a software framework for distributed processing of large datasets across large
clusters of computers.
Large datasets  Terabytes or Petabytes of data
Large Clusters  hundreds or thousands of nodes
 Hadoop is based on simple programming model called MapReduce.
 Hadoop = HDFS + Map / Reduce Infrastructure .
 Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop
applications.
 HDFS creates multiple replicas of data blocks and distributes them on computer nodes
throughout a cluster to enable reliable, extremely rapid computations.
 HDFS Data Block is usually 64MB or 128MB. Each block is replicated multiple(default = 3)
times and stored on different data nodes.
 MapReduce is a programming model for parallel data processing. Hadoop can run map reduce
programs in multiple languages like Java , Python , Ruby and C++.
4
Map Reduce - Example
5
MAP FUNCTION REDUCE FUNCTION
Operates on set of (key , value) pairs. Operates on set of (key , value) pairs from Mapper.
Map is applied in parallel on input data set. This
produces output keys and list of values for each
key depending on the functionality.
Reduce is then applied in parallel to each group , again
producing a collection of key , values.
Mapper output are partitioned per reducer = No.
of reduce task for that job.
Reducers cannot be set by user.
StratoSphere
 A massively parallel data processing system.
 Extends MapReduce with more operators.
 Support for advanced data flow graphs.
 Compiler/Optimizer , Java/Scala Interface , YARN
 Data Flow Composition
6
StratoSphere – Components
7
Query is parsed into Sopremo Plan which is a DAG
(Directed Acyclic Graph) of interconnected data
processing operators
End Users Specify Data Analysis tasks by writing
Meteor Queries.
A Generalization of MapReduce programming
paradigm.
Interprets data flow graphs and distributes tasks to
the computation nodes.
StratoSphere – Architecture
(1)
Users formulate a query that
is parsed into Sopremo Plan
(2)
Imports the
packages .
(3)
Registers the
discovered operators
and predefined
functions
(4)
Validates the script
and translate it into
Sopremo Plan
(5)
The plan is analyzed by
the schema inferencer
to obtain a global
schema
(6)
Creates a
consistent
PACT plan .
8
StratoSphere - Operators
9
MAP
Record-at-a-
Time Accepts Single Record as Input ,
Emits Any number of Records ,
Applications :- Filters / Transformations
One Input
REDUCE
Group-at-a-
Time
Groups the record of its input on Record
Key.
Accepts A list of Records as Input ,
Emits Any number of Records ,
Applications :- Aggregations
One Input
JOIN
Record-at-a-
Time
Joins both inputs on their Record Keys
and non-matched records are discarded.
Accepts One Record of each Input ,
Emits Any number of Records ,
Applications :- Equi-Joins
Two Inputs
StratoSphere - Operators
10
CROSS
Record-at-a-
Time
Cartesian product of the records of
both inputs.
Accepts One Record of each Input ,
Emits Any number of Records ,
Very Expensive Operation.
Two Inputs
CO-
GROUP
Group-at-a-
Time
Groups the record of its input on Record
Key.
Accepts One list of Records for each
Input ,
Emits Any number of Records .
Two Inputs
UNION
Record-at-a-
Time
Merges two or more input data sets into
a single output data set.
Follows Bag Semantics.
Duplicates are not removed.
Two Inputs
vs
11
Conclusions :-
1) Most tasks do not fit the MapReduce model.
2) Very Expensive – Always go to disk and HDFS.
3) Tedious to implement .
vs
12
Conclusions :-
1) Joins do not fit the MapReduce model.
2) Time Consuming to implement .
3) Hard Optimization necessary .
vs
13
Loop is outside the system
• Hard to Program
• Very Poor Performance
Loop is inside the system
• Easy to Program
• Huge Performance Gains
Summary : Feature Matrix
Map Reduce StratoSphere
Operators
• Map
• Reduce
• Map
• Reduce (multiple sort keys)
• Cross
• Join
• CoGroup
• Union
• Iterate , Iterate Delta
Composition Only MapReduce Arbitrary Data Flows
Data Exchange Batch through disk
Pipe-lined , in-memory
(automatic spilling to disk)
14
Stratosphere - Web Log Analysis
15
Stratosphere Query Interface
Web Log Analysis – continued…..
Optimizer Query Plan
Web Log Analysis – continued…..
17
Job Submission
Web Log Analysis – continued…..
18
Dashboard – Running Jobs
19
Web Log Analysis – continued…..
Dashboard – Running Jobs
20
Dashboard –Job Plan
Web Log Analysis – continued…..

More Related Content

What's hot

Map Reduce
Map ReduceMap Reduce
Map Reduce
Vigen Sahakyan
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
Abhishek Mukherjee
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
Zubair Nabi
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
Shalish VJ
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
SANTOSH WAYAL
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
Kelly Technologies
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
Shrihari Rathod
 
Hadoop
HadoopHadoop
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
Muralidharan Deenathayalan
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
barbie0909
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
Robert Grossman
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
Brendan Tierney
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
Gabriela Agustini
 

What's hot (20)

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Join Algorithms in MapReduce
Join Algorithms in MapReduceJoin Algorithms in MapReduce
Join Algorithms in MapReduce
 
Hadoop
HadoopHadoop
Hadoop
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 

Similar to Stratosphere with big_data_analytics

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
IIIT-H
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
NouhaElhaji1
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
BRNSSPublicationHubI
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
sreehari orienit
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
Sai Koppuravuri
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
ijcsit
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
harithakannan
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
chunkypandey12
 
Cppt
CpptCppt

Similar to Stratosphere with big_data_analytics (20)

Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Generating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop ClustersGenerating Frequent Itemsets by RElim on Hadoop Clusters
Generating Frequent Itemsets by RElim on Hadoop Clusters
 
Unit 2
Unit 2Unit 2
Unit 2
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Apache hadoop
Apache hadoopApache hadoop
Apache hadoop
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node CombinersHadoop Mapreduce Performance Enhancement Using In-Node Combiners
Hadoop Mapreduce Performance Enhancement Using In-Node Combiners
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 

Recently uploaded

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 

Stratosphere with big_data_analytics

  • 1. Applying Stratosphere for Big Data Analytics 1 P R E S E N T E D B Y : - J V P S A V I N A S H ( A 2 0 3 4 4 3 9 7 ) A S H O K D E S H P A N D E ( A 2 0 3 3 4 7 6 4 )
  • 2. Points of Discussion  Big Data and Hadoop  Map-Reduce Framework  Stratosphere and its Components  Stratosphere and its Architecture  Stratosphere and its Operators  Stratosphere vs Map-Reduce  Execution and Analysis 2
  • 3. Big Data  Big Data is a collection of large and complex data sets that it become difficult to process using on-hand database management tools. The challenges include capture, storage, search, sharing, analysis and visualization.  Problems :- 1) Large-Scale Data Storage 2) Large-Scale Data Analysis  Solution :- Hadoop – HDFS - MapReduce 3
  • 4. Hadoop Approach  Hadoop is a software framework for distributed processing of large datasets across large clusters of computers. Large datasets  Terabytes or Petabytes of data Large Clusters  hundreds or thousands of nodes  Hadoop is based on simple programming model called MapReduce.  Hadoop = HDFS + Map / Reduce Infrastructure .  Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications.  HDFS creates multiple replicas of data blocks and distributes them on computer nodes throughout a cluster to enable reliable, extremely rapid computations.  HDFS Data Block is usually 64MB or 128MB. Each block is replicated multiple(default = 3) times and stored on different data nodes.  MapReduce is a programming model for parallel data processing. Hadoop can run map reduce programs in multiple languages like Java , Python , Ruby and C++. 4
  • 5. Map Reduce - Example 5 MAP FUNCTION REDUCE FUNCTION Operates on set of (key , value) pairs. Operates on set of (key , value) pairs from Mapper. Map is applied in parallel on input data set. This produces output keys and list of values for each key depending on the functionality. Reduce is then applied in parallel to each group , again producing a collection of key , values. Mapper output are partitioned per reducer = No. of reduce task for that job. Reducers cannot be set by user.
  • 6. StratoSphere  A massively parallel data processing system.  Extends MapReduce with more operators.  Support for advanced data flow graphs.  Compiler/Optimizer , Java/Scala Interface , YARN  Data Flow Composition 6
  • 7. StratoSphere – Components 7 Query is parsed into Sopremo Plan which is a DAG (Directed Acyclic Graph) of interconnected data processing operators End Users Specify Data Analysis tasks by writing Meteor Queries. A Generalization of MapReduce programming paradigm. Interprets data flow graphs and distributes tasks to the computation nodes.
  • 8. StratoSphere – Architecture (1) Users formulate a query that is parsed into Sopremo Plan (2) Imports the packages . (3) Registers the discovered operators and predefined functions (4) Validates the script and translate it into Sopremo Plan (5) The plan is analyzed by the schema inferencer to obtain a global schema (6) Creates a consistent PACT plan . 8
  • 9. StratoSphere - Operators 9 MAP Record-at-a- Time Accepts Single Record as Input , Emits Any number of Records , Applications :- Filters / Transformations One Input REDUCE Group-at-a- Time Groups the record of its input on Record Key. Accepts A list of Records as Input , Emits Any number of Records , Applications :- Aggregations One Input JOIN Record-at-a- Time Joins both inputs on their Record Keys and non-matched records are discarded. Accepts One Record of each Input , Emits Any number of Records , Applications :- Equi-Joins Two Inputs
  • 10. StratoSphere - Operators 10 CROSS Record-at-a- Time Cartesian product of the records of both inputs. Accepts One Record of each Input , Emits Any number of Records , Very Expensive Operation. Two Inputs CO- GROUP Group-at-a- Time Groups the record of its input on Record Key. Accepts One list of Records for each Input , Emits Any number of Records . Two Inputs UNION Record-at-a- Time Merges two or more input data sets into a single output data set. Follows Bag Semantics. Duplicates are not removed. Two Inputs
  • 11. vs 11 Conclusions :- 1) Most tasks do not fit the MapReduce model. 2) Very Expensive – Always go to disk and HDFS. 3) Tedious to implement .
  • 12. vs 12 Conclusions :- 1) Joins do not fit the MapReduce model. 2) Time Consuming to implement . 3) Hard Optimization necessary .
  • 13. vs 13 Loop is outside the system • Hard to Program • Very Poor Performance Loop is inside the system • Easy to Program • Huge Performance Gains
  • 14. Summary : Feature Matrix Map Reduce StratoSphere Operators • Map • Reduce • Map • Reduce (multiple sort keys) • Cross • Join • CoGroup • Union • Iterate , Iterate Delta Composition Only MapReduce Arbitrary Data Flows Data Exchange Batch through disk Pipe-lined , in-memory (automatic spilling to disk) 14
  • 15. Stratosphere - Web Log Analysis 15 Stratosphere Query Interface
  • 16. Web Log Analysis – continued….. Optimizer Query Plan
  • 17. Web Log Analysis – continued….. 17 Job Submission
  • 18. Web Log Analysis – continued….. 18 Dashboard – Running Jobs
  • 19. 19 Web Log Analysis – continued….. Dashboard – Running Jobs
  • 20. 20 Dashboard –Job Plan Web Log Analysis – continued…..