SlideShare a Scribd company logo
SPARK ARCHITECTURE
 PRESENTED BY:-
GAURAV BISWAS
BIT MESRA
SPARK COMPONENTS
 The Spark core is complemented by a set of powerful,
higher-level libraries
SparkSQL
MLlib (for machine learning)
 GraphX
RDD(Resilient Distributed Dataset)
SparkSQL Introduction
 Part of the core distribution since Spark 1.0 (2014)
 Integrated with the Spark stack Supports querying
data either via SQL or via the Hive Query Language
 Originated as the Apache Hive port to run on top of
Spark (in place of MapReduce)
 Can weave SQL queries with code transformations
 Capability to expose Spark datasets over JDBC API and
allow running the SQL like queries on Spark data
using traditional BI and visualization tools
 Bindings in Python, Scala, and Java
SQL Execution Plans
 Logical and Physical query plans
Both are trees representing query evaluation
 Internal nodes are operators over the data
Logical plan is higher-level and algebraic
Physical plan is lower-level and operational
 Logical plan operators –
Conceptually describe what operation needs to be
performed
 Physical plan operators – Correspond to implemented
access methods
Key Features of MLib
 Low level library in Spark
 Built-in data analysis workflow
 Free performance gains
 Scalable
 Python, Scala, JavaAPIs
 Broad coverage of applications & algorithms
 Rapid improvements in speed & robustness
 Easy to use
 Integrated workflow
MLlib
 MLlib is a machine learning library that provides
various algorithms designed to scale out on a cluster
for classification, regression, clustering, collaborative
filtering, and so on.
 These algorithms also work with streaming data, such
as linear regression using ordinary least squares or k-
means clustering (and more on the way).
 Apache Mahout (a machine learning library for
Hadoop) has already turned away from MapReduce
and joined forces on Spark MLlib.
GraphX
 GraphX is an API for graphs and graph parallel
execution.
 It is a network graph analytics engine.
 GraphX is a library that performs graph-parallel
computation and manipulates graph.
 It has various Spark RDD API so it can help to create
directed graphs with arbitrary properties linked to its
vertex and edges.
GraphX
 GraphX also provides various operator and algorithms
to manipulate graph.
 Clustering, classification, traversal, searching, and
pathfinding is possible in GraphX.
Spark GraphX Features
 Flexibility:
 works with both graphs and computations
 unifies ETL (Extract, Transform & Load), exploratory analysis and
iterative graph computation within a single system.
 We can view the same data as both graphs and collections, transform
and join graphs with RDDs efficiently and write custom iterative graph
algorithms
 Speed:
 provides comparable performance to the fastest specialized graph
processing systems.
 It is comparable with the fastest graph systems while retaining Spark’s
flexibility, fault tolerance and ease of use.
Spark GraphX Features
Growing Algorithm Library:
 We can choose from a growing library of graph
algorithms
 Some of the popular algorithms are page rank,
connected components, label propagation, strongly
connected components and triangle count.
Spark Core
 Shelter to API that contains the backbone of Spark i.e.
RDDs
 The basic functionality of Spark is present in Spark
Core :
 memory management
 fault recovery
 interaction with the storage system
 I/O functionalities like task dispatching
Resilient Distributed Dataset(RDD)
 Spark introduces the concept of an RDD , an
immutable fault-tolerant, distributed collection of
objects that can be operated on in parallel.
 RDD can contain any type of object and is created by
loading an external dataset or distributing a collection
from the driver program.
RDD operation
 RDDs support two types of operations:
 Transformations : transform one data collection into
another (such as map, filter, join, union, and so on),
that are performed on an RDD and which yield a new
RDD containing the result. Means create a new dataset
from an existing one
 Actions : require that the computation be performed
(such as reduce, count, first, collect, save and so on)
that return a value after running a computation on an
RDD. which return a value to the driver program or file
after running a computation on the dataset.
Properties for RDD
 Immutability
 Cacheable – linage – persist
 Lazy evaluation (it different than execution)
 Type Inferred
 Two ways to create RDDs:
 parallelizing an existing collection in your driver program,
 referencing a dataset in an external storage system,
such as a shared file system, HDFS, Hbase, Cassandra or
any data source offering a Hadoop InputFormat.
Spark Streaming
 Spark Streaming is the component of Spark which is
used to process real-time streaming data.
 It enables high-throughput and fault-tolerant stream
processing of live data streams.
END!

More Related Content

What's hot

An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Dona Mary Philip
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Apache spark
Apache sparkApache spark
Apache spark
Dona Mary Philip
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
Andrey Kudryavtsev
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Ten tools for ten big data areas 02_Tableau
Ten tools for ten big data areas 02_TableauTen tools for ten big data areas 02_Tableau
Ten tools for ten big data areas 02_Tableau
Will Du
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 

What's hot (19)

An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on AuroraDUG'20: 02 - Accelerating apache spark with DAOS on Aurora
DUG'20: 02 - Accelerating apache spark with DAOS on Aurora
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
Ten tools for ten big data areas 02_Tableau
Ten tools for ten big data areas 02_TableauTen tools for ten big data areas 02_Tableau
Ten tools for ten big data areas 02_Tableau
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Similar to SPARK ARCHITECTURE

Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
Elvis Saravia
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
JinxinTang
 
Let's start with Spark
Let's start with SparkLet's start with Spark
Let's start with Spark
Milos Milovanovic
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
Rahul Kumar
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
Jyotasana Bharti
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 

Similar to SPARK ARCHITECTURE (20)

Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Spark core
Spark coreSpark core
Spark core
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
An Introduction to Apache Spark
An Introduction to Apache SparkAn Introduction to Apache Spark
An Introduction to Apache Spark
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Spark cluster computing with working sets
Spark cluster computing with working setsSpark cluster computing with working sets
Spark cluster computing with working sets
 
Let's start with Spark
Let's start with SparkLet's start with Spark
Let's start with Spark
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Reactive dashboard’s using apache spark
Reactive dashboard’s using apache sparkReactive dashboard’s using apache spark
Reactive dashboard’s using apache spark
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Apache spark
Apache sparkApache spark
Apache spark
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 

More from GauravBiswas9

Pipeline anomaly detection
Pipeline anomaly detectionPipeline anomaly detection
Pipeline anomaly detection
GauravBiswas9
 
False colouring
False colouringFalse colouring
False colouring
GauravBiswas9
 
WCDMA
WCDMA WCDMA
Ofdm
OfdmOfdm
2.5G Cellular Standards
2.5G Cellular Standards2.5G Cellular Standards
2.5G Cellular Standards
GauravBiswas9
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Iot in healthcare
Iot in healthcareIot in healthcare
Iot in healthcare
GauravBiswas9
 
Gsm vs gprs
Gsm vs gprsGsm vs gprs
Gsm vs gprs
GauravBiswas9
 
Circuit switch vs packet switch
Circuit switch vs packet switchCircuit switch vs packet switch
Circuit switch vs packet switch
GauravBiswas9
 
Channelization scheme in AMPS & GSM
Channelization scheme in AMPS & GSMChannelization scheme in AMPS & GSM
Channelization scheme in AMPS & GSM
GauravBiswas9
 
Big data analytics.
Big data analytics.Big data analytics.
Big data analytics.
GauravBiswas9
 

More from GauravBiswas9 (11)

Pipeline anomaly detection
Pipeline anomaly detectionPipeline anomaly detection
Pipeline anomaly detection
 
False colouring
False colouringFalse colouring
False colouring
 
WCDMA
WCDMA WCDMA
WCDMA
 
Ofdm
OfdmOfdm
Ofdm
 
2.5G Cellular Standards
2.5G Cellular Standards2.5G Cellular Standards
2.5G Cellular Standards
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Iot in healthcare
Iot in healthcareIot in healthcare
Iot in healthcare
 
Gsm vs gprs
Gsm vs gprsGsm vs gprs
Gsm vs gprs
 
Circuit switch vs packet switch
Circuit switch vs packet switchCircuit switch vs packet switch
Circuit switch vs packet switch
 
Channelization scheme in AMPS & GSM
Channelization scheme in AMPS & GSMChannelization scheme in AMPS & GSM
Channelization scheme in AMPS & GSM
 
Big data analytics.
Big data analytics.Big data analytics.
Big data analytics.
 

Recently uploaded

COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
ViniHema
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
MuhammadTufail242431
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 

Recently uploaded (20)

COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
power quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptxpower quality voltage fluctuation UNIT - I.pptx
power quality voltage fluctuation UNIT - I.pptx
 
Halogenation process of chemical process industries
Halogenation process of chemical process industriesHalogenation process of chemical process industries
Halogenation process of chemical process industries
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 

SPARK ARCHITECTURE

  • 1. SPARK ARCHITECTURE  PRESENTED BY:- GAURAV BISWAS BIT MESRA
  • 2. SPARK COMPONENTS  The Spark core is complemented by a set of powerful, higher-level libraries SparkSQL MLlib (for machine learning)  GraphX RDD(Resilient Distributed Dataset)
  • 3. SparkSQL Introduction  Part of the core distribution since Spark 1.0 (2014)  Integrated with the Spark stack Supports querying data either via SQL or via the Hive Query Language  Originated as the Apache Hive port to run on top of Spark (in place of MapReduce)  Can weave SQL queries with code transformations  Capability to expose Spark datasets over JDBC API and allow running the SQL like queries on Spark data using traditional BI and visualization tools  Bindings in Python, Scala, and Java
  • 4.
  • 6.
  • 7.  Logical and Physical query plans Both are trees representing query evaluation  Internal nodes are operators over the data Logical plan is higher-level and algebraic Physical plan is lower-level and operational  Logical plan operators – Conceptually describe what operation needs to be performed  Physical plan operators – Correspond to implemented access methods
  • 8. Key Features of MLib  Low level library in Spark  Built-in data analysis workflow  Free performance gains  Scalable  Python, Scala, JavaAPIs  Broad coverage of applications & algorithms  Rapid improvements in speed & robustness  Easy to use  Integrated workflow
  • 9. MLlib  MLlib is a machine learning library that provides various algorithms designed to scale out on a cluster for classification, regression, clustering, collaborative filtering, and so on.  These algorithms also work with streaming data, such as linear regression using ordinary least squares or k- means clustering (and more on the way).  Apache Mahout (a machine learning library for Hadoop) has already turned away from MapReduce and joined forces on Spark MLlib.
  • 10.
  • 11. GraphX  GraphX is an API for graphs and graph parallel execution.  It is a network graph analytics engine.  GraphX is a library that performs graph-parallel computation and manipulates graph.  It has various Spark RDD API so it can help to create directed graphs with arbitrary properties linked to its vertex and edges.
  • 12. GraphX  GraphX also provides various operator and algorithms to manipulate graph.  Clustering, classification, traversal, searching, and pathfinding is possible in GraphX.
  • 13. Spark GraphX Features  Flexibility:  works with both graphs and computations  unifies ETL (Extract, Transform & Load), exploratory analysis and iterative graph computation within a single system.  We can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently and write custom iterative graph algorithms  Speed:  provides comparable performance to the fastest specialized graph processing systems.  It is comparable with the fastest graph systems while retaining Spark’s flexibility, fault tolerance and ease of use.
  • 14. Spark GraphX Features Growing Algorithm Library:  We can choose from a growing library of graph algorithms  Some of the popular algorithms are page rank, connected components, label propagation, strongly connected components and triangle count.
  • 15. Spark Core  Shelter to API that contains the backbone of Spark i.e. RDDs  The basic functionality of Spark is present in Spark Core :  memory management  fault recovery  interaction with the storage system  I/O functionalities like task dispatching
  • 16. Resilient Distributed Dataset(RDD)  Spark introduces the concept of an RDD , an immutable fault-tolerant, distributed collection of objects that can be operated on in parallel.  RDD can contain any type of object and is created by loading an external dataset or distributing a collection from the driver program.
  • 17. RDD operation  RDDs support two types of operations:  Transformations : transform one data collection into another (such as map, filter, join, union, and so on), that are performed on an RDD and which yield a new RDD containing the result. Means create a new dataset from an existing one  Actions : require that the computation be performed (such as reduce, count, first, collect, save and so on) that return a value after running a computation on an RDD. which return a value to the driver program or file after running a computation on the dataset.
  • 18. Properties for RDD  Immutability  Cacheable – linage – persist  Lazy evaluation (it different than execution)  Type Inferred  Two ways to create RDDs:  parallelizing an existing collection in your driver program,  referencing a dataset in an external storage system, such as a shared file system, HDFS, Hbase, Cassandra or any data source offering a Hadoop InputFormat.
  • 19. Spark Streaming  Spark Streaming is the component of Spark which is used to process real-time streaming data.  It enables high-throughput and fault-tolerant stream processing of live data streams.
  • 20. END!