SlideShare a Scribd company logo
Unit II Introducing Real-Time Processing Tool
Agenda
• Apache Spark, Why Apache Spark, Evolution of Apache Spark,
• Architecture Apache Spark, Features of Apache Spark,
• Spark Deployment, Standalone, Hadoop YARN, Spark MapReduce,
• Components of Apache Spark, Spark core, Spark SQL, Spark Streaming, Spark
Machine Learning, Spark GraphX, Spark Shell,
• Resilient Distributed Dataset (RDD) Basic, Spark Context, RDD Transformations,
Creating RDDs, RDD Operations, Programming with RDD,
• Transformations, Actions, Lazy Evaluation, Converting between RDD Types
Apache Spark
• Apache Spark is a lightning-fast cluster computing framework
designed for real-time processing.
• Spark is an open-source project from Apache Software Foundation.
• Spark overcomes the limitations of Hadoop MapReduce, and it
extends the MapReduce model to be efficiently used for data
processing.
• Spark is a market leader for big data processing.
• It is widely used across organizations in many ways.
• It has surpassed Hadoop by running 100 times faster in memory
and 10 times faster on disks.
Why Apache Spark
• Most of the technology-based companies across the globe have moved toward Apache Spark.
• They were quick enough to understand the real value possessed by Sparks such as Machine Learning
and interactive querying.
• Industry leaders such as Amazon, Huawei, and IBM have already adopted Apache Spark.
• The firms that were initially based on Hadoop, such as Hortonworks, Cloudera, and MapR, have also
moved to Apache Spark.
• Big Data Hadoop professionals surely need to learn Apache Spark since it is the next most important
technology in Hadoop data processing.
• ETL professionals, SQL professionals, and Project Managers can gain immensely if they master
Apache Spark.
• Data Scientists also need to gain in-depth knowledge of Spark to excel in their careers.
• Spark can be extensively deployed in Machine Learning scenarios.
Evolution of Apache Spark
• Before Spark, there was MapReduce that was used as a processing framework.
• Then, Spark got initiated as one of the research projects in 2009 at UC Berkeley AMPLab.
• It was later open-sourced in 2010.
• After its release in the market, Spark grew and moved to Apache Software Foundation in 2013
• Most organizations across the world have incorporated Apache Spark for empowering their Big Data
applications.
Architecture of Apache Spark
Feature of Apache Spark
• Apache Spark has many features-
• Fault tolerance- design to handle worker node failure using DAG and RDD.
• Dynamic In Nature- offer 80 high-level operators to build parallel apps
• Lazy Evaluation- transformation lazily evaluated, added to DAG and results obtained after action called.
• Real-Time Stream Processing- language –integrated API to stream processing
• Speed- run on Hadoop up to 100x faster in memory and 10x faster on disk, minimize disk read/write operation for
intermediate results.
Feature of Apache Spark
• Reusability- spark code used for batch-processing, join streaming data and to adhoc queries on streaming state.
• Advanced Analytics- de facto standard for big data processing and data sciences across multiple industries, machine learning and
graph processing libraries
• In Memory Computing- capable of processing tasks in memory and it is not required to write back intermediate results to the disk ,
capable of caching the intermediate results so that it can be reused in the next iteration, common dataset which can be used across multiple tasks.
• Supporting Multiple languages- APIs available in Java, Scala, Python and R, advanced features available with R language
for data analytics, SparkSQL.
• Integrated with Hadoop- integrates very well with Hadoop file system HDFS, support to multiple file formats like parquet,
json, csv, ORC, Avro etc
• Cost efficient- open source software, so it does not have any licensing fee associated with it.
Spark Deployment
• Apache Spark can be used with Hadoop or Hadoop
YARN together.
• It can be deployed on Hadoop in three ways:
• Standalone- allows Spark to allocate all resources or a subset of resources in a
Hadoop cluster run Spark in parallel with Hadoop MapReduce
• YARN- config files can easily read/write to HDFS and YARN Resource Manager, run
Spark on YARN without any pre-installation.
• SIMR- help us start experimenting with Spark to explore more.
Components of Spark
• The following image gives you a clear picture of the different Spark components..
Components of Spark
• The following image gives you a clear picture of the different Spark components..
Apache Spark Core-
general execution engine for the Spark platform which is built as per the requirement, in-built memory
computing and references datasets stored in external storage systems.
write code quickly with the help of a rich set of operators.
takes fewer lines when written in Spark Scala.
Spark SQL-
introduces a new set of data abstraction called SchemaRDD.
SchemaRDD provides support for both structured and semi-structured data
MLlib (Machine Learning Library)-
contains a wide array of Machine Learning algorithms, classification, clustering, and collaboration filters, etc
GraphX-
library to manipulate graphs and perform computations
extends Spark RDD API, which creates a directed graph.
numerous operators in order to manipulate the graphs, along with graph algorithms.
Resilient Distributed Dataset (RDD) Basic
RDDs are the main logical data units in Spark.
They are a distributed collection of objects, which are stored in memory or on disks of different machines of a cluster.
A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different
machines of a cluster.
RDDs are immutable (read-only) in nature.
You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like
transformations, on an existing RDD.
An RDD in Spark can be cached and used again for future transformations, which is a huge benefit for users.
RDDs are said to be lazily evaluated, i.e., they delay the evaluation until it is really needed.
This saves a lot of time and improves efficiency.
Features of an RDD in Spark
• Here are some features of RDD in Spark:
• Resilience: RDDs track data lineage information to recover lost data, automatically on failure. It is also
called fault tolerance.
• Distributed: Data present in an RDD resides on multiple nodes. It is distributed across different nodes of
a cluster.
• Lazy evaluation: Data does not get loaded in an RDD even if you define it. Transformations are actually
computed when you call action, such as count or collect, or save the output to a file system.
Features of an RDD in Spark
• Here are some features of RDD in Spark:
• Immutability: Data stored in an RDD is in the read-only mode━you cannot edit
the data which is present in the RDD. But, you can create new RDDs by
performing transformations on the existing RDDs.
• In-memory computation: An RDD stores any immediate data that is generated
in the memory (RAM) than on the disk so that it provides faster access.
• Partitioning: Partitions can be done on any existing RDD to create logical parts
that are mutable. You can achieve this by applying transformations to the existing
partitions.
RDD abstraction
• Resilient Distributed Datasets
• partitioned collection of records
• spread across the cluster
• read-only
• caching dataset in memory
– different storage levels available
– fallback to disk possible
RDD operations
• transformations to build RDDs through
deterministic operations on other RDDs
– transformations include map, filter, join
– lazy operation
• actions to return value or export data
– actions include count, collect, save
– triggers execution
Job example
val log = sc.textFile(“hdfs://...”)
val errors = file.filter(_.contains(“ERROR”))
errors.cache()
errors.filter(_.contains(“I/O”)).count()
errors.filter(_.contains(“timeout”)).count()
Driver
Worker Worker Worker
Block3
Block1 Block2
Cache1 Cache2 Cache2
Action!
RDD partition-level view
HadoopRDD
path = hdfs://...
FilteredRDD
func = _.contains(…)
shouldCache = true
log:
errors:
Partition-level view:
Dataset-level view:
Task 1 Task 2 ...
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Job scheduling
rdd1.join(rdd2)
.groupBy(…)
.filter(…)
RDD Objects
build operator DAG
DAGScheduler
split graph into
stages of tasks
submit each
stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via
cluster manager
retry failed or
straggling tasks
Cluster
manager
Worker
execute tasks
store and serve
blocks
Block
manager
Threads
Task
source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
Available APIs
• You can write in Java, Scala or Python
• interactive interpreter: Scala & Python only
• standalone applications: any
• performance: Java & Scala are faster thanks to
static typing
Hand on - interpreter
• script
• run scala spark interpreter
• or python interpreter
http://cern.ch/kacper/spark.txt
$ spark-shell
$ pyspark
Hand on – build and submission
• download and unpack source code
• build definition in
• source code
• building
• job submission
GvaWeather/src/main/scala/GvaWeather.scala
spark-submit --master local --class GvaWeather 
target/scala-2.10/gva-weather_2.10-1.0.jar
cd GvaWeather
sbt package
GvaWeather/gvaweather.sbt
wget http://cern.ch/kacper/GvaWeather.tar.gz; tar -xzf GvaWeather.tar.gz
Summary
• concept not limited to single pass map-reduce
• avoid soring intermediate results on disk or
HDFS
• speedup computations when reusing datasets

More Related Content

Similar to Unit II Real Time Data Processing tools.pptx

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Olalekan Fuad Elesin
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
Atif Akhtar
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Big data overview
Big data overviewBig data overview
Big data overview
beCloudReady
 
Apache spark
Apache sparkApache spark
Apache spark
Dona Mary Philip
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
ITLAb21
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
amarkayam
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
Saptak Sen
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 

Similar to Unit II Real Time Data Processing tools.pptx (20)

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Big data overview
Big data overviewBig data overview
Big data overview
 
Apache spark
Apache sparkApache spark
Apache spark
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 
Apache spark
Apache sparkApache spark
Apache spark
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Apache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop SummitApache Spark Workshop at Hadoop Summit
Apache Spark Workshop at Hadoop Summit
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 

More from Rahul Borate

PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
Rahul Borate
 
Unit 4_Introduction to Server Farms.pptx
Unit 4_Introduction to Server Farms.pptxUnit 4_Introduction to Server Farms.pptx
Unit 4_Introduction to Server Farms.pptx
Rahul Borate
 
Unit 3_Data Center Design in storage.pptx
Unit  3_Data Center Design in storage.pptxUnit  3_Data Center Design in storage.pptx
Unit 3_Data Center Design in storage.pptx
Rahul Borate
 
Fundamentals of storage Unit III Backup and Recovery.ppt
Fundamentals of storage Unit III Backup and Recovery.pptFundamentals of storage Unit III Backup and Recovery.ppt
Fundamentals of storage Unit III Backup and Recovery.ppt
Rahul Borate
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
Confusion Matrix.pptx
Confusion Matrix.pptxConfusion Matrix.pptx
Confusion Matrix.pptx
Rahul Borate
 
Unit 4 SVM and AVR.ppt
Unit 4 SVM and AVR.pptUnit 4 SVM and AVR.ppt
Unit 4 SVM and AVR.ppt
Rahul Borate
 
Unit I Fundamentals of Cloud Computing.pptx
Unit I Fundamentals of Cloud Computing.pptxUnit I Fundamentals of Cloud Computing.pptx
Unit I Fundamentals of Cloud Computing.pptx
Rahul Borate
 
Unit II Cloud Delivery Models.pptx
Unit II Cloud Delivery Models.pptxUnit II Cloud Delivery Models.pptx
Unit II Cloud Delivery Models.pptx
Rahul Borate
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
Rahul Borate
 
EDA.pptx
EDA.pptxEDA.pptx
EDA.pptx
Rahul Borate
 
Module III MachineLearningSparkML.pptx
Module III MachineLearningSparkML.pptxModule III MachineLearningSparkML.pptx
Module III MachineLearningSparkML.pptx
Rahul Borate
 
2.2 Logit and Probit.pptx
2.2 Logit and Probit.pptx2.2 Logit and Probit.pptx
2.2 Logit and Probit.pptx
Rahul Borate
 
UNIT I Streaming Data & Architectures.pptx
UNIT I Streaming Data & Architectures.pptxUNIT I Streaming Data & Architectures.pptx
UNIT I Streaming Data & Architectures.pptx
Rahul Borate
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
Practice_Exercises_Files_and_Exceptions.pptx
Practice_Exercises_Files_and_Exceptions.pptxPractice_Exercises_Files_and_Exceptions.pptx
Practice_Exercises_Files_and_Exceptions.pptx
Rahul Borate
 
Practice_Exercises_Data_Structures.pptx
Practice_Exercises_Data_Structures.pptxPractice_Exercises_Data_Structures.pptx
Practice_Exercises_Data_Structures.pptx
Rahul Borate
 
Practice_Exercises_Control_Flow.pptx
Practice_Exercises_Control_Flow.pptxPractice_Exercises_Control_Flow.pptx
Practice_Exercises_Control_Flow.pptx
Rahul Borate
 
blog creation.pdf
blog creation.pdfblog creation.pdf
blog creation.pdf
Rahul Borate
 
Chapter I.pptx
Chapter I.pptxChapter I.pptx
Chapter I.pptx
Rahul Borate
 

More from Rahul Borate (20)

PigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptxPigHive presentation and hive impor.pptx
PigHive presentation and hive impor.pptx
 
Unit 4_Introduction to Server Farms.pptx
Unit 4_Introduction to Server Farms.pptxUnit 4_Introduction to Server Farms.pptx
Unit 4_Introduction to Server Farms.pptx
 
Unit 3_Data Center Design in storage.pptx
Unit  3_Data Center Design in storage.pptxUnit  3_Data Center Design in storage.pptx
Unit 3_Data Center Design in storage.pptx
 
Fundamentals of storage Unit III Backup and Recovery.ppt
Fundamentals of storage Unit III Backup and Recovery.pptFundamentals of storage Unit III Backup and Recovery.ppt
Fundamentals of storage Unit III Backup and Recovery.ppt
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Confusion Matrix.pptx
Confusion Matrix.pptxConfusion Matrix.pptx
Confusion Matrix.pptx
 
Unit 4 SVM and AVR.ppt
Unit 4 SVM and AVR.pptUnit 4 SVM and AVR.ppt
Unit 4 SVM and AVR.ppt
 
Unit I Fundamentals of Cloud Computing.pptx
Unit I Fundamentals of Cloud Computing.pptxUnit I Fundamentals of Cloud Computing.pptx
Unit I Fundamentals of Cloud Computing.pptx
 
Unit II Cloud Delivery Models.pptx
Unit II Cloud Delivery Models.pptxUnit II Cloud Delivery Models.pptx
Unit II Cloud Delivery Models.pptx
 
QQ Plot.pptx
QQ Plot.pptxQQ Plot.pptx
QQ Plot.pptx
 
EDA.pptx
EDA.pptxEDA.pptx
EDA.pptx
 
Module III MachineLearningSparkML.pptx
Module III MachineLearningSparkML.pptxModule III MachineLearningSparkML.pptx
Module III MachineLearningSparkML.pptx
 
2.2 Logit and Probit.pptx
2.2 Logit and Probit.pptx2.2 Logit and Probit.pptx
2.2 Logit and Probit.pptx
 
UNIT I Streaming Data & Architectures.pptx
UNIT I Streaming Data & Architectures.pptxUNIT I Streaming Data & Architectures.pptx
UNIT I Streaming Data & Architectures.pptx
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Practice_Exercises_Files_and_Exceptions.pptx
Practice_Exercises_Files_and_Exceptions.pptxPractice_Exercises_Files_and_Exceptions.pptx
Practice_Exercises_Files_and_Exceptions.pptx
 
Practice_Exercises_Data_Structures.pptx
Practice_Exercises_Data_Structures.pptxPractice_Exercises_Data_Structures.pptx
Practice_Exercises_Data_Structures.pptx
 
Practice_Exercises_Control_Flow.pptx
Practice_Exercises_Control_Flow.pptxPractice_Exercises_Control_Flow.pptx
Practice_Exercises_Control_Flow.pptx
 
blog creation.pdf
blog creation.pdfblog creation.pdf
blog creation.pdf
 
Chapter I.pptx
Chapter I.pptxChapter I.pptx
Chapter I.pptx
 

Recently uploaded

CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
DuvanRamosGarzon1
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
TeeVichai
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
Kamal Acharya
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 

Recently uploaded (20)

CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Railway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdfRailway Signalling Principles Edition 3.pdf
Railway Signalling Principles Edition 3.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Automobile Management System Project Report.pdf
Automobile Management System Project Report.pdfAutomobile Management System Project Report.pdf
Automobile Management System Project Report.pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 

Unit II Real Time Data Processing tools.pptx

  • 1. Unit II Introducing Real-Time Processing Tool
  • 2. Agenda • Apache Spark, Why Apache Spark, Evolution of Apache Spark, • Architecture Apache Spark, Features of Apache Spark, • Spark Deployment, Standalone, Hadoop YARN, Spark MapReduce, • Components of Apache Spark, Spark core, Spark SQL, Spark Streaming, Spark Machine Learning, Spark GraphX, Spark Shell, • Resilient Distributed Dataset (RDD) Basic, Spark Context, RDD Transformations, Creating RDDs, RDD Operations, Programming with RDD, • Transformations, Actions, Lazy Evaluation, Converting between RDD Types
  • 3. Apache Spark • Apache Spark is a lightning-fast cluster computing framework designed for real-time processing. • Spark is an open-source project from Apache Software Foundation. • Spark overcomes the limitations of Hadoop MapReduce, and it extends the MapReduce model to be efficiently used for data processing. • Spark is a market leader for big data processing. • It is widely used across organizations in many ways. • It has surpassed Hadoop by running 100 times faster in memory and 10 times faster on disks.
  • 4. Why Apache Spark • Most of the technology-based companies across the globe have moved toward Apache Spark. • They were quick enough to understand the real value possessed by Sparks such as Machine Learning and interactive querying. • Industry leaders such as Amazon, Huawei, and IBM have already adopted Apache Spark. • The firms that were initially based on Hadoop, such as Hortonworks, Cloudera, and MapR, have also moved to Apache Spark. • Big Data Hadoop professionals surely need to learn Apache Spark since it is the next most important technology in Hadoop data processing. • ETL professionals, SQL professionals, and Project Managers can gain immensely if they master Apache Spark. • Data Scientists also need to gain in-depth knowledge of Spark to excel in their careers. • Spark can be extensively deployed in Machine Learning scenarios.
  • 5. Evolution of Apache Spark • Before Spark, there was MapReduce that was used as a processing framework. • Then, Spark got initiated as one of the research projects in 2009 at UC Berkeley AMPLab. • It was later open-sourced in 2010. • After its release in the market, Spark grew and moved to Apache Software Foundation in 2013 • Most organizations across the world have incorporated Apache Spark for empowering their Big Data applications.
  • 7. Feature of Apache Spark • Apache Spark has many features- • Fault tolerance- design to handle worker node failure using DAG and RDD. • Dynamic In Nature- offer 80 high-level operators to build parallel apps • Lazy Evaluation- transformation lazily evaluated, added to DAG and results obtained after action called. • Real-Time Stream Processing- language –integrated API to stream processing • Speed- run on Hadoop up to 100x faster in memory and 10x faster on disk, minimize disk read/write operation for intermediate results.
  • 8. Feature of Apache Spark • Reusability- spark code used for batch-processing, join streaming data and to adhoc queries on streaming state. • Advanced Analytics- de facto standard for big data processing and data sciences across multiple industries, machine learning and graph processing libraries • In Memory Computing- capable of processing tasks in memory and it is not required to write back intermediate results to the disk , capable of caching the intermediate results so that it can be reused in the next iteration, common dataset which can be used across multiple tasks. • Supporting Multiple languages- APIs available in Java, Scala, Python and R, advanced features available with R language for data analytics, SparkSQL. • Integrated with Hadoop- integrates very well with Hadoop file system HDFS, support to multiple file formats like parquet, json, csv, ORC, Avro etc • Cost efficient- open source software, so it does not have any licensing fee associated with it.
  • 9. Spark Deployment • Apache Spark can be used with Hadoop or Hadoop YARN together. • It can be deployed on Hadoop in three ways: • Standalone- allows Spark to allocate all resources or a subset of resources in a Hadoop cluster run Spark in parallel with Hadoop MapReduce • YARN- config files can easily read/write to HDFS and YARN Resource Manager, run Spark on YARN without any pre-installation. • SIMR- help us start experimenting with Spark to explore more.
  • 10. Components of Spark • The following image gives you a clear picture of the different Spark components..
  • 11. Components of Spark • The following image gives you a clear picture of the different Spark components.. Apache Spark Core- general execution engine for the Spark platform which is built as per the requirement, in-built memory computing and references datasets stored in external storage systems. write code quickly with the help of a rich set of operators. takes fewer lines when written in Spark Scala. Spark SQL- introduces a new set of data abstraction called SchemaRDD. SchemaRDD provides support for both structured and semi-structured data MLlib (Machine Learning Library)- contains a wide array of Machine Learning algorithms, classification, clustering, and collaboration filters, etc GraphX- library to manipulate graphs and perform computations extends Spark RDD API, which creates a directed graph. numerous operators in order to manipulate the graphs, along with graph algorithms.
  • 12. Resilient Distributed Dataset (RDD) Basic RDDs are the main logical data units in Spark. They are a distributed collection of objects, which are stored in memory or on disks of different machines of a cluster. A single RDD can be divided into multiple logical partitions so that these partitions can be stored and processed on different machines of a cluster. RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD. An RDD in Spark can be cached and used again for future transformations, which is a huge benefit for users. RDDs are said to be lazily evaluated, i.e., they delay the evaluation until it is really needed. This saves a lot of time and improves efficiency.
  • 13. Features of an RDD in Spark • Here are some features of RDD in Spark: • Resilience: RDDs track data lineage information to recover lost data, automatically on failure. It is also called fault tolerance. • Distributed: Data present in an RDD resides on multiple nodes. It is distributed across different nodes of a cluster. • Lazy evaluation: Data does not get loaded in an RDD even if you define it. Transformations are actually computed when you call action, such as count or collect, or save the output to a file system.
  • 14. Features of an RDD in Spark • Here are some features of RDD in Spark: • Immutability: Data stored in an RDD is in the read-only mode━you cannot edit the data which is present in the RDD. But, you can create new RDDs by performing transformations on the existing RDDs. • In-memory computation: An RDD stores any immediate data that is generated in the memory (RAM) than on the disk so that it provides faster access. • Partitioning: Partitions can be done on any existing RDD to create logical parts that are mutable. You can achieve this by applying transformations to the existing partitions.
  • 15. RDD abstraction • Resilient Distributed Datasets • partitioned collection of records • spread across the cluster • read-only • caching dataset in memory – different storage levels available – fallback to disk possible
  • 16. RDD operations • transformations to build RDDs through deterministic operations on other RDDs – transformations include map, filter, join – lazy operation • actions to return value or export data – actions include count, collect, save – triggers execution
  • 17. Job example val log = sc.textFile(“hdfs://...”) val errors = file.filter(_.contains(“ERROR”)) errors.cache() errors.filter(_.contains(“I/O”)).count() errors.filter(_.contains(“timeout”)).count() Driver Worker Worker Worker Block3 Block1 Block2 Cache1 Cache2 Cache2 Action!
  • 18. RDD partition-level view HadoopRDD path = hdfs://... FilteredRDD func = _.contains(…) shouldCache = true log: errors: Partition-level view: Dataset-level view: Task 1 Task 2 ... source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
  • 19. Job scheduling rdd1.join(rdd2) .groupBy(…) .filter(…) RDD Objects build operator DAG DAGScheduler split graph into stages of tasks submit each stage as ready DAG TaskScheduler TaskSet launch tasks via cluster manager retry failed or straggling tasks Cluster manager Worker execute tasks store and serve blocks Block manager Threads Task source: https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
  • 20. Available APIs • You can write in Java, Scala or Python • interactive interpreter: Scala & Python only • standalone applications: any • performance: Java & Scala are faster thanks to static typing
  • 21. Hand on - interpreter • script • run scala spark interpreter • or python interpreter http://cern.ch/kacper/spark.txt $ spark-shell $ pyspark
  • 22. Hand on – build and submission • download and unpack source code • build definition in • source code • building • job submission GvaWeather/src/main/scala/GvaWeather.scala spark-submit --master local --class GvaWeather target/scala-2.10/gva-weather_2.10-1.0.jar cd GvaWeather sbt package GvaWeather/gvaweather.sbt wget http://cern.ch/kacper/GvaWeather.tar.gz; tar -xzf GvaWeather.tar.gz
  • 23. Summary • concept not limited to single pass map-reduce • avoid soring intermediate results on disk or HDFS • speedup computations when reusing datasets