SlideShare a Scribd company logo
1 of 29
Siddharth Singh
Topics
 Why SPARK
 Evolvement of PIG
 Why PIG
 PIG vs Map reduce
 PIG vs SQL
 Processing Flow
 Data Model
 Execution modes
 Properties of PIG
 Practice Sessions
Need of SPARK
Why there was a need of new SYSTEM ?
Map reduce was used extensively to analyze the large data sets in batch
processing. The latency time was high on map reduce jobs. They needed a
system which can run the batch jobs faster than map-reduce.
Industry needed a single framework for batch, interactive, SQL, Graph,
Streaming and machine learning processing engines natively.
Map reduce supported only batch processing. It cannot do any interactive or
real time processing which sometimes needed for quick analysis and fast
analytics.
YARN View
SPARK Components
This is a SINGLE SPARK framework with all these components
supported natively.
Contd..
Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems. When we run also run batch
jobs on top of core using SPARK APIs.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data
abstraction called SchemaRDD, which provides support for structured and semi-
structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD (Resilient
Distributed Datasets) transformations on those mini-batches of data.
Contd..
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of
the distributed memory-based Spark architecture. Spark MLlib is nine times
as fast as the Hadoop disk-based version of Apache Mahout.
GraphX
Graph is a distributed graph-processing framework on top of Spark. It
provides an API for expressing graph computation that can model the user-
defined graphs by using Pregel abstraction API. It also provides an optimized
runtime for this abstraction.
Why SPARK
 Apache Spark™ is a fast and general engine for large-scale data
processing in memory. Key is in memory processing.
 SPARK is 10X to 100X faster than map reduce based on in memory and
disk operation.
 Provides high level APIs to interact with JAVA, Python, Scala and R
programing languages so you can work on any language in SPARK.
 SPARK provided a single and unified framework for batch, interactive,
SQL, Graph, Streaming and machine learning processing engines natively.
All these functionalities were distributed in HADOOP as specialized tools
on top of HADOOP.
Why SPARK
 It provided real time stream processing system.
 Faster decision making due to interactive shell processing.
 This was a general purpose computing engine which supported most
type of computation in single framework.
Map reduce vs SPARK
Replacing Hadoop ?
SPARK is not a REPLACEMENT of Hadoop. You can learn about SPARK
framework without learning HADOOP.
What does it mean ?
HADOOP has STORAGE and PROCESSING both whereas SPARK does not have
its own storage system so it can use storage of any file system including
HDFS and deployed on top of HADOOP/YARN same as any other YARN
supported tool system. SPARK may use file system of amazon S3, HBASE,
HADOOP, CASSANDRA and Local file system too.
Though, it will work BEST with HDFS because it will get all the flexibilities
and properties of HDFS like replication, fault tolerance etc.
You may compare Map-reduce with SPARK batch processing system but not
with HADOOP as both are separate framework for different purposes
though there is some overlap.
Evolution of SPARK
Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s
AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license.
It was donated to Apache software foundation in 2013, and now Apache
Spark has become a top level Apache project from Feb-2014. This is written
in Scala with 2-3% in python but supports APIs for JAVA, Scala, python and R.
There are still lot of research and work going on SPARK and upgraded
version come anytime.
Modes of running SPARK
Contd..
Standalone − Spark Standalone deployment means Spark occupies the place
on top of HDFS(Hadoop Distributed File System) and space is allocated for
HDFS, explicitly. Here, Spark and Map Reduce will run side by side to cover
all spark jobs on cluster. This mode run on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn.
It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows
other components to run on top of stack. This can be run in client mode and
cluster mode. Client means interactive and cluster means submitting the jar
to Hadoop cluster. This mode run on cluster.
Local Mode – Spark runs on your local system and all SPARK processes run
on single JVM process on your client machine. This mode is used to test
your SPARK jobs. This mode cannot run on cluster. Its for single client
machine.
Standalone mode
Architecture - Standalone
Architecture - YARN
RDD (Resilient Distributed Dataset)
RDD is a collection of data having 5 properties,
1. Immutability.
2. Lazy Evaluation
3. Type inferred.
4. Cacheable
Lets discuss each.
What is Immutability
 Immutability means once its created it will never change.
 Big Data by default is immutable as it provides streaming access.
 Immutability helps in,
Parallelize
Caching
Because underneath data will never change so you can parallelize and cahce
it easily. Immutability is about the value/object not about the reference.
String name = “Siddharth”;
name =“Siddharth ”+”Singh”;
What is Immutability
Immutable programming,
Here you create a new object anytime you want to do any transformations.
Like,
Val collection = {1,2,3,4}
Val newcollection = collection. Map(value=>value+1)
Both collections are different due to immutability. Same as each
transformation would require a one more copy and so on.
What is Immutability
Drawbacks,
This is good for parallelism but not for space. Multiple transformations is
causing multiple copies of data.
Its causing multiple copies of data even for small transformations and we
are passing through whole dataset for each transformation. This may cause
poor performance in BIG data world.
BUT
This is overcome by Lazy feature.
What is Laziness
Laziness means do not compute or transform until needed or action is
called.
Laziness just evaluate the statements but do not execute it.
It separates the evaluation with execution.
Does not do anything until some action is called on it.
What is Laziness
Val collection = {1,2,3,4}
Val c1 = collection. Map(value=>value+1)
Val c2 = collection. Map(value=>value+2)
Print c2//action
Since its lazy, it will combine both in to one pass
Val c2 = collection. Map(value=> {
var result= value+1
var result = value +2 })
Since no one asked for c1, I do not need to create the object of it.
What is Laziness
 You can be only lazy when you are immutable otherwise laziness will
have issues in combining and transforming the values.
 You cannot combine transformations if there are any errors in
transformations due to data types.
 So laziness and immutability giving me parallelism and faster processing
in distributed environment.
 MapReduce has immutability but lacks laziness.
Laziness Challenges
Laziness may have problem in data type conversion errors. We do not want to
run a Job and after 1 hr. it fails due to data type casting errors. PIG is also a lazy
language.
Sometimes its difficult to debug the lazy language because its not executing until
action is called.
Job may fail due to semantic issues if casting is not proper in lazy language and
we do not want to do it. Like in MapReduce, sometimes job may fail when
casting is not done properly and JAVA did not catch the data type casting during
compilation and job failed at runtime. In SAPRK, you will never get data type
casting error at run time.
What we want ?
We want a programming language which is type inferred means where program
identifies the data types of variables and expressions and gives semantics error
at evaluation/compiling time not at the run time.
Type Inferred
It means compiler will decide the data type of a value/expressions without user declaring it.
Eg,
Val collection = {1,2,3,4}
Val c1 = collection. Map(value=>value+1)
Here ‘c1’ will always be an array because map function will always return an array if working on an array data
type.
Val c2=c1.count //inferred as int
c2 is inferred as integer now and cannot change its data type now further in programming. This is fixed now
based on return function of c1.count.
Val c3=c2.map (value => value+1)
This will give an error because you cannot apply map function over an integers. This is called static typing.
What is Cacheable
Cacheable means the data which can be cached in memory (RAM) easily.
Immutability helps here. If your underlying data is not changing then you
can cache it easily without any problem as know that data will be same.
Since its lazy and immutable, you can create the dataset easily through
lineage which means that each transformations can be remembered for
long and recreate at any point of time.
Caching will off course improve the performance of any system.
This the reason the SPARK is written in JAVA as these properties were not in
JAVA. SCALA is combination of functional and OOPS programming and runs
on JVM.
RDD (Resilient Distributed Dataset)
What is RDD ? (It’s the heart of SPARK). This is the main abstraction in SPARK.
Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.
Its an interface to the data that how data will look like ?
Each RDD is divided into logical partitions.
RDD is logical collection of data but you can cache the actual data too in
memory.
It is fault tolerant due to lineage if any RDD or its partition is lost in
transformations then it can rebuilt.
RDD is memory computation which makes it faster execution on cluster.
Data Sharing in RDD
Data Sharing using Spark RDD
Data sharing is slow in Map Reduce due to replication, serialization, and disk
IO. Most of the Hadoop applications, they spend more than 90% of the time
doing HDFS read-write operations.
Recognizing this problem, researchers developed a specialized framework
called Apache Spark. The key idea of spark is Resilient Distributed Datasets
(RDD); it supports in-memory processing computation. Data sharing in
memory is 10 to 100 times faster than network and Disk.

More Related Content

Similar to Learn about SPARK tool and it's componemts

Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your EyesDemi Ben-Ari
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogsprateek kumar
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs sparkamarkayam
 
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724sdeeg
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Programming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire projectProgramming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire projectŁukasz Dumiszewski
 

Similar to Learn about SPARK tool and it's componemts (20)

Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Module01
 Module01 Module01
Module01
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Hadoop vs spark
Hadoop vs sparkHadoop vs spark
Hadoop vs spark
 
Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724Spark forplainoldjavageeks svforum_20140724
Spark forplainoldjavageeks svforum_20140724
 
Spark rdd
Spark rddSpark rdd
Spark rdd
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Spark core
Spark coreSpark core
Spark core
 
Programming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire projectProgramming in Spark - Lessons Learned in OpenAire project
Programming in Spark - Lessons Learned in OpenAire project
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 

Recently uploaded

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 

Recently uploaded (20)

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 

Learn about SPARK tool and it's componemts

  • 2. Topics  Why SPARK  Evolvement of PIG  Why PIG  PIG vs Map reduce  PIG vs SQL  Processing Flow  Data Model  Execution modes  Properties of PIG  Practice Sessions
  • 3. Need of SPARK Why there was a need of new SYSTEM ? Map reduce was used extensively to analyze the large data sets in batch processing. The latency time was high on map reduce jobs. They needed a system which can run the batch jobs faster than map-reduce. Industry needed a single framework for batch, interactive, SQL, Graph, Streaming and machine learning processing engines natively. Map reduce supported only batch processing. It cannot do any interactive or real time processing which sometimes needed for quick analysis and fast analytics.
  • 5. SPARK Components This is a SINGLE SPARK framework with all these components supported natively.
  • 6. Contd.. Apache Spark Core Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems. When we run also run batch jobs on top of core using SPARK APIs. Spark SQL Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi- structured data. Spark Streaming Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.
  • 7. Contd.. MLlib (Machine Learning Library) MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout. GraphX Graph is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user- defined graphs by using Pregel abstraction API. It also provides an optimized runtime for this abstraction.
  • 8. Why SPARK  Apache Spark™ is a fast and general engine for large-scale data processing in memory. Key is in memory processing.  SPARK is 10X to 100X faster than map reduce based on in memory and disk operation.  Provides high level APIs to interact with JAVA, Python, Scala and R programing languages so you can work on any language in SPARK.  SPARK provided a single and unified framework for batch, interactive, SQL, Graph, Streaming and machine learning processing engines natively. All these functionalities were distributed in HADOOP as specialized tools on top of HADOOP.
  • 9. Why SPARK  It provided real time stream processing system.  Faster decision making due to interactive shell processing.  This was a general purpose computing engine which supported most type of computation in single framework.
  • 10. Map reduce vs SPARK
  • 11. Replacing Hadoop ? SPARK is not a REPLACEMENT of Hadoop. You can learn about SPARK framework without learning HADOOP. What does it mean ? HADOOP has STORAGE and PROCESSING both whereas SPARK does not have its own storage system so it can use storage of any file system including HDFS and deployed on top of HADOOP/YARN same as any other YARN supported tool system. SPARK may use file system of amazon S3, HBASE, HADOOP, CASSANDRA and Local file system too. Though, it will work BEST with HDFS because it will get all the flexibilities and properties of HDFS like replication, fault tolerance etc. You may compare Map-reduce with SPARK batch processing system but not with HADOOP as both are separate framework for different purposes though there is some overlap.
  • 12. Evolution of SPARK Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014. This is written in Scala with 2-3% in python but supports APIs for JAVA, Scala, python and R. There are still lot of research and work going on SPARK and upgraded version come anytime.
  • 14. Contd.. Standalone − Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and Map Reduce will run side by side to cover all spark jobs on cluster. This mode run on cluster. Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack. This can be run in client mode and cluster mode. Client means interactive and cluster means submitting the jar to Hadoop cluster. This mode run on cluster. Local Mode – Spark runs on your local system and all SPARK processes run on single JVM process on your client machine. This mode is used to test your SPARK jobs. This mode cannot run on cluster. Its for single client machine.
  • 18. RDD (Resilient Distributed Dataset) RDD is a collection of data having 5 properties, 1. Immutability. 2. Lazy Evaluation 3. Type inferred. 4. Cacheable Lets discuss each.
  • 19. What is Immutability  Immutability means once its created it will never change.  Big Data by default is immutable as it provides streaming access.  Immutability helps in, Parallelize Caching Because underneath data will never change so you can parallelize and cahce it easily. Immutability is about the value/object not about the reference. String name = “Siddharth”; name =“Siddharth ”+”Singh”;
  • 20. What is Immutability Immutable programming, Here you create a new object anytime you want to do any transformations. Like, Val collection = {1,2,3,4} Val newcollection = collection. Map(value=>value+1) Both collections are different due to immutability. Same as each transformation would require a one more copy and so on.
  • 21. What is Immutability Drawbacks, This is good for parallelism but not for space. Multiple transformations is causing multiple copies of data. Its causing multiple copies of data even for small transformations and we are passing through whole dataset for each transformation. This may cause poor performance in BIG data world. BUT This is overcome by Lazy feature.
  • 22. What is Laziness Laziness means do not compute or transform until needed or action is called. Laziness just evaluate the statements but do not execute it. It separates the evaluation with execution. Does not do anything until some action is called on it.
  • 23. What is Laziness Val collection = {1,2,3,4} Val c1 = collection. Map(value=>value+1) Val c2 = collection. Map(value=>value+2) Print c2//action Since its lazy, it will combine both in to one pass Val c2 = collection. Map(value=> { var result= value+1 var result = value +2 }) Since no one asked for c1, I do not need to create the object of it.
  • 24. What is Laziness  You can be only lazy when you are immutable otherwise laziness will have issues in combining and transforming the values.  You cannot combine transformations if there are any errors in transformations due to data types.  So laziness and immutability giving me parallelism and faster processing in distributed environment.  MapReduce has immutability but lacks laziness.
  • 25. Laziness Challenges Laziness may have problem in data type conversion errors. We do not want to run a Job and after 1 hr. it fails due to data type casting errors. PIG is also a lazy language. Sometimes its difficult to debug the lazy language because its not executing until action is called. Job may fail due to semantic issues if casting is not proper in lazy language and we do not want to do it. Like in MapReduce, sometimes job may fail when casting is not done properly and JAVA did not catch the data type casting during compilation and job failed at runtime. In SAPRK, you will never get data type casting error at run time. What we want ? We want a programming language which is type inferred means where program identifies the data types of variables and expressions and gives semantics error at evaluation/compiling time not at the run time.
  • 26. Type Inferred It means compiler will decide the data type of a value/expressions without user declaring it. Eg, Val collection = {1,2,3,4} Val c1 = collection. Map(value=>value+1) Here ‘c1’ will always be an array because map function will always return an array if working on an array data type. Val c2=c1.count //inferred as int c2 is inferred as integer now and cannot change its data type now further in programming. This is fixed now based on return function of c1.count. Val c3=c2.map (value => value+1) This will give an error because you cannot apply map function over an integers. This is called static typing.
  • 27. What is Cacheable Cacheable means the data which can be cached in memory (RAM) easily. Immutability helps here. If your underlying data is not changing then you can cache it easily without any problem as know that data will be same. Since its lazy and immutable, you can create the dataset easily through lineage which means that each transformations can be remembered for long and recreate at any point of time. Caching will off course improve the performance of any system. This the reason the SPARK is written in JAVA as these properties were not in JAVA. SCALA is combination of functional and OOPS programming and runs on JVM.
  • 28. RDD (Resilient Distributed Dataset) What is RDD ? (It’s the heart of SPARK). This is the main abstraction in SPARK. Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. Its an interface to the data that how data will look like ? Each RDD is divided into logical partitions. RDD is logical collection of data but you can cache the actual data too in memory. It is fault tolerant due to lineage if any RDD or its partition is lost in transformations then it can rebuilt. RDD is memory computation which makes it faster execution on cluster.
  • 29. Data Sharing in RDD Data Sharing using Spark RDD Data sharing is slow in Map Reduce due to replication, serialization, and disk IO. Most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations. Recognizing this problem, researchers developed a specialized framework called Apache Spark. The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. Data sharing in memory is 10 to 100 times faster than network and Disk.