SlideShare a Scribd company logo
1 of 39
APACHE SPARK
S.Deepa
Assistant Professor(Sr.G)
Department of Computer Technology – PG
Kongu Engineering College
Introduction
• Apache Spark is a lightning-fast cluster computing technology, designed for fast computation.
• It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for
more types of computations, which includes interactive queries and stream processing.
• The main feature of Spark is its in-memory cluster computing that increases the processing speed
of an application.
• Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming.
• Apart from supporting all these workload in a respective system, it reduces the management
burden of maintaining separate tools.
Evolution of Apache Spark
• Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei
Zaharia.
• It was Open Sourced in 2010 under a BSD license.
• It was donated to Apache software foundation in 2013.
• Now Apache Spark has become a top level Apache project from Feb-2014.
Features of Apache Spark
• Speed
• Supports multiple languages
• Advanced Analytics
• Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk.
• This is possible by reducing number of read/write operations to disk.
• It stores the intermediate processing data in memory.
• Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore,
you can write applications in different languages. Spark comes up with 80 high-level operators for
interactive querying.
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries,
Streaming data, Machine learning (ML), and Graph algorithms.
• Powerful Caching - Simple programming layer provides powerful caching and disk persistence
capabilities.
• Deployment - It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster
manager.
• Real-Time - It offers Real-time computation & low latency because of in-memory computation.
• Polyglot - Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written
in any of these four languages. It also provides a shell in Scala and Python.
Spark Built on Hadoop
Ways of Spark deployment
• Standalone − Spark Standalone deployment means Spark occupies the place on top of
HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and
MapReduce will run side by side to cover all spark jobs on cluster.
• Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-
installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop
stack. It allows other components to run on top of stack.
• Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to
standalone deployment. With SIMR, user can start Spark and uses its shell without any
administrative access.
Components of Spark
Components
• Apache Spark Core
-- Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon.
-- It provides In-Memory computing and referencing datasets in external storage systems.
• Spark SQL
-- Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
• Spark Streaming
-- Spark Streaming is an extension of the core Spark API that allows data engineers and
data scientists to process real-time data from various sources including (but not limited
to) Kafka, Flume, and Amazon Kinesis.
-- This processed data can be pushed out to file systems, databases, and live dashboards.
-- Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics.
-- It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
• MLlib (Machine Learning Library)
-- Built on top of Spark, MLlib is a scalable machine learning library consisting of common
learning algorithms and utilities, including classification, regression, clustering,
collaborative filtering, dimensionality reduction, and underlying optimization primitives.
-- MLlib is a distributed machine learning framework above Spark because of the
distributed memory-based Spark architecture.
-- It is, according to benchmarks, done by the MLlib developers against the
Alternating Least Squares (ALS) implementations.
-- Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache
Mahout (before Mahout gained a Spark interface).
• GraphX
• GraphX is a distributed graph-processing framework on top of Spark.
• It provides an API for expressing graph computation that can model the user-
defined graphs by using Pregel abstraction API.
• It also provides an optimized runtime for this abstraction.
• You can view the same data as both graphs and collections, transform and join
graphs with RDDs efficiently, and write custom iterative graph algorithms using
the Pregel API.
Resilient Distributed Datasets
• Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark.
• It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster.
• RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
• Formally, an RDD is a read-only, partitioned collection of records.
• RDDs can be created through deterministic operations on either data on stable storage or other
RDDs.
• RDD is a fault-tolerant collection of elements that can be operated on in parallel.
• There are two ways to create RDDs − parallelizing an existing collection in your driver program,
or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase,
or any data source offering a Hadoop Input Format.
• Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let
us first discuss how MapReduce operations take place and why they are not so efficient.
• Data Sharing is slow in MapReduce.
• MapReduce is widely adopted for processing and generating large datasets with a parallel,
distributed algorithm on a cluster.
• It allows users to write parallel computations, using a set of high-level operators, without having
to worry about work distribution and fault tolerance.
• Unfortunately, in most current frameworks, the only way to reuse data between computations (Ex
− between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS).
• Although this framework provides numerous abstractions for accessing a cluster’s computational
resources, users still want more.
• Both Iterative and Interactive applications require faster data sharing across parallel jobs.
• Data sharing is slow in MapReduce due to replication, serialization, and disk IO.
• Regarding storage system, most of the Hadoop applications, they spend more than 90% of the
time doing HDFS read-write operations.
Iterative Operations on MapReduce
• Reuse intermediate results across multiple computations in multi-stage applications.
• The following illustration explains how the current framework works, while doing the iterative
operations on MapReduce.
• This incurs substantial overheads due to data replication, disk I/O, and serialization, which makes
the system slow.
Interactive Operations on MapReduce
• User runs ad-hoc queries on the same subset of data.
• Each query will do the disk I/O on the stable storage, which can dominate application execution
time.
• The following illustration explains how the current framework works while doing the interactive
queries on MapReduce.
Data Sharing using Spark RDD
• Data sharing is slow in MapReduce due to replication, serialization, and disk IO.
• Most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write
operations.
• Recognizing this problem, researchers developed a specialized framework called Apache Spark.
• The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing
computation.
• This means, it stores the state of memory as an object across the jobs and the object is sharable
between those jobs.
• Data sharing in memory is 10 to 100 times faster than network and Disk.
Spark RDD
• Iterative Operations on Spark RDD
Interactive Operations on Spark RDD
• This illustration shows interactive operations on Spark RDD.
• If different queries are run on the same set of data repeatedly, this particular data can be kept in
memory for better execution times.
• By default, each transformed RDD may be recomputed each time you run an
action on it.
• However, you may also persist an RDD in memory, in which case Spark will keep
the elements around on the cluster for much faster access, the next time you
query it.
• There is also support for persisting RDDs on disk, or replicated across multiple
nodes.
Spark Architecture
• Apache Spark has a well-defined layered architecture where all the spark components and layers are
loosely coupled.
• This architecture is further integrated with various extensions and libraries.
• Apache Spark Architecture is based on two main abstractions:
• Resilient Distributed Dataset (RDD)
• Directed Acyclic Graph (DAG)
Spark Eco-System
• The spark ecosystem is composed of various components like Spark SQL, Spark Streaming, MLlib,
GraphX, and the Core API component.
Spark Core
• Spark Core is the base engine for large-scale parallel and distributed data processing.
• Further, additional libraries which are built on the top of the core allows diverse workloads for
streaming, SQL, and machine learning.
• It is responsible for memory management and fault recovery, scheduling, distributing and
monitoring jobs on a cluster & interacting with storage systems.
Spark Streaming
• Spark Streaming is the component of Spark which is used to process real-time streaming data.
• Thus, it is a useful addition to the core Spark API.
• It enables high-throughput and fault-tolerant stream processing of live data streams.
• Spark SQL
Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional
programming API.
• It supports querying data either via SQL or via the Hive Query Language.
• For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools
where you can extend the boundaries of traditional relational data processing.
• GraphX
GraphX is the Spark API for graphs and graph-parallel computation.
• Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.
• At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed
Property Graph (a directed multigraph with properties attached to each vertex and edge).
• MLlib (Machine Learning)
MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in
Apache Spark.
• SparkR
• It is an R package that provides a distributed data frame implementation.
• It also supports operations like selection, filtering, aggregation but on large data-sets.
Resilient Distributed Dataset(RDD)
RDDs are the building blocks of any Spark application. RDDs Stands for:
• Resilient: Fault tolerant and is capable of rebuilding data on failure
• Distributed: Distributed data among the multiple nodes in a cluster
• Dataset: Collection of partitioned data with values
• It is a layer of abstracted data over the distributed collection. It is immutable in
nature and follows lazy transformations.
• The data in an RDD is split into chunks based on a key.
• RDDs are highly resilient, i.e, they are able to recover quickly from any issues as
the same data chunks are replicated across multiple executor nodes.
• Thus, even if one executor node fails, another will still process the data.
• This allows you to perform your functional calculations against your dataset very
quickly by harnessing the power of multiple nodes.
• Moreover, once you create an RDD it becomes immutable.
• Immutable means, an object whose state cannot be modified after it is created,
but they can surely be transformed.
• Talking about the distributed environment, each dataset in RDD is divided into
logical partitions, which may be computed on different nodes of the cluster.
• Due to this, you can perform transformations or actions on the complete data
parallelly.
• Also, you don’t have to worry about the distribution, because Spark takes care of
that.
• There are two ways to create RDDs − parallelizing an existing collection in your driver program, or
by referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase,
etc.
• With RDDs, you can perform two types of operations:
• Transformations: They are the operations that are applied to create a new RDD.
• Actions: They are applied on an RDD to instruct Apache Spark to apply computation and pass the
result back to the driver.
Working of Spark Architecture
• In your master node, you have the driver program, which drives your application. The code you
are writing behaves as a driver program or if you are using the interactive shell, the shell acts as
the driver program.
• Inside the driver program, the first thing you do is, you create a Spark Context.
• Assume that the Spark context is a gateway to all the Spark functionalities.
• It is similar to your database connection.
• Any command you execute in your database goes through the database connection.
• Likewise, anything you do on Spark goes through Spark context.
• Now, this Spark context works with the cluster manager to manage various jobs.
• The driver program & Spark context takes care of the job execution within the
cluster.
• A job is split into multiple tasks which are distributed over the worker node.
• Anytime an RDD is created in Spark context, it can be distributed across various
nodes and can be cached there.
• Worker nodes are the slave nodes whose job is to basically execute the tasks.
These tasks are then executed on the partitioned RDDs in the worker node and
hence returns back the result to the Spark Context.
• Spark Context takes the job, breaks the job in tasks and distribute them to the
worker nodes. These tasks work on the partitioned RDD, perform operations,
collect the results and return to the main Spark Context.
• If you increase the number of workers, then you can divide jobs into
more partitions and execute them parallelly over multiple systems. It
will be a lot faster.
• With the increase in the number of workers, memory size will also
increase & you can cache the jobs to execute it faster.
• To know about the workflow of Spark Architecture, you can have a look at
the infographic below:
• STEP 1:
• The client submits spark user application code.
• When an application code is submitted, the driver implicitly converts user code that contains
transformations and actions into a logically directed acyclic graph called DAG.
• At this stage, it also performs optimizations such as pipelining transformations.
STEP 2:
• After that, it converts the logical graph called DAG into physical execution plan with many
stages.
• After converting into a physical execution plan, it creates physical execution units called tasks
under each stage.
• Then the tasks are bundled and sent to the cluster.
• STEP 3:
• Now the driver talks to the cluster manager and negotiates the resources.
• Cluster manager launches executors in worker nodes on behalf of the driver.
• At this point, the driver will send the tasks to the executors based on data
placement.
• When executors start, they register themselves with drivers.
• So, the driver will have a complete view of executors that are executing the task.
• STEP 4: During the course of execution of tasks, driver program will monitor the set of executors
that runs. Driver node also schedules future tasks based on data placement.

More Related Content

What's hot

Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...Databricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Databricks
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with PythonGokhan Atil
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Glen Hawkins
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overviewDataArt
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Edureka!
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellKhalid Imran
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 

What's hot (20)

Data Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its EcosystemData Quality With or Without Apache Spark and Its Ecosystem
Data Quality With or Without Apache Spark and Its Ecosystem
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)Announcing Databricks Cloud (Spark Summit 2014)
Announcing Databricks Cloud (Spark Summit 2014)
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
HDInsight for Architects
HDInsight for ArchitectsHDInsight for Architects
HDInsight for Architects
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive Oracle Active Data Guard: Best Practices and New Features Deep Dive
Oracle Active Data Guard: Best Practices and New Features Deep Dive
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 

Similar to APACHE SPARK.pptx

Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the SurfaceJosi Aranda
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practiceDarko Marjanovic
 

Similar to APACHE SPARK.pptx (20)

Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache spark
Apache sparkApache spark
Apache spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 

More from DeepaThirumurugan

Creating a Binary tree from a General Tree.pptx
Creating a Binary tree from a General Tree.pptxCreating a Binary tree from a General Tree.pptx
Creating a Binary tree from a General Tree.pptxDeepaThirumurugan
 
Difference between traditional and agile software development
Difference between traditional and agile software developmentDifference between traditional and agile software development
Difference between traditional and agile software developmentDeepaThirumurugan
 
Data structures - Introduction
Data structures - IntroductionData structures - Introduction
Data structures - IntroductionDeepaThirumurugan
 

More from DeepaThirumurugan (6)

IO organization.ppt
IO organization.pptIO organization.ppt
IO organization.ppt
 
Creating a Binary tree from a General Tree.pptx
Creating a Binary tree from a General Tree.pptxCreating a Binary tree from a General Tree.pptx
Creating a Binary tree from a General Tree.pptx
 
Difference between traditional and agile software development
Difference between traditional and agile software developmentDifference between traditional and agile software development
Difference between traditional and agile software development
 
Agile Development Models
Agile Development ModelsAgile Development Models
Agile Development Models
 
Cloud Computing
Cloud Computing Cloud Computing
Cloud Computing
 
Data structures - Introduction
Data structures - IntroductionData structures - Introduction
Data structures - Introduction
 

Recently uploaded

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 

Recently uploaded (20)

OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 

APACHE SPARK.pptx

  • 1. APACHE SPARK S.Deepa Assistant Professor(Sr.G) Department of Computer Technology – PG Kongu Engineering College
  • 2. Introduction • Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. • It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. • The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. • Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. • Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.
  • 3. Evolution of Apache Spark • Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. • It was Open Sourced in 2010 under a BSD license. • It was donated to Apache software foundation in 2013. • Now Apache Spark has become a top level Apache project from Feb-2014.
  • 4. Features of Apache Spark • Speed • Supports multiple languages • Advanced Analytics • Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. • This is possible by reducing number of read/write operations to disk. • It stores the intermediate processing data in memory.
  • 5. • Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying. • Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms. • Powerful Caching - Simple programming layer provides powerful caching and disk persistence capabilities. • Deployment - It can be deployed through Mesos, Hadoop via YARN, or Spark’s own cluster manager. • Real-Time - It offers Real-time computation & low latency because of in-memory computation. • Polyglot - Spark provides high-level APIs in Java, Scala, Python, and R. Spark code can be written in any of these four languages. It also provides a shell in Scala and Python.
  • 6. Spark Built on Hadoop
  • 7. Ways of Spark deployment • Standalone − Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster. • Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre- installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack. • Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.
  • 9. Components • Apache Spark Core -- Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. -- It provides In-Memory computing and referencing datasets in external storage systems. • Spark SQL -- Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.
  • 10. • Spark Streaming -- Spark Streaming is an extension of the core Spark API that allows data engineers and data scientists to process real-time data from various sources including (but not limited to) Kafka, Flume, and Amazon Kinesis. -- This processed data can be pushed out to file systems, databases, and live dashboards. -- Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. -- It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.
  • 11. • MLlib (Machine Learning Library) -- Built on top of Spark, MLlib is a scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and underlying optimization primitives. -- MLlib is a distributed machine learning framework above Spark because of the distributed memory-based Spark architecture. -- It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. -- Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
  • 12. • GraphX • GraphX is a distributed graph-processing framework on top of Spark. • It provides an API for expressing graph computation that can model the user- defined graphs by using Pregel abstraction API. • It also provides an optimized runtime for this abstraction. • You can view the same data as both graphs and collections, transform and join graphs with RDDs efficiently, and write custom iterative graph algorithms using the Pregel API.
  • 13. Resilient Distributed Datasets • Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. • It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. • RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. • Formally, an RDD is a read-only, partitioned collection of records. • RDDs can be created through deterministic operations on either data on stable storage or other RDDs. • RDD is a fault-tolerant collection of elements that can be operated on in parallel.
  • 14. • There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. • Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations. Let us first discuss how MapReduce operations take place and why they are not so efficient. • Data Sharing is slow in MapReduce. • MapReduce is widely adopted for processing and generating large datasets with a parallel, distributed algorithm on a cluster. • It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance.
  • 15. • Unfortunately, in most current frameworks, the only way to reuse data between computations (Ex − between two MapReduce jobs) is to write it to an external stable storage system (Ex − HDFS). • Although this framework provides numerous abstractions for accessing a cluster’s computational resources, users still want more. • Both Iterative and Interactive applications require faster data sharing across parallel jobs. • Data sharing is slow in MapReduce due to replication, serialization, and disk IO. • Regarding storage system, most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations.
  • 16. Iterative Operations on MapReduce • Reuse intermediate results across multiple computations in multi-stage applications. • The following illustration explains how the current framework works, while doing the iterative operations on MapReduce. • This incurs substantial overheads due to data replication, disk I/O, and serialization, which makes the system slow.
  • 17. Interactive Operations on MapReduce • User runs ad-hoc queries on the same subset of data. • Each query will do the disk I/O on the stable storage, which can dominate application execution time. • The following illustration explains how the current framework works while doing the interactive queries on MapReduce.
  • 18. Data Sharing using Spark RDD • Data sharing is slow in MapReduce due to replication, serialization, and disk IO. • Most of the Hadoop applications, they spend more than 90% of the time doing HDFS read-write operations. • Recognizing this problem, researchers developed a specialized framework called Apache Spark. • The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing computation. • This means, it stores the state of memory as an object across the jobs and the object is sharable between those jobs. • Data sharing in memory is 10 to 100 times faster than network and Disk.
  • 19. Spark RDD • Iterative Operations on Spark RDD
  • 20. Interactive Operations on Spark RDD • This illustration shows interactive operations on Spark RDD. • If different queries are run on the same set of data repeatedly, this particular data can be kept in memory for better execution times.
  • 21. • By default, each transformed RDD may be recomputed each time you run an action on it. • However, you may also persist an RDD in memory, in which case Spark will keep the elements around on the cluster for much faster access, the next time you query it. • There is also support for persisting RDDs on disk, or replicated across multiple nodes.
  • 22. Spark Architecture • Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled. • This architecture is further integrated with various extensions and libraries. • Apache Spark Architecture is based on two main abstractions: • Resilient Distributed Dataset (RDD) • Directed Acyclic Graph (DAG)
  • 23. Spark Eco-System • The spark ecosystem is composed of various components like Spark SQL, Spark Streaming, MLlib, GraphX, and the Core API component.
  • 24. Spark Core • Spark Core is the base engine for large-scale parallel and distributed data processing. • Further, additional libraries which are built on the top of the core allows diverse workloads for streaming, SQL, and machine learning. • It is responsible for memory management and fault recovery, scheduling, distributing and monitoring jobs on a cluster & interacting with storage systems. Spark Streaming • Spark Streaming is the component of Spark which is used to process real-time streaming data. • Thus, it is a useful addition to the core Spark API. • It enables high-throughput and fault-tolerant stream processing of live data streams.
  • 25. • Spark SQL Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. • It supports querying data either via SQL or via the Hive Query Language. • For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing. • GraphX GraphX is the Spark API for graphs and graph-parallel computation. • Thus, it extends the Spark RDD with a Resilient Distributed Property Graph. • At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph (a directed multigraph with properties attached to each vertex and edge).
  • 26. • MLlib (Machine Learning) MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark. • SparkR • It is an R package that provides a distributed data frame implementation. • It also supports operations like selection, filtering, aggregation but on large data-sets.
  • 27. Resilient Distributed Dataset(RDD) RDDs are the building blocks of any Spark application. RDDs Stands for: • Resilient: Fault tolerant and is capable of rebuilding data on failure • Distributed: Distributed data among the multiple nodes in a cluster • Dataset: Collection of partitioned data with values
  • 28. • It is a layer of abstracted data over the distributed collection. It is immutable in nature and follows lazy transformations. • The data in an RDD is split into chunks based on a key. • RDDs are highly resilient, i.e, they are able to recover quickly from any issues as the same data chunks are replicated across multiple executor nodes. • Thus, even if one executor node fails, another will still process the data. • This allows you to perform your functional calculations against your dataset very quickly by harnessing the power of multiple nodes.
  • 29. • Moreover, once you create an RDD it becomes immutable. • Immutable means, an object whose state cannot be modified after it is created, but they can surely be transformed. • Talking about the distributed environment, each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. • Due to this, you can perform transformations or actions on the complete data parallelly. • Also, you don’t have to worry about the distribution, because Spark takes care of that.
  • 30. • There are two ways to create RDDs − parallelizing an existing collection in your driver program, or by referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, etc. • With RDDs, you can perform two types of operations: • Transformations: They are the operations that are applied to create a new RDD. • Actions: They are applied on an RDD to instruct Apache Spark to apply computation and pass the result back to the driver.
  • 31. Working of Spark Architecture • In your master node, you have the driver program, which drives your application. The code you are writing behaves as a driver program or if you are using the interactive shell, the shell acts as the driver program.
  • 32. • Inside the driver program, the first thing you do is, you create a Spark Context. • Assume that the Spark context is a gateway to all the Spark functionalities. • It is similar to your database connection. • Any command you execute in your database goes through the database connection. • Likewise, anything you do on Spark goes through Spark context.
  • 33. • Now, this Spark context works with the cluster manager to manage various jobs. • The driver program & Spark context takes care of the job execution within the cluster. • A job is split into multiple tasks which are distributed over the worker node. • Anytime an RDD is created in Spark context, it can be distributed across various nodes and can be cached there.
  • 34. • Worker nodes are the slave nodes whose job is to basically execute the tasks. These tasks are then executed on the partitioned RDDs in the worker node and hence returns back the result to the Spark Context. • Spark Context takes the job, breaks the job in tasks and distribute them to the worker nodes. These tasks work on the partitioned RDD, perform operations, collect the results and return to the main Spark Context.
  • 35. • If you increase the number of workers, then you can divide jobs into more partitions and execute them parallelly over multiple systems. It will be a lot faster. • With the increase in the number of workers, memory size will also increase & you can cache the jobs to execute it faster.
  • 36. • To know about the workflow of Spark Architecture, you can have a look at the infographic below:
  • 37. • STEP 1: • The client submits spark user application code. • When an application code is submitted, the driver implicitly converts user code that contains transformations and actions into a logically directed acyclic graph called DAG. • At this stage, it also performs optimizations such as pipelining transformations. STEP 2: • After that, it converts the logical graph called DAG into physical execution plan with many stages. • After converting into a physical execution plan, it creates physical execution units called tasks under each stage. • Then the tasks are bundled and sent to the cluster.
  • 38. • STEP 3: • Now the driver talks to the cluster manager and negotiates the resources. • Cluster manager launches executors in worker nodes on behalf of the driver. • At this point, the driver will send the tasks to the executors based on data placement. • When executors start, they register themselves with drivers. • So, the driver will have a complete view of executors that are executing the task.
  • 39. • STEP 4: During the course of execution of tasks, driver program will monitor the set of executors that runs. Driver node also schedules future tasks based on data placement.