SlideShare a Scribd company logo
1 of 12
Apache Spark
Presented by Prashant Pranay
What is Apache Spark
Apache Spark is an open-source parallel processing framework that supports in-
memory processing to boost the performance of big-data analytic applications.
• Apache Spark was introduced in 2009 in the UC Berkeley R&D Lab, later it
becomes AMP Lab. It was open sourced in 2010 under BSD license. In 2013 spark
was donated to Apache Software Foundation where it became top-level Apache
project in 2014.
• Apache Spark is a general-purpose & lightning fast cluster computing system. It
provides high-level API. For example, Java, Scala, Python and R. Apache Spark is a
tool for Running Spark Applications. Spark is 100 times faster
than Bigdata Hadoop and 10 times faster than accessing data from disk.
• Spark is written in Scala but provides rich APIs in Scala, Java, Python, and R.
• It can be integrated with Hadoop and can process existing Hadoop HDFS data.
Need of Spark
In the industry, there was a need for general purpose cluster computing
tool as:
• Hadoop MapReduce can only perform batch processing.
• Apache Storm / S4 can only perform stream processing.
• Apache Impala / Apache Tez can only perform interactive processing
• Neo4j / Apache Giraph can only perform to graph processing
• Hence in the industry, there was a big demand for a powerful engine
that can process the data in real-time (streaming) as well as in batch
mode. There is a need for an engine that can respond in sub-second
and perform in-memory processing.
Apache Spark Components
• Spark Core: It is the kernel of Spark, which provides an
execution platform for all the Spark applications. It is a
generalized platform to support a wide array of applications.
• Spark SQL: It enables users to run SQL / HQL queries on the top
of Spark. Using Apache Spark SQL, we can process structured as
well as semi-structured data. It also provides an engine
for Hive to run unmodified queries up to 100 times faster on
existing deployments.
• Spark Streaming: Apache Spark Streaming enables powerful
interactive and data analytics application across live streaming
data. The live streams are converted into micro-batches which
are executed on top of spark core.
• Spark Mllib: It is the scalable machine learning library which
delivers both efficiencies as well as the high-quality
algorithm. Apache Spark MLlib is one of the hottest choices
for Data Scientist due to its capability of in-memory data
processing, which improves the performance of iterative
algorithm drastically.
• Spark GraphX: Apache Spark GraphX is the graph computation
engine built on top of spark that enables to process graph data
at scale.
Data in Apache Spark
• Resilient Distributed Dataset (RDD) : is the fundamental unit of data in Apache Spark, which is a distributed
collection of elements across cluster nodes and can perform parallel operations. Spark RDDs are immutable
but can generate new RDD by transforming existing RDD.
• Dataframe :Unlike an RDD, data organizes into named columns. For example a table in a relational
database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose
a structure onto a distributed collection of data, allowing higher-level abstraction.
• Datasets :Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object-
oriented programming interface. Dataset takes advantage of Spark’s Catalyst optimizer by exposing
expressions and data fields to a query planner.
• Apache Spark RDDs support two types of operations:
Transformation – Creates a new RDD from the existing one. It passes the dataset to the function and
returns new dataset.
Action – Spark Action returns final result to driver program or write it to the external datastore.
Apache Spark Architecture
Apache Spark Components
a. Apache Spark Driver
• The main() method of the program runs in the driver. The driver is the process that runs the user code that creates RDDs, and performs transformation and action, and also
creates SparkContext. When the Spark Shell is launched, this signifies that we have created a driver program. On the termination of the driver, the application is finished.
• The driver program splits the Spark application into the task and schedules them to run on the executor. The task scheduler resides in the driver and distributes task among
workers. The two main key roles of drivers are:
Converting user program into the task.
Scheduling task on the executor.
• The structure of Spark program at a higher level is: RDDs are created from some input data, derive new RDD from existing using various transformations, and then after it
performs an action to compute data. In Spark Program, the DAG (directed acyclic graph) of operations are created implicitly. And when the driver runs, it converts that Spark
DAG into a physical execution plan.
b. Apache Spark Cluster Manager
• Spark relies on cluster manager to launch executors and in some cases, even the drivers are launched through it. It is a pluggable component in Spark. On the cluster manager,
jobs and action within a spark application are scheduled by Spark Scheduler in a FIFO fashion. Alternatively, the scheduling can also be done in Round Robin fashion. The
resources used by a Spark application can be dynamically adjusted based on the workload. Thus, the application can free unused resources and request them again when there is
a demand. This is available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode.
c. Apache Spark Executors
• The individual task in the given Spark job runs in the Spark executors. Executors are launched once in the beginning of Spark Application and then they run for the entire lifetime
of an application. Even if the Spark executor fails, the Spark application can continue with ease. There are two main roles of the executors:
Runs the task that makes up the application and return the result to the driver.
Provide in-memory storage for RDDs that are cached by the user
How does Spark works?
How does Spark work?
• Using spark-submit, the user submits an application.
• In spark-submit, the main() method specified by the user is invoked. It also
launches the driver program.
• The driver program asks for the resources to the cluster manager that is
required to launch executors.
• The cluster manager launches executors on behalf of the driver program.
• The driver process runs with the help of user application. Based on the
actions and transformation on RDDs, the driver sends work to executors in
the form of tasks.
• The executors process the task and the obtained result is sent back to the
driver through cluster manager.
Common RDD functions
• RDD functions:- persist(),cache()
• RDD Transformation
map(func),flatMap(),filter(func),mapPartitions(func),mapPartitionWithI
ndex(),union(dataset),intersection(other-
dataset),distinct(),groupByKey(),reduceByKey(func,
[numTasks]),sortByKey(),join(),coalesce()
• RDD Action
count(),collect(),take(n),top(),countByValue(),reduce(),fold(),aggregat
e(),foreach()
How Does Microsoft Azure helps?
Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP). Apache Hadoop was the
original open-source framework for distributed processing and analysis of big data sets on clusters of computers. It provides 99.9% SLA for
service availability.
The following components and utilities are included on HDInsight clusters:+
• Ambari: Cluster provisioning, management, monitoring, and utilities.
• Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment.
• Hive & HCatalog: SQL-like querying, and a table and storage management layer.
• Mahout: For scalable machine learning applications.
• MapReduce: Legacy framework for Hadoop distributed processing and resource management. See YARN.
• Oozie: Workflow management.
• Phoenix: Relational database layer over HBase.
• Pig: Simpler scripting for MapReduce transformations.
• Sqoop: Data import and export.
• Tez: Allows data-intensive processes to run efficiently at scale.
• YARN: Resource management that is part of the Hadoop core library.
• ZooKeeper: Coordination of processes in distributed systems.
Learn Spark
• https://community.cloud.databricks.com
• https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-
spark-overview
• https://www.tutorialspoint.com/apache_spark/
• https://www.pluralsight.com/

More Related Content

What's hot

Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Clusterphanleson
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversityAlex Zeltov
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at ViadeoCepoi Eugen
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkSamy Dindane
 

What's hot (20)

Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Learning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a ClusterLearning spark ch07 - Running on a Cluster
Learning spark ch07 - Running on a Cluster
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Apache Spark 101
Apache Spark 101Apache Spark 101
Apache Spark 101
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
SPARK ARCHITECTURE
SPARK ARCHITECTURESPARK ARCHITECTURE
SPARK ARCHITECTURE
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark at Viadeo
Apache Spark at ViadeoApache Spark at Viadeo
Apache Spark at Viadeo
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Spark
SparkSpark
Spark
 
Module01
 Module01 Module01
Module01
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

Similar to Apache spark

Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkJohn Godoi
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark Mostafa
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxAishg4
 

Similar to Apache spark (20)

Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Spark core
Spark coreSpark core
Spark core
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Apache spark
Apache sparkApache spark
Apache spark
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 

Recently uploaded

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGSIVASHANKAR N
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 

Recently uploaded (20)

BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTINGMANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
MANUFACTURING PROCESS-II UNIT-1 THEORY OF METAL CUTTING
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 

Apache spark

  • 1. Apache Spark Presented by Prashant Pranay
  • 2. What is Apache Spark Apache Spark is an open-source parallel processing framework that supports in- memory processing to boost the performance of big-data analytic applications. • Apache Spark was introduced in 2009 in the UC Berkeley R&D Lab, later it becomes AMP Lab. It was open sourced in 2010 under BSD license. In 2013 spark was donated to Apache Software Foundation where it became top-level Apache project in 2014. • Apache Spark is a general-purpose & lightning fast cluster computing system. It provides high-level API. For example, Java, Scala, Python and R. Apache Spark is a tool for Running Spark Applications. Spark is 100 times faster than Bigdata Hadoop and 10 times faster than accessing data from disk. • Spark is written in Scala but provides rich APIs in Scala, Java, Python, and R. • It can be integrated with Hadoop and can process existing Hadoop HDFS data.
  • 3. Need of Spark In the industry, there was a need for general purpose cluster computing tool as: • Hadoop MapReduce can only perform batch processing. • Apache Storm / S4 can only perform stream processing. • Apache Impala / Apache Tez can only perform interactive processing • Neo4j / Apache Giraph can only perform to graph processing • Hence in the industry, there was a big demand for a powerful engine that can process the data in real-time (streaming) as well as in batch mode. There is a need for an engine that can respond in sub-second and perform in-memory processing.
  • 4. Apache Spark Components • Spark Core: It is the kernel of Spark, which provides an execution platform for all the Spark applications. It is a generalized platform to support a wide array of applications. • Spark SQL: It enables users to run SQL / HQL queries on the top of Spark. Using Apache Spark SQL, we can process structured as well as semi-structured data. It also provides an engine for Hive to run unmodified queries up to 100 times faster on existing deployments. • Spark Streaming: Apache Spark Streaming enables powerful interactive and data analytics application across live streaming data. The live streams are converted into micro-batches which are executed on top of spark core. • Spark Mllib: It is the scalable machine learning library which delivers both efficiencies as well as the high-quality algorithm. Apache Spark MLlib is one of the hottest choices for Data Scientist due to its capability of in-memory data processing, which improves the performance of iterative algorithm drastically. • Spark GraphX: Apache Spark GraphX is the graph computation engine built on top of spark that enables to process graph data at scale.
  • 5. Data in Apache Spark • Resilient Distributed Dataset (RDD) : is the fundamental unit of data in Apache Spark, which is a distributed collection of elements across cluster nodes and can perform parallel operations. Spark RDDs are immutable but can generate new RDD by transforming existing RDD. • Dataframe :Unlike an RDD, data organizes into named columns. For example a table in a relational database. It is an immutable distributed collection of data. DataFrame in Spark allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction. • Datasets :Datasets in Apache Spark are an extension of DataFrame API which provides type-safe, object- oriented programming interface. Dataset takes advantage of Spark’s Catalyst optimizer by exposing expressions and data fields to a query planner. • Apache Spark RDDs support two types of operations: Transformation – Creates a new RDD from the existing one. It passes the dataset to the function and returns new dataset. Action – Spark Action returns final result to driver program or write it to the external datastore.
  • 7. Apache Spark Components a. Apache Spark Driver • The main() method of the program runs in the driver. The driver is the process that runs the user code that creates RDDs, and performs transformation and action, and also creates SparkContext. When the Spark Shell is launched, this signifies that we have created a driver program. On the termination of the driver, the application is finished. • The driver program splits the Spark application into the task and schedules them to run on the executor. The task scheduler resides in the driver and distributes task among workers. The two main key roles of drivers are: Converting user program into the task. Scheduling task on the executor. • The structure of Spark program at a higher level is: RDDs are created from some input data, derive new RDD from existing using various transformations, and then after it performs an action to compute data. In Spark Program, the DAG (directed acyclic graph) of operations are created implicitly. And when the driver runs, it converts that Spark DAG into a physical execution plan. b. Apache Spark Cluster Manager • Spark relies on cluster manager to launch executors and in some cases, even the drivers are launched through it. It is a pluggable component in Spark. On the cluster manager, jobs and action within a spark application are scheduled by Spark Scheduler in a FIFO fashion. Alternatively, the scheduling can also be done in Round Robin fashion. The resources used by a Spark application can be dynamically adjusted based on the workload. Thus, the application can free unused resources and request them again when there is a demand. This is available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode. c. Apache Spark Executors • The individual task in the given Spark job runs in the Spark executors. Executors are launched once in the beginning of Spark Application and then they run for the entire lifetime of an application. Even if the Spark executor fails, the Spark application can continue with ease. There are two main roles of the executors: Runs the task that makes up the application and return the result to the driver. Provide in-memory storage for RDDs that are cached by the user
  • 8. How does Spark works?
  • 9. How does Spark work? • Using spark-submit, the user submits an application. • In spark-submit, the main() method specified by the user is invoked. It also launches the driver program. • The driver program asks for the resources to the cluster manager that is required to launch executors. • The cluster manager launches executors on behalf of the driver program. • The driver process runs with the help of user application. Based on the actions and transformation on RDDs, the driver sends work to executors in the form of tasks. • The executors process the task and the obtained result is sent back to the driver through cluster manager.
  • 10. Common RDD functions • RDD functions:- persist(),cache() • RDD Transformation map(func),flatMap(),filter(func),mapPartitions(func),mapPartitionWithI ndex(),union(dataset),intersection(other- dataset),distinct(),groupByKey(),reduceByKey(func, [numTasks]),sortByKey(),join(),coalesce() • RDD Action count(),collect(),take(n),top(),countByValue(),reduce(),fold(),aggregat e(),foreach()
  • 11. How Does Microsoft Azure helps? Azure HDInsight is a cloud distribution of the Hadoop components from the Hortonworks Data Platform (HDP). Apache Hadoop was the original open-source framework for distributed processing and analysis of big data sets on clusters of computers. It provides 99.9% SLA for service availability. The following components and utilities are included on HDInsight clusters:+ • Ambari: Cluster provisioning, management, monitoring, and utilities. • Avro (Microsoft .NET Library for Avro): Data serialization for the Microsoft .NET environment. • Hive & HCatalog: SQL-like querying, and a table and storage management layer. • Mahout: For scalable machine learning applications. • MapReduce: Legacy framework for Hadoop distributed processing and resource management. See YARN. • Oozie: Workflow management. • Phoenix: Relational database layer over HBase. • Pig: Simpler scripting for MapReduce transformations. • Sqoop: Data import and export. • Tez: Allows data-intensive processes to run efficiently at scale. • YARN: Resource management that is part of the Hadoop core library. • ZooKeeper: Coordination of processes in distributed systems.
  • 12. Learn Spark • https://community.cloud.databricks.com • https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache- spark-overview • https://www.tutorialspoint.com/apache_spark/ • https://www.pluralsight.com/