SlideShare a Scribd company logo
Introduction and Working
Available Distributed
Programming models
• MapReduce
• Storm
• Flink
• Spark
Why Spark ?
Major limitations with available
distributed models
1. Difficulty in programming directly in MapReduce
2. No support for in-memory computation in
MapReduce
3. MR uses batch processing (does not fit every
use-case ).
4. Flink is not ready for production level projects.
5. Flink primarily works on streaming data.
6. Storm is slower than Spark.
Hadoop Ecosystem
Note: This is just an illustrative figure, not all components shown may be production ready
What is Spark?
• Spark is the open standard for flexible in-
memory data processing for batch, real-time,
and advanced analytics.
• Powerful open source processing engine built
around speed, ease of use, and sophisticated
analytics.
• First high-level programing framework for fast,
distributed data processing.
Some key points about Spark
• Handles batch, interactive, and real-time
within a single framework
(MR for Batch and Flink for Stream)
• Native integration with Java, Python, Scala
and R
• More general: map/reduce is just one set of
supported constructs
How was Spark born?
How does Spark Work?
• Often used in tandem with a distributed storage system to
write the data processed and a cluster manager to manage
the distribution of the application across the cluster.
• Spark currently supports three kinds of cluster managers:
1. The manager included in Spark, called the Standalone Cluster Manager,
which requires Spark to be installed in each node of a cluster.
2. Apache Mesos
3. Hadoop YARN.
Spark Data processing eco system
Figure: Components of
Spark Architecture Model
Spark Cluster
Figure: Spark Cluster Mode Overview
Spark Ecosystem
Spark Core API
R SQL Python Scala Java
Spark SQL Streaming MLlib GraphX
Programming languages used in Spark Source:
Spark Core
• Spark Core, the main data processing framework in the Spark ecosystem
• Spark Core is the underlying general execution engine for the Spark
platform that all other functionality is built on top of.
• It provides in-memory computing capabilities to deliver speed, a
generalized execution model to support a wide variety of applications, and
Java, Scala, and Python APIs for ease of development.
• In addition to Spark Core, the Spark ecosystem includes a number of other
first-party components for more specific data processing tasks, including
Spark SQL, Spark MLLib, Spark ML, and Graph X.
• These components have many of the same generic performance
considerations as the core. However, some of them have unique
considerations - like SQL’s different optimizer.
Spark Core- Word Count
MapReduce Word Count
MapReduce Word Count Cont.
Spark SQL
• Spark SQL is a Spark module for structured data processing.
• Provides a programming abstraction called Data Frames & can also act as
distributed SQL query engine.
• Defines an interface for a semi-structured data type,
called DataFrames and a typed version called Dataset.
• Very important component for Spark performance, and almost all that can
be accomplished with Spark core can be applied to Spark SQL.
• DataFrames and Datasets interfaces are the future of Spark performance,
with more efficient storage options, advanced optimizer, and direct
operations on serialized data.
• Datasets was introduced in Spark 1.6, DataFrames in Spark 1.3, and the
SQL engine in Spark 1.0.
• Spark SQL supports structured queries in batch and streaming
modes (with the latter as a separate module of Spark SQL
called Structured Streaming).
• As of Spark 2.0, Spark SQL is now de facto the primary and
feature-rich interface to Spark’s underlying in-memory
distributed platform (hiding Spark Core’s RDDs behind higher-
level abstractions).
Spark SQL’s different APIs
• Dataset API (formerly DataFrame API) with a strongly-typed
LINQ-like Query DSL that Scala programmers will likely find
very appealing to use.
• Structured Streaming API (aka Streaming Datasets) for
continuous incremental execution of structured queries.
• Non-programmers will likely use SQL as their query language
through direct integration with Hive
• JDBC/ODBC fans can use JDBC interface (through Thrift
JDBC/ODBC Server) and connect their tools to Spark’s
distributed query engine.
Spark SQL- On Hive
Spark SQL - Dataset
Machine Learning
• Spark has two machine learning packages, ML and MLlib.
• Spark ML is still in the early stages, but since Spark 1.2, it provides a
higher-level API than MLlib that helps users create practical machine
learning pipelines more easily.
• Spark MLLib is built on top of RDDs, on the other hand ML is build on top
of SparkSQL data frames.
• Spark community plans to move over to ML deprecating MLlib.
• Spark ML and MLLib have some unique performance considerations,
especially when working with large data sizes and caching.
Spark Streaming
• Running on top of Spark, Spark Streaming enables powerful interactive
and analytical applications across both streaming and historical data,
while inheriting
• Uses the scheduling of the Spark Core for streaming analytics on mini
batches of data.
• Has a number of unique considerations such as the window sizes used for
batches.
• Running on top of Spark, it enables powerful interactive and analytical
applications across both streaming and historical data, while inheriting
Spark’s ease of use and fault tolerance characteristics.
• Readily integrates with a wide variety of popular data sources, including
HDFS, Flume, Kafka, and Twitter.
Graph X
• GraphX is a graph computation engine built on top of Spark that enables
users to interactively build, transform and reason about graph structured
data at scale.
• Comes complete with a library of common algorithms.
• Least mature components of Spark.
• Typed graph functionality will start to be introduced on top of the Dataset
API in upcoming version.
Spark Model of Parallel Computing:
RDDs
• Spark revolves around the concept of a resilient distributed dataset (RDD),
which is a fault-tolerant collection of elements partitioned across
machines, that can be operated on in parallel.
• Each RDD is split into multiple partitions, which may be computed on
different nodes of the cluster.
• RDDs are distributed data-sets that can stay in-memory or fall back to disk
gracefully.
• RDDs are resilient because they have a long lineage. Whenever there's a
failure in the system, they can re-compute themselves using the prior
information using lineage.
• RDDs are a representation of lazily evaluated statically typed distributed
collections.
• Spark stores data in RDDs on different partitions. They help with
rearranging the computations and optimizing the data processing.
• RDDs are immutable. We can modify an RDD with a transformation but
the transformation returns a new RDD whereas the original RDD remains
the same.
• In addition to Spark Core, the Spark ecosystem includes a number of other
first-party components for more specific data processing tasks, including
Spark SQL, Spark MLLib, Spark ML, and Graph X.
RDD Operations
• RDD supports two types of operations:
– Transformation: Transformations don't return a single value,
they return a new RDD. Nothing gets evaluated when
Transformation function is called, it just takes an RDD and return
a new RDD.
Few of the Transformation functions are map, filter, flatMap,
groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce.
– Action: Action operation evaluates and returns a new value.
When an Action function is called on a RDD object, all the data
processing queries are computed at that time and the result
value is returned.
Few of the Actions are reduce, collect, count, first, take,
countByKey, and foreach.
Lazy Evaluation
• Evaluation of RDDs is completely lazy.
• Spark does not begin computing the partitions until and
action is called.
• Actions trigger the scheduler, which builds a directed acyclic
graph (called the DAG), based on the dependencies between
RDD transformations.
PERFORMANCE & USABILITY
ADVANTAGES OF LAZY EVALUATION
• Allows Spark to chain together operations that don’t
require communication with the driver to avoid doing
multiple passes through the data.
• As each partition of the data contains the dependency
information needed to re-calculate the partition, Spark is
fault-tolerant
• RDD contains all the dependency information required to
replicate each of its partitions.
• In case o failure when a partition is lost, the RDD has
enough information about its lineage to recompute it, and
that computation can be parallelized to make recovery
faster.
IN-MEMORY STORAGE & MEMORY MANAGEMENT
• Spark has option of storing the data on slave nodes on loaded into
memory. So its performance it very good for iterative computations
compare to MapReduce.
• Spark offers three options for memory management:
1. In memory as de-serialized Java objects: memory storage is the fastest
but not memory efficient, as it needs the data to be as objects.
2. As serialized data: slower, since serialized data is more CPU-intensive to
read often more memory efficient, since it allows the user to choose a
more efficient representation for data than as Java objects
3. On Disk: obviously slower for repeated computations, but can be more
fault-tolerant for long strings of transformations and may be the only
feasible option for enormous computations.
IMMUTABILITY AND THE RDD INTERFACE
• Spark has a RDD interface whose properties are followed by
RDD of every type.
• RDD properties include dependences & information about
data locality that are needed for the execution engine to
compute that RDD
• RDDs can be created in two ways:
(1) by transforming an existing RDD or
(2) from a Spark Context(by passing a list or reading files)
Example
What are the benefits of Spark?
• Speed-Engineered from the bottom-up for performance, Spark can
be 100x faster than Hadoop for large scale data processing by exploiting in
memory computing and other optimizations. Spark is also fast when data
is stored on disk, and currently holds the world record for large-scale on-
disk sorting. Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Ease of Use-Spark has easy-to-use APIs for operating on large datasets.
This includes a collection of over 100 operators for transforming data and
familiar data frame APIs for manipulating semi-structured data. Write
applications quickly in Java, Scala, Python, R.
• A Unified Engine-Spark comes packaged with higher-level libraries,
including support for SQL queries, streaming data, machine learning and
graph processing. These standard libraries increase developer productivity
and can be seamlessly combined to create complex workflows.
When to use Spark?
• Faster Batch Applications: You can now deploy batch applications that run 10-
100x faster in production environments with the added benefit of easy code
maintenance.
• Complex ETL Data Pipelines: You can leverage the complete Spark stack to
build complex ETL pipelines that can merge streaming, machine learning and
sql operations all in one program.
• Real-time Operational Analytics :You can leverage MapR-DB/HBase and/or
Spark Streaming functionality to build real-time operational dashboards or
time-series analytics over data ingested at high speeds.
Example:
• Credit Card Fraud Detection
• Network Security
• Genomic Sequencing
When Not to use Spark?
• Spark was not designed as a multi-user environment. Spark users are
required to know whether the memory they have access to is sufficient for
a dataset. Adding more users further complicates this since the users will
have to coordinate memory usage to run projects concurrently. Due to
this, users will want to consider an alternate engine, such as Apache Hive,
for large, batch projects.
Questions?
Thank You
Anirudh Menon(animenon@mail.com)
Aman Kaushik(amanthekaushik@gmail.com)

More Related Content

What's hot

Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
Xuan-Chao Huang
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
DataFactZ
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Apache spark
Apache sparkApache spark
Apache spark
Dona Mary Philip
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Spark
SparkSpark
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
hadoopsphere
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Pactera_US
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
Tudor Lapusan
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 

What's hot (20)

Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Introduction to Spark - DataFactZ
Introduction to Spark - DataFactZIntroduction to Spark - DataFactZ
Introduction to Spark - DataFactZ
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache spark
Apache sparkApache spark
Apache spark
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Ten tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache SparkTen tools for ten big data areas 03_Apache Spark
Ten tools for ten big data areas 03_Apache Spark
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Spark
SparkSpark
Spark
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Future of data visualization
Future of data visualizationFuture of data visualization
Future of data visualization
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data Using Visualization to Succeed with Big Data
Using Visualization to Succeed with Big Data
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 

Similar to Apache Spark for Beginners

Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
ShidrokhGoudarzi1
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Spark 101
Spark 101Spark 101
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
Ramesh Mudunuri
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Spark core
Spark coreSpark core
Spark core
Prashant Gupta
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
DataStax Academy
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
datamantra
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 

Similar to Apache Spark for Beginners (20)

Apache Spark
Apache SparkApache Spark
Apache Spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
spark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark examplespark example spark example spark examplespark examplespark examplespark example
spark example spark example spark examplespark examplespark examplespark example
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark 101
Spark 101Spark 101
Spark 101
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Dec6 meetup spark presentation
Dec6 meetup spark presentationDec6 meetup spark presentation
Dec6 meetup spark presentation
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Spark core
Spark coreSpark core
Spark core
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 

Recently uploaded

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 

Recently uploaded (20)

AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 

Apache Spark for Beginners

  • 2. Available Distributed Programming models • MapReduce • Storm • Flink • Spark
  • 4. Major limitations with available distributed models 1. Difficulty in programming directly in MapReduce 2. No support for in-memory computation in MapReduce 3. MR uses batch processing (does not fit every use-case ). 4. Flink is not ready for production level projects. 5. Flink primarily works on streaming data. 6. Storm is slower than Spark.
  • 5. Hadoop Ecosystem Note: This is just an illustrative figure, not all components shown may be production ready
  • 6. What is Spark? • Spark is the open standard for flexible in- memory data processing for batch, real-time, and advanced analytics. • Powerful open source processing engine built around speed, ease of use, and sophisticated analytics. • First high-level programing framework for fast, distributed data processing.
  • 7. Some key points about Spark • Handles batch, interactive, and real-time within a single framework (MR for Batch and Flink for Stream) • Native integration with Java, Python, Scala and R • More general: map/reduce is just one set of supported constructs
  • 9. How does Spark Work? • Often used in tandem with a distributed storage system to write the data processed and a cluster manager to manage the distribution of the application across the cluster. • Spark currently supports three kinds of cluster managers: 1. The manager included in Spark, called the Standalone Cluster Manager, which requires Spark to be installed in each node of a cluster. 2. Apache Mesos 3. Hadoop YARN.
  • 10. Spark Data processing eco system Figure: Components of Spark Architecture Model
  • 11. Spark Cluster Figure: Spark Cluster Mode Overview
  • 12. Spark Ecosystem Spark Core API R SQL Python Scala Java Spark SQL Streaming MLlib GraphX Programming languages used in Spark Source:
  • 13. Spark Core • Spark Core, the main data processing framework in the Spark ecosystem • Spark Core is the underlying general execution engine for the Spark platform that all other functionality is built on top of. • It provides in-memory computing capabilities to deliver speed, a generalized execution model to support a wide variety of applications, and Java, Scala, and Python APIs for ease of development. • In addition to Spark Core, the Spark ecosystem includes a number of other first-party components for more specific data processing tasks, including Spark SQL, Spark MLLib, Spark ML, and Graph X. • These components have many of the same generic performance considerations as the core. However, some of them have unique considerations - like SQL’s different optimizer.
  • 17. Spark SQL • Spark SQL is a Spark module for structured data processing. • Provides a programming abstraction called Data Frames & can also act as distributed SQL query engine. • Defines an interface for a semi-structured data type, called DataFrames and a typed version called Dataset. • Very important component for Spark performance, and almost all that can be accomplished with Spark core can be applied to Spark SQL. • DataFrames and Datasets interfaces are the future of Spark performance, with more efficient storage options, advanced optimizer, and direct operations on serialized data. • Datasets was introduced in Spark 1.6, DataFrames in Spark 1.3, and the SQL engine in Spark 1.0.
  • 18. • Spark SQL supports structured queries in batch and streaming modes (with the latter as a separate module of Spark SQL called Structured Streaming). • As of Spark 2.0, Spark SQL is now de facto the primary and feature-rich interface to Spark’s underlying in-memory distributed platform (hiding Spark Core’s RDDs behind higher- level abstractions).
  • 19. Spark SQL’s different APIs • Dataset API (formerly DataFrame API) with a strongly-typed LINQ-like Query DSL that Scala programmers will likely find very appealing to use. • Structured Streaming API (aka Streaming Datasets) for continuous incremental execution of structured queries. • Non-programmers will likely use SQL as their query language through direct integration with Hive • JDBC/ODBC fans can use JDBC interface (through Thrift JDBC/ODBC Server) and connect their tools to Spark’s distributed query engine.
  • 21. Spark SQL - Dataset
  • 22. Machine Learning • Spark has two machine learning packages, ML and MLlib. • Spark ML is still in the early stages, but since Spark 1.2, it provides a higher-level API than MLlib that helps users create practical machine learning pipelines more easily. • Spark MLLib is built on top of RDDs, on the other hand ML is build on top of SparkSQL data frames. • Spark community plans to move over to ML deprecating MLlib. • Spark ML and MLLib have some unique performance considerations, especially when working with large data sizes and caching.
  • 23. Spark Streaming • Running on top of Spark, Spark Streaming enables powerful interactive and analytical applications across both streaming and historical data, while inheriting • Uses the scheduling of the Spark Core for streaming analytics on mini batches of data. • Has a number of unique considerations such as the window sizes used for batches. • Running on top of Spark, it enables powerful interactive and analytical applications across both streaming and historical data, while inheriting Spark’s ease of use and fault tolerance characteristics. • Readily integrates with a wide variety of popular data sources, including HDFS, Flume, Kafka, and Twitter.
  • 24. Graph X • GraphX is a graph computation engine built on top of Spark that enables users to interactively build, transform and reason about graph structured data at scale. • Comes complete with a library of common algorithms. • Least mature components of Spark. • Typed graph functionality will start to be introduced on top of the Dataset API in upcoming version.
  • 25. Spark Model of Parallel Computing: RDDs • Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements partitioned across machines, that can be operated on in parallel. • Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. • RDDs are distributed data-sets that can stay in-memory or fall back to disk gracefully. • RDDs are resilient because they have a long lineage. Whenever there's a failure in the system, they can re-compute themselves using the prior information using lineage. • RDDs are a representation of lazily evaluated statically typed distributed collections.
  • 26. • Spark stores data in RDDs on different partitions. They help with rearranging the computations and optimizing the data processing. • RDDs are immutable. We can modify an RDD with a transformation but the transformation returns a new RDD whereas the original RDD remains the same. • In addition to Spark Core, the Spark ecosystem includes a number of other first-party components for more specific data processing tasks, including Spark SQL, Spark MLLib, Spark ML, and Graph X.
  • 27. RDD Operations • RDD supports two types of operations: – Transformation: Transformations don't return a single value, they return a new RDD. Nothing gets evaluated when Transformation function is called, it just takes an RDD and return a new RDD. Few of the Transformation functions are map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, pipe, and coalesce. – Action: Action operation evaluates and returns a new value. When an Action function is called on a RDD object, all the data processing queries are computed at that time and the result value is returned. Few of the Actions are reduce, collect, count, first, take, countByKey, and foreach.
  • 28. Lazy Evaluation • Evaluation of RDDs is completely lazy. • Spark does not begin computing the partitions until and action is called. • Actions trigger the scheduler, which builds a directed acyclic graph (called the DAG), based on the dependencies between RDD transformations.
  • 29. PERFORMANCE & USABILITY ADVANTAGES OF LAZY EVALUATION • Allows Spark to chain together operations that don’t require communication with the driver to avoid doing multiple passes through the data. • As each partition of the data contains the dependency information needed to re-calculate the partition, Spark is fault-tolerant • RDD contains all the dependency information required to replicate each of its partitions. • In case o failure when a partition is lost, the RDD has enough information about its lineage to recompute it, and that computation can be parallelized to make recovery faster.
  • 30. IN-MEMORY STORAGE & MEMORY MANAGEMENT • Spark has option of storing the data on slave nodes on loaded into memory. So its performance it very good for iterative computations compare to MapReduce. • Spark offers three options for memory management: 1. In memory as de-serialized Java objects: memory storage is the fastest but not memory efficient, as it needs the data to be as objects. 2. As serialized data: slower, since serialized data is more CPU-intensive to read often more memory efficient, since it allows the user to choose a more efficient representation for data than as Java objects 3. On Disk: obviously slower for repeated computations, but can be more fault-tolerant for long strings of transformations and may be the only feasible option for enormous computations.
  • 31. IMMUTABILITY AND THE RDD INTERFACE • Spark has a RDD interface whose properties are followed by RDD of every type. • RDD properties include dependences & information about data locality that are needed for the execution engine to compute that RDD • RDDs can be created in two ways: (1) by transforming an existing RDD or (2) from a Spark Context(by passing a list or reading files)
  • 33.
  • 34.
  • 35.
  • 36. What are the benefits of Spark? • Speed-Engineered from the bottom-up for performance, Spark can be 100x faster than Hadoop for large scale data processing by exploiting in memory computing and other optimizations. Spark is also fast when data is stored on disk, and currently holds the world record for large-scale on- disk sorting. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. • Ease of Use-Spark has easy-to-use APIs for operating on large datasets. This includes a collection of over 100 operators for transforming data and familiar data frame APIs for manipulating semi-structured data. Write applications quickly in Java, Scala, Python, R. • A Unified Engine-Spark comes packaged with higher-level libraries, including support for SQL queries, streaming data, machine learning and graph processing. These standard libraries increase developer productivity and can be seamlessly combined to create complex workflows.
  • 37. When to use Spark? • Faster Batch Applications: You can now deploy batch applications that run 10- 100x faster in production environments with the added benefit of easy code maintenance. • Complex ETL Data Pipelines: You can leverage the complete Spark stack to build complex ETL pipelines that can merge streaming, machine learning and sql operations all in one program. • Real-time Operational Analytics :You can leverage MapR-DB/HBase and/or Spark Streaming functionality to build real-time operational dashboards or time-series analytics over data ingested at high speeds. Example: • Credit Card Fraud Detection • Network Security • Genomic Sequencing
  • 38. When Not to use Spark? • Spark was not designed as a multi-user environment. Spark users are required to know whether the memory they have access to is sufficient for a dataset. Adding more users further complicates this since the users will have to coordinate memory usage to run projects concurrently. Due to this, users will want to consider an alternate engine, such as Apache Hive, for large, batch projects.