5 Reasons Spark in Demand - Low Latency, Streaming, ML, Graphs

•

1 like•883 views

Spark is in high demand for several reasons: it offers low-latency processing by keeping data in memory, supports streaming analytics, machine learning algorithms, and graph processing. It also introduces DataFrames for easier data analysis and integrates well with Hadoop for processing large datasets. Spark can sort 100TB of data 3 times faster than MapReduce using fewer resources, making it a popular big data processing engine.

Technology

www.edureka.co/apache-spark-scala-training
5 Reasons why Spark is in demand!

www.edureka.co/apache-spark-scala-training
Reasons why Spark is in Demand
 Reason #1 : Low Latency
 Reason #2 : Streaming Support
 Reason #3 : Machine Learning and Graph
 Reason #4 : Data Frame API introduction
 Reason #5 : Spark integration with Hadoop

www.edureka.co/apache-spark-scala-training
Spark Architecture
Machine
Learning
Library
Graph
programming
Spark interface
For RDBMS
lovers
Utility for
continuous
ingestion of data

www.edureka.co/apache-spark-scala-training
Low Latency

www.edureka.co/apache-spark-scala-training
Sparks Cuts Down Read/Write I/O To Disk
Spark is good for both data that fit In-Memory and Off-Memory
Spark tries to keep things in-memory of its distributed workers, allowing for significantly
faster/lower-latency computations, whereas MapReduce keeps shuffling things in and out of
disk.

www.edureka.co/apache-spark-scala-training
Time taken for a System to Sort 100 TB Of Data
The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce
cluster of 2100 nodes
Using Spark on 206 EC2 nodes, it took only 23 minutes.
Spark sorted the same data 3X faster using 10X fewer machines

www.edureka.co/apache-spark-scala-training
Streaming Support

www.edureka.co/apache-spark-scala-training
Event processing
Used for processing the real-time streaming data.
It uses the DStream which is a series of RDDs, for processing the continuous real-time data.
The Spark Streaming API closely matches that of the Spark Core
The Spark Streaming API closely matches that of the Spark Core

www.edureka.co/apache-spark-scala-training
Machine Learning and Graph
Implementation with DAG

www.edureka.co/apache-spark-scala-training
Machine Learning
MLlib, a
machine
learning library
Classification Regression Clustering
Collaborative
filtering
Some of the algorithms also work with streaming data, such as linear regression using
ordinary least squares or k-means clustering

www.edureka.co/apache-spark-scala-training
Cyclic Data Flows
• All jobs in spark comprise a series of operators and run on a set of data.
• All the operators in a job are used to construct a DAG (Directed Acyclic Graph).
• The DAG is optimized by rearranging and combining operators where possible.

www.edureka.co/apache-spark-scala-training
GraphX
Graph
Algorithms
Page Rank
Connected
Components
Triangle
Counting
Component for graphs and graph-parallel computation
Extends the Spark RDD by introducing a new Graph abstraction

www.edureka.co/apache-spark-scala-training
Support for Data Frames

www.edureka.co/apache-spark-scala-training
DataFrame
As Spark continues to grow, it wants to enable wider audiences beyond “big data” engineers to
leverage the power of distributed processing.
Inspired by DataFrames in R and Python (pandas)
DataFrames API is designed to make big data processing on tabular data easier
DataFrame is a distributed collection of data organized into named columns.
Provides operations to filter, group, or compute aggregates, and can be used with Spark SQL.
Can be constructed from structured data files, existing RDDs, tables in Hive, or external
databases.

www.edureka.co/apache-spark-scala-training
DataFrame features
Ability to scale from KBs to PBs
APIs for python, java, scala, and R (in development via sparkr)
Seamless integration with all big data tooling and infrastructure via spark
State-of-the-art optimization and code generation through the spark SQL catalyst optimizer
Support for a wide array of data formats and storage systems

www.edureka.co/apache-spark-scala-training
Spark Integration with Hadoop

www.edureka.co/apache-spark-scala-training
Spark Execution Platforms
Spark can leverage the resource negotiator of Hadoop framework i.e. YARN
Spark workloads can make use of Symphony scheduling policies and execute via YARN
Spark execution
modes
Standalone Mesos HDFS

www.edureka.co/apache-spark-scala-training
Spark Features/Modules in Demand

www.edureka.co/apache-spark-scala-training
New Features In 2015
Data Frames 
• Similar API to data frames in R and Pandas
• Automatically optimised via Spark SQL
• Released in Spark 1.3
SparkR 
• Released in Spark 1.4
• Exposes DataFrames, RDD’s & ML library in R
Machine Learning Pipelines 
• High Level API
• Featurization
• Evaluation
• Model Tuning
External Data Sources 
• Platform API to plug Data-Sources into Spark
• Pushes logic into sources

www.edureka.co/apache-spark-scala-training
Spark overview

www.edureka.co/apache-spark-scala-training
Thank You
Questions/Queries/Feedback
Recording and presentation will be made available to you within 24 hours

What's hot

Spark for big data analyticsEdureka!

Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy

Spark StreamingEdureka!

Big Data Processing With SparkEdureka!

Machine Learning with SparkROlgun Aydın

Apache Spark PDFNaresh Rupareliya

Spark Will Replace Hadoop ! Know Why Edureka!

5 things one must know about spark!Edureka!

Spark_Part 1Shashi Prakash

Big data Processing with Apache Spark & ScalaEdureka!

Spark For Faster Batch ProcessingEdureka!

Hadoop vs Apache SparkALTEN Calsoft Labs

Apache sparkTEJPAL GAUTAM

Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Edureka!

Data processing with spark in r & pythonMaloy Manna, PMP®

Apache spark architecture (Big Data and Analytics)Jyotasana Bharti

Lighting up Big Data Analytics with Apache Spark in AzureJen Stirrup

Big data overviewbeCloudReady

Jump Start into Apache Spark (Seattle Spark Meetup)Denny Lee

Started with-apache-sparkHappiest Minds Technologies

What's hot (20)

Spark for big data analytics

Apache spark - Architecture , Overview & libraries

Spark Streaming

Big Data Processing With Spark

Machine Learning with SparkR

Apache Spark PDF

Spark Will Replace Hadoop ! Know Why

5 things one must know about spark!

Spark_Part 1

Big data Processing with Apache Spark & Scala

Spark For Faster Batch Processing

Hadoop vs Apache Spark

Apache spark

Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...

Data processing with spark in r & python

Apache spark architecture (Big Data and Analytics)

Lighting up Big Data Analytics with Apache Spark in Azure

Big data overview

Jump Start into Apache Spark (Seattle Spark Meetup)

Started with-apache-spark

Viewers also liked

"Introduction to R Programming and Machine Learning"Edureka!

Introduction to Apache Sparkdatamantra

Apache Spark 2.0: Faster, Easier, and SmarterDatabricks

Introduction to Apache Spark Developer TrainingCloudera, Inc.

What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...Edureka!

Apache spark basicssparrowAnalytics.com

Introduction to Apache SparkRahul Jain

Apache Spark ArchitectureAlexey Grishchenko

Viewers also liked (8)

"Introduction to R Programming and Machine Learning"

Introduction to Apache Spark

Apache Spark 2.0: Faster, Easier, and Smarter

Introduction to Apache Spark Developer Training

What Is Salesforce CRM? | Salesforce CRM Tutorial For Beginners | Salesforce ...

Apache spark basics

Introduction to Apache Spark

Apache Spark Architecture

Similar to 5 Reasons Spark in Demand - Low Latency, Streaming, ML, Graphs

5 things one must know about spark!Edureka!

Apache sparkDona Mary Philip

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!

Module01NPN Training

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...Edureka!

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!

SparkPaperSuraj Thapaliya

Introduction to sparkHome

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn

spark_v1_2Frank Schroeter

Apache Spark Introduction.pdfMaheshPandit16

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Low latency access of bigdata using spark and sharkPradeep Kumar G.S

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri

Apache Spark OverviewDharmjit Singh

Scalable Machine Learning with PySparkLadle Patel

APACHE SPARK.pptxDeepaThirumurugan

Spark introduction & Architecture.pptxMUMERSHARJEELCh

Hadoop vs sparkamarkayam

Similar to 5 Reasons Spark in Demand - Low Latency, Streaming, ML, Graphs (20)

5 things one must know about spark!

Apache spark

Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...

Module01

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training |...

What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...

SparkPaper

Introduction to spark

What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...

spark_v1_2

Apache Spark Introduction.pdf

Processing Large Data with Apache Spark -- HasGeek

Low latency access of bigdata using spark and shark

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...

HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...

Apache Spark Overview

Scalable Machine Learning with PySpark

APACHE SPARK.pptx

Spark introduction & Architecture.pptx

Hadoop vs spark

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Artificial intelligence in the post-deep learning eraDeakin University

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

How to convert PDF to text with Nanonetsnaman860154

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

CloudStudio User manual (basic edition):comworks

Key Features Of Token Development (1).pptxLBM Solutions

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Advanced Test Driven-Development @ php[tek] 2024

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Breaking the Kubernetes Kill Chain: Host Path Mount

Injustice - Developers Among Us (SciFiDevCon 2024)

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Artificial intelligence in the post-deep learning era

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

How to convert PDF to text with Nanonets

Unblocking The Main Thread Solving ANRs and Frozen Frames

CloudStudio User manual (basic edition):

Key Features Of Token Development (1).pptx

SQL Database Design For Developers at php[tek] 2024

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

08448380779 Call Girls In Friends Colony Women Seeking Men

The transition to renewables in India.pdf

Azure Monitor & Application Insight to monitor Infrastructure & Application

5 Reasons Spark in Demand - Low Latency, Streaming, ML, Graphs

1. www.edureka.co/apache-spark-scala-training 5 Reasons why Spark is in demand!

2. www.edureka.co/apache-spark-scala-training Reasons why Spark is in Demand  Reason #1 : Low Latency  Reason #2 : Streaming Support  Reason #3 : Machine Learning and Graph  Reason #4 : Data Frame API introduction  Reason #5 : Spark integration with Hadoop

3. www.edureka.co/apache-spark-scala-training Spark Architecture Machine Learning Library Graph programming Spark interface For RDBMS lovers Utility for continuous ingestion of data

4. www.edureka.co/apache-spark-scala-training Low Latency

5. www.edureka.co/apache-spark-scala-training Sparks Cuts Down Read/Write I/O To Disk Spark is good for both data that fit In-Memory and Off-Memory Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keeps shuffling things in and out of disk.

6. www.edureka.co/apache-spark-scala-training Time taken for a System to Sort 100 TB Of Data The previous world record was 72 minutes, set by Yahoo using a Hadoop MapReduce cluster of 2100 nodes Using Spark on 206 EC2 nodes, it took only 23 minutes. Spark sorted the same data 3X faster using 10X fewer machines

7. www.edureka.co/apache-spark-scala-training Streaming Support

8. www.edureka.co/apache-spark-scala-training Event processing Used for processing the real-time streaming data. It uses the DStream which is a series of RDDs, for processing the continuous real-time data. The Spark Streaming API closely matches that of the Spark Core The Spark Streaming API closely matches that of the Spark Core

9. www.edureka.co/apache-spark-scala-training Machine Learning and Graph Implementation with DAG

10. www.edureka.co/apache-spark-scala-training Machine Learning MLlib, a machine learning library Classification Regression Clustering Collaborative filtering Some of the algorithms also work with streaming data, such as linear regression using ordinary least squares or k-means clustering

11. www.edureka.co/apache-spark-scala-training Cyclic Data Flows • All jobs in spark comprise a series of operators and run on a set of data. • All the operators in a job are used to construct a DAG (Directed Acyclic Graph). • The DAG is optimized by rearranging and combining operators where possible.

12. www.edureka.co/apache-spark-scala-training GraphX Graph Algorithms Page Rank Connected Components Triangle Counting Component for graphs and graph-parallel computation Extends the Spark RDD by introducing a new Graph abstraction

13. www.edureka.co/apache-spark-scala-training Support for Data Frames

14. www.edureka.co/apache-spark-scala-training DataFrame As Spark continues to grow, it wants to enable wider audiences beyond “big data” engineers to leverage the power of distributed processing. Inspired by DataFrames in R and Python (pandas) DataFrames API is designed to make big data processing on tabular data easier DataFrame is a distributed collection of data organized into named columns. Provides operations to filter, group, or compute aggregates, and can be used with Spark SQL. Can be constructed from structured data files, existing RDDs, tables in Hive, or external databases.

15. www.edureka.co/apache-spark-scala-training DataFrame features Ability to scale from KBs to PBs APIs for python, java, scala, and R (in development via sparkr) Seamless integration with all big data tooling and infrastructure via spark State-of-the-art optimization and code generation through the spark SQL catalyst optimizer Support for a wide array of data formats and storage systems

16. www.edureka.co/apache-spark-scala-training Spark Integration with Hadoop

17. www.edureka.co/apache-spark-scala-training Spark Execution Platforms Spark can leverage the resource negotiator of Hadoop framework i.e. YARN Spark workloads can make use of Symphony scheduling policies and execute via YARN Spark execution modes Standalone Mesos HDFS

18. www.edureka.co/apache-spark-scala-training Spark Features/Modules in Demand

19. www.edureka.co/apache-spark-scala-training New Features In 2015 Data Frames  • Similar API to data frames in R and Pandas • Automatically optimised via Spark SQL • Released in Spark 1.3 SparkR  • Released in Spark 1.4 • Exposes DataFrames, RDD’s & ML library in R Machine Learning Pipelines  • High Level API • Featurization • Evaluation • Model Tuning External Data Sources  • Platform API to plug Data-Sources into Spark • Pushes logic into sources

20. www.edureka.co/apache-spark-scala-training Spark overview

21. www.edureka.co/apache-spark-scala-training Thank You Questions/Queries/Feedback Recording and presentation will be made available to you within 24 hours

5 Reasons Spark in Demand - Low Latency, Streaming, ML, Graphs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to 5 Reasons Spark in Demand - Low Latency, Streaming, ML, Graphs

Similar to 5 Reasons Spark in Demand - Low Latency, Streaming, ML, Graphs (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

5 Reasons Spark in Demand - Low Latency, Streaming, ML, Graphs