SlideShare a Scribd company logo
SPARK: THE STATE OF THE ART
ENGINE FOR BIG DATA
PROCESSING
Presented By:
Ramaninder Singh Jhajj
Seminar on Internet Technologies
AGENDA
• Problem
• Limitation of Map Reduce
• Spark Computing Framework
• Resilient Distributed Datasets
• A Unified Stack
• Who uses Spark?
• Demo
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 2
PROBLEM
• Data growing faster than processing speeds
• Map Reduce:
• Restrict the programming interface so that the system can do more
automatically.
• Express jobs as high level operators.
• Map Reduce is efficient (But may be not always)
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 3
LIMITATIONS OF MAPREDUCE
• Work very well with one-pass computation but ineffiecient for multi-
pass algorithms.
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 4
Source: http://www.slideshare.net/aknahs/spark-16667619
SOLUTION: IN-MEMORY DATA SHARING
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 5
Source: http://www.slideshare.net/aknahs/spark-16667619
SPARK ... IS WHAT YOU MIGHT CALL A SWISS
ARMY KNIFE OF BIG DATA ANALYTICS TOOLS
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 6
- Reynold Xin (@rxin),
Berkeley AMPLab Shark Development Lead
SPARK: IN A TWEET
SPARK COMPUTING FRAMEWORK
• Spark is a fast and general engine for large scale data processing.
• Handles batch, interactive, iterative and real-time application
scenarios and provides clean APIs in Java, Scala, Python.
• "Here‘s an operation, run it on all the data": I don‘t care where it runs
and how faults will be handled.
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 7
RESILIENT DISTRIBUTED DATASETS (RDD)
• Primary memory abstraction.
• Read only collection of objects partitioned across cluster that can be
rebuilt if a partition is lost.
• Can be cached explicitely in memory.
• Two operations: Transformations and Actions
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 8
RDD OPERATIONS
Transform
ations RDD action Value
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 9
Transformations Actions
map reduce
filter collect
flatMap count
mapPartitions first
groupByKey take(n)
reduceByKey saveAsTextFile
join foreach
https://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
RDD EXAMPLE
lines = spark.textfile("hdfs://........ ")
errors = lines.filter(_.startWith("Error"))
messages = errors.map(_.split("t")(2))
messages.cache()
messages.filter(_.contains(str)).count()
Base RDD from HDFS
Transformed to new RDD
Stored in Memory
Action
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 10
FAULT TOLERANCE IN RDDS
• Achieved through a notion of lineage.
• Keep track of how it was derived from other RDDs.
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 11
Ex: message = textFile(...).filter(_.contains("error")).map(_.split("t")(2))
EXAMPLE: WORD COUNT
• Word Count in MapReduce: 50-70 lines of code.
• What about Spark?
Python
Java
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 12
SPARK: A UNIFIED STACK
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 13
Spark Core
RDD APIs Fault ToleranceProcessing
A UNIFIED STACK : SPARK SQL
Spark Core
Spark
SQL
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 14
• Spark SQL unifies access to structured data.
• Seamlessly mix SQL queries with Spark programs
sqlCtx.jsonFile("s3n://... ") .registerAsTable("json")
schema_rdd = sqlCtx.sql(" " " SELECT * FROM hiveTable JOIN json ... " " ")
Spark Core
Spark
SQL
GraphX
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 15
A UNIFIED STACK : SPARK GRAPHX
•GraphX is Apache Spark's API for graphs and graph-parallel computation
•Seamlessly work with both graphs and collections
A UNIFIED STACK : SPARK MLLIB
Spark Core
Spark
SQL
GraphX MLLib
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 16
• MLlib is Apache Spark's scalable machine learning library.
• High-quality algorithms, 100x faster than MapReduce.
points = spark.textFile("hdfs://... ").map(parsePoint)
model = KMeans.train(points, k=10)
Spark Core
Spark
SQL
GraphX MLLib
Spark
Streaming
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 17
A UNIFIED STACK : SPARK STREAMING
• Spark Streaming makes it easy to build scalable fault-tolerant
streaming applications.
WHO USES SPARK?
Source: Spark Wiki (https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark)SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 18
DEMO
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 19
CONCLUSION
• Spark is the first system to allow an efficient, general-purpose
programming language to be used interactively to process
large datasets on a cluster.
• Same engine performs data extraction, model training and
interactive queries, no need of separate framework for each
function.
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 20
INTERESTED IN READING MORE?
• https://spark.apache.org/
• http://ampcamp.berkeley.edu/
• https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-
w
• https://spark.apache.org/documentation.html
• edx.org offering a course „Introduction to Big Data with
Apache Spark“
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 21
ANY QUESTIONS?
Thanks you for listening
SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 18

More Related Content

What's hot

Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
Jen Aman
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedChao Chen
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael Nitschinger
Spark Summit
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
Jen Aman
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
clairvoyantllc
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
Modern Data Stack France
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
Spark Summit
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Apache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource ManagerApache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource Manager
haridasnss
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
Burak Yavuz
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
Jen Aman
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Databricks
 

What's hot (20)

Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
MLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reducedMLlib sparkmeetup_8_6_13_final_reduced
MLlib sparkmeetup_8_6_13_final_reduced
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael Nitschinger
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
Getting started with SparkSQL - Desert Code Camp 2016
Getting started with SparkSQL  - Desert Code Camp 2016Getting started with SparkSQL  - Desert Code Camp 2016
Getting started with SparkSQL - Desert Code Camp 2016
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Apache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource ManagerApache spark on Hadoop Yarn Resource Manager
Apache spark on Hadoop Yarn Resource Manager
 
End-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache SparkEnd-to-End Data Pipelines with Apache Spark
End-to-End Data Pipelines with Apache Spark
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Scalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With SparkScalable And Incremental Data Profiling With Spark
Scalable And Incremental Data Profiling With Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
 

Viewers also liked

Cloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-HadoopCloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-Hadoop
Cloudera, Inc.
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
sarith divakar
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
Takuya UESHIN
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
Michael Zhang
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Big Data Spain
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia
Databricks
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
Bernard Marr
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
Bernard Marr
 

Viewers also liked (13)

Cloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-HadoopCloudera Showcase: SQL-on-Hadoop
Cloudera Showcase: SQL-on-Hadoop
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Spark sql meetup
Spark sql meetupSpark sql meetup
Spark sql meetup
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia2016 Spark Summit East Keynote: Matei Zaharia
2016 Spark Summit East Keynote: Matei Zaharia
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 

Similar to Spark: The State of the Art Engine for Big Data Processing

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
Adarsh Pannu
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
IT Event
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
cdmaxime
 
Apache spark
Apache sparkApache spark
Apache spark
Prashant Pranay
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
clairvoyantllc
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
Aishg4
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
Sneha Challa
 

Similar to Spark: The State of the Art Engine for Big Data Processing (20)

Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Apache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating SystemApache Spark: The Analytics Operating System
Apache Spark: The Analytics Operating System
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
OVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptxOVERVIEW ON SPARK.pptx
OVERVIEW ON SPARK.pptx
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
Apache spark sneha challa- google pittsburgh-aug 25th
Apache spark  sneha challa- google pittsburgh-aug 25thApache spark  sneha challa- google pittsburgh-aug 25th
Apache spark sneha challa- google pittsburgh-aug 25th
 

Recently uploaded

State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
kuntobimo2016
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
mzpolocfi
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 

Recently uploaded (20)

State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023State of Artificial intelligence Report 2023
State of Artificial intelligence Report 2023
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
一比一原版(Dalhousie毕业证书)达尔豪斯大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 

Spark: The State of the Art Engine for Big Data Processing

  • 1. SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING Presented By: Ramaninder Singh Jhajj Seminar on Internet Technologies
  • 2. AGENDA • Problem • Limitation of Map Reduce • Spark Computing Framework • Resilient Distributed Datasets • A Unified Stack • Who uses Spark? • Demo SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 2
  • 3. PROBLEM • Data growing faster than processing speeds • Map Reduce: • Restrict the programming interface so that the system can do more automatically. • Express jobs as high level operators. • Map Reduce is efficient (But may be not always) SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 3
  • 4. LIMITATIONS OF MAPREDUCE • Work very well with one-pass computation but ineffiecient for multi- pass algorithms. SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 4 Source: http://www.slideshare.net/aknahs/spark-16667619
  • 5. SOLUTION: IN-MEMORY DATA SHARING SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 5 Source: http://www.slideshare.net/aknahs/spark-16667619
  • 6. SPARK ... IS WHAT YOU MIGHT CALL A SWISS ARMY KNIFE OF BIG DATA ANALYTICS TOOLS SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 6 - Reynold Xin (@rxin), Berkeley AMPLab Shark Development Lead SPARK: IN A TWEET
  • 7. SPARK COMPUTING FRAMEWORK • Spark is a fast and general engine for large scale data processing. • Handles batch, interactive, iterative and real-time application scenarios and provides clean APIs in Java, Scala, Python. • "Here‘s an operation, run it on all the data": I don‘t care where it runs and how faults will be handled. SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 7
  • 8. RESILIENT DISTRIBUTED DATASETS (RDD) • Primary memory abstraction. • Read only collection of objects partitioned across cluster that can be rebuilt if a partition is lost. • Can be cached explicitely in memory. • Two operations: Transformations and Actions SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 8
  • 9. RDD OPERATIONS Transform ations RDD action Value SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 9 Transformations Actions map reduce filter collect flatMap count mapPartitions first groupByKey take(n) reduceByKey saveAsTextFile join foreach https://spark.apache.org/docs/latest/programming-guide.html#rdd-operations
  • 10. RDD EXAMPLE lines = spark.textfile("hdfs://........ ") errors = lines.filter(_.startWith("Error")) messages = errors.map(_.split("t")(2)) messages.cache() messages.filter(_.contains(str)).count() Base RDD from HDFS Transformed to new RDD Stored in Memory Action SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 10
  • 11. FAULT TOLERANCE IN RDDS • Achieved through a notion of lineage. • Keep track of how it was derived from other RDDs. SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 11 Ex: message = textFile(...).filter(_.contains("error")).map(_.split("t")(2))
  • 12. EXAMPLE: WORD COUNT • Word Count in MapReduce: 50-70 lines of code. • What about Spark? Python Java SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 12
  • 13. SPARK: A UNIFIED STACK SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 13 Spark Core RDD APIs Fault ToleranceProcessing
  • 14. A UNIFIED STACK : SPARK SQL Spark Core Spark SQL SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 14 • Spark SQL unifies access to structured data. • Seamlessly mix SQL queries with Spark programs sqlCtx.jsonFile("s3n://... ") .registerAsTable("json") schema_rdd = sqlCtx.sql(" " " SELECT * FROM hiveTable JOIN json ... " " ")
  • 15. Spark Core Spark SQL GraphX SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 15 A UNIFIED STACK : SPARK GRAPHX •GraphX is Apache Spark's API for graphs and graph-parallel computation •Seamlessly work with both graphs and collections
  • 16. A UNIFIED STACK : SPARK MLLIB Spark Core Spark SQL GraphX MLLib SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 16 • MLlib is Apache Spark's scalable machine learning library. • High-quality algorithms, 100x faster than MapReduce. points = spark.textFile("hdfs://... ").map(parsePoint) model = KMeans.train(points, k=10)
  • 17. Spark Core Spark SQL GraphX MLLib Spark Streaming SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 17 A UNIFIED STACK : SPARK STREAMING • Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.
  • 18. WHO USES SPARK? Source: Spark Wiki (https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark)SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 18
  • 19. DEMO SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 19
  • 20. CONCLUSION • Spark is the first system to allow an efficient, general-purpose programming language to be used interactively to process large datasets on a cluster. • Same engine performs data extraction, model training and interactive queries, no need of separate framework for each function. SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 20
  • 21. INTERESTED IN READING MORE? • https://spark.apache.org/ • http://ampcamp.berkeley.edu/ • https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82- w • https://spark.apache.org/documentation.html • edx.org offering a course „Introduction to Big Data with Apache Spark“ SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 21
  • 22. ANY QUESTIONS? Thanks you for listening SPARK: THE STATE OF THE ART ENGINE FOR BIG DATA PROCESSING 18

Editor's Notes

  1. No efficient primitives for data sharing » State between steps goes to distributed file system » Slow due to replication & disk storage While MapReduce is simple, It spend almost 90% of time in I/O operations. MapReduce doesn‘t compose well for all application use cases therefore people built specialized systems as workaround like Pregel, Dremel, Drill, Impala, Storm, S4 etc. Complex apps andInteractive queries both need on thing that MapReduce lacks: Efficient primitives for data sharing. In MapReduce, the only way to share data across jobs is stable storage which is slow
  2. Here comes Spark, it takes the concept of MapReduce to the next label- Higher level API, Low Latency, In Memory data storage
  3. - Supports applications with working sets while providing similar scalability and fault-tolerance. - Spark outperform hadoop by 10x in iterative ML workload. Can Scan a 39GB dataset with sub-second latency. Extends a programming language with a distributed collection data-structure: Resilient distributed datasets (RDD)
  4. Conceptually, RDDs can be roughly viewed as partitioned, locality aware distributed vectors.