SlideShare a Scribd company logo
Introduction to Spark
Eric Eijkelenboom - UserReport - userreport.com
• What is Spark and why should I care?
• Architecture and programming model
• Examples
• Mini demo
• Related projects
RTFM
• A general-purpose computation framework that leverages distributed
memory
• More flexible than MapReduce (it supports general execution graphs)
• Linear scalability and fault tolerance
• It supports a rich set of higher-level tools including
• Shark (Hive on Spark) and Spark SQL
• MLlib for machine learning
• GraphX for graph processing
• Spark Streaming
Who cares?
!
!
!
!
• Slow due to serialisation & replication
• Inefficient for iterative computing & interactive
querying
Limitations of MapReduce
Input iter. 1 iter. 2 . . .
HDFS

read
HDFS

write
HDFS

read
HDFS

write
Map
Map
Map
Reduce
Reduce
Input Output
Leveraging memory
iter. 1 iter. 2 . . .
Input
HDFS

read
HDFS

write
HDFS

read
HDFS

write
Leveraging memory
iter. 1 iter. 2 . . .
Input
iter. 1 iter. 2 . . .
Input
HDFS

read
HDFS

write
HDFS

read
HDFS

write
Leveraging memory
iter. 1 iter. 2 . . .
Input
iter. 1 iter. 2 . . .
Input
HDFS

read
HDFS

write
HDFS

read
HDFS

write
Not tied to 2-stage
MapReduce paradigm
1. Extract a working set
2. Cache it
3. Query it repeatedly
So, Spark is…
• In-memory analytics, many times faster than
Hadoop/Hive
• Designed for running iterative algorithms &
interactive querying
• Highly compatible with Hadoop’s Storage APIs
• Can run on your existing Hadoop Cluster Setup
• Programming in Scala, Python or Java
Spark stack
Architecture
HDFS
Datanode Datanode Datanode....
Spark Worker Spark Worker Spark Worker
....
Cache Cache Cache
Block Block Block
Cluster Manager
Spark Driver (Master)
Architecture
HDFS
Datanode Datanode Datanode....
Spark Worker Spark Worker Spark Worker
....
Cache Cache Cache
Block Block Block
Cluster Manager
Spark Driver (Master)
• YARN
• Mesos
• Standalone
Programming model
• Resilient Distributed Datasets (RDDs) are basic building blocks
• Distributed collection of objects, cached in-memory across
cluster nodes
• Automatically rebuilt on failure
• RDD operations
• Transformations: create new RDDs from existing ones
• Actions: return a value to the master node after running a
computation on the dataset
As you know…
• … Hadoop is a distributed system for counting
words
• Here is how it’s done is Spark
As you know…
• … Hadoop is a distributed system for counting
words
• Here is how it’s done is Spark
Blue code: Spark operations
Red code: functions
(closures) that get passed
to the cluster automatically
Text search
Text search
In memory text search:
!
!
caches the RDD in memory for faster reuse
Logistic regression
!
• 100 GB of data on a 100 node cluster
Easy unit testing
Spark shell
Mini demo
Hive on Spark = Shark
• A large scale data warehouse system just like Hive
• Highly compatible with Hive (HQL, metastore,
serialization formats, and UDFs)
• Built on top of Spark (thus a faster execution engine)
• Provision of creating in-memory materialized tables
(Cached Tables)
• And cached tables utilise columnar storage instead of
raw storage
Shark
Shark uses the existing Hive client and metastore
MLlib
• Machine learning library based on Spark
!
!
• Supports a range of machine learning algorithms,
including classification, regression, clustering,
collaborative filtering, dimensionality reduction, and
more
Spark Streaming
• Write streaming applications in the same way as
batch applications
• Reuse code between batch processing and
streaming
• Write more than analytics applications:
• Join streams against historical data
• Run ad-hoc queries on stream state
Spark Streaming
• Count tweets on a sliding window
!
!
• Find words with higher frequency than historic data
GraphX: graph computing
Introduction to apache spark

More Related Content

What's hot

Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark Summit
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
pumaranikar
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
Muktadiur Rahman
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
Yukti Kaura
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
Ryan Bosshart
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Spark Summit
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic
 

What's hot (20)

Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
Spark as a Platform to Support Multi-Tenancy and Many Kinds of Data Applicati...
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Introduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlibIntroduction to Apache Spark and MLlib
Introduction to Apache Spark and MLlib
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Introduction to apache spark
Introduction to apache sparkIntroduction to apache spark
Introduction to apache spark
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
 
Apache spark linkedin
Apache spark linkedinApache spark linkedin
Apache spark linkedin
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
 
Introduction to Apache Spark Ecosystem
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
 

Viewers also liked

La filiera Gas
La filiera GasLa filiera Gas
La filiera Gas
Enegan
 
Monitoring tool partner school
Monitoring tool partner schoolMonitoring tool partner school
Monitoring tool partner school
evangeline quibuyen
 
Marketing Cloud
Marketing CloudMarketing Cloud
Marketing Cloud
Oorjit
 
Presentation 224 b margie ware insurance and benefits counseling as a core se...
Presentation 224 b margie ware insurance and benefits counseling as a core se...Presentation 224 b margie ware insurance and benefits counseling as a core se...
Presentation 224 b margie ware insurance and benefits counseling as a core se...
The ALS Association
 
Bestow Showcase: Godrej Chotukool
Bestow Showcase: Godrej ChotukoolBestow Showcase: Godrej Chotukool
Bestow Showcase: Godrej ChotukoolBestow
 
Bridal in Knysna 060 529 6330
Bridal in Knysna 060 529 6330Bridal in Knysna 060 529 6330
Bridal in Knysna 060 529 6330
knysnaarea
 
Key challenges to scale up climate change resillience in Botswana lr experien...
Key challenges to scale up climate change resillience in Botswana lr experien...Key challenges to scale up climate change resillience in Botswana lr experien...
Key challenges to scale up climate change resillience in Botswana lr experien...
PROCASUR Corporation / Corporación PROCASUR
 
Structural Resin Injection for Soil Stabilization
Structural Resin Injection for Soil StabilizationStructural Resin Injection for Soil Stabilization
Structural Resin Injection for Soil Stabilization
Uretek Mid-Atlantic
 
Presentation 208 b sue walsh_an evaluation of newly diagnosed patient needs
Presentation 208 b sue walsh_an evaluation of newly diagnosed  patient needsPresentation 208 b sue walsh_an evaluation of newly diagnosed  patient needs
Presentation 208 b sue walsh_an evaluation of newly diagnosed patient needs
The ALS Association
 
Call for Application: Learning Initiative on Practical solutions to adapt to ...
Call for Application: Learning Initiative on Practical solutions to adapt to ...Call for Application: Learning Initiative on Practical solutions to adapt to ...
Call for Application: Learning Initiative on Practical solutions to adapt to ...
PROCASUR Corporation / Corporación PROCASUR
 
A Low Impact Solution for Increasing Existing Structural Loads
A Low Impact Solution for Increasing Existing Structural LoadsA Low Impact Solution for Increasing Existing Structural Loads
A Low Impact Solution for Increasing Existing Structural Loads
Uretek Mid-Atlantic
 
Ws routesa regional workshop_december 2014
Ws routesa regional workshop_december 2014Ws routesa regional workshop_december 2014
Ws routesa regional workshop_december 2014
PROCASUR Corporation / Corporación PROCASUR
 
Prezi andrea
Prezi andreaPrezi andrea
Prezi andrea
AnyJr
 

Viewers also liked (14)

La filiera Gas
La filiera GasLa filiera Gas
La filiera Gas
 
Monitoring tool partner school
Monitoring tool partner schoolMonitoring tool partner school
Monitoring tool partner school
 
Marketing Cloud
Marketing CloudMarketing Cloud
Marketing Cloud
 
Presentation 224 b margie ware insurance and benefits counseling as a core se...
Presentation 224 b margie ware insurance and benefits counseling as a core se...Presentation 224 b margie ware insurance and benefits counseling as a core se...
Presentation 224 b margie ware insurance and benefits counseling as a core se...
 
Bestow Showcase: Godrej Chotukool
Bestow Showcase: Godrej ChotukoolBestow Showcase: Godrej Chotukool
Bestow Showcase: Godrej Chotukool
 
Bridal in Knysna 060 529 6330
Bridal in Knysna 060 529 6330Bridal in Knysna 060 529 6330
Bridal in Knysna 060 529 6330
 
Vioxx tn2015
Vioxx tn2015Vioxx tn2015
Vioxx tn2015
 
Key challenges to scale up climate change resillience in Botswana lr experien...
Key challenges to scale up climate change resillience in Botswana lr experien...Key challenges to scale up climate change resillience in Botswana lr experien...
Key challenges to scale up climate change resillience in Botswana lr experien...
 
Structural Resin Injection for Soil Stabilization
Structural Resin Injection for Soil StabilizationStructural Resin Injection for Soil Stabilization
Structural Resin Injection for Soil Stabilization
 
Presentation 208 b sue walsh_an evaluation of newly diagnosed patient needs
Presentation 208 b sue walsh_an evaluation of newly diagnosed  patient needsPresentation 208 b sue walsh_an evaluation of newly diagnosed  patient needs
Presentation 208 b sue walsh_an evaluation of newly diagnosed patient needs
 
Call for Application: Learning Initiative on Practical solutions to adapt to ...
Call for Application: Learning Initiative on Practical solutions to adapt to ...Call for Application: Learning Initiative on Practical solutions to adapt to ...
Call for Application: Learning Initiative on Practical solutions to adapt to ...
 
A Low Impact Solution for Increasing Existing Structural Loads
A Low Impact Solution for Increasing Existing Structural LoadsA Low Impact Solution for Increasing Existing Structural Loads
A Low Impact Solution for Increasing Existing Structural Loads
 
Ws routesa regional workshop_december 2014
Ws routesa regional workshop_december 2014Ws routesa regional workshop_december 2014
Ws routesa regional workshop_december 2014
 
Prezi andrea
Prezi andreaPrezi andrea
Prezi andrea
 

Similar to Introduction to apache spark

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
Darko Marjanovic
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Marius Soutier
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
Dorian Beganovic
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Chris Fregly
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
cdmaxime
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Mac Moore
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 

Similar to Introduction to apache spark (20)

Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
Big Data tools in practice
Big Data tools in practiceBig Data tools in practice
Big Data tools in practice
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015Scaling Spark Workloads on YARN - Boulder/Denver July 2015
Scaling Spark Workloads on YARN - Boulder/Denver July 2015
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 

Recently uploaded

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 

Recently uploaded (20)

The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 

Introduction to apache spark

  • 1. Introduction to Spark Eric Eijkelenboom - UserReport - userreport.com
  • 2. • What is Spark and why should I care? • Architecture and programming model • Examples • Mini demo • Related projects
  • 3.
  • 4. RTFM • A general-purpose computation framework that leverages distributed memory • More flexible than MapReduce (it supports general execution graphs) • Linear scalability and fault tolerance • It supports a rich set of higher-level tools including • Shark (Hive on Spark) and Spark SQL • MLlib for machine learning • GraphX for graph processing • Spark Streaming
  • 6. ! ! ! ! • Slow due to serialisation & replication • Inefficient for iterative computing & interactive querying Limitations of MapReduce Input iter. 1 iter. 2 . . . HDFS
 read HDFS
 write HDFS
 read HDFS
 write Map Map Map Reduce Reduce Input Output
  • 7. Leveraging memory iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write
  • 8. Leveraging memory iter. 1 iter. 2 . . . Input iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write
  • 9. Leveraging memory iter. 1 iter. 2 . . . Input iter. 1 iter. 2 . . . Input HDFS
 read HDFS
 write HDFS
 read HDFS
 write Not tied to 2-stage MapReduce paradigm 1. Extract a working set 2. Cache it 3. Query it repeatedly
  • 10. So, Spark is… • In-memory analytics, many times faster than Hadoop/Hive • Designed for running iterative algorithms & interactive querying • Highly compatible with Hadoop’s Storage APIs • Can run on your existing Hadoop Cluster Setup • Programming in Scala, Python or Java
  • 12. Architecture HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... Cache Cache Cache Block Block Block Cluster Manager Spark Driver (Master)
  • 13. Architecture HDFS Datanode Datanode Datanode.... Spark Worker Spark Worker Spark Worker .... Cache Cache Cache Block Block Block Cluster Manager Spark Driver (Master) • YARN • Mesos • Standalone
  • 14. Programming model • Resilient Distributed Datasets (RDDs) are basic building blocks • Distributed collection of objects, cached in-memory across cluster nodes • Automatically rebuilt on failure • RDD operations • Transformations: create new RDDs from existing ones • Actions: return a value to the master node after running a computation on the dataset
  • 15. As you know… • … Hadoop is a distributed system for counting words • Here is how it’s done is Spark
  • 16. As you know… • … Hadoop is a distributed system for counting words • Here is how it’s done is Spark Blue code: Spark operations Red code: functions (closures) that get passed to the cluster automatically
  • 18. Text search In memory text search: ! ! caches the RDD in memory for faster reuse
  • 19. Logistic regression ! • 100 GB of data on a 100 node cluster
  • 23.
  • 24. Hive on Spark = Shark • A large scale data warehouse system just like Hive • Highly compatible with Hive (HQL, metastore, serialization formats, and UDFs) • Built on top of Spark (thus a faster execution engine) • Provision of creating in-memory materialized tables (Cached Tables) • And cached tables utilise columnar storage instead of raw storage
  • 25. Shark Shark uses the existing Hive client and metastore
  • 26. MLlib • Machine learning library based on Spark ! ! • Supports a range of machine learning algorithms, including classification, regression, clustering, collaborative filtering, dimensionality reduction, and more
  • 27. Spark Streaming • Write streaming applications in the same way as batch applications • Reuse code between batch processing and streaming • Write more than analytics applications: • Join streams against historical data • Run ad-hoc queries on stream state
  • 28. Spark Streaming • Count tweets on a sliding window ! ! • Find words with higher frequency than historic data