Intro to Apache Spark

Marius Soutier
Marius SoutierIndependent Consultant and Software Developer at Software Development and Consulting
Intro to Apache Spark
Marius Soutier
Freelance Software Engineer
@mariussoutier
Clustered In-Memory Computation
Motivation
• Classical data architectures break down
• RDMBS can’t handle large amounts of data well
• Most RDMBS can’t handle multiple input formats
• Most NoSQLs don’t offer analytics
Problem Running computations on BigData®
The 3 Vs of Big Data
Volume
100s of GB, TB, PB
Variety
Structured, Unstructured,
Semi-Structured
Velocity
Sensors, Realtime
“Fast Data”
Hadoop (1)
• De-facto standard for running computations on large amounts of
different data is Hadoop
• Hadoop consists of
• HDFS distributed, fault-tolerant file system
• Map/Reduce parallelizable computations pioneered by Google
• Hadoop is typically run on a (large) cluster of non-virtualized
commodity hardware
Hadoop (2)
• However, Map/Reduce are batch jobs with high latency
• Not suitable for interactive queries, real-time analytics,
or Machine Learning
• Pure Map/Reduce is hard to develop and maintain
Enter Spark
Spark
is a framework for
clustered
in-memory
data processing
• Developed at UC Berkeley, released in
2010
• Apache Top-Level Project Since February
2014, current version is 1.2.1 / 1.3.0
• USP: Uses cluster-wide available memory
to speed up computations
• Very active community
Apache Spark (1)
• Written in Scala (& Akka), 

APIs for Java and Python
• Programming model is a collection pipeline*
instead of Map/Reduce
• Supports batch, streaming, interactive, 

or all combined using unified API
Apache Spark (2)
* http://martinfowler.com/articles/collection-pipeline/
Spark Ecosystem
Spark Core
Spark SQL
Spark Hive
BlinkDB
Approximate
SQL
Spark
Streaming
MLlib
Machine
Learning
GraphX SparkR
ALPHA
ALPHA
ALPHA
Tachyon
Spark is a framework for clustered in-memory
data processing
Spark is a platform for data driven products.
• Base abstraction Resilient Distributed Dataset (RDD)
• Essentially a distributed collection of objects
• Can be cached in memory or on disk
RDD
RDD Word Count
val sc = new SparkContext()

val input: RDD[String] = sc.textFile("/tmp/word.txt")

val words: RDD[(String, Long)] = input

.flatMap(line => line.toLowerCase.split("s+"))

.map(word => word -> 1L)

.cache()



val wordCountsRdd: RDD[(String, Long)] = words

.reduceByKey(_ + _)

.sortByKey()


val wordCounts: Array[(String, Long)] = wordCountsRdd.collect()
Cluster
Driver
SparkContext
Master
Worker
Executor
Worker
Executor
Tasks
Tasks
• Spark app (driver) builds DAG from RDD operations
• DAG is split into tasks that are executed by workers
Example Architecture
Input
HDFS
Message Queue
Spark
Streaming
Spark
Batch Jobs
SparkSQL
Real-Time
Dashboard
Interactive
SQL
Analytics,
Reports
Demo
Questions?
1 of 15

Recommended

Getting started big data by
Getting started big dataGetting started big data
Getting started big dataKibrom Gebrehiwot
333 views12 slides
Introduction to apache spark by
Introduction to apache sparkIntroduction to apache spark
Introduction to apache sparkUserReport
1.5K views30 slides
Apache Spark Briefing by
Apache Spark BriefingApache Spark Briefing
Apache Spark BriefingThomas W. Dinsmore
4K views25 slides
Cloudera Hadoop Distribution by
Cloudera Hadoop DistributionCloudera Hadoop Distribution
Cloudera Hadoop DistributionThisara Pramuditha
976 views23 slides
Apache Spark in Industry by
Apache Spark in IndustryApache Spark in Industry
Apache Spark in IndustryDorian Beganovic
241 views33 slides
Hadoop Ecosystem by
Hadoop EcosystemHadoop Ecosystem
Hadoop EcosystemPatrick Nicolas
4.7K views8 slides

More Related Content

What's hot

Hadoop distributions - ecosystem by
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystemJakub Stransky
1.4K views24 slides
Hadoop by
HadoopHadoop
Hadoopavnishagr
46 views18 slides
Introduction To Hadoop Ecosystem by
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
2.8K views26 slides
Kudu demo by
Kudu demoKudu demo
Kudu demoHemanth Kumar Ratakonda
425 views17 slides
Hadoop at ayasdi by
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdiMohit Jaggi
750 views18 slides
An Overview of Apache Spark by
An Overview of Apache SparkAn Overview of Apache Spark
An Overview of Apache SparkYasoda Jayaweera
201 views93 slides

What's hot(20)

Hadoop distributions - ecosystem by Jakub Stransky
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
Jakub Stransky1.4K views
Introduction To Hadoop Ecosystem by InSemble
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
InSemble2.8K views
Hadoop at ayasdi by Mohit Jaggi
Hadoop at ayasdiHadoop at ayasdi
Hadoop at ayasdi
Mohit Jaggi750 views
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky by Spark Summit
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy StarzhinskySpark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit EU talk by Yaroslav Nedashkovsky and Andy Starzhinsky
Spark Summit1.2K views
Spark and Hadoop Technology by Avinash Gautam
Spark and Hadoop Technology Spark and Hadoop Technology
Spark and Hadoop Technology
Avinash Gautam232 views
Messaging architecture @FB (Fifth Elephant Conference) by Joydeep Sen Sarma
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
Apache Spark in Scientific Applciations by Dr. Mirko Kämpf
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf380 views
Apache Spark Overview by airisData
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
airisData1.6K views
Spark Summit EU talk by Oscar Castaneda by Spark Summit
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
Spark Summit1.1K views

Viewers also liked

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc... by
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
3.2K views27 slides
Intro to Scala.js - Scala UG Cologne by
Intro to Scala.js - Scala UG CologneIntro to Scala.js - Scala UG Cologne
Intro to Scala.js - Scala UG CologneMarius Soutier
865 views18 slides
Type Classes in Scala by
Type Classes in ScalaType Classes in Scala
Type Classes in ScalaMarius Soutier
656 views24 slides
Intro to sbt-web by
Intro to sbt-webIntro to sbt-web
Intro to sbt-webMarius Soutier
836 views17 slides
Scala the good and bad parts by
Scala the good and bad partsScala the good and bad parts
Scala the good and bad partsbenewu
3K views27 slides
Scala by
ScalaScala
Scalaguest8996422d
8.2K views31 slides

Viewers also liked(10)

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc... by Lucidworks
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Lucidworks3.2K views
Intro to Scala.js - Scala UG Cologne by Marius Soutier
Intro to Scala.js - Scala UG CologneIntro to Scala.js - Scala UG Cologne
Intro to Scala.js - Scala UG Cologne
Marius Soutier865 views
Scala the good and bad parts by benewu
Scala the good and bad partsScala the good and bad parts
Scala the good and bad parts
benewu3K views
Scala - The Simple Parts, SFScala presentation by Martin Odersky
Scala - The Simple Parts, SFScala presentationScala - The Simple Parts, SFScala presentation
Scala - The Simple Parts, SFScala presentation
Martin Odersky16.5K views
Scala - the good, the bad and the very ugly by Bozhidar Bozhanov
Scala - the good, the bad and the very uglyScala - the good, the bad and the very ugly
Scala - the good, the bad and the very ugly
Bozhidar Bozhanov57.5K views
A Brief Intro to Scala by Tim Underwood
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to Scala
Tim Underwood23.1K views

Similar to Intro to Apache Spark

APACHE SPARK.pptx by
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptxDeepaThirumurugan
17 views39 slides
Apache Spark Fundamentals by
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
560 views68 slides
Unit II Real Time Data Processing tools.pptx by
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
6 views23 slides
Processing Large Data with Apache Spark -- HasGeek by
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
12.4K views72 slides
Intro to Apache Spark by CTO of Twingo by
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
4.1K views44 slides
Apache Spark for Everyone - Women Who Code Workshop by
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
147 views40 slides

Similar to Intro to Apache Spark(20)

Unit II Real Time Data Processing tools.pptx by Rahul Borate
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
Rahul Borate6 views
Processing Large Data with Apache Spark -- HasGeek by Venkata Naga Ravi
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi12.4K views
Intro to Apache Spark by CTO of Twingo by MapR Technologies
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies4.1K views
Apache Spark for Everyone - Women Who Code Workshop by Amanda Casari
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
Amanda Casari147 views
Introduction to Apache Spark Ecosystem by Bojan Babic
Introduction to Apache Spark EcosystemIntroduction to Apache Spark Ecosystem
Introduction to Apache Spark Ecosystem
Bojan Babic1.8K views
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R... by Dataconomy Media
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media379 views
Apache Spark Overview @ ferret by Andrii Gakhov
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov1.2K views
Etu Solution Day 2014 Track-D: 掌握Impala和Spark by James Chen
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen691 views
Comparison - RDBMS vs Hadoop vs Apache by SandeepTaksande
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
SandeepTaksande239 views
Sa introduction to big data pipelining with cassandra & spark west mins... by Simon Ambridge
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
Simon Ambridge34.4K views
Hadoop world overview trends and topics by Valentin Kropov
Hadoop world overview trends and topicsHadoop world overview trends and topics
Hadoop world overview trends and topics
Valentin Kropov223 views
Apache Spark for Beginners by Anirudh
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh 497 views
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig... by Alex Zeltov
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov2.8K views
Spark SQL by Caserta
Spark SQLSpark SQL
Spark SQL
Caserta 7.8K views

Recently uploaded

Chapter 3b- Process Communication (1) (1)(1) (1).pptx by
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptxayeshabaig2004
6 views30 slides
Amy slides.pdf by
Amy slides.pdfAmy slides.pdf
Amy slides.pdfStatsCommunications
5 views13 slides
UNEP FI CRS Climate Risk Results.pptx by
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptxpekka28
11 views51 slides
3196 The Case of The East River by
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East RiverErickANDRADE90
16 views4 slides
PRIVACY AWRE PERSONAL DATA STORAGE by
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGEantony420421
5 views56 slides
Cross-network in Google Analytics 4.pdf by
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdfGA4 Tutorials
6 views7 slides

Recently uploaded(20)

Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20046 views
UNEP FI CRS Climate Risk Results.pptx by pekka28
UNEP FI CRS Climate Risk Results.pptxUNEP FI CRS Climate Risk Results.pptx
UNEP FI CRS Climate Risk Results.pptx
pekka2811 views
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9016 views
PRIVACY AWRE PERSONAL DATA STORAGE by antony420421
PRIVACY AWRE PERSONAL DATA STORAGEPRIVACY AWRE PERSONAL DATA STORAGE
PRIVACY AWRE PERSONAL DATA STORAGE
antony4204215 views
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M... by DataScienceConferenc1
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
[DSC Europe 23] Milos Grubjesic Empowering Business with Pepsico s Advanced M...
Ukraine Infographic_22NOV2023_v2.pdf by AnastosiyaGurin
Ukraine Infographic_22NOV2023_v2.pdfUkraine Infographic_22NOV2023_v2.pdf
Ukraine Infographic_22NOV2023_v2.pdf
AnastosiyaGurin1.4K views
CRM stick or twist.pptx by info828217
CRM stick or twist.pptxCRM stick or twist.pptx
CRM stick or twist.pptx
info82821710 views
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx by DataScienceConferenc1
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
[DSC Europe 23] Stefan Mrsic_Goran Savic - Evolving Technology Excellence.pptx
Data about the sector workshop by info828217
Data about the sector workshopData about the sector workshop
Data about the sector workshop
info82821712 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Data Journeys Hard Talk workshop final.pptx by info828217
Data Journeys Hard Talk workshop final.pptxData Journeys Hard Talk workshop final.pptx
Data Journeys Hard Talk workshop final.pptx
info82821710 views
Organic Shopping in Google Analytics 4.pdf by GA4 Tutorials
Organic Shopping in Google Analytics 4.pdfOrganic Shopping in Google Analytics 4.pdf
Organic Shopping in Google Analytics 4.pdf
GA4 Tutorials14 views
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init... by DataScienceConferenc1
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...
[DSC Europe 23][Cryptica] Martin_Summer_Digital_central_bank_money_Ideas_init...

Intro to Apache Spark

  • 1. Intro to Apache Spark Marius Soutier Freelance Software Engineer @mariussoutier Clustered In-Memory Computation
  • 2. Motivation • Classical data architectures break down • RDMBS can’t handle large amounts of data well • Most RDMBS can’t handle multiple input formats • Most NoSQLs don’t offer analytics Problem Running computations on BigData®
  • 3. The 3 Vs of Big Data Volume 100s of GB, TB, PB Variety Structured, Unstructured, Semi-Structured Velocity Sensors, Realtime “Fast Data”
  • 4. Hadoop (1) • De-facto standard for running computations on large amounts of different data is Hadoop • Hadoop consists of • HDFS distributed, fault-tolerant file system • Map/Reduce parallelizable computations pioneered by Google • Hadoop is typically run on a (large) cluster of non-virtualized commodity hardware
  • 5. Hadoop (2) • However, Map/Reduce are batch jobs with high latency • Not suitable for interactive queries, real-time analytics, or Machine Learning • Pure Map/Reduce is hard to develop and maintain
  • 6. Enter Spark Spark is a framework for clustered in-memory data processing
  • 7. • Developed at UC Berkeley, released in 2010 • Apache Top-Level Project Since February 2014, current version is 1.2.1 / 1.3.0 • USP: Uses cluster-wide available memory to speed up computations • Very active community Apache Spark (1)
  • 8. • Written in Scala (& Akka), 
 APIs for Java and Python • Programming model is a collection pipeline* instead of Map/Reduce • Supports batch, streaming, interactive, 
 or all combined using unified API Apache Spark (2) * http://martinfowler.com/articles/collection-pipeline/
  • 9. Spark Ecosystem Spark Core Spark SQL Spark Hive BlinkDB Approximate SQL Spark Streaming MLlib Machine Learning GraphX SparkR ALPHA ALPHA ALPHA Tachyon
  • 10. Spark is a framework for clustered in-memory data processing Spark is a platform for data driven products.
  • 11. • Base abstraction Resilient Distributed Dataset (RDD) • Essentially a distributed collection of objects • Can be cached in memory or on disk RDD
  • 12. RDD Word Count val sc = new SparkContext()
 val input: RDD[String] = sc.textFile("/tmp/word.txt")
 val words: RDD[(String, Long)] = input
 .flatMap(line => line.toLowerCase.split("s+"))
 .map(word => word -> 1L)
 .cache()
 
 val wordCountsRdd: RDD[(String, Long)] = words
 .reduceByKey(_ + _)
 .sortByKey() 
 val wordCounts: Array[(String, Long)] = wordCountsRdd.collect()
  • 13. Cluster Driver SparkContext Master Worker Executor Worker Executor Tasks Tasks • Spark app (driver) builds DAG from RDD operations • DAG is split into tasks that are executed by workers
  • 14. Example Architecture Input HDFS Message Queue Spark Streaming Spark Batch Jobs SparkSQL Real-Time Dashboard Interactive SQL Analytics, Reports