SlideShare a Scribd company logo
Apache Spark 
Buenos Aires High Scalability 
Buenos Aires, Argentina, Dic 2014 
Fernando Rodriguez Olivera 
@frodriguez
Fernando Rodriguez Olivera 
Professor at Universidad Austral (Distributed Systems, Compiler 
Design, Operating Systems, …) 
Creator of mvnrepository.com 
Organizer at Buenos Aires High Scalability Group, Professor at 
nosqlessentials.com 
Twitter: @frodriguez
Apache Spark 
Apache Spark is a Fast and General Engine 
for Large-Scale data processing 
In-Memory computing primitives 
Supports for Batch, Interactive, Iterative and 
Stream processing with Unified API
Apache Spark 
Unified API for multiple kind of processing 
Batch (high throughput) 
Interactive (low latency) 
Stream (continuous processing) 
Iterative (results used immediately)
Daytona Gray Sort 100TB Benchmark 
Data Size Time Nodes Cores 
Hadoop MR 
(2013) 
102.5 TB 72 min 2,100 
50,400 
physical 
Apache 
Spark 
(2014) 
100 TB 23 min 206 
6,592 
virtualized 
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Daytona Gray Sort 100TB Benchmark 
Data Size Time Nodes Cores 
Hadoop MR 
(2013) 
102.5 TB 72 min 2,100 
50,400 
physical 
Apache 
Spark 
(2014) 
100 TB 23 min 206 
6,592 
virtualized 
3X faster using 10X fewer machines 
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Hadoop vs Spark for Iterative Proc 
Logistic regression in Hadoop and Spark 
source: https://spark.apache.org/
Hadoop MR Limits 
Job Job Job 
Hadoop HDFS 
MapReduce designed for Batch Processing: 
- Communication between jobs through FS 
- Fault-Tolerance (between jobs) by Persistence to FS 
- Memory not managed (relies on OS caches) 
Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
Apache Spark 
Apache Spark (Core) 
Spark 
SQL 
Spark 
Streaming ML lib GraphX 
Powered by Scala and Akka 
APIs for Java, Scala, Python
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed 
Stored in Memory
Resilient Distributed Datasets (RDD) 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Immutable Collection of Objects 
Partitioned and Distributed 
Stored in Memory 
Partitions Recomputed on Failure
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
...
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
depends on 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
depends on 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars 
Int 
N 
Action
RDD Transformations and Actions 
RDD of Strings 
Hello World 
... 
... 
A New Line 
... 
... 
hello 
The End 
... 
RDD of Ints 
11 
... 
... 
10 
... 
5 
... 
7 
... 
Compute 
Function 
(transformation) 
e.g: apply 
function 
to count 
chars 
RDD Implementation 
Partitions 
Compute Function 
Dependencies 
Preferred Compute 
Location 
(for each partition) 
Partitioner 
depends on 
Int 
N 
Action
Spark API 
val spark = new SparkContext() 
val lines = spark.textFile(“hdfs://docs/”) // RDD[String] 
val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] 
val count = nonEmpty.count 
Scala 
SparkContext spark = new SparkContext(); 
JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) 
JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); 
long count = nonEmpty.count(); 
Java 8 Python 
spark = SparkContext() 
lines = spark.textFile(“hdfs://docs/”) 
nonEmpty = lines.filter(lambda line: len(line) > 0) 
count = nonEmpty.count()
RDD Operations 
Transformations Actions 
map(func) 
flatMap(func) 
filter(func) 
take(N) 
count() 
collect() 
groupByKey() 
reduceByKey(func) 
reduce(func) 
mapValues(func) 
takeOrdered(N) 
top(N) 
… …
Text Processing Example 
Top Words by Frequency 
(Step by step)
Create RDD from External Data 
Apache Spark 
Hadoop FileSystem, 
I/O Formats, Codecs 
HDFS S3 HBase MongoDB 
Cassandra 
… 
Spark can read/write from any data source supported by Hadoop 
I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) 
// Step 1 - Create RDD from Hadoop Text File 
val docs = spark.textFile(“/docs/”) 
ElasticSearch
Function map 
RDD[String] RDD[String] 
Hello World 
A New Line 
hello 
... 
The end 
.map(line => line.toLowerCase) 
hello world 
a new line 
hello 
... 
the end 
= 
.map(_.toLowerCase) 
// Step 2 - Convert lines to lower case 
val lower = docs.map(line => line.toLowerCase)
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
hello 
... 
the 
world 
new line 
end
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
hello 
... 
the 
world 
new line 
end 
.flatten 
RDD[String] 
hello 
world 
a 
new 
line 
... 
*
Functions map and flatMap 
hello world 
a new line 
hello 
... 
the end 
RDD[Array[String]] 
hello 
.flatMap(line => line.split(“s+“)) 
RDD[String] 
.map( … ) 
_.split(“s+”) 
a 
hello 
... 
the 
world 
new line 
end 
.flatten 
RDD[String] 
hello 
world 
a 
new 
line 
... 
*
Functions map and flatMap 
RDD[String] 
hello world 
a new line 
hello 
... 
the end 
.map( … ) 
RDD[Array[String]] 
_.split(“s+”) 
hello 
a 
world 
new line 
hello 
... 
the 
end 
.flatten 
.flatMap(line => line.split(“s+“)) 
RDD[String] 
world 
// Step 3 - Split lines into words 
val words = lower.flatMap(line => line.split(“s+“)) 
Note: flatten() not available in spark, only flatMap 
hello 
a 
new 
line 
... 
*
Key-Value Pairs 
RDD[Tuple2[String, Int]] 
RDD[String] RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
... 
hello 
world 
a 
new 
line 
hello 
... 
.map(word => Tuple2(word, 1)) 
1 
1 
1 
1 
1 
1 
= 
.map(word => (word, 1)) 
// Step 4 - Split lines into words 
val counts = words.map(word => (word, 1)) 
Pair RDD
Shuffling 
RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)]
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)] 
RDD[(String, Int)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Shuffling 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
.reduceByKey((a, b) => a + b) 
RDD[(String, Int)] 
RDD[(String, Int)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Shuffling 
RDD[(String, Int)] 
hello 
world 
a 
new 
line 
hello 
1 
1 
1 
1 
1 
1 
RDD[(String, Iterator[Int])] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
1 
.groupByKey 
1 
RDD[(String, Int)] 
.reduceByKey((a, b) => a + b) 
// Step 5 - Count all words 
val freq = counts.reduceByKey(_ + _) 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.mapValues 
_.reduce(…) 
(a,b) => a+b
Top N (Prepare data) 
RDD[(String, Int)] RDD[(Int, String)] 
world 
a 
1 
1 
new 1 
line 
hello 
1 
2 
.map(_.swap) 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
// Step 6 - Swap tuples (partial code) 
freq.map(_.swap)
Top N (First Attempt) 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2
Top N (First Attempt) 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
.sortByKey 
RDD[(Int, String)] 
2 
1 
1 a 
hello 
world 
new 
line 
1 
1 
(sortByKey(false) for descending)
Top N (First Attempt) 
RDD[(Int, String)] Array[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
hello 
world 
2 
1 
RDD[(Int, String)] 
2 
1 
1 a 
hello 
world 
.sortByKey .take(N) 
new 
line 
1 
1 
(sortByKey(false) for descending)
Top N 
Array[(Int, String)] 
RDD[(Int, String)] 
1 
1 
1 new 
world 
a 
line 
hello 
1 
2 
world 
a 
1 
1 
.top(N) 
hello 
line 
2 
1 
hello 
line 
2 
1 
local top N * 
local top N * 
reduction 
* local top N implemented by bounded priority queues 
// Step 6 - Swap tuples (complete code) 
val top = freq.map(_.swap).top(N)
Top Words by Frequency (Full Code) 
val spark = new SparkContext() 
// RDD creation from external data source 
val docs = spark.textFile(“hdfs://docs/”) 
// Split lines into words 
val lower = docs.map(line => line.toLowerCase) 
val words = lower.flatMap(line => line.split(“s+“)) 
val counts = words.map(word => (word, 1)) 
// Count all words (automatic combination) 
val freq = counts.reduceByKey(_ + _) 
// Swap tuples and get top results 
val top = freq.map(_.swap).top(N) 
top.foreach(println)
RDD Persistence (in-memory) 
RDD 
… 
... 
... 
… 
... 
… 
... 
… 
... 
.cache() 
.persist() 
.persist(storageLevel) 
StorageLevel: 
MEMORY_ONLY, 
MEMORY_ONLY_SER, 
MEMORY_AND_DISK, 
MEMORY_AND_DISK_SER, 
DISK_ONLY, … 
(memory only) 
(memory only) 
(lazy persistence & caching)
RDD Lineage 
RDD Transformations 
words = sc.textFile(“hdfs://large/file/”) HadoopRDD 
.map(_.toLowerCase) 
.flatMap(_.split(“ “)) FlatMappedRDD 
nums = words.filter(_.matches(“[0-9]+”)) 
alpha.count() 
MappedRDD 
alpha = words.filter(_.matches(“[a-z]+”)) 
FilteredRDD 
FilteredRDD 
Lineage 
(built on the driver 
by the transformations) 
Action (run job on the cluster)
SchemaRDD & SQL 
SchemaRDD 
Row 
... 
... 
Row 
... 
... 
Row 
Row 
... 
RRD of Row + Column Metadata 
Queries with SQL 
Support for Reflection, JSON, 
Parquet, …
SchemaRDD & SQL 
topWords 
Row 
... 
... 
Row 
... 
... 
Row 
Row 
... 
case class Word(text: String, n: Int) 
val wordsFreq = freq.map { 
case (text, count) => Word(text, count) 
} // RDD[Word] 
wordsFreq.registerTempTable("wordsFreq") 
val topWords = sql("select text, n 
from wordsFreq 
order by n desc 
limit 20”) // RDD[Row] 
topWords.collect().foreach(println)
Spark Streaming 
DStream 
RDD RDD RDD RDD RDD RDD 
Data Collected, Buffered and Replicated 
by a Receiver (one per DStream) 
then Pushed to a stream as small RDDs 
Configurable Batch Intervals. 
e.g: 1 second, 5 seconds, 5 minutes 
Receiver 
e.g: Kafka, 
Kinesis, 
Flume, 
Sockets, 
Akka 
etc
DStream Transformations 
DStream 
RDD RDD RDD RDD RDD RDD 
DStream 
transform 
RDD RDD RDD RDD RDD RDD 
Receiver 
// Example 
val entries = stream.transform { rdd => rdd.map(Log.parse) } 
// Alternative 
val entries = stream.map(Log.parse)
Parallelism with Multiple Receivers 
DStream 1 
Receiver 1 RDD RDD RDD RDD RDD RDD 
DStream 2 
Receiver 2 RDD RDD RDD RDD RDD RDD 
union of (stream1, stream2, …) 
Union can be used to manage multiple DStreams as 
a single logical stream
Sliding Windows 
DStream 
RDD RDD RDD RDD RDD RDD 
DStream 
… … … W3 W2 W1 
Window Length: 3, Sliding Interval: 1 
Receiver
Deployment with Hadoop 
A 
B 
C 
D 
/large/file 
allocates resources 
(cores and memory) 
Spark 
Worker 
Data 
Node 1 
Application 
Spark 
Worker 
Data 
Node 3 
Spark 
Worker 
Data 
Node 4 
Spark 
Worker 
Data 
Node 2 
A C B C A B A B 
Spark 
Master 
Name 
Node 
RF 3 D D D C 
Client 
Submit App 
(mode=cluster) 
Driver Executors Executors Executors 
DN + Spark 
HDFS Spark
Fernando Rodriguez Olivera 
twitter: @frodriguez

More Related Content

What's hot

Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
metsarin
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
Databricks
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
Jaemun Jung
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Harri Kauhanen
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
Russell Jurney
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 

What's hot (20)

Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst OptimizerDeep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to PySpark
Introduction to PySparkIntroduction to PySpark
Introduction to PySpark
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
NoSql
NoSqlNoSql
NoSql
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 

Viewers also liked

Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
Fernando Rodriguez
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
Data Con LA
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
Databricks
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009Cloudera, Inc.
 
AWS Kinesis Streams
AWS Kinesis StreamsAWS Kinesis Streams
AWS Kinesis Streams
Fernando Rodriguez
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementationtugrulh
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadershipsjoerdluteyn
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
Brian O'Neill
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
Massimiliano Martella
 
Performance
PerformancePerformance
Performance
Christophe Marchal
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Data Con LA
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
Daniel Leon
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data Pipelines
MapR Technologies
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Chris Fregly
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
colorant
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
Dean Wampler
 

Viewers also liked (20)

Apache Spark with Scala
Apache Spark with ScalaApache Spark with Scala
Apache Spark with Scala
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
Hadoop Lecture for Harvard's CS 264 -- October 19, 2009
 
AWS Kinesis Streams
AWS Kinesis StreamsAWS Kinesis Streams
AWS Kinesis Streams
 
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and ImplementationDistributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
Distributed Computing Seminar - Lecture 2: MapReduce Theory and Implementation
 
Preso spark leadership
Preso spark leadershipPreso spark leadership
Preso spark leadership
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Spark, the new age of data scientist
Spark, the new age of data scientistSpark, the new age of data scientist
Spark, the new age of data scientist
 
Performance
PerformancePerformance
Performance
 
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of DatabricksBig Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
Big Data Day LA 2015 - Spark after Dark by Chris Fregly of Databricks
 
Spark - The beginnings
Spark -  The beginningsSpark -  The beginnings
Spark - The beginnings
 
Spark Streaming Data Pipelines
Spark Streaming Data PipelinesSpark Streaming Data Pipelines
Spark Streaming Data Pipelines
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Spark introduction - In Chinese
Spark introduction - In ChineseSpark introduction - In Chinese
Spark introduction - In Chinese
 
Spark the next top compute model
Spark   the next top compute modelSpark   the next top compute model
Spark the next top compute model
 

Similar to Apache Spark & Streaming

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Databricks
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
CloudxLab
 
Apache Spark
Apache SparkApache Spark
Apache Spark
Uwe Printz
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
Javier Santos Paniego
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
sparrowAnalytics.com
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Duyhai Doan
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
Javier Arrieta
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
Spark Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
Venkat Datla
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
Padma shree. T
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
Himanshu Gupta
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for MobileBugSense
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
Wojciech Pituła
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
Massimo Schenone
 

Similar to Apache Spark & Streaming (20)

Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Distributed computing with spark
Distributed computing with sparkDistributed computing with spark
Distributed computing with spark
 
Operations on rdd
Operations on rddOperations on rdd
Operations on rdd
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Scala meetup - Intro to spark
Scala meetup - Intro to sparkScala meetup - Intro to spark
Scala meetup - Intro to spark
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
 
Beneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek LaskowskiBeneath RDD in Apache Spark by Jacek Laskowski
Beneath RDD in Apache Spark by Jacek Laskowski
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence04 spark-pair rdd-rdd-persistence
04 spark-pair rdd-rdd-persistence
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
Introduction to Spark with Scala
Introduction to Spark with ScalaIntroduction to Spark with Scala
Introduction to Spark with Scala
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
 
Spark workshop
Spark workshopSpark workshop
Spark workshop
 
Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 

Recently uploaded

The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
Nettur Technical Training Foundation
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
ssuser7dcef0
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
top1002
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
veerababupersonal22
 
Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
AkolbilaEmmanuel1
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
manasideore6
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
ClaraZara1
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
Kamal Acharya
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
ChristineTorrepenida1
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 

Recently uploaded (20)

The Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdfThe Role of Electrical and Electronics Engineers in IOT Technology.pdf
The Role of Electrical and Electronics Engineers in IOT Technology.pdf
 
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
NUMERICAL SIMULATIONS OF HEAT AND MASS TRANSFER IN CONDENSING HEAT EXCHANGERS...
 
Basic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparelBasic Industrial Engineering terms for apparel
Basic Industrial Engineering terms for apparel
 
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERSCW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
CW RADAR, FMCW RADAR, FMCW ALTIMETER, AND THEIR PARAMETERS
 
Steel & Timber Design according to British Standard
Steel & Timber Design according to British StandardSteel & Timber Design according to British Standard
Steel & Timber Design according to British Standard
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Fundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptxFundamentals of Induction Motor Drives.pptx
Fundamentals of Induction Motor Drives.pptx
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)6th International Conference on Machine Learning & Applications (CMLA 2024)
6th International Conference on Machine Learning & Applications (CMLA 2024)
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Water billing management system project report.pdf
Water billing management system project report.pdfWater billing management system project report.pdf
Water billing management system project report.pdf
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
Unbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptxUnbalanced Three Phase Systems and circuits.pptx
Unbalanced Three Phase Systems and circuits.pptx
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 

Apache Spark & Streaming

  • 1. Apache Spark Buenos Aires High Scalability Buenos Aires, Argentina, Dic 2014 Fernando Rodriguez Olivera @frodriguez
  • 2. Fernando Rodriguez Olivera Professor at Universidad Austral (Distributed Systems, Compiler Design, Operating Systems, …) Creator of mvnrepository.com Organizer at Buenos Aires High Scalability Group, Professor at nosqlessentials.com Twitter: @frodriguez
  • 3. Apache Spark Apache Spark is a Fast and General Engine for Large-Scale data processing In-Memory computing primitives Supports for Batch, Interactive, Iterative and Stream processing with Unified API
  • 4. Apache Spark Unified API for multiple kind of processing Batch (high throughput) Interactive (low latency) Stream (continuous processing) Iterative (results used immediately)
  • 5. Daytona Gray Sort 100TB Benchmark Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  • 6. Daytona Gray Sort 100TB Benchmark Data Size Time Nodes Cores Hadoop MR (2013) 102.5 TB 72 min 2,100 50,400 physical Apache Spark (2014) 100 TB 23 min 206 6,592 virtualized 3X faster using 10X fewer machines source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  • 7. Hadoop vs Spark for Iterative Proc Logistic regression in Hadoop and Spark source: https://spark.apache.org/
  • 8. Hadoop MR Limits Job Job Job Hadoop HDFS MapReduce designed for Batch Processing: - Communication between jobs through FS - Fault-Tolerance (between jobs) by Persistence to FS - Memory not managed (relies on OS caches) Compensated with: Storm, Samza, Giraph, Impala, Presto, etc
  • 9. Apache Spark Apache Spark (Core) Spark SQL Spark Streaming ML lib GraphX Powered by Scala and Akka APIs for Java, Scala, Python
  • 10. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects
  • 11. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed
  • 12. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed Stored in Memory
  • 13. Resilient Distributed Datasets (RDD) RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Immutable Collection of Objects Partitioned and Distributed Stored in Memory Partitions Recomputed on Failure
  • 14. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ...
  • 15. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... Compute Function (transformation) e.g: apply function to count chars
  • 16. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... Compute Function (transformation) e.g: apply function to count chars
  • 17. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... depends on Compute Function (transformation) e.g: apply function to count chars
  • 18. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... depends on Compute Function (transformation) e.g: apply function to count chars Int N Action
  • 19. RDD Transformations and Actions RDD of Strings Hello World ... ... A New Line ... ... hello The End ... RDD of Ints 11 ... ... 10 ... 5 ... 7 ... Compute Function (transformation) e.g: apply function to count chars RDD Implementation Partitions Compute Function Dependencies Preferred Compute Location (for each partition) Partitioner depends on Int N Action
  • 20. Spark API val spark = new SparkContext() val lines = spark.textFile(“hdfs://docs/”) // RDD[String] val nonEmpty = lines.filter(l => l.nonEmpty()) // RDD[String] val count = nonEmpty.count Scala SparkContext spark = new SparkContext(); JavaRDD<String> lines = spark.textFile(“hdfs://docs/”) JavaRDD<String> nonEmpty = lines.filter(l -> l.length() > 0); long count = nonEmpty.count(); Java 8 Python spark = SparkContext() lines = spark.textFile(“hdfs://docs/”) nonEmpty = lines.filter(lambda line: len(line) > 0) count = nonEmpty.count()
  • 21. RDD Operations Transformations Actions map(func) flatMap(func) filter(func) take(N) count() collect() groupByKey() reduceByKey(func) reduce(func) mapValues(func) takeOrdered(N) top(N) … …
  • 22. Text Processing Example Top Words by Frequency (Step by step)
  • 23. Create RDD from External Data Apache Spark Hadoop FileSystem, I/O Formats, Codecs HDFS S3 HBase MongoDB Cassandra … Spark can read/write from any data source supported by Hadoop I/O via Hadoop is optional (e.g: Cassandra connector bypass Hadoop) // Step 1 - Create RDD from Hadoop Text File val docs = spark.textFile(“/docs/”) ElasticSearch
  • 24. Function map RDD[String] RDD[String] Hello World A New Line hello ... The end .map(line => line.toLowerCase) hello world a new line hello ... the end = .map(_.toLowerCase) // Step 2 - Convert lines to lower case val lower = docs.map(line => line.toLowerCase)
  • 25. Functions map and flatMap RDD[String] hello world a new line hello ... the end
  • 26. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a hello ... the world new line end
  • 27. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a hello ... the world new line end .flatten RDD[String] hello world a new line ... *
  • 28. Functions map and flatMap hello world a new line hello ... the end RDD[Array[String]] hello .flatMap(line => line.split(“s+“)) RDD[String] .map( … ) _.split(“s+”) a hello ... the world new line end .flatten RDD[String] hello world a new line ... *
  • 29. Functions map and flatMap RDD[String] hello world a new line hello ... the end .map( … ) RDD[Array[String]] _.split(“s+”) hello a world new line hello ... the end .flatten .flatMap(line => line.split(“s+“)) RDD[String] world // Step 3 - Split lines into words val words = lower.flatMap(line => line.split(“s+“)) Note: flatten() not available in spark, only flatMap hello a new line ... *
  • 30. Key-Value Pairs RDD[Tuple2[String, Int]] RDD[String] RDD[(String, Int)] hello world a new line hello ... hello world a new line hello ... .map(word => Tuple2(word, 1)) 1 1 1 1 1 1 = .map(word => (word, 1)) // Step 4 - Split lines into words val counts = words.map(word => (word, 1)) Pair RDD
  • 31. Shuffling RDD[(String, Int)] hello world a new line hello 1 1 1 1 1 1
  • 32. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)]
  • 33. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)] RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 34. Shuffling hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 .reduceByKey((a, b) => a + b) RDD[(String, Int)] RDD[(String, Int)] world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 35. Shuffling RDD[(String, Int)] hello world a new line hello 1 1 1 1 1 1 RDD[(String, Iterator[Int])] world a 1 1 new 1 line hello 1 1 .groupByKey 1 RDD[(String, Int)] .reduceByKey((a, b) => a + b) // Step 5 - Count all words val freq = counts.reduceByKey(_ + _) world a 1 1 new 1 line hello 1 2 .mapValues _.reduce(…) (a,b) => a+b
  • 36. Top N (Prepare data) RDD[(String, Int)] RDD[(Int, String)] world a 1 1 new 1 line hello 1 2 .map(_.swap) 1 1 1 new world a line hello 1 2 // Step 6 - Swap tuples (partial code) freq.map(_.swap)
  • 37. Top N (First Attempt) RDD[(Int, String)] 1 1 1 new world a line hello 1 2
  • 38. Top N (First Attempt) RDD[(Int, String)] 1 1 1 new world a line hello 1 2 .sortByKey RDD[(Int, String)] 2 1 1 a hello world new line 1 1 (sortByKey(false) for descending)
  • 39. Top N (First Attempt) RDD[(Int, String)] Array[(Int, String)] 1 1 1 new world a line hello 1 2 hello world 2 1 RDD[(Int, String)] 2 1 1 a hello world .sortByKey .take(N) new line 1 1 (sortByKey(false) for descending)
  • 40. Top N Array[(Int, String)] RDD[(Int, String)] 1 1 1 new world a line hello 1 2 world a 1 1 .top(N) hello line 2 1 hello line 2 1 local top N * local top N * reduction * local top N implemented by bounded priority queues // Step 6 - Swap tuples (complete code) val top = freq.map(_.swap).top(N)
  • 41. Top Words by Frequency (Full Code) val spark = new SparkContext() // RDD creation from external data source val docs = spark.textFile(“hdfs://docs/”) // Split lines into words val lower = docs.map(line => line.toLowerCase) val words = lower.flatMap(line => line.split(“s+“)) val counts = words.map(word => (word, 1)) // Count all words (automatic combination) val freq = counts.reduceByKey(_ + _) // Swap tuples and get top results val top = freq.map(_.swap).top(N) top.foreach(println)
  • 42. RDD Persistence (in-memory) RDD … ... ... … ... … ... … ... .cache() .persist() .persist(storageLevel) StorageLevel: MEMORY_ONLY, MEMORY_ONLY_SER, MEMORY_AND_DISK, MEMORY_AND_DISK_SER, DISK_ONLY, … (memory only) (memory only) (lazy persistence & caching)
  • 43. RDD Lineage RDD Transformations words = sc.textFile(“hdfs://large/file/”) HadoopRDD .map(_.toLowerCase) .flatMap(_.split(“ “)) FlatMappedRDD nums = words.filter(_.matches(“[0-9]+”)) alpha.count() MappedRDD alpha = words.filter(_.matches(“[a-z]+”)) FilteredRDD FilteredRDD Lineage (built on the driver by the transformations) Action (run job on the cluster)
  • 44. SchemaRDD & SQL SchemaRDD Row ... ... Row ... ... Row Row ... RRD of Row + Column Metadata Queries with SQL Support for Reflection, JSON, Parquet, …
  • 45. SchemaRDD & SQL topWords Row ... ... Row ... ... Row Row ... case class Word(text: String, n: Int) val wordsFreq = freq.map { case (text, count) => Word(text, count) } // RDD[Word] wordsFreq.registerTempTable("wordsFreq") val topWords = sql("select text, n from wordsFreq order by n desc limit 20”) // RDD[Row] topWords.collect().foreach(println)
  • 46. Spark Streaming DStream RDD RDD RDD RDD RDD RDD Data Collected, Buffered and Replicated by a Receiver (one per DStream) then Pushed to a stream as small RDDs Configurable Batch Intervals. e.g: 1 second, 5 seconds, 5 minutes Receiver e.g: Kafka, Kinesis, Flume, Sockets, Akka etc
  • 47. DStream Transformations DStream RDD RDD RDD RDD RDD RDD DStream transform RDD RDD RDD RDD RDD RDD Receiver // Example val entries = stream.transform { rdd => rdd.map(Log.parse) } // Alternative val entries = stream.map(Log.parse)
  • 48. Parallelism with Multiple Receivers DStream 1 Receiver 1 RDD RDD RDD RDD RDD RDD DStream 2 Receiver 2 RDD RDD RDD RDD RDD RDD union of (stream1, stream2, …) Union can be used to manage multiple DStreams as a single logical stream
  • 49. Sliding Windows DStream RDD RDD RDD RDD RDD RDD DStream … … … W3 W2 W1 Window Length: 3, Sliding Interval: 1 Receiver
  • 50. Deployment with Hadoop A B C D /large/file allocates resources (cores and memory) Spark Worker Data Node 1 Application Spark Worker Data Node 3 Spark Worker Data Node 4 Spark Worker Data Node 2 A C B C A B A B Spark Master Name Node RF 3 D D D C Client Submit App (mode=cluster) Driver Executors Executors Executors DN + Spark HDFS Spark
  • 51. Fernando Rodriguez Olivera twitter: @frodriguez