Introduction to Spark on Hadoop

© 2014 MapR Technologies 1© 2014 MapR Technologies
An Overview of Apache Spark

© 2014 MapR Technologies 2
Agenda
• MapReduce Refresher
• What is Spark?
• The Difference with Spark
• Examples and Resources

MapReduce Refresher

MapReduce: A Programming Model
• MapReduce:
Simplified Data
Processing on Large
Clusters
(published 2004)
• Parallel and Distributed
Algorithm:
• Data Locality
• Fault Tolerance
• Linear Scalability

The Hadoop Strategy
http://developer.yahoo.com/hadoop/tutorial/module4.html
Distribute data
(share nothing)
Distribute computation
(parallelization without synchronization)
Tolerate failures
(no single point of failure)
Node 1
Mapping process
Node 2
Mapping process
Node 3
Mapping process
Node 1
Reducing process
Node 2
Reducing process
Node 3
Reducing process

Chunks are replicated across the cluster
Distribute Data: HDFS
User process
NameNode
. . .
network
HDFS splits large data files
into chunks (64 MB)
metadata
access physical data access
Location metadata
DataNodes store & retrieve data
data

Distribute Computation
MapReduce
Program
Data
Sources
Hadoop Cluster
Result

MapReduce Basics
• Foundational model is based on a distributed file system
– Scalability and fault-tolerance
• Map
– Loading of the data and defining a set of keys
• Many use cases do not utilize a reduce task
• Reduce
– Collects the organized key-based data to process and output
• Performance can be tweaked based on known details of your
source files and cluster shape (size, total number)

MapReduce Execution and Data Flow
Files loaded from HDFS stores
file file
Files loaded from HDFS stores
Node 1
InputFormat InputFormat
OutputFormat OutputFormat
Final (k, v) pairs Final (k, v) pairs
reduce reduce
(sort) (sort)
Input (k, v) pairs
map map map
RR RR RR
RecordReaders:
Split Split Split
Writeback to
Local HDFS
store
file
Writeback to
Local HDFS
store
file
SplitSplitSplit
RRRRRR
RecordReaders:
Input (k, v) pairs
mapmapmap
Node2
“Shuffle” process
Intermediate (k, v)
Pairs exchanged
By all nodes
Partitioner
Intermediate (k, v) pairs
Partitioner
Intermediate (k, v) pairs

MapReduce Example: Word Count
Output
"The time has come," the Walrus said,
"To talk of many things:
Of shoes—and ships—and sealing-wax
the, 1
time, 1
has, 1
come, 1
…
and, 1
…
and, 1
…
and, [1, 1, 1]
come, [1,1,1]
has, [1,1]
the, [1,1,1]
time, [1,1,1,1]
…
and, 12
come, 6
has, 8
the, 4
time, 14
…
Input Map
Shuffle
and Sort
Reduce Output
Reduce

Tolerate Failures
Hadoop Cluster
Failures are expected & managed gracefully
DataNode fails -> name node will locate replica
MapReduce task fails -> job tracker will schedule another one
Data

MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For complex work, chain jobs together
– Use a higher level language or DSL that does this for you

MapReduce Design Patterns
• Summarization
– Inverted index, counting
• Filtering
– Top ten, distinct
• Aggregation
• Data Organziation
– partitioning
• Join
– Join data sets
• Metapattern
– Job chaining

Inverted Index Example
come, (alice.txt)
do, (macbeth.txt)
has, (alice.txt)
time, (alice.txt, macbeth.txt)
. . .
"The time has
come," the
Walrus said
alice.txt
tis time to do it
macbeth.txt
time, alice.txt
has, alice.txt
come, alice.txt
..
tis, macbeth.txt
time, macbeth.txt
do, macbeth.txt
…

MapReduce Example:Inverted Index
• Input: (filename, text) records
• Output: list of files containing each word
• Map:
foreach word in text.split():
output(word, filename)
• Combine: uniquify filenames for each word
• Reduce:
def reduce(word, filenames):
output(word, sort(filenames))

MapReduce: The Good
• Built in fault tolerance
• Optimized IO path
• Scalable
• Developer focuses on Map/Reduce, not infrastructure
• simple? API

MapReduce: The Bad
• Optimized for disk IO
– Doesn’t leverage memory well
– Iterative algorithms go through disk IO path again and again
• Primitive API
– simple abstraction
– Key/Value in/out
– basic things like join
• require extensive code
• Result often many files that need to be combined appropriately

Free Hadoop MapReduce On Demand Training
• https://www.mapr.com/services/mapr-academy/big-data-hadoop-
online-training

What is Hive?
• Data Warehouse on top of Hadoop
– Gives ability to query without programming
– Used for analytical querying of data
• SQL like execution for Hadoop
• SQL evaluates to MapReduce code
– Submits jobs to your cluster

Using HBase as a MapReduce/Hive Source
EXAMPLE: Data Warehouse for Analytical Processing queries
Hive runs
MapReduce
application
Hive Select
JoinHBase database
Files (HDFS/MapR-FS)
Query Result File

Using HBase as a MapReduce or Hive Sink
EXAMPLE: bulk load data into a table
Files (HDFS/MapR-FS) HBase databaseHive runs
MapReduce application
Hive Insert Select

Using HBase as a Source & Sink
EXAMPLE: calculate and store summaries,
Pre-Computed, Materialized View
HBase database
Hive Select
Join
Hive runs
MapReduce
application

Job
Tracker
Name
Node
HADOOP
(MAP-REDUCE + HDFS)
Data Node
+
Task Tracker
Hive Metastore
Driver
(compiler, Optimizer, Executor)
Command Line
Interface
Web
Interface
JDBC
Thrift Server
ODBC
Metastore
Hive
The schema metadata is stored
in the Hive metastore
Hive Table definition HBase trades_tall Table

Hive HBase
HBase Tables
Hive
metastore
Points to Existing
Hive Managed

Hive HBase – External Table
CREATE EXTERNAL TABLE trades(key string, price bigint, vol bigint)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping"= “:key,cf1:price#b,cf1:vol#b")
TBLPROPERTIES ("hbase.table.name" = "/usr/user1/trades_tall");
Points to
External
key string price bigint vol bigint key cf1:price cf1:vol
AMZN_986186008 12.34 1000
AMZN_986186007 12.00 50
trades /usr/user1/trades_tall
Hive Table definition HBaseTable

Hive HBase – Hive Query
SQL evaluates to MapReduce code
SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ;
HBase Tables
Queries
Parser Planner Execution

Hive HBase – External Table
key cf1:price cf1:vol
AMZN_986186008 12.34 1000
AMZN_986186007 12.00 50
Selection
WHERE key like
SQL evaluates to MapReduce code
SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ;
Projection
select price
Aggregation
Avg( price)

Hive Query Plan
• EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
Filter Operator
predicate: (key like 'GOOG%') (type: boolean)
Select Operator
Group By Operator
Reduce Operator Tree:
Group By Operator
Select Operator
File Output Operator

Hive Query Plan – (2)
output
hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";
col0
Trades
table
group
aggregations:
avg(price)
scan filter
Select
key like 'GOOG%
Select price
Group by
map()
map()
map()
reduce()
reduce()

Hive Map Reduce
Region Region Region
scan key, row
reduce()
shuffle
reduce()
reduce()Map() Map() Map()
Query Result File
HBase
Hive Select Join
Hive Query
result result result

Some Hive Design Patterns
• Summarization
– Select min(delay), max(delay), count(*) from flights group by
carrier;
• Filtering
– SELECT * FROM trades WHERE key LIKE "GOOG%";
– SELECT price FROM trades DESC LIMIT 10 ;
• Join
SELECT tableA.field1, tableB.field2 FROM tableA
JOIN tableB
ON tableA.field1 = tableB.field2;

What is a Directed Acylic Graph (DAG) ?
• Graph
– vertices (points) and edges (lines)
• Directed
– Only in a single direction
• Acyclic
– No looping
• This supports fault-tolerance
BA
BA

Hive Query Plan Map Reduce Execution
FS1
AGG2
RS4
JOIN1
RS2
AGG1
RS1
t1
RS3
t1
Job 3
Job 2
FS1
AGG2
JOIN1
AGG1
RS1
t1
RS3Job 1
Job 1
Optimize

Slow !
Iteration: the bane of MapReduce

Typical MapReduce Workflows
Input to
Job 1
SequenceFile
Last Job
Maps Reduces
SequenceFile
Job 1
Maps Reduces
SequenceFile
Job 2
Maps Reduces
Output from
Job 1
Output from
Job 2
Input to
last job
Output from
last job
HDFS

Iterations
Step Step Step Step Step
In-memory Caching
• Data Partitions read from RAM instead of disk

Free HBase On Demand Training
(includes Hive and MapReduce with HBase)
• https://www.mapr.com/services/mapr-academy/big-data-hadoop-
online-training

Lab – Query HBase airline data with Hive
Import mapping to Row Key and Columns:
Row-key
Carrier-
Flightnumber-
Date-
Origin-
destination
delay info stats timing
Air
Craft
delay
Arr
delay
Carrier
delay
cncl Cncl
code
tailnum distance elaptime arrtime Dep
time
AA-1-2014-01-
01-JFK-LAX
13 0 N7704 2475 385.00 359 …

Count number of cancellations by reason (code)
$ hive
hive> explain select count(*) as
cancellations, cnclcode from flighttable
where cncl=1 group by cnclcode order by
cancellations asc limit 100;
1 row
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
Filter Operator
Select Operator
Group By Operator
aggregations: count()
Reduce Output Operator
Group By Operator
aggregations: count(VALUE._col0)
Select Operator
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
Extract
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
Limit
Stage: Stage-0
Fetch Operator
limit: 100

2 MapReduce jobs
$ hive
hive> select count(*) as cancellations, cnclcode from flighttable where
cncl=1 group by cnclcode order by cancellations asc limit 100;
1 row
Total jobs = 2
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 13.3 sec MAPRFS Read: 0
MAPRFS Write: 0 SUCCESS
Job 1: Map: 1 Reduce: 1 Cumulative CPU: 1.52 sec MAPRFS Read: 0
MAPRFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 820 msec
OK
4598 C
7146 A

Find the longest airline delays
$ hive
hive> select arrdelay,key from flighttable where arrdelay > 1000 order by
arrdelay desc limit 10;
1 row
MapReduce Jobs Launched:
Map: 1 Reduce: 1
OK
1530.0 AA-385-2014-01-18-BNA-DFW
1504.0 AA-1202-2014-01-15-ONT-DFW
1473.0 AA-1265-2014-01-05-CMH-LAX
1448.0 AA-1243-2014-01-21-IAD-DFW
1390.0 AA-1198-2014-01-11-PSP-DFW
1335.0 AA-1680-2014-01-21-SLC-DFW
1296.0 AA-1277-2014-01-21-BWI-DFW
1294.0 MQ-2894-2014-01-02-CVG-DFW
1201.0 MQ-3756-2014-01-01-CLT-MIA
1184.0 DL-2478-2014-01-10-BOS-ATL

Apache Spark

Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009 in UC
Berkeley’s AMP Lab
• Fully open sourced in 2010 – now
a Top Level Project at the Apache
Software Foundation

Spark: Fast Big Data
– Rich APIs in
Java, Scala,
Python
– Interactive shell
• Fast to Run
– General execution
graphs
– In-memory storage
2-5× less code

The Spark Community

Spark is the Most Active Open Source Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Projectcontributorsinpastyear

Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN
Unified Platform

Spark Use Cases
• Iterative Algorithms on on large amounts of data
• Anomaly detection
• Classification
• Predictions
• Recommendations

Why Iterative Algorithms
• Algorithms that need iterations
– Clustering (K-Means, Canopy, …)
– Gradient descent (e.g., Logistic Regression, Matrix Factorization)
– Graph Algorithms (e.g., PageRank, Line-Rank, components, paths,
reachability, centrality, )
– Alternating Least Squares ALS
– Graph communities / dense sub-components
– Inference (believe propagation)
– …
51

Example: Logistic Regression
• Goal: find best line separating two sets of points
target
random initial line

data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
Iteration!
Logistic Regression

Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop InputFormat
• HBase
• other NoSQL data stores

How Spark Works

Spark Programming Model
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.map
Driver Program
SparkContext
cluster
Worker Node
Task
Task
Task Worker Node

Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• Fault-tolerant
• read only collection of
elements
• operated on in parallel
• Cached in memory
• Or on disk
http://www.cs.berkeley.edu/~matei/papers/
2012/nsdi_spark.pdf

Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)

Working With RDDs
RDD
RDD
RDD
RDD
Transformations
linesWithSpark = textFile.filter(lambda line: "Spark” in line)

Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark

MapR Tutorial: Getting Started with Spark on MapR Sandbox
• https://www.mapr.com/products/mapr-sandbox-
hadoop/tutorials/spark-tutorial

Example Spark Word Count in Java
...the
...
andtime and
the, 1 time, 1 and, 1 and, 1
and, 12time, 4 ...the, 20
JavaRDD<String> input = sc.textFile(inputFile);
// Split each line into words
JavaRDD<String> words = input.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
}});
// Turn the words into (word, 1) pairs
JavaPairRDD<String, Integer> word1s== words.mapToPair(
new PairFunction<String, String, Integer>(){
public Tuple2<String, Integer> call(String x){
return new Tuple2(x, 1);
}});
// reduce add the pairs by key to produce counts
JavaPairRDD<String, Integer> counts =word1s.reduceByKey(
new Function2<Integer, Integer, Integer>(){
public Integer call(Integer x, Integer y){
return x + y;
}});
.........

Example Spark Word Count in Scala
...the
...
andtime and
the, 1 time, 1 and, 1 and, 1
and, 12time, 4 ...the, 20
// Load our input data.
val input = sc.textFile(inputFile)
// Split it up into words.
val words = input.flatMap(line => line.split(" "))
// Transform into pairs and count.
val counts = words
.map(word => (word, 1))
.reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file,
counts.saveAsTextFile(outputFile)
the, 20 time, 4 ….. and, 12
.........

64
HadoopRDD
textFile
// Load input data.
RDD
partitions
MapPartitionsRDD

65
// Load our input data.
// Split it up into words.
HadoopRDD
textFile flatmap
MapPartitionsRDD
MapPartitionsRDD

FlatMap
flatMap
line => line.split(" "))
1 to many mapping
ShipsShips
and
wax
and
wax
JavaPairRDD<String> words

67
textFile flatmap map
// Transform into pairs
val counts = words.map(word => (word, 1))
HadoopRDD
MapPartitionsRDD
MapPartitionsRDD
MapPartitionsRDD

Map
map
word => (word, 1))
1 to 1 mapping
and and, 1
JavaPairRDD<String, Integer> word1s

69
textFile flatmap map reduceByKey
val counts = words
HadoopRDD
MapPartitionsRDD
MapPartitionsRDD
ShuffledRDD
MapPartitionsRDD

reduceByKey
reduceByKey
case (x, y) => x + y wax, 1
and, 1
and, 1
wax, 1
and, 2
JavaPairRDD<String, Integer> counts

textFile flatmap map reduceByKey
val counts = words
val countArray = counts.collect()
HadoopRDD
MapPartitionsRDD
MapPartitionsRDD
MapPartitionsRDD
collect
ShuffledRDD
Array

Components Of Execution

MapR Blog: Getting Started with the Spark Web UI
• https://www.mapr.com/blog/getting-started-spark-web-ui

Spark RDD DAG -> Physical Execution plan
HadoopRDD
sc.textfile(…)
MapPartitionsRDD
flatmap
flatmap
reduceByKey
RDD Graph Physical Plan
collect
MapPartitionsRDD
ShuffledRDD
MapPartitionsRDD
Stage 1
Stage 2

Physical Plan
DAG
Stage 1
Stage 2
Task Task Task Task
Task Task Task
Stage 1
Stage 2
Split into Tasks
HFile
HDFS
Data Node
Worker Node
block
cache
partition
Executor
HFile
block
HFileHFile
Task thread
Task
Set
Task
Scheduler
Task
Physical Execution plan -> Stages and Tasks

Summary of Components
• Task : unit of execution
• Stage: Group of Tasks
– Base on partitions of RDD
– Tasks run in parallel
• DAG : Logical Graph of RDD operations
• RDD : Parallel dataset with partitions
76

How Spark Application runs on a Hadoop cluster
HFile
HDFS Data Node
Worker Node
block
cache
partitiontask
task
Executor
HFile
block
HFileHFile
SparkContext
zookeeper
YARN
Resource
Manager
HFile
HDFS Data Node
Worker Node
block
cache
partitiontask
task
Executor
HFile
block
HFileHFile
Client node
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.map
Driver Program
Yarn
Node
Manger
Yarn
Node
Manger

Deploying Spark – Cluster Manager Types
• Standalone mode
• Mesos
• YARN
• EC2
• GCE

Example: Log Mining

Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Based on slides from Pat McDonough at

Example: Log Mining
Worker
Worker
Worker
Driver

Example: Log Mining
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)

Example: Log Mining
Worker
Worker
Worker
Driver
Base RDD

Example: Log Mining
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver

Example: Log Mining
Worker
Worker
Worker
Driver
Transformed RDD

Example: Log Mining
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()

Example: Log Mining
messages.cache()
Worker
Worker
Worker
Driver
Action

Example: Log Mining
messages.cache()
Worker
Worker
Worker
Driver
Block 1
Block 2
Block 3

Example: Log Mining
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
tasks
tasks
tasks

Example: Log Mining
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Read
HDFS
Block
Read
HDFS
Block
Read
HDFS
Block

Example: Log Mining
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data

Example: Log Mining
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results

Example: Log Mining
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()

Example: Log Mining
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
tasks
tasks
tasks
Driver

Example: Log Mining
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
Process
from
Cache
Process
from
Cache
Process
from
Cache

Example: Log Mining
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
results
results
results

Example: Log Mining
messages.cache()
Worker
Worker
Worker
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
Driver
Cache your data  Faster Results
Full-text search of Wikipedia
• 60GB on 20 EC2 machines
• 0.5 sec from cache vs. 20s for
on-disk

Transformations and Actions

RDD Transformations and Actions
RDD
RDD
RDD
RDDTransformations Action Value
Transformations
(define a new RDD)
map
filter
sample
union
groupByKey
reduceByKey
join
cache
…
Actions
(return a value)
reduce
collect
count
save
lookupKey
…

Basic Transformations
> nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
> squares = nums.map(lambda x: x*x) // {1, 4, 9}
# Keep elements passing a predicate
> even = squares.filter(lambda x: x % 2 == 0) // {4}
# Map each element to zero or more others
> nums.flatMap(lambda x: => range(x))
> # => {0, 0, 1, 0, 1, 2}
Range object (sequence of
numbers 0, 1, …, x-1)

Basic Actions
> nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
> nums.collect() # => [1, 2, 3]
# Return first K elements
> nums.take(2) # => [1, 2]
# Count number of elements
> nums.count() # => 3
# Merge elements with an associative function
> nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
> nums.saveAsTextFile(“hdfs://file.txt”)

RDD Fault Recovery
• RDDs track lineage information
• can be used to efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))

Passing a function to Spark
• Spark is based on Anonymous function syntax
– (x: Int) => x *x
• Which is a shorthand for
new Function1[Int,Int] {
def apply(x: Int) = x * x
}
103

Dataframes

DataFrame
Distributed collection of data organized into named
columns
// Create the DataFrame
val df = sqlContext.read.json("person.json")
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// |-- height: string (nullable = true)
// Select only the "name" column
df.select("name").show()
https://spark.apache.org/docs/latest/sql-programming-guide.html

DataFrame RDD
• # data frame style
lineitems.groupby(‘customer’).agg(Map(
‘units’ > ‘avg’,
‘totalPrice’ > ‘std’
))
• # or SQL style
SELECT AVG(units), STD(totalPrice) FROM linetiems
GROUP BY customer

Demo Interactive Shell
• Iterative Development
– Cache those RDDs
– Open the shell and ask questions
• We have all wished we could do this with MapReduce
– Compile / save your code for scheduled jobs later
• Scala – spark-shell
• Python – pyspark

MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data
• https://www.mapr.com/blog/using-apache-spark-dataframes-
processing-tabular-data

The physical plan for DataFrames

DataFrame Excecution plan
// Print the physical plan to the console
auction.select("auctionid").distinct.explain()
== Physical Plan ==
Distinct false
Exchange (HashPartitioning [auctionid#0], 200)
Distinct true
Project [auctionid#0]
PhysicalRDD
[auctionid#0,bid#1,bidtime#2,bidder#3,
bidderrate#4,openbid#5,price#6,item#7,daystolive#8],
MapPartitionsRDD[11] at mapPartitions at
ExistingRDD.scala:37

There’s a lot more !

Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN
Unified Platform

Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/
• Blogs and Tutorials:
– Movie Recommendations with Collaborative Filtering
– Spark Streaming

Soon to Come
Blogs and Tutorials:
– Re-write this mahout example with spark

Examples and Resources

Spark on MapR
• Certified Spark Distribution
• Fully supported and packaged by MapR in partnership with
Databricks
– mapr-spark package with Spark, Shark, Spark Streaming today
– Spark-python, GraphX and MLLib soon
• YARN integration
– Spark can then allocate resources from cluster when needed

References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark

Q&A
@mapr maprtech
kbotzum@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

Introduction to Spark on Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Spark on Hadoop

Similar to Introduction to Spark on Hadoop (20)

More from Carol McDonald

More from Carol McDonald (19)

Recently uploaded

Recently uploaded (20)

Introduction to Spark on Hadoop