SlideShare a Scribd company logo
1 of 116
Download to read offline
© 2014 MapR Technologies 1© 2014 MapR Technologies
An Overview of Apache Spark
© 2014 MapR Technologies 2
Agenda
• MapReduce Refresher
• What is Spark?
• The Difference with Spark
• Examples and Resources
© 2014 MapR Technologies 3© 2014 MapR Technologies
MapReduce Refresher
© 2014 MapR Technologies 4
MapReduce: A Programming Model
• MapReduce:
Simplified Data
Processing on Large
Clusters
(published 2004)
• Parallel and Distributed
Algorithm:
• Data Locality
• Fault Tolerance
• Linear Scalability
© 2014 MapR Technologies 5
The Hadoop Strategy
http://developer.yahoo.com/hadoop/tutorial/module4.html
Distribute data
(share nothing)
Distribute computation
(parallelization without synchronization)
Tolerate failures
(no single point of failure)
Node 1
Mapping process
Node 2
Mapping process
Node 3
Mapping process
Node 1
Reducing process
Node 2
Reducing process
Node 3
Reducing process
© 2014 MapR Technologies 6
Chunks are replicated across the cluster
Distribute Data: HDFS
User process
NameNode
. . .
network
HDFS splits large data files
into chunks (64 MB)
metadata
access physical data access
Location metadata
DataNodes store & retrieve data
data
© 2014 MapR Technologies 7
Distribute Computation
MapReduce
Program
Data
Sources
Hadoop Cluster
Result
© 2014 MapR Technologies 8
MapReduce Basics
• Foundational model is based on a distributed file system
– Scalability and fault-tolerance
• Map
– Loading of the data and defining a set of keys
• Many use cases do not utilize a reduce task
• Reduce
– Collects the organized key-based data to process and output
• Performance can be tweaked based on known details of your
source files and cluster shape (size, total number)
© 2014 MapR Technologies 9
MapReduce Execution and Data Flow
Files loaded from HDFS stores
file file
Files loaded from HDFS stores
Node 1
InputFormat InputFormat
OutputFormat OutputFormat
Final (k, v) pairs Final (k, v) pairs
reduce reduce
(sort) (sort)
Input (k, v) pairs
map map map
RR RR RR
RecordReaders:
Split Split Split
Writeback to
Local HDFS
store
file
Writeback to
Local HDFS
store
file
SplitSplitSplit
RRRRRR
RecordReaders:
Input (k, v) pairs
mapmapmap
Node2
“Shuffle” process
Intermediate (k, v)
Pairs exchanged
By all nodes
Partitioner
Intermediate (k, v) pairs
Partitioner
Intermediate (k, v) pairs
© 2014 MapR Technologies 10
MapReduce Example: Word Count
Output
"The time has come," the Walrus said,
"To talk of many things:
Of shoes—and ships—and sealing-wax
the, 1
time, 1
has, 1
come, 1
…
and, 1
…
and, 1
…
and, [1, 1, 1]
come, [1,1,1]
has, [1,1]
the, [1,1,1]
time, [1,1,1,1]
…
and, 12
come, 6
has, 8
the, 4
time, 14
…
Input Map
Shuffle
and Sort
Reduce Output
Reduce
© 2014 MapR Technologies 11
Tolerate Failures
Hadoop Cluster
Failures are expected & managed gracefully
DataNode fails -> name node will locate replica
MapReduce task fails -> job tracker will schedule another one
Data
© 2014 MapR Technologies 12
MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For complex work, chain jobs together
– Use a higher level language or DSL that does this for you
© 2014 MapR Technologies 13
MapReduce Design Patterns
• Summarization
– Inverted index, counting
• Filtering
– Top ten, distinct
• Aggregation
• Data Organziation
– partitioning
• Join
– Join data sets
• Metapattern
– Job chaining
© 2014 MapR Technologies 14
Inverted Index Example
come, (alice.txt)
do, (macbeth.txt)
has, (alice.txt)
time, (alice.txt, macbeth.txt)
. . .
"The time has
come," the
Walrus said
alice.txt
tis time to do it
macbeth.txt
time, alice.txt
has, alice.txt
come, alice.txt
..
tis, macbeth.txt
time, macbeth.txt
do, macbeth.txt
…
© 2014 MapR Technologies 15
MapReduce Example:Inverted Index
• Input: (filename, text) records
• Output: list of files containing each word
• Map:
foreach word in text.split():
output(word, filename)
• Combine: uniquify filenames for each word
• Reduce:
def reduce(word, filenames):
output(word, sort(filenames))
© 2014 MapR Technologies 18
MapReduce: The Good
• Built in fault tolerance
• Optimized IO path
• Scalable
• Developer focuses on Map/Reduce, not infrastructure
• simple? API
© 2014 MapR Technologies 19
MapReduce: The Bad
• Optimized for disk IO
– Doesn’t leverage memory well
– Iterative algorithms go through disk IO path again and again
• Primitive API
– simple abstraction
– Key/Value in/out
– basic things like join
• require extensive code
• Result often many files that need to be combined appropriately
© 2014 MapR Technologies 20
Free Hadoop MapReduce On Demand Training
• https://www.mapr.com/services/mapr-academy/big-data-hadoop-
online-training
© 2014 MapR Technologies 21
What is Hive?
• Data Warehouse on top of Hadoop
– Gives ability to query without programming
– Used for analytical querying of data
• SQL like execution for Hadoop
• SQL evaluates to MapReduce code
– Submits jobs to your cluster
© 2014 MapR Technologies 22
Using HBase as a MapReduce/Hive Source
EXAMPLE: Data Warehouse for Analytical Processing queries
Hive runs
MapReduce
application
Hive Select
JoinHBase database
Files (HDFS/MapR-FS)
Query Result File
© 2014 MapR Technologies 23
Using HBase as a MapReduce or Hive Sink
EXAMPLE: bulk load data into a table
Files (HDFS/MapR-FS) HBase databaseHive runs
MapReduce application
Hive Insert Select
© 2014 MapR Technologies 24
Using HBase as a Source & Sink
EXAMPLE: calculate and store summaries,
Pre-Computed, Materialized View
HBase database
Hive Select
Join
Hive runs
MapReduce
application
© 2014 MapR Technologies 25
Job
Tracker
Name
Node
HADOOP
(MAP-REDUCE + HDFS)
Data Node
+
Task Tracker
Hive Metastore
Driver
(compiler, Optimizer, Executor)
Command Line
Interface
Web
Interface
JDBC
Thrift Server
ODBC
Metastore
Hive
The schema metadata is stored
in the Hive metastore
Hive Table definition HBase trades_tall Table
© 2014 MapR Technologies 26
Hive HBase
HBase Tables
Hive
metastore
Points to Existing
Hive Managed
© 2014 MapR Technologies 27
Hive HBase – External Table
CREATE EXTERNAL TABLE trades(key string, price bigint, vol bigint)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping"= “:key,cf1:price#b,cf1:vol#b")
TBLPROPERTIES ("hbase.table.name" = "/usr/user1/trades_tall");
Points to
External
key string price bigint vol bigint key cf1:price cf1:vol
AMZN_986186008 12.34 1000
AMZN_986186007 12.00 50
trades /usr/user1/trades_tall
Hive Table definition HBaseTable
© 2014 MapR Technologies 28
Hive HBase – Hive Query
SQL evaluates to MapReduce code
SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ;
HBase Tables
Queries
Parser Planner Execution
© 2014 MapR Technologies 29
Hive HBase – External Table
key cf1:price cf1:vol
AMZN_986186008 12.34 1000
AMZN_986186007 12.00 50
Selection
WHERE key like
SQL evaluates to MapReduce code
SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ;
Projection
select price
Aggregation
Avg( price)
© 2014 MapR Technologies 30
Hive Query Plan
• EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
Filter Operator
predicate: (key like 'GOOG%') (type: boolean)
Select Operator
Group By Operator
Reduce Operator Tree:
Group By Operator
Select Operator
File Output Operator
© 2014 MapR Technologies 31
Hive Query Plan – (2)
output
hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";
col0
Trades
table
group
aggregations:
avg(price)
scan filter
Select
key like 'GOOG%
Select price
Group by
map()
map()
map()
reduce()
reduce()
© 2014 MapR Technologies 32
Hive Map Reduce
Region Region Region
scan key, row
reduce()
shuffle
reduce()
reduce()Map() Map() Map()
Query Result File
HBase
Hive Select Join
Hive Query
result result result
© 2014 MapR Technologies 33
Some Hive Design Patterns
• Summarization
– Select min(delay), max(delay), count(*) from flights group by
carrier;
• Filtering
– SELECT * FROM trades WHERE key LIKE "GOOG%";
– SELECT price FROM trades DESC LIMIT 10 ;
• Join
SELECT tableA.field1, tableB.field2 FROM tableA
JOIN tableB
ON tableA.field1 = tableB.field2;
© 2014 MapR Technologies 34
What is a Directed Acylic Graph (DAG) ?
• Graph
– vertices (points) and edges (lines)
• Directed
– Only in a single direction
• Acyclic
– No looping
• This supports fault-tolerance
BA
BA
© 2014 MapR Technologies 35
Hive Query Plan Map Reduce Execution
FS1
AGG2
RS4
JOIN1
RS2
AGG1
RS1
t1
RS3
t1
Job 3
Job 2
FS1
AGG2
JOIN1
AGG1
RS1
t1
RS3Job 1
Job 1
Optimize
© 2014 MapR Technologies 36
Slow !
Iteration: the bane of MapReduce
© 2014 MapR Technologies 37
Typical MapReduce Workflows
Input to
Job 1
SequenceFile
Last Job
Maps Reduces
SequenceFile
Job 1
Maps Reduces
SequenceFile
Job 2
Maps Reduces
Output from
Job 1
Output from
Job 2
Input to
last job
Output from
last job
HDFS
© 2014 MapR Technologies 38
Iterations
Step Step Step Step Step
In-memory Caching
• Data Partitions read from RAM instead of disk
© 2014 MapR Technologies 39
Free HBase On Demand Training
(includes Hive and MapReduce with HBase)
• https://www.mapr.com/services/mapr-academy/big-data-hadoop-
online-training
© 2014 MapR Technologies 40
Lab – Query HBase airline data with Hive
Import mapping to Row Key and Columns:
Row-key
Carrier-
Flightnumber-
Date-
Origin-
destination
delay info stats timing
Air
Craft
delay
Arr
delay
Carrier
delay
cncl Cncl
code
tailnum distance elaptime arrtime Dep
time
AA-1-2014-01-
01-JFK-LAX
13 0 N7704 2475 385.00 359 …
© 2014 MapR Technologies 41
Count number of cancellations by reason (code)
$ hive
hive> explain select count(*) as
cancellations, cnclcode from flighttable
where cncl=1 group by cnclcode order by
cancellations asc limit 100;
1 row
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
Filter Operator
Select Operator
Group By Operator
aggregations: count()
Reduce Output Operator
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
Select Operator
File Output Operator
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
Reduce Operator Tree:
Extract
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
Limit
File Output Operator
Stage: Stage-0
Fetch Operator
limit: 100
© 2014 MapR Technologies 42
2 MapReduce jobs
$ hive
hive> select count(*) as cancellations, cnclcode from flighttable where
cncl=1 group by cnclcode order by cancellations asc limit 100;
1 row
Total jobs = 2
MapReduce Jobs Launched:
Job 0: Map: 1 Reduce: 1 Cumulative CPU: 13.3 sec MAPRFS Read: 0
MAPRFS Write: 0 SUCCESS
Job 1: Map: 1 Reduce: 1 Cumulative CPU: 1.52 sec MAPRFS Read: 0
MAPRFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 820 msec
OK
4598 C
7146 A
© 2014 MapR Technologies 43
Find the longest airline delays
$ hive
hive> select arrdelay,key from flighttable where arrdelay > 1000 order by
arrdelay desc limit 10;
1 row
MapReduce Jobs Launched:
Map: 1 Reduce: 1
OK
1530.0 AA-385-2014-01-18-BNA-DFW
1504.0 AA-1202-2014-01-15-ONT-DFW
1473.0 AA-1265-2014-01-05-CMH-LAX
1448.0 AA-1243-2014-01-21-IAD-DFW
1390.0 AA-1198-2014-01-11-PSP-DFW
1335.0 AA-1680-2014-01-21-SLC-DFW
1296.0 AA-1277-2014-01-21-BWI-DFW
1294.0 MQ-2894-2014-01-02-CVG-DFW
1201.0 MQ-3756-2014-01-01-CLT-MIA
1184.0 DL-2478-2014-01-10-BOS-ATL
© 2014 MapR Technologies 44© 2014 MapR Technologies
Apache Spark
© 2014 MapR Technologies 45
Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally developed in 2009 in UC
Berkeley’s AMP Lab
• Fully open sourced in 2010 – now
a Top Level Project at the Apache
Software Foundation
© 2014 MapR Technologies 46
Spark: Fast Big Data
– Rich APIs in
Java, Scala,
Python
– Interactive shell
• Fast to Run
– General execution
graphs
– In-memory storage
2-5× less code
© 2014 MapR Technologies 47
The Spark Community
© 2014 MapR Technologies 48
Spark is the Most Active Open Source Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
120
140
Projectcontributorsinpastyear
© 2014 MapR Technologies 49
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN
Unified Platform
© 2014 MapR Technologies 50
Spark Use Cases
• Iterative Algorithms on on large amounts of data
• Anomaly detection
• Classification
• Predictions
• Recommendations
© 2014 MapR Technologies 51
Why Iterative Algorithms
• Algorithms that need iterations
– Clustering (K-Means, Canopy, …)
– Gradient descent (e.g., Logistic Regression, Matrix Factorization)
– Graph Algorithms (e.g., PageRank, Line-Rank, components, paths,
reachability, centrality, )
– Alternating Least Squares ALS
– Graph communities / dense sub-components
– Inference (believe propagation)
– …
51
© 2014 MapR Technologies 52
Example: Logistic Regression
• Goal: find best line separating two sets of points
target
random initial line
© 2014 MapR Technologies 53
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w
Iteration!
Logistic Regression
© 2014 MapR Technologies 54
Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop InputFormat
• HBase
• other NoSQL data stores
© 2014 MapR Technologies 55© 2014 MapR Technologies
How Spark Works
© 2014 MapR Technologies 56
Spark Programming Model
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.map
Driver Program
SparkContext
cluster
Worker Node
Task
Task
Task Worker Node
© 2014 MapR Technologies 57
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• Fault-tolerant
• read only collection of
elements
• operated on in parallel
• Cached in memory
• Or on disk
http://www.cs.berkeley.edu/~matei/papers/
2012/nsdi_spark.pdf
© 2014 MapR Technologies 58
Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)
© 2014 MapR Technologies 59
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
textFile = sc.textFile(”SomeFile.txt”)
© 2014 MapR Technologies 60
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark
textFile = sc.textFile(”SomeFile.txt”)
© 2014 MapR Technologies 61
MapR Tutorial: Getting Started with Spark on MapR Sandbox
• https://www.mapr.com/products/mapr-sandbox-
hadoop/tutorials/spark-tutorial
© 2014 MapR Technologies 62
Example Spark Word Count in Java
...the
...
"The time has come," the Walrus said,
"To talk of many things:
Of shoes—and ships—and sealing-wax
andtime and
the, 1 time, 1 and, 1 and, 1
and, 12time, 4 ...the, 20
JavaRDD<String> input = sc.textFile(inputFile);
// Split each line into words
JavaRDD<String> words = input.flatMap(
new FlatMapFunction<String, String>() {
public Iterable<String> call(String x) {
return Arrays.asList(x.split(" "));
}});
// Turn the words into (word, 1) pairs
JavaPairRDD<String, Integer> word1s== words.mapToPair(
new PairFunction<String, String, Integer>(){
public Tuple2<String, Integer> call(String x){
return new Tuple2(x, 1);
}});
// reduce add the pairs by key to produce counts
JavaPairRDD<String, Integer> counts =word1s.reduceByKey(
new Function2<Integer, Integer, Integer>(){
public Integer call(Integer x, Integer y){
return x + y;
}});
.........
© 2014 MapR Technologies 63
Example Spark Word Count in Scala
...the
...
"The time has come," the Walrus said,
"To talk of many things:
Of shoes—and ships—and sealing-wax
andtime and
the, 1 time, 1 and, 1 and, 1
and, 12time, 4 ...the, 20
// Load our input data.
val input = sc.textFile(inputFile)
// Split it up into words.
val words = input.flatMap(line => line.split(" "))
// Transform into pairs and count.
val counts = words
.map(word => (word, 1))
.reduceByKey{case (x, y) => x + y}
// Save the word count back out to a text file,
counts.saveAsTextFile(outputFile)
the, 20 time, 4 ….. and, 12
.........
© 2014 MapR Technologies 64
Example Spark Word Count in Scala
64
HadoopRDD
textFile
// Load input data.
val input = sc.textFile(inputFile)
RDD
partitions
MapPartitionsRDD
© 2014 MapR Technologies 65
Example Spark Word Count in Scala
65
// Load our input data.
val input = sc.textFile(inputFile)
// Split it up into words.
val words = input.flatMap(line => line.split(" "))
HadoopRDD
textFile flatmap
MapPartitionsRDD
MapPartitionsRDD
© 2014 MapR Technologies 66
FlatMap
flatMap
line => line.split(" "))
1 to many mapping
ShipsShips
and
wax
and
wax
JavaPairRDD<String> words
© 2014 MapR Technologies 67
Example Spark Word Count in Scala
67
textFile flatmap map
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
// Transform into pairs
val counts = words.map(word => (word, 1))
HadoopRDD
MapPartitionsRDD
MapPartitionsRDD
MapPartitionsRDD
© 2014 MapR Technologies 68
Map
map
word => (word, 1))
1 to 1 mapping
and and, 1
JavaPairRDD<String, Integer> word1s
© 2014 MapR Technologies 69
Example Spark Word Count in Scala
69
textFile flatmap map reduceByKey
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
val counts = words
.map(word => (word, 1))
.reduceByKey{case (x, y) => x + y}
HadoopRDD
MapPartitionsRDD
MapPartitionsRDD
ShuffledRDD
MapPartitionsRDD
© 2014 MapR Technologies 70
reduceByKey
reduceByKey
case (x, y) => x + y wax, 1
and, 1
and, 1
wax, 1
and, 2
JavaPairRDD<String, Integer> counts
© 2014 MapR Technologies 71
Example Spark Word Count in Scala
textFile flatmap map reduceByKey
val input = sc.textFile(inputFile)
val words = input.flatMap(line => line.split(" "))
val counts = words
.map(word => (word, 1))
.reduceByKey{case (x, y) => x + y}
val countArray = counts.collect()
HadoopRDD
MapPartitionsRDD
MapPartitionsRDD
MapPartitionsRDD
collect
ShuffledRDD
Array
© 2014 MapR Technologies 72© 2014 MapR Technologies
Components Of Execution
© 2014 MapR Technologies 73
MapR Blog: Getting Started with the Spark Web UI
• https://www.mapr.com/blog/getting-started-spark-web-ui
© 2014 MapR Technologies 74
Spark RDD DAG -> Physical Execution plan
HadoopRDD
sc.textfile(…)
MapPartitionsRDD
flatmap
flatmap
reduceByKey
RDD Graph Physical Plan
collect
MapPartitionsRDD
ShuffledRDD
MapPartitionsRDD
Stage 1
Stage 2
© 2014 MapR Technologies 75
Physical Plan
DAG
Stage 1
Stage 2
Task Task Task Task
Task Task Task
Stage 1
Stage 2
Split into Tasks
HFile
HDFS
Data Node
Worker Node
block
cache
partition
Executor
HFile
block
HFileHFile
Task thread
Task
Set
Task
Scheduler
Task
Physical Execution plan -> Stages and Tasks
© 2014 MapR Technologies 76
Summary of Components
• Task : unit of execution
• Stage: Group of Tasks
– Base on partitions of RDD
– Tasks run in parallel
• DAG : Logical Graph of RDD operations
• RDD : Parallel dataset with partitions
76
© 2014 MapR Technologies 77
How Spark Application runs on a Hadoop cluster
HFile
HDFS Data Node
Worker Node
block
cache
partitiontask
task
Executor
HFile
block
HFileHFile
SparkContext
zookeeper
YARN
Resource
Manager
HFile
HDFS Data Node
Worker Node
block
cache
partitiontask
task
Executor
HFile
block
HFileHFile
Client node
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.map
Driver Program
Yarn
Node
Manger
Yarn
Node
Manger
© 2014 MapR Technologies 78
Deploying Spark – Cluster Manager Types
• Standalone mode
• Mesos
• YARN
• EC2
• GCE
© 2014 MapR Technologies 79© 2014 MapR Technologies
Example: Log Mining
© 2014 MapR Technologies 80
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Based on slides from Pat McDonough at
© 2014 MapR Technologies 81
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
© 2014 MapR Technologies 82
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
© 2014 MapR Technologies 83
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
Base RDD
© 2014 MapR Technologies 84
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
© 2014 MapR Technologies 85
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Transformed RDD
© 2014 MapR Technologies 86
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
© 2014 MapR Technologies 87
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Action
© 2014 MapR Technologies 88
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
© 2014 MapR Technologies 89
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
tasks
tasks
tasks
© 2014 MapR Technologies 90
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Read
HDFS
Block
Read
HDFS
Block
Read
HDFS
Block
© 2014 MapR Technologies 91
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
Process
& Cache
Data
Process
& Cache
Data
Process
& Cache
Data
© 2014 MapR Technologies 92
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results
© 2014 MapR Technologies 93
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
© 2014 MapR Technologies 94
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
tasks
tasks
tasks
Driver
© 2014 MapR Technologies 95
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Process
from
Cache
Process
from
Cache
Process
from
Cache
© 2014 MapR Technologies 96
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
results
results
results
© 2014 MapR Technologies 97
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
Worker
Worker
Worker
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3
messages.filter(lambda s: “php” in s).count()
Driver
Cache your data  Faster Results
Full-text search of Wikipedia
• 60GB on 20 EC2 machines
• 0.5 sec from cache vs. 20s for
on-disk
© 2014 MapR Technologies 98© 2014 MapR Technologies
Transformations and Actions
© 2014 MapR Technologies 99
RDD Transformations and Actions
RDD
RDD
RDD
RDDTransformations Action Value
Transformations
(define a new RDD)
map
filter
sample
union
groupByKey
reduceByKey
join
cache
…
Actions
(return a value)
reduce
collect
count
save
lookupKey
…
© 2014 MapR Technologies 100
Basic Transformations
> nums = sc.parallelize([1, 2, 3])
# Pass each element through a function
> squares = nums.map(lambda x: x*x) // {1, 4, 9}
# Keep elements passing a predicate
> even = squares.filter(lambda x: x % 2 == 0) // {4}
# Map each element to zero or more others
> nums.flatMap(lambda x: => range(x))
> # => {0, 0, 1, 0, 1, 2}
Range object (sequence of
numbers 0, 1, …, x-1)
© 2014 MapR Technologies 101
Basic Actions
> nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection
> nums.collect() # => [1, 2, 3]
# Return first K elements
> nums.take(2) # => [1, 2]
# Count number of elements
> nums.count() # => 3
# Merge elements with an associative function
> nums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text file
> nums.saveAsTextFile(“hdfs://file.txt”)
© 2014 MapR Technologies 102
RDD Fault Recovery
• RDDs track lineage information
• can be used to efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”))
.map(lambda s: s.split(“t”)[2])
HDFS File Filtered RDD Mapped RDD
filter
(func = startsWith(…))
map
(func = split(...))
© 2014 MapR Technologies 103
Passing a function to Spark
• Spark is based on Anonymous function syntax
– (x: Int) => x *x
• Which is a shorthand for
new Function1[Int,Int] {
def apply(x: Int) = x * x
}
103
© 2014 MapR Technologies 104© 2014 MapR Technologies
Dataframes
© 2014 MapR Technologies 105
DataFrame
Distributed collection of data organized into named
columns
// Create the DataFrame
val df = sqlContext.read.json("person.json")
// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// |-- height: string (nullable = true)
// Select only the "name" column
df.select("name").show()
https://spark.apache.org/docs/latest/sql-programming-guide.html
© 2014 MapR Technologies 106
DataFrame RDD
• # data frame style
lineitems.groupby(‘customer’).agg(Map(
‘units’ > ‘avg’,
‘totalPrice’ > ‘std’
))
• # or SQL style
SELECT AVG(units), STD(totalPrice) FROM linetiems
GROUP BY customer
© 2014 MapR Technologies 107
Demo Interactive Shell
• Iterative Development
– Cache those RDDs
– Open the shell and ask questions
• We have all wished we could do this with MapReduce
– Compile / save your code for scheduled jobs later
• Scala – spark-shell
• Python – pyspark
© 2014 MapR Technologies 108
MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data
• https://www.mapr.com/blog/using-apache-spark-dataframes-
processing-tabular-data
© 2014 MapR Technologies 109
The physical plan for DataFrames
© 2014 MapR Technologies 110
DataFrame Excecution plan
// Print the physical plan to the console
auction.select("auctionid").distinct.explain()
== Physical Plan ==
Distinct false
Exchange (HashPartitioning [auctionid#0], 200)
Distinct true
Project [auctionid#0]
PhysicalRDD
[auctionid#0,bid#1,bidtime#2,bidder#3,
bidderrate#4,openbid#5,price#6,item#7,daystolive#8],
MapPartitionsRDD[11] at mapPartitions at
ExistingRDD.scala:37
© 2014 MapR Technologies 111© 2014 MapR Technologies
There’s a lot more !
© 2014 MapR Technologies 112
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine)
GraphX (Graph
computation)
Mesos
Distributed File System (HDFS, MapR-FS, S3, …)
Hadoop YARN
Unified Platform
© 2014 MapR Technologies 113
Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/
• Blogs and Tutorials:
– Movie Recommendations with Collaborative Filtering
– Spark Streaming
© 2014 MapR Technologies 114
Soon to Come
Blogs and Tutorials:
– Re-write this mahout example with spark
© 2014 MapR Technologies 115© 2014 MapR Technologies
Examples and Resources
© 2014 MapR Technologies 116
Spark on MapR
• Certified Spark Distribution
• Fully supported and packaged by MapR in partnership with
Databricks
– mapr-spark package with Spark, Shark, Spark Streaming today
– Spark-python, GraphX and MLLib soon
• YARN integration
– Spark can then allocate resources from cluster when needed
© 2014 MapR Technologies 117
References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on MapR:
– http://www.mapr.com/products/apache-spark
• Spark SQL and DataFrame Guide
• Apache Spark vs. MapReduce – Whiteboard Walkthrough
• Learning Spark - O'Reilly Book
• Apache Spark
© 2014 MapR Technologies 118
Q&A
@mapr maprtech
kbotzum@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

What's hot

Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Carol McDonald
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesCarol McDonald
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down InternetMapR Technologies
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes StrategicMapR Technologies
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...DataWorks Summit/Hadoop Summit
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04Ted Dunning
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveJulian Hyde
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient DataCarol McDonald
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...DataWorks Summit
 

What's hot (20)

Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1Fast Cars, Big Data How Streaming can help Formula 1
Fast Cars, Big Data How Streaming can help Formula 1
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes Strategic
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Applying Machine Learning to Live Patient Data
Applying Machine Learning to  Live Patient DataApplying Machine Learning to  Live Patient Data
Applying Machine Learning to Live Patient Data
 
MapR 5.2 Product Update
MapR 5.2 Product UpdateMapR 5.2 Product Update
MapR 5.2 Product Update
 
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
Unify Stream and Batch Processing using Dataflow, a Portable Programmable Mod...
 

Similar to Introduction to Spark on Hadoop

Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGPradeep MG
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)Revolution Analytics
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 

Similar to Introduction to Spark on Hadoop (20)

Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Is Spark Replacing Hadoop
Is Spark Replacing HadoopIs Spark Replacing Hadoop
Is Spark Replacing Hadoop
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Hadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MGHadoop_EcoSystem_Pradeep_MG
Hadoop_EcoSystem_Pradeep_MG
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 

More from Carol McDonald

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Carol McDonald
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBCarol McDonald
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...Carol McDonald
 
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningCarol McDonald
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBCarol McDonald
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Carol McDonald
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Carol McDonald
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningCarol McDonald
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Carol McDonald
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Carol McDonald
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churnCarol McDonald
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with SparkCarol McDonald
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBaseCarol McDonald
 

More from Carol McDonald (19)

Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
 
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DBAnalyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
Analyzing Flight Delays with Apache Spark, DataFrames, GraphFrames, and MapR-DB
 
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...Analysis of Popular Uber Locations using Apache APIs:  Spark Machine Learning...
Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning...
 
Predicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine LearningPredicting Flight Delays with Spark Machine Learning
Predicting Flight Delays with Spark Machine Learning
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
 
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
Applying Machine learning to IOT: End to End Distributed Distributed Pipeline...
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churn
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
 
CU9411MW.DOC
CU9411MW.DOCCU9411MW.DOC
CU9411MW.DOC
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBase
 

Recently uploaded

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

Introduction to Spark on Hadoop

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies An Overview of Apache Spark
  • 2. © 2014 MapR Technologies 2 Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Examples and Resources
  • 3. © 2014 MapR Technologies 3© 2014 MapR Technologies MapReduce Refresher
  • 4. © 2014 MapR Technologies 4 MapReduce: A Programming Model • MapReduce: Simplified Data Processing on Large Clusters (published 2004) • Parallel and Distributed Algorithm: • Data Locality • Fault Tolerance • Linear Scalability
  • 5. © 2014 MapR Technologies 5 The Hadoop Strategy http://developer.yahoo.com/hadoop/tutorial/module4.html Distribute data (share nothing) Distribute computation (parallelization without synchronization) Tolerate failures (no single point of failure) Node 1 Mapping process Node 2 Mapping process Node 3 Mapping process Node 1 Reducing process Node 2 Reducing process Node 3 Reducing process
  • 6. © 2014 MapR Technologies 6 Chunks are replicated across the cluster Distribute Data: HDFS User process NameNode . . . network HDFS splits large data files into chunks (64 MB) metadata access physical data access Location metadata DataNodes store & retrieve data data
  • 7. © 2014 MapR Technologies 7 Distribute Computation MapReduce Program Data Sources Hadoop Cluster Result
  • 8. © 2014 MapR Technologies 8 MapReduce Basics • Foundational model is based on a distributed file system – Scalability and fault-tolerance • Map – Loading of the data and defining a set of keys • Many use cases do not utilize a reduce task • Reduce – Collects the organized key-based data to process and output • Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
  • 9. © 2014 MapR Technologies 9 MapReduce Execution and Data Flow Files loaded from HDFS stores file file Files loaded from HDFS stores Node 1 InputFormat InputFormat OutputFormat OutputFormat Final (k, v) pairs Final (k, v) pairs reduce reduce (sort) (sort) Input (k, v) pairs map map map RR RR RR RecordReaders: Split Split Split Writeback to Local HDFS store file Writeback to Local HDFS store file SplitSplitSplit RRRRRR RecordReaders: Input (k, v) pairs mapmapmap Node2 “Shuffle” process Intermediate (k, v) Pairs exchanged By all nodes Partitioner Intermediate (k, v) pairs Partitioner Intermediate (k, v) pairs
  • 10. © 2014 MapR Technologies 10 MapReduce Example: Word Count Output "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … and, 1 … and, 1 … and, [1, 1, 1] come, [1,1,1] has, [1,1] the, [1,1,1] time, [1,1,1,1] … and, 12 come, 6 has, 8 the, 4 time, 14 … Input Map Shuffle and Sort Reduce Output Reduce
  • 11. © 2014 MapR Technologies 11 Tolerate Failures Hadoop Cluster Failures are expected & managed gracefully DataNode fails -> name node will locate replica MapReduce task fails -> job tracker will schedule another one Data
  • 12. © 2014 MapR Technologies 12 MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together – Use a higher level language or DSL that does this for you
  • 13. © 2014 MapR Technologies 13 MapReduce Design Patterns • Summarization – Inverted index, counting • Filtering – Top ten, distinct • Aggregation • Data Organziation – partitioning • Join – Join data sets • Metapattern – Job chaining
  • 14. © 2014 MapR Technologies 14 Inverted Index Example come, (alice.txt) do, (macbeth.txt) has, (alice.txt) time, (alice.txt, macbeth.txt) . . . "The time has come," the Walrus said alice.txt tis time to do it macbeth.txt time, alice.txt has, alice.txt come, alice.txt .. tis, macbeth.txt time, macbeth.txt do, macbeth.txt …
  • 15. © 2014 MapR Technologies 15 MapReduce Example:Inverted Index • Input: (filename, text) records • Output: list of files containing each word • Map: foreach word in text.split(): output(word, filename) • Combine: uniquify filenames for each word • Reduce: def reduce(word, filenames): output(word, sort(filenames))
  • 16. © 2014 MapR Technologies 18 MapReduce: The Good • Built in fault tolerance • Optimized IO path • Scalable • Developer focuses on Map/Reduce, not infrastructure • simple? API
  • 17. © 2014 MapR Technologies 19 MapReduce: The Bad • Optimized for disk IO – Doesn’t leverage memory well – Iterative algorithms go through disk IO path again and again • Primitive API – simple abstraction – Key/Value in/out – basic things like join • require extensive code • Result often many files that need to be combined appropriately
  • 18. © 2014 MapR Technologies 20 Free Hadoop MapReduce On Demand Training • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  • 19. © 2014 MapR Technologies 21 What is Hive? • Data Warehouse on top of Hadoop – Gives ability to query without programming – Used for analytical querying of data • SQL like execution for Hadoop • SQL evaluates to MapReduce code – Submits jobs to your cluster
  • 20. © 2014 MapR Technologies 22 Using HBase as a MapReduce/Hive Source EXAMPLE: Data Warehouse for Analytical Processing queries Hive runs MapReduce application Hive Select JoinHBase database Files (HDFS/MapR-FS) Query Result File
  • 21. © 2014 MapR Technologies 23 Using HBase as a MapReduce or Hive Sink EXAMPLE: bulk load data into a table Files (HDFS/MapR-FS) HBase databaseHive runs MapReduce application Hive Insert Select
  • 22. © 2014 MapR Technologies 24 Using HBase as a Source & Sink EXAMPLE: calculate and store summaries, Pre-Computed, Materialized View HBase database Hive Select Join Hive runs MapReduce application
  • 23. © 2014 MapR Technologies 25 Job Tracker Name Node HADOOP (MAP-REDUCE + HDFS) Data Node + Task Tracker Hive Metastore Driver (compiler, Optimizer, Executor) Command Line Interface Web Interface JDBC Thrift Server ODBC Metastore Hive The schema metadata is stored in the Hive metastore Hive Table definition HBase trades_tall Table
  • 24. © 2014 MapR Technologies 26 Hive HBase HBase Tables Hive metastore Points to Existing Hive Managed
  • 25. © 2014 MapR Technologies 27 Hive HBase – External Table CREATE EXTERNAL TABLE trades(key string, price bigint, vol bigint) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping"= “:key,cf1:price#b,cf1:vol#b") TBLPROPERTIES ("hbase.table.name" = "/usr/user1/trades_tall"); Points to External key string price bigint vol bigint key cf1:price cf1:vol AMZN_986186008 12.34 1000 AMZN_986186007 12.00 50 trades /usr/user1/trades_tall Hive Table definition HBaseTable
  • 26. © 2014 MapR Technologies 28 Hive HBase – Hive Query SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ; HBase Tables Queries Parser Planner Execution
  • 27. © 2014 MapR Technologies 29 Hive HBase – External Table key cf1:price cf1:vol AMZN_986186008 12.34 1000 AMZN_986186007 12.00 50 Selection WHERE key like SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ; Projection select price Aggregation Avg( price)
  • 28. © 2014 MapR Technologies 30 Hive Query Plan • EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Filter Operator predicate: (key like 'GOOG%') (type: boolean) Select Operator Group By Operator Reduce Operator Tree: Group By Operator Select Operator File Output Operator
  • 29. © 2014 MapR Technologies 31 Hive Query Plan – (2) output hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; col0 Trades table group aggregations: avg(price) scan filter Select key like 'GOOG% Select price Group by map() map() map() reduce() reduce()
  • 30. © 2014 MapR Technologies 32 Hive Map Reduce Region Region Region scan key, row reduce() shuffle reduce() reduce()Map() Map() Map() Query Result File HBase Hive Select Join Hive Query result result result
  • 31. © 2014 MapR Technologies 33 Some Hive Design Patterns • Summarization – Select min(delay), max(delay), count(*) from flights group by carrier; • Filtering – SELECT * FROM trades WHERE key LIKE "GOOG%"; – SELECT price FROM trades DESC LIMIT 10 ; • Join SELECT tableA.field1, tableB.field2 FROM tableA JOIN tableB ON tableA.field1 = tableB.field2;
  • 32. © 2014 MapR Technologies 34 What is a Directed Acylic Graph (DAG) ? • Graph – vertices (points) and edges (lines) • Directed – Only in a single direction • Acyclic – No looping • This supports fault-tolerance BA BA
  • 33. © 2014 MapR Technologies 35 Hive Query Plan Map Reduce Execution FS1 AGG2 RS4 JOIN1 RS2 AGG1 RS1 t1 RS3 t1 Job 3 Job 2 FS1 AGG2 JOIN1 AGG1 RS1 t1 RS3Job 1 Job 1 Optimize
  • 34. © 2014 MapR Technologies 36 Slow ! Iteration: the bane of MapReduce
  • 35. © 2014 MapR Technologies 37 Typical MapReduce Workflows Input to Job 1 SequenceFile Last Job Maps Reduces SequenceFile Job 1 Maps Reduces SequenceFile Job 2 Maps Reduces Output from Job 1 Output from Job 2 Input to last job Output from last job HDFS
  • 36. © 2014 MapR Technologies 38 Iterations Step Step Step Step Step In-memory Caching • Data Partitions read from RAM instead of disk
  • 37. © 2014 MapR Technologies 39 Free HBase On Demand Training (includes Hive and MapReduce with HBase) • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  • 38. © 2014 MapR Technologies 40 Lab – Query HBase airline data with Hive Import mapping to Row Key and Columns: Row-key Carrier- Flightnumber- Date- Origin- destination delay info stats timing Air Craft delay Arr delay Carrier delay cncl Cncl code tailnum distance elaptime arrtime Dep time AA-1-2014-01- 01-JFK-LAX 13 0 N7704 2475 385.00 359 …
  • 39. © 2014 MapR Technologies 41 Count number of cancellations by reason (code) $ hive hive> explain select count(*) as cancellations, cnclcode from flighttable where cncl=1 group by cnclcode order by cancellations asc limit 100; 1 row OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Filter Operator Select Operator Group By Operator aggregations: count() Reduce Output Operator Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) Select Operator File Output Operator Stage: Stage-2 Map Reduce Map Operator Tree: TableScan Reduce Output Operator Reduce Operator Tree: Extract Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Limit File Output Operator Stage: Stage-0 Fetch Operator limit: 100
  • 40. © 2014 MapR Technologies 42 2 MapReduce jobs $ hive hive> select count(*) as cancellations, cnclcode from flighttable where cncl=1 group by cnclcode order by cancellations asc limit 100; 1 row Total jobs = 2 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 13.3 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS Job 1: Map: 1 Reduce: 1 Cumulative CPU: 1.52 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 14 seconds 820 msec OK 4598 C 7146 A
  • 41. © 2014 MapR Technologies 43 Find the longest airline delays $ hive hive> select arrdelay,key from flighttable where arrdelay > 1000 order by arrdelay desc limit 10; 1 row MapReduce Jobs Launched: Map: 1 Reduce: 1 OK 1530.0 AA-385-2014-01-18-BNA-DFW 1504.0 AA-1202-2014-01-15-ONT-DFW 1473.0 AA-1265-2014-01-05-CMH-LAX 1448.0 AA-1243-2014-01-21-IAD-DFW 1390.0 AA-1198-2014-01-11-PSP-DFW 1335.0 AA-1680-2014-01-21-SLC-DFW 1296.0 AA-1277-2014-01-21-BWI-DFW 1294.0 MQ-2894-2014-01-02-CVG-DFW 1201.0 MQ-3756-2014-01-01-CLT-MIA 1184.0 DL-2478-2014-01-10-BOS-ATL
  • 42. © 2014 MapR Technologies 44© 2014 MapR Technologies Apache Spark
  • 43. © 2014 MapR Technologies 45 Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation
  • 44. © 2014 MapR Technologies 46 Spark: Fast Big Data – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage 2-5× less code
  • 45. © 2014 MapR Technologies 47 The Spark Community
  • 46. © 2014 MapR Technologies 48 Spark is the Most Active Open Source Project in Big Data Giraph Storm Tez 0 20 40 60 80 100 120 140 Projectcontributorsinpastyear
  • 47. © 2014 MapR Technologies 49 Spark SQL Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN Unified Platform
  • 48. © 2014 MapR Technologies 50 Spark Use Cases • Iterative Algorithms on on large amounts of data • Anomaly detection • Classification • Predictions • Recommendations
  • 49. © 2014 MapR Technologies 51 Why Iterative Algorithms • Algorithms that need iterations – Clustering (K-Means, Canopy, …) – Gradient descent (e.g., Logistic Regression, Matrix Factorization) – Graph Algorithms (e.g., PageRank, Line-Rank, components, paths, reachability, centrality, ) – Alternating Least Squares ALS – Graph communities / dense sub-components – Inference (believe propagation) – … 51
  • 50. © 2014 MapR Technologies 52 Example: Logistic Regression • Goal: find best line separating two sets of points target random initial line
  • 51. © 2014 MapR Technologies 53 data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w Iteration! Logistic Regression
  • 52. © 2014 MapR Technologies 54 Data Sources • Local Files – file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem – Regular files, sequence files, any other Hadoop InputFormat • HBase • other NoSQL data stores
  • 53. © 2014 MapR Technologies 55© 2014 MapR Technologies How Spark Works
  • 54. © 2014 MapR Technologies 56 Spark Programming Model sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program SparkContext cluster Worker Node Task Task Task Worker Node
  • 55. © 2014 MapR Technologies 57 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • Fault-tolerant • read only collection of elements • operated on in parallel • Cached in memory • Or on disk http://www.cs.berkeley.edu/~matei/papers/ 2012/nsdi_spark.pdf
  • 56. © 2014 MapR Technologies 58 Working With RDDs RDD textFile = sc.textFile(”SomeFile.txt”)
  • 57. © 2014 MapR Technologies 59 Working With RDDs RDD RDD RDD RDD Transformations linesWithSpark = textFile.filter(lambda line: "Spark” in line) textFile = sc.textFile(”SomeFile.txt”)
  • 58. © 2014 MapR Technologies 60 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line) linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark textFile = sc.textFile(”SomeFile.txt”)
  • 59. © 2014 MapR Technologies 61 MapR Tutorial: Getting Started with Spark on MapR Sandbox • https://www.mapr.com/products/mapr-sandbox- hadoop/tutorials/spark-tutorial
  • 60. © 2014 MapR Technologies 62 Example Spark Word Count in Java ...the ... "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax andtime and the, 1 time, 1 and, 1 and, 1 and, 12time, 4 ...the, 20 JavaRDD<String> input = sc.textFile(inputFile); // Split each line into words JavaRDD<String> words = input.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String x) { return Arrays.asList(x.split(" ")); }}); // Turn the words into (word, 1) pairs JavaPairRDD<String, Integer> word1s== words.mapToPair( new PairFunction<String, String, Integer>(){ public Tuple2<String, Integer> call(String x){ return new Tuple2(x, 1); }}); // reduce add the pairs by key to produce counts JavaPairRDD<String, Integer> counts =word1s.reduceByKey( new Function2<Integer, Integer, Integer>(){ public Integer call(Integer x, Integer y){ return x + y; }}); .........
  • 61. © 2014 MapR Technologies 63 Example Spark Word Count in Scala ...the ... "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax andtime and the, 1 time, 1 and, 1 and, 1 and, 12time, 4 ...the, 20 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) // Transform into pairs and count. val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} // Save the word count back out to a text file, counts.saveAsTextFile(outputFile) the, 20 time, 4 ….. and, 12 .........
  • 62. © 2014 MapR Technologies 64 Example Spark Word Count in Scala 64 HadoopRDD textFile // Load input data. val input = sc.textFile(inputFile) RDD partitions MapPartitionsRDD
  • 63. © 2014 MapR Technologies 65 Example Spark Word Count in Scala 65 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) HadoopRDD textFile flatmap MapPartitionsRDD MapPartitionsRDD
  • 64. © 2014 MapR Technologies 66 FlatMap flatMap line => line.split(" ")) 1 to many mapping ShipsShips and wax and wax JavaPairRDD<String> words
  • 65. © 2014 MapR Technologies 67 Example Spark Word Count in Scala 67 textFile flatmap map val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) // Transform into pairs val counts = words.map(word => (word, 1)) HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD
  • 66. © 2014 MapR Technologies 68 Map map word => (word, 1)) 1 to 1 mapping and and, 1 JavaPairRDD<String, Integer> word1s
  • 67. © 2014 MapR Technologies 69 Example Spark Word Count in Scala 69 textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} HadoopRDD MapPartitionsRDD MapPartitionsRDD ShuffledRDD MapPartitionsRDD
  • 68. © 2014 MapR Technologies 70 reduceByKey reduceByKey case (x, y) => x + y wax, 1 and, 1 and, 1 wax, 1 and, 2 JavaPairRDD<String, Integer> counts
  • 69. © 2014 MapR Technologies 71 Example Spark Word Count in Scala textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} val countArray = counts.collect() HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD collect ShuffledRDD Array
  • 70. © 2014 MapR Technologies 72© 2014 MapR Technologies Components Of Execution
  • 71. © 2014 MapR Technologies 73 MapR Blog: Getting Started with the Spark Web UI • https://www.mapr.com/blog/getting-started-spark-web-ui
  • 72. © 2014 MapR Technologies 74 Spark RDD DAG -> Physical Execution plan HadoopRDD sc.textfile(…) MapPartitionsRDD flatmap flatmap reduceByKey RDD Graph Physical Plan collect MapPartitionsRDD ShuffledRDD MapPartitionsRDD Stage 1 Stage 2
  • 73. © 2014 MapR Technologies 75 Physical Plan DAG Stage 1 Stage 2 Task Task Task Task Task Task Task Stage 1 Stage 2 Split into Tasks HFile HDFS Data Node Worker Node block cache partition Executor HFile block HFileHFile Task thread Task Set Task Scheduler Task Physical Execution plan -> Stages and Tasks
  • 74. © 2014 MapR Technologies 76 Summary of Components • Task : unit of execution • Stage: Group of Tasks – Base on partitions of RDD – Tasks run in parallel • DAG : Logical Graph of RDD operations • RDD : Parallel dataset with partitions 76
  • 75. © 2014 MapR Technologies 77 How Spark Application runs on a Hadoop cluster HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile SparkContext zookeeper YARN Resource Manager HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile Client node sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program Yarn Node Manger Yarn Node Manger
  • 76. © 2014 MapR Technologies 78 Deploying Spark – Cluster Manager Types • Standalone mode • Mesos • YARN • EC2 • GCE
  • 77. © 2014 MapR Technologies 79© 2014 MapR Technologies Example: Log Mining
  • 78. © 2014 MapR Technologies 80 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Based on slides from Pat McDonough at
  • 79. © 2014 MapR Technologies 81 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver
  • 80. © 2014 MapR Technologies 82 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”)
  • 81. © 2014 MapR Technologies 83 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”) Base RDD
  • 82. © 2014 MapR Technologies 84 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
  • 83. © 2014 MapR Technologies 85 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver Transformed RDD
  • 84. © 2014 MapR Technologies 86 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count()
  • 85. © 2014 MapR Technologies 87 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Action
  • 86. © 2014 MapR Technologies 88 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3
  • 87. © 2014 MapR Technologies 89 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver tasks tasks tasks
  • 88. © 2014 MapR Technologies 90 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block
  • 89. © 2014 MapR Technologies 91 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
  • 90. © 2014 MapR Technologies 92 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results
  • 91. © 2014 MapR Technologies 93 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count()
  • 92. © 2014 MapR Technologies 94 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks tasks tasks Driver
  • 93. © 2014 MapR Technologies 95 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache Process from Cache Process from Cache
  • 94. © 2014 MapR Technologies 96 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results results results
  • 95. © 2014 MapR Technologies 97 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data  Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk
  • 96. © 2014 MapR Technologies 98© 2014 MapR Technologies Transformations and Actions
  • 97. © 2014 MapR Technologies 99 RDD Transformations and Actions RDD RDD RDD RDDTransformations Action Value Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Actions (return a value) reduce collect count save lookupKey …
  • 98. © 2014 MapR Technologies 100 Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)
  • 99. © 2014 MapR Technologies 101 Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)
  • 100. © 2014 MapR Technologies 102 RDD Fault Recovery • RDDs track lineage information • can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  • 101. © 2014 MapR Technologies 103 Passing a function to Spark • Spark is based on Anonymous function syntax – (x: Int) => x *x • Which is a shorthand for new Function1[Int,Int] { def apply(x: Int) = x * x } 103
  • 102. © 2014 MapR Technologies 104© 2014 MapR Technologies Dataframes
  • 103. © 2014 MapR Technologies 105 DataFrame Distributed collection of data organized into named columns // Create the DataFrame val df = sqlContext.read.json("person.json") // Print the schema in a tree format df.printSchema() // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // |-- height: string (nullable = true) // Select only the "name" column df.select("name").show() https://spark.apache.org/docs/latest/sql-programming-guide.html
  • 104. © 2014 MapR Technologies 106 DataFrame RDD • # data frame style lineitems.groupby(‘customer’).agg(Map( ‘units’ > ‘avg’, ‘totalPrice’ > ‘std’ )) • # or SQL style SELECT AVG(units), STD(totalPrice) FROM linetiems GROUP BY customer
  • 105. © 2014 MapR Technologies 107 Demo Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
  • 106. © 2014 MapR Technologies 108 MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data • https://www.mapr.com/blog/using-apache-spark-dataframes- processing-tabular-data
  • 107. © 2014 MapR Technologies 109 The physical plan for DataFrames
  • 108. © 2014 MapR Technologies 110 DataFrame Excecution plan // Print the physical plan to the console auction.select("auctionid").distinct.explain() == Physical Plan == Distinct false Exchange (HashPartitioning [auctionid#0], 200) Distinct true Project [auctionid#0] PhysicalRDD [auctionid#0,bid#1,bidtime#2,bidder#3, bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37
  • 109. © 2014 MapR Technologies 111© 2014 MapR Technologies There’s a lot more !
  • 110. © 2014 MapR Technologies 112 Spark SQL Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN Unified Platform
  • 111. © 2014 MapR Technologies 113 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/ • Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering – Spark Streaming
  • 112. © 2014 MapR Technologies 114 Soon to Come Blogs and Tutorials: – Re-write this mahout example with spark
  • 113. © 2014 MapR Technologies 115© 2014 MapR Technologies Examples and Resources
  • 114. © 2014 MapR Technologies 116 Spark on MapR • Certified Spark Distribution • Fully supported and packaged by MapR in partnership with Databricks – mapr-spark package with Spark, Shark, Spark Streaming today – Spark-python, GraphX and MLLib soon • YARN integration – Spark can then allocate resources from cluster when needed
  • 115. © 2014 MapR Technologies 117 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  • 116. © 2014 MapR Technologies 118 Q&A @mapr maprtech kbotzum@mapr.com Engage with us! MapR maprtech mapr-technologies