Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2014 MapR Technologies 1© 2014 MapR Technologies
An Overview of Apache Spark
© 2014 MapR Technologies 2
Agenda
• MapReduce Refresher
• What is Spark?
• The Difference with Spark
• Examples and Resour...
© 2014 MapR Technologies 3© 2014 MapR Technologies
MapReduce Refresher
© 2014 MapR Technologies 4
MapReduce: A Programming Model
• MapReduce:
Simplified Data
Processing on Large
Clusters
(publi...
© 2014 MapR Technologies 5
The Hadoop Strategy
http://developer.yahoo.com/hadoop/tutorial/module4.html
Distribute data
(sh...
© 2014 MapR Technologies 6
Chunks are replicated across the cluster
Distribute Data: HDFS
User process
NameNode
. . .
netw...
© 2014 MapR Technologies 7
Distribute Computation
MapReduce
Program
Data
Sources
Hadoop Cluster
Result
© 2014 MapR Technologies 8
MapReduce Basics
• Foundational model is based on a distributed file system
– Scalability and f...
© 2014 MapR Technologies 9
MapReduce Execution and Data Flow
Files loaded from HDFS stores
file file
Files loaded from HDF...
© 2014 MapR Technologies 10
MapReduce Example: Word Count
Output
"The time has come," the Walrus said,
"To talk of many th...
© 2014 MapR Technologies 11
Tolerate Failures
Hadoop Cluster
Failures are expected & managed gracefully
DataNode fails -> ...
© 2014 MapR Technologies 12
MapReduce Processing Model
• Define mappers
• Shuffling is automatic
• Define reducers
• For c...
© 2014 MapR Technologies 13
MapReduce Design Patterns
• Summarization
– Inverted index, counting
• Filtering
– Top ten, di...
© 2014 MapR Technologies 14
Inverted Index Example
come, (alice.txt)
do, (macbeth.txt)
has, (alice.txt)
time, (alice.txt, ...
© 2014 MapR Technologies 15
MapReduce Example:Inverted Index
• Input: (filename, text) records
• Output: list of files con...
© 2014 MapR Technologies 18
MapReduce: The Good
• Built in fault tolerance
• Optimized IO path
• Scalable
• Developer focu...
© 2014 MapR Technologies 19
MapReduce: The Bad
• Optimized for disk IO
– Doesn’t leverage memory well
– Iterative algorith...
© 2014 MapR Technologies 20
Free Hadoop MapReduce On Demand Training
• https://www.mapr.com/services/mapr-academy/big-data...
© 2014 MapR Technologies 21
What is Hive?
• Data Warehouse on top of Hadoop
– Gives ability to query without programming
–...
© 2014 MapR Technologies 22
Using HBase as a MapReduce/Hive Source
EXAMPLE: Data Warehouse for Analytical Processing queri...
© 2014 MapR Technologies 23
Using HBase as a MapReduce or Hive Sink
EXAMPLE: bulk load data into a table
Files (HDFS/MapR-...
© 2014 MapR Technologies 24
Using HBase as a Source & Sink
EXAMPLE: calculate and store summaries,
Pre-Computed, Materiali...
© 2014 MapR Technologies 25
Job
Tracker
Name
Node
HADOOP
(MAP-REDUCE + HDFS)
Data Node
+
Task Tracker
Hive Metastore
Drive...
© 2014 MapR Technologies 26
Hive HBase
HBase Tables
Hive
metastore
Points to Existing
Hive Managed
© 2014 MapR Technologies 27
Hive HBase – External Table
CREATE EXTERNAL TABLE trades(key string, price bigint, vol bigint)...
© 2014 MapR Technologies 28
Hive HBase – Hive Query
SQL evaluates to MapReduce code
SELECT AVG(price) FROM trades WHERE ke...
© 2014 MapR Technologies 29
Hive HBase – External Table
key cf1:price cf1:vol
AMZN_986186008 12.34 1000
AMZN_986186007 12....
© 2014 MapR Technologies 30
Hive Query Plan
• EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";
STAGE PLANS:
S...
© 2014 MapR Technologies 31
Hive Query Plan – (2)
output
hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%";
col0
...
© 2014 MapR Technologies 32
Hive Map Reduce
Region Region Region
scan key, row
reduce()
shuffle
reduce()
reduce()Map() Map...
© 2014 MapR Technologies 33
Some Hive Design Patterns
• Summarization
– Select min(delay), max(delay), count(*) from fligh...
© 2014 MapR Technologies 34
What is a Directed Acylic Graph (DAG) ?
• Graph
– vertices (points) and edges (lines)
• Direct...
© 2014 MapR Technologies 35
Hive Query Plan Map Reduce Execution
FS1
AGG2
RS4
JOIN1
RS2
AGG1
RS1
t1
RS3
t1
Job 3
Job 2
FS1...
© 2014 MapR Technologies 36
Slow !
Iteration: the bane of MapReduce
© 2014 MapR Technologies 37
Typical MapReduce Workflows
Input to
Job 1
SequenceFile
Last Job
Maps Reduces
SequenceFile
Job...
© 2014 MapR Technologies 38
Iterations
Step Step Step Step Step
In-memory Caching
• Data Partitions read from RAM instead ...
© 2014 MapR Technologies 39
Free HBase On Demand Training
(includes Hive and MapReduce with HBase)
• https://www.mapr.com/...
© 2014 MapR Technologies 40
Lab – Query HBase airline data with Hive
Import mapping to Row Key and Columns:
Row-key
Carrie...
© 2014 MapR Technologies 41
Count number of cancellations by reason (code)
$ hive
hive> explain select count(*) as
cancell...
© 2014 MapR Technologies 42
2 MapReduce jobs
$ hive
hive> select count(*) as cancellations, cnclcode from flighttable wher...
© 2014 MapR Technologies 43
Find the longest airline delays
$ hive
hive> select arrdelay,key from flighttable where arrdel...
© 2014 MapR Technologies 44© 2014 MapR Technologies
Apache Spark
© 2014 MapR Technologies 45
Apache Spark
spark.apache.org
github.com/apache/spark
user@spark.apache.org
• Originally devel...
© 2014 MapR Technologies 46
Spark: Fast Big Data
– Rich APIs in
Java, Scala,
Python
– Interactive shell
• Fast to Run
– Ge...
© 2014 MapR Technologies 47
The Spark Community
© 2014 MapR Technologies 48
Spark is the Most Active Open Source Project in Big Data
Giraph
Storm
Tez
0
20
40
60
80
100
12...
© 2014 MapR Technologies 49
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engine...
© 2014 MapR Technologies 50
Spark Use Cases
• Iterative Algorithms on on large amounts of data
• Anomaly detection
• Class...
© 2014 MapR Technologies 51
Why Iterative Algorithms
• Algorithms that need iterations
– Clustering (K-Means, Canopy, …)
–...
© 2014 MapR Technologies 52
Example: Logistic Regression
• Goal: find best line separating two sets of points
target
rando...
© 2014 MapR Technologies 53
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iter...
© 2014 MapR Technologies 54
Data Sources
• Local Files
– file:///opt/httpd/logs/access_log
• S3
• Hadoop Distributed Files...
© 2014 MapR Technologies 55© 2014 MapR Technologies
How Spark Works
© 2014 MapR Technologies 56
Spark Programming Model
sc=new SparkContext
rDD=sc.textfile(“hdfs://…”)
rDD.map
Driver Program...
© 2014 MapR Technologies 57
Resilient Distributed Datasets (RDD)
Spark revolves around RDDs
• Fault-tolerant
• read only c...
© 2014 MapR Technologies 58
Working With RDDs
RDD
textFile = sc.textFile(”SomeFile.txt”)
© 2014 MapR Technologies 59
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
linesWithSpark = textFile.filter(lambda line...
© 2014 MapR Technologies 60
Working With RDDs
RDD
RDD
RDD
RDD
Transformations
Action Value
linesWithSpark = textFile.filte...
© 2014 MapR Technologies 61
MapR Tutorial: Getting Started with Spark on MapR Sandbox
• https://www.mapr.com/products/mapr...
© 2014 MapR Technologies 62
Example Spark Word Count in Java
...the
...
"The time has come," the Walrus said,
"To talk of ...
© 2014 MapR Technologies 63
Example Spark Word Count in Scala
...the
...
"The time has come," the Walrus said,
"To talk of...
© 2014 MapR Technologies 64
Example Spark Word Count in Scala
64
HadoopRDD
textFile
// Load input data.
val input = sc.tex...
© 2014 MapR Technologies 65
Example Spark Word Count in Scala
65
// Load our input data.
val input = sc.textFile(inputFile...
© 2014 MapR Technologies 66
FlatMap
flatMap
line => line.split(" "))
1 to many mapping
ShipsShips
and
wax
and
wax
JavaPair...
© 2014 MapR Technologies 67
Example Spark Word Count in Scala
67
textFile flatmap map
val input = sc.textFile(inputFile)
v...
© 2014 MapR Technologies 68
Map
map
word => (word, 1))
1 to 1 mapping
and and, 1
JavaPairRDD<String, Integer> word1s
© 2014 MapR Technologies 69
Example Spark Word Count in Scala
69
textFile flatmap map reduceByKey
val input = sc.textFile(...
© 2014 MapR Technologies 70
reduceByKey
reduceByKey
case (x, y) => x + y wax, 1
and, 1
and, 1
wax, 1
and, 2
JavaPairRDD<St...
© 2014 MapR Technologies 71
Example Spark Word Count in Scala
textFile flatmap map reduceByKey
val input = sc.textFile(inp...
© 2014 MapR Technologies 72© 2014 MapR Technologies
Components Of Execution
© 2014 MapR Technologies 73
MapR Blog: Getting Started with the Spark Web UI
• https://www.mapr.com/blog/getting-started-s...
© 2014 MapR Technologies 74
Spark RDD DAG -> Physical Execution plan
HadoopRDD
sc.textfile(…)
MapPartitionsRDD
flatmap
fla...
© 2014 MapR Technologies 75
Physical Plan
DAG
Stage 1
Stage 2
Task Task Task Task
Task Task Task
Stage 1
Stage 2
Split int...
© 2014 MapR Technologies 76
Summary of Components
• Task : unit of execution
• Stage: Group of Tasks
– Base on partitions ...
© 2014 MapR Technologies 77
How Spark Application runs on a Hadoop cluster
HFile
HDFS Data Node
Worker Node
block
cache
pa...
© 2014 MapR Technologies 78
Deploying Spark – Cluster Manager Types
• Standalone mode
• Mesos
• YARN
• EC2
• GCE
© 2014 MapR Technologies 79© 2014 MapR Technologies
Example: Log Mining
© 2014 MapR Technologies 80
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 81
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 82
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 83
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 84
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 85
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 86
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 87
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 88
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 89
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 90
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 91
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 92
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 93
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 94
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 95
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 96
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 97
Example: Log Mining
Load error messages from a log into memory, then
interactively search for ...
© 2014 MapR Technologies 98© 2014 MapR Technologies
Transformations and Actions
© 2014 MapR Technologies 99
RDD Transformations and Actions
RDD
RDD
RDD
RDDTransformations Action Value
Transformations
(d...
© 2014 MapR Technologies 100
Basic Transformations
> nums = sc.parallelize([1, 2, 3])
# Pass each element through a functi...
© 2014 MapR Technologies 101
Basic Actions
> nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collectio...
© 2014 MapR Technologies 102
RDD Fault Recovery
• RDDs track lineage information
• can be used to efficiently recompute lo...
© 2014 MapR Technologies 103
Passing a function to Spark
• Spark is based on Anonymous function syntax
– (x: Int) => x *x
...
© 2014 MapR Technologies 104© 2014 MapR Technologies
Dataframes
© 2014 MapR Technologies 105
DataFrame
Distributed collection of data organized into named
columns
// Create the DataFrame...
© 2014 MapR Technologies 106
DataFrame RDD
• # data frame style
lineitems.groupby(‘customer’).agg(Map(
‘units’ > ‘avg’,
‘t...
© 2014 MapR Technologies 107
Demo Interactive Shell
• Iterative Development
– Cache those RDDs
– Open the shell and ask qu...
© 2014 MapR Technologies 108
MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data
• https://www.mapr.co...
© 2014 MapR Technologies 109
The physical plan for DataFrames
© 2014 MapR Technologies 110
DataFrame Excecution plan
// Print the physical plan to the console
auction.select("auctionid...
© 2014 MapR Technologies 111© 2014 MapR Technologies
There’s a lot more !
© 2014 MapR Technologies 112
Spark SQL
Spark Streaming
(Streaming)
MLlib
(Machine learning)
Spark (General execution engin...
© 2014 MapR Technologies 113
Soon to Come
• Spark On Demand Training
– https://www.mapr.com/services/mapr-academy/
• Blogs...
© 2014 MapR Technologies 114
Soon to Come
Blogs and Tutorials:
– Re-write this mahout example with spark
© 2014 MapR Technologies 115© 2014 MapR Technologies
Examples and Resources
© 2014 MapR Technologies 116
Spark on MapR
• Certified Spark Distribution
• Fully supported and packaged by MapR in partne...
© 2014 MapR Technologies 117
References
• Spark web site: http://spark.apache.org/
• https://databricks.com/
• Spark on Ma...
© 2014 MapR Technologies 118
Q&A
@mapr maprtech
kbotzum@mapr.com
Engage with us!
MapR
maprtech
mapr-technologies
Upcoming SlideShare
Loading in …5
×

Introduction to Spark on Hadoop

2,116 views

Published on

Introduction to Apache Spark on Hadoop

Published in: Data & Analytics

Introduction to Spark on Hadoop

  1. 1. © 2014 MapR Technologies 1© 2014 MapR Technologies An Overview of Apache Spark
  2. 2. © 2014 MapR Technologies 2 Agenda • MapReduce Refresher • What is Spark? • The Difference with Spark • Examples and Resources
  3. 3. © 2014 MapR Technologies 3© 2014 MapR Technologies MapReduce Refresher
  4. 4. © 2014 MapR Technologies 4 MapReduce: A Programming Model • MapReduce: Simplified Data Processing on Large Clusters (published 2004) • Parallel and Distributed Algorithm: • Data Locality • Fault Tolerance • Linear Scalability
  5. 5. © 2014 MapR Technologies 5 The Hadoop Strategy http://developer.yahoo.com/hadoop/tutorial/module4.html Distribute data (share nothing) Distribute computation (parallelization without synchronization) Tolerate failures (no single point of failure) Node 1 Mapping process Node 2 Mapping process Node 3 Mapping process Node 1 Reducing process Node 2 Reducing process Node 3 Reducing process
  6. 6. © 2014 MapR Technologies 6 Chunks are replicated across the cluster Distribute Data: HDFS User process NameNode . . . network HDFS splits large data files into chunks (64 MB) metadata access physical data access Location metadata DataNodes store & retrieve data data
  7. 7. © 2014 MapR Technologies 7 Distribute Computation MapReduce Program Data Sources Hadoop Cluster Result
  8. 8. © 2014 MapR Technologies 8 MapReduce Basics • Foundational model is based on a distributed file system – Scalability and fault-tolerance • Map – Loading of the data and defining a set of keys • Many use cases do not utilize a reduce task • Reduce – Collects the organized key-based data to process and output • Performance can be tweaked based on known details of your source files and cluster shape (size, total number)
  9. 9. © 2014 MapR Technologies 9 MapReduce Execution and Data Flow Files loaded from HDFS stores file file Files loaded from HDFS stores Node 1 InputFormat InputFormat OutputFormat OutputFormat Final (k, v) pairs Final (k, v) pairs reduce reduce (sort) (sort) Input (k, v) pairs map map map RR RR RR RecordReaders: Split Split Split Writeback to Local HDFS store file Writeback to Local HDFS store file SplitSplitSplit RRRRRR RecordReaders: Input (k, v) pairs mapmapmap Node2 “Shuffle” process Intermediate (k, v) Pairs exchanged By all nodes Partitioner Intermediate (k, v) pairs Partitioner Intermediate (k, v) pairs
  10. 10. © 2014 MapR Technologies 10 MapReduce Example: Word Count Output "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax the, 1 time, 1 has, 1 come, 1 … and, 1 … and, 1 … and, [1, 1, 1] come, [1,1,1] has, [1,1] the, [1,1,1] time, [1,1,1,1] … and, 12 come, 6 has, 8 the, 4 time, 14 … Input Map Shuffle and Sort Reduce Output Reduce
  11. 11. © 2014 MapR Technologies 11 Tolerate Failures Hadoop Cluster Failures are expected & managed gracefully DataNode fails -> name node will locate replica MapReduce task fails -> job tracker will schedule another one Data
  12. 12. © 2014 MapR Technologies 12 MapReduce Processing Model • Define mappers • Shuffling is automatic • Define reducers • For complex work, chain jobs together – Use a higher level language or DSL that does this for you
  13. 13. © 2014 MapR Technologies 13 MapReduce Design Patterns • Summarization – Inverted index, counting • Filtering – Top ten, distinct • Aggregation • Data Organziation – partitioning • Join – Join data sets • Metapattern – Job chaining
  14. 14. © 2014 MapR Technologies 14 Inverted Index Example come, (alice.txt) do, (macbeth.txt) has, (alice.txt) time, (alice.txt, macbeth.txt) . . . "The time has come," the Walrus said alice.txt tis time to do it macbeth.txt time, alice.txt has, alice.txt come, alice.txt .. tis, macbeth.txt time, macbeth.txt do, macbeth.txt …
  15. 15. © 2014 MapR Technologies 15 MapReduce Example:Inverted Index • Input: (filename, text) records • Output: list of files containing each word • Map: foreach word in text.split(): output(word, filename) • Combine: uniquify filenames for each word • Reduce: def reduce(word, filenames): output(word, sort(filenames))
  16. 16. © 2014 MapR Technologies 18 MapReduce: The Good • Built in fault tolerance • Optimized IO path • Scalable • Developer focuses on Map/Reduce, not infrastructure • simple? API
  17. 17. © 2014 MapR Technologies 19 MapReduce: The Bad • Optimized for disk IO – Doesn’t leverage memory well – Iterative algorithms go through disk IO path again and again • Primitive API – simple abstraction – Key/Value in/out – basic things like join • require extensive code • Result often many files that need to be combined appropriately
  18. 18. © 2014 MapR Technologies 20 Free Hadoop MapReduce On Demand Training • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  19. 19. © 2014 MapR Technologies 21 What is Hive? • Data Warehouse on top of Hadoop – Gives ability to query without programming – Used for analytical querying of data • SQL like execution for Hadoop • SQL evaluates to MapReduce code – Submits jobs to your cluster
  20. 20. © 2014 MapR Technologies 22 Using HBase as a MapReduce/Hive Source EXAMPLE: Data Warehouse for Analytical Processing queries Hive runs MapReduce application Hive Select JoinHBase database Files (HDFS/MapR-FS) Query Result File
  21. 21. © 2014 MapR Technologies 23 Using HBase as a MapReduce or Hive Sink EXAMPLE: bulk load data into a table Files (HDFS/MapR-FS) HBase databaseHive runs MapReduce application Hive Insert Select
  22. 22. © 2014 MapR Technologies 24 Using HBase as a Source & Sink EXAMPLE: calculate and store summaries, Pre-Computed, Materialized View HBase database Hive Select Join Hive runs MapReduce application
  23. 23. © 2014 MapR Technologies 25 Job Tracker Name Node HADOOP (MAP-REDUCE + HDFS) Data Node + Task Tracker Hive Metastore Driver (compiler, Optimizer, Executor) Command Line Interface Web Interface JDBC Thrift Server ODBC Metastore Hive The schema metadata is stored in the Hive metastore Hive Table definition HBase trades_tall Table
  24. 24. © 2014 MapR Technologies 26 Hive HBase HBase Tables Hive metastore Points to Existing Hive Managed
  25. 25. © 2014 MapR Technologies 27 Hive HBase – External Table CREATE EXTERNAL TABLE trades(key string, price bigint, vol bigint) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping"= “:key,cf1:price#b,cf1:vol#b") TBLPROPERTIES ("hbase.table.name" = "/usr/user1/trades_tall"); Points to External key string price bigint vol bigint key cf1:price cf1:vol AMZN_986186008 12.34 1000 AMZN_986186007 12.00 50 trades /usr/user1/trades_tall Hive Table definition HBaseTable
  26. 26. © 2014 MapR Technologies 28 Hive HBase – Hive Query SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE "GOOG” ; HBase Tables Queries Parser Planner Execution
  27. 27. © 2014 MapR Technologies 29 Hive HBase – External Table key cf1:price cf1:vol AMZN_986186008 12.34 1000 AMZN_986186007 12.00 50 Selection WHERE key like SQL evaluates to MapReduce code SELECT AVG(price) FROM trades WHERE key LIKE “AMZN” ; Projection select price Aggregation Avg( price)
  28. 28. © 2014 MapR Technologies 30 Hive Query Plan • EXPLAIN SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Filter Operator predicate: (key like 'GOOG%') (type: boolean) Select Operator Group By Operator Reduce Operator Tree: Group By Operator Select Operator File Output Operator
  29. 29. © 2014 MapR Technologies 31 Hive Query Plan – (2) output hive> SELECT AVG(price) FROM trades WHERE key LIKE "GOOG%"; col0 Trades table group aggregations: avg(price) scan filter Select key like 'GOOG% Select price Group by map() map() map() reduce() reduce()
  30. 30. © 2014 MapR Technologies 32 Hive Map Reduce Region Region Region scan key, row reduce() shuffle reduce() reduce()Map() Map() Map() Query Result File HBase Hive Select Join Hive Query result result result
  31. 31. © 2014 MapR Technologies 33 Some Hive Design Patterns • Summarization – Select min(delay), max(delay), count(*) from flights group by carrier; • Filtering – SELECT * FROM trades WHERE key LIKE "GOOG%"; – SELECT price FROM trades DESC LIMIT 10 ; • Join SELECT tableA.field1, tableB.field2 FROM tableA JOIN tableB ON tableA.field1 = tableB.field2;
  32. 32. © 2014 MapR Technologies 34 What is a Directed Acylic Graph (DAG) ? • Graph – vertices (points) and edges (lines) • Directed – Only in a single direction • Acyclic – No looping • This supports fault-tolerance BA BA
  33. 33. © 2014 MapR Technologies 35 Hive Query Plan Map Reduce Execution FS1 AGG2 RS4 JOIN1 RS2 AGG1 RS1 t1 RS3 t1 Job 3 Job 2 FS1 AGG2 JOIN1 AGG1 RS1 t1 RS3Job 1 Job 1 Optimize
  34. 34. © 2014 MapR Technologies 36 Slow ! Iteration: the bane of MapReduce
  35. 35. © 2014 MapR Technologies 37 Typical MapReduce Workflows Input to Job 1 SequenceFile Last Job Maps Reduces SequenceFile Job 1 Maps Reduces SequenceFile Job 2 Maps Reduces Output from Job 1 Output from Job 2 Input to last job Output from last job HDFS
  36. 36. © 2014 MapR Technologies 38 Iterations Step Step Step Step Step In-memory Caching • Data Partitions read from RAM instead of disk
  37. 37. © 2014 MapR Technologies 39 Free HBase On Demand Training (includes Hive and MapReduce with HBase) • https://www.mapr.com/services/mapr-academy/big-data-hadoop- online-training
  38. 38. © 2014 MapR Technologies 40 Lab – Query HBase airline data with Hive Import mapping to Row Key and Columns: Row-key Carrier- Flightnumber- Date- Origin- destination delay info stats timing Air Craft delay Arr delay Carrier delay cncl Cncl code tailnum distance elaptime arrtime Dep time AA-1-2014-01- 01-JFK-LAX 13 0 N7704 2475 385.00 359 …
  39. 39. © 2014 MapR Technologies 41 Count number of cancellations by reason (code) $ hive hive> explain select count(*) as cancellations, cnclcode from flighttable where cncl=1 group by cnclcode order by cancellations asc limit 100; 1 row OK STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan Filter Operator Select Operator Group By Operator aggregations: count() Reduce Output Operator Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) Select Operator File Output Operator Stage: Stage-2 Map Reduce Map Operator Tree: TableScan Reduce Output Operator Reduce Operator Tree: Extract Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Limit File Output Operator Stage: Stage-0 Fetch Operator limit: 100
  40. 40. © 2014 MapR Technologies 42 2 MapReduce jobs $ hive hive> select count(*) as cancellations, cnclcode from flighttable where cncl=1 group by cnclcode order by cancellations asc limit 100; 1 row Total jobs = 2 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 13.3 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS Job 1: Map: 1 Reduce: 1 Cumulative CPU: 1.52 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 14 seconds 820 msec OK 4598 C 7146 A
  41. 41. © 2014 MapR Technologies 43 Find the longest airline delays $ hive hive> select arrdelay,key from flighttable where arrdelay > 1000 order by arrdelay desc limit 10; 1 row MapReduce Jobs Launched: Map: 1 Reduce: 1 OK 1530.0 AA-385-2014-01-18-BNA-DFW 1504.0 AA-1202-2014-01-15-ONT-DFW 1473.0 AA-1265-2014-01-05-CMH-LAX 1448.0 AA-1243-2014-01-21-IAD-DFW 1390.0 AA-1198-2014-01-11-PSP-DFW 1335.0 AA-1680-2014-01-21-SLC-DFW 1296.0 AA-1277-2014-01-21-BWI-DFW 1294.0 MQ-2894-2014-01-02-CVG-DFW 1201.0 MQ-3756-2014-01-01-CLT-MIA 1184.0 DL-2478-2014-01-10-BOS-ATL
  42. 42. © 2014 MapR Technologies 44© 2014 MapR Technologies Apache Spark
  43. 43. © 2014 MapR Technologies 45 Apache Spark spark.apache.org github.com/apache/spark user@spark.apache.org • Originally developed in 2009 in UC Berkeley’s AMP Lab • Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation
  44. 44. © 2014 MapR Technologies 46 Spark: Fast Big Data – Rich APIs in Java, Scala, Python – Interactive shell • Fast to Run – General execution graphs – In-memory storage 2-5× less code
  45. 45. © 2014 MapR Technologies 47 The Spark Community
  46. 46. © 2014 MapR Technologies 48 Spark is the Most Active Open Source Project in Big Data Giraph Storm Tez 0 20 40 60 80 100 120 140 Projectcontributorsinpastyear
  47. 47. © 2014 MapR Technologies 49 Spark SQL Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN Unified Platform
  48. 48. © 2014 MapR Technologies 50 Spark Use Cases • Iterative Algorithms on on large amounts of data • Anomaly detection • Classification • Predictions • Recommendations
  49. 49. © 2014 MapR Technologies 51 Why Iterative Algorithms • Algorithms that need iterations – Clustering (K-Means, Canopy, …) – Gradient descent (e.g., Logistic Regression, Matrix Factorization) – Graph Algorithms (e.g., PageRank, Line-Rank, components, paths, reachability, centrality, ) – Alternating Least Squares ALS – Graph communities / dense sub-components – Inference (believe propagation) – … 51
  50. 50. © 2014 MapR Technologies 52 Example: Logistic Regression • Goal: find best line separating two sets of points target random initial line
  51. 51. © 2014 MapR Technologies 53 data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w Iteration! Logistic Regression
  52. 52. © 2014 MapR Technologies 54 Data Sources • Local Files – file:///opt/httpd/logs/access_log • S3 • Hadoop Distributed Filesystem – Regular files, sequence files, any other Hadoop InputFormat • HBase • other NoSQL data stores
  53. 53. © 2014 MapR Technologies 55© 2014 MapR Technologies How Spark Works
  54. 54. © 2014 MapR Technologies 56 Spark Programming Model sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program SparkContext cluster Worker Node Task Task Task Worker Node
  55. 55. © 2014 MapR Technologies 57 Resilient Distributed Datasets (RDD) Spark revolves around RDDs • Fault-tolerant • read only collection of elements • operated on in parallel • Cached in memory • Or on disk http://www.cs.berkeley.edu/~matei/papers/ 2012/nsdi_spark.pdf
  56. 56. © 2014 MapR Technologies 58 Working With RDDs RDD textFile = sc.textFile(”SomeFile.txt”)
  57. 57. © 2014 MapR Technologies 59 Working With RDDs RDD RDD RDD RDD Transformations linesWithSpark = textFile.filter(lambda line: "Spark” in line) textFile = sc.textFile(”SomeFile.txt”)
  58. 58. © 2014 MapR Technologies 60 Working With RDDs RDD RDD RDD RDD Transformations Action Value linesWithSpark = textFile.filter(lambda line: "Spark” in line) linesWithSpark.count() 74 linesWithSpark.first() # Apache Spark textFile = sc.textFile(”SomeFile.txt”)
  59. 59. © 2014 MapR Technologies 61 MapR Tutorial: Getting Started with Spark on MapR Sandbox • https://www.mapr.com/products/mapr-sandbox- hadoop/tutorials/spark-tutorial
  60. 60. © 2014 MapR Technologies 62 Example Spark Word Count in Java ...the ... "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax andtime and the, 1 time, 1 and, 1 and, 1 and, 12time, 4 ...the, 20 JavaRDD<String> input = sc.textFile(inputFile); // Split each line into words JavaRDD<String> words = input.flatMap( new FlatMapFunction<String, String>() { public Iterable<String> call(String x) { return Arrays.asList(x.split(" ")); }}); // Turn the words into (word, 1) pairs JavaPairRDD<String, Integer> word1s== words.mapToPair( new PairFunction<String, String, Integer>(){ public Tuple2<String, Integer> call(String x){ return new Tuple2(x, 1); }}); // reduce add the pairs by key to produce counts JavaPairRDD<String, Integer> counts =word1s.reduceByKey( new Function2<Integer, Integer, Integer>(){ public Integer call(Integer x, Integer y){ return x + y; }}); .........
  61. 61. © 2014 MapR Technologies 63 Example Spark Word Count in Scala ...the ... "The time has come," the Walrus said, "To talk of many things: Of shoes—and ships—and sealing-wax andtime and the, 1 time, 1 and, 1 and, 1 and, 12time, 4 ...the, 20 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) // Transform into pairs and count. val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} // Save the word count back out to a text file, counts.saveAsTextFile(outputFile) the, 20 time, 4 ….. and, 12 .........
  62. 62. © 2014 MapR Technologies 64 Example Spark Word Count in Scala 64 HadoopRDD textFile // Load input data. val input = sc.textFile(inputFile) RDD partitions MapPartitionsRDD
  63. 63. © 2014 MapR Technologies 65 Example Spark Word Count in Scala 65 // Load our input data. val input = sc.textFile(inputFile) // Split it up into words. val words = input.flatMap(line => line.split(" ")) HadoopRDD textFile flatmap MapPartitionsRDD MapPartitionsRDD
  64. 64. © 2014 MapR Technologies 66 FlatMap flatMap line => line.split(" ")) 1 to many mapping ShipsShips and wax and wax JavaPairRDD<String> words
  65. 65. © 2014 MapR Technologies 67 Example Spark Word Count in Scala 67 textFile flatmap map val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) // Transform into pairs val counts = words.map(word => (word, 1)) HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD
  66. 66. © 2014 MapR Technologies 68 Map map word => (word, 1)) 1 to 1 mapping and and, 1 JavaPairRDD<String, Integer> word1s
  67. 67. © 2014 MapR Technologies 69 Example Spark Word Count in Scala 69 textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} HadoopRDD MapPartitionsRDD MapPartitionsRDD ShuffledRDD MapPartitionsRDD
  68. 68. © 2014 MapR Technologies 70 reduceByKey reduceByKey case (x, y) => x + y wax, 1 and, 1 and, 1 wax, 1 and, 2 JavaPairRDD<String, Integer> counts
  69. 69. © 2014 MapR Technologies 71 Example Spark Word Count in Scala textFile flatmap map reduceByKey val input = sc.textFile(inputFile) val words = input.flatMap(line => line.split(" ")) val counts = words .map(word => (word, 1)) .reduceByKey{case (x, y) => x + y} val countArray = counts.collect() HadoopRDD MapPartitionsRDD MapPartitionsRDD MapPartitionsRDD collect ShuffledRDD Array
  70. 70. © 2014 MapR Technologies 72© 2014 MapR Technologies Components Of Execution
  71. 71. © 2014 MapR Technologies 73 MapR Blog: Getting Started with the Spark Web UI • https://www.mapr.com/blog/getting-started-spark-web-ui
  72. 72. © 2014 MapR Technologies 74 Spark RDD DAG -> Physical Execution plan HadoopRDD sc.textfile(…) MapPartitionsRDD flatmap flatmap reduceByKey RDD Graph Physical Plan collect MapPartitionsRDD ShuffledRDD MapPartitionsRDD Stage 1 Stage 2
  73. 73. © 2014 MapR Technologies 75 Physical Plan DAG Stage 1 Stage 2 Task Task Task Task Task Task Task Stage 1 Stage 2 Split into Tasks HFile HDFS Data Node Worker Node block cache partition Executor HFile block HFileHFile Task thread Task Set Task Scheduler Task Physical Execution plan -> Stages and Tasks
  74. 74. © 2014 MapR Technologies 76 Summary of Components • Task : unit of execution • Stage: Group of Tasks – Base on partitions of RDD – Tasks run in parallel • DAG : Logical Graph of RDD operations • RDD : Parallel dataset with partitions 76
  75. 75. © 2014 MapR Technologies 77 How Spark Application runs on a Hadoop cluster HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile SparkContext zookeeper YARN Resource Manager HFile HDFS Data Node Worker Node block cache partitiontask task Executor HFile block HFileHFile Client node sc=new SparkContext rDD=sc.textfile(“hdfs://…”) rDD.map Driver Program Yarn Node Manger Yarn Node Manger
  76. 76. © 2014 MapR Technologies 78 Deploying Spark – Cluster Manager Types • Standalone mode • Mesos • YARN • EC2 • GCE
  77. 77. © 2014 MapR Technologies 79© 2014 MapR Technologies Example: Log Mining
  78. 78. © 2014 MapR Technologies 80 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Based on slides from Pat McDonough at
  79. 79. © 2014 MapR Technologies 81 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver
  80. 80. © 2014 MapR Technologies 82 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”)
  81. 81. © 2014 MapR Technologies 83 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Worker Worker Worker Driver lines = spark.textFile(“hdfs://...”) Base RDD
  82. 82. © 2014 MapR Technologies 84 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver
  83. 83. © 2014 MapR Technologies 85 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) Worker Worker Worker Driver Transformed RDD
  84. 84. © 2014 MapR Technologies 86 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count()
  85. 85. © 2014 MapR Technologies 87 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Action
  86. 86. © 2014 MapR Technologies 88 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker Driver messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3
  87. 87. © 2014 MapR Technologies 89 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver tasks tasks tasks
  88. 88. © 2014 MapR Technologies 90 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Read HDFS Block Read HDFS Block Read HDFS Block
  89. 89. © 2014 MapR Technologies 91 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
  90. 90. © 2014 MapR Technologies 92 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 results results results
  91. 91. © 2014 MapR Technologies 93 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count()
  92. 92. © 2014 MapR Technologies 94 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() tasks tasks tasks Driver
  93. 93. © 2014 MapR Technologies 95 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Process from Cache Process from Cache Process from Cache
  94. 94. © 2014 MapR Technologies 96 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver results results results
  95. 95. © 2014 MapR Technologies 97 Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(lambda s: s.startswith(“ERROR”)) messages = errors.map(lambda s: s.split(“t”)[2]) messages.cache() Worker Worker Worker messages.filter(lambda s: “mysql” in s).count() Block 1 Block 2 Block 3 Cache 1 Cache 2 Cache 3 messages.filter(lambda s: “php” in s).count() Driver Cache your data  Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from cache vs. 20s for on-disk
  96. 96. © 2014 MapR Technologies 98© 2014 MapR Technologies Transformations and Actions
  97. 97. © 2014 MapR Technologies 99 RDD Transformations and Actions RDD RDD RDD RDDTransformations Action Value Transformations (define a new RDD) map filter sample union groupByKey reduceByKey join cache … Actions (return a value) reduce collect count save lookupKey …
  98. 98. © 2014 MapR Technologies 100 Basic Transformations > nums = sc.parallelize([1, 2, 3]) # Pass each element through a function > squares = nums.map(lambda x: x*x) // {1, 4, 9} # Keep elements passing a predicate > even = squares.filter(lambda x: x % 2 == 0) // {4} # Map each element to zero or more others > nums.flatMap(lambda x: => range(x)) > # => {0, 0, 1, 0, 1, 2} Range object (sequence of numbers 0, 1, …, x-1)
  99. 99. © 2014 MapR Technologies 101 Basic Actions > nums = sc.parallelize([1, 2, 3]) # Retrieve RDD contents as a local collection > nums.collect() # => [1, 2, 3] # Return first K elements > nums.take(2) # => [1, 2] # Count number of elements > nums.count() # => 3 # Merge elements with an associative function > nums.reduce(lambda x, y: x + y) # => 6 # Write elements to a text file > nums.saveAsTextFile(“hdfs://file.txt”)
  100. 100. © 2014 MapR Technologies 102 RDD Fault Recovery • RDDs track lineage information • can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))
  101. 101. © 2014 MapR Technologies 103 Passing a function to Spark • Spark is based on Anonymous function syntax – (x: Int) => x *x • Which is a shorthand for new Function1[Int,Int] { def apply(x: Int) = x * x } 103
  102. 102. © 2014 MapR Technologies 104© 2014 MapR Technologies Dataframes
  103. 103. © 2014 MapR Technologies 105 DataFrame Distributed collection of data organized into named columns // Create the DataFrame val df = sqlContext.read.json("person.json") // Print the schema in a tree format df.printSchema() // root // |-- age: long (nullable = true) // |-- name: string (nullable = true) // |-- height: string (nullable = true) // Select only the "name" column df.select("name").show() https://spark.apache.org/docs/latest/sql-programming-guide.html
  104. 104. © 2014 MapR Technologies 106 DataFrame RDD • # data frame style lineitems.groupby(‘customer’).agg(Map( ‘units’ > ‘avg’, ‘totalPrice’ > ‘std’ )) • # or SQL style SELECT AVG(units), STD(totalPrice) FROM linetiems GROUP BY customer
  105. 105. © 2014 MapR Technologies 107 Demo Interactive Shell • Iterative Development – Cache those RDDs – Open the shell and ask questions • We have all wished we could do this with MapReduce – Compile / save your code for scheduled jobs later • Scala – spark-shell • Python – pyspark
  106. 106. © 2014 MapR Technologies 108 MapR Blog: Using Apache Spark DataFrames for Processing of Tabular Data • https://www.mapr.com/blog/using-apache-spark-dataframes- processing-tabular-data
  107. 107. © 2014 MapR Technologies 109 The physical plan for DataFrames
  108. 108. © 2014 MapR Technologies 110 DataFrame Excecution plan // Print the physical plan to the console auction.select("auctionid").distinct.explain() == Physical Plan == Distinct false Exchange (HashPartitioning [auctionid#0], 200) Distinct true Project [auctionid#0] PhysicalRDD [auctionid#0,bid#1,bidtime#2,bidder#3, bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37
  109. 109. © 2014 MapR Technologies 111© 2014 MapR Technologies There’s a lot more !
  110. 110. © 2014 MapR Technologies 112 Spark SQL Spark Streaming (Streaming) MLlib (Machine learning) Spark (General execution engine) GraphX (Graph computation) Mesos Distributed File System (HDFS, MapR-FS, S3, …) Hadoop YARN Unified Platform
  111. 111. © 2014 MapR Technologies 113 Soon to Come • Spark On Demand Training – https://www.mapr.com/services/mapr-academy/ • Blogs and Tutorials: – Movie Recommendations with Collaborative Filtering – Spark Streaming
  112. 112. © 2014 MapR Technologies 114 Soon to Come Blogs and Tutorials: – Re-write this mahout example with spark
  113. 113. © 2014 MapR Technologies 115© 2014 MapR Technologies Examples and Resources
  114. 114. © 2014 MapR Technologies 116 Spark on MapR • Certified Spark Distribution • Fully supported and packaged by MapR in partnership with Databricks – mapr-spark package with Spark, Shark, Spark Streaming today – Spark-python, GraphX and MLLib soon • YARN integration – Spark can then allocate resources from cluster when needed
  115. 115. © 2014 MapR Technologies 117 References • Spark web site: http://spark.apache.org/ • https://databricks.com/ • Spark on MapR: – http://www.mapr.com/products/apache-spark • Spark SQL and DataFrame Guide • Apache Spark vs. MapReduce – Whiteboard Walkthrough • Learning Spark - O'Reilly Book • Apache Spark
  116. 116. © 2014 MapR Technologies 118 Q&A @mapr maprtech kbotzum@mapr.com Engage with us! MapR maprtech mapr-technologies

×