The Future of Hadoop: A deeper look at Apache Spark

1
An Introduction to Spark
Jai Ranganathan, Senior Director Product Management, Cloudera
Denny Lee, Senior Director Data Sciences Engineering, Concur

Agenda
• Cloudera’s Enterprise Data Hub
• Why Spark?
• Spark Use Cases
• Concur Case Study
• Cloudera and Spark
• Future of Spark
2 ©2014 Cloudera, Inc. All rights reserved.

Cloudera’s Enterprise Data Hub
3RD PARTY
APPS
STORAGE FOR ANY TYPE OF DATA
UNIFIED, ELASTIC, RESILIENT URE
BATCH
PROCESSING
MAPREDUCE,
SPARK
ANALYTIC
SQL
IMPALA
SEARCH
ENGINE
SOLR
MACHINE
LEARNING
SPARK, PARTNERS,
MAHOUT, MLLIB
STREAM
PROCESSING
SPARK
WORKLOAD MANAGEMENT YARN
FILESYSTEM
HDFS
ONLINE NOSQL
HBASE
MANAGEMENT
CLOUDERA NAVIGATOR
DATA
MANAGEMENT
CLOUDERA MANAGER
SYSTEM
, SECURE SENTRY

Spark: Easy and Fast Big Data
Easy to Develop
• Rich APIs in Java, Scala,
Python
• Interactive shell
Fast to Run
• General execution
graphs
• In-memory storage
2-5× less code Up to 10× faster on disk,
100× in memory

Easy: Expressive API
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
• reduce
• count
• fold
• reduceByKey
• groupByKey
• cogroup
• cross
• zip
• sample
• take
• first
• partitionBy
• mapWith
• pipe
• save ...

Example: Logistic Regression
data = spark.textFile(...).map(readPoint).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x)
.reduce(lambda x, y: x + y)
w -= gradient
print “Final w: %s” % w

Spark Introduces Concept of RDD to Take
Advantage of Memory
RDD = Resilient Distributed Datasets
• Memory caching layer that stores data in a distributed, fault-tolerant
cache
• Created by parallel transformations on data in stable storage
Two observations:
a. Can fall back to disk when data-set does not fit in memory
b. Provides fault-tolerance through concept of lineage

Fast: Using RAM, Operator Graphs
In-Memory Caching
• Data Partitions read from RAM
instead of disk
Operator Graphs
• Scheduling Optimizations
• Fault Tolerance
C: D: E:
join
B: B:
groupBy
filter
F:
Ç
√
Ω
map
A:
map
take
= RDD = cached partition

Easy: Out of the Box Functionality
Hadoop Integration
• Standard Hadoop data formats
• Runs under YARN in mixed clusters
Libraries
• Mllib – Machine Learning toolkit
• GraphX (alpha) – Graph analytics based on
PowerGraph abstractions
• Spark Streaming – Near real-time analytics
• Spark SQL – direct SQL interface in a Spark
application
Language support:
• SparkR (upcoming)
• Java 8
• Schema support in Spark’s APIs
• SQL support in Spark Streaming (upcoming)

Logistic Regression Performance
(Data Fits in Memory)
4000
3500
3000
2500
2000
1500
1000
500
0
1 5 10 20 30
Running Time (s)
Number of Iterations
110 s / iteration
Hadoop
Spark
first iteration 80 s
further iterations 1 s

Spark Streaming
What is it?
• Run continuous processing of data using Spark’s core API. Extends Spark concepts to fault tolerant,
transformable streams
• Adds “rolling window” operations. E.g. compute rolling averages or counts for data over last five minutes.
Why do you care?
• Same programming paradigm for streaming and batch – reuse knowledge and code in both contexts
• High level API with automatic DAG generation – simplicity of development
• Excellent throughput – can scale easily to really large volumes of data ingest
• Combine elements like MLLib & Oryx into a streaming application
Example use cases:
• “On-the-fly” ETL as data is ingested into Hadoop/HDFS
• Detecting anomalous behavior and triggering alerts.
• Continuous reporting of summary metrics for incoming data.

Streaming Architectures with Spark
Data sources
Integration
layer
Ingest
HDFS
Spark Stream processing
• Flume
• Kafka
Data prep
Aggregation /
Scoring
HBase
Spark long-term analytics / model building
Real-time result
serving

Cloudera Customer Use Cases – Core Spark
Sector Use case Replaces
Financial
• Multiple use cases to calculate VaR for portfolio risk analysis –
Services
Monte Carlo simulations as well as Var-Covar methods
• ETL pipeline speed-up
• Analyzing stock data for 20 years
• Home grown
applications
Genomics • Two use cases to identify disease causing genes in full human
genome
• MySQL engine
Data services • Trend analysis using statistical methods on large data sets
• Document classification (LDA)
• Fraud analytics
• Netezza
replacement
• Net new
Healthcare • Calculating Jaccard scores on health care data sets • Net new

Cloudera Customer Use Cases – Streaming
Sector Use case Replaces
Financial
Services
• On-line fraud detection • Net new
Many • Continuous ETL
Retail • On-line recommender systems
• Inventory management
• Custom apps

16
About Concur
What do we do?
• Leading provider of spend management solutions and (Travel,
Invoice, TripIt, etc.) services in the world
• Global customer base of 20,000 clients and 25 million users
• Processing more than $50 Billion in Travel & Expense (T&E)
spend each year

17
About the Speaker
Who Am I?
• Long time SQL Server BI guy
(24TB Yahoo! Cube)
• Project Isotope (Hadoop on
Windows and Azure)
• At Concur, helping with Big
Data and Data Sciences

18
A long time ago…
• We started using Hadoop because
• It was free
• i.e. Didn’t want to pay for a big data warehouse
• Could slowly extract from hundreds of relational data sources, consolidate it, and query it
• We were not thinking about advanced analytics
• We were thinking …. “cheaper reporting”
• We have some hardware lying around … let’s cobble it together and now we have reports

19
Themes
Consolidate Visualize Insight Recommend

20
BTS
Travel Weather
Invoice Web Analytics
Expense

Can quickly switch to map mode and determine where most itineraries are from in 2013
21

22
Or even quickly map out the airport locations on a map to see that Sun Moon
Lake Airport is in the center of Taiwan

23
Starbucks Store #3313
601 108th Ave NE
Bellevue, WA (425) 646-9602
-------------------------------
Chk 713452
05/14/2014 11:04 AM
1961558 Drawer: 1 Reg: 1
-------------------------------
Bacon Art Brkfst 3.45
Warmed
T1 Latte 2.70
Triple 1.50
Soy 0.60
Gr Vanilla Mac 4.15
Reload Card 50.00
AMEX $50.00
XXXXXXXXXXXXXXXXXX1004
SBUX Card $13.56
SUBTOTAL $62.40
New Caffe Espresso
Frappuccino(R) Blended beverage
Our Signature
Frappuccino(R) roast coffee and
fresh milk, blended with ice.
Topped with our new espresso
whipped cream and new
Italian roast drizzle
Expense Categorization
One of my receipts that I had OCRed
One of the issues we’re trying to solve
is to auto-categorize this, so how
can we do this?
Below is a simplistic solution using
WordCount
Note, a real solution should involve
machine learning algorithms

24
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 1.1.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_45)
Type in expressions to have them evaluated.
Type :help for more information.
2014-09-07 22:31:21.064 java[1871:15527] Unable to load realm info from SCDynamicStore
14/09/07 22:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
Spark context available as sc.
scala> val receipt = sc.textFile("/usr/local/Cellar/workspace/data/receipt/receipt.txt")
receipt: org.apache.spark.rdd.RDD[String] =
/usr/local/Cellar/workspace/data/receipt/receipt.txt MappedRDD[1] at textFile at
<console>:12
scala> receipt.count
res0: Long = 30

25
scala> val words = receipt.flatMap(_.split(" "))
words: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[2] at flatMap at <console>:14
scala> words.count
res1: Long = 161
scala> words.distinct.count
res2: Long = 72
scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _).map{case(x,y) =>
(y,x)}.sortByKey(false).map{case(i,j) => (j, i)}
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[12] at map at <console>:16
scala> wordCounts.take(12)
res5: Array[(String, Int)] = Array(("",82), (with,2), (Card,2), (new,2), (----------------
---------------,2), (Frappuccino(R),2), (roast,2), (1,2), (and,2), (New,1), (Topped,1),
(Starbucks,1))

27
What’s next…
• With Spark 1.1
• Sort-based shuffling
• MLLib: correlations, sampling, feature extraction, decision
trees
• GraphX: label propagation

28
Using AtScale to build up a dimensional model based on the data that is
stored within Impala / Hive

29
Slice and filter the Impala model using Tableau

Why Cloudera?
Expertise
• Deep engineering investment – only distribution vendor with engineering contributions to Spark and
actual technical know-how
• Field team, support, training and services with experience in many Spark use cases
• Driving roadmap for Spark
Experience
•Most customers running Spark across all distributions put together
• Range from few nodes to 800+ nodes
• Longest field presence – first vendor to support and still only two vendors with official support
Partnerships
• Intel partnership brings 15 Spark developers focused on Cloudera customer use cases
• Business relationship with Databricks to do joint development on Spark

Spark Takes Over From MapReduce
Stage 1
• Crunch on Spark
• Search on Spark
Stage 2
• Hive on Spark
• Pig on Spark
Stage 3
• MR equivalence
• Sqoop on Spark
Cloudera led multi-organization effort:
MapR, Intel, Databricks, IBM

Spark is Great but…
• Opaque API limitations
• Debugging and troubleshooting
• Complex configuration
CLOUDERA
UNIVERSITY
Spark Training

Questions & Next Steps
Download Now – www.cloudera.com/download
Spark Training -
www.cloudera.com/content/cloudera/en/training/cour
ses/spark-training.html

Thank You

The Future of Hadoop: A deeper look at Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The Future of Hadoop: A deeper look at Apache Spark

Similar to The Future of Hadoop: A deeper look at Apache Spark (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

The Future of Hadoop: A deeper look at Apache Spark

Editor's Notes