SlideShare a Scribd company logo
1 of 23
Download to read offline
Real-Time Analytical Processing (RTAP)
Using the Spark Stack
Jason Dai

jason.dai@intel.com
Intel Software and Services Group
Project Overview
Research & open source projects initiated by AMPLab
in UC Berkeley

BDAS: Berkeley Data Analytics Stack
(Ref: https://amplab.cs.berkeley.edu/software/)

Intel closely collaborating with AMPLab & the community
on open source development
• Apache incubation since June 2013
• The “most active cluster data processing engine
after Hadoop MapReduce”

Intel partnering with several big websites
• Building next-gen big data analytics using the Spark stack

• Real-time analytical processing (RTAP)
• E.g., Alibaba , Baidu iQiyi, Youku, etc.

2
Next-Gen Big Data Analytics
Volume
• Massive scale & exponential growth

Variety
• Multi-structured, diverse sources & inconsistent schemas
Value

• Simple (SQL) – descriptive analytics
• Complex (non-SQL) – predictive analytics
Velocity
• Interactive – the speed of thought
• Streaming/online – drinking from the firehose

3
Next-Gen Big Data Analytics
Volume
• Massive scale & exponential growth

Variety
• Multi-structured, diverse sources & inconsistent schemas
Value

• Simple (SQL) – descriptive analytics
• Complex (non-SQL) – predictive analytics
Velocity
• Interactive – the speed of thought
• Streaming/online – drinking from the firehose

4

The Spark stack
Real-Time Analytical Processing (RTAP)
A vision for next-gen big data analytics
• Data captured & processed in a (semi) streaming/online fashion
• Real-time & history data combined and mined interactively and/or iteratively
– Complex OLAP / BI in interactive fashion
– Iterative, complex machine learning & graph analysis

• Predominantly memory-based computation

Events /
Logs

Messaging /
Queue

Stream
Processing

In-Memory
Store

Persistent Storage

NoSQL
Data
Warehouse

5

Online Analysis
/ Dashboard
Low Latency
Processing
Engine

Ad-hoc,
interactive
OLAP / BI

Iterative, complex
machine learning &
graph analysis
Real-World Use Case #1

(Semi) Real-Time Log Aggregation & Analysis
Logs continuously collected & streamed in
• Through queuing/messaging systems

Incoming logs processed in a (semi) streaming fashion
• Aggregations for different time periods, demographics, etc.
• Join logs and history tables when necessary
Aggregation results then consumed in a (semi) streaming fashion
• Monitoring, alerting, etc.

6
(Semi) Real-Time Log Aggregation: MiniBatch Jobs

Log Collector

Log Collector

Kafka
Client

Kafka
Client

…

Server
Kafka

Server

Server
Spark

Mesos

Spark

Kafka

Kafka

Mesos

…

Spark
Mesos

Architecture

• Log collected and published to Kafka continuously

RDBMS

– Kafka cluster collocated with Spark Cluster

• Independent Spark applications launched at regular interval
– A series of mini-batch apps running in “fine-grained” mode on Mesos
– Process newly received data in Kafka (e.g., aggregation by keys) & write results out

In production in the user’s environment today
• 10s of seconds latency

7

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
(Semi) Real-Time Log Aggregation: MiniBatch Jobs
Design decisions
• Kafka and Spark servers collocated
– A distinct Spark task for each Kafka partition, usually reading from local file system cache

• Individual Spark applications launched periodically
– Apps logging current (Kafka partition) offsets in ZooKeeper
– Data of failed batch simply discarded (other strategies possible)

• Spark running in “fine-grained” mode on Mesos
– Allow the next batch to start even if the current one is late

Performance
• Input (log collection) bottlenecked by network bandwidth (1GbE)
• Processing (log aggregation) bottlenecked by CPUs
– Easily scaled to several seconds

8

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
(Semi) Real-Time Log Aggregation: Spark
Streaming
Log
Collectors

Kafka
Cluster

Spark
Cluster

RDBMS

Implications
• Better streaming framework support
– Complex (e.g., stateful) analysis, fault-tolerance, etc.

• Kafka & Spark not collocated
– DStream retrieves logs in background (over network) and caches blocks in memory

• Memory tuning to reduce GC is critical
– spark.cleaner.ttl (throughput * spark.cleaner.ttl < spark mem free size)
– Storage level (MEMORY_ONLY_SER2)

• Lowe latency (several seconds)
– No startup overhead (reusing SparkContext)

9

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Real-World Use Case #2
Complex, Interactive OLAP / BI
Significant speedup of ad-hoc, complex OLAP / BI
• Spark/Shark cluster runs alongside Hadoop/Hive clusters

• Directly query on Hadoop/Hive data (interactively)
• No ETL, no storage overhead, etc.
In-memory, real-time queries
• Data of interest loaded into memory
create table XYZ tblproperties ("shark.cache" = "true") as select ...

• Typically frontended by lightweight UIs
– Data visualization and exploration (e.g., drill down / up)

10
Time Series Analysis: Unique Event Occurrence
Computing unique event occurrence across the time range
• Input time series
<TimeStamp, ObjectId, EventId, ...>

– E.g., watch of a particular video by a specific user
– E.g., transactions of a particular stock by a specific account

• Output:
<ObjectId, TimeRange, Unique Event#, Unique Event(≥2)#, …, Unique Event(≥n)#>

– E.g., accumulated unique event# for each day in a week (staring from Monday)

…
1

2

…

7

• Implementation
– 2-level aggregation using Spark
• Specialized partitioning, general execution graph, in-memory cached data

– Speedup from 20+ hours to several minutes
11

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Time Series Analysis: Unique Event Occurrence
Input time series
TimeStamp

ObjectId

EventId

…

Partitioned by (ObjectId, EventId)

Aggregation
(shuffle)
ObjectId

12

EventId

Day
(1, 2, … ,7)

1

ObjectId

EventId

Day
Count
(1, 2, … ,7)

Potentially cached in mem
Time Series Analysis: Unique Event Occurrence
Input time series
TimeStamp

ObjectId

EventId

…

Partitioned by (ObjectId, EventId)

Aggregation
(shuffle)
ObjectId

EventId

Day
(1, 2, … ,7)

1

ObjectId

ObjectId

ObjectId

EventId

TimeRage
[1, D]

EventId

TimeRage
[1, D]

1

Potentially cached in mem

Day
Count
(1, 2, … ,7)

Accumulated
Count

1 (if count ≥2)
0 otherwise

…

1 (if count ≥n)
0 otherwise

Aggregation
(shuffle)
ObjectId

Partitioned by ObjectId

13

TimeRage
[1, D]

Unique Event# Unique Event(≥2)#

…

Unique Event(≥n)#
Real-World Use Case #3

Complex Machine Learning & Graph Analysis
Algorithm: complex math operations
• Mostly matrix based
– Multiplication, factorization, etc.

• Sometime graph-based
– E.g., sparse matrix

Iterative computations

• Matrix (graph) cached in memory across iterations

14
Graph Analysis: N-Degree Association
N-degree association in the graph
• Computing associations between two
vertices that are n-hop away
• E.g., friends of friend

Graph-parallel implementation

Weight1(u, v) = edge(u, v) ∈ (0, 1)
Weightn(u, v) = 𝑥→𝑣 Weightn−1(u, x)∗Weight1(x, v)

• Bagel (Pregel on Spark) and GraphX
– Memory optimizations for efficient graph
caching critical

• Speedup from 20+ minutes to <2 minutes

15

For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
Graph Analysis: N-Degree Association
u

v

State[u] = list of Weight(x, u)
(for current top K weights to vertex u)

w

State[w] = list of Weight(x, w)
(for current top K weights to vertex w)

State[v] = list of Weight(x, v)
(for current top K weights to vertex v)

16
Graph Analysis: N-Degree Association
u

v

State[u] = list of Weight(x, u)
(for current top K weights to vertex u)

w

State[w] = list of Weight(x, w)
(for current top K weights to vertex w)

State[v] = list of Weight(x, v)
(for current top K weights to vertex v)

u
Messages = {D(x, u) =
Weight(x, v) * edge(w, u)}
(for weight(x, v) in State[v])

Messages = {D(x, u) =
Weight(x, w) * edge(w, u)}
(for weight(x, w) in State[w])

v

17

w
Graph Analysis: N-Degree Association
u

v

State[u] = list of Weight(x, u)
(for current top K weights to vertex u)

w

State[w] = list of Weight(x, w)
(for current top K weights to vertex w)

State[v] = list of Weight(x, v)
(for current top K weights to vertex v)

Weight(u, x) =

u

u
Messages = {D(x, u) =
Weight(x, v) * edge(w, u)}
(for weight(x, v) in State[v])

Messages = {D(x, u) =
Weight(x, w) * edge(w, u)}
(for weight(x, w) in State[w])

v

18

v

w

𝐷 𝑥,𝑢 𝑖𝑛 𝑀𝑒𝑠𝑠𝑎𝑔𝑒𝑠 D(x,

u)

State[u] = list of Weight(x, u)
(for current top K weights to vertex w)

w
Graph Analysis: N-Degree Association
u

v

State[u] = list of Weight(x, u)
(for current top K weights to vertex u)

w

State[w] = list of Weight(x, w)
(for current top K weights to vertex w)

State[v] = list of Weight(x, v)
(for current top K weights to vertex v)

Weight(u, x) =

u

u
Messages = {D(x, u) =
Weight(x, v) * edge(w, u)}
(for weight(x, v) in State[v])

Messages = {D(x, u) =
Weight(x, w) * edge(w, u)}
(for weight(x, w) in State[w])

v

19

v

w

𝐷 𝑥,𝑢 𝑖𝑛 𝑀𝑒𝑠𝑠𝑎𝑔𝑒𝑠 D(x,

u)

State[u] = list of Weight(x, u)
(for current top K weights to vertex w)

w
Contributions by Intel
Netty based shuffle for Spark
FairScheduler for Spark

Spark job log files
Metrics system for Spark
Configurations system for Spark
Spark shell on YARN
Spark (standalone mode) integration with security Hadoop
Byte code generation for Shark
Co-partitioned join in Shark
...
20
Summary
The Spark stack: lightning-fast analytics over Hadoop data
• Active communities and early adopters evolving
– Apache incubator project

• A growing stack: SQL, streaming, graph-parallel, machine learning, …

Work with us on next-gen big data analytics using the Spark stack
• Interactive and in-memory OLAP / BI
• Complex machine learning & graph analysis
• Near real-time, streaming processing
• And many more!

21
Notices and Disclaimers
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL®
PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH
PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS
OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING
LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE,
MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL
PROPERTY RIGHT.
Intel may make changes to specifications, product descriptions, and plans at any time, without
notice. Designers must not rely on the absence or characteristics of any features or instructions
marked "reserved" or "undefined". Intel reserves these for future definition and shall have no
responsibility whatsoever for conflicts or incompatibilities arising from future changes to
them. The information here is subject to change without notice. Do not finalize a design with
this information.
The products described in this document may contain design defects or errors known as errata
which may cause the product to deviate from published specifications. Current characterized
errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and
before placing your product order.

All dates provided are subject to change without notice.
Intel and Intel logoare trademarks of Intel Corporation in the U.S. and other countries.
*Other names and brands may be claimed as the property of others.
Copyright © 2013, Intel Corporation. All rights reserved.
22
Intel realtime analytics_spark

More Related Content

What's hot

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot modePrakash Chockalingam
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Sparkjlacefie
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...Spark Summit
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Tathagata Das
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...Spark Summit
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLSpark Summit
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPTathagata Das
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and DatasetKazuaki Ishizaki
 
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkDatabricks
 

What's hot (20)

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
 
Databricks clusters in autopilot mode
Databricks clusters in autopilot modeDatabricks clusters in autopilot mode
Databricks clusters in autopilot mode
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
An Introduction to Spark
An Introduction to SparkAn Introduction to Spark
An Introduction to Spark
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
 
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Optimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone MLOptimizing Terascale Machine Learning Pipelines with Keystone ML
Optimizing Terascale Machine Learning Pipelines with Keystone ML
 
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSPDiscretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
Discretized Stream - Fault-Tolerant Streaming Computation at Scale - SOSP
 
Demystifying DataFrame and Dataset
Demystifying DataFrame and DatasetDemystifying DataFrame and Dataset
Demystifying DataFrame and Dataset
 
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
Accumulo Summit 2016: Introducing Accumulo Collections: A Practical Accumulo ...
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 

Viewers also liked

Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Josef A. Habdank
 
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Spark Summit
 
Tutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtimeTutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtimeSocialmetrix
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Spark Summit
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 

Viewers also liked (7)

Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
 
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Micros...
 
Tutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtimeTutorial en Apache Spark - Clasificando tweets en realtime
Tutorial en Apache Spark - Clasificando tweets en realtime
 
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 

Similar to Intel realtime analytics_spark

夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架hdhappy001
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's comingDatabricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupGwen (Chen) Shapira
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014Luigi Dell'Aquila
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingDatabricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Codemotion
 

Similar to Intel realtime analytics_spark (20)

夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架夏俊鸾:Spark——基于内存的下一代大数据分析框架
夏俊鸾:Spark——基于内存的下一代大数据分析框架
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Intro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data MeetupIntro to Spark - for Denver Big Data Meetup
Intro to Spark - for Denver Big Data Meetup
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
OrientDB - Time Series and Event Sequences - Codemotion Milan 2014
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
Il tempo vola: rappresentare e manipolare sequenze di eventi e time series co...
 
Spark cep
Spark cepSpark cep
Spark cep
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 

Intel realtime analytics_spark

  • 1. Real-Time Analytical Processing (RTAP) Using the Spark Stack Jason Dai jason.dai@intel.com Intel Software and Services Group
  • 2. Project Overview Research & open source projects initiated by AMPLab in UC Berkeley BDAS: Berkeley Data Analytics Stack (Ref: https://amplab.cs.berkeley.edu/software/) Intel closely collaborating with AMPLab & the community on open source development • Apache incubation since June 2013 • The “most active cluster data processing engine after Hadoop MapReduce” Intel partnering with several big websites • Building next-gen big data analytics using the Spark stack • Real-time analytical processing (RTAP) • E.g., Alibaba , Baidu iQiyi, Youku, etc. 2
  • 3. Next-Gen Big Data Analytics Volume • Massive scale & exponential growth Variety • Multi-structured, diverse sources & inconsistent schemas Value • Simple (SQL) – descriptive analytics • Complex (non-SQL) – predictive analytics Velocity • Interactive – the speed of thought • Streaming/online – drinking from the firehose 3
  • 4. Next-Gen Big Data Analytics Volume • Massive scale & exponential growth Variety • Multi-structured, diverse sources & inconsistent schemas Value • Simple (SQL) – descriptive analytics • Complex (non-SQL) – predictive analytics Velocity • Interactive – the speed of thought • Streaming/online – drinking from the firehose 4 The Spark stack
  • 5. Real-Time Analytical Processing (RTAP) A vision for next-gen big data analytics • Data captured & processed in a (semi) streaming/online fashion • Real-time & history data combined and mined interactively and/or iteratively – Complex OLAP / BI in interactive fashion – Iterative, complex machine learning & graph analysis • Predominantly memory-based computation Events / Logs Messaging / Queue Stream Processing In-Memory Store Persistent Storage NoSQL Data Warehouse 5 Online Analysis / Dashboard Low Latency Processing Engine Ad-hoc, interactive OLAP / BI Iterative, complex machine learning & graph analysis
  • 6. Real-World Use Case #1 (Semi) Real-Time Log Aggregation & Analysis Logs continuously collected & streamed in • Through queuing/messaging systems Incoming logs processed in a (semi) streaming fashion • Aggregations for different time periods, demographics, etc. • Join logs and history tables when necessary Aggregation results then consumed in a (semi) streaming fashion • Monitoring, alerting, etc. 6
  • 7. (Semi) Real-Time Log Aggregation: MiniBatch Jobs Log Collector Log Collector Kafka Client Kafka Client … Server Kafka Server Server Spark Mesos Spark Kafka Kafka Mesos … Spark Mesos Architecture • Log collected and published to Kafka continuously RDBMS – Kafka cluster collocated with Spark Cluster • Independent Spark applications launched at regular interval – A series of mini-batch apps running in “fine-grained” mode on Mesos – Process newly received data in Kafka (e.g., aggregation by keys) & write results out In production in the user’s environment today • 10s of seconds latency 7 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 8. (Semi) Real-Time Log Aggregation: MiniBatch Jobs Design decisions • Kafka and Spark servers collocated – A distinct Spark task for each Kafka partition, usually reading from local file system cache • Individual Spark applications launched periodically – Apps logging current (Kafka partition) offsets in ZooKeeper – Data of failed batch simply discarded (other strategies possible) • Spark running in “fine-grained” mode on Mesos – Allow the next batch to start even if the current one is late Performance • Input (log collection) bottlenecked by network bandwidth (1GbE) • Processing (log aggregation) bottlenecked by CPUs – Easily scaled to several seconds 8 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 9. (Semi) Real-Time Log Aggregation: Spark Streaming Log Collectors Kafka Cluster Spark Cluster RDBMS Implications • Better streaming framework support – Complex (e.g., stateful) analysis, fault-tolerance, etc. • Kafka & Spark not collocated – DStream retrieves logs in background (over network) and caches blocks in memory • Memory tuning to reduce GC is critical – spark.cleaner.ttl (throughput * spark.cleaner.ttl < spark mem free size) – Storage level (MEMORY_ONLY_SER2) • Lowe latency (several seconds) – No startup overhead (reusing SparkContext) 9 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 10. Real-World Use Case #2 Complex, Interactive OLAP / BI Significant speedup of ad-hoc, complex OLAP / BI • Spark/Shark cluster runs alongside Hadoop/Hive clusters • Directly query on Hadoop/Hive data (interactively) • No ETL, no storage overhead, etc. In-memory, real-time queries • Data of interest loaded into memory create table XYZ tblproperties ("shark.cache" = "true") as select ... • Typically frontended by lightweight UIs – Data visualization and exploration (e.g., drill down / up) 10
  • 11. Time Series Analysis: Unique Event Occurrence Computing unique event occurrence across the time range • Input time series <TimeStamp, ObjectId, EventId, ...> – E.g., watch of a particular video by a specific user – E.g., transactions of a particular stock by a specific account • Output: <ObjectId, TimeRange, Unique Event#, Unique Event(≥2)#, …, Unique Event(≥n)#> – E.g., accumulated unique event# for each day in a week (staring from Monday) … 1 2 … 7 • Implementation – 2-level aggregation using Spark • Specialized partitioning, general execution graph, in-memory cached data – Speedup from 20+ hours to several minutes 11 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 12. Time Series Analysis: Unique Event Occurrence Input time series TimeStamp ObjectId EventId … Partitioned by (ObjectId, EventId) Aggregation (shuffle) ObjectId 12 EventId Day (1, 2, … ,7) 1 ObjectId EventId Day Count (1, 2, … ,7) Potentially cached in mem
  • 13. Time Series Analysis: Unique Event Occurrence Input time series TimeStamp ObjectId EventId … Partitioned by (ObjectId, EventId) Aggregation (shuffle) ObjectId EventId Day (1, 2, … ,7) 1 ObjectId ObjectId ObjectId EventId TimeRage [1, D] EventId TimeRage [1, D] 1 Potentially cached in mem Day Count (1, 2, … ,7) Accumulated Count 1 (if count ≥2) 0 otherwise … 1 (if count ≥n) 0 otherwise Aggregation (shuffle) ObjectId Partitioned by ObjectId 13 TimeRage [1, D] Unique Event# Unique Event(≥2)# … Unique Event(≥n)#
  • 14. Real-World Use Case #3 Complex Machine Learning & Graph Analysis Algorithm: complex math operations • Mostly matrix based – Multiplication, factorization, etc. • Sometime graph-based – E.g., sparse matrix Iterative computations • Matrix (graph) cached in memory across iterations 14
  • 15. Graph Analysis: N-Degree Association N-degree association in the graph • Computing associations between two vertices that are n-hop away • E.g., friends of friend Graph-parallel implementation Weight1(u, v) = edge(u, v) ∈ (0, 1) Weightn(u, v) = 𝑥→𝑣 Weightn−1(u, x)∗Weight1(x, v) • Bagel (Pregel on Spark) and GraphX – Memory optimizations for efficient graph caching critical • Speedup from 20+ minutes to <2 minutes 15 For more complete information about performance and benchmark results, visit www.intel.com/benchmarks.
  • 16. Graph Analysis: N-Degree Association u v State[u] = list of Weight(x, u) (for current top K weights to vertex u) w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) 16
  • 17. Graph Analysis: N-Degree Association u v State[u] = list of Weight(x, u) (for current top K weights to vertex u) w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) u Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v 17 w
  • 18. Graph Analysis: N-Degree Association u v State[u] = list of Weight(x, u) (for current top K weights to vertex u) w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) Weight(u, x) = u u Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v 18 v w 𝐷 𝑥,𝑢 𝑖𝑛 𝑀𝑒𝑠𝑠𝑎𝑔𝑒𝑠 D(x, u) State[u] = list of Weight(x, u) (for current top K weights to vertex w) w
  • 19. Graph Analysis: N-Degree Association u v State[u] = list of Weight(x, u) (for current top K weights to vertex u) w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) Weight(u, x) = u u Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v 19 v w 𝐷 𝑥,𝑢 𝑖𝑛 𝑀𝑒𝑠𝑠𝑎𝑔𝑒𝑠 D(x, u) State[u] = list of Weight(x, u) (for current top K weights to vertex w) w
  • 20. Contributions by Intel Netty based shuffle for Spark FairScheduler for Spark Spark job log files Metrics system for Spark Configurations system for Spark Spark shell on YARN Spark (standalone mode) integration with security Hadoop Byte code generation for Shark Co-partitioned join in Shark ... 20
  • 21. Summary The Spark stack: lightning-fast analytics over Hadoop data • Active communities and early adopters evolving – Apache incubator project • A growing stack: SQL, streaming, graph-parallel, machine learning, … Work with us on next-gen big data analytics using the Spark stack • Interactive and in-memory OLAP / BI • Complex machine learning & graph analysis • Near real-time, streaming processing • And many more! 21
  • 22. Notices and Disclaimers INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL® PRODUCTS. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER, AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY RELATING TO SALE AND/OR USE OF INTEL PRODUCTS, INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT, OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel may make changes to specifications, product descriptions, and plans at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. All dates provided are subject to change without notice. Intel and Intel logoare trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright © 2013, Intel Corporation. All rights reserved. 22