Spark: High-Speed Big
Data Analysis Framework
Intel
Andrew Xia
Weibo:Andrew-Xia
Agenda
•
•
•
•

Intel contributions to Spark
Collaboration
Real world cases
Summary
Spark Overview
• Open source projects initiated by
AMPLab in UC Berkeley
• Apache incubation since June 2013
UC BERKELEY

...
Contributions by Intel
• Netty based shuffle for Spark
Intel China
• 3 committers
• FairScheduler for Spark
• 7 contributo...
Agenda
•
•
•
•

Intel contributions to Spark
Collaboration
Real world cases
Summary
Collaboration Partners
• Intel partnering with several big websites
– Building next-gen big data analytics using the
Spark...
Big Data in Partners
• Advertising
– Operation analysis
– Effect analysis
– Directional optimization

• Analysis
– Website...
FAQs
FAQ #1: Poor Performance
– machine learning and graph computation
– OLAP for Tabular data, interactive query
Hadoop Data Sharing
iter. 1

. . .

iter. 2

Input

query 1

result 1

query 2

HDFS
read

result 2

query 3

result 3

Sl...
Spark Data Sharing
iter. 1
Input

query 1
one-time
processing

Input

iter. 2

query 2
query 3

. . .

10-100× faster than...
FAQs
FAQ #2: Too many big data systems
Big Data Systems Today

MapReduce
…
General batch
processing

Specialized systems
(iterative, interactive and
streaming ap...
Vision of Spark Ecosystem

One stack to rule them all!
Spark Ecosystem
Spark
Streaming

Graphx
Graphparallel

MLBase
Machine
learning

Spark
Tachyon
HDFS/Hadoop Storage
Mesos

Y...
FAQs
FAQ #3: Study cost
Code Size
140000
120000
100000

GraphX

80000

Shark*

60000

Streaming

40000

20000
0
Hadoop
Storm
MapReduce (Streaming)...
FAQs
FAQ #4: Is Spark Stable?
Spark Status
• Spark 0.8 has been released
• Spark 0.9 will be release in Jan 2013
FAQs

FAQ #5: Not enough memory to cache
Not Enough Memory
• Graceful degradation
• Scheduler takes care of this
• Other options
– MEMORY_ONLY
– MEMORY_ONLY_SER
– ...
FAQs

FAQ #6: How to recover when failure?
How to Failover?
iter. 1
Input

query 1
one-time
processing

Input

iter. 2

query 2
query 3

. . .

. . .
How to Failover?
• Lineage: track the graph of transformations
that built RDD
• Checkpoint: lineage graphs get large
FAQs

FAQ #7: Is Spark compatible with
Hadoop ecosystem?
FAQs

FAQ #8:Need port to Spark?
FAQs

FAQ #9: Any cons about Spark?
Agenda
•
•
•
•

Intel contributions to Spark
Collaboration
Real world cases
Summary
Case1#:Real-Time Log Aggregation
• Logs continuously collected & streamed in
– Through queuing/messaging systems

• Incomi...
Real-Time Log Aggregation: Spark Streaming
Log
Collectors

Kafka
Cluster

Spark
Cluster

RDBMS

• Implications
– Better st...
Case #2:Machine Learning & Graph Analysis
• Algorithm: complex match operations
– Mostly matrix based
• Multiplication, fa...
Graph Analysis: N-Degree Association
• N-degree association in the
graph
– Computing associations between two
vertices tha...
Graph Analysis: N-Degree Association
u

State[u] = list of Weight(x, u)
(for current top K weights to vertex u)

v

w

Sta...
Agenda
•
•
•
•

Intel contributions to Spark
Collaboration
Real world cases
Summary
Summary
• Memory is King!
• One stack to rule them all!
• Contribute to community!
夏俊鸾:Spark——基于内存的下一代大数据分析框架
Upcoming SlideShare
Loading in...5
×

夏俊鸾:Spark——基于内存的下一代大数据分析框架

1,070

Published on

BDTC 2013 Beijing China

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,070
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
58
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

夏俊鸾:Spark——基于内存的下一代大数据分析框架

  1. 1. Spark: High-Speed Big Data Analysis Framework Intel Andrew Xia Weibo:Andrew-Xia
  2. 2. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  3. 3. Spark Overview • Open source projects initiated by AMPLab in UC Berkeley • Apache incubation since June 2013 UC BERKELEY • Intel closely collaborating with AMPLab & the community on open source development
  4. 4. Contributions by Intel • Netty based shuffle for Spark Intel China • 3 committers • FairScheduler for Spark • 7 contributors • Spark job log files • 50+ patches • Metrics system for Spark • Spark shell on YARN • Spark (standalone mode) integration with security Hadoop • Byte code generation for Shark • Co-partitioned join in Shark ...
  5. 5. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  6. 6. Collaboration Partners • Intel partnering with several big websites – Building next-gen big data analytics using the Spark stack – E.g., Alibaba , Baidu iQiyi, Youku, etc.
  7. 7. Big Data in Partners • Advertising – Operation analysis – Effect analysis – Directional optimization • Analysis – Website report – Platform report – Monitor system • Recommendation – Ranking list – Personal recommendation – Hot-click analysis
  8. 8. FAQs FAQ #1: Poor Performance – machine learning and graph computation – OLAP for Tabular data, interactive query
  9. 9. Hadoop Data Sharing iter. 1 . . . iter. 2 Input query 1 result 1 query 2 HDFS read result 2 query 3 result 3 Slow due to replication, serialization, and disk IO
  10. 10. Spark Data Sharing iter. 1 Input query 1 one-time processing Input iter. 2 query 2 query 3 . . . 10-100× faster than network and disk . . .
  11. 11. FAQs FAQ #2: Too many big data systems
  12. 12. Big Data Systems Today MapReduce … General batch processing Specialized systems (iterative, interactive and streaming apps)
  13. 13. Vision of Spark Ecosystem One stack to rule them all!
  14. 14. Spark Ecosystem Spark Streaming Graphx Graphparallel MLBase Machine learning Spark Tachyon HDFS/Hadoop Storage Mesos YARN MPI…… MapReduce Shark SQL
  15. 15. FAQs FAQ #3: Study cost
  16. 16. Code Size 140000 120000 100000 GraphX 80000 Shark* 60000 Streaming 40000 20000 0 Hadoop Storm MapReduce (Streaming) non-test, non-example source lines Impala (SQL) Giraph (Graph) Spark * also calls into Hive
  17. 17. FAQs FAQ #4: Is Spark Stable?
  18. 18. Spark Status • Spark 0.8 has been released • Spark 0.9 will be release in Jan 2013
  19. 19. FAQs FAQ #5: Not enough memory to cache
  20. 20. Not Enough Memory • Graceful degradation • Scheduler takes care of this • Other options – MEMORY_ONLY – MEMORY_ONLY_SER – MEMORY_AND_DISK – DISK_ONLY
  21. 21. FAQs FAQ #6: How to recover when failure?
  22. 22. How to Failover? iter. 1 Input query 1 one-time processing Input iter. 2 query 2 query 3 . . . . . .
  23. 23. How to Failover? • Lineage: track the graph of transformations that built RDD • Checkpoint: lineage graphs get large
  24. 24. FAQs FAQ #7: Is Spark compatible with Hadoop ecosystem?
  25. 25. FAQs FAQ #8:Need port to Spark?
  26. 26. FAQs FAQ #9: Any cons about Spark?
  27. 27. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  28. 28. Case1#:Real-Time Log Aggregation • Logs continuously collected & streamed in – Through queuing/messaging systems • Incoming logs processed in a (semi) streaming fashion – Aggregations for different time periods, demographics, etc. – Join logs and history tables when necessary • Aggregation results then consumed in a (semi) streaming fashion – Monitoring, alerting, etc.
  29. 29. Real-Time Log Aggregation: Spark Streaming Log Collectors Kafka Cluster Spark Cluster RDBMS • Implications – Better streaming framework support • Complex (e.g., statful) analysis, fault-tolerance, etc. – Kafka & Spark not collocated • DStream retrieves logs in background (over network) and caches blocks in memory – Memory tuning to reduce GC is critical • spark.cleaner.ttl (throughput * spark.cleaner.ttl < spark mem free size) • Storage level (MEMORY_ONLY_SER2) – Lower latency (several seconds) • No startup overhead (reusing SparkContext)
  30. 30. Case #2:Machine Learning & Graph Analysis • Algorithm: complex match operations – Mostly matrix based • Multiplication, factorization, etc. – Sometime graph-based • E.g., sparse matrix • Iterative computations – Matrix (graph) cached in memory across iterations
  31. 31. Graph Analysis: N-Degree Association • N-degree association in the graph – Computing associations between two vertices that are n-hop away – E.g., friends of friend • Graph-parallel implementation – Bagel (Pregel on Spark) and GraphX • Memory optimizations for efficient graph caching critical – Speedup from 20+ minutes to <2 minutes
  32. 32. Graph Analysis: N-Degree Association u State[u] = list of Weight(x, u) (for current top K weights to vertex u) v w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) u u Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) v Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v w w
  33. 33. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  34. 34. Summary • Memory is King! • One stack to rule them all! • Contribute to community!
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×