夏俊鸾:Spark——基于内存的下一代大数据分析框架

1,642 views

Published on

BDTC 2013 Beijing China

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,642
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
61
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

夏俊鸾:Spark——基于内存的下一代大数据分析框架

  1. 1. Spark: High-Speed Big Data Analysis Framework Intel Andrew Xia Weibo:Andrew-Xia
  2. 2. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  3. 3. Spark Overview • Open source projects initiated by AMPLab in UC Berkeley • Apache incubation since June 2013 UC BERKELEY • Intel closely collaborating with AMPLab & the community on open source development
  4. 4. Contributions by Intel • Netty based shuffle for Spark Intel China • 3 committers • FairScheduler for Spark • 7 contributors • Spark job log files • 50+ patches • Metrics system for Spark • Spark shell on YARN • Spark (standalone mode) integration with security Hadoop • Byte code generation for Shark • Co-partitioned join in Shark ...
  5. 5. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  6. 6. Collaboration Partners • Intel partnering with several big websites – Building next-gen big data analytics using the Spark stack – E.g., Alibaba , Baidu iQiyi, Youku, etc.
  7. 7. Big Data in Partners • Advertising – Operation analysis – Effect analysis – Directional optimization • Analysis – Website report – Platform report – Monitor system • Recommendation – Ranking list – Personal recommendation – Hot-click analysis
  8. 8. FAQs FAQ #1: Poor Performance – machine learning and graph computation – OLAP for Tabular data, interactive query
  9. 9. Hadoop Data Sharing iter. 1 . . . iter. 2 Input query 1 result 1 query 2 HDFS read result 2 query 3 result 3 Slow due to replication, serialization, and disk IO
  10. 10. Spark Data Sharing iter. 1 Input query 1 one-time processing Input iter. 2 query 2 query 3 . . . 10-100× faster than network and disk . . .
  11. 11. FAQs FAQ #2: Too many big data systems
  12. 12. Big Data Systems Today MapReduce … General batch processing Specialized systems (iterative, interactive and streaming apps)
  13. 13. Vision of Spark Ecosystem One stack to rule them all!
  14. 14. Spark Ecosystem Spark Streaming Graphx Graphparallel MLBase Machine learning Spark Tachyon HDFS/Hadoop Storage Mesos YARN MPI…… MapReduce Shark SQL
  15. 15. FAQs FAQ #3: Study cost
  16. 16. Code Size 140000 120000 100000 GraphX 80000 Shark* 60000 Streaming 40000 20000 0 Hadoop Storm MapReduce (Streaming) non-test, non-example source lines Impala (SQL) Giraph (Graph) Spark * also calls into Hive
  17. 17. FAQs FAQ #4: Is Spark Stable?
  18. 18. Spark Status • Spark 0.8 has been released • Spark 0.9 will be release in Jan 2013
  19. 19. FAQs FAQ #5: Not enough memory to cache
  20. 20. Not Enough Memory • Graceful degradation • Scheduler takes care of this • Other options – MEMORY_ONLY – MEMORY_ONLY_SER – MEMORY_AND_DISK – DISK_ONLY
  21. 21. FAQs FAQ #6: How to recover when failure?
  22. 22. How to Failover? iter. 1 Input query 1 one-time processing Input iter. 2 query 2 query 3 . . . . . .
  23. 23. How to Failover? • Lineage: track the graph of transformations that built RDD • Checkpoint: lineage graphs get large
  24. 24. FAQs FAQ #7: Is Spark compatible with Hadoop ecosystem?
  25. 25. FAQs FAQ #8:Need port to Spark?
  26. 26. FAQs FAQ #9: Any cons about Spark?
  27. 27. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  28. 28. Case1#:Real-Time Log Aggregation • Logs continuously collected & streamed in – Through queuing/messaging systems • Incoming logs processed in a (semi) streaming fashion – Aggregations for different time periods, demographics, etc. – Join logs and history tables when necessary • Aggregation results then consumed in a (semi) streaming fashion – Monitoring, alerting, etc.
  29. 29. Real-Time Log Aggregation: Spark Streaming Log Collectors Kafka Cluster Spark Cluster RDBMS • Implications – Better streaming framework support • Complex (e.g., statful) analysis, fault-tolerance, etc. – Kafka & Spark not collocated • DStream retrieves logs in background (over network) and caches blocks in memory – Memory tuning to reduce GC is critical • spark.cleaner.ttl (throughput * spark.cleaner.ttl < spark mem free size) • Storage level (MEMORY_ONLY_SER2) – Lower latency (several seconds) • No startup overhead (reusing SparkContext)
  30. 30. Case #2:Machine Learning & Graph Analysis • Algorithm: complex match operations – Mostly matrix based • Multiplication, factorization, etc. – Sometime graph-based • E.g., sparse matrix • Iterative computations – Matrix (graph) cached in memory across iterations
  31. 31. Graph Analysis: N-Degree Association • N-degree association in the graph – Computing associations between two vertices that are n-hop away – E.g., friends of friend • Graph-parallel implementation – Bagel (Pregel on Spark) and GraphX • Memory optimizations for efficient graph caching critical – Speedup from 20+ minutes to <2 minutes
  32. 32. Graph Analysis: N-Degree Association u State[u] = list of Weight(x, u) (for current top K weights to vertex u) v w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) u u Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) v Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v w w
  33. 33. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  34. 34. Summary • Memory is King! • One stack to rule them all! • Contribute to community!

×