Track A-2 基於 Spark 的數據分析

3,779 views

Published on

講師:Etu CTO 陳昭宇 (James Chen)

Published in: Technology
0 Comments
30 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,779
On SlideShare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
Downloads
485
Comments
0
Likes
30
Embeds 0
No embeds

No notes for slide

Track A-2 基於 Spark 的數據分析

  1. 1. 1 Spark Drives Big Data Analytics Application 基於Spark的數據分析 James Chen Etu CTO June 16, 2015
  2. 2. 2 • Spark Brief • What Cloudera is doing on Spark • Spark Use Cases • Cloudera’s Position on Spark • Etu and Cloudera Agenda
  3. 3. 3 Key Advances by MapReduce: • Data Locality: Automatic split computation and launch of mappers appropriately • Fault-Tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware • Linear Scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems A Brief Review of MapReduce M ap M ap M ap M ap M ap M ap M ap M ap M ap M ap M ap M ap Redu ce Redu ce Redu ce Redu ce
  4. 4. 4 MapReduce: Good The Good: • Built in fault tolerance • Optimized IO path • Scalable • Developer focuses on Map/Reduce, not infrastructure • Simple? API
  5. 5. 5 MapReduce: Bad The Bad: •Optimized for disk IO – Doesn’t leverage memory – Iterative algorithms go through disk IO path again and again •Primitive API – Developer’s have to build on very simple abstraction – Key/Value in/out – Even basic things like join require extensive code •Result often many files that need to be combined appropriately
  6. 6. 6 Spark is a general purpose computational framework with more flexibility than MapReduce Key properties: • Leverages distributed memory • Full Directed Graph expressions for data parallel computations • Improved developer experience Yet retains: Linear scalability, Fault-tolerance, and Data Locality based computations Reference: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf What is Spark?
  7. 7. 7 Easy to Develop – High productive language support – Clean and expressive APIs – Interactive shell – Out of box functionality Spark: Easy and Fast Big Data Fast to Run –General execution graphs –In-memory storage 2-5× less code Up to 10× faster on disk, 100× in memory
  8. 8. 8 Spark Easy: Example – Word Count Hadoop MapReduce public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  9. 9. 9 Hadoop Integration • Works with Hadoop Data • Runs With YARN Libraries • MLlib • Spark Streaming • GraphX (alpha) Out-of-the-Box Functionality Language support: • Improved Python support • SparkR • Java 8 • Schema support in Spark’s APIs
  10. 10. 10 data = spark.textFile(...).map(readPoint).cache() w = numpy.random.rand(D) for i in range(iterations): gradient = data .map(lambda p: (1 / (1 + exp(-p.y * w.dot(p.x)))) * p.y * p.x) .reduce(lambda x, y: x + y) w -= gradient print “Final w: %s” % w Example: Logistic Regression
  11. 11. 11 • Hadoop cluster with 100 nodes contains 10+TB of RAM today and will double next year • 1 GB RAM ~ $10-$20 • Trends: • ½ price every 18 months • 2x bandwidth every 3 years Memory Management Leads to Greater Performance 64-­‐128GB  RAM 16  cores 50  GB  per   sec Memory can be enabler for high performance big data applications
  12. 12. 12 In-memory Caching • Data Partitions read from RAM instead of disk Operator Graphs • Scheduling Optimizations • Fault Tolerance Fast: Using RAM, Operator Graphs join filter groupBy B: B: C: D: E: F: Ç√ Ω map A: map take =  cached  partition=  RDD
  13. 13. 13 Expressiveness of Programming Model Map Reduce Map Map Reduce Map Reduce Efficient  group-­‐by   aggregations   and  other  analytics Pipelined  MapReduce  Jobs Ma p Reduc e Ma p Reduc eX X X Ma p Reduc e Iterative  jobs  (Machine  Learning)
  14. 14. 14 Logistic Regression Performance (Data Fits in Memory) 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 10 20 30 Running  Time  (s) Number  of  Iterations Hadoop Spark 110  s  /  iteration first  iteration 80  s further  iterations 1  s
  15. 15. 15 • Spark Brief • What Cloudera is doing on Spark • Spark Use Cases • Cloudera’s Position on Spark • Etu and Cloudera Agenda
  16. 16. 16 Spark Engineering in Cloudera • Cloudera embraced Spark in early 2014 • Engineering with Intel to broaden Spark ecosystem – Hive-on-Spark – Pig-on-Spark – Spark-over-YARN – Spark Streaming Reliability – General Spark Optimization
  17. 17. 17 Hive on Spark • Technology – Hive: “standard” SQL tool in Hadoop – Spark: next-gen distributed processing framework – Hive + Spark • Performance • Minimum feature gap • Industry – A lot of customers heavily invest in Hive – Want to leverage the Spark engine
  18. 18. 18 Design Principles • No or limited impact on Hive’s existing code path • Maximize code reuse • Minimum feature customization • Low future maintenance cost
  19. 19. 19 Class Hierarchy TaskCompiler MapRedCompiler TezCompiler Task Work MapRedTask TezTask TezWorkMapRedWork SparkCompiler SparkTask SparkWork generates described by
  20. 20. 20 Work – Metadata for Task • MapReduceWork contains one MapWork and a possible ReduceWork • SparkWork contains a graph of MapWorks and ReduceWorks MapWork1 ReduceWork1 MapWork2 ReduceWork2 MapWork1 ReduceWork1 ReduceWork2 Query:  select  name,   sum(value)  as  v   from  dec group  by  name   order  by  v; Spark  Job MR  Job  2 MR    Job  1
  21. 21. 21 Data Processing via Spark • Treat Table as HadoopRDD (input RDD) • Apply the function that wraps MR’s map-side processing • Shuffle map output using Spark’s transformations (groupByKey, sortByKey, etc) • Apply the function that wraps MR’s reduce-side processing
  22. 22. 22 Spark Plan • MapInput – encapsulate a table • MapTran – map-side processing • ShuffleTran – shuffling • ReduceTran – reduce-side processing Query: Select name, sum(value) as v from dec group by name order by v;
  23. 23. 23 Current Status • All functionality in Hive is implemented • First round of optimization is completed – Map join, SMB – Split generation and grouping – CBO, vectorization • More optimization and benchmarking coming • Beta in CDH – http://archive-primary.cloudera.com/cloudera-labs/hive-on-spark/ – http://www.cloudera.com/content/cloudera/en/documentation/hive- spark/latest/PDF/hive-spark-get-started.pdf
  24. 24. 24 • Spark Brief • What Cloudera is doing on Spark • Spark Use Cases • Cloudera’s Position on Spark Agenda
  25. 25. 25 User Use Case Spark’s Value Conviva 通過實時分析流量規則以及更精細的流 量控制,優化終端用戶的在線視頻體驗 • 快速原型開發 • 共享的離線和在線計算業務邏輯 • 開源的機器學習算法 Yahoo! 加速廣告投放的模型訓練週期,特徵提 取提高3備,採用協同過濾算法進行內容 推薦 • 降低數據管道的延遲 • 迭代式機器學習 • 高效的P2P廣播 Anonymous (Large Tech Company) 準實時日誌聚合於分析,實現監控和告 警 • 低延遲、高頻度的運行“mini”批 總也來處理最新數據 Technicolor 為(電信)客戶提供實時分析;提供流 處理和實時查詢能力 • 部署簡單,只需要Spark和Spark Streaming • 在線數據的隨機查詢 Sample Use Cases
  26. 26. 26 Large Tech Company – Spark is used for new machine learning investigations for search personalization Financial Services – Process millions of stock positions and future scenarios in 4hrs with Spark (compared with 1 week using MapReduce) University – Genomics research using Spark pipelines Video – Spark and Spark Streaming for video streaming and analysis Hospital – Spark for predictive modeling of disease conditions Cloudera Use Cases in Verticals
  27. 27. 27 • Run ETL on Spark using PIG – To achieve very tight SLA’s. – Accenture Smart Water Application. • Spark Analytics over Hbase – Patients physiological data, experiment and user data – Serving Researchers. • Traffic analysis using MLlib/Clustering at Dylan • Annotated Variants analysis on Spark – Using the Spark/Java framework in Duke • Sepsis detection with Spark Streaming Cloudera Use cases with different Components
  28. 28. 28 • A car shopping website where people from all across the nation come to read reviews, compare prices, and in general get help in all matters car related. • The goal was to build a near real-time dashboard that would provide both unique visitor and page view counts per make and make/model that could be engineered in a couple of weeks. • In the past, these updates have been restricted to hourly granularities with an additional hour delay. • Furthermore, as this data was not available in an easy-to-use dashboard, manual processing was needed to visualize the data. Near real-time dashboard by Edmunds.com
  29. 29. 29 Prototype Architecture
  30. 30. 30 Page View Per Minute
  31. 31. 31 Unique Visitor Per Minute
  32. 32. 32 Total UV by Maker/Model
  33. 33. 33 Case Study in Etu Insight l Problem domain: − Analyze user behavior from web site interaction log − Analyze users behavior from existing offline data − Make data aggregation on the data grouping by time and users l Approach: − ETL process from the web log to Hive structure data − Import existing database data − Define and implement the aggregation function in Spark (with Scala) − Output the calculation result to HBase
  34. 34. 34 Architecture & Flow Web log User Data Hive (Structured Data) SPARK HBase
  35. 35. 35 Etu Insight Dashboard
  36. 36. 36 Advanced Analytics with Spark • Written by Cloudera data science team – First ever book bridging ML with Hadoop ecosystem – Focusing on use cases and examples rather than a manual – Target for data scientist solving real word analysis problems – Generally available in May 2015
  37. 37. 37 Analyzing Big Data • Building a model to detect credit card fraud using thousands of features and billions of transactions • Intelligently recommend millions of products to millions of users • Estimate financial risk through simulations of portfolios including millions of instruments • Easily manipulate data from thousands of human genomes to detect genetic associations with disease
  38. 38. 38 • Spark Brief • What Cloudera is doing on Spark • Spark Use Cases • Cloudera’s Position on Spark • Etu and Cloudera Agenda
  39. 39. 39 Spark is a fully integrated and supported part of Cloudera’s enterprise data hub • First vendor to ship and support Spark – Invested early to make it a cohesive part of the platform – Complemented by Intel’s early investment – Developed and supported in collaboration with Databricks to ensure success • Only vendor with Spark committers on staff • Several Spark use cases in production • Well-trained support staff and external Training Courses Cloudera’s Investment in Spark
  40. 40. 40 Hadoop in the Spark World YARN Spark Spark Streaming GraphX MLlib HDFS, HBase HivePig Impala MapReduce2 SparkSQL Search Core  Hadoop Support  Spark  components Unsupported  add-­‐ons
  41. 41. 41 Focusing on Open Standards, not just Open Source Open  Standards  are  just  as   important  as  Open  Source. Why  does  it  matter? • Diverse  engineering  is  more  sustainable. • Broad  support  ensures  vendor   portability. • Project  utility  depends  on  ecosystem   compatibility,  which  depends   on   standards. Cloudera  leads  in  defining the  de  facto  open  standards   adopted  by  the  market. Vendor   Support Component (Founder) Cloudera Pivota l MapR Amaz on IBM Hortonwo rks Spark (UC   Berkeley) ✔ ✔ ✔ ✔ ✔ ✔ Impala  (Cloudera) ✔ ✖ ✔ ✔ ✖ ✖ Hue  (Cloudera) ✔ ✖ ✔ ✔ ✖ ✔ Sentry  (Cloudera) ✔ ✔ ✔ ✖ ✔ ✖ Flume  (Cloudera) ✔ ✔ ✔ ✖ ✔ ✔ Parquet   (Cloudera/Twitter) ✔ ✔ ✔ ✔ ✔ ✖ Sqoop  (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔ Falcon   ✖ ✖ ✖ ✖ ✖ ✔ Knox ✖ ✖ ✖ ✖ ✖ ✔ Tez ✖ ✖ ✔ ✖ ✖ ✔ Ranger   ✖ ✖ ✖ ✖ ✖ ✔
  42. 42. 42 Cloudera is a member of, and aligned with, the broader Spark community Spark: • Will replace MapReduce as the general purpose Hadoop framework – Broad community and vendor adoption – Hadoop ecosystem integration (native & 3rd party) • Goes beyond data science/machine learning – Cloudera working on Spark Core, Streaming, Security, YARN, and MLlib • Does not replace special purpose frameworks – One size does not fit all for SQL, Search, Graph, Stream Cloudera’s Position on Spark
  43. 43. 43 • Spark Brief • What Cloudera is doing on Spark • Spark Use Cases • Cloudera’s Position on Spark • Etu and Cloudera Agenda
  44. 44. 44 Cloudera Partner with Etu
  45. 45. 45 Etu 在 Hadoop 企業化的定位與價值 人才 招聘 團隊 建立 程式開發 資料 架構 探勘 設計 部署、調校 運維、管理 應用 平台 搶佔 市場 核心 價值 資源 調配 標準化、自動化 降低 Hadoop  平台 部署與維運的複雜度 • 省力:到府安裝調校、專案技術服務 • 省時:顧問與教育訓練,協助迅速上手 • 安心:本土技術支援,降低導入風險 • 智開:多年經驗分享,打通任督二脈 難 難 易 易 Etu Manager Etu Professional Service Etu Consulting Etu Training Etu Services Etu Support 難 難 易 易
  46. 46. 46 Etu Support Etu Professional Service Etu Consulting Cloudera Support Etu Manager Etu Services Etu Big Data 軟體平台與服務 Cloudera Manager Etu Manager Cloudera Manager inside Etu Training
  47. 47. 47 主流 X86 商用伺服器 效能最佳化 全叢集管理 空機自動部署 全自動、高效能、易管理的巨量資料處理平台 唯一在地 Hadoop 專業服務 Etu Manager 讓 Hadoop 更容易
  48. 48. 48 Etu Services • Etu Manager 功能模組更新 • HDFS / MapReduce / HBase / Pig / Hive / Impala / Spark 技術諮詢 (電⼦子郵 件) • 配合 CDH 提供升級與更新套件 • 客⼾戶問題管理 (Issue Management) • Hadoop叢集規劃與設計 ● Hadoop軟體架構與資料模型設計 • Hadoop系統安裝與建置(on-site) ● Hadoop資料處理與應⽤用軟體開發 • Hadoop叢集維護檢測與調教(on-site) ● Hadoop資料移轉服務 Etu 專業服務 (以⼈人天計費) • 叢集規劃與網路架構設計/顧問服務 • 應⽤用程式架構設計/顧問服務 Etu 科技顧問 (以⼈人天計費) • 標準課程:Hadoop 直通學習地圖 – 針對不同職務需求,全⽅方位巨量資料技術實作學習 • 企業包班 Etu 教育訓練 (以⼈人次計費) Etu 技術⽀支援 8X5 (以年計算)
  49. 49. 49 Booth 4 : Etu Data Lake Booth 5 : Cloudera 進一步了解
  50. 50. 50 Appendix Concepts
  51. 51. 51 • Driver & Workers • RDD – Resilient Distributed Dataset • Transformations • Actions • Caching Spark Concepts - Overview
  52. 52. 52 Drivers and Workers Driver Worker Worker Worker Data Data RAM Data RAM Tasks Results RAM
  53. 53. 53 • Read-only partitioned collection of records • Created through: – Transformation of data in storage – Transformation of RDDs • Contains lineage to compute from storage • Lazy materialization • Users control persistence and partitioning RDD – Resilient Distributed Dataset
  54. 54. 54 • Map • Filter • Sample • Join Operations • Reduce • Count • First, Take • SaveAs Transformations Actions
  55. 55. 55 • Transformations create new RDD from an existing one • Actions run computation on RDD and return a value • Transformations are lazy • Actions materialize RDDs by computing transformations • RDDs can be cached to avoid re-computing Operations
  56. 56. 56 • RDDs contain lineage • Lineage – source location and list of transformations • Lost partitions can be re-computed from source data Fault-Tolerance msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS  File Filtered  RDD Mapped  RDD filter (func =  startsWith(…)) map (func =  split(...))
  57. 57. 57 • Persist() and cache() mark data • RDD is cached after first action • Fault-tolerant – lost partitions will re-compute • If not enough memory, some partitions will not be cached • Future actions are performed on cached partitioned, so they are much faster Use caching for iterative algorithms Caching
  58. 58. 58 • MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2… Caching – Storage Levels
  59. 59. 59 • map • filter • groupBy • sort • union • join • leftOuterJoin • rightOuterJoin Easy: Expressive API • reduce • count • fold • reduceByKey • groupByKey • cogroup • cross • zip • sample • take • first • partitionBy • mapWith • pipe • save ...

×