Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

3,723 views
3,577 views

Published on

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,723
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
161
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • Add “variables” to the “functions” in functional programming
  • Note that dataset is reused on each gradient computation
  • Key idea: add “variables” to the “functions” in functional programming
  • This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  • Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join
  • Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

    1. 1. UC BERKELEY
    2. 2. It’s All Happening On-line User Generated (Web, Social & Mobile) Every: Click Ad impression Billing event ….. Fast Forward, pause,… Friend Request Transaction Network message Fault …Internet of Things / M2M Scientific Computing
    3. 3. Volume Petabytes+ Variety Unstructured Velocity Real-TimeOur view: More data should mean better answers • Must balance Cost, Time, and Answer Quality3
    4. 4. 4
    5. 5. UC BERKELEY Algorithms: Machine Learning and Analytics Massive and Diverse Data People: Machines: CrowdSourcing & Cloud Computing Human Computation5
    6. 6. throughout the entire analytics lifecycle6
    7. 7. Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken Goldberg (Crowdsourcing) Randy Katz (Systems) *Michael Franklin (Databases) Dave Patterson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Scott Shenker (Networking)Organized for Collaboration: 7
    8. 8. 8
    9. 9. > 450,000 downloads9
    10. 10. • Sequencing costs (150X) Big Data $100,000.0 $K per genome $10,000.0 • UCSF cancer researchers + UCSC cancer genetic $1,000.0 $100.0 database + AMP Lab + Intel Cluster $10.0 $1.0 @TCGA: 5 PB = 20 cancers x 1000 genomes $0.1 2001 - 2014• See Dave Patterson’s Talk: Thursday 3-4, BDT205 David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 10 12/5/2011
    11. 11. MLBase (Declarative Machine Learning) Hadoop MR MPI BlinkDB (approx QP) Graphlab Shark (SQL) + Streaming etc. Spark Streaming Shared RDDs (distributed memory) Mesos (cluster resource manager) HDFS 3rd party AMPLab (released) AMPLab (in progress)11
    12. 12. 12
    13. 13. 13
    14. 14. Lightning-Fast Cluster Computing
    15. 15. Base RDD Cache 1lines = spark.textFile(“hdfs://...”) Transformed RDD Worker resultserrors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(„t‟)(2)) tasks Block 1 DrivercachedMsgs = messages.cache() ActioncachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker Cache 3 Worker Block 2 Result: full-text search TBWikipedia in sec sec Result: scaled to 1 of data in 5-7 <1 (vs 170sec for on-disk data) (vs 20 sec for on-disk data) Block 3
    16. 16. messages = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2))HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…)
    17. 17. random initial linetarget
    18. 18. map readPoint cache Load data in memory once Initial parameter vector map p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.xreduce _ + _ Repeated MapReduce steps to do gradient descent
    19. 19. 60 50Running Time (min) 110 s / iteration 40 Hadoop 30 Spark 20 10 first iteration 80 s further iterations 1 s 0 1 10 20 30 Number of Iterations
    20. 20. Java API JavaRDD<String> lines = sc.textFile(...);(out now) lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();PySpark lines = sc.textFile(...)(coming soon) lines.filter(lambda x: x.contains(error)) .count()
    21. 21. Hive 20Spark 0.5 Time (hours) 0 5 10 15 20
    22. 22. Client CLI JDBC DriverMeta store SQL Query Physical Plan Parser Optimizer Execution MapReduce HDFS
    23. 23. Client CLI JDBC Driver Cache Mgr.Meta store SQL Query Physical Plan Parser Optimizer Execution Spark HDFS
    24. 24. Row Storage Column Storage1 john 4.1 1 2 32 mike 3.5 john mike sally3 sally 6.4 4.1 3.5 6.4
    25. 25. Shark Shark (disk) Hive 100 90 80 70 60 50 40 30100 m2.4xlarge nodes 202.1 TB benchmark (Pavlo et al) 10 1.1 0 Selection
    26. 26. Shark Shark (disk) Hive 600 500 400 300 200100 m2.4xlarge nodes 100 322.1 TB benchmark (Pavlo et al) 0 Group By
    27. 27. 1800 Shark (copartitioned) Shark 1500 Shark (disk) Hive 1200 900 600 300 105100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al) 0 Join
    28. 28. Shark Shark (disk) Hive70 70 100 9060 60 8050 50 70 6040 40 5030 30 40 3020 20 20 100 m2.4xlarge10 10 10 nodes, 1.7 TB 1.0 0.8 0.7 0 Conviva dataset 0 0 Query 1 Query 2 Query 3
    29. 29. spark-project.orgamplab.cs.berkeley.edu UC BERKELEY
    30. 30. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.

    ×