Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

3,965 views

Published on

  • Be the first to comment

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

  1. 1. UC BERKELEY
  2. 2. It’s All Happening On-line User Generated (Web, Social & Mobile) Every: Click Ad impression Billing event ….. Fast Forward, pause,… Friend Request Transaction Network message Fault …Internet of Things / M2M Scientific Computing
  3. 3. Volume Petabytes+ Variety Unstructured Velocity Real-TimeOur view: More data should mean better answers • Must balance Cost, Time, and Answer Quality3
  4. 4. 4
  5. 5. UC BERKELEY Algorithms: Machine Learning and Analytics Massive and Diverse Data People: Machines: CrowdSourcing & Cloud Computing Human Computation5
  6. 6. throughout the entire analytics lifecycle6
  7. 7. Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken Goldberg (Crowdsourcing) Randy Katz (Systems) *Michael Franklin (Databases) Dave Patterson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Scott Shenker (Networking)Organized for Collaboration: 7
  8. 8. 8
  9. 9. > 450,000 downloads9
  10. 10. • Sequencing costs (150X) Big Data $100,000.0 $K per genome $10,000.0 • UCSF cancer researchers + UCSC cancer genetic $1,000.0 $100.0 database + AMP Lab + Intel Cluster $10.0 $1.0 @TCGA: 5 PB = 20 cancers x 1000 genomes $0.1 2001 - 2014• See Dave Patterson’s Talk: Thursday 3-4, BDT205 David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 10 12/5/2011
  11. 11. MLBase (Declarative Machine Learning) Hadoop MR MPI BlinkDB (approx QP) Graphlab Shark (SQL) + Streaming etc. Spark Streaming Shared RDDs (distributed memory) Mesos (cluster resource manager) HDFS 3rd party AMPLab (released) AMPLab (in progress)11
  12. 12. 12
  13. 13. 13
  14. 14. Lightning-Fast Cluster Computing
  15. 15. Base RDD Cache 1lines = spark.textFile(“hdfs://...”) Transformed RDD Worker resultserrors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(„t‟)(2)) tasks Block 1 DrivercachedMsgs = messages.cache() ActioncachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker Cache 3 Worker Block 2 Result: full-text search TBWikipedia in sec sec Result: scaled to 1 of data in 5-7 <1 (vs 170sec for on-disk data) (vs 20 sec for on-disk data) Block 3
  16. 16. messages = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2))HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…)
  17. 17. random initial linetarget
  18. 18. map readPoint cache Load data in memory once Initial parameter vector map p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.xreduce _ + _ Repeated MapReduce steps to do gradient descent
  19. 19. 60 50Running Time (min) 110 s / iteration 40 Hadoop 30 Spark 20 10 first iteration 80 s further iterations 1 s 0 1 10 20 30 Number of Iterations
  20. 20. Java API JavaRDD<String> lines = sc.textFile(...);(out now) lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();PySpark lines = sc.textFile(...)(coming soon) lines.filter(lambda x: x.contains(error)) .count()
  21. 21. Hive 20Spark 0.5 Time (hours) 0 5 10 15 20
  22. 22. Client CLI JDBC DriverMeta store SQL Query Physical Plan Parser Optimizer Execution MapReduce HDFS
  23. 23. Client CLI JDBC Driver Cache Mgr.Meta store SQL Query Physical Plan Parser Optimizer Execution Spark HDFS
  24. 24. Row Storage Column Storage1 john 4.1 1 2 32 mike 3.5 john mike sally3 sally 6.4 4.1 3.5 6.4
  25. 25. Shark Shark (disk) Hive 100 90 80 70 60 50 40 30100 m2.4xlarge nodes 202.1 TB benchmark (Pavlo et al) 10 1.1 0 Selection
  26. 26. Shark Shark (disk) Hive 600 500 400 300 200100 m2.4xlarge nodes 100 322.1 TB benchmark (Pavlo et al) 0 Group By
  27. 27. 1800 Shark (copartitioned) Shark 1500 Shark (disk) Hive 1200 900 600 300 105100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al) 0 Join
  28. 28. Shark Shark (disk) Hive70 70 100 9060 60 8050 50 70 6040 40 5030 30 40 3020 20 20 100 m2.4xlarge10 10 10 nodes, 1.7 TB 1.0 0.8 0.7 0 Conviva dataset 0 0 Query 1 Query 2 Query 3
  29. 29. spark-project.orgamplab.cs.berkeley.edu UC BERKELEY
  30. 30. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.

×