Your SlideShare is downloading. ×
0
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

BDT305 Transforming Big Data with Spark and Shark - AWS re: Invent 2012

1,671

Published on

The Berkeley AMPLab is developing a new open source data analysis software stack by deeply integrating machine learning and data analytics at scale (Algorithms), cloud and cluster computing (Machines) …

The Berkeley AMPLab is developing a new open source data analysis software stack by deeply integrating machine learning and data analytics at scale (Algorithms), cloud and cluster computing (Machines) and crowdsourcing (People) to make sense of massive data. Current application efforts focus on cancer genomics, real-time traffic prediction, and collaborative analytics for mobile devices. In this talk, we present an overview of this stack and demonstrate key components: Spark and Shark.

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,671
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. UC BERKELEY
  • 2. It’s All Happening On-line User Generated (Web, Social & Mobile) Every: Click Ad impression Billing event ….. Fast Forward, pause,… Friend Request Transaction Network message Fault …Internet of Things / M2M Scientific Computing
  • 3. Volume Petabytes+ Variety Unstructured Velocity Real-TimeOur view: More data should mean better answers • Must balance Cost, Time, and Answer Quality3
  • 4. 4
  • 5. UC BERKELEY Algorithms: Machine Learning and Analytics Massive and Diverse Data People: Machines: CrowdSourcing & Cloud Computing Human Computation5
  • 6. throughout the entire analytics lifecycle6
  • 7. Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken Goldberg (Crowdsourcing) Randy Katz (Systems) *Michael Franklin (Databases) Dave Patterson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Scott Shenker (Networking)Organized for Collaboration: 7
  • 8. 8
  • 9. > 450,000 downloads9
  • 10. • Sequencing costs (150X) Big Data $100,000.0 $K per genome $10,000.0 • UCSF cancer researchers + UCSC cancer genetic $1,000.0 $100.0 database + AMP Lab + Intel Cluster $10.0 $1.0 @TCGA: 5 PB = 20 cancers x 1000 genomes $0.1 2001 - 2014• See Dave Patterson’s Talk: Thursday 3-4, BDT205 David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 10 12/5/2011
  • 11. MLBase (Declarative Machine Learning) Hadoop MR MPI BlinkDB (approx QP) Graphlab Shark (SQL) + Streaming etc. Spark Streaming Shared RDDs (distributed memory) Mesos (cluster resource manager) HDFS 3rd party AMPLab (released) AMPLab (in progress)11
  • 12. 12
  • 13. 13
  • 14. Lightning-Fast Cluster Computing
  • 15. Base RDD Cache 1lines = spark.textFile(“hdfs://...”) Transformed RDD Worker resultserrors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘t’)(2)) tasks Block 1 DrivercachedMsgs = messages.cache() ActioncachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker Cache 3 Worker Block 2 Result: full-text search TBWikipedia in sec sec Result: scaled to 1 of data in 5-7 <1 (vs 170sec for on-disk data) (vs 20 sec for on-disk data) Block 3
  • 16. messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘t’)(2))HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…)
  • 17. random initial linetarget
  • 18. map readPoint cache Load data in memory once Initial parameter vector map p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.xreduce _ + _ Repeated MapReduce steps to do gradient descent
  • 19. 60 50Running Time (min) 110 s / iteration 40 Hadoop 30 Spark 20 10 first iteration 80 s further iterations 1 s 0 1 10 20 30 Number of Iterations
  • 20. Java API JavaRDD<String> lines = sc.textFile(...);(out now) lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();PySpark lines = sc.textFile(...)(coming soon) lines.filter(lambda x: x.contains(error)) .count()
  • 21. Hive 20Spark 0.5 Time (hours) 0 5 10 15 20
  • 22. Client CLI JDBC DriverMeta store SQL Query Physical Plan Parser Optimizer Execution MapReduce HDFS
  • 23. Client CLI JDBC Driver Cache Mgr.Meta store SQL Query Physical Plan Parser Optimizer Execution Spark HDFS
  • 24. Row Storage Column Storage1 john 4.1 1 2 32 mike 3.5 john mike sally3 sally 6.4 4.1 3.5 6.4
  • 25. Shark Shark (disk) Hive 100 90 80 70 60 50 40 30100 m2.4xlarge nodes 202.1 TB benchmark (Pavlo et al) 10 1.1 0 Selection
  • 26. Shark Shark (disk) Hive 600 500 400 300 200100 m2.4xlarge nodes 100 322.1 TB benchmark (Pavlo et al) 0 Group By
  • 27. 1800 Shark (copartitioned) Shark 1500 Shark (disk) Hive 1200 900 600 300 105100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al) 0 Join
  • 28. Shark Shark (disk) Hive70 70 100 9060 60 8050 50 7040 40 60 5030 30 4020 20 30 20 100 m2.4xlarge10 10 nodes, 1.7 TB 10 0.8 0.7 1.0 Conviva dataset0 0 0 Query 1 Query 2 Query 3
  • 29. spark-project.orgamplab.cs.berkeley.edu UC BERKELEY
  • 30. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.

×