UC BERKELEY
It’s All Happening On-line          User Generated                                  (Web, Social & Mobile)          Every:...
Volume     Petabytes+                               Variety    Unstructured                               Velocity   Real-...
4
UC BERKELEY                    Algorithms: Machine                       Learning and                          Analytics  ...
throughout the entire analytics lifecycle6
Alex Bayen (Mobile Sensing)       Anthony Joseph (Sec./ Privacy)   Ken Goldberg (Crowdsourcing)      Randy Katz (Systems) ...
8
> 450,000    downloads9
• Sequencing costs                    (150X)               Big Data                $100,000.0                             ...
MLBase (Declarative Machine Learning)     Hadoop MR        MPI                         BlinkDB (approx QP)      Graphlab  ...
12
13
Lightning-Fast Cluster Computing
Base RDD                                              Cache 1lines = spark.textFile(“hdfs://...”)              Transformed...
messages = textFile(...).filter(_.contains(“error”))                        .map(_.split(„t‟)(2))HadoopRDD                ...
random initial linetarget
map readPoint     cache                                                       Load data in memory once                    ...
60                     50Running Time (min)                                                            110 s / iteration  ...
Java API        JavaRDD<String> lines = sc.textFile(...);(out now)                lines.filter(new Function<String, Boolea...
Hive                            20Spark       0.5                                     Time (hours)        0         5   10...
Client                                 CLI          JDBC                               DriverMeta store      SQL       Que...
Client                                 CLI          JDBC                               Driver     Cache Mgr.Meta store    ...
Row Storage       Column Storage1   john    4.1    1      2      32   mike    3.5   john   mike   sally3   sally   6.4   4...
Shark   Shark (disk)   Hive                                 100                                 90                        ...
Shark   Shark (disk)   Hive                                 600                                 500                       ...
1800                                        Shark (copartitioned)                                        Shark            ...
Shark   Shark (disk)   Hive70                             70               100                                            ...
spark-project.orgamplab.cs.berkeley.edu                         UC BERKELEY
We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation   form wh...
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305
Upcoming SlideShare
Loading in...5
×

Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

3,409

Published on

0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,409
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
159
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • Add “variables” to the “functions” in functional programming
  • Note that dataset is reused on each gradient computation
  • Key idea: add “variables” to the “functions” in functional programming
  • This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
  • Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join
  • Transforming Big Data with Spark and Shark - AWS Re:Invent 2012 BDT 305

    1. 1. UC BERKELEY
    2. 2. It’s All Happening On-line User Generated (Web, Social & Mobile) Every: Click Ad impression Billing event ….. Fast Forward, pause,… Friend Request Transaction Network message Fault …Internet of Things / M2M Scientific Computing
    3. 3. Volume Petabytes+ Variety Unstructured Velocity Real-TimeOur view: More data should mean better answers • Must balance Cost, Time, and Answer Quality3
    4. 4. 4
    5. 5. UC BERKELEY Algorithms: Machine Learning and Analytics Massive and Diverse Data People: Machines: CrowdSourcing & Cloud Computing Human Computation5
    6. 6. throughout the entire analytics lifecycle6
    7. 7. Alex Bayen (Mobile Sensing) Anthony Joseph (Sec./ Privacy) Ken Goldberg (Crowdsourcing) Randy Katz (Systems) *Michael Franklin (Databases) Dave Patterson (Systems) Armando Fox (Systems) *Ion Stoica (Systems) *Mike Jordan (Machine Learning) Scott Shenker (Networking)Organized for Collaboration: 7
    8. 8. 8
    9. 9. > 450,000 downloads9
    10. 10. • Sequencing costs (150X) Big Data $100,000.0 $K per genome $10,000.0 • UCSF cancer researchers + UCSC cancer genetic $1,000.0 $100.0 database + AMP Lab + Intel Cluster $10.0 $1.0 @TCGA: 5 PB = 20 cancers x 1000 genomes $0.1 2001 - 2014• See Dave Patterson’s Talk: Thursday 3-4, BDT205 David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 10 12/5/2011
    11. 11. MLBase (Declarative Machine Learning) Hadoop MR MPI BlinkDB (approx QP) Graphlab Shark (SQL) + Streaming etc. Spark Streaming Shared RDDs (distributed memory) Mesos (cluster resource manager) HDFS 3rd party AMPLab (released) AMPLab (in progress)11
    12. 12. 12
    13. 13. 13
    14. 14. Lightning-Fast Cluster Computing
    15. 15. Base RDD Cache 1lines = spark.textFile(“hdfs://...”) Transformed RDD Worker resultserrors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(„t‟)(2)) tasks Block 1 DrivercachedMsgs = messages.cache() ActioncachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count Cache 2 Worker Cache 3 Worker Block 2 Result: full-text search TBWikipedia in sec sec Result: scaled to 1 of data in 5-7 <1 (vs 170sec for on-disk data) (vs 20 sec for on-disk data) Block 3
    16. 16. messages = textFile(...).filter(_.contains(“error”)) .map(_.split(„t‟)(2))HadoopRDD FilteredRDD MappedRDD path = hdfs://… func = _.contains(...) func = _.split(…)
    17. 17. random initial linetarget
    18. 18. map readPoint cache Load data in memory once Initial parameter vector map p =>(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.xreduce _ + _ Repeated MapReduce steps to do gradient descent
    19. 19. 60 50Running Time (min) 110 s / iteration 40 Hadoop 30 Spark 20 10 first iteration 80 s further iterations 1 s 0 1 10 20 30 Number of Iterations
    20. 20. Java API JavaRDD<String> lines = sc.textFile(...);(out now) lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();PySpark lines = sc.textFile(...)(coming soon) lines.filter(lambda x: x.contains(error)) .count()
    21. 21. Hive 20Spark 0.5 Time (hours) 0 5 10 15 20
    22. 22. Client CLI JDBC DriverMeta store SQL Query Physical Plan Parser Optimizer Execution MapReduce HDFS
    23. 23. Client CLI JDBC Driver Cache Mgr.Meta store SQL Query Physical Plan Parser Optimizer Execution Spark HDFS
    24. 24. Row Storage Column Storage1 john 4.1 1 2 32 mike 3.5 john mike sally3 sally 6.4 4.1 3.5 6.4
    25. 25. Shark Shark (disk) Hive 100 90 80 70 60 50 40 30100 m2.4xlarge nodes 202.1 TB benchmark (Pavlo et al) 10 1.1 0 Selection
    26. 26. Shark Shark (disk) Hive 600 500 400 300 200100 m2.4xlarge nodes 100 322.1 TB benchmark (Pavlo et al) 0 Group By
    27. 27. 1800 Shark (copartitioned) Shark 1500 Shark (disk) Hive 1200 900 600 300 105100 m2.4xlarge nodes2.1 TB benchmark (Pavlo et al) 0 Join
    28. 28. Shark Shark (disk) Hive70 70 100 9060 60 8050 50 70 6040 40 5030 30 40 3020 20 20 100 m2.4xlarge10 10 10 nodes, 1.7 TB 1.0 0.8 0.7 0 Conviva dataset 0 0 Query 1 Query 2 Query 3
    29. 29. spark-project.orgamplab.cs.berkeley.edu UC BERKELEY
    30. 30. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×