Spark!

1,523 views

Published on

Overview of Spark - a new paradigm for processing Big Data - with astounding performance and conciseness.

Slides from DataKRK meetup.

Published in: Technology
1 Comment
6 Likes
Statistics
Notes
No Downloads
Views
Total views
1,523
On SlideShare
0
From Embeds
0
Number of Embeds
20
Actions
Shares
0
Downloads
55
Comments
1
Likes
6
Embeds 0
No embeds

No notes for slide

Spark!

  1. 1. By @przemur from
  2. 2. HTTP://ABOUT.ME/PRZEMEK.MACIOLEK/ • Data Scientist, Hadoop user since 2009 • Did research for Academia, data mined for oil&gas exploration industry, cofounded Data Science startup, built Big Data team in Base CRM, … • A lot of different tools used meanwhile (Mahout, HBase, Cassandra, Redis, Pig, Storm, …) • Dreaming about something powerful and concise for Big Data… • AD 2014: Head of Analytics & Data @ Toptal - researching new ways of doing Big Data Analytics, rediscovered Storm. P.S. Ever considered doing Analytics & Data Science for a very cool startup? Drop me a note at: prze@toptal.com
  3. 3. HADOOP IS COOL…
  4. 4. HADOOP IS COOL (BUT SOMETIMES IT’S NOT) • High latency (interactive, anyone?) • Challenging expressibility of business logic • Iterative algorithms? (think: PageRank)
  5. 5. SOLUTION? Giraph MapReduce Pig S4 Hive General batch processing Pregel Storm Drill … Specialized systems Impala
  6. 6. Map Reduce Data Data Data Data Data Data Data
  7. 7. Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data Data
  8. 8. MAYBE MAP REDUCE IS NOT ALWAYS THE BEST SOLUTION?
  9. 9. GENERALIZE FTW! Spark Task DAG and Data Sharing MapReduce … Batch 
 processing Specialized systems
  10. 10. RESILIENT DISTRIBUTED DATASET (RDD) • A collection of elements that can be operated in parallel • Parallel Collection, e.g. sc.paralellize(Array(1,2,3)) • Hadoop Dataset • Lazily evaluated, able to rebuild lost data any time • Can be stored in memory without replication
  11. 11. ACTIONS TRANSFORMATIONS • Creates a new dataset from an existing one • • Return the value to the driver after computation finishes • Runs all required transformations Lazily evaluated • Recomputed each time an action runs on it, but might be persisted (in memory or disk) • Broadcast Variables and Accumulators for cluster-level sharing
  12. 12. Scala, Java, Python!
  13. 13. HOW TO USE IT? scala> val textFile = sc.textFile("README.md") textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 ! scala> textFile.count() // Number of items in this RDD res0: Long = 74 ! scala> textFile.first() // First item in this RDD res1: String = # Apache Spark ! scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b)) // How many words are in the longest line res2: Int = 16 ! scala> textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b).collect res3: Array[(java.lang.String, Int)] = Array((need,2), ("",43), (Extra, 3), (using,1), (passed,1), (etc.,1), (its,1), (`/usr/local/lib/ libmesos.so`,1), (`SCALA_HOME`,1), (option,1), (these,1), (#,1), (`PATH`,,2), (200,1), (To,3),...
  14. 14. WHAT HAPPENS UNDERNEATH?
  15. 15. RDD Objects DAG Scheduler Split graph into stages of tasks. Submit each one when ready. rdd.filter().map(…). groupBy(…).filter(…) t Se sk Ta Worker Execute tasks. Store and serve blocks. Task Task Scheduler Lunch tasks via cluster manager. Retry.
  16. 16. NARROW DEPENDENCIES WIDE (SHUFFLE) DEPENDENCIES map, filter groupByKey union join (inputs not co-partitioned)
  17. 17. * http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  18. 18. How much code is needed to implement Big Data Page Rank?
  19. 19. * http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
  20. 20. * http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf
  21. 21. * http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
  22. 22. * http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf
  23. 23. BERKELEY DATA ANALYTICS STACK * https://amplab.cs.berkeley.edu/software/
  24. 24. SPARK LIVE
  25. 25. REFERENCES • http://spark.incubator.apache.org/ • https://amplab.cs.berkeley.edu/software/ • http://ampcamp.berkeley.edu/3/exercises/index.html • http://www.mlbase.org/ • https://amplab.cs.berkeley.edu/benchmark/ • http://files.meetup.com/3138542/dev-meetup-dec-2012.pptx • http://spark-summit.org/wp-content/uploads/2013/10/Tully-SparkSummit4.pdf • http://spark-summit.org/wp-content/uploads/2013/10/Kay_Sparrow_Spark_Summit.pdf • http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf • http://spark-summit.org/wp-content/uploads/2013/10/Wendell-Spark-Performance.pdf • http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf • http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

×