Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark as a distributed Scala

63 views

Published on

In this presentation I want to look at Spark as at some sort of Scala evolution. We will consider basic Scala features and then map some of them into Spark features.
www.fruzenshtein.com

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Spark as a distributed Scala

  1. 1. Spark As A Distributed Scala
  2. 2. Write a lot to my blog www.Fruzenshtein.com Currently interested in Scala, Akka, Spark… Who is who? Alexey Zvolinskiy ~4 years of Scala experience Passing through Functional Programming in Scala Specialization on Coursera @Fruzenshtein
  3. 3. Why Scala?
  4. 4. What makes Scala so great? 1. Functional programming language* 2. Immutability 3. Type system 4. Collections API 5. Pattern matching 6. Implicit
  5. 5. Functional programming language 1. Function is a first class citizen 2. Totality 3. Determinism 4. Purity A => B A1 A2 … An B1 B2 … Bn A => BAi Bi A => BAi Bi A => BAi Bi
  6. 6. Immutability 1. Makes a code more predictable 2. Reduces efforts to understand a code 3. Key to thread-safety Books: Java concurrency in practice Effective Java 2nd Edition
  7. 7. Type system 1. Static typing 2. Type inference 3. Bounds Map[V, K] List[T1 <: T2] Set[+T]
  8. 8. Collections API val numbers = List(1,2,3,4,5,6,7,8,9,10) numbers.filter(_ % 2 == 0) .map(_ * 10) //List(20, 40, 60, 80, 100) filter(n:Int => Boolean) //(n => n % 2 == 0) //(n => n * 10) map(n:Int => Int)
  9. 9. Collections API val groupsOfStudents = List( List(("Alex", 65), ("Kate", 87), ("Sam", 98)), List(("Peter", 84), ("Bob", 79), ("Samanta", 71)), List(("Rob", 82), ("Jack", 55), ("Ann", 90)) ) groupsOfStudents.flatMap(students => students) .groupBy(student => student._2 > 75) .get(true).get //List((Kate,87), (Sam,98), (Peter,84), (Bob,79), (Rob,82), (Ann,90))
  10. 10. And what?! =Parallelism=
  11. 11. Idea of parallelism How to divide a problem into subproblems? How to use a hardware optimally?
  12. 12. Parallelism background
  13. 13. Scala parallel collections val from0to100000: Range = 0 until 100000 val list = from0to100000.toList //scala.collection.parallel.immutable.ParSeq[Int] val parList = list.par
  14. 14. Some benchmarks val list = from0to100000.toList for (i <- 1 to 10) { val t0 = System.currentTimeMillis() list.filter(isPrime(_)) println(System.currentTimeMillis - t0) } def isPrime(n: Int): Boolean = ! ( (2 until n-1) exists (n % _ == 0) ) val parList = list.par for (i <- 1 to 10) { val t1 = System.currentTimeMillis() parList.filter(isPrime(_)) println(System.currentTimeMillis - t1) } 7106 6467 6315 6275 6478 8732 6543 6296 6299 6286 5130 5106 4649 4568 4580 4446 4447 4437 4290 4476
  15. 15. Ok, but what about Spark?!
  16. 16. Why distributed computations? single machine (shared memory) Multiple nodes (network) Parallel collections (scala) RDDs (spark) Almost the same API
  17. 17. RDD example Spark Spark Spark val tweets: RDD[Tweet] = … tweets.filter( _.contains(“bigdata”) )
  18. 18. Latency Numbers from Jeff Dean http://research.google.com/people/jeff/ https://gist.github.com/2841832 Graph and scale by Thomas Lee
  19. 19. Computation model memory disk network seconds - days weeks - months weeks - years
  20. 20. Scala transformations & actions 1. Transformations are lazy 2. Actions are eager map filter flatMap … reduce collect count … val tweets: RDD[Tweet] = … tweets.filter(_.contains(“bigdata”)) .map(t => (t.author, t.body) val tweets: RDD[Tweet] = … tweets.filter(_.contains(“bigdata”)) .map(t => (t.author, t.body) .collect()
  21. 21. Rules of thumb 1. Cache 2. Apply efficiently 3. Avoid shuffling val tweets: RDD[Tweet] = … val cachedTweets = tweets.cache() cachedTweets.filter(_.contains(“USA”)) .map(t => (t.author, t.body) cachedTweets.map(t => (t.author, t.body) .filter(_.contains(“USA”))
  22. 22. Shuffling (1, 240) (2, 500) (2, 105) (3, 100) (1, 200) (1, 500) (1, 450) (3, 100) (3, 100) (2, [500, 105]) (1, [240, 200, 500, 450]) (3, [100, 100, 100]) groupByKey() Transaction(id: Int, amount: Int) We want to know how much money spent each client
  23. 23. Reduce before group (1, 240) (2, 605) (3, 100) (1, 700) (1, 450) (3, 200) (2, [605]) (1, [240, 700, 450]) (3, [100, 200]) groupByKey() (1, 240) (2, 500) (2, 105) (3, 100) (1, 200) (1, 500) (1, 450) (3, 100) (3, 100) reduceByKey(…)
  24. 24. Thanks :) @Fruzenshtein

×