Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why functional why scala


Published on

Why every data engineer should know something about functional programming and Scala.

Properly formatted slides at

Published in: Software
  • Be the first to comment

Why functional why scala

  1. 1. Jul 2014 WHY FUNCTIONAL? WHY SCALA? Neville Li @sinisa_lyh
  2. 2. MONOID! Actuallyit'sa semigroup,monoid just soundsmore interesting :) A Little Teaser Crunch:CombineFns are used to representthe associative operations... PGroupedTable<K,V>::combineValues(CombineFn<K,V>combineFn, CombineFn<K,V>reduceFn) Scalding:reduce with fn which mustbe associative and commutative KeyedList[K,T]::reduce(fn:(T,T)=>T) Spark:Merge the values for each key using an associative reduce function PairRDDFunctions[K,V]::reduceByKey(fn:(V,V)=>V) All ofthem work on both mapper and reducer side 0
  3. 3. MY STORY Before Mostly Python/C++ (and PHP...) No Java experience at all Started using Scala early 2013 Now Discovery's* Java backend/riemann guy The Scalding/Spark/Storm guy Contributor to Spark, chill, cascading.avro *Spotify'smachine learning and recommendation team
  4. 4. WHY THIS TALK? Not a tutorial Discovery's experience Why FP matters Why Scala matters Common misconceptions
  5. 5. WHAT WE ALREADY USE Kafka Scalding Spark / MLLib Stratosphere Storm / Riemann (Clojure)
  6. 6. WHAT WE WANT TO INVESTIGATE Summingbird (Scala for Storm + Hadoop) Spark Streaming Shark / SparkSQL GraphX (Spark) BIDMach (GPU ML with GPU)
  7. 7. DISCOVERY Mid 2013: 100+ Python jobs 10+ hires since (half since new year) Few with Java experience, none with Scala As of May 2014: ~100 Scalding jobs & 90 tests More uncommited ad-hoc jobs 12+ commiters, 4+ using Spark
  8. 8. DISCOVERY rec-sys-scalding.git
  10. 10. WHY FUNCTIONAL Immutable data Copy and transform Not mutate in place HDFS with M/R jobs Storm tuples, Riemann streams
  11. 11. WHY FUNCTIONAL Higher order functions Expressions, not statements Focus on problem solving Not solving programming problems
  12. 12. WHY FUNCTIONAL Word count in Python lyrics=["WeallliveinAmerika","Amerikaistwunderbar"] wc=defaultdict(int) forlinlyrics: forwinl.split(): wc[w]+=1 Screen too small for the Java version
  13. 13. WHY FUNCTIONAL Map and reduce are key concepts in FP vallyrics=List("WeallliveinAmerika","Amerikaistwunderbar") lyrics.flatMap(_.split("")) //map .groupBy(identity) //shuffle .map{case(k,g)=>(k,g.size)} //reduce (deflyrics["WeallliveinAmerika""Amerikaistwunderbar"]) (->>lyrics(mapcat#(clojure.string/split%#"s")) (group-byidentity) (map(fn[[kg]][k(countg)]))) importControl.Arrow importData.List letlyrics=["WeallliveinAmerika","Amerikaistwunderbar"] mapwords>>>concat >>>sort>>>group >>>map(x->(headx,lengthx))$lyrics
  14. 14. WHY FUNCTIONAL Linear equation in ALS matrixfactorization = ( Y + ( − I)Y p(u)xu Y T Y T C u ) −1 Y T C u{case(id,vec)=>(id,vec*vec.T)} //YtY .map(_._2).reduce(_+_) ratings.keyBy(fixedKey).join(outerProducts) //YtCuIY .map{case(_,(r,op))=>(solveKey(r),op*(r.rating*alpha))} .reduceByKey(_+_) ratings.keyBy(fixedKey).join(vectors) //YtCupu .map{case(_,(r,vec))=> valCui=r.rating*alpha+1 valpui=if(Cui>0.0)1.0else0.0 (solveKey(r),vec*(Cui*pui)) }.reduceByKey(_+_)
  15. 15. WHY SCALA JVM - libraries and tools Pythonesque syntax Static typing with inference Transition from imperative to FP
  16. 16. WHY SCALA Performance vs. agility
  17. 17. WHY SCALA Type inference classComplexDecorationService{ publicList<ListenableFuture<Map<String,Metadata>>> lookupMetadata(List<String>keys){/*...*/} } valdata=service.lookupMetadata(keys) typeDF=List[ListenableFuture[Map[String,Track]]] defprocess(data:DF)={/*...*/}
  18. 18. WHY SCALA Higher order functions List<Integer>list=Lists.newArrayList(1,2,3); Lists.transform(list,newFunction<Integer,Integer>(){ @Override publicIntegerapply(Integerinput){ returninput+1; } }); vallist=List(1,2,3) //List(2,3,4) And then imagine ifyou have to chain or nested functions
  19. 19. WHY SCALA Collections API vall=List(1,2,3,4,5) //List(2,3,4,5,6) l.filter(_>3) //45"a","b","c")).toMap //Map(1->a,2->b,3->c) l.partition(_%2==0) //(List(2,4),List(1,3,5)) List(l,*2)).flatten //List(1,2,3,4,5,2,4,6,8,10) l.reduce(_+_) //15 l.fold(100)(_+_) //115 "WeallliveinAmerika".split("").groupBy(_.size) //Map(2->Array(We,in),4->Array(live), // 7->Array(Amerika),3->Array(all))
  20. 20. WHY SCALA Scalding field based word count TextLine(path)) .flatMap('line->'word){line:String=>line.split("""W+""")} .groupBy('word){_.size} Scalding type-safe word count TextLine(path).read.toTypedPipe[String](Fields.ALL) .flatMap(_.split(""W+"")) .groupBy(identity).size Scrunch word count read(from.textFile(file)) .flatMap(_.split("""W+""") .count
  21. 21. WHY SCALA Summingbird word count source .flatMap{line:String=>line.split("""W+""").map((_,1))} .sumByKey(store) Spark word count sc.textFile(path) .flatMap(_.split("""W+""")) .map(word=>(word,1)) .reduceByKey(_+_) Stratosphere word count TextFile(textInput) .flatMap(_.split("""W+""")) .map(word=>(word,1)) .groupBy(_._1) .reduce{(w1,w2)=>(w1._1,w1._2+w2._2)}
  22. 22. WHY SCALA Many patterns also common in Java Java 8 lambdas and streams Guava, Crunch, etc. Optional, Predicate Collection transformations ListenableFuture and transform parallelDo, DoFn, MapFn, CombineFn
  23. 23. COMMON MISCONCEPTIONS It's complex True for language features Not from user's perspective We only use 20% features Not more than needed in Java
  24. 24. COMMON MISCONCEPTIONS It's slow No slower than Python Depend on how pure FP Trade off with productivity Drop down to Java or native libraries
  25. 25. COMMON MISCONCEPTIONS I don't want to learn a new language How about flatMap, reduce, fold, etc.? Unnecessary overhead interfacing with Python or Java You've used monoids, monads, or higher order functions already