Your SlideShare is downloading. ×
Why functional  why scala
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Why functional why scala

404
views

Published on

Why every data engineer should know something about functional programming and Scala. …

Why every data engineer should know something about functional programming and Scala.

Properly formatted slides at http://www.lyh.me/slides/pitch.html

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
404
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
9
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Jul 2014 WHY FUNCTIONAL? WHY SCALA? Neville Li @sinisa_lyh
  • 2. MONOID! Actuallyit'sa semigroup,monoid just soundsmore interesting :) A Little Teaser Crunch:CombineFns are used to representthe associative operations... PGroupedTable<K,V>::combineValues(CombineFn<K,V>combineFn, CombineFn<K,V>reduceFn) Scalding:reduce with fn which mustbe associative and commutative KeyedList[K,T]::reduce(fn:(T,T)=>T) Spark:Merge the values for each key using an associative reduce function PairRDDFunctions[K,V]::reduceByKey(fn:(V,V)=>V) All ofthem work on both mapper and reducer side 0
  • 3. MY STORY Before Mostly Python/C++ (and PHP...) No Java experience at all Started using Scala early 2013 Now Discovery's* Java backend/riemann guy The Scalding/Spark/Storm guy Contributor to Spark, chill, cascading.avro *Spotify'smachine learning and recommendation team
  • 4. WHY THIS TALK? Not a tutorial Discovery's experience Why FP matters Why Scala matters Common misconceptions
  • 5. WHAT WE ALREADY USE Kafka Scalding Spark / MLLib Stratosphere Storm / Riemann (Clojure)
  • 6. WHAT WE WANT TO INVESTIGATE Summingbird (Scala for Storm + Hadoop) Spark Streaming Shark / SparkSQL GraphX (Spark) BIDMach (GPU ML with GPU)
  • 7. DISCOVERY Mid 2013: 100+ Python jobs 10+ hires since (half since new year) Few with Java experience, none with Scala As of May 2014: ~100 Scalding jobs & 90 tests More uncommited ad-hoc jobs 12+ commiters, 4+ using Spark
  • 8. DISCOVERY rec-sys-scalding.git
  • 9. DISCOVERY GUESS HOW MANY JOBS WRITTEN BY YOURS TRUELY? 3
  • 10. WHY FUNCTIONAL Immutable data Copy and transform Not mutate in place HDFS with M/R jobs Storm tuples, Riemann streams
  • 11. WHY FUNCTIONAL Higher order functions Expressions, not statements Focus on problem solving Not solving programming problems
  • 12. WHY FUNCTIONAL Word count in Python lyrics=["WeallliveinAmerika","Amerikaistwunderbar"] wc=defaultdict(int) forlinlyrics: forwinl.split(): wc[w]+=1 Screen too small for the Java version
  • 13. WHY FUNCTIONAL Map and reduce are key concepts in FP vallyrics=List("WeallliveinAmerika","Amerikaistwunderbar") lyrics.flatMap(_.split("")) //map .groupBy(identity) //shuffle .map{case(k,g)=>(k,g.size)} //reduce (deflyrics["WeallliveinAmerika""Amerikaistwunderbar"]) (->>lyrics(mapcat#(clojure.string/split%#"s")) (group-byidentity) (map(fn[[kg]][k(countg)]))) importControl.Arrow importData.List letlyrics=["WeallliveinAmerika","Amerikaistwunderbar"] mapwords>>>concat >>>sort>>>group >>>map(x->(headx,lengthx))$lyrics
  • 14. WHY FUNCTIONAL Linear equation in ALS matrixfactorization = ( Y + ( − I)Y p(u)xu Y T Y T C u ) −1 Y T C u vectors.map{case(id,vec)=>(id,vec*vec.T)} //YtY .map(_._2).reduce(_+_) ratings.keyBy(fixedKey).join(outerProducts) //YtCuIY .map{case(_,(r,op))=>(solveKey(r),op*(r.rating*alpha))} .reduceByKey(_+_) ratings.keyBy(fixedKey).join(vectors) //YtCupu .map{case(_,(r,vec))=> valCui=r.rating*alpha+1 valpui=if(Cui>0.0)1.0else0.0 (solveKey(r),vec*(Cui*pui)) }.reduceByKey(_+_)
  • 15. WHY SCALA JVM - libraries and tools Pythonesque syntax Static typing with inference Transition from imperative to FP
  • 16. WHY SCALA Performance vs. agility http://nicholassterling.wordpress.com/2012/11/16/scala-performance/
  • 17. WHY SCALA Type inference classComplexDecorationService{ publicList<ListenableFuture<Map<String,Metadata>>> lookupMetadata(List<String>keys){/*...*/} } valdata=service.lookupMetadata(keys) typeDF=List[ListenableFuture[Map[String,Track]]] defprocess(data:DF)={/*...*/}
  • 18. WHY SCALA Higher order functions List<Integer>list=Lists.newArrayList(1,2,3); Lists.transform(list,newFunction<Integer,Integer>(){ @Override publicIntegerapply(Integerinput){ returninput+1; } }); vallist=List(1,2,3) list.map(_+1) //List(2,3,4) And then imagine ifyou have to chain or nested functions
  • 19. WHY SCALA Collections API vall=List(1,2,3,4,5) l.map(_+1) //List(2,3,4,5,6) l.filter(_>3) //45 l.zip(List("a","b","c")).toMap //Map(1->a,2->b,3->c) l.partition(_%2==0) //(List(2,4),List(1,3,5)) List(l,l.map(_*2)).flatten //List(1,2,3,4,5,2,4,6,8,10) l.reduce(_+_) //15 l.fold(100)(_+_) //115 "WeallliveinAmerika".split("").groupBy(_.size) //Map(2->Array(We,in),4->Array(live), // 7->Array(Amerika),3->Array(all))
  • 20. WHY SCALA Scalding field based word count TextLine(path)) .flatMap('line->'word){line:String=>line.split("""W+""")} .groupBy('word){_.size} Scalding type-safe word count TextLine(path).read.toTypedPipe[String](Fields.ALL) .flatMap(_.split(""W+"")) .groupBy(identity).size Scrunch word count read(from.textFile(file)) .flatMap(_.split("""W+""") .count
  • 21. WHY SCALA Summingbird word count source .flatMap{line:String=>line.split("""W+""").map((_,1))} .sumByKey(store) Spark word count sc.textFile(path) .flatMap(_.split("""W+""")) .map(word=>(word,1)) .reduceByKey(_+_) Stratosphere word count TextFile(textInput) .flatMap(_.split("""W+""")) .map(word=>(word,1)) .groupBy(_._1) .reduce{(w1,w2)=>(w1._1,w1._2+w2._2)}
  • 22. WHY SCALA Many patterns also common in Java Java 8 lambdas and streams Guava, Crunch, etc. Optional, Predicate Collection transformations ListenableFuture and transform parallelDo, DoFn, MapFn, CombineFn
  • 23. COMMON MISCONCEPTIONS It's complex True for language features Not from user's perspective We only use 20% features Not more than needed in Java
  • 24. COMMON MISCONCEPTIONS It's slow No slower than Python Depend on how pure FP Trade off with productivity Drop down to Java or native libraries
  • 25. COMMON MISCONCEPTIONS I don't want to learn a new language How about flatMap, reduce, fold, etc.? Unnecessary overhead interfacing with Python or Java You've used monoids, monads, or higher order functions already
  • 26. THE END THANK YOU

×