Scalable and Flexible Machine Learning With Scala @ LinkedIn


Published on

The presentation given by Chris Severs and myself at the Bay Area Scala Enthusiasts meetup.

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • I have been in this room on 3 special occasionsJust this last Friday some of the most influential people in our industry like Jeff Weiner and Reid Hoffman judged our internal mini startup contest called incubatorWhen Bryan Stevenson, the person who got the longest standing ovation on TED was giving a private talk to LinkedIn employeesAnd when we celebrated out successful year by everyone getting new IPadsAnd let me tell you, I have never seen this room so full.So thank you all for coming and I promise you that this talk won’t be nearly as exciting as those occasions 
  • There are a lot of ways to develop MR flows, when I came to LinkedIn I saw 3 different patterns …
  • Simple here is as oppose to complex (reference Rich Hickey’s – Simple made Easy talk)
  • Simple is the key message – if you had to takeaway a single point from the entire talk, this would be it.
  • ----- Meeting Notes (3/11/13 17:16) -----This is the Facebook user object, well part of it. Avro schemas can get ridiculously big
  • Scalable and Flexible Machine Learning With Scala @ LinkedIn

    1. Scalable and Flexible Machine Learning With Scala
    2. Who are we? @BigDataSc @ccsevers 2
    3. Stuff you will see today … Different types of data scientists – Comparison of different approaches to develop machine learning flows Code The five tool tool – Why Scala (and its ecosystem) is the best tool to develop machine learning flows (Hint: MapReduce is functional) Some more code Machine Learning examples – Real life (well … almost) examples of different machine learning problems Even more code 3
    4. “Good data scientists understand, in a deep way, that the heavy lifting of cleanup and preparation is not something that gets in theway of solving the problem – it is the problem!” DJ Patil – Founding member of the LinkedIn data science team 4
    5. The data funnel Real data is an awful, terrible mess Cleaning often is a process of operating on data, excluding some data, bucketing data and calculating aggregates about the data Generate map, flatMap, for Exclude filter Bucket group, groupBy, groupWith Aggregate sum, reduce, foldLeft These blocks form the basis of most data flows 5
    6. There are many ways to develop data flows 6
    7. The mixer 7
    8. {"schema": {The Mixer Word Count "type": "record", "name": "WordCount", "fields": [from org.apache.pig.scripting import * { "name": "word",@outputSchema("b: bag{ w: chararray}") "type": "string"def tokenize(words): }, return words.split(" ") { "name": "count",script = """ "type": "int"A = load ./input.txt; }]}}B = foreach A generate flatten(tokenize((chararray)$0)) as word;C = group B by word;D = foreach C generate group, COUNT(B);store D into ./wordcount’ using AvroStorage("schema");"""Pig.compile(script).bind().runSingle() 8
    9. The Mixer Data Scientist Too many occurrences of code inside strings Three different languages inside a single file User Defined Functions (UDFs) vs. Language Support Not real Python, but Jython (which missing some libraries) This is just a simple word count! 9
    10. The Mixer Data Scientist Pig is great at extract, transform, load (ETL) … as long as you want to use a function that is already part of the included library … or you get someone else to write it for you (hello, DataFu!) Realistically you will need to maintain a Pig code base and a code base in some language which can run on the JVM Pig Latin is a bit funky, missing a lot of core programming language features Pig Latin is interpreted so you get (limited) type and syntax checking only at runtime 10
    11. The Expert 11
    12. The Expert Word Counthadoop fs –get input.txt input.txtcp /mnt/hadoop/input.txt ~/MyProjects/WordCount/input.txt##!/usr/bin/perluse strict;use warnings;my %count_of;while (my $line = <>) { #read from file or STDIN foreach my $word (split /s+/, $line) { $count_of{$word}++; }}print "All words and their counts: n";for my $word (sort keys %count_of) { print "$word: $count_of{$word}n";}__END__ 12
    13. The Scalable Expert – Hadoop Streaming Lets you use any language you want. Same issues as Java MapReduce with regards to multiple passes, complicated joins, etc. Always reading from stdin and writing to stdout. Easy to test out on local data – cat myfile.txt | | sort | Actual data may not be as nice. No type checking on input or output can will lead to problems. The main reason to do this is so you can use a nice interpreted language to do your processing. 13
    14. The craftsman 14
    15. The Craftsman Word Countpackage org.myorg; public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {import;import java.util.*; public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {import org.apache.hadoop.fs.Path; int sum = 0;import org.apache.hadoop.conf.*; for (IntWritable val : values) {import*; sum += val.get();import org.apache.hadoop.mapreduce.*; }import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; context.write(key, new IntWritable(sum));import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; }import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; }import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public static void main(String[] args) throws Exception {public class WordCount { Configuration conf = new Configuration(); public static class Map extends Mapper<LongWritable, Text, Text,IntWritable> { Job job = new Job(conf, "wordcount"); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException { job.setMapperClass(Map.class); String line = value.toString(); job.setReducerClass(Reduce.class); StringTokenizer tokenizer = new StringTokenizer(line); job.setInputFormatClass(TextInputFormat.class); while (tokenizer.hasMoreTokens()) { job.setOutputFormatClass(TextOutputFormat.class); word.set(tokenizer.nextToken()); FileInputFormat.addInputPath(job, new Path(args[0])); context.write(word, one); FileOutputFormat.setOutputPath(job, new Path(args[1])); } } job.waitForCompletion(true); } } 15
    16. The Craftsman Data Scientist If you like Java it works fine … until you want to do more than one pass, a complicated join or anything fancy. Cascading solves many of these problems for you but it is still very verbose 16
    17. We need a better tool A five tool tool! 17
    18. The Pragmatic Data Scientist Agile – Iterates quickly Productive - Uses the right tool for the right job Correct - Tests as much as he can before the job is even submitted Scalable – Can handle real world problems Simple - Single language to represent Operations, UDFs and Data 18
    19. The Pragmatic Data Scientist Agile – Iterates quickly Productive - Uses the right tool for the right job Correct - Tests as much as he can before the job is even submitted Scalable – Can handle real world problems Simple - Single language to represent Operations, UDFs and Data 19
    20. Agility – Data is complex 20
    21. Agility – Try before you buyscala> 1 to 10res0: Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)scala> 1 until 10res1: Range = Range(1, 2, 3, 4, 5, 6, 7, 8, 9)scala> res0.slice(3, 5)res3: scala.collection.immutable.IndexedSeq[Int] = Vector(4, 5)scala> res0.groupBy(_ % 2)res4: Map[Int, IndexedSeq[Int]] =Map(1 -> Vector(1, 3, 5, 7, 9), 0 -> Vector(2, 4, 6, 8, 10)) 21
    22. Productivity – Dont reinvent the wheel 22
    23. Productivity – Have the work done for youPython Collections Operators Scala Collections Operators map  foreach  span reduce  map  partition filter  flatMap  groupBy sum  collect  forall min/max  find  exists  takeWhile  count  dropWhile  fold  filter  reduce  withFilter  sum  filterNot  product  splitAt  min/max 23
    24. Correctness – how to keep your sanity 24
    25. Scalability – works on more than your machine Integrates with Hadoop (more than just streaming) Has the support of scalable libraries Parallel by design – not just for M/R flows 25
    26. Simplicity Paco Nathan, Evil Mad Scientist, Concurrent Inc., @pacoid, says: – “[Scalding] code is compact, simple to understand” – “nearly 1:1 between elements of conceptual flow diagram and function calls” – “Cascalog and Scalding DSLs leverage the functional aspects of MapReduce, helping to limit complexity in process” Scala is a functional tool for a fundamentally functional job 26
    27. Hadoop basicsLet’s count some words 27
    28. Let’s count some words This is the “Hello, World!” of anything tangentially related to Hadoop. Let’s try it in Scala first without any Hadoop stuff. val myLines : Seq[String] = ... // get some stuff val myWords = myLines.flatMap(w => w.split("s+")) val myWordsGrouped = myWords.groupBy(identity) val countedWords = myWordsGrouped.mapValues(x=>x.size) Now write out the words somehowval countedWords = myLines.flatMap(_.split("s+")) .groupBy(identity) .mapValues(_.size) 28
    29. Let’s count a lot of words I’ve gone to the trouble of rewriting this example to run in Hadoop. Here it is: val myLines : TypedPipe[String] = TextLine(args("input")) val myWords = myLines.flatMap(w => w.split("s+")) val myWordsGrouped = myWords.groupBy(identity) val countedWords = myWordsGrouped.mapValueStream(x => Iterator(x.size)) We can make this even better. val countedWords = myWordsGrouped.size countedWords.write(TypedTsv[(String,Long)](output)) 29
    30. Something for nothing Other people have already done the hard work to make the previous example run The previous example is using Scalding, a Scala library to write (mainly) Hadoop MapReduce jobs. It even has its own Twitter account, @scalding Created by: – Avi Bryant @avibryant – Oscar Boykin @posco – Argyris Zymnis @argyris Tweet them now and tell them how awesome it is … I’ll wait 30
    31. Side by side comparison of local and Hadoopval myWords = val myWords =myLines.flatMap(w => myLines.flatMap(w =>w.split("s+")) w.split("s+"))val myWordsGrouped = val myWordsGrouped =myLines.groupBy(identity) myWords.groupBy(identity)val countedWords = val countedWords = myWordsGrouped.myWordsGrouped. sizemapValues(x=>x.size) There are some small differences, mainly due to how the underlying Hadoop process needs to happen. 31
    32. Why does this work? Scala has support for embedded domain specific languages (DSLs) Scalding includes a couple DSLs for specifying Cascading (and by extension Hadoop) workflows. Info about Cascading: One of the Scalding DSLs, the Typed one, is designed to be very close to the standard Scala collections API It’s not a perfect mapping due to how Cascading and Hadoop work, but in general it is very easy to write your code locally, change a couple small bits, and have it run on a Hadoop cluster Scalding also has a local mode if you want the syntactic sugar without fussing with Hadoop 32
    33. DSLs for everyone! We’re showing you Scalding in this talk, but there are others that are similar. – Scoobi: – Scrunch: All three attempt to make using code to written on Scala collections work (almost) seamlessly in Hadoop. More on DSLs: Some guts: 33
    34. Fields based DSLFrom com.twitter.scalding.Dsl/** * This object has all the implicit functions and values that are used * to make the scalding DSL. * * Its useful to import Dsl._ when you are writing scalding code outside * of a Job. */object Dsl extends FieldConversions with TupleConversions withGeneratedTupleAdders with { implicit def pipeToRichPipe(pipe : Pipe) : RichPipe = new RichPipe(pipe) implicit def richPipeToPipe(rp : RichPipe) : Pipe = rp.pipe} }} 34
    35. Typed DSLFrom com.twitter.scalding.TDsl/** implicits for the type-safe DSL * import TDsl._ to get the implicit conversions fromGrouping/CoGrouping to Pipe, * to get the .toTypedPipe method on standard cascading Pipes. * to get automatic conversion of Mappable[T] to TypedPipe[T] */object TDsl extends Serializable with GeneratedTupleAdders { implicit def pipeTExtensions(pipe : Pipe) : PipeTExtensions = new PipeTExtensions(pipe) implicit def mappableToTypedPipe[T](mappable : Mappable[T]) (implicit flowDef : FlowDef, mode : Mode, conv : TupleConverter[T]) : TypedPipe[T] = { TypedPipe.from(mappable)(flowDef, mode, conv) }} 35
    36. Algebird – It’s like algebra and a bird We did something fancy in the previous example: val countedWords = myGroupedWords.size val countedWords = myGroupedWords.mapValues(x => 1L).sum val countedWords = myGroupedWords.mapValues(x => 1L).reduce(implicit mon: Monoid[Long])((l,r) =>,r)) Scalding uses Algebird extensively to make your life easier. Algebird can also be used outside of Scalding with no trouble. Algebird has your favorite things like monoids, monads, bloom filters, count-min sketches, hyperloglogs, etc. 36
    37. Counting words with some extra information Sometimes we want to know some information about the contexts that words occurred in. At eBay, this is often the category that a term appeared in. Let’s count words and calculate the entropy of the category distribution for each word. – If you’re unfamiliar with this type of entropy just think of it as a measure of how concentrated the distribution is. – If you really like formulas it is: Σi p(xi) log(pi) 37
    38. More codecase class MyAvroOutput(word: String, count: Long, entropy: Double) extends AvroRecordTypedTsv[(String,Int)] .flatMap{case(line,cat) => line.split("s+").map(x => (x,Map(cat->1L))} .group .sum .map{ case(word, dist) => val total: Double = dist.values.sum val entropy = (-1)*{ count => (count/total)*math.log(count/total)}.sum MyAvroOutput(word,total.toLong,entropy) } .write(PackedAvroSource[MyAvroOutput](output)) Math is great 38
    39. Machine Learning ExamplesThe reason why you are here 39
    40. Classification case studyHow much should we charge for aTitanic insurance? 40
    41. Titanic II case study We want to sell life insurance to passengers of Titanic II All we have is data from Titanic I We have to be able to explain why we charge the prices we do (damn regulators!) 41
    42. Titanic I Data Cabin class – e.g. 1st, 2nd, 3rd .. Name – String Age – Integer Embark place – String Destination – String Room – Integer Ticket – Integer Gender – Male or Female 42
    43. Titanic Model 43
    44. Classifier codeobject Titanic { def main(args: Array[String]) = { // parse data val reader = new CSVReader(new FileReader( "src/main/data/titanic.csv")) val passengers = val instances = // build tree val treeBuilder = new TreeBuilder val tree = treeBuilder.buildTree(instances) // print tree tree.dump(System.out) }} 44
    45. Titanic Model 45
    46. Clustering case studyLet’s cluster some eBaykeywords. 46
    47. Motivation eBay, like any large site, has a massive number of unique queries every day Identifying groups of queries based on user behavior might help us to understand the individual queries better For queries we are unsure of we can even try and match them into a cluster that contains queries we know a lot about. We can use behavioral things like: – number of searches – number of clicks – number of subsequent bids, buys – number of exits – etc 47
    48. Let’s use Mahout Apache Mahout,, @ApacheMahout, is a powerful machine learning and data mining library that works with Hadoop. It has a ton of great stuff in it, but many of the drawbacks of using Java MapReduce apply. It uses some proprietary data formats (is your data in VectorWritable SequenceFiles?) Luckily for us, there are some nice things that work as standalone pieces. Coming in release 0.8, there is an excellent single pass k-means clustering algorithm we can use. 48
    49. Let’s use Mahout, inside Scaldinglazy val clust = new StreamingKMeans(new FastProjectionSearch(newEuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])var count = 0;val sloppyClusters = TextLine(args("input")) .map{ str => val vec = str.split("t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .toPipe(centroids) // This wont work with the current build, coming soon though .unorderedFoldTo[StreamingKMeans,Centroid](centroids->’clusters)(clust){(cl,cent) => cl.cluster(cent); cl} .toTypedPipe[StreamingKMeans](Dsl.intFields(Seq(0))) .flatMap(c => c.iterator.asScala.toIterable) 49
    50. Let’s use Mahout, inside Scaldingval finalClusters = sloppyClusters.groupAll .mapValueStream{centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values 50
    51. Results These are primarily eBay head queries. Remember that the clustering algorithm knows nothing about the text in the query. Sample groups: – chanel, tory burch, diamond ring, kathy van zeeland handbags, ... – ipad 4th generation, samsung galaxy s iii, iphone 4 s, nexus 4, ipad mini, ... – kohls coupons, lowes coupons – jcrew, cole haan, diesel, banana republic, gucci, burberry, brooks brothers, … – ferrari, utility trailer, polaris ranger, porsche 911, dump truck, bmw m3, chainsaw, rv, chevelle, vw bus, dodge charger, ... – paypal account,, apple touch icon precomposed.png, paypal, undefined, ps3%2520games, michael%25 20kors 51
    52. Clustering Takeaway There are some excellent libraries that exist, and even fit the functional model Scala and Scalding will help you work around the rough edges and integrate them into your data flow, rather than having to create new data flows Being able to prototype locally and in the Scala REPL saves massive amounts of developer time 52
    53. Matrix API case studyUsing LinkedIn endorsement datato rank Scala experts 53
    54. LinkedIn Endorsements 54
    55. Page Rank Algorithm 55
    56. Prepare Datadef prepareData = { // read endorsements and transform to edges val ends = readFile[Endorsement]("endorsements") .filter(_.skill == "Scala") .map(e => (e.sender, e.recipient, 1)) .write(TSV(”edges"))}def getDominantEigenVector = { … } // outputs to “ranks” (memberId, rank)def getMembers = { // get Bay Area members val members = readLatest[Member]("members") .filter(_.getRegionCode == 84) .groupBy(_.getMemberId.toLong) // join ranks and members readFile[Ranks](”ranks”).withReducers(10).join(members).toTypedPipe .map{ case (id, ((_, rank), m)) => (rank, m.getMemberId, m.getFirstName, m.getLastName, m.getHeadline) } .groupAll.sortBy(_._1).reverse.values .write(TextLine("talk/scalaRanks"))} 56
    57. Matrix API mat.mapValues( func ): Matrix  rowMeanCentering : Matrix mat.filterValues( func ) :  rowSizeAveStdev : Matrix Matrix  matrix1 * matrix2 : Matrix mat.getRow( ind ) : RowVector  matrix / scalar(Scalar) : Matrix mat.reduceRowVectors{ f } :  elemWiseOp( mat2 ){ func } RowVector  mat1.hProd( matrix2 ) : Matrix mat.sumRowVectors :  mat2/r/c ) : Matrix RowVector  matrix.nonZerosWith( sclr ) mat.mapRows{ func } : Matrix  matrix.trace : Scalar mat.topRowElems( k ) : Matrix  matrix.sum : Scalar mat.rowL1Normalize : Matrix  matrix.transpose : Matrix mat.rowL2Normalize : Matrix  mat.diagonal : DiagonalMatrix 57
    58. Endorsements Page RankTime for Results! 58
    59. 59
    60. 60
    61. Only one slide left!Summary 61
    62. Stuff you have seen today … There are many ways to develop machine learning programs, none of them are perfect Scala which reflects the years of evolution since Javas invention, and Scalding which is the same for vanilla MapReduce, are a much better alternative Machine learning is fun and not necessarily complicated 62
    63. 63