SlideShare a Scribd company logo
Scalable and Flexible Machine Learning With Scala
Who are we?




     @BigDataSc   @ccsevers
                              2
Stuff you will see today …

 Different types of data scientists – Comparison of different
  approaches to develop machine learning flows
 Code
 The five tool tool – Why Scala (and its ecosystem) is the best tool to
  develop machine learning flows (Hint: MapReduce is functional)
 Some more code
 Machine Learning examples – Real life (well … almost) examples
  of different machine learning problems
 Even more code




                                                                           3
“Good data scientists understand, in a deep
   way, that the heavy lifting of cleanup and
 preparation is not something that gets in the
way of solving the problem – it is the problem!”
         DJ Patil – Founding member of the LinkedIn data science team




                                                                        4
The data funnel

 Real data is an awful, terrible mess
 Cleaning often is a process of operating on data, excluding some
  data, bucketing data and calculating aggregates about the data

        Generate                   map, flatMap, for
        Exclude                    filter
        Bucket                     group, groupBy, groupWith
        Aggregate                  sum, reduce, foldLeft


 These blocks form the basis of most data flows




                                                                     5
There are many ways to develop data flows




                                            6
The mixer




            7
{"schema": {
The Mixer Word Count                               "type": "record",
                                                   "name": "WordCount",
#wordcount.py                                      "fields": [
from org.apache.pig.scripting import *                {
                                                          "name": "word",
@outputSchema("b: bag{ w: chararray}")                    "type": "string"
def tokenize(words):                                  },
 return words.split(" ")                              {
                                                          "name": "count",
script = """                                              "type": "int"
A = load './input.txt';                               }]}}
B = foreach A generate flatten(tokenize((chararray)$0)) as word;
C = group B by word;
D = foreach C generate group, COUNT(B);
store D into './wordcount’ using AvroStorage("schema");
"""
Pig.compile(script).bind().runSingle()



                                                                             8
The Mixer Data Scientist

   Too many occurrences of code inside strings
   Three different languages inside a single file
   User Defined Functions (UDFs) vs. Language Support
   Not real Python, but Jython (which missing some libraries)
   This is just a simple word count!




                                                                 9
The Mixer Data Scientist

 Pig is great at extract, transform, load (ETL)
 … as long as you want to use a function that is already part of the
  included library
 … or you get someone else to write it for you (hello, DataFu!)
 Realistically you will need to maintain a Pig code base and a code
  base in some language which can run on the JVM
 Pig Latin is a bit funky, missing a lot of core programming language
  features
 Pig Latin is interpreted so you get (limited) type and syntax
  checking only at runtime




                                                                         10
The Expert




             11
The Expert Word Count

hadoop fs –get input.txt input.txt
cp /mnt/hadoop/input.txt ~/MyProjects/WordCount/input.txt

##!/usr/bin/perl
use strict;
use warnings;

my %count_of;
while (my $line = <>) { #read from file or STDIN
  foreach my $word (split /s+/, $line) {
    $count_of{$word}++;
  }
}
print "All words and their counts: n";
for my $word (sort keys %count_of) {
  print "'$word': $count_of{$word}n";
}
__END__


                                                            12
The Scalable Expert – Hadoop Streaming

 Lets you use any language you want.
 Same issues as Java MapReduce with regards to multiple passes,
  complicated joins, etc.
 Always reading from stdin and writing to stdout.
 Easy to test out on local data
    – cat myfile.txt | mymapper.sh | sort | myreducer.sh
 Actual data may not be as nice. No type checking on input or output
  can will lead to problems.
 The main reason to do this is so you can use a nice interpreted
  language to do your processing.




                                                                        13
The craftsman




                14
The Craftsman Word Count
package org.myorg;                                                  public static class Reduce extends
                                                                    Reducer<Text, IntWritable, Text, IntWritable> {
import java.io.IOException;
import java.util.*;                                                   public void reduce(Text key, Iterable<IntWritable> values, Context
                                                                    context)
                                                                        throws IOException, InterruptedException {
import org.apache.hadoop.fs.Path;
                                                                         int sum = 0;
import org.apache.hadoop.conf.*;
                                                                         for (IntWritable val : values) {
import org.apache.hadoop.io.*;
                                                                            sum += val.get();
import org.apache.hadoop.mapreduce.*;
                                                                         }
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
                                                                         context.write(key, new IntWritable(sum));
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
                                                                      }
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
                                                                    }
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

                                                                    public static void main(String[] args) throws Exception {
public class WordCount {
                                                                     Configuration conf = new Configuration();

 public static class Map extends Mapper<LongWritable, Text, Text,
IntWritable> {                                                            Job job = new Job(conf, "wordcount");
   private final static IntWritable one = new IntWritable(1);
   private Text word = new Text();                                      job.setOutputKeyClass(Text.class);
                                                                        job.setOutputValueClass(IntWritable.class);
   public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {                              job.setMapperClass(Map.class);
     String line = value.toString();                                    job.setReducerClass(Reduce.class);
     StringTokenizer tokenizer = new StringTokenizer(line);             job.setInputFormatClass(TextInputFormat.class);
     while (tokenizer.hasMoreTokens()) {                                job.setOutputFormatClass(TextOutputFormat.class);
       word.set(tokenizer.nextToken());                                 FileInputFormat.addInputPath(job, new Path(args[0]));
       context.write(word, one);                                        FileOutputFormat.setOutputPath(job, new Path(args[1]));
     }
   }                                                                    job.waitForCompletion(true);
 }                                                                  }



                                                                                                                                           15
The Craftsman Data Scientist

 If you like Java it works fine
 … until you want to do more than one pass, a complicated join or
  anything fancy.
 Cascading solves many of these problems for you but it is still very
  verbose




                                                                         16
We need a better tool




   A five tool tool!




http://en.wikipedia.org/wiki/Five-tool_player
http://en.wikipedia.org/wiki/Willie_Mays
                                                17
The Pragmatic Data Scientist

   Agile – Iterates quickly
   Productive - Uses the right tool for the right job
   Correct - Tests as much as he can before the job is even submitted
   Scalable – Can handle real world problems
   Simple - Single language to represent Operations, UDFs and Data




                                                                         18
The Pragmatic Data Scientist

   Agile – Iterates quickly
   Productive - Uses the right tool for the right job
   Correct - Tests as much as he can before the job is even submitted
   Scalable – Can handle real world problems
   Simple - Single language to represent Operations, UDFs and Data




                                                                         19
Agility – Data is complex




                            20
Agility – Try before you buy

scala> 1 to 10
res0: Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> 1 until 10
res1: Range = Range(1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> res0.slice(3, 5)
res3: scala.collection.immutable.IndexedSeq[Int] = Vector(4, 5)

scala> res0.groupBy(_ % 2)
res4: Map[Int, IndexedSeq[Int]] =
Map(1 -> Vector(1, 3, 5, 7, 9), 0 -> Vector(2, 4, 6, 8, 10))



                                                                  21
Productivity – Don't reinvent the wheel




                                          22
Productivity – Have the work done for you


Python Collections Operators   Scala Collections Operators
 map                           foreach       span
 reduce                        map           partition
 filter                        flatMap       groupBy
 sum                           collect       forall
 min/max                       find          exists
                                takeWhile  count
                                dropWhile  fold
                                filter        reduce
                                withFilter    sum
                                filterNot     product
                                splitAt       min/max


                                                             23
Correctness – how to keep your sanity




                                        24
Scalability – works on more than your machine

 Integrates with Hadoop (more than just streaming)
 Has the support of scalable libraries
 Parallel by design – not just for M/R flows




                                                      25
Simplicity

 Paco Nathan, Evil Mad Scientist, Concurrent Inc., @pacoid, says:
   – “[Scalding] code is compact, simple to understand”
   – “nearly 1:1 between elements of conceptual flow diagram and function
      calls”
   – “Cascalog and Scalding DSLs leverage the functional aspects of
      MapReduce, helping to limit complexity in process”
 Scala is a functional tool for a fundamentally functional job




                                                                            26
Hadoop basics

Let’s count some words




                         27
Let’s count some words
 This is the “Hello, World!” of anything tangentially related to
  Hadoop.
 Let’s try it in Scala first without any Hadoop stuff.

   val myLines : Seq[String] = ... // get some stuff
   val myWords = myLines.flatMap(w => w.split("s+"))
   val myWordsGrouped = myWords.groupBy(identity)
   val countedWords = myWordsGrouped.mapValues(x=>x.size)
   Now write out the words somehow

val countedWords = myLines.flatMap(_.split("s+"))
 .groupBy(identity)
 .mapValues(_.size)



                                                                    28
Let’s count a lot of words

 I’ve gone to the trouble of rewriting this example to run in Hadoop.
 Here it is:
 val myLines : TypedPipe[String] = TextLine(args("input"))
 val myWords = myLines.flatMap(w => w.split("s+"))
 val myWordsGrouped = myWords.groupBy(identity)
 val countedWords = myWordsGrouped.mapValueStream(x =>
  Iterator(x.size))
 We can make this even better.
 val countedWords = myWordsGrouped.size
 countedWords.write(TypedTsv[(String,Long)](output))




                                                                         29
Something for nothing

 Other people have already done the hard work to make the
  previous example run
 The previous example is using Scalding, a Scala library to write
  (mainly) Hadoop MapReduce jobs.
 https://github.com/twitter/scalding
 It even has its own Twitter account, @scalding
 Created by:
    – Avi Bryant @avibryant
    – Oscar Boykin @posco
    – Argyris Zymnis @argyris
 Tweet them now and tell them how awesome it is
 … I’ll wait




                                                                     30
Side by side comparison of local and Hadoop


val myWords =                    val myWords =
myLines.flatMap(w =>             myLines.flatMap(w =>
w.split("s+"))                 w.split("s+"))
val myWordsGrouped =             val myWordsGrouped =
myLines.groupBy(identity)        myWords.groupBy(identity)
val countedWords =               val countedWords =
                                 myWordsGrouped.
myWordsGrouped.
                                 size
mapValues(x=>x.size)

  There are some small differences, mainly due to how the
    underlying Hadoop process needs to happen.




                                                             31
Why does this work?

 Scala has support for embedded domain specific languages (DSLs)
 Scalding includes a couple DSLs for specifying Cascading (and by
  extension Hadoop) workflows.
 Info about Cascading: http://www.cascading.org/
 One of the Scalding DSLs, the Typed one, is designed to be very
  close to the standard Scala collections API
 It’s not a perfect mapping due to how Cascading and Hadoop work,
  but in general it is very easy to write your code locally, change a
  couple small bits, and have it run on a Hadoop cluster
 Scalding also has a local mode if you want the syntactic sugar
  without fussing with Hadoop




                                                                        32
DSLs for everyone!

 We’re showing you Scalding in this talk, but there are others that
  are similar.
    – Scoobi: https://github.com/NICTA/scoobi
    – Scrunch: https://github.com/cloudera/crunch/tree/master/scrunch
 All three attempt to make using code to written on Scala collections
  work (almost) seamlessly in Hadoop.
 More on DSLs: http://www.scala-lang.org/node/1403
 Some guts:




                                                                         33
Fields based DSL

From com.twitter.scalding.Dsl
/**
 * This object has all the implicit functions and values that are used
 * to make the scalding DSL.
 *
 * It's useful to import Dsl._ when you are writing scalding code outside
 * of a Job.
 */
object Dsl extends FieldConversions with TupleConversions with
GeneratedTupleAdders with java.io.Serializable {
  implicit def pipeToRichPipe(pipe : Pipe) : RichPipe = new
    RichPipe(pipe)
  implicit def richPipeToPipe(rp : RichPipe) : Pipe = rp.pipe
}
  }
}
                                                                            34
Typed DSL

From com.twitter.scalding.TDsl
/** implicits for the type-safe DSL
 * import TDsl._ to get the implicit conversions from
Grouping/CoGrouping to Pipe,
 * to get the .toTypedPipe method on standard cascading Pipes.
 * to get automatic conversion of Mappable[T] to TypedPipe[T]
 */
object TDsl extends Serializable with GeneratedTupleAdders {
  implicit def pipeTExtensions(pipe : Pipe) : PipeTExtensions = new
    PipeTExtensions(pipe)
  implicit def mappableToTypedPipe[T](mappable : Mappable[T])
    (implicit flowDef : FlowDef, mode : Mode, conv :
    TupleConverter[T]) : TypedPipe[T] = {
      TypedPipe.from(mappable)(flowDef, mode, conv)
    }
}


                                                                      35
Algebird – It’s like algebra and a bird

 We did something fancy in the previous example:
 val countedWords = myGroupedWords.size
 val countedWords = myGroupedWords.mapValues(x =>
  1L).sum
 val countedWords = myGroupedWords.mapValues(x =>
  1L).reduce(implicit mon: Monoid[Long])((l,r) => mon.plus(l,r))



 Scalding uses Algebird extensively to make your life easier.
 Algebird can also be used outside of Scalding with no trouble.
 Algebird has your favorite things like monoids, monads, bloom
  filters, count-min sketches, hyperloglogs, etc.



                                                                   36
Counting words with some extra information

 Sometimes we want to know some information about the contexts
  that words occurred in. At eBay, this is often the category that a
  term appeared in.
 Let’s count words and calculate the entropy of the category
  distribution for each word.
    – If you’re unfamiliar with this type of entropy just think of it as a
      measure of how concentrated the distribution is.
    – If you really like formulas it is: Σi p(xi) log(pi)




                                     http://en.wikipedia.org/wiki/Entropy_%28information_theory%29



                                                                                                     37
More code

case class MyAvroOutput(word: String, count: Long,
 entropy: Double) extends AvroRecord
TypedTsv[(String,Int)]
 .flatMap{case(line,cat) => line.split("s+").map(x =>
    (x,Map(cat->1L))}
 .group
 .sum
 .map{ case(word, dist) =>
     val total: Double = dist.values.sum
     val entropy = (-1)*dist.values.map{ count =>
          (count/total)*math.log(count/total)}.sum
     MyAvroOutput(word,total.toLong,entropy)
 }
 .write(PackedAvroSource[MyAvroOutput](output))
 Math is great



                                                          38
Machine Learning Examples

The reason why you are here




                              39
Classification case study

How much should we charge for a
Titanic insurance?




                                  40
Titanic II case study

 We want to sell life insurance to passengers of Titanic II
 All we have is data from Titanic I
 We have to be able to explain why we charge the prices we do
  (damn regulators!)




                                           http://commons.wikimedia.org

                                                                     41
Titanic I Data

   Cabin class – e.g. 1st, 2nd, 3rd ..
   Name – String
   Age – Integer
   Embark place – String
   Destination – String
   Room – Integer
   Ticket – Integer
   Gender – Male or Female




                                          42
Titanic Model




                http://www.dtreg.com/

                                   43
Classifier code

object Titanic {
  def main(args: Array[String]) = {
   // parse data
   val reader = new CSVReader(new FileReader(
      "src/main/data/titanic.csv"))
   val passengers = reader.readAll.tail.map(Passenger(_))
   val instances = passengers.map(_.getInstance).toSet
   // build tree
   val treeBuilder = new TreeBuilder
   val tree = treeBuilder.buildTree(instances)
   // print tree
   tree.dump(System.out)
   }
}


                                                            44
Titanic Model




                http://www.dtreg.com/

                                   45
Clustering case study

Let’s cluster some eBay
keywords.




                          46
Motivation

 eBay, like any large site, has a massive number of unique queries
  every day
 Identifying groups of queries based on user behavior might help us
  to understand the individual queries better
 For queries we are unsure of we can even try and match them into
  a cluster that contains queries we know a lot about.
 We can use behavioral things like:
    –   number of searches
    –   number of clicks
    –   number of subsequent bids, buys
    –   number of exits
    –   etc




                                                                       47
Let’s use Mahout

 Apache Mahout, http://mahout.apache.org/, @ApacheMahout, is a
  powerful machine learning and data mining library that works with
  Hadoop.
 It has a ton of great stuff in it, but many of the drawbacks of using
  Java MapReduce apply.
 It uses some proprietary data formats (is your data in
  VectorWritable SequenceFiles?)
 Luckily for us, there are some nice things that work as standalone
  pieces.
 Coming in release 0.8, there is an excellent single pass k-means
  clustering algorithm we can use.




                                                                          48
Let’s use Mahout, inside Scalding
lazy val clust = new StreamingKMeans(new FastProjectionSearch(new
EuclideanDistanceMeasure,5,10),
  args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])

var count = 0;
val sloppyClusters =
 TextLine(args("input"))
 .map{ str =>
   val vec = str.split("t").map(_.toDouble)
   val cent = new Centroid(count, new DenseVector(vec))
   count += 1
   cent
 }
 .toPipe('centroids)
  // This won't work with the current build, coming soon though
 .unorderedFoldTo[StreamingKMeans,Centroid]('centroids->’clusters)(clust){(cl,cent) =>
     cl.cluster(cent); cl}
  .toTypedPipe[StreamingKMeans](Dsl.intFields(Seq(0)))
  .flatMap(c => c.iterator.asScala.toIterable)

                                                                                         49
Let’s use Mahout, inside Scalding

val finalClusters = sloppyClusters.groupAll
 .mapValueStream{centList =>
   lazy val bclusterer = new BallKMeans(new BruteSearch(
    new EuclideanDistanceMeasure),
    args("numclusters").toInt, 100)
   bclusterer.cluster(centList.toList.asJava)
   bclusterer.iterator.asScala
 }
 .values




                                                           50
Results

 These are primarily eBay head queries. Remember that the
  clustering algorithm knows nothing about the text in the query.
 Sample groups:
    – chanel, tory burch, diamond ring, kathy van zeeland handbags, ...
    – ipad 4th generation, samsung galaxy s iii, iphone 4 s, nexus 4, ipad
      mini, ...
    – kohls coupons, lowes coupons
    – jcrew, cole haan, diesel, banana republic, gucci, burberry, brooks
      brothers, …
    – ferrari, utility trailer, polaris ranger, porsche 911, dump truck, bmw
      m3, chainsaw, rv, chevelle, vw bus, dodge charger, ...
    – paypal account, ebay.com, apple touch icon
      precomposed.png, paypal, undefined, ps3%2520games, michael%25
      20kors



                                                                               51
Clustering Takeaway

 There are some excellent libraries that exist, and even fit the
  functional model
 Scala and Scalding will help you work around the rough edges and
  integrate them into your data flow, rather than having to create new
  data flows
 Being able to prototype locally and in the Scala REPL saves
  massive amounts of developer time




                                                                         52
Matrix API case study

Using LinkedIn endorsement data
to rank Scala experts




                                  53
LinkedIn Endorsements




                        54
Page Rank Algorithm




                      http://commons.wikimedia.org

                                                55
Prepare Data
def prepareData = {
  // read endorsements and transform to edges
  val ends = readFile[Endorsement]("endorsements")
    .filter(_.skill == "Scala")
    .map(e => (e.sender, e.recipient, 1))
    .write(TSV(”edges"))
}
def getDominantEigenVector = { … } // outputs to “ranks” (memberId, rank)
def getMembers = {
  // get Bay Area members
  val members = readLatest[Member]("members")
    .filter(_.getRegionCode == 84)
    .groupBy(_.getMemberId.toLong)
  // join ranks and members
  readFile[Ranks](”ranks”).withReducers(10).join(members).toTypedPipe
    .map{ case (id, ((_, rank), m)) =>
      (rank, m.getMemberId, m.getFirstName, m.getLastName, m.getHeadline) }
    .groupAll.sortBy(_._1).reverse.values
    .write(TextLine("talk/scalaRanks"))
}                                                                             56
Matrix API


 mat.mapValues( func ): Matrix      rowMeanCentering : Matrix
 mat.filterValues( func ) :         rowSizeAveStdev : Matrix
  Matrix
                                     matrix1 * matrix2 : Matrix
 mat.getRow( ind ) :
  RowVector                          matrix / scalar(Scalar) : Matrix
 mat.reduceRowVectors{ f } :        elemWiseOp( mat2 ){ func }
  RowVector                          mat1.hProd( matrix2 ) : Matrix
 mat.sumRowVectors :                mat1.zip( mat2/r/c ) : Matrix
  RowVector
                                     matrix.nonZerosWith( sclr )
 mat.mapRows{ func } : Matrix
                                     matrix.trace : Scalar
 mat.topRowElems( k ) :
  Matrix                             matrix.sum : Scalar
 mat.rowL1Normalize : Matrix        matrix.transpose : Matrix
 mat.rowL2Normalize : Matrix        mat.diagonal : DiagonalMatrix


                                                                         57
Endorsements Page Rank

Time for Results!




                         58
59
60
Only one slide left!

Summary




                       61
Stuff you have seen today …

 There are many ways to develop machine learning programs, none
  of them are perfect
 Scala which reflects the    years of evolution since Java's
  invention, and Scalding which is the same for vanilla MapReduce,
  are a much better alternative
 Machine learning is fun and not necessarily complicated




                                                                     62
63

More Related Content

What's hot

Js set timeout & setinterval
Js set timeout & setintervalJs set timeout & setinterval
Js set timeout & setinterval
ARIF MAHMUD RANA
 
The Three Basic Selection Structures in C++ Programming Concepts
The Three Basic Selection Structures in C++ Programming ConceptsThe Three Basic Selection Structures in C++ Programming Concepts
The Three Basic Selection Structures in C++ Programming Concepts
Tech
 
FUNCTIONS IN PYTHON[RANDOM FUNCTION]
FUNCTIONS IN PYTHON[RANDOM FUNCTION]FUNCTIONS IN PYTHON[RANDOM FUNCTION]
FUNCTIONS IN PYTHON[RANDOM FUNCTION]
vikram mahendra
 
TP C++ : Correction
TP C++ : CorrectionTP C++ : Correction
Operators and expressions in c language
Operators and expressions in c languageOperators and expressions in c language
Operators and expressions in c language
tanmaymodi4
 
Pointers in C
Pointers in CPointers in C
Pointers in C
Prabhu Govind
 
project report in C++ programming and SQL
project report in C++ programming and SQLproject report in C++ programming and SQL
project report in C++ programming and SQL
vikram mahendra
 
Data structures using C
Data structures using CData structures using C
Data structures using C
Pdr Patnaik
 
16717 functions in C++
16717 functions in C++16717 functions in C++
16717 functions in C++
LPU
 
Basic c programming and explanation PPT1
Basic c programming and explanation PPT1Basic c programming and explanation PPT1
Basic c programming and explanation PPT1
Rumman Ansari
 
Python Advanced – Building on the foundation
Python Advanced – Building on the foundationPython Advanced – Building on the foundation
Python Advanced – Building on the foundation
Kevlin Henney
 
Functional Design Patterns (DevTernity 2018)
Functional Design Patterns (DevTernity 2018)Functional Design Patterns (DevTernity 2018)
Functional Design Patterns (DevTernity 2018)
Scott Wlaschin
 
Functions in c++
Functions in c++Functions in c++
Functions in c++
Maaz Hasan
 
Functional Programming Patterns (NDC London 2014)
Functional Programming Patterns (NDC London 2014)Functional Programming Patterns (NDC London 2014)
Functional Programming Patterns (NDC London 2014)
Scott Wlaschin
 
Handling of character strings C programming
Handling of character strings C programmingHandling of character strings C programming
Handling of character strings C programming
Appili Vamsi Krishna
 
Unit 5. Control Statement
Unit 5. Control StatementUnit 5. Control Statement
Unit 5. Control Statement
Ashim Lamichhane
 
Operator Overloading & Type Conversions
Operator Overloading & Type ConversionsOperator Overloading & Type Conversions
Operator Overloading & Type Conversions
Rokonuzzaman Rony
 
Function Pointer
Function PointerFunction Pointer
Function Pointer
Dr-Dipali Meher
 
The Functional Programming Toolkit (NDC Oslo 2019)
The Functional Programming Toolkit (NDC Oslo 2019)The Functional Programming Toolkit (NDC Oslo 2019)
The Functional Programming Toolkit (NDC Oslo 2019)
Scott Wlaschin
 
PHP slides
PHP slidesPHP slides
PHP slides
Farzad Wadia
 

What's hot (20)

Js set timeout & setinterval
Js set timeout & setintervalJs set timeout & setinterval
Js set timeout & setinterval
 
The Three Basic Selection Structures in C++ Programming Concepts
The Three Basic Selection Structures in C++ Programming ConceptsThe Three Basic Selection Structures in C++ Programming Concepts
The Three Basic Selection Structures in C++ Programming Concepts
 
FUNCTIONS IN PYTHON[RANDOM FUNCTION]
FUNCTIONS IN PYTHON[RANDOM FUNCTION]FUNCTIONS IN PYTHON[RANDOM FUNCTION]
FUNCTIONS IN PYTHON[RANDOM FUNCTION]
 
TP C++ : Correction
TP C++ : CorrectionTP C++ : Correction
TP C++ : Correction
 
Operators and expressions in c language
Operators and expressions in c languageOperators and expressions in c language
Operators and expressions in c language
 
Pointers in C
Pointers in CPointers in C
Pointers in C
 
project report in C++ programming and SQL
project report in C++ programming and SQLproject report in C++ programming and SQL
project report in C++ programming and SQL
 
Data structures using C
Data structures using CData structures using C
Data structures using C
 
16717 functions in C++
16717 functions in C++16717 functions in C++
16717 functions in C++
 
Basic c programming and explanation PPT1
Basic c programming and explanation PPT1Basic c programming and explanation PPT1
Basic c programming and explanation PPT1
 
Python Advanced – Building on the foundation
Python Advanced – Building on the foundationPython Advanced – Building on the foundation
Python Advanced – Building on the foundation
 
Functional Design Patterns (DevTernity 2018)
Functional Design Patterns (DevTernity 2018)Functional Design Patterns (DevTernity 2018)
Functional Design Patterns (DevTernity 2018)
 
Functions in c++
Functions in c++Functions in c++
Functions in c++
 
Functional Programming Patterns (NDC London 2014)
Functional Programming Patterns (NDC London 2014)Functional Programming Patterns (NDC London 2014)
Functional Programming Patterns (NDC London 2014)
 
Handling of character strings C programming
Handling of character strings C programmingHandling of character strings C programming
Handling of character strings C programming
 
Unit 5. Control Statement
Unit 5. Control StatementUnit 5. Control Statement
Unit 5. Control Statement
 
Operator Overloading & Type Conversions
Operator Overloading & Type ConversionsOperator Overloading & Type Conversions
Operator Overloading & Type Conversions
 
Function Pointer
Function PointerFunction Pointer
Function Pointer
 
The Functional Programming Toolkit (NDC Oslo 2019)
The Functional Programming Toolkit (NDC Oslo 2019)The Functional Programming Toolkit (NDC Oslo 2019)
The Functional Programming Toolkit (NDC Oslo 2019)
 
PHP slides
PHP slidesPHP slides
PHP slides
 

Viewers also liked

F# Type Provider for R Statistical Platform
F# Type Provider for R Statistical PlatformF# Type Provider for R Statistical Platform
F# Type Provider for R Statistical Platform
Howard Mansell
 
Computing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic GraphComputing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic Graph
Vitaly Gordon
 
Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine Learning
Patrick Nicolas
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
MBrace: Cloud Computing with F#
MBrace: Cloud Computing with F#MBrace: Cloud Computing with F#
MBrace: Cloud Computing with F#
Eirik George Tsarpalis
 
Building Skynet: Machine Learning for Software Developers
Building Skynet: Machine Learning for Software DevelopersBuilding Skynet: Machine Learning for Software Developers
Building Skynet: Machine Learning for Software Developers
Anthony Brown
 
Writing a Search Engine. How hard could it be?
Writing a Search Engine. How hard could it be?Writing a Search Engine. How hard could it be?
Writing a Search Engine. How hard could it be?
Anthony Brown
 
Learning from "Effective Scala"
Learning from "Effective Scala"Learning from "Effective Scala"
Learning from "Effective Scala"
Kazuhiro Sera
 
Scala-Ls1
Scala-Ls1Scala-Ls1
Scala-Ls1
Aniket Joshi
 
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
GeeksLab Odessa
 
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Vitaly Gordon
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
Andy Petrella
 
Drm and the web
Drm and the webDrm and the web
Drm and the web
Anthony Brown
 
Streaming ETL With Akka.NET
Streaming ETL With Akka.NETStreaming ETL With Akka.NET
Streaming ETL With Akka.NET
petabridge
 
Role Discovery and RBAC Design: A Case Study with IBM Role and Policy Modeler
Role Discovery and RBAC Design: A Case Study with IBM Role and Policy ModelerRole Discovery and RBAC Design: A Case Study with IBM Role and Policy Modeler
Role Discovery and RBAC Design: A Case Study with IBM Role and Policy Modeler
Prolifics
 
Abac and the evolution of access control
Abac and the evolution of access controlAbac and the evolution of access control
Abac and the evolution of access control
Akbar Azwir, MM, PMP, PMI-SP, PSM I, CISSP
 
Building applications with akka.net
Building applications with akka.netBuilding applications with akka.net
Building applications with akka.net
Anthony Brown
 
Reactive applications with Akka.Net - DDD East Anglia 2015
Reactive applications with Akka.Net - DDD East Anglia 2015Reactive applications with Akka.Net - DDD East Anglia 2015
Reactive applications with Akka.Net - DDD East Anglia 2015
Anthony Brown
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!
David Hoerster
 
Role based access control - RBAC
Role based access control - RBACRole based access control - RBAC
Role based access control - RBAC
Ajit Dadresa
 

Viewers also liked (20)

F# Type Provider for R Statistical Platform
F# Type Provider for R Statistical PlatformF# Type Provider for R Statistical Platform
F# Type Provider for R Statistical Platform
 
Computing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic GraphComputing Professional Identity for the Economic Graph
Computing Professional Identity for the Economic Graph
 
Scala for Machine Learning
Scala for Machine LearningScala for Machine Learning
Scala for Machine Learning
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
MBrace: Cloud Computing with F#
MBrace: Cloud Computing with F#MBrace: Cloud Computing with F#
MBrace: Cloud Computing with F#
 
Building Skynet: Machine Learning for Software Developers
Building Skynet: Machine Learning for Software DevelopersBuilding Skynet: Machine Learning for Software Developers
Building Skynet: Machine Learning for Software Developers
 
Writing a Search Engine. How hard could it be?
Writing a Search Engine. How hard could it be?Writing a Search Engine. How hard could it be?
Writing a Search Engine. How hard could it be?
 
Learning from "Effective Scala"
Learning from "Effective Scala"Learning from "Effective Scala"
Learning from "Effective Scala"
 
Scala-Ls1
Scala-Ls1Scala-Ls1
Scala-Ls1
 
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
Java/Scala Lab 2016. Александр Конопко: Машинное обучение в Spark.
 
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
Big Data World 2013 - How LinkedIn leveraged its data to become the world's l...
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Drm and the web
Drm and the webDrm and the web
Drm and the web
 
Streaming ETL With Akka.NET
Streaming ETL With Akka.NETStreaming ETL With Akka.NET
Streaming ETL With Akka.NET
 
Role Discovery and RBAC Design: A Case Study with IBM Role and Policy Modeler
Role Discovery and RBAC Design: A Case Study with IBM Role and Policy ModelerRole Discovery and RBAC Design: A Case Study with IBM Role and Policy Modeler
Role Discovery and RBAC Design: A Case Study with IBM Role and Policy Modeler
 
Abac and the evolution of access control
Abac and the evolution of access controlAbac and the evolution of access control
Abac and the evolution of access control
 
Building applications with akka.net
Building applications with akka.netBuilding applications with akka.net
Building applications with akka.net
 
Reactive applications with Akka.Net - DDD East Anglia 2015
Reactive applications with Akka.Net - DDD East Anglia 2015Reactive applications with Akka.Net - DDD East Anglia 2015
Reactive applications with Akka.Net - DDD East Anglia 2015
 
Reactive Development: Commands, Actors and Events. Oh My!!
Reactive Development: Commands, Actors and Events.  Oh My!!Reactive Development: Commands, Actors and Events.  Oh My!!
Reactive Development: Commands, Actors and Events. Oh My!!
 
Role based access control - RBAC
Role based access control - RBACRole based access control - RBAC
Role based access control - RBAC
 

Similar to Scalable and Flexible Machine Learning With Scala @ LinkedIn

JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
Toni Cebrián
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Open XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand DechouxOpen XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand Dechoux
Publicis Sapient Engineering
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
Unmesh Baile
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
Rohit Agrawal
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
Legacy Typesafe (now Lightbend)
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
Clinton Dreisbach
 
Dart structured web apps
Dart   structured web appsDart   structured web apps
Dart structured web apps
chrisbuckett
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
Matthew McCullough
 
Atlassian Groovy Plugins
Atlassian Groovy PluginsAtlassian Groovy Plugins
Atlassian Groovy Plugins
Paul King
 
Create & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptxCreate & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptx
vishal choudhary
 
Java
JavaJava
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
Max Tepkeev
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
Sages
 
Java 7 & 8 New Features
Java 7 & 8 New FeaturesJava 7 & 8 New Features
Java 7 & 8 New Features
Leandro Coutinho
 
Jug java7
Jug java7Jug java7
Jug java7
Dmitry Buzdin
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
Jairam Chandar
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
datasalt
 

Similar to Scalable and Flexible Machine Learning With Scala @ LinkedIn (20)

JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Writing Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using ScaldingWriting Hadoop Jobs in Scala using Scalding
Writing Hadoop Jobs in Scala using Scalding
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Open XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand DechouxOpen XKE - Big Data, Big Mess par Bertrand Dechoux
Open XKE - Big Data, Big Mess par Bertrand Dechoux
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
The How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache SparkThe How and Why of Fast Data Analytics with Apache Spark
The How and Why of Fast Data Analytics with Apache Spark
 
Having Fun with Play
Having Fun with PlayHaving Fun with Play
Having Fun with Play
 
Dart structured web apps
Dart   structured web appsDart   structured web apps
Dart structured web apps
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
Atlassian Groovy Plugins
Atlassian Groovy PluginsAtlassian Groovy Plugins
Atlassian Groovy Plugins
 
Create & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptxCreate & Execute First Hadoop MapReduce Project in.pptx
Create & Execute First Hadoop MapReduce Project in.pptx
 
Java
JavaJava
Java
 
EuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and HadoopEuroPython 2015 - Big Data with Python and Hadoop
EuroPython 2015 - Big Data with Python and Hadoop
 
Wprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache HadoopWprowadzenie do technologi Big Data i Apache Hadoop
Wprowadzenie do technologi Big Data i Apache Hadoop
 
Java 7 & 8 New Features
Java 7 & 8 New FeaturesJava 7 & 8 New Features
Java 7 & 8 New Features
 
Jug java7
Jug java7Jug java7
Jug java7
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 

Scalable and Flexible Machine Learning With Scala @ LinkedIn

  • 1. Scalable and Flexible Machine Learning With Scala
  • 2. Who are we? @BigDataSc @ccsevers 2
  • 3. Stuff you will see today …  Different types of data scientists – Comparison of different approaches to develop machine learning flows  Code  The five tool tool – Why Scala (and its ecosystem) is the best tool to develop machine learning flows (Hint: MapReduce is functional)  Some more code  Machine Learning examples – Real life (well … almost) examples of different machine learning problems  Even more code 3
  • 4. “Good data scientists understand, in a deep way, that the heavy lifting of cleanup and preparation is not something that gets in the way of solving the problem – it is the problem!” DJ Patil – Founding member of the LinkedIn data science team 4
  • 5. The data funnel  Real data is an awful, terrible mess  Cleaning often is a process of operating on data, excluding some data, bucketing data and calculating aggregates about the data Generate map, flatMap, for Exclude filter Bucket group, groupBy, groupWith Aggregate sum, reduce, foldLeft  These blocks form the basis of most data flows 5
  • 6. There are many ways to develop data flows 6
  • 8. {"schema": { The Mixer Word Count "type": "record", "name": "WordCount", #wordcount.py "fields": [ from org.apache.pig.scripting import * { "name": "word", @outputSchema("b: bag{ w: chararray}") "type": "string" def tokenize(words): }, return words.split(" ") { "name": "count", script = """ "type": "int" A = load './input.txt'; }]}} B = foreach A generate flatten(tokenize((chararray)$0)) as word; C = group B by word; D = foreach C generate group, COUNT(B); store D into './wordcount’ using AvroStorage("schema"); """ Pig.compile(script).bind().runSingle() 8
  • 9. The Mixer Data Scientist  Too many occurrences of code inside strings  Three different languages inside a single file  User Defined Functions (UDFs) vs. Language Support  Not real Python, but Jython (which missing some libraries)  This is just a simple word count! 9
  • 10. The Mixer Data Scientist  Pig is great at extract, transform, load (ETL)  … as long as you want to use a function that is already part of the included library  … or you get someone else to write it for you (hello, DataFu!)  Realistically you will need to maintain a Pig code base and a code base in some language which can run on the JVM  Pig Latin is a bit funky, missing a lot of core programming language features  Pig Latin is interpreted so you get (limited) type and syntax checking only at runtime 10
  • 12. The Expert Word Count hadoop fs –get input.txt input.txt cp /mnt/hadoop/input.txt ~/MyProjects/WordCount/input.txt ##!/usr/bin/perl use strict; use warnings; my %count_of; while (my $line = <>) { #read from file or STDIN foreach my $word (split /s+/, $line) { $count_of{$word}++; } } print "All words and their counts: n"; for my $word (sort keys %count_of) { print "'$word': $count_of{$word}n"; } __END__ 12
  • 13. The Scalable Expert – Hadoop Streaming  Lets you use any language you want.  Same issues as Java MapReduce with regards to multiple passes, complicated joins, etc.  Always reading from stdin and writing to stdout.  Easy to test out on local data – cat myfile.txt | mymapper.sh | sort | myreducer.sh  Actual data may not be as nice. No type checking on input or output can will lead to problems.  The main reason to do this is so you can use a nice interpreted language to do your processing. 13
  • 15. The Craftsman Word Count package org.myorg; public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { import java.io.IOException; import java.util.*; public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { import org.apache.hadoop.fs.Path; int sum = 0; import org.apache.hadoop.conf.*; for (IntWritable val : values) { import org.apache.hadoop.io.*; sum += val.get(); import org.apache.hadoop.mapreduce.*; } import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; context.write(key, new IntWritable(sum)); import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; } import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; } import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public static void main(String[] args) throws Exception { public class WordCount { Configuration conf = new Configuration(); public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { Job job = new Job(conf, "wordcount"); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { job.setMapperClass(Map.class); String line = value.toString(); job.setReducerClass(Reduce.class); StringTokenizer tokenizer = new StringTokenizer(line); job.setInputFormatClass(TextInputFormat.class); while (tokenizer.hasMoreTokens()) { job.setOutputFormatClass(TextOutputFormat.class); word.set(tokenizer.nextToken()); FileInputFormat.addInputPath(job, new Path(args[0])); context.write(word, one); FileOutputFormat.setOutputPath(job, new Path(args[1])); } } job.waitForCompletion(true); } } 15
  • 16. The Craftsman Data Scientist  If you like Java it works fine  … until you want to do more than one pass, a complicated join or anything fancy.  Cascading solves many of these problems for you but it is still very verbose 16
  • 17. We need a better tool A five tool tool! http://en.wikipedia.org/wiki/Five-tool_player http://en.wikipedia.org/wiki/Willie_Mays 17
  • 18. The Pragmatic Data Scientist  Agile – Iterates quickly  Productive - Uses the right tool for the right job  Correct - Tests as much as he can before the job is even submitted  Scalable – Can handle real world problems  Simple - Single language to represent Operations, UDFs and Data 18
  • 19. The Pragmatic Data Scientist  Agile – Iterates quickly  Productive - Uses the right tool for the right job  Correct - Tests as much as he can before the job is even submitted  Scalable – Can handle real world problems  Simple - Single language to represent Operations, UDFs and Data 19
  • 20. Agility – Data is complex 20
  • 21. Agility – Try before you buy scala> 1 to 10 res0: Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> 1 until 10 res1: Range = Range(1, 2, 3, 4, 5, 6, 7, 8, 9) scala> res0.slice(3, 5) res3: scala.collection.immutable.IndexedSeq[Int] = Vector(4, 5) scala> res0.groupBy(_ % 2) res4: Map[Int, IndexedSeq[Int]] = Map(1 -> Vector(1, 3, 5, 7, 9), 0 -> Vector(2, 4, 6, 8, 10)) 21
  • 22. Productivity – Don't reinvent the wheel 22
  • 23. Productivity – Have the work done for you Python Collections Operators Scala Collections Operators  map  foreach  span  reduce  map  partition  filter  flatMap  groupBy  sum  collect  forall  min/max  find  exists  takeWhile  count  dropWhile  fold  filter  reduce  withFilter  sum  filterNot  product  splitAt  min/max 23
  • 24. Correctness – how to keep your sanity 24
  • 25. Scalability – works on more than your machine  Integrates with Hadoop (more than just streaming)  Has the support of scalable libraries  Parallel by design – not just for M/R flows 25
  • 26. Simplicity  Paco Nathan, Evil Mad Scientist, Concurrent Inc., @pacoid, says: – “[Scalding] code is compact, simple to understand” – “nearly 1:1 between elements of conceptual flow diagram and function calls” – “Cascalog and Scalding DSLs leverage the functional aspects of MapReduce, helping to limit complexity in process”  Scala is a functional tool for a fundamentally functional job 26
  • 28. Let’s count some words  This is the “Hello, World!” of anything tangentially related to Hadoop.  Let’s try it in Scala first without any Hadoop stuff.  val myLines : Seq[String] = ... // get some stuff  val myWords = myLines.flatMap(w => w.split("s+"))  val myWordsGrouped = myWords.groupBy(identity)  val countedWords = myWordsGrouped.mapValues(x=>x.size)  Now write out the words somehow val countedWords = myLines.flatMap(_.split("s+")) .groupBy(identity) .mapValues(_.size) 28
  • 29. Let’s count a lot of words  I’ve gone to the trouble of rewriting this example to run in Hadoop.  Here it is:  val myLines : TypedPipe[String] = TextLine(args("input"))  val myWords = myLines.flatMap(w => w.split("s+"))  val myWordsGrouped = myWords.groupBy(identity)  val countedWords = myWordsGrouped.mapValueStream(x => Iterator(x.size))  We can make this even better.  val countedWords = myWordsGrouped.size  countedWords.write(TypedTsv[(String,Long)](output)) 29
  • 30. Something for nothing  Other people have already done the hard work to make the previous example run  The previous example is using Scalding, a Scala library to write (mainly) Hadoop MapReduce jobs.  https://github.com/twitter/scalding  It even has its own Twitter account, @scalding  Created by: – Avi Bryant @avibryant – Oscar Boykin @posco – Argyris Zymnis @argyris  Tweet them now and tell them how awesome it is  … I’ll wait 30
  • 31. Side by side comparison of local and Hadoop val myWords = val myWords = myLines.flatMap(w => myLines.flatMap(w => w.split("s+")) w.split("s+")) val myWordsGrouped = val myWordsGrouped = myLines.groupBy(identity) myWords.groupBy(identity) val countedWords = val countedWords = myWordsGrouped. myWordsGrouped. size mapValues(x=>x.size) There are some small differences, mainly due to how the underlying Hadoop process needs to happen. 31
  • 32. Why does this work?  Scala has support for embedded domain specific languages (DSLs)  Scalding includes a couple DSLs for specifying Cascading (and by extension Hadoop) workflows.  Info about Cascading: http://www.cascading.org/  One of the Scalding DSLs, the Typed one, is designed to be very close to the standard Scala collections API  It’s not a perfect mapping due to how Cascading and Hadoop work, but in general it is very easy to write your code locally, change a couple small bits, and have it run on a Hadoop cluster  Scalding also has a local mode if you want the syntactic sugar without fussing with Hadoop 32
  • 33. DSLs for everyone!  We’re showing you Scalding in this talk, but there are others that are similar. – Scoobi: https://github.com/NICTA/scoobi – Scrunch: https://github.com/cloudera/crunch/tree/master/scrunch  All three attempt to make using code to written on Scala collections work (almost) seamlessly in Hadoop.  More on DSLs: http://www.scala-lang.org/node/1403  Some guts: 33
  • 34. Fields based DSL From com.twitter.scalding.Dsl /** * This object has all the implicit functions and values that are used * to make the scalding DSL. * * It's useful to import Dsl._ when you are writing scalding code outside * of a Job. */ object Dsl extends FieldConversions with TupleConversions with GeneratedTupleAdders with java.io.Serializable { implicit def pipeToRichPipe(pipe : Pipe) : RichPipe = new RichPipe(pipe) implicit def richPipeToPipe(rp : RichPipe) : Pipe = rp.pipe } } } 34
  • 35. Typed DSL From com.twitter.scalding.TDsl /** implicits for the type-safe DSL * import TDsl._ to get the implicit conversions from Grouping/CoGrouping to Pipe, * to get the .toTypedPipe method on standard cascading Pipes. * to get automatic conversion of Mappable[T] to TypedPipe[T] */ object TDsl extends Serializable with GeneratedTupleAdders { implicit def pipeTExtensions(pipe : Pipe) : PipeTExtensions = new PipeTExtensions(pipe) implicit def mappableToTypedPipe[T](mappable : Mappable[T]) (implicit flowDef : FlowDef, mode : Mode, conv : TupleConverter[T]) : TypedPipe[T] = { TypedPipe.from(mappable)(flowDef, mode, conv) } } 35
  • 36. Algebird – It’s like algebra and a bird  We did something fancy in the previous example:  val countedWords = myGroupedWords.size  val countedWords = myGroupedWords.mapValues(x => 1L).sum  val countedWords = myGroupedWords.mapValues(x => 1L).reduce(implicit mon: Monoid[Long])((l,r) => mon.plus(l,r))  Scalding uses Algebird extensively to make your life easier.  Algebird can also be used outside of Scalding with no trouble.  Algebird has your favorite things like monoids, monads, bloom filters, count-min sketches, hyperloglogs, etc. 36
  • 37. Counting words with some extra information  Sometimes we want to know some information about the contexts that words occurred in. At eBay, this is often the category that a term appeared in.  Let’s count words and calculate the entropy of the category distribution for each word. – If you’re unfamiliar with this type of entropy just think of it as a measure of how concentrated the distribution is. – If you really like formulas it is: Σi p(xi) log(pi) http://en.wikipedia.org/wiki/Entropy_%28information_theory%29 37
  • 38. More code case class MyAvroOutput(word: String, count: Long, entropy: Double) extends AvroRecord TypedTsv[(String,Int)] .flatMap{case(line,cat) => line.split("s+").map(x => (x,Map(cat->1L))} .group .sum .map{ case(word, dist) => val total: Double = dist.values.sum val entropy = (-1)*dist.values.map{ count => (count/total)*math.log(count/total)}.sum MyAvroOutput(word,total.toLong,entropy) } .write(PackedAvroSource[MyAvroOutput](output))  Math is great 38
  • 39. Machine Learning Examples The reason why you are here 39
  • 40. Classification case study How much should we charge for a Titanic insurance? 40
  • 41. Titanic II case study  We want to sell life insurance to passengers of Titanic II  All we have is data from Titanic I  We have to be able to explain why we charge the prices we do (damn regulators!) http://commons.wikimedia.org 41
  • 42. Titanic I Data  Cabin class – e.g. 1st, 2nd, 3rd ..  Name – String  Age – Integer  Embark place – String  Destination – String  Room – Integer  Ticket – Integer  Gender – Male or Female 42
  • 43. Titanic Model http://www.dtreg.com/ 43
  • 44. Classifier code object Titanic { def main(args: Array[String]) = { // parse data val reader = new CSVReader(new FileReader( "src/main/data/titanic.csv")) val passengers = reader.readAll.tail.map(Passenger(_)) val instances = passengers.map(_.getInstance).toSet // build tree val treeBuilder = new TreeBuilder val tree = treeBuilder.buildTree(instances) // print tree tree.dump(System.out) } } 44
  • 45. Titanic Model http://www.dtreg.com/ 45
  • 46. Clustering case study Let’s cluster some eBay keywords. 46
  • 47. Motivation  eBay, like any large site, has a massive number of unique queries every day  Identifying groups of queries based on user behavior might help us to understand the individual queries better  For queries we are unsure of we can even try and match them into a cluster that contains queries we know a lot about.  We can use behavioral things like: – number of searches – number of clicks – number of subsequent bids, buys – number of exits – etc 47
  • 48. Let’s use Mahout  Apache Mahout, http://mahout.apache.org/, @ApacheMahout, is a powerful machine learning and data mining library that works with Hadoop.  It has a ton of great stuff in it, but many of the drawbacks of using Java MapReduce apply.  It uses some proprietary data formats (is your data in VectorWritable SequenceFiles?)  Luckily for us, there are some nice things that work as standalone pieces.  Coming in release 0.8, there is an excellent single pass k-means clustering algorithm we can use. 48
  • 49. Let’s use Mahout, inside Scalding lazy val clust = new StreamingKMeans(new FastProjectionSearch(new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float]) var count = 0; val sloppyClusters = TextLine(args("input")) .map{ str => val vec = str.split("t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .toPipe('centroids) // This won't work with the current build, coming soon though .unorderedFoldTo[StreamingKMeans,Centroid]('centroids->’clusters)(clust){(cl,cent) => cl.cluster(cent); cl} .toTypedPipe[StreamingKMeans](Dsl.intFields(Seq(0))) .flatMap(c => c.iterator.asScala.toIterable) 49
  • 50. Let’s use Mahout, inside Scalding val finalClusters = sloppyClusters.groupAll .mapValueStream{centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values 50
  • 51. Results  These are primarily eBay head queries. Remember that the clustering algorithm knows nothing about the text in the query.  Sample groups: – chanel, tory burch, diamond ring, kathy van zeeland handbags, ... – ipad 4th generation, samsung galaxy s iii, iphone 4 s, nexus 4, ipad mini, ... – kohls coupons, lowes coupons – jcrew, cole haan, diesel, banana republic, gucci, burberry, brooks brothers, … – ferrari, utility trailer, polaris ranger, porsche 911, dump truck, bmw m3, chainsaw, rv, chevelle, vw bus, dodge charger, ... – paypal account, ebay.com, apple touch icon precomposed.png, paypal, undefined, ps3%2520games, michael%25 20kors 51
  • 52. Clustering Takeaway  There are some excellent libraries that exist, and even fit the functional model  Scala and Scalding will help you work around the rough edges and integrate them into your data flow, rather than having to create new data flows  Being able to prototype locally and in the Scala REPL saves massive amounts of developer time 52
  • 53. Matrix API case study Using LinkedIn endorsement data to rank Scala experts 53
  • 55. Page Rank Algorithm http://commons.wikimedia.org 55
  • 56. Prepare Data def prepareData = { // read endorsements and transform to edges val ends = readFile[Endorsement]("endorsements") .filter(_.skill == "Scala") .map(e => (e.sender, e.recipient, 1)) .write(TSV(”edges")) } def getDominantEigenVector = { … } // outputs to “ranks” (memberId, rank) def getMembers = { // get Bay Area members val members = readLatest[Member]("members") .filter(_.getRegionCode == 84) .groupBy(_.getMemberId.toLong) // join ranks and members readFile[Ranks](”ranks”).withReducers(10).join(members).toTypedPipe .map{ case (id, ((_, rank), m)) => (rank, m.getMemberId, m.getFirstName, m.getLastName, m.getHeadline) } .groupAll.sortBy(_._1).reverse.values .write(TextLine("talk/scalaRanks")) } 56
  • 57. Matrix API  mat.mapValues( func ): Matrix  rowMeanCentering : Matrix  mat.filterValues( func ) :  rowSizeAveStdev : Matrix Matrix  matrix1 * matrix2 : Matrix  mat.getRow( ind ) : RowVector  matrix / scalar(Scalar) : Matrix  mat.reduceRowVectors{ f } :  elemWiseOp( mat2 ){ func } RowVector  mat1.hProd( matrix2 ) : Matrix  mat.sumRowVectors :  mat1.zip( mat2/r/c ) : Matrix RowVector  matrix.nonZerosWith( sclr )  mat.mapRows{ func } : Matrix  matrix.trace : Scalar  mat.topRowElems( k ) : Matrix  matrix.sum : Scalar  mat.rowL1Normalize : Matrix  matrix.transpose : Matrix  mat.rowL2Normalize : Matrix  mat.diagonal : DiagonalMatrix 57
  • 58. Endorsements Page Rank Time for Results! 58
  • 59. 59
  • 60. 60
  • 61. Only one slide left! Summary 61
  • 62. Stuff you have seen today …  There are many ways to develop machine learning programs, none of them are perfect  Scala which reflects the years of evolution since Java's invention, and Scalding which is the same for vanilla MapReduce, are a much better alternative  Machine learning is fun and not necessarily complicated 62
  • 63. 63

Editor's Notes

  1. I have been in this room on 3 special occasionsJust this last Friday some of the most influential people in our industry like Jeff Weiner and Reid Hoffman judged our internal mini startup contest called incubatorWhen Bryan Stevenson, the person who got the longest standing ovation on TED was giving a private talk to LinkedIn employeesAnd when we celebrated out successful year by everyone getting new IPadsAnd let me tell you, I have never seen this room so full.So thank you all for coming and I promise you that this talk won’t be nearly as exciting as those occasions 
  2. There are a lot of ways to develop MR flows, when I came to LinkedIn I saw 3 different patterns …
  3. Simple here is as oppose to complex (reference Rich Hickey’s – Simple made Easy talk)
  4. Simple is the key message – if you had to takeaway a single point from the entire talk, this would be it.
  5. ----- Meeting Notes (3/11/13 17:16) -----This is the Facebook user object, well part of it. Avro schemas can get ridiculously big