Scala and Hadoop @ eBay

What we will cover
• Polymorphic Function Values
• Higher Kinded/Recursive Types
• Cokleislis Star Operators
• Scala Macros

I have no clue what those things are

What we will ACTUALLY cover
• Why Scala
• Why Hadoop
• How we use Scala with Hadoop
• Lots of CODE!

Why Scala?
• JVM
• **Functional**
• Expressive
• How to convince your boss?

Someone on Hacker News said
Scala sucks
• Compile Times
• You changed List again?
• Complicated
• Leads to Madness

Madness?
trait Lazy[+T, P] {
var creationParameters: P = None.asInstanceOf[P];
lazy val lazyThing: Either[Throwable, T] = try {
Right(create(creationParameters)) }
catch { case e => Left(e) }
def get(createParams: P): Either[Throwable, T] = {
creationParameters = createParams
lazyThing
}
def create(params: P): T
}

Madness?
def getSingleInstance[T, P](params: P)(implicit
lazyCreator: Lazy[T, P]): T = {
lazyCreator.get(params) match {
case Right(successValue) => successValue
case Left(exception) => throw new
StackException(exception)
}
}

This is used by ONE client class
• Show some self-restraint

Hadoop
• void map(K1 key, V1 value,
OutputCollector<K2, V2> output, Reporter
reporter)
• void reduce(K2 key, Iterator<V2> values,
OutputCollector<K3, V3> output, Reporter
reporter)

BIG NUMBERS
• Petabytes of data
• 1k+ node Hadoop cluster
• Multi-billion dollar merchandising business
• Lots of users and items 

How should I use Map Reduce?
• Raw map reduce 
• Pig
• Hive
• Cascading
• Scoobi
• Scalding 

Decision Time
• “And every one that heareth these sayings of
mine (great software engineers of the past),
and doeth them not, shall be likened unto a
foolish man, which built his house upon the
sand.”
• “And the rain descended, and the floods
came, and the winds blew, and beat upon that
house; and it fell: and great was the fall of it.”

I believe!
• Scalding combines the best of PIG and
Cascading

Good Pig
A = LOAD 'input' AS (x, y, z);
B = FILTER A BY x > 5;
DUMP B;
C = FOREACH B GENERATE y, z;
STORE C INTO 'output';
// do joins and group by also

Bad Pig
DEFINE NV_terms `perl nv_terms2.pl`
ship('$scripts/nv_terms2.pl');
i5 = stream i4 through NV_terms as (leafcat:chararray,
name:chararray, name1:chararray);
i7 = foreach i5 generate leafcat,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as
name,
com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as
name1;

Other Pig Issues
• Scheduling and DAG creation

Cascading Rocks!
• What is it?
• Supports large workflows and reusable
components
– DAG generation
– Parallel Executions

Cascading code in Scala
val masterPipe = new
FilterURLEncodedStrings(masterPipe, "sqr")
masterPipe = new
FilterInappropriateQueries(masterPipe, "sqr”)
masterPipe = new GroupBy(masterPipe,
CFields("user_id", "epoch_ts", "sqr"),
sortFields)

Someone should really code review this

Cascading Issues
This page intentionally left blank

Scalding Time
class WordCountJob(args : Args) extends Job(args) {
TextLine( args("input") )
.flatMap('line -> 'word) { line : String => tokenize(line) }
.groupBy('word) { _.size }
.write( Tsv( args("output") ) )
// Split a piece of text into individual words.
def tokenize(text : String) : Array[String] = {
// Lowercase each word and remove punctuation.
text.toLowerCase.replaceAll("[^a-zA-Z0-9s]", "").split("s+")
}
}

Scalding @ eBay
• Boilerplate reduction
• Extensibility
• New hires

Practical Scalding Use
• Pimp my pimp
• Code generated boilerplate
• Cascades
• Traps
• Testing!

class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate {
implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe)
class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) with
CommonFunctions
trait CommonFunctions {
import Dsl._
import RichPipe.assignName
def pipe: Pipe
def reallyComplexFunction(field: Fields, param: Long) = {
//mind blowing code here
}
}}

CheckoutTransactionsPipe(//default path logic)
.project(//fields I need)
.countUserInteractions(//params)
.doScoreCalculation(//params)
.doConfidenceCalculation(//params)
Seems a bit too readable for Scala

Collaborative Filtering
• Typically hard to run on large datasets

Structured Data Importance
• Do people shop by brand?
0
0.2
0.4
0.6
0.8
1
1.2
Supply
Handbags and Purses

Markov Chains
• Investigation of buying patterns in ~50 lines of
code
val purchases = "firsttime" :: x.take(500).toList
val pairs = purchases zip purchases.tail
val grouped = pairs.groupBy(x =>
x._1.toString+"-"+x._2.toString)
val sizes = grouped map { x => {
x._1 -> x._2.size
}} toList

Mining Search Queries
• 20+ billion user queries - give me the top ones
per user
De-Dupe Rank ValidateSample Data

Automation
Hadoop Proxy
Batch Database Load
Machines
Cassandra
Jenkins
MySql
Mongo

Scala and Hadoop @ eBay

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Scala and Hadoop @ eBay

Similar to Scala and Hadoop @ eBay (20)

Recently uploaded

Recently uploaded (20)

Scala and Hadoop @ eBay

Editor's Notes