Scalding: Reaching Efficient MapReduce

Efficient
MapReduce
using Scalding
Neta Barkay | Data Scientist, LivePerson | December

Outline

Scalding - Scala library that makes it easy
to write MapReduce jobs in Hadoop.

We will talk about:
• MapReduce paradigm
• Writing Scalding jobs
• Improving jobs performance
• Typed API, testing

Getting a glimpse of some Scalding code

class TopKJob(args : Args) extends Job(args){
val exclusions = Tsv(args("exclusions"), 'exVisitorId)
Tsv(args("input"), visitScheme)
.filter('country){country : String => country == "Israel"}
.leftJoinWithTiny('visitorId -> exVisitorId, exclusions)
.filter('exVisitorId){isEx : String => isEx
null}
.groupBy('section){_.sortWithTake(visitScheme -> 'top,
)(biggerSale)}
.flattenTo[visitType]{'top -> visitScheme}
.write(Tsv(args("output"), visitScheme))

}

Asking big data questions

Which questions will you ask?
What analysis will you do?
A possible approach:

Use the outliers to improve your product
• Most popular products on your site
• Visits that ended with the highest sale value

Asking big data questions

Which questions will you ask?
What analysis will you do?
A possible approach:

Use the outliers to improve your product
• Most popular products on your site
• Visits that ended with the highest sale value

That is the problem of finding the top elements in the data.

Data analysis problem

Top elements problem
Input
•

Data – arranged in records

•

K – number of top elements or p – percentage of top
elements to output

•

Order function – some ordering on the records

Output
•

K top records of our data or top p percentage according to
the order function

Algorithm flow

Read input records

Top K elements
problem

Input =
13, 55, 8, 2, 34, 89, 21, 8
K=5

Sort records, take top K

Output top records

Output =
89, 55, 34, 21, 13

Algorithm flow

Read input records

Top K elements
problem

Input =
13, 55, 8, 2, 34, 89, 21, 8
K=5


Output top records

Output =
89, 55, 34, 21, 13

Scalding code
Tsv(args("input"), 'item)
.groupAll{_.sortWithTake('item -> 'top,
(a : Int, b : Int) => a > b}}
.write(Tsv(args("output"), 'top))

){

Algorithm flow

Read input records

Top K elements
problem


Output top records

Algorithm flow

Read input records

Top K elements
problem

Filter records that fit
target population


Output top records

Algorithm flow

Read input records

Top K elements
problem

target population

Divide to groups by site
section

Sort
records, tak
e top K

Sort
records, tak
e top K

Output top
records

Output top
records

Algorithm flow

Read input records

Top K elements
problem

Read exclusion list from
external source

target population

Filter out the visits from
the exclusion list
according to visitor id

section

Sort
records,
take top K

Sort
records,
take top K

Output top
records

Output top
records

MapReduce on Hadoop

Block

Mapper
(k,v) 
(k’1,v’1),(k’2,v’2)…

HDFS

Block n

Mapper n
(k,v) 
(k’1,v’1),(k’2,v’2)…

Output
file

Reducer
(k', iterator(v'))
v’’1, v’’2…

Block

Mapper
(k,v) 
(k’1,v’1),(k’2,v’2)…

Reducer
(k', iterator(v'))
v’’1, v’’2…

Output
file

MapReduce on Hadoop

Big bottleneck
Block

Mapper
(k,v) 
(k’1,v’1),(k’2,v’2)…

HDFS

Block n

Mapper n
(k,v) 
(k’1,v’1),(k’2,v’2)…

Output
file

Reducer
(k', iterator(v'))
v’’1, v’’2…

Block

Mapper
(k,v) 
(k’1,v’1),(k’2,v’2)…

Reducer
(k', iterator(v'))
v’’1, v’’2…

Output
file

Efficient MapReduce

Which tool
should we
use?

Have built-in
performanceoriginated features

Efficient
Execution

Easy to alter

And easy
maintenance

Full
Functionality

Fast
Code Writing

About Scalding

Scalding is a Scala library that makes it easy to write

MapReduce jobs in Hadoop. It's similar to other
MapReduce platforms like Pig and Hive, but offers a
higher level of abstraction by leveraging the full power of
Scala and the JVM
–Twitter

Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, String, Double)
def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4

Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, String, Double)
def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4

import com.twitter.scalding._
)(biggerSale)}

}

MapReduce joins

We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2

country
Israel
Israel

section
…
…

saleValue
…
…

3

Israel

…

…

exVisitorId
3

1

MapReduce joins

the exclusion list:
visitorId
1
2

country
Israel
Israel

section
…
…

saleValue
…
…

3

Israel

…

exVisitorId
3

…

visitorId

country

section

saleValue

1

exVisitorId

MapReduce joins

the exclusion list:
visitorId
1
2

country
Israel
Israel

section
…
…

saleValue
…
…

3

Israel

…

exVisitorId
3

…

1

visitorId
1

country
Israel

section
…

saleValue
…

exVisitorId
1

2
3

Israel
Israel

…
…

…
…

null
3

MapReduce joins

the exclusion list:
visitorId
1
2

country
Israel
Israel

section
…
…

saleValue
…
…

3

Israel

…

exVisitorId
3

…

1

visitorId
1

country
Israel

section
…

saleValue
…

exVisitorId
1

2
3

Israel
Israel

…
…

…
…

null
3

visitorId

country

section

saleValue

exVisitorId

2

Israel

…

…

null

Simple Scalding job

Filtering using JoinWithTiny:
val exclusions = Tsv(args("exclusions"), 'exVisitorId)
.leftJoinWithTiny('visitorId -> exVisitorId, exclusions)
.filter('exVisitorId){isEx : String => isEx
null}
)(biggerSale)}
}

Simple Scalding job

Functionality is complete

What's next

Efficient MapReduce

Functionality is complete

What's next

Efficient
Execution

Full
Functionality

Fast

Code Writing

Efficient MapReduce

MapReduce performance issues:

1. Traffic bottleneck between the mappers and the reduces.
2. Inefficient order of map and reduce steps.

Efficient MapReduce

MapReduce performance issues:

1. Traffic bottleneck between the mappers and the reduces.

The traffic bottleneck is when we take the top K elements.

•
•

We like to output from each mapper the top elements of its
input.
How is sortWithTake implemented?

Efficient performance using Algebird

sortWithTake uses:
class PriorityQueueMonoid[T](max : Int)(implicit
ord : Ordering[T]) extends Monoid[PriorityQueue[T]]

Defined in:
Algebird (Twitter): Abstract algebra for Scala, targeted at
building aggregation systems.


sortWithTake uses:
class PriorityQueueMonoid[T](max : Int)(implicit
ord : Ordering[T]) extends Monoid[PriorityQueue[T]]

PriorityQueue case:
Empty PriorityQueue
Two PriorityQueues can be added:
K=5
Q1: values = 55, 34, 21, 13, 8
Q2: values = 100, 80, 60, 40, 20
Q1 plus Q2: values: 100, 80, 60, 55, 40

Associative and commutative


All Monoid aggregations can start in Map phase, then
finish in Reduce phase. This decreases the amount
of traffic from the mappers to the reducers.
Performed implicitly when using Scalding built-in
aggregation functions:
average
sum
sizeAveStdev
histogram
approximateUniqueCount
sortWithTake

Improving performance

Our second performance issue:

What about the performance due to
inefficient order of the map and reduce
steps?

Top elements problem revisited

New problem definition:

Output the percentage p of top elements
instead of the fixed K top elements.

What is K?
K = p * count

Top %p of elements algorithm flow

Read input records

What is K?
K = p * count

…

section

Count the
number of
records

Count the
number of
records

Sort
records
take top p

Sort
records
take top p

Output top
records

Output top
records

Top %p of elements scalding job
class TopPJob(args : Args) extends Job(args){
// visitScheme after join with exclusion list
val visits : RichPipe = …
val counts = visits
.groupBy('section){_.size('sectionSize)}
.map('sectionSize -> 'sectionK){size : Int => {size *
// taking top %p of elements
visits.joinWithTiny('section -> 'section, counts)
…
}

}.toInt}

Flow graph

How will this flow be executed on Hadoop?
•

How many MapReduce steps will be performed?

•

What will be the input to each step?

•

What logic will each contain?

Flow graph

How will this flow be executed on Hadoop?
•

How many MapReduce steps will be performed?

•

What will be the input to each step?

•

What logic will each contain?

Run with --tool.graph!

Flow graph

Full flow in
Cascading
terminology

Flow graph

Split to
counting

Full flow in
Cascading
terminology

Reading input,
join with
exclusion list

Counting and
calculating K

Join with
counting
result
Joining with K
and sorting

Flow graph

And another graph:

Flow graph

And another graph:

source

source

Step number
Records input
Exclusion list
group

Step number
Records input
Exclusion list
group
Output file
sink

First
step

Second
step

Flow graph

Changing joining with exclusion list to
be performed only once:
val visits : RichPipe =
…
.project(visitScheme)
.forceToDisk

Only a single
line is added!

val counts = visits
.groupBy('section){_.size('sectionSize)}
…
visits.joinWithTiny('section -> 'section, counts)
…

Flow graph

The new map reduce steps:

source

Step number
Records input
Exclusion list

Step number
group

sink

Step number
group
Output file

First
step

Second
step

Third
step

Improving performance

We saw how:
• Writing Scalding jobs is simple, intuitive and fast.
• We can use external resources to improve the

performance of our algorithms. Scalding performs
some of this job implicitly for us.
• We can use Cascading library Scalding built on to

understand what are the exact steps that will run.

Additional features

Some other features in Scalding
• Typed API
TypedTsv[visitType](args("input"))
.filter(_._2 == "Israel")
.toPipe(visitScheme)
.toTypedPipe[visitType](visitScheme)

// TypedPipe[visitType]
// TypedPipe[visitType]

• Testing using JobTest

Give the input and get the output as Lists
• Matrix API

Useful for running graph algorithms such as PageRank

Scalding in LivePerson

How do we use
Scalding in LivePerson?

• The main tool in the Data Science team
• Both for quick data exploration, and in production jobs

LivePerson Developers

developer.liveperson.com
apps.liveperson.com

YouTube.com/LivePersonDev
Twitter.com/LivePersonDev
Facebook.com/LivePersonDev

Thank You!
Contact info:
netab@liveperson.com
netabarkay@gmail.com

We are hiring!

Scalding: Reaching Efficient MapReduce

Scalding: Reaching Efficient MapReduce

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scalding: Reaching Efficient MapReduce

Similar to Scalding: Reaching Efficient MapReduce (20)

More from LivePerson

More from LivePerson (20)

Recently uploaded

Recently uploaded (20)

Scalding: Reaching Efficient MapReduce