Efficient
MapReduce
using Scalding
Neta Barkay | Data Scientist, LivePerson | December
Outline

Scalding - Scala library that makes it easy
to write MapReduce jobs in Hadoop.

We will talk about:
• MapReduce p...
Getting a glimpse of some Scalding code

class TopKJob(args : Args) extends Job(args){
val exclusions = Tsv(args("exclusio...
Asking big data questions

Which questions will you ask?
What analysis will you do?
A possible approach:

Use the outliers...
Asking big data questions

Which questions will you ask?
What analysis will you do?
A possible approach:

Use the outliers...
Data analysis problem

Top elements problem
Input
•

Data – arranged in records

•

K – number of top elements or p – perc...
Algorithm flow

Read input records

Top K elements
problem

Input =
13, 55, 8, 2, 34, 89, 21, 8
K=5

Sort records, take to...
Algorithm flow

Read input records

Top K elements
problem

Input =
13, 55, 8, 2, 34, 89, 21, 8
K=5

Sort records, take to...
Algorithm flow

Read input records

Top K elements
problem

Sort records, take top K

Output top records
Algorithm flow

Read input records

Top K elements
problem

Filter records that fit
target population

Sort records, take ...
Algorithm flow

Read input records

Top K elements
problem

Filter records that fit
target population

Divide to groups by...
Algorithm flow

Read input records

Top K elements
problem

Read exclusion list from
external source

Filter records that ...
MapReduce on Hadoop

Block

Mapper
(k,v) 
(k’1,v’1),(k’2,v’2)…

HDFS

Block n

Mapper n
(k,v) 
(k’1,v’1),(k’2,v’2)…

Out...
MapReduce on Hadoop

Big bottleneck
Block

Mapper
(k,v) 
(k’1,v’1),(k’2,v’2)…

HDFS

Block n

Mapper n
(k,v) 
(k’1,v’1),...
Efficient MapReduce

Which tool
should we
use?

Have built-in
performanceoriginated features

Efficient
Execution

Easy to...
About Scalding

Scalding is a Scala library that makes it easy to write

MapReduce jobs in Hadoop. It's similar to other
M...
Algorithm flow

Read input records

Top K elements
problem

Read exclusion list from
external source

Filter records that ...
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, Stri...
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, Stri...
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, Stri...
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, Stri...
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, Stri...
Simple Scalding job

val visitScheme = ('visitorId, 'country, 'section, 'saleValue)
type visitType = (String, String, Stri...
MapReduce joins

We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2

country
Israel
Israel
...
MapReduce joins

We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2

country
Israel
Israel
...
MapReduce joins

We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2

country
Israel
Israel
...
MapReduce joins

We like to filter out the visits that appear in
the exclusion list:
visitorId
1
2

country
Israel
Israel
...
Simple Scalding job

Filtering using JoinWithTiny:
class TopKJob(args : Args) extends Job(args){
val exclusions = Tsv(args...
Simple Scalding job

Filtering using JoinWithTiny:
class TopKJob(args : Args) extends Job(args){
val exclusions = Tsv(args...
Simple Scalding job

Functionality is complete

What's next
Efficient MapReduce

Functionality is complete

What's next

Efficient
Execution

Full
Functionality

Fast

Code Writing
Efficient MapReduce

MapReduce performance issues:

1. Traffic bottleneck between the mappers and the reduces.
2. Ineffici...
Efficient MapReduce

MapReduce performance issues:

1. Traffic bottleneck between the mappers and the reduces.

The traffi...
Efficient performance using Algebird

sortWithTake uses:
class PriorityQueueMonoid[T](max : Int)(implicit
ord : Ordering[T...
Efficient performance using Algebird

sortWithTake uses:
class PriorityQueueMonoid[T](max : Int)(implicit
ord : Ordering[T...
Efficient performance using Algebird

All Monoid aggregations can start in Map phase, then
finish in Reduce phase. This de...
Improving performance

Our second performance issue:

What about the performance due to
inefficient order of the map and r...
Top elements problem revisited

New problem definition:

Output the percentage p of top elements
instead of the fixed K to...
Top %p of elements algorithm flow

Read input records

What is K?
K = p * count

…

Divide to groups by site
section

Coun...
Top %p of elements scalding job
class TopPJob(args : Args) extends Job(args){
// visitScheme after join with exclusion lis...
Flow graph

How will this flow be executed on Hadoop?
•

How many MapReduce steps will be performed?

•

What will be the ...
Flow graph

How will this flow be executed on Hadoop?
•

How many MapReduce steps will be performed?

•

What will be the ...
Flow graph

Full flow in
Cascading
terminology
Flow graph

Split to
counting

Full flow in
Cascading
terminology

Reading input,
join with
exclusion list

Counting and
c...
Flow graph

And another graph:
Flow graph

And another graph:

source

source

Step number
Records input
Exclusion list
group

Step number
Records input
...
Flow graph

Changing joining with exclusion list to
be performed only once:
val visits : RichPipe =
…
.project(visitScheme...
Flow graph

The new map reduce steps:

source

Step number
Records input
Exclusion list

Step number
group

sink

Step num...
Improving performance

We saw how:
• Writing Scalding jobs is simple, intuitive and fast.
• We can use external resources ...
Additional features

Some other features in Scalding
• Typed API
TypedTsv[visitType](args("input"))
.filter(_._2 == "Israe...
Scalding in LivePerson

How do we use
Scalding in LivePerson?

• The main tool in the Data Science team
• Both for quick d...
LivePerson Developers

developer.liveperson.com
apps.liveperson.com

YouTube.com/LivePersonDev
Twitter.com/LivePersonDev
F...
Thank You!
Contact info:
netab@liveperson.com
netabarkay@gmail.com

We are hiring!
Scalding: Reaching Efficient MapReduce
Upcoming SlideShare
Loading in...5
×

Scalding: Reaching Efficient MapReduce

2,008

Published on

My name is Neta Barkay , and I'm a data scientist at LivePerson.
I'd like to share with you a talk I presented at the Underscore Scala community on "Efficient MapReduce using Scalding".
In this talk I reviewed why Scalding fits big data analysis, how it enables writing quick and intuitive code with the full functionality vanilla MapReduce has, without compromising on efficient execution on the Hadoop cluster. In addition, I presented some examples of Scalding jobs which can be used to get you started, and talked about how you can use Scalding's ecosystem, which includes Cascading and the monoids from Algebird library.
Read more & Video: https://connect.liveperson.com/community/developers/blog/2014/02/25/scalding-reaching-efficient-mapreduce

Published in: Education
0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,008
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
7
Embeds 0
No embeds

No notes for slide

Transcript of "Scalding: Reaching Efficient MapReduce"

  1. 1. Efficient MapReduce using Scalding Neta Barkay | Data Scientist, LivePerson | December
  2. 2. Outline Scalding - Scala library that makes it easy to write MapReduce jobs in Hadoop. We will talk about: • MapReduce paradigm • Writing Scalding jobs • Improving jobs performance • Typed API, testing
  3. 3. Getting a glimpse of some Scalding code class TopKJob(args : Args) extends Job(args){ val exclusions = Tsv(args("exclusions"), 'exVisitorId) Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .leftJoinWithTiny('visitorId -> exVisitorId, exclusions) .filter('exVisitorId){isEx : String => isEx null} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  4. 4. Asking big data questions Which questions will you ask? What analysis will you do? A possible approach: Use the outliers to improve your product • Most popular products on your site • Visits that ended with the highest sale value
  5. 5. Asking big data questions Which questions will you ask? What analysis will you do? A possible approach: Use the outliers to improve your product • Most popular products on your site • Visits that ended with the highest sale value That is the problem of finding the top elements in the data.
  6. 6. Data analysis problem Top elements problem Input • Data – arranged in records • K – number of top elements or p – percentage of top elements to output • Order function – some ordering on the records Output • K top records of our data or top p percentage according to the order function
  7. 7. Algorithm flow Read input records Top K elements problem Input = 13, 55, 8, 2, 34, 89, 21, 8 K=5 Sort records, take top K Output top records Output = 89, 55, 34, 21, 13
  8. 8. Algorithm flow Read input records Top K elements problem Input = 13, 55, 8, 2, 34, 89, 21, 8 K=5 Sort records, take top K Output top records Output = 89, 55, 34, 21, 13 Scalding code Tsv(args("input"), 'item) .groupAll{_.sortWithTake('item -> 'top, (a : Int, b : Int) => a > b}} .write(Tsv(args("output"), 'top)) ){
  9. 9. Algorithm flow Read input records Top K elements problem Sort records, take top K Output top records
  10. 10. Algorithm flow Read input records Top K elements problem Filter records that fit target population Sort records, take top K Output top records
  11. 11. Algorithm flow Read input records Top K elements problem Filter records that fit target population Divide to groups by site section Sort records, tak e top K Sort records, tak e top K Output top records Output top records
  12. 12. Algorithm flow Read input records Top K elements problem Read exclusion list from external source Filter records that fit target population Filter out the visits from the exclusion list according to visitor id Divide to groups by site section Sort records, take top K Sort records, take top K Output top records Output top records
  13. 13. MapReduce on Hadoop Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… HDFS Block n Mapper n (k,v)  (k’1,v’1),(k’2,v’2)… Output file Reducer (k', iterator(v')) v’’1, v’’2… Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… Reducer (k', iterator(v')) v’’1, v’’2… Output file
  14. 14. MapReduce on Hadoop Big bottleneck Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… HDFS Block n Mapper n (k,v)  (k’1,v’1),(k’2,v’2)… Output file Reducer (k', iterator(v')) v’’1, v’’2… Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… Reducer (k', iterator(v')) v’’1, v’’2… Output file
  15. 15. Efficient MapReduce Which tool should we use? Have built-in performanceoriginated features Efficient Execution Easy to alter And easy maintenance Full Functionality Fast Code Writing
  16. 16. About Scalding Scalding is a Scala library that makes it easy to write MapReduce jobs in Hadoop. It's similar to other MapReduce platforms like Pig and Hive, but offers a higher level of abstraction by leveraging the full power of Scala and the JVM –Twitter
  17. 17. Algorithm flow Read input records Top K elements problem Read exclusion list from external source Filter records that fit target population Filter out the visits from the exclusion list according to visitor id Divide to groups by site section Sort records, take top K Sort records, take top K Output top records Output top records
  18. 18. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4
  19. 19. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  20. 20. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  21. 21. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  22. 22. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  23. 23. Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  24. 24. MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … … exVisitorId 3 1
  25. 25. MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … exVisitorId 3 … visitorId country section saleValue 1 exVisitorId
  26. 26. MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … exVisitorId 3 … 1 visitorId 1 country Israel section … saleValue … exVisitorId 1 2 3 Israel Israel … … … … null 3
  27. 27. MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … exVisitorId 3 … 1 visitorId 1 country Israel section … saleValue … exVisitorId 1 2 3 Israel Israel … … … … null 3 visitorId country section saleValue exVisitorId 2 Israel … … null
  28. 28. Simple Scalding job Filtering using JoinWithTiny: class TopKJob(args : Args) extends Job(args){ val exclusions = Tsv(args("exclusions"), 'exVisitorId) Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .leftJoinWithTiny('visitorId -> exVisitorId, exclusions) .filter('exVisitorId){isEx : String => isEx null} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  29. 29. Simple Scalding job Filtering using JoinWithTiny: class TopKJob(args : Args) extends Job(args){ val exclusions = Tsv(args("exclusions"), 'exVisitorId) Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .leftJoinWithTiny('visitorId -> exVisitorId, exclusions) .filter('exVisitorId){isEx : String => isEx null} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  30. 30. Simple Scalding job Functionality is complete What's next
  31. 31. Efficient MapReduce Functionality is complete What's next Efficient Execution Full Functionality Fast Code Writing
  32. 32. Efficient MapReduce MapReduce performance issues: 1. Traffic bottleneck between the mappers and the reduces. 2. Inefficient order of map and reduce steps.
  33. 33. Efficient MapReduce MapReduce performance issues: 1. Traffic bottleneck between the mappers and the reduces. The traffic bottleneck is when we take the top K elements. • • We like to output from each mapper the top elements of its input. How is sortWithTake implemented?
  34. 34. Efficient performance using Algebird sortWithTake uses: class PriorityQueueMonoid[T](max : Int)(implicit ord : Ordering[T]) extends Monoid[PriorityQueue[T]] Defined in: Algebird (Twitter): Abstract algebra for Scala, targeted at building aggregation systems.
  35. 35. Efficient performance using Algebird sortWithTake uses: class PriorityQueueMonoid[T](max : Int)(implicit ord : Ordering[T]) extends Monoid[PriorityQueue[T]] PriorityQueue case: Empty PriorityQueue Two PriorityQueues can be added: K=5 Q1: values = 55, 34, 21, 13, 8 Q2: values = 100, 80, 60, 40, 20 Q1 plus Q2: values: 100, 80, 60, 55, 40 Associative and commutative
  36. 36. Efficient performance using Algebird All Monoid aggregations can start in Map phase, then finish in Reduce phase. This decreases the amount of traffic from the mappers to the reducers. Performed implicitly when using Scalding built-in aggregation functions: average sum sizeAveStdev histogram approximateUniqueCount sortWithTake
  37. 37. Improving performance Our second performance issue: What about the performance due to inefficient order of the map and reduce steps?
  38. 38. Top elements problem revisited New problem definition: Output the percentage p of top elements instead of the fixed K top elements. What is K? K = p * count
  39. 39. Top %p of elements algorithm flow Read input records What is K? K = p * count … Divide to groups by site section Count the number of records Count the number of records Sort records take top p Sort records take top p Output top records Output top records
  40. 40. Top %p of elements scalding job class TopPJob(args : Args) extends Job(args){ // visitScheme after join with exclusion list val visits : RichPipe = … val counts = visits .groupBy('section){_.size('sectionSize)} .map('sectionSize -> 'sectionK){size : Int => {size * // taking top %p of elements visits.joinWithTiny('section -> 'section, counts) … } }.toInt}
  41. 41. Flow graph How will this flow be executed on Hadoop? • How many MapReduce steps will be performed? • What will be the input to each step? • What logic will each contain?
  42. 42. Flow graph How will this flow be executed on Hadoop? • How many MapReduce steps will be performed? • What will be the input to each step? • What logic will each contain? Run with --tool.graph!
  43. 43. Flow graph Full flow in Cascading terminology
  44. 44. Flow graph Split to counting Full flow in Cascading terminology Reading input, join with exclusion list Counting and calculating K Join with counting result Joining with K and sorting
  45. 45. Flow graph And another graph:
  46. 46. Flow graph And another graph: source source Step number Records input Exclusion list group Step number Records input Exclusion list group Output file sink First step Second step
  47. 47. Flow graph Changing joining with exclusion list to be performed only once: val visits : RichPipe = … .project(visitScheme) .forceToDisk Only a single line is added! val counts = visits .groupBy('section){_.size('sectionSize)} … visits.joinWithTiny('section -> 'section, counts) …
  48. 48. Flow graph The new map reduce steps: source Step number Records input Exclusion list Step number group sink Step number group Output file First step Second step Third step
  49. 49. Improving performance We saw how: • Writing Scalding jobs is simple, intuitive and fast. • We can use external resources to improve the performance of our algorithms. Scalding performs some of this job implicitly for us. • We can use Cascading library Scalding built on to understand what are the exact steps that will run.
  50. 50. Additional features Some other features in Scalding • Typed API TypedTsv[visitType](args("input")) .filter(_._2 == "Israel") .toPipe(visitScheme) .toTypedPipe[visitType](visitScheme) // TypedPipe[visitType] // TypedPipe[visitType] • Testing using JobTest Give the input and get the output as Lists • Matrix API Useful for running graph algorithms such as PageRank
  51. 51. Scalding in LivePerson How do we use Scalding in LivePerson? • The main tool in the Data Science team • Both for quick data exploration, and in production jobs
  52. 52. LivePerson Developers developer.liveperson.com apps.liveperson.com YouTube.com/LivePersonDev Twitter.com/LivePersonDev Facebook.com/LivePersonDev
  53. 53. Thank You! Contact info: netab@liveperson.com netabarkay@gmail.com We are hiring!

×