• Save
Scalding: Reaching Efficient MapReduce
Upcoming SlideShare
Loading in...5
×
 

Scalding: Reaching Efficient MapReduce

on

  • 982 views

My name is Neta Barkay , and I'm a data scientist at LivePerson. ...

My name is Neta Barkay , and I'm a data scientist at LivePerson.
I'd like to share with you a talk I presented at the Underscore Scala community on "Efficient MapReduce using Scalding".
In this talk I reviewed why Scalding fits big data analysis, how it enables writing quick and intuitive code with the full functionality vanilla MapReduce has, without compromising on efficient execution on the Hadoop cluster. In addition, I presented some examples of Scalding jobs which can be used to get you started, and talked about how you can use Scalding's ecosystem, which includes Cascading and the monoids from Algebird library.
Read more & Video: https://connect.liveperson.com/community/developers/blog/2014/02/25/scalding-reaching-efficient-mapreduce

Statistics

Views

Total Views
982
Views on SlideShare
975
Embed Views
7

Actions

Likes
3
Downloads
0
Comments
0

2 Embeds 7

https://twitter.com 4
http://www.slideee.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Scalding: Reaching Efficient MapReduce Scalding: Reaching Efficient MapReduce Presentation Transcript

  • Efficient MapReduce using Scalding Neta Barkay | Data Scientist, LivePerson | December
  • Outline Scalding - Scala library that makes it easy to write MapReduce jobs in Hadoop. We will talk about: • MapReduce paradigm • Writing Scalding jobs • Improving jobs performance • Typed API, testing
  • Getting a glimpse of some Scalding code class TopKJob(args : Args) extends Job(args){ val exclusions = Tsv(args("exclusions"), 'exVisitorId) Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .leftJoinWithTiny('visitorId -> exVisitorId, exclusions) .filter('exVisitorId){isEx : String => isEx null} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) } View slide
  • Asking big data questions Which questions will you ask? What analysis will you do? A possible approach: Use the outliers to improve your product • Most popular products on your site • Visits that ended with the highest sale value View slide
  • Asking big data questions Which questions will you ask? What analysis will you do? A possible approach: Use the outliers to improve your product • Most popular products on your site • Visits that ended with the highest sale value That is the problem of finding the top elements in the data.
  • Data analysis problem Top elements problem Input • Data – arranged in records • K – number of top elements or p – percentage of top elements to output • Order function – some ordering on the records Output • K top records of our data or top p percentage according to the order function
  • Algorithm flow Read input records Top K elements problem Input = 13, 55, 8, 2, 34, 89, 21, 8 K=5 Sort records, take top K Output top records Output = 89, 55, 34, 21, 13
  • Algorithm flow Read input records Top K elements problem Input = 13, 55, 8, 2, 34, 89, 21, 8 K=5 Sort records, take top K Output top records Output = 89, 55, 34, 21, 13 Scalding code Tsv(args("input"), 'item) .groupAll{_.sortWithTake('item -> 'top, (a : Int, b : Int) => a > b}} .write(Tsv(args("output"), 'top)) ){
  • Algorithm flow Read input records Top K elements problem Sort records, take top K Output top records
  • Algorithm flow Read input records Top K elements problem Filter records that fit target population Sort records, take top K Output top records
  • Algorithm flow Read input records Top K elements problem Filter records that fit target population Divide to groups by site section Sort records, tak e top K Sort records, tak e top K Output top records Output top records
  • Algorithm flow Read input records Top K elements problem Read exclusion list from external source Filter records that fit target population Filter out the visits from the exclusion list according to visitor id Divide to groups by site section Sort records, take top K Sort records, take top K Output top records Output top records
  • MapReduce on Hadoop Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… HDFS Block n Mapper n (k,v)  (k’1,v’1),(k’2,v’2)… Output file Reducer (k', iterator(v')) v’’1, v’’2… Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… Reducer (k', iterator(v')) v’’1, v’’2… Output file
  • MapReduce on Hadoop Big bottleneck Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… HDFS Block n Mapper n (k,v)  (k’1,v’1),(k’2,v’2)… Output file Reducer (k', iterator(v')) v’’1, v’’2… Block Mapper (k,v)  (k’1,v’1),(k’2,v’2)… Reducer (k', iterator(v')) v’’1, v’’2… Output file
  • Efficient MapReduce Which tool should we use? Have built-in performanceoriginated features Efficient Execution Easy to alter And easy maintenance Full Functionality Fast Code Writing
  • About Scalding Scalding is a Scala library that makes it easy to write MapReduce jobs in Hadoop. It's similar to other MapReduce platforms like Pig and Hive, but offers a higher level of abstraction by leveraging the full power of Scala and the JVM –Twitter
  • Algorithm flow Read input records Top K elements problem Read exclusion list from external source Filter records that fit target population Filter out the visits from the exclusion list according to visitor id Divide to groups by site section Sort records, take top K Sort records, take top K Output top records Output top records
  • Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4
  • Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • Simple Scalding job val visitScheme = ('visitorId, 'country, 'section, 'saleValue) type visitType = (String, String, String, Double) def biggerSale(v1 : visitType, v2: visitType) = v1._4 > v2._4 import com.twitter.scalding._ class TopKJob(args : Args) extends Job(args){ Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … … exVisitorId 3 1
  • MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … exVisitorId 3 … visitorId country section saleValue 1 exVisitorId
  • MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … exVisitorId 3 … 1 visitorId 1 country Israel section … saleValue … exVisitorId 1 2 3 Israel Israel … … … … null 3
  • MapReduce joins We like to filter out the visits that appear in the exclusion list: visitorId 1 2 country Israel Israel section … … saleValue … … 3 Israel … exVisitorId 3 … 1 visitorId 1 country Israel section … saleValue … exVisitorId 1 2 3 Israel Israel … … … … null 3 visitorId country section saleValue exVisitorId 2 Israel … … null
  • Simple Scalding job Filtering using JoinWithTiny: class TopKJob(args : Args) extends Job(args){ val exclusions = Tsv(args("exclusions"), 'exVisitorId) Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .leftJoinWithTiny('visitorId -> exVisitorId, exclusions) .filter('exVisitorId){isEx : String => isEx null} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • Simple Scalding job Filtering using JoinWithTiny: class TopKJob(args : Args) extends Job(args){ val exclusions = Tsv(args("exclusions"), 'exVisitorId) Tsv(args("input"), visitScheme) .filter('country){country : String => country == "Israel"} .leftJoinWithTiny('visitorId -> exVisitorId, exclusions) .filter('exVisitorId){isEx : String => isEx null} .groupBy('section){_.sortWithTake(visitScheme -> 'top, )(biggerSale)} .flattenTo[visitType]{'top -> visitScheme} .write(Tsv(args("output"), visitScheme)) }
  • Simple Scalding job Functionality is complete What's next
  • Efficient MapReduce Functionality is complete What's next Efficient Execution Full Functionality Fast Code Writing
  • Efficient MapReduce MapReduce performance issues: 1. Traffic bottleneck between the mappers and the reduces. 2. Inefficient order of map and reduce steps.
  • Efficient MapReduce MapReduce performance issues: 1. Traffic bottleneck between the mappers and the reduces. The traffic bottleneck is when we take the top K elements. • • We like to output from each mapper the top elements of its input. How is sortWithTake implemented?
  • Efficient performance using Algebird sortWithTake uses: class PriorityQueueMonoid[T](max : Int)(implicit ord : Ordering[T]) extends Monoid[PriorityQueue[T]] Defined in: Algebird (Twitter): Abstract algebra for Scala, targeted at building aggregation systems.
  • Efficient performance using Algebird sortWithTake uses: class PriorityQueueMonoid[T](max : Int)(implicit ord : Ordering[T]) extends Monoid[PriorityQueue[T]] PriorityQueue case: Empty PriorityQueue Two PriorityQueues can be added: K=5 Q1: values = 55, 34, 21, 13, 8 Q2: values = 100, 80, 60, 40, 20 Q1 plus Q2: values: 100, 80, 60, 55, 40 Associative and commutative
  • Efficient performance using Algebird All Monoid aggregations can start in Map phase, then finish in Reduce phase. This decreases the amount of traffic from the mappers to the reducers. Performed implicitly when using Scalding built-in aggregation functions: average sum sizeAveStdev histogram approximateUniqueCount sortWithTake
  • Improving performance Our second performance issue: What about the performance due to inefficient order of the map and reduce steps?
  • Top elements problem revisited New problem definition: Output the percentage p of top elements instead of the fixed K top elements. What is K? K = p * count
  • Top %p of elements algorithm flow Read input records What is K? K = p * count … Divide to groups by site section Count the number of records Count the number of records Sort records take top p Sort records take top p Output top records Output top records
  • Top %p of elements scalding job class TopPJob(args : Args) extends Job(args){ // visitScheme after join with exclusion list val visits : RichPipe = … val counts = visits .groupBy('section){_.size('sectionSize)} .map('sectionSize -> 'sectionK){size : Int => {size * // taking top %p of elements visits.joinWithTiny('section -> 'section, counts) … } }.toInt}
  • Flow graph How will this flow be executed on Hadoop? • How many MapReduce steps will be performed? • What will be the input to each step? • What logic will each contain?
  • Flow graph How will this flow be executed on Hadoop? • How many MapReduce steps will be performed? • What will be the input to each step? • What logic will each contain? Run with --tool.graph!
  • Flow graph Full flow in Cascading terminology
  • Flow graph Split to counting Full flow in Cascading terminology Reading input, join with exclusion list Counting and calculating K Join with counting result Joining with K and sorting
  • Flow graph And another graph:
  • Flow graph And another graph: source source Step number Records input Exclusion list group Step number Records input Exclusion list group Output file sink First step Second step
  • Flow graph Changing joining with exclusion list to be performed only once: val visits : RichPipe = … .project(visitScheme) .forceToDisk Only a single line is added! val counts = visits .groupBy('section){_.size('sectionSize)} … visits.joinWithTiny('section -> 'section, counts) …
  • Flow graph The new map reduce steps: source Step number Records input Exclusion list Step number group sink Step number group Output file First step Second step Third step
  • Improving performance We saw how: • Writing Scalding jobs is simple, intuitive and fast. • We can use external resources to improve the performance of our algorithms. Scalding performs some of this job implicitly for us. • We can use Cascading library Scalding built on to understand what are the exact steps that will run.
  • Additional features Some other features in Scalding • Typed API TypedTsv[visitType](args("input")) .filter(_._2 == "Israel") .toPipe(visitScheme) .toTypedPipe[visitType](visitScheme) // TypedPipe[visitType] // TypedPipe[visitType] • Testing using JobTest Give the input and get the output as Lists • Matrix API Useful for running graph algorithms such as PageRank
  • Scalding in LivePerson How do we use Scalding in LivePerson? • The main tool in the Data Science team • Both for quick data exploration, and in production jobs
  • LivePerson Developers developer.liveperson.com apps.liveperson.com YouTube.com/LivePersonDev Twitter.com/LivePersonDev Facebook.com/LivePersonDev
  • Thank You! Contact info: netab@liveperson.com netabarkay@gmail.com We are hiring!