• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hourglass: a Library for Incremental Processing on Hadoop
 

Hourglass: a Library for Incremental Processing on Hadoop

on

  • 1,733 views

Slides from my talk at IEEE BigData 2013 presenting our paper "Hourglass: a Library for Incremental Processing on Hadoop" ...

Slides from my talk at IEEE BigData 2013 presenting our paper "Hourglass: a Library for Incremental Processing on Hadoop"

Abstract:
Hadoop enables processing of large data sets through its relatively easy-to-use semantics. However, jobs are often written inefficiently for tasks that could be computed incrementally due to the burdensome incremental state management for the programmer. This paper introduces Hourglass, a library for developing incremental monoid computations on Hadoop. It runs on unmodified Hadoop and provides an accumulator-based interface for programmers to store and use state across successive runs; the framework ensures that only the necessary subcomputations are performed. It is successfully used at LinkedIn, one of the largest online social networks, for many use cases in dashboarding and machine learning. Hourglass is open source and freely available.

Statistics

Views

Total Views
1,733
Views on SlideShare
1,651
Embed Views
82

Actions

Likes
7
Downloads
33
Comments
0

1 Embed 82

https://twitter.com 82

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hourglass: a Library for Incremental Processing on Hadoop Hourglass: a Library for Incremental Processing on Hadoop Presentation Transcript

    • Hourglass: a Library for Incremental Processing on Hadoop IEEE BigData 2013 October 9th Matthew Hayes ©2013 LinkedIn Corporation. All Rights Reserved.
    • Matthew Hayes Staff Software Engineer www.linkedin.com/in/matthewterencehayes/ ©2013 LinkedIn Corporation. All Rights Reserved. • 3+ Years on Applied Data Team at LinkedIn • Skills • Endorsements • DataFu • White Elephant
    • Agenda  Motivation  Design  Experiments  Q&A ©2013 LinkedIn Corporation. All Rights Reserved. 3
    • Motivation ©2013 LinkedIn Corporation. All Rights Reserved. 4
    • Event Collection in an Online System  Typically online websites have instrumented services that collect events  Events stored in an offline system (such as Hadoop) for later analysis  Using events, can build dashboards with metrics such as: – # of page views over last month – # of active users over last month  Metrics derived from events can also be useful in recommendation pipelines – e.g. impression discounting ©2013 LinkedIn Corporation. All Rights Reserved. 5
    • Event Storage  Events can be categorized into topics, for example: – page view – user login – ad impression/click  Store events by topic and by day: – /data/page_view/daily/2013/10/08 – /data/page_view/daily/2013/10/09 – ... – /data/ad_click/daily/2013/10/08  Now can perform computation over specific time windows ©2013 LinkedIn Corporation. All Rights Reserved. 6
    • Computation Over Time Windows  In practice, many of our computations over time windows use either: ©2013 LinkedIn Corporation. All Rights Reserved. 7
    • Recognizing Inefficiencies  But, typically jobs compute these daily  From one day to next, input changes little  Fixed-start window includes one new day: ©2013 LinkedIn Corporation. All Rights Reserved. 8
    • Recognizing Inefficiencies  Fixed-length window includes one new day, minus oldest day ©2013 LinkedIn Corporation. All Rights Reserved. 9
    • Recognizing Inefficiencies  Repeatedly processing same input data  This wastes cluster resources  Better to process new data only  How can we do better? ©2013 LinkedIn Corporation. All Rights Reserved. 10
    • Hourglass Design ©2013 LinkedIn Corporation. All Rights Reserved. 11
    • Design Goals  Address use cases: – Fixed-start and fixed-length window computations – Daily partitioned data  Reduce resource usage  Reduce wall clock time  Run on standard Hadoop ©2013 LinkedIn Corporation. All Rights Reserved. 12
    • Improving Fixed-Start Computations  Suppose we must compute page view counts per member  The job consumes all days of available input, producing one output.  We call this a partition-collapsing job.  But, if the job runs tomorrow it has to reprocess the same data. ©2013 LinkedIn Corporation. All Rights Reserved. 13
    • Improving Fixed-Start Computations  Solution: Merge new data with previous output  We can do this because this is an arithmetic operation  Hourglass provides a partition-collapsing job that supports output reuse. ©2013 LinkedIn Corporation. All Rights Reserved. 14
    • Partition-Collapsing Job Architecture (Fixed-Start)  When applied to a fixed-start window computation: ©2013 LinkedIn Corporation. All Rights Reserved. 15
    • Improving Fixed-Length Computations  For a fixed-length job, can reuse output using a similar trick: – Add new day to previous output – Subtract old day from result  We can subtract the old day since this is arithmetic ©2013 LinkedIn Corporation. All Rights Reserved. 16
    • Partition-Collapsing Job Architecture (Fixed-Length)  When applied to a fixed-length window computation: ©2013 LinkedIn Corporation. All Rights Reserved. 17
    • Improving Fixed-Length Computations  But, for some operations, cannot subtract old data – example: max(), min()  Cannot reuse previous output, so how do we reduce computation?  Solution: partition-preserving job  Partitioned input data, partitioned output data  Essentially: aggregate the data in advance  Aggregating in advance can be useful even when you can reuse output ©2013 LinkedIn Corporation. All Rights Reserved. 18
    • Partition-Preserving Job Architecture ©2013 LinkedIn Corporation. All Rights Reserved. 19
    • MapReduce in Hourglass  MapReduce is a fairly general programming model  Hourglass requires: – reduce() must output (key,value) pair – reduce() must produce at most one value – reduce() implemented by an accumulator ©2013 LinkedIn Corporation. All Rights Reserved. 20
    • Building Blocks  Two types of jobs: – Partition-preserving: consume partitioned input data, produce partitioned output data – Partition-collapsing: consume partitioned input data, produce single output  Must provide to jobs: – Inputs and output paths – Desired time range  Must implement: – map() – accumulate()  May implement if necessary: – merge() – unmerge() ©2013 LinkedIn Corporation. All Rights Reserved. 21
    • Experiments ©2013 LinkedIn Corporation. All Rights Reserved. 22
    • Metrics for Evaluation  Wall clock time – Amount of time that elapses until job completes  Total task time – Sum of execution times for all tasks – Represents usage of cluster resources  Compare each against baseline non-incremental job ©2013 LinkedIn Corporation. All Rights Reserved. 23
    • Experiment: Page Views per Member  Goal: Count page views per member over last n days  Chain partition-preserving and partition-collapsing  Can reuse previous output: ©2013 LinkedIn Corporation. All Rights Reserved. 24
    • Experiment: Page Views per Member ©2013 LinkedIn Corporation. All Rights Reserved. 25
    • Member Count Estimation  Goal: Estimate number of members visiting site over past n days  Use HyperLogLog cardinality estimation (space vs. accuracy)  Can't reuse output, but with partition-preserving can save state: ©2013 LinkedIn Corporation. All Rights Reserved. 26
    • Member Count Estimation: Results ©2013 LinkedIn Corporation. All Rights Reserved. 27
    • Conclusion  Computations over sliding windows are quite common  Implementations are typically inefficient  Incrementalizing Hadoop jobs can in some cases yield: – 95-98% reductions in total task time – 20-40% reductions in wall clock time ©2013 LinkedIn Corporation. All Rights Reserved. 28
    • datafu.org Learning More ©2013 LinkedIn Corporation. All Rights Reserved. 29