Example Generation for Data Flow Programs

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Notes on slide 1

    remove Y! logo, slide for related work

    say what canonicalize does, filter like having clause in SQL

    cite surajit, motwani

    this is what someone would write by hand, or when teaching a class

    skip

    input or output records?, give rule for UNION

    mention that only highlevel description

    call out which filter every time you say it

    nice to point out that real ones can be pruned too

    say 8 programs, sampling of workload at yahoo, going to show one easy one and 1 hard one

    say downstream run with 10000 initial samples, same as our algo

    investigate completeness with downstream

    put up y! logo, pig logo

    Favorites, Groups & Events

    Example Generation for Data Flow Programs - Presentation Transcript

    1. Generating Example Data For Dataflow Programs
      Chris Olston Shubham Chopra
      Utkarsh Srivastava
      Research
    2. Data Processing Renaissance
      • Lots of data (TBs/day at Yahoo!)
      • Lots of queries and programs to analyze that data
      • New data flow languages
      • Map-Reduce, Pig Latin, Dryad
      • Other data flow systems
      • Aurora, Tioga, River
    3. Example Dataflow Program
      LOAD
      (user, url)
      LOAD
      (url, pagerank)
      JOIN
      on url
      Find users that tend to visit high-pagerank pages
      GROUP
      on user
      TRANSFORM
      user, canonicalize(url)
      TRANSFORM
      user, AVG(pagerank)
      FILTER
      avgPR> 0.5
    4. Iterative Process
      LOAD
      (user, url)
      LOAD
      (url, pagerank)
      Joining on right attribute?
      JOIN
      on url
      GROUP
      on user
      TRANSFORM
      user, canonicalize(url)
      TRANSFORM
      user, AVG(pagerank)
      Bug in UDF
      canonicalize?
      Everything being filtered out?
      FILTER
      avgPR> 0.5
      No Output 
    5. How to do test runs?
      Run with real data
      Too inefficient (TBs of data)
      Create smaller data sets (e.g., by sampling)
      Empty results due to joins [Chaudhuri et. al. 99], and selective filters
      Biased sampling for joins
      Indexes not always present
    6. Examples to Illustrate Program
      (www.cnn.com, 0.9)
      (www.frogs.com, 0.3)
      (www.snails.com, 0.4)
      LOAD
      (user, url)
      LOAD
      (url, pagerank)
      (Amy, cnn.com)
      (Amy, http://www.frogs.com)
      (Fred, www.snails.com/index.html)
      JOIN
      on url
      (Amy, www.cnn.com, 0.9)
      (Amy, www.frogs.com, 0.3)
      (Fred, www.snails.com, 0.4)
      GROUP
      on user
      TRANSFORM
      user, canonicalize(url)
      (Amy, www.cnn.com, 0.9)
      (Amy, www.frogs.com, 0.3)
      (Fred, www.snails.com, 0.4)
      )
      ( Amy,
      ( Fred,
      )
      TRANSFORM
      user, AVG(pagerank)
      (Amy, www.cnn.com)
      (Amy, www.frogs.com)
      (Fred, www.snails.com)
      (Amy, 0.6)
      (Fred, 0.4)
      FILTER
      avgPR> 0.5
      (Amy, 0.6)
    7. Value Addition From Examples
      Examples can be used for
      Debugging
      Understanding a program written by someone else
      Learning a new operator, or language
    8. Outline
      Formalization of good examples
      Example Generation Algorithm
      Performance Evaluation
    9. Good Examples: Consistency
      LOAD
      (user, url)
      LOAD
      (url, pagerank)
      (Amy, cnn.com)
      (Amy, http://www.frogs.com)
      (Fred, www.snails.com/index.html)
      JOIN
      on url
      GROUP
      on user
      0. Consistency
      TRANSFORM
      user, canonicalize(url)
      TRANSFORM
      user, AVG(pagerank)
      output example
      =
      operator applied on input example
      (Amy, www.cnn.com)
      (Amy, www.frogs.com)
      (Fred, www.snails.com)
      FILTER
      avgPR> 0.5
    10. Good Examples: Realism
      LOAD
      (user, url)
      LOAD
      (url, pagerank)
      (Amy, cnn.com)
      (Amy, http://www.frogs.com)
      (Fred, www.snails.com/index.html)
      JOIN
      on url
      GROUP
      on user
      1. Realism
      TRANSFORM
      user, canonicalize(url)
      TRANSFORM
      user, AVG(pagerank)
      (Amy, www.cnn.com)
      (Amy, www.frogs.com)
      (Fred, www.snails.com)
      Formalization: Fraction of examples that are real or are derived from real records
      FILTER
      avgPR> 0.5
    11. Good Examples: Completeness
      LOAD
      (user, url)
      LOAD
      (url, pagerank)
      2. Completeness
      JOIN
      on url
      Demonstrate the salient properties of each operator, e.g., FILTER
      GROUP
      on user
      TRANSFORM
      user, canonicalize(url)
      TRANSFORM
      user, AVG(pagerank)
      (Amy, 0.6)
      (Fred, 0.4)
      FILTER
      avgPR> 0.5
      (Amy, 0.6)
    12. Good Examples: Completeness
      (www.cnn.com, 0.9)
      (www.frogs.com, 0.3)
      (www.snails.com, 0.4)
      LOAD
      (user, url)
      LOAD
      (url, pagerank)
      JOIN
      on url
      (Amy, www.cnn.com, 0.9)
      (Amy, www.frogs.com, 0.3)
      (Fred, www.snails.com, 0.4)
      GROUP
      on user
      TRANSFORM
      user, canonicalize(url)
      2. Completeness
      TRANSFORM
      user, AVG(pagerank)
      (Amy, www.cnn.com)
      (Amy, www.frogs.com)
      (Fred, www.snails.com)
      Demonstrate the salient properties of each operator, e.g., JOIN
      FILTER
      avgPR> 0.5
    13. Formalizing Completeness
      • For any operator, classify input/output example records into equivalence classes.
      • Each equivalence class demonstrates one property of the operator.
      • Try to have at least one example from each class
    14. Equivalence Class Examples
      FILTER
      E0: All input records that pass the filter
      E1: All input records that fail the filter
      JOIN
      E0: All output records
      UNION
      E0: All records belonging to first input
      E1: All records belonging to second input
    15. Formalizing Completeness
      Operator Completeness:
      Fraction of equivalence classes that have at least one example record.
      Overall Completeness:
      Average of per-operator completeness.
    16. Good Examples: Conciseness
      LOAD
      (user, url)
      LOAD
      (url, pagerank)
      3. Conciseness
      (Amy, cnn.com)
      (Amy, http://www.frogs.com)
      (Fred, www.snails.com/index.html)
      JOIN
      on url
      Operator Conciseness:
      # equivalence classes
      # example records
      GROUP
      on user
      TRANSFORM
      user, canonicalize(url)
      Overall Conciseness:
      Average of per-operator conciseness
      TRANSFORM
      user, AVG(pagerank)
      (Amy, www.cnn.com)
      (Amy, www.frogs.com)
      (Fred, www.snails.com)
      FILTER
      avgPR> 0.5
    17. Outline
      Formalization of good examples
      Example Generation Algorithm
      Performance Evaluation
    18. Related Work
      Related Areas:
      Reverse Query Processing
      Database Testing
      Software and Hardware Verification
      Differences
      Realism not a concern
      Notion of conciseness is different
      Intermediate result size is immaterial
    19. Strawman I: Downstream Propagation
      Take some portion of input data and run the program over it.
      1. Realism
      2. Completeness
      3. Conciseness
    20. Strawman II: Upstream Propagation
      Start from what output is desired, and work backwards
      1. Realism
      2. Completeness
      3. Conciseness
    21. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
    22. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
      Take a subset of input and propagate through the program.
      (Jack, 30)
      LOAD
      (user, age)
      UNION
      FILTER
      age>18
      LOAD
      (user, age)
      FILTER
      udf(user)
      (Amy, 20)
      (Fred, 25)
      (Jack, 30)
      (Amy, 20)
      (Fred, 25)
      (Jack, 30)
      (Amy, 20)
      (Fred, 25)
      (Amy, 20)
      (Fred, 25)
    23. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
      Prune redundant examples, i.e., improve conciseness without hurting completeness.
      (Jack, 30)
      LOAD
      (user, age)
      UNION
      FILTER
      age>18
      LOAD
      (user, age)
      FILTER
      udf(user)
      (Amy, 20)
      (Fred, 25)
      (Jack, 30)
      (Amy, 20)
      (Fred, 25)
      (Jack, 30)
      (Amy, 20)
      (Fred, 25)
      (Amy, 20)
      (Fred, 25)
    24. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
      Prune redundant examples, i.e., improve conciseness without hurting completeness.
      (Jack, 30)
      LOAD
      (user, age)
      UNION
      FILTER
      age>18
      LOAD
      (user, age)
      FILTER
      udf(user)
      (Amy, 20)
      (Fred, 25)
      (Jack, 30)
      (Amy, 20)
      (Fred, 25)
      (Jack, 30)
      (Amy, 20)
      (Fred, 25)
      (Amy, 20)
      (Fred, 25)
    25. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
      Prune redundant examples, i.e., improve conciseness without hurting completeness.
      (Jack, 30)
      LOAD
      (user, age)
      UNION
      FILTER
      age>18
      LOAD
      (user, age)
      FILTER
      udf(user)
      (Amy, 20)
      (Jack, 30)
      (Amy, 20)
      (Jack, 30)
      (Amy, 20)
      (Amy, 20)
    26. Formalization of Pruning
      Example Records Elements
      Equivalence Classes Sets
      Pick minimum #records to cover every equivalence class
      Set-Cover Problem
      More involved because completeness of other operators must be maintained; details in paper
    27. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
      Enhance completeness by inserting constraint records (best effort; details in paper)
      (Jack, 30)
      LOAD
      (user, age)
      UNION
      FILTER
      age>18
      LOAD
      (user, age)
      FILTER
      udf(user)
      (Amy, 20)
      (Jack, 30)
      (Amy, 20)
      (Jack, 30)
      (Amy, 20)
      (Amy, 20)
    28. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
      Enhance completeness by inserting constraint records (best effort; details in paper)
      (Jack, 30)
      LOAD
      (user, age)
      UNION
      FILTER
      age>18
      LOAD
      (user, age)
      FILTER
      udf(user)
      (Amy, 20)
      (Jack, 30)
      (Amy, 20)
      (Jack, 30)
      (--, 17)
      (Amy, 20)
      (Amy, 20)
    29. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
      Enhance completeness by inserting constraint records (best effort; details in paper)
      (Jack, 30)
      (--, 17)
      LOAD
      (user, age)
      UNION
      FILTER
      age>18
      LOAD
      (user, age)
      FILTER
      udf(user)
      (Amy, 20)
      (Jack, 30)
      (Amy, 20)
      (Jack, 30)
      (--, 17)
      (Amy, 20)
      (--, 17)
      (Amy, 20)
      (Bill, 17)
    30. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
      Enhance completeness by inserting constraint records (best effort; details in paper)
      (Jack, 30)
      (Bob, 17)
      LOAD
      (user, age)
      UNION
      FILTER
      age>18
      LOAD
      (user, age)
      FILTER
      udf(user)
      (Amy, 20)
      (Jack, 30)
      (Amy, 20)
      (Jack, 30)
      (Bill, 17)
      (Bob, 17)
      (Amy, 20)
      (Bill, 17)
      (Amy, 20)
      (Bill, 17)
    31. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
      Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones.
      (Jack, 30)
      (Bob, 17)
      LOAD
      (user, age)
      UNION
      FILTER
      age>18
      LOAD
      (user, age)
      FILTER
      udf(user)
      (Amy, 20)
      (Jack, 30)
      (Amy, 20)
      (Jack, 30)
      (Bill, 17)
      (Bob, 17)
      (Amy, 20)
      (Bill, 17)
      (Amy, 20)
      (Bill, 17)
    32. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
      Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones.
      (Jack, 30)
      (Bob, 17)
      LOAD
      (user, age)
      UNION
      FILTER
      age>18
      LOAD
      (user, age)
      FILTER
      udf(user)
      (Amy, 20)
      (Jack, 30)
      (Amy, 20)
      (Jack, 30)
      (Bill, 17)
      (Bob, 17)
      (Amy, 20)
      (Bill, 17)
      (Amy, 20)
      (Bill, 17)
    33. Our Algorithm
      Algorithm Passes
      Downstream
      Pruning
      Upstream
      Pruning
      Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones.
      (Jack, 30)
      LOAD
      (user, age)
      UNION
      FILTER
      age>18
      LOAD
      (user, age)
      FILTER
      udf(user)
      (Jack, 30)
      (Jack, 30)
      (Bill, 17)
      (Bill, 17)
      (Bill, 17)
    34. Implementation Status
      Available as ILLUSTRATE command in open-source release of Pig
      Available as Eclipse Plugin (PigPen)
    35. PigPen Snapshot
    36. Performance Evaluation
      Program I: (Web Search Result Viewing Statistics)
      LOAD
      FILTER by compound arithmetic expression
      GROUP
      TRANSFORM using built-in aggregate function
    37. Performance on Program I
    38. Performance Evaluation
      Program II: (Web Advertising Activity)
      LOAD table A
      FILTER A by compound logical expression
      JOIN with table B (highly selective)
      TRANSFORM using 4 string manipulation UDFS (non-invertible)
    39. Performance on Program II
    40. Running Time
    41. Conclusions
      Research
      • Writing dataflow programs is an iterative process.
      • Actual dataset too large for test runs.
      • Our algorithm can automatically generate examples that illustrate the program through:
      • Realism
      • Conciseness
      • Completeness

    + meetutkarshmeetutkarsh, 4 months ago

    custom

    522 views, 0 favs, 0 embeds more stats

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 522
      • 522 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 20
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories

    Tags