Example Generation for Data Flow Programs

3,553 views
3,442 views

Published on

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,553
On SlideShare
0
From Embeds
0
Number of Embeds
17
Actions
Shares
0
Downloads
121
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • remove Y! logo, slide for related work
  • say what canonicalize does, filter like having clause in SQL
  • cite surajit, motwani
  • this is what someone would write by hand, or when teaching a class
  • skip
  • input or output records?, give rule for UNION
  • mention that only highlevel description
  • call out which filter every time you say it
  • nice to point out that real ones can be pruned too
  • say 8 programs, sampling of workload at yahoo, going to show one easy one and 1 hard one
  • say downstream run with 10000 initial samples, same as our algo
  • investigate completeness with downstream
  • put up y! logo, pig logo
  • Example Generation for Data Flow Programs

    1. 1. Generating Example Data For Dataflow Programs<br />Chris Olston Shubham Chopra<br />Utkarsh Srivastava<br />Research<br />
    2. 2. Data Processing Renaissance<br /><ul><li>Lots of data (TBs/day at Yahoo!)
    3. 3. Lots of queries and programs to analyze that data
    4. 4. New data flow languages
    5. 5. Map-Reduce, Pig Latin, Dryad
    6. 6. Other data flow systems
    7. 7. Aurora, Tioga, River</li></li></ul><li>Example Dataflow Program<br />LOAD<br />(user, url)<br />LOAD<br />(url, pagerank)<br />JOIN<br />on url<br />Find users that tend to visit high-pagerank pages <br />GROUP<br />on user<br />TRANSFORM<br />user, canonicalize(url)<br />TRANSFORM<br />user, AVG(pagerank)<br />FILTER<br />avgPR&gt; 0.5<br />
    8. 8. Iterative Process<br />LOAD<br />(user, url)<br />LOAD<br />(url, pagerank)<br />Joining on right attribute?<br />JOIN<br />on url<br />GROUP<br />on user<br />TRANSFORM<br />user, canonicalize(url)<br />TRANSFORM<br />user, AVG(pagerank)<br />Bug in UDF<br />canonicalize?<br />Everything being filtered out?<br />FILTER<br />avgPR&gt; 0.5<br />No Output <br />
    9. 9. How to do test runs?<br />Run with real data<br />Too inefficient (TBs of data)<br />Create smaller data sets (e.g., by sampling)<br />Empty results due to joins [Chaudhuri et. al. 99], and selective filters<br />Biased sampling for joins<br />Indexes not always present<br />
    10. 10. Examples to Illustrate Program<br />(www.cnn.com, 0.9) <br />(www.frogs.com, 0.3)<br />(www.snails.com, 0.4)<br />LOAD<br />(user, url)<br />LOAD<br />(url, pagerank)<br />(Amy, cnn.com) <br />(Amy, http://www.frogs.com)<br />(Fred, www.snails.com/index.html)<br />JOIN<br />on url<br />(Amy, www.cnn.com, 0.9) <br />(Amy, www.frogs.com, 0.3)<br />(Fred, www.snails.com, 0.4)<br />GROUP<br />on user<br />TRANSFORM<br />user, canonicalize(url)<br />(Amy, www.cnn.com, 0.9) <br />(Amy, www.frogs.com, 0.3)<br />(Fred, www.snails.com, 0.4)<br />)<br />( Amy, <br />( Fred,<br />)<br />TRANSFORM<br />user, AVG(pagerank)<br />(Amy, www.cnn.com) <br />(Amy, www.frogs.com)<br />(Fred, www.snails.com)<br />(Amy, 0.6) <br />(Fred, 0.4)<br />FILTER<br />avgPR&gt; 0.5<br />(Amy, 0.6) <br />
    11. 11. Value Addition From Examples<br />Examples can be used for<br />Debugging<br />Understanding a program written by someone else<br />Learning a new operator, or language<br />
    12. 12. Outline<br />Formalization of good examples<br />Example Generation Algorithm<br />Performance Evaluation<br />
    13. 13. Good Examples: Consistency<br />LOAD<br />(user, url)<br />LOAD<br />(url, pagerank)<br />(Amy, cnn.com) <br />(Amy, http://www.frogs.com)<br />(Fred, www.snails.com/index.html)<br />JOIN<br />on url<br />GROUP<br />on user<br />0. Consistency<br />TRANSFORM<br />user, canonicalize(url)<br />TRANSFORM<br />user, AVG(pagerank)<br />output example <br />= <br />operator applied on input example<br />(Amy, www.cnn.com) <br />(Amy, www.frogs.com)<br />(Fred, www.snails.com)<br />FILTER<br />avgPR&gt; 0.5<br />
    14. 14. Good Examples: Realism<br />LOAD<br />(user, url)<br />LOAD<br />(url, pagerank)<br />(Amy, cnn.com) <br />(Amy, http://www.frogs.com)<br />(Fred, www.snails.com/index.html)<br />JOIN<br />on url<br />GROUP<br />on user<br />1. Realism<br />TRANSFORM<br />user, canonicalize(url)<br />TRANSFORM<br />user, AVG(pagerank)<br />(Amy, www.cnn.com) <br />(Amy, www.frogs.com)<br />(Fred, www.snails.com)<br />Formalization: Fraction of examples that are real or are derived from real records<br />FILTER<br />avgPR&gt; 0.5<br />
    15. 15. Good Examples: Completeness<br />LOAD<br />(user, url)<br />LOAD<br />(url, pagerank)<br />2. Completeness<br />JOIN<br />on url<br />Demonstrate the salient properties of each operator, e.g., FILTER<br />GROUP<br />on user<br />TRANSFORM<br />user, canonicalize(url)<br />TRANSFORM<br />user, AVG(pagerank)<br />(Amy, 0.6) <br />(Fred, 0.4)<br />FILTER<br />avgPR&gt; 0.5<br />(Amy, 0.6) <br />
    16. 16. Good Examples: Completeness<br />(www.cnn.com, 0.9) <br />(www.frogs.com, 0.3)<br />(www.snails.com, 0.4)<br />LOAD<br />(user, url)<br />LOAD<br />(url, pagerank)<br />JOIN<br />on url<br />(Amy, www.cnn.com, 0.9) <br />(Amy, www.frogs.com, 0.3)<br />(Fred, www.snails.com, 0.4)<br />GROUP<br />on user<br />TRANSFORM<br />user, canonicalize(url)<br />2. Completeness<br />TRANSFORM<br />user, AVG(pagerank)<br />(Amy, www.cnn.com) <br />(Amy, www.frogs.com)<br />(Fred, www.snails.com)<br />Demonstrate the salient properties of each operator, e.g., JOIN<br />FILTER<br />avgPR&gt; 0.5<br />
    17. 17. Formalizing Completeness<br /><ul><li>For any operator, classify input/output example records into equivalence classes.
    18. 18. Each equivalence class demonstrates one property of the operator.
    19. 19. Try to have at least one example from each class </li></li></ul><li>Equivalence Class Examples<br />FILTER<br />E0: All input records that pass the filter<br />E1: All input records that fail the filter<br />JOIN<br />E0: All output records<br />UNION<br />E0: All records belonging to first input<br />E1: All records belonging to second input<br />
    20. 20. Formalizing Completeness<br />Operator Completeness: <br /> Fraction of equivalence classes that have at least one example record.<br />Overall Completeness: <br /> Average of per-operator completeness. <br />
    21. 21. Good Examples: Conciseness<br />LOAD<br />(user, url)<br />LOAD<br />(url, pagerank)<br />3. Conciseness<br />(Amy, cnn.com) <br />(Amy, http://www.frogs.com)<br />(Fred, www.snails.com/index.html)<br />JOIN<br />on url<br />Operator Conciseness:<br /># equivalence classes<br /># example records<br />GROUP<br />on user<br />TRANSFORM<br />user, canonicalize(url)<br />Overall Conciseness:<br />Average of per-operator conciseness <br />TRANSFORM<br />user, AVG(pagerank)<br />(Amy, www.cnn.com) <br />(Amy, www.frogs.com)<br />(Fred, www.snails.com)<br />FILTER<br />avgPR&gt; 0.5<br />
    22. 22. Outline<br />Formalization of good examples<br />Example Generation Algorithm<br />Performance Evaluation<br />
    23. 23. Related Work<br />Related Areas:<br />Reverse Query Processing<br />Database Testing<br />Software and Hardware Verification<br />Differences<br />Realism not a concern<br />Notion of conciseness is different<br />Intermediate result size is immaterial<br />
    24. 24. Strawman I: Downstream Propagation<br />Take some portion of input data and run the program over it.<br />1. Realism<br />2. Completeness<br />3. Conciseness<br />
    25. 25. Strawman II: Upstream Propagation<br />Start from what output is desired, and work backwards<br />1. Realism<br />2. Completeness<br />3. Conciseness<br />
    26. 26. Our Algorithm<br />Algorithm Passes<br />Downstream <br />Pruning<br />Upstream<br />Pruning<br />
    27. 27. Our Algorithm<br />Algorithm Passes<br />Downstream<br />Pruning<br />Upstream<br />Pruning<br />Take a subset of input and propagate through the program.<br />(Jack, 30)<br />LOAD<br />(user, age)<br />UNION<br />FILTER<br />age&gt;18<br />LOAD<br />(user, age)<br />FILTER<br />udf(user)<br />(Amy, 20) <br />(Fred, 25)<br />(Jack, 30)<br />(Amy, 20) <br />(Fred, 25)<br />(Jack, 30)<br />(Amy, 20) <br />(Fred, 25)<br />(Amy, 20) <br />(Fred, 25)<br />
    28. 28. Our Algorithm<br />Algorithm Passes<br />Downstream<br />Pruning<br />Upstream<br />Pruning<br />Prune redundant examples, i.e., improve conciseness without hurting completeness. <br />(Jack, 30)<br />LOAD<br />(user, age)<br />UNION<br />FILTER<br />age&gt;18<br />LOAD<br />(user, age)<br />FILTER<br />udf(user)<br />(Amy, 20) <br />(Fred, 25)<br />(Jack, 30)<br />(Amy, 20) <br />(Fred, 25)<br />(Jack, 30)<br />(Amy, 20) <br />(Fred, 25)<br />(Amy, 20) <br />(Fred, 25)<br />
    29. 29. Our Algorithm<br />Algorithm Passes<br />Downstream<br />Pruning<br />Upstream<br />Pruning<br />Prune redundant examples, i.e., improve conciseness without hurting completeness. <br />(Jack, 30)<br />LOAD<br />(user, age)<br />UNION<br />FILTER<br />age&gt;18<br />LOAD<br />(user, age)<br />FILTER<br />udf(user)<br />(Amy, 20) <br />(Fred, 25)<br />(Jack, 30)<br />(Amy, 20) <br />(Fred, 25)<br />(Jack, 30)<br />(Amy, 20) <br />(Fred, 25)<br />(Amy, 20) <br />(Fred, 25)<br />
    30. 30. Our Algorithm<br />Algorithm Passes<br />Downstream<br />Pruning<br />Upstream<br />Pruning<br />Prune redundant examples, i.e., improve conciseness without hurting completeness. <br />(Jack, 30)<br />LOAD<br />(user, age)<br />UNION<br />FILTER<br />age&gt;18<br />LOAD<br />(user, age)<br />FILTER<br />udf(user)<br />(Amy, 20) <br />(Jack, 30)<br />(Amy, 20) <br />(Jack, 30)<br />(Amy, 20) <br />(Amy, 20) <br />
    31. 31. Formalization of Pruning<br />Example Records Elements <br />Equivalence Classes Sets<br />Pick minimum #records to cover every equivalence class<br />Set-Cover Problem<br />More involved because completeness of other operators must be maintained; details in paper<br />
    32. 32. Our Algorithm<br />Algorithm Passes<br />Downstream<br />Pruning<br />Upstream<br />Pruning<br />Enhance completeness by inserting constraint records (best effort; details in paper)<br />(Jack, 30)<br />LOAD<br />(user, age)<br />UNION<br />FILTER<br />age&gt;18<br />LOAD<br />(user, age)<br />FILTER<br />udf(user)<br />(Amy, 20) <br />(Jack, 30)<br />(Amy, 20) <br />(Jack, 30)<br />(Amy, 20) <br />(Amy, 20) <br />
    33. 33. Our Algorithm<br />Algorithm Passes<br />Downstream<br />Pruning<br />Upstream<br />Pruning<br />Enhance completeness by inserting constraint records (best effort; details in paper)<br />(Jack, 30)<br />LOAD<br />(user, age)<br />UNION<br />FILTER<br />age&gt;18<br />LOAD<br />(user, age)<br />FILTER<br />udf(user)<br />(Amy, 20) <br />(Jack, 30)<br />(Amy, 20) <br />(Jack, 30)<br />(--, 17) <br />(Amy, 20) <br />(Amy, 20) <br />
    34. 34. Our Algorithm<br />Algorithm Passes<br />Downstream<br />Pruning<br />Upstream<br />Pruning<br />Enhance completeness by inserting constraint records (best effort; details in paper)<br />(Jack, 30)<br />(--, 17)<br />LOAD<br />(user, age)<br />UNION<br />FILTER<br />age&gt;18<br />LOAD<br />(user, age)<br />FILTER<br />udf(user)<br />(Amy, 20) <br />(Jack, 30)<br />(Amy, 20) <br />(Jack, 30)<br />(--, 17) <br />(Amy, 20)<br />(--, 17)<br />(Amy, 20)<br />(Bill, 17)<br />
    35. 35. Our Algorithm<br />Algorithm Passes<br />Downstream<br />Pruning<br />Upstream<br />Pruning<br />Enhance completeness by inserting constraint records (best effort; details in paper)<br />(Jack, 30)<br />(Bob, 17)<br />LOAD<br />(user, age)<br />UNION<br />FILTER<br />age&gt;18<br />LOAD<br />(user, age)<br />FILTER<br />udf(user)<br />(Amy, 20) <br />(Jack, 30)<br />(Amy, 20) <br />(Jack, 30)<br />(Bill, 17)<br />(Bob, 17) <br />(Amy, 20)<br />(Bill, 17)<br />(Amy, 20)<br />(Bill, 17) <br />
    36. 36. Our Algorithm<br />Algorithm Passes<br />Downstream<br />Pruning<br />Upstream<br />Pruning<br />Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones.<br />(Jack, 30)<br />(Bob, 17)<br />LOAD<br />(user, age)<br />UNION<br />FILTER<br />age&gt;18<br />LOAD<br />(user, age)<br />FILTER<br />udf(user)<br />(Amy, 20) <br />(Jack, 30)<br />(Amy, 20) <br />(Jack, 30)<br />(Bill, 17)<br />(Bob, 17) <br />(Amy, 20)<br />(Bill, 17)<br />(Amy, 20)<br />(Bill, 17) <br />
    37. 37. Our Algorithm<br />Algorithm Passes<br />Downstream<br />Pruning<br />Upstream<br />Pruning<br />Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones.<br />(Jack, 30)<br />(Bob, 17)<br />LOAD<br />(user, age)<br />UNION<br />FILTER<br />age&gt;18<br />LOAD<br />(user, age)<br />FILTER<br />udf(user)<br />(Amy, 20) <br />(Jack, 30)<br />(Amy, 20) <br />(Jack, 30)<br />(Bill, 17)<br />(Bob, 17) <br />(Amy, 20)<br />(Bill, 17)<br />(Amy, 20)<br />(Bill, 17) <br />
    38. 38. Our Algorithm<br />Algorithm Passes<br />Downstream<br />Pruning<br />Upstream<br />Pruning<br />Prune redundant examples (as in Pass 2). Favor real examples over synthetic ones.<br />(Jack, 30)<br />LOAD<br />(user, age)<br />UNION<br />FILTER<br />age&gt;18<br />LOAD<br />(user, age)<br />FILTER<br />udf(user)<br />(Jack, 30)<br />(Jack, 30)<br />(Bill, 17)<br />(Bill, 17)<br />(Bill, 17) <br />
    39. 39. Implementation Status<br />Available as ILLUSTRATE command in open-source release of Pig<br />Available as Eclipse Plugin (PigPen)<br />
    40. 40. PigPen Snapshot<br />
    41. 41. Performance Evaluation<br />Program I: (Web Search Result Viewing Statistics)<br />LOAD<br />FILTER by compound arithmetic expression<br />GROUP<br />TRANSFORM using built-in aggregate function<br />
    42. 42. Performance on Program I<br />
    43. 43. Performance Evaluation<br />Program II: (Web Advertising Activity)<br />LOAD table A<br />FILTER A by compound logical expression<br />JOIN with table B (highly selective)<br />TRANSFORM using 4 string manipulation UDFS (non-invertible)<br />
    44. 44. Performance on Program II<br />
    45. 45. Running Time<br />
    46. 46. Conclusions<br />Research<br /><ul><li>Writing dataflow programs is an iterative process.
    47. 47. Actual dataset too large for test runs.
    48. 48. Our algorithm can automatically generate examples that illustrate the program through:
    49. 49. Realism
    50. 50. Conciseness
    51. 51. Completeness</li>

    ×