Example Generation for Data Flow Programs - Presentation Transcript
Generating Example Data For Dataflow Programs Chris Olston Shubham Chopra Utkarsh Srivastava Research
Data Processing Renaissance
Lots of data (TBs/day at Yahoo!)
Lots of queries and programs to analyze that data
New data flow languages
Map-Reduce, Pig Latin, Dryad
Other data flow systems
Aurora, Tioga, River
Example Dataflow Program LOAD (user, url) LOAD (url, pagerank) JOIN on url Find users that tend to visit high-pagerank pages GROUP on user TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) FILTER avgPR> 0.5
Iterative Process LOAD (user, url) LOAD (url, pagerank) Joining on right attribute? JOIN on url GROUP on user TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) Bug in UDF canonicalize? Everything being filtered out? FILTER avgPR> 0.5 No Output
How to do test runs? Run with real data Too inefficient (TBs of data) Create smaller data sets (e.g., by sampling) Empty results due to joins [Chaudhuri et. al. 99], and selective filters Biased sampling for joins Indexes not always present
Value Addition From Examples Examples can be used for Debugging Understanding a program written by someone else Learning a new operator, or language
Outline Formalization of good examples Example Generation Algorithm Performance Evaluation
Good Examples: Consistency LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url GROUP on user 0. Consistency TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) output example = operator applied on input example (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) FILTER avgPR> 0.5
Good Examples: Realism LOAD (user, url) LOAD (url, pagerank) (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url GROUP on user 1. Realism TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) Formalization: Fraction of examples that are real or are derived from real records FILTER avgPR> 0.5
Good Examples: Completeness LOAD (user, url) LOAD (url, pagerank) 2. Completeness JOIN on url Demonstrate the salient properties of each operator, e.g., FILTER GROUP on user TRANSFORM user, canonicalize(url) TRANSFORM user, AVG(pagerank) (Amy, 0.6) (Fred, 0.4) FILTER avgPR> 0.5 (Amy, 0.6)
Good Examples: Completeness (www.cnn.com, 0.9) (www.frogs.com, 0.3) (www.snails.com, 0.4) LOAD (user, url) LOAD (url, pagerank) JOIN on url (Amy, www.cnn.com, 0.9) (Amy, www.frogs.com, 0.3) (Fred, www.snails.com, 0.4) GROUP on user TRANSFORM user, canonicalize(url) 2. Completeness TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) Demonstrate the salient properties of each operator, e.g., JOIN FILTER avgPR> 0.5
Formalizing Completeness
For any operator, classify input/output example records into equivalence classes.
Each equivalence class demonstrates one property of the operator.
Try to have at least one example from each class
Equivalence Class Examples FILTER E0: All input records that pass the filter E1: All input records that fail the filter JOIN E0: All output records UNION E0: All records belonging to first input E1: All records belonging to second input
Formalizing Completeness Operator Completeness: Fraction of equivalence classes that have at least one example record. Overall Completeness: Average of per-operator completeness.
Good Examples: Conciseness LOAD (user, url) LOAD (url, pagerank) 3. Conciseness (Amy, cnn.com) (Amy, http://www.frogs.com) (Fred, www.snails.com/index.html) JOIN on url Operator Conciseness: # equivalence classes # example records GROUP on user TRANSFORM user, canonicalize(url) Overall Conciseness: Average of per-operator conciseness TRANSFORM user, AVG(pagerank) (Amy, www.cnn.com) (Amy, www.frogs.com) (Fred, www.snails.com) FILTER avgPR> 0.5
Outline Formalization of good examples Example Generation Algorithm Performance Evaluation
Related Work Related Areas: Reverse Query Processing Database Testing Software and Hardware Verification Differences Realism not a concern Notion of conciseness is different Intermediate result size is immaterial
Strawman I: Downstream Propagation Take some portion of input data and run the program over it. 1. Realism 2. Completeness 3. Conciseness
Strawman II: Upstream Propagation Start from what output is desired, and work backwards 1. Realism 2. Completeness 3. Conciseness
Formalization of Pruning Example Records Elements Equivalence Classes Sets Pick minimum #records to cover every equivalence class Set-Cover Problem More involved because completeness of other operators must be maintained; details in paper
Implementation Status Available as ILLUSTRATE command in open-source release of Pig Available as Eclipse Plugin (PigPen)
PigPen Snapshot
Performance Evaluation Program I: (Web Search Result Viewing Statistics) LOAD FILTER by compound arithmetic expression GROUP TRANSFORM using built-in aggregate function
Performance on Program I
Performance Evaluation Program II: (Web Advertising Activity) LOAD table A FILTER A by compound logical expression JOIN with table B (highly selective) TRANSFORM using 4 string manipulation UDFS (non-invertible)
Performance on Program II
Running Time
Conclusions Research
Writing dataflow programs is an iterative process.
Actual dataset too large for test runs.
Our algorithm can automatically generate examples that illustrate the program through:
0 comments
Post a comment