Integrating Conflicting Data_PVERConf_May2011

  • 231 views
Uploaded on

May 2011 Personal Vali

May 2011 Personal Vali

More in: News & Politics
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
231
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • While I expected online blogs and maybe some smaller papers to use the quote, I did not think it would have a major impact. I was wrong. Quality newspapers in England, India, America and as far away as Australia had my words in their reports of Jarre’s death. I was shocked that highly respected newspapers would use material from Wikipedia without first sourcing and referencing it properly.
  • On July 19, 2010, Shirley Sherrod was forced to resign from her position as Georgia State Director of Rural Development for the United States Department of Agriculture [1] after conservative blogger Andrew Breitbart posted video excerpts of Sherrod's address at a March 2010 NAACP event to his website. [2] The NAACP condemned her remarks, and U.S. government officials called on her to resign. However, upon review of the unedited video in context, the NAACP, White House officials, and Tom Vilsack , the Secretary of Agriculture , apologized and Sherrod was offered a new position. The first news outlet to report on the Breitbart video was FoxNews.com , which posted an article about the story on its website. [16] [14] The New York City affiliate for CBS also posted a report on its website later that afternoon. [14] The Atlanta Journal Constitution newspaper's related website also soon picked up the story. [17] In addition, the story was picked up and reported widely in the blogosph A ccording to Sherrod, that afternoon she received numerous demands from government officials to submit her resignation, demands that she characterized as harrassment. [18] In response to a call from USDA deputy undersecretary Cheryl Cook, Sherrod submitted her resignation via email. Sherrod claims that Cook told her White House officials wanted her to quit immediately because the controversy was "going to be on Glenn Beck tonight [18] ", a claim disputed by White House Press Secretary Robert Gibbs. [6] ere . [14]
  • ( American short-story Writer and Novelist , Nobel Prize for Literature in 1949, 1897 - 1962 )
  • ( American short-story Writer and Novelist , Nobel Prize for Literature in 1949, 1897 - 1962 )
  • ( American short-story Writer and Novelist , Nobel Prize for Literature in 1949, 1897 - 1962 )
  • ( American short-story Writer and Novelist , Nobel Prize for Literature in 1949, 1897 - 1962 )
  • ( American short-story Writer and Novelist , Nobel Prize for Literature in 1949, 1897 - 1962 )
  • ( American short-story Writer and Novelist , Nobel Prize for Literature in 1949, 1897 - 1962 )
  • ( American short-story Writer and Novelist , Nobel Prize for Literature in 1949, 1897 - 1962 )
  • ( American short-story Writer and Novelist , Nobel Prize for Literature in 1949, 1897 - 1962 )
  • ( American short-story Writer and Novelist , Nobel Prize for Literature in 1949, 1897 - 1962 )

Transcript

  • 1.
    • Xin Luna Dong (AT&T Labs—Research)
    • Laure Berti (Universite de Rennes 1, visiting AT&T)
    • Divesh Srivastava (AT&T Labs—Research)
  • 2. The WWW is Great
  • 3.  
  • 4. False Information on the Web (I) Maurice Jarre (1924-2009) French Conductor and Composer “ One could say my life itself has been one long soundtrack. Music was my life, music brought me to life, and music is how I will be remembered long after I leave this life. When I die there will be a final waltz playing in my head and that only I can hear.” 2:29, 30 March 2009
  • 5. False Information on the Web (II) Posted by Andrew Breitbart In his blog …
  • 6.
    • We now live in this media culture where something goes up on YouTube or a blog and everybody scrambles. - Barack Obama
    The Internet needs a way to help people separate rumor from real science. – Tim Berners-Lee
  • 7. Why is the Problem Hard? (A Well-Predicted Problem)
    • Facts and truth really don’t have much to do with each other.
    • — William Faulkner
    S1 S2 S3 Stonebraker MIT Berkeley MIT Dewitt MSR MSR UWisc Bernstein MSR MSR MSR Carey UCI AT&T BEA Halevy Google Google UW
  • 8. Why is the Problem Hard? (A Well-Predicted Problem)
    • Facts and truth really don’t have much to do with each other.
    • — William Faulkner
    • Naïve voting works
    S1 S2 S3 Stonebraker MIT Berkeley MIT Dewitt MSR MSR UWisc Bernstein MSR MSR MSR Carey UCI AT&T BEA Halevy Google Google UW
  • 9. Why is the Problem Hard? (A Well-Predicted Problem)
    • A lie told often enough becomes the truth. — Vladimir Lenin
    • Naïve voting works only if data sources are independent.
    S1 S2 S3 S4 S5 Stonebraker MIT Berkeley MIT MIT MS Dewitt MSR MSR UWisc UWisc UWisc Bernstein MSR MSR MSR MSR MSR Carey UCI AT&T BEA BEA BEA Halevy Google Google UW UW UW
  • 10. Our Goal: Truth Discovery w. Awareness of Dependence Between Sources
    • You can fool some of the people all the time, and all of the people some of the time, but you cannot fool all of the people all the time. – Abraham Lincoln
    • Naïve voting works only if data sources are independent.
    S1 S2 S3 S4 S5 Stonebraker MIT Berkeley MIT MIT MS Dewitt MSR MSR UWisc UWisc UWisc Bernstein MSR MSR MSR MSR MSR Carey UCI AT&T BEA BEA BEA Halevy Google Google UW UW UW
  • 11. Challenges in Dependence Discovery
    • 1. Sharing common data does not in itself imply copying.
    • 2. With only a snapshot it is hard to decide which source is a copier.
    3. A copier can also provide or verify some data by itself, so it is inappropriate to ignore all of its data. S1 S2 S3 S4 S5 Stonebraker MIT Berkeley MIT MIT MS Dewitt MSR MSR UWisc UWisc UWisc Bernstein MSR MSR MSR MSR MSR Carey UCI AT&T BEA BEA BEA Halevy Google Google UW UW UW
  • 12. High-Level Intuitions for Dependence Detection
    • Intuition I: decide dependence (w/o direction)
      • Let D1, D2 be data from two sources. D1 and D2 are dependent if
      • Pr(D1, D2) <> Pr(D1) * Pr(D2).
  • 13. Dependence?
    • Source 1 on USA Presidents :
    • 1 st : George Washington
    • 2 nd : John Adams
    • 3 rd : Thomas Jefferson
    • 4 th : James Madison
    • 41 st : George H.W. Bush
    • 42 nd : William J. Clinton
    • 43 rd : George W. Bush
    • 44 th : Barack Obama
    • Source 2 on USA Presidents :
    • 1 st : George Washington
    • 2 nd : John Adams
    • 3 rd : Thomas Jefferson
    • 4 th : James Madison
    • 41 st : George H.W. Bush
    • 42 nd : William J. Clinton
    • 43 rd : George W. Bush
    • 44 th : Barack Obama
    Are Source 1 and Source 2 dependent? Not necessarily        
  • 14. Dependence?
    • Source 1 on USA Presidents :
    • 1 st : George Washington
    • 2 nd : Benjamin Franklin
    • 3 rd : Tom Jefferson
    • 4 th : Abraham Lincoln
    • 41 st : George W. Bush
    • 42 nd : Hillary Clinton
    • 43 rd : Mickey Mouse
    • 44 th : Barack Obama
    • Source 2 on USA Presidents :
    • 1 st : George Washington
    • 2 nd : Benjamin Franklin
    • 3 rd : Tom Jefferson
    • 4 th : Abraham Lincoln
    • 41 st : George W. Bush
    • 42 nd : Hillary Clinton
    • 43 rd : Mickey Mouse
    • 44 th : John McCain
    Are Source 1 and Source 2 dependent? -- Common Errors Very likely       
  • 15. High-Level Intuitions for Dependence Detection
    • Intuition I: decide dependence (w/o direction)
      • Let D1, D2 be data from two sources. D1 and D2 are dependent if
      • Pr(D1, D2) <> Pr(D1) * Pr(D2).
    • Intuition II: decide copying direction
      • Let F be a property function of the data; e.g., accuracy of data. D1 is likely to be dependent on D2 if
      • |F(D1  D2)-F(D1-D2)| > |F(D1  D2)-F(D2-D1)| .
  • 16. Dependence?
    • Source 2 on USA Presidents :
    • 1 st : George Washington
    • 2 nd : Benjamin Franklin
    • 3 rd : Tom Jefferson
    • 4 th : Abraham Lincoln
    • 41 st : George W. Bush
    • 42 nd : Hillary Clinton
    • 43 rd : Mickey Mouse
    • 44 th : John McCain
    • Source 1 on USA Presidents :
    • 1 st : George Washington
    • 2 nd : John Adams
    • 3 rd : Thomas Jefferson
    • 4 th : Abraham Lincoln
    • 41 st : George W. Bush
    • 42 nd : Hillary Clinton
    • 43 rd : George W. Bush
    • 44 th : John McCain
    Are Source 1 and Source 2 dependent? -- Different Accuracy    S1 more likely to be a copier        
  • 17. Outline
    • Motivation and intuitions for solution
    • Techniques
      • Problem definition
      • Copying detection
      • Truth discovery
    • Experimental Results
    • Framework of the Solomon project
  • 18. Problem Definition
    • INPUT
      • Objects: an aspect of a real-world entity
        • E.g., director of a movie, author list of a book
        • Each associated with one true value
      • Sources: each providing values for a subset of objects
    • OUTPUT: the true value for each object
  • 19. Source Dependence
    • Source dependence: two sources S and T deriving the same part of data directly or transitively from a common source (can be one of S or T).
      • Independent source
      • Copier
        • copying part (or all) of data from other sources
        • may verify or revise some of the copied values
        • may add additional values
    • Assumptions
      • Independent values
      • Independent copying
      • No loop copying
  • 20. Models for a Static World
    • Core case
      • Conditions
        • Same source accuracy
        • Uniform false-value distribution
        • Categorical value
      • Proposition: W. independent “good” sources, Naïve voting selects values with highest probability to be true.
    • Models
    Depen AccuPR Consider value probabilities in dependence analysis Accu Remove Cond 1 Sim Remove Cond 3 NonUni Remove Cond 2
  • 21. Bayesian Analysis – Basic
    • Observation: Ф
    • Goal: Pr(S1  S2| Ф ), Pr(S1  S2| Ф ) (sum up to 1)
    • According to the Bayes Rule, we need to know
    • Pr( Ф |S1  S2), Pr( Ф |S1  S2)
    • Key: computing Pr( Ф O.A |S1  S2), Pr( Ф O.A |S1  S2) for each O.A  S1  S2
    Different Values O.A d TRUE O.A t S1  S2 FALSE O.A f Same Values
  • 22. Bayesian Analysis – Probability Computation
    • ε -error rate; n-#wrong-values; c-copy rate
      > Pr Independence Copying O.A t O.A f O.A d Different Values O.A d TRUE O.A t S1  S2 FALSE O.A f Same Values
  • 23. Considering Source Accuracy ≠ ≠ Pr Independence S1 Copies S2 S2 Copies S1 O.A t O.A f O.A d Different Values O.A d TRUE O.A t S1  S2 FALSE O.A f Same Values
  • 24. II. Finding the True Value
    • Consider dependence
  • 25. Solution on the Motivating Example Copying Relationship UCI AT&T BEA Truth Discovery (1-.99*.8=.2) (.2 2 ) S1 S2 S3 S4 S5 Round 1 S1 S2 S3 S4 S5 Stonebraker MIT Berkeley MIT MIT MS Dewitt MSR MSR UWisc UWisc UWisc Bernstein MSR MSR MSR MSR MSR Carey UCI AT&T BEA BEA BEA Halevy Google Google UW UW UW S 1 S 2 S 4 S 3 S 5 .87 .2 .2 .99 .99 .99
  • 26. Solution on the Motivating Example Copying Relationship AT&T BEA Truth Discovery S2 S3 S4 S5 UCI S1 Round 2 S1 S2 S3 S4 S5 Stonebraker MIT Berkeley MIT MIT MS Dewitt MSR MSR UWisc UWisc UWisc Bernstein MSR MSR MSR MSR MSR Carey UCI AT&T BEA BEA BEA Halevy Google Google UW UW UW S 1 S 2 S 4 S 3 S 5 .14 .49 .49 .49 .08 .49 .49 .49
  • 27. Solution on the Motivating Example Copying Relationship AT&T BEA Truth Discovery S2 S3 S4 S5 UCI S1 Round 3 S1 S2 S3 S4 S5 Stonebraker MIT Berkeley MIT MIT MS Dewitt MSR MSR UWisc UWisc UWisc Bernstein MSR MSR MSR MSR MSR Carey UCI AT&T BEA BEA BEA Halevy Google Google UW UW UW S 1 S 2 S 4 S 3 S 5 .12 .49 .49 .49 .06 .49 .49 .49
  • 28. Solution on the Motivating Example Copying Relationship AT&T BEA Truth Discovery S2 UCI S1 Round 4 S3 S4 S5 S1 S2 S3 S4 S5 Stonebraker MIT Berkeley MIT MIT MS Dewitt MSR MSR UWisc UWisc UWisc Bernstein MSR MSR MSR MSR MSR Carey UCI AT&T BEA BEA BEA Halevy Google Google UW UW UW S 1 S 2 S 4 S 3 S 5 .10 .48 .49 .50 .05 .49 .48 .50
  • 29. Solution on the Motivating Example Copying Relationship AT&T BEA Truth Discovery S2 UCI S1 Round 5 S3 S4 S5 S1 S2 S3 S4 S5 Stonebraker MIT Berkeley MIT MIT MS Dewitt MSR MSR UWisc UWisc UWisc Bernstein MSR MSR MSR MSR MSR Carey UCI AT&T BEA BEA BEA Halevy Google Google UW UW UW S 1 S 2 S 4 S 3 S 5 .09 .47 .49 .51 .04 .49 .47 .51
  • 30. Solution on the Motivating Example Copying Relationship AT&T BEA Truth Discovery S2 UCI S1 Round 13 S3 S4 S5 S1 S2 S3 S4 S5 Stonebraker MIT Berkeley MIT MIT MS Dewitt MSR MSR UWisc UWisc UWisc Bernstein MSR MSR MSR MSR MSR Carey UCI AT&T BEA BEA BEA Halevy Google Google UW UW UW S 1 S 2 S 4 S 3 S 5 .55 .49 .55 .49 .44 .44
  • 31. Combining Accuracy and Dependence
    • Theorem: w/o accuracy, converges
    • Observation: w. accuracy, converges when #objs >> #srcs
    Truth Discovery Source-accuracy Computation Dependence Detection Step 1 Step 3 Step 2
  • 32. Outline
    • Motivation and intuitions for solution
    • Techniques
      • Problem definition
      • Copying detection
      • Truth discovery
    • Experimental Results
    • Framework of the Solomon project
  • 33. Experimental Setup
    • Dataset: AbeBooks
      • 877 bookstores
      • 1263 CS books
      • 24364 listings, w. ISBN, author-list
      • After pre-cleaning, each book on avg has 19 listings and 4 author lists (ranges from 1-23)
    • Golden standard: 100 random books
      • Manually check author list from book cover
    • Measure: Precision=#(Corr author lists)/#(All lists)
  • 34. Naïve Voting and Types of Errors
    • Naïve voting has precision .71
  • 35. Contributions of Various Components Precision improves by 25.4% over Naïve Considering dependence improves the results most Reasonably fast Methods Prec #Rnds Time(s) Naïve .71 1 .2 Only value similarity .74 1 .2 Only source accuracy .79 23 1.1 Only source dependence .83 3 28.3 Depen+accu .87 22 185.8 Depen+accu+sim .89 18 197.5
  • 36. Outline
    • Motivation and intuitions for solution
    • Techniques
      • Problem definition
      • Copying detection
      • Truth discovery
    • Experimental Results
    • Framework of the Solomon project
  • 37. Data Integration Faces 3 Challenges
  • 38. Data Integration Faces 3 Challenges
  • 39. Data Integration Faces 3 Challenges Scissors Paper Scissors
  • 40. Data Integration Faces 3 Challenges Scissors Glue
  • 41. Data Integration Faces 3 Challenges
    • Schema matching
    • Model management
    • Query answering using views
    • Information extraction
    • String matching (edit distance, token-based, etc.)
    • Object matching (aka. record linkage, reference reconciliation, …)
    • Data fusion
    • Truth discovery
    Assume INDEPENDENCE of data sources
  • 42. Source Copying Adds A New Dimension to Data Integration
  • 43. Copying Can Be Large Scaled [VLDB’10a] (Copying of AbeBooks Data) Data collected from AbeBooks [Yin et al., 2007]
  • 44. Related Work
    • Data provenance [Buneman et al., PODS’08]
      • Focus on effective presentation and retrieval
      • Assume knowledge of provenance/lineage
    • Opinion pooling [Clemen&Winkler, 1985]
      • Combine pr distributions from multiple experts
      • Again, assume knowledge of dependence
    • Detect plagiarism of text, image/video, programs, etc. [Dong, Sigmod’11 tutorial]
  • 45.
    • http://www2.research.att.com/~yifanhu/SourceCopying/