0
1Searching Data with Substanceand StyleAmélie MarianRutgers Universityhttp://www.cs.rutgers.edu/~amelie
2Semi-structured Data Processing• Large amount of data online and in personal  devices ▫ Structure (style) ▫ Text content ...
3Semi-structured Data Processing at RutgersSPIDR Lab• Personal Information Search  ▫ Semi-structured data  ▫ Need for high...
4Joint work with:Wei WangChristopher PeeryThu Nguyen                Computer Science, Rutgers University                  ...
5Personal Information Search  Web                                                         Personal                        ...
6                                                              EDBT’08                                                    ...
7Target file: Halloween party pictures taken at home where someone  wears a witch costumeSeparate Structure and Content   ...
8Current Search ToolsCurrent search tools (i.e. web, desktop, GDS) mostly rely on ranking and filtering.  ▫ Ranking     co...
9Unified ApproachGoal: Unify structure and content ▫ Develop a unified view of directory and file structure ▫ Allow for a ...
10Unified Structure and ContentTarget file: Halloween party pictures taken at home where someone  wears a witch costume   ...
11From Query to Answers                                 DAG    Relaxation                                           Matchi...
12Query RelaxationsTarget: IMG_1391.gif• Edge Generalization ── missing terms  ▫ /Desktop/Home → /Desktop//Home• Path Exte...
13DAG Representation                                                                           IDF scorep – Pictures      ...
14Query Evaluation• Top-k query processing  ▫ Branch-and-bound approach• Lazy evaluation of the relaxed DAG structure  ▫ D...
15Traditional Content TF∙IDF Scoring• Consider files as “bag of terms”• TF (Term Frequency)  ▫ A file that mentions a quer...
16Unified IDF Score For a unified data tree T, a path query PQ, and a file F, we define: • IDF Score                      ...
17  TF ScorePath query: //a//{b}               /                         matchstruct = 1                               Nor...
18Unified ScoreAggregate IDF and TF scores across all relaxed queries  /a/b (exact match)            //a/b                ...
19Experimental Setup• Platform  PC with a 64-bit hyper-threaded 2.8GHz Intel Xeon   processor, 2GB memory, a 10K RPM 70GB ...
20Relevance Comparison• Use Lucene as a comparison basis• Content-only Use the standard Lucene content indexing and  searc...
21Case Study  ▫ Search for a witch costume picture taken at home on Halloween     Target: IMG_1391.gif (tagged with “witch...
22CDFs (Impact of Inaccuracies)                                                                            100%   100%    ...
23Query Processing Performance      100%      90%      80%      70%      60%      50%      40%      30%      20%      10% ...
24Personal Information SearchContributions• A multi-dimensional search framework that supports  fuzzy query conditions• Sc...
Joint work with:Gayatree Ganu          Computer Science, Rutgers UniversityNoémie Elhadad          Biomedical Informatics,...
26URSA:User Review Structure AnalysisProject Description               WebDB’09  • Aim:    Better understanding of user re...
27Online Reviewing Systems:Citysearch                                              Data in Reviews                        ...
28Data Description• Restaurant reviews extracted from  Citysearch, New York  (http://newyork.citysearch.com)• The corpus c...
29Structure Identification• Classification of review sentences with topic  and sentiment information Sentence Topics      ...
30Text Based RecommendationSystem: Evaluation Setting• For evaluation, we separated three non-  overlapping test sets of a...
31Text-Based Recommendation System:Steps• Text-derived rating score ▫ Regression-based rating• Goals 1. Predicting the met...
32Regression-based Text Rating• Use text of reviews to generate a rating• Different categories and sentiment should have d...
Regression-based Text Rating                                                                Food and                      ...
Regression-Based Text                                                               BaselineRating                        ...
35Clustering-based strategies forrecommendations• KNN based on a clustering over star ratings  ▫   Little improvement over...
Information Bottleneck Method• Foundations in Rate Distortion Theory• Allows choosing tradeoff between ▫ Compression (numb...
37Leveraging text content forpersonalized predictions• Use the sentence types (categories, sentiments)  within the reviews...
38  Example: Clustering using iIB algorithm                   Restaurant1          Restaurant2       Restaurant3   User1  ...
39     Example: Soft-clustering PredictionUser rating (star or text)                                                      ...
iIB Algorithm  • Experimented with different values of β and T, used    β=20, T=100.     RMSE errors and percentage improv...
41URSA: Qualitative Predictions• Predict sentiment towards each topic• Cluster users along each dimension separately• Use ...
42PERSEUS Project DescriptionPatient Emotion and StRucture SEarch USer Interface ▫ Large amount of patient-produced data  ...
43PERSEUS Project Description▫ Automatically add structure to free-text  • Use of context information    • “hair loss” sid...
44Structuring and Searching Web ContentContributions• Leveraged automatically generated structure to improve  predictions ...
45Joint work with:Minji Wu           Computer Science, Rutgers UniversityCollaborators:Serge Abiteboul, Alban Galland     ...
46Motivations• Information on web sources are unreliable  ▫   Erroneous  ▫   Misleading  ▫   Biased  ▫   Outdated• Users n...
47Example: What is the gas mileage of myHonda Civic?                                                 Query: “honda civic 2...
48Example: Identifying good businesslistings• NYC restaurant information from 6 sources ▫   Yellowpages ▫   Menupages ▫   ...
49                                                           WebDB’07                                                     ...
50                                                                                                    CleanDB’06          ...
51Data CorroborationContributions• Probabilistic model for corroboration  ▫   Fact uncertainty  ▫   Source trustworthiness...
52Conclusions• New Challenges in web data management ▫ Semi-structured data    PIMS    User reviews ▫ Multiple sources o...
53Amélie Marian - Rutgers University
Upcoming SlideShare
Loading in...5
×

Searching data with substance and style

133

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
133
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Searching data with substance and style"

  1. 1. 1Searching Data with Substanceand StyleAmélie MarianRutgers Universityhttp://www.cs.rutgers.edu/~amelie
  2. 2. 2Semi-structured Data Processing• Large amount of data online and in personal devices ▫ Structure (style) ▫ Text content (substance) ▫ Different sources (soul) ▫ Finding the data we need can be difficult Amélie Marian - Rutgers University
  3. 3. 3Semi-structured Data Processing at RutgersSPIDR Lab• Personal Information Search ▫ Semi-structured data ▫ Need for high -quality search tools• Structuring of User Web Posts ▫ Large amount of user-generated data untapped ▫ Text has inherent structure ▫ Use of text for guiding search and analyze data• Data Corroboration ▫ Conflicting sources of data ▫ Need to identify true facts Amélie Marian - Rutgers University
  4. 4. 4Joint work with:Wei WangChristopher PeeryThu Nguyen Computer Science, Rutgers University Amélie Marian - Rutgers University
  5. 5. 5Personal Information Search Web Personal DataSearch for relevant documents Search for specific documents Information that can be used for personal information search • Content (keywords) • Metadata (file size, modification time, etc.) • Structure ▫ Directory (external) ▫ File structure (internal): XML, LaTeX tags, Picture tags, etc. ▫ Partially known Amélie Marian - Rutgers University
  6. 6. 6 EDBT’08 ICDE’08 (demo)PIMS Project Description DEB’09 EDBT’11 TKDE (accepted)• Data and query models that unify content and structure• Scoring framework to rank unified search results• Query processing algorithms and index structures to score and rank answers efficiently• Evaluation of the quality and efficiency of the unified scoring NSF CAREER Award July 2009-2014 Amélie Marian - Rutgers University
  7. 7. 7Target file: Halloween party pictures taken at home where someone wears a witch costumeSeparate Structure and Content FileBoundary Directory: //Home Keywords: Halloween, witch Amélie Marian - Rutgers University
  8. 8. 8Current Search ToolsCurrent search tools (i.e. web, desktop, GDS) mostly rely on ranking and filtering. ▫ Ranking content keywords ▫ Filtering additional conditions (e.g., metadata, structure) Find a jpg file saved in directory /Desktop/Pictures/Home that contains the words “Halloween witch”This approach is often insufficient. ▫ Filtering forces a binary decision. Gif files and files under directory /Archive/Pictures/Home are not returned. ▫ Structure and content are strictly separated. Files under directory /Pictures/Halloween are not returned. Amélie Marian - Rutgers University
  9. 9. 9Unified ApproachGoal: Unify structure and content ▫ Develop a unified view of directory and file structure ▫ Allow for a single query to contain both structure and content components and to be answered at once ▫ Return results even if queries are incomplete or contain mistakesApproach: ▫ Define a unified data model by ignoring file boundaries ▫ Define a unified query model ▫ Define relaxations to approximate unified queries ▫ Define relevance score for unified queries Amélie Marian - Rutgers University
  10. 10. 10Unified Structure and ContentTarget file: Halloween party pictures taken at home where someone wears a witch costume //Home[.//“Halloween” and .//“witch”] File rootBoundary Home “Halloween” “witch” Amélie Marian - Rutgers University
  11. 11. 11From Query to Answers DAG Relaxation Matching Relaxed Queries Matches Query / Answers User Scoring Ranked Answers (TA algorithm) Amélie Marian - Rutgers University
  12. 12. 12Query RelaxationsTarget: IMG_1391.gif• Edge Generalization ── missing terms ▫ /Desktop/Home → /Desktop//Home• Path Extension ── only remember prefix ▫ /Desktop/Pictures → /Desktop/Pictures//*• Node Generalization ── misremember structure/content ▫ //Home//Halloween → //Home//{Halloween}• Node Inversion ── misremember order ▫ /Desktop//Home//{Halloween} → /Desktop//(Home//{Halloween})• Node Deletion ── extraneous terms ▫ /Desktop/Backup/Pictures//Home → /Desktop//Pictures//Home Amélie Marian - Rutgers University
  13. 13. 13DAG Representation IDF scorep – Pictures ▫ Function of how manyh – Home files match the query /p/h (exact match) ▫ DAG stores IDF scoring information //p/h /p//h /(p/h) 1//p//h 2 3 //n //p//* //h//*1 - /p/h//*2 - //p/h//* //* (match all)3 - //(p/h) Amélie Marian - Rutgers University
  14. 14. 14Query Evaluation• Top-k query processing ▫ Branch-and-bound approach• Lazy evaluation of the relaxed DAG structure ▫ DAG is query dependent and has to be generated at runtime ▫ We developed two algorithms to speed up query evaluation  DAGJump allows skip unnecessary parts of the DAG (sorted accesses)  RandomDAG allows to zoom in on the relevant part of the DAG (random accesses)• Matching of answers using dedicated data structures  We extended PathStack (Bruno et al. ICDE’02) to support permutations (NIPathstack) Amélie Marian - Rutgers University
  15. 15. 15Traditional Content TF∙IDF Scoring• Consider files as “bag of terms”• TF (Term Frequency) ▫ A file that mentions a query term more often is more relevant ▫ TF could be normalized by file length• IDF (Inverse Document Frequency) ▫ Terms that appear in too many files have little differentiation power in determining relevance• TF∙IDF Scoring ▫ Aggregate TF and IDF scores across all query terms score ( q , d ) tf t , d idf t t q Amélie Marian - Rutgers University
  16. 16. 16Unified IDF Score For a unified data tree T, a path query PQ, and a file F, we define: • IDF Score N log matches (T , PQ ) score idf ( PQ ) log N where N is total number of files, and matches (T , PQ ) is the set of files that match PQ in T. Amélie Marian - Rutgers University
  17. 17. 17 TF ScorePath query: //a//{b} / matchstruct = 1 Normalized 0.25 a nodesstruct = 4 File F TF Score c ∑f(x) f(0.25)+f(0.4) b d matchcontent = 2 0.4 1 “” “b e f b f” nodescontent = 5 Normalized 0.8 0.6 f(x) 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 x 1f ( x) log( 1 x)  x , n n 2 , 3,  affects relative impact on TF to unified scores Amélie Marian - Rutgers University
  18. 18. 18Unified ScoreAggregate IDF and TF scores across all relaxed queries /a/b (exact match) //a/b /a//b idf tf idf tf idf tf 1.0 0.15 0.8 0.25 0.8 0.1 ... * * *tf*idf 0.15 0.2 0.08 ... + 0.875 ... Unified Score Amélie Marian - Rutgers University
  19. 19. 19Experimental Setup• Platform PC with a 64-bit hyper-threaded 2.8GHz Intel Xeon processor, 2GB memory, a 10K RPM 70GB SCSI disk, Linux 2.6.16 kernel, Sun Java 1.5.0 JVM.• Data Set ▫ Files and directories from the environment of a graduate student (15Gb) ▫ 95,172 files (document 59%, email 34%) in 7,788 directories. Average directory depth is 6.3 with the longest being 12. ▫ 57M nodes in the unified data tree, with 49M (86%) leaf content nodes Amélie Marian - Rutgers University
  20. 20. 20Relevance Comparison• Use Lucene as a comparison basis• Content-only Use the standard Lucene content indexing and search• Content:Dir Create two Lucene indexes: content terms, and terms from the directory pathnames (treated as a small file)• Content+Dir Augment content index with directory path terms Amélie Marian - Rutgers University
  21. 21. 21Case Study ▫ Search for a witch costume picture taken at home on Halloween Target: IMG_1391.gif (tagged with “witch” and “Halloween”)Query Query Condition Comment RankType U //home[.//”witch” and Accurate condition 1 .//”halloween”] U //halloween/witch/”home” Structure / content switched 1 C {witch, halloween} Accurate condition 20 C:D {witch, halloween} : {home} Accurate condition 1 C:D {witch, home} : {halloween} Structure / content switched 245- 252 Amélie Marian - Rutgers University
  22. 22. 22CDFs (Impact of Inaccuracies) 100% 100% U U Percentage of Queries 90% C:D 90% C:D C+D C+D 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% 1 10 100 1 10 10050% error, 1 swap Rank 100% error, 1 swap Rank 100% 100% U U Percentage of Queries 90% C:D 90% C:D C+D C+D 80% 80% 70% 70% 60% 60% 50% 50% 40% 40% 30% 30% 20% 20% 10% 10% 0% 0% 1 10 100 1 10 10050% error, 2 swap Rank 100% error, 2 swap Rank Amélie Marian - Rutgers University
  23. 23. 23Query Processing Performance 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% U C:D 0% 0 2 4 6 8 10 Query Processing Time (sec) Amélie Marian - Rutgers University
  24. 24. 24Personal Information SearchContributions• A multi-dimensional search framework that supports fuzzy query conditions• Scoring techniques for fuzzy query conditions against a unified view of structure and content  Improves search accuracy over content-based methods by leveraging both structure and content information as well as relationships between the terms  Shows improvements over existing techniques (GDS, topX)• Efficient index structures and optimizations to efficiently process multi-dimensional and unified queries  Significantly reduced the overall query processing time• Future work directions:  User studies, Twig matching, Result granularity, Context Amélie Marian - Rutgers University
  25. 25. Joint work with:Gayatree Ganu Computer Science, Rutgers UniversityNoémie Elhadad Biomedical Informatics, Columbia University User Review Structure Analysis Project – URSA Patient Emotion and stRucture SEarch USer interface - PERSEUS
  26. 26. 26URSA:User Review Structure AnalysisProject Description WebDB’09 • Aim: Better understanding of user reviews Better search and access of user reviews • Tasks: Structure Identification and Analysis Text and Structure Search Similarity Search in Social Networks  Google Research Award – April 2008 Amélie Marian - Rutgers University
  27. 27. 27Online Reviewing Systems:Citysearch Data in Reviews • Structured metadata • Textual review body  Sentiment information  Information on product specific features Users are inconvenienced because: • Large number of reviews available • Hard to find relevant reviews • Vague or undefined information needs Amélie Marian - Rutgers University
  28. 28. 28Data Description• Restaurant reviews extracted from Citysearch, New York (http://newyork.citysearch.com)• The corpus contains: ▫ 5531 restaurants - associated structured information (name, location, cuisine type) - a set of reviews ▫ 52264 reviews, of which 1359 are editorial reviews - structured information (star rating, username, date) - unstructured text (title, body, pros, cons) ▫ 32284 distinct users - Distinct username information• Dataset accessible at http://www.research.rutgers.edu/~gganu/datasets/ Amélie Marian - Rutgers University
  29. 29. 29Structure Identification• Classification of review sentences with topic and sentiment information Sentence Topics Sentence Sentiment Food Positive Price Negative Service Neutral Ambience Conflict Anecdotes Miscellaneous Amélie Marian - Rutgers University
  30. 30. 30Text Based RecommendationSystem: Evaluation Setting• For evaluation, we separated three non- overlapping test sets of about 260 reviews: ▫ Test A and B : Users who have reviewed at least two restaurants (so that training set has at least one review) ▫ Test C : Users with at least 5 reviews• For measuring accuracy of prediction we use the Root Mean Square Error (RMSE) Amélie Marian - Rutgers University
  31. 31. 31Text-Based Recommendation System:Steps• Text-derived rating score ▫ Regression-based rating• Goals 1. Predicting the metadata star rating 2. Predicting the text-derived score • Only predicts the score, not the content of the reviews • Lower standard deviations: lower RMSE• Prediction Strategies ▫ Average-based prediction ▫ Personalized prediction Amélie Marian - Rutgers University
  32. 32. 32Regression-based Text Rating• Use text of reviews to generate a rating• Different categories and sentiment should have different importance in the ratingMethod• We use multivariate quadratic regression• Each normalized sentence type [(category, sentiment)] is a variable in the regression• Dependent variable is metadata star-rating• Used training sets to learn the weights for each sentence type; weights are used in computing text-based score Amélie Marian - Rutgers University
  33. 33. Regression-based Text Rating Food and Negative • Regression Constant: 3.68 Price and Service • Regression Weights (First order variables) appear to Regression Weights Positive Negative Neutral Conflict be most Food 2.62 -2.65 -0.08 -0.69 important Price 0.39 -2.12 -1.27 0.93 Service 0.85 -4.25 -1.83 0.36 Ambience 0.75 -0.27 0.16 0.21 Anecdotes 0.95 -1.75 0.06 -0.19 Miscellaneous 1.30 -2.62 -0.30 0.36 • Regression Weights (Second order variables) Regression Weights Positive Negative Neutral Conflict Food -1.99 2.05 -0.14 0.67 Price -0.27 2.04 2.17 -1.01 Service -0.52 3.15 1.76 0.34 Ambience -0.44 0.81 -0.28 -0.61 Anecdotes -0.40 2.03 -0.03 -0.20 Miscellaneous -0.65 2.38 0.5 -0.10 Amélie Marian - Rutgers University 33
  34. 34. Regression-Based Text BaselineRating Case Restaurant Average-based Prediction • Prediction using average rating given to a restaurant by all users (we also tried user-average and combined) • RMSE Errors: Predicting using text does better than popularly used star ratingPredicting Star Ratings TEST A TEST B TEST CUsing Star Rating 1.127 1.267 1.126Using Sentiment-based text rating 1.126 1.224 1.046Predicting Sentiment Text Rating TEST A TEST B TEST CUsing Star Rating 0.703 0.718 0.758Using Sentiment-based text rating 0.545 0.557 0.514 Amélie Marian - Rutgers University 34
  35. 35. 35Clustering-based strategies forrecommendations• KNN based on a clustering over star ratings ▫ Little improvement over baseline ▫ Does not take into account the textual information ▫ Sparse data ▫ Cold start problem ▫ Hard clustering not appropriate• Soft clustering ▫ Partitions objects into clusters, ▫ Each user has a membership probability to each cluster Amélie Marian - Rutgers University
  36. 36. Information Bottleneck Method• Foundations in Rate Distortion Theory• Allows choosing tradeoff between ▫ Compression (number of clusters T) ▫ Quality estimated through the average distortion between cluster points and cluster centroid (β parameter)• Shown to work well with sparse datasets N. Slonim, SIGIR 2002
  37. 37. 37Leveraging text content forpersonalized predictions• Use the sentence types (categories, sentiments) within the reviews as features• Users clustered based on the type of information in their reviews• Predictions are made using membership probabilities of clusters to find neighbors Amélie Marian - Rutgers University
  38. 38. 38 Example: Clustering using iIB algorithm Restaurant1 Restaurant2 Restaurant3 User1 4 - - User2 2 5 4 User3 4 ??? 3 Input matrix to the iIB algorithm User4 5 2 - (before normalization) User5 - - 1 Restaurant1 Restaurant2 Restaurant3 Food Food Price Price Food Food Price Price Food Food Price Price Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive NegativeUser1 0.6 0.2 0.2 - - - - - - - - -User2 0.3 0.6 0.1 - 0.9 - 0.1 - 0.6 0.1 0.2 0.1User3 0.7 0.1 0.15 0.05 - - - - 0.2 0.8 - -User4 0.9 0.05 0.05 - 0.3 0.4 0.2 0.1 - - - -User5 - - - - - - - - - 0.7 0.3 -
  39. 39. 39 Example: Soft-clustering PredictionUser rating (star or text) Cluster Membership Probabilities Restaurant1 Restaurant Restaurant 2 3 Cluster1 Cluster2 Cluster3User1 4 - - User1 0.040 0.057 0.903User2 2 5 4 User2 0.396 0.202 0.402User3 4 * 3 User3 0.380 0.502 0.118User4 5 2 - User4 0.576 0.015 0.409User5 - - 1 User5 0.006 0.990 0.004 •For each cluster we compute the cluster contribution for the test restaurant •Weighted average of ratings given to the restaurant Contribution (c2,r2)=4.793, Contribution(c3,r2)=3.487 •We compute the final prediction based on the cluster contributions for the test restaurant and the test user’s membership probabilities = 4.042 Amélie Marian - Rutgers University
  40. 40. iIB Algorithm • Experimented with different values of β and T, used β=20, T=100. RMSE errors and percentage improvement over baseline:Predicting Star Ratings TEST A TEST B TEST CUsing Star Rating 1.103 (2.13%) 1.242 (1.74%) 1.106 (1.78%)Using Sentiment-based text rating 1.113 (1.15%) 1.211(1.06%) 1.046(0%)Predicting Sentiment Text Rating TEST A TEST B TEST CUsing Star Rating 0.692 (1.56%) 0.704(1.95%) 0.742(2.11%)Using Sentiment-based text rating 0.544(0.18%) 0.549(1.44%) 0.514(0%) • Always improve by using text features for clustering for the traditional goal of predicting star ratings • Even small improvement in RMSE are useful (Netflix, precision in top-k)
  41. 41. 41URSA: Qualitative Predictions• Predict sentiment towards each topic• Cluster users along each dimension separately• Use threshold to classify sentiment (actual and predicted) 100% 80% Accuracy 60% 80%-100% 40% 60%-80% 20% 40%-60% 0%Prediction accuracy 20%-40% 0%-20%for positive ambience. A-0 A-0.1 A-0.2 A-0.3 A-0.4 A-0.5 A-0.6 A-0.7 A-0.8 A-0.9 A-1 θact Amélie Marian - Rutgers University
  42. 42. 42PERSEUS Project DescriptionPatient Emotion and StRucture SEarch USer Interface ▫ Large amount of patient-produced data • Difficult to search and understand • Patients need help finding information • Health professionals could learn from the data ▫ Analyze and Search patient forums, mailing lists and blogs • Topical information • Specific Language • Time sensitive • Emotionally charged Google Research Award – April 2010 NSF CDI Type I – October 2010-2013 Amélie Marian - Rutgers University
  43. 43. 43PERSEUS Project Description▫ Automatically add structure to free-text • Use of context information • “hair loss” side effect or symptom • Approximate structure▫ Use structure to guide search • Need for high recall, but good precision • Find users with similar experiences • Various results granularities • Thread vs. sentence • Context dependent • Needs to take approximation into account Amélie Marian - Rutgers University
  44. 44. 44Structuring and Searching Web ContentContributions• Leveraged automatically generated structure to improve predictions ▫ Around 2% RMSE improvements ▫ Used inferred structure to group users using soft clustering techniques• Qualitative predictions ▫ High Accuracy• Future directions ▫ Extension to healthcare domains ▫ Use of inferred structure to guide search ▫ Use user clusters in search ▫ Adapt to various result granularities ▫ Take classification inaccuracies into account Amélie Marian - Rutgers University
  45. 45. 45Joint work with:Minji Wu Computer Science, Rutgers UniversityCollaborators:Serge Abiteboul, Alban Galland INRIAPierre Senellart Telecom ParisTechMagda Procopiuc, Divesh Srivasatava AT&T Research LabsLaure Berti-Equille IRD Amélie Marian - Rutgers University
  46. 46. 46Motivations• Information on web sources are unreliable ▫ Erroneous ▫ Misleading ▫ Biased ▫ Outdated• Users need to check web sites to confirm the information ▫ Data corroboration Minji Wu - Rutgers University
  47. 47. 47Example: What is the gas mileage of myHonda Civic? Query: “honda civic 2007 gas mileage” on MSN Search • Is the top hit; the honda.com site unbiased? • Is the autoweb.com web site trustworthy? • Are all these values referring to the correct year of the model? Users may check several web sites to get an answer Minji Wu - Rutgers University
  48. 48. 48Example: Identifying good businesslistings• NYC restaurant information from 6 sources ▫ Yellowpages ▫ Menupages ▫ Yelp ▫ Foursquare ▫ OpenTable ▫ Mechanical Turk (check streetview) Which listings are correct ? Amélie Marian - Rutgers University
  49. 49. 49 WebDB’07 WSDM’10 IS’11 DEB’11Data Corroboration Project Description Trustworthy sources report true facts True facts come from trustworthy sources• Sources have different ▫ Coverage ▫ Domain ▫ Dependencies ▫ Overlap Conflict resolution with maximum coverage Microsoft Live Labs Search Award – May 2006 Amélie Marian - Rutgers University
  50. 50. 50 CleanDB’06 PVLDB’10Top-k Join: Project Description Integrate and aggregate information from several sources (“minji”, “vldb10”, 0.2) (“minji”, “amélie”, 1.0) (“amélie”, “vldb10”, 0.5) (“amélie”, “SIN”, 0.3) (“minji”, “SIN”, 0.1) (“SIN”, “vldb10”, 0.9) Amélie Marian - Rutgers University
  51. 51. 51Data CorroborationContributions• Probabilistic model for corroboration ▫ Fact uncertainty ▫ Source trustworthiness ▫ Source coverage ▫ Conflict between sources• Fixpoint techniques to compute truth values of facts and source quality estimates• Top-k query algorithms for computing corroborated answers• Open Issues: ▫ Functional dependencies ▫ Time ▫ Social network ▫ Uncertain data ▫ Source dependence Amélie Marian - Rutgers University
  52. 52. 52Conclusions• New Challenges in web data management ▫ Semi-structured data  PIMS  User reviews ▫ Multiple sources of data  Conflicting information  Low quality data providers (Web 2.0)• SPIDR lab at Rutgers focuses on helping users identify useful data in the wealth of information available Amélie Marian - Rutgers University
  53. 53. 53Amélie Marian - Rutgers University
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×