Searching data with substance and style

1

Searching Data with Substance
and Style
Amélie Marian
Rutgers University

http://www.cs.rutgers.edu/~amelie

2

Semi-structured Data Processing
• Large amount of data online and in personal
devices
▫ Structure (style)
▫ Text content (substance)
▫ Different sources (soul)

▫ Finding the data we need can be difficult

Amélie Marian - Rutgers University

3

Semi-structured Data Processing at Rutgers
SPIDR Lab
• Personal Information Search
▫ Semi-structured data
▫ Need for high -quality search tools
• Structuring of User Web Posts
▫ Large amount of user-generated data untapped
▫ Text has inherent structure
▫ Use of text for guiding search and analyze data
• Data Corroboration
▫ Conflicting sources of data
▫ Need to identify true facts

4

Joint work with:

Wei Wang
Christopher Peery
Thu Nguyen
Computer Science, Rutgers University


5

Personal Information Search
Web Personal
Data

Search for relevant documents Search for specific documents
Information that can be used for personal information search
• Content (keywords)
• Metadata (file size, modification time, etc.)
• Structure
▫ Directory (external)
▫ File structure (internal): XML, LaTeX tags, Picture tags, etc.
▫ Partially known


6

EDBT’08
ICDE’08 (demo)
PIMS Project Description DEB’09
EDBT’11
TKDE (accepted)

• Data and query models that unify content and structure

• Scoring framework to rank unified search results

• Query processing algorithms and index structures to
score and rank answers efficiently

• Evaluation of the quality and efficiency of the unified
scoring

NSF CAREER Award July 2009-2014

7

Target file: Halloween party pictures taken at home where someone
wears a witch costume

Separate Structure and Content

File
Boundary

Directory: //Home
Keywords: Halloween, witch


8

Current Search Tools
Current search tools (i.e. web, desktop, GDS) mostly rely on
ranking and filtering.
▫ Ranking content keywords
▫ Filtering additional conditions (e.g., metadata, structure)
Find a jpg file saved in directory /Desktop/Pictures/Home
that contains the words “Halloween witch”

This approach is often insufficient.
▫ Filtering forces a binary decision. Gif files and files under
directory /Archive/Pictures/Home are not returned.
▫ Structure and content are strictly separated. Files under
directory /Pictures/Halloween are not returned.


9

Unified Approach
Goal: Unify structure and content
▫ Develop a unified view of directory and file structure
▫ Allow for a single query to contain both structure and
content components and to be answered at once
▫ Return results even if queries are incomplete or contain
mistakes

Approach:
▫ Define a unified data model by ignoring file boundaries
▫ Define a unified query model
▫ Define relaxations to approximate unified queries
▫ Define relevance score for unified queries

10

Unified Structure and Content
Target file: Halloween party pictures taken at home where someone
wears a witch costume

//Home[.//“Halloween” and .//“witch”]
File
root
Boundary
Home

“Halloween” “witch”


11

From Query to Answers

DAG
Relaxation Matching
Relaxed Queries
Matches
Query
/ Answers

User Scoring
Ranked Answers
(TA algorithm)


12

Query Relaxations
Target: IMG_1391.gif
• Edge Generalization ── missing terms
▫ /Desktop/Home → /Desktop//Home
• Path Extension ── only remember prefix
▫ /Desktop/Pictures → /Desktop/Pictures//*
• Node Generalization ── misremember structure/content
▫ //Home//Halloween → //Home//{Halloween}
• Node Inversion ── misremember order
▫ /Desktop//Home//{Halloween} → /Desktop//(Home//{Halloween})
• Node Deletion ── extraneous terms
▫ /Desktop/Backup/Pictures//Home → /Desktop//Pictures//Home


13

DAG Representation
IDF score
p – Pictures ▫ Function of how many
h – Home files match the query
/p/h (exact match)
▫ DAG stores IDF scoring
information
//p/h /p//h /(p/h)
1

//p//h 2 3

//n

//p//* //h//*
1 - /p/h//*
2 - //p/h//* //* (match all)
3 - //(p/h)

14

Query Evaluation
• Top-k query processing
▫ Branch-and-bound approach
• Lazy evaluation of the relaxed DAG structure
▫ DAG is query dependent and has to be generated at runtime
▫ We developed two algorithms to speed up query evaluation
 DAGJump allows skip unnecessary parts of the DAG (sorted
accesses)
 RandomDAG allows to zoom in on the relevant part of the DAG
(random accesses)
• Matching of answers using dedicated data structures
 We extended PathStack (Bruno et al. ICDE’02) to support
permutations (NIPathstack)


15

Traditional Content TF∙IDF Scoring
• Consider files as “bag of terms”
• TF (Term Frequency)
▫ A file that mentions a query term more often is more relevant
▫ TF could be normalized by file length

• IDF (Inverse Document Frequency)
▫ Terms that appear in too many files have little differentiation
power in determining relevance
• TF∙IDF Scoring
▫ Aggregate TF and IDF scores across all query terms

score ( q , d ) tf t , d idf t
t q


16

Unified IDF Score
For a unified data tree T, a path query PQ, and a file F, we define:
• IDF Score

N
log
matches (T , PQ )
score idf
( PQ )
log N

where N is total number of files, and matches (T , PQ ) is the set of files that
match PQ in T.


17

TF Score
Path query: //a//{b}
/ matchstruct = 1 Normalized
0.25
a nodesstruct = 4
File F TF Score
c ∑f(x) f(0.25)+f(0.4)

b d
matchcontent = 2
0.4 1

“” “b e f b f” nodescontent = 5 Normalized
0.8

0.6

f(x)
0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
x
1

f ( x) log( 1 x)  x , n
n
2 , 3,  affects relative impact on TF to unified scores


18

Unified Score
Aggregate IDF and TF scores across all relaxed queries

/a/b (exact match) //a/b /a//b
idf tf idf tf idf tf
1.0 0.15 0.8 0.25 0.8 0.1 ...
* * *

tf*idf 0.15 0.2 0.08 ...
+

0.875 ...
Unified Score


19

Experimental Setup
• Platform
PC with a 64-bit hyper-threaded 2.8GHz Intel Xeon
processor, 2GB memory, a 10K RPM 70GB SCSI disk,
Linux 2.6.16 kernel, Sun Java 1.5.0 JVM.
• Data Set
▫ Files and directories from the environment of a
graduate student (15Gb)
▫ 95,172 files (document 59%, email 34%) in 7,788
directories. Average directory depth is 6.3 with the
longest being 12.
▫ 57M nodes in the unified data tree, with 49M (86%)
leaf content nodes


20

Relevance Comparison
• Use Lucene as a comparison basis
• Content-only
Use the standard Lucene content indexing and
search
• Content:Dir
Create two Lucene indexes: content terms, and
terms from the directory pathnames (treated as a
small file)
• Content+Dir
Augment content index with directory path terms


21

Case Study
▫ Search for a witch costume picture taken at home on Halloween
Target: IMG_1391.gif (tagged with “witch” and “Halloween”)

Query Query Condition Comment Rank
Type
U //home[.//”witch” and Accurate condition 1
.//”halloween”]
U //halloween/witch/”home” Structure / content switched 1
C {witch, halloween} Accurate condition 20
C:D {witch, halloween} : {home} Accurate condition 1
C:D {witch, home} : {halloween} Structure / content switched 245-
252


22

CDFs (Impact of Inaccuracies)
100%
100% U
U

Percentage of Queries
90% C:D
90% C:D C+D
C+D 80%
80%
70%
70%
60%
60%
50%
50%
40%
40%
30%
30%
20%
20%
10%
10%
0%
0% 1 10 100
1 10 100
50% error, 1 swap Rank 100% error, 1 swap Rank

100% 100%
U U
Percentage of Queries

90% C:D 90% C:D
C+D C+D
80% 80%

70% 70%

60% 60%

50% 50%

40% 40%

30% 30%

20% 20%

10% 10%

0% 0%
1 10 100 1 10 100
50% error, 2 swap Rank 100% error, 2 swap Rank

23

Query Processing Performance
100%

90%

80%

70%

60%

50%

40%

30%

20%

10% U
C:D
0%
0 2 4 6 8 10
Query Processing Time (sec)


24

Personal Information Search
Contributions
• A multi-dimensional search framework that supports
fuzzy query conditions
• Scoring techniques for fuzzy query conditions against a
unified view of structure and content
 Improves search accuracy over content-based methods by leveraging
both structure and content information as well as relationships between
the terms
 Shows improvements over existing techniques (GDS, topX)

• Efficient index structures and optimizations to efficiently
process multi-dimensional and unified queries
 Significantly reduced the overall query processing time

• Future work directions:
 User studies, Twig matching, Result granularity, Context

Joint work with:

Gayatree Ganu
Computer Science, Rutgers University
Noémie Elhadad
Biomedical Informatics, Columbia University
User Review Structure Analysis Project – URSA
Patient Emotion and stRucture SEarch USer interface - PERSEUS

26

URSA:User Review Structure Analysis
Project Description WebDB’09

• Aim:
Better understanding of user reviews
Better search and access of user reviews
• Tasks:
Structure Identification and Analysis
Text and Structure Search
Similarity Search in Social Networks 

Google Research Award – April 2008

27

Online Reviewing Systems:
Citysearch

Data in Reviews
• Structured metadata
• Textual review body
 Sentiment information
 Information on product specific
features

Users are inconvenienced
because:
• Large number of reviews
available
• Hard to find relevant reviews
• Vague or undefined
information needs

28

Data Description
• Restaurant reviews extracted from
Citysearch, New York
(http://newyork.citysearch.com)
• The corpus contains:
▫ 5531 restaurants
- associated structured information (name, location, cuisine type)
- a set of reviews
▫ 52264 reviews, of which 1359 are editorial reviews
- structured information (star rating, username, date)
- unstructured text (title, body, pros, cons)
▫ 32284 distinct users
- Distinct username information
• Dataset accessible at
http://www.research.rutgers.edu/~gganu/datasets/

29

Structure Identification
• Classification of review sentences with topic
and sentiment information
Sentence Topics Sentence Sentiment

Food Positive

Price Negative

Service Neutral

Ambience Conflict

Anecdotes

Miscellaneous


30

Text Based Recommendation
System: Evaluation Setting

• For evaluation, we separated three non-
overlapping test sets of about 260 reviews:
▫ Test A and B : Users who have reviewed at least two
restaurants (so that training set has at least one
review)
▫ Test C : Users with at least 5 reviews
• For measuring accuracy of prediction we use the
Root Mean Square Error (RMSE)


31

Text-Based Recommendation System:
Steps

• Text-derived rating score
▫ Regression-based rating
• Goals
1. Predicting the metadata star rating
2. Predicting the text-derived score
• Only predicts the score, not the content of the reviews
• Lower standard deviations: lower RMSE
• Prediction Strategies
▫ Average-based prediction
▫ Personalized prediction


32

Regression-based Text Rating
• Use text of reviews to generate a rating
• Different categories and sentiment should have different
importance in the rating

Method
• We use multivariate quadratic regression
• Each normalized sentence type [(category, sentiment)] is
a variable in the regression
• Dependent variable is metadata star-rating

• Used training sets to learn the weights for each sentence
type; weights are used in computing text-based score


Regression-based Text Rating Food and
Negative
• Regression Constant: 3.68 Price and
Service
• Regression Weights (First order variables) appear to
Regression Weights Positive Negative Neutral Conflict
be most
Food 2.62 -2.65 -0.08 -0.69
important
Price 0.39 -2.12 -1.27 0.93
Service 0.85 -4.25 -1.83 0.36
Ambience 0.75 -0.27 0.16 0.21
Anecdotes 0.95 -1.75 0.06 -0.19
Miscellaneous 1.30 -2.62 -0.30 0.36

• Regression Weights (Second order variables)
Regression Weights Positive Negative Neutral Conflict
Food -1.99 2.05 -0.14 0.67
Price -0.27 2.04 2.17 -1.01
Service -0.52 3.15 1.76 0.34
Ambience -0.44 0.81 -0.28 -0.61
Anecdotes -0.40 2.03 -0.03 -0.20
Miscellaneous -0.65 2.38 0.5 -0.10

Amélie Marian - Rutgers University 33

Regression-Based Text Baseline

Rating Case

Restaurant Average-based Prediction
• Prediction using average rating given to a restaurant by all users
(we also tried user-average and combined)
• RMSE Errors:
Predicting using text does better
than popularly used star rating

Predicting Star Ratings TEST A TEST B TEST C
Using Star Rating 1.127 1.267 1.126
Using Sentiment-based text rating 1.126 1.224 1.046
Predicting Sentiment Text Rating TEST A TEST B TEST C
Using Star Rating 0.703 0.718 0.758
Using Sentiment-based text rating 0.545 0.557 0.514

Amélie Marian - Rutgers University 34

35

Clustering-based strategies for
recommendations
• KNN based on a clustering over star ratings
▫ Little improvement over baseline
▫ Does not take into account the textual information
▫ Sparse data
▫ Cold start problem
▫ Hard clustering not appropriate
• Soft clustering
▫ Partitions objects into clusters,
▫ Each user has a membership probability to each
cluster


Information Bottleneck Method
• Foundations in Rate Distortion Theory
• Allows choosing tradeoff between
▫ Compression (number of clusters T)
▫ Quality estimated through the average distortion
between cluster points and cluster centroid (β
parameter)
• Shown to work well with sparse datasets

N. Slonim, SIGIR 2002

37

Leveraging text content for
personalized predictions
• Use the sentence types (categories, sentiments)
within the reviews as features
• Users clustered based on the type of information
in their reviews
• Predictions are made using membership
probabilities of clusters to find neighbors


38

Example: Clustering using iIB algorithm
Restaurant1 Restaurant2 Restaurant3
User1 4 - -
User2 2 5 4
User3 4 ??? 3
Input matrix to the iIB algorithm
User4 5 2 -
(before normalization)
User5 - - 1

Restaurant1 Restaurant2 Restaurant3
Food Food Price Price Food Food Price Price Food Food Price Price
Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative Positive Negative

User1 0.6 0.2 0.2 - - - - - - - - -
User2 0.3 0.6 0.1 - 0.9 - 0.1 - 0.6 0.1 0.2 0.1
User3 0.7 0.1 0.15 0.05 - - - - 0.2 0.8 - -
User4 0.9 0.05 0.05 - 0.3 0.4 0.2 0.1 - - - -
User5 - - - - - - - - - 0.7 0.3 -

39

Example: Soft-clustering Prediction
User rating (star or text)
Cluster Membership Probabilities
Restaurant1 Restaurant Restaurant
2 3 Cluster1 Cluster2 Cluster3

User1 4 - - User1 0.040 0.057 0.903
User2 2 5 4 User2 0.396 0.202 0.402
User3 4 * 3 User3 0.380 0.502 0.118
User4 5 2 - User4 0.576 0.015 0.409
User5 - - 1 User5 0.006 0.990 0.004

•For each cluster we compute the cluster contribution for the test
restaurant
•Weighted average of ratings given to the restaurant

Contribution (c2,r2)=4.793,
Contribution(c3,r2)=3.487

•We compute the final prediction based on the cluster contributions for
the test restaurant and the test user’s membership probabilities
= 4.042


iIB Algorithm
• Experimented with different values of β and T, used
β=20, T=100.
RMSE errors and percentage improvement over baseline:
Predicting Star Ratings TEST A TEST B TEST C
Using Star Rating 1.103 (2.13%) 1.242 (1.74%) 1.106 (1.78%)
Using Sentiment-based text rating 1.113 (1.15%) 1.211(1.06%) 1.046(0%)

Predicting Sentiment Text Rating TEST A TEST B TEST C
Using Star Rating 0.692 (1.56%) 0.704(1.95%) 0.742(2.11%)
Using Sentiment-based text rating 0.544(0.18%) 0.549(1.44%) 0.514(0%)

• Always improve by using text features for clustering for
the traditional goal of predicting star ratings
• Even small improvement in RMSE are useful (Netflix,
precision in top-k)

41

URSA: Qualitative Predictions
• Predict sentiment towards each topic
• Cluster users along each dimension separately
• Use threshold to classify sentiment (actual and
predicted)
100%
80%

Accuracy
60% 80%-100%
40% 60%-80%
20% 40%-60%
0%
Prediction accuracy 20%-40%
0%-20%
for positive ambience.
A-0
A-0.1
A-0.2
A-0.3
A-0.4
A-0.5
A-0.6
A-0.7
A-0.8
A-0.9
A-1

θact


42

PERSEUS Project Description
Patient Emotion and StRucture SEarch USer
Interface
▫ Large amount of patient-produced data
• Difficult to search and understand
• Patients need help finding information
• Health professionals could learn from the data
▫ Analyze and Search patient forums, mailing lists and blogs
• Topical information
• Specific Language
• Time sensitive
• Emotionally charged
Google Research Award – April 2010
NSF CDI Type I – October 2010-2013

43

PERSEUS Project Description
▫ Automatically add structure to free-text
• Use of context information
• “hair loss” side effect or symptom
• Approximate structure
▫ Use structure to guide search
• Need for high recall, but good precision
• Find users with similar experiences
• Various results granularities
• Thread vs. sentence
• Context dependent
• Needs to take approximation into account


44

Structuring and Searching Web Content
Contributions
• Leveraged automatically generated structure to improve
predictions
▫ Around 2% RMSE improvements
▫ Used inferred structure to group users using soft clustering
techniques
• Qualitative predictions
▫ High Accuracy
• Future directions
▫ Extension to healthcare domains
▫ Use of inferred structure to guide search
▫ Use user clusters in search
▫ Adapt to various result granularities
▫ Take classification inaccuracies into account


45

Joint work with:
Minji Wu Computer Science, Rutgers University

Collaborators:
Serge Abiteboul, Alban Galland INRIA
Pierre Senellart Telecom ParisTech
Magda Procopiuc, Divesh Srivasatava AT&T Research Labs
Laure Berti-Equille IRD


46

Motivations
• Information on web sources are unreliable
▫ Erroneous
▫ Misleading
▫ Biased
▫ Outdated
• Users need to check web sites to confirm the
information
▫ Data corroboration

Minji Wu - Rutgers University

47

Example: What is the gas mileage of my
Honda Civic?
Query: “honda civic 2007
gas mileage” on MSN
Search
• Is the top hit; the
honda.com site
unbiased?
• Is the autoweb.com web
site trustworthy?
• Are all these values
referring to the correct
year of the model?
Users may check several web
sites to get an answer

Minji Wu - Rutgers University

48

Example: Identifying good business
listings
• NYC restaurant information from 6 sources
▫ Yellowpages
▫ Menupages
▫ Yelp
▫ Foursquare
▫ OpenTable
▫ Mechanical Turk (check streetview)

Which listings are correct ?


49
WebDB’07
WSDM’10
IS’11
DEB’11
Data Corroboration Project Description
Trustworthy sources report true facts
True facts come from trustworthy sources
• Sources have different
▫ Coverage
▫ Domain
▫ Dependencies
▫ Overlap Conflict resolution with maximum
coverage

Microsoft Live Labs Search Award – May 2006


50

CleanDB’06
PVLDB’10
Top-k Join: Project Description
Integrate and aggregate information from several sources

(“minji”, “vldb10”, 0.2)

(“minji”, “amélie”, 1.0)

(“amélie”, “vldb10”, 0.5)

(“amélie”, “SIN”, 0.3)
(“minji”, “SIN”, 0.1)

(“SIN”, “vldb10”, 0.9)


51

Data Corroboration
Contributions
• Probabilistic model for corroboration
▫ Fact uncertainty
▫ Source trustworthiness
▫ Source coverage
▫ Conflict between sources
• Fixpoint techniques to compute truth values of facts and
source quality estimates
• Top-k query algorithms for computing corroborated answers
• Open Issues:
▫ Functional dependencies
▫ Time
▫ Social network
▫ Uncertain data
▫ Source dependence


52

Conclusions
• New Challenges in web data management
▫ Semi-structured data
 PIMS
 User reviews
▫ Multiple sources of data
 Conflicting information
 Low quality data providers (Web 2.0)
• SPIDR lab at Rutgers focuses on helping users
identify useful data in the wealth of information
available

53


Searching data with substance and style

More Related Content

Viewers also liked

Similar to Searching data with substance and style

More from Amélie Marian

Searching data with substance and style