Leveraging collaborativetaggingforwebitemdesign ajithajjarani

Leveraging Collaborative Tagging
for Web Item Design

Mahashweta Das, Gautam Das
, Vagelis Hristidis

Presenter : Ajith C Ajjarani
[1000-727269]
1/15/2012
1

Outline : Organization of Presentation!
Motivation &
Problem Definition
Naïve
Bayes
Classifier
Tag Maximization :
NP Complete
Moderate
Instances Larger Instances

Exact 2 Tier : Top Approximation
K Algorithm Algorithm

Experiment &
result Tabulation

1/15/2012 2

Motivation

Lets Define this Can I design a New
Opportunity as Camera Which Attracts
problem ! & maximizes the Tags ??
1/15/2012 3

Problem Construction ? Training
Data

 Attributes are product
definition
 Tags are user-defined

Now, given subset of subjective “Desired“ Tags  predict a
New Item( a combination of Attribute values)
Extend this to “Top K” version for potential k Items with
highest expected number of desirable Tags.
1/15/2012 4

Problem Statement
• Given a database of tagged products, task is to design k new
products (attribute values) that are likely to attract maximum
number of desirable tags
– tag-desirability is just one aspect of product design consideration

Zoom? Flash?
• Applications Resolution?

– electronics, autos, apparel
– musical artist, blogger

Light
Sensitivity?
Shooting
mode?

Tag Maximization
Technically challenging, as complex dependencies exist between
tags and items

Difficult to determine a combination of attribute values that
maximizes the expected number of desirable tags.
“Naïve Bayes” Classifier for Tag Prediction.
Even for this Classifier(assumption of simplistic Conditional
Independence), Tag maximization problem is NP- Complete.
Researchers have
NOT resorted to
Heuristics
Developed Principal Algorithms

1/15/2012 6

Proposed Solution

 Exact – Top K Algorithm (ETT)  performs significantly better than naïve
brute force algorithm.
(No need to compute all possible products )
 Application of Rank-Join and TA top-k algorithm in a two-tier architecture
 In the worst case, may have exponential running time

 Approximation Algorithm  (Poly Time Approximation Scheme) with
provable Error bounds
 The algorithm’s overall running time is exponential only in the (constant)
size of the groups, but can be reduced to a polynomial time complexity.
 For Large datasets

Problem Framework Boolean
Dataset

• D = {o1, o2, ..., on}
• A = {A1,A2, ..., Am}
• T = {T1, T2, ..., Tr }

 Each item is thus a vector of size (m + r)
Eg :

• Above such dataset has been used as a training set to build
Naive Bayes Classifiers (NBC) & compute P (Tag | Attributes)

1/15/2012 8

Derived Results
The probability that a
new item o is annotated
by the tag Tj

Probability Pr(Tj ‘ | o) of
an item o not having
tag Tj :

1/15/2012 9

Derived Results
Derived :

Rj :
Convenience

Expected number of desirable tags Td = {T1, . . . , Tz} ⊆ T .
new Item(o) is annotated with:

Exact Algorithm
• Naïve brute-force
– Consider all possible 2m products and compute for each
possible product
– Exponential Complexity

• Exact two-tier top-k (ETT)
– Application of Rank-Join and TA top-k algorithm in a two-tier
architecture
– Does not need to compute all possible products
• performs significantly better than naïve brute-force
– Works well for moderate data instances, does not scale to larger data
• In the worst case, may have exponential running time

ETT: Two Tier Architecture
Match these Items in
tier-2 to compute
global best product
across all tags

Determine “best”
Item for each
tag(T1,T2..Tz) in
tier-1

Z – desirable Tags
m‘ =m / l

ETT Algorithm(Exemplification)
• Database: {A1, A2, A3, A4 } and {T1, T2} and top-1
– Partition attributes into 2 groups {A1, A2} and {A3, A4 } to form 2 lists of
Run NBCpartial products
&
Calculate
– Each list has ( 2m‘ )  22= 4 entries (partial products)
– Compute score for each partial product for each tag using
and sort in descending order

Buffer Product Complete Score MUS: sum of last seen Tier 2
Top-K () score from all
.. ..
GetNext()
.. ..

GetNext() = GetNext() =

(A1 A2) (A3 A4) (A1 A2) (A3 A4)
10, 1.97 10, 1.97 11, 2.76 11, 4.57 Tier 1
00, 0.84 00, 0.84 01, 1.18 10, 2.53
11, 0.84 11, 0.84 10, 1.18 01, 0.91
01, 0.36 01, 0.36 00, 0.51 00, 0.51
L1 L2 L1 L2
T1 T2
Join Product Partial MPFS MPFS:
Score
.. .. .. ..
Actual/
.. .. .. ..
Complete :
Score

Buffer Product Complete Score Tier 2
Top-K () MinK (1.75) <= MUS (1.88)
Iteration 1 1111 1.75
1010 1.70 Return
to Tier 1
GetNext( ) = 1010 GetNext( ) = 1111

(A1 A2) (A3 A4) (A1 A2) (A3 A4)
10, 1.97 10, 1.97 11, 2.76 11, 4.57 Tier 1
00, 0.84 00, 0.84 01, 1.18 10, 2.53
Rank Join
Join 11, 0.84 11, 0.84 10, 1.18 01, 0.91
01, 0.36 01, 0.36 00, 0.51 00, 0.51
L11 L12 L21 L22
T1 T2
Join Product Partial MPFS Join Product Partial MPFS
Score Score
1 1010 0.95 >= 0.95 1 1111 0.93 >= 0.93

2 .. .. ..

.. .. .. ..

Top-K () MinK (1.77) <= MUS (1.79)
1011 1.76 Return
to Tier 1

(A1 A2) (A3 A4) (A1 A2) (A3 A4)
10, 1.97 10, 1.97 11, 2.76 11, 4.57 Tier 1
00, 0.84 00, 0.84 01, 1.18 10, 2.53
Rank Join
Join 11, 0.84 11, 0.84 10, 1.18 01, 0.91
01, 0.36 01, 0.36 00, 0.51 00, 0.51
L11 L12 L21 L22
T1 T2
Score Score
1 1010 0.95 0.95 1 1111 0.93 0.93

2 1011 0.92 >= 0.92 .2 1110 0.88 >= 0.88

.. .. .. ..

Top-K () MinK (1.77) <= MUS (1.74)
0010 1.76 ETT Terminates


(A1 A2) (A3 A4)
Thus, ETT returns(A1 A2)
the (A3 A4)
10, 1.97 10, 1.97 Best Item 11, 2.76 11, 4.57 Tier 1
(0111 or 1110) in Just
00, 0.84 00, 0.84 Item Look -up 1.18
6 01, 10, 2.53
Rank Join
Join 11, 0.84 11, 0.84 10, 1.18 01, 0.91
01, 0.36 01, 0.36 00, 0.51 00, 0.51
L11 L12 L21 L22
T1 T2
Score Score
1 1010 0.95 0.95 1 1111 0.93 0.93

2 1011 0.92 0.92 .2 1110 0.88 0.88

3 0010 0.89 >= 0.89 3 0111 0.84 >= 0.84

Approximation Algorithm
Z Desirable tags
€ = 2σm
Z/Z‘ Subgroups σ = Compression
factor

Z ‘ Tags Z ‘ Tags
T1,T2… Tz ‘ T3,T4 … Tz ‘
Solved using
Top K Items for PTAS
Each Subgroup in polynomial
time defined for
O11,O12…O1k O21,O22…O2k Approximation
factor €

Overall Top K
O1,O2…Ok Items
1/15/2012 18

PTAS Algorithm Design
For K = 1 &
Z Desirable tags 1 Sub Group
€>0

Z =Z ‘ Tags
T1,T2… Tz ‘
PTAS Should run in Polynomial Time &
Top K =1 Item
Invariant
for This
Exact Score (Oa) >= (1- €) Exact Score (Og)
Subgroup
Oa

Oa  PTAS returned Item
Og  Optimal Item
1/15/2012 19

Simple exponential time exact top-1 algorithm for the sub-problem is created
& then deduced to PTAS
Given (m ) Boolean attributes and Z ‘ tags,
the exponential time algorithm makes m iterations

Initial step :
Produces the set S0 consisting of the single item {0m} along with its Z ‘ scores,
one for each tag.
first iteration,
it produces the set containing two items S1 = {0m, 10m−1}
each accompanied by its Z ‘ scores, one for each tag.
ith iteration, it produces the set of items
Si = {{0, 1}i×0m−1} along with their z scores, one for each tag.
final set Sm contains all 2m items along exact scores, from which the top-1 item can
be returned,

1/15/2012 20


Consider this Table
Z = Z‘ = 2
σ = 0.5
m=4
€ = (2σm) = 4

Og = {1110} Oa = {1111}
[1.77] = [0.89+0.88] [1.75] = [0.82+0.93]
1/15/2012 21

Cluster’s
item’s exact
underlined
score
should be
close to the
deleted
item’s exact
score.

1/15/2012 22

Experiment
Synthetic and real datasets for quantitative and
qualitative analysis of proposed algorithms

Quantitative performance indicators are :
 efficiency of the proposed exact and approximation
algorithm.
 Obtained Approximation factor of results produced by
the approximation algorithm

Qualitative results of algorithms :
Amazon Mechanical Turk user study to assess the
results of algorithms.

1/15/2012 23

Experiment
Real Camera Dataset :
Crawled a real dataset of 100 cameras listed at Amazon .

The listed camera’s contain technical details (attributes) & tags
customers associate with each camera.

The tags are sanitized to remove synonyms, unintelligent and
undesirable tags such as Nikon coolpix, quali, bad, etc.

Synthetic Dataset :
Boolean matrix of dimension 10,000 (items) × 100 (50 attributes +50 tags)

 50 independent distributed attributes into 4 groups, where the
value is set to 1 with probabilities of 0.75, 0.15, 0.10 and 0.05

50 tags, predefined relations by randomly picking a set of attributes that are
correlated
1/15/2012 24

Quantitative : Performance
Exact Algorithm:
• Synthetic dataset having 1000 items, 16 attributes and 8 tags
(Naïve Vs ETT)

1/15/2012 25

Below figure, reveals that ETT is extremely slow beyond number of
attributes (m) = 16

PA with an approximation factor =0.5, continues to return guaranteed results
in reasonable time with increasing number of attributes m

1/15/2012 26

Execution time &
obtained approximation
factor
 Synthetic dataset
1000 items, 20
attributes & 8 tags

Top 1 Item is considered.

1/15/2012 27

Qualitative : User Study
First part of User study :

PA algorithm with an approximation factor =0.5, by considering tag sets
corresponding to compact cameras and slr cameras respectively.

Built 4 new cameras (2 digital compact & 2digital slr) PA algorithm € =0. 5

Vs
4 existing popular cameras

65% of users choose the new cameras

1/15/2012 28

Qualitative : User Study
Second part of the study :

Built 6 new cameras designed for three groups : 2 potential new
cameras for each
1. young students Group
2. old retired
3. professional photographers.

When asked with users to assign at least five
tags : observation : majority of the users
rightly classify the six cameras into the three
groups

1/15/2012 29

Conclusion
 Define the Tag Maximization problem & investigate its
computational complexity.
 Propose 2 novel Algorithms & shown the practicability
 This work is a preliminary look at a very novel area of
research & promises exciting directions of future
research.
 Decision trees, SVMs, and regression trees classifiers
are to used & Conduct the experiment

1/15/2012 30

References
http://crystal.uta.edu/~gdas/Courses/websitepages/fall10DBIR.html

1/15/2012 31

Leveraging collaborativetaggingforwebitemdesign ajithajjarani

More Related Content

Viewers also liked

Similar to Leveraging collaborativetaggingforwebitemdesign ajithajjarani

Recently uploaded

Leveraging collaborativetaggingforwebitemdesign ajithajjarani

Editor's Notes