Cultural Heritage domain has opened up to contributions from the users on the web. The contributions are mainly in the form of tags which describe certain aspect of the cultural heritage object. With a wide range of users on the web, it becomes important to determine the quality of the user contributed content before it is published online. However, manually evaluating the quality of these user generated contributions is exhausting in terms of resources for the Cultural Heritage institutions. In this talk, I will describe methods which can semi-automatically predict the quality of tags. These methods address three research questions: How can we trust an online contributor?, How can we assess the quality of annotation process?, How can we trust the contributed data?.
1. Trusting user-contributed data in
Cultural Heritage Domain
Archana Nottamkandath
(Work done with Davide Ceolin & Wan Fokkink)
VU University Amsterdam
COMMIT/SEALINC
1
2. Context
• COMMIT/SEALINC project
• Museums have collections which can be
annotated with (external) user-contributed
information for searching better through
collection
COMMIT/SEALINC 2
TulipsTulips
ButterflyButterfly
PortraitPortrait
3. Can we directly trust the user provided
content?
COMMIT/SEALINC 3
4. Can we trust the user provided
content directly? – Apparently Not!
COMMIT/SEALINC 4
Stella is GayStella is Gay
wwwapartmentvermeercomwwwapartmentvermeercom
7. Evaluation costs Resources
• Is expensive manual labor
• Costs a lot of time
• Requires adherence to museum policies
– Museum X [Accept, not sure, reject]
– Museum Y [Foreign, Judgmental, Strong reject,
Strong accept ]..
COMMIT/SEALINC 7
8. Need for automated trust analysis
• Algorithms automatically/ semi-automatically
evaluate annotations
COMMIT/SEALINC 8
(a) Flower
(b) 19th
century
(c) Sunshine
(d) Vermeer
(e) Bronze
9. Automated Trust analysis algorithms
• Requirements
– High accuracy (Accurately predict evaluations
most of the time)
– Minimum input from cultural heritage
professionals
– Scalable and Efficient (w.r.t resources and time)
– Works with different cultural heritage data
COMMIT/SEALINC 9
10. Definition
• Trustworthy annotation
– Relevant to image
– Enhances/re-instates existing knowledge
– Is acceptable by museums policies to be published
on their website
COMMIT/SEALINC 10
12. How to determine trust from user
contributing annotations to the
system?
COMMIT/SEALINC 12
Tulips
Roses
Night Sky
Van Gogh
Buddhist
Portrait
Monument
Asian
War
memorial
User_name: Jones
contributed
Used
Accurator Interface
Tags
13. How to determine trust from the
Annotation Process?
COMMIT/SEALINC 13
Tulips
Roses
Night Sky
Van Gogh
Buddhist
Portrait
Monument
Asian
War
memorial
User_name: Jones
contributed
Used
Accurator Interface
Tags
14. How to determine trust from
contributed data?
COMMIT/SEALINC 14
Tulips
Roses
Night Sky
Van Gogh
Buddhist
Portrait
Monument
Asian
War
memorial
User_name: Jones
contributed
Used
Accurator Interface
Tags
15. How to determine trust from
users?[1]
• Evaluate subset of user tags
COMMIT/SEALINC 15
Tulips
Roses
Night Sky
Van Gogh
Buddhist
Portrait
Monument
Asian
War
memorial
User_name: Jones Test set
Roses
Night sky
Van Gogh
Asian
War
Memorial
contributed
Train set
Tulips
Van Gogh
Buddhist
Monument
Evaluates
Museum
16. • User expert on one topic might be expert on
similar topics
COMMIT/SEALINC 16
Expert on
Tulips
Possibly
Expert on
Possibly
Expert on
Roses
Lilies
User_name: Jones
Test set
Roses
Night sky
Van Gogh
Asian
War
Memorial
Train
setTulips
Van Gogh
Buddhist
Monument
How to determine trust from
users?[1]
With a certain probability
17. Determine trust from users[2]
• User profile : [Experience, education, country,
gender, income, museum visits…]
COMMIT/SEALINC 17
Steve.museum
dataset
18. Determine trust from users[2]
• Predict user reputation using machine
learning
• [Feature1, Feature2, ..] -> Category of user
– [21 yrs, Female, Bachelors, Australia] -> Excellent
– [60 yrs, Male, PhD, America] -> Good
– [56 yrs, Female, Masters, Croatia] -> Bad
– [30 yrs, Male, Bachelors, Mexico] -> ?
COMMIT/SEALINC 18
19. How to determine trust from
Annotation process?
• Time of day, Day of week, Day of month etc.
affect user quality
• Typing speed affects user quality
– Typing fast might indicate higher confidence
COMMIT/SEALINC 19
Tulips
Van Gogh
Buddhist
Monument
Rich Lady
Plant
Leonardo
Bronze plate
20. How to determine trust from
Annotation process?
• Predict tag quality using machine learning
• [Feature1, Feature2, ....] -> Category of Tag
– [10:00, Monday, June, 3s] -> Excellent
– [12:00, Wednesday, 15s] -> Good
– [23:56, Friday, April, 80s] -> Bad
– [06:00, Thursday, March, 70s] -> ?
COMMIT/SEALINC 20
21. How to determine trust from
Annotation process?
• Why is this important?
– Useful for anonymous users who did not fill profile
information
COMMIT/SEALINC 21
22. How to determine trust from data?
• Contributed data itself has features, use
machine learning to predict quality of tag
– Length
– Specificity
– Presence in vocabularies
– Times already contributed
– Noun
COMMIT/SEALINC 22
Tulips
Van Gogh
Buddhist
Monument
[6,specific, yes, English, 10, no…] -> Good
[7,specific, yes, Dutch, 1,yes…] -> Bad
23. Goals achieved
• Requirements
– High accuracy (Accurately predict evaluations
most of the time)
– Minimum input from cultural heritage
professionals
– Scalable and efficient
– Works with different cultural heritage data
COMMIT/SEALINC 23
25. – High accuracy (Accurately predict evaluations
most of the time)
• Predicted quality of a tag based on user profile with
accuracy from 68% to 72%
COMMIT/SEALINC 25
Steve dataset results
Goal 1: High Accuracy
26. Goal 2: Minimum input from
Cultural Heritage Institutions
• Algorithms require minimum of 5 evaluated
tags per user for predictions
• Working on to minimize/eliminate this
requirement
COMMIT/SEALINC 26
27. Goal 3: Scalable and efficient
• Reduced computation time while maintaining
accuracy in Steve dataset
COMMIT/SEALINC 27
28. Goal 4: Works with different
cultural heritage data
• Steve Museum dataset
• Waisda? Dataset
– Video Tagging Game
• SEALINC Media experiments at CWI
COMMIT/SEALINC 28
29. Future Work
• Employ our experiences and algorithms to
analyze the data from Accurator
• Employ trust scores for ranking in search
• Identify techniques to visualize trust
COMMIT/SEALINC 29