best weekend places near delhi where you should visit.pdf
Why are there more hotels in Tyrol than in Austria. Analyzing schema.org usage in the hotel domain
1. ENTER 2016 Research Track Slide Number 1
Why are there more hotels in Tyrol
than in Austria?
Analyzing schema.org usage in the hotel domain
Elias Kärle, Anna Fensel, Ioan Toma, and Dieter Fensel
Semantic Technology Institute (STI), Austria
firstname.lastname@sti2.at
http://www.sti2.at
@eliaska
#ENTER2016
2. ENTER 2016 Research Track Slide Number 2
Outline
1.Motivation
2.Data
3.Analysis
4. ENTER 2016 Research Track Slide Number 4
1. Motivation
• Dieter Fensel has a Wikipedia page
5. ENTER 2016 Research Track Slide Number 5
1. Motivation
• Italian swimmer VS. @cyberandy
• How did he do it?
6. ENTER 2016 Research Track Slide Number 6
1. Motivation
• Schema.org annotation
• Hotels and tourism
do they use annotations?
7. ENTER 2016 Research Track Slide Number 7
1. Motivation
1) How many hotels use schema.org?
2) How is schema.org used?
1) Which classes?
2) Which attributes?
3) Is schema.org used correctly?
3) Who is using schema.org in tourism?
4) Does the use of schema.org increase/decrease?
8. ENTER 2016 Research Track Slide Number 8
2. Data
What is schema.org?
•Initiative founded 2011
•Vocabulary for structuring data in web sites
•Embedded into html
– Microdata
– RDFa
– JSON-LD
9. ENTER 2016 Research Track Slide Number 9
2. Data
Analysis of all web sites:
•Founded in 2007
•Non-Profit Organisation
•Crawls web 4 times (or more) per year
•Datadumps are available open for public
•November 2013: 2,3 billion webseiten, 148TB
•Dezember 2014: 2,1 billion webseiten, 160TB
•2015: 9 times; September: 1,3b ws, 106TB size
10. ENTER 2016 Research Track Slide Number 10
2. Data
Only survey structured data:
WebDataCommons:
•2012 Freie Universität Berlin & KIT
•Currently Uni Mannheim
•Operated by Chris Bizer
•Extracts structured data from the Common Crawl
– WebTables (July ‚15): 233 Million relational tab. (1.8 Billion Pages)
– Hyperlink Graph (Spring 2014): 1.7 Billion Webseiten, 64 Billion Links
– Semantically annotated data:
• November 2013: 44TB, 2.2Bn URLs
• Dezember 2014: 64TB, 2Bn URLs
11. ENTER 2016 Research Track Slide Number 11
2. Data
• November 2013 corpus and December 2014 corpus
• Subset: schema.org/Hotel
– 35GB / 42GB
– 127 Mio. / 148 Mio. Triples
• GraphDB-SE Repository
• SPARQL Queries
• Linux Debian 3.2, STI
12. ENTER 2016 Research Track Slide Number 12
3. Analysis
1) How many hotels are annotated with schema.org?
4.841.353
• Hotels annotated several times
– own website
– booking websites
740.298
• Lost all hotels with same names
– Adler, Post, ...
Bind to address!
13. ENTER 2016 Research Track Slide Number 13
3. Analysis
Hotel
4.841.353
Address
3.035.000
Country
1.904.000
Name
1.125.000
Region
1.902.000
ZIP
2.011.000
Street
2.284.000
15. ENTER 2016 Research Track Slide Number 16
3. Analysis
What categories of hotels are annotated?
http://schema.org/Rating
16. ENTER 2016 Research Track Slide Number 17
3. Analysis
Hotel
4.841.353
Address
3.035.000
Country
1.904.000
Name
1.125.000
Region
1.902.000
ZIP
2.011.000
Street
2.284.000
17. ENTER 2016 Research Track Slide Number 18
3. Analysis
Hotel
4.841.353
Addre
3.035.00
Country
1.904.000
Name
1.125.000
Region
1.902.000
Rating
2.377.000
RatingValue
2.375.000
18. ENTER 2016 Research Track Slide Number 19
3. Analysis
What categories of hotels are annotated?
866.932
651.606
426.925
176.800
135.958
35.079
66.208
15.476
941
19. ENTER 2016 Research Track Slide Number 20
3. Analysis
2) How is schema.org
used?
20. ENTER 2016 Research Track Slide Number 21
3. Analysis
3) Who uses schema.org in tourism?
Hypothesis:
„Schema.org is mainly used by booking- and rating websites,
barely by hotels themselves.“
21. ENTER 2016 Research Track Slide Number 22
3. Analysis
Approach:
•Hotels on booking and rating sites
Search for annotation on own hotel web site
•Countercheck with annotated hotel websites
Multiple appearance in data set?
exemplaric (top-booking sites)
22. ENTER 2016 Research Track Slide Number 23
3. Analysis
Outcome:
•Main user of schema.org/Hotel:
booking and rating sites
Errors:
incomplete
Wrong classes
Wrong attributes
Wrong datatypes
24. ENTER 2016 Research Track Slide Number 25
3. Analysis
nnotation „Hotel“ right but the same on every subpage!
25. ENTER 2016 Research Track Slide Number 26
3. Analysis
4) Does the use of schema.org increase/decrease?
26. ENTER 2016 Research Track Slide Number 27
Conclusion & Future Work
• Schema.org gets adapted but mostly poorly
• Schema.org is mostly used by platforms,
not hotels
• Working on schema.org extension (better
voc. for hotel amenities and more)
• Promte schema.org in touristim
(cooperation with ÖHV)