Your SlideShare is downloading. ×
  • Like
Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems

  • 604 views
Published

 

Published in Technology , Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
604
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
5
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems Georg Rehm 1 , Marina Santini 2 , Alexander Mehler 3 , Pavel Braslavski 4 , Rüdiger Gleim 3 , Andrea Stubbe 5 , Svetlana Symonenko 6 , Mirko Tavosanis 7 , Vedrana Vidulin 8 Language Resources and Evaluation Conference – LREC 2008 University of Tübingen, Germany 1 SFB 441: Linguistic Data Structures DSV, Sweden 2 KTH-Stockholm University University of Bielefeld, Germany 3 Computational Linguistics Dept. Inst. of Engineering Science, RAS 4 Ekaterinenburg, Russia conject AG 5 Munich, Germany Nitol, LLC 6 Moscow, Russia Università di Pisa, Italy 7 Dipartimento di Studi italianistici Jožef Stefan Institute 8 Ljubljana, Slovenia Corresponding author: [email_address]
  • 2. Introduction
    • Genres are specific types of text.
    • Genres have, roughly speaking, three characteristic properties:
      • Content topic
      • Form layout, design, text structure etc.
      • Function communicative purpose etc.
    • Genres are socially specified sets of rules and conventions.
    • Genres are recognised by particular discourse communities.
    • Genres usually have established names.
  • 3. Examples of Traditional Genres
  • 4. Scope of this Talk
    • There are not only hundreds (Dimter, 1981), but thousands (Adamzik, 1995) of genres:
      • Shopping list
      • Love letter
      • Flyer
      • Weather forecast
      • CV
      • PhD thesis
    • This talk is not about traditional, paper-based genres.
    • This talk is about web genres .
  • 5. Web Genres
    • Studies have shown that genres also exist in the web, e.g.:
      • Personal homepage
      • FAQ
      • Blog
      • Search engine
      • Encyclopedia
      • Web shop
    • Web genres are more complex than traditional genres:
      • The web is a hypertext system
      • Interactive features
      • Multimedia
  • 6. Automatic Web Genre Identification
    • If we were able to identify web genres automatically, we could exploit this information in search engines. Find:
      • textbook web pages that contain “language resource”
      • PhD thesis web pages that contain “RCG parsing”
    • About 20 different approaches have been published in this area (incl. the identification of traditional genres). They mainly use
      • Machine learning methods
      • Hand-crafted genre detection rules
  • 7. Automatic Web Genre Identification
    • All approaches have some characteristics in common.
    • Nearly every group of researchers
      • have their own personal definition of “web genre”,
      • create their own document collection,
      • create their own set of web genre labels,
      • annotate their corpora with these web genre labels.
    DIY DIY DIY Web Genre Identification Approach Classification algorithm Corpus (collection of web documents) Tag set (genre categories)
  • 8. Automatic Web Genre Identification It’s impossible to compare such isolated approaches. Approach 1 Algorithm 1 Corpus 1 Tag set 1 Approach 2 Algorithm 2 Corpus 2 Tag set 2 Approach 3 Algorithm 3 Corpus 3 Tag set 3 Approach 4 Algorithm 4 Corpus 4 Tag set 4 Approach 5 Algorithm 5 Corpus 5 Tag set 5
  • 9. Towards a Reference Corpus of Web Genres Reference Corpus of Web Genres enables comparative evaluation Approach 1 Algorithm 1 Approach 2 Algorithm 2 Approach 3 Algorithm 3 Approach 4 Algorithm 4 Approach 5 Algorithm 5
  • 10. Towards a Reference Corpus of Web Genres Reference collection of web documents Shared genre category set or sets Approach 1 Algorithm 1 Approach 2 Algorithm 2 Approach 3 Algorithm 3 Approach 4 Algorithm 4 Approach 5 Algorithm 5
  • 11. Towards a Reference Corpus of Web Genres Reference collection of web documents Shared genre category set or sets Approach 1 Algorithm 1 Approach 2 Algorithm 2 Approach 3 Algorithm 3 Approach 4 Algorithm 4 Approach 5 Algorithm 5
  • 12. Assigning Genre Labels to Web Pages
    • The construction of a genre corpus involves the task of assigning genre labels to web documents by a group of annotators.
    • Previous studies have shown that this is a very hard task.
    tag with genre category Set of genre categories
  • 13. Preliminary Study
    • We conducted a survey amongst the group of authors:
      • Goal: to measure the agreement of genre labels assigned to a random sample of 50 web documents by persons who are engaged in genre-related research.
      • Seven of the nine authors participated.
    • Result: the categories assigned by the participants contain a very high number of disparate terms at various levels of abstraction.
    • Conclusion: the task of assigning genre labels to web documents is – even for linguists who work on genres – very hard.
  • 14. Assigning Genre Labels to Web Pages
    • Consistency: High
    • Participant 1: News article
    • Participant 2: Article /commentary
    • Participant 3: Article
    • Participant 4: Feature
    • Participant 5: A newsletter article
    • Participant 6: News article
    • Participant 7: Journalistic
  • 15. Assigning Genre Labels to Web Pages
    • Consistency: Low
    • P1: Entry page of the website of a research journal
    • P2: Table of contents with snippets
    • P3: Portal, link collection
    • P4: Bibliography/List of Articles
    • P5: A homepage of a subscription-based academic journal
    • P6: Homepage
    • P7: Index, Content Delivery
  • 16. Genre Category Sets in Previous Approaches
    • Almost all category sets used in previous approaches are
      • limited in size and scope and
      • contain categories that cannot be considered genres:
    Lim et al. (2005) Personal homepages; Public homepages; Commercial homepages; Bulletin collections; Link collections; Image collections; Simple tables/lists ; Input pages ; Journalistic materials; Research reports; Official materials ; Informative materials ; FAQs; Discussions; Product specifications; Others Vidulin et al. (2007) Blog; Childrens’ ; Commercial/Promotional ; Community ; Content Delivery ; Entertainment ; Error Message; FAQ; Gateway; Index; Informative ; Journalistic ; Official ; Personal ; Poetry; Scientific; Shopping; User Input
  • 17. Shared Genre Category Sets
    • A set of genre categories is needed so that we can assign web genre labels to web documents.
    • Requirements for this shared category set:
      • It should be precise, scalable, as unambiguous as possible, and reflect the genre-reality as it presents itself in the web.
      • The majority of researchers in this field should agree upon the category set or sets.
    • We used a wiki to come up with an initial proposal of 78 web genre categories.
  • 18. Our Proposal for a Shared Genre Category Set 1. About Page 2. Abstract 3. Agenda (Schedule, Calendar) 4. Announcement 5. Application 6. Bibliography 7. Biography 8. Chronicle 9. Code Listings 10. Column / Editorial / Lead Article 11. Comic 12. Contact Form 13. Contract / Disclaimer / Terms and Conditons 14. Corporate Blog 15. Curriculum Vitae / CV / Resume 16. Data / Statistics / Data Sheet 17. Diary, Blog 18. Dictionary 19. Directory of Persons or Organisations 20. Discussion Group / Newsgroup 21. Download 22. Drama / Play 23. Encyclopedia 24. Errata 25. Error Message / Empty Page / Under Construction Page 26. Essay 27. Exercises (Problems) 28. FAQ 29. Feature Story / News Reportage 30. Game (Quiz, Puzzle) 31. Glossary 32. Guestbook 33. Homepage / Front Page / Entry Page 34. Horoscope 35. Index 36. Instruction 37. Interview 38. Invitation 39. Job Listing 40. Joke 41. Law / Regulation / Rule / Proclamation 42. Letter / Mail / E-Mail 43. Letter to the Editor 44. Linkfarm 45. Link Collection / Hotlist 46. List of Products 47. List of Projects 48. Login Page 49. Media (Images, videos, music, sound) 50. Meeting minutes 51. News Article 52. News Collection / Newsletter / Digest 53. Obituary 54. Official Report 55. Ordering Form / Booking Form 56. Pamphlet 57. Petition 58. Promotional / Advertisement 59. Poem / Poetry / Lyrics 60. Pornographic 61. Prose Fiction 62. Quotation 63. Reportage 64. Research Report 65. Review (Testimonial) 66. Script (Manuscript) 67. Search Form 68. Sermon 69. Shop 70. Specification 71. Speech 72. Splash Page / Gateway / Welcome Page 73. Strategic Plans 74. Survey 75. Table of contents / Sitemap / Navigation 76. Thesis 77. Travel Guide 78. Tutorial
  • 19. Tagging HTML Documents with Genre Categories tag 1) tag HTML documents; the most common approach tag 2) tag websites tag tag tag tag tag 3) tag page segments
  • 20. Towards a Reference Corpus of Web Genres Reference collection of web documents Shared genre category set or sets Approach 1 Algorithm 1 Approach 2 Algorithm 2 Approach 3 Algorithm 3 Approach 4 Algorithm 4 Approach 5 Algorithm 5
  • 21. Reference Collection of Web Documents
    • We plan to build the reference corpus in two stages:
      • First, we will apply our shared set of genre categories to existing collections as a proof of concept.
        • Initial step towards an objective evaluation and integrative compatibility of individual approaches.
      • Second, we will use a crawler to gather more recent as well as more diverse sets of documents.
  • 22. Reference Collection of Web Genres (Selection)
    • Web Corpus for English (Santini, 2007): editorial , biography , do-it-yourself guide , feature article (20 web pages each).
    • German corpus (Mehler et al., 2007, 2008): conference website (50 sites), personal academic homepage (68 sites), project website (52 sites), city website (180 sites).
    • Hierachical Web Genre Collection (Stubbe and Ringlstetter, 2007), 32 genre classes, 40 HTML files/class, English .
    • Corpus of 400 blog posts , Italian (Tavosanis, 2007).
    • English (65,177 pages) and Russian (29,650 pages) corpora (Sharoff, 2007).
  • 23. Towards a Reference Corpus of Web Genres Reference collection of web documents Shared genre category set or sets Approach 1 Algorithm 1 Approach 2 Algorithm 2 Approach 3 Algorithm 3 Approach 4 Algorithm 4 Approach 5 Algorithm 5
  • 24. Corpus Management and Annotation Tools
    • Construction of the reference corpus requires tools that support
      • compiling a document collection and
      • annotating HTML documents.
    • We use the HyGraph toolbox:
      • Supports researchers in the process of corpus compilation, annotation and analysis
      • Annotate at various levels
      • Assign confidence values
      • Support for multiple tag sets and category systems
      • Uses stand-off annotation
  • 25. Towards a Reference Corpus of Web Genres Reference collection of web documents Shared genre category set or sets Reference Corpus of Web Genres
  • 26. Summary and Future Work
    • We construct a reference corpus of web genres.
    • Provide a shared resource for researchers who work on web genre identification and the evaluation of these systems.
    • Future work includes the further realisation of this resource:
      • Apply a set of genre categories to existing corpora.
      • Collect a large set of new documents that will be categorised based on annotation guidelines using HyGraph.
      • Assign genre labels to single web documents first and to page segments as well as complete websites later.
  • 27. Q/A
    • Thanks for your attention!
    • Please get in touch if you (plan to) work in the field of
    • automatic web genre identification or a related area:
      • [email_address]
      • http://129.70.40.20/WebGenreWiki/
      • A mailing list will be available soon.