Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Towards Contextualized Information: How Automatic Genre Identification Can Help


Published on

Genre is one of the textual dimensions that can be used to reconstruct the communicative context needed to assess the value of information with respect to a purpose (business, learning, finding, monitoring, predicting, etc.). When we know the genre of a text, we can surmise the CONTEXT where a text has been created and for which purpose. Therefore we can more confidently decide whether a text contains the information we are looking for. For example, factual texts might have more credibility than opinionated texts. In this respect, genres such as press conferences, declarations or announcements by a White House spokesman might be more reliable than subjective genres, e.g. newspapers’ editorials or op-ed articles. On the other hand, if we want to test the pulse and explore the feelings about a product or a politician, we might give more weight to more emotional genres like blogs, forums or social networks’ microposts.

In recent years, important steps forward have been taken in Automatic Genre Identification (AGI). AGI can be defined as a meta-discipline that leverages on and spans Computational Linguistics, NLP, Corpus Linguistics, Information Retrieval, Information Extraction, Text Mining, Text Analytics, Sentiment Analysis and LIS, among others. Promising computational models have been proposed to automatically identify the genre(s) of a text, although no agreement has been reached on the definition of the concept of genre itself. AGI research has shown that genre classes such as blogs, online newspaper front pages, FAQs, DIYs can be automatically identified using a wide range of genre-revealing features -- from linguistic cues to character n-grams -- with a variety of classification algorithms.

In a world where information overload is still pervasive and where technology encourages massive text production through emailing, blogging, tweeting and social network communication, it is likely that the concept of genre and AGI are useful to convert unclassified and unstructured textual data to more structured and contextualized information.

This talk presents a summary of the state-of-the-art in AGI and discusses how genre-aware applications could help extract actionable information from raw textual data.

Published in: Technology, Education
  • Be the first to comment

Towards Contextualized Information: How Automatic Genre Identification Can Help

  1. 1. Towards Contextualized Information:How Automatic Genre Identification Can Help Marina Santini Seminar Series Laboratory for Cognition, Interaction and Language Technology (CILTLab) Linköping University, Tuesday 28 August 2012
  2. 2. Outline1. Self-Presentation2. Automatic Genre Identification (AGI) • The Beginning • State-of-the-Art3. Why is Genre Important and Useful?4. Viable Projects • DaisyKB • Contextify • WebRider • SearchInFocus Marina Santini © 2012.
  3. 3. SELF-PRESENTATION Marina Santini © 2012.
  4. 4. Background• Technical Translation (degree, Rome) [worked with localization, multilinguality, terminology, unstructured knowledge bases used as translation memory = for more than 10 years ]• History of the Italian Language (degree, Rome) [corpora, manually analysed for philological purposes]• NLP (MSc, Manchester) [corpus creation and natural language processing]• Computational Linguistics (PhD, Brighton) [language processing and automatic text classification of web documents]• Agile Web Development (KYHYh, Stockholm) [how web documents are conceived and implemented; what functionalities they can offer; how users interact with web documents; in which way functionalities affect content, 2012. Marina Santini © etc.]
  5. 5. Current Activities• Managing a blog and a LinkedIn Group• Planning a book (Springer?): A Computational Theory of Genre (pre-study phase)• Designing web-based applications for contextualized language technology• etc. Marina Santini © 2012.
  6. 6. Strong interest… apply computational linguistics to BIG UNSTRUCTURED TEXTUAL DATA to extract contextualized and actionable information A possible approach… Automatic Genre Identification (but also domain, Raw/Unstructured Linguistic sublanguage, Contextualized Actionable Textual Data Preprocessing register, and other Information Information textual and situational dimensions)The reduced exploitation of big unstructured textual data isrecognized as a major problem causing remarkable economicloss and ineffective decision-making Marina Santini © 2012.
  7. 7. AUTOMATIC GENREIDENTIFICATION (AGI) Marina Santini © 2012.
  8. 8. Genre Studies• Aristotle (4th cent. b.C.): drama, lyrics, epics – Drama: tragedy, comedy, satyr• Literary theory and literary genres [e.g. sonnets, ballads, monologues, epistolary novels]• More recently, pratical genres in academic contexts [e.g. academic papers, essays], in workplace and professional contexts [e.g. tax forms, medical referrals, progress reports, patents], public contexts [e.g. popular magazines, public speeches], in pedagogy (teaching, writing, e- Marina Santini © 2012.
  9. 9. Genres on the Web and in other Digital Environments • Emails • Powerpoint slides • Search Pages • FAQs • Personal Home Pages • Corporate Home Pages • Blogs • Facebook Microposts • Tweets • etc. – [Migration & Colonization: music genres (jazz, rock, etc.), film genres (thriller, drama, western etc.)] Marina Santini © 2012.
  10. 10. Recent Genre Studies & Analyses: 2008- 2010 Marina Santini © 2012.
  11. 11. AGI: 2007-2011 PhD Theses…1. Santini M. (2007). Automatic Identification of Genre in Web Pages. University of Brighton (UK)2. Meyer zu Eissen S. (2007). On Information Need and Categorizing Search. PhD thesis. University of Paderbon (Germany).3. Freund L. (2008). Exploiting task-document relations in support of information retrieval in the workplace. University of Toronto (Canada)4. Jebari C. (2008). Catégorisation Flexible et Incrémentale avec Raffinage de Pages Web par Genre. PhD thesis. Tunis El Manar University, Tunisia. In French.5. Mason J. (2009). An n-tram based approach to the automatic classification of web pages by genre. Dalhousie University (Canada)6. Gunnarsson M. (2011). Classification along Genre Dimension. Exploring a Multidisciplinary Problem. University of Borås. (Sweden)7. Clark Malcom (PhD thesis in progress, UK)8. Vidulin Vedrana (PhD thesis in progress, Croatia) Marina Santini © 2012.
  12. 12. AGI 2007-2011 Workshops and Special Issues…• The WebGenreWiki (must find the time to update it)• Contact me: – Facebook page: Genres on the web – Email: Marina Santini © 2012.
  13. 13. AGI: Companion Volumes Marina Santini © 2012.
  14. 14. The concept of genre is intuitive… but difficult to pin down and to agree uponIn the book, we do notpropose a single andunified definition ofgenre. Authors givetheir different views ongenre. Marina Santini © 2012.
  15. 15. Do we really need a definition when doing AGI? • After all…. – … once we are convinced that genre is intuitive and useful, we could just say that: • genre is a classificatory principle based on a number of attributes. Marina Santini © 2012.
  16. 16. What do we need for Automatic Genre Identification (AGI)?• We need: – a genre taxonomy (genre palette) – a corpus of different genres (genre collection) – measurable attributes (genre-revealing features) that can be extracted automatically – a general-purpose automatic classifier, i.e. an off-the-shelf statistical software that builds a classification model for us Marina Santini © 2012.
  17. 17. Vector representation & supervisedmachine learning algorithms (esp. SVM) Marina Santini © 2012.
  18. 18. PIONEERS:DOUGLAS BIBER – JUSSIKARLGREN Marina Santini © 2012.
  19. 19. “I have used the term ‘genre’Multi-Dimensional Analysis (or ‘register’) for text varieties that are readily recognized and Factor Analysis, Factors Scores (Biber, 1988) ‘named’ within a culture (e.g. Cluster Analysis (Biber, 1989) letters, press editorials, sermon, Additional Statistical Tests (Biber, 2004a; 2004b, etc.) conversation), while I have used 66 linguistic features the term ‘text type’ for varieties that are defined linguistically Factor 2 - Biber (1988) (rather than perceptually)” (Biber, 1993). Cluster Analysis - Biber (1989) 1. intimate interpersonal interaction 2. informational interaction 3. scientific exposition 4. learned exposition 5. imaginative narrative 6. general narrative exposition 7. situated reportage 8. involved persuasion Marina Santini © 2012.
  20. 20. Karlgren and Cutting (1994):Recognizing Text Genres with Simple Metrics Using Discriminant Analysis• 20 features• Discriminant analysis• Brown corpus Marina Santini © 2012.
  21. 21. From Biber’s text types to genres of electronic corpora: Karlgren and Cutting (1994) Marina Santini © 2012.
  22. 22. POSs & SUC Marina Santini © 2012.
  23. 23. RECENT COMPUTATIONALMODELS FOR AGI Marina Santini © 2012.
  24. 24. AGI: Some Scenarios• Serge Sharoff• Kim & Ross• Santini• Stein et al. Marina Santini © 2012.
  25. 25. Morphology & the Linguist Serge Sharoff• Aim: Find a genre palette allowing comparison among corpora (Web As Corpus initiative ) and across languages• A functional genre palette inspired by J. Sinclair• Many corpora: English and Russian• Classifier: SVM• Features: POS trigrams (577 for Russian; 593 for English) Marina Santini © 2012. Ex of POS trigrams: ADV ADJ NOUN
  26. 26. The expert (the linguist) decides: Marina Santini © 2012.
  27. 27. Results1. It might be that the features are not ideal for these genres2. It might be that there is a problem with the distribution of genre classes: the information genre is not well represented and the classifier does not learn how to discriminate against the other genres3. It might be that the genres of Sharoff’s palette are too broad and vague and thus confusing for the classifier Marina Santini © 2012.
  28. 28. Harmonic Descriptor Representation (HDR) 2477 words Kim & Ross• Aim: to apply genres for the classification of Digital Libraries• Features: HDR = FP, LP or AP (betw 1 and T/ (N x MP))• Number of features: 7431 (2477*3)• Classifier: SVM• KRYS I + 7 webgenre collection (total: 31genre classes , 3452 documents) Marina Santini © 2012.
  29. 29. 31 genresMarina Santini © 2012.
  30. 30. Accuracies• Inter-rater disagreement: people do not necessarily agree on genre labels to be assigned to a web document Marina Santini © 2012.
  31. 31. What about morphology & syntax? What about noise? Santini• Aim: automatically identify the genre of web pages• Collection: 7-webgenre collection + others• Features: 100 facets• Genre palette: 7 webgenres• Classifier: inferential model subjective Bayesian method Marina Santini © 2012.
  32. 32. 7-webgenre collection• Balanced (200 web pages per genre class)• Genre palette• Not annotated manually• Built following 2 principles: – Objective sources – Consistent genre granularity Marina Santini © 2012.
  33. 33. 100 Facets Marina Santini © 2012.
  34. 34. Inferential model• It is a simple probabilistic model based on rules.• It allows some ”reasonging” through the use of weights (closer to artificial intelligence than machine learning)• The goal is to identify the correct genre of 1400 web pages belonging to 7 genres in a noisy environment. Marina Santini © 2012.
  35. 35. Comparisons (I) Marina Santini © 2012.
  36. 36. Comparisons (II) Marina Santini © 2012.
  37. 37. Resul ts• An effort to contr ol the effect of noise on clussi ficatio n result s shoul d be made Marina Santini © 2012.
  38. 38. Three experimental settings, three different genre needs….1. Genre comparison across corpora2. Efficient AGI for digital libraries3. Taming the wild web, where everything is uncertain and noisy WEGA prototype:a retrieval model for genre-enabled web search Marina Santini © 2012.
  39. 39. Genre retrieval model Stein, Meyer zu Eissen, Lipka• Aim: provide genre labels to the result set• Genre collection and palette: KI-04 corpus: 8 webgenres• Firefox add-on• Model: ”lightweight GenreRich model” (linear discriminant analysis)• Features: HTML, link features, character features, vocabulary concentration features (< 100 features) Marina Santini © 2012.
  40. 40. WEGA (WEb Genre Analysis) Marina Santini © 2012.
  41. 41. KI-04 genre collection: 8 webgenres• How can we definte the optimal genre palette for the task/purpose in focus Marina Santini © 2012.
  42. 42. Genre Classes & Human Recognition• How can we decide on the most representative genre classes? Let’s ask users… yes indeed, but how?• 1) questionnaires (Karlgren)• 2) card sorting (Rosso & Haas)• 3) task-oriented studies (Crowston et al.) Marina Santini © 2012.
  43. 43. Questionnaires: ”what genres are available on the internet?” • Ever evolving taxonomy Marina Santini © 2012.
  44. 44. User Warrant Rosso & Haas• Collecting genre terminology in the users’ own words (3 participants) – Make the users classify web pages and create piles• Users choose the best of the collected genre terminology (102 participants)• User validation of the genre palette (257 participants)• Genres’ usefulness of web search (32 Marina Santini © 2012.
  45. 45. FinalGenrePalette: 18Genres Marina Santini © 2012.
  46. 46. Genres & Tasks Crowston et al.• 3 groups of respondents : teachers, journalists, engineers,• Respondents were asked to carry out a web search for a real task of their own choice – What is your search goal? – What type of web page would you call this? – What is it about the page that makes you call that? – Was this page useful to you? Marina Santini © 2012.
  47. 47. • How can we find anWhat type of web page would you call this? optimal genre palette for the task/purpose in focus shareable by professionals• 522 unique terms  about 300 Marina Santini © 2012.
  48. 48. Syracuse corpus & AGIACL 2010 (Uppsala): FINE-GRAINED GENRE CLASSIFICATION USING STRUCTURAL LEARNING ALGORITHMS Zhili Wu, Katja Markert and Serge Sharoff• The whole corpus: 3027 annotated webpages divided into 292 genres.• Focussing on genres containing 15 or more examples, the corpus is of about 2293 examples• anddo wegenres. optimal number of genres a classifier can How 52 work out the digest with a useful performance Marina Santini © 2012.
  49. 49. In summary• AGI can be easily done. Results are promising.• AGI presents a fascinating classification problem where classes are cultural, social and evolving artifacts.• R&D of AGI systems will help understand complex classification problems in many other disciplines, (ex, social sciences, neuroscience, psychology, etc.). Marina Santini © 2012.
  50. 50. Resumed: Do we need a genre definition when doing AGI?• Yes. We need a computational theory of genre that allows us to shed some light on the correlations and interaction of different factors. Without a theoretical definition or a characterization of the concept of genre, it is not clear how to make decisions about… Marina Santini © 2012.
  51. 51. Current Open Issues• The distribution of genre classes in the learning set• The optimal genre-revealing features for the genres in focus• The generality of the genre classes• The creation of a genre taxonomy that BOTH humans and automatic classifiers can easily discriminate against • The inter-rater disagreement M a • An increasing corpus size (can we work out a critical mass for a genre ri corpus?) n a • Noise (noisy and evolving digital environments) S a • An ever evolving corpus of genre classes n ti n • Optimal genre palette for the task/purpose in focus © i 2 0 • The optimal number of genres a classifier can digest with a useful 1 performance 2 .
  52. 52. A computational theory of genre Without a genre definition, all these experiments remain random, uncorrelated, fragmented… Marina Santini © 2012.
  53. 53. WHY IS GENRE IMPORTANTAND USEFUL? Marina Santini © 2012.
  54. 54. Why is genre important?• It is a context carrier: being based on recurrent conventions and predictable expectations, genre provides the communicative context and the communicative purpose for which a text has been produced. It is both a semantic and a pragmatic concept (meaning + context) Think of what happens in your mind when you come across a specific genre. Eg, FAQs, reviews, interviews, academic papers, Marina Santini © 2012.
  55. 55. The Information Interaction in Context IIiXconference (IIiX) explores the relationshipsbetween and within the contexts that affectinformation retrieval (IR) and informationseeking, how these contexts impactinformation behavior, and how knowledgeof information contexts and behaviorsimproves the design of interactiveinformation systems. The fourth IIiX (Information Interaction in Context) symposium 2012 will be held in Nijmegen, the Netherlands from August 21 to 24 2012. IIiX 2012 is organized in cooperation with the ACM and ACM SIGIR Marina Santini © 2012.
  56. 56. Major BenefitsBeing a context carrier, contribute to:• Complexity reduction and predictivity: a text receives identity through belonging to a certain genre and and this identity reduces the cognitive effort• Improve Findability: genre helps find data that is more ”relevant” to our information needs• Increase information understanding: genre competence increases self-protection against digital crimes (e.g. fishing, hoaxes, cyberbullying) because it can help spot genre anomalies and consequently malicious Marina Santini © 2012.
  57. 57. Genre Conventions dominate Language since early age• Mastering genres is an important factor in children’s linguistic development: – ”As they grow older, [children] become more accomplished and learn how to use linguistic features that are specific to the genre, for exmaple the appropriate use of present or past tense”. Source: Understanding children development, p. 419, 2011. Marina Santini © 2012.
  58. 58. Genre is ubiquitous• Language does not exist in abstract.• Language use changes with the situation, purpose, audience, emotional state, etc.• We might express the same meaning with different words according to different communicative contexts. Marina Santini © 2012.
  59. 59. Benefits in Language Technology• Genre competence and AGI: – could improve many current NLP subfields, eg. automatic summarization, machine translation, or Natural Language Generation (NLG) – Information Retrieval (IR) would benefit by identifying more relevant documents to the queries – Business Intelligence (BI) and Customer Experience Management (CEM) would find actionable information ©in the deluge of data Marina Santini 2012.
  60. 60. GENRE-AWARE Advanced Text AnalyticsVIABLE PROJECTS Marina Santini © 2012.
  61. 61. Action Projects• Resource: DaisyKB - Fine-grained multilingual knowledge base• Tool: contextify - MetaDataTagger: creating metadata for genre, sublanguage, domain• Information System: WebRider – socially- and emotionally- intelligent web search• Actionable information: SearchInFocus Pilot study: ”Using query log analysis for BI and CEM”. Applying findability for Business Intelligence (BI) and Customer Experience Management (CEM).• More… Marina Santini © 2012.
  62. 62. Fine-grained multilingual knowledge baseDAISYKB Marina Santini © 2012.
  63. 63. The word ”bank” Marina Santini © 2012.
  64. 64. The 2 English senses are translated using 2 different Italian words. Senses and words across languages are linked together via the abstract representation… Marina Santini © 2012.
  65. 65. The importance of the abstract representation• It enables cross-linguality and multilinguality• Consistency across all languages. Marina Santini © 2012.
  66. 66. Single words are often not enough to understand the meaning…• DaisyKB can have fields with: – Collocations – Terminology – Companies – Domain – Frequent co-occurent words – Frequent queries containing the keyword – Sublanguage – Genre – Etc. Marina Santini © 2012.
  67. 67. Object-Oriented Approach• Each entry is like an object• Each field is like a method in an object, you just call it when you need it• An object can be called at different levels of granularity Marina Santini © 2012.
  68. 68. Pre-population…• Migrating extisting resources to populate DaisyKB.• The dictionary structure must be well- thought in terms of flexibility and design in order to accomodate future needs. Marina Santini © 2012.
  69. 69. Benefits• Standardization of scattered resourses• Flexibility: it can be updated any time, systematically• Consistency• Coherence• Reduced management• Reduced idiosincracies and errors• Increased efficiency• Reusable for many products or activities• It can be open-source and collaborative• It can be built with XML and programmed with XSLT for quick updating and deletion or insertion of new fields. Marina Santini © 2012.
  70. 70. Contextify is a metadata tagger: text classification according to:domain, genre, sublanguage (register, style, sentiment, emotions,opnions…)CONTEXTIFY Marina Santini © 2012.
  71. 71. Context- and Content-revealing Metadata and Text-Internal Annotation• Context can be ”reconstructed” if you know the genre, and more accurately, if we know other textual dimensions such as the domain of a text, the sublanguage used in the text, the sentiment expressed in a text, etc.
  72. 72. Context is the King• Context helps disambiguate words and assess the relevance/importance of texts• Context helps identify the most important information.• How can we capture context from a text? In this application, I would start with genre, sublanguage, and domain i.e. three textual dimensions that say something about the communicative context in which a text or a document has been issued: – A ”weird” word like ” Spweet ” is not a typo if it belongs to a Twitter micropost (genre and sublanguage: tweet spam) – A ”normal” word like ”mouse” is a specialized term if it belongs to the computer domain. – Figurative senses and metaphores: surfing (sport, internet communication), agile (ordinary word, software), sentence (law, grammar), appeal (ordinary language: ”appeal for help” or legal sublanguage: to lodge an appeal, genre: newspaper, court act) etc. Marina Santini © 2012.
  73. 73. Content Enrichment Marina Santini © 2012.
  74. 74. Benefits: Contextify• Helps identify: – the most reliable/relevant documents – how language is used in different communicative contexts – linguistic conventions – textual conventions – etc.• Contextualized information can be exploited to improve automatic summarization, machine translation, terminology extraction, indexing, etc. Marina Santini © 2012.
  75. 75. A genre-aware information systemWEB RIDER Marina Santini © 2012.
  76. 76. Web Rider is…• “WebRider” is a metaphor to describe a “socially and emotionally intelligent” information system that helps web users make sense of information on the web by internalizing genre cues.• The claim: genre cues contribute to information understanding and decision making, thus assisting users to ride safely the web. Marina Santini © 2012.
  77. 77. Inspirators• Social and emotional intelligence [Daniel Goleman, psychologist] is the competence that lies behind mutual understanding, group interaction, social behaviour and all kinds of social actions.• Genres are social actions [Genre as social action (Carolyn Miller, 1984)] Marina Santini © 2012.
  78. 78. Simply put…• Social actions – the driving forces behind the web – are manifested through genres.• Genres -- e.g. FAQs, press releases, product descriptions, instructions, guides, etc. -- are recurring and recognized patterns of communication that can help contextualize information.• An information system with Genre Competence would make web users socially and emotionally intelligent because, through the deep understanding of genre conventions and expectations, they would be able to evaluate the genuineness, reliability, authenticity, and the actual purpose of information distributed on the web. Marina Santini © 2012.
  79. 79. The Technology underlying WebRider• Under discussion… Marina Santini © 2012.
  80. 80. Main Benefits• Increased information understanding. WebRider would promote more adequate social and emotional behaviours through its genre competence• [Increased Accuracy and Relevance: more accurate information systems & more relevant search results]• Ethics: Protection against fishing, scams, malicious behaviour, cyberbullying, etc. while searching Marina Santini © 2012.
  81. 81. Pilot study: SearchInFocusFINDABILITY FOR BI & CEM Marina Santini © 2012.
  82. 82. If you think of BI and CEM in terms of searchability, findability and actionability…– “Merrill Lynch estimates that more than 85 percent of all business information exists as unstructured data –commonly appearing in e‐mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations and Web pages.” [DM Review Magazine, February 2003 Issue]– ECONOMIC LOSS! Santini © 2012. Marina
  83. 83. Simple search is not enough…• Of course, it is possible to use simple search. But simple search is unrewarding, because is based on single terms. – ”a search is made on the term felony. In a simple search, the term felony is used, and everywhere there is a reference to felony, a hit to an unstructured document is made. But a simple search is crude. It does not find references to crime, arson, murder, embezzlement, vehicular homicide, and such, even though these crimes are types of felonies” [ Source: Inmon, B. & A. Nesavich, "Unstructured Textual Data in the Organization" from "Managing Unstructured data in the organization", Prentice Hall 2008, pp. 1–13] Marina Santini © 2012.
  84. 84. Text Analytics• A set of NLP techniques that provide some structure to textual documents.• Common components: – Tokenization – Morphological Analysis – Syntactic Analysis – Named Entity Recognition – Sentiment Analysis – Automatic Summarization – Etc. Marina Santini © 2012.
  85. 85. Text Analytics• Commercial: • Open Source: – Attensity  GATE – Clarabridge – Temis  NLTK – Lexalytics  UIMA – Texify  Rapid Miner – SAS  etc. – IBM Cognos – etc. Marina Santini © 2012.
  86. 86. SearchInFocus: searching for actionable intelligence• Pilot study: ”Using query log analysis for BI and CEM” [In collaboration with Findwise(?)] Marina Santini © 2012.
  87. 87. WHERE TO START? Marina Santini © 2012.
  88. 88. Strategy Use cases: what kind of BETA User Big actionable Open-source feedback &unstructured Wireframes User test information web-based collaborativetextual data do user prototyes improvement need? Marina Santini © 2012.
  89. 89. Academia• Many students can work on it at the same time, with different languages• There are many linguistic, cognitive and technical aspects to be analysed• Creation of multilingual shareable resources• Open source algorithms enriched by a collaborative effort Marina Santini © 2012.
  90. 90. Thank you for your attention Questions? Marina Santini © 2012.