This document discusses annotation of anaphora and coreference in corpora for computational linguistics. It covers several annotation schemes including MUC, which aimed to achieve high inter-annotator agreement by focusing on coreference between noun phrases. The NP4E corpus aimed to develop guidelines for annotating both noun phrase and event coreference in newspaper articles. Annotation is a time-consuming process that requires concentration to identify mentions and relations accurately. Guidelines must be clear and consistent to help annotators agree on how to mark up texts.
This document provides an overview of the Minimalist Program (MP) proposed by Chomsky in 1993. It discusses the redundant and necessary levels of representation, including Logical Form and Phonetic Form. Principles like economy of derivation and economy of representation are explained. The document also covers topics like phrase structure, movements, feature checking, and the Full Interpretation Principle in MP. The conclusion states that MP aims to minimize theoretical concepts in syntax to achieve universality of grammar.
Automatic text simplification evaluation aspectsiwan_rg
The document discusses automatic text simplification (ATS) and methods for evaluating ATS systems. It provides an overview of common evaluation metrics like BLEU, SARI, FKGL, and SAMSA and compares their abilities to measure simplicity, meaning preservation, and grammaticality. The document also outlines a proposed project to build a corpus and develop a graded reading scale to guide the simplification of Arabic fiction works for educational purposes.
1) Artificial Intelligence research draws from many disciplines including formal logic, probability theory, linguistics and philosophy. Computational logic combines and improves upon traditional logic and decision theory.
2) The paper argues that the abductive logic programming (ALP) agent model is a powerful model of both descriptive and normative thinking. It includes production systems and is compatible with classical logic and decision theory.
3) The ALP agent model treats beliefs as describing the world and goals as describing how the world should be. Its semantics aim to generate actions and assumptions to make goals and observations true based on beliefs.
This document presents a project report on sarcasm analysis using machine learning techniques. It discusses how sarcasm detection is a challenging task in natural language processing due to the gap between the literal and intended meaning of sarcastic texts. The report outlines a methodology to detect sarcasm in tweets by extracting features like intensifiers and interjections and training machine learning classifiers. Naive Bayes, maximum entropy, and decision tree classifiers are tested, with decision trees achieving the highest accuracy of 63%. The conclusion discusses how accuracy could be improved by incorporating better features, and future work includes adding context and detecting sarcasm in other languages.
The document discusses opinion mining and sentiment analysis. It describes how opinion mining uses natural language processing techniques on user input from internet sources to understand opinions. Sentiment analysis is used to extract emotions, subjects, and the impact of opinions. The key modules of an opinion mining and sentiment analysis system include opinion retrieval, sentiment classification, and summary generation. Sentiment classification applies a semi-supervised naive Bayes classifier using linguistic features to determine the polarity of opinions. While current systems can effectively analyze sentiments, challenges remain in handling ambiguity and analyzing opinions in different languages.
This document provides an overview of the Minimalist Program (MP) in linguistics. It discusses the following key points:
1. The MP aims to develop a simple linguistic model with minimal components and operations. It builds on principles of economy of derivation and representation from earlier theories like Government and Binding Theory.
2. Core concepts in the MP framework include morphosyntactic features, uninterpretable features, interpretable features, phases of derivation, probes and goals. Derivations proceed through numeration, spell-out at phases, and interpretation at the interfaces of phonetic form and logical form.
3. A phase is a syntactic domain like CP or VP that structures the derivation. Probes are
This document provides an overview of the Minimalist Program (MP) proposed by Chomsky in 1993. It discusses the redundant and necessary levels of representation, including Logical Form and Phonetic Form. Principles like economy of derivation and economy of representation are explained. The document also covers topics like phrase structure, movements, feature checking, and the Full Interpretation Principle in MP. The conclusion states that MP aims to minimize theoretical concepts in syntax to achieve universality of grammar.
Automatic text simplification evaluation aspectsiwan_rg
The document discusses automatic text simplification (ATS) and methods for evaluating ATS systems. It provides an overview of common evaluation metrics like BLEU, SARI, FKGL, and SAMSA and compares their abilities to measure simplicity, meaning preservation, and grammaticality. The document also outlines a proposed project to build a corpus and develop a graded reading scale to guide the simplification of Arabic fiction works for educational purposes.
1) Artificial Intelligence research draws from many disciplines including formal logic, probability theory, linguistics and philosophy. Computational logic combines and improves upon traditional logic and decision theory.
2) The paper argues that the abductive logic programming (ALP) agent model is a powerful model of both descriptive and normative thinking. It includes production systems and is compatible with classical logic and decision theory.
3) The ALP agent model treats beliefs as describing the world and goals as describing how the world should be. Its semantics aim to generate actions and assumptions to make goals and observations true based on beliefs.
This document presents a project report on sarcasm analysis using machine learning techniques. It discusses how sarcasm detection is a challenging task in natural language processing due to the gap between the literal and intended meaning of sarcastic texts. The report outlines a methodology to detect sarcasm in tweets by extracting features like intensifiers and interjections and training machine learning classifiers. Naive Bayes, maximum entropy, and decision tree classifiers are tested, with decision trees achieving the highest accuracy of 63%. The conclusion discusses how accuracy could be improved by incorporating better features, and future work includes adding context and detecting sarcasm in other languages.
The document discusses opinion mining and sentiment analysis. It describes how opinion mining uses natural language processing techniques on user input from internet sources to understand opinions. Sentiment analysis is used to extract emotions, subjects, and the impact of opinions. The key modules of an opinion mining and sentiment analysis system include opinion retrieval, sentiment classification, and summary generation. Sentiment classification applies a semi-supervised naive Bayes classifier using linguistic features to determine the polarity of opinions. While current systems can effectively analyze sentiments, challenges remain in handling ambiguity and analyzing opinions in different languages.
This document provides an overview of the Minimalist Program (MP) in linguistics. It discusses the following key points:
1. The MP aims to develop a simple linguistic model with minimal components and operations. It builds on principles of economy of derivation and representation from earlier theories like Government and Binding Theory.
2. Core concepts in the MP framework include morphosyntactic features, uninterpretable features, interpretable features, phases of derivation, probes and goals. Derivations proceed through numeration, spell-out at phases, and interpretation at the interfaces of phonetic form and logical form.
3. A phase is a syntactic domain like CP or VP that structures the derivation. Probes are
The document provides 10 guidelines for creating simple product architecture, including removing seldom-used features to reduce complexity, shortening the number of steps to access features, providing fewer options to avoid overwhelming users, using intuitive icons, anticipating and addressing user fears, controlling where users' attention is directed, embracing constraints to drive creativity, guessing what features users will want and making products self-teaching through contextual use and feedback. Following these guidelines can result in products that are faster to build, easier to modify, and simpler for both users and developers.
The document discusses how people often convince themselves that they will be happier once they achieve certain life goals such as getting married, having children, buying a nice car, retiring, etc. However, it notes that there is no better time to be happy than the present. It also discusses how happiness is a journey, not a destination. While we may remember famous or successful people, what really matters are the people in our lives who care for us and support us. The document concludes by relating a story about disabled athletes who stopped racing to help a fellow athlete, showing that what's most important in life is helping others.
The RCM analysis process consists of 7 steps to determine the best maintenance strategy for an asset. It involves gathering basic information, performing a failure modes and effects analysis, identifying risks of failure, determining maintenance actions to mitigate risks, analyzing costs and benefits, compiling results, and generating final reports. The process is supported by software that facilitates information sharing and automatic report generation.
Loving Hut is a vegan restaurant chain that serves plant-based meals without animal products or byproducts. Founded in Taiwan in 1992, Loving Hut now has locations around the world serving dishes made from tofu, vegetables, grains and fruits. Their mission is to promote compassion for all living beings and a cruelty-free lifestyle.
La canción habla sobre la importancia y gratitud hacia las mujeres. John Lennon expresa sus sentimientos contradictorios y su deuda eterna con las mujeres por mostrarle el significado del éxito y la vida. Pide a la mujer que lo mantenga cerca de su corazón y comprenda que nunca quiso causarle dolor, y que la ama ahora y para siempre.
Have a look at these 6 slides and see that historically, we justified the abuse of humans.
After becoming more civilised and enlightened we have moved on to animals.
The comparisons are sad
New Forodhani ( Jubilee Garden ) Being RebuiltHeena Modi
The Forodhani (Jubilee Garden) in Zanzibar is being rebuilt, but it has reduced greenery and added more inner roads, replacing sturdy British benches with flimsy concrete ones and chopping down some trees. Photos of the original garden are now just memories of what used to be there.
Boost Conversions and Raise your Revenues with A/B testing in RailsDaniel Pritchett
Boost Conversions and Raise your Revenues with A/B testing in Rails:
Motivations:
* Products with highly relevant messaging find their way into the hands of people happy to pay for them.
* Gain insight into which features of your product resonate most with customers.
* Validate market appetites for different flavors of product messaging.
* Propose alternate layouts without compromising overall design goals.
* Automation skills harmonize well with product and marketing groups in this arena.
El documento parece estar escrito en catalán y contiene los nombres de dos personas, Joan Antoja i Mas i Òscar Morales Díaz, sin más contexto o información proporcionada.
This document lists various sights and landmarks found in the state of Kansas, including the state flag and seal, counties, a hand dug well in Greensburg, botanical gardens in Wichita, the Cosmosphere in Hutchinson, Fort Larned, Mount Sunflower, an Oregon Trail marker, Pawnee Rock, Fort Scott, and the Kansas Vietnam Memorial. It also mentions wagon trails.
The document expresses regret for any flaws or statements in its content that are not aligned with Jain teachings. It apologizes for any offense caused knowingly or unknowingly through its statements. The document can be shared privately with friends if deemed appropriate, but is only for non-commercial, private use.
Greens - a gorgeous setting with delicious vegan treats on the menu!Heena Modi
Greens is a restaurant located at 15 Marina Blvd in San Francisco, California. Its phone number is +1 415-771-6222 and its website is greensrestaurant.com. The document provides contact information for Greens restaurant.
The document summarizes the life cycle of a broiler hen from hatching to slaughter. Newly hatched chicks are quickly transported on conveyor belts and crammed into small spaces to grow rapidly to meet demand for meat. However, their legs cannot support their fast-growing bodies. Workers roughly grab the fully-grown hens and cram them into crates, causing injuries, before transporting them to slaughter where machines stun, decapitate, and scald them while still alive at times due to inaccuracies.
This document discusses using social media marketing for local businesses. It provides an overview of key social media platforms like Facebook and Twitter and their usage statistics. It emphasizes how local businesses can engage customers on these platforms through sharing content, photos and videos to build authentic relationships and drive traffic. The document also outlines specific social media marketing tactics and tools local businesses can use, from creating Facebook pages and profiles to using advertising, and stresses the importance of experimenting with different approaches.
The document discusses fashion designer Jean Fares and his fashion house Jean Fares Couture. It provides details about Fares' background and philosophy, describes Jean Fares Couture's collections and ready-to-wear lines. It also lists many famous Hollywood stars and celebrities who have worn Jean Fares Couture designs, praising the brand's innovative couture gowns.
ROI Conference 2013 - Your Social Success StoryChris Treadaway
This document discusses how social media and online platforms have changed marketing and business. It notes that most customers now find businesses online rather than through traditional advertising. It provides tips on using platforms like Google, Facebook, YouTube, and blogs to engage customers and drive business. The key is choosing the right outlets, consistently publishing content, and identifying influential customers to engage with online. While this takes work, listening to customers and participating in online conversations can help businesses adapt to changes in how customers interact with companies.
The role of linguistic information for shallow language processingConstantin Orasan
The document discusses shallow language processing and summarization. It argues that while deep language understanding is limited, shallow methods can be improved by adding linguistic information. As an example, it shows how term frequency, anaphora resolution, discourse cues and genetic algorithms can select extractive summaries that better match human abstracts, without requiring full text comprehension.
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
Working out when events in a text happen is difficult. Many have tried over the past decade but the state of the art has not advanced.
After introducing a few fundamental concepts for dealing with time in language, we work out what makes this task so difficult, and then identify two common causes of temporal ordering difficulty and describe how to overcome them.
Full document: http://derczynski.com/sheffield/papers/derczynski-phdthesis.pdf
The document provides 10 guidelines for creating simple product architecture, including removing seldom-used features to reduce complexity, shortening the number of steps to access features, providing fewer options to avoid overwhelming users, using intuitive icons, anticipating and addressing user fears, controlling where users' attention is directed, embracing constraints to drive creativity, guessing what features users will want and making products self-teaching through contextual use and feedback. Following these guidelines can result in products that are faster to build, easier to modify, and simpler for both users and developers.
The document discusses how people often convince themselves that they will be happier once they achieve certain life goals such as getting married, having children, buying a nice car, retiring, etc. However, it notes that there is no better time to be happy than the present. It also discusses how happiness is a journey, not a destination. While we may remember famous or successful people, what really matters are the people in our lives who care for us and support us. The document concludes by relating a story about disabled athletes who stopped racing to help a fellow athlete, showing that what's most important in life is helping others.
The RCM analysis process consists of 7 steps to determine the best maintenance strategy for an asset. It involves gathering basic information, performing a failure modes and effects analysis, identifying risks of failure, determining maintenance actions to mitigate risks, analyzing costs and benefits, compiling results, and generating final reports. The process is supported by software that facilitates information sharing and automatic report generation.
Loving Hut is a vegan restaurant chain that serves plant-based meals without animal products or byproducts. Founded in Taiwan in 1992, Loving Hut now has locations around the world serving dishes made from tofu, vegetables, grains and fruits. Their mission is to promote compassion for all living beings and a cruelty-free lifestyle.
La canción habla sobre la importancia y gratitud hacia las mujeres. John Lennon expresa sus sentimientos contradictorios y su deuda eterna con las mujeres por mostrarle el significado del éxito y la vida. Pide a la mujer que lo mantenga cerca de su corazón y comprenda que nunca quiso causarle dolor, y que la ama ahora y para siempre.
Have a look at these 6 slides and see that historically, we justified the abuse of humans.
After becoming more civilised and enlightened we have moved on to animals.
The comparisons are sad
New Forodhani ( Jubilee Garden ) Being RebuiltHeena Modi
The Forodhani (Jubilee Garden) in Zanzibar is being rebuilt, but it has reduced greenery and added more inner roads, replacing sturdy British benches with flimsy concrete ones and chopping down some trees. Photos of the original garden are now just memories of what used to be there.
Boost Conversions and Raise your Revenues with A/B testing in RailsDaniel Pritchett
Boost Conversions and Raise your Revenues with A/B testing in Rails:
Motivations:
* Products with highly relevant messaging find their way into the hands of people happy to pay for them.
* Gain insight into which features of your product resonate most with customers.
* Validate market appetites for different flavors of product messaging.
* Propose alternate layouts without compromising overall design goals.
* Automation skills harmonize well with product and marketing groups in this arena.
El documento parece estar escrito en catalán y contiene los nombres de dos personas, Joan Antoja i Mas i Òscar Morales Díaz, sin más contexto o información proporcionada.
This document lists various sights and landmarks found in the state of Kansas, including the state flag and seal, counties, a hand dug well in Greensburg, botanical gardens in Wichita, the Cosmosphere in Hutchinson, Fort Larned, Mount Sunflower, an Oregon Trail marker, Pawnee Rock, Fort Scott, and the Kansas Vietnam Memorial. It also mentions wagon trails.
The document expresses regret for any flaws or statements in its content that are not aligned with Jain teachings. It apologizes for any offense caused knowingly or unknowingly through its statements. The document can be shared privately with friends if deemed appropriate, but is only for non-commercial, private use.
Greens - a gorgeous setting with delicious vegan treats on the menu!Heena Modi
Greens is a restaurant located at 15 Marina Blvd in San Francisco, California. Its phone number is +1 415-771-6222 and its website is greensrestaurant.com. The document provides contact information for Greens restaurant.
The document summarizes the life cycle of a broiler hen from hatching to slaughter. Newly hatched chicks are quickly transported on conveyor belts and crammed into small spaces to grow rapidly to meet demand for meat. However, their legs cannot support their fast-growing bodies. Workers roughly grab the fully-grown hens and cram them into crates, causing injuries, before transporting them to slaughter where machines stun, decapitate, and scald them while still alive at times due to inaccuracies.
This document discusses using social media marketing for local businesses. It provides an overview of key social media platforms like Facebook and Twitter and their usage statistics. It emphasizes how local businesses can engage customers on these platforms through sharing content, photos and videos to build authentic relationships and drive traffic. The document also outlines specific social media marketing tactics and tools local businesses can use, from creating Facebook pages and profiles to using advertising, and stresses the importance of experimenting with different approaches.
The document discusses fashion designer Jean Fares and his fashion house Jean Fares Couture. It provides details about Fares' background and philosophy, describes Jean Fares Couture's collections and ready-to-wear lines. It also lists many famous Hollywood stars and celebrities who have worn Jean Fares Couture designs, praising the brand's innovative couture gowns.
ROI Conference 2013 - Your Social Success StoryChris Treadaway
This document discusses how social media and online platforms have changed marketing and business. It notes that most customers now find businesses online rather than through traditional advertising. It provides tips on using platforms like Google, Facebook, YouTube, and blogs to engage customers and drive business. The key is choosing the right outlets, consistently publishing content, and identifying influential customers to engage with online. While this takes work, listening to customers and participating in online conversations can help businesses adapt to changes in how customers interact with companies.
The role of linguistic information for shallow language processingConstantin Orasan
The document discusses shallow language processing and summarization. It argues that while deep language understanding is limited, shallow methods can be improved by adding linguistic information. As an example, it shows how term frequency, anaphora resolution, discourse cues and genetic algorithms can select extractive summaries that better match human abstracts, without requiring full text comprehension.
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
Working out when events in a text happen is difficult. Many have tried over the past decade but the state of the art has not advanced.
After introducing a few fundamental concepts for dealing with time in language, we work out what makes this task so difficult, and then identify two common causes of temporal ordering difficulty and describe how to overcome them.
Full document: http://derczynski.com/sheffield/papers/derczynski-phdthesis.pdf
This document summarizes Anabela Barreiro's PhD defense on using automated paraphrasing to improve machine translation, specifically for support verb constructions. It discusses how paraphrasing support verb constructions into semantically related verbs can simplify language and reduce ambiguity, improving machine translation quality. The thesis presents work formalizing support verb constructions and generating paraphrases, and experiments showing paraphrasing improved machine translation results by 21-31%. It suggests areas for future work expanding linguistic knowledge and paraphrasing capabilities.
This document summarizes research that has been done on computational morphology for the Odia language. It begins with an abstract that outlines how morphological analysis, generation, and parsing are important tools for natural language processing. The document then reviews different works that have developed morphological analyzers and generators for Odia. It describes various methods that have been used, including suffix stripping, finite state transducers, two-level morphology, corpus-based approaches, and paradigm-based approaches. Finally, it outlines several applications of morphology like machine translation, spelling checking, and part-of-speech tagging.
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
This document summarizes a student project to develop a shallow parser for Hindi language with input from a transliterator. The plan is to create a transliterator to convert Roman script to Devanagari, generate a lexicon from corpus analysis, develop a morphological analyzer using finite state transducers, and implement a shallow parser using context free grammar. The system architecture and flow chart are presented. In conclusion, the document notes that shallow parsing is needed to build full parsers for Hindi and transliteration is important for translating names and terms across languages with different alphabets.
Natural language processing (NLP) involves analyzing and understanding human language to allow interaction between computers and humans. The document outlines key steps in NLP including morphological analysis, syntactic analysis, semantic analysis, and pragmatic analysis to convert text into structured representations. It also discusses statistical NLP and real-world applications such as machine translation, question answering, and speech recognition.
Natural language processing (NLP) is introduced, including its definition, common steps like morphological analysis and syntactic analysis, and applications like information extraction and machine translation. Statistical NLP aims to perform statistical inference for NLP tasks. Real-world applications of NLP are discussed, such as automatic summarization, information retrieval, question answering and speech recognition. A demo of a free NLP application is presented at the end.
This document provides an overview of natural language processing (NLP) tools and resources that can be used to build a machine learning classifier to identify the fame of people mentioned in news articles. It describes NLP tasks like tokenization, part-of-speech tagging, chunking, named entity recognition, parsing, and coreference resolution. It also introduces libraries like the Curator for accessing NLP tools, Edison for feature extraction, and Learning Based Java for building the classifier. Finally, it demonstrates connecting all the pieces to construct a system that can label famous people as politicians, athletes, or corporate moguls.
Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team.
August 30-31, 2007.
Patan Dhoka, Lalitpur, Nepal.
speech recognition and removal of disfluenciesAnkit Sharma
This document discusses techniques for automatic speech recognition, including detecting sentence boundaries and disfluencies. It covers:
1) The process of speech recognition including digitization, acoustic analysis, and linguistic interpretation of the speech signal.
2) Statistics-based approaches to speech recognition which use large speech corpora to train models to learn correspondences between speech and text.
3) Challenges in speech recognition including variability between individuals, detecting sentence boundaries and disfluencies, and current performance which still has room for improvement.
This document discusses research on automated text summarization. It defines a summary as a shorter text that retains the key information from the original text(s). There are typically three stages to automated summarization: topic identification to extract important units, interpretation to fuse concepts using external knowledge, and generation to produce coherent readable text. Various methods are reviewed for the topic identification stage, including analyzing positional, cue phrase, frequency-based, title overlap, and discourse structure criteria. Combining the scores from different methods improves performance over using a single method alone.
Natural Language Processing_in semantic web.pptxAlyaaMachi
This document discusses natural language processing (NLP) techniques for extracting information from unstructured text for the semantic web. It describes common NLP tasks like named entity recognition, relation extraction, and how they fit into a processing pipeline. Rule-based and machine learning approaches are covered. Challenges with ambiguity and overlapping relations are also discussed. Knowledge bases can help relation extraction by defining relation types and arguments.
The recognition of spoken word can be viewed as classifying an auditory stimulus to one ‘’word form’’ category, chosen from many alternatives.
This process requires matching of the spoken input with the mental representation associated with the word candidates and selecting one among the several candidates that are atleast partially consistent with the input.
Process of recognizing a spoken word is that it starts from a string of phonemes (Dahan, Magnuson, 2006) establishes how these phonemes should be grouped to form words and passes these words into the next level of processing.
Some theories, though, take a broader view and blur the distinction between speech perception, spoken word recognition, and sentence processing (Elman, 2004; Gaskell & Marslen 1997; Klatt, 1979; McClelland, 1989).
Annotated text corpora are an important resource for natural language processing research and technologies. Corpora can be annotated with linguistic information like parts of speech, morphology, syntax, and semantics through a layered approach. This involves manually or automatically tagging words, sentences, and texts with linguistic metadata. Well-annotated corpora are essential for tasks like morphological analysis, part-of-speech tagging, parsing, and machine translation model training.
The document analyzes the pragmatic functions of discourse markers used by interpreters in simultaneously interpreting the 2012 Chinese Spring Festival Gala from Chinese to English. It categorizes discourse markers into 7 types, and discusses how markers help reduce cognitive load or enhance communication. The study aims to determine the distribution and purposes of different discourse marker categories through analyzing over 3 hours of interpreting data.
FCA-MERGE: Bottom-Up Merging of Ontologiesalemarrena
The document describes a new bottom-up method called FCA-MERGE for merging ontologies. It extracts instances from documents for each ontology to generate formal contexts. It then merges the contexts and computes a concept lattice using techniques from Formal Concept Analysis. This lattice provides a structural description of the merging process. The final merged ontology is then generated from the lattice with human guidance. FCA-MERGE circumvents the problem of finding instances classified in both ontologies by extracting instances from relevant documents.
The document provides an overview of ontology and its various aspects. It discusses the origin of the term ontology, which derives from Greek words meaning "being" and "science," so ontology is the study of being. It distinguishes between scientific and philosophical ontologies. Social ontology examines social entities. Perspectives on ontology include philosophy, library and information science, artificial intelligence, linguistics, and the semantic web. The goal of ontology is to encode knowledge to make it understandable to both people and machines. It provides motivations for developing ontologies such as enabling information integration and knowledge management. The document also discusses ontology languages, uniqueness of ontologies, purposes of ontologies, and provides references.
This paper presents a rule based model of parts of speech (POS) tagset for Classical Tamil Texts (CTT). The noun forms are type pattern, verb forms are token pattern. This is based on form agreement method. This is a very efficient and novel approach because Tamil Language has a build-in system of agreement/concord of the sentence. Classical Tamil Tagset is divided into two basic classifications, noun morphology and verb morphology.
AUTOMATIC ARABIC NAMED ENTITY EXTRACTION AND CLASSIFICATION FOR INFORMATION R...kevig
This document describes a rule-based system for extracting and classifying Arabic named entities (NEs). It uses linguistic information like morphosyntactic analysis, semantic classification of lexical items, and syntactico-semantic rules. Trigger words and extensions are used to recognize NE structures. Trigger words are classified as specific or generic to determine whether the NE class is identified by the trigger word alone or through combinations with extensions. The system was evaluated on journalistic corpora and can contribute to information retrieval and extraction applications.
Similar to Annotation of anaphora and coreference for automatic processing (20)
Tutorial given at RANLP 2015 in Hissar, Bulgaria
Recent years have seen lots of changes in the field of computational linguistics, most of them due to the widespread use of the Internet and the benefits and problems it brings. The first part of this tutorial will discuss these changes and will focus on crowdsourcing and how it influenced the creation of annotated data.
Annotation of data employed to train and test NLP methods used to be the task of language experts who had a good understanding of the linguistic phenomena to be tackled. Given that a large number of people now have access to the Internet, crowdsourcing has become an alternative way of obtaining annotated data. The core idea of crowdsourcing is that it is possible to design tasks that can be completed by non-experts and that the outputs of these tasks can be combined to obtain high-quality linguistic annotation, which would normally be produced by experts. Examples of how crowdsourcing was employed in computational linguistics will be given.
Big data is another trend in computational linguistics as researchers rely on more and more data for improving the results of a method. The second part of the tutorial will introduce the MapReduce programming model and show how it was used in processing language. Combined with processing larger quantities of data, the field of computational linguistics has applied deep learning to various tasks successfully, improving their accuracy. An introduction to deep learning will be provided, followed by examples of how it was applied to tasks such as learning semantic representations, sentiment analysis and machine translation evaluation.
From TREC to Watson: is open domain question answering a solved problem?Constantin Orasan
The document summarizes a presentation on question answering systems. It begins by providing context on information overload and defining question answering. It then discusses the evolution of QA systems from early databases to today's open-domain systems. The presentation focuses on IBM's Watson system, providing an overview of its unprecedented ability to answer open-domain questions as well as the massive resources required for its development. It concludes by arguing that open-domain QA remains unsolved and that closed-domain, interactive QA may be more practical for real-world applications.
What is Computer-Aided Summarisation and does it really work?Constantin Orasan
Computer-aided summarization (CAS) uses automatic methods to identify important information in documents, which humans can then edit to produce summaries. An evaluation of a CAS tool called CAST found that it reduced the time professional summarizers needed to produce summaries by 20% on average without significantly affecting summary quality. User feedback indicated the tool was most useful for identifying related sentences to include.
The document discusses automatic summarization and related disciplines. It defines summarization as the condensation of a source text into a shorter version by selecting key information. Automatic summarization involves producing summaries computationally. Related fields include automatic classification, keyword extraction, information retrieval, information extraction, and question answering, which all aim to organize and understand information from text.
The MESSAGE project aims to:
1) Develop tools to rapidly disseminate reliable emergency messages across Europe.
2) Ensure messages are comprehensible to facilitate response.
3) Propose making available a controlled language editing tool to allow quick and accurate editing of alerts.
Invited talk at Processing ROmanian in Multilingual, Interoperational and Scalable Environments (PROMISE 2010) on how to port the QALL-ME framework to a new language
Annotation of anaphora and coreference for automatic processing
1. Annotation of anaphora
and coreference for
automatic processing
Constantin Orasan
Research Group in Computational Linguistics
University of Wolverhampton, UK
http://www.wlv.ac.uk/~in6093/
2. Why use corpora in
anaphora/coreference resolution
In this talk corpora discussed for:
Training machine learning systems
Testing anaphora/coreference resolution
algorithms
Annotation:
Linguistically motivated: tries to capture certain
phenomena (usually focuses on anaphora)
Application motivated: limited relations are
encoded (usually focuses on coreference)
3. Structure
1. Background information
2. The MUC annotation for coreference
3. The NP4E corpus
4. Event coreference and NP coreference
5. Conclusions
4. Anaphora and anaphora
resolution
cohesion which points back to some previous item
(Halliday and Hasan, 1976)
the pointing back word is called an anaphor, the
entity to which it refers or for which it stands is its
antecedent (Mitkov, 2002)
The process of determining the antecedent of an
anaphor is called anaphora resolution (Mitkov,
2002)
Anaphora resolution can be seen as a process of
filling empty or almost empty expressions with
information from other expressions
5. Coreference and coreference
resolution
When the anaphor refers to an antecedent
and when both have the same referent in real
world they are termed coreferential (Mitkov,
2002)
The process of establishing which referential
NPs point to the same discourse entity is
called coreference resolution
6. Examples of anaphoric
expressions from Mitkov (2002)
Sophia Loren says she will always be grateful to
Bono. The actress revealed that the U2 singer helped
her calm down when she became scared by a
thunderstorm while travelling on a plane.
Coreferential chains:
{Sophia Loren, she, the actress, her, she},
{Bono, the U2 singer},
{a thunderstorm},
{a plane}
7. Examples of anaphoric
expressions from Mitkov (2002)
Indirect anaphora: Although the store had only just
opened, the food hall was busy and there were long
queues at the tills.
Identity-of-sense anaphora: The man who gave
his paycheck to his wife was wiser that the man who
gave it to his mistress
Verb and adverb anaphora: Stephanie sang, as
did Mike
Bound anaphora: Every man has his own agenda
Cataphora: The elevator opened for him on the 14th
floor, and Alec stepped out quickly.
8. Anaphora vs. coreference
There are many anaphoric expressions which are
not coreferential
Most of the coreferential expressions are anaphoric
(Sophia Loren, the actress)
Coreferential expressions that may be or may not be
anaphoric
(Sophia Loren, the actress Sophia Loren) – not anaphoric?
(the actress Sophia Loren, Sophia Loren) – anaphoric
Coreferential expressions which are not anaphoric
(Sophia Loren, Sophia Loren)
Cross-document coreference is not anaphora
9. Substitution test
To determine whether two entities are
coreferential substitution test is used
Sophia Loren says she will always be grateful to
Bono Sophia Loren says Sophia Loren will
always be grateful to Bono.
John has his own agenda John has John’s own
agenda
Every man has his own agenda. Every man has
every man’s own agenda. ??
10. Anaphora & coreference in
computational linguistics
are important preprocessing steps for a wide
range of applications such as machine
translation, information extraction, automatic
summarisation, etc.
From linguistic perspective the expressions
processed are rather limited
11. Developing annotated corpora for
computational linguistics
A simple, reliable annotation task
Producing an CL-oriented resource
Capturing the most widespread and best-understood anaphoric
relation
identity-of-reference direct nominal anaphora
Including identity, Referring expressions (pronouns,
Elements synonymy, definite NPs, or proper names)
corresponding to the generalisation and have non-pronominal NP
same discourse entity specialisation antecedents in the preceding text /
dialogue
12. Terminology
Entity = an object or set of objects in the world
Entities can have types (ACE requires to annotate
only certain types e.g. person, location,
organisation, etc.)
Mention = a textual reference to an entity (usually an
NP)
Direct anaphora = identity of head, generalisation,
specialisation or synonymy
Indirect anaphora = part-of, set-membership
13. Annotation of anaphora/
coreference
In general the process can be split into two
stages:
Identification and annotation of elements involved
in a relation (annotation of mentions)
Identification and annotation of relations between
mentions
The two stages can be done together or
separately
14. Annotation of mentions
Annotate everything?
Singletons should be annotated because they
influence evaluation measures (except MUC
score)
If everything is annotated it is easier if this
annotation is done in the first instance
Syntactic annotation can be useful
15. Annotation of relations
Each annotation scheme defines a set of
relations that should be covered
The relations normally happen between
mentions/markables
16. MUC annotation (Hirchmann
1997)
Defined in the coreference resolution task at MUC
The criteria used to define the task were:
1. Support for the MUC information extraction tasks;
2. Ability to achieve good (ca. 95%) interannotator
agreement;
3. Ability to mark text up quickly (and therefore, cheaply);
4. Desire to create a corpus for research on coreference and
discourse phenomena, independent of the MUC extraction
task.
These criteria are not necessarily consistent with
each other
17. MUC annotation scheme
Marks only relations between noun phrases
Does not mark relations between verbs,
clauses, etc.
Marks only IDENTITY which defines
equivalence classes and is not directional
Values which are clearly distinct should not
be allowed to be in the same class e.g. the
stock price fell from $4.02 to $3.85
18. MUC annotation scheme (II)
SGML used
<COREF ID="100">Lawson Mardon Group Ltd.</COREF> said
<COREF ID="101" TYPE="IDENT" REF="100">it </COREF> ...
Attributes:
ID a unique identifier for a mention
REF indicates links between mentions
TYPE the type of link (only IDENT supported)
MIN the minimum span to be identified in order to be
considered correct in automatic evaluation
STATUS=“OPT” to indicate optional elements to be
resolved
19. MUC annotation scheme –
markables (III)
NPs (including dates, percentages and
currency expressions), personal and
demonstrative pronouns
Interrogative “wh-” NPs are not marked
(Which engine would you like to use?)
The extent of the markable is quite loosely
defined (must include the head, but should
really include the maximal NP and MIN
attribute have the head as the value)
20. MUC annotation scheme –
relations
Basic coreference
Bound anaphors
Apposition
<COREF ID="1" MIN="Julius Caesar">Julius Caesar, <COREF
ID="2" REF="1" MIN="emperor" TYPE="IDENT"> the/a well-known
emperor,</COREF></COREF>
Predicate nominals
<COREF ID="1" MIN="Julius Caesar">Julius Caesar</COREF> is
<COREF ID="2" REF="1" MIN="emperor" TYPE="IDENT">the/a
well-known emperor</COREF> who …
For appositions and predicate nominals there needs
to be certainty (is not may be)
21. MUC annotation - criticism
Van Deemter and Kibble (1999) criticised the
MUC scheme because it goes beyond
annotation of coreference as it is commonly
understood because:
It marks quantifying NPs (e.g. every man, most
people)
Marks indefinite NPs
Henry Higgins, who was formerly sales director of Sudsy
Soaps, became president of Dreamy Detergents.
and one can argue not in a consistent manner
the stock price fell from $4.02 to $3.85
22. MUC annotation & corpus
Despite criticism the MUC annotation provided a
starting point for standardising
anaphora/coreference annotation schemes
Designed to mark only a small set of expressions
and relations which can be tackled by computers
Was proposed in the context of a competition
comparison of results and backing of an
organisation
The corpus is available
23. Corpus of technical manuals
(Mitkov et. al. 2000)
A corpus of technical manuals annotated with
a MUC-7 like annotation scheme
Annotates only identity of reference between
direct nominal referential expressions
Less interesting from linguistic perspective,
but used to develop automatic methods
24. Corpus of technical manuals
(Mitkov et. al. 2000)
Full coreferential chains are annotated
All the mentions are annotated regardless
whether they are singletons or not
The relation of coreference is considered fully
transitive
The MUC annotation scheme was used but
the guidelines were not adapted completely
CLinkA (Orasan 2000) was used for
annotation
25. Annotation guidelines
The starting point the MUC-7 annotation
guidelines, but
More strict with what means identity of meaning
(e.g. we do not consider indefinite appositional
phrases coreferential with the phrases that
contain them)
An indefinite NP cannot refer to anything
Not consider gerunds as mentions
26. Add missing phenomena:
V [NP1] as [NP2] – not coreferential
[use [a diagonal linear gradient] as [the map]] – is not
coreferential
[elect [John Prescott] as [Prime Minister]], – is not coreferential
…if [[ an NTSC Ø ]i or [ PAL monitor ]j]k is being used…[ The
NTSC monitor ]l… - not coreferential
…[[the pixels’ luminance]i or [Ø Ø saturation]j ]k is important…
[The pixels’ saturation]j - coreferential
27. Annotation guidelines – short version
Do: Do not:
(i) annotate identity-of-reference direct nominal (i) annotate indefinite predicate nominals that are linked to
anaphora other elements by perception verbs as coreferential with
those elements
(ii) annotate definite descriptions which stand in any of (ii) annotate identity-of-sense anaphora
the identity, synonymy, generalisation, specialisation, or
copula relationships with an antecedent
(iii) annotate definite NPs in a copula relation as (iii) annotate indirect anaphora between markables
coreferential
(iv) annotate definite appositional and bracketed phrases (iv) annotate cross-document coreference
as coreferential with the NP of which they are a part
(v) annotate NPs at all levels from base to complex and (v) annotate indefinite NPs in copula relations with other
co-ordinated NPs as coreferential
(vi) familiarise yourself with the use of unfamiliar, (vi) annotate non-permanent or “potential” coreference
highly specialised terminology by search through the between markables
text
(vii) annotate bound anaphors
(viii) consider gerunds of any kind markable
(ix) annotate anaphora over disjoined antecedents
(x) consider first or second person pronouns markable
28. Speed of annotation (Mitkov et.
al. 2000)
Speed of annotation in one hour:
At the beginning while the guidelines were being created:
assign 288 mentions to 220 entities covering on average
2051 words in text
After the annotators became used to the task and the
guidelines finalised: assign 315 mentions to 250 entities
covering on average 1411 words in text
Fast track annotation for pronoun resolution in one
hour: 113 pronouns, 944 candidates and 148
antecedents, covering 10862 words
29. Speed of annotation (II)
Most of the time during the annotation is
spent identifying the mentions
… existing annotation levels can prove very
beneficial
30. Reasons for disagreements
The process is tedious and requires high
levels of concentration
Two main reasons for disagreement:
Unsteady references – mentions which may
belong to different entities through the document
(e.g. image, the window) – the automatic
annotation option of the annotation tool may also
mislead
Specialised terminology
31. Improving annotation strategies
Unsteady reference: Pre-annotation stage to clarify
topic segments
Domain knowledge: Pre-annotation stage to
disambiguate unknown technical terminology
‘Master strategy’ combining individual
approaches:
Printing text prior to annotation - increases familiarity
Two step process
Taking note of troublesome cases to discuss later with others
Annotating intensively vs sporadically
32. NP4E (Hasler et. al. 2006)
The goal was to develop annotation guidelines for
NP and event coreference in newspaper articles
about terrorism/security threats
A small corpus annotated with NP and event
coreference was produced
An attempt to produce a more refined annotated
resource than our previous corpus
5 clusters of related documents in the domain were
built, about 50,000 words
http://clg.wlv.ac.uk/projects/NP4E/
33. NP coreference annotation
Used the guidelines developed by (Mitkov et. al.
2000) as the starting point,
but adapted them for our goals and texts
All the mentions need to be annotated, both definite and
indefinte NPs
Introduced coref and ucoref tags to be able to deal with
uncertainties
The government] will argue that… [[McVeigh] and [Nichols]] were [the
masterminds of [the bombing plot]]
Types of relations between an NP and its antecedent:
identity, synonymy, generalisation, specialisation and
other, but we do not annotate indirect anaphora
34. NP coreference annotation (II)
Types of (coreference) relations we identify NP, copular, apposition,
bracketed text, speech pronoun and other
Link to the first element of the chain in most of the cases for type NP
For copular, apposition, bracketed text and speech pronouns (pronouns
which occur in direct speech), the anaphor should be linked back to the
nearest mention of the antecedent in the text
Do not annotate coreferential different readings of an NP as
coreferential
[A jobless Taiwanese journalist who commandeered [a Taiwan airliner] to [China]]…
[China] ordered [[its] airports] to beef up [security]…
35. The user can
override WordNet is consulted
WordNet’s decision about the relation
between the two NPs
Annotation of NPs using PALinkA
the plane is marked as
coreferential with The
aircraft
36. Issues arising during the NP
annotation
The antecedent of pronoun we in direct speech can
be linked to: the individual speaker, a group
represented by the speaker or nothing
General concepts such as violence, terror, terrorism,
police, etc are sometimes used in a general sense
so it is difficult to know whether to annotate and how
Sometimes difficult to decide the best indefinite NP
as an antecedent
…the man detained for hijacking [a Taiwanese airliner]… Liu
forced [a Far East Air Transport domestic plane]… Beijing
returned [the Boeing 757]…
37. Issues arising during the NP
annotation (II)
Mark relative pronouns/clauses and link them
to the nearest mention
Chinese officials were tightlipped whether [Liu Shan-chung,
45, [who] is in custody in China's southeastern city of
Xiamen], would be prosecuted or repatriated to Taiwan.
The type of relation is sometimes difficult to
establish without the help of WordNet (have
ident, non-ident)
38. Annotation of event coreference
Event = a thing that happens or takes place, a single
specific occurrence, either instantaneous or
ongoing.
Used the ACE annotation guidelines as starting
point
Events marked: ATTACK, DEFEND, INJURE, DIE,
CONTACT
Identify the trigger = the best word to represent the
event
Triggers: verbs, nouns, adjectives and pronouns
{The blast} {killed} 168 people…and {injured}
hundreds more… (ATTACK: noun, DIE: verb,
INJURE: verb)
39. Event triggers
ATTACK: attack events are physical actions which aim to cause harm
or damage tothings or people: attack, bomb, shoot, blast, war, fighting,
clashes, throw, hit, hold, spent.
DEFEND: defend events are events where people or organisations
defend something, usually against someone or something else:
sheltering, reinforcing, running, prepared.
INJURE: injure events involve people experiencing physical harm:
injure, hurt, maim, paralyse, wounded, ailing.
DIE: die events happen when a person’s life ends: kill, dead, suicide,
fatal, assassinate, died, death.
CONTACT: contact events occur when two or more parties
communicate in order to try and resolve something, reach an
agreement or better relations between different sides etc. This category
includes demands, threats and promises made by parties during
negotiations: meeting, talks, summit, met, negotiations, conference,
called, talked, phoned, discussed, promised, threatened, agree, reject,
demand.
40. Annotation of event coreference
Two stage process: identify the triggers and then
link them
Link arguments of an event to NP annotated in the
previous stage
The arguments are event dependent (e.g.
ATTACKER, MEANS, VICTIM, CAUSE, AGENT,
TOPIC and MEDIUM)
The arguments should be linked to NPs from the
same sentence or near by sentences if they are
necessary to disambiguate the event
Also TENSE, MODALITY and POLARITY needs to
be indicated
41. Annotation of an attack event using PALinkA
the operation
TYPE: attack
TIME: Dec. 17
REF: stormed
TARGET: the Japanese
ambassador's residence in
Lima (FACILITY)
ATTACKER: MRTA rebels
(PERSON)
the operation PLACE: Lima (LOCATION)
42. Issues with event annotation
Very difficult annotation task
At times it is difficult to decide the tense of an event
in direct speech
Whether to include demands, promises or threats in
the CONTACT (or use them only as a signal of
modality)
Whether to make a distinction between
speaker/hearer in CONTACT events (especially in
the case of demands, promises or threats)
43. What coreferential events indicate?
(Hasler and Orasan 2009)
Starting point – do coreferential events have
coreferential arguments?
We had a corpus of about 12,000 words
annotated with event coreference
344 unique event mentions
106 coreferential chains with 2 to 10 triggers
238 events referred by only one trigger
44. Zaire planes bombs rebels as U.N. seeks war’s end.
a293 TRIGGER: bombs
ATTACKER: –
MEANS: Zaire planes: ID=0: CHAIN=0: VEHICLE
PLACE: –
TARGET: rebels: ID=1: CHAIN=1: PERSON
TIME: –
Zaire said on Monday its warplanes were bombing three key rebel-held towns in its eastern
border provinces and that the raids would increase in intensity.
a333 TRIGGER: bombing
ATTACKER: Zaire: ID=44: CHAIN=5: ORGANISATION
MEANS: its warplanes: ID=46: CHAIN=46: VEHICLE
PLACE: three key rebel-held towns in its eastern border provinces: ID=48:
CHAIN=14: LOCATION
TARGET: three key rebel-held towns in its eastern border provinces: ID=48:
CHAIN=14: LOCATION
TIME: Monday: ID=45: CHAIN=7
“Since this morning the FAZ (Zaire army) has been bombing Bukavu, Shabunda and
Walikale”, said a defence ministry statement in the capital Kinshasa.
a334 TRIGGER: bombing
ATTACKER: the FAZ (Zaire army): ID=53: CHAIN=53: ORGANISATION
MEANS: –
PLACE: Bukavu, Shabunda and Walikale: ID=55: CHAIN=14: LOCATION
TARGET: Bukavu, Shabunda and Walikale: ID=55: CHAIN=14: LOCATION
TIME: this morning: ID=52: CHAIN=52
45. Referential relations between
arguments
104 chains considered:
22 (21.15%) contained only coferential NPs
23 (22.12%) contained only non-coferential NPs
9 chains ignored
50 (48.07%) contain a mixture of coreferential and
non-coreferential NPs
If indirect anaphora is not annotated, 70% of
chains are affected
46. ID TRIGGER ARGUMENT: AGENT(S)
c389 an emergency summit the leaders of both nations: ID=20:
CHAIN=20: PERS
c397 the two-hour closed meeting they: ID=24: CHAIN=20: PERS
c408 the summit Fujimori: ID=60: CHAIN=32: PERS
Hashimoto: ID=58:CHAIN=40:PERS
c409 the summit Fujimori: ID=60: CHAIN=32: PERS
Hashimoto: ID=58: CHAIN=40: PERS
c418 the summit rebels: ID=110: CHAIN=14: PERS
c432 the summit he: ID=170: CHAIN=40: PERS
47. Identity of sense
There are cases where even though the strings are
the same we do not have identity of reference: at
least nine people and nine confirmed dead
Hundred, at least 500 people, the first group of at
least 500 people, but probably more than that and
the 500
It can be argued that events of INJURE, DIE and
DEFEND with such parameters are not
coreferential, but the ATTACK events that causes
them are.
48. at least nine people were killed and up to 37 wounded
i343 TRIGGER: wounded
AGENT: the FAZ (Zaire army): ID=53: CHAIN=53: ORG
VICTIM: up to 37: ID=66: CHAIN=66: PERSON
CAUSE: –
PLACE: Bukavu: ID=70: CHAIN=17: LOCATION
TIME: Monday: ID=69: CHAIN=7
there are nine confirmed dead and 37 wounded
i346 TRIGGER: wounded
AGENT: –
VICTIM: 37 wounded: ID=86: CHAIN=78: PERSON
CAUSE: –
PLACE: –
TIME: –
49. Missing slots
Coreference between events can be established
even if many slots are not filled in:
Peru’s Fujimori says hostage talks still young.
...the President said talks to free them were still in their
preliminary phase.
”We cannot predict how many more weeks these
discussions will take.”
”We are still at a preliminary stage in the conversations.”
Fujimori said he hoped Nestor Cerpa would personally take
part in the talks when they resume on Monday at 11am.
50. Contact events
Involve 2 or more parties
The parties are usually introduced bit by bit
and event coreference is necessary to
establish all the participants
Cross-document event coreference is
sometimes necessary collect all the
participants
51. Conclusions
The guidelines should not be used directly and the
characteristics of the texts should be considered
For automatic processing MUC-like may provide a good trade off
between the linguistic detail encoded and the difficulty of
annotation
However, quite often this annotation is not enough for more
advanced processing
Have a more refined notion of “identity”
Coreference is a scalar relation holding between two (or more)
linguistic expressions that refer to DEs considered to be at the
same granularity level relevant to the pragmatic purpose.
(Recasens, Hovy and Marti, forthcoming)
53. References
van Deemter, Kees and Rodger Kibble, (1999). What is coreference and what
should coreference annotation be? In Amit Bagga, Breck Baldwin, and S Shelton
(eds.), Proceedings of ACL workshop on Coreference and Its Applications.
Maryland.
Halliday, M. A. K., and Hasan, R. (1976).Cohesion in English. London: Longman.
Hasler, L. and Orasan. C (2009). Do coreferential arguments make event mentions
coreferential? Proceedings of the 7th Discourse Anaphora and Anaphor Resolution
Colloquium (DAARC 2009), Goa, India, 5-6 November 2009, 151-163
Hasler, L., Orasan, C. and Naumann, K. (2006) NPs for Events: Experiments in
coreference annotation. In Proceedings of the 5th Language Resources and
Evaluation Conference (LREC2006). Genoa, Italy, 24-26 May, 1167-1172
Hirschman, L. (1997). MUC-7 coreference task definition. Version 3.0
Mitkov, R. (2002): Anaphora Resolution. Longman
Mitkov, R., Evans, R., Orasan, C., Barbu, C., Jones L. and Sotirova, V. (2000)
Coreference and anaphora: developing annotating tools, annotated resources and
annotation strategies Proceedings of the Discourse Anaphora and Anaphora
Resolution Colloquium (DAARC'2000)), 49-58. Lancaster, UK