This summarizes a document describing a system used in the Aspect Based Sentiment Analysis (ABSA) task of SemEval 2016. The system uses maximum entropy classifiers for aspect category detection and sentiment polarity tasks. Conditional random fields are used for opinion target extraction. It achieved state-of-the-art results in 9 constrained and 2 unconstrained experiments. The system is described including the features used, such as semantics features from word embeddings, and constrained/unconstrained features for the different subtasks and languages.
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015RIILP
The document discusses using syntactic preordering models to delimit the morphosyntactic search space for machine translation of morphologically rich languages. It explores preordering dependency trees of the source language to reduce word order variations and predicting morphological attributes on the source side to inform target language word selection. Experimental results show that non-local features and jointly learning which attributes to predict can improve translation performance over baselines. The work aims to combine preordering and morphology prediction to better exploit interactions between syntactic structure and inflectional properties.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
This document discusses learning Amharic verb morphology using inductive logic programming (ILP). Amharic verbs are complex, conveying information about subject, object, tense, aspect, mood and more through affixation, reduplication and compounding. The authors apply ILP to learn morphological rules from a training set of 216 Amharic verbs. They achieve 86.9% accuracy on a test set of 1,784 verb forms. Key challenges include a lack of similar examples in the training data and learning inappropriate alternation rules. This work contributes to advancing the automatic learning of morphology for under-resourced languages like Amharic.
Pattern Mining To Unknown Word Extraction (10Jason Yang
The document discusses using pattern mining and machine learning techniques to identify and extract unknown words from Chinese text. It begins with an introduction on Chinese word segmentation challenges and types of unknown words. It then discusses related work using particular and general methods. The main techniques described are using continuity pattern mining to derive unknown word detection rules (Phase 1), and applying machine learning classification and sequential learning for extraction (Phase 2). Experimental results show improvements over previous work, with an F1-score of 0.614 for unknown word extraction.
This document presents machine learning techniques to identify verb subcategorization frames from Czech language corpora. It compares three statistical techniques and shows they can discover new subcategorization frames and label dependents as arguments or adjuncts with 88% accuracy. It describes the task, relevant properties of Czech including free word order and case marking, and how the techniques are applied without assuming a predefined frame set.
Neural machine translation of rare words with subword unitsTae Hwan Jung
This paper proposes using subword units generated by byte-pair encoding (BPE) to address the open-vocabulary problem in neural machine translation. The paper finds that BPE segmentation outperforms a back-off dictionary baseline on two translation tasks, improving BLEU by up to 1.1 and CHRF by up to 1.3. BPE learns a joint encoding between source and target languages which increases consistency in segmentation compared to language-specific encodings, further improving translation of rare and unseen words.
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015RIILP
The document discusses using syntactic preordering models to delimit the morphosyntactic search space for machine translation of morphologically rich languages. It explores preordering dependency trees of the source language to reduce word order variations and predicting morphological attributes on the source side to inform target language word selection. Experimental results show that non-local features and jointly learning which attributes to predict can improve translation performance over baselines. The work aims to combine preordering and morphology prediction to better exploit interactions between syntactic structure and inflectional properties.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
This document discusses learning Amharic verb morphology using inductive logic programming (ILP). Amharic verbs are complex, conveying information about subject, object, tense, aspect, mood and more through affixation, reduplication and compounding. The authors apply ILP to learn morphological rules from a training set of 216 Amharic verbs. They achieve 86.9% accuracy on a test set of 1,784 verb forms. Key challenges include a lack of similar examples in the training data and learning inappropriate alternation rules. This work contributes to advancing the automatic learning of morphology for under-resourced languages like Amharic.
Pattern Mining To Unknown Word Extraction (10Jason Yang
The document discusses using pattern mining and machine learning techniques to identify and extract unknown words from Chinese text. It begins with an introduction on Chinese word segmentation challenges and types of unknown words. It then discusses related work using particular and general methods. The main techniques described are using continuity pattern mining to derive unknown word detection rules (Phase 1), and applying machine learning classification and sequential learning for extraction (Phase 2). Experimental results show improvements over previous work, with an F1-score of 0.614 for unknown word extraction.
This document presents machine learning techniques to identify verb subcategorization frames from Czech language corpora. It compares three statistical techniques and shows they can discover new subcategorization frames and label dependents as arguments or adjuncts with 88% accuracy. It describes the task, relevant properties of Czech including free word order and case marking, and how the techniques are applied without assuming a predefined frame set.
Neural machine translation of rare words with subword unitsTae Hwan Jung
This paper proposes using subword units generated by byte-pair encoding (BPE) to address the open-vocabulary problem in neural machine translation. The paper finds that BPE segmentation outperforms a back-off dictionary baseline on two translation tasks, improving BLEU by up to 1.1 and CHRF by up to 1.3. BPE learns a joint encoding between source and target languages which increases consistency in segmentation compared to language-specific encodings, further improving translation of rare and unseen words.
MORPHOLOGICAL SEGMENTATION WITH LSTM NEURAL NETWORKS FOR TIGRINYAijnlc
Morphological segmentation is a fundamental task in language processing. Some languages, such as
Arabic and Tigrinya,have words packed with very rich morphological information.Therefore, unpacking
this information becomes a necessary taskfor many downstream natural language processing tasks. This
paper presents the first morphological segmentation research forTigrinya. We constructed a new
morphologically segmented corpus with 45,127 manually segmented tokens. Conditional random fields
(CRF) and window-based longshort-term memory (LSTM) neural networkswere employed separately to
develop our boundary detection models. We appliedlanguage-independent character and substring features
for the CRFand character embeddings for the LSTM networks. Experimentswere performed with four
variants of the Begin-Inside-Outside (BIO) chunk annotation scheme. We achieved 94.67% F1 scoreusing
bidirectional LSTMs with fixed-sizewindow approach to morphemeboundary detection.
The document discusses using pattern mining techniques to detect and extract unknown words from Chinese text. It begins with an introduction of the Chinese word segmentation problem and types of unknown words. It then discusses related work on particular and general unknown word extraction methods. The document proposes applying continuity pattern mining to detect unknown words, and using sequential supervised learning and machine learning algorithms to extract unknown words based on natural language and statistical information. Experimental results show the approach achieves better performance than rule-based methods.
This document summarizes the history and development of neural models for natural language processing from the 1940s to present day. It describes early models like the perceptron and how recurrent neural networks enabled modeling of longer sequences. Modern contextualized models like BERT are able to incorporate broader context using attention and bidirectional processing. However, neural models still struggle with tasks requiring complex reasoning and world knowledge.
This document discusses several issues in part-of-speech (POS) tagging for Tamil language texts. It identifies challenges such as resolving lexical ambiguity, normalization of corpora, and analyzing compound words and multi-word expressions. It also discusses issues like determining the fineness versus coarseness of linguistic analysis, distinguishing syntactic function from lexical category, and whether to use new tags or tags from standard tagsets. Specific examples are provided to illustrate POS ambiguity depending on context, like the word "paṭi" which can be a noun, verb, adjective, adverb, or particle.
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
The document describes a hybrid approach to part-of-speech disambiguation that combines neural networks and manually crafted rules. The algorithm uses neural networks to generate a set of possible part-of-speech tags for each word, and rule-based tagging to generate another set. The final set of tags is the intersection of these two sets, or their union if the intersection is empty. The approach achieved 96.11% precision on one corpus and 86.39% precision on another larger corpus.
Representing and Reasoning with Modular OntologiesJie Bao
The document discusses representing and reasoning with modular ontologies. It introduces the need for modularity in large ontologies to enable reuse and selective knowledge hiding. It presents package-based description logics (P-DL) as a formalism for representing and reasoning with modular ontologies through package extension and importing. P-DL defines local interpretations and model projection to provide unambiguous semantics for modular ontologies while supporting both inter-module subsumption and role relations. Scope limitation modifiers and concealable reasoning are discussed to enable selective knowledge hiding across module boundaries without compromising soundness.
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
Part of speech (POS) tagging is the process of assigning a part of speech tag like noun, verb, adjective to each word in a sentence. It involves determining the most likely tag sequence given the probabilities of tags occurring before or after other tags, and words occurring with certain tags. POS tagging is the first step in many NLP applications and helps determine the grammatical role of words. It involves calculating bigram and lexical probabilities from annotated corpora to find the tag sequence with the highest joint probability.
This document discusses natural language processing (NLP) from a developer's perspective. It provides an overview of common NLP tasks like spam detection, machine translation, question answering, and summarization. It then discusses some of the challenges in NLP like ambiguity and new forms of written language. The document goes on to explain probabilistic models and language models that are used to complete sentences and rearrange phrases based on probabilities. It also covers text processing techniques like tokenization, regular expressions, and more. Finally, it discusses spelling correction techniques using noisy channel models and confusion matrices.
Python questions in pdf for data science interviews. A question bank on python for practice. In Reddit and Sanfoundry, you will get random questions, but here these are in order. The difficult to answer questions explained clearly.
The document proposes adapting OWL as a more modular ontology language by addressing weaknesses in its current modularity. Specifically, OWL lacks:
1) Semantic modularity as it only supports global semantics between imported ontologies.
2) Syntactic modularity as imports can lead to tangled definitions between modules.
The paper suggests approaches to enhance OWL's modularity while maintaining backwards compatibility, such as giving imports a localized semantics or defining explicit syntactic rules to avoid nested definitions across modules.
Divide and Conquer Semantic Web with ModularJie Bao
This document provides a brief review of modular ontology language formalisms. It discusses the need for modular ontologies to address issues with large, monolithic ontologies. Several approaches to modular ontologies are summarized, including Distributed Description Logics (DDL), E-Connections, and Package-based Description Logics (P-DL). Key challenges with modular ontologies are also outlined, such as reasoning across modules and ensuring interoperability while preserving local semantics.
A Semantic Importing Approach to Knowledge Reuse from Multiple OntologiesJie Bao
This document summarizes a research paper that presents a semantic importing approach called Package-based Description Logics (P-DL) for knowledge reuse from multiple ontologies. P-DL allows ontologies (called packages) to selectively reuse subsets of definitions from other packages while preserving context and unsatisfiability. It supports transitive knowledge reuse through compositionally consistent domain relations between packages. The paper outlines P-DL's syntax, semantics and modeling abilities and compares it to existing modular ontology languages.
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
The paper accepted on ICSE'17 and TSE'19. https://se-thesaurus.appspot.com/ https://pypi.org/project/DomainThesaurus/ Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonlyused morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.
A Distributed Tableau Algorithm for Package-based Description LogicsJie Bao
The document describes a distributed tableau algorithm for reasoning with modular ontologies expressed in Package-based Description Logics (P-DL). The algorithm uses multiple local reasoners, each maintaining a local tableau for a single ontology module. Local reasoners communicate by querying each other or reporting clashes to collectively construct a global tableau without fully integrating the modules. The algorithm is proven sound and complete for P-DL with acyclic module importing. It can support reasoning across modules to answer queries.
Deep Dependency Graph Conversion in EnglishJinho Choi
This paper presents a method for the automatic conversion of constituency trees into deep dependency graphs consisting of primary, secondary, and semantic relations. Our work is distinguished from previous work concerning the generation of shallow dependency trees such that it generates dependency graphs incorporating deep structures in which relations stay consistent regardless of their surface positions, and derives relations between out-of-domain arguments, caused by syntactic variations such as open clause, relative clause, or coordination, and their predicates so the complete argument structures are represented for both verbal and non-verbal predicates. Our deep dependency graph conversion recovers important argument relations that would be missed by dependency tree conversion, and merges syntactic and semantic relations into one unified representation, which can reduce the bundle of developing another layer of annotation dedicated for predicate argument structures. Our graph conversion method is applied to six corpora in English and generated over 4.6M dependency graphs covering 20 different genres.
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
The document presents a method for unsupervised machine translation evaluation using universal phrase tags. It designs a mapping between phrase tags from different treebanks to 9 universal tags. An unsupervised metric called HPPR is introduced to measure similarity between the universal phrase sequences of the source and translated sentences. Experiments on French-English data show HPPR achieves promising correlations with human judgments without using reference translations.
This document summarizes an experiment comparing different character-level embedding approaches for Korean sentence classification tasks. Dense character-level embeddings using pre-trained fastText vectors outperformed sparse one-hot encodings. Character-level embeddings preserved local semantics around character boundaries better than Jamo-level encodings, which performed best with self-attention. While Jamo-level features may be useful for syntax-semantic tasks, character-level approaches had better performance and computation efficiency. These findings provide insights for character-rich languages beyond Korean.
This document discusses natural language processing and language models. It begins by explaining that natural language processing aims to give computers the ability to process human language in order to perform tasks like dialogue systems, machine translation, and question answering. It then discusses how language models assign probabilities to strings of text to determine if they are valid sentences. Specifically, it covers n-gram models which use the previous n words to predict the next, and how smoothing techniques are used to handle uncommon words. The document provides an overview of key concepts in natural language processing and language modeling.
In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the
word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic
finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the
words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t
support some special words, so hash table can partly solve the addressed problem. And the second goal is
to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical
organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri
Collection and Security News from ICTna.ir Site show that our method has the average accuracy of
95.37% which is even improved in using the method on a test set with common topics.
This document summarizes a doctoral oral presentation on the development of an automatic diacritic restoration system for Standard Yoruba digital text. The presentation outlines the background and problem statement, objectives, methodology, and results of the research. Key points include:
1) The research aims to develop a system to automatically restore diacritics (tone marks and dots) to digitized Yoruba text where they are missing or partially present.
2) The methodology involves collecting Yoruba texts, analyzing features critical for restoration, formulating computational models using machine learning techniques, and implementing and evaluating a prototype system.
3) Results show that character-based restoration of dots using memory-based learning achieved over
2010 PACLIC - pay attention to categoriesWarNik Chow
This document summarizes a research paper on a proposed method called Metadata Projection Matrix (MPM) for sentence modeling that allows controlling attention to certain syntactic categories. The method uses a projection matrix to incorporate syntactic category information when calculating attention weights. Experimental results on several datasets show MPM outperforms baselines on tasks where attention to specific categories is important, like detecting terms or irony, but is weaker on more context-dependent tasks. The method is best suited to applications where syntactic structure significantly informs predictions.
The document introduces a new word analogy corpus for the Czech language to evaluate word embedding models. It contains over 22,000 semantic and syntactic analogy questions across various categories. The authors experiment with Word2Vec (CBOW and Skip-gram) and GloVe models on the Czech Wikipedia corpus. Their results show that CBOW generally performs best, with accuracy improving as vector dimension and training epochs increase. Performance is better on syntactic tasks than semantic ones. The new Czech analogy corpus allows further exploration of how word embeddings represent semantics and syntax for highly inflected languages.
The group chose the song "Sweet Talker" by Jessie J for their music video project. They analyzed the lyrics and decided the post-chorus, which repeats "So talk to me", would feature a repeated scene. Though the song has a slow rhythm, it has an energetic party vibe. The group was inspired by music videos from artists like Beyonce, Rihanna, Eve, and Jennifer Lopez. They plan to film in various London locations like shopping centers, roads by the river, and their school TV studio. One main actress and extras will be cast, dressed casually or smart casually. The only ethical issue may be filming with a drone in some restricted London areas. The target audience is females aged
The document discusses using pattern mining techniques to detect and extract unknown words from Chinese text. It begins with an introduction of the Chinese word segmentation problem and types of unknown words. It then discusses related work on particular and general unknown word extraction methods. The document proposes applying continuity pattern mining to detect unknown words, and using sequential supervised learning and machine learning algorithms to extract unknown words based on natural language and statistical information. Experimental results show the approach achieves better performance than rule-based methods.
This document summarizes the history and development of neural models for natural language processing from the 1940s to present day. It describes early models like the perceptron and how recurrent neural networks enabled modeling of longer sequences. Modern contextualized models like BERT are able to incorporate broader context using attention and bidirectional processing. However, neural models still struggle with tasks requiring complex reasoning and world knowledge.
This document discusses several issues in part-of-speech (POS) tagging for Tamil language texts. It identifies challenges such as resolving lexical ambiguity, normalization of corpora, and analyzing compound words and multi-word expressions. It also discusses issues like determining the fineness versus coarseness of linguistic analysis, distinguishing syntactic function from lexical category, and whether to use new tags or tags from standard tagsets. Specific examples are provided to illustrate POS ambiguity depending on context, like the word "paṭi" which can be a noun, verb, adjective, adverb, or particle.
Elena Bruches - The Hybrid Approach to Part-of-Speech DisambiguationAIST
The document describes a hybrid approach to part-of-speech disambiguation that combines neural networks and manually crafted rules. The algorithm uses neural networks to generate a set of possible part-of-speech tags for each word, and rule-based tagging to generate another set. The final set of tags is the intersection of these two sets, or their union if the intersection is empty. The approach achieved 96.11% precision on one corpus and 86.39% precision on another larger corpus.
Representing and Reasoning with Modular OntologiesJie Bao
The document discusses representing and reasoning with modular ontologies. It introduces the need for modularity in large ontologies to enable reuse and selective knowledge hiding. It presents package-based description logics (P-DL) as a formalism for representing and reasoning with modular ontologies through package extension and importing. P-DL defines local interpretations and model projection to provide unambiguous semantics for modular ontologies while supporting both inter-module subsumption and role relations. Scope limitation modifiers and concealable reasoning are discussed to enable selective knowledge hiding across module boundaries without compromising soundness.
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
Part of speech (POS) tagging is the process of assigning a part of speech tag like noun, verb, adjective to each word in a sentence. It involves determining the most likely tag sequence given the probabilities of tags occurring before or after other tags, and words occurring with certain tags. POS tagging is the first step in many NLP applications and helps determine the grammatical role of words. It involves calculating bigram and lexical probabilities from annotated corpora to find the tag sequence with the highest joint probability.
This document discusses natural language processing (NLP) from a developer's perspective. It provides an overview of common NLP tasks like spam detection, machine translation, question answering, and summarization. It then discusses some of the challenges in NLP like ambiguity and new forms of written language. The document goes on to explain probabilistic models and language models that are used to complete sentences and rearrange phrases based on probabilities. It also covers text processing techniques like tokenization, regular expressions, and more. Finally, it discusses spelling correction techniques using noisy channel models and confusion matrices.
Python questions in pdf for data science interviews. A question bank on python for practice. In Reddit and Sanfoundry, you will get random questions, but here these are in order. The difficult to answer questions explained clearly.
The document proposes adapting OWL as a more modular ontology language by addressing weaknesses in its current modularity. Specifically, OWL lacks:
1) Semantic modularity as it only supports global semantics between imported ontologies.
2) Syntactic modularity as imports can lead to tangled definitions between modules.
The paper suggests approaches to enhance OWL's modularity while maintaining backwards compatibility, such as giving imports a localized semantics or defining explicit syntactic rules to avoid nested definitions across modules.
Divide and Conquer Semantic Web with ModularJie Bao
This document provides a brief review of modular ontology language formalisms. It discusses the need for modular ontologies to address issues with large, monolithic ontologies. Several approaches to modular ontologies are summarized, including Distributed Description Logics (DDL), E-Connections, and Package-based Description Logics (P-DL). Key challenges with modular ontologies are also outlined, such as reasoning across modules and ensuring interoperability while preserving local semantics.
A Semantic Importing Approach to Knowledge Reuse from Multiple OntologiesJie Bao
This document summarizes a research paper that presents a semantic importing approach called Package-based Description Logics (P-DL) for knowledge reuse from multiple ontologies. P-DL allows ontologies (called packages) to selectively reuse subsets of definitions from other packages while preserving context and unsatisfiability. It supports transitive knowledge reuse through compositionally consistent domain relations between packages. The paper outlines P-DL's syntax, semantics and modeling abilities and compares it to existing modular ontology languages.
Unsupervised Software-Specific Morphological Forms Inference from Informal Di...Chunyang Chen
The paper accepted on ICSE'17 and TSE'19. https://se-thesaurus.appspot.com/ https://pypi.org/project/DomainThesaurus/ Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonlyused morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.
A Distributed Tableau Algorithm for Package-based Description LogicsJie Bao
The document describes a distributed tableau algorithm for reasoning with modular ontologies expressed in Package-based Description Logics (P-DL). The algorithm uses multiple local reasoners, each maintaining a local tableau for a single ontology module. Local reasoners communicate by querying each other or reporting clashes to collectively construct a global tableau without fully integrating the modules. The algorithm is proven sound and complete for P-DL with acyclic module importing. It can support reasoning across modules to answer queries.
Deep Dependency Graph Conversion in EnglishJinho Choi
This paper presents a method for the automatic conversion of constituency trees into deep dependency graphs consisting of primary, secondary, and semantic relations. Our work is distinguished from previous work concerning the generation of shallow dependency trees such that it generates dependency graphs incorporating deep structures in which relations stay consistent regardless of their surface positions, and derives relations between out-of-domain arguments, caused by syntactic variations such as open clause, relative clause, or coordination, and their predicates so the complete argument structures are represented for both verbal and non-verbal predicates. Our deep dependency graph conversion recovers important argument relations that would be missed by dependency tree conversion, and merges syntactic and semantic relations into one unified representation, which can reduce the bundle of developing another layer of annotation dedicated for predicate argument structures. Our graph conversion method is applied to six corpora in English and generated over 4.6M dependency graphs covering 20 different genres.
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
The document presents a method for unsupervised machine translation evaluation using universal phrase tags. It designs a mapping between phrase tags from different treebanks to 9 universal tags. An unsupervised metric called HPPR is introduced to measure similarity between the universal phrase sequences of the source and translated sentences. Experiments on French-English data show HPPR achieves promising correlations with human judgments without using reference translations.
This document summarizes an experiment comparing different character-level embedding approaches for Korean sentence classification tasks. Dense character-level embeddings using pre-trained fastText vectors outperformed sparse one-hot encodings. Character-level embeddings preserved local semantics around character boundaries better than Jamo-level encodings, which performed best with self-attention. While Jamo-level features may be useful for syntax-semantic tasks, character-level approaches had better performance and computation efficiency. These findings provide insights for character-rich languages beyond Korean.
This document discusses natural language processing and language models. It begins by explaining that natural language processing aims to give computers the ability to process human language in order to perform tasks like dialogue systems, machine translation, and question answering. It then discusses how language models assign probabilities to strings of text to determine if they are valid sentences. Specifically, it covers n-gram models which use the previous n words to predict the next, and how smoothing techniques are used to handle uncommon words. The document provides an overview of key concepts in natural language processing and language modeling.
In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the
word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic
finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the
words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t
support some special words, so hash table can partly solve the addressed problem. And the second goal is
to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical
organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri
Collection and Security News from ICTna.ir Site show that our method has the average accuracy of
95.37% which is even improved in using the method on a test set with common topics.
This document summarizes a doctoral oral presentation on the development of an automatic diacritic restoration system for Standard Yoruba digital text. The presentation outlines the background and problem statement, objectives, methodology, and results of the research. Key points include:
1) The research aims to develop a system to automatically restore diacritics (tone marks and dots) to digitized Yoruba text where they are missing or partially present.
2) The methodology involves collecting Yoruba texts, analyzing features critical for restoration, formulating computational models using machine learning techniques, and implementing and evaluating a prototype system.
3) Results show that character-based restoration of dots using memory-based learning achieved over
2010 PACLIC - pay attention to categoriesWarNik Chow
This document summarizes a research paper on a proposed method called Metadata Projection Matrix (MPM) for sentence modeling that allows controlling attention to certain syntactic categories. The method uses a projection matrix to incorporate syntactic category information when calculating attention weights. Experimental results on several datasets show MPM outperforms baselines on tasks where attention to specific categories is important, like detecting terms or irony, but is weaker on more context-dependent tasks. The method is best suited to applications where syntactic structure significantly informs predictions.
The document introduces a new word analogy corpus for the Czech language to evaluate word embedding models. It contains over 22,000 semantic and syntactic analogy questions across various categories. The authors experiment with Word2Vec (CBOW and Skip-gram) and GloVe models on the Czech Wikipedia corpus. Their results show that CBOW generally performs best, with accuracy improving as vector dimension and training epochs increase. Performance is better on syntactic tasks than semantic ones. The new Czech analogy corpus allows further exploration of how word embeddings represent semantics and syntax for highly inflected languages.
The group chose the song "Sweet Talker" by Jessie J for their music video project. They analyzed the lyrics and decided the post-chorus, which repeats "So talk to me", would feature a repeated scene. Though the song has a slow rhythm, it has an energetic party vibe. The group was inspired by music videos from artists like Beyonce, Rihanna, Eve, and Jennifer Lopez. They plan to film in various London locations like shopping centers, roads by the river, and their school TV studio. One main actress and extras will be cast, dressed casually or smart casually. The only ethical issue may be filming with a drone in some restricted London areas. The target audience is females aged
El documento resume la evolución y perspectivas de la economía peruana. Señala que Perú tiene un mejor crecimiento que Colombia y Chile. Aunque el PBI per cápita peruano es uno de los menores de América Latina, la tasa de desempleo es la más baja de la región. Perú también tiene el menor riesgo país y subió dos puestos en el índice de competitividad global.
The magazine proposal is for a magazine called "Unknown" focused on the R&B music genre and targeting males and females aged 15-24. The publisher chosen is Prometheus Global Media due to their experience publishing music magazines. The magazine would have a pink, grey, and white color scheme. Content would include profiles of popular R&B artists like Rihanna and Kendrick Lamar as well as lesser known new artists. In addition to music content, the magazine aims to provide advice for readers navigating life challenges. Layout elements like the contents page and double page spreads would be modeled after successful music magazines.
This document describes research on unsupervised methods to improve aspect-based sentiment analysis in Czech. It presents experiments using labeled and unlabeled corpora for Czech and English to train models incorporating distributional semantics features. The models achieve new state-of-the-art results for aspect-based sentiment analysis in Czech. Additionally, the researchers created new labeled and unlabeled corpora for Czech to support this task.
Advisory Distributors - Meeting of Minds - FindingsHannah Jackson
Following the Meeting of Minds Advisory Distributors held back in June, we are now publishing the outputs of the roundtable discussions to give you a sense of CEO concerns as we head into the autumn.
The document describes a system for semantic textual similarity (STS) that uses various techniques to estimate the semantic similarity between texts. The system combines lexical, syntactic, and semantic information sources using state-of-the-art algorithms. In SemEval 2016 tasks, the system achieved a mean Pearson correlation of 75.7% on the monolingual English task and 86.3% on the cross-lingual Spanish-English task, ranking first in the cross-lingual task. The system utilizes techniques such as word embeddings, paragraph vectors, tree-structured LSTMs, and word alignment to capture semantic similarity.
This document discusses cultural shock and reverse cultural shock experienced by international students. It defines cultural shock as emotional reactions to losing familiar cultural reinforcements and experiencing new cultural stimuli with little meaning. Reverse cultural shock refers to difficulties re-adjusting to one's home culture after living abroad. The document outlines four common behavioral patterns when facing cultural shock: trying to replicate home, idealizing the host country, open-mindedly experiencing differences, and withdrawing. It argues the most productive experience involves effectively coping with stresses to synthesize both cultures. Coping strategies for cultural and reverse cultural shock include understanding challenges, accessing social support, and maintaining an open mindset.
The document discusses plans for a music video to Jessie J's song "Sweet Talker." It will be in the genres of pop and R&B and feature a self-assured female protagonist acting out scenes of potential romance between two lovers, with choreography and lip syncing. The target audience is 16-27 year olds. Potential locations include a car park, theatre, and dance studio. Casting calls for one main female performer, backup dancers, and a male actor. Props include a microphone and handbags. Costumes range from casual to sophisticated wear. Inspirations include Miley Cyrus' "Party in the USA" and Beyoncé's "Diva."
Using Unicorn to synchronize Sitecore databases and resolve item deployment problems. Presentation by Radoslaw Kozlowski, Sitecore & .NET developer at Coders Center
The document discusses Sitecore Habitat, an open source framework for Sitecore projects. It is focused on increasing productivity and quality in Sitecore projects. Some key points about Sitecore Habitat include:
- It helps keep projects clean and is based on a modular architecture.
- It is free, open source, and recommended by Sitecore.
- It is based on three main rules: simplicity, flexibility, and extensibility.
- Useful links are provided for the GitHub repository, tutorials, and a demo page.
This document discusses using Gulp for automation tasks in Sitecore projects. It provides an overview of Gulp and how it compares to Grunt. Gulp can be used to automate publishing processes and compilations for Habitat, a modular Sitecore solution. Specific Gulp plugins like gulp-msbuild and gulp-watch are useful for ASP.NET development. The document demonstrates how Gulp works with an example.
El documento resume la situación económica internacional y peruana. En el entorno internacional, América del Sur es la única región con crecimiento negativo, mientras que Asia muestra los mayores crecimientos. La economía peruana ha mostrado un crecimiento del PBI de 3.9% en 2015, una inflación baja comparada con otros países de la región, y una inversión privada que ha disminuido los últimos años. Para los próximos periodos, se proyecta un tipo de cambio al alza y que Perú mantendrá el segundo mejor
Sitecore xDB - Architecture and ConfigurationCodersCenter
Presentation about Sitecore xDB by Tomasz Juranek – Sitecore Developer/Architect at Coders Center.
Certificated Sitecore Developer since 2012. For last 5 years has worked on several Sitecore project for big brands around the Europe.
Este documento promociona productos de la marca LR para las fiestas navideñas, incluyendo perfumes, velas aromáticas, joyería y relojes. Alienta a los lectores a celebrar la Navidad con sus seres queridos y ofrece regalos para hombres y mujeres. También anuncia que por cada producto vendido con un símbolo especial, LR donará 1 euro a un fondo de beneficencia para niños.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
The document discusses analyzing sentiment and classification using neural network approaches. It begins by introducing the concepts of machine learning models, training and evaluation data, and model training and evaluation. It then discusses applications of sentiment analysis and classification to movie reviews, including describing commonly used datasets and evaluation metrics. Finally, it outlines different neural network architectures that can be used for sentiment analysis and classification tasks, including convolutional and recurrent neural networks.
The document discusses two paradigms for natural language processing: knowledge engineering and machine learning. It provides examples of how each approach handles tasks like parsing, translation, and question formation. While knowledge engineering relies on hand-coded rules and representations, machine learning trains statistical models on large datasets. The document also notes Microsoft's interests in using NLP for applications like search and summarization.
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...Hiroki Shimanaka
This document summarizes the paper "Supervised Learning of Universal Sentence Representations from Natural Language Inference Data". It discusses how the researchers trained sentence embeddings using supervised data from the Stanford Natural Language Inference dataset. They tested several sentence encoder architectures and found that a BiLSTM network with max pooling produced the best performing universal sentence representations, outperforming prior unsupervised methods on 12 transfer tasks. The sentence representations learned from the natural language inference data consistently achieved state-of-the-art performance across multiple downstream tasks.
Grammarly AI-NLP Club #6 - Sequence Tagging using Neural Networks - Artem Che...Grammarly
Speaker: Artem Chernodub, Chief Scientist at Clikque Technology and Associate Professor at Ukrainian Catholic University
Summary: Sequence Tagging is an important NLP problem that has several applications, including Named Entity Recognition, Part-of-Speech Tagging, and Argument Component Detection. In our talk, we will focus on a BiLSTM+CNN+CRF model — one of the most popular and efficient neural network-based models for tagging. We will discuss task decomposition for this model, explore the internal design of its components, and provide the ablation study for them on the well-known NER 2003 shared task dataset.
This document discusses n-gram language models. It provides an introduction to language models and their role in applications like speech recognition. Simple n-gram models are described that estimate word probabilities based on prior context. Parameter estimation and smoothing techniques are covered to address data sparsity issues from rare word combinations. Evaluation of language models on held-out test data is also mentioned.
Development of Lexicons Generation Tools for Arabic: Case of an Open Source C...ijnlc
The dictionary resources are very important for Natural Language Processing (NLP). Generating high
quality dictionary resources is a crucial step for the success and effectiveness of NLP application.
Linguistic information about lexical database is complex, large size and various (ie, phonological,
morphological, syntactic, semantic and pragmatic). Among such lexical database entries, we find
conjugated verbs. To this end, we present in this paper the open source mobile application of our
conjugator that we developed in Java platform under the Android. This Conjugator allowed us to generate
a lexicon of more than 18667 conjugated verbs. This lexicon will be used to generate textual words. The
resultant lexicon can be used in various applications such as morphological analysis (lexical approach),
text indexing, etc…
Beyond Word2Vec: Embedding Words and Phrases in Same Vector SpaceVijay Prakash Dwivedi
Word embeddings are commonly used in NLP tasks but embedding phrases while maintaining semantic meaning has been challenging. The authors present a novel method using Siamese neural networks to embed words and multi-word units in the same vector space. The model learns to generate phrase representations based on their semantic similarity to single words. It is trained on a dataset to predict similarity between words and phrases and outperforms previous models on phrase similarity and composition tasks.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
This document summarizes Jessica Hullman's project modeling word sense disambiguation using support vector machines. She used a dataset from Senseval-2 and achieved an average accuracy of 87% at assigning word senses, with a standard deviation of 15% and median of 92%. The project involved modifying an existing implementation that used part-of-speech tags of neighboring words to classify word senses, training support vector machine classifiers on the Senseval-2 data.
The document describes the Columbia-GWU system submitted to the 2016 TAC KBP BeSt Evaluation. It discusses several approaches used for different languages and genres, including:
1) A sentiment system based on identifying the target only, adapted for English, Chinese, and Spanish.
2) An English sentiment system based on relation extraction, treating sentiment as a relation between source and target.
3) English and Chinese belief systems that combine high-precision word tagging with a high-recall default system.
4) A Spanish belief system based on weighted random choice of tags.
The document provides details on the data, approaches, and results for each language-specific system.
Word2vec on the italian language: first experimentsVincenzo Lomonaco
Word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent years. The vector representations of words learned by word2vec models have been proven to be able to carry semantic meanings and are useful in various NLP tasks. In this work I try to reproduce the previously obtained results for the English language and to explore the possibility of doing the same for the Italian language.
ACL-WMT2013.A Description of Tunable Machine Translation Evaluation Systems i...Lifeng (Aaron) Han
The document describes two machine translation evaluation systems, nLEPOR_baseline and LEPOR_v3.1, that were submitted to the WMT13 Metrics Task. nLEPOR_baseline is an n-gram based metric that considers modified sentence length penalty, position difference penalty, and n-gram precision and recall. LEPOR_v3.1 is an enhanced version that uses a harmonic mean to combine factors and includes part-of-speech information. Evaluation results showed LEPOR_v3.1 had the highest correlation of 0.86 with human judgments for English to other language pairs.
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
Continuous bag of words (cbow) word2vec word embedding work is that it tends to predict the
probability of a word given a context. A context may be a single word or a group of words. But for
simplicity, I will take a single context word and try to predict a single target word.
The purpose of this question is to be able to create a word embedding for the given data set.
data set text:
In linguistics word embeddings were discussed in the research area of distributional semantics. It
aims to quantify and categorize semantic similarities between linguistic items based on their
distributional properties in large samples of language data. The underlying idea that "a word is
characterized by the company it keeps" was popularized by Firth.
The technique of representing words as vectors has roots in the 1960s with the development of
the vector space model for information retrieval. Reducing the number of dimensions using
singular value decomposition then led to the introduction of latent semantic analysis in the late
1980s.In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language
models" to reduce the high dimensionality of words representations in contexts by "learning a
distributed representation for words". (Bengio et al, 2003). Word embeddings come in two different
styles, one in which words are expressed as vectors of co-occurring words, and another in which
words are expressed as vectors of linguistic contexts in which the words occur; these different
styles are studied in (Lavelli et al, 2004). Roweis and Saul published in Science how to use
"locally linear embedding" (LLE) to discover representations of high dimensional data structures.
The area developed gradually and really took off after 2010, partly because important advances
had been made since then on the quality of vectors and the training speed of the model.
There are many branches and many research groups working on word embeddings. In 2013, a
team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit which can train
vector space models faster than the previous approaches. Most new word embedding techniques
rely on a neural network architecture instead of more traditional n-gram models and unsupervised
learning.
Limitations
One of the main limitations of word embeddings (word vector space models in general) is that
possible meanings of a word are conflated into a single representation (a single vector in the
semantic space). Sense embeddings are a solution to this problem: individual meanings of words
are represented as distinct vectors in the space.
For biological sequences: BioVectors
Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for
bioinformatics applications have been proposed by Asgari and Mofrad. Named bio-vectors
(BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins
(amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representa.
This document summarizes a paper on using simple lexical overlap features with support vector machines (SVMs) for Russian paraphrase identification. It introduces paraphrase identification and various paraphrase corpora. It then describes a knowledge-lean approach using only tokenization, lowercasing, and overlap features like union and intersection size as inputs to linear and RBF kernel SVMs. The method achieves competitive results on English, Turkish, and Russian paraphrase identification tasks.
Our project is about guessing the correct missing
word in a given sentence. To find of guess the missing word
we have two main methods one of them statistical language
modeling, while the other is neural language models.
Statistical language modeling depend on the frequency of the
relation between words and here we use Markov chain. Since
neural language models uses artificial neural networks which
uses deep learning, here we use BERT which is the state of art
in language modeling provided by google.
This document describes NAIST's system for the HOO 2012 Shared Task on grammatical error correction. It discusses the system's configuration for correcting spelling errors, preposition errors, and determiner errors. For spelling correction, it uses a spelling checker and language model ranking. For prepositions, it trains a maximum entropy model on two corpora to detect and correct errors. For determiners, it checks noun phrases and trains two parser models to correct errors. It analyzes the results of different system runs and configurations. The system achieved preliminary F-scores of 52.4%, 72.2%, and 60.7% for spelling correction, and aimed to improve correction of existing words, use richer verb knowledge, and add more determin
1. UWB at SemEval-2016 Task 5: Aspect Based Sentiment Analysis
Tom´aˇs Hercig†‡
, Tom´aˇs Brychc´ın‡
, Luk´aˇs Svoboda†‡
and Michal Konkol†‡
†
Department of Computer Science and Engineering, Faculty of Applied Sciences,
University of West Bohemia, Univerzitn´ı 8, 306 14 Plzeˇn, Czech Republic
‡
NTIS – New Technologies for the Information Society, Faculty of Applied Sciences,
University of West Bohemia, Technick´a 8, 306 14 Plzeˇn, Czech Republic
tigi@kiv.zcu.cz brychcin@kiv.zcu.cz
svobikl@kiv.zcu.cz konkol@kiv.zcu.cz
Abstract
This paper describes our system used in the
Aspect Based Sentiment Analysis (ABSA)
task of SemEval 2016. Our system uses Max-
imum Entropy classifier for the aspect cate-
gory detection and for the sentiment polarity
task. Conditional Random Fields (CRF) are
used for opinion target extraction. We achieve
state-of-the-art results in 9 experiments among
the constrained systems and in 2 experiments
among the unconstrained systems.
1 Introduction
The goal of Aspect Based Sentiment Analysis
(ABSA) is to identify the aspects of a given tar-
get entity and estimate the sentiment polarity for
each mentioned aspect. In recent years the aspect-
based sentiment analysis has undergone rapid devel-
opment mainly because of competitive tasks such as
SemEval 2014 (Pontiki et al., 2014) and SemEval
2015 (Pontiki et al., 2015). The highest ranking par-
ticipants are for SemEval 2014 (Kiritchenko et al.,
2014; Brun et al., 2014; Castellucci et al., 2014; Toh
and Wang, 2014; Wagner et al., 2014; Brychc´ın et
al., 2014) and for SemEval 2015 (Saias, 2015; Toh
and Su, 2015a; San Vicente et al., 2015; Zhang and
Lan, 2015).
In the current ABSA task - SemEval 2016 task 5
(Pontiki et al., 2016) has attracted 29 participating
teams competing in 40 different experiments among
8 languages. The task has three subtasks: Sentence-
level (SB1), Text-level (SB2) and Out-of-domain
ABSA (SB3). The subtasks are further divided into
three slots:
• 1) Aspect Category Detection – the category
consists of an entity and attribute (E#A) pair.
• 2) Opinion Target Expression (OTE)
• 3) Sentiment Polarity (positive, negative, neu-
tral, and for SB2 conflict)
In phase A we solved slots 1 and 2. In phase B we
were given the results for slots 1 and 2 and solved
slot 3. We participate in 19 experiments including
Chinese, English, French, and Spanish.
2 System Description
Our approach to the ABSA task is based on super-
vised Machine Learning. Detailed description for
each experiment can be found in Section 2.2 and
Section 2.3.
2.1 Features
Our system combines a large number of features to
achieve competitive results. In this section we will
describe the features in detail.
2.1.1 Semantics Features
We use semantics models to derive word clusters
from unlabeled datasets. Similarly to (Toh and Su,
2015b) we use the Amazon product reviews from
(Blitzer et al., 2007) and the user reviews from the
Yelp Phoenix Academic Dataset1 to create seman-
tic word clusters. Additionally, we use a review
Opentable dataset2. In this paper we consider the
following semantics models.
1
https://www.yelp.com/dataset_challenge
2
downloaded from http://opentable.com
2. Dimension Window Iterations
GloVe 300 10 100
CBOW 300 10 100
LDA – sentence 1000
Table 1: Model settings.
Global Vectors (GloVe) (Pennington et al., 2014)
use the ratios of the word–word co-occurrence
probabilities to encode the meanings of words.
Continuous Bag-of-Words (CBOW) (Mikolov et
al., 2013) is model based on Neural Network
Language Model (NNLM) that tries to predict
the current word using a small context window
around the word.
Latent Dirichlet Allocation (LDA) (Blei et al.,
2003) discovers the hidden topics in text. We
experiment with 50, 100, 200, 300, 400, and
500 topics.
The settings of the GloVe and CBOW models re-
flect the results of these methods in their original
publications (Pennington et al., 2014; Mikolov et al.,
2013). The detailed settings of all these methods are
shown in Table 1.
We used the GloVe implementation provided
on the official website3, CBOW model uses the
Word2Vec4 implementation and the LDA implemen-
tation comes from the MALLET (McCallum, 2002)
software package.
CLUTO software package (Karypis, 2003) is used
for words clustering with the k-means algorithm and
cosine similarity metric. All vector space models in
this paper cluster the word vectors into four different
numbers of clusters: 100, 500, 1000, and 5000.
The following features are based on the word
clusters created using the semantic models.
Clusters (C) – The occurrence of a cluster at a
given position.
Bag of Clusters (BoC) – The occurrence of a clus-
ter in the context window.
Cluster Bigrams (CB) – The occurrence of cluster
bigram at a given position.
3
http://nlp.stanford.edu/projects/glove
4
https://code.google.com/p/word2vec
Bag of Cluster Bigrams (BoCB) – The occurrence
of cluster bigram in the context window.
2.1.2 Constrained Features
Affixes (A) – Affix (length 2-4 characters) of a
word at a given position with a frequency > 5.
Aspect Category (AC) – extracted aspect category.
We use separately the entity, attribute, and the
E#A pair.
Aspect Target (AT) – listed aspect target.
Bag of Words (BoW) – The occurrence of a word
in the context window.
Bag of Words filtered by POS (BoW-POS) – The
occurrence of a word in the context window fil-
tered by POS tags.
Bag of Bigrams (BoB) – The occurrence of a bi-
gram in the context window.
Bag of Words around Verb (5V) – Bag of 5 words
before verb and a bag of 5 words after verb.
Bag of 5 Words at the Beginning of Sentence (5sS)
– Bag of 5 words at the beginning of a sentence.
Bag of 5 Words at the End of Sentence (5eS) –
Bag of 5 words at the end of a sentence.
Bag of Head Words (BoHW) – bag of extracted
head words from the sentence parse tree.
Emoticons (E) We used a list of positive and neg-
ative emoticons (Montejo-R´aez et al., 2012).
The feature captures the presence of an emoti-
con within the text.
Head Word (HW) – extracted head word from the
sentence parse tree.
Character N-gram (ChN) – The occurrence of
character n-gram at a given position.
Learned Target Dictionary (LTD) – presence of a
word from learned5 dictionary of aspect terms.
Learned Target Dictionary by Category (LTD-C)
– presence of a word from the learned
dictionary5 of aspect terms grouped by
category.
5
from training data
3. N-gram (N) – The occurrence of n-gram in the con-
text window.
N-gram Shape (NSh) – The occurrence of word
shape n-gram in the context window. We con-
sider unigrams with frequency >5 and bigrams,
trigrams with frequency > 20.
Paragraph Vectors (P2Vec) is an unsupervised
method of learning text representation (Le
and Mikolov, 2014). Resulting feature vector
has a fixed dimension while the input text can
be of any length. The model is trained on
the One billion word benchmark presented in
(Chelba et al., 2013), resulting vectors6 are
used as features for a sentence. We use the
implementation by ˇReh˚uˇrek and Sojka (2010).
POS N-gram (POS-N) – The occurrence of POS
n-gram in the context window.
Punctuation (P) – The occurrence of a question
mark, an exclamation mark or at least two dots
in the context window.
Skip-bigram (SkB) – Instead of using sequences of
adjacent words (n-grams) we used skip-grams
(Guthrie et al., 2006; Reyes et al., 2013), which
skip over arbitrary gaps. We consider skip-
bigrams with 2 to 5 word skips and remove
skip-grams with a frequency ≤ 5.
Target Bag of Words (T-BoW) – BoW containing
parent, siblings, and children of the target from
the sentence parse tree.
TF-IDF (TF-IDF) – Term frequency - inverse doc-
ument frequency of a word computed from the
training data.
Verb Bag of Tags (V-BoT) – Bag of syntactic de-
pendency tags of parent, siblings, and children
of the verb from the sentence parse tree.
Verb Bag of Words (V-BoW) – Bag of words for
parent, siblings, and children of the verb from
the sentence parse tree.
6
Vector dimension has been set to 300.
Word Shape (WSh) – we assign words into one of
24 classes7 similar to the function specified in
(Bikel et al., 1997).
Words (W) – The occurrence of word at a given po-
sition (e.g. previous word).
2.1.3 Unconstrained Features
Dictionary (DL) – presence of a word from dictio-
nary extracted from the Annotation Guidelines
for Laptops.
Dictionary (DR) – presence of a word from dictio-
nary extracted from the Annotation Guidelines
for Restaurants.
Enchanted Dictionary (ED) – presence of a word
from a dictionary extracted from website8.
Group of Words from ED (EDG) – presence of
any word from a group from the ED dictionary.
Dictionary of Negative Words (ND) – presence of
any negative word from the negative words
list9.
Sentiment (S) – this is a union of features deal-
ing with sentiment. It consists of BoG fea-
tures where the groups correspond to various
sentiment lexicons. We used the following
lexicon resources: Affinity lexicon (Nielsen,
2011), Senticon (Cruz et al., 2014), dictionar-
ies from (Steinberger et al., 2012), MICRO-
WNOP (Cerini et al., 2007), and the list of pos-
itive or negative opinion words from (Liu et al.,
2005). Additional feature includes the output
of Stanford CoreNLP (Manning et al., 2014)
v3.6 sentiment analysis package by (Socher et
al., 2013).
2.2 Phase A
Sentence-level Category (SB1, slot 1) We use
maximum entropy classifier for all classes. Then a
threshold t is used to decide which categories will
be assigned by the classifier.
7
We use edu.stanford.nlp.process.WordShapeClassifier with
the WORDSHAPECHRIS1 setting.
8
http://www.enchantedlearning.com/
wordlist/
9
http://dreference.blogspot.cz/2010/05/
negative-ve-words-adjectives-list-for.
html
4. Chinese We used identical features for both do-
mains (BoB, BoHW, BoW, ChN, N), where ChN
ranges from unigram to 4-gram and ChN with fre-
quency < 20 are removed and N ranges from uni-
gram to trigram and ChN with frequency < 10 are
removed. The threshold was set to t = 0.1.
Spanish For Spanish we used the following fea-
tures: 5V, 5eS, BoB, BoHW, BoW, BoW-POS, ChN,
V-BoT, where 5V considers only adjective, adverb,
and noun, 5eS considers adjectives and adverbs with
frequency > 5, ChN ranges from unigram to 4-gram
and ChN with frequency < 20 are removed, BoW-
POS is used separately for adverbs, nouns, verbs,
and a union of adjectives, adverbs, nouns, and verbs,
V-BoT is used separately for adverbs, nouns, and a
union of adjectives and adverbs while reducing fea-
ture space by 50 occurrences. The threshold was set
to t = 0.2.
English English features employ lemmatization.
The threshold was set to t = 0.14. Common features
for all experiments in this task are 5V, 5eS, BoB,
BoHW, BoW, BoW-POS, P, TF-IDF, V-BoT, where
5V considers only adjective, adverb, and noun, 5eS
filters only adjective and adverb, BoW-POS contains
adjectives, adverbs, nouns, and verbs, V-BoT filters
adjectives and adverbs with frequency > 20.
The unconstrained model for the Laptops domain
additionally uses BoC, BoCB, DL, ED, P2Vec BoC
and BoCB include the GloVe and CBOW models
computed on the Amazon dataset.
The constrained model for the restaurant domain
additionally uses 5sS, ChN, LTD, LTD-C, P2Vec 5sS
filters only adjective and adverb, ChN in this case
means character unigrams with frequency > 5. This
model also considers separate BoW-POS features for
groups for adverbs, nouns and verbs.
The unconstrained model for the restaurant do-
main uses BoC, BoCB, DR, LDA, ND, NSh on top
of the previously listed features for the constrained
model.
BoC and BoCB include the GloVe, CBOW, and
LDA models computed on the Yelp dataset and
CBOW model computed on the Opentable dataset.
Sentence-level Target (SB1, slot 2) Similarly to
(Brychc´ın et al., 2014), we have decided to use Con-
ditional Random Fields (CRF) (Lafferty et al., 2001)
to solve this subtask. The context for this task is de-
fined as a five word window centred at the currently
processed word. English features for this subtask
employ lemmatization.
The baseline feature set consists of A, BoB, BoW-
POS, HW, LTD, LTD-C, N, POS-N, V-BoT, W,
WSh. BoW-POS contains adjectives, adverbs, nouns,
verbs, and a union of adverbs and nouns. We con-
sider POS-N with frequency > 10. V-BoT includes
adverbs, nouns, and a union of adjectives, adverbs,
nouns, and verbs.
In the unconstrained model, we extend this with
the semantic features C, CB (created using the CBOW
model computed on the opentable dataset) and
with lexicons DR, EDG.
Sentence level Category & Target (SB1, slot 1&2)
We firstly assign targets as described above and then
combine them with the best five candidates for as-
pect category10. We also add aspect categories with-
out target. This produces too many combinations
thus we need to filter the unlikely opinions. We
remove the opinions without target in a sentence
where the aspect category is already present with a
target. When there is only one target and one as-
pect category in a sentence we combine them into a
single opinion.
Text-level Category (SB2, slot 1) We used the
baseline algorithm: the predicted sentence-level tu-
ples (SB1, slot 1) are copied to text-level and dupli-
cates are removed.
2.3 Phase B
Sentence level Sentiment Polarity (SB1, slot 3)
Our sentiment polarity detection is based on the
Maximum Entropy classifier, which works very well
in many NLP tasks, including document-level senti-
ment analysis (Habernal et al., 2014).
Chinese We used identical features for both do-
mains (5V, 5eS, 5sS, AC, BoB, BoHW, BoW, BoW-
POS, ChN, N, NSh, P, SkB, V-BoT, V-BoW), where
5V considers adjectives and adverbs with frequency
> 5, 5eS and 5sS contain adjectives, adverbs, nouns,
and verbs, BoW-POS is used separately for adjec-
tives and adverbs, ChN ranges from unigram to 5-
10
We use the same settings and approach as in the sentence-
level category detection (SB1 slot 1).
5. gram and ChN with frequency < 5 are removed, N
ranges from unigram to 5-gram and ChN with fre-
quency < 2 are removed, V-BoT is used separately
for verbs, and a union of adjectives and adverbs,
V-BoW is used separately for adjectives, adverbs,
verbs, a union of adjectives and adverbs and a union
of adjectives, adverbs, nouns, and verbs, while re-
ducing feature space by 2 occurrences.
French We employ lemmatization for French.
The first constrained model includes the following
features: AC, BoB, BoHW, BoW, BoW-POS, ChN,
LTD, LTD-C, N, NSh, P, SkB, V-BoT, where BoW-
POS is used separately for adjectives, adverbs and a
union of adjectives, adverbs, nouns, and verbs, ChN
ranges from unigram to 5-gram and ChN with fre-
quency ≤ 5 are removed, N ranges from unigram
to 5-gram and N with frequency < 2 are removed,
V-BoT is used separately for verbs, and a union of
adjectives and adverbs.
The second constrained model additionally uses
5V, 5eS, 5sS, AT, T-Bow, V-BoW, where 5V consid-
ers only adjective and adverb, 5eS, 5sS considers
adjective, adverb, noun, and verb, T-BoW is used
for adjectives, adverbs, nouns, and verbs, V-BoW
is used separately for adjectives, adverbs, verbs, a
union of adjectives and adverbs and a union of ad-
jectives, adverbs, nouns, and verbs, while reducing
feature space by 2 occurrences.
Spanish We employ lemmatization for Spanish.
We used the following features: 5V, 5eS, AC, BoB,
BoHW, BoW, BoW-POS, E, ChN, LTD, LTD-C, N,
NSh, P, SkB, T-Bow, V-BoT, V-BoW, where 5V con-
siders only adjective and adverb, 5eS considers ad-
jective, adverb, noun, and verb, BoW-POS is used
separately for adjectives, adverbs, ChN ranges from
unigram to 5-gram and ChN with frequency ≤ 5 are
removed, N ranges from unigram to 5-gram and N
with frequency < 2 are removed, T-BoW is used for
adjectives, adverbs, nouns, and verbs, V-BoT is used
separately for verbs, and a union of adjectives and
adverbs, V-BoW is used separately for adjectives,
adverbs, verbs, a union of adjectives and adverbs
and a union of adjectives, adverbs, nouns, and verbs,
while reducing feature space by 2 occurrences.
English We use lemmatization in this subtask.
Common features for all experiments in this task
are 5V, 5eS, 5sS, AC, AT, BoB, BoHW, BoW, BoW-
POS, E, ChN, LTD, LTD-C, N, NSh, P, SkB, V-BoT,
V-BoW, where 5V considers adjectives and adverbs,
5eS, 5sS consists of adjectives, adverbs, nouns, and
verbs, , BoW-POS contains adjectives and adverbs,
N ranges from unigram to 5-gram and N with fre-
quency < 2 are removed, V-BoT is used separately
for verbs, and a union of adjectives and adverbs,
V-BoW is used separately for adjectives, adverbs,
verbs, a union of adjectives and adverbs and a union
of adjectives, adverbs, nouns, and verbs, while re-
ducing feature space by 2 occurrences.
The unconstrained model for the Laptops domain
additionally uses BoC, BoCB, ED, S BoC and BoCB
include the GloVe and CBOW models computed on
the Amazon dataset.
The constrained model for the restaurant domain
additionally uses T-Bow, TF-IDF, where T-BoW is
used for adjectives, adverbs, nouns, and verbs.
The unconstrained model for the restaurant do-
main uses BoC, BoCB, ED, ND, S on top of the pre-
viously listed features for the constrained model.
BoC and BoCB include the GloVe and CBOW
models computed on the Yelp dataset and CBOW
model computed on the Opentable dataset.
Text-level Sentiment Polarity (SB2, slot 3) The
baseline algorithm traverses the predicted sentence-
level tuples of the same category and counts the re-
spective polarity labels (positive, negative or neu-
tral). Finally, the polarity label with the highest fre-
quency is assigned to the text-level category. If there
are not any sentence-level tuples of the same cat-
egory the polarity label is determined based on all
tuples regardless of the category.
Our improved algorithm contains an additional
step, that assigns polarity for cases (categories) with
more than one sentence-level polarity labels. The
resulting polarity is determined by the following al-
gorithm:
if(catPolarity == lastPolarity){
assign lastPolarity;
}else if(catPolarity == entPolarity){
assign entPolarity;
}else{
assign CONFLICT;
}
where catPolarity is the polarity label with
the highest frequency for the given category
6. Domain Lang Subtask
Constrained Unconstrained
Category Sentiment Category Sentiment
Rank F1[%] Rank ACC[%] Rank F1[%] Rank ACC[%]
Restaurants EN SB1 3. 67.8 2. 81.8 8. 68.2 9. 81.7
Laptops EN SB1 1. 47.9 3. 73.8 7. 47.3 10. 73.8
Restaurants EN SB2 1. 81.0 1. 80.9 3. 80.2 1. 81.9
Laptops EN SB2 1. 60.5 1. 74.5 2. 59.7 1. - 2. 75.0
Restaurants FR SB1 – – 2. 75.3 – – – –
Cameras CH SB1 1. 36.3 3. 77.8 – – – –
Phones CH SB1 1. 22.5 3. 72.0 – – – –
Restaurants SP SB1 3. 62.0 2. 81.3 – – – –
Restaurants SP SB2 3. 73.7 1. 77.2 – – – –
Target Category & Target Target Category & Target
Rank F1[%] Rank F1[%] Rank F1[%] Rank F1[%]
Restaurants EN SB1 1. 66.9 4. 41.1 3. 67.1 6. 41.1
Table 2: Achieved ranks and results (in %) by UWB for all submitted systems.
(E#A touple), entPolarity is the polarity label
with the highest frequency for the entity E and
lastPolarity is the last seen polarity label for the
given category. This follows our believe that the last
polarity tends to reflect the final sentiment (opinion)
toward the aspect category.
2.4 System Settings
For all experiments we use Brainy (Konkol, 2014)
machine learning library.
Data preprocessing includes lower-casing and in
some cases lemmatization.
We utilize parse trees, lemmatization and POS
tags from the Stanford CoreNLP (Manning et al.,
2014) v3.6 framework. We chose it because it has
support for Chinese, English, French, and Spanish.
3 Results and Discussion
As shown in the Table 2 we achieved very satis-
factory results especially for the constrained experi-
ments.
In the English sentence level Laptops domain our
constrained method was slightly better than the un-
constrained one (by 0.6%). We believe this is not a
significant deviation.
The baseline algorithm for text-level cate-
gory (SB2, slot 1)11 achieves an F1 score of
96.08% on the Laptops domain and 97.07% on
the Restaurants domain for English. When
we add the corresponding general class for the
11
Using the sentence-level gold test data.
given domain (e.g. RESTAURANT#GENERAL and
LAPTOP#GENERAL) the algorithm achieves an F1
score of 96.87% on the Laptops domain and 99.75%
on the Restaurants domain for English.
The baseline algorithm for text-level sentiment
polarity (SB2, slot 3)11 achieves an Accuracy of
86.8% on the Laptops domain and 89.6% on the
Restaurants domain for English, while the improved
algorithm achieves an Accuracy of 94.5% on the
Laptops domain and 97.3% on the Restaurants do-
main for English.
4 Conclusion
We competed in 19 constrained experiments and
won 9 of them. In the other 10 cases we have
reached at worst the 4th place. Our unconstrained
systems participated in 10 experiments and achieved
5 ranks ranging from the 1st to 3rd place.
In the future we aim to explore the benefits of us-
ing neural networks for sentiment analysis.
Acknowledgements
This work was supported by the project LO1506
of the Czech Ministry of Education, Youth and
Sports and by Grant No. SGS-2016-018 Data
and Software Engineering for Advanced Applica-
tions. Computational resources were provided by
the CESNET LM2015042 and the CERIT Scientific
Cloud LM2015085, provided under the programme
”Projects of Large Research, Development, and In-
novations Infrastructures”.
7. References
Daniel M Bikel, Scott Miller, Richard Schwartz, and
Ralph Weischedel. 1997. Nymble: a high-
performance learning name-finder. In Proceedings
of the fifth conference on Applied natural language
processing, pages 194–201. Association for Compu-
tational Linguistics.
D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty.
2003. Latent Dirichlet allocation. Journal of Machine
Learning Research, 3:2003.
John Blitzer, Mark Dredze, Fernando Pereira, et al. 2007.
Biographies, bollywood, boom-boxes and blenders:
Domain adaptation for sentiment classification. In
ACL, volume 7, pages 440–447.
Caroline Brun, Diana Nicoleta Popa, and Claude Roux.
2014. XRCE: Hybrid classification for aspect-based
sentiment analysis. In Proceedings of the 8th Inter-
national Workshop on Semantic Evaluation (SemEval
2014), pages 838–842, Dublin, Ireland, August. Asso-
ciation for Computational Linguistics.
Tom´aˇs Brychc´ın, Michal Konkol, and Josef Steinberger.
2014. UWB: Machine learning approach to aspect-
based sentiment analysis. In Proceedings of the 8th
International Workshop on Semantic Evaluation (Se-
mEval 2014), pages 817–822. Association for Compu-
tational Linguistics.
Giuseppe Castellucci, Simone Filice, Danilo Croce, and
Roberto Basili. 2014. UNITOR: Aspect based senti-
ment analysis with structured learning. In Proceed-
ings of the 8th International Workshop on Semantic
Evaluation (SemEval 2014), pages 761–767, Dublin,
Ireland, August. Association for Computational Lin-
guistics.
Sabrina Cerini, Valentina Compagnoni, Alice Demontis,
Maicol Formentelli, and G Gandini. 2007. Micro-
WNOp: A gold standard for the evaluation of auto-
matically compiled lexical resources for opinion min-
ing. Language resources and linguistic theory: Typol-
ogy, second language acquisition, English linguistics,
pages 200–210.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,
Thorsten Brants, Phillipp Koehn, and Tony Robin-
son. 2013. One billion word benchmark for measur-
ing progress in statistical language modeling. arXiv
preprint arXiv:1312.3005.
Ferm´ın L Cruz, Jos´e A Troyano, Beatriz Pontes, and
F Javier Ortega. 2014. Building layered, multilingual
sentiment lexicons at synset and lemma levels. Expert
Systems with Applications, 41(13):5984–5994.
David Guthrie, Ben Allison, Wei Liu, Louise Guthrie,
and Yorick Wilks. 2006. A closer look at skip-gram
modelling. In Proceedings of the 5th international
Conference on Language Resources and Evaluation
(LREC-2006), pages 1–4.
Ivan Habernal, Tom´aˇs Pt´aˇcek, and Josef Steinberger.
2014. Supervised sentiment analysis in Czech so-
cial media. Information Processing & Management,
50(5):693–707.
G. Karypis. 2003. CLUTO—A clustering toolkit.
Svetlana Kiritchenko, Xiaodan Zhu, Colin Cherry, and
Saif Mohammad. 2014. NRC-Canada-2014: Detect-
ing aspects and sentiment in customer reviews. In
Proceedings of the 8th International Workshop on Se-
mantic Evaluation (SemEval 2014), pages 437–442,
Dublin, Ireland, August. Association for Computa-
tional Linguistics.
Michal Konkol. 2014. Brainy: A machine learning li-
brary. In Leszek Rutkowski, Marcin Korytkowski,
Rafa Scherer, Ryszard Tadeusiewicz, Lotfi A. Zadeh,
and Jacek M. Zurada, editors, Artificial Intelligence
and Soft Computing, volume 8468 of Lecture Notes in
Computer Science. Springer-Verlag, Berlin.
John Lafferty, Andrew McCallum, and Fernando Pereira.
2001. Conditional random fields: Probabilistic mod-
els for segmenting and labeling sequence data. In
Proceedings of International Conference on Machine
Learning.
Quoc V. Le and Tomas Mikolov. 2014. Distributed
representations of sentences and documents. In Pro-
ceedings of the 31th International Conference on Ma-
chine Learning, ICML 2014, Beijing, China, 21-26
June 2014, pages 1188–1196.
Bing Liu, Minqing Hu, and Junsheng Cheng. 2005.
Opinion observer: analyzing and comparing opinions
on the web. In Proceedings of the 14th international
conference on World Wide Web, pages 342–351. ACM.
Christopher D. Manning, Mihai Surdeanu, John Bauer,
Jenny Finkel, Steven J. Bethard, and David McClosky.
2014. The Stanford CoreNLP natural language pro-
cessing toolkit. In Association for Computational Lin-
guistics (ACL) System Demonstrations, pages 55–60.
Andrew Kachites McCallum. 2002. MALLET: A ma-
chine learning for language toolkit.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word representa-
tions in vector space. arXiv preprint arXiv:1301.3781.
A. Montejo-R´aez, E. Mart´ınez-C´amara, M. T. Mart´ın-
Valdivia, and L. A. Ure˜na L´opez. 2012. Random
walk weighting over sentiwordnet for sentiment po-
larity detection on twitter. In Proceedings of the 3rd
Workshop in Computational Approaches to Subjectiv-
ity and Sentiment Analysis, WASSA ’12, pages 3–10,
Stroudsburg, PA, USA. Association for Computational
Linguistics.
Finn ˚Arup Nielsen. 2011. A new ANEW: Evaluation of
a word list for sentiment analysis in microblogs. In
8. Proceedings of the ESWC2011 Workshop on ’Making
Sense of Microposts’: Big things come in small pack-
ages. 93-98.
Jeffrey Pennington, Richard Socher, and Christopher D
Manning. 2014. Glove: Global vectors for word rep-
resentation. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Process-
ing (EMNLP), pages 1532–1543.
Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Har-
ris Papageorgiou, Ion Androutsopoulos, and Suresh
Manandhar. 2014. SemEval-2014 Task 4: Aspect
based sentiment analysis. In Proceedings of the 8th
International Workshop on Semantic Evaluation (Se-
mEval 2014), pages 27–35, Dublin, Ireland, August.
Association for Computational Linguistics and Dublin
City University.
Maria Pontiki, Dimitrios Galanis, Haris Papageorgiou,
Suresh Manandhar, and Ion Androutsopoulos. 2015.
Semeval-2015 task 12: Aspect based sentiment anal-
ysis. In Proceedings of the 9th International Work-
shop on Semantic Evaluation (SemEval 2015), Asso-
ciation for Computational Linguistics, Denver, Col-
orado, pages 486–495.
Maria Pontiki, Dimitrios Galanis, Haris Papageorgiou,
Ion Androutsopoulos, Suresh Manandhar, Mohammad
AL-Smadi, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing
Qin, Orph´ee De Clercq, V´eronique Hoste, Marianna
Apidianaki, Xavier Tannier, Natalia Loukachevitch,
Evgeny Kotelnikov, Nuria Bel, Salud Mar´ıa Jim´enez-
Zafra, and G¨uls¸en Eryi˘git. 2016. SemEval-2016 task
5: Aspect based sentiment analysis. In Proceedings of
the 10th International Workshop on Semantic Evalua-
tion, SemEval ’16, San Diego, California, June. Asso-
ciation for Computational Linguistics.
Radim ˇReh˚uˇrek and Petr Sojka. 2010. Software Frame-
work for Topic Modelling with Large Corpora. In
Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks, pages 45–50, Val-
letta, Malta, May. ELRA. http://is.muni.cz/
publication/884893/en.
Antonio Reyes, Paolo Rosso, and Tony Veale. 2013. A
multidimensional approach for detecting irony in twit-
ter. Language Resources and Evaluation, 47(1):239–
268.
Jos´e Saias. 2015. Sentiue: Target and aspect based sen-
timent analysis in semeval-2015 task 12. Association
for Computational Linguistics.
Inaki San Vicente, Xabier Saralegi, Rodrigo Agerri, and
Donostia-San Sebasti´an. 2015. Elixa: A modular and
flexible absa platform. SemEval-2015, page 748.
Richard Socher, Alex Perelygin, Jean Y Wu, Jason
Chuang, Christopher D Manning, Andrew Y Ng, and
Christopher Potts. 2013. Recursive deep models for
semantic compositionality over a sentiment treebank.
In Proceedings of the conference on empirical meth-
ods in natural language processing (EMNLP), volume
1631, page 1642. Citeseer.
Josef Steinberger, Mohamed Ebrahim, Maud Ehrmann,
Ali Hurriyetoglu, Mijail Kabadjov, Polina Lenkova,
Ralf Steinberger, Hristo Tanev, Silvia Vzquez, and
Vanni Zavarella. 2012. Creating sentiment dictio-
naries via triangulation. Decision Support Systems,
53(4):689 – 694.
Zhiqiang Toh and Jian Su. 2015a. Nlangp: Supervised
machine learning system for aspect category classifi-
cation and opinion target extraction.
Zhiqiang Toh and Jian Su. 2015b. NLANGP: Supervised
Machine Learning System for Aspect Category Clas-
sification and Opinion Target Extraction. In Proceed-
ings of the 9th International Workshop on Semantic
Evaluation (SemEval 2015), pages 496–501, Denver,
Colorado, June. Association for Computational Lin-
guistics.
Zhiqiang Toh and Wenting Wang. 2014. DLIREC:
Aspect term extraction and term polarity classifica-
tion system. In Proceedings of the 8th International
Workshop on Semantic Evaluation (SemEval 2014),
pages 235–240, Dublin, Ireland, August. Association
for Computational Linguistics.
Joachim Wagner, Piyush Arora, Santiago Cortes, Utsab
Barman, Dasha Bogdanova, Jennifer Foster, and
Lamia Tounsi. 2014. DCU: Aspect-based polarity
classification for SemEval Task 4. In Proceedings of
the 8th International Workshop on Semantic Evalua-
tion (SemEval 2014), pages 223–229, Dublin, Ireland,
August. Association for Computational Linguistics.
Zhihua Zhang and Man Lan. 2015. Ecnu: Extract-
ing effective features from multiple sequential sen-
tences for target-dependent sentiment analysis in re-
views. SemEval-2015, page 736.