An open-source finite state morphological transducer for Modern Standard Arabic called AraComLex was developed. It contains over 30,000 lemmas and is corpus-based rather than based on Classical Arabic. Machine learning was used to automatically extend the lexicon and predict morphological features. Testing showed it achieved comparable coverage to other tools with a lower rate of analyses per word.
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
Expressive Querying of Semantic Databases with Incremental Query RewritingAlexandre Riazanov
This talk briefly introduces the Incremental Query Rewriting (IQR) method (see http://link.springer.com/chapter/10.1007%2F978-1-4419-7335-1_1 ) and presents an approach for extremely expressive querying of RDF triplestores, based on IQR.
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
Expressive Querying of Semantic Databases with Incremental Query RewritingAlexandre Riazanov
This talk briefly introduces the Incremental Query Rewriting (IQR) method (see http://link.springer.com/chapter/10.1007%2F978-1-4419-7335-1_1 ) and presents an approach for extremely expressive querying of RDF triplestores, based on IQR.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
A brief survey presentation about Arabic Question Answering touching the different Natural Language Processing and Information Retrieval Approaches to Question Analysis, Passage Retrieval and Answer Extraction. In addition to the listing of the different NLP tools used in AQA and the Challenges and future trends in this area.
Please if you want to cite this paper you can download it here:
http://www.acit2k.org/ACIT/2012Proceedings/13106.pdf
Overview of the SPARQL-Generate language and latest developmentsMaxime Lefrançois
SPARQL-Generate is an extension of SPARQL 1.1 for querying not only RDF datasets but also documents in arbitrary formats. The solution bindings can then be used to output RDF (SPARQL-Generate) or text (SPARQL-Template)
Anyone familiar with SPARQL can easily learn SPARQL-Generate; Learning SPARQL-Generate helps you learning SPARQL.
The open-source implementation (Apache 2 license) is based on Apache Jena and can be used to execute transformations from a combination of RDF and any kind of documents in XML, JSON, CSV, HTML, GeoJSON, CBOR, streams of messages using WebSocket or MQTT... (easily extensible)
Recent extensions and improvement include:
- heavy refactoring to support parallelization
- more expressive iterators and functions
- simple generation of RDF lists
- support of aggregates
- generation of HDT (thanks Ana for the use case)
- partial implementation of STTL for the generation of Text (https://ns.inria.fr/sparql-template/)
- partial implementation of LDScript (http://ns.inria.fr/sparql-extension/)
- integration of all these types of rules to decouple or compose queries, e.g.:
- call a SPARQL-Generate query in the SPARQL FROM clause
- plug a SPARQL-Generate or a SPARQL-Template query to the output of a SPARQL-
Select function
- a Sublime Text package for local development
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageWaqas Tariq
Morphological analysis is an essential stage in language engineering applications. For the Arabic language, this stage is not easy to develop because the Arabic language has some particularities such as the phenomena of agglutination and a lot of morphological ambiguity phenomenon. These reasons make the design of the morphological analyzer for Arabic somewhat difficult and require lots of other tools and treatments. The volume of the lexicon is another big problem of the morphological analysis of the Arabic Language which affects directly the process of the analyzing. In this paper we present a Morphological Analyzer for Modern Standard Arabic based on Arabic Morphological Automaton technique and using a new and innovative language (XMODEL) to represent the Arabic morphological knowledge in an optimal way. Both the Arabic Morphological Analyzer and Arabic Morphological Automaton are implemented in Java language and used XML technology. Buckwalter Arabic Morphological Analyzer and Xerox Arabic Finite State Morphology are two of the best known morphological analyzers for Modern Standard Arabic and they are also available and documented. Our Morphological Analyzer can be exploited by Natural Language Processing (NLP) applications such as machine translation, orthographical correction, information retrieval and both syntactic and semantic analyzers. At the end, an evaluation of Xerox and our system is done.
Following are the questions which I tried to answer in this ppt
What is text summarization.
What is automatic text summarization?
How it has evolved over the time?
What are different methods?
How deep learning is used for text summarization?
business application
in first few slides extractive summarization is explained, with pro and cons in next section abstractive on is explained.
In the last section business application of each one is highlighted
A brief survey presentation about Arabic Question Answering touching the different Natural Language Processing and Information Retrieval Approaches to Question Analysis, Passage Retrieval and Answer Extraction. In addition to the listing of the different NLP tools used in AQA and the Challenges and future trends in this area.
Please if you want to cite this paper you can download it here:
http://www.acit2k.org/ACIT/2012Proceedings/13106.pdf
Overview of the SPARQL-Generate language and latest developmentsMaxime Lefrançois
SPARQL-Generate is an extension of SPARQL 1.1 for querying not only RDF datasets but also documents in arbitrary formats. The solution bindings can then be used to output RDF (SPARQL-Generate) or text (SPARQL-Template)
Anyone familiar with SPARQL can easily learn SPARQL-Generate; Learning SPARQL-Generate helps you learning SPARQL.
The open-source implementation (Apache 2 license) is based on Apache Jena and can be used to execute transformations from a combination of RDF and any kind of documents in XML, JSON, CSV, HTML, GeoJSON, CBOR, streams of messages using WebSocket or MQTT... (easily extensible)
Recent extensions and improvement include:
- heavy refactoring to support parallelization
- more expressive iterators and functions
- simple generation of RDF lists
- support of aggregates
- generation of HDT (thanks Ana for the use case)
- partial implementation of STTL for the generation of Text (https://ns.inria.fr/sparql-template/)
- partial implementation of LDScript (http://ns.inria.fr/sparql-extension/)
- integration of all these types of rules to decouple or compose queries, e.g.:
- call a SPARQL-Generate query in the SPARQL FROM clause
- plug a SPARQL-Generate or a SPARQL-Template query to the output of a SPARQL-
Select function
- a Sublime Text package for local development
XMODEL: An XML-based Morphological Analyzer for Arabic LanguageWaqas Tariq
Morphological analysis is an essential stage in language engineering applications. For the Arabic language, this stage is not easy to develop because the Arabic language has some particularities such as the phenomena of agglutination and a lot of morphological ambiguity phenomenon. These reasons make the design of the morphological analyzer for Arabic somewhat difficult and require lots of other tools and treatments. The volume of the lexicon is another big problem of the morphological analysis of the Arabic Language which affects directly the process of the analyzing. In this paper we present a Morphological Analyzer for Modern Standard Arabic based on Arabic Morphological Automaton technique and using a new and innovative language (XMODEL) to represent the Arabic morphological knowledge in an optimal way. Both the Arabic Morphological Analyzer and Arabic Morphological Automaton are implemented in Java language and used XML technology. Buckwalter Arabic Morphological Analyzer and Xerox Arabic Finite State Morphology are two of the best known morphological analyzers for Modern Standard Arabic and they are also available and documented. Our Morphological Analyzer can be exploited by Natural Language Processing (NLP) applications such as machine translation, orthographical correction, information retrieval and both syntactic and semantic analyzers. At the end, an evaluation of Xerox and our system is done.
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
This presentation introduces the main principles of Linked Data, the underlying technologies and background standards. It provides basic knowledge for how data can be published over the Web, how it can be queried, and what are the possible use cases and benefits. As an example, we use the development of a music portal (based on the MusicBrainz dataset), which facilitates access to a wide range of information and multimedia resources relating to music.
Using Semantic and Domain-based Information in CLIR SystemsMauro Dragoni
Cross-Language Information Retrieval (CLIR) systems extend classic information retrieval mechanisms for allowing users to query across languages, i.e., to retrieve documents written in languages different from the language used for query formulation.
In this paper, we present a CLIR system exploiting multilingual ontologies for enriching documents representation with multilingual semantic information during the indexing phase and for mapping query fragments to concepts during the retrieval phase.
This system has been applied on a domain-specific document collection and the contribution of the ontologies to the CLIR system has been evaluated in conjunction with the use of both Microsoft Bing and Google Translate translation services.
Results demonstrate that the use of domain-specific resources leads to a significant improvement of CLIR system performance.
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
Covers the services supported by SoDA v2. Includes some background on Named Entity Recognition and Resolution, popular approaches to Named Entity Recognition, hybrid approaches, scaling SoDA using Spark and Spark streaming, deployment strategies, etc.
1. An Open-Source Finite State
Morphological Transducer for Modern
Standard Arabic
Mohammed Attia, Pavel Pecina, Antonio Toral, Lamia Tounsi,
Josef van Genabith
National Centre for Language Technology (NCLT),
School of Computing, Dublin City University
Funded by:
Enterprise Ireland, the Irish Research Council for Science
Engineering and Technology (IRCSET), and
the EU projects PANACEA and META-NET
2. Contribution
• We develop a finite state morphological
transducer for Modern Standard Arabic
1. Open source, distributed under the GPLv3 license
2. Large scale, more than 30,000 lemmas
3. Corpus based, truly representative of Modern
Standard Arabic and not Classical Arabic.
4. Compatible with Foma, an open-source fst compiler
3. Short Tutorial
(1) Download Foma
http://foma.sourceforge.net
(2) Download AraComLex
http://aracomlex.sourceforge.net
(3) Build the transducer: README
5. Introduction
• Modern Standard Arabic vs. Classical
Arabic
• Current State of Arabic Lexicography
– Lexicons are not corpus-based
– Buckwalter Arabic Morphological Analyser
• Importance of Lexical Resources
7. Aim
• Building a finite-state morphological
transducer
• Constructing a lexical database of Modern
Standard Arabic
8. Methodology
• Using Open-Source Finite State
Technology
• Using statistics from a 1 billion word
corpus
– 90% from the LDC's Arabic Gigaword
– 10% collected from the Al-Jazeera website
• Using a medium-scale manually created
lexicon of 10,799 lemmas
9. Methodology
• Using Finite State Technology (XFST)
– Bidrectional: Suitable for analysis and generation
– handles concatenative and non-concatenative
morphotactics
– Speed and efficiency in dealing with millions of
paths
– Handles separated dependencies.
– Handles phonological and orthographic changes
through alteration rules.
13. Methodology
Alteration Rules:
Alteration Rules are used for handling discrepancies
between surface forms and underlying representation or
lemmas. We have 130 replace rules.
a -> b || L _ R
14. Results to Date
• Start-off with a seed lexicon
– Four Lexical Databases, manually constructed
• 5,925 nominal lemmas
• 1,529 verb lemmas
• 490 patterns (456 for nominals and 34 for verbs)
• lemma-root look up database
15. Results to Date
• Automatically Extending the Lexical
Database: Lexical Enrichment
– Data-driven filtering technique
• 40,648 lemmas (in Buckwalter or SAMA 3.1)
• Statistics from three web search engines
• Statistics from the corpus annotated by MADA
• 29,627 lemmas (left after filtering)
16. Results to Date
Automatically Extending the Lexical
Database: Feature Enrichment
– Machine Learning
– Multilayer Peceptron classification algorithm
– Training Data: 4,816 nominals and 1,448 verbs
– Classes for nominals: continuation classes (or inflection
paths), the semantico-grammatical feature of humanness,
and POS (noun or adjective)
– Classes for verbs: transitivity, allowing the passive voice,
and allowing the imperative mood
– We feed these datasets with frequency statistics from the
corpus and build a vector grid.
17. Results to Date
• Extending the Lexical Database
– Feature enrichment using Machine Learning
18. Results to Date
• Extending the Lexical Database
– With Machine Learning we add:
18,000 new lemmas:
12,974 nominals
5,034 verbs
19. Results to Date
• Handling Broken Plurals
jAnib (side)
jawAnib (sides)
Poor handling of broken plural in Buckwalter
(4) <lemmaID>jAnib_1</lemmaID> <voc>jAnib</voc>
<pos>jAnib/NOUN</pos> <gloss>side/aspect</gloss>
(5) <lemmaID>jAnib_1</lemmaID> <voc>jawAnib</voc>
<pos>jawAnib/NOUN</pos> <gloss>sides/aspects</gloss>
Two differences: voc and gloss
20. Results to Date
• Extracting Broken Plurals
<gloss>side/aspect</gloss>
<gloss>sides/aspects</gloss>
We use Levenshtein Distance which measures the difference
between two strings (here glosses having the same lemmaID).
distance of 2 / length of the first string = 0.15
(within the threshold 0.4)
We collect 2,266 candidates
21. Results to Date
• Validating Broken Plurals
<voc>jAnib</voc> singular
pattern is: fAEil
regex is: .A.i.
<voc>jawAnib</voc> plural
pattern is: fawAEil
regex is: .awA.i.
Pattern database: 135 singular patterns that choose from a
set of 82 broken plural patterns
2,266 candidates -> 1,965 are validated (87%)
22. Results to Date
• Interesting statistics on Arabic plurals
Insights from the corpus:
5,570 lemmas have a feminine plural suffix
1,942 lemmas have a masculine plural suffix
2,730 lemmas with a broken plural forms
24. Results to Date
• FST Morphology Coverage and RPW
Results
– a test corpus of 800,000 words, divided as
• 400,000 for Semi-Literary text
• 400,000 for General News texts.
26. Conclusion
• Open-source finite state transducer for Modern
Standard Arabic (AraComLex) distributed under
the GPLv3 license.
• We successfully use machine learning to predict
morpho-syntactic features for newly acquired
words.
• Comparing our morphological transducer to
SAMA, we find that we achieve comparable
coverage and lower rate of analyses per word.