Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team.
August 30-31, 2007.
Patan Dhoka, Lalitpur, Nepal.
The document describes a Russian paraphrase corpus created by the authors. It contains over 8000 sentence pairs annotated as precise, loose, or non-paraphrases using crowdsourcing. The corpus was collected from news headlines and aims to capture the most important events. The authors evaluate different models for classifying sentence pairs and find that combining linguistic features improves performance over individual feature types. Graphs built from the corpus can reveal connected events more completely than human annotations alone.
4. The many uses of machine translation technology
Let Google and Microsoft run with it or invest in your own translation technology. Why build your own machine translation system if Google and Microsoft offer such a great machine translation service.
Machine translation technology is a force multiplier and a catalyst for innovation. Learn more about how effective use of MT opens many new services and markets.
Panelists: Stéphane Domisse (John Deere), Olga Beregovaya (Welocalize), Diego Bartolome (tauyou), Dragos Munteanu (SDL), Tony O’Dowd (KantanMT), Sanna Piha (Moravia), Irene O'Riordan (Microsoft)
Concept hierarchy is the backbone of ontology, and the concept hierarchy acquisition has been a hot topic in the field of ontology learning. this paper proposes a hyponymy extraction method of domain ontology concept based on cascaded conditional random field(CCRFs) and hierarchy clustering. It takes free text as extracting object, adopts CCRFs identifying the domain concepts. First the low layer of CCRFs is used to identify simple domain concept, then the results are sent to the high layer, in which the nesting concepts are recognized. Next we adopt hierarchy clustering to identify the hyponymy relation between domain ontology concepts. The experimental results demonstrate the proposed method is efficient.
This document discusses contrastive analysis, which compares two languages to identify similarities and differences. It covers:
1. The basic assumptions of contrastive analysis, including that interference from a first language causes learning difficulties in a second language, and contrastive analysis can predict and address these issues.
2. The theoretical and applied levels of contrastive analysis - theoretical establishes frameworks for comparison, while applied uses these findings for language teaching.
3. Contrastive analysis can predict about one-third to one-half of learner errors caused by interference from the first language. It cannot predict other error types.
Modular Ontologies - A Formal Investigation of Semantics and ExpressivityJie Bao
The document discusses requirements for modular ontologies, including semantic soundness, expressivity, and localized semantics. It analyzes approaches like DDL, E-Connections and P-DL based on these requirements. DDL allows directional relations but lacks support for roles and transitive reusability. E-Connections ensures reasoning exactness but has limited expressivity. P-DL supports transitive reusability through compositionally consistent relations but directionality does not always hold and decidability relies on the underlying description logic. The conclusions call for an expressive modular ontology language with both concept and role correspondences across modules while relaxing disjointedness assumptions to improve expressivity and reasoning capabilities.
This document discusses cognitive plausibility in learning algorithms, with a focus on natural language processing. It outlines the author's background and motivation, which is to model human learning and communication more accurately. Some key points made include: understanding language acquisition as discriminative learning rather than compositional; explaining features of human language through models like Rescorla-Wagner learning; and how naive discrimination learning can be applied to NLP tasks through an incremental learning algorithm. The document also provides an overview of available NLP tools and limitations in fully achieving language understanding.
Rethinking Critical Editions of Fragments by OntologiesMatteo Romanello
This document discusses rethinking the representation of fragmentary classical texts in digital editions through the use of ontologies. It addresses problems with current editions, such as duplication of text. The authors analyze the domain to identify concepts like fragments as interpretations linked to evidence. They design an ontology with classes for interpretations, textual passages, and linking fragments to witness texts. The benefits cited include a solid architecture separating texts from interpretations, formalization of the domain, and improved data interoperability.
Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
The document describes a Russian paraphrase corpus created by the authors. It contains over 8000 sentence pairs annotated as precise, loose, or non-paraphrases using crowdsourcing. The corpus was collected from news headlines and aims to capture the most important events. The authors evaluate different models for classifying sentence pairs and find that combining linguistic features improves performance over individual feature types. Graphs built from the corpus can reveal connected events more completely than human annotations alone.
4. The many uses of machine translation technology
Let Google and Microsoft run with it or invest in your own translation technology. Why build your own machine translation system if Google and Microsoft offer such a great machine translation service.
Machine translation technology is a force multiplier and a catalyst for innovation. Learn more about how effective use of MT opens many new services and markets.
Panelists: Stéphane Domisse (John Deere), Olga Beregovaya (Welocalize), Diego Bartolome (tauyou), Dragos Munteanu (SDL), Tony O’Dowd (KantanMT), Sanna Piha (Moravia), Irene O'Riordan (Microsoft)
Concept hierarchy is the backbone of ontology, and the concept hierarchy acquisition has been a hot topic in the field of ontology learning. this paper proposes a hyponymy extraction method of domain ontology concept based on cascaded conditional random field(CCRFs) and hierarchy clustering. It takes free text as extracting object, adopts CCRFs identifying the domain concepts. First the low layer of CCRFs is used to identify simple domain concept, then the results are sent to the high layer, in which the nesting concepts are recognized. Next we adopt hierarchy clustering to identify the hyponymy relation between domain ontology concepts. The experimental results demonstrate the proposed method is efficient.
This document discusses contrastive analysis, which compares two languages to identify similarities and differences. It covers:
1. The basic assumptions of contrastive analysis, including that interference from a first language causes learning difficulties in a second language, and contrastive analysis can predict and address these issues.
2. The theoretical and applied levels of contrastive analysis - theoretical establishes frameworks for comparison, while applied uses these findings for language teaching.
3. Contrastive analysis can predict about one-third to one-half of learner errors caused by interference from the first language. It cannot predict other error types.
Modular Ontologies - A Formal Investigation of Semantics and ExpressivityJie Bao
The document discusses requirements for modular ontologies, including semantic soundness, expressivity, and localized semantics. It analyzes approaches like DDL, E-Connections and P-DL based on these requirements. DDL allows directional relations but lacks support for roles and transitive reusability. E-Connections ensures reasoning exactness but has limited expressivity. P-DL supports transitive reusability through compositionally consistent relations but directionality does not always hold and decidability relies on the underlying description logic. The conclusions call for an expressive modular ontology language with both concept and role correspondences across modules while relaxing disjointedness assumptions to improve expressivity and reasoning capabilities.
This document discusses cognitive plausibility in learning algorithms, with a focus on natural language processing. It outlines the author's background and motivation, which is to model human learning and communication more accurately. Some key points made include: understanding language acquisition as discriminative learning rather than compositional; explaining features of human language through models like Rescorla-Wagner learning; and how naive discrimination learning can be applied to NLP tasks through an incremental learning algorithm. The document also provides an overview of available NLP tools and limitations in fully achieving language understanding.
Rethinking Critical Editions of Fragments by OntologiesMatteo Romanello
This document discusses rethinking the representation of fragmentary classical texts in digital editions through the use of ontologies. It addresses problems with current editions, such as duplication of text. The authors analyze the domain to identify concepts like fragments as interpretations linked to evidence. They design an ontology with classes for interpretations, textual passages, and linking fragments to witness texts. The benefits cited include a solid architecture separating texts from interpretations, formalization of the domain, and improved data interoperability.
Formal and Computational Representations
The Semantics of First-Order Logic
Event Representations
Description Logics & the Web Ontology Language
Compositionality
Lamba calculus
Corpus-based approaches:
Latent Semantic Analysis
Topic models
Distributional Semantics
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...Hiroki Shimanaka
This document summarizes the paper "Supervised Learning of Universal Sentence Representations from Natural Language Inference Data". It discusses how the researchers trained sentence embeddings using supervised data from the Stanford Natural Language Inference dataset. They tested several sentence encoder architectures and found that a BiLSTM network with max pooling produced the best performing universal sentence representations, outperforming prior unsupervised methods on 12 transfer tasks. The sentence representations learned from the natural language inference data consistently achieved state-of-the-art performance across multiple downstream tasks.
This document provides an overview of deep learning techniques for natural language processing. It begins with an introduction to distributed word representations like word2vec and GloVe. It then discusses methods for generating sentence embeddings, including paragraph vectors and recursive neural networks. Character-level models are presented as an alternative to word embeddings that can handle morphology and out-of-vocabulary words. Finally, some general deep learning approaches for NLP tasks like text generation and word sense disambiguation are briefly outlined.
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...Andre Freitas
Tasks such as question answering and semantic search are dependent
on the ability of querying & reasoning over large-scale commonsense knowledge
bases (KBs). However, dealing with commonsense data demands coping with
problems such as the increase in schema complexity, semantic inconsistency, incompleteness
and scalability. This paper proposes a selective graph navigation
mechanism based on a distributional relational semantic model which can be applied
to querying & reasoning over heterogeneous knowledge bases (KBs). The
approach can be used for approximative reasoning, querying and associational
knowledge discovery. In this paper we focus on commonsense reasoning as the
main motivational scenario for the approach. The approach focuses on addressing
the following problems: (i) providing a semantic selection mechanism for facts
which are relevant and meaningful in a specific reasoning & querying context
and (ii) allowing coping with information incompleteness in large KBs. The approach
is evaluated using ConceptNet as a commonsense KB, and achieved high
selectivity, high scalability and high accuracy in the selection of meaningful nav-
igational paths. Distributional semantics is also used as a principled mechanism
to cope with information incompleteness.
This document summarizes the internship of Ho Xuan Vinh at Kyoto Institute of Technology aimed at creating a bilingual annotated corpus of Vietnamese-English for machine learning purposes. Vinh experimented with several semantic tagsets, including WordNet, LLOCE, and UCREL, but faced challenges due to the lack of Vietnamese language resources. His goal was to find an effective method for annotating a bilingual corpus to provide training data for natural language processing tasks, but he was unable to validate his annotation approaches due to limitations in the available data and tools.
V. Malykh presents an approach for creating robust word vectors for the Russian language that does not rely on a predefined vocabulary or word co-occurrence matrices. The approach uses a LSTM neural network and BME representations of words at the character level to learn word embeddings. Experiments on Russian corpora for paraphrase identification and plagiarism detection show the approach outperforms standard word2vec models, especially in noisy conditions with character substitutions and additions/deletions.
This document summarizes a paper on using simple lexical overlap features with support vector machines (SVMs) for Russian paraphrase identification. It introduces paraphrase identification and various paraphrase corpora. It then describes a knowledge-lean approach using only tokenization, lowercasing, and overlap features like union and intersection size as inputs to linear and RBF kernel SVMs. The method achieves competitive results on English, Turkish, and Russian paraphrase identification tasks.
Context, Perspective, and Generalities in a Knowledge OntologyMike Bergman
This presentation to the Ontolog Forum in Dec 2016 presents the knowledge graph (ontology) design for KBpedia, a system of six major knowledge bases and 20 minor ones for conducting knowledge-based artificial intelligence (KBAI). The talk emphasizes the roots of the system in the triadic logic of Charles Sanders Peirce. It also discusses the use of KBpedia for the more-or-less automatic ways it can help create training corpuses, training sets, and reference standards for supervised, unsupervised and deep machine learning. Uses of the system include entity and relation extraction and tagging, classification, clustering, sentiment analysis, and other AI tasks.
This document proposes an approach to automatically build term hierarchies from large patent datasets. It involves a three-stage process: term extraction, hierarchy building, and hierarchy enrichment. Terms are first extracted from patent titles, abstracts, and claims. The hierarchy is built by classifying terms into unigrams, bigrams, and trigrams to reflect different levels of generality. The hierarchy is then enriched using a word embedding model to add related terms. Results on sample patent subgroups show the approach can identify generic and specific terms, though human evaluation and more linguistic study on patents are needed.
This document provides an overview of an automata theory course. The course will cover regular languages and their descriptors like finite automata and regular expressions. It will also cover context-free languages and their descriptors including context-free grammars and pushdown automata. Finally, the course examines recursive and recursively enumerable languages as well as intractable problems and the limits of computation.
Python questions in pdf for data science interviews. A question bank on python for practice. In Reddit and Sanfoundry, you will get random questions, but here these are in order. The difficult to answer questions explained clearly.
The document discusses a novel approach to handling ellipsis, or omitted words, in domain-specific question answering systems. It classifies ellipsis into three types and proposes solutions for each. Type 1 is handled using prepositions, Type 2 uses grouping based on noun phrases, and Type 3 uses semantic relationships. The approach identifies complete queries, maps entities to the domain, analyzes queries to handle ellipsis in subsequent questions, and evaluates performance against other QA systems.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
[Emnlp] what is glo ve part i - towards data scienceNikhil Jaiswal
This document introduces GloVe (Global Vectors), a method for creating word embeddings that combines global matrix factorization and local context window models. It discusses how global matrix factorization uses singular value decomposition to reduce a term-frequency matrix to learn word vectors from global corpus statistics. It also explains how local context window models like skip-gram and CBOW learn word embeddings by predicting words from a fixed-size window of surrounding context words during training. GloVe aims to learn from both global co-occurrence patterns and local context to generate word vectors.
This document presents two new approaches for aligning sentences in parallel English-Arabic corpora: mathematical regression (MR) and genetic algorithm (GA) classifiers. Feature vectors containing text features like length, punctuation score, and cognate score are extracted from sentence pairs and used to train the MR and GA models on manually aligned training data. The trained models are then tested on additional sentence pairs, achieving better results than a baseline length-based approach. The methods can be applied to any language pair by modifying the feature vector.
This document provides an overview of natural language processing (NLP) research trends presented at ACL 2020, including shifting away from large labeled datasets towards unsupervised and data augmentation techniques. It discusses the resurgence of retrieval models combined with language models, the focus on explainable NLP models, and reflections on current achievements and limitations in the field. Key papers on BERT and XLNet are summarized, outlining their main ideas and achievements in advancing the state-of-the-art on various NLP tasks.
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
Computational model language and grammar bnfTaha Shakeel
Computational models help answer whether tasks can be carried out by computers and how. Grammars are used to generate and recognize languages and provide models for natural and programming languages. BNF (Backus-Naur Form) is a metalanguage that describes syntax through rewrite rules and was developed by John Backus and Peter Naur to describe programming languages. It is used widely in compiler construction.
Introduction to Ontology Concepts and TerminologySteven Miller
The document introduces an ontology tutorial that will cover basic concepts of the Semantic Web, Linked Data, and the Resource Description Framework data model as well as the ontology languages RDFS and OWL. The tutorial is intended for information professionals who want to gain an introductory understanding of ontologies, ontology concepts, and terminology. The tutorial will explain how to model and structure data as RDF triples and create basic RDFS ontologies.
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015RIILP
The document discusses using syntactic preordering models to delimit the morphosyntactic search space for machine translation of morphologically rich languages. It explores preordering dependency trees of the source language to reduce word order variations and predicting morphological attributes on the source side to inform target language word selection. Experimental results show that non-local features and jointly learning which attributes to predict can improve translation performance over baselines. The work aims to combine preordering and morphology prediction to better exploit interactions between syntactic structure and inflectional properties.
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeEstelle Delpech
Material presented at the TKE (Terminology and Knowledge Engineering) Conference 2010, Dublin, Ireland.
Download paper at http://hal.archives-ouvertes.fr/hal-00544403
Insitutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina.
Camilla López es una estudiante de 22 años de la Escuela de Comunicación Mónica Herrera. Disfruta del diseño, la fotografía, la música, la comida y los viajes. Le gusta encontrar soluciones creativas a situaciones difíciles utilizando su astucia e innovación.
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...Hiroki Shimanaka
This document summarizes the paper "Supervised Learning of Universal Sentence Representations from Natural Language Inference Data". It discusses how the researchers trained sentence embeddings using supervised data from the Stanford Natural Language Inference dataset. They tested several sentence encoder architectures and found that a BiLSTM network with max pooling produced the best performing universal sentence representations, outperforming prior unsupervised methods on 12 transfer tasks. The sentence representations learned from the natural language inference data consistently achieved state-of-the-art performance across multiple downstream tasks.
This document provides an overview of deep learning techniques for natural language processing. It begins with an introduction to distributed word representations like word2vec and GloVe. It then discusses methods for generating sentence embeddings, including paragraph vectors and recursive neural networks. Character-level models are presented as an alternative to word embeddings that can handle morphology and out-of-vocabulary words. Finally, some general deep learning approaches for NLP tasks like text generation and word sense disambiguation are briefly outlined.
A Distributional Semantics Approach for Selective Reasoning on Commonsense Gr...Andre Freitas
Tasks such as question answering and semantic search are dependent
on the ability of querying & reasoning over large-scale commonsense knowledge
bases (KBs). However, dealing with commonsense data demands coping with
problems such as the increase in schema complexity, semantic inconsistency, incompleteness
and scalability. This paper proposes a selective graph navigation
mechanism based on a distributional relational semantic model which can be applied
to querying & reasoning over heterogeneous knowledge bases (KBs). The
approach can be used for approximative reasoning, querying and associational
knowledge discovery. In this paper we focus on commonsense reasoning as the
main motivational scenario for the approach. The approach focuses on addressing
the following problems: (i) providing a semantic selection mechanism for facts
which are relevant and meaningful in a specific reasoning & querying context
and (ii) allowing coping with information incompleteness in large KBs. The approach
is evaluated using ConceptNet as a commonsense KB, and achieved high
selectivity, high scalability and high accuracy in the selection of meaningful nav-
igational paths. Distributional semantics is also used as a principled mechanism
to cope with information incompleteness.
This document summarizes the internship of Ho Xuan Vinh at Kyoto Institute of Technology aimed at creating a bilingual annotated corpus of Vietnamese-English for machine learning purposes. Vinh experimented with several semantic tagsets, including WordNet, LLOCE, and UCREL, but faced challenges due to the lack of Vietnamese language resources. His goal was to find an effective method for annotating a bilingual corpus to provide training data for natural language processing tasks, but he was unable to validate his annotation approaches due to limitations in the available data and tools.
V. Malykh presents an approach for creating robust word vectors for the Russian language that does not rely on a predefined vocabulary or word co-occurrence matrices. The approach uses a LSTM neural network and BME representations of words at the character level to learn word embeddings. Experiments on Russian corpora for paraphrase identification and plagiarism detection show the approach outperforms standard word2vec models, especially in noisy conditions with character substitutions and additions/deletions.
This document summarizes a paper on using simple lexical overlap features with support vector machines (SVMs) for Russian paraphrase identification. It introduces paraphrase identification and various paraphrase corpora. It then describes a knowledge-lean approach using only tokenization, lowercasing, and overlap features like union and intersection size as inputs to linear and RBF kernel SVMs. The method achieves competitive results on English, Turkish, and Russian paraphrase identification tasks.
Context, Perspective, and Generalities in a Knowledge OntologyMike Bergman
This presentation to the Ontolog Forum in Dec 2016 presents the knowledge graph (ontology) design for KBpedia, a system of six major knowledge bases and 20 minor ones for conducting knowledge-based artificial intelligence (KBAI). The talk emphasizes the roots of the system in the triadic logic of Charles Sanders Peirce. It also discusses the use of KBpedia for the more-or-less automatic ways it can help create training corpuses, training sets, and reference standards for supervised, unsupervised and deep machine learning. Uses of the system include entity and relation extraction and tagging, classification, clustering, sentiment analysis, and other AI tasks.
This document proposes an approach to automatically build term hierarchies from large patent datasets. It involves a three-stage process: term extraction, hierarchy building, and hierarchy enrichment. Terms are first extracted from patent titles, abstracts, and claims. The hierarchy is built by classifying terms into unigrams, bigrams, and trigrams to reflect different levels of generality. The hierarchy is then enriched using a word embedding model to add related terms. Results on sample patent subgroups show the approach can identify generic and specific terms, though human evaluation and more linguistic study on patents are needed.
This document provides an overview of an automata theory course. The course will cover regular languages and their descriptors like finite automata and regular expressions. It will also cover context-free languages and their descriptors including context-free grammars and pushdown automata. Finally, the course examines recursive and recursively enumerable languages as well as intractable problems and the limits of computation.
Python questions in pdf for data science interviews. A question bank on python for practice. In Reddit and Sanfoundry, you will get random questions, but here these are in order. The difficult to answer questions explained clearly.
The document discusses a novel approach to handling ellipsis, or omitted words, in domain-specific question answering systems. It classifies ellipsis into three types and proposes solutions for each. Type 1 is handled using prepositions, Type 2 uses grouping based on noun phrases, and Type 3 uses semantic relationships. The approach identifies complete queries, maps entities to the domain, analyzes queries to handle ellipsis in subsequent questions, and evaluates performance against other QA systems.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
[Emnlp] what is glo ve part i - towards data scienceNikhil Jaiswal
This document introduces GloVe (Global Vectors), a method for creating word embeddings that combines global matrix factorization and local context window models. It discusses how global matrix factorization uses singular value decomposition to reduce a term-frequency matrix to learn word vectors from global corpus statistics. It also explains how local context window models like skip-gram and CBOW learn word embeddings by predicting words from a fixed-size window of surrounding context words during training. GloVe aims to learn from both global co-occurrence patterns and local context to generate word vectors.
This document presents two new approaches for aligning sentences in parallel English-Arabic corpora: mathematical regression (MR) and genetic algorithm (GA) classifiers. Feature vectors containing text features like length, punctuation score, and cognate score are extracted from sentence pairs and used to train the MR and GA models on manually aligned training data. The trained models are then tested on additional sentence pairs, achieving better results than a baseline length-based approach. The methods can be applied to any language pair by modifying the feature vector.
This document provides an overview of natural language processing (NLP) research trends presented at ACL 2020, including shifting away from large labeled datasets towards unsupervised and data augmentation techniques. It discusses the resurgence of retrieval models combined with language models, the focus on explainable NLP models, and reflections on current achievements and limitations in the field. Key papers on BERT and XLNet are summarized, outlining their main ideas and achievements in advancing the state-of-the-art on various NLP tasks.
French machine reading for question answeringAli Kabbadj
This paper proposes to unlock the main barrier to machine reading and comprehension French natural language texts. This open the way to machine to find to a question a precise answer buried in the mass of unstructured French texts. Or to create a universal French chatbot. Deep learning has produced extremely promising results for various tasks in natural language understanding particularly topic classification, sentiment analysis, question answering, and language translation. But to be effective Deep Learning methods need very large training da-tasets. Until now these technics cannot be actually used for French texts Question Answering (Q&A) applications since there was not a large Q&A training dataset. We produced a large (100 000+) French training Dataset for Q&A by translating and adapting the English SQuAD v1.1 Dataset, a GloVe French word and character embed-ding vectors from Wikipedia French Dump. We trained and evaluated of three different Q&A neural network ar-chitectures in French and carried out a French Q&A models with F1 score around 70%.
Computational model language and grammar bnfTaha Shakeel
Computational models help answer whether tasks can be carried out by computers and how. Grammars are used to generate and recognize languages and provide models for natural and programming languages. BNF (Backus-Naur Form) is a metalanguage that describes syntax through rewrite rules and was developed by John Backus and Peter Naur to describe programming languages. It is used widely in compiler construction.
Introduction to Ontology Concepts and TerminologySteven Miller
The document introduces an ontology tutorial that will cover basic concepts of the Semantic Web, Linked Data, and the Resource Description Framework data model as well as the ontology languages RDFS and OWL. The tutorial is intended for information professionals who want to gain an introductory understanding of ontologies, ontology concepts, and terminology. The tutorial will explain how to model and structure data as RDF triples and create basic RDFS ontologies.
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015RIILP
The document discusses using syntactic preordering models to delimit the morphosyntactic search space for machine translation of morphologically rich languages. It explores preordering dependency trees of the source language to reduce word order variations and predicting morphological attributes on the source side to inform target language word selection. Experimental results show that non-local features and jointly learning which attributes to predict can improve translation performance over baselines. The work aims to combine preordering and morphology prediction to better exploit interactions between syntactic structure and inflectional properties.
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchangeEstelle Delpech
Material presented at the TKE (Terminology and Knowledge Engineering) Conference 2010, Dublin, Ireland.
Download paper at http://hal.archives-ouvertes.fr/hal-00544403
Insitutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina.
Camilla López es una estudiante de 22 años de la Escuela de Comunicación Mónica Herrera. Disfruta del diseño, la fotografía, la música, la comida y los viajes. Le gusta encontrar soluciones creativas a situaciones difíciles utilizando su astucia e innovación.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise has also been shown to increase gray matter volume in the brain and reduce risks for conditions like Alzheimer's disease and dementia.
El documento describe los principios del tratamiento de fracturas, incluyendo la reducción, retención e inmovilización de fracturas estables mediante vendajes o yesos, y el tratamiento quirúrgico de fracturas inestables o imposibles de reducir mediante fijación externa, clavos endomedulares u osteosíntesis con placas y tornillos. También cubre el tratamiento de fracturas articulares, abiertas y los diferentes grados de fracturas abiertas.
My goals have been:
- focusing on several project areas, where you can use jruby successfully
- share the experience that I made using ruby in the last years
- proove that things can be done easier as they are done in typical java projects
This document provides an introduction to networks. It discusses how networks are used to model relationships between entities in various domains, including social networks, protein interactions, and infrastructure networks. It also describes some key concepts in network analysis, such as degree distribution, shortest paths, centrality measures, and topological properties like the small-world and scale-free networks commonly seen in real-world systems.
Rule based approach to sentiment analysis at ROMIP 2011Dmitry Kan
The document describes a rule-based approach to sentiment analysis of Russian language texts. It uses linguistic rules and dictionaries of positive and negative words to classify text segments as positive, negative, or neutral. The algorithm performs shallow parsing and applies rules about negation, conjunctions, and sentiment combinations. It achieved 90% precision on positive classifications for cases where annotators agreed, and was able to classify sentiment at the subclause, sentence, and full text levels. The approach ranked 14th out of 27 systems on a movie reviews dataset for binary classification and 14th out of 21 for 3-class classification.
Poster: Method for an automatic generation of a semantic-level contextual tra...Dmitry Kan
The document describes a method for automatically generating a semantic-level contextual translational dictionary as the key component of a machine translation system that combines rule-based and statistical approaches. The method uses parallel corpora and statistical word alignment to extract context examples and represent them with semantic formulas in dictionary entries. The machine translation system relies on these dictionary entries containing semantic attributes of words and is designed to be automatically extended through acquiring more corpora and applying the word alignment and semantic analysis method.
FiBAN gathers yearly Finnish business angel activity. In 2015 over €37M was invested in 322 companies by business angels, which is 15% of the total €253M invested in Finland.
Finland has one of the most active and largest angel networks in Europe and FiBAN has been awarded "European business angel network" of the year 2012 and 2015.
FiBAN’s board and office is glad to answer any additional questions about angel investing and present activity at events.
www.fiban.org/contact
Claes Mikko Nilsen
Network Manager, FiBAN
This document discusses leveraging Groovy for capturing business rules through domain-specific languages (DSLs). It begins with introductions to DSLs and Groovy, explaining their goals and advantages. Examples are provided of using Groovy to remove boilerplate code from Java programs and create internal DSLs. The document demonstrates how Groovy features like closures and meta-programming enable the creation of DSLs for expressing business rules in a natural, domain-focused way.
The document discusses several different machine learning approaches to plain text information extraction, including SRV, RAPIER, WHISK, AutoSlog, and CRYSTAL. These systems use both top-down and bottom-up approaches to induce rules or patterns for extracting structured information from unstructured text. The document compares the different systems and their rule representations, learning algorithms, experiments and performance on various information extraction tasks.
This document discusses research on automated text summarization. It defines a summary as a shorter text that retains the key information from the original text(s). There are typically three stages to automated summarization: topic identification to extract important units, interpretation to fuse concepts using external knowledge, and generation to produce coherent readable text. Various methods are reviewed for the topic identification stage, including analyzing positional, cue phrase, frequency-based, title overlap, and discourse structure criteria. Combining the scores from different methods improves performance over using a single method alone.
The document describes the Bondec system, a sentence boundary detection system with three applications: Rule-based, HMM, and Maximum Entropy. The Maximum Entropy model is the central part of the system and achieved an error rate of less than 2% on part of the Wall Street Journal corpus using only eight binary features. The document discusses related research on machine learning approaches for sentence boundary disambiguation and describes the authors' approach using Maximum Entropy modeling, which maximizes the conditional entropy of predictions while satisfying constraints from training data.
This document presents an algorithm for interactively learning monotone Boolean functions. The algorithm is based on Hansel's lemma, which states that algorithms based on finding maximal upper zeros and minimal lower units are optimal for learning monotone Boolean functions. The algorithm allows decreasing the number of queries needed to learn non-monotone functions that can be represented as combinations of monotone functions. The effectiveness of the approach is demonstrated through computational experiments in engineering and medical applications.
This document presents an algorithm for interactively learning monotone Boolean functions from examples. The algorithm is based on Hansel's lemma, which states that the optimal number of queries needed to learn a monotone Boolean function of n variables is O(n). The algorithm learns the target function by finding its maximal upper zeros and minimal lower units, representing the borders of the negative and positive patterns, respectively. The algorithm is optimal in the sense that it minimizes the maximum number of queries needed for any monotone Boolean function.
Summarization in Computational linguisticsAhmad Mashhood
The document discusses summarization of single documents, specifically technical articles. It defines summarization as presenting a significant portion of a text's information in a shorter, abridged form. The advancement of computer processing systems and natural language processing enabled automated summarization through tools that can produce abstractive summaries. Automated summarization aims to generate summaries similar to human summaries to help address the large amount of online information. Single document summarization of technical articles typically involves extracting sentences while reorganizing and modifying them to form a coherent summary.
The document discusses two NSF-funded research projects on intelligence and security informatics:
1. A project to filter and monitor message streams to detect "new events" and changes in topics or activity levels. It describes the technical challenges and components of automatic message processing.
2. A project called HITIQA to develop high-quality interactive question answering. It describes the team members and key research issues like question semantics, human-computer dialogue, and information quality metrics.
The document provides guidance on writing a scientific paper like a professional through a six phase process. The phases include outlining the key results, methods, and implications; expanding the outlines into paragraphs for each section; refining the writing; polishing at the sentence and paragraph level for clarity; ensuring the paper presents a cohesive story; and completing elements like the abstract, conclusion, and references. Key advice includes thinking about the reader, using a logical structure, and focusing on presenting results in a clear and convincing manner.
Strict intersection types for the lambda calculusunyil96
This article discusses strict intersection types for the lambda calculus. It focuses on an essential intersection type assignment system (E) that is almost syntax directed. The system E is shown to satisfy all major properties of the Barendregt-Coppo-Dezani type system (BCD), including the approximation theorem, characterization of normalization, completeness of type assignment using filter semantics, strong normalization for cut-elimination, and the principal pair property. Some proofs of these properties for E are new. E is a true restriction of BCD and provides a less complicated approach than BCD while achieving the same results.
The document proposes a method for automatically generating questions from sentences by performing sentence simplification. It involves two main steps - identifying potential answer phrases in the sentence, and generating simplified versions of the sentence focused around each answer phrase. A classifier is trained to identify answer phrases using syntactic and semantic features. Sentence simplification is done by pruning dependencies from the sentence's parse tree in a way that preserves the identified answer phrases, resulting in multiple simplified statements from the original sentence that can be transformed into questions. Evaluation shows the classifier achieves over 70% accuracy in identifying answer phrases.
This chapter discusses models and research methods related to memory. It covers the main processes involved in memory including encoding, storage, and retrieval. It also describes different types of memory tests like recall and recognition tasks. Several influential models of memory are introduced, such as the multi-store model, levels of processing model, working memory model, and multiple memory systems model. Key research on sensory memory, short-term memory, long-term memory, and amnesia is summarized.
This chapter discusses models and research methods related to memory. It describes the processes of encoding, storage, and retrieval in memory. It also discusses different types of memory tests like recall and recognition tasks. The chapter reviews research on sensory memory, short-term memory, and long-term memory. It also summarizes several influential models of memory including Atkinson and Shiffrin's multi-store model, Craik and Lockhart's levels of processing model, Baddeley's working memory model, and Tulving's multiple memory systems model.
This chapter discusses models and research methods related to memory. It describes the processes of encoding, storage, and retrieval in memory. It also discusses different types of memory tests like recall and recognition tasks. The chapter reviews research on sensory memory, short-term memory, and long-term memory. It also summarizes several influential models of memory including Atkinson and Shiffrin's multi-store model, Craik and Lockhart's levels of processing model, Baddeley's working memory model, and Tulving's multiple memory systems model.
The document describes latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by distributions over words. It is a three-level hierarchical Bayesian model where documents are generated by first sampling a per-document topic distribution from a Dirichlet prior, then repeatedly sampling topics and words from these distributions. LDA addresses limitations of previous models by capturing statistical structure within and between documents through the hierarchical Bayesian formulation.
Bland, Paul E.
Rings and their modules / by Paul E. Bland.
p. cm.(De Gruyter textbook)
Includes bibliographical references and index.
ISBN 978-3-11-025022-0 (alk. paper)
1. Rings (Algebra) 2. Modules (Algebra) I. Title.
QA247.B545 2011
5121.4dc22
2010034731
Concept and example of a semantic solution implemented with SQL views to cooperate with users on queries over structured data with independence from database schema knowledge and technology.
The document summarizes a study that used lexical frequency software to analyze and compare the writing styles of native English speakers and advanced French-speaking English learners. The software generated frequency profiles of word categories and individual words. The analysis found that learner writing overused determiners, pronouns, and adverbs, while underusing conjunctions, prepositions, and nouns compared to native writing. More detailed analysis revealed specific words that were significantly over- or underused, such as learners overusing the pronoun "I" and underusing subordinating conjunctions. The study aims to demonstrate how automatic profiling can reveal stylistic characteristics of learner language.
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...dannyijwest
The document describes a method for extracting hyponymy (hierarchical) relationships between domain concepts in an ontology. It uses Cascaded Conditional Random Fields (CCRFs) to identify concepts in text, then performs hierarchy clustering on the concepts to identify hyponymy relationships. CCRFs use a two-layer approach, with the first layer identifying simple concepts and the second layer identifying nested concepts. Hierarchy clustering represents concepts as vectors based on co-occurrence and calculates similarity to group concepts into a taxonomy. The method aims to automatically construct ontology hierarchies from free text.
The document discusses Lin Ma's PhD research on analyzing presuppositions in natural language requirements. Presuppositions are implicit commitments in language that simplify communication but can cause misunderstanding if not made explicit. The research aims to automatically detect presuppositions triggered by definite descriptions in requirements and identify which are not explicitly stated. It will use natural language processing techniques and knowledge sources to classify definite descriptions and analyze how presuppositions project in requirements texts.
Découverte du Traitement Automatique des LanguesEstelle Delpech
Conférence donnée dans le cadre du meet-up "Toulouse Data Science".
L'exposé est une introduction du domaine du traitement automatique des langues (aussi connu comme TAL, text mining, ou NLP, fouille de texte, analyse sémantique...). L'exposé est à destination de tout public (informaticien, statisticien, linguiste, manageur, curieux).
Corpus comparables et traduction assistée par ordinateur, contributions à la ...Estelle Delpech
Soutenance de thèse en Informatique, spécialité Traitement Automatique des Langues.
Soutenue le 2 juillet 2013 à l'Université de Nantes.
Manuscrit de thèse disponible ici : http://tel.archives-ouvertes.fr/tel-00905930
Identification de compatibilites sémantiques entre descripteurs de lieuxEstelle Delpech
Présentation effectuée lors de la 13e Conférence Francophone sur l'Extraction et la Gestion des Connaissances, le 31/12/2013, Toulouse, France.
Vidéo : http://www.canalc2.tv/video.asp?idVideo=11682
Article associé : http://hal.archives-ouvertes.fr/hal-00912332
Usage du TAL dans des applications industrielles : gestion des contenus multi...Estelle Delpech
Intervention dans le cadre du Master Ergonomie Cognitive et Ingénierie Linguistique (ECIL 2012), UE 352 - "Production, gestion et exploitation de documents textuels", Université de Toulouse Le Mirail, Toulouse, France.
Institution : Nomao
Nomao: local search and recommendation engineEstelle Delpech
Nomao is a local search engine that uses social data and personalized search results to recommend places to users. It aggregates information from multiple sources, processes the content using natural language processing and data mining, and generates summaries of places. Current features include collaborative filtering to recommend places liked by similar users, user profiling to suggest places based on interests, and place merging, term classification, and summary generation from content. The company aims to expand its user base through better integration with Facebook and early adopter targeting.
Extraction of domain-specific bilingual lexicon from comparable corpora: comp...Estelle Delpech
Material presented at the 24th International Conference on Computational Linguistics (COLING 2012), Mumbai, India.
Paper download at http://hal.archives-ouvertes.fr/hal-00743807.
Institutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina, Gremuts.
Identification of Fertile Translations in Comparable Corpora: a Morpho-Compos...Estelle Delpech
Material presented at the Tenth Biennial Conference of the
Association for Machine Translation in the Americas (AMTA 2012), San Diego, CA.
Download paper at http://hal.archives-ouvertes.fr/hal-00730325.
Instiutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina, Gremuts
Applicative evaluation of bilingual terminologiesEstelle Delpech
Material presented at the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011), Riga, Latvia.
Download paper: http://hal.archives-ouvertes.fr/hal-00585187
Institutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina
Évaluation applicative des terminologies destinées à la traduction spécialiséeEstelle Delpech
Présentation effectuée lors du 7ème atelier "Qualité des données et des connaissances, évaluation des méthodes d'extraction de données" (2011), Brest, France.
Articles associés :
- http://hal.archives-ouvertes.fr/hal-00912320 (actes atelier)
- http://hal.archives-ouvertes.fr/hal-00605304 (revue RNTI)
Institutions : Laboratoire d'Informatique de Nantes Atlantique, Lingua et Machina
Material of the 4th Intensive Summer school and collaborative workshop on Natural Language Processing (NAIST Franco-Thai Workshop 2010).
Bangkok, Thaıland.
Institution: Institut de Recherche en Informatique de Toulouse (IRIT), Lingua et Machina
Material of the 4th Intensive Summer school and collaborative workshop on Natural Language Processing (NAIST Franco-Thai Workshop 2010).
Bangkok, Thaıland.
Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team.
August 30-31, 2007.
Patan Dhoka, Lalitpur, Nepal.
Text Processing for Procedural Question AnsweringEstelle Delpech
Material of the Natural Language Processing (NLP) Workshop with STIC-Asia representatives and the Nepal team.
August 30-31, 2007.
Institution: Institut de Recherche en Informatique de Toulouse (IRIT)
Patan Dhoka, Lalitpur, Nepal.
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
Discover top-tier mobile app development services, offering innovative solutions for iOS and Android. Enhance your business with custom, user-friendly mobile applications.
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
inQuba Webinar Mastering Customer Journey Management with Dr Graham HillLizaNolte
HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable.
In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed:
Key Takeaways:
Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement.
Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers.
Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Skybuffer SAM4U tool for SAP license adoptionTatiana Kojar
Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool.
SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk
Robust rule-based parsing
1. Robust rule-based parsing
(quick overview)
I.
II.
III.
IV.
Robustness
Three robust rule-based
parsers of English
Common features
Example : identification of
subjects in Syntex
2. I. Robustness
(Aït-Mohktar et al. 1997)
« the ability to provide useful analyses for realword input text. By useful analyses, we mean
analyses that are (at least partially) correct and
usable in some automatic task or application »
implies :
1 analysis (even partial) for any real world input
ability to process irregular input, to overcome error
analysis
efficiency
3. I. Types of robust parsers
(Aït Mokhtar et al. 1997)
based on traditional theorical models with rule-based and/or
stochastic post-processing
Minipar (Lin 1995)
stochastic parsers
Charniak’s parser (2000)
rule-based parsers
Non-Projective Dependency Parser (Järvinen & Tapanainen
1997)
Syntex (Bourigault 2007)
Cass (Abney 1990,1995)
most parsers are hybrid
4. II.1 Non-Projective Dependency Parser
(Tapanainen & Järvinen 1997)
Tagged Text
Syntactic
Labeling
valency
subcategorization
information
Selection of
syntactic
links
Pruning
OUTPUT
« all legitimate surface-syntactic
labels are added to the set of
morphological readings »
« syntactic rules discard
contextually illegitimate
alternatives or select legitimate
ones »
General heuristics
disambiguate the last of the
syntactic links
5. II.1 Non-Projective Dependency Parser
(Tapanainen & Järvinen 1997)
Rules establish dependency links between words
Rules are contextual :
SELECT (@SUBJ)
IF (1C AUXMOD HEAD);
SUBJ
How do you do ?
AUX
If the preceding the word is an unambiguous auxiliary,
the current word is the subject of this auxiliary
Rules use syntactic links established by preceding rules
6. II.2 Syntex
(Bourigault 2007)
Tagged Text
Endogenous and
exogenous
subcategorization
information
Verb Chunk
he will leave
non recursive NP
the man
non recursive SP
Object, Subject
Endogenous and
exogenous
subcategorization
information
Prep
Attachement
OUTPUT
?
?
happy tree friends
from Paris
This is the man
?
?
This is the man from Paris
7. II.2 Syntex (Bourigault 2007)
One module per syntactic relation
Each module processes the sentence from left to right.
Like the Non-Projective Dependency Parser, the rules
establish dependency relations between words
are contextual
use syntactic links established by preceding rules
The identification of a dependency link is formulated as a «path» to be
followed up through the existing links and grammatical categories
from governor to dependent or from dependent to governor
Ambiguous relations : selection of potential governors +
desambiguisation with probabilities
Those who think they are interested in water supply must vote
8. II.3 Cass (Abney 1990,1995)
Tagged Text
CHUNK FILTER
NP filter
Chunk filter
CLAUSE FILTER
Raw Clause filter
Non recursive chunks
Internal structure remains ambiguous
[NP the happy tree friends]
[VP will leave]
[SP from [NP the happy tree friends]
Subject-predicate relation
Beginning and end of simplex clauses
[SUBJThis] [PREDis] [NPthe man][SPfrom Paris]
Clause Repair filter
subcategorization
information
PARSE FILTER
OUTPUT
Repair if no Subject-predicate relation
Assembles recursive structures
[[This] [is] [NPthe man][SPfrom Paris] ]
9. II.3 Cass (Abney 1990,1995)
Each filter uses transducers :
PP (Prep|To)+(NP|Vbg)
Use of repair
(also used in Syntex and NPDP but less explicit):
Each filter makes a decision (determinism), the safest one in
case of ambiguity
« ambiguity is not propagated downstream »
« repair consists in directly modifying erroneous structure
without regard to the history of computation that produced
the structure »
« when errors become apparent downstream, the parser
attempts to repair them »
10. II.3 Cass (Abney 1990,1995)
Example of repair
In South Australia beds of boulders were deposited …
Erroneous structure output from the Chunk filter
[SPIn [NPSouth Australia beds]][SPof [NPboulders]][VPwere
deposited]
Raw Clause filter : no subject is found
Repair filter tries to find a subject by modifying the structure
[SPIn [NPSouth Australia beds]][SPof [SPof boulders][VPwere
Australia]][NP-SUBJbeds][ NPboulders]][VPwere
deposited]
11. III. Common features : Incrementality
The parsing task is divided into substasks
reduces the overall complexity of the main task :
« factoring the problem into a sequence of small, well defined
questions » (Abney 1990).
The sentence is parsed in several phases, each phase producing
an intermediate structure
allows each phase to use the syntactic information left by the
predecing phase
« the level of abstraction produced during the 1st phase (...)
facilitates the description of deeper syntactic relations» (Aït-Mohktar et al.
1997)
ease of maintenance
problem of circularity : difficult to choose in what order the relation
should be identified (Bourigault 2007)
12. III. Common features :
determinism and repair
Each parsing phase yields one solution.
In case of ambiguity, the safest choice is made, even if
some higher level information is needed
ambiguity is not propagated downstream
Most regular errors can be repaired later on
≠ parallelism, backtracking
« The salient performance is not errors vs no errors,
but the tradeoff between speed and error rate » (Abney
1990)
13. III. Common features: no syntactic
theory
Difference between :
Difficulties in automatic syntactic analysis :
the theoretical study of the syntactic structures of language
automatic identification of grammatical relation in real-word
texts
lack of knowledge (semantics/pragmatics for desambiguation)
deviation from the norm of the language
errors of preceding processing steps
Use of common grammatical knowledge
Hours of corpus observation to find clues for automatic
identification
14. III. Common features : implicit
grammatical knowledge
Bipartite architecture :
Lexical information
Recognition routines
No independent declaration of grammatical knowlege
Difficult / impossible to set apart :
Grammatical knowledge
Non grammar-based heuristics
No linguist/computer scientist job separation
Need both linguistic and programming know-hows
A condition to scalability and robustness
15. IV. Example : the subject relation
in Syntex
The identification of the subject relation is formulated as a
«path» through the already identified grammatical relations :
start from tensed verb
move to the left
stop when you encounter an ungoverned Noun
SUJ
DET
PREP
NOMPREP
the cost of
Det
Noun
SUBJECT
Prep
OBJ
NOMPREP
technology takes time to
Noun
Verb
Noun
TENSED VERB
Prep
shrink
Noun
16. IV. Using existing links
The Subject might be far from the tensed verb
Lots of configuration are possible :
Initiatives leading to cessation of smoking in workplaces
are adopted
Gerund
PP
PP
Those who think they are interested in water supply must
vote.
Clause
Clause
PP
No reference to the war, or to the alliance, should remain
PP
Conj
PP
Existing links form dependency islands (~syntagms or isolated words)
Following up the islands until a reasonnable subject is found allows to find subjects without
describing all possible configurations or doing too much computing
17. IV. Ambiguities
Many persons have died in Darfur since the conflict began
A person sitting on the death row since the age of 16 is
not the same as before.
Many adults believe education equates intelligence.
Those who think they are interested in water supply must
vote.
When to stop? When to follow up ? When to repair ?
18. IV. Path decomposition
At each island, a decision is made by a dedicated submodule (one type of island = one sub-module) :
follow up to the island on the left
stop and identify a subject
without
repair
with repair
change path direction
to
the right
to any other position in the sentence
call other module
stop and return failure
Decisions are encoded as if-then rules that may test :
local and non-local context : lemmas, ms tags, links, presence of commas…
specific information left by other modules : encountered tags, activated modules …
19. IV. Path Example : following up
SUBJ
Korea who we believe to have WMD is safe from us.
Clause
PP
PP module
Clause module
_ RelPron [[SUJPron] Verb ]
20. IV. Path example : repair
OBJ
SUBJ
Many adults believe education equates intelligence.
Clause
Clause module
## [[SUBJNP] Verb [[OBJ [SUBJNP] Verb ]
Verb
OBJNP]]
21. IV. Path example : sub-module call
SUBJ
On the walls were scarlett banners
PP
PP module
Wall module
## [PP] Verb
NP
InvertedSubject
module
_
22. IV. Path example : change path
On the contrary, war hysteria was continuous and
PP module
Clause module
Conj
deliberate, and acts such as looting, murdering, the
Adj
slaughters of
Noun
PP module
prisonners,
were considered as normal.
Commas module
+2.6 Recall
-0.07 Precision
All three political Parties at the federal level, and certainly at the
provincial level in different sections, have parity clauses.
Although no directive was ever issued, it was known that the chief of
the Departement intended that within one week no reference to the war
with Eurasia, or to the alliance, should remain
23. IV. Evaluation on Susanne Corpus
Tensed verb
Identification
Subject
Identification
(TreeTagger)
(if tensed verb correct)
(correct tensed verb and
correct subject)
precision
94,87
94,56
89,51
recall
89,76
90,84
81,53
f-mesure
92,24
92,66
85,33
Shallow subjects evaluation only
are not identified or evaluated :
I’ve never seen the dog hiding his bones.
She wants me to clean my shoes
The book is read by the boy
SUBJECT
RELATION
24. Bibliography
Abney (1990) : « Rapid Incremental Parsing with Repair », Proceedings of the 6th
New OED Conference, University of Waterloo, Waterloo, Ontario.
Abney (1995) : «Partial Parsing with finite state cascade », Natural Language
Engineering, Cambridge University Press
Aït-Mokhtar et al. (1997) : « Incremental Finite State Parsing », Proceedings of the
ANLP-97, Washington
Bourigault (2007) : Syntex, analyseur syntaxique opérationnel, Thèse d’Habilitation
à Diriger les Recherches, Université Toulouse - Le Mirail.
http://www.cs.ualberta.ca/~lindek/downloads.htm
Tapanainen & Järvinen (1997) : « A Dependency Parser for English», Technical
Reports, No.TR-1, Department of General Linguistics University, March 1997.
http://www.cfilt.iitb.ac.in/~anupama/charniak.php
Lin (1995) :« Dependency-based Evaluation of Minipar », Proceedings of JCAI.
w3.univ-tlse2.fr/erss/textes/pagespersos/bourigault/syntex.html
Charniak (2000): «A maximum-entropy-inspired parser », In The Proceedings
of the North American, Chapter of the Association for Computational
Linguistics,pp 132–139.
www.sfs.uni-tuebingen.de/~abney/StevenAbney.html#cass
www.connexor.com
TreeTagger : http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
Evaluation Corpus : ftp://ftp.cs.umanitoba.ca/pub/lindek/depeval