This document describes the embedding of NomLex-BR, a dictionary of Portuguese nominalizations, into OpenWordNet-PT. NomLex-BR relates nominal terms to their corresponding verbs. It contains over 2,539 entries from various sources. The integration aims to facilitate linguistic research and information extraction by connecting deverbal nouns to their verbs. Some issues in OpenWordNet-PT were also identified in the process, such as linking the noun "aviltamento" to the correct verb "aviltar". Future work includes further improvements to coverage and applications to test the resource.
Slides of the paper Standoff Annotation for the Ancient Greek and Latin Dependency Treebank by Giuseppe Celano at the 3rd Edition of the DATeCH2019 International Conference
For the seasoned or new to WordPress developer this session will discus the basics of setting up WordPress using WPI (Web Platform Installer). We will walk through the basic WPI setup, WordPress installation, Db configuration and general setup procedures on your localhost.
A linked open data architecture for contemporary historical archivesAlexandre Rademaker
This presentation presents an architecture for historical archives maintenance based on Open Linked Data technologies and open source distributed development model and tools. The proposed architecture is being implemented for the archives of the Center for Teaching and Research in the Social Sciences and Contemporary History of Brazil (CPDOC) from Getulio Vargas Foundation (FGV).
Slides of the paper Standoff Annotation for the Ancient Greek and Latin Dependency Treebank by Giuseppe Celano at the 3rd Edition of the DATeCH2019 International Conference
For the seasoned or new to WordPress developer this session will discus the basics of setting up WordPress using WPI (Web Platform Installer). We will walk through the basic WPI setup, WordPress installation, Db configuration and general setup procedures on your localhost.
A linked open data architecture for contemporary historical archivesAlexandre Rademaker
This presentation presents an architecture for historical archives maintenance based on Open Linked Data technologies and open source distributed development model and tools. The proposed architecture is being implemented for the archives of the Center for Teaching and Research in the Social Sciences and Contemporary History of Brazil (CPDOC) from Getulio Vargas Foundation (FGV).
GSCL2013.Phrase Tagset Mapping for French and English Treebanks and Its Appli...Lifeng (Aaron) Han
Language Processing and Knowledge in the Web - Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, (GSCL 2013), Darmstadt, Germany, on September 25–27, 2013. LNCS Vol. 8105, Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. Open tool https://github.com/aaronlifenghan/aaron-project-hppr
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
This slides introduce a python toolkit for Natural Language Processing (NLP). The author introduces several useful topics in NLTK and demonstrates with code examples.
How do we generate spoken words This issue is a fasci-natin.docxwellesleyterresa
How do we generate spoken words? This issue is a fasci-
nating one. In normal fluent conversation we produce two
to three words per second, which amounts to about four syl-
lables and ten or twelve phonemes per second. These words
are continuously selected from a huge repository, the men-
tal lexicon, which contains at least 50–100 thousand words
in a normal, literate adult person1. Even so, the high speed
and complexity of word production does not seem to make
it particularly error-prone. We err, on average, no more
than once or twice in 1000 words2. This robustness no
doubt has a biological basis; we are born talkers. But in ad-
dition, there is virtually no other skill we exercise as much as
word production. In no more than 40 minutes of talking a
day, we will have produced some 50 million word tokens by
the time we reach adulthood.
The systematic study of word production began in the
late 1960s, when psycholinguists started collecting and ana-
lyzing corpora of spontaneous speech errors (see Box 1).
The first theoretical models were designed to account for
the patterns of verbal slips observed in these corpora. In a
parallel but initially independent development, psycholin-
guists adopted an already existing chronometric approach
to word production (Box 1). Their first models were de-
signed to account for the distribution of picture naming la-
tencies obtained under various experimental conditions.
Although these two approaches are happily merging in
current theorizing, all existing models have a dominant kin-
ship: their ancestry is either in speech error analysis or it is
in chronometry. In spite of this dual perspective, there is a
general agreement on the processes to be modeled.
Producing words is a core part of producing utterances; ex-
plaining word production is part of explaining utterance
production3,4. In producing an utterance, we go from some
communicative intention to a decision about what infor-
mation to express – the ‘message’. The message contains one
or more concepts for which we have words in our lexicon,
and these words have to be retrieved. They have syntactic
properties, such as being a noun or a transitive verb, which
we use in planning the sentence, that is in ‘grammatical en-
coding’. These syntactic properties taken together, we call
the word’s ‘lemma’. Words also have morphological and
phonological properties that we use in preparing their syl-
labification and prosody, that is in ‘phonological encoding’.
Ultimately, we must prepare the articulatory gestures for
each of these syllables, words and phrases in the utterance.
The execution of these gestures is the only overt part of the
entire process.
This review will first introduce the two kinds of word
production model. It will then turn to the computational
steps in producing a word: conceptual preparation, lexical
selection, phonological encoding, phonetic encoding and
articulation. This review does not cover models of word
reading.
Two kinds of model ...
This paper presents a methodology to extract a paraphrase database for the European and Brazilian varieties of Portuguese, and discusses a set of paraphrastic categories of multiwords and
phrasal units, such as the compounds toda a gente vs todo o mundo "everybody" or the gerundive constructions [estar a + V-Inf] vs [ficar + V-Ger] (e.g., estive a observar vs fiquei observando "I was observing"), which are extremely relevant to high quality paraphrasing. The variants were manually aligned in the e-PACT corpus, using the CLUE-Aligner tool. The methodology, inspired
in the Logos Model, focuses on a semantico-syntactic analysis of each paraphrastic unit and constitutes a subset of the Gold-CLUE-Paraphrases.1 The construction of a larger dataset of
paraphrastic contrasts among the distinct varieties of the Portuguese language is indispensable for variety adaptation, i.e., for dealing with the cultural, linguistic and stylistic differences between them, making it possible to convert texts (semi-)automatically from one variety into another, a
key function in paraphrasing systems. This topic represents an interesting new line of research with valuable applications in language learning, language generation, question-answering, summarization, and machine translation, among others. The paraphrastic units are the first resource of its kind for Portuguese to become available to the scientific community for research purposes.
This presentation was held as a guest lecture on corpus linguistics at the University of Paderborn, Germany, on 8 November 2007. I'd like to thank my colleague Anette Rosenbach for inviting me as part of her "Web as Corpus" seminar.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
More Related Content
Similar to Embedding NomLex-BR nominalizations into OpenWordnet-PT
GSCL2013.Phrase Tagset Mapping for French and English Treebanks and Its Appli...Lifeng (Aaron) Han
Language Processing and Knowledge in the Web - Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, (GSCL 2013), Darmstadt, Germany, on September 25–27, 2013. LNCS Vol. 8105, Volume Editors: Iryna Gurevych, Chris Biemann and Torsten Zesch. Open tool https://github.com/aaronlifenghan/aaron-project-hppr
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
This slides introduce a python toolkit for Natural Language Processing (NLP). The author introduces several useful topics in NLTK and demonstrates with code examples.
How do we generate spoken words This issue is a fasci-natin.docxwellesleyterresa
How do we generate spoken words? This issue is a fasci-
nating one. In normal fluent conversation we produce two
to three words per second, which amounts to about four syl-
lables and ten or twelve phonemes per second. These words
are continuously selected from a huge repository, the men-
tal lexicon, which contains at least 50–100 thousand words
in a normal, literate adult person1. Even so, the high speed
and complexity of word production does not seem to make
it particularly error-prone. We err, on average, no more
than once or twice in 1000 words2. This robustness no
doubt has a biological basis; we are born talkers. But in ad-
dition, there is virtually no other skill we exercise as much as
word production. In no more than 40 minutes of talking a
day, we will have produced some 50 million word tokens by
the time we reach adulthood.
The systematic study of word production began in the
late 1960s, when psycholinguists started collecting and ana-
lyzing corpora of spontaneous speech errors (see Box 1).
The first theoretical models were designed to account for
the patterns of verbal slips observed in these corpora. In a
parallel but initially independent development, psycholin-
guists adopted an already existing chronometric approach
to word production (Box 1). Their first models were de-
signed to account for the distribution of picture naming la-
tencies obtained under various experimental conditions.
Although these two approaches are happily merging in
current theorizing, all existing models have a dominant kin-
ship: their ancestry is either in speech error analysis or it is
in chronometry. In spite of this dual perspective, there is a
general agreement on the processes to be modeled.
Producing words is a core part of producing utterances; ex-
plaining word production is part of explaining utterance
production3,4. In producing an utterance, we go from some
communicative intention to a decision about what infor-
mation to express – the ‘message’. The message contains one
or more concepts for which we have words in our lexicon,
and these words have to be retrieved. They have syntactic
properties, such as being a noun or a transitive verb, which
we use in planning the sentence, that is in ‘grammatical en-
coding’. These syntactic properties taken together, we call
the word’s ‘lemma’. Words also have morphological and
phonological properties that we use in preparing their syl-
labification and prosody, that is in ‘phonological encoding’.
Ultimately, we must prepare the articulatory gestures for
each of these syllables, words and phrases in the utterance.
The execution of these gestures is the only overt part of the
entire process.
This review will first introduce the two kinds of word
production model. It will then turn to the computational
steps in producing a word: conceptual preparation, lexical
selection, phonological encoding, phonetic encoding and
articulation. This review does not cover models of word
reading.
Two kinds of model ...
This paper presents a methodology to extract a paraphrase database for the European and Brazilian varieties of Portuguese, and discusses a set of paraphrastic categories of multiwords and
phrasal units, such as the compounds toda a gente vs todo o mundo "everybody" or the gerundive constructions [estar a + V-Inf] vs [ficar + V-Ger] (e.g., estive a observar vs fiquei observando "I was observing"), which are extremely relevant to high quality paraphrasing. The variants were manually aligned in the e-PACT corpus, using the CLUE-Aligner tool. The methodology, inspired
in the Logos Model, focuses on a semantico-syntactic analysis of each paraphrastic unit and constitutes a subset of the Gold-CLUE-Paraphrases.1 The construction of a larger dataset of
paraphrastic contrasts among the distinct varieties of the Portuguese language is indispensable for variety adaptation, i.e., for dealing with the cultural, linguistic and stylistic differences between them, making it possible to convert texts (semi-)automatically from one variety into another, a
key function in paraphrasing systems. This topic represents an interesting new line of research with valuable applications in language learning, language generation, question-answering, summarization, and machine translation, among others. The paraphrastic units are the first resource of its kind for Portuguese to become available to the scientific community for research purposes.
This presentation was held as a guest lecture on corpus linguistics at the University of Paderborn, Germany, on 8 November 2007. I'd like to thank my colleague Anette Rosenbach for inviting me as part of her "Web as Corpus" seminar.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
National Security Agency - NSA mobile device best practices
Embedding NomLex-BR nominalizations into OpenWordnet-PT
1. Embedding NomLex-BR nominalizations into
OpenWordnet-PT
Livy Maria Real Coelho1 Alexandre Rademaker2,5
Valeria de Paiva3 Gerard de Melo4
UFP
IBM Research
Nuance Comms.
Tsinghua University
FGV/EMAp
February 1, 2014
3. NomLex (cont.)
a dictionary of English
nominalizations, under
Catherine Macleod.
relate the nominal complements
to the arguments of the
corresponding verb.
1025 entries of several types of
lexical nominalizations.
Alexander’s destruction of the
city happened in 330 BC.
first version on January 15,
1999, latest version October
2001 downloadable from
http://bit.ly/1aZWQmh
4. Nomlex (cont.)
( nom : o r t h ” p r o m o t i o n ”
: v e r b ” promote ”
: nom−type ( ( verb−nom ) )
: v e r b − s u b j ( ( n−n−mod) ( d e t − p o s s ) )
: v e r b − s u b c ( ( nom−np : o b j e c t ( ( d e t − p o s s ) ( n−n−mod ) ( pp−of ) ) )
( nom−np−as−np : o b j e c t ( ( d e t − p o s s ) ( pp−of ) ) )
( nom−possing : nom−subc ( ( p − p o s s i n g : p v a l ( ” o f ” ) ) ) )
( nom−np−pp : o b j e c t ( ( d e t − p o s s ) (n−n−mod) ( pp−of ) )
: p v a l ( ” i n t o ” ” from ” ” f o r ” ” t o ” ) )
( nom−np−pp−pp : o b j e c t ( ( d e t − p o s s ) (n−n−mod) ( pp−of ) )
: p v a l ( ” f o r ” ” i n t o ” ” t o ” ) : p v a l 2 ( ” from ” ) ) ) )
5. Related Works
Nominalizations have been studied for more than 4 decades
(Chomsky, 1970).
NomLex-Plus (Meyers et al., 2004). Extension of NomLex with 7.050
nominalizations.
The NomBank Project (Meyer, 2007) http://bit.ly/1d5G7L9.
“ mark the sets of arguments that co-occur with nouns in the
PropBank Corpus, just as PropBank records such information for
verbs... firmly on the shoulders of NOMLEX...”
Berkeley FrameNet (https://framenet.icsi.berkeley.edu/).
11600 lexical units based on frame semantics supported by corpus
evidence. Deverbal nominalizations are annotated as events (in the
frame of verbs) or entities/results (diff. semantic frame).
FrameNet-Brazil, http://www.ufjf.br/framenetbr/.
6. Using for NLP (IE)
To write maps bettween IE patterns for active clauses to IE patterns
for nominalizations.
Active clause: “IBM appointed Alice Smith as vice president”.
Passive clause: “IBM’s appointment of Alice Smith as vice president”
and “Alice Smith’s appointment as vice president”.
7. Main use for NLP (IE) (cont.)
The Proteus Extraction System starts with:
np(C-company) vg(appoint) np(C-person) "as" np(C-position)
Meta rules to produce passive clause pattern:
np(C-person) vg-pass(appoint) "as" np(C-position) "by"
np(C-company)
When a pattern matches the input, the pieces corresponding to its
constituents are used to build a semantic representation of the patter (e.g.
logical form).
vg = verb group (plus auxiliares). vg-pass = passive verb group.
8. Project Motivation: DHBB
7.5K entries Brazilian Historical
Biographic Dictionary (DHBB).
Enrich the structure (semantics).
Uniform data treatment (standards and
interlinks between collections).
NLP of DHBB entries: (1) word sense
disambiguation with openWordnet-PT;
and (2) named entity recognition to
make links. (133K proper names)
We need grammars, lexical resources, ontologies, KBs, automated theorem
provers etc to reason about knowledge extracted from text. This will
empower QA, KE, MT, personal assistents and other systems.
9. Nominalizations in Portuguese
Nominalizations: difficult to deal with in KR systems, harder to
obtain the arguments of nominal predicate;
NOMLEX project (Macleod et al., 1998) provides a well-established,
open access baseline;
nominalizations with the suffixes -¸˜o/-ion, -mento/-ment and
ca
-or/-er, which work well in Portuguese;
E.g. constru¸˜o (construction), adiamento (adjournment) and
ca
escritor (writer );
90% of the original resource easily manually translated.
10. How we expanded it
We translate both noun/verb by looking up in extractions from the EN
and PT Wiktionary dumps, generating all combination of noun/verb
translations. Filter to compare the noun and verb translations to see if
they are similar enough to be morphologically related.
Other experiments with DHBB and openWordnet-PT.
11. NomLex-BR
a dictionary of Portuguese nominalizations
Relate nominals to corresponding verbs
Over 2,539 entries of several types of lexical nominalizations
first version of NOMLEX-BR in 2011, much expanded 2013
Freely available for download and embeded in openWordnet-PT.
A RDF vocabulary to describe nominalizations. Future extensions to
cover more information from COMLEX and COMNOM (extension
from NomBank).
URI for the schema,
http://arademaker.github.com/nomlex/schema/! Need a better
and stable URI.
“Constru¸˜o da rodovia Transamazˆnica, na d´cada de 70, pelo governo
ca
o
e
Medici, uma das obras faraˆnicas da ditadura militar.”
o
15. Results
Extension of OpenWN-PT aims at incorporating links to connect
deverbal nouns with their corresponding verbs.
The integration into OpenWN-PT will facilitate their use for linguistic
research as well as information extraction
Incorporating NOMLEX-BR data into OpenWN-PT has shown itself
useful in pinpointing some issues with the coherence and richness of
OpenWN-PT.
the word abasement corresponds in NOMLEX to the verb abase,
and thus we would like a similar correspondence between the
Portuguese noun “aviltamento” and the verb “aviltar” (suggested
translations). OpenWN-PT simply has two synsets “humilhar,
abaixar” and “humilhar, rebaixar”. The more common verb humilhar
is repeated, while the uncommon aviltar was left out.
16. Next Steps
Finish to embed Nomlex-BR into OpenWN-PT (anchor floating
words, http://bit.ly/1aQdpkr).
Work with Claudia Freitas and Hugo Gon¸alvez on leveraging
c
Linguatecas PAPEL, Cart˜o, ACDC and Floresta Sint´(c)tica.
a
a
Lists from Linguateca’s resources complement NomLex-BR using
corpora and make sure our resource is not simply a translation.
Adding the Portuguese terms that satisfy different relations?
OpenVerbNet-PT? Glosses? Classification of nominalizations?
We are developing our own web interface for browsing and
collaborative editing. Most important pending issue!
Use and test the accuracy of the resource! More applications!
17. Conclusion
We presented NomLex-BR, an lexicon
of nominalizations in Brazilian
Portuguese.
NomLex-BR is embedded into
OpenWordNet-PT and shares its RDF
representation.
Recent improvements include better
coverage: newer suffixes and Nomage
incorporation.
The work with Nomlex-BR helped us to
improve openWordnet-PT (new words,
senses).
The data is freely available from
http://github.com/arademaker/wordnet-br/ and a SPARQL
Endpoint at http://logics.emap.fgv.br:10035.
18. Obrigado!
Multilingual Wordnet 1.0
1/26/14, 8:21 AM
Synset 01146493-a
Danish
English
Finnish
French
Galician
Indonesian
Italian
taknemmelig
thankful, grateful
kiitollinen
reconnaissant
grato, agradecido
bersyukur, berterima kasih, tanda terima kasih, terhutang budi
grato, riconoscente
Japanese
忝い, 有り難い, 感謝を感じた, 幸甚, ありがたい, 有難い, 感謝を表した
Bokmål
takknemlig
Portuguese reconhecido, grato, agradecido
Thai
ซึ่งสำนึกในบุญคุณ
bersyukur, berterima kasih, tanda terima kasih, menampakkan tanda kesyukuran,
Malaysian
memperlihatkan tanda kesyukuran, terhutang budi
Eng: feeling or showing gratitude; "a grateful heart"; "grateful for the tree's shade"; "a thankful
smile";
Similar to: appreciative glad