The document describes a two-stage approach to named entity recognition for Dutch text. In the first recognition stage, a classifier labels tokens as the beginning, inside, or outside of an entity based on features like surrounding words and capitalization. In the second classification stage, entity spans identified in the first stage are classified into types like person, location or organization using additional features about the entity text, context and capitalization patterns. The approach uses averaged perceptrons trained on custom feature sets at each stage to recognize and classify named entities in Dutch language documents.
Understanding Names with Neural Networks - May 2020Basis Technology
Matching names across languages and writing systems is a critical issue in a variety of consumer and governmental domains. Historically, computers have attempted to solve this problem with ad-hoc methods such as edit distance, sound indexing, and Hidden Markov Models, but these have a variety of practical limitations in this problem space, which we will explore. To address these issues, we present our research and development team’s work on doing English/Japanese name matching using deep neural networks, which provides a substantial boost in accuracy.
KFIR BAR, PHILIP BLAIR, CARMEL ELIAV
Basis Technology
Logic and Reasoning in the Semantic Web (part I –RDF/RDFS)Fulvio Corno
An introduction to RDF/RDFS semantics and on RDF-based reasoning (entailment). The material is mostly taken from the Semantic Web Recommendations. Slides for the PhD Course on Semantic Web (http://elite.polito.it/).
ITU - MDD - Textural Languages and GrammarsTonny Madsen
This presentation describes the use and design of textural domain specific language - DSL. It has two basic purposes:
Introduce you to some of the more important design criteria in language design
Introduce you to BNF
This presentation is developed for MDD 2010 course at ITU, Denmark.
Logic and Reasoning in the Semantic WebFulvio Corno
An introduction to the semantics of RDF and OWL-DL. Inference, Closure, Search. The Pellet reasoner. Slides for the PhD Course on Semantic Web (http://elite.polito.it/).
Teaching the Group Theory of Permutation CiphersJoshua Holden
One of the first topics often taught in an abstract algebra class is permutations, since they provide good examples of non-commutative finite groups which the students can manipulate and visualize. This visualization is often done through symmetry groups. For students who are less geometrically inclined, however, the use of permutation ciphers provides another good way of motivating permutations. They can easily be used to illustrate composition, non-commutativity, inverses, and the order of group elements, which are fundamental topics in group theory. We will give examples of how this can be done and suggest other courses besides abstract algebra in which this could also prove useful.
Also since {a} is regular, {a}* is a regular language which is the set of strings consisting of a's such as , a, aa, aaa, aaaa etc. Note also that *, which is the set of strings consisting of a's and b's, is a regular language because {a, b} is regular. Regular expressions are used to denote regular languages.
The task is to identify salient named entities from a set of named entities. 'Salience' of a named entity indirectly depends on the author. He may not emphasize on all the named entities. Suppose the tweet 'Google executive Dan Fredinburg dies in Everest Avalanche Nepal Earthquake'. The author emphasize on the named entity 'Dan Fredlinburg' more than other named entities such as Google, Mount Everest or Nepal.
Understanding Names with Neural Networks - May 2020Basis Technology
Matching names across languages and writing systems is a critical issue in a variety of consumer and governmental domains. Historically, computers have attempted to solve this problem with ad-hoc methods such as edit distance, sound indexing, and Hidden Markov Models, but these have a variety of practical limitations in this problem space, which we will explore. To address these issues, we present our research and development team’s work on doing English/Japanese name matching using deep neural networks, which provides a substantial boost in accuracy.
KFIR BAR, PHILIP BLAIR, CARMEL ELIAV
Basis Technology
Logic and Reasoning in the Semantic Web (part I –RDF/RDFS)Fulvio Corno
An introduction to RDF/RDFS semantics and on RDF-based reasoning (entailment). The material is mostly taken from the Semantic Web Recommendations. Slides for the PhD Course on Semantic Web (http://elite.polito.it/).
ITU - MDD - Textural Languages and GrammarsTonny Madsen
This presentation describes the use and design of textural domain specific language - DSL. It has two basic purposes:
Introduce you to some of the more important design criteria in language design
Introduce you to BNF
This presentation is developed for MDD 2010 course at ITU, Denmark.
Logic and Reasoning in the Semantic WebFulvio Corno
An introduction to the semantics of RDF and OWL-DL. Inference, Closure, Search. The Pellet reasoner. Slides for the PhD Course on Semantic Web (http://elite.polito.it/).
Teaching the Group Theory of Permutation CiphersJoshua Holden
One of the first topics often taught in an abstract algebra class is permutations, since they provide good examples of non-commutative finite groups which the students can manipulate and visualize. This visualization is often done through symmetry groups. For students who are less geometrically inclined, however, the use of permutation ciphers provides another good way of motivating permutations. They can easily be used to illustrate composition, non-commutativity, inverses, and the order of group elements, which are fundamental topics in group theory. We will give examples of how this can be done and suggest other courses besides abstract algebra in which this could also prove useful.
Also since {a} is regular, {a}* is a regular language which is the set of strings consisting of a's such as , a, aa, aaa, aaaa etc. Note also that *, which is the set of strings consisting of a's and b's, is a regular language because {a, b} is regular. Regular expressions are used to denote regular languages.
The task is to identify salient named entities from a set of named entities. 'Salience' of a named entity indirectly depends on the author. He may not emphasize on all the named entities. Suppose the tweet 'Google executive Dan Fredinburg dies in Everest Avalanche Nepal Earthquake'. The author emphasize on the named entity 'Dan Fredlinburg' more than other named entities such as Google, Mount Everest or Nepal.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 3
Presentation at NLDB 2012
1. Two-stage Named Entity Recognition using
averaged perceptrons
Lars Buitinck Maarten Marx
Information and Language Processing Systems
Informatics Institute
University of Amsterdam
17th Int’l Conf. on Applications of NLP to Information
Systems
Buitinck, Marx Two-stage NER
3. Named Entity Recognition
Find names in text and classify them as belonging to
persons, locations, organizations, events, products or
“miscellaneous”
Use machine learning
Buitinck, Marx Two-stage NER
4. Named Entity Recognition
Find names in text and classify them as belonging to
persons, locations, organizations, events, products or
“miscellaneous”
Use machine learning
Buitinck, Marx Two-stage NER
5. Named Entity Recognition for Dutch
State of the art algorithm for Dutch by Desmet and Hoste
(2011); voting classifiers with GA to train weights
Good training sets are just becoming available
Many practitioners retrain Stanford CRF-NER tagger
Buitinck, Marx Two-stage NER
6. Named Entity Recognition for Dutch
State of the art algorithm for Dutch by Desmet and Hoste
(2011); voting classifiers with GA to train weights
Good training sets are just becoming available
Many practitioners retrain Stanford CRF-NER tagger
Buitinck, Marx Two-stage NER
7. Named Entity Recognition for Dutch
State of the art algorithm for Dutch by Desmet and Hoste
(2011); voting classifiers with GA to train weights
Good training sets are just becoming available
Many practitioners retrain Stanford CRF-NER tagger
Buitinck, Marx Two-stage NER
8. Overview
Realize that NER is two problems in one: recognition and
classification
Pipeline solution with two classifiers
Use custom feature sets for each
Do not used precompiled list of names (“gazetteer”)
Work at the sentence level (because of how training sets
are set up)
Buitinck, Marx Two-stage NER
9. Overview
Realize that NER is two problems in one: recognition and
classification
Pipeline solution with two classifiers
Use custom feature sets for each
Do not used precompiled list of names (“gazetteer”)
Work at the sentence level (because of how training sets
are set up)
Buitinck, Marx Two-stage NER
10. Overview
Realize that NER is two problems in one: recognition and
classification
Pipeline solution with two classifiers
Use custom feature sets for each
Do not used precompiled list of names (“gazetteer”)
Work at the sentence level (because of how training sets
are set up)
Buitinck, Marx Two-stage NER
11. Overview
Realize that NER is two problems in one: recognition and
classification
Pipeline solution with two classifiers
Use custom feature sets for each
Do not used precompiled list of names (“gazetteer”)
Work at the sentence level (because of how training sets
are set up)
Buitinck, Marx Two-stage NER
12. Overview
Realize that NER is two problems in one: recognition and
classification
Pipeline solution with two classifiers
Use custom feature sets for each
Do not used precompiled list of names (“gazetteer”)
Work at the sentence level (because of how training sets
are set up)
Buitinck, Marx Two-stage NER
13. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
14. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
15. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
16. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
17. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
18. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
19. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
20. Recognition stage
Token-level task: is a token the Beginning of, Inside, or
Outside any entity name?
Features:
Word window wi−2 , . . . , wi+2
POS tags for words in window
Conjunction of words and POS tags in window, e.g.
(wi−1 , pi−1 )
Capitalization of tokens in window
(Character) prefixes and suffixes of wi and wi−1
REs for digits, Roman numerals and punctuation
Buitinck, Marx Two-stage NER
21. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
22. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
23. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
24. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
25. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
26. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
27. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
28. Classification stage
Don’t do this at token-level; we know the entity spans!
Input is a list of tokens considered an entity by the
recognition stage
Features:
The tokens we got from recognition
The four surrounding tokens
Their pre- and suffixes up to length four
Capitalization pattern, as a string on the alphabet (L|U|O)∗
The occurrence of capitalized tokens, digits and dashes in
the entire sentence
Buitinck, Marx Two-stage NER
29. Learning algorithm
Use averaged perceptron for both stages
Learns an approximation of max-margin solution (linear
SVM)
40 iterations
Used the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
30. Learning algorithm
Use averaged perceptron for both stages
Learns an approximation of max-margin solution (linear
SVM)
40 iterations
Used the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
31. Learning algorithm
Use averaged perceptron for both stages
Learns an approximation of max-margin solution (linear
SVM)
40 iterations
Used the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
32. Learning algorithm
Use averaged perceptron for both stages
Learns an approximation of max-margin solution (linear
SVM)
40 iterations
Used the LBJ machine learning toolkit
Buitinck, Marx Two-stage NER
33. Evaluation
Aim for F1 score, as defined in the CoNLL 2002 shared
task on NER
Two corpora: CoNLL 2002 and a subset of SoNaR
(courtesy Desmet and Hoste)
Compare against Stanford and Desmet and Hoste’s
algorithm
Buitinck, Marx Two-stage NER
34. Evaluation
Aim for F1 score, as defined in the CoNLL 2002 shared
task on NER
Two corpora: CoNLL 2002 and a subset of SoNaR
(courtesy Desmet and Hoste)
Compare against Stanford and Desmet and Hoste’s
algorithm
Buitinck, Marx Two-stage NER
35. Evaluation
Aim for F1 score, as defined in the CoNLL 2002 shared
task on NER
Two corpora: CoNLL 2002 and a subset of SoNaR
(courtesy Desmet and Hoste)
Compare against Stanford and Desmet and Hoste’s
algorithm
Buitinck, Marx Two-stage NER
36. Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories
65% training, 22% validation and 12% test sets
Stanford achieves F1 = 74.72; "miscellaneous" category is
hard (< 0.7)
We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
37. Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories
65% training, 22% validation and 12% test sets
Stanford achieves F1 = 74.72; "miscellaneous" category is
hard (< 0.7)
We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
38. Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories
65% training, 22% validation and 12% test sets
Stanford achieves F1 = 74.72; "miscellaneous" category is
hard (< 0.7)
We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
39. Results on CoNLL 2002
309.686 tokens containing 19901 names, four categories
65% training, 22% validation and 12% test sets
Stanford achieves F1 = 74.72; "miscellaneous" category is
hard (< 0.7)
We achieve F1 = 75.14; "organization" category is hard
Buitinck, Marx Two-stage NER
40. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
41. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
42. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
43. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
44. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
45. Results on SoNaR
New, large corpus with manual annotations
Used a 200k tokens subset of a preliminary version,
three-fold cross validation
State of the art is Desmet and Hoste (2011) with
F1 = 84.44
Best individual classifier from that paper (CRF) gets 83.77
Our system: 83.56
Here, “product” and “miscellaneous” categories are hard
Buitinck, Marx Two-stage NER
46. Conclusion
Near-state of the art performance from simple learners
with good feature sets
No gazetteers, so should be fairly reusable
(Side conclusion: SoNaR is more easily learnable than
CoNLL)
Buitinck, Marx Two-stage NER
47. Conclusion
Near-state of the art performance from simple learners
with good feature sets
No gazetteers, so should be fairly reusable
(Side conclusion: SoNaR is more easily learnable than
CoNLL)
Buitinck, Marx Two-stage NER
48. Conclusion
Near-state of the art performance from simple learners
with good feature sets
No gazetteers, so should be fairly reusable
(Side conclusion: SoNaR is more easily learnable than
CoNLL)
Buitinck, Marx Two-stage NER
49. Future work
Being integrated in UvA’s xTAS text analysis pipeline
Used to find entities in Dutch Hansard corpus
(forthcoming) and link entities to Wikipedia
Full SoNaR is now available; new evaluation needed
Buitinck, Marx Two-stage NER
50. Future work
Being integrated in UvA’s xTAS text analysis pipeline
Used to find entities in Dutch Hansard corpus
(forthcoming) and link entities to Wikipedia
Full SoNaR is now available; new evaluation needed
Buitinck, Marx Two-stage NER
51. Future work
Being integrated in UvA’s xTAS text analysis pipeline
Used to find entities in Dutch Hansard corpus
(forthcoming) and link entities to Wikipedia
Full SoNaR is now available; new evaluation needed
Buitinck, Marx Two-stage NER