Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
[ RMLL 2013, Bruxelles – Thursday 11th
July 2013 ]
Presentation of OpenNLP
Presenter : Dr Ir Robert Viseur
2
What is OpenNLP ?
• Toolkit for the processing of natural language text.
• Project of the Apache Foundation.
• Developpe...
3
What are the features ?
• For common NLP tasks :
• tokenization,
• sentence segmentation,
• part-of-speech tagging,
• na...
4
What is the part-of-speech tagging ?
• Example :
• See more:
http://opennlp.apache.org/documentation/1.5.3
/manual/openn...
5
What is the named entity
extraction ?
• Example :
• See more:
http://opennlp.apache.org/documentation/1.5.3
/manual/open...
6
How does it work ? (1/2)
• The features are associated to pre-trained models.
• Each pre-trained model is created for on...
7
How does it work ? (2/2)
• Example (English vs Spanish languages) :
8
What are the criteria of choice ?
• Support of the product.
• License.
• Available languages.
• Precision / Recall.
• Sp...
9
Are there free (as freedom)
alternative tools ?
• Other light tools :
• Stanford Log-linear Part-Of-Speech Tagger (POST)...
10
Example:
tag cloud creation (1/6)
• Starting point: website.
• Example: www.adacore.com.
• What we want (from website c...
11
Example:
tag cloud creation (2/6)
• Cleaning:
• Remove the HTML tags and keep only the useful
content.
• Warnings:
• NL...
12
Example:
tag cloud creation (3/6)
• Named entities extraction.
• Standard in OpenNLP : OpenNLP adds tags in text.
• Her...
13
Example:
tag cloud creation (4/6)
• Process :
Raw HTML
document
---- --- -- ----.
--- -- -- -- ----
--- -- ----.
---- -...
14
Example:
tag cloud creation (5/6)
• Result: common tag cloud.
15
Example:
tag cloud creation (6/6)
• Result: circular tag cloud.
16
Thanks for your attention.
Any questions ?
17
Contact
Dr Ir Robert Viseur
Email (@CETIC) : robert.viseur@cetic.be
Email (@UMONS) : robert.viseur@umons.ac.be
Phone : ...
Upcoming SlideShare
Loading in …5
×

Presentation of OpenNLP

6,005 views

Published on

Published in: Technology

Presentation of OpenNLP

  1. 1. [ RMLL 2013, Bruxelles – Thursday 11th July 2013 ] Presentation of OpenNLP Presenter : Dr Ir Robert Viseur
  2. 2. 2 What is OpenNLP ? • Toolkit for the processing of natural language text. • Project of the Apache Foundation. • Developped in Java. • Under Apache License, Version 2. • Download and documentation: http://opennlp.apache.org/.
  3. 3. 3 What are the features ? • For common NLP tasks : • tokenization, • sentence segmentation, • part-of-speech tagging, • named entity extraction, • chuncking.
  4. 4. 4 What is the part-of-speech tagging ? • Example : • See more: http://opennlp.apache.org/documentation/1.5.3 /manual/opennlp.html.
  5. 5. 5 What is the named entity extraction ? • Example : • See more: http://opennlp.apache.org/documentation/1.5.3 /manual/opennlp.html.
  6. 6. 6 How does it work ? (1/2) • The features are associated to pre-trained models. • Each pre-trained model is created for one language and for one type of use. • Supported languages: da, de, en, es, nl, pt, se. • Warnings : – The functional coverage varies with languages. – The french language is not supported ! • See http://opennlp.sourceforge.net/models- 1.5/. • Use in command line or as a Java library. • Warning : loading time of models with CLI.
  7. 7. 7 How does it work ? (2/2) • Example (English vs Spanish languages) :
  8. 8. 8 What are the criteria of choice ? • Support of the product. • License. • Available languages. • Precision / Recall. • Speed of text processing.
  9. 9. 9 Are there free (as freedom) alternative tools ? • Other light tools : • Stanford Log-linear Part-Of-Speech Tagger (POST), • Stanford Named Entity Recognizer (NER), • TagEN, • Java Automatic Term Extraction toolkit. • Frameworks : • In Java : UIMA (Java), GATE (Java). • In other languages : NLTK (Python).
  10. 10. 10 Example: tag cloud creation (1/6) • Starting point: website. • Example: www.adacore.com. • What we want (from website content): • common tag cloud, • circular tag cloud. • Main steps : crawl, cleaning of HTML documents, named entities (person) and terminology extractions (+ merge) and display (tag cloud).
  11. 11. 11 Example: tag cloud creation (2/6) • Cleaning: • Remove the HTML tags and keep only the useful content. • Warnings: • NLP tools are sensitive to noise in raw data. • Pay attention to the language of the document. • Use of HTML boilerplate tool (HTML -> TXT). • Tool: Boilerpipe. • See http://code.google.com/p/boilerpipe/. • Next: normalization of the text.
  12. 12. 12 Example: tag cloud creation (3/6) • Named entities extraction. • Standard in OpenNLP : OpenNLP adds tags in text. • Here : extraction of Person NE. • Terminology extraction. • First : part-of-speech tagging (POST). • Next : identification et filtering (threshold) of : • collocations (i.e: Name_Name, Adjective_Name,...), • proper names (often: brands or people).
  13. 13. 13 Example: tag cloud creation (4/6) • Process : Raw HTML document ---- --- -- ----. --- -- -- -- ---- --- -- ----. ---- --- -- ----. --- -- -- -- ---- --- -- ----. _--- _-- _-- _ _---- _--. _--- _-- _-- _-- _____ _____ _____ Conversion to text Normalization POS tagging _____ _____ _____ Terminology extraction NE extraction Tag cloud (for a website) Website (Internet) Website (local) Crawl Tags Merge
  14. 14. 14 Example: tag cloud creation (5/6) • Result: common tag cloud.
  15. 15. 15 Example: tag cloud creation (6/6) • Result: circular tag cloud.
  16. 16. 16 Thanks for your attention. Any questions ?
  17. 17. 17 Contact Dr Ir Robert Viseur Email (@CETIC) : robert.viseur@cetic.be Email (@UMONS) : robert.viseur@umons.ac.be Phone : 0032 (0) 479 66 08 76 Website : www.robertviseur.be This presentation is covered by « CC-BY-ND » license.

×