Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media Streams

Apache OpenNLP
Apache OpenNLPApache OpenNLP
Deriving Actionable Insights from
High Volume Media Streams
Jörn Kottmann
Peter Thygesen
Big Data Spain 2017,
Madrid
$WhoAreWe
Jörn Kottmann
● Senior Software Engineer, Sandstone SA, Luxembourg
● Member of Apache Software Foundation
● PMC Chair & Committer, Apache OpenNLP
● PMC and Committer, Apache UIMA
Peter Thygesen
● Senior Software Engineer & Partner, Paqle A/S, Denmark
● PMC and Committer, Apache OpenNLP
What is a Natural Language?
What is a Natural Language?
Is any language that has evolved naturally in humans through
use and repetition without conscious planning or
premeditation
(From Wikipedia)
What is NOT a Natural Language?
Characteristics of Natural Language
Unstructured
Ambiguous
Complex
Hidden semantic
Irony
Informal
Unpredictable
Rich
Most updated
Noise
Hard to search
Metaphors
and it holds most of human knowledge
and it holds most of human knowledge
and it holds most of human knowledge
As information overload grows
ever worse, computers may
become our only hope for
handling a growing deluge of
documents.
MIT Press - May 12, 2017
What is Natural Language Processing?
NLP is a field of computer science, artificial intelligence and
computational linguistics concerned with the interactions
between computers and human (natural) languages, and, in
particular, concerned with programming computers to fruitfully
process large natural language corpora.
(From Wikipedia)
???
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media Streams
How?
Guten Morgen
早上好
おはようございます
Hyvää huomenta
Καλημέρα
доброе утро
शुभ भात dzień dobry
¡Buenos días!
By solving small problems each time
A pipeline where an ambiguity type is solved, incrementally.
Language Detector
By solving small problems each time
A pipeline where an ambiguity type is solved, incrementally.
Sentence Detector
Mr. Robert talk is today at room num. 7. Let's go?
| | | | ❌
| | ✅
Tokenizer
Mr. Robert talk is today at room num. 7. Let's go?
|| | | | | | | || || | ||| | | ❌
| | | | | | | | || | | | | | ✅
By solving small problems each time
Each step of a pipeline solves one ambiguity problem.
Name Finder
<Person>Washington</Person> was the first president of the USA.
<Place>Washington</Place> is a state in the Pacific Northwest region
of the USA.
POS Tagger
Laura Keene brushed by him with the glass of water .
| | | | | | | | | | |
NNP NNP VBD IN PRP IN DT NN IN NN .
By solving small problems each time
A pipeline can be long and resolve many ambiguities
Lemmatizer
He is better than many others
| | | | | |
He be good than many other
Language
Detector
Sentence
Detector
Tokenizer
POS
Tagger
Lemmatizer
Name
Finder
Chunker
Language 1
Language 2
Language N
Index
.
.
.
Apache OpenNLP
Apache OpenNLP
Mature project (> 10 years)
Actively developed
Machine learning
Java
Easy to train
Highly customizable
Fast
Language Detector
Sentence detector
Tokenizer
Part of Speech Tagger
Lemmatizer
Chunker
Parser
....
Language Detection
● Extract character n-grams as features, can be done for any string in any language
● Use the n-grams as features for classification via Maximum Entropy, Perceptron or
Naive Bayes
● Train on high quality data like Leipzig corpus to classify the text in more than 100
languages
Source: Language Detection Library for Java - Shuyo Nakatani
Training Language Detection Model
Corpus - Leipzig (https://catalog.ldc.upenn.edu/ldc2013t19)
svn co https://svn.apache.org/repos/bigdata/opennlp/trunk opennlp-corpus //(roughly 25GB)
bin/opennlp" LanguageDetectorConverter leipzig -sentencesDir data -sentencesPerSample 5 
-samplesPerLanguage 2000 -encoding UTF-8 > ld-train.txt
bin/opennlp" LanguageDetectorConverter leipzig -sentencesDir data -sentencesPerSample 5 
-samplesPerLanguage 2000 -samplesToSkip 2000 -encoding UTF-8 > ld-eval.txt
bin/opennlp" LanguageDetectorTrainer -model lang.bin -params MAXENT_45_PARAMS.txt -data 
ld-train.txt -encoding UTF-8
bin/opennlp" LanguageDetectorEvaluator -model lang.bin -misclassified true 
-reportOutputFile report.txt -data ld-eval.txt -encoding UTF-8
Training Models for English
Corpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19)
bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir 
~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-ner-ontonotes.bin
bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir 
~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin
Training Models for Portuguese
Corpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html)
bin/opennlp TokenizerTrainer.ad -lang por -data amazonia.ad -model por-tokenizer.bin 
-detokenizer lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1
bin/opennlp POSTaggerTrainer.ad -lang por -data amazonia.ad -model por-pos.bin 
-encoding ISO-8859-1 -includeFeatures false
bin/opennlp ChunkerTrainerME.ad -lang por -data amazonia.ad -model por-chunk.bin 
-encoding ISO-8859-1
bin/opennlp TokenNameFinderTrainer.ad -lang por -data amazonia.ad -model por-ner.bin 
-encoding ISO-8859-1
Name Finder API - Detect Names
TokenNameFinderModel model = new TokenNameFinderModel(
OpenNLPMain.class.getResourceAsStream("/opennlp-models/por-ner.bin"));
NameFinderME nameFinder = new NameFinderME(model);
for (String document[][] : documents) {
for (String[] sentence : document) {
Span nameSpans[] = nameFinder.find(sentence);
// do something with the names
}
nameFinder.clearAdaptiveData();
}
Name Finder API - Train a model
ObjectStream<String> lineStream =
new PlainTextByLineStream(
new FileInputStream("en-ner-person.train"), StandardCharsets.UTF8);
TokenNameFinderModel model;
try (ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream))
{
model = NameFinderME.train("eng", "person", sampleStream,
TrainingParameters.defaultParams(),
newTokenNameFinderFactory());
}
model.serialize(modelFile);
Name Finder API - Evaluate a model
TokenNameFinderEvaluator evaluator =
new TokenNameFinderEvaluator(new NameFinderME(model));
evaluator.evaluate(sampleStream);
FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());
Name Finder API - Cross Evaluate a model
FileInputStream sampleDataIn = new FileInputStream("en-ner-person.train");
ObjectStream<NameSample> sampleStream = new
PlainTextByLineStream(sampleDataIn.getChannel(), StandardCharsets.UTF_8);
TokenNameFinderCrossValidator evaluator = new
TokenNameFinderCrossValidator("eng", 100, 5);
evaluator.evaluate(sampleStream, 10);
FMeasure result = evaluator.getFMeasure();
System.out.println(result.toString());
Apache Flink
Apache Flink
Mature project - 320+ contributors, > 11K commits
Very Active project on Github
Java/Scala
Streaming first
Fault-Tolerant
Unified Batch and Streaming APIs
Stateful Stream Processing
Apache Flink - NLP Pipeline
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Annotation> rawStream = env.readFile(
new AnnotationInputFormat(NewsArticleAnnotationFactory.getFactory()),
parameterTool.getRequired("file"));
SplitStream<Annotation> articleStream = rawStream
.map(new LanguageDetectorFunction())
.split(new LanguageSelector(nlpLanguages));
Apache Flink - NLP Pipeline
articleStream.select("eng")
.map(new SentenceDetectorFunction(engSentenceModel))
.map(new TokenizerFunction(engTokenizerModel))
.map(new POSTaggerFunction(engPosModel))
.map(new ChunkerFunction(engChunkModel))
.map(new NameFinderFunction(engNerPersonModel))
.addSink(new ElasticsearchSink<>(config, transportAddresses,
new ESSinkFunction()));
Apache Flink - NLP Pipeline
articleStream.select("por")
.map(new SentenceDetectorFunction(porSentenceModel))
.map(new TokenizerFunction(porTokenizerModel))
.map(new POSTaggerFunction(porPosModel))
.map(new ChunkerFunction(porChunkModel))
.map(new NameFinderFunction(porNerPersonModel))
.addSink(new ElasticsearchSink<>(config, transportAddresses,
new ESSinkFunction()));
Apache Flink - NLP Pipeline
private static class LanguageSelector
implements OutputSelector<Tuple2<String, String>> {
public Iterable<String> select(Tuple2<String, String> s) {
List<String> list = new ArrayList<>();
list.add(languageDetectorME.predictLanguage(s.f1).getLang());
return list;
}
}
Apache Flink - Pos Tagger and NER
class POSTaggerMapFunction
implements RichMapFunction<Tuple2<String, String[]>, POSSample> {
…
public void open(Configuration parameters) throws Exception {
posTagger = new POSTaggerME(model);
}
public POSSample map(Tuple2<String, String[]> s) {
String[] tags = posTagger.tag(s.f1);
return new POSSample(s.f0, s.f1, tags);
}
}
Apache Flink - Pos Tagger and NER
class NameFinderMapFunction
implements RichMapFunction<Tuple2<String, String[]>,NameSample> {
…
public void open(Configuration parameters) throws Exception {
nameFinder = new NameFinderME(model);
}
public NameSample map(Tuple2<String, String[]> s) {
Span[] names = nameFinder.find(s.f1);
return new NameSample(s.f0, s.f1, names, null, true);
}
}
TODO: Add Kibana preview
screenshot
What’s Coming ??
● Apache MxNet: Mature Project: backed by Amazon, Apple, Intel,
NVidia
● Modular: Tensor library, reinforcement learning, ETL,..
● Focused on integrating with JVM ecosystem while
supporting state of the art like gpus on large clusters
● Implements most neural nets you’d need for language
● Named Entity Recognition using MxNet with LSTMs
● Language Detection using MxNet, short texts
● Possible: Translation using Bidirectional LSTMs with embeddings
● Computation graph architecture for more advanced use cases
Credits
Suneel Marthi
@suneelmarthi
Tommaso Teofili
@tteofili
William Colen
@wcolen
Rodrigo Agerri
@ragerri
Jörn Kottmann
@joernkottmann
Peter Thygesen
in:thygesen
Daniel Russ
in:daniel-russ-9541aa15
Koji Sekiguchi
@kojisays
Jeff Zemerick
in:jeffzemerick
Bruno Kinoshita
@kinow
Questions ???
1 of 41

Recommended

Large Scale Text Processing by
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text ProcessingSuneel Marthi
3.5K views36 slides
NLP 101 + Chatbots by
NLP 101 + ChatbotsNLP 101 + Chatbots
NLP 101 + ChatbotsChris Shei
459 views56 slides
ChatGPT에 대한 인문학적 이해 by
ChatGPT에 대한 인문학적 이해ChatGPT에 대한 인문학적 이해
ChatGPT에 대한 인문학적 이해Wonjun Hwang
198 views19 slides
ChatGPT_Cheatsheet_Costa.pdf by
ChatGPT_Cheatsheet_Costa.pdfChatGPT_Cheatsheet_Costa.pdf
ChatGPT_Cheatsheet_Costa.pdfssuser3e5d3a
144 views30 slides
chat-GPT-Information.pdf by
chat-GPT-Information.pdfchat-GPT-Information.pdf
chat-GPT-Information.pdfNishaadequateinfosof
193 views6 slides
STEM Episode 21 Ball Seed Water Management Levels 1-5 - English by
STEM Episode 21 Ball Seed Water Management Levels 1-5 - EnglishSTEM Episode 21 Ball Seed Water Management Levels 1-5 - English
STEM Episode 21 Ball Seed Water Management Levels 1-5 - EnglishBill Calkins
1.5K views14 slides

More Related Content

Similar to Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media Streams

DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan by
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
909 views52 slides
Open nlp presentationss by
Open nlp presentationssOpen nlp presentationss
Open nlp presentationssChandan Deb
1.8K views62 slides
Natural language processing: feature extraction by
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
536 views23 slides
Natural Language Processing - Research and Application Trends by
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsShreyas Suresh Rao
74 views36 slides
Ai meetup Neural machine translation updated by
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated2040.io
316 views33 slides
Analyzing Data With Python by
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With PythonSarah Guido
4.1K views51 slides

Similar to Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media Streams(20)

DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan by rudolf eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
rudolf eremyan909 views
Open nlp presentationss by Chandan Deb
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
Chandan Deb1.8K views
Natural language processing: feature extraction by Gabriel Hamilton
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
Gabriel Hamilton536 views
Natural Language Processing - Research and Application Trends by Shreyas Suresh Rao
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
Ai meetup Neural machine translation updated by 2040.io
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated
2040.io316 views
Analyzing Data With Python by Sarah Guido
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
Sarah Guido4.1K views
Python Intro For Managers by Atul Shridhar
Python Intro For ManagersPython Intro For Managers
Python Intro For Managers
Atul Shridhar6.4K views
Natural language processing for requirements engineering: ICSE 2021 Technical... by alessio_ferrari
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
alessio_ferrari444 views
Distributed tracing with erlang/elixir by Ivan Glushkov
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixir
Ivan Glushkov720 views
Nltk - Boston Text Analytics by shanbady
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
shanbady2.8K views
Recent Advances in Natural Language Processing by Seth Grimes
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
Seth Grimes194 views
NLTK - Natural Language Processing in Python by shanbady
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
shanbady23.2K views
Nltk by Anirudh
NltkNltk
Nltk
Anirudh 1.2K views
The Art of Evolutionary Algorithms Programming by Juan J. Merelo
The Art of Evolutionary Algorithms ProgrammingThe Art of Evolutionary Algorithms Programming
The Art of Evolutionary Algorithms Programming
Juan J. Merelo964 views

Recently uploaded

Unit 1_Lecture 2_Physical Design of IoT.pdf by
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdfStephenTec
12 views36 slides
Future of Indian ConsumerTech by
Future of Indian ConsumerTechFuture of Indian ConsumerTech
Future of Indian ConsumerTechKapil Khandelwal (KK)
21 views68 slides
Case Study Copenhagen Energy and Business Central.pdf by
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdfAitana
16 views3 slides
Voice Logger - Telephony Integration Solution at Aegis by
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at AegisNirmal Sharma
39 views1 slide
Tunable Laser (1).pptx by
Tunable Laser (1).pptxTunable Laser (1).pptx
Tunable Laser (1).pptxHajira Mahmood
24 views37 slides
SUPPLIER SOURCING.pptx by
SUPPLIER SOURCING.pptxSUPPLIER SOURCING.pptx
SUPPLIER SOURCING.pptxangelicacueva6
15 views1 slide

Recently uploaded(20)

Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec12 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
Voice Logger - Telephony Integration Solution at Aegis by Nirmal Sharma
Voice Logger - Telephony Integration Solution at AegisVoice Logger - Telephony Integration Solution at Aegis
Voice Logger - Telephony Integration Solution at Aegis
Nirmal Sharma39 views
The details of description: Techniques, tips, and tangents on alternative tex... by BookNet Canada
The details of description: Techniques, tips, and tangents on alternative tex...The details of description: Techniques, tips, and tangents on alternative tex...
The details of description: Techniques, tips, and tangents on alternative tex...
BookNet Canada127 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2217 views
Serverless computing with Google Cloud (2023-24) by wesley chun
Serverless computing with Google Cloud (2023-24)Serverless computing with Google Cloud (2023-24)
Serverless computing with Google Cloud (2023-24)
wesley chun11 views

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media Streams

  • 1. Deriving Actionable Insights from High Volume Media Streams Jörn Kottmann Peter Thygesen Big Data Spain 2017, Madrid
  • 2. $WhoAreWe Jörn Kottmann ● Senior Software Engineer, Sandstone SA, Luxembourg ● Member of Apache Software Foundation ● PMC Chair & Committer, Apache OpenNLP ● PMC and Committer, Apache UIMA Peter Thygesen ● Senior Software Engineer & Partner, Paqle A/S, Denmark ● PMC and Committer, Apache OpenNLP
  • 3. What is a Natural Language?
  • 4. What is a Natural Language? Is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation (From Wikipedia)
  • 5. What is NOT a Natural Language?
  • 6. Characteristics of Natural Language Unstructured Ambiguous Complex Hidden semantic Irony Informal Unpredictable Rich Most updated Noise Hard to search Metaphors
  • 7. and it holds most of human knowledge
  • 8. and it holds most of human knowledge
  • 9. and it holds most of human knowledge
  • 10. As information overload grows ever worse, computers may become our only hope for handling a growing deluge of documents. MIT Press - May 12, 2017
  • 11. What is Natural Language Processing? NLP is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora. (From Wikipedia)
  • 12. ???
  • 14. How?
  • 15. Guten Morgen 早上好 おはようございます Hyvää huomenta Καλημέρα доброе утро शुभ भात dzień dobry ¡Buenos días! By solving small problems each time A pipeline where an ambiguity type is solved, incrementally. Language Detector
  • 16. By solving small problems each time A pipeline where an ambiguity type is solved, incrementally. Sentence Detector Mr. Robert talk is today at room num. 7. Let's go? | | | | ❌ | | ✅ Tokenizer Mr. Robert talk is today at room num. 7. Let's go? || | | | | | | || || | ||| | | ❌ | | | | | | | | || | | | | | ✅
  • 17. By solving small problems each time Each step of a pipeline solves one ambiguity problem. Name Finder <Person>Washington</Person> was the first president of the USA. <Place>Washington</Place> is a state in the Pacific Northwest region of the USA. POS Tagger Laura Keene brushed by him with the glass of water . | | | | | | | | | | | NNP NNP VBD IN PRP IN DT NN IN NN .
  • 18. By solving small problems each time A pipeline can be long and resolve many ambiguities Lemmatizer He is better than many others | | | | | | He be good than many other
  • 21. Apache OpenNLP Mature project (> 10 years) Actively developed Machine learning Java Easy to train Highly customizable Fast Language Detector Sentence detector Tokenizer Part of Speech Tagger Lemmatizer Chunker Parser ....
  • 22. Language Detection ● Extract character n-grams as features, can be done for any string in any language ● Use the n-grams as features for classification via Maximum Entropy, Perceptron or Naive Bayes ● Train on high quality data like Leipzig corpus to classify the text in more than 100 languages Source: Language Detection Library for Java - Shuyo Nakatani
  • 23. Training Language Detection Model Corpus - Leipzig (https://catalog.ldc.upenn.edu/ldc2013t19) svn co https://svn.apache.org/repos/bigdata/opennlp/trunk opennlp-corpus //(roughly 25GB) bin/opennlp" LanguageDetectorConverter leipzig -sentencesDir data -sentencesPerSample 5 -samplesPerLanguage 2000 -encoding UTF-8 > ld-train.txt bin/opennlp" LanguageDetectorConverter leipzig -sentencesDir data -sentencesPerSample 5 -samplesPerLanguage 2000 -samplesToSkip 2000 -encoding UTF-8 > ld-eval.txt bin/opennlp" LanguageDetectorTrainer -model lang.bin -params MAXENT_45_PARAMS.txt -data ld-train.txt -encoding UTF-8 bin/opennlp" LanguageDetectorEvaluator -model lang.bin -misclassified true -reportOutputFile report.txt -data ld-eval.txt -encoding UTF-8
  • 24. Training Models for English Corpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19) bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-ner-ontonotes.bin bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin
  • 25. Training Models for Portuguese Corpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html) bin/opennlp TokenizerTrainer.ad -lang por -data amazonia.ad -model por-tokenizer.bin -detokenizer lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1 bin/opennlp POSTaggerTrainer.ad -lang por -data amazonia.ad -model por-pos.bin -encoding ISO-8859-1 -includeFeatures false bin/opennlp ChunkerTrainerME.ad -lang por -data amazonia.ad -model por-chunk.bin -encoding ISO-8859-1 bin/opennlp TokenNameFinderTrainer.ad -lang por -data amazonia.ad -model por-ner.bin -encoding ISO-8859-1
  • 26. Name Finder API - Detect Names TokenNameFinderModel model = new TokenNameFinderModel( OpenNLPMain.class.getResourceAsStream("/opennlp-models/por-ner.bin")); NameFinderME nameFinder = new NameFinderME(model); for (String document[][] : documents) { for (String[] sentence : document) { Span nameSpans[] = nameFinder.find(sentence); // do something with the names } nameFinder.clearAdaptiveData(); }
  • 27. Name Finder API - Train a model ObjectStream<String> lineStream = new PlainTextByLineStream( new FileInputStream("en-ner-person.train"), StandardCharsets.UTF8); TokenNameFinderModel model; try (ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream)) { model = NameFinderME.train("eng", "person", sampleStream, TrainingParameters.defaultParams(), newTokenNameFinderFactory()); } model.serialize(modelFile);
  • 28. Name Finder API - Evaluate a model TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME(model)); evaluator.evaluate(sampleStream); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  • 29. Name Finder API - Cross Evaluate a model FileInputStream sampleDataIn = new FileInputStream("en-ner-person.train"); ObjectStream<NameSample> sampleStream = new PlainTextByLineStream(sampleDataIn.getChannel(), StandardCharsets.UTF_8); TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator("eng", 100, 5); evaluator.evaluate(sampleStream, 10); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  • 31. Apache Flink Mature project - 320+ contributors, > 11K commits Very Active project on Github Java/Scala Streaming first Fault-Tolerant Unified Batch and Streaming APIs Stateful Stream Processing
  • 32. Apache Flink - NLP Pipeline final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<Annotation> rawStream = env.readFile( new AnnotationInputFormat(NewsArticleAnnotationFactory.getFactory()), parameterTool.getRequired("file")); SplitStream<Annotation> articleStream = rawStream .map(new LanguageDetectorFunction()) .split(new LanguageSelector(nlpLanguages));
  • 33. Apache Flink - NLP Pipeline articleStream.select("eng") .map(new SentenceDetectorFunction(engSentenceModel)) .map(new TokenizerFunction(engTokenizerModel)) .map(new POSTaggerFunction(engPosModel)) .map(new ChunkerFunction(engChunkModel)) .map(new NameFinderFunction(engNerPersonModel)) .addSink(new ElasticsearchSink<>(config, transportAddresses, new ESSinkFunction()));
  • 34. Apache Flink - NLP Pipeline articleStream.select("por") .map(new SentenceDetectorFunction(porSentenceModel)) .map(new TokenizerFunction(porTokenizerModel)) .map(new POSTaggerFunction(porPosModel)) .map(new ChunkerFunction(porChunkModel)) .map(new NameFinderFunction(porNerPersonModel)) .addSink(new ElasticsearchSink<>(config, transportAddresses, new ESSinkFunction()));
  • 35. Apache Flink - NLP Pipeline private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> { public Iterable<String> select(Tuple2<String, String> s) { List<String> list = new ArrayList<>(); list.add(languageDetectorME.predictLanguage(s.f1).getLang()); return list; } }
  • 36. Apache Flink - Pos Tagger and NER class POSTaggerMapFunction implements RichMapFunction<Tuple2<String, String[]>, POSSample> { … public void open(Configuration parameters) throws Exception { posTagger = new POSTaggerME(model); } public POSSample map(Tuple2<String, String[]> s) { String[] tags = posTagger.tag(s.f1); return new POSSample(s.f0, s.f1, tags); } }
  • 37. Apache Flink - Pos Tagger and NER class NameFinderMapFunction implements RichMapFunction<Tuple2<String, String[]>,NameSample> { … public void open(Configuration parameters) throws Exception { nameFinder = new NameFinderME(model); } public NameSample map(Tuple2<String, String[]> s) { Span[] names = nameFinder.find(s.f1); return new NameSample(s.f0, s.f1, names, null, true); } }
  • 38. TODO: Add Kibana preview screenshot
  • 39. What’s Coming ?? ● Apache MxNet: Mature Project: backed by Amazon, Apple, Intel, NVidia ● Modular: Tensor library, reinforcement learning, ETL,.. ● Focused on integrating with JVM ecosystem while supporting state of the art like gpus on large clusters ● Implements most neural nets you’d need for language ● Named Entity Recognition using MxNet with LSTMs ● Language Detection using MxNet, short texts ● Possible: Translation using Bidirectional LSTMs with embeddings ● Computation graph architecture for more advanced use cases
  • 40. Credits Suneel Marthi @suneelmarthi Tommaso Teofili @tteofili William Colen @wcolen Rodrigo Agerri @ragerri Jörn Kottmann @joernkottmann Peter Thygesen in:thygesen Daniel Russ in:daniel-russ-9541aa15 Koji Sekiguchi @kojisays Jeff Zemerick in:jeffzemerick Bruno Kinoshita @kinow