Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Large Scale Processing of Text
Suneel Marthi
DataWorks Summit 2017,
San Jose, California
@suneelmarthi
$WhoAmI
● Principal Software Engineer in the Office of Technology, Red Hat
● Member of Apache Software Foundation
● Commit...
What is a Natural Language?
What is a Natural Language?
Is any language that has evolved naturally in humans through
use and repetition without consci...
What is NOT a Natural Language?
Characteristics of Natural Language
Unstructured
Ambiguous
Complex
Hidden semantic
Ironic
Informal
Unpredictable
Rich
Most...
and it holds most of human knowledge
and it holds most of human knowledge
and but it holds most of human knowledge
As information overload grows
ever worse, computers may
become our only hope for
handling a growing deluge of
documents.
M...
What is Natural Language Processing?
NLP is a field of computer science, artificial intelligence and
computational linguis...
???
How?
By solving small problems each time
A pipeline where an ambiguity type is solved, incrementally.
Sentence Detector
Mr. Rob...
By solving small problems each time
Each step of a pipeline solves one ambiguity problem.
Name Finder
<Person>Washington</...
By solving small problems each time
A pipeline can be long and resolve many ambiguities
Lemmatizer
He is better than many ...
Apache OpenNLP
Apache OpenNLP
Mature project (> 10 years)
Actively developed
Machine learning
Java
Easy to train
Highly customizable
Fast...
Training Models for English
Corpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19)
bin/opennlp TokenNameFinderTrain...
Training Models for Portuguese
Corpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html)
bin/opennlp TokenizerTrai...
Name Finder API - Detect Names
NameFinderME nameFinder = new NameFinderME(new
TokenNameFinderModel(
OpenNLPMain.class.getR...
Name Finder API - Train a model
ObjectStream<String> lineStream =
new PlainTextByLineStream(new
FileInputStream("en-ner-pe...
Name Finder API - Evaluate a model
TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new
NameFinderME(mode...
Name Finder API - Cross Evaluate a model
FileInputStream sampleDataIn = new FileInputStream("en-ner-person.train");
Object...
Language
Detector
Sentence
Detector
Tokenizer
POS
Tagger
Lemmatizer
Name
Finder
Chunker
Language 1
Language 2
Language N
I...
Apache Flink
Apache Flink
Mature project - 320+ contributors, > 11K commits
Very Active project on Github
Java/Scala
Streaming first
Fa...
Apache Flink - Pos Tagger and NER
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironmen...
Apache Flink - Pos Tagger and NER
DataStream<Tuple2<String, String>> porNewsArticles = langStream.select("por");
DataStrea...
Apache Flink - Pos Tagger and NER
private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> ...
Apache Flink - Pos Tagger and NER
private static class PorPOSTaggerMapFunction implements MapFunction<Tuple2<String, Strin...
What’s Coming ??
What’s Coming ??
● DL4J: Mature Project: 114 contributors, ~8k commits
● Modular: Tensor library, reinforcement learning, ...
Credits
Joern Kottmann — PMC Chair, Apache OpenNLP
Tommaso Teofili --- PMC - Apache Lucene, Apache OpenNLP
William Colen -...
Questions ???
Large Scale Text Processing
Upcoming SlideShare
Loading in …5
×

Large Scale Text Processing

1,380 views

Published on

Large Scale Text Processing with Apache OpenNLP and Apache Flink

Published in: Data & Analytics
  • Be the first to comment

Large Scale Text Processing

  1. 1. Large Scale Processing of Text Suneel Marthi DataWorks Summit 2017, San Jose, California @suneelmarthi
  2. 2. $WhoAmI ● Principal Software Engineer in the Office of Technology, Red Hat ● Member of Apache Software Foundation ● Committer and PMC member on Apache Mahout, Apache OpenNLP, Apache Streams
  3. 3. What is a Natural Language?
  4. 4. What is a Natural Language? Is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation (From Wikipedia)
  5. 5. What is NOT a Natural Language?
  6. 6. Characteristics of Natural Language Unstructured Ambiguous Complex Hidden semantic Ironic Informal Unpredictable Rich Most updated Noise Hard to search
  7. 7. and it holds most of human knowledge
  8. 8. and it holds most of human knowledge
  9. 9. and but it holds most of human knowledge
  10. 10. As information overload grows ever worse, computers may become our only hope for handling a growing deluge of documents. MIT Press - May 12, 2017
  11. 11. What is Natural Language Processing? NLP is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.(From Wikipedia)
  12. 12. ???
  13. 13. How?
  14. 14. By solving small problems each time A pipeline where an ambiguity type is solved, incrementally. Sentence Detector Mr. Robert talk is today at room num. 7. Let's go? | | | | ❌ | | ✅ Tokenizer Mr. Robert talk is today at room num. 7. Let's go? || | | | | | | || || | ||| | | ❌ | | | | | | | | || | | | | | ✅
  15. 15. By solving small problems each time Each step of a pipeline solves one ambiguity problem. Name Finder <Person>Washington</Person> was the first president of the USA. <Place>Washington</Place> is a state in the Pacific Northwest region of the USA. POS Tagger Laura Keene brushed by him with the glass of water . | | | | | | | | | | | NNP NNP VBD IN PRP IN DT NN IN NN .
  16. 16. By solving small problems each time A pipeline can be long and resolve many ambiguities Lemmatizer He is better than many others | | | | | | He be good than many other
  17. 17. Apache OpenNLP
  18. 18. Apache OpenNLP Mature project (> 10 years) Actively developed Machine learning Java Easy to train Highly customizable Fast Language Detector (soon) Sentence detector Tokenizer Part of Speech Tagger Lemmatizer Chunker Parser ....
  19. 19. Training Models for English Corpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19) bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-ontonotes.bin bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin
  20. 20. Training Models for Portuguese Corpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html) bin/opennlp TokenizerTrainer.ad -lang por -data amazonia.ad -model por-tokenizer.bin -detokenizer lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1 bin/opennlp POSTaggerTrainer.ad -lang por -data amazonia.ad -model por-pos.bin -encoding ISO-8859-1 -includeFeatures false bin/opennlp ChunkerTrainerME.ad -lang por -data amazonia.ad -model por-chunk.bin -encoding ISO-8859-1 bin/opennlp TokenNameFinderTrainer.ad -lang por -data amazonia.ad -model por-ner.bin -encoding ISO-8859-1
  21. 21. Name Finder API - Detect Names NameFinderME nameFinder = new NameFinderME(new TokenNameFinderModel( OpenNLPMain.class.getResource("/opennlp-models/por-ner.bin”))); for (String document[][] : documents) { for (String[] sentence : document) { Span nameSpans[] = nameFinder.find(sentence); // do something with the names } nameFinder.clearAdaptiveData() }
  22. 22. Name Finder API - Train a model ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream("en-ner-person.train"), StandardCharsets.UTF8); TokenNameFinderModel model; try (ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream)) { model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(), TokenNameFinderFactory nameFinderFactory); } model.serialize(modelFile);
  23. 23. Name Finder API - Evaluate a model TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME(model)); evaluator.evaluate(sampleStream); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  24. 24. Name Finder API - Cross Evaluate a model FileInputStream sampleDataIn = new FileInputStream("en-ner-person.train"); ObjectStream<NameSample> sampleStream = new PlainTextByLineStream(sampleDataIn.getChannel(), StandardCharsets.UTF_8); TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator("en", 100, 5); evaluator.evaluate(sampleStream, 10); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  25. 25. Language Detector Sentence Detector Tokenizer POS Tagger Lemmatizer Name Finder Chunker Language 1 Language 2 Language N Index . . .
  26. 26. Apache Flink
  27. 27. Apache Flink Mature project - 320+ contributors, > 11K commits Very Active project on Github Java/Scala Streaming first Fault-Tolerant Scalable - to 1000s of nodes and more High Throughput, Low Latency
  28. 28. Apache Flink - Pos Tagger and NER final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<String> portugeseText = env.readTextFile(OpenNLPMain.class.getResource( "/input/por_newscrawl.txt").getFile()); DataStream<String> engText = env.readTextFile( OpenNLPMain.class.getResource("/input/eng_news.txt").getFile()); DataStream<String> mergedStream = inputStream.union(portugeseText); SplitStream<Tuple2<String, String>> langStream = mergedStream.split(new LanguageSelector());
  29. 29. Apache Flink - Pos Tagger and NER DataStream<Tuple2<String, String>> porNewsArticles = langStream.select("por"); DataStream<Tuple2<String, String[]>> porNewsTokenized = porNewsArticles.map(new PorTokenizerMapFunction()); DataStream<POSSample> porNewsPOS = porNewsTokenized.map(new PorPOSTaggerMapFunction()); DataStream<NameSample> porNewsNamedEntities = porNewsTokenized.map(new PorNameFinderMapFunction());
  30. 30. Apache Flink - Pos Tagger and NER private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> { public Iterable<String> select(Tuple2<String, String> s) { List<String> list = new ArrayList<>(); list.add(languageDetectorME.predictLanguage(s.f1).getLang()); return list; } } private static class PorTokenizerMapFunction implements MapFunction<Tuple2<String, String>, Tuple2<String, String[]>> { public Tuple2<String, String[]> map(Tuple2<String, String> s) { return new Tuple2<>(s.f0, porTokenizer.tokenize(s.f0)); } }
  31. 31. Apache Flink - Pos Tagger and NER private static class PorPOSTaggerMapFunction implements MapFunction<Tuple2<String, String[]>, POSSample> { public POSSample map(Tuple2<String, String[]> s) { String[] tags = porPosTagger.tag(s.f1); return new POSSample(s.f0, s.f1, tags); } } private static class PorNameFinderMapFunction implements MapFunction<Tuple2<String, String[]>, NameSample> { public NameSample map(Tuple2<String, String[]> s) { Span[] names = engNameFinder.find(s.f1); return new NameSample(s.f0, s.f1, names, null, true); } }
  32. 32. What’s Coming ??
  33. 33. What’s Coming ?? ● DL4J: Mature Project: 114 contributors, ~8k commits ● Modular: Tensor library, reinforcement learning, ETL,.. ● Focused on integrating with JVM ecosystem while supporting state of the art like gpus on large clusters ● Implements most neural nets you’d need for language ● Named Entity Recognition using DL4J with LSTMs ● Language Detection using DL4J with LSTMs ● Possible: Translation using Bidirectional LSTMs with embeddings ● Computation graph architecture for more advanced use cases
  34. 34. Credits Joern Kottmann — PMC Chair, Apache OpenNLP Tommaso Teofili --- PMC - Apache Lucene, Apache OpenNLP William Colen --- Head of Technology, Stilingue - Inteligência Artificial, Sao Paulo, Brazil PMC - Apache OpenNLP Till Rohrmann --- Engineering Lead, Data Artisans, Berlin, Germany Committer and PMC, Apache Flink Fabian Hueske --- Data Artisans, Committer and PMC on Apache Flink
  35. 35. Questions ???

×