SlideShare a Scribd company logo

Large Scale Text Processing

Large Scale Text Processing with Apache OpenNLP and Apache Flink

1 of 36
Download to read offline
Large Scale Processing of Text
Suneel Marthi
DataWorks Summit 2017,
San Jose, California
@suneelmarthi
$WhoAmI
● Principal Software Engineer in the Office of Technology, Red Hat
● Member of Apache Software Foundation
● Committer and PMC member on Apache Mahout, Apache OpenNLP, Apache
Streams
What is a Natural Language?
What is a Natural Language?
Is any language that has evolved naturally in humans through
use and repetition without conscious planning or
premeditation
(From Wikipedia)
What is NOT a Natural Language?
Characteristics of Natural Language
Unstructured
Ambiguous
Complex
Hidden semantic
Ironic
Informal
Unpredictable
Rich
Most updated
Noise
Hard to search

Recommended

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapAnant Corporation
 
gRPC vs REST: let the battle begin!
gRPC vs REST: let the battle begin!gRPC vs REST: let the battle begin!
gRPC vs REST: let the battle begin!Alex Borysov
 
Api-First service design
Api-First service designApi-First service design
Api-First service designStefaan Ponnet
 
What is APIGEE? What are the benefits of APIGEE?
What is APIGEE? What are the benefits of APIGEE?What is APIGEE? What are the benefits of APIGEE?
What is APIGEE? What are the benefits of APIGEE?IQ Online Training
 
Intro to Neo4j
Intro to Neo4jIntro to Neo4j
Intro to Neo4jNeo4j
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
 

More Related Content

What's hot

Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Brownfield Domain Driven Design
Brownfield Domain Driven DesignBrownfield Domain Driven Design
Brownfield Domain Driven DesignNicolò Pignatelli
 
API first Design and Microservices
API first Design and MicroservicesAPI first Design and Microservices
API first Design and MicroservicesSven Bernhardt
 
API Developer Experience: Why it Matters, and How Documenting Your API with S...
API Developer Experience: Why it Matters, and How Documenting Your API with S...API Developer Experience: Why it Matters, and How Documenting Your API with S...
API Developer Experience: Why it Matters, and How Documenting Your API with S...SmartBear
 
Building APIs with the OpenApi Spec
Building APIs with the OpenApi SpecBuilding APIs with the OpenApi Spec
Building APIs with the OpenApi SpecPedro J. Molina
 
A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3Ishan Jain
 
Build an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdfBuild an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdfAnastasiaSteele10
 
Introduction to GraphQL using Nautobot and Arista cEOS
Introduction to GraphQL using Nautobot and Arista cEOSIntroduction to GraphQL using Nautobot and Arista cEOS
Introduction to GraphQL using Nautobot and Arista cEOSJoel W. King
 
Building Microservices with the 12 Factor App Pattern on AWS
Building Microservices with the 12 Factor App Pattern on AWSBuilding Microservices with the 12 Factor App Pattern on AWS
Building Microservices with the 12 Factor App Pattern on AWSAmazon Web Services
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?Kai Wähner
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfM Waleed Kadous
 
API Management Solution Powerpoint Presentation Slides
API Management Solution Powerpoint Presentation SlidesAPI Management Solution Powerpoint Presentation Slides
API Management Solution Powerpoint Presentation SlidesSlideTeam
 
Aprendiendo REDIS en 20 minutos
Aprendiendo REDIS en 20 minutosAprendiendo REDIS en 20 minutos
Aprendiendo REDIS en 20 minutosGonzalo Chacaltana
 
REST-API introduction for developers
REST-API introduction for developersREST-API introduction for developers
REST-API introduction for developersPatrick Savalle
 
Tiki.vn - How we scale as a tech startup
Tiki.vn - How we scale as a tech startupTiki.vn - How we scale as a tech startup
Tiki.vn - How we scale as a tech startupTung Ns
 

What's hot (20)

Jena
JenaJena
Jena
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Brownfield Domain Driven Design
Brownfield Domain Driven DesignBrownfield Domain Driven Design
Brownfield Domain Driven Design
 
API first Design and Microservices
API first Design and MicroservicesAPI first Design and Microservices
API first Design and Microservices
 
API Developer Experience: Why it Matters, and How Documenting Your API with S...
API Developer Experience: Why it Matters, and How Documenting Your API with S...API Developer Experience: Why it Matters, and How Documenting Your API with S...
API Developer Experience: Why it Matters, and How Documenting Your API with S...
 
Building APIs with the OpenApi Spec
Building APIs with the OpenApi SpecBuilding APIs with the OpenApi Spec
Building APIs with the OpenApi Spec
 
A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3A brief primer on OpenAI's GPT-3
A brief primer on OpenAI's GPT-3
 
Build an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdfBuild an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdf
 
OpenAI Chatgpt.pptx
OpenAI Chatgpt.pptxOpenAI Chatgpt.pptx
OpenAI Chatgpt.pptx
 
Introduction to GraphQL using Nautobot and Arista cEOS
Introduction to GraphQL using Nautobot and Arista cEOSIntroduction to GraphQL using Nautobot and Arista cEOS
Introduction to GraphQL using Nautobot and Arista cEOS
 
Clean code
Clean codeClean code
Clean code
 
Building Microservices with the 12 Factor App Pattern on AWS
Building Microservices with the 12 Factor App Pattern on AWSBuilding Microservices with the 12 Factor App Pattern on AWS
Building Microservices with the 12 Factor App Pattern on AWS
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
Protocol Buffers
Protocol BuffersProtocol Buffers
Protocol Buffers
 
Guide to an API-first Strategy
Guide to an API-first StrategyGuide to an API-first Strategy
Guide to an API-first Strategy
 
API Management Solution Powerpoint Presentation Slides
API Management Solution Powerpoint Presentation SlidesAPI Management Solution Powerpoint Presentation Slides
API Management Solution Powerpoint Presentation Slides
 
Aprendiendo REDIS en 20 minutos
Aprendiendo REDIS en 20 minutosAprendiendo REDIS en 20 minutos
Aprendiendo REDIS en 20 minutos
 
REST-API introduction for developers
REST-API introduction for developersREST-API introduction for developers
REST-API introduction for developers
 
Tiki.vn - How we scale as a tech startup
Tiki.vn - How we scale as a tech startupTiki.vn - How we scale as a tech startup
Tiki.vn - How we scale as a tech startup
 

Similar to Large Scale Text Processing

DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyanrudolf eremyan
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated2040.io
 
The Mystery of Natural Language Processing
The Mystery of Natural Language ProcessingThe Mystery of Natural Language Processing
The Mystery of Natural Language ProcessingMahmood Aijazi, MD
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsShreyas Suresh Rao
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingSeth Grimes
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationssChandan Deb
 
Python Intro For Managers
Python Intro For ManagersPython Intro For Managers
Python Intro For ManagersAtul Shridhar
 
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translationAIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation2040.io
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Pythonshanbady
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003butest
 
Using Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a bookUsing Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a bookOlusola Amusan
 
Distributed tracing with erlang/elixir
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixirIvan Glushkov
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 
Generative programming (mostly parser generation)
Generative programming (mostly parser generation)Generative programming (mostly parser generation)
Generative programming (mostly parser generation)Ralf Laemmel
 
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...rahul_net
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analyticsshanbady
 

Similar to Large Scale Text Processing (20)

DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf EremyanDataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
DataFest 2017. Introduction to Natural Language Processing by Rudolf Eremyan
 
Ai meetup Neural machine translation updated
Ai meetup Neural machine translation updatedAi meetup Neural machine translation updated
Ai meetup Neural machine translation updated
 
The Mystery of Natural Language Processing
The Mystery of Natural Language ProcessingThe Mystery of Natural Language Processing
The Mystery of Natural Language Processing
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
 
Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
Python Intro For Managers
Python Intro For ManagersPython Intro For Managers
Python Intro For Managers
 
AIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translationAIMeetup #4: Neural-machine-translation
AIMeetup #4: Neural-machine-translation
 
NLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in PythonNLTK - Natural Language Processing in Python
NLTK - Natural Language Processing in Python
 
programming language.pdf
programming language.pdfprogramming language.pdf
programming language.pdf
 
GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003GATE, HLT and Machine Learning, Sheffield, July 2003
GATE, HLT and Machine Learning, Sheffield, July 2003
 
Using Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a bookUsing Stanza NLP and TensorFlow to create a summary of a book
Using Stanza NLP and TensorFlow to create a summary of a book
 
Distributed tracing with erlang/elixir
Distributed tracing with erlang/elixirDistributed tracing with erlang/elixir
Distributed tracing with erlang/elixir
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
ppt
pptppt
ppt
 
Practical NLP with Lisp
Practical NLP with LispPractical NLP with Lisp
Practical NLP with Lisp
 
Generative programming (mostly parser generation)
Generative programming (mostly parser generation)Generative programming (mostly parser generation)
Generative programming (mostly parser generation)
 
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
 
Nltk - Boston Text Analytics
Nltk - Boston Text AnalyticsNltk - Boston Text Analytics
Nltk - Boston Text Analytics
 

More from Suneel Marthi

Measuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazardsMeasuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazardsSuneel Marthi
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagerySuneel Marthi
 
Streaming topic model training and inference
Streaming topic model training and inferenceStreaming topic model training and inference
Streaming topic model training and inferenceSuneel Marthi
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagerySuneel Marthi
 
Building streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translationBuilding streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translationSuneel Marthi
 
Moving beyond moving bytes
Moving beyond moving bytesMoving beyond moving bytes
Moving beyond moving bytesSuneel Marthi
 
Embracing diversity searching over multiple languages
Embracing diversity  searching over multiple languagesEmbracing diversity  searching over multiple languages
Embracing diversity searching over multiple languagesSuneel Marthi
 
Distributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache MahoutDistributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache MahoutSuneel Marthi
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream ProcessingSuneel Marthi
 

More from Suneel Marthi (9)

Measuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazardsMeasuring vegetation health to predict natural hazards
Measuring vegetation health to predict natural hazards
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagery
 
Streaming topic model training and inference
Streaming topic model training and inferenceStreaming topic model training and inference
Streaming topic model training and inference
 
Large scale landuse classification of satellite imagery
Large scale landuse classification of satellite imageryLarge scale landuse classification of satellite imagery
Large scale landuse classification of satellite imagery
 
Building streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translationBuilding streaming pipelines for neural machine translation
Building streaming pipelines for neural machine translation
 
Moving beyond moving bytes
Moving beyond moving bytesMoving beyond moving bytes
Moving beyond moving bytes
 
Embracing diversity searching over multiple languages
Embracing diversity  searching over multiple languagesEmbracing diversity  searching over multiple languages
Embracing diversity searching over multiple languages
 
Distributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache MahoutDistributed Machine Learning with Apache Mahout
Distributed Machine Learning with Apache Mahout
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 

Recently uploaded

Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensKondapi V Siva Rama Brahmam
 
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Cyber Security Experts
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Thibaud Le Douarin
 
Artificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxArtificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxVighnesh Shashtri
 
Lies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaLies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaAdrian Sanabria
 
data analytics and tools from in2inglobal.pdf
data analytics  and tools from in2inglobal.pdfdata analytics  and tools from in2inglobal.pdf
data analytics and tools from in2inglobal.pdfdigimartfamily
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfAustraliaChapterIIBA
 
Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)CUO VEERANAN VEERANAN
 
SABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referenceSABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referencepriyansabari355
 
SABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referenceSABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referencepriyansabari355
 
Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023stephizcoolio
 
Oppotus - Malaysians on Malaysia 4Q 2023.pdf
Oppotus - Malaysians on Malaysia 4Q 2023.pdfOppotus - Malaysians on Malaysia 4Q 2023.pdf
Oppotus - Malaysians on Malaysia 4Q 2023.pdfOppotus
 
Industry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxIndustry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxMdRafiqulIslam403212
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub
 
AWS Identity and access management for users
AWS Identity and access management for usersAWS Identity and access management for users
AWS Identity and access management for usersStephenEfange3
 

Recently uploaded (17)

Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample Screens
 
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
 
Artificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxArtificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptx
 
Lies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaLies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix Enigma
 
data analytics and tools from in2inglobal.pdf
data analytics  and tools from in2inglobal.pdfdata analytics  and tools from in2inglobal.pdf
data analytics and tools from in2inglobal.pdf
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
 
Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)
 
SABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referenceSABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a reference
 
2.pptx
2.pptx2.pptx
2.pptx
 
SABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referenceSABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as reference
 
Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023
 
Oppotus - Malaysians on Malaysia 4Q 2023.pdf
Oppotus - Malaysians on Malaysia 4Q 2023.pdfOppotus - Malaysians on Malaysia 4Q 2023.pdf
Oppotus - Malaysians on Malaysia 4Q 2023.pdf
 
Industry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxIndustry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptx
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
Electricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptxElectricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptx
 
AWS Identity and access management for users
AWS Identity and access management for usersAWS Identity and access management for users
AWS Identity and access management for users
 

Large Scale Text Processing

  • 1. Large Scale Processing of Text Suneel Marthi DataWorks Summit 2017, San Jose, California @suneelmarthi
  • 2. $WhoAmI ● Principal Software Engineer in the Office of Technology, Red Hat ● Member of Apache Software Foundation ● Committer and PMC member on Apache Mahout, Apache OpenNLP, Apache Streams
  • 3. What is a Natural Language?
  • 4. What is a Natural Language? Is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation (From Wikipedia)
  • 5. What is NOT a Natural Language?
  • 6. Characteristics of Natural Language Unstructured Ambiguous Complex Hidden semantic Ironic Informal Unpredictable Rich Most updated Noise Hard to search
  • 7. and it holds most of human knowledge
  • 8. and it holds most of human knowledge
  • 9. and but it holds most of human knowledge
  • 10. As information overload grows ever worse, computers may become our only hope for handling a growing deluge of documents. MIT Press - May 12, 2017
  • 11. What is Natural Language Processing? NLP is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.(From Wikipedia)
  • 12. ???
  • 14. How?
  • 15. By solving small problems each time A pipeline where an ambiguity type is solved, incrementally. Sentence Detector Mr. Robert talk is today at room num. 7. Let's go? | | | | ❌ | | ✅ Tokenizer Mr. Robert talk is today at room num. 7. Let's go? || | | | | | | || || | ||| | | ❌ | | | | | | | | || | | | | | ✅
  • 16. By solving small problems each time Each step of a pipeline solves one ambiguity problem. Name Finder <Person>Washington</Person> was the first president of the USA. <Place>Washington</Place> is a state in the Pacific Northwest region of the USA. POS Tagger Laura Keene brushed by him with the glass of water . | | | | | | | | | | | NNP NNP VBD IN PRP IN DT NN IN NN .
  • 17. By solving small problems each time A pipeline can be long and resolve many ambiguities Lemmatizer He is better than many others | | | | | | He be good than many other
  • 19. Apache OpenNLP Mature project (> 10 years) Actively developed Machine learning Java Easy to train Highly customizable Fast Language Detector (soon) Sentence detector Tokenizer Part of Speech Tagger Lemmatizer Chunker Parser ....
  • 20. Training Models for English Corpus - OntoNotes (https://catalog.ldc.upenn.edu/ldc2013t19) bin/opennlp TokenNameFinderTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-ontonotes.bin bin/opennlp POSTaggerTrainer.ontonotes -lang eng -ontoNotesDir ~/opennlp-data-dir/ontonotes4/data/files/data/english/ -model en-pos-maxent.bin
  • 21. Training Models for Portuguese Corpus - Amazonia (http://www.linguateca.pt/floresta/corpus.html) bin/opennlp TokenizerTrainer.ad -lang por -data amazonia.ad -model por-tokenizer.bin -detokenizer lang/pt/tokenizer/pt-detokenizer.xml -encoding ISO-8859-1 bin/opennlp POSTaggerTrainer.ad -lang por -data amazonia.ad -model por-pos.bin -encoding ISO-8859-1 -includeFeatures false bin/opennlp ChunkerTrainerME.ad -lang por -data amazonia.ad -model por-chunk.bin -encoding ISO-8859-1 bin/opennlp TokenNameFinderTrainer.ad -lang por -data amazonia.ad -model por-ner.bin -encoding ISO-8859-1
  • 22. Name Finder API - Detect Names NameFinderME nameFinder = new NameFinderME(new TokenNameFinderModel( OpenNLPMain.class.getResource("/opennlp-models/por-ner.bin”))); for (String document[][] : documents) { for (String[] sentence : document) { Span nameSpans[] = nameFinder.find(sentence); // do something with the names } nameFinder.clearAdaptiveData() }
  • 23. Name Finder API - Train a model ObjectStream<String> lineStream = new PlainTextByLineStream(new FileInputStream("en-ner-person.train"), StandardCharsets.UTF8); TokenNameFinderModel model; try (ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream)) { model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(), TokenNameFinderFactory nameFinderFactory); } model.serialize(modelFile);
  • 24. Name Finder API - Evaluate a model TokenNameFinderEvaluator evaluator = new TokenNameFinderEvaluator(new NameFinderME(model)); evaluator.evaluate(sampleStream); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  • 25. Name Finder API - Cross Evaluate a model FileInputStream sampleDataIn = new FileInputStream("en-ner-person.train"); ObjectStream<NameSample> sampleStream = new PlainTextByLineStream(sampleDataIn.getChannel(), StandardCharsets.UTF_8); TokenNameFinderCrossValidator evaluator = new TokenNameFinderCrossValidator("en", 100, 5); evaluator.evaluate(sampleStream, 10); FMeasure result = evaluator.getFMeasure(); System.out.println(result.toString());
  • 28. Apache Flink Mature project - 320+ contributors, > 11K commits Very Active project on Github Java/Scala Streaming first Fault-Tolerant Scalable - to 1000s of nodes and more High Throughput, Low Latency
  • 29. Apache Flink - Pos Tagger and NER final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<String> portugeseText = env.readTextFile(OpenNLPMain.class.getResource( "/input/por_newscrawl.txt").getFile()); DataStream<String> engText = env.readTextFile( OpenNLPMain.class.getResource("/input/eng_news.txt").getFile()); DataStream<String> mergedStream = inputStream.union(portugeseText); SplitStream<Tuple2<String, String>> langStream = mergedStream.split(new LanguageSelector());
  • 30. Apache Flink - Pos Tagger and NER DataStream<Tuple2<String, String>> porNewsArticles = langStream.select("por"); DataStream<Tuple2<String, String[]>> porNewsTokenized = porNewsArticles.map(new PorTokenizerMapFunction()); DataStream<POSSample> porNewsPOS = porNewsTokenized.map(new PorPOSTaggerMapFunction()); DataStream<NameSample> porNewsNamedEntities = porNewsTokenized.map(new PorNameFinderMapFunction());
  • 31. Apache Flink - Pos Tagger and NER private static class LanguageSelector implements OutputSelector<Tuple2<String, String>> { public Iterable<String> select(Tuple2<String, String> s) { List<String> list = new ArrayList<>(); list.add(languageDetectorME.predictLanguage(s.f1).getLang()); return list; } } private static class PorTokenizerMapFunction implements MapFunction<Tuple2<String, String>, Tuple2<String, String[]>> { public Tuple2<String, String[]> map(Tuple2<String, String> s) { return new Tuple2<>(s.f0, porTokenizer.tokenize(s.f0)); } }
  • 32. Apache Flink - Pos Tagger and NER private static class PorPOSTaggerMapFunction implements MapFunction<Tuple2<String, String[]>, POSSample> { public POSSample map(Tuple2<String, String[]> s) { String[] tags = porPosTagger.tag(s.f1); return new POSSample(s.f0, s.f1, tags); } } private static class PorNameFinderMapFunction implements MapFunction<Tuple2<String, String[]>, NameSample> { public NameSample map(Tuple2<String, String[]> s) { Span[] names = engNameFinder.find(s.f1); return new NameSample(s.f0, s.f1, names, null, true); } }
  • 34. What’s Coming ?? ● DL4J: Mature Project: 114 contributors, ~8k commits ● Modular: Tensor library, reinforcement learning, ETL,.. ● Focused on integrating with JVM ecosystem while supporting state of the art like gpus on large clusters ● Implements most neural nets you’d need for language ● Named Entity Recognition using DL4J with LSTMs ● Language Detection using DL4J with LSTMs ● Possible: Translation using Bidirectional LSTMs with embeddings ● Computation graph architecture for more advanced use cases
  • 35. Credits Joern Kottmann — PMC Chair, Apache OpenNLP Tommaso Teofili --- PMC - Apache Lucene, Apache OpenNLP William Colen --- Head of Technology, Stilingue - Inteligência Artificial, Sao Paulo, Brazil PMC - Apache OpenNLP Till Rohrmann --- Engineering Lead, Data Artisans, Berlin, Germany Committer and PMC, Apache Flink Fabian Hueske --- Data Artisans, Committer and PMC on Apache Flink