SlideShare a Scribd company logo

Exploring Natural Language Processing in Ruby

Exploring Natural Language Processing in Ruby - Tokyo Rubyist Meetup (April 9th, 2015) This presentation will cover 3 natural language processing gems I’ve released over the past year: * Pragmatic Segmenter (a sentence boundary detection gem) * Chat Correct (a gem for English teachers/students that provides error analysis when an incorrect sentence is diffed with a correct sentence) * Word Count Analyzer (a gem that analyzes a string for potential “word count gray areas” which cause tools to report different word counts) The talk will cover various aspects of building these gems including working from first principles, testing edge cases, and getting comfortable with regular expressions. I’ll also introduce a project that is currently in-progress - a new algorithm for parallel text alignment and some of the related challenges with building it.

1 of 34
Download to read offline
Exploring Natural Language
Processing in Ruby
Kevin Dias!
Tokyo Rubyist Meetup - April 9th, 2015
Rubyで自然言語処理の世界を探求してみよう
Developer at
Twitter: @diasks2!
GitHub: diasks2
Pragmatic Segmenter
Chat Correct
Word Count Analyzer
? ? ?
Pragmatic Segmenter
A rule-based sentence boundary
detection gem that works out-of-the-box
across many languages.
What is segmentation?
Segmentation is the process of splitting a text
into segments or sentences. In other words,
deciding where sentences begin and end.
Pragmatic Segmenter
text = ”Hello Tokyo Rubyists. Let’s try segmentation.”
segment #1: Hello Tokyo Rubyists.
segment #2: Let’s try segmentation.
Why care about segmentation?
Pragmatic Segmenter
Sentence segmentation is the foundation of many
common NLP tasks:!
• Translation!
• Machine translation!
• Bitext alignment!
• Summarization!
• Part-of-speech tagging!
• Grammar parsing
Errors in segmentation compound
into errors in these other NLP tasks

Recommended

Natural Language Processing in Ruby
Natural Language Processing in RubyNatural Language Processing in Ruby
Natural Language Processing in RubyTom Cartwright
 
Class9
 Class9 Class9
Class9issbp
 
The Ruby Programming Language: Or, Why are you wasting brain power?
The Ruby Programming Language: Or, Why are you wasting brain power?The Ruby Programming Language: Or, Why are you wasting brain power?
The Ruby Programming Language: Or, Why are you wasting brain power?Vishnu Gopal
 
Introduction to Ruby
Introduction to RubyIntroduction to Ruby
Introduction to RubyMark Menard
 
Learning at the Speed of JavaScript
Learning at the Speed of JavaScriptLearning at the Speed of JavaScript
Learning at the Speed of JavaScriptJake Witcher
 
Computers for kids
Computers for kidsComputers for kids
Computers for kidsdonncha-rcsi
 
Number of Computer Languages = 3
Number of Computer Languages = 3Number of Computer Languages = 3
Number of Computer Languages = 3Ram Sekhar
 

More Related Content

What's hot

JavaScript Speech Recognition
JavaScript Speech RecognitionJavaScript Speech Recognition
JavaScript Speech RecognitionFITC
 
Ruby Introduction
Ruby IntroductionRuby Introduction
Ruby IntroductionPrabu D
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Edureka!
 
Tools for the Toolmakers
Tools for the ToolmakersTools for the Toolmakers
Tools for the ToolmakersCaleb Callaway
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Edureka!
 
Ruby monsters
Ruby monstersRuby monsters
Ruby monsters1337807
 
Semana Interop: Trabalhando com IronPython e com Ironruby
Semana Interop: Trabalhando com IronPython e com IronrubySemana Interop: Trabalhando com IronPython e com Ironruby
Semana Interop: Trabalhando com IronPython e com IronrubyAlessandro Binhara
 
Programming languages vienna
Programming languages viennaProgramming languages vienna
Programming languages viennagreg_s
 

What's hot (15)

JavaScript Speech Recognition
JavaScript Speech RecognitionJavaScript Speech Recognition
JavaScript Speech Recognition
 
BDD with F# at DDD9
BDD with F# at DDD9BDD with F# at DDD9
BDD with F# at DDD9
 
Ruby programming
Ruby programmingRuby programming
Ruby programming
 
NLP new words
NLP new wordsNLP new words
NLP new words
 
Ruby Introduction
Ruby IntroductionRuby Introduction
Ruby Introduction
 
Week2
Week2Week2
Week2
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Tools for the Toolmakers
Tools for the ToolmakersTools for the Toolmakers
Tools for the Toolmakers
 
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
 
Ruby monsters
Ruby monstersRuby monsters
Ruby monsters
 
ANTLR4 in depth
ANTLR4 in depthANTLR4 in depth
ANTLR4 in depth
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
Semana Interop: Trabalhando com IronPython e com Ironruby
Semana Interop: Trabalhando com IronPython e com IronrubySemana Interop: Trabalhando com IronPython e com Ironruby
Semana Interop: Trabalhando com IronPython e com Ironruby
 
Programming languages vienna
Programming languages viennaProgramming languages vienna
Programming languages vienna
 
Ruby
RubyRuby
Ruby
 

Similar to Exploring Natural Language Processing in Ruby

This talk lasts 三十分钟
This talk lasts 三十分钟This talk lasts 三十分钟
This talk lasts 三十分钟thepilif
 
Ruby, the language of devops
Ruby, the language of devopsRuby, the language of devops
Ruby, the language of devopsRob Kinyon
 
Meta Programming in Ruby - Code Camp 2010
Meta Programming in Ruby - Code Camp 2010Meta Programming in Ruby - Code Camp 2010
Meta Programming in Ruby - Code Camp 2010ssoroka
 
Rails development environment talk
Rails development environment talkRails development environment talk
Rails development environment talkReuven Lerner
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013Iván Montes
 
Go language presentation
Go language presentationGo language presentation
Go language presentationparamisoft
 
From Programming to Modeling And Back Again
From Programming to Modeling And Back AgainFrom Programming to Modeling And Back Again
From Programming to Modeling And Back AgainMarkus Voelter
 
A Static Type Analyzer of Untyped Ruby Code for Ruby 3
A Static Type Analyzer of Untyped Ruby Code for Ruby 3A Static Type Analyzer of Untyped Ruby Code for Ruby 3
A Static Type Analyzer of Untyped Ruby Code for Ruby 3mametter
 
Mind your lang (for role=drinks at CSUN 2017)
Mind your lang (for role=drinks at CSUN 2017)Mind your lang (for role=drinks at CSUN 2017)
Mind your lang (for role=drinks at CSUN 2017)Adrian Roselli
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlpankit_ppt
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfAdityaMishra178868
 
How to Make Your Strings Translator Friendly
How to Make Your Strings Translator FriendlyHow to Make Your Strings Translator Friendly
How to Make Your Strings Translator FriendlyNaoko Takano
 
Metaprogramming Go
Metaprogramming GoMetaprogramming Go
Metaprogramming GoWeng Wei
 
Converging Textual and Graphical Editors
Converging Textual  and Graphical EditorsConverging Textual  and Graphical Editors
Converging Textual and Graphical Editorsmeysholdt
 

Similar to Exploring Natural Language Processing in Ruby (20)

This talk lasts 三十分钟
This talk lasts 三十分钟This talk lasts 三十分钟
This talk lasts 三十分钟
 
Lexing and parsing
Lexing and parsingLexing and parsing
Lexing and parsing
 
JRuby: The Hard Parts
JRuby: The Hard PartsJRuby: The Hard Parts
JRuby: The Hard Parts
 
Ruby, the language of devops
Ruby, the language of devopsRuby, the language of devops
Ruby, the language of devops
 
Tips and tricks for PE
Tips and tricks for PETips and tricks for PE
Tips and tricks for PE
 
Build your own ASR engine
Build your own ASR engineBuild your own ASR engine
Build your own ASR engine
 
Meta Programming in Ruby - Code Camp 2010
Meta Programming in Ruby - Code Camp 2010Meta Programming in Ruby - Code Camp 2010
Meta Programming in Ruby - Code Camp 2010
 
Rails development environment talk
Rails development environment talkRails development environment talk
Rails development environment talk
 
Programming Languages #devcon2013
Programming Languages #devcon2013Programming Languages #devcon2013
Programming Languages #devcon2013
 
Go language presentation
Go language presentationGo language presentation
Go language presentation
 
From Programming to Modeling And Back Again
From Programming to Modeling And Back AgainFrom Programming to Modeling And Back Again
From Programming to Modeling And Back Again
 
A Static Type Analyzer of Untyped Ruby Code for Ruby 3
A Static Type Analyzer of Untyped Ruby Code for Ruby 3A Static Type Analyzer of Untyped Ruby Code for Ruby 3
A Static Type Analyzer of Untyped Ruby Code for Ruby 3
 
Mind your lang (for role=drinks at CSUN 2017)
Mind your lang (for role=drinks at CSUN 2017)Mind your lang (for role=drinks at CSUN 2017)
Mind your lang (for role=drinks at CSUN 2017)
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 
How to Make Your Strings Translator Friendly
How to Make Your Strings Translator FriendlyHow to Make Your Strings Translator Friendly
How to Make Your Strings Translator Friendly
 
Metaprogramming Go
Metaprogramming GoMetaprogramming Go
Metaprogramming Go
 
Apex for humans
Apex for humansApex for humans
Apex for humans
 
Ruby
RubyRuby
Ruby
 
Converging Textual and Graphical Editors
Converging Textual  and Graphical EditorsConverging Textual  and Graphical Editors
Converging Textual and Graphical Editors
 

More from Kevin Dias

TM-Town - Getting the Most out of Your Translation Memories
TM-Town - Getting the Most out of Your Translation MemoriesTM-Town - Getting the Most out of Your Translation Memories
TM-Town - Getting the Most out of Your Translation MemoriesKevin Dias
 
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...Kevin Dias
 
TM-Town TAUS Translation Technology Webinar (April 2015)
TM-Town TAUS Translation Technology Webinar (April 2015)TM-Town TAUS Translation Technology Webinar (April 2015)
TM-Town TAUS Translation Technology Webinar (April 2015)Kevin Dias
 
Putter King Education Program - Physics Level 2 (Teacher's Guide English)
Putter King Education Program - Physics Level 2 (Teacher's Guide English)Putter King Education Program - Physics Level 2 (Teacher's Guide English)
Putter King Education Program - Physics Level 2 (Teacher's Guide English)Kevin Dias
 
Putter King Education Program - Physics Level 2 (English)
Putter King Education Program - Physics Level 2 (English)Putter King Education Program - Physics Level 2 (English)
Putter King Education Program - Physics Level 2 (English)Kevin Dias
 
Putter King Education Program - Physics Level 1 (Teacher's Guide Japanese)
Putter King Education Program - Physics Level 1 (Teacher's Guide Japanese)Putter King Education Program - Physics Level 1 (Teacher's Guide Japanese)
Putter King Education Program - Physics Level 1 (Teacher's Guide Japanese)Kevin Dias
 
Putter King Education Program - Physics Level 1 (Teacher's Guide English)
Putter King Education Program - Physics Level 1 (Teacher's Guide English)Putter King Education Program - Physics Level 1 (Teacher's Guide English)
Putter King Education Program - Physics Level 1 (Teacher's Guide English)Kevin Dias
 
Putter King Education Program - Physics Level 1 (Japanese)
Putter King Education Program - Physics Level 1 (Japanese)Putter King Education Program - Physics Level 1 (Japanese)
Putter King Education Program - Physics Level 1 (Japanese)Kevin Dias
 
Putter King Education Program - Physics Level 1 (English)
Putter King Education Program - Physics Level 1 (English)Putter King Education Program - Physics Level 1 (English)
Putter King Education Program - Physics Level 1 (English)Kevin Dias
 
Putter King Education Program - Math Level 3 (Teacher's Guide English)
Putter King Education Program - Math Level 3 (Teacher's Guide English)Putter King Education Program - Math Level 3 (Teacher's Guide English)
Putter King Education Program - Math Level 3 (Teacher's Guide English)Kevin Dias
 
Putter King Education Program - Math Level 3 (English)
Putter King Education Program - Math Level 3 (English)Putter King Education Program - Math Level 3 (English)
Putter King Education Program - Math Level 3 (English)Kevin Dias
 
Putter King Education Program - Math Level 2 (Teacher's Guide Japanese)
Putter King Education Program - Math Level 2 (Teacher's Guide Japanese)Putter King Education Program - Math Level 2 (Teacher's Guide Japanese)
Putter King Education Program - Math Level 2 (Teacher's Guide Japanese)Kevin Dias
 
Putter King Education Program - Math Level 2 (Teacher's Guide English)
Putter King Education Program - Math Level 2 (Teacher's Guide English)Putter King Education Program - Math Level 2 (Teacher's Guide English)
Putter King Education Program - Math Level 2 (Teacher's Guide English)Kevin Dias
 
Putter King Education Program - Math Level 2 (Japanese)
Putter King Education Program - Math Level 2 (Japanese)Putter King Education Program - Math Level 2 (Japanese)
Putter King Education Program - Math Level 2 (Japanese)Kevin Dias
 
Putter King Education Program - Math Level 2 (English)
Putter King Education Program - Math Level 2 (English)Putter King Education Program - Math Level 2 (English)
Putter King Education Program - Math Level 2 (English)Kevin Dias
 
Putter King Education Program - Math Level 1 (Teacher's Guide Japanese)
Putter King Education Program - Math Level 1 (Teacher's Guide Japanese)Putter King Education Program - Math Level 1 (Teacher's Guide Japanese)
Putter King Education Program - Math Level 1 (Teacher's Guide Japanese)Kevin Dias
 
Putter King Education Program - Math Level 1 (Japanese)
Putter King Education Program - Math Level 1 (Japanese)Putter King Education Program - Math Level 1 (Japanese)
Putter King Education Program - Math Level 1 (Japanese)Kevin Dias
 
Putter King Education Program - Math Level 1 (English)
Putter King Education Program - Math Level 1 (English)Putter King Education Program - Math Level 1 (English)
Putter King Education Program - Math Level 1 (English)Kevin Dias
 
Putter King Business Plan
Putter King Business PlanPutter King Business Plan
Putter King Business PlanKevin Dias
 
Student Database Presentation 1.14.10
Student Database Presentation 1.14.10Student Database Presentation 1.14.10
Student Database Presentation 1.14.10Kevin Dias
 

More from Kevin Dias (20)

TM-Town - Getting the Most out of Your Translation Memories
TM-Town - Getting the Most out of Your Translation MemoriesTM-Town - Getting the Most out of Your Translation Memories
TM-Town - Getting the Most out of Your Translation Memories
 
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
Getting the Most out of Your Translation Memories (TM-Town ProZ Webinar April...
 
TM-Town TAUS Translation Technology Webinar (April 2015)
TM-Town TAUS Translation Technology Webinar (April 2015)TM-Town TAUS Translation Technology Webinar (April 2015)
TM-Town TAUS Translation Technology Webinar (April 2015)
 
Putter King Education Program - Physics Level 2 (Teacher's Guide English)
Putter King Education Program - Physics Level 2 (Teacher's Guide English)Putter King Education Program - Physics Level 2 (Teacher's Guide English)
Putter King Education Program - Physics Level 2 (Teacher's Guide English)
 
Putter King Education Program - Physics Level 2 (English)
Putter King Education Program - Physics Level 2 (English)Putter King Education Program - Physics Level 2 (English)
Putter King Education Program - Physics Level 2 (English)
 
Putter King Education Program - Physics Level 1 (Teacher's Guide Japanese)
Putter King Education Program - Physics Level 1 (Teacher's Guide Japanese)Putter King Education Program - Physics Level 1 (Teacher's Guide Japanese)
Putter King Education Program - Physics Level 1 (Teacher's Guide Japanese)
 
Putter King Education Program - Physics Level 1 (Teacher's Guide English)
Putter King Education Program - Physics Level 1 (Teacher's Guide English)Putter King Education Program - Physics Level 1 (Teacher's Guide English)
Putter King Education Program - Physics Level 1 (Teacher's Guide English)
 
Putter King Education Program - Physics Level 1 (Japanese)
Putter King Education Program - Physics Level 1 (Japanese)Putter King Education Program - Physics Level 1 (Japanese)
Putter King Education Program - Physics Level 1 (Japanese)
 
Putter King Education Program - Physics Level 1 (English)
Putter King Education Program - Physics Level 1 (English)Putter King Education Program - Physics Level 1 (English)
Putter King Education Program - Physics Level 1 (English)
 
Putter King Education Program - Math Level 3 (Teacher's Guide English)
Putter King Education Program - Math Level 3 (Teacher's Guide English)Putter King Education Program - Math Level 3 (Teacher's Guide English)
Putter King Education Program - Math Level 3 (Teacher's Guide English)
 
Putter King Education Program - Math Level 3 (English)
Putter King Education Program - Math Level 3 (English)Putter King Education Program - Math Level 3 (English)
Putter King Education Program - Math Level 3 (English)
 
Putter King Education Program - Math Level 2 (Teacher's Guide Japanese)
Putter King Education Program - Math Level 2 (Teacher's Guide Japanese)Putter King Education Program - Math Level 2 (Teacher's Guide Japanese)
Putter King Education Program - Math Level 2 (Teacher's Guide Japanese)
 
Putter King Education Program - Math Level 2 (Teacher's Guide English)
Putter King Education Program - Math Level 2 (Teacher's Guide English)Putter King Education Program - Math Level 2 (Teacher's Guide English)
Putter King Education Program - Math Level 2 (Teacher's Guide English)
 
Putter King Education Program - Math Level 2 (Japanese)
Putter King Education Program - Math Level 2 (Japanese)Putter King Education Program - Math Level 2 (Japanese)
Putter King Education Program - Math Level 2 (Japanese)
 
Putter King Education Program - Math Level 2 (English)
Putter King Education Program - Math Level 2 (English)Putter King Education Program - Math Level 2 (English)
Putter King Education Program - Math Level 2 (English)
 
Putter King Education Program - Math Level 1 (Teacher's Guide Japanese)
Putter King Education Program - Math Level 1 (Teacher's Guide Japanese)Putter King Education Program - Math Level 1 (Teacher's Guide Japanese)
Putter King Education Program - Math Level 1 (Teacher's Guide Japanese)
 
Putter King Education Program - Math Level 1 (Japanese)
Putter King Education Program - Math Level 1 (Japanese)Putter King Education Program - Math Level 1 (Japanese)
Putter King Education Program - Math Level 1 (Japanese)
 
Putter King Education Program - Math Level 1 (English)
Putter King Education Program - Math Level 1 (English)Putter King Education Program - Math Level 1 (English)
Putter King Education Program - Math Level 1 (English)
 
Putter King Business Plan
Putter King Business PlanPutter King Business Plan
Putter King Business Plan
 
Student Database Presentation 1.14.10
Student Database Presentation 1.14.10Student Database Presentation 1.14.10
Student Database Presentation 1.14.10
 

Recently uploaded

Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Asher Sterkin
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
 
Product Manager vs Product Owner – Why Do Companies Still Struggle 23 Years A...
Product Manager vs Product Owner – Why Do Companies Still Struggle 23 Years A...Product Manager vs Product Owner – Why Do Companies Still Struggle 23 Years A...
Product Manager vs Product Owner – Why Do Companies Still Struggle 23 Years A...ISPMAIndia
 
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAutokey
 
Welcome to AltTask - the nexus where innovation converges with empowerment!
Welcome to AltTask - the nexus where innovation converges with empowerment!Welcome to AltTask - the nexus where innovation converges with empowerment!
Welcome to AltTask - the nexus where innovation converges with empowerment!alttaskcom
 
No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!Anthony Dahanne
 
The Top Outages of 2023: Analyses and Takeaways
The Top Outages of 2023: Analyses and TakeawaysThe Top Outages of 2023: Analyses and Takeaways
The Top Outages of 2023: Analyses and TakeawaysThousandEyes
 
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이ssuser82c38d
 
SPM 2024 – Overview of and benefits of AI in Product Management
SPM 2024 – Overview of and benefits of AI in Product ManagementSPM 2024 – Overview of and benefits of AI in Product Management
SPM 2024 – Overview of and benefits of AI in Product ManagementISPMAIndia
 
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...emili denli
 
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A..."Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...ISPMAIndia
 
The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!ISPMAIndia
 
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ..."Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...ISPMAIndia
 
AI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit BendigiriAI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit BendigiriISPMAIndia
 
killingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfkillingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfssuser82c38d
 
Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019VICTOR MAESTRE RAMIREZ
 
Joseph Yoder : Being Agile about Architecture
Joseph Yoder : Being Agile about ArchitectureJoseph Yoder : Being Agile about Architecture
Joseph Yoder : Being Agile about ArchitectureHironori Washizaki
 
P1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 SmartsheetP1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 SmartsheetMatthewTHawley
 
killing camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfkilling camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfssuser82c38d
 

Recently uploaded (20)

Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024Essence of Requirements Engineering: Pragmatic Insights for 2024
Essence of Requirements Engineering: Pragmatic Insights for 2024
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
Product Manager vs Product Owner – Why Do Companies Still Struggle 23 Years A...
Product Manager vs Product Owner – Why Do Companies Still Struggle 23 Years A...Product Manager vs Product Owner – Why Do Companies Still Struggle 23 Years A...
Product Manager vs Product Owner – Why Do Companies Still Struggle 23 Years A...
 
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdfAUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
AUTOKEYUNLOCKER-BRANDS-SUPPORT-STANDARD-VERSION.pdf
 
Welcome to AltTask - the nexus where innovation converges with empowerment!
Welcome to AltTask - the nexus where innovation converges with empowerment!Welcome to AltTask - the nexus where innovation converges with empowerment!
Welcome to AltTask - the nexus where innovation converges with empowerment!
 
No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!No more Dockerfiles? Buildpacks to help you ship your image!
No more Dockerfiles? Buildpacks to help you ship your image!
 
The Top Outages of 2023: Analyses and Takeaways
The Top Outages of 2023: Analyses and TakeawaysThe Top Outages of 2023: Analyses and Takeaways
The Top Outages of 2023: Analyses and Takeaways
 
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이
killingcamp 광고삽입문제 풀이, killingcamp 광고삽입문제 풀이
 
SPM 2024 – Overview of and benefits of AI in Product Management
SPM 2024 – Overview of and benefits of AI in Product ManagementSPM 2024 – Overview of and benefits of AI in Product Management
SPM 2024 – Overview of and benefits of AI in Product Management
 
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
The Game-Changer_ How Software Development Outsource Can Catapult Your Growth...
 
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A..."Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...
"Discovery and Delivery through Product IntelliGenAI framework" by Ramkumar A...
 
The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!The Age of AI: Elevating Experiences & Delivering Customer Value!
The Age of AI: Elevating Experiences & Delivering Customer Value!
 
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ..."Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
"Taking an idea to a Product in Health diagnostics" by Dr. Geetha Manjunath, ...
 
AI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit BendigiriAI Product Management by Abhijit Bendigiri
AI Product Management by Abhijit Bendigiri
 
eLearning Content Development Company Code and Pixels.pdf
eLearning Content Development Company Code and Pixels.pdfeLearning Content Development Company Code and Pixels.pdf
eLearning Content Development Company Code and Pixels.pdf
 
killingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdfkillingcamp longest common subsequence.pdf
killingcamp longest common subsequence.pdf
 
Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019Implementing Docker Containers with Windows Server 2019
Implementing Docker Containers with Windows Server 2019
 
Joseph Yoder : Being Agile about Architecture
Joseph Yoder : Being Agile about ArchitectureJoseph Yoder : Being Agile about Architecture
Joseph Yoder : Being Agile about Architecture
 
P1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 SmartsheetP1 Inspection Types in Municity 5 Smartsheet
P1 Inspection Types in Municity 5 Smartsheet
 
killing camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdfkilling camp week 6 problem - maximal matrix.pdf
killing camp week 6 problem - maximal matrix.pdf
 

Exploring Natural Language Processing in Ruby

  • 1. Exploring Natural Language Processing in Ruby Kevin Dias! Tokyo Rubyist Meetup - April 9th, 2015 Rubyで自然言語処理の世界を探求してみよう
  • 4. Pragmatic Segmenter A rule-based sentence boundary detection gem that works out-of-the-box across many languages.
  • 5. What is segmentation? Segmentation is the process of splitting a text into segments or sentences. In other words, deciding where sentences begin and end. Pragmatic Segmenter text = ”Hello Tokyo Rubyists. Let’s try segmentation.” segment #1: Hello Tokyo Rubyists. segment #2: Let’s try segmentation.
  • 6. Why care about segmentation? Pragmatic Segmenter Sentence segmentation is the foundation of many common NLP tasks:! • Translation! • Machine translation! • Bitext alignment! • Summarization! • Part-of-speech tagging! • Grammar parsing Errors in segmentation compound into errors in these other NLP tasks
  • 7. Why reinvent the wheel? Pragmatic Segmenter • Most segmentation libraries are built to support only English (or English plus a few other languages)! • Current solutions do not handle ill-formatted content well! • Some libraries perform really well when trained with a data in a specific language and a specific domain, but what happens when your data could come from any language and/or domain
  • 8. Sentence segmentation methods Pragmatic Segmenter • Machine learning ! • Rule-based! • Tokenize-first group-later (e.g. Stanford CoreNLP)
  • 9. How can we achieve the following in Ruby1? string = “Hello world. Let’s try segmentation.” Desired output: [“Hello world.”, “Let’s try segmentation.”] Pragmatic Segmenter1 Using the core or standard library (no gems)
  • 10. Time to check your solutions Pragmatic Segmenter
  • 11. Some potential answers • string.scan(/[^.]+[.]/).map(&:strip)! • string.scan(/(?<=s|A)[^.]+[.]/)! • string.split(/(?<=.)s*/)! • string.split(/(?<=.)/).map(&:strip)! • string.split('.').map { |segment| segment.strip.insert(-1, '.') }! • … your answer Pragmatic Segmenter
  • 12. Let’s change the original string string = “Hello from Mt. Fuji. Let’s try segmentation.” Desired output: [“Hello from Mt. Fuji.”, “Let’s try segmentation.”] Pragmatic Segmenter
  • 13. Uh oh… string = “Hello from Mt. Fuji. Let’s try segmentation.” => [“Hello from Mt.”, “Fuji.”, “Let’s try segmentation.”] string.scan(/[^.]+[.]/).map(&:strip) Pragmatic Segmenter
  • 14. Let’s brainstorm other edge cases that will make our first solution fail • abbreviations! • …! • …! • …! • …! • … Pragmatic Segmenter
  • 15. Golden Rules Pragmatic Segmenter Currently 52 English Golden Rules covering edge cases such as:! • abbreviations! • abbreviations at the end of a sentence! • numbers! • parentheticals! • email addresses! • web addresses! • quotations! • lists! • geo coordinates! • ellipses
  • 16. Rubyists like to keep it DRY Pragmatic Segmenter Most researchers either use the WSJ corpus or Brown corpus from the Penn Treebank to test their segmentation algorithm! ! There are limits to using these corpora:! 1. The corpora may be too expensive for some people ($1,700)! 2. The majority of the sentences in the corpora are sentences that end with a regular word followed by a period, thus testing the same thing over and over again In the Brown Corpus 92% of potential sentence boundaries come after a regular word. The WSJ Corpus is richer with abbreviations and only 83% of sentences end with a regular word followed by a period.! ! Andrei Mikheev - Periods, Capitalized Words, etc.
  • 17. A comparison of segmentation libraries Pragmatic Segmenter Name Language License Golden Rule Score ! (English) Golden Rule Score (Other Languages) Speed Pragmatic Segmenter Ruby MIT 98.08% 100.00% 3.84 s TactfulTokenizer Ruby GNU GPLv3 65.38% 48.57% 46.32 s Open NLP Java APLv2 59.62% 45.71% 1.27 s Stanford CoreNLP Java GNU GPLv3 59.62% 31.43% 0.92 s Splitta Python APLv2 55.77% 37.14% N/A Punkt Python APLv2 46.15% 48.57% 1.79 s SRX English Ruby GNU GPLv3 30.77% 28.57% 6.19 s Scapel Ruby GNU GPLv3 28.85% 20.00% 0.13 s † The performance test takes the 50 English Golden Rules combined into one string and runs it 100 times through each library. The number is an average of 10 runs.
  • 18. The Holy Grail Pragmatic Segmenter A.M. / P.M. as non sentence boundary and sentence boundary At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store. Golden Rule #18 All tested segmentation libraries failed this spec ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."]
  • 19. Chat Correct A Ruby gem that shows the errors and error types when a correct English sentence is diffed with an incorrect English sentence.
  • 20. The problem Chat Correct I was giving a weekly Skype English lesson and the student was focusing on writing practice for the TOEFL test I would correct the student’s sentence, but it would often seem as if he was missing some of my corrections - even if I read it with a LOT OF STRESS!!
  • 21. The idea Chat Correct A color coded way to a student’s mistake(s) PoInT OuT
  • 23. Word Count Analyzer Analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used.
  • 24. The problem Word Count Analyzer • Translation is typically billed on a per word basis! • Different tools often report different word counts I wanted to understand what was causing these differences in word count
  • 25. Word count gray areas Word Count Analyzer Common word count gray areas include:! • Ellipses! • Hyperlinks! • Contractions! • Hyphenated Words! • Dates! • Numbers! • Numbered Lists! • XML and HTML tags! • Forward slashes and backslashes! • Punctuation
  • 26. Visualize the gray areas Word Count Analyzer
  • 27. ? ? ? A bitext alignment (aka parallel text alignment) tool with a focus on high accuracy
  • 28. What’s it used for? • Translation memory! • Machine translation ? ? ?
  • 29. Bitext alignment Current commercial state-of-the-art! • Gale-Church sentence-length information plus dictionary if available (e.g. hunalign)! ? ? ?
  • 30. Areas for improvement ? ? ? •Early misalignment compounds into errors throughout! •Accuracy may suffer for non-Roman languages unless the algorithm is properly tuned! •Does not handle cross alignments nor uneven alignments
  • 31. A method for higher accuracy • Machine translate A - B and B - A! • Relative sentence length! • Order or position in the document ? ? ? 0 1 2 3 4 5 0 1 X 2 X 3 4 X 5 X X
  • 32. The trade-offs Pros! • better accuracy! • can handle crossing alignments! • can handle uneven segments matches ! (1 to 2, 2 to 1, 1 to 3, 3 to 1, 2 to 3, and 3 to 2) ? ? ? Cons! • slower! • potential data privacy issues ! (depending on method to obtain machine translation)
  • 33. Small framework for thinking about new problems Step 1! Use your ignorance as a weapon to think about a problem from first principles (you aren’t yet weighed down with any bias). Step 3! Diff your conceptual framework and your research. Look at where it diverges and try to understand why.! ! Has tech changed/advanced? Were you missing something? Step 2! Do your research.