SlideShare a Scribd company logo
The pope is catholic.
language as an interfacelanguage as data
introduction
(@spencermountain)
introduction
problem
problem
problem
london in the rain london in the
in the rain
london in
in the
the rain
london
in
the
rain
4-gram: 3-gram: 2-gram: 1-gram:
10 requests per keystroke
4-words -
5-words : 15 6-words : 21 7-words : 28 8-words : 36
problem
london in the rain
london in the
in the rain
london in
in the
the rain
London
in
the
rain
in
the
london in the
in the rain
london in
the rain
Stopwords
blacklist:
Edge gram
filter:
Redundancy
check:
#1
#2
#3
in the
“rain”
“london”
“london in the rain”
problem
● NLTK - excellent, huge, python
● Stanford parser - excellent, huge, java
Alchemy,
TextRazor,
OpenCalais,
Embedly,
Zemanta
Or an offsite API?● Freeling - excellent, huge, C++
● Illinois tagger - excellent, huge, java
When all you’ve got is a jackhammer..
niche
Can it be hacked?
tldr: yes.
niche
How big is a language?
Shakespeare - 35,000
niche
Zipfs law
The top 10 words account for 25% of language.
The top 100 words account for 50% of language.
The top 50,000 words account for 95% of language.
niche
602 kb
uncompressed
50,000
different words
An average person will ever hear
~4 lookups in binary search
niche
first, let’s kill the
nouns 70%
process
180 kb
uncompressed
Noun Verb Adjective Adverb
Tomato
Tomatoes
Toronto
Torontonian
Speak
Spoke
Speaking
will speak
have spoken
had spoken
...
nice
nicer
nicest
quickly
quicklier
quickliest
“awesome”“awesomeify”
improveify your vocabularies
“quickly”“quick”
n/2.3
Each word
*not handsome *not truly*not is*not economics
“tomatoey”“tomato”
“speak”“speaker”
“aggressive”“agressiveness”
“civil”“civilize”
niche
then, let’s conjugate
our verbs
process
110 kb
uncompressed
process
jQuery
256kb
d3js
330kb
react
653kb
110 kb
uncompressed
lodash
503kb
the whole
english
language
110kb
Ok, let’s roll our own POS tagger..
(what could go rong?)
process
1) ☑ Lexicon
2) ◻ Suffix regexes
3) ◻ Sentence
Suffix rules
process
1) ☑ Lexicon
2) ☑ Suffix regexes
3) ◻ Sentence
Grammar rules - markov
She could walk the walk .
before: Verb - Det - Verb
after: Verb - Det - Noun
process
1) ☑ Lexicon
2) ☑ Suffix regexes
3) ☑ Sentence
“Unreasonable effectiveness” of rule-based taggers-
● a 1,000 word lexicon - 45% precision
● fallback to [Noun] - 70% precision
● a little regex - 74% precision
● a little grammar in it - 81% precision
process
t.text(“keep on rocking in the free world”)
t.negate()
//“don’t keep on rocking in the free world.”
outcome
t.text(“it is a cool library”)
t.toValleyGirl()
//“so, it is like, a cool library.”
outcome
We gave the monkeys the
bananas,
..because they were ripe. ..because they were hungry.
outcome
We gave the monkeys the bananas
[Pr] [Verb] [Dt] [Noun] [Dt] [Noun]
list of letters
POS-tagging
Dependency parser We give [Noun] [Noun]
[act / transfer / voluntary] [genus / monkey]
Knowledge engine
[plant / banana]
outcome
#TODOFML
● Mutable/Immutable API
● Speed, performance testing
● Romantic-language verb conjugations
● ‘bl.ocks.org’ of demos and docs
outcome
npm install --wooyeah
@spencermountain
Slack group, mailing list, github, Toronto/coffee

More Related Content

Similar to nlp_compromise

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Apache OpenNLP
 
About programming languages
About programming languagesAbout programming languages
About programming languages
Ganesh Samarthyam
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
ankit_ppt
 
Collatinus: Lemmatizer and morphological analyzer for Latin texts
Collatinus: Lemmatizer and morphological analyzer for Latin textsCollatinus: Lemmatizer and morphological analyzer for Latin texts
Collatinus: Lemmatizer and morphological analyzer for Latin texts
Equipex Biblissima
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Pythontswr
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
wltrimbl
 
An overview of JavaScript and Node.js
An overview of JavaScript and Node.jsAn overview of JavaScript and Node.js
An overview of JavaScript and Node.js
Luciano Mammino
 
Intuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyondIntuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyond
C4Media
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
Insoo Chung
 
Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.
⌨️ Steven Proctor
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
DigiGurukul
 
Here be dragons
Here be dragonsHere be dragons
Here be dragons
deelay1
 
Distributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonDistributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in Python
Clare Corthell
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
Suneel Marthi
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
DataWorks Summit
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
👋 Christopher Moody
 
Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)
Omar Abdelhafith
 
Learning for sequences - Adam Mathias
Learning for sequences  - Adam MathiasLearning for sequences  - Adam Mathias
Learning for sequences - Adam Mathias
DataFest Tbilisi
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
Ted Xiao
 

Similar to nlp_compromise (20)

Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
About programming languages
About programming languagesAbout programming languages
About programming languages
 
Text generation and_advanced_topics
Text generation and_advanced_topicsText generation and_advanced_topics
Text generation and_advanced_topics
 
Collatinus: Lemmatizer and morphological analyzer for Latin texts
Collatinus: Lemmatizer and morphological analyzer for Latin textsCollatinus: Lemmatizer and morphological analyzer for Latin texts
Collatinus: Lemmatizer and morphological analyzer for Latin texts
 
Introduction to Python
Introduction to PythonIntroduction to Python
Introduction to Python
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Sequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RASTSequencing run grief counseling: counting kmers at MG-RAST
Sequencing run grief counseling: counting kmers at MG-RAST
 
An overview of JavaScript and Node.js
An overview of JavaScript and Node.jsAn overview of JavaScript and Node.js
An overview of JavaScript and Node.js
 
Intuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyondIntuition & Use-Cases of Embeddings in NLP & beyond
Intuition & Use-Cases of Embeddings in NLP & beyond
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.Hello, Joe. Hello, Mike; Hello, Robert.
Hello, Joe. Hello, Mike; Hello, Robert.
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Here be dragons
Here be dragonsHere be dragons
Here be dragons
 
Distributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in PythonDistributed Natural Language Processing Systems in Python
Distributed Natural Language Processing Systems in Python
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vecword2vec, LDA, and introducing a new hybrid algorithm: lda2vec
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec
 
Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)Introduction to functional programming (In Arabic)
Introduction to functional programming (In Arabic)
 
Learning for sequences - Adam Mathias
Learning for sequences  - Adam MathiasLearning for sequences  - Adam Mathias
Learning for sequences - Adam Mathias
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 

More from TorontoNodeJS

Node.js API pitfalls
Node.js API pitfallsNode.js API pitfalls
Node.js API pitfalls
TorontoNodeJS
 
Safely Build, Publish & Maintain ES2015, ES2016 Packages Today
Safely Build, Publish & Maintain ES2015, ES2016 Packages TodaySafely Build, Publish & Maintain ES2015, ES2016 Packages Today
Safely Build, Publish & Maintain ES2015, ES2016 Packages Today
TorontoNodeJS
 
KoNote
KoNoteKoNote
Understanding the Single Thread Event Loop
Understanding the Single Thread Event LoopUnderstanding the Single Thread Event Loop
Understanding the Single Thread Event Loop
TorontoNodeJS
 
Avoiding callback hell with promises
Avoiding callback hell with promisesAvoiding callback hell with promises
Avoiding callback hell with promises
TorontoNodeJS
 
Node as an API shim
Node as an API shimNode as an API shim
Node as an API shim
TorontoNodeJS
 
Building your own slack bot on the AWS stack
Building your own slack bot on the AWS stackBuilding your own slack bot on the AWS stack
Building your own slack bot on the AWS stack
TorontoNodeJS
 

More from TorontoNodeJS (7)

Node.js API pitfalls
Node.js API pitfallsNode.js API pitfalls
Node.js API pitfalls
 
Safely Build, Publish & Maintain ES2015, ES2016 Packages Today
Safely Build, Publish & Maintain ES2015, ES2016 Packages TodaySafely Build, Publish & Maintain ES2015, ES2016 Packages Today
Safely Build, Publish & Maintain ES2015, ES2016 Packages Today
 
KoNote
KoNoteKoNote
KoNote
 
Understanding the Single Thread Event Loop
Understanding the Single Thread Event LoopUnderstanding the Single Thread Event Loop
Understanding the Single Thread Event Loop
 
Avoiding callback hell with promises
Avoiding callback hell with promisesAvoiding callback hell with promises
Avoiding callback hell with promises
 
Node as an API shim
Node as an API shimNode as an API shim
Node as an API shim
 
Building your own slack bot on the AWS stack
Building your own slack bot on the AWS stackBuilding your own slack bot on the AWS stack
Building your own slack bot on the AWS stack
 

Recently uploaded

Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
Vlad Stirbu
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 

Recently uploaded (20)

Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
Quantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIsQuantum Computing: Current Landscape and the Future Role of APIs
Quantum Computing: Current Landscape and the Future Role of APIs
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

nlp_compromise

  • 1. The pope is catholic. language as an interfacelanguage as data introduction
  • 6. london in the rain london in the in the rain london in in the the rain london in the rain 4-gram: 3-gram: 2-gram: 1-gram: 10 requests per keystroke 4-words - 5-words : 15 6-words : 21 7-words : 28 8-words : 36 problem
  • 7. london in the rain london in the in the rain london in in the the rain London in the rain in the london in the in the rain london in the rain Stopwords blacklist: Edge gram filter: Redundancy check: #1 #2 #3 in the “rain” “london” “london in the rain” problem
  • 8. ● NLTK - excellent, huge, python ● Stanford parser - excellent, huge, java Alchemy, TextRazor, OpenCalais, Embedly, Zemanta Or an offsite API?● Freeling - excellent, huge, C++ ● Illinois tagger - excellent, huge, java When all you’ve got is a jackhammer.. niche
  • 9. Can it be hacked? tldr: yes. niche
  • 10. How big is a language? Shakespeare - 35,000 niche
  • 11. Zipfs law The top 10 words account for 25% of language. The top 100 words account for 50% of language. The top 50,000 words account for 95% of language. niche
  • 12. 602 kb uncompressed 50,000 different words An average person will ever hear ~4 lookups in binary search niche
  • 13. first, let’s kill the nouns 70% process 180 kb uncompressed
  • 14. Noun Verb Adjective Adverb Tomato Tomatoes Toronto Torontonian Speak Spoke Speaking will speak have spoken had spoken ... nice nicer nicest quickly quicklier quickliest “awesome”“awesomeify” improveify your vocabularies “quickly”“quick” n/2.3 Each word *not handsome *not truly*not is*not economics “tomatoey”“tomato” “speak”“speaker” “aggressive”“agressiveness” “civil”“civilize” niche
  • 15. then, let’s conjugate our verbs process 110 kb uncompressed
  • 17. Ok, let’s roll our own POS tagger.. (what could go rong?) process 1) ☑ Lexicon 2) ◻ Suffix regexes 3) ◻ Sentence
  • 18. Suffix rules process 1) ☑ Lexicon 2) ☑ Suffix regexes 3) ◻ Sentence
  • 19. Grammar rules - markov She could walk the walk . before: Verb - Det - Verb after: Verb - Det - Noun process 1) ☑ Lexicon 2) ☑ Suffix regexes 3) ☑ Sentence
  • 20. “Unreasonable effectiveness” of rule-based taggers- ● a 1,000 word lexicon - 45% precision ● fallback to [Noun] - 70% precision ● a little regex - 74% precision ● a little grammar in it - 81% precision process
  • 21. t.text(“keep on rocking in the free world”) t.negate() //“don’t keep on rocking in the free world.” outcome
  • 22. t.text(“it is a cool library”) t.toValleyGirl() //“so, it is like, a cool library.” outcome
  • 23. We gave the monkeys the bananas, ..because they were ripe. ..because they were hungry. outcome
  • 24. We gave the monkeys the bananas [Pr] [Verb] [Dt] [Noun] [Dt] [Noun] list of letters POS-tagging Dependency parser We give [Noun] [Noun] [act / transfer / voluntary] [genus / monkey] Knowledge engine [plant / banana] outcome
  • 25. #TODOFML ● Mutable/Immutable API ● Speed, performance testing ● Romantic-language verb conjugations ● ‘bl.ocks.org’ of demos and docs outcome
  • 26. npm install --wooyeah @spencermountain Slack group, mailing list, github, Toronto/coffee