SlideShare a Scribd company logo
1 of 30
Industrial Strength
Natural Language Processing
I am Jeffrey Williams
I am here to provide meaning to unstructured text
I work @ Label Insight
You can find me at @jeffxor
Label Insight is Hiring!
https://www.labelinsight.com/careers/topic/engineering
Caveats
◇ I am not a linguist specialist
◇ I am not a natural language specialist
◇ I am not a data scientist
◇ I am a software engineer
This talk is aimed at software engineers trying to tackle
text problems by extract meaning or understanding
Agenda
◇ Natural Language Processing Concepts
◇ spacy.io Introduction
◇ Visualizations
◇ Applying spacy.io
◇ spacy.io Extensions
◇ Lessons Learnt
◇ Alternatives to spaCy.io
Let’s review some
NLP concepts
Sentence Boundary Detection
Sentence boundaries are often
marked by periods or other
punctuation marks, but these
same characters can serve other
purposes
Tokens/Word Segmentation
Separate a chunk of continuous
text into separate words. Text
segmentation is a significant task
requiring knowledge of the
vocabulary and morphology.
Stemming/Lemmatization
reduce inflectional forms of a
word to a common base form
am, are, is -> be
car, cars, car's, cars' -> car
Named Entity Recognition
Given a stream of text, determine
which items in the text map to
proper names, such as people or
places, and what the type of each
such name is (e.g. person,
location, organization).
Parts of Speech Tagging
Given a sentence, determine the
part of speech for each word.
Many words, especially common
ones, can serve as multiple parts
of speech.
Word sense disambiguation
Many words have more than one
meaning; we have to select the
meaning which makes the most
sense in context.
spaCy.io Introduction
◇ Open-source library for advanced (NLP) in Python
◇ Opinionated NLP library (not an API/Service)
◇ Number of pretrained models for common
languages
◇ Great documentation and example code
◇ Helps build information extraction & natural
language understanding systems
spaCy.io is very powerful library that has many extension
points allowing for training and pipeline configuration
spaCy.io Features
Lemmatization
Assigning the base forms of
words. For example, the lemma of
"was" is "be", and the lemma of
"rats" is "rat".
Rule-based Matching
Finding sequences of tokens
based on their texts and linguistic
annotations, similar to regular
expressions.
Similarity
Comparing words, text spans and
documents and how similar they
are to each other.
(POS) Part-of-speech Tagging
Assigning word types to tokens,
like verb or noun.
(NER) Named Entity Recognition
Labelling named "real-world"
objects, like persons, companies
or locations.
Dependency Parsing
Assigning syntactic dependency
labels, describing the relations
between individual tokens, like
subject or object.
Place your screenshot here
Language Support
spaCy v2.0 features new neural models for
tagging, parsing and entity recognition. The
models have been designed and implemented
from scratch specifically for spaCy, to give you
an unmatched balance of speed, size and
accuracy.
Combination of language (english), training
data (web, news, etc), size of model (sm, md,
lg)
https://spacy.io/usage/models
Place your screenshot here
Provided Named Entities
From my experience with Locations it is not as
well trained as Google Cloud Natural Language
https://spacy.io/api/annotation#section-named-entities
Place your screenshot here
Parts-of-Speech Tagging
Maps all language-specific part-of-speech tags
to a small, fixed set of word type tags following
the Universal Dependencies scheme.
https://spacy.io/api/annotation#section-pos-tagging
Visualizations
Super simple and super powerful for development iteration
Place your screenshot here
import spacy
from spacy import displacy
nlp = spacy.load('en')
doc = nlp(u'This is a sentence.')
displacy.serve(doc, style='dep')
Dependency Visualization
Place your screenshot here
import spacy
from spacy import displacy
text = """But Google is starting from
behind. The company made a late push
into hardware, and Apple’s Siri,
available on iPhones, and Amazon’s Alexa
software, which runs on its Echo and Dot
devices, have clear leads in
consumer adoption."""
nlp = spacy.load('custom_ner_model')
doc = nlp(text)
displacy.serve(doc, style='ent')
Named Entity
Visualization
spaCy.io Code
Examples
Examples of using applying spaCy.io’s building blocks to solve a
problem
Navigating Parse Trees
◇ navigate the parse tree including subtrees attached
to a word
◇ Noun chunks (noun plus the words describing the
noun)
◇ terms head and child to describe the words
connected by a single arc
◇ term dep is used for the arc label, ( type of syntactic
relation)
Phrase Matcher
◇ efficiently match large terminology lists
◇ match sequences based on lists of token
descriptions
◇ accepts match patterns in the form of Doc objects
spaCy.io
Applied to the Real World
Walk through applying to a new problem domain
Training Data
Provide additional data to
either adjust and existing
model or build your own
model.
https://prodi.gy/
spaCy.io Extensions
Functionality
Number of extension points to
add customizations
◇ Adjust pipeline
◇ Add new pipeline features
◇ Add functionality to core
components
◇ Add callback functions into
pipeline processes
spaCy.io Pipeline
Disabling/Modifying
If you don't need a particular
component of the pipeline – for
example, the tagger or the parser,
you can disable loading it.
Can sometimes make a big
difference and improve loading
speed.
Custom Components
Custom components can be
added to the pipeline
Allows for adding it before or
after, tell spaCy to add it first or
last in the pipeline, or define a
custom name.
Eg. add spell checking (hunspell)
Extension Attributes
allows you to set any custom
attributes and methods on the
Doc, Span and Token
additional information relevant to
your application, add new
features and functionality to
spaCy, and implement your own
models
Eg. improve spaCy's sentence
boundary detectionhttps://spacy.io/usage/processing-pipelines
Place your screenshot here
Processing Pipeline
The Language object coordinates
these components. It takes raw text
and sends it through the pipeline,
returning an annotated document. It
also orchestrates training and
serialization.
https://spacy.io/usage/processing-pipelines
Named Entity Extension
Adding Additional Entity Types
Need a few hundred labeled sentences
for a good start, mixin examples of other
entity types
Actual training is performed by looping
over the examples, makes a prediction
against golden parsed data
train_data = [
("Uber blew through $1 million a week", [(0, 4, 'ORG')]),
("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30,
'GPE')]),
("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]),
("Google Maps launches location sharing", [(0, 11, "PRODUCT")]),
("Google rebrands its business apps", [(0, 6, "ORG")]),
("look what i found on google! 😂", [(21, 27, "PRODUCT")])]
Update a pre-trained Model
Need to provide many examples to meaningfully
improve the system — a few hundred
https://spacy.io/usage/training#section-ner
Place your screenshot here
Custom Semantics
◇ Can be used to be trained to
predict any type of tree
structure over your input text
◇ Can be useful to for
conversational applications,
◇ Train spaCy's parser to label
intents and their targets, like
attributes, quality, time and
locations
https://spacy.io/usage/training#section-tagger-parser
Attempt to summarize my learning curve both from
implementation as well as business buyin
spaCy.io Lessons
Learnt
Start Simple!
Define you key outcomes
Visualize the data
Experiment, iteration is key!
Educate
Engage you SMEs
Visualizations always help
Opt for easy/understandable
Measurement
System Metric
Operations Metric
Overall Business Metric
spaCy.io Alternatives
There are many alternatives available they tend to fall into two
categories, alternative libraries and hosted solutions
◇ NLTK Natural Language Toolkit (Python)
◇ Stanford CoreNLP (Java)
◇ NLP4J (Java)
Libraries allow you to configure, extend and train for your
problem domain
Alternate Libraries
◇ Microsoft Azure Text Analytics
◇ Google Cloud Natural Language
Hosted solutions provide a generic solution
◇ Well trained models
◇ Basic/Generic Named Entities
◇ Unable to model/train for your domain (yet!)
Alternate Hosted
Solutions
Thanks!
Any questions?
You can find me at:
◇ @jeffxor
◇ jwilliams@labelinsight.com
◇ https://speakerrate.com/speakers/181771 (Feedback)
Label Insight is Hiring!
https://www.labelinsight.com/careers/topic/engineering
Useful Information
This presentation used the following resources:
◇ spacy.io
◇ spacy.io github
◇ explosion.ai/demos/
◇ Natural Language Processing Wikipedia
◇ Stanford CoreNLP
◇ Microsoft Azure Text Analytics
◇ Google Cloud Natural Language

More Related Content

What's hot

Pre trained language model
Pre trained language modelPre trained language model
Pre trained language modelJiWenKim
 
Knowledge graphs ilaria maresi the hyve 23apr2020
Knowledge graphs   ilaria maresi the hyve 23apr2020Knowledge graphs   ilaria maresi the hyve 23apr2020
Knowledge graphs ilaria maresi the hyve 23apr2020Pistoia Alliance
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AINeo4j
 
ChatGPT OpenAI Primer for Business
ChatGPT OpenAI Primer for BusinessChatGPT OpenAI Primer for Business
ChatGPT OpenAI Primer for BusinessDion Hinchcliffe
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...ssuser4edc93
 
Blueprint ChatGPT Lunch & Learn
Blueprint ChatGPT Lunch & LearnBlueprint ChatGPT Lunch & Learn
Blueprint ChatGPT Lunch & Learngnakan
 
Generative Models and ChatGPT
Generative Models and ChatGPTGenerative Models and ChatGPT
Generative Models and ChatGPTLoic Merckel
 
ChatGPT 101 - Vancouver ChatGPT Experts
ChatGPT 101 - Vancouver ChatGPT ExpertsChatGPT 101 - Vancouver ChatGPT Experts
ChatGPT 101 - Vancouver ChatGPT ExpertsAli Tavanayan
 
Generative AI to Accelerate Discovery of Materials
Generative AI to Accelerate Discovery of MaterialsGenerative AI to Accelerate Discovery of Materials
Generative AI to Accelerate Discovery of MaterialsDeakin University
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfDavid Rostcheck
 
How ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyHow ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyPekka Abrahamsson / Tampere University
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMsLoic Merckel
 
Clase 19 tendencias en la ingenieria de sistemas parte2
Clase 19 tendencias en la ingenieria de sistemas parte2Clase 19 tendencias en la ingenieria de sistemas parte2
Clase 19 tendencias en la ingenieria de sistemas parte2Maria Garcia
 
What Are the Problems Associated with ChatGPT?
What Are the Problems Associated with ChatGPT?What Are the Problems Associated with ChatGPT?
What Are the Problems Associated with ChatGPT?Windzoon Technologies
 
Generative AI and ChatGPT - Scope of AI and advance Generative AI
Generative AI and ChatGPT - Scope of AI and advance Generative AIGenerative AI and ChatGPT - Scope of AI and advance Generative AI
Generative AI and ChatGPT - Scope of AI and advance Generative AIKumaresan K
 

What's hot (20)

Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
Knowledge graphs ilaria maresi the hyve 23apr2020
Knowledge graphs   ilaria maresi the hyve 23apr2020Knowledge graphs   ilaria maresi the hyve 23apr2020
Knowledge graphs ilaria maresi the hyve 23apr2020
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
 
ChatGPT
ChatGPTChatGPT
ChatGPT
 
chat-GPT-Information.pdf
chat-GPT-Information.pdfchat-GPT-Information.pdf
chat-GPT-Information.pdf
 
ChatGPT OpenAI Primer for Business
ChatGPT OpenAI Primer for BusinessChatGPT OpenAI Primer for Business
ChatGPT OpenAI Primer for Business
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
 
Blueprint ChatGPT Lunch & Learn
Blueprint ChatGPT Lunch & LearnBlueprint ChatGPT Lunch & Learn
Blueprint ChatGPT Lunch & Learn
 
Generative Models and ChatGPT
Generative Models and ChatGPTGenerative Models and ChatGPT
Generative Models and ChatGPT
 
ChatGPT 101 - Vancouver ChatGPT Experts
ChatGPT 101 - Vancouver ChatGPT ExpertsChatGPT 101 - Vancouver ChatGPT Experts
ChatGPT 101 - Vancouver ChatGPT Experts
 
ChatGPT.pptx
ChatGPT.pptxChatGPT.pptx
ChatGPT.pptx
 
Data-Driven @ Netflix
Data-Driven @ NetflixData-Driven @ Netflix
Data-Driven @ Netflix
 
Generative AI to Accelerate Discovery of Materials
Generative AI to Accelerate Discovery of MaterialsGenerative AI to Accelerate Discovery of Materials
Generative AI to Accelerate Discovery of Materials
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 
How ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyHow ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundly
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
 
Clase 19 tendencias en la ingenieria de sistemas parte2
Clase 19 tendencias en la ingenieria de sistemas parte2Clase 19 tendencias en la ingenieria de sistemas parte2
Clase 19 tendencias en la ingenieria de sistemas parte2
 
What Are the Problems Associated with ChatGPT?
What Are the Problems Associated with ChatGPT?What Are the Problems Associated with ChatGPT?
What Are the Problems Associated with ChatGPT?
 
Generative AI and ChatGPT - Scope of AI and advance Generative AI
Generative AI and ChatGPT - Scope of AI and advance Generative AIGenerative AI and ChatGPT - Scope of AI and advance Generative AI
Generative AI and ChatGPT - Scope of AI and advance Generative AI
 
CHATGPT.pptx
CHATGPT.pptxCHATGPT.pptx
CHATGPT.pptx
 

Similar to Industrial Strength NLP with spaCy

What's new for Text in SAP HANA SPS 11
What's new for Text in SAP HANA SPS 11What's new for Text in SAP HANA SPS 11
What's new for Text in SAP HANA SPS 11SAP Technology
 
POSI Overview
POSI OverviewPOSI Overview
POSI Overviewaindilis
 
C, C++ Training Institute in Chennai , Adyar
C, C++ Training Institute in Chennai , AdyarC, C++ Training Institute in Chennai , Adyar
C, C++ Training Institute in Chennai , AdyarsasikalaD3
 
Evaluation of online learning
Evaluation of online learningEvaluation of online learning
Evaluation of online learningshatha al abeer
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Webliddy
 
April 2016 - USG Web Tech Day - Let's Talk Drupal
April 2016 - USG Web Tech Day - Let's Talk DrupalApril 2016 - USG Web Tech Day - Let's Talk Drupal
April 2016 - USG Web Tech Day - Let's Talk DrupalEric Sembrat
 
Importance Of Being Driven
Importance Of Being DrivenImportance Of Being Driven
Importance Of Being DrivenAntonio Terreno
 
Python For SEO specialists and Content Marketing - Hand in Hand
Python For SEO specialists and Content Marketing - Hand in HandPython For SEO specialists and Content Marketing - Hand in Hand
Python For SEO specialists and Content Marketing - Hand in HandDido Grigorov
 
CASE tools and their effects on software quality
CASE tools and their effects on software qualityCASE tools and their effects on software quality
CASE tools and their effects on software qualityUtkarsh Agarwal
 
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...MinhLeNguyenAnh2
 
Programming Paradigms
Programming ParadigmsProgramming Paradigms
Programming ParadigmsJaneve George
 
Domain Driven Design
Domain Driven DesignDomain Driven Design
Domain Driven DesignRyan Riley
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureAggregage
 
Programming Paradigms
Programming ParadigmsProgramming Paradigms
Programming ParadigmsDirecti Group
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.Vladimir Ulogov
 

Similar to Industrial Strength NLP with spaCy (20)

What's new for Text in SAP HANA SPS 11
What's new for Text in SAP HANA SPS 11What's new for Text in SAP HANA SPS 11
What's new for Text in SAP HANA SPS 11
 
POSI Overview
POSI OverviewPOSI Overview
POSI Overview
 
Shuzworld Analysis
Shuzworld AnalysisShuzworld Analysis
Shuzworld Analysis
 
C, C++ Training Institute in Chennai , Adyar
C, C++ Training Institute in Chennai , AdyarC, C++ Training Institute in Chennai , Adyar
C, C++ Training Institute in Chennai , Adyar
 
Evaluation of online learning
Evaluation of online learningEvaluation of online learning
Evaluation of online learning
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
April 2016 - USG Web Tech Day - Let's Talk Drupal
April 2016 - USG Web Tech Day - Let's Talk DrupalApril 2016 - USG Web Tech Day - Let's Talk Drupal
April 2016 - USG Web Tech Day - Let's Talk Drupal
 
Importance Of Being Driven
Importance Of Being DrivenImportance Of Being Driven
Importance Of Being Driven
 
Python For SEO specialists and Content Marketing - Hand in Hand
Python For SEO specialists and Content Marketing - Hand in HandPython For SEO specialists and Content Marketing - Hand in Hand
Python For SEO specialists and Content Marketing - Hand in Hand
 
Text mining and Visualizations
Text mining  and VisualizationsText mining  and Visualizations
Text mining and Visualizations
 
CASE tools and their effects on software quality
CASE tools and their effects on software qualityCASE tools and their effects on software quality
CASE tools and their effects on software quality
 
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
PostgreSQL_ Up and Running_ A Practical Guide to the Advanced Open Source Dat...
 
Programming Paradigms
Programming ParadigmsProgramming Paradigms
Programming Paradigms
 
NetBase API Presentation
NetBase API PresentationNetBase API Presentation
NetBase API Presentation
 
Domain Driven Design
Domain Driven DesignDomain Driven Design
Domain Driven Design
 
ppt
pptppt
ppt
 
ppt
pptppt
ppt
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
 
Programming Paradigms
Programming ParadigmsProgramming Paradigms
Programming Paradigms
 
"the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar."the Bund" language. A PEG grammar.
"the Bund" language. A PEG grammar.
 

Recently uploaded

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 

Recently uploaded (20)

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 

Industrial Strength NLP with spaCy

  • 1. Industrial Strength Natural Language Processing I am Jeffrey Williams I am here to provide meaning to unstructured text I work @ Label Insight You can find me at @jeffxor Label Insight is Hiring! https://www.labelinsight.com/careers/topic/engineering
  • 2. Caveats ◇ I am not a linguist specialist ◇ I am not a natural language specialist ◇ I am not a data scientist ◇ I am a software engineer This talk is aimed at software engineers trying to tackle text problems by extract meaning or understanding
  • 3. Agenda ◇ Natural Language Processing Concepts ◇ spacy.io Introduction ◇ Visualizations ◇ Applying spacy.io ◇ spacy.io Extensions ◇ Lessons Learnt ◇ Alternatives to spaCy.io
  • 4. Let’s review some NLP concepts Sentence Boundary Detection Sentence boundaries are often marked by periods or other punctuation marks, but these same characters can serve other purposes Tokens/Word Segmentation Separate a chunk of continuous text into separate words. Text segmentation is a significant task requiring knowledge of the vocabulary and morphology. Stemming/Lemmatization reduce inflectional forms of a word to a common base form am, are, is -> be car, cars, car's, cars' -> car Named Entity Recognition Given a stream of text, determine which items in the text map to proper names, such as people or places, and what the type of each such name is (e.g. person, location, organization). Parts of Speech Tagging Given a sentence, determine the part of speech for each word. Many words, especially common ones, can serve as multiple parts of speech. Word sense disambiguation Many words have more than one meaning; we have to select the meaning which makes the most sense in context.
  • 5. spaCy.io Introduction ◇ Open-source library for advanced (NLP) in Python ◇ Opinionated NLP library (not an API/Service) ◇ Number of pretrained models for common languages ◇ Great documentation and example code ◇ Helps build information extraction & natural language understanding systems spaCy.io is very powerful library that has many extension points allowing for training and pipeline configuration
  • 6. spaCy.io Features Lemmatization Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat". Rule-based Matching Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. Similarity Comparing words, text spans and documents and how similar they are to each other. (POS) Part-of-speech Tagging Assigning word types to tokens, like verb or noun. (NER) Named Entity Recognition Labelling named "real-world" objects, like persons, companies or locations. Dependency Parsing Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
  • 7. Place your screenshot here Language Support spaCy v2.0 features new neural models for tagging, parsing and entity recognition. The models have been designed and implemented from scratch specifically for spaCy, to give you an unmatched balance of speed, size and accuracy. Combination of language (english), training data (web, news, etc), size of model (sm, md, lg) https://spacy.io/usage/models
  • 8. Place your screenshot here Provided Named Entities From my experience with Locations it is not as well trained as Google Cloud Natural Language https://spacy.io/api/annotation#section-named-entities
  • 9. Place your screenshot here Parts-of-Speech Tagging Maps all language-specific part-of-speech tags to a small, fixed set of word type tags following the Universal Dependencies scheme. https://spacy.io/api/annotation#section-pos-tagging
  • 10. Visualizations Super simple and super powerful for development iteration
  • 11. Place your screenshot here import spacy from spacy import displacy nlp = spacy.load('en') doc = nlp(u'This is a sentence.') displacy.serve(doc, style='dep') Dependency Visualization
  • 12. Place your screenshot here import spacy from spacy import displacy text = """But Google is starting from behind. The company made a late push into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa software, which runs on its Echo and Dot devices, have clear leads in consumer adoption.""" nlp = spacy.load('custom_ner_model') doc = nlp(text) displacy.serve(doc, style='ent') Named Entity Visualization
  • 13. spaCy.io Code Examples Examples of using applying spaCy.io’s building blocks to solve a problem
  • 14. Navigating Parse Trees ◇ navigate the parse tree including subtrees attached to a word ◇ Noun chunks (noun plus the words describing the noun) ◇ terms head and child to describe the words connected by a single arc ◇ term dep is used for the arc label, ( type of syntactic relation)
  • 15. Phrase Matcher ◇ efficiently match large terminology lists ◇ match sequences based on lists of token descriptions ◇ accepts match patterns in the form of Doc objects
  • 16. spaCy.io Applied to the Real World Walk through applying to a new problem domain
  • 17. Training Data Provide additional data to either adjust and existing model or build your own model. https://prodi.gy/ spaCy.io Extensions Functionality Number of extension points to add customizations ◇ Adjust pipeline ◇ Add new pipeline features ◇ Add functionality to core components ◇ Add callback functions into pipeline processes
  • 18. spaCy.io Pipeline Disabling/Modifying If you don't need a particular component of the pipeline – for example, the tagger or the parser, you can disable loading it. Can sometimes make a big difference and improve loading speed. Custom Components Custom components can be added to the pipeline Allows for adding it before or after, tell spaCy to add it first or last in the pipeline, or define a custom name. Eg. add spell checking (hunspell) Extension Attributes allows you to set any custom attributes and methods on the Doc, Span and Token additional information relevant to your application, add new features and functionality to spaCy, and implement your own models Eg. improve spaCy's sentence boundary detectionhttps://spacy.io/usage/processing-pipelines
  • 19. Place your screenshot here Processing Pipeline The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization. https://spacy.io/usage/processing-pipelines
  • 20. Named Entity Extension Adding Additional Entity Types Need a few hundred labeled sentences for a good start, mixin examples of other entity types Actual training is performed by looping over the examples, makes a prediction against golden parsed data train_data = [ ("Uber blew through $1 million a week", [(0, 4, 'ORG')]), ("Android Pay expands to Canada", [(0, 11, 'PRODUCT'), (23, 30, 'GPE')]), ("Spotify steps up Asia expansion", [(0, 8, "ORG"), (17, 21, "LOC")]), ("Google Maps launches location sharing", [(0, 11, "PRODUCT")]), ("Google rebrands its business apps", [(0, 6, "ORG")]), ("look what i found on google! 😂", [(21, 27, "PRODUCT")])] Update a pre-trained Model Need to provide many examples to meaningfully improve the system — a few hundred https://spacy.io/usage/training#section-ner
  • 21. Place your screenshot here Custom Semantics ◇ Can be used to be trained to predict any type of tree structure over your input text ◇ Can be useful to for conversational applications, ◇ Train spaCy's parser to label intents and their targets, like attributes, quality, time and locations https://spacy.io/usage/training#section-tagger-parser
  • 22. Attempt to summarize my learning curve both from implementation as well as business buyin spaCy.io Lessons Learnt
  • 23. Start Simple! Define you key outcomes Visualize the data Experiment, iteration is key!
  • 24. Educate Engage you SMEs Visualizations always help Opt for easy/understandable
  • 26. spaCy.io Alternatives There are many alternatives available they tend to fall into two categories, alternative libraries and hosted solutions
  • 27. ◇ NLTK Natural Language Toolkit (Python) ◇ Stanford CoreNLP (Java) ◇ NLP4J (Java) Libraries allow you to configure, extend and train for your problem domain Alternate Libraries
  • 28. ◇ Microsoft Azure Text Analytics ◇ Google Cloud Natural Language Hosted solutions provide a generic solution ◇ Well trained models ◇ Basic/Generic Named Entities ◇ Unable to model/train for your domain (yet!) Alternate Hosted Solutions
  • 29. Thanks! Any questions? You can find me at: ◇ @jeffxor ◇ jwilliams@labelinsight.com ◇ https://speakerrate.com/speakers/181771 (Feedback) Label Insight is Hiring! https://www.labelinsight.com/careers/topic/engineering
  • 30. Useful Information This presentation used the following resources: ◇ spacy.io ◇ spacy.io github ◇ explosion.ai/demos/ ◇ Natural Language Processing Wikipedia ◇ Stanford CoreNLP ◇ Microsoft Azure Text Analytics ◇ Google Cloud Natural Language