SlideShare a Scribd company logo
Introduction to Text Analytics
and Natural Language
Processing
Nick Grattan
Application Architecture Director, Dassault Systèmes
PhD Student, Insight Centre for Data Analytics, University College Cork
www.3ds.com insight-centre.org/
Cork AI Meetup 15th March 2018
www.meetup.com/Cork-AI/
@NickGrattan
“You shall know a word by the
company it keeps”
J.R. Firth (1957)
Agenda
• Introduction to Text Analytics.
• Overview of common techniques and the types of problems commonly
solved.
• Traditional “Frequentist” text analysis
• Bag-of-Words and Vector Space Models (High Dimensional, Low Density)
• Measuring document / text similarity with distance metrics and clustering
documents.
• Hands-On: Document Clustering with Python, NLTK and Scipy.
• Word Embeddings with word2vec for semantic term analysis
• Unsupervised semantic analysis using a corpus of words
• Hands-On: Creating a semantic space with a Neural Network in TensorFlow
Natural Language Processing and Text
Analytics
• Natural Language Processing (NLP)
• Area of AI concerned with interactions between computers and human natural
language, to process or “understand” natural language
• Common tasks: speech recognition, natural language understanding & generation,
automatic summarization, part-of-speech tagging, disambiguation, named entity
recognition …
• To fully understand and represent the meaning of language is a difficult goal (AI-
Complete) [1]
• Text Analytics (Text Mining):
• The process or practice of examining large collections of written resources in order
to generate new information (Oxford English Dictionary)
• Transforms text to data for information discovery, establishing relationships, often
using NLP
Text Preparation
• Extract text from documents
• E.g. use “BeautifulSoup” in Python to process HTML/XML documents
• Process terms (words) from text
• Tokenisation – breaks text into discrete terms
• Stop Words – remove common words (“the”, “and” etc.)
• Stemming – Reduce words to their root or base form ("fishing", "fished", and
"fisher" => "fish")
• E.g. “NLTK” (Natural Language Toolkit) in Python
• All, some, or none of these techniques may be used, depending on the
application
Bag-of-Words & Jaccard Similarity
• Bag-of-Words is the set of terms found
in a document, corpus etc.
• Jaccard Similarity between two Bag-of-
Words, A &B:
• Ratio of the Intersection length over the
Union length of two sets
• ‘0’ – Identical, ‘1’ – Dissimilar
• Simple & quick to calculate
Term Frequencies (TF) & Vector Space Models
• Term Frequency (TF)
• Count of number of term
occurrences in a document
• Vector Space Model
• Dimension for each Term in Vocabulary
• Map documents into this space
Very high dimensionality, low density
For many documents, many dimensions
with be zero
Distance Measures
• Distance between two documents in a vector space model
• Two common measures: Euclidean and Cosine
Term Frequency / Inverse Document
Frequency (TF/IDF)
• Term Frequency / Inverse Document Frequency (TF/IDF)
• Reflects how important a word is to a document in a corpus
• Increases proportionally to the number of times a document appears in the
document
• Offset by the frequency of the word in the corpus
• Adjusts for words that appear more frequently in general
See: https://deeplearning4j.org/bagofwords-tf-idf
Distance Matrices & Clustering
• Square, symmetrical matrix with pair-wise distances between
documents in a corpus
• Used for clustering documents, e.g.
• K-Means clustering
• Hierarchical clustering (Ward algorithm commonly used)
See: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
Edit Distance
• Number of inserts / deletes / substitutions to
transform one document to another
• Weights may be applied to different types of
edit
• E.G. Terms that are semantically related may have
a lower weight
• Levenstein Edit Distance may be solved using
Dynamic Programming
• Allows document alignments to be produced
• But: Expensive in time and space!
https://wordpress.com/read/feeds/71910664/posts/1718047915
Document Retrieval with MinHash & Local
Sensitivity Hashing (LSH)
• Problem: How to retrieve similar document from a large corpus
• MinHash:
• “Document Fingerprint” with n-hash values (n ≈ 200)
• Characteristic: Similar document have similar hash values
• Use Jaccard similarity to measure MinHash similarity, and hence document similarity
• Independent of document size, small storage and retrieval costs
• Local Sensitivity Hashing (LSH):
• For large number of documents
• Organizes documents represented by MinHash into buckets
• Documents within a bucket are similar
• Reduces retrieval time, good for document duplication/near duplication detection
etc.
https://nickgrattandatascience.wordpress.com/2013/11/12/minhash-implementation-in-c/
https://nickgrattandatascience.wordpress.com/2017/12/31/lsh-for-finding-similar-documents-from-a-large-number-of-documents-in-c/
Natural Language Processing (NLP)
• Techniques describe thus far are Text Analytical
• Numerical in nature, take little account of the meaning of
text
• Terms are numerically encoded symbols
• NLP attempts to understand text
• Semantics – The meaning of a word based on how / where
it’s used
• Part of Speech (POS) Tagging- Understanding the
construction of sentences, phrases etc.
• Word Relatedness & Concepts: Wordnet -
https://wordnet.princeton.edu/
E.g. Homonym Problem:
Words with same spelling but
different meanings, depending
on how / where used
E.G. Disambiguation: “Like” as Verb (Fruit
flies like to eat bananas), “Like” as a adjective
(“Fruit flies that look like a banana”)
Word Embeddings – word2vec
• Unsupervised semantic analysis from
corpus of terms
• Define number of dimensions for the
semantic space (e.g. 300)
• Window: Define number of words before
/ after (e.g. 1,2 or 5) target word
• Generate Training Samples
• For each word, create parameters that
map the word into the semantic space
• The “Word Vector Lookup Table”
See: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
Word Embeddings – word2vec
• Neural Network trained on samples
• Output layer discarded
• Keep the Hidden Layer Weight Matrix!
See: https://www.tensorflow.org/tutorials/word2vec
Also look at “gensim” for a word2vec implementation:
https://radimrehurek.com/gensim/models/word2vec.html
Optimisation:
1. Continuous Bag of Words (CBOW)
2. Skipgram
Word Embedding - Visualisation
• Use Principal Component Analysis (PCA) to create 2d representation
of semantic vector space
RNN and NLP
• Recurrent Neural Networks (RNN) may
be used for generative models
• Once trained, they can generate text
with the same structure, syntax and
semantics as the training set
• For a bit of fun, see “The Unreasonable
Effectiveness of Recurrent Neural
Networks”
• http://karpathy.github.io/2015/05/21/rnn-effectiveness/
C Code generated from an RNN trained on the Linux
code base. While it does not execute (!) it is
syntactically correct. The model, for example, has
learnt matched “{“ “}” and parentheses
Resources and References
[1] “Natural Language Processing with Deep Learning” – Christopher Manning et
al, Stamford University. https://www.youtube.com/watch?v=OQQ-W_63UgQ
• Lecture series including excellent description of back propagation, word2vec and GLOVE
[2] “Hands-On Machine Learning with Scikit-Learn & TensorFlow”, Aurélien
Géron (O'Reilly Media, 2017)
• Excellent introduction, Jupyter Notebooks available here: https://github.com/ageron
[3] “Speech and Language Processing”, Daniel Jurafsky & James H. Martin (2nd
Edition, Pearson Education 2009)
• In-depth introduction to NLP
[4] “Introduction to Information Retrieval”, Christopher Manning et al
(Cambridge University Press, 2008)
• Probabilistic models for text retrieval, TF/IDF, Vector Space, Support Vector Machines…

More Related Content

Similar to Cork AI Meetup Number 3

AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
vincent683379
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
Noemi Derzsy
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
pbajcsy
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finaleAjit More
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Lucidworks
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
Simon Hughes
 
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
Keith.May
 
co:op-READ-Convention Marburg - Enrique Vidal
co:op-READ-Convention Marburg - Enrique Vidalco:op-READ-Convention Marburg - Enrique Vidal
co:op-READ-Convention Marburg - Enrique Vidal
ICARUS - International Centre for Archival Research
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
Simon Hughes
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
RajkiranVeluri
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
Alia Hamwi
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Spark Summit
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Sujit Pal
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
Natasha Latysheva
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Saurabh Kaushik
 
Text features
Text featuresText features
Text features
Shruti kar
 

Similar to Cork AI Meetup Number 3 (20)

AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
PyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from ScratchPyGotham NY 2017: Natural Language Processing from Scratch
PyGotham NY 2017: Natural Language Processing from Scratch
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
 
Web indexing finale
Web indexing finaleWeb indexing finale
Web indexing finale
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
CAA 2014 - To Boldly or Bravely Go? Experiences of using Semantic Technologie...
 
co:op-READ-Convention Marburg - Enrique Vidal
co:op-READ-Convention Marburg - Enrique Vidalco:op-READ-Convention Marburg - Enrique Vidal
co:op-READ-Convention Marburg - Enrique Vidal
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
Dictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit PalDictionary Based Annotation at Scale with Spark by Sujit Pal
Dictionary Based Annotation at Scale with Spark by Sujit Pal
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 
Building a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From ScratchBuilding a Neural Machine Translation System From Scratch
Building a Neural Machine Translation System From Scratch
 
22 owl section 1
22 owl    section 122 owl    section 1
22 owl section 1
 
methods and resources
methods and resourcesmethods and resources
methods and resources
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2 Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
 
Text features
Text featuresText features
Text features
 

Recently uploaded

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 

Cork AI Meetup Number 3

  • 1. Introduction to Text Analytics and Natural Language Processing Nick Grattan Application Architecture Director, Dassault Systèmes PhD Student, Insight Centre for Data Analytics, University College Cork www.3ds.com insight-centre.org/ Cork AI Meetup 15th March 2018 www.meetup.com/Cork-AI/ @NickGrattan
  • 2. “You shall know a word by the company it keeps” J.R. Firth (1957)
  • 3. Agenda • Introduction to Text Analytics. • Overview of common techniques and the types of problems commonly solved. • Traditional “Frequentist” text analysis • Bag-of-Words and Vector Space Models (High Dimensional, Low Density) • Measuring document / text similarity with distance metrics and clustering documents. • Hands-On: Document Clustering with Python, NLTK and Scipy. • Word Embeddings with word2vec for semantic term analysis • Unsupervised semantic analysis using a corpus of words • Hands-On: Creating a semantic space with a Neural Network in TensorFlow
  • 4. Natural Language Processing and Text Analytics • Natural Language Processing (NLP) • Area of AI concerned with interactions between computers and human natural language, to process or “understand” natural language • Common tasks: speech recognition, natural language understanding & generation, automatic summarization, part-of-speech tagging, disambiguation, named entity recognition … • To fully understand and represent the meaning of language is a difficult goal (AI- Complete) [1] • Text Analytics (Text Mining): • The process or practice of examining large collections of written resources in order to generate new information (Oxford English Dictionary) • Transforms text to data for information discovery, establishing relationships, often using NLP
  • 5. Text Preparation • Extract text from documents • E.g. use “BeautifulSoup” in Python to process HTML/XML documents • Process terms (words) from text • Tokenisation – breaks text into discrete terms • Stop Words – remove common words (“the”, “and” etc.) • Stemming – Reduce words to their root or base form ("fishing", "fished", and "fisher" => "fish") • E.g. “NLTK” (Natural Language Toolkit) in Python • All, some, or none of these techniques may be used, depending on the application
  • 6. Bag-of-Words & Jaccard Similarity • Bag-of-Words is the set of terms found in a document, corpus etc. • Jaccard Similarity between two Bag-of- Words, A &B: • Ratio of the Intersection length over the Union length of two sets • ‘0’ – Identical, ‘1’ – Dissimilar • Simple & quick to calculate
  • 7. Term Frequencies (TF) & Vector Space Models • Term Frequency (TF) • Count of number of term occurrences in a document • Vector Space Model • Dimension for each Term in Vocabulary • Map documents into this space Very high dimensionality, low density For many documents, many dimensions with be zero
  • 8. Distance Measures • Distance between two documents in a vector space model • Two common measures: Euclidean and Cosine
  • 9. Term Frequency / Inverse Document Frequency (TF/IDF) • Term Frequency / Inverse Document Frequency (TF/IDF) • Reflects how important a word is to a document in a corpus • Increases proportionally to the number of times a document appears in the document • Offset by the frequency of the word in the corpus • Adjusts for words that appear more frequently in general See: https://deeplearning4j.org/bagofwords-tf-idf
  • 10. Distance Matrices & Clustering • Square, symmetrical matrix with pair-wise distances between documents in a corpus • Used for clustering documents, e.g. • K-Means clustering • Hierarchical clustering (Ward algorithm commonly used) See: https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
  • 11. Edit Distance • Number of inserts / deletes / substitutions to transform one document to another • Weights may be applied to different types of edit • E.G. Terms that are semantically related may have a lower weight • Levenstein Edit Distance may be solved using Dynamic Programming • Allows document alignments to be produced • But: Expensive in time and space! https://wordpress.com/read/feeds/71910664/posts/1718047915
  • 12. Document Retrieval with MinHash & Local Sensitivity Hashing (LSH) • Problem: How to retrieve similar document from a large corpus • MinHash: • “Document Fingerprint” with n-hash values (n ≈ 200) • Characteristic: Similar document have similar hash values • Use Jaccard similarity to measure MinHash similarity, and hence document similarity • Independent of document size, small storage and retrieval costs • Local Sensitivity Hashing (LSH): • For large number of documents • Organizes documents represented by MinHash into buckets • Documents within a bucket are similar • Reduces retrieval time, good for document duplication/near duplication detection etc. https://nickgrattandatascience.wordpress.com/2013/11/12/minhash-implementation-in-c/ https://nickgrattandatascience.wordpress.com/2017/12/31/lsh-for-finding-similar-documents-from-a-large-number-of-documents-in-c/
  • 13. Natural Language Processing (NLP) • Techniques describe thus far are Text Analytical • Numerical in nature, take little account of the meaning of text • Terms are numerically encoded symbols • NLP attempts to understand text • Semantics – The meaning of a word based on how / where it’s used • Part of Speech (POS) Tagging- Understanding the construction of sentences, phrases etc. • Word Relatedness & Concepts: Wordnet - https://wordnet.princeton.edu/ E.g. Homonym Problem: Words with same spelling but different meanings, depending on how / where used E.G. Disambiguation: “Like” as Verb (Fruit flies like to eat bananas), “Like” as a adjective (“Fruit flies that look like a banana”)
  • 14. Word Embeddings – word2vec • Unsupervised semantic analysis from corpus of terms • Define number of dimensions for the semantic space (e.g. 300) • Window: Define number of words before / after (e.g. 1,2 or 5) target word • Generate Training Samples • For each word, create parameters that map the word into the semantic space • The “Word Vector Lookup Table” See: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
  • 15. Word Embeddings – word2vec • Neural Network trained on samples • Output layer discarded • Keep the Hidden Layer Weight Matrix! See: https://www.tensorflow.org/tutorials/word2vec Also look at “gensim” for a word2vec implementation: https://radimrehurek.com/gensim/models/word2vec.html Optimisation: 1. Continuous Bag of Words (CBOW) 2. Skipgram
  • 16. Word Embedding - Visualisation • Use Principal Component Analysis (PCA) to create 2d representation of semantic vector space
  • 17. RNN and NLP • Recurrent Neural Networks (RNN) may be used for generative models • Once trained, they can generate text with the same structure, syntax and semantics as the training set • For a bit of fun, see “The Unreasonable Effectiveness of Recurrent Neural Networks” • http://karpathy.github.io/2015/05/21/rnn-effectiveness/ C Code generated from an RNN trained on the Linux code base. While it does not execute (!) it is syntactically correct. The model, for example, has learnt matched “{“ “}” and parentheses
  • 18. Resources and References [1] “Natural Language Processing with Deep Learning” – Christopher Manning et al, Stamford University. https://www.youtube.com/watch?v=OQQ-W_63UgQ • Lecture series including excellent description of back propagation, word2vec and GLOVE [2] “Hands-On Machine Learning with Scikit-Learn & TensorFlow”, Aurélien Géron (O'Reilly Media, 2017) • Excellent introduction, Jupyter Notebooks available here: https://github.com/ageron [3] “Speech and Language Processing”, Daniel Jurafsky & James H. Martin (2nd Edition, Pearson Education 2009) • In-depth introduction to NLP [4] “Introduction to Information Retrieval”, Christopher Manning et al (Cambridge University Press, 2008) • Probabilistic models for text retrieval, TF/IDF, Vector Space, Support Vector Machines…