Natural language processing and
machine learning
Nikola Milosevic
What is AI?
• Intelligence presented by a machine
• Flexible agent that interacts with the environment and
performs actions to maximize success towards certain goal
Popular AI
What is machine learning
• Subfield of computer science that explores
how machines can learn to perform certain
task without explicit programming
Data mining generally
Types of machine learning
• Supervised learning
• Semi-supervised learning
• Unsupervised learning
• Reinforcement learning
Machine learning problems
• Classification
• Clustering
• Regression
Testing the model
• Iteratively improve the model
• Test multiple algorithms – find the best one
• No free lunch theory
• Feedback loop for feature selection
• Konfuziona matrica
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Examples of ML frameworks and
algorithms
• SCI-kit learn
– Python library
– Implementation of the most useful algorithms
– Naïve Bayes, SVM, Random forests, decision
trees…
• Keras
– Python library implementing about everything
related to neural networks
Text data
• About 80% of data in organizations are in text
format
• Harder to analyse than structured data
• Huge amount of textual documents
– Only in biomedicine 2200 scientific papers are
published every day
• Growing exponentially
Main goals of text mining
• Make communication easier (e.g. translation)
• Automate some processes (e.g.
communication agents/chatbots)
• Do data mining on textual and unstructured
data
Process overview
Challenges
• Man saw a woman with the telescope.
– Who has a telescope?
• Multiple senses, synonyms,
homonyms, irony
• Grammar and context can help
• Acronyms
Approaches
• Rule based
– Human defined rules to extract information
– Needs expert humans who know how people express
certain things
– Is quite laborious
• Machine learning based
– Machine tries to learn what to extract guided by
human
– Needs annotated corpora (usually fairly large)
• This is expensive to create and quite laborious
Levels of analysis
• Lexical
– Analysis of words
• Syntactic
– Analysis of organization of words
(phrases, sentences)
• Semantic
– Analysis meaning
• Sometimes pragmatic
– Analysis pragmatics of the use of certain words,
phrases. Why author used that?
Steps
Lexical processing
• Part of speech tagging
• Parsing
– Constituency
– Dependency
Stanford parser
Semantic processing
• Text classification
– Sentiment analysis (positive/negative)
– Classification by topics (politics/sport/business)
– Authorship detection (Tolkien, Rowling, Shakespeare)
• Named entity recognition
• Topic modelling (unsupervised)
• Search
Sequence modelling
• Machine learning technique useful for named
entity recognition
• Conditional random fields (CRF) or recurrent
neural networks (often LSTM)
Feature engineering
• Selecting important features that help extract
information
• Can be:
– Words, PoS, word shapes, vocabulary features,
etc.
– May depend on task and methodology
– Iterative process of selecting and improving the
performance
– Some features may confuse the algorithm
Search
• Finds documents that are the most relevant
for a given user query
• Usual techniques include algorithm called TF-
IDF and cosine similarity
• May additionally use links towards text,
positions of matched words and similar things
to rank found documents
• Apache Lucene, Solr (Java), there are also
Python libraries
Language models
• Used as features to classification and other
NLP tasks
• Contain some basic characteristics of language
• The most naïve (but also frequently used) is
called Bag of Words
• NN use more advanced
models: word2vec, Glove,
ULMo, BERT…
Useful tools and libraries
• Apache OpenNLP – Java
• Apache Lucene – Java, C#
• Stanford Core NLP – Java
• NLTK – Python
• GATE – GUI alat
• SharpNLP
• ...
• Weka – for machine learning (GUI)

Machine learning (ML) and natural language processing (NLP)

  • 1.
    Natural language processingand machine learning Nikola Milosevic
  • 2.
    What is AI? •Intelligence presented by a machine • Flexible agent that interacts with the environment and performs actions to maximize success towards certain goal
  • 3.
  • 4.
    What is machinelearning • Subfield of computer science that explores how machines can learn to perform certain task without explicit programming
  • 5.
  • 6.
    Types of machinelearning • Supervised learning • Semi-supervised learning • Unsupervised learning • Reinforcement learning
  • 7.
    Machine learning problems •Classification • Clustering • Regression
  • 8.
    Testing the model •Iteratively improve the model • Test multiple algorithms – find the best one • No free lunch theory • Feedback loop for feature selection • Konfuziona matrica 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
  • 9.
    Examples of MLframeworks and algorithms • SCI-kit learn – Python library – Implementation of the most useful algorithms – Naïve Bayes, SVM, Random forests, decision trees… • Keras – Python library implementing about everything related to neural networks
  • 10.
    Text data • About80% of data in organizations are in text format • Harder to analyse than structured data • Huge amount of textual documents – Only in biomedicine 2200 scientific papers are published every day • Growing exponentially
  • 11.
    Main goals oftext mining • Make communication easier (e.g. translation) • Automate some processes (e.g. communication agents/chatbots) • Do data mining on textual and unstructured data
  • 12.
  • 13.
    Challenges • Man sawa woman with the telescope. – Who has a telescope? • Multiple senses, synonyms, homonyms, irony • Grammar and context can help • Acronyms
  • 14.
    Approaches • Rule based –Human defined rules to extract information – Needs expert humans who know how people express certain things – Is quite laborious • Machine learning based – Machine tries to learn what to extract guided by human – Needs annotated corpora (usually fairly large) • This is expensive to create and quite laborious
  • 15.
    Levels of analysis •Lexical – Analysis of words • Syntactic – Analysis of organization of words (phrases, sentences) • Semantic – Analysis meaning • Sometimes pragmatic – Analysis pragmatics of the use of certain words, phrases. Why author used that?
  • 16.
  • 17.
    Lexical processing • Partof speech tagging • Parsing – Constituency – Dependency Stanford parser
  • 18.
    Semantic processing • Textclassification – Sentiment analysis (positive/negative) – Classification by topics (politics/sport/business) – Authorship detection (Tolkien, Rowling, Shakespeare) • Named entity recognition • Topic modelling (unsupervised) • Search
  • 19.
    Sequence modelling • Machinelearning technique useful for named entity recognition • Conditional random fields (CRF) or recurrent neural networks (often LSTM)
  • 20.
    Feature engineering • Selectingimportant features that help extract information • Can be: – Words, PoS, word shapes, vocabulary features, etc. – May depend on task and methodology – Iterative process of selecting and improving the performance – Some features may confuse the algorithm
  • 21.
    Search • Finds documentsthat are the most relevant for a given user query • Usual techniques include algorithm called TF- IDF and cosine similarity • May additionally use links towards text, positions of matched words and similar things to rank found documents • Apache Lucene, Solr (Java), there are also Python libraries
  • 22.
    Language models • Usedas features to classification and other NLP tasks • Contain some basic characteristics of language • The most naïve (but also frequently used) is called Bag of Words • NN use more advanced models: word2vec, Glove, ULMo, BERT…
  • 23.
    Useful tools andlibraries • Apache OpenNLP – Java • Apache Lucene – Java, C# • Stanford Core NLP – Java • NLTK – Python • GATE – GUI alat • SharpNLP • ... • Weka – for machine learning (GUI)