SlideShare a Scribd company logo
1 of 48
Download to read offline
Text-Mining:
Dealing with unstructured data in business intelligence
Prof dr ir Jan C. Scholtes
https://textmining.nu
https://www.linkedin.com/in/jscholtes/
Johannes (Jan) C. Scholtes
3
Text Mining
Text Mining: The next step in
Search Technology
Finding without knowing exactly what
you’re looking for, or finding what
apparently isn’t there.
4
Text-Mining
The study of text-mining focusses primary on:
• extracting complex patterns from
unstructured electronic data sets, and
• applying machine learning for document
classification.
5
Language_Name English
CITY New Brunswick, WASHINGTON
COMPANY J&J, Johnson & Johnson
COUNTRY Greece, Poland, Romania, United Kingdom
CURRENCY .02 USD, 21400000 USD, 48600000 USD, 59.47 USD, 70000000 USD
DATE 04-08
DAY Fri, Friday
NOUN_GROUP
biotech drugs, bribery case, denying guilt, final growth frontier, foreign countries, giving gifts, holding corporations,
intense revenue pressure, meaningful credit, medical device kickbacks, medical devices, multiple businesses, next several
days, non-U.S. markets, only way, orthopedic hips, other countries, over-the-counter medicines, paid kickbacks, past
year, paying kickbacks, same time, several new positions, similar violations, travel gifts
ORGANIZATION Department of Justice, Justice Department, SEC, Securities and Exchange Commission, University of Michigan
PEOPLES Iraqi
PERSON Erik Gordon, Mythili Raman, William Weldon
PLACE_REGION Europe
PRODUCT Benadryl, Tylenol
PROP_MISC Band-Aids, Food Program, Foreign Corrupt Practices Act, United Nations Oil
STATE N.J.
TIME 1:32 pm ET
TIME_PERIOD 13 years, five years, six months, three years
YEAR 2007
Problem
"We went to the government to report improper payments and have taken full responsibility for these actions," said
William Weldon, Chairman and CEO of J&J., Last month federal health regulators took legal control of the plant where
millions of bottles of defective medication were produced., The charges against J&J were brought under the Foreign
Corrupt Practices Act, which bars publicly traded companies from bribing officials in other countries to get or retain
business., The company will pay $21.4 million in criminal penalties for improper payments and return $48.6 million in
illegal profits, according to the government., The SEC says J&J agents used fake contracts and sham companies to deliver
the bribes.
Sentiment
giving meaningful credit to companies that self-report, We are committed to holding corporations accountable for bribing
foreign officials, what is honest
Request make sure it complies with anti-bribery laws across its businesses
6
WHAT happened?
7
WHO: Community Detection
8
WHAT-WHEN: Topic Rivers
9
WHY & WHO: Emotion Detection
10
Anomaly Detection
Σ(Φ)
How does that work?
Search
Pattern
Recognition
Text-Mining
13
There are certain problems we need to tackle
First Bank of Chicago was the only ATM in the neighborhood.
Before visiting the Victoria and Albert Museum on Cromwell Rd, I
had to get some cash so I could take Prof. Joan V. Hhdrsat Jr.
and my son Mitchell with me to the museum. The last one was
not really interested, he hates musea.
• Abbreviations
• Attributes of entities
• Boundary Problem
• Conjunction Problem
• Co-references
• Negations
• Emotions & sentiments
14
Redaction aka Black-Lining
aka Anonymization
• Lack of precision leads
to noise, too many false
hits, too much work to
review, which yields
high cost of review.
• Lack of recall leads to
missing relevant
documents which yields
risk.
15
16
Human Performance
• When both precision and recall are
over 80%, human performance is
approached.
• This applies to the best humans.
• It can be argued that values over
80% are often subject to different
interpretations and discussions.
16
Teaching the computer what you
are looking for … text classification
18
What is document classification?
19
Machine Learning Algorithms are used
During the last decade, a generation of efficient and successful
algorithms has been developed using bag-of-words models to
represent document content and statistical and geometrical
machine learning algorithms such as Conditional Random Fields
(CRF) and Support Vector Machines (SVM).
20
Decision Trees and Entropy Modeling
• A decision tree is a
decision support tool
that uses a tree-like
graph or model of
decisions and their
possible consequences,
including chance event
outcomes, resource
costs, and utility. It is
one way to display an
algorithm.
21
Now imagine 1.2 million dimensional …
2-dimensional 3-dimensional
22
Text Representation Schemes we use
Statistical hand-picked
features
Bag-of-Words
23
Classifying Reuters Document Collection (RCV1)
24
Information Requests
eDiscovery• FOIA (WOB)
• Audits
Internal Investigations
• Litigation
• Arbitration
• Regulatory Requests
• Subject Access Requests
25
3x more relevant
documents than
Boolean search
No complex queries, just
review documents
2x total number of
relevant documents
is all that need to be
reviewed
Estimate
accurately percentage of all
relevant documents found at
end
Teach the computer what to look for …
26
Benefits and problems current approach
• These algorithms require relatively little training data and are
fast on modern hardware.
• Performance seems to be maximized at F1 values around 0.9
• Not very good at transfer learning.
27
Deep Learning
Deep Learning has produced superior results on several NLP
tasks such as POS tagging, Named Entity Recognition, or
semantic role labeling, especially when using LSTM’s.
Source: Recent Trends in Deep Learning Based Natural Language Processing. Tom Youngy, Devamanyu Hazarikaz,
Soujanya Poria, Erik Cambria, 2018. https://arxiv.org/pdf/1708.02709.pdf
28
General Challenges Deep Learning
Challenge
• Huge computational
requirements
• Need for very large
training data sets
• What architecture?
Today’s Answer
• Dedicated hardware
(GPU’s)
• Semi Supervised data
set creations, training
data augmentation &
transfer learning
• CNN, LSTM, …
29
Dedicated Hardware
30
How about Word Embeddings?
• Pre-trained model
• Understand context better
• Transfer learning: understand already
general aspects of language, subsequent
only need to fine-tune for a specific NLP
task.
• No need for millions or billions of annotated
training data (when using deep learning).
31
Word Embeddings: Document Representation
derived with and used for Deep Learning*
Word2Vec Doc2Vec
Glove FastText
ELMO BERT …
Remember: with TF-IDF we create a vector for each document. How
can we do something similar for Deep Learning?
Idea behind Word Embeddings:
Use words from a vocabulary as input and embed them as vectors into a
lower dimensional space in order to enforce the system to create similar
encodings for semantically related words to include context.
*) but can also be used for SVM or other non-deep learning models.
32
Word2Vec …
Revolutionized the use of word
embedding’s by using a continuous
bag of words and skip-grams to derive
high quality word embedding’s.
Why: unexpected side effect was
compositionality: algebraic operations
on word vectors result in a vector that
is a semantic composite:
man + royal = king
men to king = women to …
country1 to capital1 = country2 to …
See Gittens et al., Skip-Gram–Zipf+Uniform=VectorAdditivity, 2017 for theoretical justification of compositionality
33
Source: http://blog.aylien.com/overview-word-embeddings-history-word2vec-cbow-glove/
Vector relations captured by GloVe
34
Limitations of Word2Vec & GloVe …
• Language Sensitivity
• Cannot deal with out-of-vocabulary words. Solution: Preprocess and assign left-over
codes to these. Miss the semantic context.
• Combined high-frequent combinations used in Word2Vec Treating common word pairs or
phrases as single “words” does not result in compositionality for the individual terms.
Solution: train word pairs also as individual tokens, but then harder to deal with
frequent multi-token words such as New York.
• The small windows of surrounding words sometimes leads to problems for high frequent
words: bad and good often occur in the same context leading to similar encodings
resulting in problems in sentiment and emotion mining due to polarity confusion.
Solution: preprocessing of negations and polarity.
• Word embedding's assign a unique vector to each textually unique word, so a word
cannot have multiple meaning (polysemy). Remember a word like bank. Solution:
preprocess and replace homonyms with different tokens for each meaning: bank1,
bank2, bank3, bank4, …
35
Word2Vec & GloVe vs ELMo & BERT
• Word2vec and Glove word embeddings are context-free or context-
independent- they output just one vector (embedding) for each
word. Same vector for different meanings of a word: “He took his
cell phone to call his brother who was in the prison cell on the next
floor. Booth had been taken blood samples to measure their white
blood cell count.” results in ONE unique vector for cell*.
• ELMo and BERT are context sensitive: they can generate different
word embeddings for a word that captures the context of a word -
that is its position in a sentence. So, above example would lead to 3
different encodings for the word cell. cell (prison cell case) would be
closer to words like incarceration, crime etc. cell (phone case) would
be closer to words like iPhone, android, galaxy etc.. and cell (blood)
would be closer to biology, life science, etc.
*) This may be less of a problem as it seems as our intuition of “close by” that holds for low
dimensional spaces does not hold for high dimensional spaces
36
Uni-Directional and Bi-Directional Context
“I accessed the bank of the river”
unidirectional contextual model would represent “bank” based
on “I accessed the” but not “river.”
bi-directional contextual model represents “bank” using both its
previous and next context — “I accessed the ... of the river”
Both ELMo and BERT are bi-directional. ELMo is shallow bi-
directional, BERT deep bi-directional.
37
Differences Word Embeddings
Source: https://www.quora.com/What-are-the-main-differences-between-the-word-
embeddings-of-ELMo-BERT-Word2vec-and-GloVe
*) BERT has deep contextual and can deal with out of vocabulary words due to fully connected bi-
directional and sub word representation
**) All the above are still language dependent
38
Deep Learning Architectures:
Convolutional Neural Networks (CNN)
39
Convolution
Convolution acts as a filter.
Used for feature selection.
Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
40
What is Convolution in Text?
• Convolution makes perfect sense
when the picture is an image, with
two spatial dimensions (height and
width). But text has only one
dimension, and it’s temporal not
spatial.
• For all practical purposes, that
doesn’t matter. We just need to
think of our text as an image of
width n and height 1. Tensorflow
provides a conv1d function, but it
does not expose other
convolutional operations in their 1d
version. You will have to add these
yourself.
41
Convolution in Text
• Context from surroundings, using filters with ratio 1:3 generally
work best. Larger ones damage the gradient descent in training.
• Different layers represent different abstraction. Higher layers more
semantic representations.
• Use dilated convolutions AKA atrous convolutions AKA convolutions
with holes to capture longer term relations without angering the
gradient descent.
42
Conceptual Problems with CNN for NLP
• Any part of a sentence can influence the semantics of a word.
For that reason we want our network to see the entire input
at once.
• Getting that big a receptive can make gradients vanish and
our networks fail.
• We can solve the vanishing gradient problem with DenseNets
or Dilated Convolutions (convolution filters with holes in it),
but still ...
43
Long Short-Term Memory (LSTM) are better in
capturing long-term relations as seen in NLP
• Can deal with input of
variable sizes.
• Better in learning the
meaning of the same
word in different
locations (which is hard
for CNN), e.g.: drink a
lot of beers / or like to
drink a lot
• Better in dealing with
long term dependencies
44
2018: Long Tran: Co-reference resolution with LSTM
45
2019: Zina Wang: Extract Complex Relation Entities with LSTM’s
46
2019: Zoe Gerolemou. Aspect Based Emotion &
Sentiment Mining
47
Business Intelligence: How can text-
mining help me?
• Collect & analyze open
source (textual) data:
social media, blogs,
websites, …
• Consumer opinion
analysis.
• Anonymize historical
data collections holding
personal information.
• … Source: IDC Executive Summary, 2015
Thank you!
Time for Q&A
Prof dr ir Jan C. Scholtes
https://textmining.nu
https://www.linkedin.com/in/jscholtes/

More Related Content

Similar to Text mining voor Business Intelligence toepassingen

Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Jonathan Stray
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppttestbest6
 
Portrait Of A Writer Assignment Notes By Chris
Portrait Of A Writer Assignment Notes By ChrisPortrait Of A Writer Assignment Notes By Chris
Portrait Of A Writer Assignment Notes By ChrisAlyssa Jefferson
 
TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019
TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019
TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019webwinkelvakdag
 
Text mining scholtes - big data congress utrecht 2019
Text mining   scholtes - big data congress utrecht 2019Text mining   scholtes - big data congress utrecht 2019
Text mining scholtes - big data congress utrecht 2019jcscholtes
 
1 Foundations of Fintech, Spring 2019 FINAL EXAM Profe.docx
1 Foundations of Fintech, Spring 2019 FINAL EXAM Profe.docx1 Foundations of Fintech, Spring 2019 FINAL EXAM Profe.docx
1 Foundations of Fintech, Spring 2019 FINAL EXAM Profe.docxgertrudebellgrove
 
Expert Essay Writers - Buy Essays Online Australia - 20171
Expert Essay Writers - Buy Essays Online Australia - 20171Expert Essay Writers - Buy Essays Online Australia - 20171
Expert Essay Writers - Buy Essays Online Australia - 20171Karen Oliver
 
Big Data Bootcamp 2017 - Atlanta - Flavio Villanustre
Big Data Bootcamp 2017 - Atlanta - Flavio VillanustreBig Data Bootcamp 2017 - Atlanta - Flavio Villanustre
Big Data Bootcamp 2017 - Atlanta - Flavio VillanustreHPCC Systems
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applicationsVasileios Lampos
 
Word Essay Professional Writin. Online assignment writing service.
Word Essay Professional Writin. Online assignment writing service.Word Essay Professional Writin. Online assignment writing service.
Word Essay Professional Writin. Online assignment writing service.Debra Perea
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search ComponentMario Flecha
 
Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerd...
Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerd...Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerd...
Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerd...BigDataExpo
 
Text mining scholtes - big data congress utrecht 2018
Text mining   scholtes - big data congress utrecht 2018Text mining   scholtes - big data congress utrecht 2018
Text mining scholtes - big data congress utrecht 2018jcscholtes
 
Comparative Study on Lexicon-based sentiment analysers over Negative sentiment
Comparative Study on Lexicon-based sentiment analysers over Negative sentimentComparative Study on Lexicon-based sentiment analysers over Negative sentiment
Comparative Study on Lexicon-based sentiment analysers over Negative sentimentAI Publications
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Micah Altman
 
TCI 2015 What Do Links Mean in Innovation Clusters? ‘Relational Dialectics’
TCI 2015 What Do Links Mean in Innovation Clusters? ‘Relational Dialectics’TCI 2015 What Do Links Mean in Innovation Clusters? ‘Relational Dialectics’
TCI 2015 What Do Links Mean in Innovation Clusters? ‘Relational Dialectics’TCI Network
 
White Handmade Paper, Recycled Paper, Wedding Paper, Writing Paper, Art
White Handmade Paper, Recycled Paper, Wedding Paper, Writing Paper, ArtWhite Handmade Paper, Recycled Paper, Wedding Paper, Writing Paper, Art
White Handmade Paper, Recycled Paper, Wedding Paper, Writing Paper, ArtRhonda Cetnar
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...David Milward
 
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...Amit Sheth
 

Similar to Text mining voor Business Intelligence toepassingen (20)

Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
Frontiers of Computational Journalism week 1 - Introduction and High Dimensio...
 
16-nlp (2).ppt
16-nlp (2).ppt16-nlp (2).ppt
16-nlp (2).ppt
 
Portrait Of A Writer Assignment Notes By Chris
Portrait Of A Writer Assignment Notes By ChrisPortrait Of A Writer Assignment Notes By Chris
Portrait Of A Writer Assignment Notes By Chris
 
TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019
TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019
TEXT-MINING: BIG DATA ANALYTICS VOOR ONGESTRUCTUREERDE DATA - Big Data Expo 2019
 
Text mining scholtes - big data congress utrecht 2019
Text mining   scholtes - big data congress utrecht 2019Text mining   scholtes - big data congress utrecht 2019
Text mining scholtes - big data congress utrecht 2019
 
1 Foundations of Fintech, Spring 2019 FINAL EXAM Profe.docx
1 Foundations of Fintech, Spring 2019 FINAL EXAM Profe.docx1 Foundations of Fintech, Spring 2019 FINAL EXAM Profe.docx
1 Foundations of Fintech, Spring 2019 FINAL EXAM Profe.docx
 
Expert Essay Writers - Buy Essays Online Australia - 20171
Expert Essay Writers - Buy Essays Online Australia - 20171Expert Essay Writers - Buy Essays Online Australia - 20171
Expert Essay Writers - Buy Essays Online Australia - 20171
 
Big Data Bootcamp 2017 - Atlanta - Flavio Villanustre
Big Data Bootcamp 2017 - Atlanta - Flavio VillanustreBig Data Bootcamp 2017 - Atlanta - Flavio Villanustre
Big Data Bootcamp 2017 - Atlanta - Flavio Villanustre
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applications
 
Word Essay Professional Writin. Online assignment writing service.
Word Essay Professional Writin. Online assignment writing service.Word Essay Professional Writin. Online assignment writing service.
Word Essay Professional Writin. Online assignment writing service.
 
Semantic Search Component
Semantic Search ComponentSemantic Search Component
Semantic Search Component
 
Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerd...
Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerd...Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerd...
Maastricht university - Text-Mining: Big Data Analytics voor ongestructureerd...
 
Text mining scholtes - big data congress utrecht 2018
Text mining   scholtes - big data congress utrecht 2018Text mining   scholtes - big data congress utrecht 2018
Text mining scholtes - big data congress utrecht 2018
 
Comparative Study on Lexicon-based sentiment analysers over Negative sentiment
Comparative Study on Lexicon-based sentiment analysers over Negative sentimentComparative Study on Lexicon-based sentiment analysers over Negative sentiment
Comparative Study on Lexicon-based sentiment analysers over Negative sentiment
 
Exampleessays Com
Exampleessays ComExampleessays Com
Exampleessays Com
 
Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...Making Decisions in a World Awash in Data: We’re going to need a different bo...
Making Decisions in a World Awash in Data: We’re going to need a different bo...
 
TCI 2015 What Do Links Mean in Innovation Clusters? ‘Relational Dialectics’
TCI 2015 What Do Links Mean in Innovation Clusters? ‘Relational Dialectics’TCI 2015 What Do Links Mean in Innovation Clusters? ‘Relational Dialectics’
TCI 2015 What Do Links Mean in Innovation Clusters? ‘Relational Dialectics’
 
White Handmade Paper, Recycled Paper, Wedding Paper, Writing Paper, Art
White Handmade Paper, Recycled Paper, Wedding Paper, Writing Paper, ArtWhite Handmade Paper, Recycled Paper, Wedding Paper, Writing Paper, Art
White Handmade Paper, Recycled Paper, Wedding Paper, Writing Paper, Art
 
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...Healthcare Data Management using Domain Specific Languages for Metadata Manag...
Healthcare Data Management using Domain Specific Languages for Metadata Manag...
 
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
Semantic Integration of Citizen Sensor Data and Multilevel Sensing: A compreh...
 

More from jcscholtes

Legal tech Alliance Workshop 20191029
Legal tech Alliance Workshop 20191029Legal tech Alliance Workshop 20191029
Legal tech Alliance Workshop 20191029jcscholtes
 
LegalTech Alliance eDiscovery keynote Scholtes
LegalTech Alliance eDiscovery keynote ScholtesLegalTech Alliance eDiscovery keynote Scholtes
LegalTech Alliance eDiscovery keynote Scholtesjcscholtes
 
Target-Based Sentiment Anaysis as a Sequence-Tagging Task
Target-Based Sentiment Anaysis as a Sequence-Tagging TaskTarget-Based Sentiment Anaysis as a Sequence-Tagging Task
Target-Based Sentiment Anaysis as a Sequence-Tagging Taskjcscholtes
 
Ai and applications in the legal domain studium generale maastricht 20191101
Ai and applications in the legal domain studium generale maastricht 20191101Ai and applications in the legal domain studium generale maastricht 20191101
Ai and applications in the legal domain studium generale maastricht 20191101jcscholtes
 
Augmented intelligence and the impact on your world in 2030
Augmented intelligence and the impact on your world in 2030Augmented intelligence and the impact on your world in 2030
Augmented intelligence and the impact on your world in 2030jcscholtes
 
Hogeschool Den Haag Legal Analytics
Hogeschool Den Haag Legal AnalyticsHogeschool Den Haag Legal Analytics
Hogeschool Den Haag Legal Analyticsjcscholtes
 
HvA Legaltech Lab Opening
HvA Legaltech Lab OpeningHvA Legaltech Lab Opening
HvA Legaltech Lab Openingjcscholtes
 
Big Data en Data Science en de Rechtspraak
Big Data en Data Science en de RechtspraakBig Data en Data Science en de Rechtspraak
Big Data en Data Science en de Rechtspraakjcscholtes
 
How can Artificial Intelligence help me on the Battlefield?
How can Artificial Intelligence help me on the Battlefield?How can Artificial Intelligence help me on the Battlefield?
How can Artificial Intelligence help me on the Battlefield?jcscholtes
 
Big data analytics for legal fact finding
Big data analytics for legal fact findingBig data analytics for legal fact finding
Big data analytics for legal fact findingjcscholtes
 
How new ai based analytics ignite a productivity revolution in e discovery-final
How new ai based analytics ignite a productivity revolution in e discovery-finalHow new ai based analytics ignite a productivity revolution in e discovery-final
How new ai based analytics ignite a productivity revolution in e discovery-finaljcscholtes
 
Efficiently Handling Subject Access Requests
Efficiently Handling Subject Access RequestsEfficiently Handling Subject Access Requests
Efficiently Handling Subject Access Requestsjcscholtes
 
Waarom LegalTech de toekomst heeft
Waarom LegalTech de toekomst heeftWaarom LegalTech de toekomst heeft
Waarom LegalTech de toekomst heeftjcscholtes
 

More from jcscholtes (13)

Legal tech Alliance Workshop 20191029
Legal tech Alliance Workshop 20191029Legal tech Alliance Workshop 20191029
Legal tech Alliance Workshop 20191029
 
LegalTech Alliance eDiscovery keynote Scholtes
LegalTech Alliance eDiscovery keynote ScholtesLegalTech Alliance eDiscovery keynote Scholtes
LegalTech Alliance eDiscovery keynote Scholtes
 
Target-Based Sentiment Anaysis as a Sequence-Tagging Task
Target-Based Sentiment Anaysis as a Sequence-Tagging TaskTarget-Based Sentiment Anaysis as a Sequence-Tagging Task
Target-Based Sentiment Anaysis as a Sequence-Tagging Task
 
Ai and applications in the legal domain studium generale maastricht 20191101
Ai and applications in the legal domain studium generale maastricht 20191101Ai and applications in the legal domain studium generale maastricht 20191101
Ai and applications in the legal domain studium generale maastricht 20191101
 
Augmented intelligence and the impact on your world in 2030
Augmented intelligence and the impact on your world in 2030Augmented intelligence and the impact on your world in 2030
Augmented intelligence and the impact on your world in 2030
 
Hogeschool Den Haag Legal Analytics
Hogeschool Den Haag Legal AnalyticsHogeschool Den Haag Legal Analytics
Hogeschool Den Haag Legal Analytics
 
HvA Legaltech Lab Opening
HvA Legaltech Lab OpeningHvA Legaltech Lab Opening
HvA Legaltech Lab Opening
 
Big Data en Data Science en de Rechtspraak
Big Data en Data Science en de RechtspraakBig Data en Data Science en de Rechtspraak
Big Data en Data Science en de Rechtspraak
 
How can Artificial Intelligence help me on the Battlefield?
How can Artificial Intelligence help me on the Battlefield?How can Artificial Intelligence help me on the Battlefield?
How can Artificial Intelligence help me on the Battlefield?
 
Big data analytics for legal fact finding
Big data analytics for legal fact findingBig data analytics for legal fact finding
Big data analytics for legal fact finding
 
How new ai based analytics ignite a productivity revolution in e discovery-final
How new ai based analytics ignite a productivity revolution in e discovery-finalHow new ai based analytics ignite a productivity revolution in e discovery-final
How new ai based analytics ignite a productivity revolution in e discovery-final
 
Efficiently Handling Subject Access Requests
Efficiently Handling Subject Access RequestsEfficiently Handling Subject Access Requests
Efficiently Handling Subject Access Requests
 
Waarom LegalTech de toekomst heeft
Waarom LegalTech de toekomst heeftWaarom LegalTech de toekomst heeft
Waarom LegalTech de toekomst heeft
 

Recently uploaded

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 

Recently uploaded (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 

Text mining voor Business Intelligence toepassingen

  • 1. Text-Mining: Dealing with unstructured data in business intelligence Prof dr ir Jan C. Scholtes https://textmining.nu https://www.linkedin.com/in/jscholtes/
  • 2. Johannes (Jan) C. Scholtes
  • 3. 3 Text Mining Text Mining: The next step in Search Technology Finding without knowing exactly what you’re looking for, or finding what apparently isn’t there.
  • 4. 4 Text-Mining The study of text-mining focusses primary on: • extracting complex patterns from unstructured electronic data sets, and • applying machine learning for document classification.
  • 5. 5 Language_Name English CITY New Brunswick, WASHINGTON COMPANY J&J, Johnson & Johnson COUNTRY Greece, Poland, Romania, United Kingdom CURRENCY .02 USD, 21400000 USD, 48600000 USD, 59.47 USD, 70000000 USD DATE 04-08 DAY Fri, Friday NOUN_GROUP biotech drugs, bribery case, denying guilt, final growth frontier, foreign countries, giving gifts, holding corporations, intense revenue pressure, meaningful credit, medical device kickbacks, medical devices, multiple businesses, next several days, non-U.S. markets, only way, orthopedic hips, other countries, over-the-counter medicines, paid kickbacks, past year, paying kickbacks, same time, several new positions, similar violations, travel gifts ORGANIZATION Department of Justice, Justice Department, SEC, Securities and Exchange Commission, University of Michigan PEOPLES Iraqi PERSON Erik Gordon, Mythili Raman, William Weldon PLACE_REGION Europe PRODUCT Benadryl, Tylenol PROP_MISC Band-Aids, Food Program, Foreign Corrupt Practices Act, United Nations Oil STATE N.J. TIME 1:32 pm ET TIME_PERIOD 13 years, five years, six months, three years YEAR 2007 Problem "We went to the government to report improper payments and have taken full responsibility for these actions," said William Weldon, Chairman and CEO of J&J., Last month federal health regulators took legal control of the plant where millions of bottles of defective medication were produced., The charges against J&J were brought under the Foreign Corrupt Practices Act, which bars publicly traded companies from bribing officials in other countries to get or retain business., The company will pay $21.4 million in criminal penalties for improper payments and return $48.6 million in illegal profits, according to the government., The SEC says J&J agents used fake contracts and sham companies to deliver the bribes. Sentiment giving meaningful credit to companies that self-report, We are committed to holding corporations accountable for bribing foreign officials, what is honest Request make sure it complies with anti-bribery laws across its businesses
  • 9. 9 WHY & WHO: Emotion Detection
  • 11.
  • 12. How does that work? Search Pattern Recognition Text-Mining
  • 13. 13 There are certain problems we need to tackle First Bank of Chicago was the only ATM in the neighborhood. Before visiting the Victoria and Albert Museum on Cromwell Rd, I had to get some cash so I could take Prof. Joan V. Hhdrsat Jr. and my son Mitchell with me to the museum. The last one was not really interested, he hates musea. • Abbreviations • Attributes of entities • Boundary Problem • Conjunction Problem • Co-references • Negations • Emotions & sentiments
  • 15. • Lack of precision leads to noise, too many false hits, too much work to review, which yields high cost of review. • Lack of recall leads to missing relevant documents which yields risk. 15
  • 16. 16 Human Performance • When both precision and recall are over 80%, human performance is approached. • This applies to the best humans. • It can be argued that values over 80% are often subject to different interpretations and discussions. 16
  • 17. Teaching the computer what you are looking for … text classification
  • 18. 18 What is document classification?
  • 19. 19 Machine Learning Algorithms are used During the last decade, a generation of efficient and successful algorithms has been developed using bag-of-words models to represent document content and statistical and geometrical machine learning algorithms such as Conditional Random Fields (CRF) and Support Vector Machines (SVM).
  • 20. 20 Decision Trees and Entropy Modeling • A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm.
  • 21. 21 Now imagine 1.2 million dimensional … 2-dimensional 3-dimensional
  • 22. 22 Text Representation Schemes we use Statistical hand-picked features Bag-of-Words
  • 23. 23 Classifying Reuters Document Collection (RCV1)
  • 24. 24 Information Requests eDiscovery• FOIA (WOB) • Audits Internal Investigations • Litigation • Arbitration • Regulatory Requests • Subject Access Requests
  • 25. 25 3x more relevant documents than Boolean search No complex queries, just review documents 2x total number of relevant documents is all that need to be reviewed Estimate accurately percentage of all relevant documents found at end Teach the computer what to look for …
  • 26. 26 Benefits and problems current approach • These algorithms require relatively little training data and are fast on modern hardware. • Performance seems to be maximized at F1 values around 0.9 • Not very good at transfer learning.
  • 27. 27 Deep Learning Deep Learning has produced superior results on several NLP tasks such as POS tagging, Named Entity Recognition, or semantic role labeling, especially when using LSTM’s. Source: Recent Trends in Deep Learning Based Natural Language Processing. Tom Youngy, Devamanyu Hazarikaz, Soujanya Poria, Erik Cambria, 2018. https://arxiv.org/pdf/1708.02709.pdf
  • 28. 28 General Challenges Deep Learning Challenge • Huge computational requirements • Need for very large training data sets • What architecture? Today’s Answer • Dedicated hardware (GPU’s) • Semi Supervised data set creations, training data augmentation & transfer learning • CNN, LSTM, …
  • 30. 30 How about Word Embeddings? • Pre-trained model • Understand context better • Transfer learning: understand already general aspects of language, subsequent only need to fine-tune for a specific NLP task. • No need for millions or billions of annotated training data (when using deep learning).
  • 31. 31 Word Embeddings: Document Representation derived with and used for Deep Learning* Word2Vec Doc2Vec Glove FastText ELMO BERT … Remember: with TF-IDF we create a vector for each document. How can we do something similar for Deep Learning? Idea behind Word Embeddings: Use words from a vocabulary as input and embed them as vectors into a lower dimensional space in order to enforce the system to create similar encodings for semantically related words to include context. *) but can also be used for SVM or other non-deep learning models.
  • 32. 32 Word2Vec … Revolutionized the use of word embedding’s by using a continuous bag of words and skip-grams to derive high quality word embedding’s. Why: unexpected side effect was compositionality: algebraic operations on word vectors result in a vector that is a semantic composite: man + royal = king men to king = women to … country1 to capital1 = country2 to … See Gittens et al., Skip-Gram–Zipf+Uniform=VectorAdditivity, 2017 for theoretical justification of compositionality
  • 34. 34 Limitations of Word2Vec & GloVe … • Language Sensitivity • Cannot deal with out-of-vocabulary words. Solution: Preprocess and assign left-over codes to these. Miss the semantic context. • Combined high-frequent combinations used in Word2Vec Treating common word pairs or phrases as single “words” does not result in compositionality for the individual terms. Solution: train word pairs also as individual tokens, but then harder to deal with frequent multi-token words such as New York. • The small windows of surrounding words sometimes leads to problems for high frequent words: bad and good often occur in the same context leading to similar encodings resulting in problems in sentiment and emotion mining due to polarity confusion. Solution: preprocessing of negations and polarity. • Word embedding's assign a unique vector to each textually unique word, so a word cannot have multiple meaning (polysemy). Remember a word like bank. Solution: preprocess and replace homonyms with different tokens for each meaning: bank1, bank2, bank3, bank4, …
  • 35. 35 Word2Vec & GloVe vs ELMo & BERT • Word2vec and Glove word embeddings are context-free or context- independent- they output just one vector (embedding) for each word. Same vector for different meanings of a word: “He took his cell phone to call his brother who was in the prison cell on the next floor. Booth had been taken blood samples to measure their white blood cell count.” results in ONE unique vector for cell*. • ELMo and BERT are context sensitive: they can generate different word embeddings for a word that captures the context of a word - that is its position in a sentence. So, above example would lead to 3 different encodings for the word cell. cell (prison cell case) would be closer to words like incarceration, crime etc. cell (phone case) would be closer to words like iPhone, android, galaxy etc.. and cell (blood) would be closer to biology, life science, etc. *) This may be less of a problem as it seems as our intuition of “close by” that holds for low dimensional spaces does not hold for high dimensional spaces
  • 36. 36 Uni-Directional and Bi-Directional Context “I accessed the bank of the river” unidirectional contextual model would represent “bank” based on “I accessed the” but not “river.” bi-directional contextual model represents “bank” using both its previous and next context — “I accessed the ... of the river” Both ELMo and BERT are bi-directional. ELMo is shallow bi- directional, BERT deep bi-directional.
  • 37. 37 Differences Word Embeddings Source: https://www.quora.com/What-are-the-main-differences-between-the-word- embeddings-of-ELMo-BERT-Word2vec-and-GloVe *) BERT has deep contextual and can deal with out of vocabulary words due to fully connected bi- directional and sub word representation **) All the above are still language dependent
  • 39. 39 Convolution Convolution acts as a filter. Used for feature selection. Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
  • 40. 40 What is Convolution in Text? • Convolution makes perfect sense when the picture is an image, with two spatial dimensions (height and width). But text has only one dimension, and it’s temporal not spatial. • For all practical purposes, that doesn’t matter. We just need to think of our text as an image of width n and height 1. Tensorflow provides a conv1d function, but it does not expose other convolutional operations in their 1d version. You will have to add these yourself.
  • 41. 41 Convolution in Text • Context from surroundings, using filters with ratio 1:3 generally work best. Larger ones damage the gradient descent in training. • Different layers represent different abstraction. Higher layers more semantic representations. • Use dilated convolutions AKA atrous convolutions AKA convolutions with holes to capture longer term relations without angering the gradient descent.
  • 42. 42 Conceptual Problems with CNN for NLP • Any part of a sentence can influence the semantics of a word. For that reason we want our network to see the entire input at once. • Getting that big a receptive can make gradients vanish and our networks fail. • We can solve the vanishing gradient problem with DenseNets or Dilated Convolutions (convolution filters with holes in it), but still ...
  • 43. 43 Long Short-Term Memory (LSTM) are better in capturing long-term relations as seen in NLP • Can deal with input of variable sizes. • Better in learning the meaning of the same word in different locations (which is hard for CNN), e.g.: drink a lot of beers / or like to drink a lot • Better in dealing with long term dependencies
  • 44. 44 2018: Long Tran: Co-reference resolution with LSTM
  • 45. 45 2019: Zina Wang: Extract Complex Relation Entities with LSTM’s
  • 46. 46 2019: Zoe Gerolemou. Aspect Based Emotion & Sentiment Mining
  • 47. 47 Business Intelligence: How can text- mining help me? • Collect & analyze open source (textual) data: social media, blogs, websites, … • Consumer opinion analysis. • Anonymize historical data collections holding personal information. • … Source: IDC Executive Summary, 2015
  • 48. Thank you! Time for Q&A Prof dr ir Jan C. Scholtes https://textmining.nu https://www.linkedin.com/in/jscholtes/