Text Mining voor Business Intelligence
Voor de meeste Business Intelligence specialisten is het vakgebied van data mining bekender dan dat van text mining. Een goed voorbeeld van data mining is het analyseren van transactie gegevens die in relationele databases zitten. Denk aan omzet gegevens van concurrenten of financiële transacties van klanten. Tekst is vaak een stuk lastiger om mee te werken door de verschillende formaten, ambiguïteit, inconsistentie en fouten.
Echter, steeds meer informatie is ongestructureerde informatie in de vorm van tekst. Slechts een beperkte hoeveelheid informatie is opgeslagen in een gestructureerd formaat in een database. Denk aan sociale media, internet forums, websites, blogs of intranet (MS-SharePoint sites). Daarin zoeken of analyses maken met traditionele database- of data mining technieken is onmogelijk. Deze werken namelijk alleen op gestructureerde informatie.
Daarom richt het vakgebied van de text mining zich op het ontwikkelen van diverse geavanceerde wiskundige-, statistische-, taalkundige- en patroonherkenning technieken waarmee het mogelijk is om ongestructureerde informatie automatisch te structureren en analyseren alsmede om hoge kwaliteit en relevante gegevens te extraheren en de tekst in zijn geheel daardoor beter doorzoekbaar te maken.
Hoge kwaliteit refereert hier in het bijzonder aan de combinatie van relevantie (oftewel: de speld in de hooiberg vinden) en het verkrijgen van nieuwe interessante inzichten.
Deze nieuwe technieken hebben al een grote impact gehad binnen de wereld van opsporings- en inlichtingen diensten, maar ook binnen het juridische en financiële domein. In deze lezing zal worden toegelicht hoe de Business Intelligence toepassingen kunnen profiteren van deze technieken bij het verzamelen van waardevolle inzichten uit open source informatie of het meten van sentimenten en emoties over producten, diensten of bedrijven op sociale media of internet forums.
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Text mining voor Business Intelligence toepassingen
1. Text-Mining:
Dealing with unstructured data in business intelligence
Prof dr ir Jan C. Scholtes
https://textmining.nu
https://www.linkedin.com/in/jscholtes/
3. 3
Text Mining
Text Mining: The next step in
Search Technology
Finding without knowing exactly what
you’re looking for, or finding what
apparently isn’t there.
4. 4
Text-Mining
The study of text-mining focusses primary on:
• extracting complex patterns from
unstructured electronic data sets, and
• applying machine learning for document
classification.
5. 5
Language_Name English
CITY New Brunswick, WASHINGTON
COMPANY J&J, Johnson & Johnson
COUNTRY Greece, Poland, Romania, United Kingdom
CURRENCY .02 USD, 21400000 USD, 48600000 USD, 59.47 USD, 70000000 USD
DATE 04-08
DAY Fri, Friday
NOUN_GROUP
biotech drugs, bribery case, denying guilt, final growth frontier, foreign countries, giving gifts, holding corporations,
intense revenue pressure, meaningful credit, medical device kickbacks, medical devices, multiple businesses, next several
days, non-U.S. markets, only way, orthopedic hips, other countries, over-the-counter medicines, paid kickbacks, past
year, paying kickbacks, same time, several new positions, similar violations, travel gifts
ORGANIZATION Department of Justice, Justice Department, SEC, Securities and Exchange Commission, University of Michigan
PEOPLES Iraqi
PERSON Erik Gordon, Mythili Raman, William Weldon
PLACE_REGION Europe
PRODUCT Benadryl, Tylenol
PROP_MISC Band-Aids, Food Program, Foreign Corrupt Practices Act, United Nations Oil
STATE N.J.
TIME 1:32 pm ET
TIME_PERIOD 13 years, five years, six months, three years
YEAR 2007
Problem
"We went to the government to report improper payments and have taken full responsibility for these actions," said
William Weldon, Chairman and CEO of J&J., Last month federal health regulators took legal control of the plant where
millions of bottles of defective medication were produced., The charges against J&J were brought under the Foreign
Corrupt Practices Act, which bars publicly traded companies from bribing officials in other countries to get or retain
business., The company will pay $21.4 million in criminal penalties for improper payments and return $48.6 million in
illegal profits, according to the government., The SEC says J&J agents used fake contracts and sham companies to deliver
the bribes.
Sentiment
giving meaningful credit to companies that self-report, We are committed to holding corporations accountable for bribing
foreign officials, what is honest
Request make sure it complies with anti-bribery laws across its businesses
12. How does that work?
Search
Pattern
Recognition
Text-Mining
13. 13
There are certain problems we need to tackle
First Bank of Chicago was the only ATM in the neighborhood.
Before visiting the Victoria and Albert Museum on Cromwell Rd, I
had to get some cash so I could take Prof. Joan V. Hhdrsat Jr.
and my son Mitchell with me to the museum. The last one was
not really interested, he hates musea.
• Abbreviations
• Attributes of entities
• Boundary Problem
• Conjunction Problem
• Co-references
• Negations
• Emotions & sentiments
15. • Lack of precision leads
to noise, too many false
hits, too much work to
review, which yields
high cost of review.
• Lack of recall leads to
missing relevant
documents which yields
risk.
15
16. 16
Human Performance
• When both precision and recall are
over 80%, human performance is
approached.
• This applies to the best humans.
• It can be argued that values over
80% are often subject to different
interpretations and discussions.
16
19. 19
Machine Learning Algorithms are used
During the last decade, a generation of efficient and successful
algorithms has been developed using bag-of-words models to
represent document content and statistical and geometrical
machine learning algorithms such as Conditional Random Fields
(CRF) and Support Vector Machines (SVM).
20. 20
Decision Trees and Entropy Modeling
• A decision tree is a
decision support tool
that uses a tree-like
graph or model of
decisions and their
possible consequences,
including chance event
outcomes, resource
costs, and utility. It is
one way to display an
algorithm.
25. 25
3x more relevant
documents than
Boolean search
No complex queries, just
review documents
2x total number of
relevant documents
is all that need to be
reviewed
Estimate
accurately percentage of all
relevant documents found at
end
Teach the computer what to look for …
26. 26
Benefits and problems current approach
• These algorithms require relatively little training data and are
fast on modern hardware.
• Performance seems to be maximized at F1 values around 0.9
• Not very good at transfer learning.
27. 27
Deep Learning
Deep Learning has produced superior results on several NLP
tasks such as POS tagging, Named Entity Recognition, or
semantic role labeling, especially when using LSTM’s.
Source: Recent Trends in Deep Learning Based Natural Language Processing. Tom Youngy, Devamanyu Hazarikaz,
Soujanya Poria, Erik Cambria, 2018. https://arxiv.org/pdf/1708.02709.pdf
28. 28
General Challenges Deep Learning
Challenge
• Huge computational
requirements
• Need for very large
training data sets
• What architecture?
Today’s Answer
• Dedicated hardware
(GPU’s)
• Semi Supervised data
set creations, training
data augmentation &
transfer learning
• CNN, LSTM, …
30. 30
How about Word Embeddings?
• Pre-trained model
• Understand context better
• Transfer learning: understand already
general aspects of language, subsequent
only need to fine-tune for a specific NLP
task.
• No need for millions or billions of annotated
training data (when using deep learning).
31. 31
Word Embeddings: Document Representation
derived with and used for Deep Learning*
Word2Vec Doc2Vec
Glove FastText
ELMO BERT …
Remember: with TF-IDF we create a vector for each document. How
can we do something similar for Deep Learning?
Idea behind Word Embeddings:
Use words from a vocabulary as input and embed them as vectors into a
lower dimensional space in order to enforce the system to create similar
encodings for semantically related words to include context.
*) but can also be used for SVM or other non-deep learning models.
32. 32
Word2Vec …
Revolutionized the use of word
embedding’s by using a continuous
bag of words and skip-grams to derive
high quality word embedding’s.
Why: unexpected side effect was
compositionality: algebraic operations
on word vectors result in a vector that
is a semantic composite:
man + royal = king
men to king = women to …
country1 to capital1 = country2 to …
See Gittens et al., Skip-Gram–Zipf+Uniform=VectorAdditivity, 2017 for theoretical justification of compositionality
34. 34
Limitations of Word2Vec & GloVe …
• Language Sensitivity
• Cannot deal with out-of-vocabulary words. Solution: Preprocess and assign left-over
codes to these. Miss the semantic context.
• Combined high-frequent combinations used in Word2Vec Treating common word pairs or
phrases as single “words” does not result in compositionality for the individual terms.
Solution: train word pairs also as individual tokens, but then harder to deal with
frequent multi-token words such as New York.
• The small windows of surrounding words sometimes leads to problems for high frequent
words: bad and good often occur in the same context leading to similar encodings
resulting in problems in sentiment and emotion mining due to polarity confusion.
Solution: preprocessing of negations and polarity.
• Word embedding's assign a unique vector to each textually unique word, so a word
cannot have multiple meaning (polysemy). Remember a word like bank. Solution:
preprocess and replace homonyms with different tokens for each meaning: bank1,
bank2, bank3, bank4, …
35. 35
Word2Vec & GloVe vs ELMo & BERT
• Word2vec and Glove word embeddings are context-free or context-
independent- they output just one vector (embedding) for each
word. Same vector for different meanings of a word: “He took his
cell phone to call his brother who was in the prison cell on the next
floor. Booth had been taken blood samples to measure their white
blood cell count.” results in ONE unique vector for cell*.
• ELMo and BERT are context sensitive: they can generate different
word embeddings for a word that captures the context of a word -
that is its position in a sentence. So, above example would lead to 3
different encodings for the word cell. cell (prison cell case) would be
closer to words like incarceration, crime etc. cell (phone case) would
be closer to words like iPhone, android, galaxy etc.. and cell (blood)
would be closer to biology, life science, etc.
*) This may be less of a problem as it seems as our intuition of “close by” that holds for low
dimensional spaces does not hold for high dimensional spaces
36. 36
Uni-Directional and Bi-Directional Context
“I accessed the bank of the river”
unidirectional contextual model would represent “bank” based
on “I accessed the” but not “river.”
bi-directional contextual model represents “bank” using both its
previous and next context — “I accessed the ... of the river”
Both ELMo and BERT are bi-directional. ELMo is shallow bi-
directional, BERT deep bi-directional.
37. 37
Differences Word Embeddings
Source: https://www.quora.com/What-are-the-main-differences-between-the-word-
embeddings-of-ELMo-BERT-Word2vec-and-GloVe
*) BERT has deep contextual and can deal with out of vocabulary words due to fully connected bi-
directional and sub word representation
**) All the above are still language dependent
39. 39
Convolution
Convolution acts as a filter.
Used for feature selection.
Source: http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
40. 40
What is Convolution in Text?
• Convolution makes perfect sense
when the picture is an image, with
two spatial dimensions (height and
width). But text has only one
dimension, and it’s temporal not
spatial.
• For all practical purposes, that
doesn’t matter. We just need to
think of our text as an image of
width n and height 1. Tensorflow
provides a conv1d function, but it
does not expose other
convolutional operations in their 1d
version. You will have to add these
yourself.
41. 41
Convolution in Text
• Context from surroundings, using filters with ratio 1:3 generally
work best. Larger ones damage the gradient descent in training.
• Different layers represent different abstraction. Higher layers more
semantic representations.
• Use dilated convolutions AKA atrous convolutions AKA convolutions
with holes to capture longer term relations without angering the
gradient descent.
42. 42
Conceptual Problems with CNN for NLP
• Any part of a sentence can influence the semantics of a word.
For that reason we want our network to see the entire input
at once.
• Getting that big a receptive can make gradients vanish and
our networks fail.
• We can solve the vanishing gradient problem with DenseNets
or Dilated Convolutions (convolution filters with holes in it),
but still ...
43. 43
Long Short-Term Memory (LSTM) are better in
capturing long-term relations as seen in NLP
• Can deal with input of
variable sizes.
• Better in learning the
meaning of the same
word in different
locations (which is hard
for CNN), e.g.: drink a
lot of beers / or like to
drink a lot
• Better in dealing with
long term dependencies