Semi-Automatic Text Classification with Deep Neural Networks

© INFOMOTION GmbH 18. September 2018
Simon Pickert
Semi-Automatic Text Classification with Deep Neural Networks
with the aid of huge cluster computation power
An example on Deutsche Bahn Sentiments and SpiegelOnline Topic
Classification
Frankfurt, 11.09.2019

2© INFOMOTION GmbH 18. September 2018
„Forget about the meaning of words, forget about grammar, forget about syntax, forget even the
very concept of a word. Now let the machine learn everything by itself.”
François Petitjean, Senior researcher in machine learning and data mining at Monash University

Agenda
1. Introduction Text Classification
2. Introduction Machine Learning / Deep Learning
3. Example Use Case Spiegel Online Topic und DB Sentiment Classification
4. Conclusion

Types of applications for Text Classification
Extract patternsRatings Intention
Sentiments /
Emotions
› Sentiment Analysis:
e.g. for products /
services in social
media
› Customer Relationship
Analytics: Decision
making based on E-
Mail texts
› HR Analytics: CV Texts
› Review analysis for
products
› Detect intentions e.g.
for chat bot
conversations
› IT service-ticket
assignment
› Fraud Detection
› Extract semantics from
data (e.g. extract
invoice data)
› Interpret text
paragraphs (e.g.
contractdetails)
› Text summarization

Rapid development in science and software-technological progress
(Deep Learning, improves algorithms and parallelisation support)
→ effort for implementation will decline further
Costs for data storage and computation power are dropping continually
Drivers for Revival of Machine Learning and Text Classification
Increasing amount and variety of data as part of the digitalization

Requirements for the application of Machine Learning / Deep Learning
PROBLEM-
COMPLEXITY
IMPORTANCE / SCALE
SUFFICIENT DATA
PROBLEM RELEVANT
REGULARITIES /
PATTERNS IN DATA
Problem Characteristics
Machine
Learning
/ Deep
Learning

Supervised Learning algorithms that can be used for Text Classification
Naive Bayes
Random
Forests
Support Vector
Machines
Ensemble
Methods
(Deep) Neural
Networks
Class ŷ
Features Labels𝑥1 … 𝑥i y
LEARNING
ALGORITHMS

What is Deep Learning?
› Learning based on biological principles
Applications
› Bigest success of Deep Learning applications
especially for image, speech and text data
› Based on Artifical Neural Networks which are used for
time series forecasting, regression and classification
› Used as part of artificial intelligence: e.g. Google
Alpha GO, Chess AI

Neural Networks – Multilayer Perceptron (MLP)
Input Layer Hidden Layers Output Layer
… …
…

Hello World of Neural Networks: Handwritten Digit Recognition
› Goal: Recognize the „intended“ written
number
› 10.000 examples of handwritten digits images
and the real numbers
› Images 28 x 28 pixels with gray scales in an
interval of [0;1]

Hello World of Neural Networks: Handwritten Digit Recognition

How to do the magic of training?
J – Errorfunction
θ – Weights
› All Weights get initialized
randomly
› Every row in the digits data get
calculated forward and
compared with the actual
value (error function)
› Weight gets now adjusted to
minimize the error
Optimization per Gradient Descent

Different Types of Deep Learning Networks
› Neuronale Networks are considered as „deep“ if they have an optimization path length > 3
haben
› Common types of Neural Networks for text classification:
› Multi Layer Perceptron (MLP) ab 2 Hidden Layern
› Recurrent Neural Networks (RNN) mit Sequenztiefe > 1
› Gated Recurrent Unit Network (GRUs)
› Long Short Term Memory (BI-LSTM)
› Convolutional Network (CNN)
…

Research Timeline Neural Networks / Deep Learning

Timeline for IMDB Movie Rating Benchmark

Deep Neural Networks – Recurrent Neural Network (RNN)
› Recurrent Neural Network are an extension of the Multilayer Perceptron (MLP), with the
speciality that the hidden nodes have recurrent weights on their previous activation
› d is a parameter which indicates how deep the recurrent should look back (e.g. the amount of words per text)

Exmaple Case Deutsche Bahn: Germeval Task 2017
› 22000 text messages from social media with statements about
occurrences with Deutsche Bahn
› Employee have labeled all the text messages with topics and
sentiments
› The goal of the competition is to classify the sentiments and category
of new text messages on the social web or e-mails automatically

Example Deutsche Bahn: Germeval Task 2017
› Example for tweet:
Text:
Wenn die Bahn so voll ist, dass man lieber noch 10 Minuten in der Kälte wartet, weil man keinen
Bock hat in einer Sardinenbüchse zu stehen.
When the train is so crowded, that you prefer to wait another 10 minutes in the cold, instead of
staying in a sardine can.
Label Topic: Load Factor / Overcrowding
Label Sentiment: negativ

How to proceed with texts for analysis?
› Find a numerical representation of the texts
› Every Word and syntactic elements get a unique identification number (tokenization)
› Every sentence is then represented by tokenized word vector
› Calculate special word representations such that semantic related words are near to each
other in numerical vector space (word embeddings)
› Can be trained by word2vec algorithms or can be used from pretrained models (e.g. Wikipedia
copora of Facebook)

Training of Word Embeddings with Word2Vec – 2-Layer Neural Network
0
0
0
0
0
0
1
0
0
0
0
…
Word2vec (2-Layer
MLP)
1-hot-vector with
length D
Bahn
0.39
0.11
0.12
0.33
0.01
0.91
0.11
…
(Skip grams,
Continuous Bag Of
Words)
Word embedding
represantion of
length L

Deep Neural Networks for word based Text Classifikation
Deep Neural Network
(RNN, LSTM,
CNN,Fasttext…)
0
0
1
Output Units
(Labels)
Sentiments
Positive
Neutral
Negative
0.39
0.11
0.12
0.33
0.01
0.91
0.11
…
0.19
0.24
0.52
0.23
0.11
0.24
0.83
…
Word 1 Word 2
…
…
Document Text
N Hidden Layer
W maximum sentence length
N*W*L = Weights Count

Use Case Spiegel Online
Data:
All articles from 1968 from Spiegel Online (news)
Classes:
8 possible categories in summary: Sport, Politik, Kultur, Netzwelt, Wissenschaft, etc.
Text length:
Up to 500 words
Amount of arcticles:
400 000

Datasets der Use Cases
Case
Spiegel Online
Case
Deutsche Bahn Sentiments
Amount of rows 500000 22000
Content / Source News articles of Spiegel
Online
Media texts of Deutsche
Bahn passengers
Text Classification Type Topic / Semantics Sentiments / Emotions
Output Classes In summary 8 classes: Sport,
Politik, Kultur, Netzwelt,
Wissenschaft, etc.
3 possible sentiment classes:
Positiv, neutral, negativ
Textlänge 10 - 100 words 100 – 500 words

INFOMOTION Toolbox for Text Classification
Worker Worker Worker Worker…
Dsitributed computation of different model approaches
and hyperparameters (Monitoring, Error Handling)
Spark Distribution | AWS Distribution
Texts +
Labels
Text
preprocessing
(Word
Vectorization,
Bag of Words)
Evaluation of different model approaches
Experiments +
Optimization of
Hyperparamets
+ Model
Selection
Final
optimized
modelClassic Text Mining methods:
Bag of Words + Naive Bayes, SVM
Deep Learning:
CNN, RNN, (Bi-)LSTM, FastText
Worker Worker
GPU / TPU

Accuracy per for each use case and model approach
Attribute Accuracy
Spiegel Online
F1 Score
Accuracy
Deutsche Bahn Sentiments
F1 Score
FastText Framework (Ngrams) 74,4 % 82,6 %
BI-LSTM + word2Vec 81,1 % 86,4 %
Convolutional Network +
word2Vec
71,3 % 81,2 %
TF-IDF + Naive Bayes 61,8 % 65,6 %
TF-IDF + SVM 64,3 % 68,1 %

Conclusion
› Deep Learning methods showed best performance compared to classic approaches for both
use cases (short and long texts)
› FastText not far behind, classic methods significant lower
› It is recommended to try different state of the artmodel approaches
› Tools for automizing model selection for text classification is possible because a lot of state of the
art can be stanardized and parallized
› Training und hyperparameter optimization is very computational expensiv, so distribution helps to
run large experiment trial for optimization in a limited time span

Alle Angaben basieren auf dem derzeitigen Kenntnisstand. Änderungen vorbehalten. Dieses Dokument der INFOMOTION GmbH ist ausschließlich für den Adressaten bzw. Auftraggeber bestimmt. Es bleibt bis zu
einer ausdrücklichen Übertragung von Nutzungsrechten Eigentum der INFOMOTION GmbH. Jede Bearbeitung, Verwertung, Vervielfältigung und/oder gewerbsmäßige Verbreitung des Werkes ist nur mit
Einverständnis der INFOMOTION GmbH zulässig.
INFOMOTION GmbH
Niederlassung Frankfurt
SIMON PICKERT
INFOMOTION GMBH
Ludwigstraße 33-37
60327 Frankfurt
Wirtschaftsinformatik (M. Sc.)
Data Scientist
www.infomotion.de
T +49 69 97460-700
F +49 69 97460-799
M +49 176 94247079
simon.pickert@infomotion.de

Semi-Automatic Text Classification with Deep Neural Networks

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Semi-Automatic Text Classification with Deep Neural Networks

Similar to Semi-Automatic Text Classification with Deep Neural Networks (20)

Recently uploaded

Recently uploaded (20)

Semi-Automatic Text Classification with Deep Neural Networks