Copyright © 2014 KNIME.com AG
Boston KNIME Users
Text Processing Applications
Kilian Thiel
KNIME
Copyright © 2014 KNIME.com AG
Agenda
• KNIME Crash Course
• Text Mining with KNIME: Mining Tripadvisor Data
• Text Mining with KNIME: Mining Amazon Reviews
(Anil Tarachandani)
• Networking Apero
2
Copyright © 2014 KNIME.com AG
Text Mining with KNIME: Mining Tripadvisor Data
Agenda
• The KNIME Textprocessing Extension
– Preliminaries
– Philosophy & Usage
• Classification of Tripadvisor Reviews
– Tripadvisor data
– Classification of reviews
3
Copyright © 2014 KNIME.com AG
Resources
http://tech.knime.org/knime-text-processing
• Documentation
• Examples
• Forum
• White Papers
4
Copyright © 2014 KNIME.com AG
Installation
5
1.) 2.)
Copyright © 2014 KNIME.com AG
Requirements
Requirements to import and run demo workflows
• KNIME 2.10
• Textprocessing (labs)
• Distance Matrix (KNIME)
• Palladian (Community)
6
Copyright © 2014 KNIME.com AG
Tips
• Settings (knime.ini)
– Set maximum memory for KNIME
– -Xmx3G
7
Copyright © 2014 KNIME.com AG
Demo
Prepare KNIME
• Go to KNIME directory
• Change knime.ini file (optional)
– -Xmx3G
• Start KNIME
• Install Textprocessing Extension
– (or better have it already installed)
8
Copyright © 2014 KNIME.com AG
Philosophy
9
… perhaps your name
is
Rumpelstiltskin[Perso
n] ? …
… perhaps your name
is
Rumpelstiltskin[Perso
n] ? … Visualization
Cluster-
ing
Classifi-
cation
1 1 1 0 1 0 0 1 1
0 1 1 0 0 1 0 0 0
0 0 1 1 1 0 1 1 0
Copyright © 2014 KNIME.com AG
Additional Data Types
• Document Cell
– Encapsulates a document
• Title, sentences, terms, words
• Authors, category, source
• Generic meta data (key, value pairs)
• Term Cell
– Encapsulates a term
• Words, tags
10
Copyright © 2014 KNIME.com AG
Data Table Structures
• Document table
– List of documents
• Bag of words
– Tuples of documents
and terms
• Document vectors
– Numerical
representations of
documents
11
Copyright © 2014 KNIME.com AG
Philosophy and Data Table Structures
12
Enrichment Preprocessing
1 1 1 0
1 0 0 1
Documents Bow VectorsDocuments Documents
Copyright © 2014 KNIME.com AG
Tripadvisor Data
13
Title
Author
Rating
Fulltext
Copyright © 2014 KNIME.com AG
Tripadvisor Data
14
Reviews about italian and chinese restaurants in
Boston
• Chinese: 272
• Italian: 268
Copyright © 2014 KNIME.com AG
Tripadvisor Data
15
Goal:
• Build classifier to distinguish between chinese and
italian restaurants, based on their reviews.
Review about italian or
chinese restaurant?
Copyright © 2014 KNIME.com AG
Tripadvisor Data
16
Goal:
Copyright © 2014 KNIME.com AG
1.) Reading
Read/Parse textual data
17
Copyright © 2014 KNIME.com AG
Demo
Reading
• Read Tripadvisor data (.table file)
• Filter rows with missing restaurant value
• Convert strings to documents
• Filter all but the document column
18
Copyright © 2014 KNIME.com AG
2.) Enrichment
Enrich documents with semantic information
19
Copyright © 2014 KNIME.com AG
Demo
Enrichment / Tagging
• Apply POS Tagger node
• Use Bag of Words node to inspect tagging result
20
Copyright © 2014 KNIME.com AG
3.) Preprocessing
Preprocess documents and filter words
21
Copyright © 2014 KNIME.com AG
Demo
Preprocessing
• Filter
– Numbers
– Punctuation marks
– Stop Words
• Convert to lower case
• Stemming
• Keep only nouns, verbs, adjectives
22
Copyright © 2014 KNIME.com AG
4.) Transformation
Creation of numerical representation of documents
23
Copyright © 2014 KNIME.com AG
Demo
Transformation
• Transform to bag of word
• Compute TF value for terms
• Transform to document vectors
• Extract category (class) value
24
Copyright © 2014 KNIME.com AG
5.) Classification
Training of a model (decision tree) and scoring
25
Copyright © 2014 KNIME.com AG
Demo
Classification
• Append color based on class
• Partition data into training and test set
• Train decision tree model in training data
• Apply decision tree model on test data
• Score model, measure accuracy
26
Copyright © 2014 KNIME.com AG
Additional Workflows
• Multi Word Tagging
– Detection of frequent Ngrams
– Creation of dictionary from Ngrams
– Applying Dictionary Tagger
• Classification with Multi Words
• Clustering of documents
27
Copyright © 2014 KNIME.com AG
Thank You
40k
60k
20k
28
Questions
• http://tech.knime.org/forum
• Kilian.Thiel@knime.com
Follow us
• Twitter: @KNIME
• LinkedIn: https://www.linkedin.com/groups?gid=2212172
• KNIME Blog: http://www.knime.org/blog

Text Processing with KNIME

  • 1.
    Copyright © 2014KNIME.com AG Boston KNIME Users Text Processing Applications Kilian Thiel KNIME
  • 2.
    Copyright © 2014KNIME.com AG Agenda • KNIME Crash Course • Text Mining with KNIME: Mining Tripadvisor Data • Text Mining with KNIME: Mining Amazon Reviews (Anil Tarachandani) • Networking Apero 2
  • 3.
    Copyright © 2014KNIME.com AG Text Mining with KNIME: Mining Tripadvisor Data Agenda • The KNIME Textprocessing Extension – Preliminaries – Philosophy & Usage • Classification of Tripadvisor Reviews – Tripadvisor data – Classification of reviews 3
  • 4.
    Copyright © 2014KNIME.com AG Resources http://tech.knime.org/knime-text-processing • Documentation • Examples • Forum • White Papers 4
  • 5.
    Copyright © 2014KNIME.com AG Installation 5 1.) 2.)
  • 6.
    Copyright © 2014KNIME.com AG Requirements Requirements to import and run demo workflows • KNIME 2.10 • Textprocessing (labs) • Distance Matrix (KNIME) • Palladian (Community) 6
  • 7.
    Copyright © 2014KNIME.com AG Tips • Settings (knime.ini) – Set maximum memory for KNIME – -Xmx3G 7
  • 8.
    Copyright © 2014KNIME.com AG Demo Prepare KNIME • Go to KNIME directory • Change knime.ini file (optional) – -Xmx3G • Start KNIME • Install Textprocessing Extension – (or better have it already installed) 8
  • 9.
    Copyright © 2014KNIME.com AG Philosophy 9 … perhaps your name is Rumpelstiltskin[Perso n] ? … … perhaps your name is Rumpelstiltskin[Perso n] ? … Visualization Cluster- ing Classifi- cation 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0
  • 10.
    Copyright © 2014KNIME.com AG Additional Data Types • Document Cell – Encapsulates a document • Title, sentences, terms, words • Authors, category, source • Generic meta data (key, value pairs) • Term Cell – Encapsulates a term • Words, tags 10
  • 11.
    Copyright © 2014KNIME.com AG Data Table Structures • Document table – List of documents • Bag of words – Tuples of documents and terms • Document vectors – Numerical representations of documents 11
  • 12.
    Copyright © 2014KNIME.com AG Philosophy and Data Table Structures 12 Enrichment Preprocessing 1 1 1 0 1 0 0 1 Documents Bow VectorsDocuments Documents
  • 13.
    Copyright © 2014KNIME.com AG Tripadvisor Data 13 Title Author Rating Fulltext
  • 14.
    Copyright © 2014KNIME.com AG Tripadvisor Data 14 Reviews about italian and chinese restaurants in Boston • Chinese: 272 • Italian: 268
  • 15.
    Copyright © 2014KNIME.com AG Tripadvisor Data 15 Goal: • Build classifier to distinguish between chinese and italian restaurants, based on their reviews. Review about italian or chinese restaurant?
  • 16.
    Copyright © 2014KNIME.com AG Tripadvisor Data 16 Goal:
  • 17.
    Copyright © 2014KNIME.com AG 1.) Reading Read/Parse textual data 17
  • 18.
    Copyright © 2014KNIME.com AG Demo Reading • Read Tripadvisor data (.table file) • Filter rows with missing restaurant value • Convert strings to documents • Filter all but the document column 18
  • 19.
    Copyright © 2014KNIME.com AG 2.) Enrichment Enrich documents with semantic information 19
  • 20.
    Copyright © 2014KNIME.com AG Demo Enrichment / Tagging • Apply POS Tagger node • Use Bag of Words node to inspect tagging result 20
  • 21.
    Copyright © 2014KNIME.com AG 3.) Preprocessing Preprocess documents and filter words 21
  • 22.
    Copyright © 2014KNIME.com AG Demo Preprocessing • Filter – Numbers – Punctuation marks – Stop Words • Convert to lower case • Stemming • Keep only nouns, verbs, adjectives 22
  • 23.
    Copyright © 2014KNIME.com AG 4.) Transformation Creation of numerical representation of documents 23
  • 24.
    Copyright © 2014KNIME.com AG Demo Transformation • Transform to bag of word • Compute TF value for terms • Transform to document vectors • Extract category (class) value 24
  • 25.
    Copyright © 2014KNIME.com AG 5.) Classification Training of a model (decision tree) and scoring 25
  • 26.
    Copyright © 2014KNIME.com AG Demo Classification • Append color based on class • Partition data into training and test set • Train decision tree model in training data • Apply decision tree model on test data • Score model, measure accuracy 26
  • 27.
    Copyright © 2014KNIME.com AG Additional Workflows • Multi Word Tagging – Detection of frequent Ngrams – Creation of dictionary from Ngrams – Applying Dictionary Tagger • Classification with Multi Words • Clustering of documents 27
  • 28.
    Copyright © 2014KNIME.com AG Thank You 40k 60k 20k 28 Questions • http://tech.knime.org/forum • Kilian.Thiel@knime.com Follow us • Twitter: @KNIME • LinkedIn: https://www.linkedin.com/groups?gid=2212172 • KNIME Blog: http://www.knime.org/blog