1. Patent Data Mining and
Visualization Functionalities
A foray into the worlds of
&
2. Overview
Data Mining
What is Text Mining?
Text Mining Process
Text Transformation
Feature Selection - tf-idf
Feature Selection -Term Document Matrix
Feature Selection –Term Term Matrix
Word Clouds and Clustering Examples
R and KNIME
Live Example - R Shiny
Visualizations SVG and D3
The ‘Big Data’, R and KNIME
KNIME Versus R
Conclusions
Document
Vectorization
3. Data Mining
• Data Mining = Building Models
• Model (Regression, Decision Trees, Neural Networks) = Set of rules connecting
Collection of Inputs to particular target outcome
• Model can result in explaining outcomes of particular interest predicted by
available facts
• Data Mining Tasks
• Classification
• Estimation
• Prediction
• Affinity grouping
• Clustering
Directed –Finding Particular Target Variable
Undirected – discover structure in Data without
any target variable in mind
4. Why this Study?
Apply Data Mining Techniques
to understand fine structure of
published Patent Documents.
Features of Patent Documents
• Structured Component
• Patent Number, Filing Dates,
Assignees, Regional Coverage
• Unstructured Components
• Title, Claims, Abstract, Descriptions
Data Mining Visualizations
Outcome
• Augment Manual interpretation of the results
• Address Visualization limitations
• Providing Collapsible lay-outs, Interactive Graphs etc
5. What Is Text Mining?“The objective of Text Mining is to exploit information contained in textual
documents in various ways, including …discovery of patterns and trends in
data, associations among entities, predictive rules, etc.” (Grobelnik et al.,
2001)
“Another way to view text data mining is as a process of exploratory data
analysis that leads to heretofore unknown information, or to answers for
questions for which the answer is not currently known.” (Hearst, 1999)
References
M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37th Annual Meeting of the Association
for Computational Linguistics, 1999.
M. Grobelnik, D. Mladenic, and N. Milic-Frayling, “Text Mining as Integration of Several Related Research
Areas: Report on KDD’2000 Workshop on Text Mining,” 2000.
6. Text Mining Process
Preprocessing
• Data Import
• Text preprocessing
Text Transformation
• Stop word Removal
• Stemming
• Parts of Speech Tagging
• Ngrams Generation
• Synonym Generalization
Feature Selection And Data Mining
• Term Document Matrix
• Term-Term Matrix
• Clustering or Classification
7. Text Transformation
Gulf Applied Technologies
Inc said it sold its
subsidiaries engaged in
Stop Word Removal
(and", "for", "in", "is",
"it", "not", "the",
"to“,”its”)
"Gulf Applied Technologies
Inc said sold its subsidiaries
engaged
Gulf Applied Technologies
Inc said it sold its
subsidiaries engaged in
Stemming "Gulf Appli Technolog Inc
said it sold it subsidiari engag
in pipelin"
Gulf Applied Technologies
Inc said it sold its
subsidiaries engaged in
Parts of Speech Tagging "Gulf/NNP Applied/NNP
Technologies/NNPS Inc/NNP
said/VBD its/PRP sold/VBD
NNP stands for proper noun, singular, or
e.g., VBD stands
for verb, past tense
Gulf Appli Ngrams “Gulf Appli”
Company Synonyms (wordnet) synonyms("company")
"caller" "companionship"
"company" "fellowship“ …
8. Text Transformation – Regular
Expressions (regex)
A regular expression (abbreviated regex or regexp) is a sequence of
characters that forms a search pattern, mainly for use in pattern
matching with strings, or string matching, i.e. "find and replace"-like
operations – standard feature Unix text processing utilities like “grep”.
Now supported by almost all software
A simple regexp ^[ t]+|[ t]+$ matches excess whitespace at the beginning and
end of a line.
An advanced regexp used to match any numeral is ^[+-]?(d+.?d*|.d+)([eE][+-
]?d+)?$
One More Example
[c|C]ollimat*
DAP.*
[g|G]uid.*[f|F]ield
[f|F]ield.*[g|G]uid
[L|l]ight.*[b|B]eam
[L|l]aser.*[b|B]eam
[b|B]eam.*[L|l]ight
[b|B]eam.*[L|l]aser
9. Feature Selection – Term Frequency
Inverse Document Frequency (tf-idf)
tf–idf is a numerical statistic that reflects how important
a word is to a document in a collection or corpus.
It is often used as a weighting factor in information
retrieval and text mining.
The tf-idf value increases proportionally to the number
of times a word appears in the document, but is offset
by the frequency of the word in the corpus, which helps
to control for the fact that some words are generally
more common than others
14. • A software package especially
suitable for data analysis, data (text)
Mining with rich visualization
functionality
• Scripting interface
• Graphical User Interface
development via “shiny package”
Supports Modular Node based
workflows
Core functionality required for
Data and Text mining are
implemented via these nodes
Extensibility of the functionality
of nodes via R and Java code
Snippets in the nodes
R and KNIME
R Example
16. Live Example - R Shiny Package
Web Applications Using (Only) R
No Need for HTML or Javascript
Great for Communication and Visualization
http://www.rstudio.com/shiny/showcase/
http://rstudio.github.io/shiny/tutorial/
Ui.r
Put all UI related
code hear
Server.r
Put all UI related
code hear
Socket
R Shiny Example
18. The ‘Big Data’, R and KNIME
pbdr is an academic initiative – requires special
permission to access a cluster of computers
called Tara
All Revolution R Enterprise 7 editions are distributed
with Open Source R (version 3.0.2), are 100%
compatible with R scripts, functions and CRAN
packages, and include phone and online technical
support.
ParAccel Hadoop Analytics
19. KNIME Versus R
KNIME R
Visual Programming Interface – Intuitive but some
amount familiarity is required
Scripting interface – Steep Learning curve
Workflows could be tailor made Workflows could be tailor made R Shiny user Interface
All Text mining & data analytic tools are available from a
single user interface. Classification problems – Supervised
learning could be handled better here as all the required
libraries are present at one place and one can view
intermediate results at the node output ports
Most of the libraries for Text Mining & data analytic are
available but they require prior invocation before their
usage
The Desktop version of the KNIME is available for free but
for server version requires special requirements
Server as well as desktop version is available
KNIME requires a reasonably modern PC running Linux,
Windows (XP and later), or Max OSX. Multi core systems
is a plus
The memory limitations could be overcome using
packages like:
• “ff”
• “ffBase”
Graphics output could be sent SVG etc Graphics could be sent SVG etc. One could also send
Graphics to DHTML using R Shiny
R and Java code could be at nodes for creating proprietary
analysis and visualizations
Robust big data extensions are available for distributed
frameworks such as Hadoop
Programming with Big Data in R pbdR and distributed
frameworks such as Hadoop
20. Conclusions
Starting with reasons for doing this project, tools like R and KNIME were looked at for their suitability for
Text data mining and automatic classification
Due to the availability of several built-in Libraries R and KNIME are more amenable to Text Data mining.
R and KNIME could be used in an “Big Data” Setting though this may be require additional hardware and
use of proprietary software
KNIME scores over R in terms of ease of use due to its node based visual programming interface
This study is very exploratory in nature and no serious attempt is made solve problems related to
automatic document classification. Some of the text mining libraries that were explored are:
− TM library in R for Generating the so called Term-Document Matrix and also for removing stop words
and punctuation marks in text
− TM library is also used for N-gram Tokenization (Taking Two Words at a time)
− OpenNLP Library for Parts of speech tagging
− Snowball and Potter Stemmer for Stemming text
− Graphing capabilities of R and KNIME were explored for Visual depiction of Text in the form of Word
Clouds
23. Text mining With R Regular Expressions
Tag Meaning Examples
ADJ adjective new, good, high, special, big, local
ADV adverb really, already, still, early, now
CNJ conjunction and, or, but, if, while, although
DET determiner the, a, some, most, every, no
EX existential there, there's
FW foreign word dolce, ersatz, esprit, quo, maitre
MOD modal verb will, can, would, may, must, should
N noun year, home, costs, time, education
NP proper noun Alison, Africa, April, Washington
NUM number twenty-four, fourth, 1991, 14:24
PRO pronoun he, their, her, its, my, I, us
P preposition on, of, at, with, by, into, under
TO the word to to
UH interjection ah, bang, ha, whee, hmpf, oops
V verb is, has, get, do, make, see, run
VD past tense said, took, told, made, asked
VG present participle making, going, playing, working
VN past participle given, taken, begun, sung
WH wh determiner who, which, when, what, where, how
Parts of Speech Tagging
(POS)
24. Invocation of Shiny
runApp takes the name of the Test directory in this example it is
Test_Shiny01. This directory contains Test.csv as the data source
and two R files called “ui.R” and “server.R”. The Ui.r invokes the
user interface in this case it is an HTML page with tabs and sidebar
panel (with user controls). The server.R file does all the event
handling after user selection of “Test.csv” file. The present
implementation works only with Test.csv file only
25. Choosing the data source
Click on browse button
and Choose the file
“Test.csv”
Click the Update now
28. Box Plots based on Value Score for
Top Five Players
Companies
29. Word Cloud Based on IPC Codes
Bigram Cloud based (Bi-gram contains two
words)
Word Cloud
R – Patent Informatics
Word Clouds and Cluster Dendograms
Cluster Dendrogram – Different
technical aspects related Ultrasound
that are associated with the Ultrasound
Probe
30. Each individual patent is treated
as a file- these files are
generated using R Code. For this
Text Mining example Title,
Abstract and claims data is used
31
Workflows In KNIME
Java Code
Snippet
R Code
Snippet
Motion
(Basic)
Note: This video template is optimized for Microsoft PowerPoint 2010.
In PowerPoint 2007, video elements will play, but any content overlapping the video bars will be covered by the video when in slideshow mode.
In PowerPoint 2003, video will not play, but the poster frame of the videos will remain in place as static images.
The video:
Plays automatically after each slide transition.
Is 15 seconds long.
Seamlessly loops for infinite playback.
To add slides or change layout:
To add a new slide, on the Home tab, in the Slides group, click the arrow under New Slide, then click under Motion Background Theme, select the desired layout.
To change the layout of an existing slide, on the Home tab, in the Slides group, click Layout, then select the desired layout.
Other animated elements:
Any animated element you insert will begin after the slide transition and the background video has started.