My talk on Patent Visualization at The 3rd IEEE Workshop on Interactive Visual Text Analytics. Primary focus is to introduce the Scalable Visual Analytics research that my team is working on. Workshop paper can be found at: http://vialab.science.uoit.ca/textvis2013/papers/Ankam-TextVis2013.pdf
Visually Exploring Patent Collections for Events and Patterns
1. Visually Exploring
Patent Collections for Events
and Patterns
Derek X. Wang
Associate Director of the Charlotte Visualization Center
Together with:
Wenwen Dou, Wlodek Zadrozny, Suraj Ankam, Debbie Strumsky, Terry Rabinowitz
6. Value
Businesses
• 800 patents:
• $1 billion worth of patents from AOL to Microsoft
• 1,100 patents from Kodak
• 525 Million to group license
7. Value
Businesses
• 800 patents:
• $1 billion worth of patents from AOL to Microsoft
• 1,100 patents from Kodak
• 525 Million to group license
• 17, 000 Patents
• $12.5 billion Motorola Mobility to Google
16. Value Goal
• Can we spot an emerging new technology?
• Text mining and visualization
17. Value Goal
• Can we spot an emerging new technology?
• Text mining and visualization
• Can we spot novelty within a patent?
18. Value Goal
• Can we spot an emerging new technology?
• Text mining and visualization
• Can we spot novelty within a patent?
• How much do claims differ from class descriptions?
19. Value Goal
• Can we spot an emerging new technology?
• Text mining and visualization
• Can we spot novelty within a patent?
• How much do claims differ from class descriptions?
• How much do claims differ from claims in other similar patents
20. Value Goal
• Can we spot an emerging new technology?
• Text mining and visualization
• Can we spot novelty within a patent?
• How much do claims differ from class descriptions?
• How much do claims differ from claims in other similar patents
• Can we list “all” patents relevant for some technology? (and
what does it mean)
22. Value Goal
A Robust and Scalable Patent Analysis Infrastructure
Is Needed
Balanced
Analytics
Technology
Visual Analytics Will Play a Key Role
23. Value Goal
A Robust and Scalable Patent Analysis Infrastructure
Is Needed
Balanced
Analytics
Technology
Human
=
+
Computer
Visual Analytics Will Play a Key Role
26. Value Goal Challenge
Unstructured or semi-structured
Highly heterogeneous
Leading to highly heterogeneous models
Incomplete or with holes
With intrinsic uncertainty (and in some cases deception)
Inside and outside the enterprise
Containing detailed time and space information:
29. Value Goal Challenge Research
Structuring the Unstructured:
Topic Modeling
• Latent Dirichlet Allocation (LDA)
30. Value Goal Challenge Research
Structuring the Unstructured:
Topic Modeling
• Latent Dirichlet Allocation (LDA)
• Reveals Latent topics from large textual corpus
31. Value Goal Challenge Research
Structuring the Unstructured:
Topic Modeling
• Latent Dirichlet Allocation (LDA)
• Reveals Latent topics from large textual corpus
• Coherent sets of most likely words to describe topics
32. Value Goal Challenge Research
Structuring the Unstructured:
Topic Modeling
• Latent Dirichlet Allocation (LDA)
• Reveals Latent topics from large textual corpus
• Coherent sets of most likely words to describe topics
• Topics defined by keyword groups
33. Value Goal Challenge Research
Structuring the Unstructured:
Topic Modeling
• Latent Dirichlet Allocation (LDA)
• Reveals Latent topics from large textual corpus
• Coherent sets of most likely words to describe topics
• Topics defined by keyword groups
• Topics in text collections can effectively be inferred
35. Value Goal Challenge Research
Structuring the Unstructured:
Investigative Element Extraction
36. Value Goal Challenge Research
Structuring the Unstructured:
Investigative Element Extraction
• Recognition of entities including people, locations, buildings,
organizations.
37. Value Goal Challenge Research
Structuring the Unstructured:
Investigative Element Extraction
• Recognition of entities including people, locations, buildings,
organizations.
• Recognition of times and dates.
38. Value Goal Challenge Research
Structuring the Unstructured:
Investigative Element Extraction
• Recognition of entities including people, locations, buildings,
organizations.
• Recognition of times and dates.
• Construct near-real-time analysis pipeline for entity
association
41. Value Reality Challenge Research
Structuring the Unstructured:
Event Structuring
Events: Meaningful occurrences in space and time
42. Value Reality Challenge Research
Structuring the Unstructured:
Event Structuring
Events: Meaningful occurrences in space and time
Motivating Event
Particular Topic Stream
43. Value Reality Challenge Research
Structuring the Unstructured:
Event Structuring
Events: Meaningful occurrences in space and time
Motivating Event
Particular Topic Stream
Narrative: a series of clustered (event-based) stories
temporally-linked based on content similarity.
46. Value Reality Challenge Research Results
Can we spot an emerging new technology?
Data:
50,000 telecommunication patents, in past 10 years
Abstract text and patent meta-information;
1.5 Gb Raw Patent Documents
47. Value Reality Challenge Research Results
Can we spot an emerging new technology?
Data:
50,000 telecommunication patents, in past 10 years
Abstract text and patent meta-information;
1.5 Gb Raw Patent Documents
Methods: Topic modeling and visualization
48. Value Reality Challenge Research Results
Can we spot an emerging new technology?
Data:
50,000 telecommunication patents, in past 10 years
Abstract text and patent meta-information;
1.5 Gb Raw Patent Documents
Methods: Topic modeling and visualization
Results:
We can see a significant change in the topic of “software
and storage” in communication around 2007
(corresponding to Apple iPhone?)
49. Value Reality Challenge Research Results
Can we spot an emerging new technology?
**W. Dou et al., HierarchicalTopics: Visually Exploring Large Text Collections Using Topic Hierarchies, IEEE VAST 2013
50. Value Reality Challenge Research Results
Can we spot an emerging new technology?
Model:
§ 100 topics
§ Each topic a distribution on
words
§ Each abstract a combination
of topics
!
Note: Width of the graph
proportional to the number of
patents and the number of words
from a particular topic (topic signal
strength).
Number of class 455 patents grew
from 2234 in 2005 to 7647 in 2012
**W. Dou et al., HierarchicalTopics: Visually Exploring Large Text Collections Using Topic Hierarchies, IEEE VAST 2013
51. Value Reality Challenge Research Results
Can we spot an emerging new technology?
Model:
§ 100 topics
§ Each topic a distribution on
words
§ Each abstract a combination
of topics
!
Note: Width of the graph
proportional to the number of
patents and the number of words
from a particular topic (topic signal
strength).
Number of class 455 patents grew
from 2234 in 2005 to 7647 in 2012
**W. Dou et al., HierarchicalTopics: Visually Exploring Large Text Collections Using Topic Hierarchies, IEEE VAST 2013
57. Value Reality Challenge Research Results
Can we spot novelty within an existing patent?
Data$$
$Ini(ally:$A"random"sample"of"40"patents"in"several"
classes"with"focus"on"455"(telecom)."""
$Recently:$Confirmed"through"automated"analysis"of"
several"subclasses"of"455.""
$
Method:"Compare"words"in"claims"with"words"in"class"plus"
subclass"definiAon"
"
Results:"Large"symmetric"differences
""#$%&(()*+,&)÷"#$%&(./0+1+2+#1)"
""#$%&(34&2$*52&)÷"#$%&(./0+1+2+#1)"
58. Value Reality Challenge Research Results
Example
h)p://pa,t.uspto.gov/netacgi/nph-‐Parser?Sect1=PTO2&p=1&u=%2Fnetahtml%2Fsearch-‐
bool.html&r=2&f=G&l=50&d=pall&s1=449%2F8.CCLS.&OS=CCL/449/8&RS=CCL/449/8
Patent
Title
Process
for
rearing
bumblebee
queens
and
process
for
rearing
bumblebees
Main
ClassificaTon
449/1
;
449/2;
449/8
Class
449
–
Bee
Culture
/
Subclass
1
Class
449
–
Bee
Culture
/
Subclass
8
59. Value Reality Challenge Research Results
We
claim:
1.
A
process
for
rearing
bumblebee
queens
(genus
Bombus)
comprising
generaTng
a
colony
with
workers
in
the
presence
of
ferTlized
eggs
and/or
larvae
from
at
least
one
colony,
in
a
room
with
a
controlled
climate
provided
with
food,
and
allowing
the
colony
to
grow
unTl
bumblebee
queens
are
produced,
wherein
subadult
and/or
adult
workers
that
originate
from
at
least
one
different
colony
are
brought
together
with
said
ferTlized
eggs
and/or
larvae.
2.
The
process
according
to
claim
1,
wherein
the
workers
that
originate
from
said
at
least
one
different
colony
are
brought
together
with
a
young
colony
in
the
eusocial
phase,
consisTng
of
a
ferTlized
queen,
brood
and
the
first
born
workers.
3.
The
process
according
to
claim
1,
wherein
more
than
100
workers
are
brought
together.
4.
The
process
according
to
claim
1,
wherein
rearing
is
carried
out
using
a
workers:
ferTlized
eggs
raTo
of
0.5-‐4.
5.
The
process
according
to
claim
1,
wherein
the
workers
originaTng
from
said
at
least
one
different
colony
are
first
kept
in
a
room
without
any
queen
and
without
brood
for
one
day.
6.
The
process
according
to
claim
1,
wherein
brood
and
workers
from
different
bumblebee
species
are
brought
together.
7.
A
process
for
rearing
bumblebees
(genus
Bombus),
comprising
rearing
bumblebee
queens
by
generaTng
a
colony
with
workers
in
the
presence
of
ferTlized
eggs
and/or
larvae
from
at
least
one
colony,
in
a
room
with
a
controlled
climate
provided
with
food,
and
allowing
the
colony
to
grow,
wherein
subadult
and/or
adult
workers
that
originate
from
at
least
one
different
colony
are
brought
together
with
said
ferTlized
eggs
and/or
larvae,
and
using
said
bumblebee
queens
for
rearing
bumblebees.
60. Value Reality Challenge Research Results
Subclass Nesting
Class 449
1 -> Class Definition
8 -> 7 -> 3 -> Class Definition
61. Value Reality Challenge Research Results
Subclass Nesting
Class 449
1 -> Class Definition
8 -> 7 -> 3 -> Class Definition
Class
Name:
Bee
Culture
Class
Defini;on:
This
class
includes
the
methods
of
and
structures
for
propagaTng,
raising
and
caring
for
bees;
as
well
as
certain
ancillary
methods
and
structures.
62. Value Reality Challenge Research Results
Class
449
Subclass
1
Subclass
Name:
Method
Subclass
Defini;on:
This
subclass
is
indented
under
the
class
definiTon.
Process.
63. Value Reality Challenge Research Results
Class
449
Subclass
8
Subclass
Name:
Queen
Raising
Subclass
Defini;on:
This
subclass
is
indented
under
subclass
7.
Structure
with
provision
to
encourage
and
care
for
the
producTon
of
a
bee
larvae
into
a
queen
bee.
64. Value Reality Challenge Research Results
Words
in
class
/
subclass
defini;ons
found
in
patent
claim
method
0
colony
11
process
7
culture
0
queen
6
propagate
0
raise
0
encourage
0
care
0
larvae
4
producTon
1
bee
7
mulT
0
swarm
0
capture
0
house
0
hive
0
structure
0
65. Value Reality Challenge Research Results
Words
in
claim
that
were
not
in
definiTons
rearing
5
worker
10
egg
5
ferTlize
6
climate
2
food
2
different
5
control
2
67. Value Reality Challenge Research Results
Can we spot novelty within an existing patent?
Observations
• Novelty is in words/relations that are not part of the definition (but appear in
patent claims or its abstract)
• Some things can be left unsaid. Is there a boundary?
• Happens in all patents (but degree varies)
68. Value Reality Challenge Research Results
Can we spot novelty within an existing patent?
Next
• Opportunity to text mine these differences
– Are they random on a time scale?
– Would descriptions of emerging technologies emerge from these
patterns?
– Do combination patents have more of these?
70. Value Reality Challenge Research Results
Can we list “all” patents relevant for some technology?
71. Value Reality Challenge Research Results
Can we list “all” patents relevant for some technology?
– Data: Patents, Wikipedia
72. Value Reality Challenge Research Results
Can we list “all” patents relevant for some technology?
– Data: Patents, Wikipedia
– Potential Data: Cell phone manuals or other descriptions
73. Value Reality Challenge Research Results
Can we list “all” patents relevant for some technology?
– Data: Patents, Wikipedia
– Potential Data: Cell phone manuals or other descriptions
74. Value Reality Challenge Research Results
Can we list “all” patents relevant for some technology?
– Data: Patents, Wikipedia
– Potential Data: Cell phone manuals or other descriptions
– Method: Text mining of patents in certain classes, text mining of filing
by certain market/technology players, text mining of other patents,
using Wikipedia and manuals as a guidance what to look for.
75. Value Reality Challenge Research Results
Can we list “all” patents relevant for some technology?
– Data: Patents, Wikipedia
– Potential Data: Cell phone manuals or other descriptions
– Method: Text mining of patents in certain classes, text mining of filing
by certain market/technology players, text mining of other patents,
using Wikipedia and manuals as a guidance what to look for.
77. Value Reality Challenge Research Results Scale
Scalable Computing Architecture for
Extracting Latent Topics and Events*
78. Value Reality Challenge Research Results Scale
Scalable Computing Architecture for
Extracting Latent Topics and Events*
79. Value Reality Challenge Research Results Scale
Scalable Computing Architecture for
Extracting Latent Topics and Events*
Distributed Data Storage and Pre-Processing Environment
80. Value Reality Challenge Research Results Scale
Scalable Computing Architecture for
Extracting Latent Topics and Events*
Distributed Data Storage and Pre-Processing Environment
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
81. Value Reality Challenge Research Results Scale
Scalable Computing Architecture for
Extracting Latent Topics and Events*
Distributed Data Storage and Pre-Processing Environment
MapReduce procedures for data-cleaning and pre-processing
Distributed Storage Solution (MongoDB), is used for data storage,
analysis and Retrieval
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
82. Value Reality Challenge Research Results Scale
Scalable Computing Architecture for
Extracting Latent Topics and Events*
Distributed Data Storage and Pre-Processing Environment
MapReduce-based social media crawlers for Twitter, blogs and news articles:
Unstructured Contents: Textual Information, Image, Comments
Structured Contents: User Graph, Geo-tags, HashTag
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
83. Value Reality Challenge Research Results Scale
Scalable Computing Architecture for
Extracting Latent Topics and Events*
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
84. Value Reality Challenge Research Results Scale
Scalable Computing Architecture for
Extracting Latent Topics and Events*
Parallel Data Analytics Cluster
MPI-based Parallel-LDA implementation for Topic modeling with
Memory Sharing Optimization
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
85. Value Reality Challenge Research Results Scale
Scalable Computing Architecture for
Extracting Latent Topics and Events*
Parallel Data Analytics Cluster
OpenNLP-based Parallel Implementation for Entity-Extraction
Customized PBS to schedule jobs for parallel computing environment
**X. Wang et al., I-SI: Scalable Visual Analytics Architecture for Analyzing Latent Topical-Level Information From Social Media Data, Journal of Computer Graphics Forum, 2012
87. Value Reality Challenge Research Results Scale
Resources we’d be happy to share
• Complete US patents and applications (until 1q2013)
with with a search engine (Lucene) interface
• Patent Classes
• Other text resources (Wikipedia, Wiktionary etc)
!
We’d be happy to prepare specialized extracts or
combination for those who need them.
88. Value Reality Challenge Research Results Scale
Thank you!
Derek Xiaoyu Wang
xiaoyu.wang@uncc.edu
News Briefing App
@News_Briefing
Now FREE at App Store