Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term Frequency and Proportional Document Frequency and Semantic Information

Hot Topic Detection and Technology Trend Tracking
for Patents utilizing Term Frequency and Proportional
Document Frequency and Semantic Information
Khanh-Ly Nguyen1
, Byung-Joo Shin2
, Seong Joon Yoo3
Dept. of Computer Engineering
Sejong University
Seoul, Republic of Korea
khanhly4682@gmail.com1
, {bjshin2
, sjyoo3
}@sejong.ac.kr
Abstract—This paper proposes a methodology for identifying
hot topics and tracking technology trends from the patent
domain. The methodology uses frequency information in
combination with the International Patent Classification (IPC) to
capture semantic information on word categorization, doing so in
a way that heretofore has not been employed for topic detection
and trend tracking. Term Frequency and Proportional Document
Frequency (TF*PDF) is employed as a means to detect hot topics
from patents, and IPCs are used to calculate semantic
importance of terms based on the IPCs where terms are
distributed. Aging Theory is also used to calculate the variation
of trends over time. Four types of trends including very stable
trends, stable trends, normal trends, and unstable trends are
defined and evaluated based on TF*PDF and TF*PDF combined
with Aging Theory. Experiment results show that for very stable
trends, the combination of TF*PDF and Aging Theory achieves
0.976% in Precision; for stable trends and all trends, TF*PDF
achieves 0.959% and 0.84% in Precision, respectively. By
applying TF*PDF in consideration of semantic information, we
also show a new criteria for weighting hot topics and technology
trend tracking.
Keywords—Technology forecast; Trend analysis; Patent
analysis; Topic detection; Hot term extraction
I. INTRODUCTION
The increasing amount of patent applications and the
growing need to access patent information make the task of
patent analysis become vital to 1) analyze large amounts of
patent data that is expensive being done by human, 2) increase
the quality of generating useful information, and 3) support
decision making processes to eventually increase the quality
of the patents. Patent intelligence is used to encourage the
development of innovative products, devise technology
strategies, and reveal legal/business insights amid the
technical transformation.
Various tools and techniques have been developed for use
in detecting trends and forecasting future developments from
news stories or patent documents. In the patent domain, the
techniques include such as keyword-based
approaches[1][2][3], Subject-Action-Object (SAO)-based
approaches[4][5], property-function-based
approaches[6][7][8], rule-based approaches[9][10][11][12],
semantic analysis-based approaches[13][14][15], etc.
Keyword-based approaches use frequencies and co-
occurrences among keywords that result in the lack of
representation of relationships among technological concepts
and require expert knowledge in terms of predefining
keywords. SAO-based approaches extract SAO structures,
which consist of a subject, action, and object and represent the
concepts of technology in the properties/functions format.
Usually, the SAO approaches are employed in TRIZ trend
analysis (TRIZ is a Russian acronym that means Theory of
Inventive Problem Solving). The process for TRIZ trend
analysis is to analyze and categorize patents in know trend
phases, and the results have been used to identify the evolution
of technologies or to seek for further improvements of specific
product. However, the method depends on the expertise and
skills of TRIZ experts to identify specific trends and trend
phases manually, which may be expensive or unfeasible.
For news stories collected from websites such as Google
News, Reuters, Yahoo, etc., news topics are detected by
techniques proposed under Topic Detection and Tracking
(TDT). TDT is intended to identify topics by exploring and
organizing the content of textual materials, and enabling to
group pieces of information into manageable clusters, wherein
each cluster represents a single topic. Reference [16][17]
proposed methods for hot topic extraction based on TF*PDF
and agglomerative clustering algorithm. Reference [18]
constructed topic hierarchy by identification of burst periods
of features. Reference [19] detected topic by identification of
both aperiodic and periodic features bursts.
Given the large amounts of news topics constantly being
created and updated, there is a concern about how to rank
those topics in terms of timeliness and importance. Generally,
topic ranking is determined by two factors; one is how
frequently and recently a topic is reported by websites; the
other is how much attention users pay to it[20]. The first rule
focuses on returning timely results and the second one
considers larger topics more important. Either rule involves
only one aspect of the ranking problem. Besides, other factors
must be taken into consideration, i.e., (1) every news story of a
topic contributes to its importance, while the contribution
3
Corresponding author: Seong Joon Yoo; E-mail address: sjyoo@sejong.ac.kr
223978-1-4673-8796-5/16/$31.00 2016 IEEE BigComp 2016

decays along the timeline; and (2) topics that attract more
users' attention should be ranked higher.
Generally, current topic detection and trend tracking
researches are mainly based on timeline and frequency
features[21][22][23][24]. However, most of the researches are
lack of representing the semantic relationship between terms.
Some researches[25][26][27][28] consider the binary
relationships such as relationships from patents and
relationships from a predefined trend database, which requires
knowledge of domain experts for classifying trends in advance.
In this paper, we propose a methodology for hot topic
detection and trend tracking using frequency information in
combination with the IPC to capture semantic information on
word categorization, which has not been employed in the state
of art. To obtain semantically meaningful topics of interest, we
apply the TF*PDF algorithm, which allows for the generation
of hot topics over time. Moreover, we exploit the IPCs to
capture the importance of semantic information on word
categorization as presented in multiple IPCs, from which hot
topics are detected. Trends are identified as the normalized
weight of topic over time. Four types of trends are defined
including very stable trends, stable trends, normal trends and
unstable trends. Experimental results were compared with the
baseline TF*IDF for very stable trends, stable trends and all of
trends detected by the system.
II. RELATED WORKS
A. Patent Analysis
Because patents play an important role in intellectual
property protection, there has been a growing interest in
research into patent analysis, patent search, patent query
formulation[29], and trend identification[30][31] from patent
documents. Reference [29] extracted key terms for patent
query formation using semantic patterns and a keyword
dependency relation graph. Key terms are defined as
"Problem/Solution" and are extracted through semantic
patterns. Relationships among key terms are considered in a
term-weighting scheme and important terms are ranked on the
basis of weight. To identify emerging topics and transitions in
topics, many techniques using term frequency have been
applied. A timeline chart is created by textual analysis and the
evolutionary process of emerging technology is visualized as
an S-curve shape. In other research studies, the combination of
citation networks and text-mining techniques have improved
reliability in the detection of chronological changes for which
the co-citation networks provide closer connections among
patents. Timeline visualization of co-citation networks with
labels extracted by text-mining techniques has been used to
detect and trace emerging trends. Particularly, the co-citation
clusters have been used to build patent maps that could be
used to analyze the numbers of patents filed between company
competitors through the years and thereby to visualize the
evolutionary process of emerging technology, etc. Reference
[32] used SAO structures to generate patent maps for
identifying the technological competition trends. Semantic
similarity is measured on the basis of SAO-based semantic
similarities and a patent similarity matrix is constructed. The
output is visualized in the form of a dynamic patent map,
which is used to identify technological vacuums and
technological hotspots. Reference [33] identifies promising
patents for technology transfer and uses TRIZ evolution trends
to evaluate technologies in patents. A patent is considered to
be a high future value patent if it is relevant to future
important TRIZ trends. The patents are ranked based on the
similarity scores and are classified. However, the
disadvantages are that the classification of the TRIZ trends
may not be applicable to all the technological domains and the
revision of classification by domain experts having knowledge
in TRIZ trends are required. Moreover, [15] extracts
information related to properties and functions of a product by
identifying binary relationships in the form of "adjective +
noun" and "verb + noun". The Stanford dependency parser is
used to identify all binary relationships from titles and
abstracts in patents. A "reasons for jumps" rule base that
arranges trend-specific binary relationships for trend
identification is defined, whereupon the most likely trends and
trend phases are determined by measuring sentence semantic
similarity between the binary relationships from patents and
the binary relationships from a "reason for jumps" rule base. If
two or more phases related to a trend are identified from a
patent, the currently developed logic of the trend map chooses
the more evolved one. The final output depicts the
evolutionary which can be used as input for technology
forecasting based on TRIZ trends. Additionally, [31] proposed
Patent Trend Change Mining techniques as the means to
capture changes in patent trends through metadata analysis
without the need of specialist knowledge. The approach
includes a patent indicator calculator that determines the
patent values based on citation index, originality, generality,
and technology cycle time. Then, patent change trends are
determined by association rule mining to compute the
similarities and differences of patent trends between two
different times.
B. International Patent Classification
The International Patent Classification (IPC) is
administered by the World Intellectual Property Organization.
Patent documents which are relevant to a particular inventive
concept are organized through an examination process by the
examiner.
Each patent document is classified into IPCs based on
technical field of the invention and can be assigned to more
than one IPC code. Each IPC is divided into subclass, main
group and sub group. In this research we use the data under
the class H01M 04. Fig. 1 shows hierarchy of H01M 04,
which contains patent documents belonging to the electrodes
category and includes six subgroups.
Description
H01M 04 Electrodes (electrodes for electrolytic processes)
H01M 04/02 .Electrodes compose of, or comprising active material
H01M 04/04 .. Processes of manufacture in general
H01M 04/06 .. Electrodes for primary cells
H01M 04/08 … Processes of manufacture
H01M 04/10 … of pressed electrodes with central core
H01M 04/12 …. Of consumable metal or alloy electrodes
(use of alloy compositions as active materials )
Fig. 1. The IPC H01M 04 hierarchy
224

The identification of technological trends is one of the
most important tasks in acquiring knowledge from patent
sources so as to quickly understand the latest advanced,
innovative technologies in high-tech industries and to acquire
technologies for future use. In the state of the art, however,
there is no research target pertaining to the detection of patent
trends by utilizing IPC. In our work we utilize IPC
information in a practical way so as to identify our target hot-
term extraction within a patent-data corpus. Given the
assumption that there are many technical trends invented in
patent documents, it is very difficult to identify the major
trends relative to each domain without IPC information.
For example, in the following sample sentence "drain" and
"flash amperage" are highly ranked terms by general term
weightings such as TF*IDF based on frequency information.
"A high DRAIN rate, primary alkaline cell comprising a
negative electrode...capable of providing a FLASH
AMPERAGE greater than an average of..."
However, by using IPC information we rank those terms
with higher weight based on the number of IPCs (e.g. H01M
04 or H01M 10) where "drain" and "flash amperage" belong
to.
C. Hot Topic Detection and Trend Analysis
Automatic extraction of meaningful topics, which can help
detecting topics of interest and facilitate the analysis of user
behavioral data, has been studied previously in the various
context. Reference [18] used Latent Dirichlet Allocation
(LDA) to extract latent topics by modeling temporal trends on
Twitter over time. Reference [19] modeled topics from text
corpus in order to determine whether a topic description is
well formed by doing so throught the use of LDA and
selective Zipf distribution. Reference [34] proposed a
framework using probability inference for detecting
objectionable text content that has been shown to be harmful
to Web users. For a given sentence, the probability value,
which shows the likelihood of the sentence with respect to the
model, is calculated and then a mapping function is used to
transform the probability value into a new indicator for
making decision about the type of Web text content.
Reference [35] proposed a hierarchical topic extraction
algorithm based on topic grain computation. By considering
the distribution of word document frequency as a Gaussian
mixture, the topic grain is defined based on the mixture of
Gaussian parameters and feature words are selected for the
grain by employing an EM-like algorithm. A clustering
algorithm is used to generate a multiple-grain hierarchical
topic structure with different subtopic description. Reference
[36] incorporated topic transition in topic detection along with
tracking from Reuters and BBS websites. They employed a
topic representation based on the hidden Markov model and
applied fuzzy-kMeans clustering to find the most likely topic-
transition sequence. Reference [37] addressed the problem of
extracting significant words that are highly useful for
summarizing and presenting topics from a huge number of
news articles. To extract keywords, [37] used an unsupervised
keyword extraction technique called Table Term Frequency,
which includes several variants of the conventional TF-IDF
model and filtered keywords with cross-domain comparison.
Reference [20] introduced an automatic online news-topic
key-phrase extraction system. Topics are constructed and
updated online automatically with techniques to determine the
degree of burst of terms along with the aging theory. The
proposed system extracts keyword candidates from single
news stories, filters them with topic information, and then
combines them into phrase candidates using position
information. Finally, the phrases are ranked and the top ones
are selected as topic key phrases.
In TDT, a topic is defined as a seminal event or activity,
along with all the directly related events and activities. A hot
topic is defined as a topic that appears frequently over a period
of time[17]. The hotness of a topic depends on two factor:
how often hot terms appear in a document and the number of
documents that contain those terms. However, the hotness of
each topic evolves over a given period of time through the life
cycle of birth growth, maturity, and death.
Some research studies have utilized the aging theory or
timeline analysis in TDT and hot-topic extraction[16][17];
topic hierarchy construction based on the identification of
burst periods of features[18]; topic sentence extraction along a
timeline given a query[38]; topic detection based on the
identification of both aperiodic and periodic features'
bursts[19]; finding top burst topics by identifying burst
words[30]; and so on. The previous approaches listed above
analyzed the characteristics of features from a fixed corpus on
the whole timeline. In order to identify topics in large sets of
documents, we have to determine the key terms that
sufficiently describe the topics. A term-weighting scheme is
used to capture important or representative terms that feature
in the content of a document, such as calculating the term
distribution level in a document or in a corpus[16][36][37].
The most common term-weighting scheme for processing
index terms is TF-IDF. Because the TF-IDF scheme
emphasizes the importance or uniqueness of each term, it only
identifies terms that occur in a few of the documents contained
in a corpus. For hot-topic extraction, however, terms that
appear in many of the documents in a corpus must be
identified. Therefore, a different term-weighting scheme TF-
Proportional Document Frequency (TF*PDF) assigns greater
weights to terms that occur frequently in many documents on
many channels and lower weights to others to avoid the
collapse of important terms when they appear in many
documents. Although TF-PDF captures the basic concept of a
hot topic, its weakness is that it does not consider variations in
the popularity of a topic over time. Therefore, [17] combined
TF*PDF and aging theory to the extraction of hot terms from a
data corpus in consideration of the term life cycle. An aging
theory is used to model a news-topic life span with the four
stages of birth, growth, decay, and death to reflect its
popularity over time. To track the life cycles of topics, they
used the concept of energy function. The energy of an event
increases when the event becomes popular, and it decreases as
its popularity reduces. Hence, the aging theory is suitable for
tracking the variations in the frequency of terms, which are
critical to success in hot-topic extraction.
A technology lifecycle is usually referred to as an S-curve,
containing the four stages of innovation, growth stage,
225

maturity and decline. The innovation stage is when a
technology is born from a new technical method or when
phrases appear in a small number of patents and slowly
increase. The growth stage is when a technology has been
recognized, thus gathering strength with the increasing
number of patents over time. The maturity stage is when the
recognition is high and stable with the rapid increase of
number of patents. The decline stage is when the technology is
reduced. We apply three functions from [17] in order to
calculate and update the energy of topics in every time slot;
getEnergy() calculates the nutrition that a topic receives from
a story; energyFunction() converts a topic nutritional value
into an energy value; energyDecay() carries out the energy
decrease in each time slot.
We have explored hot-topic detection from patent
documents by utilizing TF*PDF. Moreover, we seek to utilize
the importance of IPC, where the terms from a patent
document are categorized. In this research we apply the hot-
term detection algorithm (TF*PDF) and utilize IPC
information in order to identify hot topics in a patent data
corpus.
III. SYSTEM ARCHITECTURE
The overall procedure for hot-term detection and
technological trend identification from patent documents
consists of several steps, including data crawling, keyword
extraction, hot-term detection, and trend analysis as shown in
Fig. 2. Firstly, raw patent data is collected from a U.S. patent
database and transformed into structured data. POStagger is
used for keyword extraction. All common stop words are
removed in combination with the list of patent stop words.
Then, a candidate list of detected hot terms is compiled by
TF*PDF algorithm and the aging theory. Finally, we analyze
patent trend based on the hot-term detection and evaluate the
results.
Fig. 2. System Architecture
A. Hot Term Detection
Two stop-word lists are used. One contains common stop
words and the other contains the words that are common in
patent documents and irrelevant to patent content.
In hot-term extraction process, two characteristics of a
term are considered. One is the frequency of the term in the
question collection (Definition 1) and the other is the term
variation over time (Definition 2). Term frequency is
measured by TF*PDF[17][39]. TF*PDF is considered more
suitable for topic detection than TF*IDF because the former
assigns greater weights to terms that occur frequently in many
documents. We adopt the TF*PDF scheme in this paper and
the top m terms are chosen as final terms for trend analysis.
Phrases that do not include any term among the final terms are
therefore excluded.
Definition 1 (TF*PDF). Given a term j in a document, the
TF*PDF weight of term j is calculated from [39] through the
following (1) and (2):






 

 c
jc
Cc
c
jcj
N
n
FW exp
1

where




Kk
k
kc
jc
jc
F
F
F
1
2

, where Wj is the TF*PDF value of term j, which is the
summation of term weights gained from each IPC c; |C| is the
number of IPCs. Fjc is the frequency of term j in the IPC c; K
is the total number of terms in the IPC c; njc is the number of
documents that belong to the IPC c where term j occurs; Nc is
the total number of documents in the IPC c.
Definition 2 (Term Life Cycle). Reference [17] defined a
term life cycle model in order to calculate the variation of each
term value from its cycle of birth, growth, decay, and death.
This step is suitable for tracking the variations in the
frequency of terms, which are critical to a successful hot-topic
extraction. We apply three functions from [17] to calculate the
energy of topics in each time slot, including getEnergy(),
energyFunction(), and getVariation().
getEnergy() calculates the energy that a term receives from
patents at a specific time slot. The energy Et,s of term t
measures the frequency of t appearing in a specified time slot s,
which is the accumulated value of term t from all patent IPCs,
as in (3). Therefore, hot terms are those that have high energy
in all IPCs.
 ,2
, 
 Cc ctts XE 
where C is the set of IPCs, X2
t,c is the association between
term t and the time slot s in IPC c, given by (4):
))()()((
))(( 2
2
DBCADCBA
BCADDCBA
X


 
226

For each term we calculate the contingency table, as
shown in Table 1:
TABLE I. CONTINGENCY TABLE
s s
T A B
−
T
C D
, where A is the count of the patents that contain term t in time
span s; B is the count of patents that contain term t on other
time spans; C is the count of patents that do not contain term t
in time span s; D is the count of patents that do not contain
term t on other time spans.
energyFunction() converts a term energy value into a life
support value. The life support value lifeSupportt,s of t at time
slot s is calculated as the logarithm of accumulated energy Et,s,
as in (5).
)ln( ,, stst EtlifeSuppor  
getVariation() calculates the variation of the life support
values of term t over time in the patent collection can be
computed as (6):
  2
, )(
1
tlifeSupportlifeSuppor
N
V stst

, where N is the number of time slots in the given interval I;
lifesupportt,s is the life support value in each time slot;
tlifeSuppor is the average life support value; and Vt is the
variation in the life support values of t during I.
The overall weight of term t is measured by combining
TF*PDF and Variation value together, as in (7).
tt VPDFTFweight  * 
Finally, the terms in the candidate list with the combined
weight will be ranked. The top-ranked k terms can be chosen
as hot terms that reflect the hot topics in the corpus.
B. Trend Detection
We divide the patent timeline into yearly time slots. In
each time slot s, a trend is represented by the normalized
weight of occurrences of term t from n documents.
IV. EXPERIMENTS
For hot topic detection, we present the results by
comparing topics detected by three algorithms including Chi-
square, TF*PDF, and the combination of TF*PDF and Aging
Theory where IPC information is employed under TF*PDF.
For hot technology trend tracking, we define four types of
trends and eliminate insignificant ones. To evaluate the
significance of the trends we choose TF*IDF as the baseline
since TF*IDF is the most common technique in data mining
area.
A. Data Collection
The data collection contains 513 patent documents crawled
from the U.S. Patent and Trademark Office database. Patent
documents are selected from the domain of batteries, as
published from 1977 to the present. Fig. 3 shows the IPC
H01M 04, which contains patent documents belonging to the
electrodes category and includes six subgroups (H01M 04/02,
H01M 04/04, H01M 04/06, H01M 04/08, H01M 04/10, and
H01M 04/12).
United States Patent 4,016,339 Gray et al. April 5, 1977
Abstract A battery electrode structure of flat configuration comprises a
cast mass of electrochemically active material, said mass having
contained therein and exposed opposite surfaces thereof an open-mesh
electrically conductive structure adapted for connection to a battery
terminal. An open-mesh electrically conductive support member in the
mass and in contact with the exposed electrically conductive structure
maintains electrical conductivity throughout discharge to ensure
maximum use of the active material.
Current International Class: H01M 04/06?(20060101); H01M
04/58?(20060101); H01M 06/34?(20060101); H01M 06/30?(20060101);
H01M 04/70?(20060101); H01M 004/02?()
Fig. 3. Sample input patent document
B. Data Preparation
1) Data Extraction
A patent document from USPTO contains sections,
including “Title”, “Abstract”, “Claims”, and “Description”.
"Title" is too short that is not suitable for our research.
"Claims" is insufficient because it may not contain as many
technology phrases as another section or those terms are
included in "Abstract". “Description” is lengthy and includes
sub fields such as "Field of the Invention", "Prior Art",
"Summary of the Invention", "Detailed Description of the
Preferred Embodiment", "Brief Description of the Drawings",
etc. "Description" contains meaningful terms for the
problem/solution extraction method, as shown by [29]. For
preliminary experiments we use only "Abstract" which is a
short, very precise summary of the invention.
2) IPC Extraction
From patent documents, we extract IPC information under
the "Current International Class" tag as shown in Fig. 3. The
IPC information is then extracted from Main Groups by using
regular expression and stored in a list of IPCs, which would be
later utilized in term weighting using TF*PDF algorithm.
3) Timestamps
To identify the technology lifecycle, we split the whole
time span into each one-year intervals. Patent documents were
crawled from 1976 to 2014, but there is no patent filed in the
H01M 04 class in 2006. Therefore, the range of data include a
total of 30 years from 1976 to 2005.
C. Hot Topic Extraction
We compare hot terms extracted by the TF*PDF, Chi-
square, and the combined weight. This experiment validates
the effectiveness of each term-weighting methodology in hot-
term detection. Then we demonstrate how we can identify
genuine hot topics by ranking term based on its weight.
227

Tables 2, 3, and 4 show the top 10 ranked hot topics by
using TF*PDF, Chi-square, and the combined weight
respectively. The results show that hot terms extracted by each
algorithm are different since terms are weighted in
consideration of different factors such as pervasiveness,
topicality, or variation of the life cycle. It is shown that terms
detected by Chi-square do not frequently appear and cannot be
detected using frequency information. Those terms are
significantly rare terms that would be suitable for detecting
innovation technology from the patent domain. Contrastingly,
TF*PDF and the combined weight mainly detect terms that are
pervasive and topical. TF*PDF and the combined-weight
methodology detect important terms that may be considered as
the topic of patent documents.
TABLE II. TOP 10 HOT TOPICS DETECTED BY TF*PDF
1976 1980 1985 1988 1990 1994 2000 2003 2005
atmosphere atmosphere atmosphere resistance reactant areas increase liquid dry
refractory refractory shear layers accumulator coated increases ion free
telescopic conjugated tab retain makes coat fuel liquids spray
free axis reinforce retaining make uncoated iron shapes dryer
spray deterioration amalgamated strong end injecting ion wet inventive
disturbing size precipitation smooth severed nozzle polysaccharide floc condition
orous end establish web lugs passing dehydration disposition telescopic
electron wash force separating band fabrication period polytetrafluoroethylene disturbing
electronic reduce globular product negligible fabric corrosive dimensions flex
constant crystal acetate end permeates mounting pasting ions expectancy
TABLE III. TOP 10 HOT TOPICS DETECTED BY CHI-SQUARE
1976 1980 1985 1988 1990 1994 2000 2003 2005
web silver layer body grids hydrogen nickel electrode electrode
metal cathode graphite dy battery mprising cathode electro rod
emulsion athode battery powder lead comprising athode rod edge
catalytic group conductor electrode plastic rising hydroxide cell assembly
support active deposition lithium grooves copper electrochemical battery oil
mu vanadium metal electrolyte spaced lithium manganese mprising tab
fibers battery film portions space oil dioxide comprising coiled
heating connection material portion power foil improved rising ss
heat connect gel carbon lugs conductive mno end metallic
electrode addition deposit battery plates electrodes lithium single metal
TABLE IV. TOP 10 HOT TOPICS DETECTED BY THE COMBINED WEIGHT
1976 1980 1985 1988 1990 1994 2000 2003 2005
form material electrode material battery layer nickel electrode electrode
layer oxide electro powder lead hydrogen material cells electro
la active layer body grids mprising lithium cell rod
electrolyte cathode battery dy power comprising improved ce lithium
material athode material electrode electrode rising electrode el alloy
ia silver al electro electro material electrochemical battery electrolyte
battery battery ia lithium rod lithium chemical mprising anode
high vanadium er cathode layer copper electrolyte comprising electrochemical
anode anode graphite athode plastic battery cathode rising chemical
web lithium ph electrolyte high electrode athode plate battery
improved alloy includes Electro material electro ring material layer
D. Trend Tracking and Analysis
In our experiments we obtained a list of 3,211 topics. To
identify topics that could be candidates to represent technical
trends and eliminate insignificant ones, we define four types
of trends that would be satisfied with the following criteria:
Considered that a decay value is 0,
Very stable trends: trends have no decay values for the
entire life cycle.
Stable trends: trends in which the number of decay values
is less than 3.
Normal trends: trends which have at least two continuous
time spans of three years or more; with number of decay
values in the range of (3~7).
Unstable trends: trends which have only one or two
continuous time spans of at least three years in which the
number of decay values in greater than seven.
1) Trends by Chi-square
Fig. 4 shows four very stable trends detected by Chi-
square algorithm. As shown in Fig. 4, "ion" is the hottest
technology developed for secondary battery during 1976 to
2005.
Fig. 4. Very-Stable Trends by Chi-square
All normal trends detected by Chi-square from 1976 to
2005 are shown in Table 5.
TABLE V. NORMAL TRENDS BY CHI-SQUARE
hydrogen, portion, battery, include, anode, agent, high, electrodes,
electrochemical, electrolyte, mproved, gas, sheet, end, structure, electric,
surface, hydroxide, comprising, electrical, providing, current, salt, porous,
oxide, lithium, solution, substrate, including, alkaline, coat, cathode, mixing,
manufacturing, energy, active, alkali, density, face, plurality, powder, storage,
mixture, process, carbon, proper, ions, tab, contact, disc, compound, step,
method, area, chemical, matrix, part, ratio, cycle, article, composition,
charge, pen, igh, line, bind, fabric, improve, treat, car, polymer, dioxide,
sulfide, excellent, outer, voltage, element, forming, conductive, type,
improved, alloy, life, rate, nickel, solid, plate, mix, making, acid, orous,
athode
2) Trends by TF*PDF
The top 10 hottest very stable trends by TF*PDF as are
shown in Fig. 5.
Fig. 5. Top 10 Very Stable Trends by TF*PDF
Fig. 6 shows the top 10 very stable trends detected by the
combined weight from 1976 to 2005. The "method" is not a
technical term and is of little significance in relation to a
technological trend. These trends had at least one or less than
three times decay and then grew continuously. It is shown that
"material" is the hottest trend that received high attention in
the electrode domain over the life cycle of 1976 to 2005. The
228

second very stable trend is "electrode". Although the trend of
"electrode" does not have a higher peak than "electro" or
"electrolyte", it does not have falling values in years as
compared to the others. The trend of "ba" falls nearly to the
bottom, though one year before decaying it has a very high
peak. However, two other trends such as "electrolyte" and
"da" is growing. It is considered to be one of the hottest trends
that are highly paid attention in the patent domain in the next
few years. It also has high value continuously as a trend
because "method" is a very frequent term used in almost patent
documents, particularly to identify Problem and Solution
terms as by [29]. In further experiments it would be excluded
from our term list.
Fig. 6. Top 10 Very Stable Trends by combination of TF*PDF and Aging
Theory (1976~2005)
E. Evaluation
We evaluate the results for very stable trends, stable
trends, and all of trends detected by the system. We compare
the TF*PDF algorithm and the combined weight with the
baseline. We do not show the evaluation for Chi-square
because Chi-square is not very effective in detecting trends
(Recall = 0.0851%).
As shown in Table 6, the combined weight algorithm
achieves the high precision for very stable trends (0.976%);
however, the Recall is lower than the TF*PDF algorithm by
0.064%. It is also shown that the TF*PDF is more effective
than the combined weight in detecting trends. Thus, in the
patent domain the hot trends extracted by means of the
TF*PDF by utilizing IPC information are not affected by their
life cycles over time.
TABLE VI. NORMAL TRENDS BY CHI-SQUARE
Precision Recall
Very stable trends
TF*PDF 0.957% 0.936%
Combined Weight 0.976% 0.872%
Stable trends
TF*PDF 0.959% 0.959%
All trends
TF*PDF 0.84% 0.601%
V. CONCLUSIONS
We have proposed a system for the extraction of hot topics
and the detection of hot trends from the patent domain within
a specific time period using TF*PDF and semantic
information where terms are distributed (IPCs). Our work has
the following contributions:
 The automatic detection of hot technological topics
from the patent domain,
 The use of semantic information for hot-topic
detection, which has not heretofore been done in the
state of the art,
 The automatic tracking of hot trends from patent
domain.
Our implementation is intended to detect hot topics and
track trends in terms of pervasiveness and topicality. We apply
the TF*PDF weighting algorithm to extract terms with
pervasiveness. To determine term’s topicality, we apply the
aging theory to track the change in term’s life cycle. The
combination of TF*PDF and the aging theory are proved to
improve the quality of the hot-topic extraction in news
documents. However, our research with patent documents
shows that a term life cycle does not affect the topicality of
hot topics in the patent domain. It is because patent documents
contain technical terms distributed yearly or monthly, and
there are significantly rare terms that appear only several times
in the entire corpus. Meanwhile, for news documents, terms
are distributed daily with a very high frequency of change and
consequently variation plays an important role in the change
of a term’s life cycle.
By utilizing IPC information, in which documents are
manually classified into specific categories by patent experts,
we automatically detect technological trends from the patent
domain in a specific time period. The terms extracted from
those documents therefore belong to specific categories,
whereby the importance of terms is evaluated based on its
importance in specific IPCs. That allows a hot topic to be
identified based on its importance in each IPC as opposed to
being affected by the variation in its life cycle.
For a huge number of patents with lengthy and difficulty
technical terms, it is necessary to quickly identify the hottest
information about which technologies were invented with high
attention. The experiment results show that our approach
yields a substantial methodology of hot-topic extraction and
technological trend detection from the patent domain. By
apply the hot-term detection algorithm using Term Frequency
– Proportional Document Frequency in consideration of IPC
information, we have shown an important new criteria for
weighting hot topics based on semantic categorization, which
has not previously been applied in the patent domain.
ACKNOWLEDGMENT
This research was supported by the MSIP(Ministry of
Science, ICT and Future Planning), Korea, under the Global IT
Talent support program(IITP-2014-H0905-14-1005) and the
Establishing IT Research Infrastructure Projects(I2221-14-
229

1012) supervised by the IITP(Institute for Information and
Communication Technology Promotion).
REFERENCES
[1] K. Borner, C. Chen, and K.W. Boyack, “Visualizing knowledge
domains," Annual Review of Information Science and Technology, vol.
37, pp. 179-255, 2003.
[2] Y. Ding, G.G. Chowdhury, and S. Foo, "Bibliometric cartography of
information retrieval research by using co-word analysis," Information
Processing and Management, vol. 37, no. 6, pp. 817-842, 2001.
[3] S. Lee, B. Yoon, and Y. Park, "An approach to discovering new
technology opportunities: keyword-based patent map approach,"
Technovation, vol. 29, no. 6-7, pp. 481-497, 2009
[4] M. Moehrle, L. Walter, A. Geritz, and S. Muller, "Patent-based inventor
profiles as a basis for human resource decisions in research and
development," R&D Management, vol. 35, no. 5, pp. 513-524, 2005.
[5] J. Yoon and K. Kim, "Identifying rapidly evolving technological trends
for R&D planning using SAO-based semantic patent networks,"
Scientometrics, vol. 88, no. 1, pp. 313-331, 2013.
[6] S. Dewulf, "Directed variation: variation of properties for new or
improved function product DNA, a base for 'connect and develop',"
World Conference: TRIZ Future, 2006.
[7] D. Mann, Hands-on systematic innovation, Belgium: Creax press, 2002.
[8] J. Yoon and K. Kim, "An analysis of property-function based patent
networks for strategic R&D planning in fast-moving industries: Teh case
of silicon-based thin film solar cells," Expert Systems with Application,
vol. 39, no. 9, pp.7709-7717, 2012.
[9] M.L. Antonie and O.R. Zaiane, "Text document categorization by term
association," In Proceedings of the 2002 IEEE international conference
on data mining, 2002.
[10] X.Y. Chen, Y. Chen, L. Wang, and Y.F. Hu, "Text categorization based
on frequent patterns with term frequency," In Proceedings of 2004
international conference on machine learning and cybernetics, 2004.
[11] H. Han, E. Manavogulu, C. Giles, and H. Zha, "Rule-based word
clustering for text classification," In Proceedings of the 26th annual
international ACM SIGIR conference on research and development in
information retrieval, 2003.
[12] C. He and H.T. Loh, "Pattern-oriented associative rule-based patent
classification," Expert Systems with Application, vol. 37, no.3, 2010.
[13] I. Bergmann, D. Butzke, L. Walter, J.P. Fuerste, M.G. Moehrle, and V.A.
Erdmann, "Evaluating the risk of patent infringement by means of
semantic patent analysis: the case of DNA chips," R&D Management,
vol. 38, no. 5, pp.550-562, 2008.
[14] T. Magerman, B.V. Looy, and X. Song, "Exploring the feasibility and
accuracy of Latent Semantic Analysis based text mining techniques to
detect similarity between patent documents and scientific publications,"
Scientometrics, vol. 82, no. 2, pp. 289-306, 2010
[15] J. Yoon and K. Kim, "An automated method for identifying TRIZ
evolution trends from patents," Expert Systems with Applications, vol.
38, no. 12, pp.15540-15548, 2011.
[16] C.C. Chen, Y.T. Chen, Y. Sun, and M.C. Chen, "Life cycle modeling of
news events using Aging Theory," In Proceedings of 14th European
Conference of Machine Learning, pp. 47-59, 2003.
[17] K.Y. Chen, L. Luesukprasert, and S.T. Chou, "Hot topic extraction
based on timeline analysis and multidimensional sentence modeling,"
IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 8,
2007
[18] M.C. Yang and H.C. Rim, "Identifying interesting Twitter contents
using topical analysis," Expert Systems with Applications, vol. 41, no. 9,
pp. 4330-4336, 2014.
[19] J. Zeng, J. Duan, W. Cao, and C. Wu, "Topics modeling based on
selective Zipf distribution," Expert Systems with Applications, vol. 39,
no. 7, pp. 6541-6546, 2012.
[20] C. Wang, M. Zhang, L. Ru, and S. Ma, "An automatic online news topic
key phrase extraction system," IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent Technology, 2008.
[21] Y. Chen, H. Amiri, Z. Li, and T. Chua, "Emerging topic detection for
organizations from microblogs," In Proceedings of the 36th international
ACM SIGIR conference on research and development in information
retrieval, pp. 43-52, 2013.
[22] L. Christiansen, T. Schimoler, R. Burke, and B. Mobasher, "Modeling
topic trends on the social web using temporal signatures," In
Proceedings of the twelfth international workshop on Web information
and data management, pp. 3-10, 2012.
[23] S. Lee, J. Lee, C. Park, and J. Lee, "Blog topic analysis using TF
smoothing and LDA," In Proceedings of the 7th International
Conference on Ubiquitous Information Management and
Communication, 2013.
[24] R. Long, H. Wang, Y. Chen, O. Jin, and Y. Yu, "Towards effective
event detection, tracking and summarization on Microblog data," Web-
Age Information Management (Lecture Notes in Computer Science),
6897, pp. 652-663, 2011.
[25] P. Erdi, K. Makovi, Z. Somogyvari, K. Strandburg, J. Tobochnik, P.
Volf, and L. Zalanyi, "Prediction of emerging technologies based on
analysis of the US patent citation network," Scientometrics, vol. 95, no.
1, pp. 225-242, 2013.
[26] C. Lee, B. Song, and Y. Park, "How to assess patent infringement risks:
a semantic patent claim analysis using dependency relationships,"
Technology Analysis & Strategic Management, vol. 25, no. 1, pp. 23-38,
2013.
[27] A.J. Trappey, C.V. Trappey, C. Wu, C.Y. Fan, and Y. Lin, "Intelligent
patent recommendation system for innovative design collaboration,"
Journal of Network and Computer Applications, vol. 36, no. 6, pp. 1441-
1450, 2013.
[28] J. Yoon and K. Kim, "TrendPerceptor: A property-function-based
technology intelligence system for identifying technological trends from
patents," Expert Systems with Applications, vol. 39, no. 3, pp. 2927-
2938, 2012.
[29] K.L. Nguyen and S.H. Myaeng, "Query enhancement for patent prior-art
search based on key-term dependency relationships and semantic tags,"
Lecture Notes in Computer Science, 7356, pp. 28-42, 2012.
[30] Y.G. Kim, J.H. Suh, and S.C. Park, "Visualization of patent analysis for
emerging technology," Expert Systems with Applications, vol. 34, no. 3,
pp. 1804-1812, 2008.
[31] M.J. Shih, D.R. Liu, and M.L. Hsu, "Discovering competitive
intelligence by mining changes in patent trends," Expert Systems with
Applications, vol. 37, no. 4, pp. 2882-2890, 2010.
[32] J. Yoon, H. Park, and K. Kim, "Identifying technological competition
trends for R&D planning using dynamic patent maps: SAO-based
content analysis," Scientometrics Journal, vol. 94, no. 1, pp. 313-331,
2013.
[33] H. Park, J.J. Ree, and K. Kim, "Identification of promising patents for
technology transfers using TRIZ evolution trends," Expert Systems with
[34] J. Duan and J. Zeng, "Web objectionable text content detection using
topic modeling technique," Expert Systems with Applications, vol. 40,
no. 15, pp. 6094-6104, 2013.
[35] J. Zeng, C. Wu, and W. Wang, "Multiple-grain hierarchical topic
extraction algorithm for text mining," Expert Systems with Applications,
vol. 37, no. 4, pp. 3202-3208, 2010.
[36] J.P. Zeng and S.Y. Zhang, "Incorporating topic transition in topic
detection and tracking algorithms," Expert Systems with Applications,
vol. 36, no. 1, pp. 227-232, 2009.
[37] S. Lee and H.J. Kim, "News Keyword Extraction for Topic Tracking,"
Networked Computing and Advanced Information Management NCM,
2008.
[38] S.Y. Chen, T.T. Tseng, H.E. Ke, and C.T. Sun, "Social trend tracking by
time series based social tagging clustering," Expert Systems with
[39] K.K. Bun and M. Ishizuka, "Topic Extraction from News Archive Using
TF*PDF Algorithm," In Proceedings of the 3rd International Conference
Web Information System Eng, pp. 73-82, 2002.
230

Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term Frequency and Proportional Document Frequency and Semantic Information

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term Frequency and Proportional Document Frequency and Semantic Information

Similar to Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term Frequency and Proportional Document Frequency and Semantic Information (20)

Recently uploaded

Recently uploaded (20)

Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term Frequency and Proportional Document Frequency and Semantic Information