SlideShare a Scribd company logo
Hot Topic Detection and Technology Trend Tracking
for Patents utilizing Term Frequency and Proportional
Document Frequency and Semantic Information
Khanh-Ly Nguyen1
, Byung-Joo Shin2
, Seong Joon Yoo3
Dept. of Computer Engineering
Sejong University
Seoul, Republic of Korea
khanhly4682@gmail.com1
, {bjshin2
, sjyoo3
}@sejong.ac.kr
Abstract—This paper proposes a methodology for identifying
hot topics and tracking technology trends from the patent
domain. The methodology uses frequency information in
combination with the International Patent Classification (IPC) to
capture semantic information on word categorization, doing so in
a way that heretofore has not been employed for topic detection
and trend tracking. Term Frequency and Proportional Document
Frequency (TF*PDF) is employed as a means to detect hot topics
from patents, and IPCs are used to calculate semantic
importance of terms based on the IPCs where terms are
distributed. Aging Theory is also used to calculate the variation
of trends over time. Four types of trends including very stable
trends, stable trends, normal trends, and unstable trends are
defined and evaluated based on TF*PDF and TF*PDF combined
with Aging Theory. Experiment results show that for very stable
trends, the combination of TF*PDF and Aging Theory achieves
0.976% in Precision; for stable trends and all trends, TF*PDF
achieves 0.959% and 0.84% in Precision, respectively. By
applying TF*PDF in consideration of semantic information, we
also show a new criteria for weighting hot topics and technology
trend tracking.
Keywords—Technology forecast; Trend analysis; Patent
analysis; Topic detection; Hot term extraction
I. INTRODUCTION
The increasing amount of patent applications and the
growing need to access patent information make the task of
patent analysis become vital to 1) analyze large amounts of
patent data that is expensive being done by human, 2) increase
the quality of generating useful information, and 3) support
decision making processes to eventually increase the quality
of the patents. Patent intelligence is used to encourage the
development of innovative products, devise technology
strategies, and reveal legal/business insights amid the
technical transformation.
Various tools and techniques have been developed for use
in detecting trends and forecasting future developments from
news stories or patent documents. In the patent domain, the
techniques include such as keyword-based
approaches[1][2][3], Subject-Action-Object (SAO)-based
approaches[4][5], property-function-based
approaches[6][7][8], rule-based approaches[9][10][11][12],
semantic analysis-based approaches[13][14][15], etc.
Keyword-based approaches use frequencies and co-
occurrences among keywords that result in the lack of
representation of relationships among technological concepts
and require expert knowledge in terms of predefining
keywords. SAO-based approaches extract SAO structures,
which consist of a subject, action, and object and represent the
concepts of technology in the properties/functions format.
Usually, the SAO approaches are employed in TRIZ trend
analysis (TRIZ is a Russian acronym that means Theory of
Inventive Problem Solving). The process for TRIZ trend
analysis is to analyze and categorize patents in know trend
phases, and the results have been used to identify the evolution
of technologies or to seek for further improvements of specific
product. However, the method depends on the expertise and
skills of TRIZ experts to identify specific trends and trend
phases manually, which may be expensive or unfeasible.
For news stories collected from websites such as Google
News, Reuters, Yahoo, etc., news topics are detected by
techniques proposed under Topic Detection and Tracking
(TDT). TDT is intended to identify topics by exploring and
organizing the content of textual materials, and enabling to
group pieces of information into manageable clusters, wherein
each cluster represents a single topic. Reference [16][17]
proposed methods for hot topic extraction based on TF*PDF
and agglomerative clustering algorithm. Reference [18]
constructed topic hierarchy by identification of burst periods
of features. Reference [19] detected topic by identification of
both aperiodic and periodic features bursts.
Given the large amounts of news topics constantly being
created and updated, there is a concern about how to rank
those topics in terms of timeliness and importance. Generally,
topic ranking is determined by two factors; one is how
frequently and recently a topic is reported by websites; the
other is how much attention users pay to it[20]. The first rule
focuses on returning timely results and the second one
considers larger topics more important. Either rule involves
only one aspect of the ranking problem. Besides, other factors
must be taken into consideration, i.e., (1) every news story of a
topic contributes to its importance, while the contribution
3
Corresponding author: Seong Joon Yoo; E-mail address: sjyoo@sejong.ac.kr
223978-1-4673-8796-5/16/$31.00 2016 IEEE BigComp 2016
decays along the timeline; and (2) topics that attract more
users' attention should be ranked higher.
Generally, current topic detection and trend tracking
researches are mainly based on timeline and frequency
features[21][22][23][24]. However, most of the researches are
lack of representing the semantic relationship between terms.
Some researches[25][26][27][28] consider the binary
relationships such as relationships from patents and
relationships from a predefined trend database, which requires
knowledge of domain experts for classifying trends in advance.
In this paper, we propose a methodology for hot topic
detection and trend tracking using frequency information in
combination with the IPC to capture semantic information on
word categorization, which has not been employed in the state
of art. To obtain semantically meaningful topics of interest, we
apply the TF*PDF algorithm, which allows for the generation
of hot topics over time. Moreover, we exploit the IPCs to
capture the importance of semantic information on word
categorization as presented in multiple IPCs, from which hot
topics are detected. Trends are identified as the normalized
weight of topic over time. Four types of trends are defined
including very stable trends, stable trends, normal trends and
unstable trends. Experimental results were compared with the
baseline TF*IDF for very stable trends, stable trends and all of
trends detected by the system.
II. RELATED WORKS
A. Patent Analysis
Because patents play an important role in intellectual
property protection, there has been a growing interest in
research into patent analysis, patent search, patent query
formulation[29], and trend identification[30][31] from patent
documents. Reference [29] extracted key terms for patent
query formation using semantic patterns and a keyword
dependency relation graph. Key terms are defined as
"Problem/Solution" and are extracted through semantic
patterns. Relationships among key terms are considered in a
term-weighting scheme and important terms are ranked on the
basis of weight. To identify emerging topics and transitions in
topics, many techniques using term frequency have been
applied. A timeline chart is created by textual analysis and the
evolutionary process of emerging technology is visualized as
an S-curve shape. In other research studies, the combination of
citation networks and text-mining techniques have improved
reliability in the detection of chronological changes for which
the co-citation networks provide closer connections among
patents. Timeline visualization of co-citation networks with
labels extracted by text-mining techniques has been used to
detect and trace emerging trends. Particularly, the co-citation
clusters have been used to build patent maps that could be
used to analyze the numbers of patents filed between company
competitors through the years and thereby to visualize the
evolutionary process of emerging technology, etc. Reference
[32] used SAO structures to generate patent maps for
identifying the technological competition trends. Semantic
similarity is measured on the basis of SAO-based semantic
similarities and a patent similarity matrix is constructed. The
output is visualized in the form of a dynamic patent map,
which is used to identify technological vacuums and
technological hotspots. Reference [33] identifies promising
patents for technology transfer and uses TRIZ evolution trends
to evaluate technologies in patents. A patent is considered to
be a high future value patent if it is relevant to future
important TRIZ trends. The patents are ranked based on the
similarity scores and are classified. However, the
disadvantages are that the classification of the TRIZ trends
may not be applicable to all the technological domains and the
revision of classification by domain experts having knowledge
in TRIZ trends are required. Moreover, [15] extracts
information related to properties and functions of a product by
identifying binary relationships in the form of "adjective +
noun" and "verb + noun". The Stanford dependency parser is
used to identify all binary relationships from titles and
abstracts in patents. A "reasons for jumps" rule base that
arranges trend-specific binary relationships for trend
identification is defined, whereupon the most likely trends and
trend phases are determined by measuring sentence semantic
similarity between the binary relationships from patents and
the binary relationships from a "reason for jumps" rule base. If
two or more phases related to a trend are identified from a
patent, the currently developed logic of the trend map chooses
the more evolved one. The final output depicts the
evolutionary which can be used as input for technology
forecasting based on TRIZ trends. Additionally, [31] proposed
Patent Trend Change Mining techniques as the means to
capture changes in patent trends through metadata analysis
without the need of specialist knowledge. The approach
includes a patent indicator calculator that determines the
patent values based on citation index, originality, generality,
and technology cycle time. Then, patent change trends are
determined by association rule mining to compute the
similarities and differences of patent trends between two
different times.
B. International Patent Classification
The International Patent Classification (IPC) is
administered by the World Intellectual Property Organization.
Patent documents which are relevant to a particular inventive
concept are organized through an examination process by the
examiner.
Each patent document is classified into IPCs based on
technical field of the invention and can be assigned to more
than one IPC code. Each IPC is divided into subclass, main
group and sub group. In this research we use the data under
the class H01M 04. Fig. 1 shows hierarchy of H01M 04,
which contains patent documents belonging to the electrodes
category and includes six subgroups.
Description
H01M 04 Electrodes (electrodes for electrolytic processes)
H01M 04/02 .Electrodes compose of, or comprising active material
H01M 04/04 .. Processes of manufacture in general
H01M 04/06 .. Electrodes for primary cells
H01M 04/08 … Processes of manufacture
H01M 04/10 … of pressed electrodes with central core
H01M 04/12 …. Of consumable metal or alloy electrodes
(use of alloy compositions as active materials )
Fig. 1. The IPC H01M 04 hierarchy
224
The identification of technological trends is one of the
most important tasks in acquiring knowledge from patent
sources so as to quickly understand the latest advanced,
innovative technologies in high-tech industries and to acquire
technologies for future use. In the state of the art, however,
there is no research target pertaining to the detection of patent
trends by utilizing IPC. In our work we utilize IPC
information in a practical way so as to identify our target hot-
term extraction within a patent-data corpus. Given the
assumption that there are many technical trends invented in
patent documents, it is very difficult to identify the major
trends relative to each domain without IPC information.
For example, in the following sample sentence "drain" and
"flash amperage" are highly ranked terms by general term
weightings such as TF*IDF based on frequency information.
"A high DRAIN rate, primary alkaline cell comprising a
negative electrode...capable of providing a FLASH
AMPERAGE greater than an average of..."
However, by using IPC information we rank those terms
with higher weight based on the number of IPCs (e.g. H01M
04 or H01M 10) where "drain" and "flash amperage" belong
to.
C. Hot Topic Detection and Trend Analysis
Automatic extraction of meaningful topics, which can help
detecting topics of interest and facilitate the analysis of user
behavioral data, has been studied previously in the various
context. Reference [18] used Latent Dirichlet Allocation
(LDA) to extract latent topics by modeling temporal trends on
Twitter over time. Reference [19] modeled topics from text
corpus in order to determine whether a topic description is
well formed by doing so throught the use of LDA and
selective Zipf distribution. Reference [34] proposed a
framework using probability inference for detecting
objectionable text content that has been shown to be harmful
to Web users. For a given sentence, the probability value,
which shows the likelihood of the sentence with respect to the
model, is calculated and then a mapping function is used to
transform the probability value into a new indicator for
making decision about the type of Web text content.
Reference [35] proposed a hierarchical topic extraction
algorithm based on topic grain computation. By considering
the distribution of word document frequency as a Gaussian
mixture, the topic grain is defined based on the mixture of
Gaussian parameters and feature words are selected for the
grain by employing an EM-like algorithm. A clustering
algorithm is used to generate a multiple-grain hierarchical
topic structure with different subtopic description. Reference
[36] incorporated topic transition in topic detection along with
tracking from Reuters and BBS websites. They employed a
topic representation based on the hidden Markov model and
applied fuzzy-kMeans clustering to find the most likely topic-
transition sequence. Reference [37] addressed the problem of
extracting significant words that are highly useful for
summarizing and presenting topics from a huge number of
news articles. To extract keywords, [37] used an unsupervised
keyword extraction technique called Table Term Frequency,
which includes several variants of the conventional TF-IDF
model and filtered keywords with cross-domain comparison.
Reference [20] introduced an automatic online news-topic
key-phrase extraction system. Topics are constructed and
updated online automatically with techniques to determine the
degree of burst of terms along with the aging theory. The
proposed system extracts keyword candidates from single
news stories, filters them with topic information, and then
combines them into phrase candidates using position
information. Finally, the phrases are ranked and the top ones
are selected as topic key phrases.
In TDT, a topic is defined as a seminal event or activity,
along with all the directly related events and activities. A hot
topic is defined as a topic that appears frequently over a period
of time[17]. The hotness of a topic depends on two factor:
how often hot terms appear in a document and the number of
documents that contain those terms. However, the hotness of
each topic evolves over a given period of time through the life
cycle of birth growth, maturity, and death.
Some research studies have utilized the aging theory or
timeline analysis in TDT and hot-topic extraction[16][17];
topic hierarchy construction based on the identification of
burst periods of features[18]; topic sentence extraction along a
timeline given a query[38]; topic detection based on the
identification of both aperiodic and periodic features'
bursts[19]; finding top burst topics by identifying burst
words[30]; and so on. The previous approaches listed above
analyzed the characteristics of features from a fixed corpus on
the whole timeline. In order to identify topics in large sets of
documents, we have to determine the key terms that
sufficiently describe the topics. A term-weighting scheme is
used to capture important or representative terms that feature
in the content of a document, such as calculating the term
distribution level in a document or in a corpus[16][36][37].
The most common term-weighting scheme for processing
index terms is TF-IDF. Because the TF-IDF scheme
emphasizes the importance or uniqueness of each term, it only
identifies terms that occur in a few of the documents contained
in a corpus. For hot-topic extraction, however, terms that
appear in many of the documents in a corpus must be
identified. Therefore, a different term-weighting scheme TF-
Proportional Document Frequency (TF*PDF) assigns greater
weights to terms that occur frequently in many documents on
many channels and lower weights to others to avoid the
collapse of important terms when they appear in many
documents. Although TF-PDF captures the basic concept of a
hot topic, its weakness is that it does not consider variations in
the popularity of a topic over time. Therefore, [17] combined
TF*PDF and aging theory to the extraction of hot terms from a
data corpus in consideration of the term life cycle. An aging
theory is used to model a news-topic life span with the four
stages of birth, growth, decay, and death to reflect its
popularity over time. To track the life cycles of topics, they
used the concept of energy function. The energy of an event
increases when the event becomes popular, and it decreases as
its popularity reduces. Hence, the aging theory is suitable for
tracking the variations in the frequency of terms, which are
critical to success in hot-topic extraction.
A technology lifecycle is usually referred to as an S-curve,
containing the four stages of innovation, growth stage,
225
maturity and decline. The innovation stage is when a
technology is born from a new technical method or when
phrases appear in a small number of patents and slowly
increase. The growth stage is when a technology has been
recognized, thus gathering strength with the increasing
number of patents over time. The maturity stage is when the
recognition is high and stable with the rapid increase of
number of patents. The decline stage is when the technology is
reduced. We apply three functions from [17] in order to
calculate and update the energy of topics in every time slot;
getEnergy() calculates the nutrition that a topic receives from
a story; energyFunction() converts a topic nutritional value
into an energy value; energyDecay() carries out the energy
decrease in each time slot.
We have explored hot-topic detection from patent
documents by utilizing TF*PDF. Moreover, we seek to utilize
the importance of IPC, where the terms from a patent
document are categorized. In this research we apply the hot-
term detection algorithm (TF*PDF) and utilize IPC
information in order to identify hot topics in a patent data
corpus.
III. SYSTEM ARCHITECTURE
The overall procedure for hot-term detection and
technological trend identification from patent documents
consists of several steps, including data crawling, keyword
extraction, hot-term detection, and trend analysis as shown in
Fig. 2. Firstly, raw patent data is collected from a U.S. patent
database and transformed into structured data. POStagger is
used for keyword extraction. All common stop words are
removed in combination with the list of patent stop words.
Then, a candidate list of detected hot terms is compiled by
TF*PDF algorithm and the aging theory. Finally, we analyze
patent trend based on the hot-term detection and evaluate the
results.
Fig. 2. System Architecture
A. Hot Term Detection
Two stop-word lists are used. One contains common stop
words and the other contains the words that are common in
patent documents and irrelevant to patent content.
In hot-term extraction process, two characteristics of a
term are considered. One is the frequency of the term in the
question collection (Definition 1) and the other is the term
variation over time (Definition 2). Term frequency is
measured by TF*PDF[17][39]. TF*PDF is considered more
suitable for topic detection than TF*IDF because the former
assigns greater weights to terms that occur frequently in many
documents. We adopt the TF*PDF scheme in this paper and
the top m terms are chosen as final terms for trend analysis.
Phrases that do not include any term among the final terms are
therefore excluded.
Definition 1 (TF*PDF). Given a term j in a document, the
TF*PDF weight of term j is calculated from [39] through the
following (1) and (2):






 

 c
jc
Cc
c
jcj
N
n
FW exp
1

where




Kk
k
kc
jc
jc
F
F
F
1
2

, where Wj is the TF*PDF value of term j, which is the
summation of term weights gained from each IPC c; |C| is the
number of IPCs. Fjc is the frequency of term j in the IPC c; K
is the total number of terms in the IPC c; njc is the number of
documents that belong to the IPC c where term j occurs; Nc is
the total number of documents in the IPC c.
Definition 2 (Term Life Cycle). Reference [17] defined a
term life cycle model in order to calculate the variation of each
term value from its cycle of birth, growth, decay, and death.
This step is suitable for tracking the variations in the
frequency of terms, which are critical to a successful hot-topic
extraction. We apply three functions from [17] to calculate the
energy of topics in each time slot, including getEnergy(),
energyFunction(), and getVariation().
getEnergy() calculates the energy that a term receives from
patents at a specific time slot. The energy Et,s of term t
measures the frequency of t appearing in a specified time slot s,
which is the accumulated value of term t from all patent IPCs,
as in (3). Therefore, hot terms are those that have high energy
in all IPCs.
 ,2
, 
 Cc ctts XE 
where C is the set of IPCs, X2
t,c is the association between
term t and the time slot s in IPC c, given by (4):
))()()((
))(( 2
2
DBCADCBA
BCADDCBA
X


 
226
For each term we calculate the contingency table, as
shown in Table 1:
TABLE I. CONTINGENCY TABLE
s s
T A B
−
T
C D
, where A is the count of the patents that contain term t in time
span s; B is the count of patents that contain term t on other
time spans; C is the count of patents that do not contain term t
in time span s; D is the count of patents that do not contain
term t on other time spans.
energyFunction() converts a term energy value into a life
support value. The life support value lifeSupportt,s of t at time
slot s is calculated as the logarithm of accumulated energy Et,s,
as in (5).
)ln( ,, stst EtlifeSuppor  
getVariation() calculates the variation of the life support
values of term t over time in the patent collection can be
computed as (6):
  2
, )(
1
tlifeSupportlifeSuppor
N
V stst

, where N is the number of time slots in the given interval I;
lifesupportt,s is the life support value in each time slot;
tlifeSuppor is the average life support value; and Vt is the
variation in the life support values of t during I.
The overall weight of term t is measured by combining
TF*PDF and Variation value together, as in (7).
tt VPDFTFweight  * 
Finally, the terms in the candidate list with the combined
weight will be ranked. The top-ranked k terms can be chosen
as hot terms that reflect the hot topics in the corpus.
B. Trend Detection
We divide the patent timeline into yearly time slots. In
each time slot s, a trend is represented by the normalized
weight of occurrences of term t from n documents.
IV. EXPERIMENTS
For hot topic detection, we present the results by
comparing topics detected by three algorithms including Chi-
square, TF*PDF, and the combination of TF*PDF and Aging
Theory where IPC information is employed under TF*PDF.
For hot technology trend tracking, we define four types of
trends and eliminate insignificant ones. To evaluate the
significance of the trends we choose TF*IDF as the baseline
since TF*IDF is the most common technique in data mining
area.
A. Data Collection
The data collection contains 513 patent documents crawled
from the U.S. Patent and Trademark Office database. Patent
documents are selected from the domain of batteries, as
published from 1977 to the present. Fig. 3 shows the IPC
H01M 04, which contains patent documents belonging to the
electrodes category and includes six subgroups (H01M 04/02,
H01M 04/04, H01M 04/06, H01M 04/08, H01M 04/10, and
H01M 04/12).
United States Patent 4,016,339 Gray et al. April 5, 1977
Abstract A battery electrode structure of flat configuration comprises a
cast mass of electrochemically active material, said mass having
contained therein and exposed opposite surfaces thereof an open-mesh
electrically conductive structure adapted for connection to a battery
terminal. An open-mesh electrically conductive support member in the
mass and in contact with the exposed electrically conductive structure
maintains electrical conductivity throughout discharge to ensure
maximum use of the active material.
Current International Class: H01M 04/06?(20060101); H01M
04/58?(20060101); H01M 06/34?(20060101); H01M 06/30?(20060101);
H01M 04/70?(20060101); H01M 004/02?()
Fig. 3. Sample input patent document
B. Data Preparation
1) Data Extraction
A patent document from USPTO contains sections,
including “Title”, “Abstract”, “Claims”, and “Description”.
"Title" is too short that is not suitable for our research.
"Claims" is insufficient because it may not contain as many
technology phrases as another section or those terms are
included in "Abstract". “Description” is lengthy and includes
sub fields such as "Field of the Invention", "Prior Art",
"Summary of the Invention", "Detailed Description of the
Preferred Embodiment", "Brief Description of the Drawings",
etc. "Description" contains meaningful terms for the
problem/solution extraction method, as shown by [29]. For
preliminary experiments we use only "Abstract" which is a
short, very precise summary of the invention.
2) IPC Extraction
From patent documents, we extract IPC information under
the "Current International Class" tag as shown in Fig. 3. The
IPC information is then extracted from Main Groups by using
regular expression and stored in a list of IPCs, which would be
later utilized in term weighting using TF*PDF algorithm.
3) Timestamps
To identify the technology lifecycle, we split the whole
time span into each one-year intervals. Patent documents were
crawled from 1976 to 2014, but there is no patent filed in the
H01M 04 class in 2006. Therefore, the range of data include a
total of 30 years from 1976 to 2005.
C. Hot Topic Extraction
We compare hot terms extracted by the TF*PDF, Chi-
square, and the combined weight. This experiment validates
the effectiveness of each term-weighting methodology in hot-
term detection. Then we demonstrate how we can identify
genuine hot topics by ranking term based on its weight.
227
Tables 2, 3, and 4 show the top 10 ranked hot topics by
using TF*PDF, Chi-square, and the combined weight
respectively. The results show that hot terms extracted by each
algorithm are different since terms are weighted in
consideration of different factors such as pervasiveness,
topicality, or variation of the life cycle. It is shown that terms
detected by Chi-square do not frequently appear and cannot be
detected using frequency information. Those terms are
significantly rare terms that would be suitable for detecting
innovation technology from the patent domain. Contrastingly,
TF*PDF and the combined weight mainly detect terms that are
pervasive and topical. TF*PDF and the combined-weight
methodology detect important terms that may be considered as
the topic of patent documents.
TABLE II. TOP 10 HOT TOPICS DETECTED BY TF*PDF
1976 1980 1985 1988 1990 1994 2000 2003 2005
atmosphere atmosphere atmosphere resistance reactant areas increase liquid dry
refractory refractory shear layers accumulator coated increases ion free
telescopic conjugated tab retain makes coat fuel liquids spray
free axis reinforce retaining make uncoated iron shapes dryer
spray deterioration amalgamated strong end injecting ion wet inventive
disturbing size precipitation smooth severed nozzle polysaccharide floc condition
orous end establish web lugs passing dehydration disposition telescopic
electron wash force separating band fabrication period polytetrafluoroethylene disturbing
electronic reduce globular product negligible fabric corrosive dimensions flex
constant crystal acetate end permeates mounting pasting ions expectancy
TABLE III. TOP 10 HOT TOPICS DETECTED BY CHI-SQUARE
1976 1980 1985 1988 1990 1994 2000 2003 2005
web silver layer body grids hydrogen nickel electrode electrode
metal cathode graphite dy battery mprising cathode electro rod
emulsion athode battery powder lead comprising athode rod edge
catalytic group conductor electrode plastic rising hydroxide cell assembly
support active deposition lithium grooves copper electrochemical battery oil
mu vanadium metal electrolyte spaced lithium manganese mprising tab
fibers battery film portions space oil dioxide comprising coiled
heating connection material portion power foil improved rising ss
heat connect gel carbon lugs conductive mno end metallic
electrode addition deposit battery plates electrodes lithium single metal
TABLE IV. TOP 10 HOT TOPICS DETECTED BY THE COMBINED WEIGHT
1976 1980 1985 1988 1990 1994 2000 2003 2005
form material electrode material battery layer nickel electrode electrode
layer oxide electro powder lead hydrogen material cells electro
la active layer body grids mprising lithium cell rod
electrolyte cathode battery dy power comprising improved ce lithium
material athode material electrode electrode rising electrode el alloy
ia silver al electro electro material electrochemical battery electrolyte
battery battery ia lithium rod lithium chemical mprising anode
high vanadium er cathode layer copper electrolyte comprising electrochemical
anode anode graphite athode plastic battery cathode rising chemical
web lithium ph electrolyte high electrode athode plate battery
improved alloy includes Electro material electro ring material layer
D. Trend Tracking and Analysis
In our experiments we obtained a list of 3,211 topics. To
identify topics that could be candidates to represent technical
trends and eliminate insignificant ones, we define four types
of trends that would be satisfied with the following criteria:
Considered that a decay value is 0,
Very stable trends: trends have no decay values for the
entire life cycle.
Stable trends: trends in which the number of decay values
is less than 3.
Normal trends: trends which have at least two continuous
time spans of three years or more; with number of decay
values in the range of (3~7).
Unstable trends: trends which have only one or two
continuous time spans of at least three years in which the
number of decay values in greater than seven.
1) Trends by Chi-square
Fig. 4 shows four very stable trends detected by Chi-
square algorithm. As shown in Fig. 4, "ion" is the hottest
technology developed for secondary battery during 1976 to
2005.
Fig. 4. Very-Stable Trends by Chi-square
All normal trends detected by Chi-square from 1976 to
2005 are shown in Table 5.
TABLE V. NORMAL TRENDS BY CHI-SQUARE
hydrogen, portion, battery, include, anode, agent, high, electrodes,
electrochemical, electrolyte, mproved, gas, sheet, end, structure, electric,
surface, hydroxide, comprising, electrical, providing, current, salt, porous,
oxide, lithium, solution, substrate, including, alkaline, coat, cathode, mixing,
manufacturing, energy, active, alkali, density, face, plurality, powder, storage,
mixture, process, carbon, proper, ions, tab, contact, disc, compound, step,
method, area, chemical, matrix, part, ratio, cycle, article, composition,
charge, pen, igh, line, bind, fabric, improve, treat, car, polymer, dioxide,
sulfide, excellent, outer, voltage, element, forming, conductive, type,
improved, alloy, life, rate, nickel, solid, plate, mix, making, acid, orous,
athode
2) Trends by TF*PDF
The top 10 hottest very stable trends by TF*PDF as are
shown in Fig. 5.
Fig. 5. Top 10 Very Stable Trends by TF*PDF
Fig. 6 shows the top 10 very stable trends detected by the
combined weight from 1976 to 2005. The "method" is not a
technical term and is of little significance in relation to a
technological trend. These trends had at least one or less than
three times decay and then grew continuously. It is shown that
"material" is the hottest trend that received high attention in
the electrode domain over the life cycle of 1976 to 2005. The
228
second very stable trend is "electrode". Although the trend of
"electrode" does not have a higher peak than "electro" or
"electrolyte", it does not have falling values in years as
compared to the others. The trend of "ba" falls nearly to the
bottom, though one year before decaying it has a very high
peak. However, two other trends such as "electrolyte" and
"da" is growing. It is considered to be one of the hottest trends
that are highly paid attention in the patent domain in the next
few years. It also has high value continuously as a trend
because "method" is a very frequent term used in almost patent
documents, particularly to identify Problem and Solution
terms as by [29]. In further experiments it would be excluded
from our term list.
Fig. 6. Top 10 Very Stable Trends by combination of TF*PDF and Aging
Theory (1976~2005)
E. Evaluation
We evaluate the results for very stable trends, stable
trends, and all of trends detected by the system. We compare
the TF*PDF algorithm and the combined weight with the
baseline. We do not show the evaluation for Chi-square
because Chi-square is not very effective in detecting trends
(Recall = 0.0851%).
As shown in Table 6, the combined weight algorithm
achieves the high precision for very stable trends (0.976%);
however, the Recall is lower than the TF*PDF algorithm by
0.064%. It is also shown that the TF*PDF is more effective
than the combined weight in detecting trends. Thus, in the
patent domain the hot trends extracted by means of the
TF*PDF by utilizing IPC information are not affected by their
life cycles over time.
TABLE VI. NORMAL TRENDS BY CHI-SQUARE
Precision Recall
Very stable trends
TF*PDF 0.957% 0.936%
Combined Weight 0.976% 0.872%
Stable trends
TF*PDF 0.959% 0.959%
Combined Weight 0.864% 0.776%
All trends
TF*PDF 0.84% 0.601%
Combined Weight 0.665% 0.612%
V. CONCLUSIONS
We have proposed a system for the extraction of hot topics
and the detection of hot trends from the patent domain within
a specific time period using TF*PDF and semantic
information where terms are distributed (IPCs). Our work has
the following contributions:
 The automatic detection of hot technological topics
from the patent domain,
 The use of semantic information for hot-topic
detection, which has not heretofore been done in the
state of the art,
 The automatic tracking of hot trends from patent
domain.
Our implementation is intended to detect hot topics and
track trends in terms of pervasiveness and topicality. We apply
the TF*PDF weighting algorithm to extract terms with
pervasiveness. To determine term’s topicality, we apply the
aging theory to track the change in term’s life cycle. The
combination of TF*PDF and the aging theory are proved to
improve the quality of the hot-topic extraction in news
documents. However, our research with patent documents
shows that a term life cycle does not affect the topicality of
hot topics in the patent domain. It is because patent documents
contain technical terms distributed yearly or monthly, and
there are significantly rare terms that appear only several times
in the entire corpus. Meanwhile, for news documents, terms
are distributed daily with a very high frequency of change and
consequently variation plays an important role in the change
of a term’s life cycle.
By utilizing IPC information, in which documents are
manually classified into specific categories by patent experts,
we automatically detect technological trends from the patent
domain in a specific time period. The terms extracted from
those documents therefore belong to specific categories,
whereby the importance of terms is evaluated based on its
importance in specific IPCs. That allows a hot topic to be
identified based on its importance in each IPC as opposed to
being affected by the variation in its life cycle.
For a huge number of patents with lengthy and difficulty
technical terms, it is necessary to quickly identify the hottest
information about which technologies were invented with high
attention. The experiment results show that our approach
yields a substantial methodology of hot-topic extraction and
technological trend detection from the patent domain. By
apply the hot-term detection algorithm using Term Frequency
– Proportional Document Frequency in consideration of IPC
information, we have shown an important new criteria for
weighting hot topics based on semantic categorization, which
has not previously been applied in the patent domain.
ACKNOWLEDGMENT
This research was supported by the MSIP(Ministry of
Science, ICT and Future Planning), Korea, under the Global IT
Talent support program(IITP-2014-H0905-14-1005) and the
Establishing IT Research Infrastructure Projects(I2221-14-
229
1012) supervised by the IITP(Institute for Information and
Communication Technology Promotion).
REFERENCES
[1] K. Borner, C. Chen, and K.W. Boyack, “Visualizing knowledge
domains," Annual Review of Information Science and Technology, vol.
37, pp. 179-255, 2003.
[2] Y. Ding, G.G. Chowdhury, and S. Foo, "Bibliometric cartography of
information retrieval research by using co-word analysis," Information
Processing and Management, vol. 37, no. 6, pp. 817-842, 2001.
[3] S. Lee, B. Yoon, and Y. Park, "An approach to discovering new
technology opportunities: keyword-based patent map approach,"
Technovation, vol. 29, no. 6-7, pp. 481-497, 2009
[4] M. Moehrle, L. Walter, A. Geritz, and S. Muller, "Patent-based inventor
profiles as a basis for human resource decisions in research and
development," R&D Management, vol. 35, no. 5, pp. 513-524, 2005.
[5] J. Yoon and K. Kim, "Identifying rapidly evolving technological trends
for R&D planning using SAO-based semantic patent networks,"
Scientometrics, vol. 88, no. 1, pp. 313-331, 2013.
[6] S. Dewulf, "Directed variation: variation of properties for new or
improved function product DNA, a base for 'connect and develop',"
World Conference: TRIZ Future, 2006.
[7] D. Mann, Hands-on systematic innovation, Belgium: Creax press, 2002.
[8] J. Yoon and K. Kim, "An analysis of property-function based patent
networks for strategic R&D planning in fast-moving industries: Teh case
of silicon-based thin film solar cells," Expert Systems with Application,
vol. 39, no. 9, pp.7709-7717, 2012.
[9] M.L. Antonie and O.R. Zaiane, "Text document categorization by term
association," In Proceedings of the 2002 IEEE international conference
on data mining, 2002.
[10] X.Y. Chen, Y. Chen, L. Wang, and Y.F. Hu, "Text categorization based
on frequent patterns with term frequency," In Proceedings of 2004
international conference on machine learning and cybernetics, 2004.
[11] H. Han, E. Manavogulu, C. Giles, and H. Zha, "Rule-based word
clustering for text classification," In Proceedings of the 26th annual
international ACM SIGIR conference on research and development in
information retrieval, 2003.
[12] C. He and H.T. Loh, "Pattern-oriented associative rule-based patent
classification," Expert Systems with Application, vol. 37, no.3, 2010.
[13] I. Bergmann, D. Butzke, L. Walter, J.P. Fuerste, M.G. Moehrle, and V.A.
Erdmann, "Evaluating the risk of patent infringement by means of
semantic patent analysis: the case of DNA chips," R&D Management,
vol. 38, no. 5, pp.550-562, 2008.
[14] T. Magerman, B.V. Looy, and X. Song, "Exploring the feasibility and
accuracy of Latent Semantic Analysis based text mining techniques to
detect similarity between patent documents and scientific publications,"
Scientometrics, vol. 82, no. 2, pp. 289-306, 2010
[15] J. Yoon and K. Kim, "An automated method for identifying TRIZ
evolution trends from patents," Expert Systems with Applications, vol.
38, no. 12, pp.15540-15548, 2011.
[16] C.C. Chen, Y.T. Chen, Y. Sun, and M.C. Chen, "Life cycle modeling of
news events using Aging Theory," In Proceedings of 14th European
Conference of Machine Learning, pp. 47-59, 2003.
[17] K.Y. Chen, L. Luesukprasert, and S.T. Chou, "Hot topic extraction
based on timeline analysis and multidimensional sentence modeling,"
IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 8,
2007
[18] M.C. Yang and H.C. Rim, "Identifying interesting Twitter contents
using topical analysis," Expert Systems with Applications, vol. 41, no. 9,
pp. 4330-4336, 2014.
[19] J. Zeng, J. Duan, W. Cao, and C. Wu, "Topics modeling based on
selective Zipf distribution," Expert Systems with Applications, vol. 39,
no. 7, pp. 6541-6546, 2012.
[20] C. Wang, M. Zhang, L. Ru, and S. Ma, "An automatic online news topic
key phrase extraction system," IEEE/WIC/ACM International
Conference on Web Intelligence and Intelligent Agent Technology, 2008.
[21] Y. Chen, H. Amiri, Z. Li, and T. Chua, "Emerging topic detection for
organizations from microblogs," In Proceedings of the 36th international
ACM SIGIR conference on research and development in information
retrieval, pp. 43-52, 2013.
[22] L. Christiansen, T. Schimoler, R. Burke, and B. Mobasher, "Modeling
topic trends on the social web using temporal signatures," In
Proceedings of the twelfth international workshop on Web information
and data management, pp. 3-10, 2012.
[23] S. Lee, J. Lee, C. Park, and J. Lee, "Blog topic analysis using TF
smoothing and LDA," In Proceedings of the 7th International
Conference on Ubiquitous Information Management and
Communication, 2013.
[24] R. Long, H. Wang, Y. Chen, O. Jin, and Y. Yu, "Towards effective
event detection, tracking and summarization on Microblog data," Web-
Age Information Management (Lecture Notes in Computer Science),
6897, pp. 652-663, 2011.
[25] P. Erdi, K. Makovi, Z. Somogyvari, K. Strandburg, J. Tobochnik, P.
Volf, and L. Zalanyi, "Prediction of emerging technologies based on
analysis of the US patent citation network," Scientometrics, vol. 95, no.
1, pp. 225-242, 2013.
[26] C. Lee, B. Song, and Y. Park, "How to assess patent infringement risks:
a semantic patent claim analysis using dependency relationships,"
Technology Analysis & Strategic Management, vol. 25, no. 1, pp. 23-38,
2013.
[27] A.J. Trappey, C.V. Trappey, C. Wu, C.Y. Fan, and Y. Lin, "Intelligent
patent recommendation system for innovative design collaboration,"
Journal of Network and Computer Applications, vol. 36, no. 6, pp. 1441-
1450, 2013.
[28] J. Yoon and K. Kim, "TrendPerceptor: A property-function-based
technology intelligence system for identifying technological trends from
patents," Expert Systems with Applications, vol. 39, no. 3, pp. 2927-
2938, 2012.
[29] K.L. Nguyen and S.H. Myaeng, "Query enhancement for patent prior-art
search based on key-term dependency relationships and semantic tags,"
Lecture Notes in Computer Science, 7356, pp. 28-42, 2012.
[30] Y.G. Kim, J.H. Suh, and S.C. Park, "Visualization of patent analysis for
emerging technology," Expert Systems with Applications, vol. 34, no. 3,
pp. 1804-1812, 2008.
[31] M.J. Shih, D.R. Liu, and M.L. Hsu, "Discovering competitive
intelligence by mining changes in patent trends," Expert Systems with
Applications, vol. 37, no. 4, pp. 2882-2890, 2010.
[32] J. Yoon, H. Park, and K. Kim, "Identifying technological competition
trends for R&D planning using dynamic patent maps: SAO-based
content analysis," Scientometrics Journal, vol. 94, no. 1, pp. 313-331,
2013.
[33] H. Park, J.J. Ree, and K. Kim, "Identification of promising patents for
technology transfers using TRIZ evolution trends," Expert Systems with
Applications, vol. 40, no. 2, pp. 736-743, 2013.
[34] J. Duan and J. Zeng, "Web objectionable text content detection using
topic modeling technique," Expert Systems with Applications, vol. 40,
no. 15, pp. 6094-6104, 2013.
[35] J. Zeng, C. Wu, and W. Wang, "Multiple-grain hierarchical topic
extraction algorithm for text mining," Expert Systems with Applications,
vol. 37, no. 4, pp. 3202-3208, 2010.
[36] J.P. Zeng and S.Y. Zhang, "Incorporating topic transition in topic
detection and tracking algorithms," Expert Systems with Applications,
vol. 36, no. 1, pp. 227-232, 2009.
[37] S. Lee and H.J. Kim, "News Keyword Extraction for Topic Tracking,"
Networked Computing and Advanced Information Management NCM,
2008.
[38] S.Y. Chen, T.T. Tseng, H.E. Ke, and C.T. Sun, "Social trend tracking by
time series based social tagging clustering," Expert Systems with
Applications, vol. 38, no. 10, pp. 12807-12817, 2011.
[39] K.K. Bun and M. Ishizuka, "Topic Extraction from News Archive Using
TF*PDF Algorithm," In Proceedings of the 3rd International Conference
Web Information System Eng, pp. 73-82, 2002.
230

More Related Content

What's hot

IA Guidance Booklet
IA Guidance BookletIA Guidance Booklet
IA Guidance Booklet
Graeme Eyre
 
Grounded theory for geeks
Grounded theory for geeksGrounded theory for geeks
Grounded theory for geeks
Francisco Vasconcellos
 
Big Data Research Trend and Forecast (2005-2015): An Informetrics Perspective
Big Data Research Trend and Forecast (2005-2015): An Informetrics PerspectiveBig Data Research Trend and Forecast (2005-2015): An Informetrics Perspective
Big Data Research Trend and Forecast (2005-2015): An Informetrics Perspective
The International Journal of Business Management and Technology
 
EASE 2019 keynote
EASE 2019 keynoteEASE 2019 keynote
EASE 2019 keynote
Per Runeson
 
Business Process Management Research As An Interdisciplinary Field
Business Process Management Research As An Interdisciplinary FieldBusiness Process Management Research As An Interdisciplinary Field
Business Process Management Research As An Interdisciplinary Field
harryyjin
 
Format of research report
Format of research reportFormat of research report
Format of research report
Ram Doss
 
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
Videoconferencias UTPL
 
Research on Image Recognition and Tracking Based on Knowledge Mapping
Research on Image Recognition and Tracking Based on Knowledge MappingResearch on Image Recognition and Tracking Based on Knowledge Mapping
Research on Image Recognition and Tracking Based on Knowledge Mapping
ijtsrd
 

What's hot (8)

IA Guidance Booklet
IA Guidance BookletIA Guidance Booklet
IA Guidance Booklet
 
Grounded theory for geeks
Grounded theory for geeksGrounded theory for geeks
Grounded theory for geeks
 
Big Data Research Trend and Forecast (2005-2015): An Informetrics Perspective
Big Data Research Trend and Forecast (2005-2015): An Informetrics PerspectiveBig Data Research Trend and Forecast (2005-2015): An Informetrics Perspective
Big Data Research Trend and Forecast (2005-2015): An Informetrics Perspective
 
EASE 2019 keynote
EASE 2019 keynoteEASE 2019 keynote
EASE 2019 keynote
 
Business Process Management Research As An Interdisciplinary Field
Business Process Management Research As An Interdisciplinary FieldBusiness Process Management Research As An Interdisciplinary Field
Business Process Management Research As An Interdisciplinary Field
 
Format of research report
Format of research reportFormat of research report
Format of research report
 
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
 
Research on Image Recognition and Tracking Based on Knowledge Mapping
Research on Image Recognition and Tracking Based on Knowledge MappingResearch on Image Recognition and Tracking Based on Knowledge Mapping
Research on Image Recognition and Tracking Based on Knowledge Mapping
 

Similar to Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term Frequency and Proportional Document Frequency and Semantic Information

Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...
Salam Shah
 
Technological Route between Pioneerism and Improvement
Technological Route between Pioneerism and ImprovementTechnological Route between Pioneerism and Improvement
Technological Route between Pioneerism and Improvement
Roberto Nani
 
Data Mining of Project Management Data: An Analysis of Applied Research Studies.
Data Mining of Project Management Data: An Analysis of Applied Research Studies.Data Mining of Project Management Data: An Analysis of Applied Research Studies.
Data Mining of Project Management Data: An Analysis of Applied Research Studies.
Gurdal Ertek
 
Methodology chapter
Methodology chapterMethodology chapter
Methodology chapter
engrhassan21
 
A guide to deal with uncertainties in software project management
A guide to deal with uncertainties in software project managementA guide to deal with uncertainties in software project management
A guide to deal with uncertainties in software project management
ijcsit
 
Gartner maturity and adoption 1
Gartner maturity and adoption 1Gartner maturity and adoption 1
Gartner maturity and adoption 1
Muhammad Sadiq Aftab
 
Gartner maturity and adoption 1
Gartner maturity and adoption 1Gartner maturity and adoption 1
Gartner maturity and adoption 1
Muhammad Sadiq Aftab
 
Towards a Software Engineering Research Framework: Extending Design Science R...
Towards a Software Engineering Research Framework: Extending Design Science R...Towards a Software Engineering Research Framework: Extending Design Science R...
Towards a Software Engineering Research Framework: Extending Design Science R...
IRJET Journal
 
Great model a model for the automatic generation of semantic relations betwee...
Great model a model for the automatic generation of semantic relations betwee...Great model a model for the automatic generation of semantic relations betwee...
Great model a model for the automatic generation of semantic relations betwee...
ijcsity
 
IRJET- Characteristics of Research Process and Methods for Web-Based Rese...
IRJET-  	  Characteristics of Research Process and Methods for Web-Based Rese...IRJET-  	  Characteristics of Research Process and Methods for Web-Based Rese...
IRJET- Characteristics of Research Process and Methods for Web-Based Rese...
IRJET Journal
 
Ijetcas14 438
Ijetcas14 438Ijetcas14 438
Ijetcas14 438
Iasir Journals
 
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...
TELKOMNIKA JOURNAL
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
IRJET Journal
 
The social dynamics of software development
The social dynamics of software developmentThe social dynamics of software development
The social dynamics of software development
aliaalistartup
 
Systematic review on project actuality
Systematic review on project actualitySystematic review on project actuality
Systematic review on project actuality
ijcsit
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
ijdms
 
Creation of Software Focusing on Patent Analysis
Creation of Software Focusing on Patent AnalysisCreation of Software Focusing on Patent Analysis
Creation of Software Focusing on Patent Analysis
IRJET Journal
 
Integration of TRM with TRIZ
Integration of TRM with TRIZIntegration of TRM with TRIZ
Integration of TRM with TRIZ
BC Chew
 
Process Mining in Supply Chains: A Systematic Literature Review
Process Mining in Supply Chains: A Systematic  Literature Review Process Mining in Supply Chains: A Systematic  Literature Review
Process Mining in Supply Chains: A Systematic Literature Review
IJECEIAES
 
D1802023136
D1802023136D1802023136
D1802023136
IOSR Journals
 

Similar to Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term Frequency and Proportional Document Frequency and Semantic Information (20)

Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...Navigation through citation network based on content similarity using cosine ...
Navigation through citation network based on content similarity using cosine ...
 
Technological Route between Pioneerism and Improvement
Technological Route between Pioneerism and ImprovementTechnological Route between Pioneerism and Improvement
Technological Route between Pioneerism and Improvement
 
Data Mining of Project Management Data: An Analysis of Applied Research Studies.
Data Mining of Project Management Data: An Analysis of Applied Research Studies.Data Mining of Project Management Data: An Analysis of Applied Research Studies.
Data Mining of Project Management Data: An Analysis of Applied Research Studies.
 
Methodology chapter
Methodology chapterMethodology chapter
Methodology chapter
 
A guide to deal with uncertainties in software project management
A guide to deal with uncertainties in software project managementA guide to deal with uncertainties in software project management
A guide to deal with uncertainties in software project management
 
Gartner maturity and adoption 1
Gartner maturity and adoption 1Gartner maturity and adoption 1
Gartner maturity and adoption 1
 
Gartner maturity and adoption 1
Gartner maturity and adoption 1Gartner maturity and adoption 1
Gartner maturity and adoption 1
 
Towards a Software Engineering Research Framework: Extending Design Science R...
Towards a Software Engineering Research Framework: Extending Design Science R...Towards a Software Engineering Research Framework: Extending Design Science R...
Towards a Software Engineering Research Framework: Extending Design Science R...
 
Great model a model for the automatic generation of semantic relations betwee...
Great model a model for the automatic generation of semantic relations betwee...Great model a model for the automatic generation of semantic relations betwee...
Great model a model for the automatic generation of semantic relations betwee...
 
IRJET- Characteristics of Research Process and Methods for Web-Based Rese...
IRJET-  	  Characteristics of Research Process and Methods for Web-Based Rese...IRJET-  	  Characteristics of Research Process and Methods for Web-Based Rese...
IRJET- Characteristics of Research Process and Methods for Web-Based Rese...
 
Ijetcas14 438
Ijetcas14 438Ijetcas14 438
Ijetcas14 438
 
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...
A Comprehensive Survey on Comparisons across Contextual Pre-Filtering, Contex...
 
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
Mining Users Rare Sequential Topic Patterns from Tweets based on Topic Extrac...
 
The social dynamics of software development
The social dynamics of software developmentThe social dynamics of software development
The social dynamics of software development
 
Systematic review on project actuality
Systematic review on project actualitySystematic review on project actuality
Systematic review on project actuality
 
An effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systemsAn effective pre processing algorithm for information retrieval systems
An effective pre processing algorithm for information retrieval systems
 
Creation of Software Focusing on Patent Analysis
Creation of Software Focusing on Patent AnalysisCreation of Software Focusing on Patent Analysis
Creation of Software Focusing on Patent Analysis
 
Integration of TRM with TRIZ
Integration of TRM with TRIZIntegration of TRM with TRIZ
Integration of TRM with TRIZ
 
Process Mining in Supply Chains: A Systematic Literature Review
Process Mining in Supply Chains: A Systematic  Literature Review Process Mining in Supply Chains: A Systematic  Literature Review
Process Mining in Supply Chains: A Systematic Literature Review
 
D1802023136
D1802023136D1802023136
D1802023136
 

Recently uploaded

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 

Recently uploaded (20)

一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 

Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term Frequency and Proportional Document Frequency and Semantic Information

  • 1. Hot Topic Detection and Technology Trend Tracking for Patents utilizing Term Frequency and Proportional Document Frequency and Semantic Information Khanh-Ly Nguyen1 , Byung-Joo Shin2 , Seong Joon Yoo3 Dept. of Computer Engineering Sejong University Seoul, Republic of Korea khanhly4682@gmail.com1 , {bjshin2 , sjyoo3 }@sejong.ac.kr Abstract—This paper proposes a methodology for identifying hot topics and tracking technology trends from the patent domain. The methodology uses frequency information in combination with the International Patent Classification (IPC) to capture semantic information on word categorization, doing so in a way that heretofore has not been employed for topic detection and trend tracking. Term Frequency and Proportional Document Frequency (TF*PDF) is employed as a means to detect hot topics from patents, and IPCs are used to calculate semantic importance of terms based on the IPCs where terms are distributed. Aging Theory is also used to calculate the variation of trends over time. Four types of trends including very stable trends, stable trends, normal trends, and unstable trends are defined and evaluated based on TF*PDF and TF*PDF combined with Aging Theory. Experiment results show that for very stable trends, the combination of TF*PDF and Aging Theory achieves 0.976% in Precision; for stable trends and all trends, TF*PDF achieves 0.959% and 0.84% in Precision, respectively. By applying TF*PDF in consideration of semantic information, we also show a new criteria for weighting hot topics and technology trend tracking. Keywords—Technology forecast; Trend analysis; Patent analysis; Topic detection; Hot term extraction I. INTRODUCTION The increasing amount of patent applications and the growing need to access patent information make the task of patent analysis become vital to 1) analyze large amounts of patent data that is expensive being done by human, 2) increase the quality of generating useful information, and 3) support decision making processes to eventually increase the quality of the patents. Patent intelligence is used to encourage the development of innovative products, devise technology strategies, and reveal legal/business insights amid the technical transformation. Various tools and techniques have been developed for use in detecting trends and forecasting future developments from news stories or patent documents. In the patent domain, the techniques include such as keyword-based approaches[1][2][3], Subject-Action-Object (SAO)-based approaches[4][5], property-function-based approaches[6][7][8], rule-based approaches[9][10][11][12], semantic analysis-based approaches[13][14][15], etc. Keyword-based approaches use frequencies and co- occurrences among keywords that result in the lack of representation of relationships among technological concepts and require expert knowledge in terms of predefining keywords. SAO-based approaches extract SAO structures, which consist of a subject, action, and object and represent the concepts of technology in the properties/functions format. Usually, the SAO approaches are employed in TRIZ trend analysis (TRIZ is a Russian acronym that means Theory of Inventive Problem Solving). The process for TRIZ trend analysis is to analyze and categorize patents in know trend phases, and the results have been used to identify the evolution of technologies or to seek for further improvements of specific product. However, the method depends on the expertise and skills of TRIZ experts to identify specific trends and trend phases manually, which may be expensive or unfeasible. For news stories collected from websites such as Google News, Reuters, Yahoo, etc., news topics are detected by techniques proposed under Topic Detection and Tracking (TDT). TDT is intended to identify topics by exploring and organizing the content of textual materials, and enabling to group pieces of information into manageable clusters, wherein each cluster represents a single topic. Reference [16][17] proposed methods for hot topic extraction based on TF*PDF and agglomerative clustering algorithm. Reference [18] constructed topic hierarchy by identification of burst periods of features. Reference [19] detected topic by identification of both aperiodic and periodic features bursts. Given the large amounts of news topics constantly being created and updated, there is a concern about how to rank those topics in terms of timeliness and importance. Generally, topic ranking is determined by two factors; one is how frequently and recently a topic is reported by websites; the other is how much attention users pay to it[20]. The first rule focuses on returning timely results and the second one considers larger topics more important. Either rule involves only one aspect of the ranking problem. Besides, other factors must be taken into consideration, i.e., (1) every news story of a topic contributes to its importance, while the contribution 3 Corresponding author: Seong Joon Yoo; E-mail address: sjyoo@sejong.ac.kr 223978-1-4673-8796-5/16/$31.00 2016 IEEE BigComp 2016
  • 2. decays along the timeline; and (2) topics that attract more users' attention should be ranked higher. Generally, current topic detection and trend tracking researches are mainly based on timeline and frequency features[21][22][23][24]. However, most of the researches are lack of representing the semantic relationship between terms. Some researches[25][26][27][28] consider the binary relationships such as relationships from patents and relationships from a predefined trend database, which requires knowledge of domain experts for classifying trends in advance. In this paper, we propose a methodology for hot topic detection and trend tracking using frequency information in combination with the IPC to capture semantic information on word categorization, which has not been employed in the state of art. To obtain semantically meaningful topics of interest, we apply the TF*PDF algorithm, which allows for the generation of hot topics over time. Moreover, we exploit the IPCs to capture the importance of semantic information on word categorization as presented in multiple IPCs, from which hot topics are detected. Trends are identified as the normalized weight of topic over time. Four types of trends are defined including very stable trends, stable trends, normal trends and unstable trends. Experimental results were compared with the baseline TF*IDF for very stable trends, stable trends and all of trends detected by the system. II. RELATED WORKS A. Patent Analysis Because patents play an important role in intellectual property protection, there has been a growing interest in research into patent analysis, patent search, patent query formulation[29], and trend identification[30][31] from patent documents. Reference [29] extracted key terms for patent query formation using semantic patterns and a keyword dependency relation graph. Key terms are defined as "Problem/Solution" and are extracted through semantic patterns. Relationships among key terms are considered in a term-weighting scheme and important terms are ranked on the basis of weight. To identify emerging topics and transitions in topics, many techniques using term frequency have been applied. A timeline chart is created by textual analysis and the evolutionary process of emerging technology is visualized as an S-curve shape. In other research studies, the combination of citation networks and text-mining techniques have improved reliability in the detection of chronological changes for which the co-citation networks provide closer connections among patents. Timeline visualization of co-citation networks with labels extracted by text-mining techniques has been used to detect and trace emerging trends. Particularly, the co-citation clusters have been used to build patent maps that could be used to analyze the numbers of patents filed between company competitors through the years and thereby to visualize the evolutionary process of emerging technology, etc. Reference [32] used SAO structures to generate patent maps for identifying the technological competition trends. Semantic similarity is measured on the basis of SAO-based semantic similarities and a patent similarity matrix is constructed. The output is visualized in the form of a dynamic patent map, which is used to identify technological vacuums and technological hotspots. Reference [33] identifies promising patents for technology transfer and uses TRIZ evolution trends to evaluate technologies in patents. A patent is considered to be a high future value patent if it is relevant to future important TRIZ trends. The patents are ranked based on the similarity scores and are classified. However, the disadvantages are that the classification of the TRIZ trends may not be applicable to all the technological domains and the revision of classification by domain experts having knowledge in TRIZ trends are required. Moreover, [15] extracts information related to properties and functions of a product by identifying binary relationships in the form of "adjective + noun" and "verb + noun". The Stanford dependency parser is used to identify all binary relationships from titles and abstracts in patents. A "reasons for jumps" rule base that arranges trend-specific binary relationships for trend identification is defined, whereupon the most likely trends and trend phases are determined by measuring sentence semantic similarity between the binary relationships from patents and the binary relationships from a "reason for jumps" rule base. If two or more phases related to a trend are identified from a patent, the currently developed logic of the trend map chooses the more evolved one. The final output depicts the evolutionary which can be used as input for technology forecasting based on TRIZ trends. Additionally, [31] proposed Patent Trend Change Mining techniques as the means to capture changes in patent trends through metadata analysis without the need of specialist knowledge. The approach includes a patent indicator calculator that determines the patent values based on citation index, originality, generality, and technology cycle time. Then, patent change trends are determined by association rule mining to compute the similarities and differences of patent trends between two different times. B. International Patent Classification The International Patent Classification (IPC) is administered by the World Intellectual Property Organization. Patent documents which are relevant to a particular inventive concept are organized through an examination process by the examiner. Each patent document is classified into IPCs based on technical field of the invention and can be assigned to more than one IPC code. Each IPC is divided into subclass, main group and sub group. In this research we use the data under the class H01M 04. Fig. 1 shows hierarchy of H01M 04, which contains patent documents belonging to the electrodes category and includes six subgroups. Description H01M 04 Electrodes (electrodes for electrolytic processes) H01M 04/02 .Electrodes compose of, or comprising active material H01M 04/04 .. Processes of manufacture in general H01M 04/06 .. Electrodes for primary cells H01M 04/08 … Processes of manufacture H01M 04/10 … of pressed electrodes with central core H01M 04/12 …. Of consumable metal or alloy electrodes (use of alloy compositions as active materials ) Fig. 1. The IPC H01M 04 hierarchy 224
  • 3. The identification of technological trends is one of the most important tasks in acquiring knowledge from patent sources so as to quickly understand the latest advanced, innovative technologies in high-tech industries and to acquire technologies for future use. In the state of the art, however, there is no research target pertaining to the detection of patent trends by utilizing IPC. In our work we utilize IPC information in a practical way so as to identify our target hot- term extraction within a patent-data corpus. Given the assumption that there are many technical trends invented in patent documents, it is very difficult to identify the major trends relative to each domain without IPC information. For example, in the following sample sentence "drain" and "flash amperage" are highly ranked terms by general term weightings such as TF*IDF based on frequency information. "A high DRAIN rate, primary alkaline cell comprising a negative electrode...capable of providing a FLASH AMPERAGE greater than an average of..." However, by using IPC information we rank those terms with higher weight based on the number of IPCs (e.g. H01M 04 or H01M 10) where "drain" and "flash amperage" belong to. C. Hot Topic Detection and Trend Analysis Automatic extraction of meaningful topics, which can help detecting topics of interest and facilitate the analysis of user behavioral data, has been studied previously in the various context. Reference [18] used Latent Dirichlet Allocation (LDA) to extract latent topics by modeling temporal trends on Twitter over time. Reference [19] modeled topics from text corpus in order to determine whether a topic description is well formed by doing so throught the use of LDA and selective Zipf distribution. Reference [34] proposed a framework using probability inference for detecting objectionable text content that has been shown to be harmful to Web users. For a given sentence, the probability value, which shows the likelihood of the sentence with respect to the model, is calculated and then a mapping function is used to transform the probability value into a new indicator for making decision about the type of Web text content. Reference [35] proposed a hierarchical topic extraction algorithm based on topic grain computation. By considering the distribution of word document frequency as a Gaussian mixture, the topic grain is defined based on the mixture of Gaussian parameters and feature words are selected for the grain by employing an EM-like algorithm. A clustering algorithm is used to generate a multiple-grain hierarchical topic structure with different subtopic description. Reference [36] incorporated topic transition in topic detection along with tracking from Reuters and BBS websites. They employed a topic representation based on the hidden Markov model and applied fuzzy-kMeans clustering to find the most likely topic- transition sequence. Reference [37] addressed the problem of extracting significant words that are highly useful for summarizing and presenting topics from a huge number of news articles. To extract keywords, [37] used an unsupervised keyword extraction technique called Table Term Frequency, which includes several variants of the conventional TF-IDF model and filtered keywords with cross-domain comparison. Reference [20] introduced an automatic online news-topic key-phrase extraction system. Topics are constructed and updated online automatically with techniques to determine the degree of burst of terms along with the aging theory. The proposed system extracts keyword candidates from single news stories, filters them with topic information, and then combines them into phrase candidates using position information. Finally, the phrases are ranked and the top ones are selected as topic key phrases. In TDT, a topic is defined as a seminal event or activity, along with all the directly related events and activities. A hot topic is defined as a topic that appears frequently over a period of time[17]. The hotness of a topic depends on two factor: how often hot terms appear in a document and the number of documents that contain those terms. However, the hotness of each topic evolves over a given period of time through the life cycle of birth growth, maturity, and death. Some research studies have utilized the aging theory or timeline analysis in TDT and hot-topic extraction[16][17]; topic hierarchy construction based on the identification of burst periods of features[18]; topic sentence extraction along a timeline given a query[38]; topic detection based on the identification of both aperiodic and periodic features' bursts[19]; finding top burst topics by identifying burst words[30]; and so on. The previous approaches listed above analyzed the characteristics of features from a fixed corpus on the whole timeline. In order to identify topics in large sets of documents, we have to determine the key terms that sufficiently describe the topics. A term-weighting scheme is used to capture important or representative terms that feature in the content of a document, such as calculating the term distribution level in a document or in a corpus[16][36][37]. The most common term-weighting scheme for processing index terms is TF-IDF. Because the TF-IDF scheme emphasizes the importance or uniqueness of each term, it only identifies terms that occur in a few of the documents contained in a corpus. For hot-topic extraction, however, terms that appear in many of the documents in a corpus must be identified. Therefore, a different term-weighting scheme TF- Proportional Document Frequency (TF*PDF) assigns greater weights to terms that occur frequently in many documents on many channels and lower weights to others to avoid the collapse of important terms when they appear in many documents. Although TF-PDF captures the basic concept of a hot topic, its weakness is that it does not consider variations in the popularity of a topic over time. Therefore, [17] combined TF*PDF and aging theory to the extraction of hot terms from a data corpus in consideration of the term life cycle. An aging theory is used to model a news-topic life span with the four stages of birth, growth, decay, and death to reflect its popularity over time. To track the life cycles of topics, they used the concept of energy function. The energy of an event increases when the event becomes popular, and it decreases as its popularity reduces. Hence, the aging theory is suitable for tracking the variations in the frequency of terms, which are critical to success in hot-topic extraction. A technology lifecycle is usually referred to as an S-curve, containing the four stages of innovation, growth stage, 225
  • 4. maturity and decline. The innovation stage is when a technology is born from a new technical method or when phrases appear in a small number of patents and slowly increase. The growth stage is when a technology has been recognized, thus gathering strength with the increasing number of patents over time. The maturity stage is when the recognition is high and stable with the rapid increase of number of patents. The decline stage is when the technology is reduced. We apply three functions from [17] in order to calculate and update the energy of topics in every time slot; getEnergy() calculates the nutrition that a topic receives from a story; energyFunction() converts a topic nutritional value into an energy value; energyDecay() carries out the energy decrease in each time slot. We have explored hot-topic detection from patent documents by utilizing TF*PDF. Moreover, we seek to utilize the importance of IPC, where the terms from a patent document are categorized. In this research we apply the hot- term detection algorithm (TF*PDF) and utilize IPC information in order to identify hot topics in a patent data corpus. III. SYSTEM ARCHITECTURE The overall procedure for hot-term detection and technological trend identification from patent documents consists of several steps, including data crawling, keyword extraction, hot-term detection, and trend analysis as shown in Fig. 2. Firstly, raw patent data is collected from a U.S. patent database and transformed into structured data. POStagger is used for keyword extraction. All common stop words are removed in combination with the list of patent stop words. Then, a candidate list of detected hot terms is compiled by TF*PDF algorithm and the aging theory. Finally, we analyze patent trend based on the hot-term detection and evaluate the results. Fig. 2. System Architecture A. Hot Term Detection Two stop-word lists are used. One contains common stop words and the other contains the words that are common in patent documents and irrelevant to patent content. In hot-term extraction process, two characteristics of a term are considered. One is the frequency of the term in the question collection (Definition 1) and the other is the term variation over time (Definition 2). Term frequency is measured by TF*PDF[17][39]. TF*PDF is considered more suitable for topic detection than TF*IDF because the former assigns greater weights to terms that occur frequently in many documents. We adopt the TF*PDF scheme in this paper and the top m terms are chosen as final terms for trend analysis. Phrases that do not include any term among the final terms are therefore excluded. Definition 1 (TF*PDF). Given a term j in a document, the TF*PDF weight of term j is calculated from [39] through the following (1) and (2):           c jc Cc c jcj N n FW exp 1  where     Kk k kc jc jc F F F 1 2  , where Wj is the TF*PDF value of term j, which is the summation of term weights gained from each IPC c; |C| is the number of IPCs. Fjc is the frequency of term j in the IPC c; K is the total number of terms in the IPC c; njc is the number of documents that belong to the IPC c where term j occurs; Nc is the total number of documents in the IPC c. Definition 2 (Term Life Cycle). Reference [17] defined a term life cycle model in order to calculate the variation of each term value from its cycle of birth, growth, decay, and death. This step is suitable for tracking the variations in the frequency of terms, which are critical to a successful hot-topic extraction. We apply three functions from [17] to calculate the energy of topics in each time slot, including getEnergy(), energyFunction(), and getVariation(). getEnergy() calculates the energy that a term receives from patents at a specific time slot. The energy Et,s of term t measures the frequency of t appearing in a specified time slot s, which is the accumulated value of term t from all patent IPCs, as in (3). Therefore, hot terms are those that have high energy in all IPCs.  ,2 ,   Cc ctts XE  where C is the set of IPCs, X2 t,c is the association between term t and the time slot s in IPC c, given by (4): ))()()(( ))(( 2 2 DBCADCBA BCADDCBA X     226
  • 5. For each term we calculate the contingency table, as shown in Table 1: TABLE I. CONTINGENCY TABLE s s T A B − T C D , where A is the count of the patents that contain term t in time span s; B is the count of patents that contain term t on other time spans; C is the count of patents that do not contain term t in time span s; D is the count of patents that do not contain term t on other time spans. energyFunction() converts a term energy value into a life support value. The life support value lifeSupportt,s of t at time slot s is calculated as the logarithm of accumulated energy Et,s, as in (5). )ln( ,, stst EtlifeSuppor   getVariation() calculates the variation of the life support values of term t over time in the patent collection can be computed as (6):   2 , )( 1 tlifeSupportlifeSuppor N V stst  , where N is the number of time slots in the given interval I; lifesupportt,s is the life support value in each time slot; tlifeSuppor is the average life support value; and Vt is the variation in the life support values of t during I. The overall weight of term t is measured by combining TF*PDF and Variation value together, as in (7). tt VPDFTFweight  *  Finally, the terms in the candidate list with the combined weight will be ranked. The top-ranked k terms can be chosen as hot terms that reflect the hot topics in the corpus. B. Trend Detection We divide the patent timeline into yearly time slots. In each time slot s, a trend is represented by the normalized weight of occurrences of term t from n documents. IV. EXPERIMENTS For hot topic detection, we present the results by comparing topics detected by three algorithms including Chi- square, TF*PDF, and the combination of TF*PDF and Aging Theory where IPC information is employed under TF*PDF. For hot technology trend tracking, we define four types of trends and eliminate insignificant ones. To evaluate the significance of the trends we choose TF*IDF as the baseline since TF*IDF is the most common technique in data mining area. A. Data Collection The data collection contains 513 patent documents crawled from the U.S. Patent and Trademark Office database. Patent documents are selected from the domain of batteries, as published from 1977 to the present. Fig. 3 shows the IPC H01M 04, which contains patent documents belonging to the electrodes category and includes six subgroups (H01M 04/02, H01M 04/04, H01M 04/06, H01M 04/08, H01M 04/10, and H01M 04/12). United States Patent 4,016,339 Gray et al. April 5, 1977 Abstract A battery electrode structure of flat configuration comprises a cast mass of electrochemically active material, said mass having contained therein and exposed opposite surfaces thereof an open-mesh electrically conductive structure adapted for connection to a battery terminal. An open-mesh electrically conductive support member in the mass and in contact with the exposed electrically conductive structure maintains electrical conductivity throughout discharge to ensure maximum use of the active material. Current International Class: H01M 04/06?(20060101); H01M 04/58?(20060101); H01M 06/34?(20060101); H01M 06/30?(20060101); H01M 04/70?(20060101); H01M 004/02?() Fig. 3. Sample input patent document B. Data Preparation 1) Data Extraction A patent document from USPTO contains sections, including “Title”, “Abstract”, “Claims”, and “Description”. "Title" is too short that is not suitable for our research. "Claims" is insufficient because it may not contain as many technology phrases as another section or those terms are included in "Abstract". “Description” is lengthy and includes sub fields such as "Field of the Invention", "Prior Art", "Summary of the Invention", "Detailed Description of the Preferred Embodiment", "Brief Description of the Drawings", etc. "Description" contains meaningful terms for the problem/solution extraction method, as shown by [29]. For preliminary experiments we use only "Abstract" which is a short, very precise summary of the invention. 2) IPC Extraction From patent documents, we extract IPC information under the "Current International Class" tag as shown in Fig. 3. The IPC information is then extracted from Main Groups by using regular expression and stored in a list of IPCs, which would be later utilized in term weighting using TF*PDF algorithm. 3) Timestamps To identify the technology lifecycle, we split the whole time span into each one-year intervals. Patent documents were crawled from 1976 to 2014, but there is no patent filed in the H01M 04 class in 2006. Therefore, the range of data include a total of 30 years from 1976 to 2005. C. Hot Topic Extraction We compare hot terms extracted by the TF*PDF, Chi- square, and the combined weight. This experiment validates the effectiveness of each term-weighting methodology in hot- term detection. Then we demonstrate how we can identify genuine hot topics by ranking term based on its weight. 227
  • 6. Tables 2, 3, and 4 show the top 10 ranked hot topics by using TF*PDF, Chi-square, and the combined weight respectively. The results show that hot terms extracted by each algorithm are different since terms are weighted in consideration of different factors such as pervasiveness, topicality, or variation of the life cycle. It is shown that terms detected by Chi-square do not frequently appear and cannot be detected using frequency information. Those terms are significantly rare terms that would be suitable for detecting innovation technology from the patent domain. Contrastingly, TF*PDF and the combined weight mainly detect terms that are pervasive and topical. TF*PDF and the combined-weight methodology detect important terms that may be considered as the topic of patent documents. TABLE II. TOP 10 HOT TOPICS DETECTED BY TF*PDF 1976 1980 1985 1988 1990 1994 2000 2003 2005 atmosphere atmosphere atmosphere resistance reactant areas increase liquid dry refractory refractory shear layers accumulator coated increases ion free telescopic conjugated tab retain makes coat fuel liquids spray free axis reinforce retaining make uncoated iron shapes dryer spray deterioration amalgamated strong end injecting ion wet inventive disturbing size precipitation smooth severed nozzle polysaccharide floc condition orous end establish web lugs passing dehydration disposition telescopic electron wash force separating band fabrication period polytetrafluoroethylene disturbing electronic reduce globular product negligible fabric corrosive dimensions flex constant crystal acetate end permeates mounting pasting ions expectancy TABLE III. TOP 10 HOT TOPICS DETECTED BY CHI-SQUARE 1976 1980 1985 1988 1990 1994 2000 2003 2005 web silver layer body grids hydrogen nickel electrode electrode metal cathode graphite dy battery mprising cathode electro rod emulsion athode battery powder lead comprising athode rod edge catalytic group conductor electrode plastic rising hydroxide cell assembly support active deposition lithium grooves copper electrochemical battery oil mu vanadium metal electrolyte spaced lithium manganese mprising tab fibers battery film portions space oil dioxide comprising coiled heating connection material portion power foil improved rising ss heat connect gel carbon lugs conductive mno end metallic electrode addition deposit battery plates electrodes lithium single metal TABLE IV. TOP 10 HOT TOPICS DETECTED BY THE COMBINED WEIGHT 1976 1980 1985 1988 1990 1994 2000 2003 2005 form material electrode material battery layer nickel electrode electrode layer oxide electro powder lead hydrogen material cells electro la active layer body grids mprising lithium cell rod electrolyte cathode battery dy power comprising improved ce lithium material athode material electrode electrode rising electrode el alloy ia silver al electro electro material electrochemical battery electrolyte battery battery ia lithium rod lithium chemical mprising anode high vanadium er cathode layer copper electrolyte comprising electrochemical anode anode graphite athode plastic battery cathode rising chemical web lithium ph electrolyte high electrode athode plate battery improved alloy includes Electro material electro ring material layer D. Trend Tracking and Analysis In our experiments we obtained a list of 3,211 topics. To identify topics that could be candidates to represent technical trends and eliminate insignificant ones, we define four types of trends that would be satisfied with the following criteria: Considered that a decay value is 0, Very stable trends: trends have no decay values for the entire life cycle. Stable trends: trends in which the number of decay values is less than 3. Normal trends: trends which have at least two continuous time spans of three years or more; with number of decay values in the range of (3~7). Unstable trends: trends which have only one or two continuous time spans of at least three years in which the number of decay values in greater than seven. 1) Trends by Chi-square Fig. 4 shows four very stable trends detected by Chi- square algorithm. As shown in Fig. 4, "ion" is the hottest technology developed for secondary battery during 1976 to 2005. Fig. 4. Very-Stable Trends by Chi-square All normal trends detected by Chi-square from 1976 to 2005 are shown in Table 5. TABLE V. NORMAL TRENDS BY CHI-SQUARE hydrogen, portion, battery, include, anode, agent, high, electrodes, electrochemical, electrolyte, mproved, gas, sheet, end, structure, electric, surface, hydroxide, comprising, electrical, providing, current, salt, porous, oxide, lithium, solution, substrate, including, alkaline, coat, cathode, mixing, manufacturing, energy, active, alkali, density, face, plurality, powder, storage, mixture, process, carbon, proper, ions, tab, contact, disc, compound, step, method, area, chemical, matrix, part, ratio, cycle, article, composition, charge, pen, igh, line, bind, fabric, improve, treat, car, polymer, dioxide, sulfide, excellent, outer, voltage, element, forming, conductive, type, improved, alloy, life, rate, nickel, solid, plate, mix, making, acid, orous, athode 2) Trends by TF*PDF The top 10 hottest very stable trends by TF*PDF as are shown in Fig. 5. Fig. 5. Top 10 Very Stable Trends by TF*PDF Fig. 6 shows the top 10 very stable trends detected by the combined weight from 1976 to 2005. The "method" is not a technical term and is of little significance in relation to a technological trend. These trends had at least one or less than three times decay and then grew continuously. It is shown that "material" is the hottest trend that received high attention in the electrode domain over the life cycle of 1976 to 2005. The 228
  • 7. second very stable trend is "electrode". Although the trend of "electrode" does not have a higher peak than "electro" or "electrolyte", it does not have falling values in years as compared to the others. The trend of "ba" falls nearly to the bottom, though one year before decaying it has a very high peak. However, two other trends such as "electrolyte" and "da" is growing. It is considered to be one of the hottest trends that are highly paid attention in the patent domain in the next few years. It also has high value continuously as a trend because "method" is a very frequent term used in almost patent documents, particularly to identify Problem and Solution terms as by [29]. In further experiments it would be excluded from our term list. Fig. 6. Top 10 Very Stable Trends by combination of TF*PDF and Aging Theory (1976~2005) E. Evaluation We evaluate the results for very stable trends, stable trends, and all of trends detected by the system. We compare the TF*PDF algorithm and the combined weight with the baseline. We do not show the evaluation for Chi-square because Chi-square is not very effective in detecting trends (Recall = 0.0851%). As shown in Table 6, the combined weight algorithm achieves the high precision for very stable trends (0.976%); however, the Recall is lower than the TF*PDF algorithm by 0.064%. It is also shown that the TF*PDF is more effective than the combined weight in detecting trends. Thus, in the patent domain the hot trends extracted by means of the TF*PDF by utilizing IPC information are not affected by their life cycles over time. TABLE VI. NORMAL TRENDS BY CHI-SQUARE Precision Recall Very stable trends TF*PDF 0.957% 0.936% Combined Weight 0.976% 0.872% Stable trends TF*PDF 0.959% 0.959% Combined Weight 0.864% 0.776% All trends TF*PDF 0.84% 0.601% Combined Weight 0.665% 0.612% V. CONCLUSIONS We have proposed a system for the extraction of hot topics and the detection of hot trends from the patent domain within a specific time period using TF*PDF and semantic information where terms are distributed (IPCs). Our work has the following contributions:  The automatic detection of hot technological topics from the patent domain,  The use of semantic information for hot-topic detection, which has not heretofore been done in the state of the art,  The automatic tracking of hot trends from patent domain. Our implementation is intended to detect hot topics and track trends in terms of pervasiveness and topicality. We apply the TF*PDF weighting algorithm to extract terms with pervasiveness. To determine term’s topicality, we apply the aging theory to track the change in term’s life cycle. The combination of TF*PDF and the aging theory are proved to improve the quality of the hot-topic extraction in news documents. However, our research with patent documents shows that a term life cycle does not affect the topicality of hot topics in the patent domain. It is because patent documents contain technical terms distributed yearly or monthly, and there are significantly rare terms that appear only several times in the entire corpus. Meanwhile, for news documents, terms are distributed daily with a very high frequency of change and consequently variation plays an important role in the change of a term’s life cycle. By utilizing IPC information, in which documents are manually classified into specific categories by patent experts, we automatically detect technological trends from the patent domain in a specific time period. The terms extracted from those documents therefore belong to specific categories, whereby the importance of terms is evaluated based on its importance in specific IPCs. That allows a hot topic to be identified based on its importance in each IPC as opposed to being affected by the variation in its life cycle. For a huge number of patents with lengthy and difficulty technical terms, it is necessary to quickly identify the hottest information about which technologies were invented with high attention. The experiment results show that our approach yields a substantial methodology of hot-topic extraction and technological trend detection from the patent domain. By apply the hot-term detection algorithm using Term Frequency – Proportional Document Frequency in consideration of IPC information, we have shown an important new criteria for weighting hot topics based on semantic categorization, which has not previously been applied in the patent domain. ACKNOWLEDGMENT This research was supported by the MSIP(Ministry of Science, ICT and Future Planning), Korea, under the Global IT Talent support program(IITP-2014-H0905-14-1005) and the Establishing IT Research Infrastructure Projects(I2221-14- 229
  • 8. 1012) supervised by the IITP(Institute for Information and Communication Technology Promotion). REFERENCES [1] K. Borner, C. Chen, and K.W. Boyack, “Visualizing knowledge domains," Annual Review of Information Science and Technology, vol. 37, pp. 179-255, 2003. [2] Y. Ding, G.G. Chowdhury, and S. Foo, "Bibliometric cartography of information retrieval research by using co-word analysis," Information Processing and Management, vol. 37, no. 6, pp. 817-842, 2001. [3] S. Lee, B. Yoon, and Y. Park, "An approach to discovering new technology opportunities: keyword-based patent map approach," Technovation, vol. 29, no. 6-7, pp. 481-497, 2009 [4] M. Moehrle, L. Walter, A. Geritz, and S. Muller, "Patent-based inventor profiles as a basis for human resource decisions in research and development," R&D Management, vol. 35, no. 5, pp. 513-524, 2005. [5] J. Yoon and K. Kim, "Identifying rapidly evolving technological trends for R&D planning using SAO-based semantic patent networks," Scientometrics, vol. 88, no. 1, pp. 313-331, 2013. [6] S. Dewulf, "Directed variation: variation of properties for new or improved function product DNA, a base for 'connect and develop'," World Conference: TRIZ Future, 2006. [7] D. Mann, Hands-on systematic innovation, Belgium: Creax press, 2002. [8] J. Yoon and K. Kim, "An analysis of property-function based patent networks for strategic R&D planning in fast-moving industries: Teh case of silicon-based thin film solar cells," Expert Systems with Application, vol. 39, no. 9, pp.7709-7717, 2012. [9] M.L. Antonie and O.R. Zaiane, "Text document categorization by term association," In Proceedings of the 2002 IEEE international conference on data mining, 2002. [10] X.Y. Chen, Y. Chen, L. Wang, and Y.F. Hu, "Text categorization based on frequent patterns with term frequency," In Proceedings of 2004 international conference on machine learning and cybernetics, 2004. [11] H. Han, E. Manavogulu, C. Giles, and H. Zha, "Rule-based word clustering for text classification," In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, 2003. [12] C. He and H.T. Loh, "Pattern-oriented associative rule-based patent classification," Expert Systems with Application, vol. 37, no.3, 2010. [13] I. Bergmann, D. Butzke, L. Walter, J.P. Fuerste, M.G. Moehrle, and V.A. Erdmann, "Evaluating the risk of patent infringement by means of semantic patent analysis: the case of DNA chips," R&D Management, vol. 38, no. 5, pp.550-562, 2008. [14] T. Magerman, B.V. Looy, and X. Song, "Exploring the feasibility and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications," Scientometrics, vol. 82, no. 2, pp. 289-306, 2010 [15] J. Yoon and K. Kim, "An automated method for identifying TRIZ evolution trends from patents," Expert Systems with Applications, vol. 38, no. 12, pp.15540-15548, 2011. [16] C.C. Chen, Y.T. Chen, Y. Sun, and M.C. Chen, "Life cycle modeling of news events using Aging Theory," In Proceedings of 14th European Conference of Machine Learning, pp. 47-59, 2003. [17] K.Y. Chen, L. Luesukprasert, and S.T. Chou, "Hot topic extraction based on timeline analysis and multidimensional sentence modeling," IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 8, 2007 [18] M.C. Yang and H.C. Rim, "Identifying interesting Twitter contents using topical analysis," Expert Systems with Applications, vol. 41, no. 9, pp. 4330-4336, 2014. [19] J. Zeng, J. Duan, W. Cao, and C. Wu, "Topics modeling based on selective Zipf distribution," Expert Systems with Applications, vol. 39, no. 7, pp. 6541-6546, 2012. [20] C. Wang, M. Zhang, L. Ru, and S. Ma, "An automatic online news topic key phrase extraction system," IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 2008. [21] Y. Chen, H. Amiri, Z. Li, and T. Chua, "Emerging topic detection for organizations from microblogs," In Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval, pp. 43-52, 2013. [22] L. Christiansen, T. Schimoler, R. Burke, and B. Mobasher, "Modeling topic trends on the social web using temporal signatures," In Proceedings of the twelfth international workshop on Web information and data management, pp. 3-10, 2012. [23] S. Lee, J. Lee, C. Park, and J. Lee, "Blog topic analysis using TF smoothing and LDA," In Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication, 2013. [24] R. Long, H. Wang, Y. Chen, O. Jin, and Y. Yu, "Towards effective event detection, tracking and summarization on Microblog data," Web- Age Information Management (Lecture Notes in Computer Science), 6897, pp. 652-663, 2011. [25] P. Erdi, K. Makovi, Z. Somogyvari, K. Strandburg, J. Tobochnik, P. Volf, and L. Zalanyi, "Prediction of emerging technologies based on analysis of the US patent citation network," Scientometrics, vol. 95, no. 1, pp. 225-242, 2013. [26] C. Lee, B. Song, and Y. Park, "How to assess patent infringement risks: a semantic patent claim analysis using dependency relationships," Technology Analysis & Strategic Management, vol. 25, no. 1, pp. 23-38, 2013. [27] A.J. Trappey, C.V. Trappey, C. Wu, C.Y. Fan, and Y. Lin, "Intelligent patent recommendation system for innovative design collaboration," Journal of Network and Computer Applications, vol. 36, no. 6, pp. 1441- 1450, 2013. [28] J. Yoon and K. Kim, "TrendPerceptor: A property-function-based technology intelligence system for identifying technological trends from patents," Expert Systems with Applications, vol. 39, no. 3, pp. 2927- 2938, 2012. [29] K.L. Nguyen and S.H. Myaeng, "Query enhancement for patent prior-art search based on key-term dependency relationships and semantic tags," Lecture Notes in Computer Science, 7356, pp. 28-42, 2012. [30] Y.G. Kim, J.H. Suh, and S.C. Park, "Visualization of patent analysis for emerging technology," Expert Systems with Applications, vol. 34, no. 3, pp. 1804-1812, 2008. [31] M.J. Shih, D.R. Liu, and M.L. Hsu, "Discovering competitive intelligence by mining changes in patent trends," Expert Systems with Applications, vol. 37, no. 4, pp. 2882-2890, 2010. [32] J. Yoon, H. Park, and K. Kim, "Identifying technological competition trends for R&D planning using dynamic patent maps: SAO-based content analysis," Scientometrics Journal, vol. 94, no. 1, pp. 313-331, 2013. [33] H. Park, J.J. Ree, and K. Kim, "Identification of promising patents for technology transfers using TRIZ evolution trends," Expert Systems with Applications, vol. 40, no. 2, pp. 736-743, 2013. [34] J. Duan and J. Zeng, "Web objectionable text content detection using topic modeling technique," Expert Systems with Applications, vol. 40, no. 15, pp. 6094-6104, 2013. [35] J. Zeng, C. Wu, and W. Wang, "Multiple-grain hierarchical topic extraction algorithm for text mining," Expert Systems with Applications, vol. 37, no. 4, pp. 3202-3208, 2010. [36] J.P. Zeng and S.Y. Zhang, "Incorporating topic transition in topic detection and tracking algorithms," Expert Systems with Applications, vol. 36, no. 1, pp. 227-232, 2009. [37] S. Lee and H.J. Kim, "News Keyword Extraction for Topic Tracking," Networked Computing and Advanced Information Management NCM, 2008. [38] S.Y. Chen, T.T. Tseng, H.E. Ke, and C.T. Sun, "Social trend tracking by time series based social tagging clustering," Expert Systems with Applications, vol. 38, no. 10, pp. 12807-12817, 2011. [39] K.K. Bun and M. Ishizuka, "Topic Extraction from News Archive Using TF*PDF Algorithm," In Proceedings of the 3rd International Conference Web Information System Eng, pp. 73-82, 2002. 230