Web/Database Search by Topics
Case Western Reserve University
Ling Yang, Kun Si
Dec. 11, 2000
Today, a large amount of information resources is available online or in CDs/DVDs.
These information resources are typically extremely large, usually containing hyper-linked
table of contents, indexed search facilities on keyword, and some multimedia data such as
images, audio/video streams, etc. When the information resources are used in their limited
ways, they may be very effective. However, for different type of information extraction,
they are not so easy to get meaningful results. Some time it is a very frustrated experience
for the user to use this facility. A new technology is needed to search this information
In this project, we explore the information at website: http://researcher.sirs.com, and
try to model its contents using topics and topic metalink. There are two goals in doing this,
first is to find new reasonable metalink, second is to raise issues of relationships that
cannot be modeled with current metalink model, discuss them and try to improve the
current metalink model. A website has been developed to demonstrate our model.
We give a brief introduction about the website we have been studying in the next
section, and in section 2, we explain our database model, the third section is devoted to the
metalink we have defined and the issues about topic map, and section 4 is our
implementation of the web searching on the database, in the last section, we give a
conclusion of our work, and suggested further work.
1. Introduction to the study website: http://researcher.sirs.com
It is a child website of http://www.sirs.com, hosted by SIRS Mandarin, Inc. SIRS
Mandarin, Inc is an established information and technology provider to more than 50,000
libraries and institutions worldwide. Among all of its products, this company provides a
powerful Web search interface that allows integrated access to full-text articles, Internet
sites, documents and graphics from SIRS reference databases to SIRS Researcher, SIRS
Government Reporter, SIRS Renaissance and SIRS NetSelect. The website we are
studying: http://researcher.sirs.com, is such a searching website based on database SIRS
Researcher, which is a general reference database with thousands of full-text articles,
exploring social, scientific, health, historic, economic, business, political and global issues.
The searching website was designed to search an article by several different methods:
Quick Search, Advanced Search, and Topic Browse. Quick Search was designed to search
by keywords or subject headings. Advanced Search can be used to narrow down an article
by the author, title, text, or their combination. Topic Browse allows the user browse
through the topic categories. There are 8 topic categories and 46 subtopic categories in
Topic Browse section. We limited our search topics only to Business: Consumerism, since
there are so many topics in this search website and we could not possible and model all of
them. We have taken out 100 articles with about 390 topics related to those articles.
We have reviewed the articles and the involved topics one by one, tried to abstract the
relationships between topics. At this moment, we have defined 16 metalink and 53
metalink instances. The next section discusses how these metalink and their instances are
implemented in database.
2. Database Model
We save the articles, topics, metalink and their instances in a relational database. There
are totally six tables, defined as:
a. article (aid, title, path, published)
b. topic (topic)
c. metalink (metalink)
d. article_topic (aid, topic, degree) foreign key (aid) references to article(aid), foreign
key(topic) references to topic(topic)
e. metalink_instance(miid, topic1, topic2, metalink) foreign key(topic1) references
topic(topic), foreign key (topic2) references topic(topic), foreign key(metalink)
f. article_metalink_instance(aid, miid) foreign key (aid) references article(aid), foreign
key(miid) references metalink_instance(miid))
Figure 1 is the ER model, and Table 1-6 are the sample records. All the six tables are in
BCNF. The SQL commands to create tables, insert records, and create indexes are in the
First table ‘article’ is used to save the information about the articles we have reviewed.
Field ‘aid’ is the ID for each article and the primary key; ‘title’, by its name, is the title of
the article; ‘path’ tells us where to find the article, it can be a URL or a local file path and
name; ‘published’ is the date this article is published.
Second table ‘topic’ saves all the distinct topics discussed in these 100 articles. Table
‘metalink’ defines the metalink between these topics. The fourth table ‘article_topic’ is a
list of which article involves which topics. Fields ‘aid’ and ‘topic’ combined together are
the primary key, and they are foreign keys referring to table ‘artilce’ and ‘topic’,
respectively. The third field ‘degree’ is an indicator about how deep a topic is discussed in
an article. An article could include several topics, but focus on just one or two topics and
only mention briefly the others. The value of this field can be one of the following three:
‘elementary’, ‘moderate’, ‘elaborate’, getting deeper as its goes on.
Table ‘metalink_instance’, as its name suggests, is the instance of the metalink defined
in table ‘metalink’. All the metalink we have defined in this project are binary, so there are
three fields necessary in this table, one for the metalink and two for the two topics which
have a relationship saved in the field ‘metalink’ between them. Notice metalink are not
necessarily binary, can be unary or ternary too. Each metalink instance has a unique ID
‘miid’, which is the primary key. Although the combination of topic1, topic2, and metalink
is enough to uniquely define a record, we still add a unique ID for each record, since they
are going to be referred in other tables, e.g. table ‘article_metalink_instance’. We can save
some space by this way.
Table ‘article_metalink_instance’ has only two ID fields, one is the article ID ‘aid’
referring to table ‘article’, and the other is the metalink_instance ID ‘miid’ we just
mentioned. It is very similar to table ‘article_topic’, but besides of telling us which topics
an article involves, it also tells us how these two (binary metalink) topics are related. This
table has important meaning to our model, it add more dimensions to our model and
change it from just a list of (article, topic) pair to a model with inner structure of each
article. This table is built on the assumption that there exist multiple relationships
(metalink) among the same pair of topics. However, this is not true in our project, at least
not true so far. We will come back to it in the web implementation section.
3. Metalink and Issues
By scrutinizing the 100 articles and the 390 topics, we came up with 16 metalink. They
topic A BUILD_ON topic B means, if used in knowledge structure, if a reader wants to
master topic A, they need to first know topic B. In other situations, both topics can be
events, and event A is the result of event B
topic A and topic B are tended to be compared, e.g. medicare system in Canada and in
U.S.A have been studied together in many articles.
two topics, e.g. products, compete with each with, e.g. generic drugs and their
corresponding brand name products.
the existence of one topic, e.g. a person, an occurrence, a tool, makes the other more
one topic is a controversial topic in the field of the other topic, e.g. whether patients with
terminal illness has the right to die is very controversial in the filed of medical ethics.
a certain person, or group, organization, industry, government, etc, has influence in a
certain field, e.g. the controlling power of pharmaceutical industry over drug market.
the existence of certain people or policy, event, makes the other better, e.g. study shows
that certified midwives often make childbirth safer.
this is not topic A is physically in topic B, but the situation of topic A in the area of topic
B, e.g. the health care reform in California.
topic A has all the characteristics of topic B, e.g. a monkey is a mammal.
both topics are geographical or physical terms, e.g. Oregon is in U.S.
the happening of topic A, an event, leads to the happening of topic B, another event, e.g.
brain jury may causes coma.
topic A is one aspect of topic B, e.g. the side effects of drugs.
the existence of topic A, an event or policy, prevents the happening of topic B, e.g.
vaccine may prevent some commutable diseases.
a group of people or organization is a special group in the field of another topic, e.g. aged
people are special group people in health insurance.
topic A is under topic B, e.g. correctional medicine is a subspecialty in medicine.
a medicine or medical operation treats a disease.
The difficulty in finding and determining a metalink is to decide how general this
metalink is. It cannot be too detailed, otherwise, its instances will not contain useful
information, but degrade to just a record in database table. It cannot be too general either,
e.g. ‘RELATE_TO’ is a metalink, but since almost every topic can relate to every other
topic through certain ways, and it lost its meaning too. The criteria in defining a metalink
can hardly be explained explicitly in words, but more out of instinct logical feelings.
The instantiation of metalink, in another word, searching a pair of topics that are
related by the metalink, arise issues too. We need to take into consideration of the sources
(articles in our projects) of these topics. For example, for metalink ‘IS_IN’, one instance is
an engine IS_IN a car. Both topic engine and topic car have some sources associated with
them, and any source for topic engine should relate to any source for topic car by ‘IS_IN’.
However, not every match of two sources like this is meaningful or useful. A source of
engine may devote to a specific model of engine, e.g. model c01 or c02, and a source of car
may devote to a specific model of car, e.g. Honda civic, and Honda civic only uses engine
model c01, but not c02. So the match of source of engine c02 and source for car Honda
civic does not fit in metalink ‘IS_IN’. In the earlier report, we came up a solution that adds
an applicable level of certain metalink instance. In this example, "an engine is IS_IN a car"
only applies to sources about general engine and general car. When both sources get more
detailed, model c02 for engine and Honda civic for car, the metalink instance need to be
more specific too, e.g. "an c01 engine is in a Honda civic car", and all the sources about
c01 engine and Honda civic car can be paired up. However, we found this solution
complicates things, and it’s hard to decide which instance should have this ‘applicable
level’ and which not. So now, if an instance of a metalink has above issue, we don’t
include it as a legitimate instance of the metalink.
4. Implementation of Web Searching
We have developed a web site http://vorlon.cwru.edu/~lxy21 to demonstrate how our
data model and topic metalink model can be used in web search.
First the user can type in ONE topic that he is interested, and click on the search
button. Our web search engine will first search in table ‘metalink_instance’, and find the
metalink instances that contain this topic, the result is print in a table as the suggested
further search direction. The search engine will also look in the table ‘article_topic’, to find
all the articles that contain this topic, and get the detailed information of these articles in
table ‘article’, the search result is also print out on web page (See figure 2).
Topic ‘generic drugs’ show up in three metalink instances in our database: ‘generic
drugs COMPETE brand name products’, ‘prices OF generic drugs’, and ‘side effects OF
generic drugs’, and eight articles contains topic ‘generic drugs’. The three metalink
instances are our suggested further search direction with topic metalink. If the user click on
topic ‘brand name products’, our search engine will find articles contains either topic
‘generic drugs’ or topic ‘brand name products’ or both. Two articles, ‘Drug Makers
Maneuver to Keep Generics Off Market’ and ‘ARE GENERIC DRUGS AS GOOD AS
BRAND NAME?’, are on the top of the search results list. They both discuss the
competition between the ‘brand name products’ and their corresponding ‘generic drugs’.
The published dates of these articles are also displayed. Notice that there is a ‘2 topics’
associated with the first two articles on the list and ‘1 topics’ with others. They are the
number of topics they user choose and the articles contains.
The user can also choose to start a new search by type in a topic and click on the search
The difference between our topic metalink search and the usual keywords search is not
only that we know the relationship between topic ‘generic drugs’ and ‘brand name
products’ is ‘COMPETE’, but also the relationship of these two topics in our top two
search results is ‘COMPETE’. This is assured by the table ‘article_metalink_instance’,
which should have records showing these two articles both have metalink_instance
(‘generic drugs’, ‘COMPETE’, ‘brand name products’) association with their article IDs.
Our search engine should search in the table ‘article_metalink_instance’, find the articles
that have this required metalink, and display them as the top suggested articles. However,
due to the lack of multiple relationship between topics, as we discussed in section 2, this
table is not really. In another word, since there will not be another relationship between
topic ‘generic drugs’ and ‘brand name products’ in our project (in the real world, there
might be), if an article have both topics, the relationship will only be ‘COMPETE”, so
there is no need to check table ‘article_metalink_instance’. In the further, while our
database grows, and the metalink and their instances get more complicated, and there exist
multiple relationships between two topics, our search engine will visit table
‘article_metalink_instance’ first, but now it is skipped.
We choose apache as the http web server, and use JSP and JDBC to communicate with
the Oracle database, tomcat is the JSP engine.
5. Conclusion and further work
In our project, we have carefully studies 100 articles under the topic: Business,
Consumerism, and their involved topics, and extracted 16 metalink and 53 metalink
instances from them. We have built a database to save all these information in a relational
database, and developed a web site to demonstrate how our model helps people searching
with topic metalink.
There is a lot can be done to improve this project. If we continue to work on it, we will
rethink and organize the metalink. Our web site can also be made friendlier, e.g., with a list
of all the possible topics, and a text index on all the articles.
d m Article_metalin
degre article_topi n
m Metalink n
topic miid metalin
Figure 1. Entity_Relation Model of the Database
Figure 2. Search result of topic ‘generic drugs’
Figure 3. Search result of topic ‘generic drugs’ and ‘brand name
AID TITLE PATH PUBLISHED
1 The Price We Pay /article/1.txt 16-Oct-00
2 Drug Makers Maneuver to Keep Generics Off Market /article/2.txt 17-Aug-00
3 MEDICAL ECONOMICS: SEVEN WAYS TO CUT YOUR PILL BILL /article/3.txt 1-Feb-00
4 ARE GENERIC DRUGS AS GOOD AS BRAND NAME? /article/4.txt 1-May-98
5 FDA SAYS GENERIC DRUGS ARE AS GOOD /article/5.txt 3-Feb-98
6 PRICE WARS OVER NAME-DROPPING /article/6.txt 1-May-94
7 Health-Care Reform: Battling the High Cost of Drugs /article/7.txt 1-Jul-93
8 WHEN GENERIC ISNT GENUINE /article/8.txt 9-Jul-89
9 FORGOTTEN PATIENTS: THE MENTALLY ILL /article/9.txt 1-Apr-00
10 WHATS WRONG WITH MANAGED CARE AND HOW TO FIX IT /article/10.txt 1-Feb-00
11 TOBACCO TRIUMPHS AS COURT SAYS NO TO UNION LAWSUITS /article/11.txt 11-Jan-00
12 IT CANT HAPPEN HERE /article/12.txt 1-Dec-00
13 CANADA RETHINKS ITS MEDICARE /article/13.txt 14-Dec-99
14 HOSPITAL RANKINGS SHOW SAVINGS /article/14.txt 14-Dec-99
15 PLANNING CAN HELP COVER COSTS OF AGING /article/15.txt 3-Oct-99
16 JUSTICE DEPT. SUES TOBACCO COS. /article/16.txt 22-Sep-99
17 SHOULD HMOs PAY FOR MAYBE? /article/17.txt 6-Jun-99
18 BIOETHICS: A MORAL VACUUM? /article/18.txt 1-May-99
19 DRUG BENEFIT NEWEST TWIST IN DEBATE OVER MEDICARE /article/19.txt 28-Apr-99
20 THE NEW CONSUMER PARADIGM /article/20.txt 1-Apr-99
Table 2. topic
actions and defenses
african american women
aids (disease) and employment
aids (disease) education
american civil liberties union
americans with disabilities act (1990)
Table 3. metalink
Table 4. article_topic
AID TOPIC DEGREE
1 cost mentions
1 generic drugs defines
1 medical care elaborates
1 prescription drugs mentions
1 prescription pricing defines
2 brand name products elaborates
2 drugs mentions
2 generic drugs mentions
2 law and legislation defines
2 legal loopholes elaborates
2 patents mentions
2 pharmaceutical industry defines
3 generic drugs elaborates
3 medical economics mentions
3 prescription drugs defines
3 prescription pricing elaborates
4 brand name products mentions
4 consumer education defines
4 cost effectiveness elaborates
4 generic drugs mentions
4 pharmaceutical industry defines
4 pharmacists elaborates
4 safety elaborates
4 u.s. food and drug adm. mentions
Table 5. metalink_instance
MIID TOPIC1 TOPIC2 METALINK
1 nursing home care home care services COMPARE_WITH
2 generic drugs brand name products COMPETE
3 layoffs unemployment COMPLEMENT
4 living wills medical ethics CONTROVERSAL_TOPIC_IN
5 living wills medicaid CONTROVERSAL_TOPIC_IN
6 living wills medicare CONTROVERSAL_TOPIC_IN
7 right to die medical ethics CONTROVERSAL_TOPIC_IN
8 right to die medicaid CONTROVERSAL_TOPIC_IN
9 right to die medicare CONTROVERSAL_TOPIC_IN
10 right to refuse treatment medical ethics CONTROVERSAL_TOPIC_IN
11 right to refuse treatment medicaid CONTROVERSAL_TOPIC_IN
12 right to refuse treatment medicare CONTROVERSAL_TOPIC_IN
13 substance abuse in pregnancy medical ethics CONTROVERSAL_TOPIC_IN
14 substance abuse in pregnancy medicaid CONTROVERSAL_TOPIC_IN
15 substance abuse in pregnancy medicare CONTROVERSAL_TOPIC_IN
16 terminal care medical ethics CONTROVERSAL_TOPIC_IN
17 terminal care medicaid CONTROVERSAL_TOPIC_IN
18 terminal care medicare CONTROVERSAL_TOPIC_IN
19 insurance companies medical policy HAVE_POWER_OVER
20 family health IMPROVE
Table 6. article_metalink_instance