Project Report (word).doc

  • 3,381 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
3,381
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
50
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Web/Database Search by Topics Project Report Case Western Reserve University ECES 433 Ling Yang, Kun Si lxy21@po.cwru.edu, kxs99@po.cwru.edu Dec. 11, 2000
  • 2. Today, a large amount of information resources is available online or in CDs/DVDs. These information resources are typically extremely large, usually containing hyper-linked table of contents, indexed search facilities on keyword, and some multimedia data such as images, audio/video streams, etc. When the information resources are used in their limited ways, they may be very effective. However, for different type of information extraction, they are not so easy to get meaningful results. Some time it is a very frustrated experience for the user to use this facility. A new technology is needed to search this information effectively. In this project, we explore the information at website: http://researcher.sirs.com, and try to model its contents using topics and topic metalink. There are two goals in doing this, first is to find new reasonable metalink, second is to raise issues of relationships that cannot be modeled with current metalink model, discuss them and try to improve the current metalink model. A website has been developed to demonstrate our model. We give a brief introduction about the website we have been studying in the next section, and in section 2, we explain our database model, the third section is devoted to the metalink we have defined and the issues about topic map, and section 4 is our implementation of the web searching on the database, in the last section, we give a conclusion of our work, and suggested further work. 1. Introduction to the study website: http://researcher.sirs.com It is a child website of http://www.sirs.com, hosted by SIRS Mandarin, Inc. SIRS Mandarin, Inc is an established information and technology provider to more than 50,000 libraries and institutions worldwide. Among all of its products, this company provides a powerful Web search interface that allows integrated access to full-text articles, Internet sites, documents and graphics from SIRS reference databases to SIRS Researcher, SIRS Government Reporter, SIRS Renaissance and SIRS NetSelect. The website we are studying: http://researcher.sirs.com, is such a searching website based on database SIRS Researcher, which is a general reference database with thousands of full-text articles, exploring social, scientific, health, historic, economic, business, political and global issues. The searching website was designed to search an article by several different methods: Quick Search, Advanced Search, and Topic Browse. Quick Search was designed to search by keywords or subject headings. Advanced Search can be used to narrow down an article by the author, title, text, or their combination. Topic Browse allows the user browse through the topic categories. There are 8 topic categories and 46 subtopic categories in Topic Browse section. We limited our search topics only to Business: Consumerism, since there are so many topics in this search website and we could not possible and model all of them. We have taken out 100 articles with about 390 topics related to those articles. 2
  • 3. We have reviewed the articles and the involved topics one by one, tried to abstract the relationships between topics. At this moment, we have defined 16 metalink and 53 metalink instances. The next section discusses how these metalink and their instances are implemented in database. 2. Database Model We save the articles, topics, metalink and their instances in a relational database. There are totally six tables, defined as: a. article (aid, title, path, published) b. topic (topic) c. metalink (metalink) d. article_topic (aid, topic, degree) foreign key (aid) references to article(aid), foreign key(topic) references to topic(topic) e. metalink_instance(miid, topic1, topic2, metalink) foreign key(topic1) references topic(topic), foreign key (topic2) references topic(topic), foreign key(metalink) references metalink(metalink)); f. article_metalink_instance(aid, miid) foreign key (aid) references article(aid), foreign key(miid) references metalink_instance(miid)) Figure 1 is the ER model, and Table 1-6 are the sample records. All the six tables are in BCNF. The SQL commands to create tables, insert records, and create indexes are in the attachment. First table ‘article’ is used to save the information about the articles we have reviewed. Field ‘aid’ is the ID for each article and the primary key; ‘title’, by its name, is the title of the article; ‘path’ tells us where to find the article, it can be a URL or a local file path and name; ‘published’ is the date this article is published. Second table ‘topic’ saves all the distinct topics discussed in these 100 articles. Table ‘metalink’ defines the metalink between these topics. The fourth table ‘article_topic’ is a list of which article involves which topics. Fields ‘aid’ and ‘topic’ combined together are the primary key, and they are foreign keys referring to table ‘artilce’ and ‘topic’, respectively. The third field ‘degree’ is an indicator about how deep a topic is discussed in an article. An article could include several topics, but focus on just one or two topics and only mention briefly the others. The value of this field can be one of the following three: ‘elementary’, ‘moderate’, ‘elaborate’, getting deeper as its goes on. Table ‘metalink_instance’, as its name suggests, is the instance of the metalink defined in table ‘metalink’. All the metalink we have defined in this project are binary, so there are three fields necessary in this table, one for the metalink and two for the two topics which have a relationship saved in the field ‘metalink’ between them. Notice metalink are not necessarily binary, can be unary or ternary too. Each metalink instance has a unique ID 3
  • 4. ‘miid’, which is the primary key. Although the combination of topic1, topic2, and metalink is enough to uniquely define a record, we still add a unique ID for each record, since they are going to be referred in other tables, e.g. table ‘article_metalink_instance’. We can save some space by this way. Table ‘article_metalink_instance’ has only two ID fields, one is the article ID ‘aid’ referring to table ‘article’, and the other is the metalink_instance ID ‘miid’ we just mentioned. It is very similar to table ‘article_topic’, but besides of telling us which topics an article involves, it also tells us how these two (binary metalink) topics are related. This table has important meaning to our model, it add more dimensions to our model and change it from just a list of (article, topic) pair to a model with inner structure of each article. This table is built on the assumption that there exist multiple relationships (metalink) among the same pair of topics. However, this is not true in our project, at least not true so far. We will come back to it in the web implementation section. 3. Metalink and Issues By scrutinizing the 100 articles and the 390 topics, we came up with 16 metalink. They are: 1). BUILD_ON topic A BUILD_ON topic B means, if used in knowledge structure, if a reader wants to master topic A, they need to first know topic B. In other situations, both topics can be events, and event A is the result of event B 2). COMPARE_WITH topic A and topic B are tended to be compared, e.g. medicare system in Canada and in U.S.A have been studied together in many articles. 3). COMPETE two topics, e.g. products, compete with each with, e.g. generic drugs and their corresponding brand name products. 4). COMPLEMENT the existence of one topic, e.g. a person, an occurrence, a tool, makes the other more complete. 5). CONTROVERSIAL_TOPIC_IN one topic is a controversial topic in the field of the other topic, e.g. whether patients with terminal illness has the right to die is very controversial in the filed of medical ethics. 6). HAVE_POWER_OVER a certain person, or group, organization, industry, government, etc, has influence in a certain field, e.g. the controlling power of pharmaceutical industry over drug market. 7). IMPROVE the existence of certain people or policy, event, makes the other better, e.g. study shows that certified midwives often make childbirth safer. 4
  • 5. 8). IN this is not topic A is physically in topic B, but the situation of topic A in the area of topic B, e.g. the health care reform in California. 9). IS_A topic A has all the characteristics of topic B, e.g. a monkey is a mammal. 10). IS_IN both topics are geographical or physical terms, e.g. Oregon is in U.S. 11). LEAD_TO the happening of topic A, an event, leads to the happening of topic B, another event, e.g. brain jury may causes coma. 12). OF topic A is one aspect of topic B, e.g. the side effects of drugs. 13). PREVENT the existence of topic A, an event or policy, prevents the happening of topic B, e.g. vaccine may prevent some commutable diseases. 14). SPECIAL_GROUP_IN a group of people or organization is a special group in the field of another topic, e.g. aged people are special group people in health insurance. 15). SUBSPECIALTY_IN topic A is under topic B, e.g. correctional medicine is a subspecialty in medicine. 16). TREAT a medicine or medical operation treats a disease. The difficulty in finding and determining a metalink is to decide how general this metalink is. It cannot be too detailed, otherwise, its instances will not contain useful information, but degrade to just a record in database table. It cannot be too general either, e.g. ‘RELATE_TO’ is a metalink, but since almost every topic can relate to every other topic through certain ways, and it lost its meaning too. The criteria in defining a metalink can hardly be explained explicitly in words, but more out of instinct logical feelings. The instantiation of metalink, in another word, searching a pair of topics that are related by the metalink, arise issues too. We need to take into consideration of the sources (articles in our projects) of these topics. For example, for metalink ‘IS_IN’, one instance is an engine IS_IN a car. Both topic engine and topic car have some sources associated with them, and any source for topic engine should relate to any source for topic car by ‘IS_IN’. However, not every match of two sources like this is meaningful or useful. A source of engine may devote to a specific model of engine, e.g. model c01 or c02, and a source of car may devote to a specific model of car, e.g. Honda civic, and Honda civic only uses engine model c01, but not c02. So the match of source of engine c02 and source for car Honda civic does not fit in metalink ‘IS_IN’. In the earlier report, we came up a solution that adds 5
  • 6. an applicable level of certain metalink instance. In this example, "an engine is IS_IN a car" only applies to sources about general engine and general car. When both sources get more detailed, model c02 for engine and Honda civic for car, the metalink instance need to be more specific too, e.g. "an c01 engine is in a Honda civic car", and all the sources about c01 engine and Honda civic car can be paired up. However, we found this solution complicates things, and it’s hard to decide which instance should have this ‘applicable level’ and which not. So now, if an instance of a metalink has above issue, we don’t include it as a legitimate instance of the metalink. 4. Implementation of Web Searching We have developed a web site http://vorlon.cwru.edu/~lxy21 to demonstrate how our data model and topic metalink model can be used in web search. First the user can type in ONE topic that he is interested, and click on the search button. Our web search engine will first search in table ‘metalink_instance’, and find the metalink instances that contain this topic, the result is print in a table as the suggested further search direction. The search engine will also look in the table ‘article_topic’, to find all the articles that contain this topic, and get the detailed information of these articles in table ‘article’, the search result is also print out on web page (See figure 2). Topic ‘generic drugs’ show up in three metalink instances in our database: ‘generic drugs COMPETE brand name products’, ‘prices OF generic drugs’, and ‘side effects OF generic drugs’, and eight articles contains topic ‘generic drugs’. The three metalink instances are our suggested further search direction with topic metalink. If the user click on topic ‘brand name products’, our search engine will find articles contains either topic ‘generic drugs’ or topic ‘brand name products’ or both. Two articles, ‘Drug Makers Maneuver to Keep Generics Off Market’ and ‘ARE GENERIC DRUGS AS GOOD AS BRAND NAME?’, are on the top of the search results list. They both discuss the competition between the ‘brand name products’ and their corresponding ‘generic drugs’. The published dates of these articles are also displayed. Notice that there is a ‘2 topics’ associated with the first two articles on the list and ‘1 topics’ with others. They are the number of topics they user choose and the articles contains. The user can also choose to start a new search by type in a topic and click on the search button. The difference between our topic metalink search and the usual keywords search is not only that we know the relationship between topic ‘generic drugs’ and ‘brand name products’ is ‘COMPETE’, but also the relationship of these two topics in our top two search results is ‘COMPETE’. This is assured by the table ‘article_metalink_instance’, which should have records showing these two articles both have metalink_instance (‘generic drugs’, ‘COMPETE’, ‘brand name products’) association with their article IDs. Our search engine should search in the table ‘article_metalink_instance’, find the articles 6
  • 7. that have this required metalink, and display them as the top suggested articles. However, due to the lack of multiple relationship between topics, as we discussed in section 2, this table is not really. In another word, since there will not be another relationship between topic ‘generic drugs’ and ‘brand name products’ in our project (in the real world, there might be), if an article have both topics, the relationship will only be ‘COMPETE”, so there is no need to check table ‘article_metalink_instance’. In the further, while our database grows, and the metalink and their instances get more complicated, and there exist multiple relationships between two topics, our search engine will visit table ‘article_metalink_instance’ first, but now it is skipped. We choose apache as the http web server, and use JSP and JDBC to communicate with the Oracle database, tomcat is the JSP engine. 5. Conclusion and further work In our project, we have carefully studies 100 articles under the topic: Business, Consumerism, and their involved topics, and extracted 16 metalink and 53 metalink instances from them. We have built a database to save all these information in a relational database, and developed a web site to demonstrate how our model helps people searching with topic metalink. There is a lot can be done to improve this project. If we continue to work on it, we will rethink and organize the metalink. Our web site can also be made friendlier, e.g., with a list of all the possible topics, and a text index on all the articles. 7
  • 8. path title aid publishe d m Article_metalin article k m degre article_topi n e c n m Metalink n topic metalink _ topic miid metalin k Figure 1. Entity_Relation Model of the Database 8
  • 9. Figure 2. Search result of topic ‘generic drugs’ 9
  • 10. Figure 3. Search result of topic ‘generic drugs’ and ‘brand name 10
  • 11. Table1. article AID TITLE PATH PUBLISHED 1 The Price We Pay /article/1.txt 16-Oct-00 2 Drug Makers Maneuver to Keep Generics Off Market /article/2.txt 17-Aug-00 3 MEDICAL ECONOMICS: SEVEN WAYS TO CUT YOUR PILL BILL /article/3.txt 1-Feb-00 4 ARE GENERIC DRUGS AS GOOD AS BRAND NAME? /article/4.txt 1-May-98 5 FDA SAYS GENERIC DRUGS ARE AS GOOD /article/5.txt 3-Feb-98 6 PRICE WARS OVER NAME-DROPPING /article/6.txt 1-May-94 7 Health-Care Reform: Battling the High Cost of Drugs /article/7.txt 1-Jul-93 8 WHEN GENERIC ISNT GENUINE /article/8.txt 9-Jul-89 9 FORGOTTEN PATIENTS: THE MENTALLY ILL /article/9.txt 1-Apr-00 10 WHATS WRONG WITH MANAGED CARE AND HOW TO FIX IT /article/10.txt 1-Feb-00 11 TOBACCO TRIUMPHS AS COURT SAYS NO TO UNION LAWSUITS /article/11.txt 11-Jan-00 12 IT CANT HAPPEN HERE /article/12.txt 1-Dec-00 13 CANADA RETHINKS ITS MEDICARE /article/13.txt 14-Dec-99 14 HOSPITAL RANKINGS SHOW SAVINGS /article/14.txt 14-Dec-99 15 PLANNING CAN HELP COVER COSTS OF AGING /article/15.txt 3-Oct-99 16 JUSTICE DEPT. SUES TOBACCO COS. /article/16.txt 22-Sep-99 17 SHOULD HMOs PAY FOR MAYBE? /article/17.txt 6-Jun-99 18 BIOETHICS: A MORAL VACUUM? /article/18.txt 1-May-99 19 DRUG BENEFIT NEWEST TWIST IN DEBATE OVER MEDICARE /article/19.txt 28-Apr-99 20 THE NEW CONSUMER PARADIGM /article/20.txt 1-Apr-99 Table 2. topic TOPIC abortion abused women accounting actions and defenses advertising african american women aged aging aids (disease) aids (disease) and employment aids (disease) education alternative medicine american civil liberties union americans with disabilities act (1990) amphetamines antidepressants antismoking movement arrest asylum 11
  • 12. Table 3. metalink METALINK BUILD_ON COMPARE_WITH COMPETE COMPLEMENT CONTROVERSAL_TOPIC_IN HAVE_POWER_OVER IMPROVE IN IS_A IS_IN LEAD_TO OF PREVENT SPECIAL_GROUP_IN SUBSPECIALTY_IN TREAT Table 4. article_topic AID TOPIC DEGREE 1 cost mentions 1 generic drugs defines 1 medical care elaborates 1 prescription drugs mentions 1 prescription pricing defines 2 brand name products elaborates 2 drugs mentions 2 generic drugs mentions 2 law and legislation defines 2 legal loopholes elaborates 2 patents mentions 2 pharmaceutical industry defines 3 generic drugs elaborates 3 medical economics mentions 3 prescription drugs defines 3 prescription pricing elaborates 4 brand name products mentions 4 consumer education defines 4 cost effectiveness elaborates 4 generic drugs mentions 4 pharmaceutical industry defines 4 pharmacists elaborates 4 safety elaborates 4 u.s. food and drug adm. mentions 12
  • 13. Table 5. metalink_instance MIID TOPIC1 TOPIC2 METALINK 1 nursing home care home care services COMPARE_WITH 2 generic drugs brand name products COMPETE 3 layoffs unemployment COMPLEMENT 4 living wills medical ethics CONTROVERSAL_TOPIC_IN 5 living wills medicaid CONTROVERSAL_TOPIC_IN 6 living wills medicare CONTROVERSAL_TOPIC_IN 7 right to die medical ethics CONTROVERSAL_TOPIC_IN 8 right to die medicaid CONTROVERSAL_TOPIC_IN 9 right to die medicare CONTROVERSAL_TOPIC_IN 10 right to refuse treatment medical ethics CONTROVERSAL_TOPIC_IN 11 right to refuse treatment medicaid CONTROVERSAL_TOPIC_IN 12 right to refuse treatment medicare CONTROVERSAL_TOPIC_IN 13 substance abuse in pregnancy medical ethics CONTROVERSAL_TOPIC_IN 14 substance abuse in pregnancy medicaid CONTROVERSAL_TOPIC_IN 15 substance abuse in pregnancy medicare CONTROVERSAL_TOPIC_IN 16 terminal care medical ethics CONTROVERSAL_TOPIC_IN 17 terminal care medicaid CONTROVERSAL_TOPIC_IN 18 terminal care medicare CONTROVERSAL_TOPIC_IN 19 insurance companies medical policy HAVE_POWER_OVER 20 family health IMPROVE Table 6. article_metalink_instance AID MIID 1 2 4 42 5 2 6 2 7 40 9 51 13 29 15 37 25 41 26 29 31 21 33 41 35 22 41 44 44 7 47 53 51 5 57 5 61 52 62 50 13
  • 14. Attachment: 1. Default.html <html><head><title>Project Page of ECES423 CWRU----Ling Yang & Kun Si</title> <META HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1"> </head><body bgcolor=#ffffff text=#000000 link=#0000cc vlink=551a8b alink=#ff0000> <div align="center"><font color="#CC3300" size="4"><b><font color="#000000" face="Verdana, Arial, Helvetica, sans- serif">Project Page of ECES423</font></b></font> </div> <p align="center"><font size="3" color="#003333" face="Arial, Helvetica, sans-serif">----<i>LingYang &amp; Kun Si</i></font> <br> </p> <form action="topicsearch.jsp" method=get> <div align="center"> <br><font face=arial,sans-serif size=-1>Type in one topic and start search </font> <br><input type=text name=topic value="generic drugs" framewidth=8 size=40 maxlength=256> <br><input type=submit value="Search"> <input type=hidden name=new value="yes"> </div> </form> <div align="center"> <br> <font size=-1><a href="/ling/report.doc/">Project Report</a> - <a href="/ling/source.html/">Source Code </a> - <a href="/ling/command.txt">Tables and SQL commands</a></font> </div><p align="center"><font size="2">created date: Dec.09,2000</font> </body></html> 14
  • 15. 2. TopicSearch.jsp <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <title>Topic Map Search Engine---Ling Yang & Kun Si</title> </head> <body bgcolor="#FFFFFF"> <font size=+2><B>Topic Map Search Engine</B></font> <form action="topicsearch.jsp" method=get> <font face=arial,sans-serif size=-1>Type in one topic and start a new search </font> <br><input name="topic" type=text value="" framewidth=8 size=40 maxlength=256> <br><input type=submit value="Search"><input type=hidden name=new value="yes"> </form> <%@ page language="java" import="java.sql.*"%> <% Connection conn = null; Statement stmt_l = null; Statement stmt_r = null; Statement stmt = null; ResultSet rset_l = null; ResultSet rset_r = null; ResultSet rset = null; String newsearch=null; String query=null; String newtopic=null; Vector vTopic=null; try{ vTopic=(Vector) session.getValue("vTopic"); if (vTopic==null) vTopic=new Vector(); } catch (Exception e){} newsearch=request.getParameter("new"); if(newsearch!=null) vTopic.clear(); newtopic=request.getParameter("topic").toString().trim(); if(newtopic!=null && vTopic.indexOf(newtopic)==-1) vTopic.add(newtopic); query="select t.aid, count(t.aid) from article_topic t "; for (int i=0;i<vTopic.size();i++){ if(i>0){ query=query+" or t.topic='" +vTopic.elementAt(i).toString()+"' "; }else{ query=query+ "where t.topic='"+vTopic.elementAt(i).toString()+"' "; } } query=query+" group by t.aid order by count(t.aid) desc"; try { 15
  • 16. DriverManager.registerDriver(new oracle.jdbc.driver.OracleDriver()); conn = DriverManager.getConnection("jdbc:oracle:thin:@ylnw:1521:oracledb","syste m","manager"); stmt_l = conn.createStatement(); rset_l=stmt_l.executeQuery("select topic2, metalink from metalink_instance where topic1='"+newtopic+"'"); %> <br><font size=+1>click on the following topic to narrow your search </font> <br><table width="60%" border="1"> <tr bgcolor="#EEEEFF"><th>topic1</th> <th>metalink</th> <th>topic2</th></tr> <%while(rset_l.next()) {%> <tr><td><%=newtopic%></td><td>< %=rset_l.getString(2).toString().toUpperCase()%></td> <td><a href="topicsearch.jsp?topic=< %=rset_l.getString(1).trim()%>"><B><%=rset_l.getString(1)%></B></a></td> </tr> <%} stmt_r = conn.createStatement(); rset_r=stmt_r.executeQuery("select topic1, metalink from metalink_instance where topic2='"+newtopic+"'"); while(rset_r.next()) {%><tr> <td><a href="topicsearch.jsp?topic=<%=rset_r.getString(1).trim() %>"><B><%=rset_r.getString(1)%></B></a></td> <td><%=rset_r.getString(2).toString().toUpperCase()%></td><td>< %=newtopic%></td> </tr> <%} %> </table><P> <hr> <% stmt = conn.createStatement(); rset=stmt.executeQuery(query); %> <br><font size=+1><B>Search Results:</B></font> <% while (rset.next()){ Statement instmt = conn.createStatement(); ResultSet inner=instmt.executeQuery("select * from article where aid=" +rset.getInt(1)); while (inner.next()) { %> <br><a href="<%=inner.getString(3).trim()%>">< %=inner.getString(2)%></a> &nbsp;&nbsp;&nbsp;&nbsp;last updated:< %=inner.getDate(4)%>&nbsp;&nbsp;(<%=rset.getInt(2)%> topics) <%} instmt.close(); inner=null; } session.putValue("vTopic",vTopic); }catch (Exception exx){ 16
  • 17. exx.printStackTrace(); }finally{ if (stmt_l != null) stmt_l.close(); if (stmt_r != null) stmt_r.close(); if (stmt != null) { stmt.close(); conn.close(); } } %> </body> </html> 17
  • 18. 3. SQL commands create table topic(topic char(60), primary key(topic)); create table metalink(metalink char(40), primary key (metalink)); create table article(aid integer, title char(100), path char(60), published date,primary key (aid)); create table article_topic (aid integer, topic char(60), degree char(20), primary key (aid, topic), foreign key (aid) references article(aid), foreign key(topic) references topic(topic)); create index atopic_idx on article_topic(topic); create table metalink_instance(miid integer,topic1 char(60), topic2 char(60), metalink char(40), primary key(miid), foreign key(topic1) references topic(topic), foreign key (topic2) references topic(topic), foreign key(metalink) references metalink(metalink)); create index t1_idx on metalink_instance(topic1); create index t2_idx on metalink_instance(topic2); create index tl_idx on metalink_instance(metalink); create table article_metalink_instance(aid integer, miid integer, primary key(aid,miid), foreign key (aid) references article(aid), foreign key(miid) references metalink_instance(miid)); 18