50120140503012

244 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
244
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

50120140503012

  1. 1. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME 107 WEB INFORMATION RETRIEVAL USING AUTOMATIC MULTI-DOCUMENT SUMMARIZATION Rahul Shankarrao Khokale Department of Computer Science & Engineering, Priyadarshini Indira Gandhi College of Engineering, NAGPUR (India) Mohammad Atique Post Graduate Department of Computer Science, Sant Gadge Baba Amravati University, AMRAVATI (India) ABSTRACT Today, internet has become the most important source of information. People are highly accustomed to the use of internet for acquiring information which they need. Many times, it is revealed that, the information seeker does not get relevant information very easily due to the presence of non-relevant web pages. This paper addresses the problem of effective information retrieval from the web. In this paper, the notion of Web Information Retrieval using Automatic Multi-document Summarization is presented. The proposed work is blend of Web technology and Natural Language Processing. When user will fire the query, the system tries to fetch web pages from different web servers, and they are indexed as per the order of relevance. The degree of relevance is not determined by the how many times the keywords of query is present in the document but it is determined on the basis of semantic content of the document and the user query Keywords: Web Information Retrieval, Automatic Multi-Document Summarizations, Web Technology, Information Retrieval and Natural Language Processing INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2014): 8.5328 (Calculated by GISI) www.jifactor.com IJCET © I A E M E
  2. 2. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME 108 I. INTRODUCTION 1.1 Information Retrieval The Internet and the Web offer new opportunities and challenges to information retrieval researchers. With the information explosion and never ending increase of web pages as well as digital data, it is very hard to retrieve useful and reliable information from the Web. Materials from millions of web pages from organizations, institutions and personnel have been made public electronically accessible to millions of interested users. The Web uses an addressing system called Uniform Resource Locators (URLs) to represent links to documents on web servers. These URLs provide location information. Like titles of books in traditional libraries, no one can remember all URLs on the Web. Web search engines allow us to locate the internet resources through thousands of Web pages. It is almost impossible to get the right information as there is too much irrelevant and out dated information. Information retrieval systems provide useful information in libraries to researchers. The Web can be viewed as a virtual library. Information retrieval is an important and major component of the Internet and the Web in the information age and should play an important role in knowledge discovery. General search engines such as, Google, AltaVista, Excite are considered as the powerful search engines so far. Most of the current search engines are based on words, not the concepts. When searching for certain information or knowledge with a search engine, one can only use a few key words to narrow down the search. The result of the search is tens or maybe hundreds of relevant and irrelevant links to various Web pages. In spite of the voluminous studies in the field of intelligent retrieval systems, effective retrieving of information has been remained an important unsolved problem. Implementations of different conceptual knowledge in the information retrieval process such as ontology have been considered as a solution to enhance the quality of results. Furthermore, the conceptual formalism supported by typical ontology may not be sufficient to represent uncertainty information due to the lack of clear-cut boundaries between concepts of the domains [1] “Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton, 1968). 1.2 Automatic Text Summarization Automatic Text Summarization means the process of extraction and representation of most important content from the source document in the condensed form. This process involves Document Preprocessing, Feature Extraction, Sentence Ranking and Summary Generation [3]. Preprocessing is accomplished by Tokenization, Sentence Splitting, POS Tagging etc. Feature Extraction includes Word Frequency Extraction and Sentence Ordering. Weighing Sentences helps to score sentences required for Sentence Ranking. Summary Generation is the resulting phase of automatic text summarization. 1.3 Multi-Document Summarization Multi-document text summarization deals with retrieval of salient information about a topic from various sources. The task of multi-document summarization is to identify a set of sentences, phrases or some generated semantically correct language units carrying some useful information. Then significant sentences are extracted from this set and re-organized them to get multi-documents’ summary [1]. Let D = {D1, D2,.......,Dn} be a set of documents. Let S = {S1, S2, ...... ,Sn} be a set of summary of each document respectively.
  3. 3. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME 109 Figure 1: Multi –Document Summarization II. RELATED WORK In recent years, the research focus in the domain of natural language processing and information retrieval has been shifted to the area of automatic document summarization. Automatic document summarization is of two types : abstractive and extractive. The research in this field began with Term Frequency based summarization. Following researchers used term frequency based approach for document summarization. G. Salton, 1989, Jun'ichi Fukumoto, 2004, You Ouyang, 2009 and Mr.Vikrant Gupta, 2012 [1]. Inderjeet Mani, 1997, Rada Mihalcea, 2004, Junlin Zhang, 2005, Xiaojun Wan, 2008, Kokil Jaidka, 2010 [1] carried out research for document summarization using Graph-based approach. Kathleen McKeown, 1995, Xiaojun Wan, 2007 used Time-Based method for document summarization. Sentence Correlation method was implemented for document summarization by Shanmugasundaram Hariharan, 2012, Tiedan Zhu, 2012. Clustering-Based method for document summarization was proposed by Jade Goldstein, 2000. Vikrant Gupta el at [2], developed an auto-summarization tool using statistical techniques. The techniques involve finding the frequency of words, scoring the sentences, ranking the sentences etc. Yogan Jaya Kumar et al. [3] discussed Automatic Multi Document Summarization Approaches. Y. Surendranadha Reddy el at [4] presented a summarization system that produces a summary for a given web document based on sentence importance measures such as sentence ranking. Tiedan Zhu et al [5] proposed an improved approach to sentence ordering for Multi- document Summarization. III. PROPOSED WORK This paper deals with the framework for “Web Information Retrieval based on Multi-Document Summarization”. The proposed framework is shown in Figure 2. However, in this paper, the emphasis is given on the multi-level document summarization. The basic purpose of this framework is to enhance the effectiveness of web information retrieval. As the indexing and ranking of the retrieved documents is supported by intelligent decision making system which is based on fuzzy inference rules, the degree of relevance can be increased. Document (D1) Document (D2) Document (Dn) Summary (S1) Summary (S2) Summary (Sn) Significant and common sentences extraction Sentences Re- ordering based on semantic contents Multi- Document Summary
  4. 4. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME 110 User Query Figure 2: Framework for web information retrieval 3.1 Intelligent Query Processing When a query is written by the user and submitted to the system, it is required to manipulate it and represent it in proper form. Intelligent Query Processing (IQP) module helps user to formulate his query. WordNet is used for identifying synonyms and thesaurus. This intelligent unit tries to understand the user need and accordingly it classifies the query into any of three types : Informational, Navigational and Transactional. 3.2 Search Engine Search Engine uses web crawler to traverse the World Wide Web to fetch matching URLs. 3.3 Multi-Document Summarization The n number of links/URLs is the input to the Multi-Document Summarization (MDS) unit. Each of these inputs can be HTML web page or a text/PDF file. MDS finds summary of each document separately and finally combines all of them to form single summary. It involves significant sentences identification, sentences reordering etc. The detail algorithm for this is discussed in section IV. 3.4 Indexing The process of generating inverted indices of the retrieved documents is Indexing. It is also an important step in information retrieval. Intelligent Query Processing Search Engine URL1 URL2 URL3 URLn Multi- document Summarization Indexing Ranking Documents
  5. 5. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME 111 3.5 Ranking Ranking means determining the weight or rank of each retrieved web page or web document. Our page rank strategy is based on the summary generated by MDS unit. The Page Rank Algorithm will use fuzzy inference system to judge the relevance of the web page according to the user query. IV. AUTOMATIC DOCUMENT SUMMARIZATION MODEL To summarize a document or documents, a reader has to understand the document(s) and integrate information and make connections across sentences to form a coherent discourse representation. We designed and developed a new generic algorithm for automatic document summarization based on the analysis of human cognition and intelligence. In order make this model applicable for summarization we define the concept of ‘event’. ‘Event’ is a cognitive psychological concept, and can be either a story or a sentence in microstructure. We have treated each sentence as an event in this paper. The cognitive model is shown below document Figure 3: Cognitive Model As each event is representing sentence in the document, it is a combination of two parts : Subject and Predicate. e.g. Consider the example of an event : Event : Tiger is a wild animal Subject Predicate This event can be represented in predicate form by using FOPL syntax wild_animal(tiger) The representation of events in predicate form is required to create the knowledge about the document. In addition to that, we have defined inference rules to understand the connection of one sentence to others. This helps us to decide the significance of sentence within the document and used for sentence re-ordering. Event 1 Event 1 Event N
  6. 6. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME 112 The Algorithm for Automatic text/document summarization is discussed below: Figure 4: Algorithm for Text summarization V. EXPERIMENTAL RESULTS We have implemented the multi-document summarization system in JAVA. 5.1 Test Data/Corpus : We have used the standard CACM test data with 3204 test samples. Following are the documents chosen from CACM test data. Document : CACM—0276.html START Read the document Tokenization Stemmer POS Tagger Summary Cognitive Model Program Organization and Record Keeping for Dynamic Storage Allocation The material presented in this paper is part of the design plan of the core allocation portion of the ASCII-MATIC Programming System. Project ASCII-MATIC is concerned with the application of computer techniques to the activities of certain headquarters military intelligence operations of the U. Army. CACM October, 1961 Holt, A. W.
  7. 7. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME 113 Summary : 5.2 Term Frequency Computation The term frequency of all the terms in the sentence is calculated individually. The sum of all these frequencies will give the T(si,),term frequency of sentence si. This can be calculated by: ܶሺܵ݅ሻ ൌ ෍ ‫ݐ‬ ‫כ‬ ݂ሺ‫݅ݐ‬ሻ ௡ ௧௜ୀଵ Where, ti is the ith term in the sentence. And t*f(ti) is the term frequency of term ti. Therefore, T(Si) for document 1 T(Si)=(1+1+1+1+1+1+1+1+2+5+1+1+1+1+1+1+3+1+1+1+1+2+2+1+1+1+1+1+1+1+1+1 +1+1+1+1+1+1+1+1+1) = 50 5.3 TF-IDF or term-frequency inverse-document-frequency TF-IDF or term-frequency inverse-document-frequency is computed as the ratio of the quantity of terms in that document to the frequency of the quantity of documents containing that terms. For above document the TF_IDF is given below: ܶ‫ܨܦܫ_ܨ‬ ൌ 41 50 =0.82 V. CONCLUSION In this paper, we have proposed a framework for web information retrieval using multi- document summarization. The emphasis was given on the multi-level document summarization. The basic purpose of this framework is to enhance the effectiveness of web information retrieval. As the indexing and ranking of the retrieved documents is supported by intelligent decision making system which is based on fuzzy inference rules, the degree of relevance can be increased. The experiment are carried out on the CACM test data, and it is found that information retrieval results can be improved after multi-document summarization process is performed. Title : Program Organization and Record Keeping for Dynamic Storage Allocation Author : Holt, A. W. Publication & Year : CACM October, 1961 Part of the design plan of the core allocation portion of the ASCII-MATIC Programming System. Project ASCII-MATIC deals with the application of computer techniques to the activities of certain headquarters military intelligence operations of the U. Army.
  8. 8. International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 - 6375(Online), Volume 5, Issue 3, March (2014), pp. 107-114 © IAEME 114 REFERENCES [1] Md. Majharul Haque, Suraiya Pervin, and Zerina Begum, Literature Review of Automatic Multiple Documents Text Summarization, International Journal of Innovation and Applied Studies, Vol. 3 No. 1 May 2013, pp. 121-129 [2] Vikrant Gupta, Priya Chauhan, Sohan Garg, Anita Borude, Shobha Krishnan, An Statistical Tool for Multi-Document summarization, International Journal of Scientific and Research Publications, Volume 2, Issue 5, May 2012 [3] Yogan Jaya Kumar and Naomie Salim, Automatic Multi Document Summarization Approaches, Journal of Computer Science 8 (1): 133-140, [4] Y. Surendranadha Reddy and A.P. Siva Kumar, An Efficient Approach for Web document summarization by Sentence Ranking, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 2, Issue 7, July 2012 [5] Tiedan Zhu, Xinxin Zhao, An Improved Approach to Sentence Ordering For Multi-document Summarization, 2012 IACSIT Hong Kong Conferences, IPCSIT vol. 25 (2012) © (2012) IACSIT Press, Singapore [6] Nikola Vlahovic, Information Retrieval and Information Extraction in Web 2.0 environment, International Journal of Computers, Issue 1, Volume 5, 2011 [7] Yi Guo and George Stylios, An Intelligent Algorithm For Automatic Document Summarization. [8] Prakasha S, Shashidhar HR and Dr. G T Raju, “A Survey on Various Architectures, Models and Methodologies for Information Retrieval”, International Journal of Computer Engineering & Technology (IJCET), Volume 4, Issue 1, 2013, pp. 182 - 194, ISSN Print: 0976 – 6367, ISSN Online: 0976 – 6375. [9] Mousmi Chaurasia and Dr. Sushil Kumar, “Natural Language Processing Based Information Retrieval for the Purpose of Author Identification”, International Journal of Information Technology and Management Information Systems (IJITMIS), Volume 1, Issue 1, 2010, pp. 45 - 54, ISSN Print: 0976 – 6405, ISSN Online: 0976 – 6413.

×