Business Intelligence Solution Using Search Engine
BUSINESS INTELLIGENCE SOLUTION USING SEARCH ENGINE Minor Project Prepared For the Partial fulfillment of the Bachelor Of Engineering Prepared by Guided By Ankur Mukherjee IshwarLal Deshmukh Sir Prateek Barapatre Shetanshu Parihar
BUSINESS INTELLIGENCE SOLUTION Business Intelligence (BI) solutions are designed to allow companies to easily turn the volumes of data they collect and store into meaningful information – to best manage their operations. When key information is readily accessible, you can make better and timelier business decisions
“ Business Intelligence” (BI) is a business management term that refers to applications and technologies used to gather, provide access to, and analyze data and information about their company operations. Business intelligence systems can help companies have a more comprehensive knowledge of the factors affecting their business, such as metrics on sales, production, internal operations, and they can help companies to make better business decisions.
The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. Business Intelligence (BI) solutions have for many years been a hot topic among companies due to their optimization and decision making capabilities in business processes
WEB SEARCH ENGINES Index-based : search the Web, index Web pages, and build and store huge keyword-based indices. Help locate sets of Web pages containing certain keywords. Deficiencies: 1. A topic of any breadth may easily contain hundreds of thousands of documents. 2. Many documents that are highly relevant to a topic may not contain keywords defining them (polysemy).
WEB TEXT MINING The WWW is a huge, directed graph, with documents as nodes and hyperlinks as the directed edges. Apart from the text itself, this graph structure carries a lot of information about the “usefulness” of the “nodes. For example: 10 random, average people on the streets say Mr. XYZ is a good dentist. 5 reputed doctors, including dentists, recommend Mr. ABC as a better dentist. Whom to choose?
BI – TEXT MINING With the widespread inclusion of document, especially text, in the business systems, business executives can not get useful details from the large collection of unstructured and semi structured written materials based on natural languages within our traditional business intelligence systems. It is the right time to develop the powerful tool to expand the scope of business intelligence to gain more competitive advantages for the business.
Data mining has been touted to be the solution for the business intelligence. We can learn its good performance form the classical example that data mining can scan a large amount of retail sales to find the money-making purchasing patterns of the consumers to decide which products would be placed close together on shelves. Text mining is a variation of data mining and is a relatively new discipline. Like many new research areas, it is hard to give a generally agree-upon definition.
Text mining is a variation of data mining and is a relatively new discipline. Like many new research areas, it is hard to give a generally agree-upon definition. Commonly, text mining is the discovery by computer of previously unknown knowledge in text, by automatically extracting information from different written resources. Text mining can represent flexible approaches to information management, research and analysis. Thus text mining can expand the fists of data mining to the ability to deal with textual materials
MINING THE WORLD WIDE WEB The WWW is huge, widely distributed, global information service centre for: 1. Information services: news, advertisements, consumer information, financial management, education, government, e-commerce, etc. 2. Hyper-link information. 3. Access and usage information. WWW provides rich sources for data mining . Challenges faced are: 1. Too huge for effective data warehousing and data mining. 2. Too complex and heterogeneous: no standards and structure. The Data is growing and changing rapidly.
CRAWLERS The crawlers are implemented as multi-threaded objects in Java. Each crawler has many (possibly hundreds) threads of execution sharing a single synchronized frontier that lists the unvisited URLs. Each thread of the crawler follows a crawling loop that involves picking the next URL to crawl from the frontier, fetching the page corresponding to the URL through HTTP, parsing the retrieved page, and finally adding the unvisited URLs to the frontier. Before the URLs are added to the frontier they may be assigned a score that represents the estimated benefit of visiting the page corresponding to the URL.
PROBLEM IDENTIFICATION Searching for URLs of related business entities is a type of business intelligence problem. The entities could be related through the area of competence, research thrust, comparable nature (like start-ups) or a combination of such features. We start by assuming that a short list of URLs of related business entities is already available. However, the list needs to be further expanded. The short list may have been generated manually with the help of search engines, business portals or Web directories. An analyst may face some hurdles in expanding the list of relevant URLs. Such hurdles could be due to lack of appropriate content in relevant pages, inadequate user queries, staleness of search engines' collections, or bias in search engines' ranking. Similar problems plague information discovery using Web directories or portals. The staleness of a search engine's collection is highlighted by the dynamic nature of the Web. Hence, it is reasonable to complement traditional techniques with topical crawlers to discover up-to-date business information.
METHODOLOGY With the ubiquity of the Internet and Web, search engines have been sprouting like mushrooms after a rainfall. However, innovative search engines and guided search capabilities have started appearing only in recent years. For instance, Google, which is one of the popular search engines, supports Web Services that allow external applications to issue Web search queries that are actually processed using a Google’s commodity cluster computer made up of 15,000 PC nodes. The goals of these applications are to help ease and guide the searching efforts of novice web users towards their desired objectives.
SYSTEM FEATURES A search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called Page Rank . Second, it utilizes link to improve search results.
PAGE RANK CALCULATION Counting citations or back links to a given page gives some approximation of a page's importance or quality. Page Rank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. Page Rank can be defined as follows: Let us assume page A has pages T1 to Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The Page Rank of a page A is given as follows: PR(A) = (1-d) + d ( PR(T1) / C(T1) + ... + PR(Tn) / C(Tn) )
REPOSITORY The repository contains the full HTML of every web page. In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL. The repository requires no other data structures to be used in order to access it. This helps with data consistency and makes development much easier; we can rebuild all the other data structures from only the repository and a file which lists crawler errors
INDEXER Parsing -- Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone's imagination to come up with equally creative ones. Developing this parser which runs at a reasonable speed and is very robust involved a fair amount of work. Sorting -- In order to generate the inverted index, the sorter takes each of the forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, we parallelize the sorting phase to use as many machines as we have simply by running multiple sorters, which can process different buckets at the same time. Since the barrels don't fit into main memory, the sorter further subdivides them into baskets which do fit into memory based on wordID and docID.
SEARCHING The goal of searching is to provide quality search results efficiently. Steps involved are: 1.Parse the query. 2.Convert words into wordIDs. 3.Seek to the start of the doclist in the short barrel for every word. 4.Scan through the doclists until there is a document that matches all the search terms.
CRAWLTABLE It has three fields Serial which is just a serial number, URLAddress which is crawled URLaddress which is available in server, and Iscrawled which is meant for weather URLaddress is crawled or not
INDEXTABLE It contains three fields: Keyword which is the meta text, URL address which is crawled URL address which is available in server, and Frequency which shows the number of Hits to the particular URL.
RESULT AND DISCUSSION 1. Starting the Search Engine
CONCLUSION AND SCOPE OF FUTURE WORK We would like to implement Phrase search, example- suppose a string entered by the user “search broser” then the search engine will say : Did You Mean : “ Search Browser ”. The future development can also be done by implementing Filters in the search engine just like Google search engine. The future work also includes graphical results of the searched string
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.