Your SlideShare is downloading. ×
1. Introduction
1. Introduction
1. Introduction
1. Introduction
1. Introduction
1. Introduction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

1. Introduction

283

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
283
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. CIS 700B Master’s Project Proposal Information Retrieval on Web and Applying Data Mining Techniques to Pattern Discovery Submitted to the Department of Computer Science College of Computing Sciences New Jersey Institute of Technology in Partial Fulfillment of the Requirements for the Degree of Master of Science by Xiaoliang Wang APPROVALS Proposal Number: ________________________________ Agree to Advise: _________________________________ (Project Advisor) Date Submitted: __________________________________ Approved by: ____________________________________ (MS in CS Committee) Date Approved: __________________________________ 1
  • 2. 1. Introduction.....................................................................................................................................3 2. Information retrieval on web..........................................................................................................3 4. Put all together................................................................................................................................4 5. Technology involved.......................................................................................................................5 6. Implementation...............................................................................................................................6 7. Conclusion and future work............................................................................................................6 8. References.......................................................................................................................................6 2
  • 3. 1. Introduction The amount of information on the World Wide Web (Web) is glowing at an astonishing spend. Search engines, directories, and browsers have become ubiquitous tools for accessing and find special sets of information on the Web. The sets of information provide many opportunities for data mining which helps in extracting meaningful new patterns that cannot be found necessarily by us. 2. Information retrieval on web Search engines and directories are the most widely used services for finding information on the Web. Both techniques share the same goal of helping users quickly locate web pages of interest. Internet directories, however, are manually constructed. Only those pages that have been reviewed and categorized are listed. Search engines, on the other hand, automatically scour the web, building a massive index of all the pages that they find. A modern web search engine consists of four primary components: a crawler, an indexer, ad retrieval engine, connected through a set of databases. The crawler’s job is to effectively wander the web retrieving pages that are then indexed by the indexer. Once that crawler and indexer finish their job, the ranker will pre compute numerical scores for each page indexed, determining its potential importance. Lastly, the retrieval engine acts as the mediator between the user and the indexer, performing lookups and presenting results. 3.Data mining using relational database Informally speaking, data mining is the process of extracting information or knowledge from a data set for the purpose of decision making. Data 3
  • 4. mining become a very active field in recently years because of the availability of large data sets. Based on data pattern, we have following basic ways to discovery knowledge. a. Association rules correlate the presence of a set of items with another range of values for another set of variables. b. Classification is the process of learning a model that describes different classes of data. The classes are predetermined, so this type of activity is also called supervised learning. c. Clustering is also called segmentation. It is used to identify natural groupings of cases based on a set of attributes. Cases within the same group have more or less similar attributes values. No single attribute is used to guide the training process. All input attributes are treated equally. Most clustering algorithms build the model through a number of iterations and stop when the model converges, that is, when the boundaries of these segments are stabilized. 4. Put all together Most of web is semi structured in nature, which is hard to find pattern. Meanwhile, relational database with structured and well defined semantics is ideal for data mining. We may simply submit a URL to web and get feedback in HTML. Say http://finance.yahoo.com/q?s=bay BAYER AKTIENGES ADS (NYSE:BAY) Delayed quote data 4
  • 5. Last Trade: 41.09 Day's Range: 40.75 - 41.58 Trade Time: Feb 10 52wk Range: 31.16 - 44.31 Change: 0.14 (0.34%) Volume: 122,300 Prev Close: 40.95 Avg Vol (3m): 174,046 Open: 41.52 Market Cap: 30.01B 1d 5d 3m 6m 1y 2y 5y max Bid: N/A P/E (ttm): 15.74 Ask: N/A EPS (ttm): 2.61 1y Target Est: 37.00 Div & Yield: 0.73 (1.80%) The Five Dumbest Things on Wall Street This Week By excluding the predefined useless tags, we can store the data to our database as following 5. Technology involved Java, SOAP, ORACLE, XML, CGI 5
  • 6. 6. Implementation Step 1: Data Collection Step 2: Data Cleaning and Transformation Step 3: Model Building Step 4: Model Assessment Step 5: Reporting Step 6: Prediction 7. Conclusion and future work 8. References a. Mining the World Wide Web - An Information Search Approach by George Chang, Marcus Healey, James A. M. McHugh, T.L. Wang b. Fundamentals of Database Systems, Fourth Edition by Ramez Elmasri, Shamkant B. Navathe c. Data Mining with SQL Server 2005 by ZhaoHui Tang, Jamie MacLennan 6

×