Cloud computing is Internet-based ("cloud") development and use of computer technology ("computing").
Cloud computing is a general concept that incorporates software as a service (SaaS), Web 2.0 and other recent, well-known technology trends, in which the common theme is reliance on the Internet for satisfying the computing needs of the users.
2. What is Cloud computing?
3. APPLICATION INTRODUCTION
open source web-search software based Lucene
원래는 Apache Lucene project 의 sub-project
Lucene 을 좀더 사용하기 편하게 하기 위한 목적
Lucene Java :
Apache 의 매우 유명한 open source search engine
3-1.What is ‘Nutch’?
Transparency .
Nutch is open source, so anyone can see how the ranking algorithms work.
Understanding .
Nutch has been built using ideas from academia and industry
for instance, core parts of Nutch are currently being re-implemented to use the Map Reduce distributed processing model
Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend.
3-1. What is Nutch?
Extensibility .
Nutch is very flexible
it can be customized and incorporated into your application.
For developers, Nutch is a great platform for adding search to heterogeneous collections of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plugin mechanism.
3-1. What is Nutch?
Nutch divides naturally into two pieces:
the crawler
the searcher
Crawl
페이지를 수집
페이지에 대한 index 를 만든다
index 는 Crawl 과 Search 간의 가교 역할을 한다
Search
유저의 요청에 따라 필요한 정보를 찾아서 보여준다
3-1. What is Nutch?
More detail about crawler
the Nutch crawler system produces three key data structures:
The WebDB containing the web graph of pages and links.
A set of segments containing the raw data retrieved from the Web by the fetchers.
The merged index created by indexing and de-duplicating parsed data from the segments.
3-1. What is Nutch?
More detail about searcher
Nutch looks for these in the index and segments subdirectories of the directory defined in the searcher.dir property.
The default value for searcher.dir is the current directory (.), which is where you started Tomcat.
0 comments
Post a comment