A customized web search engine [autosaved]

Supervised By
Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia
Dept. of Computer Science & Engineering
Faculty of Electronic Engineering,
Menoufiya University.

 The main purpose of this project is to build our own
search engine that should suffice for our needs as a
nation
 In this project has been tried to add customized
features to search engine such as building and
developing a time-based search engine that is meant
to deal with local and international news

 Question : What is a Search Engine?
 How web search engine work?
 Web crawler , Indexing , Ranking
 Lucene , Nutch , Solr
 Who uses solr?
 Setup Nutch for web crawling
 Setup Solr for search
 Running Nutch in Eclipse for developing
 Experiments

 Answer: A software that
 builds an index on text
 answers queries using that index
 A search engine offers
Scalability
Relevance Ranking
Integrates different data sources (email,
web pages, files, database,...)‫‏‬

 A search engine operates, in the following order
1. Web crawling
2. Indexing
3. Ranking

 a program or automated script which browses the
World Wide Web
 used to create a copy of all the visited pages for later
processing by a search engine
 it starts with a list of URLs to visit, called the seeds
 URLs recursively visited according to a set of policies
 A selection policy
 A re-visit policy
 A politeness policy
 A parallelization policy

 Indexing process entails how data is collected, parsed,
and stored to facilitate fast and accurate search query
evaluation.
 The process involves the following steps
 Data collection
 Data traversal
 Indexing

 Indexing process:
 Convert document
 Extract text and meta data
 Normalize text(stop word,stim)
 Write (inverted) index
 Example:
 Document 1: “Apache Lucene at Jazoon“
 Document 2: “Jazoon conference“
 Index:
 apache -> 1
 conference -> 2
 Jazoon -> 1, 2
 lucene -> 1

 The web search engine responds to a query that a user
enters into a web search engine to satisfy his or her
information needs

 a high-performance, scalable information retrieval
(IR) library
 lets you add searching capabilities to your
applications.
 free, open source project implemented in Java
 With Lucene, you can index and search email
messages, mailing-list archives, instant messenger
chats, your wiki pages…the list goes on.

 Web Search Engine Software
 Open source web crawler
 Coded entirely in the Java programming language
 Advantages
 Scalability
 Crawler Politeness
 Crawler Management
 Quality

 Open source enterprise search platform based on
Apache Lucene project.
 Powerful full-text search, hit highlighting, faceted
search
 Database integration, and rich document (e.g.,
Word, PDF) handling

 Download a binary package (apache-nutch-bin.zip)
 cd apache-nutch-1.X/
 bin/nutch crawl urls -dir crawl -depth 3 -topN 5
 Now you should be able to see the following directories
created:
 crawl/crawldb
 crawl/linkdb
 crawl/segments

 If you have a Solr core already set up and wish to index
to it we should use
bin/nutch crawl urls -solr http://localhost:8983/solr/ -
depth 3 -topN 5
Now skip to here for how to set up your Solr instance
and index your crawl data.

 Download binary file (apache-Solr-bin.zip)
 cd ${APACHE_SOLR_HOME}/example
 java -jar start.jar
 After you started Solr admin console, you should be
able to access the following link:
http://localhost:8983/solr/admin/
 Integrate Solr with Nutch
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/conf/

 restart Solr with the command “java -jar start.jar”
under ${APACHE_SOLR_HOME}/example
 run the Solr Index command:
bin/nutch solrindex http://127.0.0.1:8983/solr/
crawl/crawldb -linkdb crawl/linkdb crawl/segments/*

 Crawling the Egyptian Universities

 Crawling the Arabic news websites

Mustafa Mohammed Ahmed Elkhiat
Email:melkhiat@gmail.com

A customized web search engine [autosaved]

A customized web search engine [autosaved]

More Related Content

What's hot

Viewers also liked

Similar to A customized web search engine [autosaved]

Recently uploaded

A customized web search engine [autosaved]