Supervised By
Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia
Dept. of Computer Science & Engineering
Faculty of Electronic Engineering,
Menoufiya University.
 The main purpose of this project is to build our own
search engine that should suffice for our needs as a
nation
 In this project has been tried to add customized
features to search engine such as building and
developing a time-based search engine that is meant
to deal with local and international news
 Question : What is a Search Engine?
 How web search engine work?
 Web crawler , Indexing , Ranking
 Lucene , Nutch , Solr
 Who uses solr?
 Setup Nutch for web crawling
 Setup Solr for search
 Running Nutch in Eclipse for developing
 Experiments
 Answer: A software that
 builds an index on text
 answers queries using that index
 A search engine offers
Scalability
Relevance Ranking
Integrates different data sources (email,
web pages, files, database,...)‫‏‬
 A search engine operates, in the following order
1. Web crawling
2. Indexing
3. Ranking
 a program or automated script which browses the
World Wide Web
 used to create a copy of all the visited pages for later
processing by a search engine
 it starts with a list of URLs to visit, called the seeds
 URLs recursively visited according to a set of policies
 A selection policy
 A re-visit policy
 A politeness policy
 A parallelization policy
 Indexing process entails how data is collected, parsed,
and stored to facilitate fast and accurate search query
evaluation.
 The process involves the following steps
 Data collection
 Data traversal
 Indexing
 Indexing process:
 Convert document
 Extract text and meta data
 Normalize text(stop word,stim)
 Write (inverted) index
 Example:
 Document 1: “Apache Lucene at Jazoon“
 Document 2: “Jazoon conference“
 Index:
 apache -> 1
 conference -> 2
 Jazoon -> 1, 2
 lucene -> 1
 The web search engine responds to a query that a user
enters into a web search engine to satisfy his or her
information needs
 a high-performance, scalable information retrieval
(IR) library
 lets you add searching capabilities to your
applications.
 free, open source project implemented in Java
 With Lucene, you can index and search email
messages, mailing-list archives, instant messenger
chats, your wiki pages…the list goes on.
 Web Search Engine Software
 Open source web crawler
 Coded entirely in the Java programming language
 Advantages
 Scalability
 Crawler Politeness
 Crawler Management
 Quality
 Open source enterprise search platform based on
Apache Lucene project.
 Powerful full-text search, hit highlighting, faceted
search
 Database integration, and rich document (e.g.,
Word, PDF) handling
 Download a binary package (apache-nutch-bin.zip)
 cd apache-nutch-1.X/
 bin/nutch crawl urls -dir crawl -depth 3 -topN 5
 Now you should be able to see the following directories
created:
 crawl/crawldb
 crawl/linkdb
 crawl/segments
 If you have a Solr core already set up and wish to index
to it we should use
bin/nutch crawl urls -solr http://localhost:8983/solr/ -
depth 3 -topN 5
Now skip to here for how to set up your Solr instance
and index your crawl data.
 Download binary file (apache-Solr-bin.zip)
 cd ${APACHE_SOLR_HOME}/example
 java -jar start.jar
 After you started Solr admin console, you should be
able to access the following link:
http://localhost:8983/solr/admin/
 Integrate Solr with Nutch
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/conf/
 restart Solr with the command “java -jar start.jar”
under ${APACHE_SOLR_HOME}/example
 run the Solr Index command:
bin/nutch solrindex http://127.0.0.1:8983/solr/
crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
 Crawling the Egyptian Universities
 Crawling the Arabic news websites
 Crawling the Arabic news websites
Mustafa Mohammed Ahmed Elkhiat
Email:melkhiat@gmail.com
A customized web search engine [autosaved]

A customized web search engine [autosaved]

  • 1.
    Supervised By Dr. MohamedA. El-Rashidy Eng. Ahmed Ghozia Dept. of Computer Science & Engineering Faculty of Electronic Engineering, Menoufiya University.
  • 2.
     The mainpurpose of this project is to build our own search engine that should suffice for our needs as a nation  In this project has been tried to add customized features to search engine such as building and developing a time-based search engine that is meant to deal with local and international news
  • 3.
     Question :What is a Search Engine?  How web search engine work?  Web crawler , Indexing , Ranking  Lucene , Nutch , Solr  Who uses solr?  Setup Nutch for web crawling  Setup Solr for search  Running Nutch in Eclipse for developing  Experiments
  • 4.
     Answer: Asoftware that  builds an index on text  answers queries using that index  A search engine offers Scalability Relevance Ranking Integrates different data sources (email, web pages, files, database,...)‫‏‬
  • 5.
     A searchengine operates, in the following order 1. Web crawling 2. Indexing 3. Ranking
  • 6.
     a programor automated script which browses the World Wide Web  used to create a copy of all the visited pages for later processing by a search engine  it starts with a list of URLs to visit, called the seeds  URLs recursively visited according to a set of policies  A selection policy  A re-visit policy  A politeness policy  A parallelization policy
  • 7.
     Indexing processentails how data is collected, parsed, and stored to facilitate fast and accurate search query evaluation.  The process involves the following steps  Data collection  Data traversal  Indexing
  • 8.
     Indexing process: Convert document  Extract text and meta data  Normalize text(stop word,stim)  Write (inverted) index  Example:  Document 1: “Apache Lucene at Jazoon“  Document 2: “Jazoon conference“  Index:  apache -> 1  conference -> 2  Jazoon -> 1, 2  lucene -> 1
  • 9.
     The websearch engine responds to a query that a user enters into a web search engine to satisfy his or her information needs
  • 11.
     a high-performance,scalable information retrieval (IR) library  lets you add searching capabilities to your applications.  free, open source project implemented in Java  With Lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your wiki pages…the list goes on.
  • 12.
     Web SearchEngine Software  Open source web crawler  Coded entirely in the Java programming language  Advantages  Scalability  Crawler Politeness  Crawler Management  Quality
  • 13.
     Open sourceenterprise search platform based on Apache Lucene project.  Powerful full-text search, hit highlighting, faceted search  Database integration, and rich document (e.g., Word, PDF) handling
  • 15.
     Download abinary package (apache-nutch-bin.zip)  cd apache-nutch-1.X/  bin/nutch crawl urls -dir crawl -depth 3 -topN 5  Now you should be able to see the following directories created:  crawl/crawldb  crawl/linkdb  crawl/segments
  • 16.
     If youhave a Solr core already set up and wish to index to it we should use bin/nutch crawl urls -solr http://localhost:8983/solr/ - depth 3 -topN 5 Now skip to here for how to set up your Solr instance and index your crawl data.
  • 17.
     Download binaryfile (apache-Solr-bin.zip)  cd ${APACHE_SOLR_HOME}/example  java -jar start.jar  After you started Solr admin console, you should be able to access the following link: http://localhost:8983/solr/admin/  Integrate Solr with Nutch cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
  • 18.
     restart Solrwith the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example  run the Solr Index command: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
  • 20.
     Crawling theEgyptian Universities
  • 21.
     Crawling theArabic news websites
  • 22.
     Crawling theArabic news websites
  • 24.
    Mustafa Mohammed AhmedElkhiat Email:melkhiat@gmail.com