0
Supervised By
Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia
Dept. of Computer Science & Engineering
Faculty of Electronic En...
 The main purpose of this project is to build our own
search engine that should suffice for our needs as a
nation
 In th...
 Question : What is a Search Engine?
 How web search engine work?
 Web crawler , Indexing , Ranking
 Lucene , Nutch , ...
 Answer: A software that
 builds an index on text
 answers queries using that index
 A search engine offers
Scalabili...
 A search engine operates, in the following order
1. Web crawling
2. Indexing
3. Ranking
 a program or automated script which browses the
World Wide Web
 used to create a copy of all the visited pages for late...
 Indexing process entails how data is collected, parsed,
and stored to facilitate fast and accurate search query
evaluati...
 Indexing process:
 Convert document
 Extract text and meta data
 Normalize text(stop word,stim)
 Write (inverted) in...
 The web search engine responds to a query that a user
enters into a web search engine to satisfy his or her
information ...
 a high-performance, scalable information retrieval
(IR) library
 lets you add searching capabilities to your
applicatio...
 Web Search Engine Software
 Open source web crawler
 Coded entirely in the Java programming language
 Advantages
 Sc...
 Open source enterprise search platform based on
Apache Lucene project.
 Powerful full-text search, hit highlighting, fa...
 Download a binary package (apache-nutch-bin.zip)
 cd apache-nutch-1.X/
 bin/nutch crawl urls -dir crawl -depth 3 -topN...
 If you have a Solr core already set up and wish to index
to it we should use
bin/nutch crawl urls -solr http://localhost...
 Download binary file (apache-Solr-bin.zip)
 cd ${APACHE_SOLR_HOME}/example
 java -jar start.jar
 After you started So...
 restart Solr with the command “java -jar start.jar”
under ${APACHE_SOLR_HOME}/example
 run the Solr Index command:
bin/...
 Crawling the Egyptian Universities
 Crawling the Arabic news websites
 Crawling the Arabic news websites
Mustafa Mohammed Ahmed Elkhiat
Email:melkhiat@gmail.com
A customized web search engine [autosaved]
A customized web search engine [autosaved]
A customized web search engine [autosaved]
A customized web search engine [autosaved]
A customized web search engine [autosaved]
Upcoming SlideShare
Loading in...5
×

A customized web search engine [autosaved]

355

Published on

A customized web search engine is graduation project .This presentation displays what search engine is and open source software which used in this project

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
355
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
29
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "A customized web search engine [autosaved]"

  1. 1. Supervised By Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia Dept. of Computer Science & Engineering Faculty of Electronic Engineering, Menoufiya University.
  2. 2.  The main purpose of this project is to build our own search engine that should suffice for our needs as a nation  In this project has been tried to add customized features to search engine such as building and developing a time-based search engine that is meant to deal with local and international news
  3. 3.  Question : What is a Search Engine?  How web search engine work?  Web crawler , Indexing , Ranking  Lucene , Nutch , Solr  Who uses solr?  Setup Nutch for web crawling  Setup Solr for search  Running Nutch in Eclipse for developing  Experiments
  4. 4.  Answer: A software that  builds an index on text  answers queries using that index  A search engine offers Scalability Relevance Ranking Integrates different data sources (email, web pages, files, database,...)‫‏‬
  5. 5.  A search engine operates, in the following order 1. Web crawling 2. Indexing 3. Ranking
  6. 6.  a program or automated script which browses the World Wide Web  used to create a copy of all the visited pages for later processing by a search engine  it starts with a list of URLs to visit, called the seeds  URLs recursively visited according to a set of policies  A selection policy  A re-visit policy  A politeness policy  A parallelization policy
  7. 7.  Indexing process entails how data is collected, parsed, and stored to facilitate fast and accurate search query evaluation.  The process involves the following steps  Data collection  Data traversal  Indexing
  8. 8.  Indexing process:  Convert document  Extract text and meta data  Normalize text(stop word,stim)  Write (inverted) index  Example:  Document 1: “Apache Lucene at Jazoon“  Document 2: “Jazoon conference“  Index:  apache -> 1  conference -> 2  Jazoon -> 1, 2  lucene -> 1
  9. 9.  The web search engine responds to a query that a user enters into a web search engine to satisfy his or her information needs
  10. 10.  a high-performance, scalable information retrieval (IR) library  lets you add searching capabilities to your applications.  free, open source project implemented in Java  With Lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your wiki pages…the list goes on.
  11. 11.  Web Search Engine Software  Open source web crawler  Coded entirely in the Java programming language  Advantages  Scalability  Crawler Politeness  Crawler Management  Quality
  12. 12.  Open source enterprise search platform based on Apache Lucene project.  Powerful full-text search, hit highlighting, faceted search  Database integration, and rich document (e.g., Word, PDF) handling
  13. 13.  Download a binary package (apache-nutch-bin.zip)  cd apache-nutch-1.X/  bin/nutch crawl urls -dir crawl -depth 3 -topN 5  Now you should be able to see the following directories created:  crawl/crawldb  crawl/linkdb  crawl/segments
  14. 14.  If you have a Solr core already set up and wish to index to it we should use bin/nutch crawl urls -solr http://localhost:8983/solr/ - depth 3 -topN 5 Now skip to here for how to set up your Solr instance and index your crawl data.
  15. 15.  Download binary file (apache-Solr-bin.zip)  cd ${APACHE_SOLR_HOME}/example  java -jar start.jar  After you started Solr admin console, you should be able to access the following link: http://localhost:8983/solr/admin/  Integrate Solr with Nutch cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
  16. 16.  restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example  run the Solr Index command: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
  17. 17.  Crawling the Egyptian Universities
  18. 18.  Crawling the Arabic news websites
  19. 19.  Crawling the Arabic news websites
  20. 20. Mustafa Mohammed Ahmed Elkhiat Email:melkhiat@gmail.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×