• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A customized web search engine [autosaved]
 

A customized web search engine [autosaved]

on

  • 380 views

A customized web search engine is graduation project .This presentation displays what search engine is and open source software which used in this project

A customized web search engine is graduation project .This presentation displays what search engine is and open source software which used in this project

Statistics

Views

Total Views
380
Views on SlideShare
380
Embed Views
0

Actions

Likes
0
Downloads
25
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    A customized web search engine [autosaved] A customized web search engine [autosaved] Presentation Transcript

    • Supervised By Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia Dept. of Computer Science & Engineering Faculty of Electronic Engineering, Menoufiya University.
    •  The main purpose of this project is to build our own search engine that should suffice for our needs as a nation  In this project has been tried to add customized features to search engine such as building and developing a time-based search engine that is meant to deal with local and international news
    •  Question : What is a Search Engine?  How web search engine work?  Web crawler , Indexing , Ranking  Lucene , Nutch , Solr  Who uses solr?  Setup Nutch for web crawling  Setup Solr for search  Running Nutch in Eclipse for developing  Experiments
    •  Answer: A software that  builds an index on text  answers queries using that index  A search engine offers Scalability Relevance Ranking Integrates different data sources (email, web pages, files, database,...)‫‏‬
    •  A search engine operates, in the following order 1. Web crawling 2. Indexing 3. Ranking
    •  a program or automated script which browses the World Wide Web  used to create a copy of all the visited pages for later processing by a search engine  it starts with a list of URLs to visit, called the seeds  URLs recursively visited according to a set of policies  A selection policy  A re-visit policy  A politeness policy  A parallelization policy
    •  Indexing process entails how data is collected, parsed, and stored to facilitate fast and accurate search query evaluation.  The process involves the following steps  Data collection  Data traversal  Indexing
    •  Indexing process:  Convert document  Extract text and meta data  Normalize text(stop word,stim)  Write (inverted) index  Example:  Document 1: “Apache Lucene at Jazoon“  Document 2: “Jazoon conference“  Index:  apache -> 1  conference -> 2  Jazoon -> 1, 2  lucene -> 1
    •  The web search engine responds to a query that a user enters into a web search engine to satisfy his or her information needs
    •  a high-performance, scalable information retrieval (IR) library  lets you add searching capabilities to your applications.  free, open source project implemented in Java  With Lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your wiki pages…the list goes on.
    •  Web Search Engine Software  Open source web crawler  Coded entirely in the Java programming language  Advantages  Scalability  Crawler Politeness  Crawler Management  Quality
    •  Open source enterprise search platform based on Apache Lucene project.  Powerful full-text search, hit highlighting, faceted search  Database integration, and rich document (e.g., Word, PDF) handling
    •  Download a binary package (apache-nutch-bin.zip)  cd apache-nutch-1.X/  bin/nutch crawl urls -dir crawl -depth 3 -topN 5  Now you should be able to see the following directories created:  crawl/crawldb  crawl/linkdb  crawl/segments
    •  If you have a Solr core already set up and wish to index to it we should use bin/nutch crawl urls -solr http://localhost:8983/solr/ - depth 3 -topN 5 Now skip to here for how to set up your Solr instance and index your crawl data.
    •  Download binary file (apache-Solr-bin.zip)  cd ${APACHE_SOLR_HOME}/example  java -jar start.jar  After you started Solr admin console, you should be able to access the following link: http://localhost:8983/solr/admin/  Integrate Solr with Nutch cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
    •  restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example  run the Solr Index command: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
    •  Crawling the Egyptian Universities
    •  Crawling the Arabic news websites
    •  Crawling the Arabic news websites
    • Mustafa Mohammed Ahmed Elkhiat Email:melkhiat@gmail.com