SlideShare a Scribd company logo
Intelligent Crawling and Indexing
using Lucene


                 By
           Shiva Thatipelli
      Mohammad Zubair (Advisor)

      Contents
    Searching
   Indexing
   Lucene
   Indexing with Lucene
   Indexing Static and Dynamic Pages
   Extracting and Indexing Dynamic Pages
   Implementation
   Screens
Searching
   Looking up words in an index
   Factors Affecting Search
   Precision – How well the system can
    filter
   Speed
   Single, Multiple Phase queries, Results
    ranking, Sorting, Wild card queries,
    Range queries support
Indexing
   Sequential Search is bad (Not Scalable)
   Index speeds up selection
   Index is a special data structure which
    allows rapid searching.
   Different Index Implementations
        - B Trees
        - Hash Map
Search Process

                      Query


Docs                                 Docs


       Indexing API
                              Hits
                      Index
Lucene

   High-performance, full-featured text
    search engine library
   Written 100% in pure java
   Easy to use yet powerful API
   Jakarta Apache Product. Strong open
    source community support.
Why Lucene?
   Open source (Not proprietary)
   Easy to use, good documentation
   Interoperable - ex: Index generated by java
    can be used by VB, asp, perl application
   Powerful And Highly Scalable
   Index Format
       Designed for interoperability
       Well Documented
       Resides on File System, RAM, custom store
Continued
   Algorithms
       Efficient, fast and optimized
•   Incremental Indexing
•   Boolean Query, Fuzzy Query, Range Query,
    Multi Phrase Query, Wild Card Query etc…
•   Content Tagging – Documents as Collection
    of terms
   Heterogeneous documents - Useful when
    different set of metadata present for different
    mime types
Indexing With Lucene
   What type of documents can be
    indexed?
       Any document from which text can be
        fetched and extracted over the net with a
        URL
   Uses Inverted Index
     - The index stores statistics about
    terms in order to make term-based
    search more efficient.
Indexing With Lucene Contd…
 HTML            XLS                 WORD            PDF


     extracted         extracted         extracted         extracted

 Parser          Parser              Parser           Parser




                          Analyzer




                          Index
Indexing Static and Dynamic
Pages
   Static Pages which are HTML, XLS, WORD, PDF
    documents on web which can be easily crawled and
    indexed by search engines like Google and Yahoo.
   Static Pages over the internet can be passed into
    Lucene and indexed and searched with direct URLs.
   Dynamic Pages which are generated due to result of
    parameters submitted; like search results pages,
    Database hidden pages cannot be indexed with direct
    URLs.
   To index Dynamic Pages we need the parameters
    submitted by users to generate those pages.
Extracting and Indexing Dynamic
Pages
   Extracting dynamic web pages which also can be
    called as database hidden pages needs some kind of
    input to generate the URLs
   To get the input parameters, we used of Apache
    Access logs which contain user request as URL.
   A sample entry in Apache access log is as follows:
    127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET
    /archon/servlet/search?
    formname=simple&fulltext=maly&group=subject&sor
    t=title HTTP/1.1" 200 9560
Extracting and Indexing Dynamic
Pages Contd...
   It contains all the information like IP-address of the computer
    accessing the information, date, time information accessed,
    Method called, Request URL, HTTP version, and HTTP code.
   The Request URL is the one which has all the input parameters,
    in this case formname=simple
fulltext=maly group=subject        sort=title
   Results page is dynamic and dependent upon the parameters
    passed.
   A full URL like
    http://archon.cs.odu.edu:8066/archon/servlet/searc
     Can be generated from Request URL by appending Website
    address.
Indexing Dynamic Pages…
          Apache Logs



                        Parse and generate URL



         Results page         Could be any file type




            Analyzer




              Index
Implementation
   The above flow chart describes the way
    Apache logs are parsed and URLs are
    generated
   It shows how the Results pages are
    fetched and extracted from the URLs
   The Results page is sent for analysis
    then Lucene generates the index which
    will be used for future searches.
Demo
   Results:
   Hardware Environment
   Dedicated machine for indexing: No, but nominal usage at time
    of indexing.
   CPU: Intel x86 P4 2.8Ghz
   RAM: 512 DDR
   Drive configuration: IDE 7200rpm
   Software environment
   Lucene Version: 1.4
   Java Version: 1..2
   OS Version: Windows 2000
   Apache Web server version 1.3 to 2.0
   Location of index: local
Create Index
IndexByLog.java file reads the access logs on local computer, generates
the URLs, fetches and extracts the results page from the URLs and
indexes them and stores in LuceneIndex folder.
Files extraction and Index
Creation
Searching at the prompt
Searching on the web
Results on the web
Conclusion
   It is very easy to implement efficient and
    powerful search engines using Lucene
   Lucene can be used to index dynamic pages
    and database hidden pages
   Web Server Access logs can be used to
    generate URLs and Java, Lucene API can be
    used to fetch and index database hidden
    pages.
   There are some security risks involved as we
    can reveal what users are doing what
    searches and other sensitive information .
Questions?

More Related Content

What's hot

Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
Adrien Grand
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
Mindfire Solutions
 
Lucene
LuceneLucene
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
lucenerevolution
 
Azure search
Azure searchAzure search
Azure search
Alexej Sommer
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
abial
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
Adrien Grand
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
lucenerevolution
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Grant Ingersoll
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
OpenSource Connections
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
Andy Jackson
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
Abdelrahman Othman Helal
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
Jeremy Coates
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
lucenerevolution
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
Asad Abbas
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
Ramez Al-Fayez
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
lucenerevolution
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 

What's hot (20)

Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Introduction To Apache Lucene
Introduction To Apache LuceneIntroduction To Apache Lucene
Introduction To Apache Lucene
 
Lucene
LuceneLucene
Lucene
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Azure search
Azure searchAzure search
Azure search
 
Munching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processingMunching & crunching - Lucene index post-processing
Munching & crunching - Lucene index post-processing
 
Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?Berlin Buzzwords 2013 - How does lucene store your data?
Berlin Buzzwords 2013 - How does lucene store your data?
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Building a Search Engine Using Lucene
Building a Search Engine Using LuceneBuilding a Search Engine Using Lucene
Building a Search Engine Using Lucene
 
Search Lucene
Search LuceneSearch Lucene
Search Lucene
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
 
Advanced full text searching techniques using Lucene
Advanced full text searching techniques using LuceneAdvanced full text searching techniques using Lucene
Advanced full text searching techniques using Lucene
 
Solr Architecture
Solr ArchitectureSolr Architecture
Solr Architecture
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 

Viewers also liked

Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
Josiane Gamgo
 
The search engine index
The search engine indexThe search engine index
The search engine index
CJ Jenkins
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint201014161
 
Perspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomiaPerspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomia
Naira Michelle Alves Pereira
 
Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4
Taniana Rodriguez
 
Blogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media BootcampBlogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media Bootcamp
wesleyzhao
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither HadoopEd Kohlwey
 
Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching
Laxman Kotte
 
Search Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media BootcampSearch Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media Bootcamp
wesleyzhao
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Carlos Castillo (ChaTo)
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
Josiane Gamgo
 
Solr
SolrSolr
Solr
sortivo
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
Edureka!
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scriptingTony Fabeen
 

Viewers also liked (15)

Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Search Engine Powerpoint
Search Engine PowerpointSearch Engine Powerpoint
Search Engine Powerpoint
 
Perspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomiaPerspectivas da web semântica para a biblioteconomia
Perspectivas da web semântica para a biblioteconomia
 
Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4Construcción de una ontología OWL con protégé 4
Construcción de una ontología OWL con protégé 4
 
Blogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media BootcampBlogging With Word Press -Social Media Bootcamp
Blogging With Word Press -Social Media Bootcamp
 
Ted Dunning - Whither Hadoop
Ted Dunning - Whither HadoopTed Dunning - Whither Hadoop
Ted Dunning - Whither Hadoop
 
Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching Difference Between Crawling, Indexing and Caching
Difference Between Crawling, Indexing and Caching
 
Search Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media BootcampSearch Engine Optimization - Social Media Bootcamp
Search Engine Optimization - Social Media Bootcamp
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
 
Architecture and implementation of Apache Lucene
Architecture and implementation of Apache LuceneArchitecture and implementation of Apache Lucene
Architecture and implementation of Apache Lucene
 
Solr
SolrSolr
Solr
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
Devinsampa nginx-scripting
Devinsampa nginx-scriptingDevinsampa nginx-scripting
Devinsampa nginx-scripting
 
Index types
Index typesIndex types
Index types
 

Similar to Intelligent crawling and indexing using lucene

Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索longkeyy
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
Francisco Gonçalves
 
SharePoint Developer Education Day Palo Alto
SharePoint  Developer Education Day  Palo  AltoSharePoint  Developer Education Day  Palo  Alto
SharePoint Developer Education Day Palo Alto
llangit
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonChetan Giridhar
 
Solr -
Solr - Solr -
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
gramana
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
IOSR Journals
 
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group
 
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksIntroduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Lucidworks
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]
Mustafa Elkhiat
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
Alkuvoima
 
Design a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basicsDesign a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basics
Alexander Meijers
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
Abanti Aazmin
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
Alex Kursov
 
Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010
bgerman
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseLaurent Alquier
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...
DESTIN-Informatique.com
 
Ikenstudiolive
IkenstudioliveIkenstudiolive
R01765113122
R01765113122R01765113122
R01765113122
IOSR Journals
 

Similar to Intelligent crawling and indexing using lucene (20)

Solr中国6月21日企业搜索
Solr中国6月21日企业搜索Solr中国6月21日企业搜索
Solr中国6月21日企业搜索
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
SharePoint Developer Education Day Palo Alto
SharePoint  Developer Education Day  Palo  AltoSharePoint  Developer Education Day  Palo  Alto
SharePoint Developer Education Day Palo Alto
 
PyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in pythonPyCon India 2012: Rapid development of website search in python
PyCon India 2012: Rapid development of website search in python
 
Solr -
Solr - Solr -
Solr -
 
Search Me: Using Lucene.Net
Search Me: Using Lucene.NetSearch Me: Using Lucene.Net
Search Me: Using Lucene.Net
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
EPC Group - Comprehensive Overview of SharePoint 2010's Enterprise Search Cap...
 
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, LucidworksIntroduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
Introduction to Lucidworks Fusion - Alexander Kanarsky, Lucidworks
 
A customized web search engine [autosaved]
A customized web search engine [autosaved]A customized web search engine [autosaved]
A customized web search engine [autosaved]
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Design a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basicsDesign a share point 2013 architecture – the basics
Design a share point 2013 architecture – the basics
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 
Wanna search? Piece of cake!
Wanna search? Piece of cake!Wanna search? Piece of cake!
Wanna search? Piece of cake!
 
Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010Enterprise Search in SharePoint 2010
Enterprise Search in SharePoint 2010
 
KnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge baseKnowIT, semantic informatics knowledge base
KnowIT, semantic informatics knowledge base
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...
 
Ikenstudiolive
IkenstudioliveIkenstudiolive
Ikenstudiolive
 
R01765113122
R01765113122R01765113122
R01765113122
 

Recently uploaded

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 

Recently uploaded (20)

Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 

Intelligent crawling and indexing using lucene

  • 1. Intelligent Crawling and Indexing using Lucene By Shiva Thatipelli Mohammad Zubair (Advisor)
  • 2. Contents Searching  Indexing  Lucene  Indexing with Lucene  Indexing Static and Dynamic Pages  Extracting and Indexing Dynamic Pages  Implementation  Screens
  • 3. Searching  Looking up words in an index  Factors Affecting Search  Precision – How well the system can filter  Speed  Single, Multiple Phase queries, Results ranking, Sorting, Wild card queries, Range queries support
  • 4. Indexing  Sequential Search is bad (Not Scalable)  Index speeds up selection  Index is a special data structure which allows rapid searching.  Different Index Implementations - B Trees - Hash Map
  • 5. Search Process Query Docs Docs Indexing API Hits Index
  • 6. Lucene  High-performance, full-featured text search engine library  Written 100% in pure java  Easy to use yet powerful API  Jakarta Apache Product. Strong open source community support.
  • 7. Why Lucene?  Open source (Not proprietary)  Easy to use, good documentation  Interoperable - ex: Index generated by java can be used by VB, asp, perl application  Powerful And Highly Scalable  Index Format  Designed for interoperability  Well Documented  Resides on File System, RAM, custom store
  • 8. Continued  Algorithms  Efficient, fast and optimized • Incremental Indexing • Boolean Query, Fuzzy Query, Range Query, Multi Phrase Query, Wild Card Query etc… • Content Tagging – Documents as Collection of terms  Heterogeneous documents - Useful when different set of metadata present for different mime types
  • 9. Indexing With Lucene  What type of documents can be indexed?  Any document from which text can be fetched and extracted over the net with a URL  Uses Inverted Index - The index stores statistics about terms in order to make term-based search more efficient.
  • 10. Indexing With Lucene Contd… HTML XLS WORD PDF extracted extracted extracted extracted Parser Parser Parser Parser Analyzer Index
  • 11. Indexing Static and Dynamic Pages  Static Pages which are HTML, XLS, WORD, PDF documents on web which can be easily crawled and indexed by search engines like Google and Yahoo.  Static Pages over the internet can be passed into Lucene and indexed and searched with direct URLs.  Dynamic Pages which are generated due to result of parameters submitted; like search results pages, Database hidden pages cannot be indexed with direct URLs.  To index Dynamic Pages we need the parameters submitted by users to generate those pages.
  • 12. Extracting and Indexing Dynamic Pages  Extracting dynamic web pages which also can be called as database hidden pages needs some kind of input to generate the URLs  To get the input parameters, we used of Apache Access logs which contain user request as URL.  A sample entry in Apache access log is as follows: 127.0.0.1 - - [31/Aug/2005:18:44:03 -0400] "GET /archon/servlet/search? formname=simple&fulltext=maly&group=subject&sor t=title HTTP/1.1" 200 9560
  • 13. Extracting and Indexing Dynamic Pages Contd...  It contains all the information like IP-address of the computer accessing the information, date, time information accessed, Method called, Request URL, HTTP version, and HTTP code.  The Request URL is the one which has all the input parameters, in this case formname=simple fulltext=maly group=subject sort=title  Results page is dynamic and dependent upon the parameters passed.  A full URL like http://archon.cs.odu.edu:8066/archon/servlet/searc Can be generated from Request URL by appending Website address.
  • 14. Indexing Dynamic Pages… Apache Logs Parse and generate URL Results page Could be any file type Analyzer Index
  • 15. Implementation  The above flow chart describes the way Apache logs are parsed and URLs are generated  It shows how the Results pages are fetched and extracted from the URLs  The Results page is sent for analysis then Lucene generates the index which will be used for future searches.
  • 16. Demo
  • 17. Results:  Hardware Environment  Dedicated machine for indexing: No, but nominal usage at time of indexing.  CPU: Intel x86 P4 2.8Ghz  RAM: 512 DDR  Drive configuration: IDE 7200rpm  Software environment  Lucene Version: 1.4  Java Version: 1..2  OS Version: Windows 2000  Apache Web server version 1.3 to 2.0  Location of index: local
  • 18. Create Index IndexByLog.java file reads the access logs on local computer, generates the URLs, fetches and extracts the results page from the URLs and indexes them and stores in LuceneIndex folder.
  • 19. Files extraction and Index Creation
  • 23. Conclusion  It is very easy to implement efficient and powerful search engines using Lucene  Lucene can be used to index dynamic pages and database hidden pages  Web Server Access logs can be used to generate URLs and Java, Lucene API can be used to fetch and index database hidden pages.  There are some security risks involved as we can reveal what users are doing what searches and other sensitive information .