SlideShare a Scribd company logo
1 of 45
Download to read offline
Sandhan(CLIA) -
Nutch and Lucene Framework
                    -Gaurav Arora
                    IRLAB,DA-IICT
N
2
u

c   Outline
h
a    Introduction
n    Behavior
d               of Nutch (Offline and Online)
L    Lucene Features
u    Sandhan Demo
c
e    RJ Interface
n
e
F
r
a
m
e
w
o
r
k
3
N
u

c   Introduction
h
a    Nutch  is an opensource search engine
n    Implemented in Java
d
L    Nutch is comprised of Lucene, Solr, Hadoop
u
c
      etc..
e    Lucene is an implementation of indexing and
n     searching crawled data
e
F    Both Nutch and Lucene are developed using
r     plugin framework
a
     Easy to customize
m
e
w
o
r
k
4
N
u

c   Where do they fit in IR?
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
5
N
u

c   Nutch – complete search engine
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
6
N
u

c   Nutch – offline processing
h
a    Crawling
n     Starts with set of seed URLs
d
      Goes deeper in the web and starts fetching the
L
u      content
c     Content need to be analyzed before storing
e     Storing the content
n
e     Makes suitable for searching
F    Issues
r
a
      Time consuming process
m     Freshness of the crawl (How often should I crawl?)
e     Coverage of content
w
o
r
k
7
N
u

c   Nutch – online processing
h
a    Searching
n     Analysis of the query
d
      Processing of few words(tokens) in the query
L
u     Query tokens matched against stored
c      tokens(index)
e
     Fast and Accurate
n
e    Involves ordering the matching results
F    Ranking affects User’s satisfaction directly
r
a    Supports distributed searching
m
e
w
o
r
k
9
N
u

c   Nutch – Data structures
h
a    Web Database or WebDB
n
       Mirrors the properties/structure of web graph being
d
L       crawled
u
c
e
     Segment
n      Intermediate index
e      Contains pages fetched in a single run
F
r
a    Index
m
       Final inverted index obtained by “merging”
e
w       segments (Lucene)
o
r
k
Nutch – Data
Web Database or WebDB
Crawldb - This contains information about every URL
known to Nutch, including whether it was fetched.
Linkdb. - This contains the list of known links to each URL,
including both the source URL and anchor text of the link.

Index

Invert index : Posting list ,Mapping from words
to its documents.
Nutch Data - Segment
Each segment is a set of URLs that are fetched as a unit.
segment contains:-

 a crawl_generate names a set of URLs to be fetched

 a crawl_fetch contains the status of fetching each URL

 a content contains the raw content retrieved from each URL

 a parse_text contains the parsed text of each URL

 a parse_data contains outlinks and metadata parsed from each URL

 a crawl_parse contains the outlink URLs, used to update the crawldb
12
oter>


        Nutch –Crawling
         Inject:   initial creation of CrawlDB
          Insert seed URLs
          Initial LinkDB is empty


         Generate new shard's fetchlist
         Fetch raw content
         Parse content (discovers outlinks)
         Update CrawlDB from shards
         Update LinkDB from shards
         Index shards
13


     Wide Crawling vs. Focused Crawling
      Differences:
       Little technical difference in configuration
       Big difference in operations, maintenance and
        quality
      Wide   crawling:
       (Almost) Unlimited crawling frontier
       High risk of spamming and junk content
       “Politeness” a very important limiting factor
       Bandwidth & DNS considerations
      Focused   (vertical or enterprise) crawling:
       Limited crawling frontier
       Bandwidth or politeness is often not an issue
       Low risk of spamming and junk content
14
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
 Crawling Architecture
k
15
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
wStep1 : Injector   injects the list of seed URLs into the
o
r
  CrawlDB
k
16
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step2 : Generator takes the list of seed URLs from CrawlDB, forms
o
r fetch list, adds crawl_generate folder into the segments
k
17
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step3 : These fetch lists are used by fetchers to fetch the raw
o
r content of the document. It is then stored in segments.
k
18
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step4 : Parser is called to parse the content of the document
o
r and parsed content is stored back in segments.
k
19
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step5 : The links are inverted in the link graph and stored in
o
r LinkDB
k
20
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
  Step6 : Indexing the terms present in segments is done and
r indices are updated in the segments
k
21
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
wStep7 : Information on the   newly fetched documents are
o
r updated in the CrwalDB
k
22
N
u

c    Crawling: 10 stage process
h
a    bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
n     1. admin db –create: Create a new WebDB.
d     2. inject: Inject root URLs into the WebDB.
L
      3. generate: Generate a fetchlist from the WebDB in a new segment.
u
c     4. fetch: Fetch content from URLs in the fetchlist.
e     5. updatedb: Update the WebDB with links from fetched pages.
n     6. Repeat steps 3-5 until the required depth is reached.
e     7. updatesegs: Update segments with scores and links from the WebDB.
F
      8. index: Index the fetched pages.
r
a      9. dedup: Eliminate duplicate content (and duplicate URLs) from the
     indexes.
m
e     10. merge: Merge the indexes into a single index for searching
w
o
r
k
23
N
u

c    De-duplication Algorithm
h
a
n
     (MD5 hash, float score, int indexID, int
d    docID, int urlLen)
L
u    for each page
c       to eliminate URL duplicates from a
e
n    segmentsDir:
e
F
        open a temporary file
r       for each segment:
a
m
           for each document in its index:
e             append a tuple for the document to
w
o    the        temporary file with
r    hash=MD5(URL)
k
        close the temporary file
24
N
u

c    URL Filtering
h
a
n
d        URL Filters (Text file) (conf/crawl-urlfilter.txt)
L          Regular expression to filter URLs during crawling
u          E.g.
c            To ignore files with certain suffix:
e
          -.(gif|exe|zip|ico)$
n            To accept host in a certain domain
e
F         +^http://([a-z0-9]*.)*apache.org/
r
a
m
e
w
o
r
k
25
N
u

c    Few API’s
h
a     Site   we would crawl: http://www.iitb.ac.in
n        bin/nutch crawl <urlfile> -dir <dir> -depth <n> >&
d         crawl.log
L     Analyze
u                 the database:
c        bin/nutch readdb <db dir> –stats
e        bin/nutch readdb <db dir> –dumppageurl
n        bin/nutch readdb <db dir> –dumplinks
e        s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread
F         -dump $s
r
a
m
e
w
o
r
k
26
N
u

c    Map-Reduce Function
h
a     Works  in distributed environment
n     map() and reduce() functions are implemented
d
L      in most of the modules
u     Both map() and reduce() functions uses <key,
c
e      value> pairs
n     Useful in case of processing large data (eg:
e
F      Indexing)
r     Some applications need sequence of map-
a
m      reduce
e        Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n
w
o
r
k
27
N
u

c    Map-Reduce Architecture
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
28
N
u

c    Nutch – Map-Reduce Indexing
h
a     Map()just assembles all parts of documents
n     Reduce() performs text analysis + indexing:
d
L        Adds to a local Lucene index
u
c
e
     Other possible MR indexing models:
n     Hadoop contrib/indexing model:
e      analysis and indexing on map() side
F
       Index merging on reduce() side
r
a     Modified   Nutch model:
m      Analysis on map() side
e
       Indexing on reduce() side
w
o
r
k
29
N
u

c    Nutch - Ranking
h
a     Nutch   Ranking
n
d
L
u
c
e      queryNorm() : indicates the normalization factor for
n       the query
e      coord() : indicates how many query terms are
F
r       present in the given document
a      norm() : score indicating field based normalization
m       factor
e      tf : term frequency and idf : inverse document
w
o       frequency
r      t.boost() : score indicating the importance of terms
k       occurrence in a particular field
30
N
u

c    Lucene - Features
h
a     Field based indexing and searching
n     Different fields of a webpage are
d
L      Title
u      URL
c      Anchor text
e
       Content, etc..
n
e     Different   boost factors to give importance to
F
r
       fields
a     Uses inverted index to store content of
m
e
       crawled documents
w     Open source Apache project
o
r
k
31
N
u

c    Lucene - Index
h
a
n     Concepts
d      Index: sequence of documents (a.k.a. Directory)
L
       Document: sequence of fields
u
c      Field: named sequence of terms
e      Term: a text string (e.g., a word)
n
e
F     Statistics
r        Term frequencies and positions
a
m
e
w
o
r
k
32
N
u

c    Writing to Index
h
a
n    IndexWriter writer =
d
L         new IndexWriter(directory, analyzer,
u    true);
c
e
n      Document doc = new Document();
e        // add fields to document (next slide)
F
r      writer.addDocument(doc);
a      writer.close();
m
e
w
o
r
k
33
N
u

c    Adding Fields
h
a    doc.add(Field.Keyword("isbn", isbn));
n
d    doc.add(Field.Keyword("category",
L    category));
u
c
     doc.add(Field.Text("title", title));
e    doc.add(Field.Text("author", author));
n    doc.add(Field.UnIndexed("url", url));
e
F    doc.add(Field.UnStored("subjects",
r    subjects, true));
a
m    doc.add(Field.Keyword("pubmonth",
e    pubmonth));
w
o
     doc.add(Field.UnStored("contents",author
r    + " " + subjects));
k
     doc.add(Field.Keyword("modified",
     DateField.timeToString(file.lastModified())
34
N
u

c    Fields Description
h
a     Attributes
n      Stored: original content retrievable
d
       Indexed: inverted, searchable
L
u      Tokenized: analyzed, split into tokens
c     Factory   methods
e
n
       Keyword: stored and indexed as single term
e      Text: indexed, tokenized, and stored if String
F      UnIndexed: stored
r      UnStored: indexed, tokenized
a
m     Terms   are what matters for searching
e
w
o
r
k
35
N
u

c    Searching an Index
h
a    IndexSearcher searcher =
n
d          new IndexSearcher(directory);
L
u
c
     Query query =
e    QueryParser.parse(queryExpression,
n      "contents“,analyzer);
e
F    Hits hits = searcher.search(query);
r    for (int i = 0; i < hits.length(); i++) {
a
m      Document doc = hits.doc(i);
e      System.out.println(doc.get("title"));
w
o
     }
r
k
36
N
u

c    Analyzer
h
a
n     Analysis   occurs
d      For each tokenized field during indexing
L
       For each term or phrase in QueryParser
u
c
e     Several   analyzers built-in
n
e
       Many more in the sandbox
F      Straightforward to create your own
r
a
      Choosing   the right analyzer is important!
m
e
w
o
r
k
37
N
u

c    WhiteSpace Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a    [The] [quick] [brown] [fox] [jumps] [over]
m
e    [the]
w    [lazy] [dog.]
o
r
k
38
N
u

c    Simple Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a    [the] [quick] [brown] [fox] [jumps] [over]
m
e    [the]
w    [lazy] [dog]
o
r
k
39
N
u

c    Stop Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a
m
e    [quick] [brown] [fox] [jumps] [over] [lazy]
w    [dog]
o
r
k
40
N
u

c    Snowball Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a    [the] [quick] [brown] [fox] [jump] [over]
m
e    [the]
w    [lazy] [dog]
o
r
k
41
N
u

c    Query Creation
h
a     Searching by a term – TermQuery
n     Searching within a range – RangeQuery
d
L     Searching on a string – PrefixQuery
u     Combining queries – BooleanQuery
c
e     Searching by phrase – PhraseQuery
n     Searching by wildcard – WildcardQuery
e
F     Searching for similar terms - FuzzyQuery
r
a
m
e
w
o
r
k
42
N
u

c    Lucene Queries
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
43
N
u

c    Conclusions
h
a     Nutch   as a starting point
n     Crawling in Nutch
d
L     Detailed map-reduce architecture
u     Different query formats in Lucene
c
e     Built-in analyzers in Lucene
n     Same analyzer need to be used both while
e
F      indexing and searching
r
a
m
e
w
o
r
k
44
N
u

c    Resources Used
h
a     Gospodnetic, Otis; Erik Hatcher (December 1,
n
d      2004). Lucene in Action (1st ed.).
L      Manning Publications. pp. 456. ISBN 
u
c
       978-1-932394-28-3.
e     Nutch Wiki http://wiki.apache.org/nutch/
n
e
F
r
a
m
e
w
o
r
k
45
N
u

c    Thanks
h
a     Questions   ??
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k

More Related Content

What's hot

Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Jamey Hanson
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsChien Chung Shen
 
NoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsNoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsFujio Turner
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs HadoopFujio Turner
 
Big Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC SystemsBig Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC SystemsFujio Turner
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewDan Morrill
 
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf Conference
 
Analysis of Air Pollution in Nova Scotia Presentation
Analysis of Air Pollution in Nova Scotia PresentationAnalysis of Air Pollution in Nova Scotia Presentation
Analysis of Air Pollution in Nova Scotia PresentationCarlo Carandang
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYEmanuel Calvo
 

What's hot (20)

Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Drill 1.0
Drill 1.0Drill 1.0
Drill 1.0
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
The MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 InterfaceThe MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 Interface
 
NoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsNoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC Systems
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
Big Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC SystemsBig Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC Systems
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
1 technical-dns-workshop-day1
1 technical-dns-workshop-day11 technical-dns-workshop-day1
1 technical-dns-workshop-day1
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Rbootcamp Day 1
Rbootcamp Day 1Rbootcamp Day 1
Rbootcamp Day 1
 
Introduction to DNS
Introduction to DNSIntroduction to DNS
Introduction to DNS
 
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Analysis of Air Pollution in Nova Scotia Presentation
Analysis of Air Pollution in Nova Scotia PresentationAnalysis of Air Pollution in Nova Scotia Presentation
Analysis of Air Pollution in Nova Scotia Presentation
 
LaTeX Tutorial
LaTeX TutorialLaTeX Tutorial
LaTeX Tutorial
 
Horizons doc
Horizons docHorizons doc
Horizons doc
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
 

Viewers also liked

Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysisSigmoid
 
The Velocity12 markets
The Velocity12 marketsThe Velocity12 markets
The Velocity12 marketsBenoit Wiesser
 
Social Media by Konceptika
Social Media by KonceptikaSocial Media by Konceptika
Social Media by KonceptikaKonceptika
 
บทที่51
 บทที่51 บทที่51
บทที่51kik.nantanit
 
Bridge outdoors Spring & Summer 2012
Bridge outdoors Spring & Summer 2012Bridge outdoors Spring & Summer 2012
Bridge outdoors Spring & Summer 2012Bridge Outdoors
 
Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011Bridge Outdoors
 
Bridge outdoors fall winter 2012
Bridge outdoors fall winter 2012Bridge outdoors fall winter 2012
Bridge outdoors fall winter 2012Bridge Outdoors
 
DEV PVH 2015 MeetUP
DEV PVH 2015 MeetUPDEV PVH 2015 MeetUP
DEV PVH 2015 MeetUPCreative S.I
 
บทที่51
 บทที่51 บทที่51
บทที่51kik.nantanit
 
Telephone
TelephoneTelephone
Telephonesumipf
 
Bridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 CatalogBridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 CatalogBridge Outdoors
 
Zonificacion merged
Zonificacion mergedZonificacion merged
Zonificacion mergedDanger
 
Журналисты 2.0
Журналисты 2.0Журналисты 2.0
Журналисты 2.0Igor Kulakov
 
Informática i
Informática iInformática i
Informática iricardo
 

Viewers also liked (20)

Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysis
 
The Velocity12 markets
The Velocity12 marketsThe Velocity12 markets
The Velocity12 markets
 
Chapter 3 rev
Chapter 3 revChapter 3 rev
Chapter 3 rev
 
Global warming
Global warmingGlobal warming
Global warming
 
Listing Presentation-ko2
Listing Presentation-ko2Listing Presentation-ko2
Listing Presentation-ko2
 
Social Media by Konceptika
Social Media by KonceptikaSocial Media by Konceptika
Social Media by Konceptika
 
บทที่51
 บทที่51 บทที่51
บทที่51
 
Bridge outdoors Spring & Summer 2012
Bridge outdoors Spring & Summer 2012Bridge outdoors Spring & Summer 2012
Bridge outdoors Spring & Summer 2012
 
The Quirindongo’S Wedding
The Quirindongo’S WeddingThe Quirindongo’S Wedding
The Quirindongo’S Wedding
 
Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011
 
Bridge outdoors fall winter 2012
Bridge outdoors fall winter 2012Bridge outdoors fall winter 2012
Bridge outdoors fall winter 2012
 
0471251240
04712512400471251240
0471251240
 
DEV PVH 2015 MeetUP
DEV PVH 2015 MeetUPDEV PVH 2015 MeetUP
DEV PVH 2015 MeetUP
 
บทที่51
 บทที่51 บทที่51
บทที่51
 
Telephone
TelephoneTelephone
Telephone
 
Bridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 CatalogBridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 Catalog
 
Zonificacion merged
Zonificacion mergedZonificacion merged
Zonificacion merged
 
โปรแกรมเพื่อการศึกษา
โปรแกรมเพื่อการศึกษาโปรแกรมเพื่อการศึกษา
โปรแกรมเพื่อการศึกษา
 
Журналисты 2.0
Журналисты 2.0Журналисты 2.0
Журналисты 2.0
 
Informática i
Informática iInformática i
Informática i
 

Similar to Nutch and lucene_framework

Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webMahdi Atawneh
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixNick Dimiduk
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RYanchang Zhao
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Steve Min
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxsmile790243
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabAbhinav Singh
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 

Similar to Nutch and lucene_framework (20)

How web searching engines work
How web searching engines workHow web searching engines work
How web searching engines work
 
Data Science
Data ScienceData Science
Data Science
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - Phoenix
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Nzitf Velociraptor Workshop
Nzitf Velociraptor WorkshopNzitf Velociraptor Workshop
Nzitf Velociraptor Workshop
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
hadoop
hadoophadoop
hadoop
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 

Recently uploaded

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Nutch and lucene_framework

  • 1. Sandhan(CLIA) - Nutch and Lucene Framework -Gaurav Arora IRLAB,DA-IICT
  • 2. N 2 u c Outline h a  Introduction n  Behavior d of Nutch (Offline and Online) L  Lucene Features u  Sandhan Demo c e  RJ Interface n e F r a m e w o r k
  • 3. 3 N u c Introduction h a  Nutch is an opensource search engine n  Implemented in Java d L  Nutch is comprised of Lucene, Solr, Hadoop u c etc.. e  Lucene is an implementation of indexing and n searching crawled data e F  Both Nutch and Lucene are developed using r plugin framework a  Easy to customize m e w o r k
  • 4. 4 N u c Where do they fit in IR? h a n d L u c e n e F r a m e w o r k
  • 5. 5 N u c Nutch – complete search engine h a n d L u c e n e F r a m e w o r k
  • 6. 6 N u c Nutch – offline processing h a  Crawling n  Starts with set of seed URLs d  Goes deeper in the web and starts fetching the L u content c  Content need to be analyzed before storing e  Storing the content n e  Makes suitable for searching F  Issues r a  Time consuming process m  Freshness of the crawl (How often should I crawl?) e  Coverage of content w o r k
  • 7. 7 N u c Nutch – online processing h a  Searching n  Analysis of the query d  Processing of few words(tokens) in the query L u  Query tokens matched against stored c tokens(index) e  Fast and Accurate n e  Involves ordering the matching results F  Ranking affects User’s satisfaction directly r a  Supports distributed searching m e w o r k
  • 8.
  • 9. 9 N u c Nutch – Data structures h a  Web Database or WebDB n  Mirrors the properties/structure of web graph being d L crawled u c e  Segment n  Intermediate index e  Contains pages fetched in a single run F r a  Index m  Final inverted index obtained by “merging” e w segments (Lucene) o r k
  • 10. Nutch – Data Web Database or WebDB Crawldb - This contains information about every URL known to Nutch, including whether it was fetched. Linkdb. - This contains the list of known links to each URL, including both the source URL and anchor text of the link. Index Invert index : Posting list ,Mapping from words to its documents.
  • 11. Nutch Data - Segment Each segment is a set of URLs that are fetched as a unit. segment contains:- a crawl_generate names a set of URLs to be fetched a crawl_fetch contains the status of fetching each URL a content contains the raw content retrieved from each URL a parse_text contains the parsed text of each URL a parse_data contains outlinks and metadata parsed from each URL a crawl_parse contains the outlink URLs, used to update the crawldb
  • 12. 12 oter> Nutch –Crawling  Inject: initial creation of CrawlDB  Insert seed URLs  Initial LinkDB is empty  Generate new shard's fetchlist  Fetch raw content  Parse content (discovers outlinks)  Update CrawlDB from shards  Update LinkDB from shards  Index shards
  • 13. 13 Wide Crawling vs. Focused Crawling  Differences:  Little technical difference in configuration  Big difference in operations, maintenance and quality  Wide crawling:  (Almost) Unlimited crawling frontier  High risk of spamming and junk content  “Politeness” a very important limiting factor  Bandwidth & DNS considerations  Focused (vertical or enterprise) crawling:  Limited crawling frontier  Bandwidth or politeness is often not an issue  Low risk of spamming and junk content
  • 15. 15 N u c h a n d L u c e n e F r a m e wStep1 : Injector injects the list of seed URLs into the o r CrawlDB k
  • 16. 16 N u c h a n d L u c e n e F r a m e w Step2 : Generator takes the list of seed URLs from CrawlDB, forms o r fetch list, adds crawl_generate folder into the segments k
  • 17. 17 N u c h a n d L u c e n e F r a m e w Step3 : These fetch lists are used by fetchers to fetch the raw o r content of the document. It is then stored in segments. k
  • 18. 18 N u c h a n d L u c e n e F r a m e w Step4 : Parser is called to parse the content of the document o r and parsed content is stored back in segments. k
  • 19. 19 N u c h a n d L u c e n e F r a m e w Step5 : The links are inverted in the link graph and stored in o r LinkDB k
  • 20. 20 N u c h a n d L u c e n e F r a m e w o Step6 : Indexing the terms present in segments is done and r indices are updated in the segments k
  • 21. 21 N u c h a n d L u c e n e F r a m e wStep7 : Information on the newly fetched documents are o r updated in the CrwalDB k
  • 22. 22 N u c Crawling: 10 stage process h a bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log n 1. admin db –create: Create a new WebDB. d 2. inject: Inject root URLs into the WebDB. L 3. generate: Generate a fetchlist from the WebDB in a new segment. u c 4. fetch: Fetch content from URLs in the fetchlist. e 5. updatedb: Update the WebDB with links from fetched pages. n 6. Repeat steps 3-5 until the required depth is reached. e 7. updatesegs: Update segments with scores and links from the WebDB. F 8. index: Index the fetched pages. r a 9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes. m e 10. merge: Merge the indexes into a single index for searching w o r k
  • 23. 23 N u c De-duplication Algorithm h a n (MD5 hash, float score, int indexID, int d docID, int urlLen) L u for each page c to eliminate URL duplicates from a e n segmentsDir: e F open a temporary file r for each segment: a m for each document in its index: e append a tuple for the document to w o the temporary file with r hash=MD5(URL) k close the temporary file
  • 24. 24 N u c URL Filtering h a n d  URL Filters (Text file) (conf/crawl-urlfilter.txt) L  Regular expression to filter URLs during crawling u  E.g. c  To ignore files with certain suffix: e -.(gif|exe|zip|ico)$ n  To accept host in a certain domain e F +^http://([a-z0-9]*.)*apache.org/ r a m e w o r k
  • 25. 25 N u c Few API’s h a  Site we would crawl: http://www.iitb.ac.in n  bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& d crawl.log L  Analyze u the database: c  bin/nutch readdb <db dir> –stats e  bin/nutch readdb <db dir> –dumppageurl n  bin/nutch readdb <db dir> –dumplinks e  s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread F -dump $s r a m e w o r k
  • 26. 26 N u c Map-Reduce Function h a  Works in distributed environment n  map() and reduce() functions are implemented d L in most of the modules u  Both map() and reduce() functions uses <key, c e value> pairs n  Useful in case of processing large data (eg: e F Indexing) r  Some applications need sequence of map- a m reduce e  Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n w o r k
  • 27. 27 N u c Map-Reduce Architecture h a n d L u c e n e F r a m e w o r k
  • 28. 28 N u c Nutch – Map-Reduce Indexing h a  Map()just assembles all parts of documents n  Reduce() performs text analysis + indexing: d L  Adds to a local Lucene index u c e Other possible MR indexing models: n  Hadoop contrib/indexing model: e  analysis and indexing on map() side F  Index merging on reduce() side r a  Modified Nutch model: m  Analysis on map() side e  Indexing on reduce() side w o r k
  • 29. 29 N u c Nutch - Ranking h a  Nutch Ranking n d L u c e  queryNorm() : indicates the normalization factor for n the query e  coord() : indicates how many query terms are F r present in the given document a  norm() : score indicating field based normalization m factor e  tf : term frequency and idf : inverse document w o frequency r  t.boost() : score indicating the importance of terms k occurrence in a particular field
  • 30. 30 N u c Lucene - Features h a  Field based indexing and searching n  Different fields of a webpage are d L  Title u  URL c  Anchor text e  Content, etc.. n e  Different boost factors to give importance to F r fields a  Uses inverted index to store content of m e crawled documents w  Open source Apache project o r k
  • 31. 31 N u c Lucene - Index h a n  Concepts d  Index: sequence of documents (a.k.a. Directory) L  Document: sequence of fields u c  Field: named sequence of terms e  Term: a text string (e.g., a word) n e F  Statistics r  Term frequencies and positions a m e w o r k
  • 32. 32 N u c Writing to Index h a n IndexWriter writer = d L new IndexWriter(directory, analyzer, u true); c e n Document doc = new Document(); e // add fields to document (next slide) F r writer.addDocument(doc); a writer.close(); m e w o r k
  • 33. 33 N u c Adding Fields h a doc.add(Field.Keyword("isbn", isbn)); n d doc.add(Field.Keyword("category", L category)); u c doc.add(Field.Text("title", title)); e doc.add(Field.Text("author", author)); n doc.add(Field.UnIndexed("url", url)); e F doc.add(Field.UnStored("subjects", r subjects, true)); a m doc.add(Field.Keyword("pubmonth", e pubmonth)); w o doc.add(Field.UnStored("contents",author r + " " + subjects)); k doc.add(Field.Keyword("modified", DateField.timeToString(file.lastModified())
  • 34. 34 N u c Fields Description h a  Attributes n  Stored: original content retrievable d  Indexed: inverted, searchable L u  Tokenized: analyzed, split into tokens c  Factory methods e n  Keyword: stored and indexed as single term e  Text: indexed, tokenized, and stored if String F  UnIndexed: stored r  UnStored: indexed, tokenized a m  Terms are what matters for searching e w o r k
  • 35. 35 N u c Searching an Index h a IndexSearcher searcher = n d new IndexSearcher(directory); L u c Query query = e QueryParser.parse(queryExpression, n "contents“,analyzer); e F Hits hits = searcher.search(query); r for (int i = 0; i < hits.length(); i++) { a m Document doc = hits.doc(i); e System.out.println(doc.get("title")); w o } r k
  • 36. 36 N u c Analyzer h a n  Analysis occurs d  For each tokenized field during indexing L  For each term or phrase in QueryParser u c e  Several analyzers built-in n e  Many more in the sandbox F  Straightforward to create your own r a  Choosing the right analyzer is important! m e w o r k
  • 37. 37 N u c WhiteSpace Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a [The] [quick] [brown] [fox] [jumps] [over] m e [the] w [lazy] [dog.] o r k
  • 38. 38 N u c Simple Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a [the] [quick] [brown] [fox] [jumps] [over] m e [the] w [lazy] [dog] o r k
  • 39. 39 N u c Stop Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a m e [quick] [brown] [fox] [jumps] [over] [lazy] w [dog] o r k
  • 40. 40 N u c Snowball Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a [the] [quick] [brown] [fox] [jump] [over] m e [the] w [lazy] [dog] o r k
  • 41. 41 N u c Query Creation h a  Searching by a term – TermQuery n  Searching within a range – RangeQuery d L  Searching on a string – PrefixQuery u  Combining queries – BooleanQuery c e  Searching by phrase – PhraseQuery n  Searching by wildcard – WildcardQuery e F  Searching for similar terms - FuzzyQuery r a m e w o r k
  • 42. 42 N u c Lucene Queries h a n d L u c e n e F r a m e w o r k
  • 43. 43 N u c Conclusions h a  Nutch as a starting point n  Crawling in Nutch d L  Detailed map-reduce architecture u  Different query formats in Lucene c e  Built-in analyzers in Lucene n  Same analyzer need to be used both while e F indexing and searching r a m e w o r k
  • 44. 44 N u c Resources Used h a  Gospodnetic, Otis; Erik Hatcher (December 1, n d 2004). Lucene in Action (1st ed.). L Manning Publications. pp. 456. ISBN  u c 978-1-932394-28-3. e  Nutch Wiki http://wiki.apache.org/nutch/ n e F r a m e w o r k
  • 45. 45 N u c Thanks h a  Questions ?? n d L u c e n e F r a m e w o r k