SlideShare a Scribd company logo
1 of 45
Download to read offline
Sandhan(CLIA) -
Nutch and Lucene Framework
                    -Gaurav Arora
                    IRLAB,DA-IICT
N
2
u

c   Outline
h
a    Introduction
n    Behavior
d               of Nutch (Offline and Online)
L    Lucene Features
u    Sandhan Demo
c
e    RJ Interface
n
e
F
r
a
m
e
w
o
r
k
3
N
u

c   Introduction
h
a    Nutch  is an opensource search engine
n    Implemented in Java
d
L    Nutch is comprised of Lucene, Solr, Hadoop
u
c
      etc..
e    Lucene is an implementation of indexing and
n     searching crawled data
e
F    Both Nutch and Lucene are developed using
r     plugin framework
a
     Easy to customize
m
e
w
o
r
k
4
N
u

c   Where do they fit in IR?
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
5
N
u

c   Nutch – complete search engine
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
6
N
u

c   Nutch – offline processing
h
a    Crawling
n     Starts with set of seed URLs
d
      Goes deeper in the web and starts fetching the
L
u      content
c     Content need to be analyzed before storing
e     Storing the content
n
e     Makes suitable for searching
F    Issues
r
a
      Time consuming process
m     Freshness of the crawl (How often should I crawl?)
e     Coverage of content
w
o
r
k
7
N
u

c   Nutch – online processing
h
a    Searching
n     Analysis of the query
d
      Processing of few words(tokens) in the query
L
u     Query tokens matched against stored
c      tokens(index)
e
     Fast and Accurate
n
e    Involves ordering the matching results
F    Ranking affects User’s satisfaction directly
r
a    Supports distributed searching
m
e
w
o
r
k
9
N
u

c   Nutch – Data structures
h
a    Web Database or WebDB
n
       Mirrors the properties/structure of web graph being
d
L       crawled
u
c
e
     Segment
n      Intermediate index
e      Contains pages fetched in a single run
F
r
a    Index
m
       Final inverted index obtained by “merging”
e
w       segments (Lucene)
o
r
k
Nutch – Data
Web Database or WebDB
Crawldb - This contains information about every URL
known to Nutch, including whether it was fetched.
Linkdb. - This contains the list of known links to each URL,
including both the source URL and anchor text of the link.

Index

Invert index : Posting list ,Mapping from words
to its documents.
Nutch Data - Segment
Each segment is a set of URLs that are fetched as a unit.
segment contains:-

 a crawl_generate names a set of URLs to be fetched

 a crawl_fetch contains the status of fetching each URL

 a content contains the raw content retrieved from each URL

 a parse_text contains the parsed text of each URL

 a parse_data contains outlinks and metadata parsed from each URL

 a crawl_parse contains the outlink URLs, used to update the crawldb
12
oter>


        Nutch –Crawling
         Inject:   initial creation of CrawlDB
          Insert seed URLs
          Initial LinkDB is empty


         Generate new shard's fetchlist
         Fetch raw content
         Parse content (discovers outlinks)
         Update CrawlDB from shards
         Update LinkDB from shards
         Index shards
13


     Wide Crawling vs. Focused Crawling
      Differences:
       Little technical difference in configuration
       Big difference in operations, maintenance and
        quality
      Wide   crawling:
       (Almost) Unlimited crawling frontier
       High risk of spamming and junk content
       “Politeness” a very important limiting factor
       Bandwidth & DNS considerations
      Focused   (vertical or enterprise) crawling:
       Limited crawling frontier
       Bandwidth or politeness is often not an issue
       Low risk of spamming and junk content
14
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
 Crawling Architecture
k
15
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
wStep1 : Injector   injects the list of seed URLs into the
o
r
  CrawlDB
k
16
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step2 : Generator takes the list of seed URLs from CrawlDB, forms
o
r fetch list, adds crawl_generate folder into the segments
k
17
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step3 : These fetch lists are used by fetchers to fetch the raw
o
r content of the document. It is then stored in segments.
k
18
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step4 : Parser is called to parse the content of the document
o
r and parsed content is stored back in segments.
k
19
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
  Step5 : The links are inverted in the link graph and stored in
o
r LinkDB
k
20
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
  Step6 : Indexing the terms present in segments is done and
r indices are updated in the segments
k
21
N
u

c
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
wStep7 : Information on the   newly fetched documents are
o
r updated in the CrwalDB
k
22
N
u

c    Crawling: 10 stage process
h
a    bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log
n     1. admin db –create: Create a new WebDB.
d     2. inject: Inject root URLs into the WebDB.
L
      3. generate: Generate a fetchlist from the WebDB in a new segment.
u
c     4. fetch: Fetch content from URLs in the fetchlist.
e     5. updatedb: Update the WebDB with links from fetched pages.
n     6. Repeat steps 3-5 until the required depth is reached.
e     7. updatesegs: Update segments with scores and links from the WebDB.
F
      8. index: Index the fetched pages.
r
a      9. dedup: Eliminate duplicate content (and duplicate URLs) from the
     indexes.
m
e     10. merge: Merge the indexes into a single index for searching
w
o
r
k
23
N
u

c    De-duplication Algorithm
h
a
n
     (MD5 hash, float score, int indexID, int
d    docID, int urlLen)
L
u    for each page
c       to eliminate URL duplicates from a
e
n    segmentsDir:
e
F
        open a temporary file
r       for each segment:
a
m
           for each document in its index:
e             append a tuple for the document to
w
o    the        temporary file with
r    hash=MD5(URL)
k
        close the temporary file
24
N
u

c    URL Filtering
h
a
n
d        URL Filters (Text file) (conf/crawl-urlfilter.txt)
L          Regular expression to filter URLs during crawling
u          E.g.
c            To ignore files with certain suffix:
e
          -.(gif|exe|zip|ico)$
n            To accept host in a certain domain
e
F         +^http://([a-z0-9]*.)*apache.org/
r
a
m
e
w
o
r
k
25
N
u

c    Few API’s
h
a     Site   we would crawl: http://www.iitb.ac.in
n        bin/nutch crawl <urlfile> -dir <dir> -depth <n> >&
d         crawl.log
L     Analyze
u                 the database:
c        bin/nutch readdb <db dir> –stats
e        bin/nutch readdb <db dir> –dumppageurl
n        bin/nutch readdb <db dir> –dumplinks
e        s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread
F         -dump $s
r
a
m
e
w
o
r
k
26
N
u

c    Map-Reduce Function
h
a     Works  in distributed environment
n     map() and reduce() functions are implemented
d
L      in most of the modules
u     Both map() and reduce() functions uses <key,
c
e      value> pairs
n     Useful in case of processing large data (eg:
e
F      Indexing)
r     Some applications need sequence of map-
a
m      reduce
e        Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n
w
o
r
k
27
N
u

c    Map-Reduce Architecture
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
28
N
u

c    Nutch – Map-Reduce Indexing
h
a     Map()just assembles all parts of documents
n     Reduce() performs text analysis + indexing:
d
L        Adds to a local Lucene index
u
c
e
     Other possible MR indexing models:
n     Hadoop contrib/indexing model:
e      analysis and indexing on map() side
F
       Index merging on reduce() side
r
a     Modified   Nutch model:
m      Analysis on map() side
e
       Indexing on reduce() side
w
o
r
k
29
N
u

c    Nutch - Ranking
h
a     Nutch   Ranking
n
d
L
u
c
e      queryNorm() : indicates the normalization factor for
n       the query
e      coord() : indicates how many query terms are
F
r       present in the given document
a      norm() : score indicating field based normalization
m       factor
e      tf : term frequency and idf : inverse document
w
o       frequency
r      t.boost() : score indicating the importance of terms
k       occurrence in a particular field
30
N
u

c    Lucene - Features
h
a     Field based indexing and searching
n     Different fields of a webpage are
d
L      Title
u      URL
c      Anchor text
e
       Content, etc..
n
e     Different   boost factors to give importance to
F
r
       fields
a     Uses inverted index to store content of
m
e
       crawled documents
w     Open source Apache project
o
r
k
31
N
u

c    Lucene - Index
h
a
n     Concepts
d      Index: sequence of documents (a.k.a. Directory)
L
       Document: sequence of fields
u
c      Field: named sequence of terms
e      Term: a text string (e.g., a word)
n
e
F     Statistics
r        Term frequencies and positions
a
m
e
w
o
r
k
32
N
u

c    Writing to Index
h
a
n    IndexWriter writer =
d
L         new IndexWriter(directory, analyzer,
u    true);
c
e
n      Document doc = new Document();
e        // add fields to document (next slide)
F
r      writer.addDocument(doc);
a      writer.close();
m
e
w
o
r
k
33
N
u

c    Adding Fields
h
a    doc.add(Field.Keyword("isbn", isbn));
n
d    doc.add(Field.Keyword("category",
L    category));
u
c
     doc.add(Field.Text("title", title));
e    doc.add(Field.Text("author", author));
n    doc.add(Field.UnIndexed("url", url));
e
F    doc.add(Field.UnStored("subjects",
r    subjects, true));
a
m    doc.add(Field.Keyword("pubmonth",
e    pubmonth));
w
o
     doc.add(Field.UnStored("contents",author
r    + " " + subjects));
k
     doc.add(Field.Keyword("modified",
     DateField.timeToString(file.lastModified())
34
N
u

c    Fields Description
h
a     Attributes
n      Stored: original content retrievable
d
       Indexed: inverted, searchable
L
u      Tokenized: analyzed, split into tokens
c     Factory   methods
e
n
       Keyword: stored and indexed as single term
e      Text: indexed, tokenized, and stored if String
F      UnIndexed: stored
r      UnStored: indexed, tokenized
a
m     Terms   are what matters for searching
e
w
o
r
k
35
N
u

c    Searching an Index
h
a    IndexSearcher searcher =
n
d          new IndexSearcher(directory);
L
u
c
     Query query =
e    QueryParser.parse(queryExpression,
n      "contents“,analyzer);
e
F    Hits hits = searcher.search(query);
r    for (int i = 0; i < hits.length(); i++) {
a
m      Document doc = hits.doc(i);
e      System.out.println(doc.get("title"));
w
o
     }
r
k
36
N
u

c    Analyzer
h
a
n     Analysis   occurs
d      For each tokenized field during indexing
L
       For each term or phrase in QueryParser
u
c
e     Several   analyzers built-in
n
e
       Many more in the sandbox
F      Straightforward to create your own
r
a
      Choosing   the right analyzer is important!
m
e
w
o
r
k
37
N
u

c    WhiteSpace Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a    [The] [quick] [brown] [fox] [jumps] [over]
m
e    [the]
w    [lazy] [dog.]
o
r
k
38
N
u

c    Simple Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a    [the] [quick] [brown] [fox] [jumps] [over]
m
e    [the]
w    [lazy] [dog]
o
r
k
39
N
u

c    Stop Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a
m
e    [quick] [brown] [fox] [jumps] [over] [lazy]
w    [dog]
o
r
k
40
N
u

c    Snowball Analyzer
h
a
n    The quick brown fox jumps over the lazy
d
L    dog.
u
c
e
n
e
F
r
a    [the] [quick] [brown] [fox] [jump] [over]
m
e    [the]
w    [lazy] [dog]
o
r
k
41
N
u

c    Query Creation
h
a     Searching by a term – TermQuery
n     Searching within a range – RangeQuery
d
L     Searching on a string – PrefixQuery
u     Combining queries – BooleanQuery
c
e     Searching by phrase – PhraseQuery
n     Searching by wildcard – WildcardQuery
e
F     Searching for similar terms - FuzzyQuery
r
a
m
e
w
o
r
k
42
N
u

c    Lucene Queries
h
a
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k
43
N
u

c    Conclusions
h
a     Nutch   as a starting point
n     Crawling in Nutch
d
L     Detailed map-reduce architecture
u     Different query formats in Lucene
c
e     Built-in analyzers in Lucene
n     Same analyzer need to be used both while
e
F      indexing and searching
r
a
m
e
w
o
r
k
44
N
u

c    Resources Used
h
a     Gospodnetic, Otis; Erik Hatcher (December 1,
n
d      2004). Lucene in Action (1st ed.).
L      Manning Publications. pp. 456. ISBN 
u
c
       978-1-932394-28-3.
e     Nutch Wiki http://wiki.apache.org/nutch/
n
e
F
r
a
m
e
w
o
r
k
45
N
u

c    Thanks
h
a     Questions   ??
n
d
L
u
c
e
n
e
F
r
a
m
e
w
o
r
k

More Related Content

What's hot

ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf Conference
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
Emanuel Calvo
 

What's hot (20)

Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Drill 1.0
Drill 1.0Drill 1.0
Drill 1.0
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
Rank Your Results with PostgreSQL Full Text Search (from PGConf2015)
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
The MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 InterfaceThe MATLAB Low-Level HDF5 Interface
The MATLAB Low-Level HDF5 Interface
 
NoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC SystemsNoSQL Couchbase Lite & BigData HPCC Systems
NoSQL Couchbase Lite & BigData HPCC Systems
 
HPCC Systems vs Hadoop
HPCC Systems vs HadoopHPCC Systems vs Hadoop
HPCC Systems vs Hadoop
 
Big Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC SystemsBig Data - Load CSV File & Query the EZ way - HPCC Systems
Big Data - Load CSV File & Query the EZ way - HPCC Systems
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
1 technical-dns-workshop-day1
1 technical-dns-workshop-day11 technical-dns-workshop-day1
1 technical-dns-workshop-day1
 
AWS Hadoop and PIG and overview
AWS Hadoop and PIG and overviewAWS Hadoop and PIG and overview
AWS Hadoop and PIG and overview
 
Rbootcamp Day 1
Rbootcamp Day 1Rbootcamp Day 1
Rbootcamp Day 1
 
Introduction to DNS
Introduction to DNSIntroduction to DNS
Introduction to DNS
 
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
ZFConf 2011: Что такое Sphinx, зачем он вообще нужен и как его использовать с...
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Analysis of Air Pollution in Nova Scotia Presentation
Analysis of Air Pollution in Nova Scotia PresentationAnalysis of Air Pollution in Nova Scotia Presentation
Analysis of Air Pollution in Nova Scotia Presentation
 
LaTeX Tutorial
LaTeX TutorialLaTeX Tutorial
LaTeX Tutorial
 
Horizons doc
Horizons docHorizons doc
Horizons doc
 
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAYPostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
PostgreSQL FTS Solutions FOSDEM 2013 - PGDAY
 

Viewers also liked

บทที่51
 บทที่51 บทที่51
บทที่51
kik.nantanit
 
Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011
Bridge Outdoors
 
บทที่51
 บทที่51 บทที่51
บทที่51
kik.nantanit
 
Bridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 CatalogBridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors
 
Zonificacion merged
Zonificacion mergedZonificacion merged
Zonificacion merged
Danger
 
Informática i
Informática iInformática i
Informática i
ricardo
 

Viewers also liked (20)

Approaches to text analysis
Approaches to text analysisApproaches to text analysis
Approaches to text analysis
 
The Velocity12 markets
The Velocity12 marketsThe Velocity12 markets
The Velocity12 markets
 
Chapter 3 rev
Chapter 3 revChapter 3 rev
Chapter 3 rev
 
Global warming
Global warmingGlobal warming
Global warming
 
Listing Presentation-ko2
Listing Presentation-ko2Listing Presentation-ko2
Listing Presentation-ko2
 
Social Media by Konceptika
Social Media by KonceptikaSocial Media by Konceptika
Social Media by Konceptika
 
บทที่51
 บทที่51 บทที่51
บทที่51
 
Bridge outdoors Spring & Summer 2012
Bridge outdoors Spring & Summer 2012Bridge outdoors Spring & Summer 2012
Bridge outdoors Spring & Summer 2012
 
The Quirindongo’S Wedding
The Quirindongo’S WeddingThe Quirindongo’S Wedding
The Quirindongo’S Wedding
 
Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011Bridge Outdoors - Spring 2011
Bridge Outdoors - Spring 2011
 
Bridge outdoors fall winter 2012
Bridge outdoors fall winter 2012Bridge outdoors fall winter 2012
Bridge outdoors fall winter 2012
 
0471251240
04712512400471251240
0471251240
 
DEV PVH 2015 MeetUP
DEV PVH 2015 MeetUPDEV PVH 2015 MeetUP
DEV PVH 2015 MeetUP
 
บทที่51
 บทที่51 บทที่51
บทที่51
 
Telephone
TelephoneTelephone
Telephone
 
Bridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 CatalogBridge Outdoors Fall and Winter 2011 Catalog
Bridge Outdoors Fall and Winter 2011 Catalog
 
Zonificacion merged
Zonificacion mergedZonificacion merged
Zonificacion merged
 
โปรแกรมเพื่อการศึกษา
โปรแกรมเพื่อการศึกษาโปรแกรมเพื่อการศึกษา
โปรแกรมเพื่อการศึกษา
 
Журналисты 2.0
Журналисты 2.0Журналисты 2.0
Журналисты 2.0
 
Informática i
Informática iInformática i
Informática i
 

Similar to Nutch and lucene_framework

Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
Paco Nathan
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
smile790243
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 

Similar to Nutch and lucene_framework (20)

How web searching engines work
How web searching engines workHow web searching engines work
How web searching engines work
 
Data Science
Data ScienceData Science
Data Science
 
Optimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the webOptimized index structures for querying rdf from the web
Optimized index structures for querying rdf from the web
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - Phoenix
 
Introduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in RIntroduction to Data Mining with R and Data Import/Export in R
Introduction to Data Mining with R and Data Import/Export in R
 
R- Introduction
R- IntroductionR- Introduction
R- Introduction
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
Nzitf Velociraptor Workshop
Nzitf Velociraptor WorkshopNzitf Velociraptor Workshop
Nzitf Velociraptor Workshop
 
Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)Apache Spark Overview part2 (20161117)
Apache Spark Overview part2 (20161117)
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
Apache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLabApache Spark Introduction - CloudxLab
Apache Spark Introduction - CloudxLab
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
hadoop
hadoophadoop
hadoop
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Nutch and lucene_framework

  • 1. Sandhan(CLIA) - Nutch and Lucene Framework -Gaurav Arora IRLAB,DA-IICT
  • 2. N 2 u c Outline h a  Introduction n  Behavior d of Nutch (Offline and Online) L  Lucene Features u  Sandhan Demo c e  RJ Interface n e F r a m e w o r k
  • 3. 3 N u c Introduction h a  Nutch is an opensource search engine n  Implemented in Java d L  Nutch is comprised of Lucene, Solr, Hadoop u c etc.. e  Lucene is an implementation of indexing and n searching crawled data e F  Both Nutch and Lucene are developed using r plugin framework a  Easy to customize m e w o r k
  • 4. 4 N u c Where do they fit in IR? h a n d L u c e n e F r a m e w o r k
  • 5. 5 N u c Nutch – complete search engine h a n d L u c e n e F r a m e w o r k
  • 6. 6 N u c Nutch – offline processing h a  Crawling n  Starts with set of seed URLs d  Goes deeper in the web and starts fetching the L u content c  Content need to be analyzed before storing e  Storing the content n e  Makes suitable for searching F  Issues r a  Time consuming process m  Freshness of the crawl (How often should I crawl?) e  Coverage of content w o r k
  • 7. 7 N u c Nutch – online processing h a  Searching n  Analysis of the query d  Processing of few words(tokens) in the query L u  Query tokens matched against stored c tokens(index) e  Fast and Accurate n e  Involves ordering the matching results F  Ranking affects User’s satisfaction directly r a  Supports distributed searching m e w o r k
  • 8.
  • 9. 9 N u c Nutch – Data structures h a  Web Database or WebDB n  Mirrors the properties/structure of web graph being d L crawled u c e  Segment n  Intermediate index e  Contains pages fetched in a single run F r a  Index m  Final inverted index obtained by “merging” e w segments (Lucene) o r k
  • 10. Nutch – Data Web Database or WebDB Crawldb - This contains information about every URL known to Nutch, including whether it was fetched. Linkdb. - This contains the list of known links to each URL, including both the source URL and anchor text of the link. Index Invert index : Posting list ,Mapping from words to its documents.
  • 11. Nutch Data - Segment Each segment is a set of URLs that are fetched as a unit. segment contains:- a crawl_generate names a set of URLs to be fetched a crawl_fetch contains the status of fetching each URL a content contains the raw content retrieved from each URL a parse_text contains the parsed text of each URL a parse_data contains outlinks and metadata parsed from each URL a crawl_parse contains the outlink URLs, used to update the crawldb
  • 12. 12 oter> Nutch –Crawling  Inject: initial creation of CrawlDB  Insert seed URLs  Initial LinkDB is empty  Generate new shard's fetchlist  Fetch raw content  Parse content (discovers outlinks)  Update CrawlDB from shards  Update LinkDB from shards  Index shards
  • 13. 13 Wide Crawling vs. Focused Crawling  Differences:  Little technical difference in configuration  Big difference in operations, maintenance and quality  Wide crawling:  (Almost) Unlimited crawling frontier  High risk of spamming and junk content  “Politeness” a very important limiting factor  Bandwidth & DNS considerations  Focused (vertical or enterprise) crawling:  Limited crawling frontier  Bandwidth or politeness is often not an issue  Low risk of spamming and junk content
  • 15. 15 N u c h a n d L u c e n e F r a m e wStep1 : Injector injects the list of seed URLs into the o r CrawlDB k
  • 16. 16 N u c h a n d L u c e n e F r a m e w Step2 : Generator takes the list of seed URLs from CrawlDB, forms o r fetch list, adds crawl_generate folder into the segments k
  • 17. 17 N u c h a n d L u c e n e F r a m e w Step3 : These fetch lists are used by fetchers to fetch the raw o r content of the document. It is then stored in segments. k
  • 18. 18 N u c h a n d L u c e n e F r a m e w Step4 : Parser is called to parse the content of the document o r and parsed content is stored back in segments. k
  • 19. 19 N u c h a n d L u c e n e F r a m e w Step5 : The links are inverted in the link graph and stored in o r LinkDB k
  • 20. 20 N u c h a n d L u c e n e F r a m e w o Step6 : Indexing the terms present in segments is done and r indices are updated in the segments k
  • 21. 21 N u c h a n d L u c e n e F r a m e wStep7 : Information on the newly fetched documents are o r updated in the CrwalDB k
  • 22. 22 N u c Crawling: 10 stage process h a bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.log n 1. admin db –create: Create a new WebDB. d 2. inject: Inject root URLs into the WebDB. L 3. generate: Generate a fetchlist from the WebDB in a new segment. u c 4. fetch: Fetch content from URLs in the fetchlist. e 5. updatedb: Update the WebDB with links from fetched pages. n 6. Repeat steps 3-5 until the required depth is reached. e 7. updatesegs: Update segments with scores and links from the WebDB. F 8. index: Index the fetched pages. r a 9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes. m e 10. merge: Merge the indexes into a single index for searching w o r k
  • 23. 23 N u c De-duplication Algorithm h a n (MD5 hash, float score, int indexID, int d docID, int urlLen) L u for each page c to eliminate URL duplicates from a e n segmentsDir: e F open a temporary file r for each segment: a m for each document in its index: e append a tuple for the document to w o the temporary file with r hash=MD5(URL) k close the temporary file
  • 24. 24 N u c URL Filtering h a n d  URL Filters (Text file) (conf/crawl-urlfilter.txt) L  Regular expression to filter URLs during crawling u  E.g. c  To ignore files with certain suffix: e -.(gif|exe|zip|ico)$ n  To accept host in a certain domain e F +^http://([a-z0-9]*.)*apache.org/ r a m e w o r k
  • 25. 25 N u c Few API’s h a  Site we would crawl: http://www.iitb.ac.in n  bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& d crawl.log L  Analyze u the database: c  bin/nutch readdb <db dir> –stats e  bin/nutch readdb <db dir> –dumppageurl n  bin/nutch readdb <db dir> –dumplinks e  s=`ls -d <segment dir> /* | head -1` ; bin/nutch segread F -dump $s r a m e w o r k
  • 26. 26 N u c Map-Reduce Function h a  Works in distributed environment n  map() and reduce() functions are implemented d L in most of the modules u  Both map() and reduce() functions uses <key, c e value> pairs n  Useful in case of processing large data (eg: e F Indexing) r  Some applications need sequence of map- a m reduce e  Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-n w o r k
  • 27. 27 N u c Map-Reduce Architecture h a n d L u c e n e F r a m e w o r k
  • 28. 28 N u c Nutch – Map-Reduce Indexing h a  Map()just assembles all parts of documents n  Reduce() performs text analysis + indexing: d L  Adds to a local Lucene index u c e Other possible MR indexing models: n  Hadoop contrib/indexing model: e  analysis and indexing on map() side F  Index merging on reduce() side r a  Modified Nutch model: m  Analysis on map() side e  Indexing on reduce() side w o r k
  • 29. 29 N u c Nutch - Ranking h a  Nutch Ranking n d L u c e  queryNorm() : indicates the normalization factor for n the query e  coord() : indicates how many query terms are F r present in the given document a  norm() : score indicating field based normalization m factor e  tf : term frequency and idf : inverse document w o frequency r  t.boost() : score indicating the importance of terms k occurrence in a particular field
  • 30. 30 N u c Lucene - Features h a  Field based indexing and searching n  Different fields of a webpage are d L  Title u  URL c  Anchor text e  Content, etc.. n e  Different boost factors to give importance to F r fields a  Uses inverted index to store content of m e crawled documents w  Open source Apache project o r k
  • 31. 31 N u c Lucene - Index h a n  Concepts d  Index: sequence of documents (a.k.a. Directory) L  Document: sequence of fields u c  Field: named sequence of terms e  Term: a text string (e.g., a word) n e F  Statistics r  Term frequencies and positions a m e w o r k
  • 32. 32 N u c Writing to Index h a n IndexWriter writer = d L new IndexWriter(directory, analyzer, u true); c e n Document doc = new Document(); e // add fields to document (next slide) F r writer.addDocument(doc); a writer.close(); m e w o r k
  • 33. 33 N u c Adding Fields h a doc.add(Field.Keyword("isbn", isbn)); n d doc.add(Field.Keyword("category", L category)); u c doc.add(Field.Text("title", title)); e doc.add(Field.Text("author", author)); n doc.add(Field.UnIndexed("url", url)); e F doc.add(Field.UnStored("subjects", r subjects, true)); a m doc.add(Field.Keyword("pubmonth", e pubmonth)); w o doc.add(Field.UnStored("contents",author r + " " + subjects)); k doc.add(Field.Keyword("modified", DateField.timeToString(file.lastModified())
  • 34. 34 N u c Fields Description h a  Attributes n  Stored: original content retrievable d  Indexed: inverted, searchable L u  Tokenized: analyzed, split into tokens c  Factory methods e n  Keyword: stored and indexed as single term e  Text: indexed, tokenized, and stored if String F  UnIndexed: stored r  UnStored: indexed, tokenized a m  Terms are what matters for searching e w o r k
  • 35. 35 N u c Searching an Index h a IndexSearcher searcher = n d new IndexSearcher(directory); L u c Query query = e QueryParser.parse(queryExpression, n "contents“,analyzer); e F Hits hits = searcher.search(query); r for (int i = 0; i < hits.length(); i++) { a m Document doc = hits.doc(i); e System.out.println(doc.get("title")); w o } r k
  • 36. 36 N u c Analyzer h a n  Analysis occurs d  For each tokenized field during indexing L  For each term or phrase in QueryParser u c e  Several analyzers built-in n e  Many more in the sandbox F  Straightforward to create your own r a  Choosing the right analyzer is important! m e w o r k
  • 37. 37 N u c WhiteSpace Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a [The] [quick] [brown] [fox] [jumps] [over] m e [the] w [lazy] [dog.] o r k
  • 38. 38 N u c Simple Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a [the] [quick] [brown] [fox] [jumps] [over] m e [the] w [lazy] [dog] o r k
  • 39. 39 N u c Stop Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a m e [quick] [brown] [fox] [jumps] [over] [lazy] w [dog] o r k
  • 40. 40 N u c Snowball Analyzer h a n The quick brown fox jumps over the lazy d L dog. u c e n e F r a [the] [quick] [brown] [fox] [jump] [over] m e [the] w [lazy] [dog] o r k
  • 41. 41 N u c Query Creation h a  Searching by a term – TermQuery n  Searching within a range – RangeQuery d L  Searching on a string – PrefixQuery u  Combining queries – BooleanQuery c e  Searching by phrase – PhraseQuery n  Searching by wildcard – WildcardQuery e F  Searching for similar terms - FuzzyQuery r a m e w o r k
  • 42. 42 N u c Lucene Queries h a n d L u c e n e F r a m e w o r k
  • 43. 43 N u c Conclusions h a  Nutch as a starting point n  Crawling in Nutch d L  Detailed map-reduce architecture u  Different query formats in Lucene c e  Built-in analyzers in Lucene n  Same analyzer need to be used both while e F indexing and searching r a m e w o r k
  • 44. 44 N u c Resources Used h a  Gospodnetic, Otis; Erik Hatcher (December 1, n d 2004). Lucene in Action (1st ed.). L Manning Publications. pp. 456. ISBN  u c 978-1-932394-28-3. e  Nutch Wiki http://wiki.apache.org/nutch/ n e F r a m e w o r k
  • 45. 45 N u c Thanks h a  Questions ?? n d L u c e n e F r a m e w o r k