• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Nutch and lucene_framework
 

Nutch and lucene_framework

on

  • 1,720 views

 

Statistics

Views

Total Views
1,720
Views on SlideShare
1,720
Embed Views
0

Actions

Likes
2
Downloads
87
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Nutch and lucene_framework Nutch and lucene_framework Presentation Transcript

    • Sandhan(CLIA) -Nutch and Lucene Framework -Gaurav Arora IRLAB,DA-IICT
    • N2uc Outlineha  Introductionn  Behaviord of Nutch (Offline and Online)L  Lucene Featuresu  Sandhan Democe  RJ InterfaceneFramework
    • 3Nuc Introductionha  Nutch is an opensource search enginen  Implemented in JavadL  Nutch is comprised of Lucene, Solr, Hadoopuc etc..e  Lucene is an implementation of indexing andn searching crawled dataeF  Both Nutch and Lucene are developed usingr plugin frameworka  Easy to customizemework
    • 4Nuc Where do they fit in IR?handLuceneFramework
    • 5Nuc Nutch – complete search enginehandLuceneFramework
    • 6Nuc Nutch – offline processingha  Crawlingn  Starts with set of seed URLsd  Goes deeper in the web and starts fetching theLu contentc  Content need to be analyzed before storinge  Storing the contentne  Makes suitable for searchingF  Issuesra  Time consuming processm  Freshness of the crawl (How often should I crawl?)e  Coverage of contentwork
    • 7Nuc Nutch – online processingha  Searchingn  Analysis of the queryd  Processing of few words(tokens) in the queryLu  Query tokens matched against storedc tokens(index)e  Fast and Accuratene  Involves ordering the matching resultsF  Ranking affects User’s satisfaction directlyra  Supports distributed searchingmework
    • 9Nuc Nutch – Data structuresha  Web Database or WebDBn  Mirrors the properties/structure of web graph beingdL crawleduce  Segmentn  Intermediate indexe  Contains pages fetched in a single runFra  Indexm  Final inverted index obtained by “merging”ew segments (Lucene)ork
    • Nutch – DataWeb Database or WebDBCrawldb - This contains information about every URLknown to Nutch, including whether it was fetched.Linkdb. - This contains the list of known links to each URL,including both the source URL and anchor text of the link.IndexInvert index : Posting list ,Mapping from wordsto its documents.
    • Nutch Data - SegmentEach segment is a set of URLs that are fetched as a unit.segment contains:- a crawl_generate names a set of URLs to be fetched a crawl_fetch contains the status of fetching each URL a content contains the raw content retrieved from each URL a parse_text contains the parsed text of each URL a parse_data contains outlinks and metadata parsed from each URL a crawl_parse contains the outlink URLs, used to update the crawldb
    • 12oter> Nutch –Crawling  Inject: initial creation of CrawlDB  Insert seed URLs  Initial LinkDB is empty  Generate new shards fetchlist  Fetch raw content  Parse content (discovers outlinks)  Update CrawlDB from shards  Update LinkDB from shards  Index shards
    • 13 Wide Crawling vs. Focused Crawling  Differences:  Little technical difference in configuration  Big difference in operations, maintenance and quality  Wide crawling:  (Almost) Unlimited crawling frontier  High risk of spamming and junk content  “Politeness” a very important limiting factor  Bandwidth & DNS considerations  Focused (vertical or enterprise) crawling:  Limited crawling frontier  Bandwidth or politeness is often not an issue  Low risk of spamming and junk content
    • 14NuchandLuceneFramewor Crawling Architecturek
    • 15NuchandLuceneFramewStep1 : Injector injects the list of seed URLs into theor CrawlDBk
    • 16NuchandLuceneFramew Step2 : Generator takes the list of seed URLs from CrawlDB, formsor fetch list, adds crawl_generate folder into the segmentsk
    • 17NuchandLuceneFramew Step3 : These fetch lists are used by fetchers to fetch the rawor content of the document. It is then stored in segments.k
    • 18NuchandLuceneFramew Step4 : Parser is called to parse the content of the documentor and parsed content is stored back in segments.k
    • 19NuchandLuceneFramew Step5 : The links are inverted in the link graph and stored inor LinkDBk
    • 20NuchandLuceneFramewo Step6 : Indexing the terms present in segments is done andr indices are updated in the segmentsk
    • 21NuchandLuceneFramewStep7 : Information on the newly fetched documents areor updated in the CrwalDBk
    • 22Nuc Crawling: 10 stage processha bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.logn 1. admin db –create: Create a new WebDB.d 2. inject: Inject root URLs into the WebDB.L 3. generate: Generate a fetchlist from the WebDB in a new segment.uc 4. fetch: Fetch content from URLs in the fetchlist.e 5. updatedb: Update the WebDB with links from fetched pages.n 6. Repeat steps 3-5 until the required depth is reached.e 7. updatesegs: Update segments with scores and links from the WebDB.F 8. index: Index the fetched pages.ra 9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes.me 10. merge: Merge the indexes into a single index for searchingwork
    • 23Nuc De-duplication Algorithmhan (MD5 hash, float score, int indexID, intd docID, int urlLen)Lu for each pagec to eliminate URL duplicates from aen segmentsDir:eF open a temporary filer for each segment:am for each document in its index:e append a tuple for the document towo the temporary file withr hash=MD5(URL)k close the temporary file
    • 24Nuc URL Filteringhand  URL Filters (Text file) (conf/crawl-urlfilter.txt)L  Regular expression to filter URLs during crawlingu  E.g.c  To ignore files with certain suffix:e -.(gif|exe|zip|ico)$n  To accept host in a certain domaineF +^http://([a-z0-9]*.)*apache.org/ramework
    • 25Nuc Few API’sha  Site we would crawl: http://www.iitb.ac.inn  bin/nutch crawl <urlfile> -dir <dir> -depth <n> >&d crawl.logL  Analyzeu the database:c  bin/nutch readdb <db dir> –statse  bin/nutch readdb <db dir> –dumppageurln  bin/nutch readdb <db dir> –dumplinkse  s=`ls -d <segment dir> /* | head -1` ; bin/nutch segreadF -dump $sramework
    • 26Nuc Map-Reduce Functionha  Works in distributed environmentn  map() and reduce() functions are implementeddL in most of the modulesu  Both map() and reduce() functions uses <key,ce value> pairsn  Useful in case of processing large data (eg:eF Indexing)r  Some applications need sequence of map-am reducee  Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-nwork
    • 27Nuc Map-Reduce ArchitecturehandLuceneFramework
    • 28Nuc Nutch – Map-Reduce Indexingha  Map()just assembles all parts of documentsn  Reduce() performs text analysis + indexing:dL  Adds to a local Lucene indexuce Other possible MR indexing models:n  Hadoop contrib/indexing model:e  analysis and indexing on map() sideF  Index merging on reduce() sidera  Modified Nutch model:m  Analysis on map() sidee  Indexing on reduce() sidework
    • 29Nuc Nutch - Rankingha  Nutch RankingndLuce  queryNorm() : indicates the normalization factor forn the querye  coord() : indicates how many query terms areFr present in the given documenta  norm() : score indicating field based normalizationm factore  tf : term frequency and idf : inverse documentwo frequencyr  t.boost() : score indicating the importance of termsk occurrence in a particular field
    • 30Nuc Lucene - Featuresha  Field based indexing and searchingn  Different fields of a webpage aredL  Titleu  URLc  Anchor texte  Content, etc..ne  Different boost factors to give importance toFr fieldsa  Uses inverted index to store content ofme crawled documentsw  Open source Apache projectork
    • 31Nuc Lucene - Indexhan  Conceptsd  Index: sequence of documents (a.k.a. Directory)L  Document: sequence of fieldsuc  Field: named sequence of termse  Term: a text string (e.g., a word)neF  Statisticsr  Term frequencies and positionsamework
    • 32Nuc Writing to Indexhan IndexWriter writer =dL new IndexWriter(directory, analyzer,u true);cen Document doc = new Document();e // add fields to document (next slide)Fr writer.addDocument(doc);a writer.close();mework
    • 33Nuc Adding Fieldsha doc.add(Field.Keyword("isbn", isbn));nd doc.add(Field.Keyword("category",L category));uc doc.add(Field.Text("title", title));e doc.add(Field.Text("author", author));n doc.add(Field.UnIndexed("url", url));eF doc.add(Field.UnStored("subjects",r subjects, true));am doc.add(Field.Keyword("pubmonth",e pubmonth));wo doc.add(Field.UnStored("contents",authorr + " " + subjects));k doc.add(Field.Keyword("modified", DateField.timeToString(file.lastModified())
    • 34Nuc Fields Descriptionha  Attributesn  Stored: original content retrievabled  Indexed: inverted, searchableLu  Tokenized: analyzed, split into tokensc  Factory methodsen  Keyword: stored and indexed as single terme  Text: indexed, tokenized, and stored if StringF  UnIndexed: storedr  UnStored: indexed, tokenizedam  Terms are what matters for searchingework
    • 35Nuc Searching an Indexha IndexSearcher searcher =nd new IndexSearcher(directory);Luc Query query =e QueryParser.parse(queryExpression,n "contents“,analyzer);eF Hits hits = searcher.search(query);r for (int i = 0; i < hits.length(); i++) {am Document doc = hits.doc(i);e System.out.println(doc.get("title"));wo }rk
    • 36Nuc Analyzerhan  Analysis occursd  For each tokenized field during indexingL  For each term or phrase in QueryParseruce  Several analyzers built-inne  Many more in the sandboxF  Straightforward to create your ownra  Choosing the right analyzer is important!mework
    • 37Nuc WhiteSpace Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFra [The] [quick] [brown] [fox] [jumps] [over]me [the]w [lazy] [dog.]ork
    • 38Nuc Simple Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFra [the] [quick] [brown] [fox] [jumps] [over]me [the]w [lazy] [dog]ork
    • 39Nuc Stop Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFrame [quick] [brown] [fox] [jumps] [over] [lazy]w [dog]ork
    • 40Nuc Snowball Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFra [the] [quick] [brown] [fox] [jump] [over]me [the]w [lazy] [dog]ork
    • 41Nuc Query Creationha  Searching by a term – TermQueryn  Searching within a range – RangeQuerydL  Searching on a string – PrefixQueryu  Combining queries – BooleanQueryce  Searching by phrase – PhraseQueryn  Searching by wildcard – WildcardQueryeF  Searching for similar terms - FuzzyQueryramework
    • 42Nuc Lucene QuerieshandLuceneFramework
    • 43Nuc Conclusionsha  Nutch as a starting pointn  Crawling in NutchdL  Detailed map-reduce architectureu  Different query formats in Lucenece  Built-in analyzers in Lucenen  Same analyzer need to be used both whileeF indexing and searchingramework
    • 44Nuc Resources Usedha  Gospodnetic, Otis; Erik Hatcher (December 1,nd 2004). Lucene in Action (1st ed.).L Manning Publications. pp. 456. ISBN uc 978-1-932394-28-3.e  Nutch Wiki http://wiki.apache.org/nutch/neFramework
    • 45Nuc Thanksha  Questions ??ndLuceneFramework