Nutch and lucene_framework
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,093
On Slideshare
2,092
From Embeds
1
Number of Embeds
1

Actions

Shares
Downloads
94
Comments
0
Likes
2

Embeds 1

https://tasks.crowdflower.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Sandhan(CLIA) -Nutch and Lucene Framework -Gaurav Arora IRLAB,DA-IICT
  • 2. N2uc Outlineha  Introductionn  Behaviord of Nutch (Offline and Online)L  Lucene Featuresu  Sandhan Democe  RJ InterfaceneFramework
  • 3. 3Nuc Introductionha  Nutch is an opensource search enginen  Implemented in JavadL  Nutch is comprised of Lucene, Solr, Hadoopuc etc..e  Lucene is an implementation of indexing andn searching crawled dataeF  Both Nutch and Lucene are developed usingr plugin frameworka  Easy to customizemework
  • 4. 4Nuc Where do they fit in IR?handLuceneFramework
  • 5. 5Nuc Nutch – complete search enginehandLuceneFramework
  • 6. 6Nuc Nutch – offline processingha  Crawlingn  Starts with set of seed URLsd  Goes deeper in the web and starts fetching theLu contentc  Content need to be analyzed before storinge  Storing the contentne  Makes suitable for searchingF  Issuesra  Time consuming processm  Freshness of the crawl (How often should I crawl?)e  Coverage of contentwork
  • 7. 7Nuc Nutch – online processingha  Searchingn  Analysis of the queryd  Processing of few words(tokens) in the queryLu  Query tokens matched against storedc tokens(index)e  Fast and Accuratene  Involves ordering the matching resultsF  Ranking affects User’s satisfaction directlyra  Supports distributed searchingmework
  • 8. 9Nuc Nutch – Data structuresha  Web Database or WebDBn  Mirrors the properties/structure of web graph beingdL crawleduce  Segmentn  Intermediate indexe  Contains pages fetched in a single runFra  Indexm  Final inverted index obtained by “merging”ew segments (Lucene)ork
  • 9. Nutch – DataWeb Database or WebDBCrawldb - This contains information about every URLknown to Nutch, including whether it was fetched.Linkdb. - This contains the list of known links to each URL,including both the source URL and anchor text of the link.IndexInvert index : Posting list ,Mapping from wordsto its documents.
  • 10. Nutch Data - SegmentEach segment is a set of URLs that are fetched as a unit.segment contains:- a crawl_generate names a set of URLs to be fetched a crawl_fetch contains the status of fetching each URL a content contains the raw content retrieved from each URL a parse_text contains the parsed text of each URL a parse_data contains outlinks and metadata parsed from each URL a crawl_parse contains the outlink URLs, used to update the crawldb
  • 11. 12oter> Nutch –Crawling  Inject: initial creation of CrawlDB  Insert seed URLs  Initial LinkDB is empty  Generate new shards fetchlist  Fetch raw content  Parse content (discovers outlinks)  Update CrawlDB from shards  Update LinkDB from shards  Index shards
  • 12. 13 Wide Crawling vs. Focused Crawling  Differences:  Little technical difference in configuration  Big difference in operations, maintenance and quality  Wide crawling:  (Almost) Unlimited crawling frontier  High risk of spamming and junk content  “Politeness” a very important limiting factor  Bandwidth & DNS considerations  Focused (vertical or enterprise) crawling:  Limited crawling frontier  Bandwidth or politeness is often not an issue  Low risk of spamming and junk content
  • 13. 14NuchandLuceneFramewor Crawling Architecturek
  • 14. 15NuchandLuceneFramewStep1 : Injector injects the list of seed URLs into theor CrawlDBk
  • 15. 16NuchandLuceneFramew Step2 : Generator takes the list of seed URLs from CrawlDB, formsor fetch list, adds crawl_generate folder into the segmentsk
  • 16. 17NuchandLuceneFramew Step3 : These fetch lists are used by fetchers to fetch the rawor content of the document. It is then stored in segments.k
  • 17. 18NuchandLuceneFramew Step4 : Parser is called to parse the content of the documentor and parsed content is stored back in segments.k
  • 18. 19NuchandLuceneFramew Step5 : The links are inverted in the link graph and stored inor LinkDBk
  • 19. 20NuchandLuceneFramewo Step6 : Indexing the terms present in segments is done andr indices are updated in the segmentsk
  • 20. 21NuchandLuceneFramewStep7 : Information on the newly fetched documents areor updated in the CrwalDBk
  • 21. 22Nuc Crawling: 10 stage processha bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.logn 1. admin db –create: Create a new WebDB.d 2. inject: Inject root URLs into the WebDB.L 3. generate: Generate a fetchlist from the WebDB in a new segment.uc 4. fetch: Fetch content from URLs in the fetchlist.e 5. updatedb: Update the WebDB with links from fetched pages.n 6. Repeat steps 3-5 until the required depth is reached.e 7. updatesegs: Update segments with scores and links from the WebDB.F 8. index: Index the fetched pages.ra 9. dedup: Eliminate duplicate content (and duplicate URLs) from the indexes.me 10. merge: Merge the indexes into a single index for searchingwork
  • 22. 23Nuc De-duplication Algorithmhan (MD5 hash, float score, int indexID, intd docID, int urlLen)Lu for each pagec to eliminate URL duplicates from aen segmentsDir:eF open a temporary filer for each segment:am for each document in its index:e append a tuple for the document towo the temporary file withr hash=MD5(URL)k close the temporary file
  • 23. 24Nuc URL Filteringhand  URL Filters (Text file) (conf/crawl-urlfilter.txt)L  Regular expression to filter URLs during crawlingu  E.g.c  To ignore files with certain suffix:e -.(gif|exe|zip|ico)$n  To accept host in a certain domaineF +^http://([a-z0-9]*.)*apache.org/ramework
  • 24. 25Nuc Few API’sha  Site we would crawl: http://www.iitb.ac.inn  bin/nutch crawl <urlfile> -dir <dir> -depth <n> >&d crawl.logL  Analyzeu the database:c  bin/nutch readdb <db dir> –statse  bin/nutch readdb <db dir> –dumppageurln  bin/nutch readdb <db dir> –dumplinkse  s=`ls -d <segment dir> /* | head -1` ; bin/nutch segreadF -dump $sramework
  • 25. 26Nuc Map-Reduce Functionha  Works in distributed environmentn  map() and reduce() functions are implementeddL in most of the modulesu  Both map() and reduce() functions uses <key,ce value> pairsn  Useful in case of processing large data (eg:eF Indexing)r  Some applications need sequence of map-am reducee  Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-nwork
  • 26. 27Nuc Map-Reduce ArchitecturehandLuceneFramework
  • 27. 28Nuc Nutch – Map-Reduce Indexingha  Map()just assembles all parts of documentsn  Reduce() performs text analysis + indexing:dL  Adds to a local Lucene indexuce Other possible MR indexing models:n  Hadoop contrib/indexing model:e  analysis and indexing on map() sideF  Index merging on reduce() sidera  Modified Nutch model:m  Analysis on map() sidee  Indexing on reduce() sidework
  • 28. 29Nuc Nutch - Rankingha  Nutch RankingndLuce  queryNorm() : indicates the normalization factor forn the querye  coord() : indicates how many query terms areFr present in the given documenta  norm() : score indicating field based normalizationm factore  tf : term frequency and idf : inverse documentwo frequencyr  t.boost() : score indicating the importance of termsk occurrence in a particular field
  • 29. 30Nuc Lucene - Featuresha  Field based indexing and searchingn  Different fields of a webpage aredL  Titleu  URLc  Anchor texte  Content, etc..ne  Different boost factors to give importance toFr fieldsa  Uses inverted index to store content ofme crawled documentsw  Open source Apache projectork
  • 30. 31Nuc Lucene - Indexhan  Conceptsd  Index: sequence of documents (a.k.a. Directory)L  Document: sequence of fieldsuc  Field: named sequence of termse  Term: a text string (e.g., a word)neF  Statisticsr  Term frequencies and positionsamework
  • 31. 32Nuc Writing to Indexhan IndexWriter writer =dL new IndexWriter(directory, analyzer,u true);cen Document doc = new Document();e // add fields to document (next slide)Fr writer.addDocument(doc);a writer.close();mework
  • 32. 33Nuc Adding Fieldsha doc.add(Field.Keyword("isbn", isbn));nd doc.add(Field.Keyword("category",L category));uc doc.add(Field.Text("title", title));e doc.add(Field.Text("author", author));n doc.add(Field.UnIndexed("url", url));eF doc.add(Field.UnStored("subjects",r subjects, true));am doc.add(Field.Keyword("pubmonth",e pubmonth));wo doc.add(Field.UnStored("contents",authorr + " " + subjects));k doc.add(Field.Keyword("modified", DateField.timeToString(file.lastModified())
  • 33. 34Nuc Fields Descriptionha  Attributesn  Stored: original content retrievabled  Indexed: inverted, searchableLu  Tokenized: analyzed, split into tokensc  Factory methodsen  Keyword: stored and indexed as single terme  Text: indexed, tokenized, and stored if StringF  UnIndexed: storedr  UnStored: indexed, tokenizedam  Terms are what matters for searchingework
  • 34. 35Nuc Searching an Indexha IndexSearcher searcher =nd new IndexSearcher(directory);Luc Query query =e QueryParser.parse(queryExpression,n "contents“,analyzer);eF Hits hits = searcher.search(query);r for (int i = 0; i < hits.length(); i++) {am Document doc = hits.doc(i);e System.out.println(doc.get("title"));wo }rk
  • 35. 36Nuc Analyzerhan  Analysis occursd  For each tokenized field during indexingL  For each term or phrase in QueryParseruce  Several analyzers built-inne  Many more in the sandboxF  Straightforward to create your ownra  Choosing the right analyzer is important!mework
  • 36. 37Nuc WhiteSpace Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFra [The] [quick] [brown] [fox] [jumps] [over]me [the]w [lazy] [dog.]ork
  • 37. 38Nuc Simple Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFra [the] [quick] [brown] [fox] [jumps] [over]me [the]w [lazy] [dog]ork
  • 38. 39Nuc Stop Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFrame [quick] [brown] [fox] [jumps] [over] [lazy]w [dog]ork
  • 39. 40Nuc Snowball Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFra [the] [quick] [brown] [fox] [jump] [over]me [the]w [lazy] [dog]ork
  • 40. 41Nuc Query Creationha  Searching by a term – TermQueryn  Searching within a range – RangeQuerydL  Searching on a string – PrefixQueryu  Combining queries – BooleanQueryce  Searching by phrase – PhraseQueryn  Searching by wildcard – WildcardQueryeF  Searching for similar terms - FuzzyQueryramework
  • 41. 42Nuc Lucene QuerieshandLuceneFramework
  • 42. 43Nuc Conclusionsha  Nutch as a starting pointn  Crawling in NutchdL  Detailed map-reduce architectureu  Different query formats in Lucenece  Built-in analyzers in Lucenen  Same analyzer need to be used both whileeF indexing and searchingramework
  • 43. 44Nuc Resources Usedha  Gospodnetic, Otis; Erik Hatcher (December 1,nd 2004). Lucene in Action (1st ed.).L Manning Publications. pp. 456. ISBN uc 978-1-932394-28-3.e  Nutch Wiki http://wiki.apache.org/nutch/neFramework
  • 44. 45Nuc Thanksha  Questions ??ndLuceneFramework