  1. 1. Sandhan(CLIA) -Nutch and Lucene Framework -Gaurav Arora IRLAB,DA-IICT
  2. 2. N2uc Outlineha  Introductionn  Behaviord of Nutch (Offline and Online)L  Lucene Featuresu  Sandhan Democe  RJ InterfaceneFramework
  3. 3. 3Nuc Introductionha  Nutch is an opensource search enginen  Implemented in JavadL  Nutch is comprised of Lucene, Solr, Hadoopuc etc..e  Lucene is an implementation of indexing andn searching crawled dataeF  Both Nutch and Lucene are developed usingr plugin frameworka  Easy to customizemework
  4. 4. 4Nuc Where do they fit in IR?handLuceneFramework
  5. 5. 5Nuc Nutch – complete search enginehandLuceneFramework
  6. 6. 6Nuc Nutch – offline processingha  Crawlingn  Starts with set of seed URLsd  Goes deeper in the web and starts fetching theLu contentc  Content need to be analyzed before storinge  Storing the contentne  Makes suitable for searchingF  Issuesra  Time consuming processm  Freshness of the crawl (How often should I crawl?)e  Coverage of contentwork
  7. 7. 7Nuc Nutch – online processingha  Searchingn  Analysis of the queryd  Processing of few words(tokens) in the queryLu  Query tokens matched against storedc tokens(index)e  Fast and Accuratene  Involves ordering the matching resultsF  Ranking affects User’s satisfaction directlyra  Supports distributed searchingmework
  8. 8. 9Nuc Nutch – Data structuresha  Web Database or WebDBn  Mirrors the properties/structure of web graph beingdL crawleduce  Segmentn  Intermediate indexe  Contains pages fetched in a single runFra  Indexm  Final inverted index obtained by “merging”ew segments (Lucene)ork
  9. 9. Nutch – DataWeb Database or WebDBCrawldb - This contains information about every URLknown to Nutch, including whether it was fetched.Linkdb. - This contains the list of known links to each URL,including both the source URL and anchor text of the link.IndexInvert index : Posting list ,Mapping from wordsto its documents.
  10. 10. Nutch Data - SegmentEach segment is a set of URLs that are fetched as a unit.segment contains:- a crawl_generate names a set of URLs to be fetched a crawl_fetch contains the status of fetching each URL a content contains the raw content retrieved from each URL a parse_text contains the parsed text of each URL a parse_data contains outlinks and metadata parsed from each URL a crawl_parse contains the outlink URLs, used to update the crawldb
  11. 11. 12oter> Nutch –Crawling  Inject: initial creation of CrawlDB  Insert seed URLs  Initial LinkDB is empty  Generate new shards fetchlist  Fetch raw content  Parse content (discovers outlinks)  Update CrawlDB from shards  Update LinkDB from shards  Index shards
  12. 12. 13 Wide Crawling vs. Focused Crawling  Differences:  Little technical difference in configuration  Big difference in operations, maintenance and quality  Wide crawling:  (Almost) Unlimited crawling frontier  High risk of spamming and junk content  “Politeness” a very important limiting factor  Bandwidth & DNS considerations  Focused (vertical or enterprise) crawling:  Limited crawling frontier  Bandwidth or politeness is often not an issue  Low risk of spamming and junk content
  13. 13. 14NuchandLuceneFramewor Crawling Architecturek
  14. 14. 15NuchandLuceneFramewStep1 : Injector injects the list of seed URLs into theor CrawlDBk
  15. 15. 16NuchandLuceneFramew Step2 : Generator takes the list of seed URLs from CrawlDB, formsor fetch list, adds crawl_generate folder into the segmentsk
  16. 16. 17NuchandLuceneFramew Step3 : These fetch lists are used by fetchers to fetch the rawor content of the document. It is then stored in segments.k
  17. 17. 18NuchandLuceneFramew Step4 : Parser is called to parse the content of the documentor and parsed content is stored back in segments.k
  18. 18. 19NuchandLuceneFramew Step5 : The links are inverted in the link graph and stored inor LinkDBk
  19. 19. 20NuchandLuceneFramewo Step6 : Indexing the terms present in segments is done andr indices are updated in the segmentsk
  20. 20. 21NuchandLuceneFramewStep7 : Information on the newly fetched documents areor updated in the CrwalDBk
  21. 21. 22Nuc Crawling: 10 stage processha bin/nutch crawl <urlfile> -dir <dir> -depth <n> >& crawl.logn 1. admin db –create: Create a new WebDB.d 2. inject: Inject root URLs into the WebDB.L 3. generate: Generate a fetchlist from the WebDB in a new segment.uc 4. fetch: Fetch content from URLs in the fetchlist.e 5. updatedb: Update the WebDB with links from fetched pages.n 6. Repeat steps 3-5 until the required depth is reached.e 7. updatesegs: Update segments with scores and links from the WebDB.F 8. index: Index the fetched pages.ra 9. dedup: Eliminate duplicate content (and duplicate URLs) from the 10. merge: Merge the indexes into a single index for searchingwork
  22. 22. 23Nuc De-duplication Algorithmhan (MD5 hash, float score, int indexID, intd docID, int urlLen)Lu for each pagec to eliminate URL duplicates from aen segmentsDir:eF open a temporary filer for each segment:am for each document in its index:e append a tuple for the document towo the temporary file withr hash=MD5(URL)k close the temporary file
  23. 23. 24Nuc URL Filteringhand  URL Filters (Text file) (conf/crawl-urlfilter.txt)L  Regular expression to filter URLs during crawlingu  E.g.c  To ignore files with certain suffix:e -.(gif|exe|zip|ico)$n  To accept host in a certain domaineF +^http://([a-z0-9]*.)*
  24. 24. 25Nuc Few API’sha  Site we would crawl:  bin/nutch crawl <urlfile> -dir <dir> -depth <n> >&d crawl.logL  Analyzeu the database:c  bin/nutch readdb <db dir> –statse  bin/nutch readdb <db dir> –dumppageurln  bin/nutch readdb <db dir> –dumplinkse  s=`ls -d <segment dir> /* | head -1` ; bin/nutch segreadF -dump $sramework
  25. 25. 26Nuc Map-Reduce Functionha  Works in distributed environmentn  map() and reduce() functions are implementeddL in most of the modulesu  Both map() and reduce() functions uses <key,ce value> pairsn  Useful in case of processing large data (eg:eF Indexing)r  Some applications need sequence of map-am reducee  Map-1 -> Reduce-1 -> ... -> Map-n -> Reduce-nwork
  26. 26. 27Nuc Map-Reduce ArchitecturehandLuceneFramework
  27. 27. 28Nuc Nutch – Map-Reduce Indexingha  Map()just assembles all parts of documentsn  Reduce() performs text analysis + indexing:dL  Adds to a local Lucene indexuce Other possible MR indexing models:n  Hadoop contrib/indexing model:e  analysis and indexing on map() sideF  Index merging on reduce() sidera  Modified Nutch model:m  Analysis on map() sidee  Indexing on reduce() sidework
  28. 28. 29Nuc Nutch - Rankingha  Nutch RankingndLuce  queryNorm() : indicates the normalization factor forn the querye  coord() : indicates how many query terms areFr present in the given documenta  norm() : score indicating field based normalizationm factore  tf : term frequency and idf : inverse documentwo frequencyr  t.boost() : score indicating the importance of termsk occurrence in a particular field
  29. 29. 30Nuc Lucene - Featuresha  Field based indexing and searchingn  Different fields of a webpage aredL  Titleu  URLc  Anchor texte  Content,  Different boost factors to give importance toFr fieldsa  Uses inverted index to store content ofme crawled documentsw  Open source Apache projectork
  30. 30. 31Nuc Lucene - Indexhan  Conceptsd  Index: sequence of documents (a.k.a. Directory)L  Document: sequence of fieldsuc  Field: named sequence of termse  Term: a text string (e.g., a word)neF  Statisticsr  Term frequencies and positionsamework
  31. 31. 32Nuc Writing to Indexhan IndexWriter writer =dL new IndexWriter(directory, analyzer,u true);cen Document doc = new Document();e // add fields to document (next slide)Fr writer.addDocument(doc);a writer.close();mework
  32. 32. 33Nuc Adding Fieldsha doc.add(Field.Keyword("isbn", isbn));nd doc.add(Field.Keyword("category",L category));uc doc.add(Field.Text("title", title));e doc.add(Field.Text("author", author));n doc.add(Field.UnIndexed("url", url));eF doc.add(Field.UnStored("subjects",r subjects, true));am doc.add(Field.Keyword("pubmonth",e pubmonth));wo doc.add(Field.UnStored("contents",authorr + " " + subjects));k doc.add(Field.Keyword("modified", DateField.timeToString(file.lastModified())
  33. 33. 34Nuc Fields Descriptionha  Attributesn  Stored: original content retrievabled  Indexed: inverted, searchableLu  Tokenized: analyzed, split into tokensc  Factory methodsen  Keyword: stored and indexed as single terme  Text: indexed, tokenized, and stored if StringF  UnIndexed: storedr  UnStored: indexed, tokenizedam  Terms are what matters for searchingework
  34. 34. 35Nuc Searching an Indexha IndexSearcher searcher =nd new IndexSearcher(directory);Luc Query query =e QueryParser.parse(queryExpression,n "contents“,analyzer);eF Hits hits =;r for (int i = 0; i < hits.length(); i++) {am Document doc = hits.doc(i);e System.out.println(doc.get("title"));wo }rk
  35. 35. 36Nuc Analyzerhan  Analysis occursd  For each tokenized field during indexingL  For each term or phrase in QueryParseruce  Several analyzers built-inne  Many more in the sandboxF  Straightforward to create your ownra  Choosing the right analyzer is important!mework
  36. 36. 37Nuc WhiteSpace Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFra [The] [quick] [brown] [fox] [jumps] [over]me [the]w [lazy] [dog.]ork
  37. 37. 38Nuc Simple Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFra [the] [quick] [brown] [fox] [jumps] [over]me [the]w [lazy] [dog]ork
  38. 38. 39Nuc Stop Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFrame [quick] [brown] [fox] [jumps] [over] [lazy]w [dog]ork
  39. 39. 40Nuc Snowball Analyzerhan The quick brown fox jumps over the lazydL dog.uceneFra [the] [quick] [brown] [fox] [jump] [over]me [the]w [lazy] [dog]ork
  40. 40. 41Nuc Query Creationha  Searching by a term – TermQueryn  Searching within a range – RangeQuerydL  Searching on a string – PrefixQueryu  Combining queries – BooleanQueryce  Searching by phrase – PhraseQueryn  Searching by wildcard – WildcardQueryeF  Searching for similar terms - FuzzyQueryramework
  41. 41. 42Nuc Lucene QuerieshandLuceneFramework
  42. 42. 43Nuc Conclusionsha  Nutch as a starting pointn  Crawling in NutchdL  Detailed map-reduce architectureu  Different query formats in Lucenece  Built-in analyzers in Lucenen  Same analyzer need to be used both whileeF indexing and searchingramework
  43. 43. 44Nuc Resources Usedha  Gospodnetic, Otis; Erik Hatcher (December 1,nd 2004). Lucene in Action (1st ed.).L Manning Publications. pp. 456. ISBN uc 978-1-932394-28-3.e  Nutch Wiki
  44. 44. 45Nuc Thanksha  Questions ??ndLuceneFramework