• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Harnessing the power of Nutch with Scala
 

Harnessing the power of Nutch with Scala

on

  • 2,919 views

Introduction to N

Introduction to N

Statistics

Views

Total Views
2,919
Views on SlideShare
2,037
Embed Views
882

Actions

Likes
3
Downloads
38
Comments
0

4 Embeds 882

http://blog.knoldus.com 879
http://webcache.googleusercontent.com 1
https://twitter.com 1
http://www.slashdocs.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Harnessing the power of Nutch with Scala Harnessing the power of Nutch with Scala Presentation Transcript

    • Crawling the web, Nutch with Scala Vikas Hazrati @
    • about CTO at Knoldus Software Co-Founder at MyCellWasStolen.com Community Editor at InfoQ.com Dabbling with Scala – last 40 months Enterprise grade implementations on Scala – 18 months 2
    • nutchWeb search crawler link-graph parsing software solr lucene 3
    • nutch – but we have google! transparent understanding extensible 4
    • nutch – basic architecturecrawler searcher 5
    • nutch - architecture Recursive segmentscrawler links web database pages fetchlists Crawl db 6
    • nutch – crawl cycle generate – fetch – update cycleCreate crawldb Inject root URLs In crawldb Update segments Generate fetchlist Index fetched pages Fetch content repeat until depth reached deduplication Update crawldb Merge indexes for searching bin/nutch crawl urls -dir crawl -depth 3 -topN 5 7
    • nutch - plugins generate – fetch – update cycleCreate crawldb parser Inject root URLs In crawldb HTMLParserFilter Generate fetchlist Fetch content URL Filter Update crawldb scoring filter 8
    • nutch – extension pointsplugin.xml // tells Nutch about the plugin build.xml // build the plugin ivy.xml // plugin dependencies // plugin source src 9
    • nutch - example<plugin id="KnoldusAggregator" name="Knoldus Parse Filter"version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="NutchHeadings Parse Filter"point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter"class="com.knoldus.aggregator.server.plugins.DetailParserFilter"></implementation> </extension></plugin> 10
    • public ParseResult filter(Content content, ParseResultparseResult, HTMLMetaTags metaTags, DocumentFragmentdoc) { LOG.debug("Parsing URL: " + content.getUrl()); } Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult; } 11
    • scala I have Java !concurrency verbose popular Strongly typed jvm OO library 12
    • scalaJava:class Person { private String firstName; private String lastName; private int age; public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; } public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; }}Scala:class Person(var firstName: String, var lastName: String, var age: Int)Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i 13
    • scalaJava – everything is an object unless it is primitiveScala – everything is an object. period.Java – has operators (+, -, < ..) and methodsScala – operators are methodsJava – statically typed – Thing thing = new Thing()Scala – statically typed but uses type inferencingval thing = new Thing 14
    • evolution 15
    • scala and concurrencyFine grained coarse grained Actors 16
    • actors 17
    • 18
    • problem contextAggregator UGC 19
    • solution Supplier 1Aggregator Supplier 2 Supplier 3 20
    • Create crawldb Inject root URLs In crawldb Supplier URLs Generate fetchlist Fetch content Update crawldb plugins written in Scala 21
    • logicCrawl the supplier Parse Is URL interesting Pass extraction to actor seed database 22
    • plugin - scalaclass DetailParserFilter extends HtmlParseFilter { def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = { if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult } private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result } private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml } 23
    • result5 suppliers crawledCrawl cycles run continuously for few days> 500K seed data collectedAll with Nutch and 823 lines of Scala code 24
    • demoin action …. 25
    • resources http://blog.knoldus.comhttp://wiki.apache.org/nutch/NutchTutorial http://www.scala-lang.org/ vikas@knoldus.com 26