Harnessing the power of Nutch with Scala

  • 2,847 views
Uploaded on

Introduction to N

Introduction to N

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,847
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
44
Comments
0
Likes
5

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Crawling the web, Nutch with Scala Vikas Hazrati @
  • 2. about CTO at Knoldus Software Co-Founder at MyCellWasStolen.com Community Editor at InfoQ.com Dabbling with Scala – last 40 months Enterprise grade implementations on Scala – 18 months 2
  • 3. nutchWeb search crawler link-graph parsing software solr lucene 3
  • 4. nutch – but we have google! transparent understanding extensible 4
  • 5. nutch – basic architecturecrawler searcher 5
  • 6. nutch - architecture Recursive segmentscrawler links web database pages fetchlists Crawl db 6
  • 7. nutch – crawl cycle generate – fetch – update cycleCreate crawldb Inject root URLs In crawldb Update segments Generate fetchlist Index fetched pages Fetch content repeat until depth reached deduplication Update crawldb Merge indexes for searching bin/nutch crawl urls -dir crawl -depth 3 -topN 5 7
  • 8. nutch - plugins generate – fetch – update cycleCreate crawldb parser Inject root URLs In crawldb HTMLParserFilter Generate fetchlist Fetch content URL Filter Update crawldb scoring filter 8
  • 9. nutch – extension pointsplugin.xml // tells Nutch about the plugin build.xml // build the plugin ivy.xml // plugin dependencies // plugin source src 9
  • 10. nutch - example<plugin id="KnoldusAggregator" name="Knoldus Parse Filter"version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="NutchHeadings Parse Filter"point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter"class="com.knoldus.aggregator.server.plugins.DetailParserFilter"></implementation> </extension></plugin> 10
  • 11. public ParseResult filter(Content content, ParseResultparseResult, HTMLMetaTags metaTags, DocumentFragmentdoc) { LOG.debug("Parsing URL: " + content.getUrl()); } Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult; } 11
  • 12. scala I have Java !concurrency verbose popular Strongly typed jvm OO library 12
  • 13. scalaJava:class Person { private String firstName; private String lastName; private int age; public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; } public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; }}Scala:class Person(var firstName: String, var lastName: String, var age: Int)Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i 13
  • 14. scalaJava – everything is an object unless it is primitiveScala – everything is an object. period.Java – has operators (+, -, < ..) and methodsScala – operators are methodsJava – statically typed – Thing thing = new Thing()Scala – statically typed but uses type inferencingval thing = new Thing 14
  • 15. evolution 15
  • 16. scala and concurrencyFine grained coarse grained Actors 16
  • 17. actors 17
  • 18. 18
  • 19. problem contextAggregator UGC 19
  • 20. solution Supplier 1Aggregator Supplier 2 Supplier 3 20
  • 21. Create crawldb Inject root URLs In crawldb Supplier URLs Generate fetchlist Fetch content Update crawldb plugins written in Scala 21
  • 22. logicCrawl the supplier Parse Is URL interesting Pass extraction to actor seed database 22
  • 23. plugin - scalaclass DetailParserFilter extends HtmlParseFilter { def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = { if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult } private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result } private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml } 23
  • 24. result5 suppliers crawledCrawl cycles run continuously for few days> 500K seed data collectedAll with Nutch and 823 lines of Scala code 24
  • 25. demoin action …. 25
  • 26. resources http://blog.knoldus.comhttp://wiki.apache.org/nutch/NutchTutorial http://www.scala-lang.org/ vikas@knoldus.com 26