Your SlideShare is downloading. ×
0
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Harnessing the power of Nutch with Scala

3,343

Published on

Introduction to N

Introduction to N

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,343
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
54
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Crawling the web, Nutch with Scala Vikas Hazrati @
  • 2. about CTO at Knoldus Software Co-Founder at MyCellWasStolen.com Community Editor at InfoQ.com Dabbling with Scala – last 40 months Enterprise grade implementations on Scala – 18 months 2
  • 3. nutchWeb search crawler link-graph parsing software solr lucene 3
  • 4. nutch – but we have google! transparent understanding extensible 4
  • 5. nutch – basic architecturecrawler searcher 5
  • 6. nutch - architecture Recursive segmentscrawler links web database pages fetchlists Crawl db 6
  • 7. nutch – crawl cycle generate – fetch – update cycleCreate crawldb Inject root URLs In crawldb Update segments Generate fetchlist Index fetched pages Fetch content repeat until depth reached deduplication Update crawldb Merge indexes for searching bin/nutch crawl urls -dir crawl -depth 3 -topN 5 7
  • 8. nutch - plugins generate – fetch – update cycleCreate crawldb parser Inject root URLs In crawldb HTMLParserFilter Generate fetchlist Fetch content URL Filter Update crawldb scoring filter 8
  • 9. nutch – extension pointsplugin.xml // tells Nutch about the plugin build.xml // build the plugin ivy.xml // plugin dependencies // plugin source src 9
  • 10. nutch - example<plugin id="KnoldusAggregator" name="Knoldus Parse Filter"version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="NutchHeadings Parse Filter"point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter"class="com.knoldus.aggregator.server.plugins.DetailParserFilter"></implementation> </extension></plugin> 10
  • 11. public ParseResult filter(Content content, ParseResultparseResult, HTMLMetaTags metaTags, DocumentFragmentdoc) { LOG.debug("Parsing URL: " + content.getUrl()); } Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult; } 11
  • 12. scala I have Java !concurrency verbose popular Strongly typed jvm OO library 12
  • 13. scalaJava:class Person { private String firstName; private String lastName; private int age; public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; } public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; }}Scala:class Person(var firstName: String, var lastName: String, var age: Int)Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i 13
  • 14. scalaJava – everything is an object unless it is primitiveScala – everything is an object. period.Java – has operators (+, -, < ..) and methodsScala – operators are methodsJava – statically typed – Thing thing = new Thing()Scala – statically typed but uses type inferencingval thing = new Thing 14
  • 15. evolution 15
  • 16. scala and concurrencyFine grained coarse grained Actors 16
  • 17. actors 17
  • 18. 18
  • 19. problem contextAggregator UGC 19
  • 20. solution Supplier 1Aggregator Supplier 2 Supplier 3 20
  • 21. Create crawldb Inject root URLs In crawldb Supplier URLs Generate fetchlist Fetch content Update crawldb plugins written in Scala 21
  • 22. logicCrawl the supplier Parse Is URL interesting Pass extraction to actor seed database 22
  • 23. plugin - scalaclass DetailParserFilter extends HtmlParseFilter { def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = { if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult } private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result } private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml } 23
  • 24. result5 suppliers crawledCrawl cycles run continuously for few days> 500K seed data collectedAll with Nutch and 823 lines of Scala code 24
  • 25. demoin action …. 25
  • 26. resources http://blog.knoldus.comhttp://wiki.apache.org/nutch/NutchTutorial http://www.scala-lang.org/ vikas@knoldus.com 26

×