Crawling the web, Nutch with Scala    Vikas Hazrati @
about  CTO at Knoldus Software  Co-Founder at MyCellWasStolen.com  Community Editor at InfoQ.com  Dabbling with Scala – la...
nutchWeb search                    crawler   link-graph   parsing software             solr lucene                        ...
nutch – but we have google!             transparent            understanding             extensible                       ...
nutch – basic architecturecrawler                 searcher                                       5
nutch - architecture          Recursive                segmentscrawler                                                    ...
nutch – crawl cycle                                             generate – fetch – update cycleCreate crawldb    Inject ro...
nutch - plugins                               generate – fetch – update cycleCreate crawldb               parser    Inject...
nutch – extension pointsplugin.xml            // tells Nutch about the plugin               build.xml        // build the ...
nutch - example<plugin id="KnoldusAggregator" name="Knoldus Parse Filter"version="1.0.0" provider-name="nutch.org">    <ru...
public ParseResult filter(Content content, ParseResultparseResult, HTMLMetaTags metaTags, DocumentFragmentdoc) {      LOG....
scala                  I have Java !concurrency       verbose        popular             Strongly typed                   ...
scalaJava:class Person {   private String firstName;   private String lastName;   private int age;    public Person(String...
scalaJava – everything is an object unless it is primitiveScala – everything is an object. period.Java – has operators (+,...
evolution            15
scala and concurrencyFine grained        coarse grained                         Actors                                    ...
actors         17
18
problem contextAggregator             UGC                                     19
solution             Supplier 1Aggregator             Supplier 2              Supplier 3                                  ...
Create crawldb    Inject root URLs        In crawldb           Supplier URLs        Generate fetchlist          Fetch cont...
logicCrawl the supplier                                                     Parse                            Is URL intere...
plugin - scalaclass DetailParserFilter extends HtmlParseFilter { def filter(content: Content, parseResult: ParseResult, me...
result5 suppliers crawledCrawl cycles run continuously for few days> 500K seed data collectedAll with Nutch and 823 lines ...
demoin action ….                      25
resources         http://blog.knoldus.comhttp://wiki.apache.org/nutch/NutchTutorial       http://www.scala-lang.org/      ...
Upcoming SlideShare
Loading in...5
×

Harnessing the power of Nutch with Scala

3,477

Published on

Introduction to N

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,477
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
55
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide

Harnessing the power of Nutch with Scala

  1. 1. Crawling the web, Nutch with Scala Vikas Hazrati @
  2. 2. about CTO at Knoldus Software Co-Founder at MyCellWasStolen.com Community Editor at InfoQ.com Dabbling with Scala – last 40 months Enterprise grade implementations on Scala – 18 months 2
  3. 3. nutchWeb search crawler link-graph parsing software solr lucene 3
  4. 4. nutch – but we have google! transparent understanding extensible 4
  5. 5. nutch – basic architecturecrawler searcher 5
  6. 6. nutch - architecture Recursive segmentscrawler links web database pages fetchlists Crawl db 6
  7. 7. nutch – crawl cycle generate – fetch – update cycleCreate crawldb Inject root URLs In crawldb Update segments Generate fetchlist Index fetched pages Fetch content repeat until depth reached deduplication Update crawldb Merge indexes for searching bin/nutch crawl urls -dir crawl -depth 3 -topN 5 7
  8. 8. nutch - plugins generate – fetch – update cycleCreate crawldb parser Inject root URLs In crawldb HTMLParserFilter Generate fetchlist Fetch content URL Filter Update crawldb scoring filter 8
  9. 9. nutch – extension pointsplugin.xml // tells Nutch about the plugin build.xml // build the plugin ivy.xml // plugin dependencies // plugin source src 9
  10. 10. nutch - example<plugin id="KnoldusAggregator" name="Knoldus Parse Filter"version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="NutchHeadings Parse Filter"point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter"class="com.knoldus.aggregator.server.plugins.DetailParserFilter"></implementation> </extension></plugin> 10
  11. 11. public ParseResult filter(Content content, ParseResultparseResult, HTMLMetaTags metaTags, DocumentFragmentdoc) { LOG.debug("Parsing URL: " + content.getUrl()); } Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult; } 11
  12. 12. scala I have Java !concurrency verbose popular Strongly typed jvm OO library 12
  13. 13. scalaJava:class Person { private String firstName; private String lastName; private int age; public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; } public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; }}Scala:class Person(var firstName: String, var lastName: String, var age: Int)Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i 13
  14. 14. scalaJava – everything is an object unless it is primitiveScala – everything is an object. period.Java – has operators (+, -, < ..) and methodsScala – operators are methodsJava – statically typed – Thing thing = new Thing()Scala – statically typed but uses type inferencingval thing = new Thing 14
  15. 15. evolution 15
  16. 16. scala and concurrencyFine grained coarse grained Actors 16
  17. 17. actors 17
  18. 18. 18
  19. 19. problem contextAggregator UGC 19
  20. 20. solution Supplier 1Aggregator Supplier 2 Supplier 3 20
  21. 21. Create crawldb Inject root URLs In crawldb Supplier URLs Generate fetchlist Fetch content Update crawldb plugins written in Scala 21
  22. 22. logicCrawl the supplier Parse Is URL interesting Pass extraction to actor seed database 22
  23. 23. plugin - scalaclass DetailParserFilter extends HtmlParseFilter { def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = { if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult } private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result } private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml } 23
  24. 24. result5 suppliers crawledCrawl cycles run continuously for few days> 500K seed data collectedAll with Nutch and 823 lines of Scala code 24
  25. 25. demoin action …. 25
  26. 26. resources http://blog.knoldus.comhttp://wiki.apache.org/nutch/NutchTutorial http://www.scala-lang.org/ vikas@knoldus.com 26
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×