SlideShare a Scribd company logo
Crawling the web, Nutch with Scala

    Vikas Hazrati @
about


  CTO at Knoldus Software

  Co-Founder at MyCellWasStolen.com

  Community Editor at InfoQ.com

  Dabbling with Scala – last 40 months

  Enterprise grade implementations on Scala – 18 months



                                                          2
nutch
Web search
                    crawler   link-graph   parsing
 software




             solr


 lucene


                                                         3
nutch – but we have google!


             transparent




            understanding




             extensible




                              4
nutch – basic architecture




crawler                 searcher




                                       5
nutch - architecture



          Recursive                segments

crawler




                                                     links




                                      web database   pages
                      fetchlists      Crawl db
                                                             6
nutch – crawl cycle
                                             generate – fetch – update cycle
Create crawldb

    Inject root URLs
        In crawldb
                                                 Update segments
        Generate fetchlist

                                                 Index fetched pages
          Fetch content      repeat until
                             depth reached          deduplication
         Update crawldb
                                                  Merge indexes for
                                                     searching




 bin/nutch crawl urls -dir crawl -depth 3 -topN 5
                                                                               7
nutch - plugins
                               generate – fetch – update cycle



Create crawldb               parser


    Inject root URLs
        In crawldb                     HTMLParserFilter

        Generate fetchlist


          Fetch content        URL Filter


         Update crawldb
                                      scoring filter




                                                                 8
nutch – extension points

plugin.xml            // tells Nutch about the plugin




               build.xml        // build the plugin




   ivy.xml           // plugin dependencies




                      // plugin source
         src

                                                        9
nutch - example
<plugin id="KnoldusAggregator" name="Knoldus Parse Filter"
version="1.0.0" provider-name="nutch.org">
    <runtime>
        <library name="kdaggregator.jar">
            <export name="*" />
        </library>
    </runtime>
    <requires>
        <import plugin="nutch-extensionpoints" />
    </requires>
    <extension id="org.apache.nutch.parse.headings" name="Nutch
Headings Parse Filter"
point="org.apache.nutch.parse.HtmlParseFilter">
        <implementation id="KDParseFilter"
class="com.knoldus.aggregator.server.plugins.DetailParserFilter
"></implementation>
    </extension>
</plugin>
                                                             10
public ParseResult filter(Content content, ParseResult
parseResult, HTMLMetaTags metaTags, DocumentFragment
doc) {

      LOG.debug("Parsing URL: " + content.getUrl());

      }
      Parse parse = parseResult.get(content.getUrl());
      Metadata metadata = parse.getData().getParseMeta();
      for (String tag : tags) {
        metadata.add(TAG_KEY, tag);
      }
      return parseResult;

  }
                                                            11
scala
                  I have Java !

concurrency       verbose




        popular             Strongly typed

                                             jvm

          OO                library


                                                           12
scala
Java:
class Person {
   private String firstName;
   private String lastName;
   private int age;

    public Person(String firstName, String lastName, int age) {
      this.firstName = firstName;
      this.lastName = lastName;
      this.age = age;
    }

    public   void   setFirstName(String firstName) { this.firstName = firstName; }
    public   void   String getFirstName() { return this.firstName; }
    public   void   setLastName(String lastName) { this.lastName = lastName; }
    public   void   String getLastName() { return this.lastName; }
    public   void   setAge(int age) { this.age = age; }
    public   void   int getAge() { return this.age; }
}


Scala:
class Person(var firstName: String, var lastName: String, var age: Int)




Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i      13
scala
Java – everything is an object unless it is primitive

Scala – everything is an object. period.



Java – has operators (+, -, < ..) and methods

Scala – operators are methods



Java – statically typed – Thing thing = new Thing()
Scala – statically typed but uses type inferencing
val thing = new Thing
                                                                14
evolution




            15
scala and concurrency



Fine grained        coarse grained




                         Actors


                                       16
actors




         17
18
problem context


Aggregator




             UGC



                                     19
solution

             Supplier 1
Aggregator

             Supplier 2



              Supplier 3




                                      20
Create crawldb

    Inject root URLs
        In crawldb           Supplier URLs

        Generate fetchlist


          Fetch content


         Update crawldb




                             plugins written in Scala

                                                        21
logic


Crawl the supplier


                                                     Parse
                            Is URL interesting




                                                 Pass extraction to
                                                       actor

                       seed
                     database



                                                                      22
plugin - scala
class DetailParserFilter extends HtmlParseFilter {

 def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc:
 DocumentFragment): ParseResult = {
   if (isDetailURL(content.getUrl)) {
     val rawHtml = content.getContent
     if (rawHtml.length > 0) processContent(rawHtml)
   }
   parseResult
 }

 private def isDetailURL(url: String): Boolean = {
   val result = url.matches(AggregatorConfiguration.regexEventDetailPages)
   result
 }

 private def processContent(rawHtml: Array[Byte]) = {
   (new DetailProcessor).start ! rawHtml
 }                                                                                     23
result

5 suppliers crawled

Crawl cycles run continuously for few days

> 500K seed data collected




All with Nutch and 823 lines of Scala code


                                                      24
demo




in action ….




                      25
resources

         http://blog.knoldus.com


http://wiki.apache.org/nutch/NutchTutorial


       http://www.scala-lang.org/


          vikas@knoldus.com



                                                 26

More Related Content

What's hot

A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
Julien Nioche
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
Julien Nioche
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
Steve Watt
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
Fabio Fumarola
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
Cloudera, Inc.
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
lucenerevolution
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
nzhang
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFS
Saumitra Srivastav
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
Alex Payne
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Jimmy Angelakos
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
Rahul Jain
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
lucenerevolution
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Infrastructure Monitoring with Postgres
Infrastructure Monitoring with PostgresInfrastructure Monitoring with Postgres
Infrastructure Monitoring with Postgres
Steven Simpson
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Mark Kerzner
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Uwe Printz
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 

What's hot (20)

A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
 
StormCrawler in the wild
StormCrawler in the wildStormCrawler in the wild
StormCrawler in the wild
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Web Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache NutchWeb Crawling and Data Gathering with Apache Nutch
Web Crawling and Data Gathering with Apache Nutch
 
11. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/211. From Hadoop to Spark 2/2
11. From Hadoop to Spark 2/2
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Friends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFSFriends of Solr - Nutch & HDFS
Friends of Solr - Nutch & HDFS
 
Building Distributed Systems in Scala
Building Distributed Systems in ScalaBuilding Distributed Systems in Scala
Building Distributed Systems in Scala
 
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph DatabaseBringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
Bringing the Semantic Web closer to reality: PostgreSQL as RDF Graph Database
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Infrastructure Monitoring with Postgres
Infrastructure Monitoring with PostgresInfrastructure Monitoring with Postgres
Infrastructure Monitoring with Postgres
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 

Similar to Harnessing the power of Nutch with Scala

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
Julien Nioche
 
Gradle
GradleGradle
Java SE 8 & EE 7 Launch
Java SE 8 & EE 7 LaunchJava SE 8 & EE 7 Launch
Java SE 8 & EE 7 Launch
Digicomp Academy AG
 
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
Maarten Balliauw
 
Ratpack JVM_MX Meetup February 2016
Ratpack JVM_MX Meetup February 2016Ratpack JVM_MX Meetup February 2016
Ratpack JVM_MX Meetup February 2016
Domingo Suarez Torres
 
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and SearchNDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
Maarten Balliauw
 
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
Maarten Balliauw
 
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se....NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
Maarten Balliauw
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
Saltlux Inc.
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
ontopic
 
Using React with Grails 3
Using React with Grails 3Using React with Grails 3
Using React with Grails 3
Zachary Klein
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
Sadayuki Furuhashi
 
ExSchema
ExSchemaExSchema
ExSchema
jccastrejon
 
Writing Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason LeeWriting Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason Lee
jaxconf
 
Java 8 in Anger (QCon London)
Java 8 in Anger (QCon London)Java 8 in Anger (QCon London)
Java 8 in Anger (QCon London)
Trisha Gee
 
Java 8 in Anger, Devoxx France
Java 8 in Anger, Devoxx FranceJava 8 in Anger, Devoxx France
Java 8 in Anger, Devoxx France
Trisha Gee
 
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010
Arun Gupta
 
Getting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQLGetting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQL
Morgan Dedmon
 
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek Piotrowski
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek PiotrowskiJDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek Piotrowski
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek Piotrowski
PROIDEA
 
Spock
SpockSpock

Similar to Harnessing the power of Nutch with Scala (20)

Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Gradle
GradleGradle
Gradle
 
Java SE 8 & EE 7 Launch
Java SE 8 & EE 7 LaunchJava SE 8 & EE 7 Launch
Java SE 8 & EE 7 Launch
 
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
CloudBurst 2019 - Indexing and searching NuGet.org with Azure Functions and S...
 
Ratpack JVM_MX Meetup February 2016
Ratpack JVM_MX Meetup February 2016Ratpack JVM_MX Meetup February 2016
Ratpack JVM_MX Meetup February 2016
 
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and SearchNDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
NDC Oslo 2019 - Indexing and searching NuGet.org with Azure Functions and Search
 
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
Indexing and searching NuGet.org with Azure Functions and Search - Cloud Deve...
 
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se....NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
.NET Conf 2019 - Indexing and searching NuGet.org with Azure Functions and Se...
 
Web Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC ProjectWeb Scale Reasoning and the LarKC Project
Web Scale Reasoning and the LarKC Project
 
Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015Storm crawler apachecon_na_2015
Storm crawler apachecon_na_2015
 
Using React with Grails 3
Using React with Grails 3Using React with Grails 3
Using React with Grails 3
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
ExSchema
ExSchemaExSchema
ExSchema
 
Writing Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason LeeWriting Plugged-in Java EE Apps: Jason Lee
Writing Plugged-in Java EE Apps: Jason Lee
 
Java 8 in Anger (QCon London)
Java 8 in Anger (QCon London)Java 8 in Anger (QCon London)
Java 8 in Anger (QCon London)
 
Java 8 in Anger, Devoxx France
Java 8 in Anger, Devoxx FranceJava 8 in Anger, Devoxx France
Java 8 in Anger, Devoxx France
 
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010
Servlets 3.0 - Asynchronous, Easy, Extensible @ Silicon Valley Code Camp 2010
 
Getting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQLGetting started with Apollo Client and GraphQL
Getting started with Apollo Client and GraphQL
 
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek Piotrowski
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek PiotrowskiJDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek Piotrowski
JDD2015: ClassIndex - szybka alternatywa dla skanowania klas - Sławek Piotrowski
 
Spock
SpockSpock
Spock
 

More from Knoldus Inc.

Terratest - Automation testing of infrastructure
Terratest - Automation testing of infrastructureTerratest - Automation testing of infrastructure
Terratest - Automation testing of infrastructure
Knoldus Inc.
 
Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
Secure practices with dot net services.pptx
Secure practices with dot net services.pptxSecure practices with dot net services.pptx
Secure practices with dot net services.pptx
Knoldus Inc.
 
Distributed Cache with dot microservices
Distributed Cache with dot microservicesDistributed Cache with dot microservices
Distributed Cache with dot microservices
Knoldus Inc.
 
Introduction to gRPC Presentation (Java)
Introduction to gRPC Presentation (Java)Introduction to gRPC Presentation (Java)
Introduction to gRPC Presentation (Java)
Knoldus Inc.
 
Using InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in JmeterUsing InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in Jmeter
Knoldus Inc.
 
Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)
Knoldus Inc.
 
Stakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) PresentationStakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) Presentation
Knoldus Inc.
 
Introduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) PresentationIntroduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) Presentation
Knoldus Inc.
 
Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)
Knoldus Inc.
 
Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)
Knoldus Inc.
 
Clean Code in Test Automation Differentiating Between the Good and the Bad
Clean Code in Test Automation  Differentiating Between the Good and the BadClean Code in Test Automation  Differentiating Between the Good and the Bad
Clean Code in Test Automation Differentiating Between the Good and the Bad
Knoldus Inc.
 
Integrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test AutomationIntegrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test Automation
Knoldus Inc.
 
State Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptxState Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptx
Knoldus Inc.
 
Authentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptxAuthentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptx
Knoldus Inc.
 
OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)
Knoldus Inc.
 
Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
Knoldus Inc.
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Knoldus Inc.
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
Knoldus Inc.
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
Knoldus Inc.
 

More from Knoldus Inc. (20)

Terratest - Automation testing of infrastructure
Terratest - Automation testing of infrastructureTerratest - Automation testing of infrastructure
Terratest - Automation testing of infrastructure
 
Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)
 
Secure practices with dot net services.pptx
Secure practices with dot net services.pptxSecure practices with dot net services.pptx
Secure practices with dot net services.pptx
 
Distributed Cache with dot microservices
Distributed Cache with dot microservicesDistributed Cache with dot microservices
Distributed Cache with dot microservices
 
Introduction to gRPC Presentation (Java)
Introduction to gRPC Presentation (Java)Introduction to gRPC Presentation (Java)
Introduction to gRPC Presentation (Java)
 
Using InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in JmeterUsing InfluxDB for real-time monitoring in Jmeter
Using InfluxDB for real-time monitoring in Jmeter
 
Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)Intoduction to KubeVela Presentation (DevOps)
Intoduction to KubeVela Presentation (DevOps)
 
Stakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) PresentationStakeholder Management (Project Management) Presentation
Stakeholder Management (Project Management) Presentation
 
Introduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) PresentationIntroduction To Kaniko (DevOps) Presentation
Introduction To Kaniko (DevOps) Presentation
 
Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)Efficient Test Environments with Infrastructure as Code (IaC)
Efficient Test Environments with Infrastructure as Code (IaC)
 
Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)Exploring Terramate DevOps (Presentation)
Exploring Terramate DevOps (Presentation)
 
Clean Code in Test Automation Differentiating Between the Good and the Bad
Clean Code in Test Automation  Differentiating Between the Good and the BadClean Code in Test Automation  Differentiating Between the Good and the Bad
Clean Code in Test Automation Differentiating Between the Good and the Bad
 
Integrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test AutomationIntegrating AI Capabilities in Test Automation
Integrating AI Capabilities in Test Automation
 
State Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptxState Management with NGXS in Angular.pptx
State Management with NGXS in Angular.pptx
 
Authentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptxAuthentication in Svelte using cookies.pptx
Authentication in Svelte using cookies.pptx
 
OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)OAuth2 Implementation Presentation (Java)
OAuth2 Implementation Presentation (Java)
 
Supply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptxSupply chain security with Kubeclarity.pptx
Supply chain security with Kubeclarity.pptx
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
Akka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On IntroductionAkka gRPC Essentials A Hands-On Introduction
Akka gRPC Essentials A Hands-On Introduction
 
Entity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptxEntity Core with Core Microservices.pptx
Entity Core with Core Microservices.pptx
 

Recently uploaded

Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
Claudio Di Ciccio
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
IndexBug
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
Techgropse Pvt.Ltd.
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 

Recently uploaded (20)

Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
CAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on BlockchainCAKE: Sharing Slices of Confidential Data on Blockchain
CAKE: Sharing Slices of Confidential Data on Blockchain
 
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial IntelligenceAI 101: An Introduction to the Basics and Impact of Artificial Intelligence
AI 101: An Introduction to the Basics and Impact of Artificial Intelligence
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 

Harnessing the power of Nutch with Scala

  • 1. Crawling the web, Nutch with Scala Vikas Hazrati @
  • 2. about CTO at Knoldus Software Co-Founder at MyCellWasStolen.com Community Editor at InfoQ.com Dabbling with Scala – last 40 months Enterprise grade implementations on Scala – 18 months 2
  • 3. nutch Web search crawler link-graph parsing software solr lucene 3
  • 4. nutch – but we have google! transparent understanding extensible 4
  • 5. nutch – basic architecture crawler searcher 5
  • 6. nutch - architecture Recursive segments crawler links web database pages fetchlists Crawl db 6
  • 7. nutch – crawl cycle generate – fetch – update cycle Create crawldb Inject root URLs In crawldb Update segments Generate fetchlist Index fetched pages Fetch content repeat until depth reached deduplication Update crawldb Merge indexes for searching bin/nutch crawl urls -dir crawl -depth 3 -topN 5 7
  • 8. nutch - plugins generate – fetch – update cycle Create crawldb parser Inject root URLs In crawldb HTMLParserFilter Generate fetchlist Fetch content URL Filter Update crawldb scoring filter 8
  • 9. nutch – extension points plugin.xml // tells Nutch about the plugin build.xml // build the plugin ivy.xml // plugin dependencies // plugin source src 9
  • 10. nutch - example <plugin id="KnoldusAggregator" name="Knoldus Parse Filter" version="1.0.0" provider-name="nutch.org"> <runtime> <library name="kdaggregator.jar"> <export name="*" /> </library> </runtime> <requires> <import plugin="nutch-extensionpoints" /> </requires> <extension id="org.apache.nutch.parse.headings" name="Nutch Headings Parse Filter" point="org.apache.nutch.parse.HtmlParseFilter"> <implementation id="KDParseFilter" class="com.knoldus.aggregator.server.plugins.DetailParserFilter "></implementation> </extension> </plugin> 10
  • 11. public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) { LOG.debug("Parsing URL: " + content.getUrl()); } Parse parse = parseResult.get(content.getUrl()); Metadata metadata = parse.getData().getParseMeta(); for (String tag : tags) { metadata.add(TAG_KEY, tag); } return parseResult; } 11
  • 12. scala I have Java ! concurrency verbose popular Strongly typed jvm OO library 12
  • 13. scala Java: class Person { private String firstName; private String lastName; private int age; public Person(String firstName, String lastName, int age) { this.firstName = firstName; this.lastName = lastName; this.age = age; } public void setFirstName(String firstName) { this.firstName = firstName; } public void String getFirstName() { return this.firstName; } public void setLastName(String lastName) { this.lastName = lastName; } public void String getLastName() { return this.lastName; } public void setAge(int age) { this.age = age; } public void int getAge() { return this.age; } } Scala: class Person(var firstName: String, var lastName: String, var age: Int) Source: http://blog.objectmentor.com/articles/2008/08/03/the-seductions-of-scala-part-i 13
  • 14. scala Java – everything is an object unless it is primitive Scala – everything is an object. period. Java – has operators (+, -, < ..) and methods Scala – operators are methods Java – statically typed – Thing thing = new Thing() Scala – statically typed but uses type inferencing val thing = new Thing 14
  • 15. evolution 15
  • 16. scala and concurrency Fine grained coarse grained Actors 16
  • 17. actors 17
  • 18. 18
  • 20. solution Supplier 1 Aggregator Supplier 2 Supplier 3 20
  • 21. Create crawldb Inject root URLs In crawldb Supplier URLs Generate fetchlist Fetch content Update crawldb plugins written in Scala 21
  • 22. logic Crawl the supplier Parse Is URL interesting Pass extraction to actor seed database 22
  • 23. plugin - scala class DetailParserFilter extends HtmlParseFilter { def filter(content: Content, parseResult: ParseResult, metaTags: HTMLMetaTags, doc: DocumentFragment): ParseResult = { if (isDetailURL(content.getUrl)) { val rawHtml = content.getContent if (rawHtml.length > 0) processContent(rawHtml) } parseResult } private def isDetailURL(url: String): Boolean = { val result = url.matches(AggregatorConfiguration.regexEventDetailPages) result } private def processContent(rawHtml: Array[Byte]) = { (new DetailProcessor).start ! rawHtml } 23
  • 24. result 5 suppliers crawled Crawl cycles run continuously for few days > 500K seed data collected All with Nutch and 823 lines of Scala code 24
  • 26. resources http://blog.knoldus.com http://wiki.apache.org/nutch/NutchTutorial http://www.scala-lang.org/ vikas@knoldus.com 26