This document summarizes the history and development of the Nutch web search engine project. It discusses how Nutch evolved from its original version to incorporate Hadoop and become more modular by delegating functions like indexing and parsing to other Apache projects like Solr and Tika. The current version, Nutch 2.0, aims to have a slimmed down architecture where it acts as a delegator to these other frameworks rather than handling these functions itself. The document also reflects on lessons learned from earlier stages of the project around community engagement, maintenance, and configuration challenges.
This slideset presents the Nutch search engine (http://lucene.apache.org/nutch). A high-level architecture is described, as well as some challenges common in web-crawling and solutions implemented in Nutch. The presentation closes with a brief look into the Nutch future.
This talk will give an overview of Apache Nutch, its main components, how it fits with other Apache projects and its latest developments.
Apache Nutch was started exactly 10 years ago and was the starting point for what later became Apache Hadoop and also Apache Tika. Nutch is nowadays the tool of reference for large scale web crawling.
In this talk I will give an overview of Apache Nutch and describe its main components and how Nutch fits with other Apache projects such as Hadoop, SOLR or Tika.
The second part of the presentation will be focused on the latest developments in Nutch and the changes introduced by the 2.x branch with the use of Apache GORA as a front end to various NoSQL datastores.
Talk about Apache Nutch on ApacheCon Europe 2014:
http://sched.co/1nyYa7b
http://events.linuxfoundation.org/sites/events/files/slides/aceu2014-snagel-web-crawling-nutch.pdf
Storm-Crawler is a collection of resources for building low-latency, large scale web crawlers on Apache Storm. We will compare with similar projects like Apache Nutch and present several use cases where the storm-crawler is being used. In particular we will see how the Storm-crawler can be used with ElasticSearch and Kibana for crawling and indexing web pages.
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
This slideset presents the Nutch search engine (http://lucene.apache.org/nutch). A high-level architecture is described, as well as some challenges common in web-crawling and solutions implemented in Nutch. The presentation closes with a brief look into the Nutch future.
This talk will give an overview of Apache Nutch, its main components, how it fits with other Apache projects and its latest developments.
Apache Nutch was started exactly 10 years ago and was the starting point for what later became Apache Hadoop and also Apache Tika. Nutch is nowadays the tool of reference for large scale web crawling.
In this talk I will give an overview of Apache Nutch and describe its main components and how Nutch fits with other Apache projects such as Hadoop, SOLR or Tika.
The second part of the presentation will be focused on the latest developments in Nutch and the changes introduced by the 2.x branch with the use of Apache GORA as a front end to various NoSQL datastores.
Talk about Apache Nutch on ApacheCon Europe 2014:
http://sched.co/1nyYa7b
http://events.linuxfoundation.org/sites/events/files/slides/aceu2014-snagel-web-crawling-nutch.pdf
Storm-Crawler is a collection of resources for building low-latency, large scale web crawlers on Apache Storm. We will compare with similar projects like Apache Nutch and present several use cases where the storm-crawler is being used. In particular we will see how the Storm-crawler can be used with ElasticSearch and Kibana for crawling and indexing web pages.
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
This session will give an overview of Apache Nutch. I will describe its main components and how it fits with other Apache projects such as Hadoop, SOLR, Tika or HBase. The second part of the presentation will be focused on the latest developments in Nutch, the differences between the 1.x and 2.x branch and what we can expect to see in Nutch in the future. This session will cover many practical aspects and should be a good starting point to crawling on a large scale with Apache Nutch and SOLR.
StormCrawler presentation given at the Bristech meetup on 6/10/2016. Covers the main concepts and functionalities of Apache Storm, then describes StormCrawler with a step by step approach to building a scalable web crawler. Finally we saw 3 real users of StormCrawler, illustrating the versatility of the project.
Low latency scalable web crawling on Apache StormJulien Nioche
In this talk I will introduce Storm-Crawler https://github.com/DigitalPebble/storm-crawler, a collection of resources for building low-latency, large scale web crawlers on Apache Storm. We will compare with similar projects like Apache Nutch and present several use cases where the storm-crawler is being used. In particular we will see how the Storm-crawler can be used with ElasticSearch and Kibana for crawling and indexing web pages.
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
Slides from my talk on "Building a Large Scale SEO/SEM Application with Apache Solr" in Lucene/Solr Revolution 2014 where I talk how we handle Indexing/Search of 40 billion records (documents)/month in Apache Solr with 4.6 TB compressed index data.
Abstract: We are working on building a SEO/SEM application where an end user search for a "keyword" or a "domain" and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr.
Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.
StormCrawler presentation given at the Bristech meetup on 6/10/2016. Covers the main concepts and functionalities of Apache Storm, then describes StormCrawler with a step by step approach to building a scalable web crawler. Finally we saw 3 real users of StormCrawler, illustrating the versatility of the project.
Low latency scalable web crawling on Apache StormJulien Nioche
In this talk I will introduce Storm-Crawler https://github.com/DigitalPebble/storm-crawler, a collection of resources for building low-latency, large scale web crawlers on Apache Storm. We will compare with similar projects like Apache Nutch and present several use cases where the storm-crawler is being used. In particular we will see how the Storm-crawler can be used with ElasticSearch and Kibana for crawling and indexing web pages.
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
Talk held at the IT-Stammtisch Darmstadt on 08.11.2013
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
Slides from my talk on "Building a Large Scale SEO/SEM Application with Apache Solr" in Lucene/Solr Revolution 2014 where I talk how we handle Indexing/Search of 40 billion records (documents)/month in Apache Solr with 4.6 TB compressed index data.
Abstract: We are working on building a SEO/SEM application where an end user search for a "keyword" or a "domain" and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr.
Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.
Building a Scalable Web Crawler with Hadoop by Ahad Rana from CommonCrawl
Ahad Rana, engineer at CommonCrawl, will go over CommonCrawl’s extensive use of Hadoop to fulfill their mission of building an open, and accessible Web-Scale crawl. He will discuss their Hadoop data processing pipeline, including their PageRank implementation, describe techniques they use to optimize Hadoop, discuss the design of their URL Metadata service, and conclude with details on how you can leverage the crawl (using Hadoop) today.
At Spotify we collect huge volumes of data for many purposes. Reporting to labels, powering our product features, and analyzing user growth are some of our most common ones. Additionally, we collect many operational metrics related to the responsiveness, utilization and capacity of our servers. To store and process this data, we use scalable and fault-tolerant multi-system infrastructure, and Apache Hadoop is a key part of it. Surprisingly or not, Apache Hadoop generates large amounts of data in the form of logs and metrics that describe its behaviour and performance. To process this data in a scalable and performant manner we use … also Hadoop! During this presentation, I will talk about how we analyze various logs generated by Apache Hadoop using custom scripts (written in Pig or Java/Python MapReduce) and available open-source tools to get data-driven answers to many questions related to the behaviour of our 690-node Hadoop cluster. At Spotify we frequently leverage these tools to learn how fast we are growing, when to buy new nodes, how to calculate the empirical retention policy for each dataset, optimize the scheduler, benchmark the cluster, find its biggest offenders (both people and datasets) and more.
Wengines, Workflows, and 2 years of advanced data processing in Apache OODTChris Mattmann
With the advent of OODT-215 and OODT-491, there has been a tremendous amount of work to port our next generation Workflow Management system (cutely dubbed "WEngine" for "workflow engine") from an isolated branch into the mainline trunk.
The WEngine system brings amazing advantages including explicit support for branch and bounds in workflow models; prioritized thread pooling and queueing on a per task, and per workflow level; global workflow level conditions (pre and post); condition and workflow timeouts, and an entirely new and more descriptive state model complete with failure codes, and with checkpointing.
WEngine is currently processing the NPOESS Preparatory Project (NPP) PEATE testbed and its thousands of jobs per day, and is being slowly introduced into processing of an entire snow and ice climatology for the Western US and Alaska for the U.S. National Climate Assessment (NCA), working with the world's best snow hydrologists and snow scientists.
With all of those new features, what's an Apache OODT user and fan to do? How can you use WEngine in your system? How does it work today? How will it work tomorrow? We'll answer those questions and more in this fly-by-the-seat-of-your-pants exciting super talk!
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
Building a reliable pipeline of data ingress, batch computation, and data egress with Hadoop can be a major challenge. Most folks start out with cron to manage workflows, but soon discover that doesn't scale past a handful of jobs. There are a number of open-source workflow engines with support for Hadoop, including Azkaban (from LinkedIn), Luigi (from Spotify), and Apache Oozie. Having deployed all three of these systems in production, Joe will talk about what features and qualities are important for a workflow system.
If you are search Best Engineering college in India, Then you can trust RCE (Roorkee College of Engineering) services and facilities. They provide the best education facility, highly educated and experienced faculty, well furnished hostels for both boys and girls, top computerized Library, great placement opportunity and more at affordable fee.
An overview of all the different content related technologies at the Apache Software Foundation
Talk from ApacheCon NA 2010 in Atlanta in November 2010
This slide deck that Mr. Minh Tran - KMS's Software Architect shared at "Java-Trends and Career Opportunities" seminar of Information Technology Center of HCMC University of Science.
If You Have The Content, Then Apache Has The Technology!gagravarr
Within the ASF, there are a wide variety of projects with technologies to help you store, retrieve, host, transform and generate content. This talk will review the landscape of Apache content technologies, provide a quick introduction to the more common and more interesting projects, and flag up new and innovative features within them. It'll also highlight talks from the rest of the week on many of the projects covered, so that you'll know where and when to go to learn more about those projects and technologies which catch your eye!
Similar to Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond (20)
Scalable Data Mining and Archiving in the Era of the Square Kilometre ArrayChris Mattmann
Keynote presentation at the HPC User Forum 2012 in Darborn, MI, September 19, 2012. http://www.hpcuserforum.com/registration/dearborn2012/dearbornagenda.pdf
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond
1. Lessons Learned in the
Development of a Web-scale
Search Engine: Nutch2 and beyond
Chris A. Mattmann
Senior Computer Scientist, NASA Jet Propulsion Laboratory
Adjunct Assistant Professor, Univ. of Southern California
Member, Apache Software Foundation
2. Roadmap
• What is Nutch?
• What are the current versions of Nutch?
• What can it do?
• What did we do right?
• What did we do wrong?
• Where is Nutch going?
3. And you are?
• Apache Member involved in
– Tika (VP,PMC), Nutch (PMC), Incubator (PMC),
OODT (Mentor), SIS (Mentor), Lucy (Mentor) and
Gora (Champion)
• Architect/Developer at
NASA JPL in
Pasadena, CA
• Software
Architecture/Engineeri
ng Prof at USC
4. is…
• A project originally started by Doug
Cutting
• Nutch builds upon the lower level text
indexing library and API called Lucene
• Nutch provides crawling services,
protocol services, parsing services,
content management services on top of
the indexing capability provided by
Lucene
• Allows you to sand up a web-scale infra.
5. Community
• Mailing lists
– User: 972 peeps
– Dev: 520 peeps
• Committers/PMC
– 8 peeps
– All 8 active: SERIOUSLY
• Releases
– 11 releases so far
– Working on 2.0
Credit: svnsearch.org
6. What Currently Exists?
• Version 0.6.x
– First easily deployable version
• Version 0.7.x
– Added several new features including several new parsers (MS-WORD,
PowerPoint), URLFilter extension point, first Apache release after Incubation,
mime type system
• Version 0.8.x
– Completely new underlying architecture based on Hadoop
– Parse plugins framework, multi-valued metadata container
– Parser Factory enhancement
• Version 0.9.x
– Major bug fixes
– Hadoop, and Lucene library upgrades
• Version 1.0
– Flexible filter framework
– Flexible scoring
– Initial integration with Tika
– Full Search Engine functionality and capabilities, in production at large scale
(Internet Archive)
7. What are the recent
versions?
• Version 1.1, upgrade all Nutch
library deps (Hadoop, Tika, etc.) and
make Fetcher faster
• Version 1.2, fix some big time
bugs (NPE in distributed search),
lots of feature upgrades
– You should be using this version
8. Some active dev areas
• Plenty!
• Bug fixes (> 200 issues in JIRA right
now with no resolution)
• Nutch 2.0 architecture
– http://search-lucene.com/m/gbrBF1RMWk9
– Refactored Nutch architecture,
delegating to Solr, HBase, Tika, and
ORM
9. Why Nutch?
• Observation: Web Search is a
commodity
– Why can’t it be provided freely?
• Allows tweaking of typically “hidden” ranking
algorithms
• Allows developers to focus less on the
infrastructure (since Brin & Page’s paper, the
infrastructure is well-known), and more on
providing value-added capabilities
10. Why Nutch?
• Value-added capabilities
– Improving fetching speed
– Parsing and handling of the hundreds of
different content types available on the internet
– Handling different protocols for obtaining
content
– Better ranking algorithms (OPIC, PageRank)
• More or less, in Nutch, these capabilities all
map to extension points available via Nutch’s
plugin framework
13. Real world application of
Nutch
• I work at NASA’s Jet Propulsion
Laboratory
• NASA’s Planetary Data System
– NASA’s archive for all planetary science
data collected by missions over the past
30 years
– Collected 20 TB over the past 30 years
• Increasing to over 200 TB in the next 3
years!
– Built up a catalog of all data collected
• Where does Nutch fit in?
14. Where does Nutch fit into
the PDS?
• PDS Management Council decide
they want “Google-like” search of the
PDS catalog
• Our plan: use Nutch to implement
capability for PDS
15. PDS Google-like Search
Architecture
Search Engine Architecture (e.g. Nutch, Google)
PDS
Catalog
P
D
S
-
D
Existing PDS
Query
Indexer Index
Lucene
Crawler
PDS
Extract
Parser
PDS
Parser
pds.war
Tomcat
Web
Server
Catalog
Metadata
Credit: D. Crichton, S. Hughes, P. Ramirez, R. Joyner, S.
Hardman, C. Mattmann
16. Approach
• Export PDS catalog datasets in RDF format (flat
files)
• Use nutch to crawl RDF files
– protocol-file plugin in Nutch
• Wrote our own parse-pds plugin
– Parse the RDF files, and then extract the metadata
• Wrote our own index-pds plugin
– Index the fields that we want from the parsed metadata
• Wrote our own query-pds plugin
– Search the index on the fields that we want
19. Some Nutch History
• In the next few slides, we’ll go
through some of Nutch’s history,
including my involvement, the history
of Nutch dev, and how we came to
today
20. How I got involved
• In CS72: Seminar on Search Engines at USC
– Okay well it used to be called CS599, but you get the picture
• Started out by contributing RSS parsing plugin
– My final project in 599
• Moved on from there to
– NUTCH-88, redesign of the parsing framework
– NUTCH-139, Metadata container support
– NUTCH-210, Web Context application file
– And various other bug fixes, and contributions here and there
– Mailing list support
– Wiki support
• Became committer in October 2006
• Helped spin Nutch into Apache TLP, March 2010,
Nutch PMC member
21. The Big Yellow Elephant
• Before this guy was born
• Lots of folks interested in Nutch
Hadoop is born
(January 2008)
Credit: svnsearch.org
22. Post Hadoop Life
• Nutch project kind of withered
– Well more than “kind of” it did wither
– Went years in-between a release
• 0.8 to 1.0 took a while
• Dev Community went into
maintenance mode
– Many committers simply went inactive
• User Community deteriorated
23. Some Observations
• It was pretty difficult to attract new
committers
– Took too long to VOTE
them in
– They were only interested
in Hadoop type stuff
– Not many organizations were doing web-
scale search
• Existing active committers dwindled
• I was one of them!
24. Some Observations
• There wasn’t a plan for what to do
next
– What features to work on?
– What bugs to fix?
– Many considered Nutch to be
“production” worthy in its current form
and not a huge number of internet-scale
users so people just “put up” with its
existing issues, e.g., difficult to configure
?
25. Hadoop wasn’t the only
spinoff
• A lot of us interested in content
detection and analysis, another major
Nutch strength, went off to work on
that in some other Apache project
that I can’t remember the name of
26. How can Nutch reorganize?
• Strong feeling from Nutch community
that we should take whomever is left
and think about what the “next
generation” Nutch (Nutch2) would
look like
• (Several cycles of) Mailing threads
started by Andrzej Bialecki, Dennis
Kubes, Otis Gospondetic
27. Initial Nutch2 fizzles
• Ended up being a lot of talk, but there
wasn’t enough interest to pick up a
shovel and help dig the hole
• But…there were interesting
things going on
– Example: Nutchbase work
from Dogacan, and Enis
28. What was “Nutchbase”?
• Take the Apache implementation of
Google’s “BigTable”
– Col oriented storge, high scalability in columns
and rows
• Store Nutch Web page content
+
29. Lots of interest in Nutchbase
• But, sadly maintained as a patch for a year
or more
– NUTCH-650 Hbase integration
• Brought about some interesting thoughts
– If storage can be abstracted, what about?
• Messaging layer (JMS Nutch?)
• Parsing?
• Indexing (Solr, Lucene, you-name-it)
30. Post Nutch 1.0
• Nutch 1.0 release was a true “1.”-oh!
– Included production features
– Those using it were happy, b/c they had bought
into the model
– Useable, tuneable
• But, how do we get
to Nutch 2.0?
31. A few things happen in parallel
• 1.1 Release?
– I had some free
time and was
willing to RM a
Nutch 1.1 release
to get things going
• Dogacan, Enis,
Julien and Andrzej
got interested in
moving Nutchbase
forward
– But took it to the
next level…we’ll get
back to this
• We elected a new
committer
• Julien Nioche
• Patches that had sat for years now
got committed
32. Oh, and Nutch became TLP
• Grabbed folks that were active in Nutch
community
• Decided to move forward with
Nutch/HBase as the de-facto platform
– No need to maintain home-grown storage
formats
– And, take it to the next level, to ORM-ness
• Decided to make Nutch a “delegator”
rather than a workhorse
– In other words…
33. Nutch2: “Delegator”
• Indexing/Querying?
– Solr has a lot of interest and
does tons of work in this area:
let’s use it instead of vanilla Lucene
• Parsing?
– Tika: ditto
• Storage
– Let’s use the ORM layer that some of the
Nutch committers were working on
34. Enter Gora:
“that ORM technology”
• Initially baked up at Github
• Decided to move
to the Incubator in Sept 2010
– I was contacted and asked to
champion the effort
• What is Gora?
– Uses Apache Avro to specify objects and
their schema
– ORM middleware takes Avro specs,
generates Java code – plugs for HBase,
Cassandra, in-memory SQL store, etc.
35. Nutch and Gora
• Throw out all code in Nutch that had to do
with Writeable interface
– Generated now by “Web Page” schema in
Gora
– Web Page is canonical Nutch object for
storage
• Parse text, parse data, etc.
• No more web-db, crawl-db, etc.
36. Out with the old…
• Throw out Nutch
webapp
– Solr provides
REST-ful services
to get at
metadata/index
– We’ll add the REST
(pun) for
storage/etc.
• Throw out Lucene
code • Slowly trash existing Nutch parsers
37. In with the new
• Get rid of webapp
– Nutch 2.x has seen contributions of REST
web services for full crawl cycle, storage I/F
• Delegate indexing to Solr
– Nutch 1.x first appearance of SolrIndexer and
Nutch Solr schema
• Delegate parsing to Tika
– Nutch 1.1 first appearance of parse-tika
– Have been decommissioning existing parsers
• Suggested improvements to Tika during this
process
39. Learning from our mistakes
• Maintenance
– Checking in jars made the Nutch checkout
huge (even of just the “source”)
• Now using Ivy to manage dependencies
– Patches sitting?
• Not on my watch! Encouragement to find and commit
patches that have been sitting for a while, or simply
disposition them
– People want to use Nutch code as “dep”
• Build now includes ability for RM to push to Maven
Central
NOTE: CHRIS’S OPINION SLIDE
40. Learning from our mistakes
• Community
– Folks contributing patches?
• Make em’ a committer
– Folks providing good testing results?
• Make em’ a committer
– Folks making good documentation?
• Make em’ a committer
– It’s the sign of a healthy Apache project if new
committers (and members) are being elected
NOTE: CHRIS’S OPINION SLIDE
41. Learning from our mistakes
• Configuration of Nutch is hard
– It still is
– Getting easier though
– Anyone have any great ideas or patches to
integrate with a DI framework?
– Things like GORA, Solr, etc, are making this
easier
• Providing flexible service interfaces beyond
Java APIs
– Existing work on NUTCH-932, NUTCH-931 and
NUTCH-880 is just the beginning
42. Interesting work going on
• I taught a class on Search Engines this
past summer
• Some neat projects that I’m working with
my students to contribute back to Apache
– Implementation of Authority/Hub scoring
– Deduplication improvements
– Clustering plugin improvements
– Work to improve Nutch-Solr-Drupal integration
43. Wrapup
• Nutch has seen tremendous highs and
lows over years
– We’re still kicking
• The newest version of Nutch (2.0) will have
a vastly slimmed down footprint, and will
use existing successful frameworks for
heavy lifting
– Solr, Tika, Gora, Hadoop
• If you’re interested in our dev, check us out
at http://nutch.apache.org
44. Alright, I’ll shut up now
• Any questions?
• THANK YOU!
– mattmann@apache.org
– @chrismattmann on Twitter