Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond


Published on

My ApacheconNA 2010 presentation on Nutch: its history, and l

Published in: Technology

Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond

  1. 1. Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and beyond Chris A. Mattmann Senior Computer Scientist, NASA Jet Propulsion Laboratory Adjunct Assistant Professor, Univ. of Southern California Member, Apache Software Foundation
  2. 2. Roadmap • What is Nutch? • What are the current versions of Nutch? • What can it do? • What did we do right? • What did we do wrong? • Where is Nutch going?
  3. 3. And you are? • Apache Member involved in – Tika (VP,PMC), Nutch (PMC), Incubator (PMC), OODT (Mentor), SIS (Mentor), Lucy (Mentor) and Gora (Champion) • Architect/Developer at NASA JPL in Pasadena, CA • Software Architecture/Engineeri ng Prof at USC
  4. 4. is… • A project originally started by Doug Cutting • Nutch builds upon the lower level text indexing library and API called Lucene • Nutch provides crawling services, protocol services, parsing services, content management services on top of the indexing capability provided by Lucene • Allows you to sand up a web-scale infra.
  5. 5. Community • Mailing lists – User: 972 peeps – Dev: 520 peeps • Committers/PMC – 8 peeps – All 8 active: SERIOUSLY • Releases – 11 releases so far – Working on 2.0 Credit: svnsearch.org
  6. 6. What Currently Exists? • Version 0.6.x – First easily deployable version • Version 0.7.x – Added several new features including several new parsers (MS-WORD, PowerPoint), URLFilter extension point, first Apache release after Incubation, mime type system • Version 0.8.x – Completely new underlying architecture based on Hadoop – Parse plugins framework, multi-valued metadata container – Parser Factory enhancement • Version 0.9.x – Major bug fixes – Hadoop, and Lucene library upgrades • Version 1.0 – Flexible filter framework – Flexible scoring – Initial integration with Tika – Full Search Engine functionality and capabilities, in production at large scale (Internet Archive)
  7. 7. What are the recent versions? • Version 1.1, upgrade all Nutch library deps (Hadoop, Tika, etc.) and make Fetcher faster • Version 1.2, fix some big time bugs (NPE in distributed search), lots of feature upgrades – You should be using this version
  8. 8. Some active dev areas • Plenty! • Bug fixes (> 200 issues in JIRA right now with no resolution) • Nutch 2.0 architecture – http://search-lucene.com/m/gbrBF1RMWk9 – Refactored Nutch architecture, delegating to Solr, HBase, Tika, and ORM
  9. 9. Why Nutch? • Observation: Web Search is a commodity – Why can’t it be provided freely? • Allows tweaking of typically “hidden” ranking algorithms • Allows developers to focus less on the infrastructure (since Brin & Page’s paper, the infrastructure is well-known), and more on providing value-added capabilities
  10. 10. Why Nutch? • Value-added capabilities – Improving fetching speed – Parsing and handling of the hundreds of different content types available on the internet – Handling different protocols for obtaining content – Better ranking algorithms (OPIC, PageRank) • More or less, in Nutch, these capabilities all map to extension points available via Nutch’s plugin framework
  11. 11. Nutch’s Architecture • Nutch Core facilities – Parsing – Indexing – Crawling – Content Acquisition – Querying – Plugin Framework • Nutch’s extension points – Scoring, Parsing, Indexing, Querying, URLFiltering
  12. 12. Nutch’s Architecture Maps to Search engine architecture proposed by Brin & Page
  13. 13. Real world application of Nutch • I work at NASA’s Jet Propulsion Laboratory • NASA’s Planetary Data System – NASA’s archive for all planetary science data collected by missions over the past 30 years – Collected 20 TB over the past 30 years • Increasing to over 200 TB in the next 3 years! – Built up a catalog of all data collected • Where does Nutch fit in?
  14. 14. Where does Nutch fit into the PDS? • PDS Management Council decide they want “Google-like” search of the PDS catalog • Our plan: use Nutch to implement capability for PDS
  15. 15. PDS Google-like Search Architecture Search Engine Architecture (e.g. Nutch, Google) PDS Catalog P D S - D Existing PDS Query Indexer Index Lucene Crawler PDS Extract Parser PDS Parser pds.war Tomcat Web Server Catalog Metadata Credit: D. Crichton, S. Hughes, P. Ramirez, R. Joyner, S. Hardman, C. Mattmann
  16. 16. Approach • Export PDS catalog datasets in RDF format (flat files) • Use nutch to crawl RDF files – protocol-file plugin in Nutch • Wrote our own parse-pds plugin – Parse the RDF files, and then extract the metadata • Wrote our own index-pds plugin – Index the fields that we want from the parsed metadata • Wrote our own query-pds plugin – Search the index on the fields that we want
  17. 17. Search Interface
  18. 18. Results
  19. 19. Some Nutch History • In the next few slides, we’ll go through some of Nutch’s history, including my involvement, the history of Nutch dev, and how we came to today
  20. 20. How I got involved • In CS72: Seminar on Search Engines at USC – Okay well it used to be called CS599, but you get the picture • Started out by contributing RSS parsing plugin – My final project in 599 • Moved on from there to – NUTCH-88, redesign of the parsing framework – NUTCH-139, Metadata container support – NUTCH-210, Web Context application file – And various other bug fixes, and contributions here and there – Mailing list support – Wiki support • Became committer in October 2006 • Helped spin Nutch into Apache TLP, March 2010, Nutch PMC member
  21. 21. The Big Yellow Elephant • Before this guy was born • Lots of folks interested in Nutch Hadoop is born (January 2008) Credit: svnsearch.org
  22. 22. Post Hadoop Life • Nutch project kind of withered – Well more than “kind of” it did wither – Went years in-between a release • 0.8 to 1.0 took a while • Dev Community went into maintenance mode – Many committers simply went inactive • User Community deteriorated
  23. 23. Some Observations • It was pretty difficult to attract new committers – Took too long to VOTE them in – They were only interested in Hadoop type stuff – Not many organizations were doing web- scale search • Existing active committers dwindled • I was one of them!
  24. 24. Some Observations • There wasn’t a plan for what to do next – What features to work on? – What bugs to fix? – Many considered Nutch to be “production” worthy in its current form and not a huge number of internet-scale users so people just “put up” with its existing issues, e.g., difficult to configure ?
  25. 25. Hadoop wasn’t the only spinoff • A lot of us interested in content detection and analysis, another major Nutch strength, went off to work on that in some other Apache project that I can’t remember the name of
  26. 26. How can Nutch reorganize? • Strong feeling from Nutch community that we should take whomever is left and think about what the “next generation” Nutch (Nutch2) would look like • (Several cycles of) Mailing threads started by Andrzej Bialecki, Dennis Kubes, Otis Gospondetic
  27. 27. Initial Nutch2 fizzles • Ended up being a lot of talk, but there wasn’t enough interest to pick up a shovel and help dig the hole • But…there were interesting things going on – Example: Nutchbase work from Dogacan, and Enis
  28. 28. What was “Nutchbase”? • Take the Apache implementation of Google’s “BigTable” – Col oriented storge, high scalability in columns and rows • Store Nutch Web page content +
  29. 29. Lots of interest in Nutchbase • But, sadly maintained as a patch for a year or more – NUTCH-650 Hbase integration • Brought about some interesting thoughts – If storage can be abstracted, what about? • Messaging layer (JMS Nutch?) • Parsing? • Indexing (Solr, Lucene, you-name-it)
  30. 30. Post Nutch 1.0 • Nutch 1.0 release was a true “1.”-oh! – Included production features – Those using it were happy, b/c they had bought into the model – Useable, tuneable • But, how do we get to Nutch 2.0?
  31. 31. A few things happen in parallel • 1.1 Release? – I had some free time and was willing to RM a Nutch 1.1 release to get things going • Dogacan, Enis, Julien and Andrzej got interested in moving Nutchbase forward – But took it to the next level…we’ll get back to this • We elected a new committer • Julien Nioche • Patches that had sat for years now got committed
  32. 32. Oh, and Nutch became TLP • Grabbed folks that were active in Nutch community • Decided to move forward with Nutch/HBase as the de-facto platform – No need to maintain home-grown storage formats – And, take it to the next level, to ORM-ness • Decided to make Nutch a “delegator” rather than a workhorse – In other words…
  33. 33. Nutch2: “Delegator” • Indexing/Querying? – Solr has a lot of interest and does tons of work in this area: let’s use it instead of vanilla Lucene • Parsing? – Tika: ditto • Storage – Let’s use the ORM layer that some of the Nutch committers were working on
  34. 34. Enter Gora: “that ORM technology” • Initially baked up at Github • Decided to move to the Incubator in Sept 2010 – I was contacted and asked to champion the effort • What is Gora? – Uses Apache Avro to specify objects and their schema – ORM middleware takes Avro specs, generates Java code – plugs for HBase, Cassandra, in-memory SQL store, etc.
  35. 35. Nutch and Gora • Throw out all code in Nutch that had to do with Writeable interface – Generated now by “Web Page” schema in Gora – Web Page is canonical Nutch object for storage • Parse text, parse data, etc. • No more web-db, crawl-db, etc.
  36. 36. Out with the old… • Throw out Nutch webapp – Solr provides REST-ful services to get at metadata/index – We’ll add the REST (pun) for storage/etc. • Throw out Lucene code • Slowly trash existing Nutch parsers
  37. 37. In with the new • Get rid of webapp – Nutch 2.x has seen contributions of REST web services for full crawl cycle, storage I/F • Delegate indexing to Solr – Nutch 1.x first appearance of SolrIndexer and Nutch Solr schema • Delegate parsing to Tika – Nutch 1.1 first appearance of parse-tika – Have been decommissioning existing parsers • Suggested improvements to Tika during this process
  38. 38. Nutch2 Architecture
  39. 39. Learning from our mistakes • Maintenance – Checking in jars made the Nutch checkout huge (even of just the “source”) • Now using Ivy to manage dependencies – Patches sitting? • Not on my watch! Encouragement to find and commit patches that have been sitting for a while, or simply disposition them – People want to use Nutch code as “dep” • Build now includes ability for RM to push to Maven Central NOTE: CHRIS’S OPINION SLIDE
  40. 40. Learning from our mistakes • Community – Folks contributing patches? • Make em’ a committer – Folks providing good testing results? • Make em’ a committer – Folks making good documentation? • Make em’ a committer – It’s the sign of a healthy Apache project if new committers (and members) are being elected NOTE: CHRIS’S OPINION SLIDE
  41. 41. Learning from our mistakes • Configuration of Nutch is hard – It still is  – Getting easier though – Anyone have any great ideas or patches to integrate with a DI framework? – Things like GORA, Solr, etc, are making this easier • Providing flexible service interfaces beyond Java APIs – Existing work on NUTCH-932, NUTCH-931 and NUTCH-880 is just the beginning
  42. 42. Interesting work going on • I taught a class on Search Engines this past summer • Some neat projects that I’m working with my students to contribute back to Apache – Implementation of Authority/Hub scoring – Deduplication improvements – Clustering plugin improvements – Work to improve Nutch-Solr-Drupal integration
  43. 43. Wrapup • Nutch has seen tremendous highs and lows over years – We’re still kicking • The newest version of Nutch (2.0) will have a vastly slimmed down footprint, and will use existing successful frameworks for heavy lifting – Solr, Tika, Gora, Hadoop • If you’re interested in our dev, check us out at http://nutch.apache.org
  44. 44. Alright, I’ll shut up now • Any questions? • THANK YOU! – mattmann@apache.org – @chrismattmann on Twitter
  45. 45. Acknowledgements • Nutch team • Some material inspired from Andrzej Bialecki’s talks here • OODT team at JPL