Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OpenSearchLab and the Lucene Ecosystem

5,571 views

Published on

Keynote slides from http://opensearchlab.otago.ac.nz/FullProceedings.pdf

Published in: Technology
  • Be the first to comment

OpenSearchLab and the Lucene Ecosystem

  1. OpenSearchLab and Lucene Grant Ingersoll Chief Scientist @LucidWorksMember, Committer at Apache Soft. Found. Co-Founder, Apache Mahout
  2. HatsI’m here as an individual who happens to contribute (and commit) to Lucene, Solr, Mahout and other open source projects. I don’t officially represent the ASF or even Lucene/Solr/Mahout.
  3. Topics• Openness• What are some OpenSearchLab (OSL) needs?• The Lucene Ecosystem• Lucene for Research?• A Sample Architecture
  4. Putting the Open in OpenSearchLab• Open Development >> Open Source• Open community• Open corpora• Open evaluations• Open Research • w/o being onerous http://www.facebook.com/photo.php?fbid=10151728075710181&set=a.101 51045050120181.780469.68096845180&type=1&theater
  5. OSL Needs? Community Code Infrastructure• Openness Model • Architecture • Hardware • Flexible • Cloud or hosted?• Contributions: • Scalable • Network/Bandwidth • Who? • Experiment Mgmt • Production/Staging/Dev • Where? • How? • Content Acquisition • $$$$ • Analysis• Ownership/Legal: • Indexing • Release Management • Code • Querying • Contributions • Downstream Tools • Devops • Infrastructure • Faceting, highlighting, •… auto-suggest,• Privacy spellchecking, etc.•… • Records Mgmt • Testing •…
  6. What’s this have to do with Lucene?
  7. “An ecosystem is a community of living organisms in conjunctionwith the nonliving components of their environment interactingas a system.” – Wikipedia Code Committers Contributors ASF Users
  8. The ASF and ASL• ASF == Apache Software Foundation – Volunteer-based, but many are paid to work on open source by their employer – Community Over Code • Consensus-driven development – Meritocracy • “Those who do, make the decisions” – 100+ Top Level Projects – Infrastructure to support projects – “The Apache Way”• ASL == Apache Software License (v2) ASL ≠ ASF
  9. Lucene Community• In a nutshell: Large, Active Community• 30+ committers, many, many more contributors• (Tens of?) Thousands of Practitioners• Thousands of production instances – Twitter, Apple, IBM Watson, LinkedIn, Netflix, Commercial Search Engines, … – “… they frequently turn to real-time search: our system serves over two billion queries a day, with an average query latency of 50 ms. Usually, tweets are searchable within 10 seconds after creation.” -- EarlyBird, Busch et. al.
  10. The Code Ecosystem Solr Tika Hadoop Lucene CoreNutch Mahout OpenNLP
  11. • Flagship Java library for building search applications – Indexing, Searching, Language Analysis• Powers apps large and small the world over• More in Apache Lucene 4 talk later• Fast, small footprint• Lots of useful related modules – Highlighting, Joins, Spatial, etc.• http://lucene.apache.org/core
  12. • Search server built using Lucene and HTTP• Faceting, highlighting, most Lucene features, easy admin• Highly Extensible• Scalable (query volume and index size)• Lucene Best Practices• http://lucene.apache.org/solr
  13. • Originally built for Nutch to solve large scale crawling problems• Distributed File System and Computation Model – HDFS and MapReduce, YARN coming• Common Use Cases: storage, log analysis, ETL• http://hadoop.apache.org
  14. • Web-scale crawler and search built on Lucene/Solr and Hadoop• Link analysis (aka PageRank)• Plugin framework• Parsers for common document formats (PDF, Word, HTML, etc.)• http://nutch.apache.org
  15. • Scalable machine learning – Utilize Hadoop where appropriate• Primary Focus: “The 3 C’s” – Clustering, classification, collaborative filtering• Others – Frequent pattern mining, topic extraction, statistically interesting phrases• http://mahout.apache.org
  16. • Toolkit for detecting and extracting content from MIME types• Support for many common file formats – Office, PDF, HTML, etc.• Intuitive API (think SAX parser)• Wraps best of breed open source extractors• Plug in your own• http://tika.apache.org
  17. • Supports common NLP tasks – NER, POS tagging, Chunking, Parsing, CoRef resolution• MaxEnt and Perceptron based – Working to make the machine learning pluggable• Some Multilingual support• New life at the ASF• Related: cTakes, Stanbol
  18. Other Useful Tools• Apache Zookeeper – Distrib. Coordination• Apache Pig – Hadoop scripting w/o Java• Apache HBase/Accumulo/Cassandra – BigTable/Dynamo• Avro and Protobufs – Serialization frameworks• Netty: Server framework – easy to add protocols and to scale• Stanbol – Semantic Content Management using Solr, OpenNLP, others• UIMA – Unstructured Info Management
  19. LUCENE CAN HAS RESEARCH?• Dispelling a few misconceptions: – No such thing as Lucene OOTB – Lucene ≠ Solr• Researchers are welcome! – Large audience and many domains – http://wiki.apache.org/lucene- java/HowToContribute – Battle-tested code – Speed v. Quality tradeoffs http://1.bp.blogspot.com/_T2ki5Em5dnI/S8gxtImG7 wI/AAAAAAAAAEs/N7aZKZ6g6g4/s1600/cat%2520ty ping.jpg
  20. Research/Contribution Areas• Work with the community to do evaluations• Scoring – BM25, LM, IM, DFR others already implemented – Easy to add your own• Codecs – Extensible compression/storage – Many already implemented approaches and more coming – SimpleText FTW!• Others: – Faceting, auto-suggest, spell-checking, highlighting, expansion and more – Different domains: machine generated data, mobile,
  21. ClientsAbstract OSL Architecture Access APIs Personalization Shard Shard Shard Users/Admin/ ... & Machine 1 2 n Other Learning Search View Updates/Analysis Distributed, Scalable Distributed (Batch/Real Time) Storage Coordination and (Docs, Users, Logs) Messaging Keys Content Acquisition Distributed Content Content Acquisition - Service-Oriented Architecture Acquisition ETL - Stateless Batch and Real Time - Failover/Fault Tolerant - Glue is lightweight - Smart about updates Data (Internet)
  22. ClientsLucene Ecosystem Implementation Access APIs Personalization Shard Shard Shard Users/Admin/ ... & Machine 1 2 n Other Learning Search View Updates/Analysis Distributed, Scalable Distributed (Batch/Real Time) Storage Coordination and (Docs, Users, Logs) Messaging Keys Content Acquisition Distributed Content Content Acquisition - Service-Oriented Architecture Acquisition ETL - Stateless Batch and Real Time - Failover/Fault Tolerant - Glue is lightweight - Smart about updates Data (Internet)
  23. Takeaways• Open Development >> Open Source >> Shared Source – Corollary: You never know where good ideas are coming from• ASF is a proven model for collaboration• Lucene ecosystem: extensive, production ready• Lucene 4 is viable for IR algorithms and data structure research• OSL (IMO) needs a services-based, pluggable architecture
  24. Resources• Getting Started – {Lucene|Mahout|Hadoop} In Action – Taming Text• grant@lucidworks.com• @gsingers• http://www.lucidworks.com

×