OpenSearchLab and the Lucene Ecosystem

3,856
-1

Published on

Keynote slides from http://opensearchlab.otago.ac.nz/FullProceedings.pdf

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,856
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
47
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Shared source, visible source, BDFL is not open source. Open DEVELOPMENT is far more powerfulAnyone can be a “researcher” - Jack Andraka -- His study resulted in over 90 percent accuracy and showed his patent-pending sensor to be 28 times faster, 28 times less expensive and over 100 times more sensitive than current tests. Jack received the Gordon E. Moore Award, of $75,000, named in honor of Intel co-founder and retired chairman and CEO. -- You never know where the next good idea is coming fromOpen corpora: anyone anywhere should be able to download and run evaluations. If Common Crawl can do it, why can’t we? iBiblio, ASF, others can likely helpHow can we build, leverage and share an open evaluation framework? How do we leverage the Internet? Crowdsourcing? Dynamic nature of content, engines, community, users, etc.? Can we time slice experiments on a real system?Open Research: how do we encourage open methodology, open process, publications, etc. without being heavy-handed?
  • Community will be the single most important pieceBottom up and top down needed to establish a community
  • https://en.wikipedia.org/wiki/EcosystemMost people have this Pyramid backwards
  • The ASF has a well developed community model that has been proven out over time
  • Committers: many are paid to work on Lucene FT.Images: Commits: Ohloh, Traffic: lucene.markmail.org
  • A loose orbit around Lucene Core
  • Second bullet: deferred to 2nd talk
  • OpenSearchLab and the Lucene Ecosystem

    1. OpenSearchLab and Lucene Grant Ingersoll Chief Scientist @LucidWorksMember, Committer at Apache Soft. Found. Co-Founder, Apache Mahout
    2. HatsI’m here as an individual who happens to contribute (and commit) to Lucene, Solr, Mahout and other open source projects. I don’t officially represent the ASF or even Lucene/Solr/Mahout.
    3. Topics• Openness• What are some OpenSearchLab (OSL) needs?• The Lucene Ecosystem• Lucene for Research?• A Sample Architecture
    4. Putting the Open in OpenSearchLab• Open Development >> Open Source• Open community• Open corpora• Open evaluations• Open Research • w/o being onerous http://www.facebook.com/photo.php?fbid=10151728075710181&set=a.101 51045050120181.780469.68096845180&type=1&theater
    5. OSL Needs? Community Code Infrastructure• Openness Model • Architecture • Hardware • Flexible • Cloud or hosted?• Contributions: • Scalable • Network/Bandwidth • Who? • Experiment Mgmt • Production/Staging/Dev • Where? • How? • Content Acquisition • $$$$ • Analysis• Ownership/Legal: • Indexing • Release Management • Code • Querying • Contributions • Downstream Tools • Devops • Infrastructure • Faceting, highlighting, •… auto-suggest,• Privacy spellchecking, etc.•… • Records Mgmt • Testing •…
    6. What’s this have to do with Lucene?
    7. “An ecosystem is a community of living organisms in conjunctionwith the nonliving components of their environment interactingas a system.” – Wikipedia Code Committers Contributors ASF Users
    8. The ASF and ASL• ASF == Apache Software Foundation – Volunteer-based, but many are paid to work on open source by their employer – Community Over Code • Consensus-driven development – Meritocracy • “Those who do, make the decisions” – 100+ Top Level Projects – Infrastructure to support projects – “The Apache Way”• ASL == Apache Software License (v2) ASL ≠ ASF
    9. Lucene Community• In a nutshell: Large, Active Community• 30+ committers, many, many more contributors• (Tens of?) Thousands of Practitioners• Thousands of production instances – Twitter, Apple, IBM Watson, LinkedIn, Netflix, Commercial Search Engines, … – “… they frequently turn to real-time search: our system serves over two billion queries a day, with an average query latency of 50 ms. Usually, tweets are searchable within 10 seconds after creation.” -- EarlyBird, Busch et. al.
    10. The Code Ecosystem Solr Tika Hadoop Lucene CoreNutch Mahout OpenNLP
    11. • Flagship Java library for building search applications – Indexing, Searching, Language Analysis• Powers apps large and small the world over• More in Apache Lucene 4 talk later• Fast, small footprint• Lots of useful related modules – Highlighting, Joins, Spatial, etc.• http://lucene.apache.org/core
    12. • Search server built using Lucene and HTTP• Faceting, highlighting, most Lucene features, easy admin• Highly Extensible• Scalable (query volume and index size)• Lucene Best Practices• http://lucene.apache.org/solr
    13. • Originally built for Nutch to solve large scale crawling problems• Distributed File System and Computation Model – HDFS and MapReduce, YARN coming• Common Use Cases: storage, log analysis, ETL• http://hadoop.apache.org
    14. • Web-scale crawler and search built on Lucene/Solr and Hadoop• Link analysis (aka PageRank)• Plugin framework• Parsers for common document formats (PDF, Word, HTML, etc.)• http://nutch.apache.org
    15. • Scalable machine learning – Utilize Hadoop where appropriate• Primary Focus: “The 3 C’s” – Clustering, classification, collaborative filtering• Others – Frequent pattern mining, topic extraction, statistically interesting phrases• http://mahout.apache.org
    16. • Toolkit for detecting and extracting content from MIME types• Support for many common file formats – Office, PDF, HTML, etc.• Intuitive API (think SAX parser)• Wraps best of breed open source extractors• Plug in your own• http://tika.apache.org
    17. • Supports common NLP tasks – NER, POS tagging, Chunking, Parsing, CoRef resolution• MaxEnt and Perceptron based – Working to make the machine learning pluggable• Some Multilingual support• New life at the ASF• Related: cTakes, Stanbol
    18. Other Useful Tools• Apache Zookeeper – Distrib. Coordination• Apache Pig – Hadoop scripting w/o Java• Apache HBase/Accumulo/Cassandra – BigTable/Dynamo• Avro and Protobufs – Serialization frameworks• Netty: Server framework – easy to add protocols and to scale• Stanbol – Semantic Content Management using Solr, OpenNLP, others• UIMA – Unstructured Info Management
    19. LUCENE CAN HAS RESEARCH?• Dispelling a few misconceptions: – No such thing as Lucene OOTB – Lucene ≠ Solr• Researchers are welcome! – Large audience and many domains – http://wiki.apache.org/lucene- java/HowToContribute – Battle-tested code – Speed v. Quality tradeoffs http://1.bp.blogspot.com/_T2ki5Em5dnI/S8gxtImG7 wI/AAAAAAAAAEs/N7aZKZ6g6g4/s1600/cat%2520ty ping.jpg
    20. Research/Contribution Areas• Work with the community to do evaluations• Scoring – BM25, LM, IM, DFR others already implemented – Easy to add your own• Codecs – Extensible compression/storage – Many already implemented approaches and more coming – SimpleText FTW!• Others: – Faceting, auto-suggest, spell-checking, highlighting, expansion and more – Different domains: machine generated data, mobile,
    21. ClientsAbstract OSL Architecture Access APIs Personalization Shard Shard Shard Users/Admin/ ... & Machine 1 2 n Other Learning Search View Updates/Analysis Distributed, Scalable Distributed (Batch/Real Time) Storage Coordination and (Docs, Users, Logs) Messaging Keys Content Acquisition Distributed Content Content Acquisition - Service-Oriented Architecture Acquisition ETL - Stateless Batch and Real Time - Failover/Fault Tolerant - Glue is lightweight - Smart about updates Data (Internet)
    22. ClientsLucene Ecosystem Implementation Access APIs Personalization Shard Shard Shard Users/Admin/ ... & Machine 1 2 n Other Learning Search View Updates/Analysis Distributed, Scalable Distributed (Batch/Real Time) Storage Coordination and (Docs, Users, Logs) Messaging Keys Content Acquisition Distributed Content Content Acquisition - Service-Oriented Architecture Acquisition ETL - Stateless Batch and Real Time - Failover/Fault Tolerant - Glue is lightweight - Smart about updates Data (Internet)
    23. Takeaways• Open Development >> Open Source >> Shared Source – Corollary: You never know where good ideas are coming from• ASF is a proven model for collaboration• Lucene ecosystem: extensive, production ready• Lucene 4 is viable for IR algorithms and data structure research• OSL (IMO) needs a services-based, pluggable architecture
    24. Resources• Getting Started – {Lucene|Mahout|Hadoop} In Action – Taming Text• grant@lucidworks.com• @gsingers• http://www.lucidworks.com

    ×