Your SlideShare is downloading. ×
0
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
OpenSearchLab and the Lucene Ecosystem
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

OpenSearchLab and the Lucene Ecosystem

3,507

Published on

Keynote slides from http://opensearchlab.otago.ac.nz/FullProceedings.pdf

Keynote slides from http://opensearchlab.otago.ac.nz/FullProceedings.pdf

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,507
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
47
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Shared source, visible source, BDFL is not open source. Open DEVELOPMENT is far more powerfulAnyone can be a “researcher” - Jack Andraka -- His study resulted in over 90 percent accuracy and showed his patent-pending sensor to be 28 times faster, 28 times less expensive and over 100 times more sensitive than current tests. Jack received the Gordon E. Moore Award, of $75,000, named in honor of Intel co-founder and retired chairman and CEO. -- You never know where the next good idea is coming fromOpen corpora: anyone anywhere should be able to download and run evaluations. If Common Crawl can do it, why can’t we? iBiblio, ASF, others can likely helpHow can we build, leverage and share an open evaluation framework? How do we leverage the Internet? Crowdsourcing? Dynamic nature of content, engines, community, users, etc.? Can we time slice experiments on a real system?Open Research: how do we encourage open methodology, open process, publications, etc. without being heavy-handed?
  • Community will be the single most important pieceBottom up and top down needed to establish a community
  • https://en.wikipedia.org/wiki/EcosystemMost people have this Pyramid backwards
  • The ASF has a well developed community model that has been proven out over time
  • Committers: many are paid to work on Lucene FT.Images: Commits: Ohloh, Traffic: lucene.markmail.org
  • A loose orbit around Lucene Core
  • Second bullet: deferred to 2nd talk
  • Transcript

    • 1. OpenSearchLab and Lucene Grant Ingersoll Chief Scientist @LucidWorksMember, Committer at Apache Soft. Found. Co-Founder, Apache Mahout
    • 2. HatsI’m here as an individual who happens to contribute (and commit) to Lucene, Solr, Mahout and other open source projects. I don’t officially represent the ASF or even Lucene/Solr/Mahout.
    • 3. Topics• Openness• What are some OpenSearchLab (OSL) needs?• The Lucene Ecosystem• Lucene for Research?• A Sample Architecture
    • 4. Putting the Open in OpenSearchLab• Open Development >> Open Source• Open community• Open corpora• Open evaluations• Open Research • w/o being onerous http://www.facebook.com/photo.php?fbid=10151728075710181&set=a.101 51045050120181.780469.68096845180&type=1&theater
    • 5. OSL Needs? Community Code Infrastructure• Openness Model • Architecture • Hardware • Flexible • Cloud or hosted?• Contributions: • Scalable • Network/Bandwidth • Who? • Experiment Mgmt • Production/Staging/Dev • Where? • How? • Content Acquisition • $$$$ • Analysis• Ownership/Legal: • Indexing • Release Management • Code • Querying • Contributions • Downstream Tools • Devops • Infrastructure • Faceting, highlighting, •… auto-suggest,• Privacy spellchecking, etc.•… • Records Mgmt • Testing •…
    • 6. What’s this have to do with Lucene?
    • 7. “An ecosystem is a community of living organisms in conjunctionwith the nonliving components of their environment interactingas a system.” – Wikipedia Code Committers Contributors ASF Users
    • 8. The ASF and ASL• ASF == Apache Software Foundation – Volunteer-based, but many are paid to work on open source by their employer – Community Over Code • Consensus-driven development – Meritocracy • “Those who do, make the decisions” – 100+ Top Level Projects – Infrastructure to support projects – “The Apache Way”• ASL == Apache Software License (v2) ASL ≠ ASF
    • 9. Lucene Community• In a nutshell: Large, Active Community• 30+ committers, many, many more contributors• (Tens of?) Thousands of Practitioners• Thousands of production instances – Twitter, Apple, IBM Watson, LinkedIn, Netflix, Commercial Search Engines, … – “… they frequently turn to real-time search: our system serves over two billion queries a day, with an average query latency of 50 ms. Usually, tweets are searchable within 10 seconds after creation.” -- EarlyBird, Busch et. al.
    • 10. The Code Ecosystem Solr Tika Hadoop Lucene CoreNutch Mahout OpenNLP
    • 11. • Flagship Java library for building search applications – Indexing, Searching, Language Analysis• Powers apps large and small the world over• More in Apache Lucene 4 talk later• Fast, small footprint• Lots of useful related modules – Highlighting, Joins, Spatial, etc.• http://lucene.apache.org/core
    • 12. • Search server built using Lucene and HTTP• Faceting, highlighting, most Lucene features, easy admin• Highly Extensible• Scalable (query volume and index size)• Lucene Best Practices• http://lucene.apache.org/solr
    • 13. • Originally built for Nutch to solve large scale crawling problems• Distributed File System and Computation Model – HDFS and MapReduce, YARN coming• Common Use Cases: storage, log analysis, ETL• http://hadoop.apache.org
    • 14. • Web-scale crawler and search built on Lucene/Solr and Hadoop• Link analysis (aka PageRank)• Plugin framework• Parsers for common document formats (PDF, Word, HTML, etc.)• http://nutch.apache.org
    • 15. • Scalable machine learning – Utilize Hadoop where appropriate• Primary Focus: “The 3 C’s” – Clustering, classification, collaborative filtering• Others – Frequent pattern mining, topic extraction, statistically interesting phrases• http://mahout.apache.org
    • 16. • Toolkit for detecting and extracting content from MIME types• Support for many common file formats – Office, PDF, HTML, etc.• Intuitive API (think SAX parser)• Wraps best of breed open source extractors• Plug in your own• http://tika.apache.org
    • 17. • Supports common NLP tasks – NER, POS tagging, Chunking, Parsing, CoRef resolution• MaxEnt and Perceptron based – Working to make the machine learning pluggable• Some Multilingual support• New life at the ASF• Related: cTakes, Stanbol
    • 18. Other Useful Tools• Apache Zookeeper – Distrib. Coordination• Apache Pig – Hadoop scripting w/o Java• Apache HBase/Accumulo/Cassandra – BigTable/Dynamo• Avro and Protobufs – Serialization frameworks• Netty: Server framework – easy to add protocols and to scale• Stanbol – Semantic Content Management using Solr, OpenNLP, others• UIMA – Unstructured Info Management
    • 19. LUCENE CAN HAS RESEARCH?• Dispelling a few misconceptions: – No such thing as Lucene OOTB – Lucene ≠ Solr• Researchers are welcome! – Large audience and many domains – http://wiki.apache.org/lucene- java/HowToContribute – Battle-tested code – Speed v. Quality tradeoffs http://1.bp.blogspot.com/_T2ki5Em5dnI/S8gxtImG7 wI/AAAAAAAAAEs/N7aZKZ6g6g4/s1600/cat%2520ty ping.jpg
    • 20. Research/Contribution Areas• Work with the community to do evaluations• Scoring – BM25, LM, IM, DFR others already implemented – Easy to add your own• Codecs – Extensible compression/storage – Many already implemented approaches and more coming – SimpleText FTW!• Others: – Faceting, auto-suggest, spell-checking, highlighting, expansion and more – Different domains: machine generated data, mobile,
    • 21. ClientsAbstract OSL Architecture Access APIs Personalization Shard Shard Shard Users/Admin/ ... & Machine 1 2 n Other Learning Search View Updates/Analysis Distributed, Scalable Distributed (Batch/Real Time) Storage Coordination and (Docs, Users, Logs) Messaging Keys Content Acquisition Distributed Content Content Acquisition - Service-Oriented Architecture Acquisition ETL - Stateless Batch and Real Time - Failover/Fault Tolerant - Glue is lightweight - Smart about updates Data (Internet)
    • 22. ClientsLucene Ecosystem Implementation Access APIs Personalization Shard Shard Shard Users/Admin/ ... & Machine 1 2 n Other Learning Search View Updates/Analysis Distributed, Scalable Distributed (Batch/Real Time) Storage Coordination and (Docs, Users, Logs) Messaging Keys Content Acquisition Distributed Content Content Acquisition - Service-Oriented Architecture Acquisition ETL - Stateless Batch and Real Time - Failover/Fault Tolerant - Glue is lightweight - Smart about updates Data (Internet)
    • 23. Takeaways• Open Development >> Open Source >> Shared Source – Corollary: You never know where good ideas are coming from• ASF is a proven model for collaboration• Lucene ecosystem: extensive, production ready• Lucene 4 is viable for IR algorithms and data structure research• OSL (IMO) needs a services-based, pluggable architecture
    • 24. Resources• Getting Started – {Lucene|Mahout|Hadoop} In Action – Taming Text• grant@lucidworks.com• @gsingers• http://www.lucidworks.com

    ×