SlideShare a Scribd company logo
1 of 24
OpenSearchLab and Lucene

            Grant Ingersoll
     Chief Scientist @LucidWorks
Member, Committer at Apache Soft. Found.
     Co-Founder, Apache Mahout
Hats




I’m here as an individual who happens to contribute (and commit)
      to Lucene, Solr, Mahout and other open source projects.
  I don’t officially represent the ASF or even Lucene/Solr/Mahout.
Topics
• Openness

• What are some OpenSearchLab (OSL) needs?

• The Lucene Ecosystem

• Lucene for Research?

• A Sample Architecture
Putting the Open in OpenSearchLab
• Open Development >> Open Source
• Open community

• Open corpora

• Open evaluations

• Open Research
  •   w/o being onerous
                          http://www.facebook.com/photo.php?fbid=10151728075710181&set=a.101
                          51045050120181.780469.68096845180&type=1&theater
OSL Needs?
        Community                 Code                       Infrastructure

• Openness Model      • Architecture                • Hardware
                        • Flexible                    • Cloud or hosted?
• Contributions:        • Scalable                    • Network/Bandwidth
  • Who?              • Experiment Mgmt               • Production/Staging/Dev
  • Where?
  • How?              • Content Acquisition         • $$$$
                      • Analysis
• Ownership/Legal:    • Indexing                    • Release Management
  • Code              • Querying
  • Contributions     • Downstream Tools            • Devops
  • Infrastructure      • Faceting, highlighting,   •…
                          auto-suggest,
• Privacy                 spellchecking, etc.
•…                    • Records Mgmt
                      • Testing
                      •…
What’s this have to do with
         Lucene?
“An ecosystem is a community of living organisms in conjunction
with the nonliving components of their environment interacting
as a system.”
   – Wikipedia
                               Code


                          Committers


                      Contributors
                              ASF
                            Users
The ASF and ASL
• ASF == Apache Software Foundation
   – Volunteer-based, but many are paid to work on open source by their
     employer

   – Community Over Code
       • Consensus-driven development
   – Meritocracy
       • “Those who do, make the decisions”
   – 100+ Top Level Projects
   – Infrastructure to support projects
   – “The Apache Way”

• ASL == Apache Software License (v2)
                              ASL ≠ ASF
Lucene Community
•   In a nutshell: Large, Active Community
•   30+ committers, many, many more contributors
•   (Tens of?) Thousands of Practitioners
•   Thousands of production instances
    – Twitter, Apple, IBM Watson, LinkedIn, Netflix,
      Commercial Search Engines, …
    – “… they frequently turn to real-time search: our
      system serves over two billion queries a day, with an
      average query latency of 50 ms. Usually, tweets are
      searchable within 10 seconds after creation.” --
      EarlyBird, Busch et. al.
The Code Ecosystem
          Solr



 Tika             Hadoop


        Lucene
         Core
Nutch             Mahout



        OpenNLP
• Flagship Java library for building search applications
    – Indexing, Searching, Language Analysis

•   Powers apps large and small the world over
•   More in Apache Lucene 4 talk later
•   Fast, small footprint
•   Lots of useful related modules
    – Highlighting, Joins, Spatial, etc.

• http://lucene.apache.org/core
• Search server built using Lucene and HTTP
• Faceting, highlighting, most Lucene features,
  easy admin
• Highly Extensible
• Scalable (query volume and index size)

• Lucene Best Practices
• http://lucene.apache.org/solr
• Originally built for Nutch to solve large scale
  crawling problems

• Distributed File System and Computation Model
   – HDFS and MapReduce, YARN coming
• Common Use Cases: storage, log analysis, ETL

• http://hadoop.apache.org
• Web-scale crawler and search built on
  Lucene/Solr and Hadoop
• Link analysis (aka PageRank)
• Plugin framework
• Parsers for common document formats (PDF,
  Word, HTML, etc.)

• http://nutch.apache.org
• Scalable machine learning
  – Utilize Hadoop where appropriate
• Primary Focus: “The 3 C’s”
  – Clustering, classification, collaborative filtering
• Others
  – Frequent pattern mining, topic extraction,
    statistically interesting phrases

• http://mahout.apache.org
• Toolkit for detecting and extracting content from
  MIME types
• Support for many common file formats
   – Office, PDF, HTML, etc.
• Intuitive API (think SAX parser)
• Wraps best of breed open source extractors
• Plug in your own

• http://tika.apache.org
• Supports common NLP tasks
  – NER, POS tagging, Chunking, Parsing, CoRef
    resolution
• MaxEnt and Perceptron based
  – Working to make the machine learning pluggable
• Some Multilingual support
• New life at the ASF
• Related: cTakes, Stanbol
Other Useful Tools
• Apache Zookeeper – Distrib. Coordination
• Apache Pig – Hadoop scripting w/o Java
• Apache HBase/Accumulo/Cassandra –
  BigTable/Dynamo
• Avro and Protobufs – Serialization
  frameworks
• Netty: Server framework – easy to add
  protocols and to scale
• Stanbol – Semantic Content Management
  using Solr, OpenNLP, others
• UIMA – Unstructured Info Management
LUCENE CAN HAS RESEARCH?
• Dispelling a few misconceptions:
  – No such thing as Lucene OOTB
  – Lucene ≠ Solr
• Researchers are welcome!
  – Large audience and many domains
  – http://wiki.apache.org/lucene-
    java/HowToContribute
  – Battle-tested code
  – Speed v. Quality tradeoffs
                             http://1.bp.blogspot.com/_T2ki5Em5dnI/S8gxtImG7
                             wI/AAAAAAAAAEs/N7aZKZ6g6g4/s1600/cat%2520ty
                             ping.jpg
Research/Contribution Areas
• Work with the community to do evaluations
• Scoring
   – BM25, LM, IM, DFR others already implemented
   – Easy to add your own
• Codecs
   – Extensible compression/storage
   – Many already implemented approaches and more coming
   – SimpleText FTW!
• Others:
   – Faceting, auto-suggest, spell-checking, highlighting,
     expansion and more
   – Different domains: machine generated data, mobile,
Clients




Abstract OSL Architecture          Access APIs


                                                 Personalization
     Shard     Shard                Shard                           Users/Admin/
                            ...                    & Machine
       1         2                    n                                Other
                                                    Learning
                Search View




    Updates/Analysis
                                  Distributed, Scalable         Distributed
    (Batch/Real Time)
                                         Storage              Coordination and
                                  (Docs, Users, Logs)           Messaging




                                                          Keys
   Content Acquisition
      Distributed Content
    Content Acquisition                   - Service-Oriented Architecture
       Acquisition ETL                    - Stateless
     Batch and Real Time                  - Failover/Fault Tolerant
                                          - Glue is lightweight
                                          - Smart about updates




       Data (Internet)
Clients




Lucene Ecosystem Implementation       Access APIs


                                                    Personalization
        Shard     Shard                Shard                           Users/Admin/
                               ...                    & Machine
          1         2                    n                                Other
                                                       Learning
                   Search View




       Updates/Analysis
                                     Distributed, Scalable         Distributed
       (Batch/Real Time)
                                            Storage              Coordination and
                                     (Docs, Users, Logs)           Messaging




                                                             Keys
      Content Acquisition
         Distributed Content
       Content Acquisition                   - Service-Oriented Architecture
          Acquisition ETL                    - Stateless
        Batch and Real Time                  - Failover/Fault Tolerant
                                             - Glue is lightweight
                                             - Smart about updates




          Data (Internet)
Takeaways
• Open Development >> Open Source >> Shared
  Source
  – Corollary: You never know where good ideas are
    coming from
• ASF is a proven model for collaboration
• Lucene ecosystem: extensive, production ready
• Lucene 4 is viable for IR algorithms and data
  structure research
• OSL (IMO) needs a services-based, pluggable
  architecture
Resources
• Getting Started
  – {Lucene|Mahout|Hadoop} In Action
  – Taming Text


• grant@lucidworks.com
• @gsingers
• http://www.lucidworks.com

More Related Content

What's hot

Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big featuresDavid Smiley
 
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profitlucenerevolution
 
Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Mike King
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singhMayank Singh
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkJake Mannix
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and RecommendersLucidworks
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchRafał Kuć
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Cloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger InsightsCloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger InsightsCloudera, Inc.
 
Developing a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with SparkDeveloping a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with SparkEdureka!
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineDaniel N
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch BasicsShifa Khan
 

What's hot (20)

How Solr Search Works
How Solr Search WorksHow Solr Search Works
How Solr Search Works
 
Solr: 4 big features
Solr: 4 big featuresSolr: 4 big features
Solr: 4 big features
 
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
 
Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5Nashville analytics summit aug9 no sql mike king dell v1.5
Nashville analytics summit aug9 no sql mike king dell v1.5
 
NoSql - mayank singh
NoSql - mayank singhNoSql - mayank singh
NoSql - mayank singh
 
Digital library software
Digital library softwareDigital library software
Digital library software
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Practical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and SparkPractical Machine Learning for Smarter Search with Solr and Spark
Practical Machine Learning for Smarter Search with Solr and Spark
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and Recommenders
 
Battle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearchBattle of the giants: Apache Solr vs ElasticSearch
Battle of the giants: Apache Solr vs ElasticSearch
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Threat hunting using notebook technologies
Threat hunting using notebook technologiesThreat hunting using notebook technologies
Threat hunting using notebook technologies
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Cloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger InsightsCloudera Search Webinar: Big Data Search, Bigger Insights
Cloudera Search Webinar: Big Data Search, Bigger Insights
 
Developing a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with SparkDeveloping a Movie recommendation Engine with Spark
Developing a Movie recommendation Engine with Spark
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
elasticsearch
elasticsearchelasticsearch
elasticsearch
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Elasticsearch Basics
Elasticsearch BasicsElasticsearch Basics
Elasticsearch Basics
 

Viewers also liked

Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopGrant Ingersoll
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and MahoutGrant Ingersoll
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solrsagar chaturvedi
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Grant Ingersoll
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xGrant Ingersoll
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineGrant Ingersoll
 

Viewers also liked (9)

Intro to Search
Intro to SearchIntro to Search
Intro to Search
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Crowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and HadoopCrowd Sourced Reflected Intelligence for Solr and Hadoop
Crowd Sourced Reflected Intelligence for Solr and Hadoop
 
Leveraging Solr and Mahout
Leveraging Solr and MahoutLeveraging Solr and Mahout
Leveraging Solr and Mahout
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4Data IO: Next Generation Search with Lucene and Solr 4
Data IO: Next Generation Search with Lucene and Solr 4
 
What's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.xWhat's new in Lucene and Solr 4.x
What's new in Lucene and Solr 4.x
 
This Ain't Your Parent's Search Engine
This Ain't Your Parent's Search EngineThis Ain't Your Parent's Search Engine
This Ain't Your Parent's Search Engine
 
Solr for Data Science
Solr for Data ScienceSolr for Data Science
Solr for Data Science
 

Similar to OpenSearchLab and the Lucene Ecosystem

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Cloudera, Inc.
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache StanbolAlkuvoima
 
Oct meetup open stack 101 clean
Oct meetup open stack 101   cleanOct meetup open stack 101   clean
Oct meetup open stack 101 cleanbenrodrigue
 
OpenStack 101 - All Things Open 2015
OpenStack 101 - All Things Open 2015OpenStack 101 - All Things Open 2015
OpenStack 101 - All Things Open 2015Mark Voelker
 
Sustaining ArchivesSpace
Sustaining ArchivesSpaceSustaining ArchivesSpace
Sustaining ArchivesSpaceDLFCLIR
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallDr. Haxel Consult
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011GlusterFS
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperabilityparker01
 
Oss and libraries enabling arabic libraries and creating opportunities
Oss and libraries   enabling arabic libraries and creating opportunitiesOss and libraries   enabling arabic libraries and creating opportunities
Oss and libraries enabling arabic libraries and creating opportunitiesMassoud AlShareef
 
Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...David Wallom
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedCloudera, Inc.
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrDataWorks Summit
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebNuxeo
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologiesneeraj rathore
 

Similar to OpenSearchLab and the Lucene Ecosystem (20)

Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Oct meetup open stack 101 clean
Oct meetup open stack 101   cleanOct meetup open stack 101   clean
Oct meetup open stack 101 clean
 
OpenStack 101
OpenStack 101OpenStack 101
OpenStack 101
 
OpenStack 101 - All Things Open 2015
OpenStack 101 - All Things Open 2015OpenStack 101 - All Things Open 2015
OpenStack 101 - All Things Open 2015
 
Sustaining ArchivesSpace
Sustaining ArchivesSpaceSustaining ArchivesSpace
Sustaining ArchivesSpace
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
 
Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011Intro to GlusterFS Webinar - August 2011
Intro to GlusterFS Webinar - August 2011
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperability
 
Amazon Deep Learning
Amazon Deep LearningAmazon Deep Learning
Amazon Deep Learning
 
963
963963
963
 
Oss and libraries enabling arabic libraries and creating opportunities
Oss and libraries   enabling arabic libraries and creating opportunitiesOss and libraries   enabling arabic libraries and creating opportunities
Oss and libraries enabling arabic libraries and creating opportunities
 
Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
 
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, SolrLarge-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
Large-Scale Search Discovery Analytics with Hadoop, Mahout, Solr
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 

More from Grant Ingersoll

Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopGrant Ingersoll
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionGrant Ingersoll
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrGrant Ingersoll
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Grant Ingersoll
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsGrant Ingersoll
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantGrant Ingersoll
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsGrant Ingersoll
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopGrant Ingersoll
 

More from Grant Ingersoll (11)

Scalable Machine Learning with Hadoop
Scalable Machine Learning with HadoopScalable Machine Learning with Hadoop
Scalable Machine Learning with Hadoop
 
Large Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in ActionLarge Scale Search, Discovery and Analytics in Action
Large Scale Search, Discovery and Analytics in Action
 
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and SolrLarge Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
Large Scale Search, Discovery and Analytics with Hadoop, Mahout and Solr
 
Bet you didn't know Lucene can...
Bet you didn't know Lucene can...Bet you didn't know Lucene can...
Bet you didn't know Lucene can...
 
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data AnalyticsStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Apache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow ElephantApache Mahout: Driving the Yellow Elephant
Apache Mahout: Driving the Yellow Elephant
 
Intelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and FriendsIntelligent Apps with Apache Lucene, Mahout and Friends
Intelligent Apps with Apache Lucene, Mahout and Friends
 
TriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr HadoopTriHUG: Lucene Solr Hadoop
TriHUG: Lucene Solr Hadoop
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 

Recently uploaded

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

OpenSearchLab and the Lucene Ecosystem

  • 1. OpenSearchLab and Lucene Grant Ingersoll Chief Scientist @LucidWorks Member, Committer at Apache Soft. Found. Co-Founder, Apache Mahout
  • 2. Hats I’m here as an individual who happens to contribute (and commit) to Lucene, Solr, Mahout and other open source projects. I don’t officially represent the ASF or even Lucene/Solr/Mahout.
  • 3. Topics • Openness • What are some OpenSearchLab (OSL) needs? • The Lucene Ecosystem • Lucene for Research? • A Sample Architecture
  • 4. Putting the Open in OpenSearchLab • Open Development >> Open Source • Open community • Open corpora • Open evaluations • Open Research • w/o being onerous http://www.facebook.com/photo.php?fbid=10151728075710181&set=a.101 51045050120181.780469.68096845180&type=1&theater
  • 5. OSL Needs? Community Code Infrastructure • Openness Model • Architecture • Hardware • Flexible • Cloud or hosted? • Contributions: • Scalable • Network/Bandwidth • Who? • Experiment Mgmt • Production/Staging/Dev • Where? • How? • Content Acquisition • $$$$ • Analysis • Ownership/Legal: • Indexing • Release Management • Code • Querying • Contributions • Downstream Tools • Devops • Infrastructure • Faceting, highlighting, •… auto-suggest, • Privacy spellchecking, etc. •… • Records Mgmt • Testing •…
  • 6. What’s this have to do with Lucene?
  • 7. “An ecosystem is a community of living organisms in conjunction with the nonliving components of their environment interacting as a system.” – Wikipedia Code Committers Contributors ASF Users
  • 8. The ASF and ASL • ASF == Apache Software Foundation – Volunteer-based, but many are paid to work on open source by their employer – Community Over Code • Consensus-driven development – Meritocracy • “Those who do, make the decisions” – 100+ Top Level Projects – Infrastructure to support projects – “The Apache Way” • ASL == Apache Software License (v2) ASL ≠ ASF
  • 9. Lucene Community • In a nutshell: Large, Active Community • 30+ committers, many, many more contributors • (Tens of?) Thousands of Practitioners • Thousands of production instances – Twitter, Apple, IBM Watson, LinkedIn, Netflix, Commercial Search Engines, … – “… they frequently turn to real-time search: our system serves over two billion queries a day, with an average query latency of 50 ms. Usually, tweets are searchable within 10 seconds after creation.” -- EarlyBird, Busch et. al.
  • 10. The Code Ecosystem Solr Tika Hadoop Lucene Core Nutch Mahout OpenNLP
  • 11. • Flagship Java library for building search applications – Indexing, Searching, Language Analysis • Powers apps large and small the world over • More in Apache Lucene 4 talk later • Fast, small footprint • Lots of useful related modules – Highlighting, Joins, Spatial, etc. • http://lucene.apache.org/core
  • 12. • Search server built using Lucene and HTTP • Faceting, highlighting, most Lucene features, easy admin • Highly Extensible • Scalable (query volume and index size) • Lucene Best Practices • http://lucene.apache.org/solr
  • 13. • Originally built for Nutch to solve large scale crawling problems • Distributed File System and Computation Model – HDFS and MapReduce, YARN coming • Common Use Cases: storage, log analysis, ETL • http://hadoop.apache.org
  • 14. • Web-scale crawler and search built on Lucene/Solr and Hadoop • Link analysis (aka PageRank) • Plugin framework • Parsers for common document formats (PDF, Word, HTML, etc.) • http://nutch.apache.org
  • 15. • Scalable machine learning – Utilize Hadoop where appropriate • Primary Focus: “The 3 C’s” – Clustering, classification, collaborative filtering • Others – Frequent pattern mining, topic extraction, statistically interesting phrases • http://mahout.apache.org
  • 16. • Toolkit for detecting and extracting content from MIME types • Support for many common file formats – Office, PDF, HTML, etc. • Intuitive API (think SAX parser) • Wraps best of breed open source extractors • Plug in your own • http://tika.apache.org
  • 17. • Supports common NLP tasks – NER, POS tagging, Chunking, Parsing, CoRef resolution • MaxEnt and Perceptron based – Working to make the machine learning pluggable • Some Multilingual support • New life at the ASF • Related: cTakes, Stanbol
  • 18. Other Useful Tools • Apache Zookeeper – Distrib. Coordination • Apache Pig – Hadoop scripting w/o Java • Apache HBase/Accumulo/Cassandra – BigTable/Dynamo • Avro and Protobufs – Serialization frameworks • Netty: Server framework – easy to add protocols and to scale • Stanbol – Semantic Content Management using Solr, OpenNLP, others • UIMA – Unstructured Info Management
  • 19. LUCENE CAN HAS RESEARCH? • Dispelling a few misconceptions: – No such thing as Lucene OOTB – Lucene ≠ Solr • Researchers are welcome! – Large audience and many domains – http://wiki.apache.org/lucene- java/HowToContribute – Battle-tested code – Speed v. Quality tradeoffs http://1.bp.blogspot.com/_T2ki5Em5dnI/S8gxtImG7 wI/AAAAAAAAAEs/N7aZKZ6g6g4/s1600/cat%2520ty ping.jpg
  • 20. Research/Contribution Areas • Work with the community to do evaluations • Scoring – BM25, LM, IM, DFR others already implemented – Easy to add your own • Codecs – Extensible compression/storage – Many already implemented approaches and more coming – SimpleText FTW! • Others: – Faceting, auto-suggest, spell-checking, highlighting, expansion and more – Different domains: machine generated data, mobile,
  • 21. Clients Abstract OSL Architecture Access APIs Personalization Shard Shard Shard Users/Admin/ ... & Machine 1 2 n Other Learning Search View Updates/Analysis Distributed, Scalable Distributed (Batch/Real Time) Storage Coordination and (Docs, Users, Logs) Messaging Keys Content Acquisition Distributed Content Content Acquisition - Service-Oriented Architecture Acquisition ETL - Stateless Batch and Real Time - Failover/Fault Tolerant - Glue is lightweight - Smart about updates Data (Internet)
  • 22. Clients Lucene Ecosystem Implementation Access APIs Personalization Shard Shard Shard Users/Admin/ ... & Machine 1 2 n Other Learning Search View Updates/Analysis Distributed, Scalable Distributed (Batch/Real Time) Storage Coordination and (Docs, Users, Logs) Messaging Keys Content Acquisition Distributed Content Content Acquisition - Service-Oriented Architecture Acquisition ETL - Stateless Batch and Real Time - Failover/Fault Tolerant - Glue is lightweight - Smart about updates Data (Internet)
  • 23. Takeaways • Open Development >> Open Source >> Shared Source – Corollary: You never know where good ideas are coming from • ASF is a proven model for collaboration • Lucene ecosystem: extensive, production ready • Lucene 4 is viable for IR algorithms and data structure research • OSL (IMO) needs a services-based, pluggable architecture
  • 24. Resources • Getting Started – {Lucene|Mahout|Hadoop} In Action – Taming Text • grant@lucidworks.com • @gsingers • http://www.lucidworks.com

Editor's Notes

  1. Shared source, visible source, BDFL is not open source. Open DEVELOPMENT is far more powerfulAnyone can be a “researcher” - Jack Andraka -- His study resulted in over 90 percent accuracy and showed his patent-pending sensor to be 28 times faster, 28 times less expensive and over 100 times more sensitive than current tests. Jack received the Gordon E. Moore Award, of $75,000, named in honor of Intel co-founder and retired chairman and CEO. -- You never know where the next good idea is coming fromOpen corpora: anyone anywhere should be able to download and run evaluations. If Common Crawl can do it, why can’t we? iBiblio, ASF, others can likely helpHow can we build, leverage and share an open evaluation framework? How do we leverage the Internet? Crowdsourcing? Dynamic nature of content, engines, community, users, etc.? Can we time slice experiments on a real system?Open Research: how do we encourage open methodology, open process, publications, etc. without being heavy-handed?
  2. Community will be the single most important pieceBottom up and top down needed to establish a community
  3. https://en.wikipedia.org/wiki/EcosystemMost people have this Pyramid backwards
  4. The ASF has a well developed community model that has been proven out over time
  5. Committers: many are paid to work on Lucene FT.Images: Commits: Ohloh, Traffic: lucene.markmail.org
  6. A loose orbit around Lucene Core
  7. Second bullet: deferred to 2nd talk