SlideShare a Scribd company logo
Heavy Committing:
Flexible Indexing in Lucene 4.0
                   Uwe Schindler
 SD DataSolutions GmbH, uschindler@sd-datasolutions.de
My Background
• I am committer and PMC member of Apache Lucene and
  Solr. My main focus is on development of Lucene Java.
• Implemented fast numerical search and maintaining the new
  attribute-based text analysis API. Well known as Generics
  and Sophisticated Backwards Compatibility Policeman.
• Working as consultant and software architect for SD
  DataSolutions GmbH in Bremen, Germany. The main task is
  maintaining PANGAEA (Publishing Network for Geoscientific
  & Environmental Data) where I implemented the portal's
  geo-spatial retrieval functions with Lucene Java.
• Talks about Lucene at various international conferences like
  ApacheCon EU/US, Lucene Revolution, Lucene Eurocon,
  Berlin Buzzwords and various local meetups.
Agenda

• Motivation

• API changes in general

• Codecs

• Future

• Wrap up

                           4
Lucene up to version 3
• Lucene started > 10 years ago
   • Lucene‟s vInt format is old and not as friendly as new
     compression algorithms to CPU‟s optimizers (exists since
     Lucene 1.0)


• It‟s hard to add additional statistics for scoring to the
  index
   • IR researchers don‟t use Lucene to try out new algorithms


• Small changes to index format are often huge patches
  covering tons of files
   • Example from days of Lucene 2.4: final omit-TFaP patch of
     LUCENE-1340 is ~70 KiB covering ~25 files

                                                                 5
Why Flexible Indexing?
• Many new compression approaches ready to
  be implemented for Lucene
• Separate index encoding from terms/postings
  enumeration API
• Make innovations to the postings format easier

 Targets to make Lucene extensible even on
              the lowest level



                                                   6
Agenda

• Motivation

• API changes in general

• Codecs

• Future

• Wrap up

                           7
Quick Overview
• Will be Apache Lucene ≥ 4.0 only!

• Allows to
  • store new information into the index
  • change the way existing information is stored
• Under heavy development
  • almost stable API, may break suddenly
  • lots of classes still in refactoring
• Replaces a lot of existing classes and
  interfaces → Lucene 4.0 will not be backwards
  compatible (API wise)
                                                    8
Quick Overview

•   Pre-4.0 indexes are still readable, but
    < 3.0 is no longer supported

•   Index upgrade tool will be available as of
    Lucene 3.2
    •   Supports upgrade of pre-3.0 indexes in-place →
        two-step migration to 4.0 (LUCENE-3082)




                                                         9
Architecture




               10
New 4-dimensional
 Enumeration-API




                    11
Flex - Enum API Properties                                  (1)



•   Replaces the TermEnum / TermDocs /
    TermPositions

•   Unified iterator-like behavior: no longer strange
    do…while vs. while

•   Decoupling of term from field
    •   IndexReader returns enum of fields, each field has separate
        TermsEnum
    •   No interning of field names needed anymore (will be removed soon)


•   Improved RAM efficiency:
    •   efficient re-usage of byte buffer with the BytesRef class


                                                                            12
Flex - Enum API Properties                         (2)


• Terms are arbitrary byte slices, no 16bit UTF-16 chars
  anymore
• Term now using byte[] instead of char[]
   • compact representation of numeric terms
• Can be any encoding
   • Default is UTF-8 → changes sort order for supplementary
     characters
   • Compact representation with low memory footprint for western
     languages
   • Support for e.g. BOCU1 (LUCENE-1799) possible, but some
     queries rely on UTF-8 (especially MultiTermQuery)
• Indexer consumes BytesRefAttribute, which
  converts the good old char[] to UTF-8

                                                                    13
Flex - Enum API Properties                   (3)



•   All Flex Enums make use of AttributeSource
    •   Custom Attribute deserialization (planned)
    •   E.g., BoostAttribute for FuzzyQuery


•   TermsEnum supports seeking (enables fast
    automaton queries like FuzzyQuery)

•   DocsEnum / DocsAndPositionsEnum now
    support separate skipDocs property: Custom
    control of excluded documents (default
    implementation by IndexReader returns deleted
    docs)
                                                           14
FieldCache
• FieldCache consumes the flex APIs
• Terms (previously strings) field cache more
  RAM efficient, low GC load
  • Used with SortField.STRING
• Shared byte[] blocks instead of separate String
  instances
  • Terms remain as byte[] in few big byte[] slices
• PackedInts for ords & addresses (LUCENE-
  1990)
• RAM reduction ~ 40-60%

                                                      15
Agenda

• Motivation

• API changes in general

• Codecs

• Future

• Wrap up

                           16
Extending Flex with Codecs

•   A Codec represents the primary Flex-API
    extension point
•   Directly passed to SegmentReader to
    decode the index format
•   Provides implementations of the enum
    classes to SegmentReader
•   Provides writer for index files to
    IndexWriter

                                              17
Architecture




               18
Components of a Codec
• Codec provides read/write for one segment
  • Unique name (String)
  • FieldsConsumer (for writing)
  • FieldsProducer is 4D enum API + close
• CodecProvider creates Codec instance
  • Passed to IndexWriter/IndexReader
• You can override merging
• Reusable building blocks
  • Terms dict + index, postings


                                              19
Build-In Codecs: PreFlex

•   Pre-Flex-Indexes will be “read-only” in
    Lucene 4.0
•   all additions to indexes will be done with
    CodecProvider„s default codec


Warning: PreFlexCodec needs to
reorder terms in term index on-the-fly:
„surrogates dance“
→ use index upgrader, LUCENE-3082


                                                 20
Default: StandardCodec             (1)




Default codec was moved out of
o.a.l.index:

  • lives now in its own package
    o.a.l.index.codecs.standard
  • is the default implementation
  • similar to the pre-flex index format
  • requires far less ram to load term index


                                               21
Default: StandardCodec                          (2)



• Terms index:
  VariableGapTermsIndexWriter/Reader
  • uses finite state transducers from new FST package
  • removes useless suffixes                Talk yesterday morning:
                                      “Finite State Automata in Lucene”
  • reusable by other codecs                    by Dawid Weiss

• Terms dict: BlockTermsWriter/Reader
  • more efficient metadata decode
  • faster TermsEnum
  • reusable by other codecs
• Postings: StandardPostingsWriter/Reader
  • Similar to pre-flex postings
                                                                      22
Build-In Codecs: PulsingCodec                (1)



• Inlines low doc-freq terms into terms dict
• Saves extra seek to get the postings
• Excellent match for primary key fields, but also
  “normal” field (Zipf‟s law)
• Wraps any other codec
• Likely default codec will use Pulsing

           See Mike McCandless’ blog:
            http://s.apache.org/XEw


                                                     23
Build-In Codecs: PulsingCodec   (2)




                                      24
Build-In Codecs: SimpleText
• All postings stored in _X.pst text file
• Read / write
• Not performant
   • Do not use in production!
• Fully functional
   • Passes all Lucene/Solr unit tests
     (slowly...)
• Useful / fun for debugging

      See Mike McCandless’ blog:
        http://s.apache.org/eh
                                            25
Other Codecs shipped with Lucene /
         Lucene Contrib

• PerFieldCodecWrapper (core)
  • Can be used to delegate specific fields to different
    codecs (e.g., use pulsing for primary key)


• AppendingCodec (contrib/misc)
  • Never rewinds a file pointer during write
  • Useful for Hadoop




                                                           26
Heavy Development: Block Codecs
• Separate branch for developing block codecs:
  • http://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings
• All codecs use default terms index, but different
  encodings of postings
• Support for recent compression developments
  • FOR, PFORDelta, Simple64, BulkVInt
• Framework for blocks of encoded ints (using
  SepCodec)
  • Easy to plugin a new compression by providing a
    encoder/decoder between a block int[] ↔
    IndexOutput/IndexInput

                                                                       27
Block Codecs: Use Case
• Smaller index & better compression for certain
  index contents by replacing vInt
  • Frequent terms (stop words) have low doc deltas
  • Short fields (city names) have small position values
• TermQuery
  • High doc frequency terms
• MultiTermQuery
  • Lots of terms (possibly with low docFreq) in order
    request doc postings



                                                           28
Block Codecs: Disadvantages
• Crazy API for consumers, e.g. queries
• Removes encapsulation of DocsEnum
  • Consumer needs to do loading of blocks
  • Consumer needs to do doc delta
    decoding
• Lot„s of queries may need duplicate
  code
  • Impl for codecs that support „bulk
    postings“
  • Classical DocsEnum impl


                                             29
Agenda

• Motivation

• API changes in general

• Codecs

• Future

• Wrap up

                           30
Flexible Indexing - Current State
• All tests pass with a pool of random codecs
  • If you find a bug, post the test output to mailing list!
  • Test output contains random seed to reproduce


            Talk yesterday evening:
“Using Lucene's Test Framework” by Robert Muir

• many more tests and documentation is needed
• community feedback is highly appreciated


                                                               31
Flexible Indexing - TODO
• Serialize custom attributes to the index
• Convert all remaining queries to use internally
  BytesRef-terms
• Remove interning of field names
• Die, Term (& Token) class, die! ☺
• Make block-/bulk-postings API more user-
  friendly and merge to SVN trunk
• Merge DocValues branch
      (Talk yesterday: “DocValues & Column Stride Fields in Lucene” by Simon Willnauer)

• More heavy refactoring!

                                                                                          32
Agenda

• Motivation

• API changes in general

• Codecs

• Future

• Wrap up

                           33
Summary
• New 4D postings enum APIs

• Pluggable codec lets you customize
  index format
  • Many codecs already available
• State-of-the-art postings formats are in
  progress

• Innovation is easier!
  • Exciting time for Lucene...
• Sizable performance gains, RAM/GC
                                             34
We need YOU!

 • Download nightly builds or check
   out SVN trunk
 • Run tests, run tests, run tests,…
 • Use in your own developments

 • Report back experience with
   developing own codecs



                                       35
Contact
                       Uwe Schindler
                 uschindler@sd-datasolutions.de
                       http://www.thetaphi.de
                               @thetaph1

Many thanks to Mike McCandless for slides & help on preparing this talk!




      SD DataSolutions GmbH
            Wätjenstr. 49
      28213 Bremen, Germany
         +49 421 40889785-0
   http://www.sd-datasolutions.de

                                                                           36

More Related Content

What's hot

Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
Lucky Sharma
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
lucenerevolution
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
Adrien Grand
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
farhan "Frank"​ mashraqi
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
dnaber
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
otisg
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
lucenerevolution
 
Apache lucene
Apache luceneApache lucene
Apache lucene
Dr. Abhiram Gandhe
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Provectus
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
BeyondTrees
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
Shrikrishna Parab
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
OpenSource Connections
 
Lucene
LuceneLucene
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
Grant Ingersoll
 

What's hot (20)

Lucece Indexing
Lucece IndexingLucece Indexing
Lucece Indexing
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Improved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert MuirImproved Search with Lucene 4.0 - Robert Muir
Improved Search with Lucene 4.0 - Robert Muir
 
Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015Apache Lucene intro - Breizhcamp 2015
Apache Lucene intro - Breizhcamp 2015
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
ElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learnedElasticSearch in Production: lessons learned
ElasticSearch in Production: lessons learned
 
Introduction to apache lucene
Introduction to apache luceneIntroduction to apache lucene
Introduction to apache lucene
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 
Lucene
LuceneLucene
Lucene
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 

Viewers also liked

Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
Josiane Gamgo
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル
彰 村地
 
C:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3bC:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3b
Donna Millard
 
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Marty Kaszubowski
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
Lucidworks (Archived)
 
Updated: Marketing your Technology
Updated: Marketing your TechnologyUpdated: Marketing your Technology
Updated: Marketing your Technology
Marty Kaszubowski
 
Spanish bombss
Spanish bombssSpanish bombss
Spanish bombsstanica
 
The mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the marketThe mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the marketPaul Williamson
 
Solr lucene search revolution
Solr lucene search revolutionSolr lucene search revolution
Solr lucene search revolution
Lucidworks (Archived)
 
Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4
Lucidworks (Archived)
 
Presentation
PresentationPresentation
Presentation
tarodnova
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Lucidworks (Archived)
 
Coterie 9 11
Coterie 9 11Coterie 9 11
Coterie 9 11
LaRue
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Lucidworks (Archived)
 
Mujer, pajaro y estrella
Mujer, pajaro y estrellaMujer, pajaro y estrella
Mujer, pajaro y estrellaguest986e5ae
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucidworks (Archived)
 
Speed Up Web 2012
Speed Up Web 2012Speed Up Web 2012
Speed Up Web 2012
彰 村地
 
All Data Big and Small
All Data Big and SmallAll Data Big and Small
All Data Big and Small
Lucidworks (Archived)
 

Viewers also liked (20)

Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
HTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコルHTML5 と次世代のネットワーク プロトコル
HTML5 と次世代のネットワーク プロトコル
 
C:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3bC:\Fakepath\6620millardmodule3b
C:\Fakepath\6620millardmodule3b
 
Web Design Course Overview
Web Design Course OverviewWeb Design Course Overview
Web Design Course Overview
 
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
Center for Enterprise Innovation (CEI) Summary for HREDA, 9-25-14
 
Indexing Text and HTML Files with Solr
Indexing Text and HTML Files with SolrIndexing Text and HTML Files with Solr
Indexing Text and HTML Files with Solr
 
Updated: Marketing your Technology
Updated: Marketing your TechnologyUpdated: Marketing your Technology
Updated: Marketing your Technology
 
Spanish bombss
Spanish bombssSpanish bombss
Spanish bombss
 
The mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the marketThe mobile as a health hub, and how bluetooth low energy enables the market
The mobile as a health hub, and how bluetooth low energy enables the market
 
Solr lucene search revolution
Solr lucene search revolutionSolr lucene search revolution
Solr lucene search revolution
 
Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4Overview of Searching in Solr 1.4
Overview of Searching in Solr 1.4
 
Presentation
PresentationPresentation
Presentation
 
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrMinneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
 
Coterie 9 11
Coterie 9 11Coterie 9 11
Coterie 9 11
 
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
 
Mujer, pajaro y estrella
Mujer, pajaro y estrellaMujer, pajaro y estrella
Mujer, pajaro y estrella
 
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks
 
Speed Up Web 2012
Speed Up Web 2012Speed Up Web 2012
Speed Up Web 2012
 
Ob12 01st
Ob12 01stOb12 01st
Ob12 01st
 
All Data Big and Small
All Data Big and SmallAll Data Big and Small
All Data Big and Small
 

Similar to Flexible Indexing in Lucene 4.0

Fun with flexible indexing
Fun with flexible indexingFun with flexible indexing
Fun with flexible indexing
Lucidworks (Archived)
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
Prashant Rane
 
Kafka overview v0.1
Kafka overview v0.1Kafka overview v0.1
Kafka overview v0.1
Mahendran Ponnusamy
 
Containers and HPC
Containers and HPCContainers and HPC
Containers and HPC
Olli-Pekka Lehto
 
What’s the Deal with Containers, Anyway?
What’s the Deal with Containers, Anyway?What’s the Deal with Containers, Anyway?
What’s the Deal with Containers, Anyway?
Stephen Foskett
 
assem.ppt
assem.pptassem.ppt
assem.ppt
SonaliAjankar
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Lucidworks
 
Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...
All Things Open
 
Managing ejabberd Platforms with Docker - ejabberd Workshop #1
Managing ejabberd Platforms with Docker - ejabberd Workshop #1Managing ejabberd Platforms with Docker - ejabberd Workshop #1
Managing ejabberd Platforms with Docker - ejabberd Workshop #1
Mickaël Rémond
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
OpenEBS
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project Update
Linaro
 
Serverless design with Fn project
Serverless design with Fn projectServerless design with Fn project
Serverless design with Fn project
Siva Rama Krishna Chunduru
 
.NET Core Blimey! (dotnetsheff Jan 2016)
.NET Core Blimey! (dotnetsheff Jan 2016).NET Core Blimey! (dotnetsheff Jan 2016)
.NET Core Blimey! (dotnetsheff Jan 2016)
citizenmatt
 
ASP.NET Core Demos
ASP.NET Core DemosASP.NET Core Demos
ASP.NET Core Demos
Erik Noren
 
What's New in ASP.NET Core 2.0
What's New in ASP.NET Core 2.0What's New in ASP.NET Core 2.0
What's New in ASP.NET Core 2.0
Jon Galloway
 
Getting started with Docker
Getting started with DockerGetting started with Docker
Getting started with Docker
Ravindu Fernando
 
Solr 4
Solr 4Solr 4
Solr 4
Erik Hatcher
 
TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.
Real-Time Innovations (RTI)
 
Docker and the K computer
Docker and the K computerDocker and the K computer
Docker and the K computer
Peter Bryzgalov
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 

Similar to Flexible Indexing in Lucene 4.0 (20)

Fun with flexible indexing
Fun with flexible indexingFun with flexible indexing
Fun with flexible indexing
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
 
Kafka overview v0.1
Kafka overview v0.1Kafka overview v0.1
Kafka overview v0.1
 
Containers and HPC
Containers and HPCContainers and HPC
Containers and HPC
 
What’s the Deal with Containers, Anyway?
What’s the Deal with Containers, Anyway?What’s the Deal with Containers, Anyway?
What’s the Deal with Containers, Anyway?
 
assem.ppt
assem.pptassem.ppt
assem.ppt
 
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, EvernoteSearch Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
 
Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...Containers in depth – Understanding how containers work to better work with c...
Containers in depth – Understanding how containers work to better work with c...
 
Managing ejabberd Platforms with Docker - ejabberd Workshop #1
Managing ejabberd Platforms with Docker - ejabberd Workshop #1Managing ejabberd Platforms with Docker - ejabberd Workshop #1
Managing ejabberd Platforms with Docker - ejabberd Workshop #1
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
LCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project UpdateLCA14: LCA14-209: ODP Project Update
LCA14: LCA14-209: ODP Project Update
 
Serverless design with Fn project
Serverless design with Fn projectServerless design with Fn project
Serverless design with Fn project
 
.NET Core Blimey! (dotnetsheff Jan 2016)
.NET Core Blimey! (dotnetsheff Jan 2016).NET Core Blimey! (dotnetsheff Jan 2016)
.NET Core Blimey! (dotnetsheff Jan 2016)
 
ASP.NET Core Demos
ASP.NET Core DemosASP.NET Core Demos
ASP.NET Core Demos
 
What's New in ASP.NET Core 2.0
What's New in ASP.NET Core 2.0What's New in ASP.NET Core 2.0
What's New in ASP.NET Core 2.0
 
Getting started with Docker
Getting started with DockerGetting started with Docker
Getting started with Docker
 
Solr 4
Solr 4Solr 4
Solr 4
 
TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.TechTalk: Connext DDS 5.2.
TechTalk: Connext DDS 5.2.
 
Docker and the K computer
Docker and the K computerDocker and the K computer
Docker and the K computer
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
 

More from Lucidworks (Archived)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
Lucidworks (Archived)
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
Lucidworks (Archived)
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
Lucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
Lucidworks (Archived)
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Lucidworks (Archived)
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...Lucidworks (Archived)
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
Lucidworks (Archived)
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
Lucidworks (Archived)
 

More from Lucidworks (Archived) (20)

Integrating Hadoop & Solr
Integrating Hadoop & SolrIntegrating Hadoop & Solr
Integrating Hadoop & Solr
 
The Data-Driven Paradigm
The Data-Driven ParadigmThe Data-Driven Paradigm
The Data-Driven Paradigm
 
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
 
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessSFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineChicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
 
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchChicago Solr Meetup - June 10th: Exploring Hadoop with Search
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
 
What's new in solr june 2014
What's new in solr june 2014What's new in solr june 2014
What's new in solr june 2014
 
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com SearchMinneapolis Solr Meetup - May 28, 2014: Target.com Search
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
 
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
 
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...Unstructured   Or: How I Learned to Stop Worrying and Love the xml, Presented...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
 
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
 
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCBig Data Challenges, Presented by Wes Caldwell at SolrExchage DC
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
 
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCWhat's New  in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
 
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DCSolr At AOL, Presented by Sean Timm at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
 
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCIntro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
 
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCTest Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
 
Building a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLKBuilding a data driven search application with LucidWorks SiLK
Building a data driven search application with LucidWorks SiLK
 
Introducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinarIntroducing LucidWorks App for Splunk Enterprise webinar
Introducing LucidWorks App for Splunk Enterprise webinar
 
Solr4 nosql search_server_2013
Solr4 nosql search_server_2013Solr4 nosql search_server_2013
Solr4 nosql search_server_2013
 

Recently uploaded

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 

Recently uploaded (20)

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 

Flexible Indexing in Lucene 4.0

  • 1. Heavy Committing: Flexible Indexing in Lucene 4.0 Uwe Schindler SD DataSolutions GmbH, uschindler@sd-datasolutions.de
  • 2. My Background • I am committer and PMC member of Apache Lucene and Solr. My main focus is on development of Lucene Java. • Implemented fast numerical search and maintaining the new attribute-based text analysis API. Well known as Generics and Sophisticated Backwards Compatibility Policeman. • Working as consultant and software architect for SD DataSolutions GmbH in Bremen, Germany. The main task is maintaining PANGAEA (Publishing Network for Geoscientific & Environmental Data) where I implemented the portal's geo-spatial retrieval functions with Lucene Java. • Talks about Lucene at various international conferences like ApacheCon EU/US, Lucene Revolution, Lucene Eurocon, Berlin Buzzwords and various local meetups.
  • 3. Agenda • Motivation • API changes in general • Codecs • Future • Wrap up 4
  • 4. Lucene up to version 3 • Lucene started > 10 years ago • Lucene‟s vInt format is old and not as friendly as new compression algorithms to CPU‟s optimizers (exists since Lucene 1.0) • It‟s hard to add additional statistics for scoring to the index • IR researchers don‟t use Lucene to try out new algorithms • Small changes to index format are often huge patches covering tons of files • Example from days of Lucene 2.4: final omit-TFaP patch of LUCENE-1340 is ~70 KiB covering ~25 files 5
  • 5. Why Flexible Indexing? • Many new compression approaches ready to be implemented for Lucene • Separate index encoding from terms/postings enumeration API • Make innovations to the postings format easier Targets to make Lucene extensible even on the lowest level 6
  • 6. Agenda • Motivation • API changes in general • Codecs • Future • Wrap up 7
  • 7. Quick Overview • Will be Apache Lucene ≥ 4.0 only! • Allows to • store new information into the index • change the way existing information is stored • Under heavy development • almost stable API, may break suddenly • lots of classes still in refactoring • Replaces a lot of existing classes and interfaces → Lucene 4.0 will not be backwards compatible (API wise) 8
  • 8. Quick Overview • Pre-4.0 indexes are still readable, but < 3.0 is no longer supported • Index upgrade tool will be available as of Lucene 3.2 • Supports upgrade of pre-3.0 indexes in-place → two-step migration to 4.0 (LUCENE-3082) 9
  • 11. Flex - Enum API Properties (1) • Replaces the TermEnum / TermDocs / TermPositions • Unified iterator-like behavior: no longer strange do…while vs. while • Decoupling of term from field • IndexReader returns enum of fields, each field has separate TermsEnum • No interning of field names needed anymore (will be removed soon) • Improved RAM efficiency: • efficient re-usage of byte buffer with the BytesRef class 12
  • 12. Flex - Enum API Properties (2) • Terms are arbitrary byte slices, no 16bit UTF-16 chars anymore • Term now using byte[] instead of char[] • compact representation of numeric terms • Can be any encoding • Default is UTF-8 → changes sort order for supplementary characters • Compact representation with low memory footprint for western languages • Support for e.g. BOCU1 (LUCENE-1799) possible, but some queries rely on UTF-8 (especially MultiTermQuery) • Indexer consumes BytesRefAttribute, which converts the good old char[] to UTF-8 13
  • 13. Flex - Enum API Properties (3) • All Flex Enums make use of AttributeSource • Custom Attribute deserialization (planned) • E.g., BoostAttribute for FuzzyQuery • TermsEnum supports seeking (enables fast automaton queries like FuzzyQuery) • DocsEnum / DocsAndPositionsEnum now support separate skipDocs property: Custom control of excluded documents (default implementation by IndexReader returns deleted docs) 14
  • 14. FieldCache • FieldCache consumes the flex APIs • Terms (previously strings) field cache more RAM efficient, low GC load • Used with SortField.STRING • Shared byte[] blocks instead of separate String instances • Terms remain as byte[] in few big byte[] slices • PackedInts for ords & addresses (LUCENE- 1990) • RAM reduction ~ 40-60% 15
  • 15. Agenda • Motivation • API changes in general • Codecs • Future • Wrap up 16
  • 16. Extending Flex with Codecs • A Codec represents the primary Flex-API extension point • Directly passed to SegmentReader to decode the index format • Provides implementations of the enum classes to SegmentReader • Provides writer for index files to IndexWriter 17
  • 18. Components of a Codec • Codec provides read/write for one segment • Unique name (String) • FieldsConsumer (for writing) • FieldsProducer is 4D enum API + close • CodecProvider creates Codec instance • Passed to IndexWriter/IndexReader • You can override merging • Reusable building blocks • Terms dict + index, postings 19
  • 19. Build-In Codecs: PreFlex • Pre-Flex-Indexes will be “read-only” in Lucene 4.0 • all additions to indexes will be done with CodecProvider„s default codec Warning: PreFlexCodec needs to reorder terms in term index on-the-fly: „surrogates dance“ → use index upgrader, LUCENE-3082 20
  • 20. Default: StandardCodec (1) Default codec was moved out of o.a.l.index: • lives now in its own package o.a.l.index.codecs.standard • is the default implementation • similar to the pre-flex index format • requires far less ram to load term index 21
  • 21. Default: StandardCodec (2) • Terms index: VariableGapTermsIndexWriter/Reader • uses finite state transducers from new FST package • removes useless suffixes Talk yesterday morning: “Finite State Automata in Lucene” • reusable by other codecs by Dawid Weiss • Terms dict: BlockTermsWriter/Reader • more efficient metadata decode • faster TermsEnum • reusable by other codecs • Postings: StandardPostingsWriter/Reader • Similar to pre-flex postings 22
  • 22. Build-In Codecs: PulsingCodec (1) • Inlines low doc-freq terms into terms dict • Saves extra seek to get the postings • Excellent match for primary key fields, but also “normal” field (Zipf‟s law) • Wraps any other codec • Likely default codec will use Pulsing See Mike McCandless’ blog: http://s.apache.org/XEw 23
  • 24. Build-In Codecs: SimpleText • All postings stored in _X.pst text file • Read / write • Not performant • Do not use in production! • Fully functional • Passes all Lucene/Solr unit tests (slowly...) • Useful / fun for debugging See Mike McCandless’ blog: http://s.apache.org/eh 25
  • 25. Other Codecs shipped with Lucene / Lucene Contrib • PerFieldCodecWrapper (core) • Can be used to delegate specific fields to different codecs (e.g., use pulsing for primary key) • AppendingCodec (contrib/misc) • Never rewinds a file pointer during write • Useful for Hadoop 26
  • 26. Heavy Development: Block Codecs • Separate branch for developing block codecs: • http://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings • All codecs use default terms index, but different encodings of postings • Support for recent compression developments • FOR, PFORDelta, Simple64, BulkVInt • Framework for blocks of encoded ints (using SepCodec) • Easy to plugin a new compression by providing a encoder/decoder between a block int[] ↔ IndexOutput/IndexInput 27
  • 27. Block Codecs: Use Case • Smaller index & better compression for certain index contents by replacing vInt • Frequent terms (stop words) have low doc deltas • Short fields (city names) have small position values • TermQuery • High doc frequency terms • MultiTermQuery • Lots of terms (possibly with low docFreq) in order request doc postings 28
  • 28. Block Codecs: Disadvantages • Crazy API for consumers, e.g. queries • Removes encapsulation of DocsEnum • Consumer needs to do loading of blocks • Consumer needs to do doc delta decoding • Lot„s of queries may need duplicate code • Impl for codecs that support „bulk postings“ • Classical DocsEnum impl 29
  • 29. Agenda • Motivation • API changes in general • Codecs • Future • Wrap up 30
  • 30. Flexible Indexing - Current State • All tests pass with a pool of random codecs • If you find a bug, post the test output to mailing list! • Test output contains random seed to reproduce Talk yesterday evening: “Using Lucene's Test Framework” by Robert Muir • many more tests and documentation is needed • community feedback is highly appreciated 31
  • 31. Flexible Indexing - TODO • Serialize custom attributes to the index • Convert all remaining queries to use internally BytesRef-terms • Remove interning of field names • Die, Term (& Token) class, die! ☺ • Make block-/bulk-postings API more user- friendly and merge to SVN trunk • Merge DocValues branch (Talk yesterday: “DocValues & Column Stride Fields in Lucene” by Simon Willnauer) • More heavy refactoring! 32
  • 32. Agenda • Motivation • API changes in general • Codecs • Future • Wrap up 33
  • 33. Summary • New 4D postings enum APIs • Pluggable codec lets you customize index format • Many codecs already available • State-of-the-art postings formats are in progress • Innovation is easier! • Exciting time for Lucene... • Sizable performance gains, RAM/GC 34
  • 34. We need YOU! • Download nightly builds or check out SVN trunk • Run tests, run tests, run tests,… • Use in your own developments • Report back experience with developing own codecs 35
  • 35. Contact Uwe Schindler uschindler@sd-datasolutions.de http://www.thetaphi.de @thetaph1 Many thanks to Mike McCandless for slides & help on preparing this talk! SD DataSolutions GmbH Wätjenstr. 49 28213 Bremen, Germany +49 421 40889785-0 http://www.sd-datasolutions.de 36