SlideShare a Scribd company logo
1 of 31
Download to read offline
Portable Lucene index format
          Andrzej Białecki

           Oct 19, 2011
whoami
§  Started using Lucene in 2003 (1.2-dev...)
§  Created Luke – the Lucene Index Toolbox
§  Apache Nutch, Hadoop, Lucene/Solr committer,
    Lucene and Nutch PMC member
§  Developer at Lucid Imagination




                          3
Agenda
§    Data compatibility challenge
§    Portable format design & applications
§    Implementation details
§    Status & future




                            4
Lucene index format
§  Binary
   •  Not human-readable*!              NRMÿt|yzqq{x …
      * except for committer-humans 
§  Highly optimized
   •  Efficient random I/O access
§  Compressed
   •  Data-specific compression
§  Strict, well-defined
   •  Details specific to a particular version of Lucene
   •  Both minor and major differences between versions


                                 5
Backward-compatibility
§  Compatibility of API
   •  Planned obsolescence across major releases
   •  Relatively easy (@deprecated APIs)
§  Compatibility of data
   •  Complex and costly
   •  Still required for upgrades to an N+1 release
   •  Caveats when re-using old data
      §  E.g. incompatible Analyzer-s, sub-optimal field types
§  Forward-compatibility in Lucene?
   •  Forget it – and for good reasons


                                  6
Backup and archival
§  Do we need to keep old data?
   •  Typically index is not a System Of Record
§  But index backups still occasionally needed
   •    If costly to build (time / manual effort / CPU / IO …)
   •    As a fallback copy during upgrades
   •    To validate claims or to prove compliance
   •    To test performance / quality across updates
§  Archival data should use a long-term viable format
   •  Timespan longer than shelf-life of a software version
   •  Keep all tools, platforms and environments?


                                              Ouch…



                                        7
MN
§  Lucene currently supports some
                                            2.9   3.x    4.x
    backward-compatibility
                                            +/+   +/-    -/-    2.9
   •  Only latest N <= 2 releases
                                            -/+   +/+?   +/-    3.x
   •  High maintenance and expertise
                                            -/-   -/+    +/+?   4.x
      costs
   •  Very fragile
§  Ever higher cost of broader back-
    compat
   •  M source formats * N target formats
   •  Forward-compat? Forget it


                            8
Conclusion
      If you need to read data
      from versions other than N
      or N-1 you will suffer greatly




      There should be a better
      solution for a broader data
      compatibility …

      9
Portable format idea
§  Can we come up with one simple,
    generic, flexible format that fits data
    from any version?
§  Who cares if it’s sub-optimal as long
    as it’s readable!



                      Use a format so simple
                      that it needs no
                      version-specific tools

                 10
Portable format: goals
§  Reduced costs of long-term back-compat
   •  M * portable                         2.9   3.x   4.x

   •  Implement only one format right      +/+   +/+   +/+   P

§  Long-term readability
   •  Readable with simple tools (text editor, less)
   •  Easy to post-process with simple tools
§  Extensible
   •  Easy to add data and metadata on many levels
   •  Easy to add other index parts, or non-index data
§  Reducible
   •  Easy to ignore / skip unknown data (forward-compat)
                             11
Portable format: design
§  Focus on function and not form / implementation
   •  Follow the logical model of a Lucene index, which
      is more or less constant across releases
§  Relaxed and forgiving
   •  Arbitrary number and type of entries
   •  Not thwarted by unknown data (ignore / skip)
   •  Figure out best defaults if data is missing
§  Focus on clarity and not minimal size
   •  De-normalize data if it increases portability
§  Simple metadata (per item, per section)

                            12
Index model: sections
§  Global index data
                                                       index
   •  User data
                                                    Global data
   •  Source format details (informational)
   •  List of commit points (and the current        Segment 1
      commit)
   •  List of index sections (segments, other?)
§  Segment sections                                Segment 2
   •  Segment info in metadata
   •  List of fields and their types (FieldInfos)
   •  Per-segment data
                                                    Segment 3



                                 13
Per-segment data
§  Logical model instead of implementation details
§  Per-segment sections          Segment 1     Segment 2

   •  Existing known sections      Segment &      Segment &
                                   Field meta     Field meta
   •  Other future sections …
                                      stored         stored
                                      terms          terms
                                    term freqs     term freqs
                                     postings       postings
§  This is the standard model!      payloads       payloads
   •  Just be more relaxed         term vectors   term vectors
                                    doc values     doc values
   •  And use plain-text formats
                                     deletes        deletes
§  SimpleTextCodec?
   •  Not quite (yet?)
                             14
PortableCodec
§  Codec API allows to customize on-disk formats
§  API is not complete yet …
  •  Term dict + postings, docvalues, segment infos
§  … but it’s coming: LUCENE-2621
  •  Custom stored fields storage (e.g. NoSQL)
  •  More efficient term freq. vectors storage
  •  Eventually will make PortableCodec possible
     §  Write and read data in portable format
     §  Reuse / extend SimpleTextCodec
§  In the meantime … use codec-independent
    export / import utilities
                                 15
Pre-analyzed input
§  SOLR-1535 PreAnalyzedField
   •  Allows submitting pre-analyzed documents
   •  Too limited format
§  “Inverted document” in portable format
   •  Pre-analyzed, pre-inverted (e.g. with term vectors)
   •  Easy to construct using non-Lucene tools
§  Applications
   •  Integration with external pipelines
       §  Text analysis, segmentation, stemming, POS-tagging, other NLP –
           produces pre-analyzed input
   •  Other ideas
       §    Transaction log
       §    Fine-grained shard rebalancing (nano-sharding)
       §    Rolling updates of a cluster
       §    …
                                          16
Serialization format: goals
§  Generic enough:
   •  Flexible enough to express all known index parts
§  Specific enough
   •  Predefined sections common to all versions
   •  Well-defined record boundaries (easy to ignore / skip)
   •  Well-defined behavior on missing sections / attributes
§  Support for arbitrary sections
§  Easily parsed, preferably with no external libs
§  Only reasonably efficient


                             17
Serialization format: design
§  Container - could be a ZIP
   •  Simple, well-known, single file, streamable, fast
      seek, compressible
   •  But limited to 4GB (ZIP-64 only in Java 7)
   •  Needs un-zipping to process with text utilities
§  Or could be just a Lucene Directory
   •  That is, a bunch of files with predefined names
      §  You can ZIP them yourself if needed, or TAR, or whatever
§  Simple plain-text formats
   •  Line-oriented records, space-separated fields
      §  Or Avro?
   •  Per-section metadata in java.util.Properties format
                                18
Implementation
§  See LUCENE-3491 for more details
§  PortableCodec proof-of-concept
   •  Uses extended Codec API in LUCENE-2621
   •  Reuses SimpleTextCodec
   •  Still some parts are not handled …
§  PortableIndexExporter tool
   •  Uses regular Lucene IndexReader API
   •  Traverses all data structures for a complete export
   •  Mimics the expected data from PortableCodec (to
      be completed in the future)
§  Serialization: plain-text files in Directory
                            19
Example data
     §  Directory contents
	
  	
  	
  120	
  10	
  Oct	
  21:03	
  _0.docvals	
  
	
  	
  	
  	
  	
  3	
  10	
  Oct	
  21:03	
  _0.f1.norms	
  
	
  	
  	
  	
  	
  3	
  10	
  Oct	
  21:03	
  _0.f2.norms	
  
	
  	
  	
  152	
  10	
  Oct	
  21:03	
  _0.fields	
  
	
  	
  1090	
  10	
  Oct	
  21:03	
  _0.postings	
  
	
  	
  	
  244	
  10	
  Oct	
  21:03	
  _0.stored	
  
	
  	
  	
  457	
  10	
  Oct	
  21:03	
  _0.terms	
  
	
  	
  1153	
  10	
  Oct	
  21:03	
  _0.vectors	
  
	
  	
  	
  701	
  10	
  Oct	
  21:03	
  commits.1318273420000.meta	
  
	
  	
  	
  342	
  10	
  Oct	
  21:03	
  index.meta	
  
	
  	
  1500	
  10	
  Oct	
  21:03	
  seginfos.meta	
  

                                         20
Example data
§  Index.meta

#Mon	
  Oct	
  17	
  15:19:35	
  CEST	
  2011	
  
#Created	
  with	
  PortableIndexExporter	
  
fieldNames=[f1,	
  f3,	
  f2]	
  
numDeleted=0	
  
numDocs=2	
  
readers=1	
  
readerVersion=1318857574927	
  




                                21
Example data
§  Seginfos.meta (SegmentInfos)
  version=1318857445991	
  
  size=1	
  
  files=[_0.nrm,	
  _0.fnm,	
  _0.tvx,	
  _0.docvals,	
  _0.tvd,	
  _0.fdx,	
  _0.tvf
  0.name=_0	
  
  0.version=4.0	
  
  0.files=[_0.nrm,	
  _0.fnm,	
  _0.tvx,	
  _0.docvals,	
  _0.tvd,	
  _0.tvf,	
  _0.f
  0.fields=3	
  
  0.usesCompond=false	
  
  0.field.f1.number=0	
  
  0.field.f2.number=1	
  
  0.field.f3.number=2	
  
  0.field.f1.flags=IdfpVopN-­‐-­‐-­‐-­‐-­‐	
  
  0.field.f2.flags=IdfpVopN-­‐Df64	
  
  0.field.f3.flags=-­‐-­‐-­‐-­‐-­‐-­‐-­‐N-­‐Dvin	
  
  0.diag.java.vendor=Apple	
  Inc.	
  
  0.hasProx=true	
  
  0.hasVectors=true	
  
  0.docCount=2	
                              22
  0.delCount=0	
  
Example data
§  Commits.1318273420000.meta (IndexCommit)

segment=segments_1	
  
generation=1	
  
timestamp=1318857575000	
  
version=1318857574927	
  
deleted=false	
  
optimized=true	
  
files=[_0.nrm,	
  _0.fnm,	
  _0_1.skp,	
  _0_1.tii,	
  _0.tvd	
  …




                              23
Example data
§  _0.fields (FieldInfos in Luke-like format)
field  	
  f1	
  
	
  flags	
  IdfpVopN-­‐-­‐-­‐-­‐-­‐	
  
	
  codec	
  Standard	
  
field	
  f2	
  
	
  flags	
  IdfpVopN-­‐Df64	
  
	
  codec	
  MockRandom	
  
field	
  f3	
  
	
  flags	
  -­‐-­‐-­‐-­‐-­‐-­‐-­‐N-­‐Dvin	
  
	
  codec	
  MockVariableIntBlock	
  



                                      24
Example data
   §  _0.stored (stored fields – LUCENE-2621)
doc	
  2	
  
	
  field	
  f1	
  
	
  	
  str	
  this	
  is	
  a	
  test	
  
	
  field	
  f2	
  
	
  	
  str	
  another	
  field	
  with	
  test	
  string	
  
doc	
  2	
  
	
  field	
  f1	
  
	
  	
  str	
  doc	
  two	
  this	
  is	
  a	
  test	
  
	
  field	
  f2	
  
	
  	
  str	
  doc	
  two	
  another	
  field	
  with	
  test	
  string	
  



                                           25
Example data
§  _0.terms – terms dict (via SimpleTextCodec)
field	
  f1	
                            field	
  f2	
  
	
  term	
  a	
                          	
  term	
  another	
  
	
  	
  freq	
  2	
                      	
  	
  freq	
  2	
  
	
  	
  totalFreq	
  2	
                 	
  	
  totalFreq	
  2	
  
	
  term	
  doc	
                        	
  term	
  doc	
  
	
  	
  freq	
  1	
                      	
  	
  freq	
  1	
  
	
  	
  totalFreq	
  1	
                 	
  	
  totalFreq	
  1	
  
	
  term	
  is	
                         	
  term	
  field	
  
	
  	
  freq	
  2	
                      	
  	
  freq	
  2	
  
	
  	
  totalFreq	
  2	
                 	
  	
  totalFreq	
  2	
  
-­‐term	
  test	
                        	
  term	
  string	
  
	
  	
  freq	
  2	
                      	
  	
  freq	
  2	
  
	
  	
  totalFreq	
  2	
                 	
  	
  totalFreq	
  2	
  
…	
                                      …	
  
                                    26
Example data
§  _0.postings (via SimpleTextCodec)

field	
  f1	
                                  field	
  f2	
  
	
  	
  term	
  a	
                            	
  	
  term	
  another	
  
	
  	
  	
  	
  doc	
  0	
                     	
  	
  	
  	
  doc	
  0	
  
	
  	
  	
  	
  	
  	
  freq	
  1	
            	
  	
  	
  	
  	
  	
  freq	
  1	
  
	
  	
  	
  	
  	
  	
  pos	
  2	
             	
  	
  	
  	
  	
  	
  pos	
  0	
  
	
  	
  	
  	
  doc	
  1	
                     	
  	
  	
  	
  doc	
  1	
  
	
  	
  	
  	
  	
  	
  freq	
  1	
            	
  	
  	
  	
  	
  	
  freq	
  1	
  
	
  	
  	
  	
  	
  	
  pos	
  4	
             	
  	
  	
  	
  	
  	
  pos	
  2	
  
	
  	
  term	
  doc	
                          	
  	
  term	
  doc	
  
	
  	
  	
  	
  doc	
  1	
                     	
  	
  	
  	
  doc	
  1	
  
	
  	
  	
  	
  	
  	
  freq	
  1	
            	
  	
  	
  	
  	
  	
  freq	
  1	
  
	
  	
  	
  	
  	
  	
  pos	
  0	
             	
  	
  	
  	
  	
  	
  pos	
  0	
  
…	
                                            	
  	
  …	
  

                                               27
Example data
§  _0.vectors
doc	
  0	
                                   doc	
  1	
  
	
  field	
  f1	
                            	
  field	
  f1	
  
	
  	
  term	
  a	
                          	
  	
  term	
  a	
  
	
  	
  	
  freq	
  1	
                      	
  	
  	
  freq	
  1	
  
	
  	
  	
  offs	
  8-­‐9	
                  	
  	
  	
  offs	
  16-­‐17	
  
	
  	
  	
  positions	
  2	
                 	
  	
  	
  positions	
  4	
  
	
  	
  term	
  is	
                         	
  	
  term	
  doc	
  
	
  	
  	
  freq	
  1	
                      	
  	
  	
  freq	
  1	
  
	
  	
  	
  offs	
  5-­‐7	
                  	
  	
  	
  offs	
  0-­‐3	
  
	
  	
  	
  positions	
  1	
                 	
  	
  	
  positions	
  0	
  
…	
                                          	
  	
  term	
  is	
  
field	
  f2	
                                	
  	
  	
  freq	
  1	
  
	
  	
  term	
  another	
                    	
  	
  	
  offs	
  13-­‐15	
  
	
  	
  	
  freq	
  1	
                      	
  	
  	
  positions	
  3	
  
	
  	
  	
  offs	
  0-­‐7	
                  …	
  
	
  	
  	
  positions	
  0	
  
…	
  
                                        28
Example data
§  _0.docvals

field	
  f2	
  
	
  type	
  FLOAT_64	
  
0	
  0.0	
  
1	
  0.0010	
  
	
  docs	
  2	
  
	
  end	
  
field	
  f3	
  
	
  type	
  VAR_INTS	
  
0	
  0	
  
1	
  1234567890	
  
	
  docs	
  2	
  
	
  end	
  
END	
  

                                  29
Current status
§  PortableIndexExporter tool working
  •  Usable as an archival tool
  •  Exports complete index data
     §  Unlike XMLExporter in Luke
  •  No corresponding import tool yet …
§  Codec-based tools incomplete
  •  … because the Codec API is incomplete
§  More work needed to reach the cross-compat goals
§  Help is welcome! J



                                 30
Summary
§  Portability of data is a challenge
   •  Binary back-compat difficult and fragile
§  Long-term archival is a challenge
§  Solution: use simple data formats
§  Portable index format:
   •    Uses plain text files
   •    Extensible – back-compat
   •    Reducible – forward-compat
   •    Enables “pre-analyzed input” pipelines
   •    PortableCodec
         §  for now PortableIndexExporter tool

                                31
QA

§  More details: LUCENE-3491
      http://issues.apache.org/jira/browse/LUCENE-3491


§  Contact:




                   Thank you!

                               32

More Related Content

What's hot

Nginx internals
Nginx internalsNginx internals
Nginx internalsliqiang xu
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with HelixAmy W. Tang
 
Enable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunEnable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunheut2008
 
FD.io VPP tap-inject with sample_plugins
FD.io VPP tap-inject with sample_pluginsFD.io VPP tap-inject with sample_plugins
FD.io VPP tap-inject with sample_pluginsNaoto MATSUMOTO
 
NGINX Back to Basics Part 3: Security (Japanese Version)
NGINX Back to Basics Part 3: Security (Japanese Version)NGINX Back to Basics Part 3: Security (Japanese Version)
NGINX Back to Basics Part 3: Security (Japanese Version)NGINX, Inc.
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...ScyllaDB
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashCeph Community
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in RustAndrew Lamb
 
PostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active RecordPostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active RecordDavid Roberts
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent
 
Kubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep DiveKubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep DiveMichal Rostecki
 
Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggleconfluent
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detailMIJIN AN
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)SolarWinds
 
PostgreSQL Performance Tuning
PostgreSQL Performance TuningPostgreSQL Performance Tuning
PostgreSQL Performance Tuningelliando dias
 

What's hot (20)

Nginx internals
Nginx internalsNginx internals
Nginx internals
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
Enable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunEnable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zun
 
FD.io VPP tap-inject with sample_plugins
FD.io VPP tap-inject with sample_pluginsFD.io VPP tap-inject with sample_plugins
FD.io VPP tap-inject with sample_plugins
 
NGINX Back to Basics Part 3: Security (Japanese Version)
NGINX Back to Basics Part 3: Security (Japanese Version)NGINX Back to Basics Part 3: Security (Japanese Version)
NGINX Back to Basics Part 3: Security (Japanese Version)
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
How We Reduced Performance Tuning Time by Orders of Magnitude with Database O...
 
AF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on FlashAF Ceph: Ceph Performance Analysis and Improvement on Flash
AF Ceph: Ceph Performance Analysis and Improvement on Flash
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
PostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active RecordPostgreSQL Materialized Views with Active Record
PostgreSQL Materialized Views with Active Record
 
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
 
Kubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep DiveKubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep Dive
 
Espresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom QuiggleEspresso Database Replication with Kafka, Tom Quiggle
Espresso Database Replication with Kafka, Tom Quiggle
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
Getting the most out of your Oracle 12.2 Optimizer (i.e. The Brain)
 
PostgreSQL Performance Tuning
PostgreSQL Performance TuningPostgreSQL Performance Tuning
PostgreSQL Performance Tuning
 

Viewers also liked

Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Luceneotisg
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadooplucenerevolution
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBertrand Delacretaz
 
Lucandra
LucandraLucandra
Lucandraotisg
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simonlucenerevolution
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...Lucidworks
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMLucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisJosiane Gamgo
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartLucidworks
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introductionotisg
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and SolrTommaso Teofili
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 

Viewers also liked (20)

Finite State Queries In Lucene
Finite State Queries In LuceneFinite State Queries In Lucene
Finite State Queries In Lucene
 
Lucene
LuceneLucene
Lucene
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Analytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoopAnalytics in olap with lucene & hadoop
Analytics in olap with lucene & hadoop
 
Beyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and SolrBeyond full-text searches with Lucene and Solr
Beyond full-text searches with Lucene and Solr
 
Lucene and MySQL
Lucene and MySQLLucene and MySQL
Lucene and MySQL
 
Lucandra
LucandraLucandra
Lucandra
 
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer SimonDocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
 
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
 
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBMBuilding and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Architecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's ThesisArchitecture and Implementation of Apache Lucene: Marter's Thesis
Architecture and Implementation of Apache Lucene: Marter's Thesis
 
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, FlipkartNear Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 

Similar to Portable Lucene Index Format & Applications - Andrzej Bialecki

Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913jasonfrantz
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloudelliando dias
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?lucenerevolution
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetuprcmuir
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Perforce
 
What's brewing in the eZ Systems extensions kitchen
What's brewing in the eZ Systems extensions kitchenWhat's brewing in the eZ Systems extensions kitchen
What's brewing in the eZ Systems extensions kitchenPaul Borgermans
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
 
What’s the Deal with Containers, Anyway?
What’s the Deal with Containers, Anyway?What’s the Deal with Containers, Anyway?
What’s the Deal with Containers, Anyway?Stephen Foskett
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservicesBigstep
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPrashant Rane
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 

Similar to Portable Lucene Index Format & Applications - Andrzej Bialecki (20)

Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0Flexible Indexing in Lucene 4.0
Flexible Indexing in Lucene 4.0
 
Hadoop
HadoopHadoop
Hadoop
 
Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?Is Your Index Reader Really Atomic or Maybe Slow?
Is Your Index Reader Really Atomic or Maybe Slow?
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr MeetupImproved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale
 
What's brewing in the eZ Systems extensions kitchen
What's brewing in the eZ Systems extensions kitchenWhat's brewing in the eZ Systems extensions kitchen
What's brewing in the eZ Systems extensions kitchen
 
Fun with flexible indexing
Fun with flexible indexingFun with flexible indexing
Fun with flexible indexing
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
Timesten Architecture
Timesten ArchitectureTimesten Architecture
Timesten Architecture
 
What’s the Deal with Containers, Anyway?
What’s the Deal with Containers, Anyway?What’s the Deal with Containers, Anyway?
What’s the Deal with Containers, Anyway?
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
From 0 to syncing
From 0 to syncingFrom 0 to syncing
From 0 to syncing
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 

Portable Lucene Index Format & Applications - Andrzej Bialecki

  • 1. Portable Lucene index format Andrzej Białecki Oct 19, 2011
  • 2. whoami §  Started using Lucene in 2003 (1.2-dev...) §  Created Luke – the Lucene Index Toolbox §  Apache Nutch, Hadoop, Lucene/Solr committer, Lucene and Nutch PMC member §  Developer at Lucid Imagination 3
  • 3. Agenda §  Data compatibility challenge §  Portable format design & applications §  Implementation details §  Status & future 4
  • 4. Lucene index format §  Binary •  Not human-readable*! NRMÿt|yzqq{x … * except for committer-humans  §  Highly optimized •  Efficient random I/O access §  Compressed •  Data-specific compression §  Strict, well-defined •  Details specific to a particular version of Lucene •  Both minor and major differences between versions 5
  • 5. Backward-compatibility §  Compatibility of API •  Planned obsolescence across major releases •  Relatively easy (@deprecated APIs) §  Compatibility of data •  Complex and costly •  Still required for upgrades to an N+1 release •  Caveats when re-using old data §  E.g. incompatible Analyzer-s, sub-optimal field types §  Forward-compatibility in Lucene? •  Forget it – and for good reasons 6
  • 6. Backup and archival §  Do we need to keep old data? •  Typically index is not a System Of Record §  But index backups still occasionally needed •  If costly to build (time / manual effort / CPU / IO …) •  As a fallback copy during upgrades •  To validate claims or to prove compliance •  To test performance / quality across updates §  Archival data should use a long-term viable format •  Timespan longer than shelf-life of a software version •  Keep all tools, platforms and environments? Ouch… 7
  • 7. MN §  Lucene currently supports some 2.9 3.x 4.x backward-compatibility +/+ +/- -/- 2.9 •  Only latest N <= 2 releases -/+ +/+? +/- 3.x •  High maintenance and expertise -/- -/+ +/+? 4.x costs •  Very fragile §  Ever higher cost of broader back- compat •  M source formats * N target formats •  Forward-compat? Forget it 8
  • 8. Conclusion If you need to read data from versions other than N or N-1 you will suffer greatly There should be a better solution for a broader data compatibility … 9
  • 9. Portable format idea §  Can we come up with one simple, generic, flexible format that fits data from any version? §  Who cares if it’s sub-optimal as long as it’s readable! Use a format so simple that it needs no version-specific tools 10
  • 10. Portable format: goals §  Reduced costs of long-term back-compat •  M * portable 2.9 3.x 4.x •  Implement only one format right +/+ +/+ +/+ P §  Long-term readability •  Readable with simple tools (text editor, less) •  Easy to post-process with simple tools §  Extensible •  Easy to add data and metadata on many levels •  Easy to add other index parts, or non-index data §  Reducible •  Easy to ignore / skip unknown data (forward-compat) 11
  • 11. Portable format: design §  Focus on function and not form / implementation •  Follow the logical model of a Lucene index, which is more or less constant across releases §  Relaxed and forgiving •  Arbitrary number and type of entries •  Not thwarted by unknown data (ignore / skip) •  Figure out best defaults if data is missing §  Focus on clarity and not minimal size •  De-normalize data if it increases portability §  Simple metadata (per item, per section) 12
  • 12. Index model: sections §  Global index data index •  User data Global data •  Source format details (informational) •  List of commit points (and the current Segment 1 commit) •  List of index sections (segments, other?) §  Segment sections Segment 2 •  Segment info in metadata •  List of fields and their types (FieldInfos) •  Per-segment data Segment 3 13
  • 13. Per-segment data §  Logical model instead of implementation details §  Per-segment sections Segment 1 Segment 2 •  Existing known sections Segment & Segment & Field meta Field meta •  Other future sections … stored stored terms terms term freqs term freqs postings postings §  This is the standard model! payloads payloads •  Just be more relaxed term vectors term vectors doc values doc values •  And use plain-text formats deletes deletes §  SimpleTextCodec? •  Not quite (yet?) 14
  • 14. PortableCodec §  Codec API allows to customize on-disk formats §  API is not complete yet … •  Term dict + postings, docvalues, segment infos §  … but it’s coming: LUCENE-2621 •  Custom stored fields storage (e.g. NoSQL) •  More efficient term freq. vectors storage •  Eventually will make PortableCodec possible §  Write and read data in portable format §  Reuse / extend SimpleTextCodec §  In the meantime … use codec-independent export / import utilities 15
  • 15. Pre-analyzed input §  SOLR-1535 PreAnalyzedField •  Allows submitting pre-analyzed documents •  Too limited format §  “Inverted document” in portable format •  Pre-analyzed, pre-inverted (e.g. with term vectors) •  Easy to construct using non-Lucene tools §  Applications •  Integration with external pipelines §  Text analysis, segmentation, stemming, POS-tagging, other NLP – produces pre-analyzed input •  Other ideas §  Transaction log §  Fine-grained shard rebalancing (nano-sharding) §  Rolling updates of a cluster §  … 16
  • 16. Serialization format: goals §  Generic enough: •  Flexible enough to express all known index parts §  Specific enough •  Predefined sections common to all versions •  Well-defined record boundaries (easy to ignore / skip) •  Well-defined behavior on missing sections / attributes §  Support for arbitrary sections §  Easily parsed, preferably with no external libs §  Only reasonably efficient 17
  • 17. Serialization format: design §  Container - could be a ZIP •  Simple, well-known, single file, streamable, fast seek, compressible •  But limited to 4GB (ZIP-64 only in Java 7) •  Needs un-zipping to process with text utilities §  Or could be just a Lucene Directory •  That is, a bunch of files with predefined names §  You can ZIP them yourself if needed, or TAR, or whatever §  Simple plain-text formats •  Line-oriented records, space-separated fields §  Or Avro? •  Per-section metadata in java.util.Properties format 18
  • 18. Implementation §  See LUCENE-3491 for more details §  PortableCodec proof-of-concept •  Uses extended Codec API in LUCENE-2621 •  Reuses SimpleTextCodec •  Still some parts are not handled … §  PortableIndexExporter tool •  Uses regular Lucene IndexReader API •  Traverses all data structures for a complete export •  Mimics the expected data from PortableCodec (to be completed in the future) §  Serialization: plain-text files in Directory 19
  • 19. Example data §  Directory contents      120  10  Oct  21:03  _0.docvals            3  10  Oct  21:03  _0.f1.norms            3  10  Oct  21:03  _0.f2.norms        152  10  Oct  21:03  _0.fields      1090  10  Oct  21:03  _0.postings        244  10  Oct  21:03  _0.stored        457  10  Oct  21:03  _0.terms      1153  10  Oct  21:03  _0.vectors        701  10  Oct  21:03  commits.1318273420000.meta        342  10  Oct  21:03  index.meta      1500  10  Oct  21:03  seginfos.meta   20
  • 20. Example data §  Index.meta #Mon  Oct  17  15:19:35  CEST  2011   #Created  with  PortableIndexExporter   fieldNames=[f1,  f3,  f2]   numDeleted=0   numDocs=2   readers=1   readerVersion=1318857574927   21
  • 21. Example data §  Seginfos.meta (SegmentInfos) version=1318857445991   size=1   files=[_0.nrm,  _0.fnm,  _0.tvx,  _0.docvals,  _0.tvd,  _0.fdx,  _0.tvf 0.name=_0   0.version=4.0   0.files=[_0.nrm,  _0.fnm,  _0.tvx,  _0.docvals,  _0.tvd,  _0.tvf,  _0.f 0.fields=3   0.usesCompond=false   0.field.f1.number=0   0.field.f2.number=1   0.field.f3.number=2   0.field.f1.flags=IdfpVopN-­‐-­‐-­‐-­‐-­‐   0.field.f2.flags=IdfpVopN-­‐Df64   0.field.f3.flags=-­‐-­‐-­‐-­‐-­‐-­‐-­‐N-­‐Dvin   0.diag.java.vendor=Apple  Inc.   0.hasProx=true   0.hasVectors=true   0.docCount=2   22 0.delCount=0  
  • 22. Example data §  Commits.1318273420000.meta (IndexCommit) segment=segments_1   generation=1   timestamp=1318857575000   version=1318857574927   deleted=false   optimized=true   files=[_0.nrm,  _0.fnm,  _0_1.skp,  _0_1.tii,  _0.tvd  … 23
  • 23. Example data §  _0.fields (FieldInfos in Luke-like format) field  f1    flags  IdfpVopN-­‐-­‐-­‐-­‐-­‐    codec  Standard   field  f2    flags  IdfpVopN-­‐Df64    codec  MockRandom   field  f3    flags  -­‐-­‐-­‐-­‐-­‐-­‐-­‐N-­‐Dvin    codec  MockVariableIntBlock   24
  • 24. Example data §  _0.stored (stored fields – LUCENE-2621) doc  2    field  f1      str  this  is  a  test    field  f2      str  another  field  with  test  string   doc  2    field  f1      str  doc  two  this  is  a  test    field  f2      str  doc  two  another  field  with  test  string   25
  • 25. Example data §  _0.terms – terms dict (via SimpleTextCodec) field  f1   field  f2    term  a    term  another      freq  2      freq  2      totalFreq  2      totalFreq  2    term  doc    term  doc      freq  1      freq  1      totalFreq  1      totalFreq  1    term  is    term  field      freq  2      freq  2      totalFreq  2      totalFreq  2   -­‐term  test    term  string      freq  2      freq  2      totalFreq  2      totalFreq  2   …   …   26
  • 26. Example data §  _0.postings (via SimpleTextCodec) field  f1   field  f2      term  a      term  another          doc  0          doc  0              freq  1              freq  1              pos  2              pos  0          doc  1          doc  1              freq  1              freq  1              pos  4              pos  2      term  doc      term  doc          doc  1          doc  1              freq  1              freq  1              pos  0              pos  0   …      …   27
  • 27. Example data §  _0.vectors doc  0   doc  1    field  f1    field  f1      term  a      term  a        freq  1        freq  1        offs  8-­‐9        offs  16-­‐17        positions  2        positions  4      term  is      term  doc        freq  1        freq  1        offs  5-­‐7        offs  0-­‐3        positions  1        positions  0   …      term  is   field  f2        freq  1      term  another        offs  13-­‐15        freq  1        positions  3        offs  0-­‐7   …        positions  0   …   28
  • 28. Example data §  _0.docvals field  f2    type  FLOAT_64   0  0.0   1  0.0010    docs  2    end   field  f3    type  VAR_INTS   0  0   1  1234567890    docs  2    end   END   29
  • 29. Current status §  PortableIndexExporter tool working •  Usable as an archival tool •  Exports complete index data §  Unlike XMLExporter in Luke •  No corresponding import tool yet … §  Codec-based tools incomplete •  … because the Codec API is incomplete §  More work needed to reach the cross-compat goals §  Help is welcome! J 30
  • 30. Summary §  Portability of data is a challenge •  Binary back-compat difficult and fragile §  Long-term archival is a challenge §  Solution: use simple data formats §  Portable index format: •  Uses plain text files •  Extensible – back-compat •  Reducible – forward-compat •  Enables “pre-analyzed input” pipelines •  PortableCodec §  for now PortableIndexExporter tool 31
  • 31. QA §  More details: LUCENE-3491 http://issues.apache.org/jira/browse/LUCENE-3491 §  Contact: Thank you! 32