Successfully reported this slideshow.

Querying rich text with XQuery


Published on

Presented by Michael Sokolov, Senior Architect, Safari Books Online

Solr and Lucene provide a powerful, scalable search server. XQuery provides a rich querying and programming model for use with marked-up text. This session will present Lux, a system that combines these into a powerful XML search engine, which is freely available under an open-source license. Query optimizers often mystify database users: sometimes queries run quickly and sometimes they don’t. An intuitive grasp of what will work well in an optimizer is often gained only after trial, error, inductive logic (i.e. educated guessing), and sometimes propitiatory sacrifice. This session will explain some of the mystery by describing work on Lux's optimizer. Lux optimizes queries by rewriting them as equivalent (but usually faster) indexed queries, so its results are easier for a user to understand than the abstract query plans produced by some optimizers. Lucene-based QName and path indexes prove useful in speeding up XQuery execution by Saxon. Finally, this session will describe the mechanisms Lux uses for extending Solr and Lucene, which include Solr UpdateProcessor, ResponseWriter, and QueryComponent plugins, dynamic Solr schema enhancement, custom XML-aware analyzers and tokenizers.

Published in: Technology, Education
  • Be the first to comment

Querying rich text with XQuery

  1. 1. Querying  Rich  Text   with  Lucene  XQuery { Michael  Sokolov Senior  Architect Safari  Books  Online
  2. 2. !   Overview  of  Lux !   Why  we  need  want  a  rich(er)  query  language !   Implementation  Stories !   Indexing  tagged  text !   Storing  documents  in  Lucene !   Lazy  searching !   Demo The  plan  for  this  talk
  3. 3. !  XQuery  in  Solr !   Query  optimizer !   Efficient  XML  document  format !   XQuery  function  library !   as  a  Java  library  (Lucene  only) !   as  Solr  plugins !   as  a  standalone  App  Server What  is  Lux?
  4. 4. Search to  find  something
  5. 5. Query to  get  an  answer
  6. 6. ! ! ! ! !   maybe  it  was  once  –  10  year  s  ago?   Legacy  stuff:  DTDs,  namespaces,  etc   arcane  Java  programming  interfaces   Don’t  we  use  JSON  now?   so  why  do  we  care  about  it? XML  is  not  cool
  7. 7. !   There’s  a  huge  amount  of  it  out  there !   HTML  is  XML,  or  can  be !   Lux  is  about  making  it  easy  (and  free)  to  deal   with  XML But  it  still  maZers
  8. 8. !   We  make  content-­‐‑rich  sites: !   our  own  site: !   our  clients  sites:,,,  … !   Publishers  provide  us  with  content !   we  debug  content  problems !   we  add  new  features  nimbly !   Piles  of  random  data  (XML,  mostly) Why  did  we  make  it?
  9. 9. !   Complex  queries  over  semi-­‐‑structured  data,  typically   documents !   You  don’t  need  it  for  edismax-­‐‑style  “quick”  search !   or  highly-­‐‑structured  data !   XQuery  comes  with  a  rich  function  library; !   rich  string,  numeric  and  date  functions !   extensions  for  HTTP,  filesystem,  zip How  can  XQuery  help?
  10. 10. DispatchFilter UpdateProcessor XML  Indexer XML  text   fields Tagged   TokenStream XPath  fields Tinybin   storage External   Field  Codec QueryComponent QParserPlugin Evaluator Saxon  XQuery   XSLT  Processor XQuery   Function   Library Lazy   Searcher ResponseWriter Compiler Optimizer Tagged Highlighter How  does  Lux  work?
  11. 11. !   “hamlet”   !   “hamlet”  in  //title !   “hamlet”  in  //scene/title,  //speaker,  etc… !   XQuery,  but  we  need  an  index !   DIH  XPathEntityProcessor !   But  are  XPath  indexes  enough? XML  is  text  with  context
  12. 12. !   In  which  speeches  does  Hamlet  talk  about  poison? !   +speaker:Hamlet  +line:poison !   Works  great  if  we  indexed  speaker  and  line  for  each   speech !   What  if  we  only  indexed  at  the  scene  level?   !   What  if  we  just  indexed  speech  text  as  a  field? !   XPath  indexes  are  precise  and  fine-­‐‑grained !   Great  when  you  know  exactly  what  you  need How  do  we  index  context?
  13. 13. <play> <title>Hamlet</title> <act act=”1”> <scene act=”1” scene=”1”> <title>SCENE I. Elsinore ... </title> Index Values Tags title, act, @act   Tag  Paths /play, /play/title, /play/act, /play/act/@act   Text hamlet,  scene,  elsinore   Tagged  Text play:hamlet,  title:hamlet,  @act:1   XPath user-­‐defined  Xpath  2.0  expression;  eg:     count(//line),     replace(//title,  'SCENE|ACT  S+','')   Contextual  Indexes
  14. 14. !   Tagged  Text,  Path  index !   Imprecise,  generic  indexes,  but  more  context   than  just  full  text !   XQuery  post-­‐‑processing  to  patch  over  the  gaps !   Query  optimizer  applies  indexes !   For  when  you  don’t  want  to  sweat  the  details:   ad  hoc  queries,  content  analysis  and  debugging General  purpose  indexes
  15. 15. <scene><speech> <speaker>Hamlet</speaker> <line>To be or not to be, … </line> … scene speech speaker … scene speech line … scene speech line Hamlet To be ! ! ! ! Zext:scene:hamlet                pos=1 Zext:speech:hamlet            pos=1 Zext:speaker:hamlet        pos=1 Zext:scene:to                                  pos=2 Zext:speech:to                              pos=2 … Tokens  emiZed   Wraps  an  existing  Analyzer  (for  the  text)   Responds  to  XML  events  (start  element,  etc)   Maintains  a  tag  name  stack   Emits  each  token  prefixed  by  enclosing  tags TaggedTokenStream
  16. 16. !   XPath:      //speech[speaker=“Hamlet”][contains(.,”poison”)] !   “optimized”  XQuery:      lux:search(“+<speaker:Hamlet  +<speech:poison”)                        //speech  [speaker=“Hamlet”]  [contains(.,”poison”)] !   Lucene  Query:      tagged_text:(+speaker:Hamlet  +speech:poison) TagQueryParser
  17. 17. !   Generic  JSON  index !   Overlapping  tags  (part-­‐‑of-­‐‑speech,  phrase-­‐‑labeling,  NLP) !   citation  classification  w/probabilistic  labeling !   One  stored  field  for  all  the  text  makes  highlighting  easier !   One  Lucene  field  means  you  can  use  PhraseQuery,  eg:          PhraseQuery(<speaker:hamlet  <speech:to)  finds  all                                  speeches  by  hamlet  starting  with  “to”. Tagged  token  examples
  18. 18. ! ! ! ! ! !   stored  document    =  100%   qnames  =  +1.3%   paths  =  +2.4%   text  tokens  =  18%   tagged  text  (opaque)  =  18%   tagged  text  (all  transparent)  =  71% What’s  the  cost?
  19. 19. subsequence(      for  $doc  in  collection()[.//SPEAKER=“Hamlet”]    order  by  $doc/lux:key(“title”)    return  $doc,  1000,  20)     subsequence  (    lux:search(“<SPEAKER:Hamlet”,  “title”,   1000)  [.//SPEAKER=“Hamlet”]   ,  1,  20)   Query  optimization
  20. 20. !   Lux  uses  Lucene  as  its  primary  document  store !   Lux  tinybin  (based  on  Saxon  TinyTree)  storage   format  avoids  XML  parsing  overhead !   Experimental  new  codec  stores  fields  as  files Document  storage
  21. 21. !   Problem:  “big”  stored  fields !   Text  documents  get  stored  for  highlighting !   Take  time  to  copy  when  merging !   Can  we  do  beZer  by  storing  as  files,  but   managing  w/Lucene? “Big”  binary  stored  fields
  22. 22. large  stored  fields small  stored  fields ExternalFieldCodec
  23. 23. !   Real-­‐‑time  deletes !   Track  deletions  when  merging !   Keep  commits  with  IndexDeletionPolicy !   Delete  unmerged  (empty)  segments !   Off-­‐‑line  deletes !   Cleanup  tool  traverses  entire  index Deleting  is  complicated
  24. 24. ! ! ! !   2-­‐‑3x  write  speedup  for  unindexed  stored  fields   a  bit  slower  in  the  worst  case   But,  text  analysis  can  take  most  of  the  time   Net:  useful  if  you  are  storing  large  binaries Codec  Performance   (preliminary)
  25. 25. !   custom  DispatchFilter  provides: !   HTTP  request/response  handling  in  XQuery !   file  uploads,  redirects !   Ability  to  roll  your  own:  cookies,  authentication !   Rapid  prototyping,  testing  query  performance,   relevance,  in  an  application  seZing App  Server
  26. 26. !   Yes,  but  did  you  remember  to  index  all  the   fields  you  need  in  advance? !   Yes,  but  did  you  want  to  format  the  result  into  a   nice  report  *using  your  query  language*? !   Yes,  but  did  you  want  access  to  a  complete   XPath  2.0  implementation  in  your  indexer? Isn’t  Solr  enough?
  27. 27. !   Find  some  sample  content  with  a  new  tag  we  need   to  support !   Perform  complex  updates  to  patch  broken  content !   Troubleshoot  content !   Explore  unfamiliar  content !   Write  prototypes  and  admin  tools  entirely  in  HTML,   JS  and  XQuery !   Demo:  hZp://localhost:8080 Example  uses  
  28. 28. !   Downloads  and  Documentation  at   hZp://   !   Source  code  at  hZp:// !   Freely  available  under  OSS  license  (MPL  2) !   Contributions  welcome !   Thank  you,  Safari  Books! Thank  You!