Solr4 nosql search_server_2013


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Solr4 nosql search_server_2013

  1. 1. Solr 4The NoSQL Search ServerYonik SeeleyMay 30, 2013
  2. 2. 2 2NoSQL Databases• Wikipedia says:A NoSQL database provides a mechanism for storage and retrieval of data thatuse looser consistency models than traditional relational databases in order toachieve horizontal scaling and higher availability. Some authors refer to them as"Not only SQL" to emphasize that some NoSQL systems do allow SQL-like querylanguage to be used.• Non-traditional data stores• Doesn’t use / isn’t designed around SQL• May not give full ACID guarantees• Offers other advantages such as greater scalability as atradeoff• Distributed, fault-tolerant architecture
  3. 3. 3 3Solr Cloud Design Goals• Automatic Distributed Indexing• HA for Writes• Durable Writes• Near Real-time Search• Real-time get• Optimistic Concurrency
  4. 4. 4 4Solr Cloud• Distributed Indexing designed from the ground up toaccommodate desired features• CAP Theorem• Consistency, Availability, Partition Tolerance (saying goes “choose 2”)• Reality: Must handle P – the real choice is tradeoffs between C and A• Ended up with a CP system (roughly)• Value Consistency over Availability• Eventual consistency is incompatible with optimistic concurrency• Closest to MongoDB in architecture• We still do well with Availability• All N replicas of a shard must go down before we lose writability for thatshard• For a network partition, the “big” partition remains active (i.e. Availabilityisn’t “on” or “off”)
  5. 5. 5 5Solr 4
  6. 6. 6 6Solr 4 at a glance• Document Oriented NoSQL Search Server• Data-format agnostic (JSON, XML, CSV, binary)• Schema-less options (more coming soon)• Distributed• Multi-tenanted• Fault Tolerant• HA + No single points of failure• Atomic Updates• Optimistic Concurrency• Near Real-time Search• Full-Text search + Hit Highlighting• Tons of specialized queries: Faceted search, grouping,pseudo-join, spatial search, functionsThe desire for thesefeatures drove someof the “SolrCloud”architecture
  7. 7. 7 7Quick Start1.  Unzip the binary distribution (.ZIP file)Note: no “installation” required3.  Start Solr4.  Go!Browse to http://localhost:8983/solr for the new admininterface$  cd  example  $  java  –jar  start.jar  
  8. 8. 8 8New admin UI
  9. 9. 9 9Add and Retrieve document$ curl http://localhost:8983/solr/update -H Content-type:application/json -d [{ "id" : "book1","title" : "American Gods","author" : "Neil Gaiman"}]$ curl http://localhost:8983/solr/get?id=book1{      "doc":  {          "id"  :  "book1",          "author":  "Neil  Gaiman",          "title"  :  "American  Gods",          "_version_":  1410390803582287872      }  }  Note: no type of “commit”is necessary to retrievedocuments via /get(real-time get)
  10. 10. 10 10Simplified JSON Delete Syntax• Singe delete-by-id{"delete":”book1"}  • Multiple delete-by-id{"delete":[”book1”,”book2”,”book3”]}  • Delete with optimistic concurrency{"delete":{"id":”book1",  "_version_":123456789}}  • Delete by Query{"delete":{”query":”tag:category1”}}  
  11. 11. 11 11Atomic Updates$  curl  http://localhost:8983/solr/update  -­‐H  Content-­‐type:application/json  -­‐d    [    {"id"                :  "book1",      "pubyear_i"  :  {  "add"  :  2001  },      "ISBN_s"        :  {  "add"  :  "0-­‐380-­‐97365-­‐1"}    }  ]  $  curl  http://localhost:8983/solr/update  -­‐H  Content-­‐type:application/json  -­‐d    [    {"id"                :  "book1",      "copies_i"    :  {  "inc"  :  1},      "cat"              :  {  "add"  :  "fantasy"},      "ISBN_s"        :  {  "set"  :  "0-­‐380-­‐97365-­‐0"}      "remove_s"    :  {  "set"  :  null  }  }  ]  
  12. 12. 12 12Optimistic Concurrency•  Conditional update based on document versionSolr1. /get document2. Modifydocument,retaining_version_3. /update resultingdocument4. Go back tostep #1 if failcode=409client
  13. 13. 13 13Version semantics_version_ Update Semantics> 1 Document version must exactly match supplied_version_1 Document must exist< 0 Document must not exist0 Don’t care (normal overwrite if exists)•  Specifying _version_ on any updateinvokes optimistic concurrency
  14. 14. 14 14Optimistic Concurrency Example$  curl  http://localhost:8983/solr/update  -­‐H  Content-­‐type:application/json  -­‐d    [    {          "id":"book2",          "title":["Neuromancer"],          "author":"William  Gibson",          "copiesIn_i":6,          "copiesOut_i":4,          "_version_":123456789  }  ]  $  curl  http://localhost:8983/solr/get?id=book2  {  "doc”  :  {          "id":"book2",          "title":["Neuromancer"],          "author":"William  Gibson",          "copiesIn_i":7,          "copiesOut_i":3,          "_version_":123456789  }}  curl http://localhost:8983/solr/update?_version_=123456789 -H Content-type:application/json-d […]Get the documentModify and resubmit, usingthe same _version_Alternately, specifythe _version_ as arequest parameter
  15. 15. 15 15Optimistic Concurrency Errors• HTTP Code 409 (Conflict) returned on version mismatch$ curl -i http://localhost:8983/solr/update -H Content-type:application/json -d [{"id":"book1", "author":"Mr Bean", "_version_":54321}]HTTP/1.1  409  Conflict  Content-­‐Type:  text/plain;charset=UTF-­‐8  Transfer-­‐Encoding:  chunked      {      "responseHeader":{          "status":409,          "QTime":1},      "error":{          "msg":"version  conflict  for  book1  expected=12345                        actual=1408814192853516288",          "code":409}}  
  16. 16. 16 16Schema
  17. 17. 17 17Schema REST API• Restlet is now integrated with Solr• Get a specific fieldcurl  http://localhost:8983/solr/schema/fields/price  {"field":{          "name":"price",          "type":"float",          "indexed":true,          "stored":true  }}  • Get all fieldscurl  http://localhost:8983/solr/schema/fields  • Get Entire Schema!curl  http://localhost:8983/solr/schema    
  18. 18. 18 18Dynamic Schema• Add a new field (Solr 4.4)curl  -­‐XPUT  http://localhost:8983/solr/schema/fields/strength  -­‐d  ‘{"type":”float",  "indexed":"true”}      ‘  • Works in distributed (cloud) mode too!• Schema must be managed & mutable (not currently the default)<schemaFactory  class="ManagedIndexSchemaFactory">      <bool  name="mutable">true</bool>      <str  name="managedSchemaResourceName">managed-­‐schema</str>  </schemaFactory>    
  19. 19. 19 19Schemaless• “Schemaless” really normally means that the client(s) have an implicitschema• “No Schema” impossible for anything based on Lucene•  A field must be indexed the same way across documents• Dynamic fields: convention over configuration•  Only pre-define types of fields, not fields themselves•  No guessing. Any field name ending in _i is an integer• “Guessed Schema” or “Type Guessing”•  For previously unknown fields, guess using JSON type as a hint•  Coming soon (4.4?) based on the Dynamic Schema work• Many disadvantages to guessing•  Lose ability to catch field naming errors•  Can’t optimize based on types•  Guessing incorrectly means having to start over
  20. 20. 20 20Solr Cloud
  21. 21. 21 21Solr Cloudshard1replica2replica3replica2replica3ZooKeeperquorumZKnodeZKnodeZKnodeZKnodeZKnode/configs/myconfsolrconfig.xmlschema.xml/clusterstate.json/aliases.json/livenodesserver1:8983/solrserver2:8983/solr/collections/collection1configName=myconf/shards/shard1server1:8983/solrserver2:8983/solr/shard2server3:8983/solrserver4:8983/solrhttp://.../solr/collection1/query?q=awesomeLoad-balancedsub-requestreplica1shard2replica1ZooKeeper holds cluster state•  Nodes in the cluster•  Collections in the cluster•  Schema & config for each collection•  Shards in each collection•  Replicas in each shard•  Collection aliases
  22. 22. 22 22Distributed Indexingshard1http://.../solr/collection1/updateshard2•  Update sent to any node•  Solr determines what shard the document is on, and forwards to shard leader•  Shard Leader versions document and forwards to all other shard replicas•  HA for updates (if one leader fails, another takes it’s place)
  23. 23. 23 23Collections APIl  Create a new document collectionhttp://localhost:8983/solr/admin/collections?    action=CREATE      &name=mycollection    &numShards=4    &replicationFactor=3    l  Delete a collectionhttp://localhost:8983/solr/admin/collections?    action=DELETE    &name=mycollection    l  Create an alias to a collection (or a group of collections)http://localhost:8983/solr/admin/collections?    action=CREATEALIAS    &name=tri_state    &collections=NY,NJ,CT  
  24. 24. 24 24http://localhost:8983/solr/#/~cloud
  25. 25. 25 25Distributed Query Requestsl  Distributed query across all shards in the collectionhttp://localhost:8983/solr/collection1/query?q=foo    l  Explicitly specify node addresses to load-balance acrossshards=localhost:8983/solr|localhost:8900/solr,                localhost:7574/solr|localhost:7500/solr  l  A list of equivalent nodes are separated by “|”l  Different phases of the same distributed request use the same nodel  Specify logical shards to search acrossshards=NY,NJ,CT    l  Specify multiple collections to search acrosscollection=collection1,collection2    l  public  CloudSolrServer(String  zkHost)  l  ZK aware SolrJ Java client that load-balances across all nodes in clusterl  Calculate where document belongs and directly send to shard leader (new)
  26. 26. 26 26Durable Writes• Lucene flushes writes to disk on a “commit”• Uncommitted docs are lost on a crash (at lucene level)• Solr 4 maintains it’s own transaction log• Contains uncommitted documents• Services real-time get requests• Recovery (log replay on restart)• Supports distributed “peer sync”• Writes forwarded to multiple shard replicas• A replica can go away forever w/o collection data loss• A replica can do a fast “peer sync” if it’s only slightly out of date• A replica can do a full index replication (copy) from a peer
  27. 27. 27 27Near Real Time (NRT) softCommit• softCommit opens a new view of the index withoutflushing + fsyncing files to disk• Decouples update visibility from update durability• commitWithin now implies a soft commit• Current autoCommit defaults from solrconfig.xml:  <autoCommit>                  <maxTime>15000</maxTime>                  <openSearcher>false</openSearcher>      </autoCommit>    <!-­‐-­‐    <autoSoftCommit>                      <maxTime>5000</maxTime>                  </autoSoftCommit>  -­‐-­‐>  
  28. 28. 28 28Document Routing80000000-bfffffff00000000-3fffffff40000000-7fffffffc0000000-ffffffffshard1shard4shard3 shard2id  =  BigCo!doc5  9f27 3c71(MurmurHash3)q=my_query  shard.keys=BigCo!  9f27 0000 9f27 ffffto(hash)shard1numShards=4router=compositeIdhashring
  29. 29. 29 29Seamless Online Shard SplittingShard2_0Shard1replicaleaderShard2replicaleaderShard3replicaleaderShard2_11.  http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=mycollection&shard=Shard2  2.  New sub-shards created in “construction” state3.  Leader starts forwarding applicable updates, which are buffered by the sub-shards4.  Leader index is split and installed on the sub-shards5.  Sub-shards apply buffered updates then become “active” leaders and old shardbecomes “inactive”update
  30. 30. 30 30Stay in touch