SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

  • 791 views
Uploaded on

Presentation from my talk at the big data conference, Fifth Elephant 2013, Bangalore. …

Presentation from my talk at the big data conference, Fifth Elephant 2013, Bangalore.
It talks about how Solr 4 can be used as a data store, specially in cases where there's a need to perform text searches on the data.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
791
On Slideshare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
48
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • - You can see the range of any shard in clusterstate.jsonHashing based on the “id” only has some advantages vs hashing based on a different field. Clients can be more generic and not know/care what addressing scheme is being used when dealing with individual documents. The “id” always fully defines where a document lives.Enabled highly scalable multi-tenanted applications

Transcript

  • 1. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud and NoSQL Anshum Gupta
  • 2. The Fifth Elephant 2013, Bangalore 12th July 20132 Who am I? • Anshum Gupta • Search and related stuff for around 8 years now • Apache Lucene since 2006, Solr since 2010 • Currently: • Helped launch the first AWS search service, CloudSearch. • Places I‟ve worked at:
  • 3. The Fifth Elephant 2013, Bangalore 12th July 2013 Big Data • Real Value = Process + Store + Search • Search - No longer expensive - Affordable - Necessity - Can get as complicated as you‟d want it to get. 3 Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Data Search
  • 4. The Fifth Elephant 2013, Bangalore 12th July 2013 NoSQL Databases •Wikipedia says: A NoSQL database provides a mechanism for storage and retrieval of data that use looser consistency models than traditional relational databases in order to achieve horizontal scaling and higher availability. Some authors refer to them as "Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query language to be used. •Non-traditional data stores •Doesn‟t use / isn‟t designed around SQL •May not give full ACID guarantees - Offers other advantages such as greater scalability as a tradeoff •Distributed, fault-tolerant architecture
  • 5. The Fifth Elephant 2013, Bangalore 12th July 2013 DB Rankings: Overall Source: http://db-engines.com/en/ranking
  • 6. The Fifth Elephant 2013, Bangalore 12th July 2013 Search Engine Rankings Source: http://db-engines.com/en/ranking/search+engine
  • 7. The Fifth Elephant 2013, Bangalore 12th July 2013 MongoDB • Data Model: BSON • Distributed Model: Sharded master-slave async replication. • Consistency: Per table write lock. • Search: - Built in full text search, large gaps with „search‟ players. - Alternate and popular solution: Use another search solution along with MongoDB, Solr?. Consistency issues and more.
  • 8. The Fifth Elephant 2013, Bangalore 12th July 2013 Cassandra • Data Model: Column based data store. • Distributed Model: Uses consistent hashing for distributed updates. • Consistency: Timestamps for consistency. • Search - Lucandra : Lucene based search. - Solandra : Solr based search.
  • 9. The Fifth Elephant 2013, Bangalore 12th July 20139 • Implements principles from the Amazon Dynamo paper. • Riak Search - Distributed index and full-text search engine. - Merge Index – Storage backed used by Riak Search. It‟s a pure Erlang storage format and among other things uses the Apache Lucene file format. - Riak Solr – Adds a subset of Apache Solr HTTP capabilities to Riak Search. • Yokozuna - “next generation of Riak Search that marries Riak with Apache Solr”. - Sits alongside of Riak.
  • 10. The Fifth Elephant 2013, Bangalore 12th July 201310 The story so far… • Different approaches for: - Data Model - Distributed Update handling - Consistency management • Work reasonably well on different fronts as far as storage is concerned. • Search: - There‟s barely anything native and in the core. - (Almost) Everyone is trying to fuse together with Lucene/Solr.
  • 11. The Fifth Elephant 2013, Bangalore 12th July 201311 Adding Search to NoSQL • To begin with, wasn‟t built for that • Compromises • Integration is the buzzword. • Lucandra, Solandra…No strong contender yet.
  • 12. The Fifth Elephant 2013, Bangalore 12th July 201312 Adding NoSQL to Search • Already store documents • With growing data, more intuitive for this to happen • More intuitive = makes more sense = easier (perhaps) • No key player as yet.
  • 13. The Fifth Elephant 2013, Bangalore 12th July 2013
  • 14. The Fifth Elephant 2013, Bangalore 12th July 2013 Apache Solr 4 at a glance • Document Oriented NoSQL Search Server - Data-format agnostic (JSON, XML, CSV, binary) - Schema-less options (more coming soon) • Distributed - Multi-tenanted • Fault Tolerant - HA + No single points of failure • Atomic Updates • Optimistic Concurrency • Near Real-time Search • Full-Text search + Hit Highlighting • Tons of specialized queries: Faceted search, grouping, pseudo-join, spatial search, functions The desire for these features drove some of the “SolrCloud” architecture
  • 15. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud Design Goals • Automatic Distributed Indexing • HA for Writes • Durable Writes • Near Real-time Search • Real-time get • Optimistic Concurrency
  • 16. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud • Distributed Indexing designed from the ground up to accommodate desired features • CAP Theorem - Consistency, Availability, Partition Tolerance (saying goes “choose 2”) - Reality: Must handle P – the real choice is tradeoffs between C and A • Ended up with a CP system (roughly) - Value Consistency over Availability - Eventual consistency is incompatible with optimistic concurrency - Closest to MongoDB in architecture • We still do well with Availability - All N replicas of a shard must go down before we lose writability for that shard - For a network partition, the “big” partition remains active (i.e. Availability isn‟t “on” or “off”)
  • 17. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud shard1 replica2 replica3 replica2 replica3 ZooKeeper quorum ZK nod e ZK node ZK nod e ZK node ZK node /configs /myconf solrconfig.xml schema.xml /clusterstate.json /aliases.json /livenodes server1:8983/solr server2:8983/solr/collections /collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr http://.../solr/collection1/query?q=awesome Load-balanced sub-request replica1 shard2 replica1 ZooKeeper holds cluster state • Nodes in the cluster • Collections in the cluster • Schema & config for each collection • Shards in each collection • Replicas in each shard • Collection aliases
  • 18. The Fifth Elephant 2013, Bangalore 12th July 2013 Shard1 Shard2 Replica1 Replica3 Replica2 Replica4 Distributed Indexing http://.../solr/collection1/update • Update sent to any node • Solr determines what shard the document is on, and forwards to shard leader • Shard Leader versions document and forwards to all other shard replicas • HA for updates (if one leader fails, another takes it‟s place) Document Update Leader Non leading replica
  • 19. The Fifth Elephant 2013, Bangalore 12th July 2013 Optimistic Concurrency • Conditional update based on document version Solr 2. Modify document, retaining _version_ 4. Go back to step #1 if fail code=409 client
  • 20. The Fifth Elephant 2013, Bangalore 12th July 2013 Distributed Query Requests  Distributed query across all shards in the collection http://localhost:8983/solr/collection1/query?q=foo  Explicitly specify node addresses to load-balance across shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr  A list of equivalent nodes are separated by “|”  Different phases of the same distributed request use the same node  Specify logical shards to search across shards=NY,NJ,CT  Specify multiple collections to search across collection=collection1,collection2  public CloudSolrServer(String zkHost)  ZK aware SolrJ Java client that load-balances across all nodes in cluster  Calculate where document belongs and directly send to shard leader (new)
  • 21. The Fifth Elephant 2013, Bangalore 12th July 2013 Document Routing 80000000-bfffffff 00000000-3fffffff 40000000-7fffffff c0000000-ffffffff shard1shard4 shard3 shard2 id = BigCo!doc5 9f2 7 3c71 (MurmurHash3) q=my_query shard.keys=BigCo! 9f27 0000 9f27 ffffto (hash) shard1 numShards=4 router=compositeId Hash Ring
  • 22. The Fifth Elephant 2013, Bangalore 12th July 2013 Durable Writes • Lucene flushes writes to disk on a “commit” - Uncommitted docs are lost on a crash (at lucene level) • Solr 4 maintains it‟s own transaction log - Contains uncommitted documents - Services real-time get requests - Recovery (log replay on restart) - Supports distributed “peer sync” • Writes forwarded to multiple shard replicas - A replica can go away forever w/o collection data loss - A replica can do a fast “peer sync” if it‟s only slightly out of date - A replica can do a full index replication (copy) from a leader.
  • 23. The Fifth Elephant 2013, Bangalore 12th July 2013 Collections API  Create a new document collection http://localhost:8983/solr/admin/collections? action=CREATE &name=mycollection &numShards=4 &replicationFactor=3 CREATE DELETE ALIAS SPLITSHARD DELETESHARD RELOAD
  • 24. The Fifth Elephant 2013, Bangalore 12th July 2013 Solr 4.3: Seamless Online Shard Splitting Shard2_0 Shard1 replica leader Shard2 replica leader Shard3 replica leader Shard2_1 1. http://localhost:8983/solr/admin/collections?action=SPLITSHARD&col lection=mycollection&shard=Shard2 2. New sub-shards created in “construction” state 3. Leader starts forwarding applicable updates, which are buffered by the sub-shards 4. Leader index is split and installed on the sub-shards 5. Sub-shards apply buffered updates then become “active” leaders and old shard becomes “inactive” update
  • 25. The Fifth Elephant 2013, Bangalore 12th July 2013 Solr 4.4: Schemaless • “Schemaless” really normally means that the client(s) have an implicit schema. • “No Schema” impossible for anything based on Lucene - A field must be indexed the same way across documents • Dynamic fields: convention over configuration - Only pre-define types of fields, not fields themselves - No guessing. Any field name ending in _i is an integer • “Guessed Schema” or “Type Guessing” - For previously unknown fields, guess using JSON type as a hint - Coming soon (4.4?) based on the Dynamic Schema work • Many disadvantages to guessing - Lose ability to catch field naming errors - Can‟t optimize based on types - Guessing incorrectly means having to start over
  • 26. The Fifth Elephant 2013, Bangalore 12th July 2013 Bangalore Apache Lucene/Solr Meetup  1 meetup already  Almost 150 members  Another one coming up soon…  Join us at: http://www.meetup.com/Bangalore-Apache- Solr-Lucene-Group/
  • 27. The Fifth Elephant 2013, Bangalore 12th July 2013 Twitter: @anshumgupta LinkedIn: http://www.linkedin.com/in/anshumgupta Blog: http://www.anshumgupta.net Thanks!