Solr cloud the 'search first' nosql database extended deep dive


Published on

Presented by Mark Miller, Software Engineer, Cloudera

As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling – two sides of the same coin.
Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.

Published in: Education, Technology

Solr cloud the 'search first' nosql database extended deep dive

  1. 1. SolrThe Search First NoSQL Database
  2. 2. • Mark Miller: Clouderaemployee, Lucene PMCmember, Apache member• Started playing withLucene in 2006• Lucene committer since2008• Solr committer since 2009Who Am I?
  3. 3. My Dog
  4. 4. Big Data is getting Bigger• The total Big Data market reached $11.4 billion in 2012• The Big Data market is projected to reach $18.1 billion in2013, an annual growth of 61%• On pace to exceed $47 billion by 2017.
  5. 5. 3 basic needs• Storage• Processing• Search
  6. 6. Two Standouts inthe Big Data Market•Hadoop•NoSQL
  7. 7. Ultimately, the NoSQL market is largely up forgrabs. Each NoSQL database has its relatedstrengths and weaknesses, and no one NoSQLdatabase currently “does it all.” Big Datapractitioners must take a number of factors intoconsideration when selecting a NoSQL databaseto facilitate large-scale transactional workloads,including scalability, performance, security, andease-of-development.Big Data Vendor Revenue and Market Forecast(Wikibon)
  8. 8. RMDBS• The classic way to store your data.• ACID is great, transactions are cool, SQL is wellknown and understood.• Scaling is *hard*, but possible (see Facebook’sMySQL cluster)• ‘impedance mismatch’ sucks
  9. 9. Search• Search has been moving from an expensive,complicated option to an affordable and more easynecessity.• Lot’s of data begs for the ability to process it, store it,and search it.
  10. 10. Enterprise SearchEngines• Verity - acquired by Autonomy in 2005• FAST - acquired by Microsoft in 2008• Endeca - acquired by Oracle in 2011• Autonomy - acquired by HP in 2011• Vivisimo - acquired by IBM in 2012
  11. 11. NoSQL• Not Only SQL rather than ‘No SQL’• Except that makes little sense...• “when ‘NoSQL’ is applied to a database, it refers toan ill- defined set of mostly open-source databases,mostly developed in the early 21st century, andmostly not using SQL.” - NoSQL Distilled
  12. 12. NoSQL• Key-Value• Columnar• Document• Graph
  13. 13. In the beginning..• BerkeleyDB (1991?)• Lotus Notes (1989?)• Bayou (1996?)
  14. 14. In the beginning ofthe modern era...• BigTable (Google) (started in 2004, paper in 2006)• Dynamo (Amazon) (paper in 2007)
  15. 15. Derivatives• Dynamo: Cassandra, CouchDB, Voldemort, Riak• BigTable: Cassandra, HBase, Redis, HyperTable,Accumulo
  16. 16. Also...• AppEngine storage built on BigTable• DynamoDB - based on the principles of Dynamo
  17. 17. When it comes to NoSQL,Open Source rules theroost.• I won’t be talking about any solution that is notbased on Open Source - only because thosesolutions are not popular.• "there’s a notion that NoSQL is an open-sourcephenomenon.” - NoSQL Distilled
  18. 18. The 2013 Future of OpenSource Survey ResultsBlack Duck and North Bridge
  19. 19. What’s Popular?• NoSQL database proliferation - NoSQL databases area dime a dozen. Why?• Which solutions should we look at?
  20. 20.• is an employment-related metasearchengine for job listings• Indeed is the #1 job site worldwide, with over 100million unique visitors per month. Indeed is availablein more than 50 countries and 26 languages,covering 94% of global GDP.
  21. 21.• DB-Engines is an initiative to collect and presentinformation on database management systems(DBMS). In addition to established relational DBMS,systems and concepts of the growing NoSQL areaare emphasized.• The DB-Engines Ranking is a list of DBMS ranked bytheir current popularity. The list is updated monthly.
  22. 22. Popular Search JobTrends
  23. 23. Popular SearchSolutions (DB-Engines)
  24. 24. Popular NoSQL JobTrends
  25. 25. Let’s get somecontext
  26. 26. Compare to Java
  27. 27. Add in Oracle...
  28. 28. NoSQL DatabaseTypes• Key-Value• Column Family• Document• Graph
  29. 29. I’m going to ignoreGraph...everyoneelse seems to...
  30. 30. Popular NoSQLDocument Stores(DB-Rankings)
  31. 31. Key-Value Stores
  32. 32. Columnar Stores
  33. 33. The Full PopularityContest
  34. 34. In case you forgot,Oracle is in theNoSQL game...• Oracle NoSQL
  35. 35. CAP TheoremThe CAP theorem, also known as Brewers theorem,states that it is impossible for a distributed computersystem to simultaneously provide all three of thefollowing guarantees:• Consistency (all nodes see the same data at thesame time)• Availability (a guarantee that every requestreceives a response about whether it wassuccessful or failed)• Partition tolerance (the system continues tooperate despite arbitrary message loss or failure ofpart of the system)
  36. 36. CAP
  37. 37. Architectures• For NoSQL, generally boils down to AP or CP. CAdoes not support partition tolerance.• You have to trade off consistency versus availability.• AP favors availability over consistency - the is theeventually consistent architecture.• CP favors consistency over availability.• Of course, there is a continuum between AP and CP.
  38. 38. Key DesignDecisions• Data Model - how is the data stored/accessed• Distribution Model - how is the data distributed• Conflict Resolution - how is it ensured that the sameupdate ‘wins’ on each node.
  39. 39. Data Model• key -> value (opaque)• key -> document• column oriented
  40. 40. Distributed Model• Roughly, how is data distributed across the cluster?• Sharding, replication, etc
  41. 41. Data Versioning andConsistency• Essentially, how is data kept consistent across nodes?• Sequential consistency—ensuring that all nodesapply operations in the same order.• Update consistency and read consistency.
  42. 42. • Data Model - bson - binary json format• Distributed Model - sharded asynchronous master/slave replication.• Data Versioning and Consistency - Master / Slave, pertable write lock
  43. 43. MongoDB Search• Built in text search. I think of it like RBDMS built infull text search - major feature gaps with dedicatedfull text search engines, and likely majorperformance gaps.• Common to sit a search engine next to MongoDB
  44. 44. • Data Model - column based, like BigTable• Distributed Updates - similar to Dynamo, consistenthashing, master-master• Data Versioning and Consistency - timestamps
  45. 45. Cassandra Search• Lucandra• Solandra• DataStax Enterprise Search (Solr fields must bestrings)
  46. 46. • Data Model - Column Store• Distribution Model - regions served by regionservers.• Versioning and Consistency - strongly consistent
  47. 47. HBase Search• HBasene (dead?)• HBASE-SEARCH, HBASE-3529 (dead?)• Solbase• Lily
  48. 48. • Riak is a NoSQL database implementing theprinciples from Amazons Dynamo paper• Data Model - stores key/value pairs in a high levelnamespace called a bucket.• Data Versioning and Consistency - Riak uses a datastructure called a vector clock to reason aboutcausality and staleness of stored values. (Can alsouse timestamps). Last write wins, or client resolvesconflict.
  49. 49. Riak Search• Riak Search - custom search engine, Solr-like API• Yokozuna
  50. 50. Yokozuna Author EnumeratesCommon Reasons Custom Searchhas Failed• Pretends to be lucene/solr• Lack of analyzer/language/features• Bad performance/resource usage for certain queries• Basho is not in the business of search
  51. 51. • CouchDB’s data format is JSON stored as documents(self-contained records with no intrinsicrelationships), grouped into “database” namespaces.• Conflicts are left to the application to resolve at writetime. CouchDB arbitrarily, but deterministically,determines a winner and tracks a conflict. The clientmust then resolve the conflict.
  52. 52. CouchDB Search• CouchDB-Lucene• Seems people usually just sit a search engine next toCouchDB
  53. 53. • Redis is an open-source, networked, in-memory, key-value data store with optional durability.• Memcached is a general-purpose distributed memorycaching system• Redis-Search
  54. 54. Adding Search toNoSQL• Hard to do without a lot of compromise• Build your own, or use Lucene or Lucene basedsolution• Nothing has yet set the world on fire...
  55. 55. Adding NoSQL toSearch• Search solutions are generally already a Documentbased NoSQL solution.• Seems a lot easier to do then the reverse• Nothing has yet set the world on fire...
  56. 56. Solr NoSQLFeatures• Realtime-Get• Update Durability• Atomic Compare and Set• Versioning and optimistic locking
  57. 57. Schemaless?• NoSQL databases are generally ‘schemaless’• In some ways, convenient, in others ways not.• Implicit schema moves to application code.• Can’t optimize based on types.• Note: some are calling ‘guessed’ schemasschemaless.
  58. 58. • Most similar to the MongoDB architecture• A CP system, though currently, eventually consistent.• The architecture supports adding strong consistencyoptions.
  59. 59. SolrCloud• The length of time an inconsistency is present iscalled the inconsistency window.• SolrCloud has a very small inconsistency window.
  60. 60. Data Model• key -> document• Optionally, column oriented
  61. 61. Contact Info• @heismark•