Solr cloud the 'search first' nosql database   extended deep dive
Upcoming SlideShare
Loading in...5
×
 

Solr cloud the 'search first' nosql database extended deep dive

on

  • 3,044 views

Presented by Mark Miller, Software Engineer, Cloudera ...

Presented by Mark Miller, Software Engineer, Cloudera

As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling – two sides of the same coin.
Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.

Statistics

Views

Total Views
3,044
Views on SlideShare
2,765
Embed Views
279

Actions

Likes
5
Downloads
43
Comments
0

3 Embeds 279

http://www.lucenerevolution.org 224
http://lucenerevolution.org 39
http://lucenerevolution.com 16

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Solr cloud the 'search first' nosql database   extended deep dive Solr cloud the 'search first' nosql database extended deep dive Presentation Transcript

  • SolrThe Search First NoSQL Database
  • • Mark Miller: Clouderaemployee, Lucene PMCmember, Apache member• Started playing withLucene in 2006• Lucene committer since2008• Solr committer since 2009Who Am I?
  • My Dog
  • Big Data is getting Bigger• The total Big Data market reached $11.4 billion in 2012• The Big Data market is projected to reach $18.1 billion in2013, an annual growth of 61%• On pace to exceed $47 billion by 2017.
  • 3 basic needs• Storage• Processing• Search
  • Two Standouts inthe Big Data Market•Hadoop•NoSQL
  • Ultimately, the NoSQL market is largely up forgrabs. Each NoSQL database has its relatedstrengths and weaknesses, and no one NoSQLdatabase currently “does it all.” Big Datapractitioners must take a number of factors intoconsideration when selecting a NoSQL databaseto facilitate large-scale transactional workloads,including scalability, performance, security, andease-of-development.Big Data Vendor Revenue and Market Forecast(Wikibon)
  • RMDBS• The classic way to store your data.• ACID is great, transactions are cool, SQL is wellknown and understood.• Scaling is *hard*, but possible (see Facebook’sMySQL cluster)• ‘impedance mismatch’ sucks
  • Search• Search has been moving from an expensive,complicated option to an affordable and more easynecessity.• Lot’s of data begs for the ability to process it, store it,and search it.
  • Enterprise SearchEngines• Verity - acquired by Autonomy in 2005• FAST - acquired by Microsoft in 2008• Endeca - acquired by Oracle in 2011• Autonomy - acquired by HP in 2011• Vivisimo - acquired by IBM in 2012
  • NoSQL• Not Only SQL rather than ‘No SQL’• Except that makes little sense...• “when ‘NoSQL’ is applied to a database, it refers toan ill- defined set of mostly open-source databases,mostly developed in the early 21st century, andmostly not using SQL.” - NoSQL Distilled
  • NoSQL• Key-Value• Columnar• Document• Graph
  • In the beginning..• BerkeleyDB (1991?)• Lotus Notes (1989?)• Bayou (1996?)
  • In the beginning ofthe modern era...• BigTable (Google) (started in 2004, paper in 2006)• Dynamo (Amazon) (paper in 2007)
  • Derivatives• Dynamo: Cassandra, CouchDB, Voldemort, Riak• BigTable: Cassandra, HBase, Redis, HyperTable,Accumulo
  • Also...• AppEngine storage built on BigTable• DynamoDB - based on the principles of Dynamo
  • When it comes to NoSQL,Open Source rules theroost.• I won’t be talking about any solution that is notbased on Open Source - only because thosesolutions are not popular.• "there’s a notion that NoSQL is an open-sourcephenomenon.” - NoSQL Distilled
  • The 2013 Future of OpenSource Survey ResultsBlack Duck and North Bridge
  • What’s Popular?• NoSQL database proliferation - NoSQL databases area dime a dozen. Why?• Which solutions should we look at?
  • indeed.com• Indeed.com is an employment-related metasearchengine for job listings• Indeed is the #1 job site worldwide, with over 100million unique visitors per month. Indeed is availablein more than 50 countries and 26 languages,covering 94% of global GDP.
  • http://db-engines.com• DB-Engines is an initiative to collect and presentinformation on database management systems(DBMS). In addition to established relational DBMS,systems and concepts of the growing NoSQL areaare emphasized.• The DB-Engines Ranking is a list of DBMS ranked bytheir current popularity. The list is updated monthly.
  • Popular Search JobTrends
  • Popular SearchSolutions (DB-Engines)
  • Popular NoSQL JobTrends
  • Let’s get somecontext
  • Compare to Java
  • Add in Oracle...
  • NoSQL DatabaseTypes• Key-Value• Column Family• Document• Graph
  • I’m going to ignoreGraph...everyoneelse seems to...
  • Popular NoSQLDocument Stores(DB-Rankings)
  • Key-Value Stores
  • Columnar Stores
  • The Full PopularityContest
  • In case you forgot,Oracle is in theNoSQL game...• Oracle NoSQL
  • CAP TheoremThe CAP theorem, also known as Brewers theorem,states that it is impossible for a distributed computersystem to simultaneously provide all three of thefollowing guarantees:• Consistency (all nodes see the same data at thesame time)• Availability (a guarantee that every requestreceives a response about whether it wassuccessful or failed)• Partition tolerance (the system continues tooperate despite arbitrary message loss or failure ofpart of the system)
  • CAP
  • Architectures• For NoSQL, generally boils down to AP or CP. CAdoes not support partition tolerance.• You have to trade off consistency versus availability.• AP favors availability over consistency - the is theeventually consistent architecture.• CP favors consistency over availability.• Of course, there is a continuum between AP and CP.
  • Key DesignDecisions• Data Model - how is the data stored/accessed• Distribution Model - how is the data distributed• Conflict Resolution - how is it ensured that the sameupdate ‘wins’ on each node.
  • Data Model• key -> value (opaque)• key -> document• column oriented
  • Distributed Model• Roughly, how is data distributed across the cluster?• Sharding, replication, etc
  • Data Versioning andConsistency• Essentially, how is data kept consistent across nodes?• Sequential consistency—ensuring that all nodesapply operations in the same order.• Update consistency and read consistency.
  • • Data Model - bson - binary json format• Distributed Model - sharded asynchronous master/slave replication.• Data Versioning and Consistency - Master / Slave, pertable write lock
  • MongoDB Search• Built in text search. I think of it like RBDMS built infull text search - major feature gaps with dedicatedfull text search engines, and likely majorperformance gaps.• Common to sit a search engine next to MongoDB
  • • Data Model - column based, like BigTable• Distributed Updates - similar to Dynamo, consistenthashing, master-master• Data Versioning and Consistency - timestamps
  • Cassandra Search• Lucandra• Solandra• DataStax Enterprise Search (Solr fields must bestrings)
  • • Data Model - Column Store• Distribution Model - regions served by regionservers.• Versioning and Consistency - strongly consistent
  • HBase Search• HBasene (dead?)• HBASE-SEARCH, HBASE-3529 (dead?)• Solbase• Lily
  • • Riak is a NoSQL database implementing theprinciples from Amazons Dynamo paper• Data Model - stores key/value pairs in a high levelnamespace called a bucket.• Data Versioning and Consistency - Riak uses a datastructure called a vector clock to reason aboutcausality and staleness of stored values. (Can alsouse timestamps). Last write wins, or client resolvesconflict.
  • Riak Search• Riak Search - custom search engine, Solr-like API• Yokozuna
  • Yokozuna Author EnumeratesCommon Reasons Custom Searchhas Failed• Pretends to be lucene/solr• Lack of analyzer/language/features• Bad performance/resource usage for certain queries• Basho is not in the business of search
  • • CouchDB’s data format is JSON stored as documents(self-contained records with no intrinsicrelationships), grouped into “database” namespaces.• Conflicts are left to the application to resolve at writetime. CouchDB arbitrarily, but deterministically,determines a winner and tracks a conflict. The clientmust then resolve the conflict.
  • CouchDB Search• CouchDB-Lucene• Seems people usually just sit a search engine next toCouchDB
  • • Redis is an open-source, networked, in-memory, key-value data store with optional durability.• Memcached is a general-purpose distributed memorycaching system• Redis-Search
  • Adding Search toNoSQL• Hard to do without a lot of compromise• Build your own, or use Lucene or Lucene basedsolution• Nothing has yet set the world on fire...
  • Adding NoSQL toSearch• Search solutions are generally already a Documentbased NoSQL solution.• Seems a lot easier to do then the reverse• Nothing has yet set the world on fire...
  • Solr NoSQLFeatures• Realtime-Get• Update Durability• Atomic Compare and Set• Versioning and optimistic locking
  • Schemaless?• NoSQL databases are generally ‘schemaless’• In some ways, convenient, in others ways not.• Implicit schema moves to application code.• Can’t optimize based on types.• Note: some are calling ‘guessed’ schemasschemaless.
  • • Most similar to the MongoDB architecture• A CP system, though currently, eventually consistent.• The architecture supports adding strong consistencyoptions.
  • SolrCloud• The length of time an inconsistency is present iscalled the inconsistency window.• SolrCloud has a very small inconsistency window.
  • Data Model• key -> document• Optionally, column oriented
  • Contact Info• @heismark• markrmiller@gmail.com