Solr
The Search First NoSQL Database
• Mark Miller: Cloudera
employee, Lucene PMC
member, Apache member
• Started playing with
Lucene in 2006
• Lucene committer since
2008
• Solr committer since 2009
Who Am I?
My Dog
Big Data is getting Bigger
• The total Big Data market reached $11.4 billion in 2012
• The Big Data market is projected to reach $18.1 billion in
2013, an annual growth of 61%
• On pace to exceed $47 billion by 2017.
3 basic needs
• Storage
• Processing
• Search
Two Standouts in
the Big Data Market
•Hadoop
•NoSQL
Ultimately, the NoSQL market is largely up for
grabs. Each NoSQL database has its related
strengths and weaknesses, and no one NoSQL
database currently “does it all.” Big Data
practitioners must take a number of factors into
consideration when selecting a NoSQL database
to facilitate large-scale transactional workloads,
including scalability, performance, security, and
ease-of-development.
Big Data Vendor Revenue and Market Forecast
(Wikibon)
RMDBS
• The classic way to store your data.
• ACID is great, transactions are cool, SQL is well
known and understood.
• Scaling is *hard*, but possible (see Facebook’s
MySQL cluster)
• ‘impedance mismatch’ sucks
Search
• Search has been moving from an expensive,
complicated option to an affordable and more easy
necessity.
• Lot’s of data begs for the ability to process it, store it,
and search it.
Enterprise Search
Engines
• Verity - acquired by Autonomy in 2005
• FAST - acquired by Microsoft in 2008
• Endeca - acquired by Oracle in 2011
• Autonomy - acquired by HP in 2011
• Vivisimo - acquired by IBM in 2012
NoSQL
• Not Only SQL rather than ‘No SQL’
• Except that makes little sense...
• “when ‘NoSQL’ is applied to a database, it refers to
an ill- defined set of mostly open-source databases,
mostly developed in the early 21st century, and
mostly not using SQL.” - NoSQL Distilled
NoSQL
• Key-Value
• Columnar
• Document
• Graph
In the beginning..
• BerkeleyDB (1991?)
• Lotus Notes (1989?)
• Bayou (1996?)
In the beginning of
the modern era...
• BigTable (Google) (started in 2004, paper in 2006)
• Dynamo (Amazon) (paper in 2007)
Derivatives
• Dynamo: Cassandra, CouchDB, Voldemort, Riak
• BigTable: Cassandra, HBase, Redis, HyperTable,
Accumulo
Also...
• AppEngine storage built on BigTable
• DynamoDB - based on the principles of Dynamo
When it comes to NoSQL,
Open Source rules the
roost.
• I won’t be talking about any solution that is not
based on Open Source - only because those
solutions are not popular.
• "there’s a notion that NoSQL is an open-source
phenomenon.” - NoSQL Distilled
The 2013 Future of Open
Source Survey Results
Black Duck and North Bridge
What’s Popular?
• NoSQL database proliferation - NoSQL databases are
a dime a dozen. Why?
• Which solutions should we look at?
indeed.com
• Indeed.com is an employment-related metasearch
engine for job listings
• Indeed is the #1 job site worldwide, with over 100
million unique visitors per month. Indeed is available
in more than 50 countries and 26 languages,
covering 94% of global GDP.
http://db-engines.com
• DB-Engines is an initiative to collect and present
information on database management systems
(DBMS). In addition to established relational DBMS,
systems and concepts of the growing NoSQL area
are emphasized.
• The DB-Engines Ranking is a list of DBMS ranked by
their current popularity. The list is updated monthly.
Popular Search Job
Trends
Popular Search
Solutions (DB-Engines)
Popular NoSQL Job
Trends
Let’s get some
context
Compare to Java
Add in Oracle...
NoSQL Database
Types
• Key-Value
• Column Family
• Document
• Graph
I’m going to ignore
Graph...everyone
else seems to...
Popular NoSQL
Document Stores
(DB-Rankings)
Key-Value Stores
Columnar Stores
The Full Popularity
Contest
In case you forgot,
Oracle is in the
NoSQL game...
• Oracle NoSQL
CAP Theorem
The CAP theorem, also known as Brewer's theorem,
states that it is impossible for a distributed computer
system to simultaneously provide all three of the
following guarantees:
• Consistency (all nodes see the same data at the
same time)
• Availability (a guarantee that every request
receives a response about whether it was
successful or failed)
• Partition tolerance (the system continues to
operate despite arbitrary message loss or failure of
part of the system)
CAP
Architectures
• For NoSQL, generally boils down to AP or CP. CA
does not support partition tolerance.
• You have to trade off consistency versus availability.
• AP favors availability over consistency - the is the
eventually consistent architecture.
• CP favors consistency over availability.
• Of course, there is a continuum between AP and CP.
Key Design
Decisions
• Data Model - how is the data stored/accessed
• Distribution Model - how is the data distributed
• Conflict Resolution - how is it ensured that the same
update ‘wins’ on each node.
Data Model
• key -> value (opaque)
• key -> document
• column oriented
Distributed Model
• Roughly, how is data distributed across the cluster?
• Sharding, replication, etc
Data Versioning and
Consistency
• Essentially, how is data kept consistent across nodes?
• Sequential consistency—ensuring that all nodes
apply operations in the same order.
• Update consistency and read consistency.
• Data Model - bson - binary json format
• Distributed Model - sharded asynchronous master/
slave replication.
• Data Versioning and Consistency - Master / Slave, per
table write lock
MongoDB Search
• Built in text search. I think of it like RBDMS built in
full text search - major feature gaps with dedicated
full text search engines, and likely major
performance gaps.
• Common to sit a search engine next to MongoDB
• Data Model - column based, like BigTable
• Distributed Updates - similar to Dynamo, consistent
hashing, master-master
• Data Versioning and Consistency - timestamps
Cassandra Search
• Lucandra
• Solandra
• DataStax Enterprise Search (Solr fields must be
strings)
• Data Model - Column Store
• Distribution Model - regions served by region
servers.
• Versioning and Consistency - strongly consistent
HBase Search
• HBasene (dead?)
• HBASE-SEARCH, HBASE-3529 (dead?)
• Solbase
• Lily
• Riak is a NoSQL database implementing the
principles from Amazon's Dynamo paper
• Data Model - stores key/value pairs in a high level
namespace called a bucket.
• Data Versioning and Consistency - Riak uses a data
structure called a vector clock to reason about
causality and staleness of stored values. (Can also
use timestamps). Last write wins, or client resolves
conflict.
Riak Search
• Riak Search - custom search engine, Solr-like API
• Yokozuna
Yokozuna Author Enumerates
Common Reasons Custom Search
has Failed
• Pretends to be lucene/solr
• Lack of analyzer/language/features
• Bad performance/resource usage for certain queries
• Basho is not in the business of search
• CouchDB’s data format is JSON stored as documents
(self-contained records with no intrinsic
relationships), grouped into “database” namespaces.
• Conflicts are left to the application to resolve at write
time. CouchDB arbitrarily, but deterministically,
determines a winner and tracks a conflict. The client
must then resolve the conflict.
CouchDB Search
• CouchDB-Lucene
• Seems people usually just sit a search engine next to
CouchDB
• Redis is an open-source, networked, in-memory, key-
value data store with optional durability.
• Memcached is a general-purpose distributed memory
caching system
• Redis-Search
Adding Search to
NoSQL
• Hard to do without a lot of compromise
• Build your own, or use Lucene or Lucene based
solution
• Nothing has yet set the world on fire...
Adding NoSQL to
Search
• Search solutions are generally already a Document
based NoSQL solution.
• Seems a lot easier to do then the reverse
• Nothing has yet set the world on fire...
Solr NoSQL
Features
• Realtime-Get
• Update Durability
• Atomic Compare and Set
• Versioning and optimistic locking
Schemaless?
• NoSQL databases are generally ‘schemaless’
• In some ways, convenient, in others ways not.
• Implicit schema moves to application code.
• Can’t optimize based on types.
• Note: some are calling ‘guessed’ schemas
schemaless.
• Most similar to the MongoDB architecture
• A CP system, though currently, eventually consistent.
• The architecture supports adding strong consistency
options.
SolrCloud
• The length of time an inconsistency is present is
called the inconsistency window.
• SolrCloud has a very small inconsistency window.
Data Model
• key -> document
• Optionally, column oriented
Contact Info
• @heismark
• markrmiller@gmail.com

Solr cloud the 'search first' nosql database extended deep dive

  • 1.
    Solr The Search FirstNoSQL Database
  • 2.
    • Mark Miller:Cloudera employee, Lucene PMC member, Apache member • Started playing with Lucene in 2006 • Lucene committer since 2008 • Solr committer since 2009 Who Am I?
  • 3.
  • 4.
    Big Data isgetting Bigger • The total Big Data market reached $11.4 billion in 2012 • The Big Data market is projected to reach $18.1 billion in 2013, an annual growth of 61% • On pace to exceed $47 billion by 2017.
  • 5.
    3 basic needs •Storage • Processing • Search
  • 6.
    Two Standouts in theBig Data Market •Hadoop •NoSQL
  • 7.
    Ultimately, the NoSQLmarket is largely up for grabs. Each NoSQL database has its related strengths and weaknesses, and no one NoSQL database currently “does it all.” Big Data practitioners must take a number of factors into consideration when selecting a NoSQL database to facilitate large-scale transactional workloads, including scalability, performance, security, and ease-of-development. Big Data Vendor Revenue and Market Forecast (Wikibon)
  • 8.
    RMDBS • The classicway to store your data. • ACID is great, transactions are cool, SQL is well known and understood. • Scaling is *hard*, but possible (see Facebook’s MySQL cluster) • ‘impedance mismatch’ sucks
  • 9.
    Search • Search hasbeen moving from an expensive, complicated option to an affordable and more easy necessity. • Lot’s of data begs for the ability to process it, store it, and search it.
  • 10.
    Enterprise Search Engines • Verity- acquired by Autonomy in 2005 • FAST - acquired by Microsoft in 2008 • Endeca - acquired by Oracle in 2011 • Autonomy - acquired by HP in 2011 • Vivisimo - acquired by IBM in 2012
  • 11.
    NoSQL • Not OnlySQL rather than ‘No SQL’ • Except that makes little sense... • “when ‘NoSQL’ is applied to a database, it refers to an ill- defined set of mostly open-source databases, mostly developed in the early 21st century, and mostly not using SQL.” - NoSQL Distilled
  • 12.
  • 13.
    In the beginning.. •BerkeleyDB (1991?) • Lotus Notes (1989?) • Bayou (1996?)
  • 14.
    In the beginningof the modern era... • BigTable (Google) (started in 2004, paper in 2006) • Dynamo (Amazon) (paper in 2007)
  • 15.
    Derivatives • Dynamo: Cassandra,CouchDB, Voldemort, Riak • BigTable: Cassandra, HBase, Redis, HyperTable, Accumulo
  • 16.
    Also... • AppEngine storagebuilt on BigTable • DynamoDB - based on the principles of Dynamo
  • 17.
    When it comesto NoSQL, Open Source rules the roost. • I won’t be talking about any solution that is not based on Open Source - only because those solutions are not popular. • "there’s a notion that NoSQL is an open-source phenomenon.” - NoSQL Distilled
  • 18.
    The 2013 Futureof Open Source Survey Results Black Duck and North Bridge
  • 19.
    What’s Popular? • NoSQLdatabase proliferation - NoSQL databases are a dime a dozen. Why? • Which solutions should we look at?
  • 20.
    indeed.com • Indeed.com isan employment-related metasearch engine for job listings • Indeed is the #1 job site worldwide, with over 100 million unique visitors per month. Indeed is available in more than 50 countries and 26 languages, covering 94% of global GDP.
  • 21.
    http://db-engines.com • DB-Engines isan initiative to collect and present information on database management systems (DBMS). In addition to established relational DBMS, systems and concepts of the growing NoSQL area are emphasized. • The DB-Engines Ranking is a list of DBMS ranked by their current popularity. The list is updated monthly.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
    NoSQL Database Types • Key-Value •Column Family • Document • Graph
  • 29.
    I’m going toignore Graph...everyone else seems to...
  • 30.
  • 31.
  • 32.
  • 33.
  • 35.
    In case youforgot, Oracle is in the NoSQL game... • Oracle NoSQL
  • 36.
    CAP Theorem The CAPtheorem, also known as Brewer's theorem, states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: • Consistency (all nodes see the same data at the same time) • Availability (a guarantee that every request receives a response about whether it was successful or failed) • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
  • 37.
  • 38.
    Architectures • For NoSQL,generally boils down to AP or CP. CA does not support partition tolerance. • You have to trade off consistency versus availability. • AP favors availability over consistency - the is the eventually consistent architecture. • CP favors consistency over availability. • Of course, there is a continuum between AP and CP.
  • 39.
    Key Design Decisions • DataModel - how is the data stored/accessed • Distribution Model - how is the data distributed • Conflict Resolution - how is it ensured that the same update ‘wins’ on each node.
  • 40.
    Data Model • key-> value (opaque) • key -> document • column oriented
  • 41.
    Distributed Model • Roughly,how is data distributed across the cluster? • Sharding, replication, etc
  • 42.
    Data Versioning and Consistency •Essentially, how is data kept consistent across nodes? • Sequential consistency—ensuring that all nodes apply operations in the same order. • Update consistency and read consistency.
  • 43.
    • Data Model- bson - binary json format • Distributed Model - sharded asynchronous master/ slave replication. • Data Versioning and Consistency - Master / Slave, per table write lock
  • 44.
    MongoDB Search • Builtin text search. I think of it like RBDMS built in full text search - major feature gaps with dedicated full text search engines, and likely major performance gaps. • Common to sit a search engine next to MongoDB
  • 45.
    • Data Model- column based, like BigTable • Distributed Updates - similar to Dynamo, consistent hashing, master-master • Data Versioning and Consistency - timestamps
  • 46.
    Cassandra Search • Lucandra •Solandra • DataStax Enterprise Search (Solr fields must be strings)
  • 47.
    • Data Model- Column Store • Distribution Model - regions served by region servers. • Versioning and Consistency - strongly consistent
  • 48.
    HBase Search • HBasene(dead?) • HBASE-SEARCH, HBASE-3529 (dead?) • Solbase • Lily
  • 49.
    • Riak isa NoSQL database implementing the principles from Amazon's Dynamo paper • Data Model - stores key/value pairs in a high level namespace called a bucket. • Data Versioning and Consistency - Riak uses a data structure called a vector clock to reason about causality and staleness of stored values. (Can also use timestamps). Last write wins, or client resolves conflict.
  • 50.
    Riak Search • RiakSearch - custom search engine, Solr-like API • Yokozuna
  • 51.
    Yokozuna Author Enumerates CommonReasons Custom Search has Failed • Pretends to be lucene/solr • Lack of analyzer/language/features • Bad performance/resource usage for certain queries • Basho is not in the business of search
  • 52.
    • CouchDB’s dataformat is JSON stored as documents (self-contained records with no intrinsic relationships), grouped into “database” namespaces. • Conflicts are left to the application to resolve at write time. CouchDB arbitrarily, but deterministically, determines a winner and tracks a conflict. The client must then resolve the conflict.
  • 53.
    CouchDB Search • CouchDB-Lucene •Seems people usually just sit a search engine next to CouchDB
  • 54.
    • Redis isan open-source, networked, in-memory, key- value data store with optional durability. • Memcached is a general-purpose distributed memory caching system • Redis-Search
  • 55.
    Adding Search to NoSQL •Hard to do without a lot of compromise • Build your own, or use Lucene or Lucene based solution • Nothing has yet set the world on fire...
  • 56.
    Adding NoSQL to Search •Search solutions are generally already a Document based NoSQL solution. • Seems a lot easier to do then the reverse • Nothing has yet set the world on fire...
  • 57.
    Solr NoSQL Features • Realtime-Get •Update Durability • Atomic Compare and Set • Versioning and optimistic locking
  • 58.
    Schemaless? • NoSQL databasesare generally ‘schemaless’ • In some ways, convenient, in others ways not. • Implicit schema moves to application code. • Can’t optimize based on types. • Note: some are calling ‘guessed’ schemas schemaless.
  • 59.
    • Most similarto the MongoDB architecture • A CP system, though currently, eventually consistent. • The architecture supports adding strong consistency options.
  • 60.
    SolrCloud • The lengthof time an inconsistency is present is called the inconsistency window. • SolrCloud has a very small inconsistency window.
  • 61.
    Data Model • key-> document • Optionally, column oriented
  • 63.
    Contact Info • @heismark •markrmiller@gmail.com