The NoSQL movement has introduced four new database architectural patterns that complement, but not replace, traditional relational and analytical databases. This presentation will introduce these four patterns and discuss their relative strengths and weaknesses for solving a variety of business problems. These problems include Big Data (scalability), search, high availability and agility. For each type of problem we look at how NoSQL databases take different approaches to solving these problems and how you can use this knowledge to find the right database architecture for your business challenges.
2. M
D
Background for Dan McCreary
• Bell Labs
• NeXT Computer (Steve Jobs)
• Owner of 75-person software
consulting firm
• US Federal data integration
(National Information Exchange
Model NIEM.gov)
• Native XML/XQuery for metadata
management since 2006
• Advocate of web standards,
NoSQL and XRX systems
Copyright Kelly-McCreary & Associates, LLC
2
3. M
D
Making Sense of NoSQL
Copyright Kelly-McCreary & Associates, LLC
3
• Working with NoSQL since 2006
• Co-founders of the NoSQL Now! conference
• Authors of Manning book on NoSQL (MEAP now, print July 2013)
• Guide for managers with a focus on business benefits
• Focus on NoSQL architectural tradeoff analysis
• http://manning.com/mccreary
4. M
D
Today
1. What are the new database "architecture
patterns" introduced by the NoSQL
movement?
2. What types of problems do they address?
3. How do you match the right problem with
the right database pattern?
Copyright Kelly-McCreary & Associates, LLC
4
5. M
D
Three Eras of Databases
• RDBMS for transactions, Data Warehouse
for analytics and NoSQL for …?
Copyright Kelly-McCreary & Associates, LLC
5
RDBMS
RDBMS
Data
Warehouse
1985-1995
1995-2010 2010-Now
Data
WarehouseRDBMS
NoSQL
7. M
D
Pressures on Single Node RDBMS Architectures
Copyright Kelly-McCreary & Associates, LLC
7
OLAP/BI/Data
Warehouse
Social
Networks
Scalability
Agile
Schema
Free
Single Node
RDBMS
9. M
D
Before NoSQL DB Selection Was Easy!
Copyright Kelly-McCreary & Associates, LLC
9
Does it
look like
document?
Use Microsoft
Office
Use the
RDBMS
Start
Stop
No
Yes
10. M
D
An evolving tree of data types
Copyright Kelly-McCreary & Associates, LLC
10
Read Mostly
Read/Write
Structured
Unstructured
Transactional
RDBMS BI/DW
Web Crawlers
Documents
Log Files
XML
JSON
Binary
Open Linked Data
Graph
11. M
D
Many Uses of Data
Copyright Kelly-McCreary & Associates, LLC
11
• Transactions (OLTP)
• Analysis (OLAP)
• Search and Findability
• Enterprise Agility
• Discovery and Insight
• Speed and Reliability
• Consistency and Availability
12. M
D
Strong Selection Bias
Anchoring bias - the tendency to produce an estimate near a cue amount - "Our managers were expecting
an RDBMS solution so that’s what we gave them."
Availability heuristic - the tendency to estimate that what is easily remembered is more likely than that
which is not. - "I hear that NoSQL does not support ACID." or "I hear that XML is verbose?"
Bandwagon effect - the tendency to do or believe what others do or believe - "Everyone else at this
company and in our local area uses RDBMSs."
Confirmation bias - the tendency to seek out only that information that supports one's preconceptions – "We
only read posts from the Oracle|Microsoft|IBM groups."
Framing effect - the tendency to react to how information is framed, beyond its factual content "We know of
some NoSQL projects that failed."
Gambler's fallacy (aka sunk cost bias) the failure to reset one's expectations based on one's current
situation – "We already paid for our Oracle|Microsoft|IBM license so why spend more money?"
Hindsight bias - the tendency to assess one's previous decisions as more efficacious than they were – "Our
last five systems worked on RDBMS solutions".
Halo effect - the tendency to attribute unverified capabilities in a person based on an observed capability. –
"Oracle|Microsoft|IBM sells billions of dollars of licenses each year, how could so many people be
wrong".
Representativeness heuristic - the tendency to judge something as belonging to a class based on a few
salient characteristics - "Our accounting systems work on RDBMS so why not our product search?"
Copyright Kelly-McCreary & Associates, LLC
12
13. M
D
Simplicity is a Virtue
• Many modern systems
derive their strength by
dramatically limiting the
features in their system
and focus on a specific
task
• Simplicity allows
database designer to
focus on the primary
business drivers
Copyright Kelly-McCreary & Associates, LLC
13
Photo from flickr by PSNZ Images
14. M
D
Simplicity is a Design Style
• Focus only on simple systems that solve
many problems in a flexible way
• Examples:
– Touch screen interfaces
– Key/Value data stores
Copyright Kelly-McCreary & Associates, LLC
14
15. M
D
RDBMS vs. NoSQL
• NoSQL is real and it’s here to stay
http://www.google.com/trends/explore#q=nosql%2C%20rdbms&date=1%2F2009%2051m&cmpt=q
RDBMS
NoSQL
Google Trends
15
Copyright Kelly-McCreary & Associates, LLC
16. M
D
Eric Evans
“The whole point of seeking
alternatives [to RDBMS
systems] is that you need to
solve a problem that relational
databases are a bad fit for.”
Eric Evans
Rackspace
16
Kelly-McCreary & Associates, LLC
17. M
D
The NO-SQL Universe
17
Copyright 2010 Dan McCreary & Associates
Document StoresKey-Value Stores
Graph/Triple Stores
Object Stores
Column-Family Stores
XML
18. M
D
Relational
• Data is usually stored in row by row
manner (row store)
• Standardized query language (SQL)
• Data model defined before you add
data
• Joins merge data from multiple
tables
• Results are tables
• Pros: mature ACID transactions with
fine-grain security controls
• Cons: Requires up front data
modeling, does not scale well
Copyright Kelly-McCreary & Associates, LLC
18
Examples:
Oracle, MySQL,
PostgreSQL,
Microsoft SQL
Server, IBM DB/2
19. M
D
Analytical (OLAP)
• Based on "Star" schema with
central fact table for each event
• Optimized for analysis of read-
analysis of historical data
• Use of MDX language to count
query "measures" for
"categories" of data
• Pros: fast queries for large data
• Cons: not optimized for
transactions and updates
Copyright Kelly-McCreary & Associates, LLC
19
Examples:
Cognos, Hyperion,
Microstrategy,
Pentaho, Microsoft,
Oracle, Business
Objects
20. M
D
Key-Value Stores
• Keys used to access opaque
blobs of data
• Values can contain any type
of data (images, video)
Pros: scalable, simple API
(put, get, delete)
Cons: no way to query based
on the content of the value
Copyright Kelly-McCreary & Associates, LLC
20
key value
key value
key value
key value
Examples:
Berkley DB,
Memcache,
DynamoDB, S3,
Redis, Riak
21. M
D
Key Value Stores
• A table with two columns
and a simple interface
– Add a key-value
– For this key, give me the
value
– Delete a key
• Blazingly fast and easy to
scale (no joins)
Copyright Kelly-McCreary & Associates, LLC
21
Key Value
Blob datatype
string datatype
24. M
D
No Subset Queries in Key-Value Stores
Copyright Kelly-McCreary & Associates, LLC
24
25. M
D
Types of Key-Value Stores
• Eventually‐consistent key‐value store
• Hierarchical key‐value stores
• Key-Value stores in RAM
• Key-Value stores on disk
• High availability key-value store
• Ordered key‐value stores
• Values that allow simple list operations
Copyright Kelly-McCreary & Associates, LLC
25
26. M
D
Memcached
• Open source in-memory key-value caching system
• Make effective use of RAM on many distributed web servers
• Designed to speed up dynamic web applications by alleviating database
load
• RAM resident key-value store for small chunks of arbitrary data (strings,
objects) from results of database calls, API calls, or page rendering
• Simple interface for highly distributed RAM caches
• 30ms read times typical
• Designed for quick deployment, ease of development
• APIs in many languages
Copyright Kelly-McCreary & Associates, LLC
26
27. M
D
Riak
• Open source distributed key-value store with
support and commercial versions by Basho
• A "Dynamo-inspired" database
• Focus on availability, fault-tolerance,
operational simplicity and scalability
• Support for replication and auto-sharding and
rebalancing on failures
• Support for MapReduce, fulltext search and
secondary indexes of value tags
• Written in ERLANG
Copyright Kelly-McCreary & Associates, LLC
27
28. M
D
Redis
• Open source in-memory key-value store
with optional durability
• Focus on high speed reads and writes of
common data structures to RAM
• Allows simple lists, sets and hashes to be
stored within the value and manipulated
• Many features that developers like
– expiration, transactions, pub/sub, partitioning
Copyright Kelly-McCreary & Associates, LLC
28
29. M
D
Amazon DynamoDB
• Amazon DynamoDB
• Based around scalable key-value store
• Fastest growing product in Amazon's
history
• SSD only database service
• Focus on throughput not storage and
predictable read and write times
• Strong integration with S3 and Elastic
MapReduce
Copyright Kelly-McCreary & Associates, LLC
29
30. M
D
Column-Family
• Key includes a row, column family
and column name
• Store versioned blobs in one large
table
• Queries can be done on rows,
column families and column names
• Pros: Good scale out, versioning
• Cons: Cannot query blob content,
row and column designs are critical
Copyright Kelly-McCreary & Associates, LLC
30
Examples:
Cassandra, HBase,
Hypertable, Apache
Accumulo, Bigtable
31. M
D
Column Family (Bigtable)
• The champion of "Big Data"
• Excel at highly saleable systems
• Tightly coupled with MapReduce
• Technically a "sparse matrix" were most cells have
no data
• Generating a list of all columns is non-trivial
• Examples:
– Google Bigtable
– Hadoop HBase
– Hypertable
Copyright Kelly-McCreary & Associates, LLC
31
32. M
D
Spreadsheets Use a Row/Column as a Key
• Bigtable systems use a combination of row
and column information as part of their key
Copyright Kelly-McCreary & Associates, LLC
32
Column ID
Row ID
33. M
D
Keys Include Family and Timestamps
• Bigtable systems have keys that include not
just row and column ID but other attributes
• Column Families are created when a table is
created
• Timestamps allows multiple versions of
values
• Values are just ordered bytes and have no
strongly typed data system
Copyright Kelly-McCreary & Associates, LLC
33
34. M
D
Column Store Concepts
• Preserve the table-structure
familiar to RDBMS systems
• Not optimized for "joins"
• One row could have millions of
columns but the data can be very
"sparse"
• Ideal for high-variability data sets
• Colum families allow to query all
columns that have a specific
property or properties
• Allow new columns to be inserted
without doing an "alter table"
• Trigger new columns on inserts
Copyright Kelly-McCreary & Associates, LLC
34
Col1 Col100000
…
35. M
D
Column Families
• Group columns into
"Column families"
• Group column families into
"Super-Columns"
• Be able to query all
columns with a family or
super family
• Similar data grouped
together to improve speed
Copyright Kelly-McCreary & Associates, LLC
35
Table
Super
Col X
Super
Col Y
Fam1 Fam2
Col-A Col-B
36. M
D
Hadoop/Hbase
• Open source implementation of MapReduce
algorithm written in Java
• Initially created by Yahoo
– 300 person-years development
• Column-oriented data store
• Java interface
• HBase designed specifically to work with
Hadoop
• High-level query language (Pig)
• Strong support by many vendors
Copyright Kelly-McCreary & Associates, LLC
36
37. M
D
Cassandra
• Apache open source column family
database supported by DataStax
• Peer-to-peer distribution model
• Strong reputation for linear scale out
(millions of writes/second)
• Database side security
• Written in Java and works well with HDFS
and MapReduce
Copyright Kelly-McCreary & Associates, LLC
37
39. M
D
Graph Store
• Data is stored in a series of nodes,
relationships and properties
• Queries are really graph traversals
• Ideal when relationships between
data is key:
– e.g. social networks
• Pros: fast network search, works
with public linked data sets
• Cons: Poor scalability when graphs
don't fit into RAM, specialized query
languages (RDF uses SPARQL)
Copyright Kelly-McCreary & Associates, LLC
39
Examples:
Neo4j, AllegroGraph,
Bigdata triple store,
InfiniteGraph,
StarDog
40. M
D
Graph Stores
• Used when the relationship and relationships
types between items are critical
• Used for
– Social networking queries: "friends of my friends"
– Inference and rules engines
– Pattern recognition
– Used for working with open-linked data
• Automate "joins" of public data
Copyright Kelly-McCreary & Associates, LLC
40
41. M
D
Nodes are "joined" to create graphs
• How do you know that two items reference
the same object?
• Node identification – URI or similar structure
Copyright Kelly-McCreary & Associates, LLC
41
43. M
D
Neo4J
• Graph database designed to
be easy to use by Java
developers
• Dual license (community
edition is GPL)
• Works as an embedded java
library in your application
• Disk-based (not just RAM)
• Full ACID
Copyright Kelly-McCreary & Associates, LLC
43
44. M
D
Document Store
• Data stored in nested
hierarchies
• Logical data remains stored
together as a unit
• Any item in the document can
be queried
• Pros: No object-relational
mapping layer, ideal for search
• Cons: Complex to implement,
incompatible with SQL
Copyright Kelly-McCreary & Associates, LLC
44
Examples:
MarkLogic,
MongoDB,
Couchbase,
CouchDB, eXist-db
45. M
D
Document Stores
• Store machine readable documents together as a
single blob of data
• Use JSON or XML formats to store documents
• Similar to "object stores" in many ways
• No shredding of data into tables
• Sub-trees and attributes of documents can still be
queried XQuery or other document query languages
• Quickly maturing to include ACID transaction support
• Lack of object-relational mapping permits agile
development
• Fastest growing revenues (MarkLogic, MongoDB,
Couchbase)
Copyright Kelly-McCreary & Associates, LLC
45
46. M
D
Estimated Big Data and NoSQL Sales
Copyright Kelly-McCreary & Associates, LLC
46
Document Stores
47. M
D
Object Relational Mapping
• T1 – HTML into Objects
• T2 –Objects into SQL Tables
• T3 – Tables into Objects
• T4 – Objects into HTML
T1
T3
T2
T4
Object Middle
Tier
Relational
Database
Web Browser
47
Kelly-McCreary & Associates, LLC
48. M
D
The Addition of XML Web Services
• T1 – HTML into Java Objects
• T2 – Java Objects into SQL Tables
• T3 – Tables into Objects
• T4 – Objects into HTML
• T5 – Objects to XML
• T6 – XML to Objects
48
Copyright 2011 Kelly-McCreary & Associates
T1
T3
T2
T4
Object Middle
Tier
Relational
Database
Web Browser
T5
Web Service
T6
49. M
D
"The Vietnam of Applications"
• Object-relational mapping has become one of
the most complex components of building
applications today
• A "Quagmire" where many projects get lost
• Many "heroic efforts" have been made to
solve the problem
– Java Hibernate Framework
– Ruby on Rails
• But sometimes the best way to avoid
complexity is to keep your architecture very
simple
Copyright Kelly-McCreary & Associates, LLC
49
50. M
D
Document Stores Need No Translation
• Documents in the database
• Documents in the application
• No object middle tier
• No "shredding"
• No reassembly
• Simple!
50
Copyright 2010 Dan McCreary & Associates
Application Layer Database
Document Document
51. M
D
Zero Translation (XML)
• XML lives in the web browser (XForms)
• REST interfaces
• XML in the database (Native XML, XQuery)
• XRX Web Application Architecture
• No translation!
51
Copyright 2010 Dan McCreary & Associates
Web Browser XML database
XForms REST-Interfaces
52. M
D
"Schema Free"
• Systems that automatically determine how to
index data as the data is loaded into the
database
• No a priori knowledge of data structure
• No need for up-front logical data modeling
– …but some modeling is still critical
• Adding new data elements or changing data
elements is not disruptive
• Searching millions of records still has sub-
second response time
52
Copyright 2010 Dan McCreary & Associates
53. M
D
Schema-Free Integration
"We can easily store the data that we
actually get, not the data we thought we
would get."
Copyright Kelly-McCreary & Associates, LLC
53
XML
v1
XML
v2
XML
v3
Enterprise Messaging System NoSQL
Database
54. M
D
Upfront ER Modeling is Not Required
• You do not have to finish
modeling your data before you
insert your first records
• No Data Definition Language
"DDL" is needed
• Metadata is used to create
indexes as data arrives
• Modeling becomes a statistical
process – write queries to find
exceptions and normalize data
• Exceptions make the rules but
can still be used
• Data validation can still be done
on documents using tools such
as XML Schema and business
rules systems like Schematron
Copyright Kelly-McCreary & Associates, LLC
54
55. M
D
Document Structure
Copyright Kelly-McCreary & Associates, LLC
55
<books> is our root element
<books> contain
a sequence of one
to many <book>
elements
Each <book>
contains
the following
sequence of
elements
Darker lines mean
"required" and light lines
mean optional elements
Id and title are required
Books have 0 to many
author-names
Format and license
elements are codes
that must be in a
fixed list of choices
Only valid URL characters
Must be a valid decimal
number
56. M
D
MarkLogic
• Native XML database designed to scale to
Petabyte data stores
• Leverages commodity hardware
• ACID compliant, schema-free document
store
• Heavy use by federal agencies, document
publishers and "high-variability" data
• Arguably the most successful NoSQL
company
Copyright Kelly-McCreary & Associates, LLC
56
57. M
D
MongoDB
• Open Source JSON data store created by
10gen
• Master-slave scale out model
• Strong developer community
• Sharding built-in, automatic
• Implemented in C++ with many APIs (C++,
JavaScript, Java, Perl, Python etc.)
Copyright Kelly-McCreary & Associates, LLC
57
58. M
D
Couchbase
• Open source JSON document store
• Code base separate from CouchDB
• Built around memcached
• Peer to peer scale out model
• Written in C++ and Erlang
• Strengths in scale out, replication and
high-availability
Copyright Kelly-McCreary & Associates, LLC
58
59. M
D
CouchDB
• Apache CouchDB
• Open source JSON data store
• Document Model
• Written in ERLANG
• RESTful JSON API
• Distributed, featuring robust, incremental
replication with bi-directional conflict
detection and management
• B-Tree based indexing
• Mobile version
Copyright Kelly-McCreary & Associates, LLC
59
60. M
D
eXist
• Open source native XML database
• Strong support for XQuery and XQuery
extensions
• Heavily used by the Text Encoding Initiative
(TEI) community and XRX/XForms
communities
• Integrated Lucene search
• Collection triggers and versioning
• Extensive XQuery libs (EXPath)
• Version 2.0 has replication
Copyright Kelly-McCreary & Associates, LLC
60
61. M
D
Two Models
"Bag of Words"
• All keywords in a single container
• Only count frequencies are stored
with each word
"Retained Structure"
• Keywords associated with each
sub-document component
61
'love'
'hate'
'new'
'fear'
keywords
keywords
keywords
keywords
keywords
keywords
doc-id
Kelly-McCreary & Associates, LLC
62. M
D
Keywords and Node IDs
• Keywords in the reverse index are now
associated with the node-id in every
document
Node-id
Node-id
Node-id
Node-id
Node-id
Node-id
keywords
keywords
keywords
keywords
keywords
keywords
document-id
62
63. M
D
Hybrid architectures
• Most real world implementations use some
combination of NoSQL solutions
• Example:
– Use document stores for data
– Use S3 for image/pdf/binary storage
– Use Apache Lucene for document index stores
– Use MapReduce for real-time index and
aggregate creation and maintenance
– Use OLAP for reporting sums and totals
Copyright Kelly-McCreary & Associates, LLC
63
64. M
D
Tools to Help You Select A System
• ATAM – Architecture Tradeoff
Methodology
• CMU developed process to objectively
select a system architecture based on
business driven use-cases and quality
metrics
Copyright Kelly-McCreary & Associates, LLC
64
65. M
D
ATAM Process Flow
Copyright Kelly-McCreary & Associates, LLC
65
Business
Drivers
Quality
Attributes
User
Stories
Analysis
Architecture
Plan
Architectural
Approaches
Architectural
Decisions
Tradeoffs
Sensitivity
Points
Non-Risks
RisksRisk Themes
Distilled info
Impacts
67. M
D
Sample Quality Attribute Tree
Kelly-McCreary & Associates, LLC
67
Utility
Searchability
XML Importability
Transformability
Affordability
Sustainability
Interoperability
Easy to add new XML data. (C, H)
Use OpenSource Software. (H, H)
Use long standing Standards. (VH, H)
Use W3C Standards. (VH, H)
Fulltext search on document data. (H, H)
Easy to transform XML data. (H, H)
Standards
Ease of Change Use declarative languages. (VH, H)
Security Prevents unauthorized access. (H, H)
Fulltext Search
XML Search
Custom Scoring
Drag-and-drop
Bulk Import
No License Fees
XQuery
XSLT
Web Services
Fine Grain Control
Standards Based
No Translation
Key: (Importance, Score)
Important to the Project C=Critical, VH=Very High, H=High, M=Medium
Architectural Score H=High, M=Medium, L=Low
Scoring via XQuery. (M, H)
Role-based
Works with W3C Forms. (H, M)
Works with web standards. (VH, H)
No complex languages to learn. (H, H)
Fast searching. (H, H)
Staff training. (H, M)
Collection-based access. (H, M)
Centralized security policy. (H, M)
Mashups wtih REST Interfaces. (VH, H)
Batch Import tools. (M, M)
Transform to HTML or PDF. (VH, H)
Customizable by non-programmers. (H, H)