Abdelmonaim Remani | Just.me Inc.


The Rise of NoSQL and
 Polyglot Persistence
About Me
• Software Architect at Just.me Inc.
• Interested in technology evangelism and enterprise software
  development and architecture
• Frequent speaker (JavaOne, JAX, OSCON, ORDEV, etc…)
• Open-source advocate
• President and founder of a number of user group
   – NorCal Java User Group
   – The Silicon Valley Spring User Group
   – The Silicon Valley Dart Meetup
• Bio:         http://about.me/PolymathicCoder
• Twitter:     @PolymathicCoder
• Email:       abdelmonaim.remani@gmail.com
License




• Creative Commons Attribution Non-Commercial 3.0 Unported
   – http://creativecommons.org/licenses/by-nc/3.0


• Disclaimer: The graphics and the logo in the presentation
  belong to their rightful owners
The Golden Age of Relational
        Databases
Relational Data Stores
• Relational Data Stores have been the
  predominant choice in storing data
  – The existence mature solutions
    • Oracle, MySQL, Ms SQL Server, etc…
  – Wide adoption and familiarity
    • Developers and even advanced business users
  – An abundance of tools
  – Etc…
• It became the De-Facto standard
The Relational Model
• Data
  – Stored in
     • 2 dimensional tables (Relations)
     • Rows (tuples) and columns (attributes)
 • Has well-define enforced schema
   – Relations themselves
   – Integrity constrains
• Normalization
  – Smaller tables with well-defined relationship
    between them
  – Why?
      • Minimized redundancy
      • No modification anomalies
          – Modification Propagation or cascading
The Relational Model
• Supported by SQL (Structured Query
  Language)
  – A somewhat standardized query language
  – Very flexible
  – Many Operations
    • Across multiple relations such as JOIN
    • Aggregations such as GROUP BY
    • Etc…
The Relational Model
• Transactional
  • ACID
    – Atomicity
        » All or nothing
    – Consistency
        » From one valid state to another
    – Isolation
        » Concurrency result in a valid state
    – Durability
        » Once committed, it’s forever
The Relational Model
• Designed with the assumptions that
 – The end-user will directly interact with database

   » It makes sense that the RDBMS should manage concurrency
     and integrity

   » Access Patterns are unknown

     » A flexible query language that is close to English

     » Data structure with no bias towards a particular pattern of
       querying

 – The database runs on a single machine

   » The only way to promise true ACID
Road Bumps
• We started building more complex applications on top
  of relational databases
 – Business logic moved out of the RDBMS

   » Fewer triggers and stored procedures and replaced by
     equivalent application layer code

 – The applications themselves evolved beyond the procedural
   paradigm to a more OOP approach

   » The Object-Relational impedance mismatch

     » ORM framework to the rescue
Scalability
We became data hoarders!
• As our datasets grew out of control
• Performance decreases exponentially
  – We buy a beefier machines
     • Larry Ellison’s most expensive RAC and make
       him even richer
• This put off the problem for a little while
Optimization
• We hire a guy
  – Indexes half of the databases
     • Made those queries a little faster
  – Creates materialized views for complex joins
     • Nightmare to maintain, get stale, etc…
  – He de-normalizes
     • Any thing but a smooth transition!
     • Redundancy
  – He introduces Caching
     • Data too stale
     • More redundancy
Clustering
• We hire another guy
   – Tells us that we hit the limit of the one machine
   – You need to scale out (Horizontally)
      • Master/Slave
          – Assuming you read more than you write
          – Write to the Master and Read from the Slaves
          – Master needs to replicate data across the slaves
              » Risk incorrect reads
          – How’s that consistent?!!
      • Sharding
          –   Improves reads as much as writes
          –   Can’t join across partitions
          –   No referential integrity
          –   Requires modification of client applications
          –   Introduces a single-point of failure
          –   How’s that consistent?!!
What’s the Point?
• We vertically scale our relational
  database
  – We’re no longer consistent
  – No ACIDity?
  – We loose query flexibility
• Are we doing something wrong?
The CAP Theorem
The CAP Theorem
• Eric Brewer on distributed systems
  – Pick tow out of
    • Consistency
    • Availability
    • Partition Tolerance
• There is Fast Cheap Good service
  – Cheap Good service won’t be Fast
  – Fast Good service won’t be Cheap
  – Fast Cheap service won’t be Good
Relational Model & CAP
• Relational Data Stores happen to favor
  – Consistency and Availability
  – For historical reasons
     • They are key to certain type of applications
     • The bank example
        – I deposit $100 in my friend’s bank account
        – Blah blah blah…
• According to CAP, Partition Tolerance is
  impossible meaning that horizontal
  scaling is impossible
Scheiße!
• We’re in a pickle
  – Too much data in CA model
  – Vertical Scaling
     • Too expensive
     • Not sustainable
• Forced to explore other alternatives in
  light of CAP
What AP Looks Like
• Partition Tolerance
  – Since we reached the limit of the one machine
    we have no choice but to scale horizontally
  – Which means to be partition tolerant
• Availability
  – Nobody is willing to give up most of the time
  – This becomes even better with distribution
  – In a cluster of servers
     • The individual node might be unreliable by itself
     • But a whole inherently reliable
What AP Looks Like
• According the CAP we simply cannot have C
• Consistency
  – I make a update and all subsequent read the most
    updated value
  – Unfortunately this is impossible as it takes time for
    the change to be replicated across each node of
    the cluster
• What a bummer?!
• Let’s look and AP system
  – DNS (Domain Naming Service)
     • Not all the nodes have the most updated records (You
       register that domain name and wait for a few days to
       guarantee that every DNS knows about it)
Eventual Consistency
• This is no so bad
   – It means that we just settled for a lesser degree
     Consistency
• So what if
   – Mohammad in Morocco updated his relationship status
     to single on an some edge node
   – His cousin who lives Spain saw it immediately because
     they happen to be on the same edge node
   – His secret admirer Sara who lives in the United States
     could not see it until an hour later
   – His bother in Japan got the update the next day
   – They all got it eventually!
• Eventual Consistency as Opposed to Immediate
  Consistency
The Compromise
• We settle for weaker consistency model
  – BASE
    • Basically Available
    • Soft state
    • Eventual Consistency
• ACID on the individual node BASE on
  the cluster
The Slippery Slope of the
        Faithless
You might as well Question…
• Schema
 – Logical
   • Well-defined and rigid in relational databases
   • Why not a flexible one or even no schema
 – Physical
   • B Trees in most relational databases
   • Why not use some other underlying data
     structure
You might as well Question…
• Integrity Constraints
  – Who cares?
• A Query Language
  – Anything would do…
• Security
  – None
• Name it…
NoSQL: Going Rogue…
NoSQL
• A wide range of specialized data stores
  with the goal of addressing the challenges
  of the relational model
• Eric Evans
  – The whole point of seeking alternatives is that
    you need to solve a problem that relational
    databases are a bad fit for
• Let me make it easier
  – It is does not anti-SQL or anti-Relational
  – Any data store that is non-relational
• “Not Only SQL” instead of “NO SQL”
SQL             vs.            NoSQL
A single machine                  A cluster
       CA                        AP/CA/CP
 Scale Vertically             Scale Horizontally
      SQL                       Custom APIs
      ACID                          BASE
  Full Indexes                 Mostly on Keys


            There are outliers of course
SQL              vs.            NoSQL
    Rigid Schema                    Schema-less
   Flexible Queries              Pre-defined Queries

• SQL (Relational)
  – Concerned about what the data consists of
• NoSQL (Non-Relational)
  – Concerned with how the data is queried

                There are outliers of course
The Zoo
Key-Value Data Stores
• Basically a big hash map associative array
   – Very Simple
   – Very fast read and write
   – No secondary indexes
• Use When
   – Your data is not highly related
   – All you need is basic CRUD
• Challenges
   – Complex queries
• Check out the Amazon Dynamo Paper
       • http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-
         sosp2007.pdf
• Featured Projects
   – DynamoDB http://hbase.apache.org/
   – Riak http://wiki.basho.com/
   – Redis http://redis.io/
Columnar Stores
•   In a table, data of the same column is stored together
     – Storage is not wasted on null value as in row-based stores (RDBMS)
     – Great for sparse tables
     – Very fast column operation including aggregation
•   Use When
     – Big Data (Excellent leverage of Map Reduce)
     – Need compression or versioning
•   Challenges
     – You better know your access patterns before hand
     – Keys design is not trivial
•   Check out Google’s BigTable Paper
     – http://static.googleusercontent.com/external_content/untrusted_dlcp/research.go
       ogle.com/en/us/archive/bigtable-osdi06.pdf
•   Featured Projects
     – Hbase http://hbase.apache.org/
     – Cassanda http://cassandra.apache.org/
Document Data Stores
•   Nested structures of hashes and their values
     – A document can be
          •   Simply a hash and its value
          •   Hash and another document as its value
          •   No limit in depth
     –   Very Flexible schema
     –   Well-Indexed data
     –   Works well with OOP (No impedance mismatch)
     –   De-normalize as a best practice
•   Use when
     – You don’t know much about the schema
     – The schema very likely to change
•   Challenges
     – Complex Join-like queries
     – Self-referencing documents and circular dependencies
•   Projects
     – MongoDB http://www.mongodb.org/
     – CouchDB http://couchdb.apache.org/
Graph Data Stores
• A graph
   –   Perfect for highly interconnected data
   –   Allows for explicit relationships
   –   Fined graph grained-traversal
   –   Very Flexible
   –   Works well with OOP (No impedance mismatch)
• Use when
   – Your data looks like a graph and requires graph question
   – You are smart enough not to try this on another data store
• Challenges
   – Doesn’t scale-well horizontally
• Featured Projects
   – Neo4j http://neo4j.org/
Relational Data Stores
• Use when
   – Your data Highly relational
   – There is a need to break data into small pieces and
     assemble it in different ways
   – When consistence is king
   – Access patterns are unknown
   – Reporting
• Challenges
   – Doesn’t scale-well horizontally
• Featured Projects
   –   Oracle http://www.oracle.com/index.html
   –   Postgres http://www.postgresql.org/
   –   Ms SQL Server http://dev.mysql.com/
   –   MySQL http://www.mysql.com/
How do you choose?
If It Doesn’t Fit, You Must Acquit!
• Data
  –   Does it have a natural structure?
  –   How it is connected to each other?
  –   How is it distributed?
  –   How much?
• Access Patterns
  – Reads/Writes ratio?
  – Uniform or random?
• CAP
Other Considerations
•   Maturity
•   Stability
•   Maintainability
•   Durability
•   Cost
•   Tools
•   Familiarity
For Fairness’ Sake!
For Fairness’ Sake!
• Relational data stores did not fail us
  – They actually perform very well
• We failed ourselves
  – By using them as solutions for problems
    they weren’t designed to solve to begin
    with
• Take any data store and you’ll get as
  much trouble
For Fairness’ Sake!
• You can’t expect
  – A flathead screwdriver to work on a Philips
    as well as one with the matching Philips
    blade
  – A crosshead screwdriver to work on
    flathead screw
Polyglot Persistence
Polyglot Persistence
• Enterprise application are complex and
  combine complex problems
  – Assumption that we should use one data store is
    absurd
  – You can’t try to fit all in one model and expect no
    problem
• Polyglot Persistence
  – To leverage multiple data storages, based on the
    way data is used by the application
     • Associated with a learning curve
     • Long term investment (More productive in the long-run)
  – Leverage the strength of multiple data stores
Polyglot Persistence
• Example
  –   MongoDB for the product catalog
  –   Redis for shopping cart
  –   DynamoDB for social profile info
  –   Neo4j for the social graph
  –   HBase for inbox and public feed messages
  –   MySQL for payment and account info
  –   Cassandra for audit and activity log
• Disclaimer: I’m not making any
  recommendation here.
NoSQL in the Cloud
NoSQL in the Cloud
• NoSQL as a commodity
  – Fully managed data stores (No
    maintenance)
  – Elastic scaling
  – Cheap storage
• Featured:
  – Amazon AWS
  – Heroku Add-ons
  – CloudFoundry
As Promised!
The A’s the Q’s in the Abstract
• What does the rise of all these NoSQL mean
  to my enterprise?
   – I’m guessing a lot
• What is NoSQL to begin with?
   – Any non-relational data store
• Does it mean “NO SQL”?
   – No
• Could this be just another fad?
   – I don’t think so
The A’s the Q’s in the Abstract
• Is a good idea to be the future of my
  enterprise on these new exotic
  technologies and simply abandon
  proven mature RDBMS?
  – It’s up to you. I will say “No guts, no glory!”
• How scalable is scalable?
  – However much you need it to be
The A’s the Q’s in the Abstract
• Assuming that I am sold, how do I
  choose the one that fits my needs the
  best?
  – I’ll tell you if you hire me
• Is there a middle ground somewhere?
  – Polyglot Persistence
• What is this Polyglot Persistence I hear
  about?
  – It’s the middle ground
Any Other Questions?
Thank You All!

@PolymathicCoder

The Rise of NoSQL and Polyglot Persistence

  • 1.
    Abdelmonaim Remani |Just.me Inc. The Rise of NoSQL and Polyglot Persistence
  • 2.
    About Me • SoftwareArchitect at Just.me Inc. • Interested in technology evangelism and enterprise software development and architecture • Frequent speaker (JavaOne, JAX, OSCON, ORDEV, etc…) • Open-source advocate • President and founder of a number of user group – NorCal Java User Group – The Silicon Valley Spring User Group – The Silicon Valley Dart Meetup • Bio: http://about.me/PolymathicCoder • Twitter: @PolymathicCoder • Email: abdelmonaim.remani@gmail.com
  • 3.
    License • Creative CommonsAttribution Non-Commercial 3.0 Unported – http://creativecommons.org/licenses/by-nc/3.0 • Disclaimer: The graphics and the logo in the presentation belong to their rightful owners
  • 4.
    The Golden Ageof Relational Databases
  • 5.
    Relational Data Stores •Relational Data Stores have been the predominant choice in storing data – The existence mature solutions • Oracle, MySQL, Ms SQL Server, etc… – Wide adoption and familiarity • Developers and even advanced business users – An abundance of tools – Etc… • It became the De-Facto standard
  • 6.
    The Relational Model •Data – Stored in • 2 dimensional tables (Relations) • Rows (tuples) and columns (attributes) • Has well-define enforced schema – Relations themselves – Integrity constrains • Normalization – Smaller tables with well-defined relationship between them – Why? • Minimized redundancy • No modification anomalies – Modification Propagation or cascading
  • 7.
    The Relational Model •Supported by SQL (Structured Query Language) – A somewhat standardized query language – Very flexible – Many Operations • Across multiple relations such as JOIN • Aggregations such as GROUP BY • Etc…
  • 8.
    The Relational Model •Transactional • ACID – Atomicity » All or nothing – Consistency » From one valid state to another – Isolation » Concurrency result in a valid state – Durability » Once committed, it’s forever
  • 9.
    The Relational Model •Designed with the assumptions that – The end-user will directly interact with database » It makes sense that the RDBMS should manage concurrency and integrity » Access Patterns are unknown » A flexible query language that is close to English » Data structure with no bias towards a particular pattern of querying – The database runs on a single machine » The only way to promise true ACID
  • 10.
    Road Bumps • Westarted building more complex applications on top of relational databases – Business logic moved out of the RDBMS » Fewer triggers and stored procedures and replaced by equivalent application layer code – The applications themselves evolved beyond the procedural paradigm to a more OOP approach » The Object-Relational impedance mismatch » ORM framework to the rescue
  • 11.
  • 12.
    We became datahoarders! • As our datasets grew out of control • Performance decreases exponentially – We buy a beefier machines • Larry Ellison’s most expensive RAC and make him even richer • This put off the problem for a little while
  • 13.
    Optimization • We hirea guy – Indexes half of the databases • Made those queries a little faster – Creates materialized views for complex joins • Nightmare to maintain, get stale, etc… – He de-normalizes • Any thing but a smooth transition! • Redundancy – He introduces Caching • Data too stale • More redundancy
  • 14.
    Clustering • We hireanother guy – Tells us that we hit the limit of the one machine – You need to scale out (Horizontally) • Master/Slave – Assuming you read more than you write – Write to the Master and Read from the Slaves – Master needs to replicate data across the slaves » Risk incorrect reads – How’s that consistent?!! • Sharding – Improves reads as much as writes – Can’t join across partitions – No referential integrity – Requires modification of client applications – Introduces a single-point of failure – How’s that consistent?!!
  • 15.
    What’s the Point? •We vertically scale our relational database – We’re no longer consistent – No ACIDity? – We loose query flexibility • Are we doing something wrong?
  • 16.
  • 17.
    The CAP Theorem •Eric Brewer on distributed systems – Pick tow out of • Consistency • Availability • Partition Tolerance • There is Fast Cheap Good service – Cheap Good service won’t be Fast – Fast Good service won’t be Cheap – Fast Cheap service won’t be Good
  • 18.
    Relational Model &CAP • Relational Data Stores happen to favor – Consistency and Availability – For historical reasons • They are key to certain type of applications • The bank example – I deposit $100 in my friend’s bank account – Blah blah blah… • According to CAP, Partition Tolerance is impossible meaning that horizontal scaling is impossible
  • 19.
    Scheiße! • We’re ina pickle – Too much data in CA model – Vertical Scaling • Too expensive • Not sustainable • Forced to explore other alternatives in light of CAP
  • 20.
    What AP LooksLike • Partition Tolerance – Since we reached the limit of the one machine we have no choice but to scale horizontally – Which means to be partition tolerant • Availability – Nobody is willing to give up most of the time – This becomes even better with distribution – In a cluster of servers • The individual node might be unreliable by itself • But a whole inherently reliable
  • 21.
    What AP LooksLike • According the CAP we simply cannot have C • Consistency – I make a update and all subsequent read the most updated value – Unfortunately this is impossible as it takes time for the change to be replicated across each node of the cluster • What a bummer?! • Let’s look and AP system – DNS (Domain Naming Service) • Not all the nodes have the most updated records (You register that domain name and wait for a few days to guarantee that every DNS knows about it)
  • 22.
    Eventual Consistency • Thisis no so bad – It means that we just settled for a lesser degree Consistency • So what if – Mohammad in Morocco updated his relationship status to single on an some edge node – His cousin who lives Spain saw it immediately because they happen to be on the same edge node – His secret admirer Sara who lives in the United States could not see it until an hour later – His bother in Japan got the update the next day – They all got it eventually! • Eventual Consistency as Opposed to Immediate Consistency
  • 23.
    The Compromise • Wesettle for weaker consistency model – BASE • Basically Available • Soft state • Eventual Consistency • ACID on the individual node BASE on the cluster
  • 24.
    The Slippery Slopeof the Faithless
  • 25.
    You might aswell Question… • Schema – Logical • Well-defined and rigid in relational databases • Why not a flexible one or even no schema – Physical • B Trees in most relational databases • Why not use some other underlying data structure
  • 26.
    You might aswell Question… • Integrity Constraints – Who cares? • A Query Language – Anything would do… • Security – None • Name it…
  • 27.
  • 28.
    NoSQL • A widerange of specialized data stores with the goal of addressing the challenges of the relational model • Eric Evans – The whole point of seeking alternatives is that you need to solve a problem that relational databases are a bad fit for • Let me make it easier – It is does not anti-SQL or anti-Relational – Any data store that is non-relational • “Not Only SQL” instead of “NO SQL”
  • 29.
    SQL vs. NoSQL A single machine A cluster CA AP/CA/CP Scale Vertically Scale Horizontally SQL Custom APIs ACID BASE Full Indexes Mostly on Keys There are outliers of course
  • 30.
    SQL vs. NoSQL Rigid Schema Schema-less Flexible Queries Pre-defined Queries • SQL (Relational) – Concerned about what the data consists of • NoSQL (Non-Relational) – Concerned with how the data is queried There are outliers of course
  • 32.
  • 33.
    Key-Value Data Stores •Basically a big hash map associative array – Very Simple – Very fast read and write – No secondary indexes • Use When – Your data is not highly related – All you need is basic CRUD • Challenges – Complex queries • Check out the Amazon Dynamo Paper • http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo- sosp2007.pdf • Featured Projects – DynamoDB http://hbase.apache.org/ – Riak http://wiki.basho.com/ – Redis http://redis.io/
  • 34.
    Columnar Stores • In a table, data of the same column is stored together – Storage is not wasted on null value as in row-based stores (RDBMS) – Great for sparse tables – Very fast column operation including aggregation • Use When – Big Data (Excellent leverage of Map Reduce) – Need compression or versioning • Challenges – You better know your access patterns before hand – Keys design is not trivial • Check out Google’s BigTable Paper – http://static.googleusercontent.com/external_content/untrusted_dlcp/research.go ogle.com/en/us/archive/bigtable-osdi06.pdf • Featured Projects – Hbase http://hbase.apache.org/ – Cassanda http://cassandra.apache.org/
  • 35.
    Document Data Stores • Nested structures of hashes and their values – A document can be • Simply a hash and its value • Hash and another document as its value • No limit in depth – Very Flexible schema – Well-Indexed data – Works well with OOP (No impedance mismatch) – De-normalize as a best practice • Use when – You don’t know much about the schema – The schema very likely to change • Challenges – Complex Join-like queries – Self-referencing documents and circular dependencies • Projects – MongoDB http://www.mongodb.org/ – CouchDB http://couchdb.apache.org/
  • 36.
    Graph Data Stores •A graph – Perfect for highly interconnected data – Allows for explicit relationships – Fined graph grained-traversal – Very Flexible – Works well with OOP (No impedance mismatch) • Use when – Your data looks like a graph and requires graph question – You are smart enough not to try this on another data store • Challenges – Doesn’t scale-well horizontally • Featured Projects – Neo4j http://neo4j.org/
  • 37.
    Relational Data Stores •Use when – Your data Highly relational – There is a need to break data into small pieces and assemble it in different ways – When consistence is king – Access patterns are unknown – Reporting • Challenges – Doesn’t scale-well horizontally • Featured Projects – Oracle http://www.oracle.com/index.html – Postgres http://www.postgresql.org/ – Ms SQL Server http://dev.mysql.com/ – MySQL http://www.mysql.com/
  • 38.
    How do youchoose?
  • 39.
    If It Doesn’tFit, You Must Acquit! • Data – Does it have a natural structure? – How it is connected to each other? – How is it distributed? – How much? • Access Patterns – Reads/Writes ratio? – Uniform or random? • CAP
  • 40.
    Other Considerations • Maturity • Stability • Maintainability • Durability • Cost • Tools • Familiarity
  • 41.
  • 42.
    For Fairness’ Sake! •Relational data stores did not fail us – They actually perform very well • We failed ourselves – By using them as solutions for problems they weren’t designed to solve to begin with • Take any data store and you’ll get as much trouble
  • 43.
    For Fairness’ Sake! •You can’t expect – A flathead screwdriver to work on a Philips as well as one with the matching Philips blade – A crosshead screwdriver to work on flathead screw
  • 44.
  • 45.
    Polyglot Persistence • Enterpriseapplication are complex and combine complex problems – Assumption that we should use one data store is absurd – You can’t try to fit all in one model and expect no problem • Polyglot Persistence – To leverage multiple data storages, based on the way data is used by the application • Associated with a learning curve • Long term investment (More productive in the long-run) – Leverage the strength of multiple data stores
  • 46.
    Polyglot Persistence • Example – MongoDB for the product catalog – Redis for shopping cart – DynamoDB for social profile info – Neo4j for the social graph – HBase for inbox and public feed messages – MySQL for payment and account info – Cassandra for audit and activity log • Disclaimer: I’m not making any recommendation here.
  • 47.
  • 48.
    NoSQL in theCloud • NoSQL as a commodity – Fully managed data stores (No maintenance) – Elastic scaling – Cheap storage • Featured: – Amazon AWS – Heroku Add-ons – CloudFoundry
  • 49.
  • 50.
    The A’s theQ’s in the Abstract • What does the rise of all these NoSQL mean to my enterprise? – I’m guessing a lot • What is NoSQL to begin with? – Any non-relational data store • Does it mean “NO SQL”? – No • Could this be just another fad? – I don’t think so
  • 51.
    The A’s theQ’s in the Abstract • Is a good idea to be the future of my enterprise on these new exotic technologies and simply abandon proven mature RDBMS? – It’s up to you. I will say “No guts, no glory!” • How scalable is scalable? – However much you need it to be
  • 52.
    The A’s theQ’s in the Abstract • Assuming that I am sold, how do I choose the one that fits my needs the best? – I’ll tell you if you hire me • Is there a middle ground somewhere? – Polyglot Persistence • What is this Polyglot Persistence I hear about? – It’s the middle ground
  • 53.
  • 54.