©2013 DataStax Confidential. Do not distribute without consent.
Benjamin Coverston
DSE Architect, DataStax Inc.
NoSQL, Big Data, and Real Time
1
Monday, September 2, 13
Who am I?
• Ben Coverston
• DSE Architect
• DataStax since 2010
• Previous Experience in the Travel Industry
• Low Cost Airlines / Web Reservations
• Past: HP / Accenture
• Lived in Santa Catarina for a few years.
Monday, September 2, 13
What is it?
NoSql
Monday, September 2, 13
Monday, September 2, 13
What is NoSQL?
NoSQL is a term coined by Carlo Strozzi and
repurposed by Eric Evans to refer to “some”
storage systems. The NoSQL term should be used
as in the Not-Only-SQL and not as No to SQL or
Never SQL.
-- Alex Popescu
Monday, September 2, 13
What is NoSQL (Cont.)
• It’s not
• No to SQL
• About performance
• About scaling
• ACID
• Eventual consistency
• Volume
• It is:
• About choice
Monday, September 2, 13
Diversity in Data
• Big Data has the 3 (or 4 or 5) V’s
• Volume
• Variety
• Velocity
• Variability (sometimes)
• Value (other times)
Monday, September 2, 13
Diversity in Data
• The V’s don’t cover everything
• Availabilty is important
• Your use case is important too
Monday, September 2, 13
Is NoSQL Big Data?
• You can store Big Data with an RDBMS
• Is it easy?
• Is it cost effective?
• What kid of compromises do you have to make?
Monday, September 2, 13
The Problems
• In general there are two classes of data problems
• OLTP (Real-Time)
• Analytics (Batch)
• Usually you want both
• No solution is perfect for everyone
• Popularity is no indication of fitness
Monday, September 2, 13
Use Cases
• OLTP
• Low Latency
• High Throughput
• LOB Applications
• Batch
• Predictive Models
• Complex Queries
• Tomorrow (or precalculated, but now we need OLTP)
Monday, September 2, 13
Where to put your ‘Stuff’
Sharded RDBMS MPP -- Greenplum, Teradata Hadoop Key/Value
Columnar
Other
Monday, September 2, 13
Why not just one?
• Analytics
• Optimize serial IO
• Limitations in Storage
• OLTP
• Working Set
• Distribution
• Availability
• Storage Medium
Monday, September 2, 13
Why Do We Need Something Else?
• ACID semantics are often
overkill
• ACID also makes the database
layer brittle
• This means you get less
Availability (CAP Theorem)
Monday, September 2, 13
The Application Stack
www.example.com
LB2LB1 LB3
ws3ws1 ws2 ws6ws4 ws5 ws9ws7 ws8
cache
1 2 3 4
DB#
Monday, September 2, 13
Sharding
• Storage Limitations
• Working Set
• So just make more!
Monday, September 2, 13
(“The&eBay&Architecture,”&Randy&Shoup&and&Dan&Pritche:)&
Monday, September 2, 13
But Sharding
• Is Painful
• Requires ‘something else’
• Most no-sql solutions auto-shard
• Sharding requires tradeoffs.
• Which means your application will need to change
Monday, September 2, 13
Monday, September 2, 13
Which should I choose?
• Analytics
• Hadoop (probably) if your data is big
• Spark, other (sometimes faster) solutions available now
• NoSql
• Let’s talk!
Monday, September 2, 13
Decisions are about tradeoffs, never a zero-sum game
Fast, Cheap, Good -- Choose Two
Monday, September 2, 13
CAP Theorem
• More of two, less of one
• Consistency
• Availability
• Partitioning
• You have to accept P
• That leaves C and A
Monday, September 2, 13
How To Scale Anything
• Partition By Function
• Split Horizontally
• Avoid Distributed Transactions
• Avoid Synchronous Coupling
• Virtualize Everywhere
• Cache Everything
Monday, September 2, 13
Partition By Function
• Don’t put everything in the same database
• Physical
• Pools of Machines
• Geographical Distribution
• Automatic sharding (look for this)
• Make sure it works!
• Virtual
• Logical Tables, Schema
• Not 100% necessary, but schema is nice
Monday, September 2, 13
Partitioning (cont.)
• Pros
• Isolate failure
• To a region
• To a service
• Simplify Failover
• Cons
• Your DB has to handle multi-region replication
• If you chose CP (CAP) you’re going to have a bad time
• AP systems do OK here (Cassandra, actually excels)
• “Relational” part of databases becomes complex
• Everything gets denormalized
Monday, September 2, 13
Split Horizontally
• Scaling Vertically is easy
• To a point, then it gets expensive.. fast..
• Easy if your system has no state to maintain
• Or if the states are known, and small
• Sharding over dependent fields complicates design
• Some things distribute themselves easily
• key/value stores
• Others not so much
• BTree indexes, foreign keys
• P2P architecture is helpful when splitting
• In other words, avoid masters
Monday, September 2, 13
Split Horizontally (cont)
• Pros
• Can be as fast or faster than traditional design
• Can scale up as long as you can afford more machines
• Scaling is easy if you avoid having masters
• Replication and failover don’t have to be special cases
• Cons
• Even logical pieces of your app are distributed over many machines
• example: your catalog is not all in one place
• Real time analytics is difficult, or slower
Monday, September 2, 13
Avoid Distributed Transactions
• Have you tried this?
• Hard to do right
• Paxos gives us some hope
• CAS in Cassandra 2.0 looks promising
• Even then, it’s not good for everything!
• MVCC works for many use cases
• Compensating Mechanisms
• Customer Service (Amazon, inventory)
Monday, September 2, 13
Avoid Distributed Transactions
• Pros
• Consistency in a distributed environment
• Cons
• Slow
• Overkill
• Did I say slow?
• We chose CP so we get less A
• What happens when they don’t succeed?
• Do we shut the whole thing down?
Monday, September 2, 13
Avoid Synchronous Coupling
• What?
• A or B can be down
• A can be down, B continues to work
• B can suffer, while A continues to work
• If your recommendation engine fails, your customers can still buy stuff!
• Master/Slave failover is a good example of synchronous coupling
• Master is down, slave needs to take over, but in the meantime.. what happens?
Monday, September 2, 13
Avoid Syncronous Coupling
• Pros
• Fewer shared dependencies means less failure
• Less failure means more total uptime
• For the whole
• Less coupling means that your application topology is more modular
• Introducing new, decoupled services is less risky
• Cons
• More duplication of your infrastructure
• e.g. now you have an application stack for each of your services.
Monday, September 2, 13
Avoid Synchronous Processing Flows
• AKA
• Blocking Sockets
• Serialized Processes
• Locking in General
• Do what is important FIRST
• Take their money
• Modify Inventory
• Other less important stuff can be queued
• Triggers
• Joins
• Stored Procedures
• Consistency Checks
Monday, September 2, 13
Avoid Synchronous Processing Flows
• Pros
• Critical operations will not block for nice to haves
• Easy monitoring of queues and assign priority to tasks
• Problem areas are easier to identify
• Cons
• Race conditions
• More up front development cost
Monday, September 2, 13
• DON’T
• Pick your database because it has a sexy API
• Pick your database because it worked for somebody else
• DO
• Pick a database that will fit with your use case
• Virtualize your data model
• Encourage manipulation of your logical models
• DO NOT force interaction with your database
• Good virtualization means that you can change your data store later...
• And most of your code will still work.
Virtualize Everything
Monday, September 2, 13
Virtualize Everything
• Virtualization isn’t just for the programmer
• Things fall apart
• Requests have to be re-routed
• Parts Replaced
• APIs change
• Good virtualization means you can make changes w/o impacting
availability
Monday, September 2, 13
Cache Appropriately
• You can’t cache everything
• But you can cache stuff that doesn’t change
• Or is expensive to retrieve
Monday, September 2, 13
Cache Appropriately
• Pros
• Cache is fast (compared to traditoinal RDBMS access)
• Can give you a performance buffer
• Cons
• Cache Coherence
• Cache Dependency
• Is it a SPOF?
• What if it all doesn’t fit?
Monday, September 2, 13
What about NoSQL?
• All of this applies
• Evaluate Products on their Strengths
• If easy things are easy
• The hard might be impossible
• Pick a something that makes the hard things possible
Monday, September 2, 13
What are the ‘easy’ things?
• Serialization Formats
• JSON/BSON
• Data Models
• HTTP/REST/JSON APIs
• NodeJS Drivers!
• etc.
Monday, September 2, 13
The ‘hard’ things
• Automatic Sharding
• Where does the data go?
• How do I find it?
• How do I add another?
• Multi DC
• Replication
• No SPOF
• Anti-Entropy
• Continuous Availability
• Upgrades
• Failure
• Etc.
Monday, September 2, 13
What should you use?
• Your decision
• Every database is not a fit for every problem.
Monday, September 2, 13
DataStax Enterprise
• DSE
• Cassandra (OLTP)
• Analytics
• Search
• The hard things are possible
• We’re making the easy things easier
Monday, September 2, 13
©2013 DataStax Confidential. Do not distribute without consent. 43
Monday, September 2, 13

Qcon talk

  • 1.
    ©2013 DataStax Confidential.Do not distribute without consent. Benjamin Coverston DSE Architect, DataStax Inc. NoSQL, Big Data, and Real Time 1 Monday, September 2, 13
  • 2.
    Who am I? •Ben Coverston • DSE Architect • DataStax since 2010 • Previous Experience in the Travel Industry • Low Cost Airlines / Web Reservations • Past: HP / Accenture • Lived in Santa Catarina for a few years. Monday, September 2, 13
  • 3.
  • 4.
  • 5.
    What is NoSQL? NoSQLis a term coined by Carlo Strozzi and repurposed by Eric Evans to refer to “some” storage systems. The NoSQL term should be used as in the Not-Only-SQL and not as No to SQL or Never SQL. -- Alex Popescu Monday, September 2, 13
  • 6.
    What is NoSQL(Cont.) • It’s not • No to SQL • About performance • About scaling • ACID • Eventual consistency • Volume • It is: • About choice Monday, September 2, 13
  • 7.
    Diversity in Data •Big Data has the 3 (or 4 or 5) V’s • Volume • Variety • Velocity • Variability (sometimes) • Value (other times) Monday, September 2, 13
  • 8.
    Diversity in Data •The V’s don’t cover everything • Availabilty is important • Your use case is important too Monday, September 2, 13
  • 9.
    Is NoSQL BigData? • You can store Big Data with an RDBMS • Is it easy? • Is it cost effective? • What kid of compromises do you have to make? Monday, September 2, 13
  • 10.
    The Problems • Ingeneral there are two classes of data problems • OLTP (Real-Time) • Analytics (Batch) • Usually you want both • No solution is perfect for everyone • Popularity is no indication of fitness Monday, September 2, 13
  • 11.
    Use Cases • OLTP •Low Latency • High Throughput • LOB Applications • Batch • Predictive Models • Complex Queries • Tomorrow (or precalculated, but now we need OLTP) Monday, September 2, 13
  • 12.
    Where to putyour ‘Stuff’ Sharded RDBMS MPP -- Greenplum, Teradata Hadoop Key/Value Columnar Other Monday, September 2, 13
  • 13.
    Why not justone? • Analytics • Optimize serial IO • Limitations in Storage • OLTP • Working Set • Distribution • Availability • Storage Medium Monday, September 2, 13
  • 14.
    Why Do WeNeed Something Else? • ACID semantics are often overkill • ACID also makes the database layer brittle • This means you get less Availability (CAP Theorem) Monday, September 2, 13
  • 15.
    The Application Stack www.example.com LB2LB1LB3 ws3ws1 ws2 ws6ws4 ws5 ws9ws7 ws8 cache 1 2 3 4 DB# Monday, September 2, 13
  • 16.
    Sharding • Storage Limitations •Working Set • So just make more! Monday, September 2, 13
  • 17.
  • 18.
    But Sharding • IsPainful • Requires ‘something else’ • Most no-sql solutions auto-shard • Sharding requires tradeoffs. • Which means your application will need to change Monday, September 2, 13
  • 19.
  • 20.
    Which should Ichoose? • Analytics • Hadoop (probably) if your data is big • Spark, other (sometimes faster) solutions available now • NoSql • Let’s talk! Monday, September 2, 13
  • 21.
    Decisions are abouttradeoffs, never a zero-sum game Fast, Cheap, Good -- Choose Two Monday, September 2, 13
  • 22.
    CAP Theorem • Moreof two, less of one • Consistency • Availability • Partitioning • You have to accept P • That leaves C and A Monday, September 2, 13
  • 23.
    How To ScaleAnything • Partition By Function • Split Horizontally • Avoid Distributed Transactions • Avoid Synchronous Coupling • Virtualize Everywhere • Cache Everything Monday, September 2, 13
  • 24.
    Partition By Function •Don’t put everything in the same database • Physical • Pools of Machines • Geographical Distribution • Automatic sharding (look for this) • Make sure it works! • Virtual • Logical Tables, Schema • Not 100% necessary, but schema is nice Monday, September 2, 13
  • 25.
    Partitioning (cont.) • Pros •Isolate failure • To a region • To a service • Simplify Failover • Cons • Your DB has to handle multi-region replication • If you chose CP (CAP) you’re going to have a bad time • AP systems do OK here (Cassandra, actually excels) • “Relational” part of databases becomes complex • Everything gets denormalized Monday, September 2, 13
  • 26.
    Split Horizontally • ScalingVertically is easy • To a point, then it gets expensive.. fast.. • Easy if your system has no state to maintain • Or if the states are known, and small • Sharding over dependent fields complicates design • Some things distribute themselves easily • key/value stores • Others not so much • BTree indexes, foreign keys • P2P architecture is helpful when splitting • In other words, avoid masters Monday, September 2, 13
  • 27.
    Split Horizontally (cont) •Pros • Can be as fast or faster than traditional design • Can scale up as long as you can afford more machines • Scaling is easy if you avoid having masters • Replication and failover don’t have to be special cases • Cons • Even logical pieces of your app are distributed over many machines • example: your catalog is not all in one place • Real time analytics is difficult, or slower Monday, September 2, 13
  • 28.
    Avoid Distributed Transactions •Have you tried this? • Hard to do right • Paxos gives us some hope • CAS in Cassandra 2.0 looks promising • Even then, it’s not good for everything! • MVCC works for many use cases • Compensating Mechanisms • Customer Service (Amazon, inventory) Monday, September 2, 13
  • 29.
    Avoid Distributed Transactions •Pros • Consistency in a distributed environment • Cons • Slow • Overkill • Did I say slow? • We chose CP so we get less A • What happens when they don’t succeed? • Do we shut the whole thing down? Monday, September 2, 13
  • 30.
    Avoid Synchronous Coupling •What? • A or B can be down • A can be down, B continues to work • B can suffer, while A continues to work • If your recommendation engine fails, your customers can still buy stuff! • Master/Slave failover is a good example of synchronous coupling • Master is down, slave needs to take over, but in the meantime.. what happens? Monday, September 2, 13
  • 31.
    Avoid Syncronous Coupling •Pros • Fewer shared dependencies means less failure • Less failure means more total uptime • For the whole • Less coupling means that your application topology is more modular • Introducing new, decoupled services is less risky • Cons • More duplication of your infrastructure • e.g. now you have an application stack for each of your services. Monday, September 2, 13
  • 32.
    Avoid Synchronous ProcessingFlows • AKA • Blocking Sockets • Serialized Processes • Locking in General • Do what is important FIRST • Take their money • Modify Inventory • Other less important stuff can be queued • Triggers • Joins • Stored Procedures • Consistency Checks Monday, September 2, 13
  • 33.
    Avoid Synchronous ProcessingFlows • Pros • Critical operations will not block for nice to haves • Easy monitoring of queues and assign priority to tasks • Problem areas are easier to identify • Cons • Race conditions • More up front development cost Monday, September 2, 13
  • 34.
    • DON’T • Pickyour database because it has a sexy API • Pick your database because it worked for somebody else • DO • Pick a database that will fit with your use case • Virtualize your data model • Encourage manipulation of your logical models • DO NOT force interaction with your database • Good virtualization means that you can change your data store later... • And most of your code will still work. Virtualize Everything Monday, September 2, 13
  • 35.
    Virtualize Everything • Virtualizationisn’t just for the programmer • Things fall apart • Requests have to be re-routed • Parts Replaced • APIs change • Good virtualization means you can make changes w/o impacting availability Monday, September 2, 13
  • 36.
    Cache Appropriately • Youcan’t cache everything • But you can cache stuff that doesn’t change • Or is expensive to retrieve Monday, September 2, 13
  • 37.
    Cache Appropriately • Pros •Cache is fast (compared to traditoinal RDBMS access) • Can give you a performance buffer • Cons • Cache Coherence • Cache Dependency • Is it a SPOF? • What if it all doesn’t fit? Monday, September 2, 13
  • 38.
    What about NoSQL? •All of this applies • Evaluate Products on their Strengths • If easy things are easy • The hard might be impossible • Pick a something that makes the hard things possible Monday, September 2, 13
  • 39.
    What are the‘easy’ things? • Serialization Formats • JSON/BSON • Data Models • HTTP/REST/JSON APIs • NodeJS Drivers! • etc. Monday, September 2, 13
  • 40.
    The ‘hard’ things •Automatic Sharding • Where does the data go? • How do I find it? • How do I add another? • Multi DC • Replication • No SPOF • Anti-Entropy • Continuous Availability • Upgrades • Failure • Etc. Monday, September 2, 13
  • 41.
    What should youuse? • Your decision • Every database is not a fit for every problem. Monday, September 2, 13
  • 42.
    DataStax Enterprise • DSE •Cassandra (OLTP) • Analytics • Search • The hard things are possible • We’re making the easy things easier Monday, September 2, 13
  • 43.
    ©2013 DataStax Confidential.Do not distribute without consent. 43 Monday, September 2, 13