©2013 DataStax Confidential. Do not distribute without consent.
Benjamin Coverston
DSE Architect, DataStax Inc.
NoSQL, Big ...
Who am I?
• Ben Coverston
• DSE Architect
• DataStax since 2010
• Previous Experience in the Travel Industry
• Low Cost Ai...
What is it?
NoSql
Monday, September 2, 13
Monday, September 2, 13
What is NoSQL?
NoSQL is a term coined by Carlo Strozzi and
repurposed by Eric Evans to refer to “some”
storage systems. Th...
What is NoSQL (Cont.)
• It’s not
• No to SQL
• About performance
• About scaling
• ACID
• Eventual consistency
• Volume
• ...
Diversity in Data
• Big Data has the 3 (or 4 or 5) V’s
• Volume
• Variety
• Velocity
• Variability (sometimes)
• Value (ot...
Diversity in Data
• The V’s don’t cover everything
• Availabilty is important
• Your use case is important too
Monday, Sep...
Is NoSQL Big Data?
• You can store Big Data with an RDBMS
• Is it easy?
• Is it cost effective?
• What kid of compromises ...
The Problems
• In general there are two classes of data problems
• OLTP (Real-Time)
• Analytics (Batch)
• Usually you want...
Use Cases
• OLTP
• Low Latency
• High Throughput
• LOB Applications
• Batch
• Predictive Models
• Complex Queries
• Tomorr...
Where to put your ‘Stuff’
Sharded RDBMS MPP -- Greenplum, Teradata Hadoop Key/Value
Columnar
Other
Monday, September 2, 13
Why not just one?
• Analytics
• Optimize serial IO
• Limitations in Storage
• OLTP
• Working Set
• Distribution
• Availabi...
Why Do We Need Something Else?
• ACID semantics are often
overkill
• ACID also makes the database
layer brittle
• This mea...
The Application Stack
www.example.com
LB2LB1 LB3
ws3ws1 ws2 ws6ws4 ws5 ws9ws7 ws8
cache
1 2 3 4
DB#
Monday, September 2, 13
Sharding
• Storage Limitations
• Working Set
• So just make more!
Monday, September 2, 13
(“The&eBay&Architecture,”&Randy&Shoup&and&Dan&Pritche:)&
Monday, September 2, 13
But Sharding
• Is Painful
• Requires ‘something else’
• Most no-sql solutions auto-shard
• Sharding requires tradeoffs.
• ...
Monday, September 2, 13
Which should I choose?
• Analytics
• Hadoop (probably) if your data is big
• Spark, other (sometimes faster) solutions ava...
Decisions are about tradeoffs, never a zero-sum game
Fast, Cheap, Good -- Choose Two
Monday, September 2, 13
CAP Theorem
• More of two, less of one
• Consistency
• Availability
• Partitioning
• You have to accept P
• That leaves C ...
How To Scale Anything
• Partition By Function
• Split Horizontally
• Avoid Distributed Transactions
• Avoid Synchronous Co...
Partition By Function
• Don’t put everything in the same database
• Physical
• Pools of Machines
• Geographical Distributi...
Partitioning (cont.)
• Pros
• Isolate failure
• To a region
• To a service
• Simplify Failover
• Cons
• Your DB has to han...
Split Horizontally
• Scaling Vertically is easy
• To a point, then it gets expensive.. fast..
• Easy if your system has no...
Split Horizontally (cont)
• Pros
• Can be as fast or faster than traditional design
• Can scale up as long as you can affo...
Avoid Distributed Transactions
• Have you tried this?
• Hard to do right
• Paxos gives us some hope
• CAS in Cassandra 2.0...
Avoid Distributed Transactions
• Pros
• Consistency in a distributed environment
• Cons
• Slow
• Overkill
• Did I say slow...
Avoid Synchronous Coupling
• What?
• A or B can be down
• A can be down, B continues to work
• B can suffer, while A conti...
Avoid Syncronous Coupling
• Pros
• Fewer shared dependencies means less failure
• Less failure means more total uptime
• F...
Avoid Synchronous Processing Flows
• AKA
• Blocking Sockets
• Serialized Processes
• Locking in General
• Do what is impor...
Avoid Synchronous Processing Flows
• Pros
• Critical operations will not block for nice to haves
• Easy monitoring of queu...
• DON’T
• Pick your database because it has a sexy API
• Pick your database because it worked for somebody else
• DO
• Pic...
Virtualize Everything
• Virtualization isn’t just for the programmer
• Things fall apart
• Requests have to be re-routed
•...
Cache Appropriately
• You can’t cache everything
• But you can cache stuff that doesn’t change
• Or is expensive to retrie...
Cache Appropriately
• Pros
• Cache is fast (compared to traditoinal RDBMS access)
• Can give you a performance buffer
• Co...
What about NoSQL?
• All of this applies
• Evaluate Products on their Strengths
• If easy things are easy
• The hard might ...
What are the ‘easy’ things?
• Serialization Formats
• JSON/BSON
• Data Models
• HTTP/REST/JSON APIs
• NodeJS Drivers!
• et...
The ‘hard’ things
• Automatic Sharding
• Where does the data go?
• How do I find it?
• How do I add another?
• Multi DC
• ...
What should you use?
• Your decision
• Every database is not a fit for every problem.
Monday, September 2, 13
DataStax Enterprise
• DSE
• Cassandra (OLTP)
• Analytics
• Search
• The hard things are possible
• We’re making the easy t...
©2013 DataStax Confidential. Do not distribute without consent. 43
Monday, September 2, 13
Upcoming SlideShare
Loading in...5
×

Qcon talk

282
-1

Published on

QConSP Talk

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
282
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Qcon talk

  1. 1. ©2013 DataStax Confidential. Do not distribute without consent. Benjamin Coverston DSE Architect, DataStax Inc. NoSQL, Big Data, and Real Time 1 Monday, September 2, 13
  2. 2. Who am I? • Ben Coverston • DSE Architect • DataStax since 2010 • Previous Experience in the Travel Industry • Low Cost Airlines / Web Reservations • Past: HP / Accenture • Lived in Santa Catarina for a few years. Monday, September 2, 13
  3. 3. What is it? NoSql Monday, September 2, 13
  4. 4. Monday, September 2, 13
  5. 5. What is NoSQL? NoSQL is a term coined by Carlo Strozzi and repurposed by Eric Evans to refer to “some” storage systems. The NoSQL term should be used as in the Not-Only-SQL and not as No to SQL or Never SQL. -- Alex Popescu Monday, September 2, 13
  6. 6. What is NoSQL (Cont.) • It’s not • No to SQL • About performance • About scaling • ACID • Eventual consistency • Volume • It is: • About choice Monday, September 2, 13
  7. 7. Diversity in Data • Big Data has the 3 (or 4 or 5) V’s • Volume • Variety • Velocity • Variability (sometimes) • Value (other times) Monday, September 2, 13
  8. 8. Diversity in Data • The V’s don’t cover everything • Availabilty is important • Your use case is important too Monday, September 2, 13
  9. 9. Is NoSQL Big Data? • You can store Big Data with an RDBMS • Is it easy? • Is it cost effective? • What kid of compromises do you have to make? Monday, September 2, 13
  10. 10. The Problems • In general there are two classes of data problems • OLTP (Real-Time) • Analytics (Batch) • Usually you want both • No solution is perfect for everyone • Popularity is no indication of fitness Monday, September 2, 13
  11. 11. Use Cases • OLTP • Low Latency • High Throughput • LOB Applications • Batch • Predictive Models • Complex Queries • Tomorrow (or precalculated, but now we need OLTP) Monday, September 2, 13
  12. 12. Where to put your ‘Stuff’ Sharded RDBMS MPP -- Greenplum, Teradata Hadoop Key/Value Columnar Other Monday, September 2, 13
  13. 13. Why not just one? • Analytics • Optimize serial IO • Limitations in Storage • OLTP • Working Set • Distribution • Availability • Storage Medium Monday, September 2, 13
  14. 14. Why Do We Need Something Else? • ACID semantics are often overkill • ACID also makes the database layer brittle • This means you get less Availability (CAP Theorem) Monday, September 2, 13
  15. 15. The Application Stack www.example.com LB2LB1 LB3 ws3ws1 ws2 ws6ws4 ws5 ws9ws7 ws8 cache 1 2 3 4 DB# Monday, September 2, 13
  16. 16. Sharding • Storage Limitations • Working Set • So just make more! Monday, September 2, 13
  17. 17. (“The&eBay&Architecture,”&Randy&Shoup&and&Dan&Pritche:)& Monday, September 2, 13
  18. 18. But Sharding • Is Painful • Requires ‘something else’ • Most no-sql solutions auto-shard • Sharding requires tradeoffs. • Which means your application will need to change Monday, September 2, 13
  19. 19. Monday, September 2, 13
  20. 20. Which should I choose? • Analytics • Hadoop (probably) if your data is big • Spark, other (sometimes faster) solutions available now • NoSql • Let’s talk! Monday, September 2, 13
  21. 21. Decisions are about tradeoffs, never a zero-sum game Fast, Cheap, Good -- Choose Two Monday, September 2, 13
  22. 22. CAP Theorem • More of two, less of one • Consistency • Availability • Partitioning • You have to accept P • That leaves C and A Monday, September 2, 13
  23. 23. How To Scale Anything • Partition By Function • Split Horizontally • Avoid Distributed Transactions • Avoid Synchronous Coupling • Virtualize Everywhere • Cache Everything Monday, September 2, 13
  24. 24. Partition By Function • Don’t put everything in the same database • Physical • Pools of Machines • Geographical Distribution • Automatic sharding (look for this) • Make sure it works! • Virtual • Logical Tables, Schema • Not 100% necessary, but schema is nice Monday, September 2, 13
  25. 25. Partitioning (cont.) • Pros • Isolate failure • To a region • To a service • Simplify Failover • Cons • Your DB has to handle multi-region replication • If you chose CP (CAP) you’re going to have a bad time • AP systems do OK here (Cassandra, actually excels) • “Relational” part of databases becomes complex • Everything gets denormalized Monday, September 2, 13
  26. 26. Split Horizontally • Scaling Vertically is easy • To a point, then it gets expensive.. fast.. • Easy if your system has no state to maintain • Or if the states are known, and small • Sharding over dependent fields complicates design • Some things distribute themselves easily • key/value stores • Others not so much • BTree indexes, foreign keys • P2P architecture is helpful when splitting • In other words, avoid masters Monday, September 2, 13
  27. 27. Split Horizontally (cont) • Pros • Can be as fast or faster than traditional design • Can scale up as long as you can afford more machines • Scaling is easy if you avoid having masters • Replication and failover don’t have to be special cases • Cons • Even logical pieces of your app are distributed over many machines • example: your catalog is not all in one place • Real time analytics is difficult, or slower Monday, September 2, 13
  28. 28. Avoid Distributed Transactions • Have you tried this? • Hard to do right • Paxos gives us some hope • CAS in Cassandra 2.0 looks promising • Even then, it’s not good for everything! • MVCC works for many use cases • Compensating Mechanisms • Customer Service (Amazon, inventory) Monday, September 2, 13
  29. 29. Avoid Distributed Transactions • Pros • Consistency in a distributed environment • Cons • Slow • Overkill • Did I say slow? • We chose CP so we get less A • What happens when they don’t succeed? • Do we shut the whole thing down? Monday, September 2, 13
  30. 30. Avoid Synchronous Coupling • What? • A or B can be down • A can be down, B continues to work • B can suffer, while A continues to work • If your recommendation engine fails, your customers can still buy stuff! • Master/Slave failover is a good example of synchronous coupling • Master is down, slave needs to take over, but in the meantime.. what happens? Monday, September 2, 13
  31. 31. Avoid Syncronous Coupling • Pros • Fewer shared dependencies means less failure • Less failure means more total uptime • For the whole • Less coupling means that your application topology is more modular • Introducing new, decoupled services is less risky • Cons • More duplication of your infrastructure • e.g. now you have an application stack for each of your services. Monday, September 2, 13
  32. 32. Avoid Synchronous Processing Flows • AKA • Blocking Sockets • Serialized Processes • Locking in General • Do what is important FIRST • Take their money • Modify Inventory • Other less important stuff can be queued • Triggers • Joins • Stored Procedures • Consistency Checks Monday, September 2, 13
  33. 33. Avoid Synchronous Processing Flows • Pros • Critical operations will not block for nice to haves • Easy monitoring of queues and assign priority to tasks • Problem areas are easier to identify • Cons • Race conditions • More up front development cost Monday, September 2, 13
  34. 34. • DON’T • Pick your database because it has a sexy API • Pick your database because it worked for somebody else • DO • Pick a database that will fit with your use case • Virtualize your data model • Encourage manipulation of your logical models • DO NOT force interaction with your database • Good virtualization means that you can change your data store later... • And most of your code will still work. Virtualize Everything Monday, September 2, 13
  35. 35. Virtualize Everything • Virtualization isn’t just for the programmer • Things fall apart • Requests have to be re-routed • Parts Replaced • APIs change • Good virtualization means you can make changes w/o impacting availability Monday, September 2, 13
  36. 36. Cache Appropriately • You can’t cache everything • But you can cache stuff that doesn’t change • Or is expensive to retrieve Monday, September 2, 13
  37. 37. Cache Appropriately • Pros • Cache is fast (compared to traditoinal RDBMS access) • Can give you a performance buffer • Cons • Cache Coherence • Cache Dependency • Is it a SPOF? • What if it all doesn’t fit? Monday, September 2, 13
  38. 38. What about NoSQL? • All of this applies • Evaluate Products on their Strengths • If easy things are easy • The hard might be impossible • Pick a something that makes the hard things possible Monday, September 2, 13
  39. 39. What are the ‘easy’ things? • Serialization Formats • JSON/BSON • Data Models • HTTP/REST/JSON APIs • NodeJS Drivers! • etc. Monday, September 2, 13
  40. 40. The ‘hard’ things • Automatic Sharding • Where does the data go? • How do I find it? • How do I add another? • Multi DC • Replication • No SPOF • Anti-Entropy • Continuous Availability • Upgrades • Failure • Etc. Monday, September 2, 13
  41. 41. What should you use? • Your decision • Every database is not a fit for every problem. Monday, September 2, 13
  42. 42. DataStax Enterprise • DSE • Cassandra (OLTP) • Analytics • Search • The hard things are possible • We’re making the easy things easier Monday, September 2, 13
  43. 43. ©2013 DataStax Confidential. Do not distribute without consent. 43 Monday, September 2, 13
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×