Qcon talk

©2013 DataStax Conﬁdential. Do not distribute without consent.
Benjamin Coverston
DSE Architect, DataStax Inc.
NoSQL, Big Data, and Real Time
1
Monday, September 2, 13

Who am I?
• Ben Coverston
• DSE Architect
• DataStax since 2010
• Previous Experience in the Travel Industry
• Low Cost Airlines / Web Reservations
• Past: HP / Accenture
• Lived in Santa Catarina for a few years.

What is it?
NoSql

What is NoSQL?
NoSQL is a term coined by Carlo Strozzi and
repurposed by Eric Evans to refer to “some”
storage systems. The NoSQL term should be used
as in the Not-Only-SQL and not as No to SQL or
Never SQL.
-- Alex Popescu

What is NoSQL (Cont.)
• It’s not
• No to SQL
• About performance
• About scaling
• ACID
• Eventual consistency
• Volume
• It is:
• About choice

Diversity in Data
• Big Data has the 3 (or 4 or 5) V’s
• Volume
• Variety
• Velocity
• Variability (sometimes)
• Value (other times)

Diversity in Data
• The V’s don’t cover everything
• Availabilty is important
• Your use case is important too

Is NoSQL Big Data?
• You can store Big Data with an RDBMS
• Is it easy?
• Is it cost effective?
• What kid of compromises do you have to make?

The Problems
• In general there are two classes of data problems
• OLTP (Real-Time)
• Analytics (Batch)
• Usually you want both
• No solution is perfect for everyone
• Popularity is no indication of fitness

Use Cases
• OLTP
• Low Latency
• High Throughput
• LOB Applications
• Batch
• Predictive Models
• Complex Queries
• Tomorrow (or precalculated, but now we need OLTP)

Where to put your ‘Stuff’
Sharded RDBMS MPP -- Greenplum, Teradata Hadoop Key/Value
Columnar
Other

Why not just one?
• Analytics
• Optimize serial IO
• Limitations in Storage
• OLTP
• Working Set
• Distribution
• Availability
• Storage Medium

Why Do We Need Something Else?
• ACID semantics are often
overkill
• ACID also makes the database
layer brittle
• This means you get less
Availability (CAP Theorem)

The Application Stack
www.example.com
LB2LB1 LB3
ws3ws1 ws2 ws6ws4 ws5 ws9ws7 ws8
cache
1 2 3 4
DB#

Sharding
• Storage Limitations
• Working Set
• So just make more!

(“The&eBay&Architecture,”&Randy&Shoup&and&Dan&Pritche:)&

But Sharding
• Is Painful
• Requires ‘something else’
• Most no-sql solutions auto-shard
• Sharding requires tradeoffs.
• Which means your application will need to change

Which should I choose?
• Analytics
• Hadoop (probably) if your data is big
• Spark, other (sometimes faster) solutions available now
• NoSql
• Let’s talk!

Decisions are about tradeoffs, never a zero-sum game
Fast, Cheap, Good -- Choose Two

CAP Theorem
• More of two, less of one
• Consistency
• Availability
• Partitioning
• You have to accept P
• That leaves C and A

How To Scale Anything
• Partition By Function
• Split Horizontally
• Avoid Distributed Transactions
• Avoid Synchronous Coupling
• Virtualize Everywhere
• Cache Everything

Partition By Function
• Don’t put everything in the same database
• Physical
• Pools of Machines
• Geographical Distribution
• Automatic sharding (look for this)
• Make sure it works!
• Virtual
• Logical Tables, Schema
• Not 100% necessary, but schema is nice

Partitioning (cont.)
• Pros
• Isolate failure
• To a region
• To a service
• Simplify Failover
• Cons
• Your DB has to handle multi-region replication
• If you chose CP (CAP) you’re going to have a bad time
• AP systems do OK here (Cassandra, actually excels)
• “Relational” part of databases becomes complex
• Everything gets denormalized

Split Horizontally
• Scaling Vertically is easy
• To a point, then it gets expensive.. fast..
• Easy if your system has no state to maintain
• Or if the states are known, and small
• Sharding over dependent fields complicates design
• Some things distribute themselves easily
• key/value stores
• Others not so much
• BTree indexes, foreign keys
• P2P architecture is helpful when splitting
• In other words, avoid masters

Split Horizontally (cont)
• Pros
• Can be as fast or faster than traditional design
• Can scale up as long as you can afford more machines
• Scaling is easy if you avoid having masters
• Replication and failover don’t have to be special cases
• Cons
• Even logical pieces of your app are distributed over many machines
• example: your catalog is not all in one place
• Real time analytics is difficult, or slower

Avoid Distributed Transactions
• Have you tried this?
• Hard to do right
• Paxos gives us some hope
• CAS in Cassandra 2.0 looks promising
• Even then, it’s not good for everything!
• MVCC works for many use cases
• Compensating Mechanisms
• Customer Service (Amazon, inventory)

Avoid Distributed Transactions
• Pros
• Consistency in a distributed environment
• Cons
• Slow
• Overkill
• Did I say slow?
• We chose CP so we get less A
• What happens when they don’t succeed?
• Do we shut the whole thing down?

Avoid Synchronous Coupling
• What?
• A or B can be down
• A can be down, B continues to work
• B can suffer, while A continues to work
• If your recommendation engine fails, your customers can still buy stuff!
• Master/Slave failover is a good example of synchronous coupling
• Master is down, slave needs to take over, but in the meantime.. what happens?

Avoid Syncronous Coupling
• Pros
• Fewer shared dependencies means less failure
• Less failure means more total uptime
• For the whole
• Less coupling means that your application topology is more modular
• Introducing new, decoupled services is less risky
• Cons
• More duplication of your infrastructure
• e.g. now you have an application stack for each of your services.

Avoid Synchronous Processing Flows
• AKA
• Blocking Sockets
• Serialized Processes
• Locking in General
• Do what is important FIRST
• Take their money
• Modify Inventory
• Other less important stuff can be queued
• Triggers
• Joins
• Stored Procedures
• Consistency Checks

Avoid Synchronous Processing Flows
• Pros
• Critical operations will not block for nice to haves
• Easy monitoring of queues and assign priority to tasks
• Problem areas are easier to identify
• Cons
• Race conditions
• More up front development cost

• DON’T
• Pick your database because it has a sexy API
• Pick your database because it worked for somebody else
• DO
• Pick a database that will fit with your use case
• Virtualize your data model
• Encourage manipulation of your logical models
• DO NOT force interaction with your database
• Good virtualization means that you can change your data store later...
• And most of your code will still work.
Virtualize Everything

Virtualize Everything
• Virtualization isn’t just for the programmer
• Things fall apart
• Requests have to be re-routed
• Parts Replaced
• APIs change
• Good virtualization means you can make changes w/o impacting
availability

Cache Appropriately
• You can’t cache everything
• But you can cache stuff that doesn’t change
• Or is expensive to retrieve

Cache Appropriately
• Pros
• Cache is fast (compared to traditoinal RDBMS access)
• Can give you a performance buffer
• Cons
• Cache Coherence
• Cache Dependency
• Is it a SPOF?
• What if it all doesn’t fit?

What about NoSQL?
• All of this applies
• Evaluate Products on their Strengths
• If easy things are easy
• The hard might be impossible
• Pick a something that makes the hard things possible

What are the ‘easy’ things?
• Serialization Formats
• JSON/BSON
• Data Models
• HTTP/REST/JSON APIs
• NodeJS Drivers!
• etc.

The ‘hard’ things
• Automatic Sharding
• Where does the data go?
• How do I find it?
• How do I add another?
• Multi DC
• Replication
• No SPOF
• Anti-Entropy
• Continuous Availability
• Upgrades
• Failure
• Etc.

What should you use?
• Your decision
• Every database is not a fit for every problem.

DataStax Enterprise
• DSE
• Cassandra (OLTP)
• Analytics
• Search
• The hard things are possible
• We’re making the easy things easier

Qcon talk

More Related Content

What's hot

Similar to Qcon talk

Recently uploaded

Qcon talk