Handling the growth of data

Handling the
growth of Data
Piyush Katariya
@AhamPiyush

Growth of Data
What is ?
How to solve?
Which metrics to consider ?
Dive into design internals

Let’s start with small database, say Postgres
● Few RDBMS tables with relationships
● Average Stats
○ thousands of rows per table
○ OLAP - several hundreds of real time queries
○ OLTP - few hundreds of updates
● Optimizations
○ Single Node
○ Reasonable CPU frequencies
○ Indexes - Unique, Single BTree, Compound
○ Modern SSD
○ Buffer Cache
○ In memory ( $ )

Relatively larger database (1)
● Few or More RDBMS tables with relationships
● Average Stats
○ Few Millions of rows per table
○ Schemaless events data
○ OLAP - Several thousands of real time queries
○ OLTP - Few thousands of updates
● Optimizations
○ Master Slave with read replica
○ JSON fields
○ Reasonably higher CPU frequencies
○ Advanced Indexes - Block range Index, BitMap index, Partial Index, Functional Index
○ Scheduled ReIndexing Jobs
○ Table Partitioning
○ Materialized views
○ Async commits
○ RAID 10
○ Caching solutions - View layer, Service layer , ORM layer, Database layer
○ In memory ( $$$ )

Relatively larger database (2)
● Few or More RDBMS tables with relationships
● Average Stats
○ Hundred Millions of rows
○ Schemaless events, audits, analytics data, real time decisions based on events
○ Hundreds of thousands of real time queries
○ Hundreds of thousands of updates
● Optimization ???
○ Data just can’t fit in SQL engine in inexpensive way
○ Sharding - Data can’t fit in Single node
○ Traditional tools falls short

Research Papers by Google
Google File System (2003)
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
Map Reduce (2004)
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi
04.pdf
Big Table (2006)
http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.p
df

HDFS and Hadoop - Alternative to GFS

NoSQL Databases
● CAP Theorem compliant Distributed data structures
● Tunable Consistency and Availability at DB or Query level
● Don’t try to solve all problems but very specific ones
● Data model specific
○ Data distribution across machines/data centers
○ Data replication for reliability and fault tolerance
○ Data denormalization
○ Physical storage layouts
○ Data compression
○ Querying and Aggregation techniques
● Automatic failover
● Distributed clock synchronization
● Multi data center support
● Integration plugins with other databases
● Community

CAP - Choose any 2
Consistency - Consistent view of dataset
Availability - Read and write at any time
Partitioning - Split data across machines

Key Value Database - Riak (AP)

Column Family Database - Cluster

Column Family Databases - Physical layout

Column Family - Cassandra (AP)

Document Database - MongoDB (CP)

Graph Database - TitanDB layer (CP/CA/AP)

Computing Engines - HDFS and Spark

Computing Engines - MapR Stack

NewSQL
(F1 and) Google Spanner (2012)
http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi201
2.pdf
Spanner : Becoming a SQL System (2017)
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46103.pdf
“ There are two ways of constructing a software design. One way is to make it so simple that
there are obviously no deficiencies. And the other way is to make it so complicated that there
are no obvious deficiencies. ”
—C.A.R. Hoare

Open Source Spanner - CockroachDB (CP)

Ask Contextual Questions
Do you really can’t afford hosting all of the data on Single Machine ?
Is your data highly connected or independent ?
Is your Primary workload OLTP (CP) or OLAP (CA) ?
Are your customers geographically distributed ?
Do you need to coordinate and scale business services without overwhelming primary data store?
How much latency are you aiming for ? How much can you compromise on it ?
How much are you willing to spend on infrastructure cost ?
What’s the skill competency level of the dev team ?
What is your target Time to Market SLA for new or changing features ?

Accept Trade Offs
Connected (Graph ) or Relational (SQL) Data
Availability
Storage Space(Volatile or Persistent)
Data Encryption
Range Sharding
Synchronous RPC
Batch Processing
Embedded Computation Engine
Independent Data
Consistency
Computation
Computation
Hash Sharding
Async and Reactive
Stream Processing
Separate Computation Engine
VS

My (Biased) Recommendation
MongoDB for medium complex load and developer productivity
CockroachDB as Primary database
Large OLTP and OLAP loads - ScyllaDB (Hash Sharding) and MapR-DB (Range Sharding)
Druid for real time OLAP
Titan / Janus Graph and ScyllaDB for (highly connected) graph data
Redis HA Cluster for short lived distributed data structures
Kafka or Pulsar as Distributed queue/buffer
Prefer MapR as Hadoop platform

Handling the growth of data

More Related Content

What's hot

Similar to Handling the growth of data

More from Piyush Katariya

Recently uploaded

Handling the growth of data