Handling the
growth of Data
Piyush Katariya
@AhamPiyush
Growth of Data
What is ?
How to solve?
Which metrics to consider ?
Dive into design internals
Vertical Scaling
Let’s start with small database, say Postgres
● Few RDBMS tables with relationships
● Average Stats
○ thousands of rows per table
○ OLAP - several hundreds of real time queries
○ OLTP - few hundreds of updates
● Optimizations
○ Single Node
○ Reasonable CPU frequencies
○ Indexes - Unique, Single BTree, Compound
○ Modern SSD
○ Buffer Cache
○ In memory ( $ )
Relatively larger database (1)
● Few or More RDBMS tables with relationships
● Average Stats
○ Few Millions of rows per table
○ Schemaless events data
○ OLAP - Several thousands of real time queries
○ OLTP - Few thousands of updates
● Optimizations
○ Master Slave with read replica
○ JSON fields
○ Reasonably higher CPU frequencies
○ Advanced Indexes - Block range Index, BitMap index, Partial Index, Functional Index
○ Scheduled ReIndexing Jobs
○ Table Partitioning
○ Materialized views
○ Async commits
○ RAID 10
○ Caching solutions - View layer, Service layer , ORM layer, Database layer
○ In memory ( $$$ )
Relatively larger database (2)
● Few or More RDBMS tables with relationships
● Average Stats
○ Hundred Millions of rows
○ Schemaless events, audits, analytics data, real time decisions based on events
○ Hundreds of thousands of real time queries
○ Hundreds of thousands of updates
● Optimization ???
○ Data just can’t fit in SQL engine in inexpensive way
○ Sharding - Data can’t fit in Single node
○ Traditional tools falls short
Horizontal Scaling
Research Papers by Google
Google File System (2003)
http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf
Map Reduce (2004)
http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi
04.pdf
Big Table (2006)
http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.p
df
Map Reduce
HDFS and Hadoop - Alternative to GFS
MapR-FS as Better Alternative
NoSQL Databases
● CAP Theorem compliant Distributed data structures
● Tunable Consistency and Availability at DB or Query level
● Don’t try to solve all problems but very specific ones
● Data model specific
○ Data distribution across machines/data centers
○ Data replication for reliability and fault tolerance
○ Data denormalization
○ Physical storage layouts
○ Data compression
○ Querying and Aggregation techniques
● Automatic failover
● Distributed clock synchronization
● Multi data center support
● Integration plugins with other databases
● Community
CAP - Choose any 2
Consistency - Consistent view of dataset
Availability - Read and write at any time
Partitioning - Split data across machines
BigTable based DB Design
Gossip Protocol (AP)
RAFT Consensus (CP)
Key Value Databases
Key Value Database - Riak (AP)
Column Family Database - Cluster
Column Family Databases - Physical layout
Column Family - Cassandra (AP)
Column Family - HBase (CP)
Document Database
Document Database - MongoDB (CP)
Graph Database
Graph Database - TitanDB layer (CP/CA/AP)
Search Engines and Logs
Distributed Queue/Log/Buffer
Computing Engines - HDFS and Spark
Computing Engines - MapR Stack
NewSQL
(F1 and) Google Spanner (2012)
http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi201
2.pdf
Spanner : Becoming a SQL System (2017)
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46103.pdf
“ There are two ways of constructing a software design. One way is to make it so simple that
there are obviously no deficiencies. And the other way is to make it so complicated that there
are no obvious deficiencies. ”
—C.A.R. Hoare
Open Source Spanner - CockroachDB (CP)
Conclusion ?
Ask Contextual Questions
Do you really can’t afford hosting all of the data on Single Machine ?
Is your data highly connected or independent ?
Is your Primary workload OLTP (CP) or OLAP (CA) ?
Are your customers geographically distributed ?
Do you need to coordinate and scale business services without overwhelming primary data store?
How much latency are you aiming for ? How much can you compromise on it ?
How much are you willing to spend on infrastructure cost ?
What’s the skill competency level of the dev team ?
What is your target Time to Market SLA for new or changing features ?
Accept Trade Offs
Connected (Graph ) or Relational (SQL) Data
Availability
Storage Space(Volatile or Persistent)
Data Encryption
Range Sharding
Synchronous RPC
Batch Processing
Embedded Computation Engine
Independent Data
Consistency
Computation
Computation
Hash Sharding
Async and Reactive
Stream Processing
Separate Computation Engine
VS
My (Biased) Recommendation
MongoDB for medium complex load and developer productivity
CockroachDB as Primary database
Large OLTP and OLAP loads - ScyllaDB (Hash Sharding) and MapR-DB (Range Sharding)
Druid for real time OLAP
Titan / Janus Graph and ScyllaDB for (highly connected) graph data
Redis HA Cluster for short lived distributed data structures
Kafka or Pulsar as Distributed queue/buffer
Prefer MapR as Hadoop platform
Thanks
@AhamPiyush

Handling the growth of data

  • 1.
    Handling the growth ofData Piyush Katariya @AhamPiyush
  • 2.
    Growth of Data Whatis ? How to solve? Which metrics to consider ? Dive into design internals
  • 3.
  • 4.
    Let’s start withsmall database, say Postgres ● Few RDBMS tables with relationships ● Average Stats ○ thousands of rows per table ○ OLAP - several hundreds of real time queries ○ OLTP - few hundreds of updates ● Optimizations ○ Single Node ○ Reasonable CPU frequencies ○ Indexes - Unique, Single BTree, Compound ○ Modern SSD ○ Buffer Cache ○ In memory ( $ )
  • 5.
    Relatively larger database(1) ● Few or More RDBMS tables with relationships ● Average Stats ○ Few Millions of rows per table ○ Schemaless events data ○ OLAP - Several thousands of real time queries ○ OLTP - Few thousands of updates ● Optimizations ○ Master Slave with read replica ○ JSON fields ○ Reasonably higher CPU frequencies ○ Advanced Indexes - Block range Index, BitMap index, Partial Index, Functional Index ○ Scheduled ReIndexing Jobs ○ Table Partitioning ○ Materialized views ○ Async commits ○ RAID 10 ○ Caching solutions - View layer, Service layer , ORM layer, Database layer ○ In memory ( $$$ )
  • 6.
    Relatively larger database(2) ● Few or More RDBMS tables with relationships ● Average Stats ○ Hundred Millions of rows ○ Schemaless events, audits, analytics data, real time decisions based on events ○ Hundreds of thousands of real time queries ○ Hundreds of thousands of updates ● Optimization ??? ○ Data just can’t fit in SQL engine in inexpensive way ○ Sharding - Data can’t fit in Single node ○ Traditional tools falls short
  • 7.
  • 8.
    Research Papers byGoogle Google File System (2003) http://static.googleusercontent.com/media/research.google.com/en//archive/gfs-sosp2003.pdf Map Reduce (2004) http://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi 04.pdf Big Table (2006) http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.p df
  • 9.
  • 10.
    HDFS and Hadoop- Alternative to GFS
  • 11.
    MapR-FS as BetterAlternative
  • 12.
    NoSQL Databases ● CAPTheorem compliant Distributed data structures ● Tunable Consistency and Availability at DB or Query level ● Don’t try to solve all problems but very specific ones ● Data model specific ○ Data distribution across machines/data centers ○ Data replication for reliability and fault tolerance ○ Data denormalization ○ Physical storage layouts ○ Data compression ○ Querying and Aggregation techniques ● Automatic failover ● Distributed clock synchronization ● Multi data center support ● Integration plugins with other databases ● Community
  • 13.
    CAP - Chooseany 2 Consistency - Consistent view of dataset Availability - Read and write at any time Partitioning - Split data across machines
  • 14.
  • 15.
  • 16.
  • 18.
  • 19.
    Key Value Database- Riak (AP)
  • 20.
  • 21.
    Column Family Databases- Physical layout
  • 22.
    Column Family -Cassandra (AP)
  • 23.
    Column Family -HBase (CP)
  • 24.
  • 25.
    Document Database -MongoDB (CP)
  • 26.
  • 27.
    Graph Database -TitanDB layer (CP/CA/AP)
  • 28.
  • 29.
  • 30.
    Computing Engines -HDFS and Spark
  • 31.
  • 32.
    NewSQL (F1 and) GoogleSpanner (2012) http://static.googleusercontent.com/media/research.google.com/en//archive/spanner-osdi201 2.pdf Spanner : Becoming a SQL System (2017) https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46103.pdf “ There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. ” —C.A.R. Hoare
  • 33.
    Open Source Spanner- CockroachDB (CP)
  • 34.
  • 35.
    Ask Contextual Questions Doyou really can’t afford hosting all of the data on Single Machine ? Is your data highly connected or independent ? Is your Primary workload OLTP (CP) or OLAP (CA) ? Are your customers geographically distributed ? Do you need to coordinate and scale business services without overwhelming primary data store? How much latency are you aiming for ? How much can you compromise on it ? How much are you willing to spend on infrastructure cost ? What’s the skill competency level of the dev team ? What is your target Time to Market SLA for new or changing features ?
  • 36.
    Accept Trade Offs Connected(Graph ) or Relational (SQL) Data Availability Storage Space(Volatile or Persistent) Data Encryption Range Sharding Synchronous RPC Batch Processing Embedded Computation Engine Independent Data Consistency Computation Computation Hash Sharding Async and Reactive Stream Processing Separate Computation Engine VS
  • 37.
    My (Biased) Recommendation MongoDBfor medium complex load and developer productivity CockroachDB as Primary database Large OLTP and OLAP loads - ScyllaDB (Hash Sharding) and MapR-DB (Range Sharding) Druid for real time OLAP Titan / Janus Graph and ScyllaDB for (highly connected) graph data Redis HA Cluster for short lived distributed data structures Kafka or Pulsar as Distributed queue/buffer Prefer MapR as Hadoop platform
  • 38.