Heterogenous Persistence

1.
Heterogeneous Persistence A guidefor the modern DBA Marcos Albe Jervin Real Ryan Lowe Liz Van Dijk

2.
Introduction Hello everyone

3.
Introduction MySQL everyone?

4.
Introduction Memcached?

5.
Agenda ● Introduction ● Whya single DBMS is not enough ● What makes a DBMS ● Different flavors of DMBS ● Top picks

6.
Why one DBMSis not enough "If you feel things are not efficient in your code, is likely that you are suffering of poor data structures choice/design" ~ Anonymous

7.
Why one DBMSis not enough ● Different data structures ● Different access patterns ● Different consistency and durability requirements. ● Different scaling needs ● Different budgets ● Theoretical fundamentalism

8.
Why one DBMSis not enough A more concrete example OLAP -vs- OLTP

9.
OLAP -vs- OLTP

10.
PROs CONs ● NoSPOF ● Workload optimized services ● Easier to scale* ● Additional complexity ● Operational needs (additional staffing) ● Cost ($$$)*

11.
La Carte ● KeyValue Stores ○ Memcached ○ MemcacheDB ○ Redis ○ Riak KV ○ Cassandra ○ Amazon's DynamoDB ● Graph ○ Neo4J ○ OrientDB ○ Titan ○ Virtuoso ○ ArangoDB ● Relational ○ MySQL ○ PostgreSQL ● Time Series ○ InfluxDB ○ Graphite ○ OpenTSDB ○ Blueflood ○ Prometheus ● Columnar ○ Vertica ○ Infobright ○ Amazon RedShift ○ Apache HBase ● Document ○ MongoDB ○ Couchbase ● Fulltext ○ Sphinx ○ Lucene/Solr

12.
What makes aDB?

13.
General Criteria ● Specialty ●Cost ● API/Interfaces ● Scalability ● CAP ● ACID ● Secondary Features

14.
What makes aDBMS: General ● Licensing ● Language support ● OS support ● Community & workforce ● Tools ecosystem

15.
● Data Architecture ○Logical data model ○ Physical data model ● Standards adherence (where defined) ● Atomicity ● Consistency ● Isolation ● Durability ● Referential integrity ● Transactions ● Locking ● Crash recovery ● Unicode support What makes a DBMS: Fundamental Features

16.
● Interface /connectors / protocols ● Sequences / auto-incrementals / atomic counters ● Conditional entry updates ● MapReduce ● Compression ● In-memory ● Availability ● Concurrency handling ● Scalability ● Embeddable ● Backups What makes a DBMS: Fundamental Features cont.

17.
● CRUD ● Union ●Intersect ● JOIN (inner, outer) ● Inner selects ● Merge joins ● Common Table Expressions ● Windowing Functions ● Parallel Query ● Subqueries ● Aggregation ● Derived tables What makes a DBMS: querying capabilities

18.
● Cursors ● Triggers ●Stored procedures ● Functions ● Views ● Materialized views ● Virtual columns ● UDF ● XML/JSON/YAML support What makes a DBMS: programmatic capabilities

19.
● Database (tablessize sum) ● Number of Tables ● Tables individual size ● Variable length column size ● Row width ● Row columns count ● Row count ● Column name ● Blob size ● Char ● Numeric ● Date (min / max) What makes a DBMS: sizing limits

20.
● B-Tree ● Fulltext indexing ● Hash ● Bitmap ● Expression ● Partials ● Reverse ● GiST ● GIS indexing ● Composite keys ● Graph support What makes a DBMS: indexing

21.
● Replication ● Failover ●Clustering ● CAP choice What makes a DBMS: high availability

22.
Partitioning ● Range ● Hash ●Range+hash ● List ● Expression ● Sub-partitioning Sharding ● By key ● By table What makes a DBMS: scalability

23.
● Integer ● Floatingpoint ● Decimal ● String ● Binary ● Date/time ● Boolean ● Binary ● Set ● Enumeration ● Blob ● Clob ● JSON/XML/YAML (as native types) What makes a DBMS: supported data types

24.
● Authentication methods ●Access Control Lists ● Pluggable Authentication Modules support ● Encryption at-rest ● Encryption over the wire ● User proxy What makes a DBMS: security features

25.
● Data organizationmodel: unstructured, semi-structured, structured ● Data model (schema) stability: Static? Stable? Dynamic? Highly dynamic? ● Writes: append-only; append mostly; updates only; updates mostly ● Reads: full scans; range scans; multi-range scans; point reads; ● Reads by age: new only; new mostly; old only; old mostly; whole range ● Reads by complexity: simple, related, deeply-nested relations, ....? What makes a DBMS: workload

26.
ACID vs BASE ●Atomic ● Consistent ● Isolated ● Durable ● Basic Availability ● Soft-state ● Eventual Consistency

27.
CAP Theorem ● Consistency ●Availability ● Partitioning

28.
Relational Databases

29.
Relational Databases

30.
Relational Databases: writeanomalies

31.
Relational Databases: writeanomalies

32.
Relational Databases: normalization

33.
Relational Databases: normalization

34.
Relational Databases: querylanguage results = new Array(); table = open(‘mydata’); while (row = table.fetch()) { if (row.x > 100) { results.push(row); } }

35.
Relational Databases: querylanguage SELECT * FROM mydata WHERE x > 100;

36.
Relational Databases: JOINs SELECTo.order_id AS Order, CONCAT(c.customer_name, “ (“, c. customer_email, “)”) as Customer, GROUP_CONCAT(i.item_name), SUM(item_price) FROM orders AS o JOIN order_items AS oi ON oi.order_id = o.order_id JOIN items AS i ON i.item_id = oi.item_id JOIN customers AS c ON c.customer_id = o.customer_id

37.
Relational Databases: gooduse cases ● Highly-structured data with complex querying needs ● Projects that need very high data durability and guarantees of database-level consistency and integrity ● Simple projects with limited data growth and limited amount of entities ● Projects that require PCI/DSS, HIPPA or similar security requirements ● Analysis of portions of larger BigData stores ● Projects where duplicated data volumes would be a problem

38.
Relational Databases: baduse cases ● Unstructured data ● Deep Hierarchies / Nested -> XML ● Deep recursion: ● Ever-growing datasets; Projects that are basically logging data ● Projects recording time-series ● Reporting on massive datasets

39.
Relational Databases: baduse cases ● Projects supporting extreme concurrency ● Projects supporting massive data intake ● Queues ● Cache storage

40.
PROs CONs ● Verymature ● Abundant workforce ● ACID guarantees ● Referential integrity ● Highly expressive query language ● Ubiquitous ● Rigid schema ● Difficult to scale horizontally ● Expensive writes ● JOIN bombs

41.
Relational Databases: MySQL

42.
● Well known/ mature / extensive documentation ● GPLv2 + commercial license for OEMs, ISVs and VARs ● Client libraries for about every programming language ● Many different engines ● SQL/ACID impose scalability limits ● Asynchronous / Semi-synchronous / Virtually synchronous replication ● Can be AP or CP depending on replication model Relational Databases: MySQL

43.
PROs CONs ● Opensource ● Mature and ubiquitous ● ACID ● Choice of AP or CP ● Highly available ● Abundant tooling and expertise ● General purpouse; Likely good to start anything you want. ● Difficult to shard ● Replication issues ● Not 100% standard compliant ● Storage engines imposed limiations ● General purpouse; No single bullet solutions for scaling!

44.
Relational Databases: PostgreSQL

45.
● Mature /adequate documentation ● PostgreSQL License (similar to BSD/MIT) ● Client libraries for about every programming language ● Highly Standards Compliant ● SQL/ACID impose scalability limits ● Asynchronous / Semi-synchronous ● Virtually synchronous replication via 3rd party ● Can be AP or CP depending on replication model` Relational Databases: PostgreSQL

46.
PROs CONs ● Opensource ● Mature and stable ● ACID ● Lots of advanced features ● Vacuum ● Difficult to shard ● Operations feel like an afterthought ● Less forgiving ● Vacuum

47.
K/V Stores

49.
CRUD ● CREATE ● READ ●UPDATE ● DELETE

50.
HASHING ● Computers: 0,1, 2, …, n - 1, n ● Key Value Pair: (k, v) (k, v) => hash(k) mod n

54.
THUNDERING HERD

55.
CONSISTENT HASHING

56.
CONSISTENT HASHING

57.
K/V Stores -Good Use Cases ● Lots of data ● Object cache in front of RDBMS ● High concurrency ● Massive small-data intake ● Simple data access patterns

58.
K/V Stores -Good Use Cases ● Lots of data ○ Usually easily horizontally scalable ● Object cache in front of RDBMS ● High concurrency ● Massive small-data intake ● Simple data access patterns

59.
K/V Stores -Good Use Cases ● Lots of data ● Object cache in front of RDBMS ○ Memcached, anyone? ● High concurrency ● Massive small-data intake ● Simple data access patterns

60.
K/V Stores -Good Use Cases ● Lots of data ● Object cache in front of RDBMS ● High concurrency ○ Very simple locking model ● Massive small-data intake ● Simple data access patterns

61.
K/V Stores -Good Use Cases ● Lots of data ● Object cache in front of RDBMS ● High concurrency ● Massive small-data intake ● Simple data access patterns

62.
K/V Stores -Good Use Cases ● Lots of data ● Object cache in front of RDBMS ● High concurrency ● Massive small-data intake ● Simple data access patterns ○ CRUD on PK access

63.
K/V Stores -Bad Use Cases ● Durability and consistency* ● Complex data access patterns ● Non-PK access* ● Operations*

64.

65.
K/V Stores -Bad Use Cases ● Durability and consistency* ● Complex data access patterns* ● Non-PK access* ● Operations*

66.

67.
K/V Stores -Bad Use Cases ● Durability and consistency* ● Complex data access patterns ● Non-PK access* ● Operations* ○ Complex systems fail in complex ways

68.
SIMPLE FAILURE

69.
COMPLICATED FAILURE

70.
EXAMPLE K/V STORES ●Memcached ● MemcacheDB ● Redis* ● Riak KV ● Cassandra* ● Amazon DynamoDB*

71.

72.

73.

74.

75.

76.

77.
PROs CONs ● Highlyscalable ● Simple access patterns ● Operational complexities ● Limited access patterns

78.
Key Value Stores- Questions?

79.
Columnar Databases

80.
Columnar Data Layout ●Row-oriented ● Column-oriented 001:10,Smith,Joe,40000; 002:12,Jones,Mary,50000; 003:11,Johnson,Cathy,44000; 004:22,Jones,Bob,55000; ... 10:001,12:002,11:003,22:004; Smith:001,Jones:002,Johnson:003,Jones:004; Joe:001,Mary:002,Cathy:003,Bob:004; 40000:001,50000:002,44000:003,55000:004; ...

81.
Columnar Data Layout ●Row-oriented Read Approach What we want to read Read Operation Memory Page 1 2 3 4 10 Smith Bob 40000 12 Jones Mary 50000 11 Johnson Cathy 44000

82.
Columnar Data Layout ●Column-oriented Read Approach What we want to read Read Operation Memory Page 1 2 3 4 10 12 11 22 Smith Jones Johnson Joe Mary Cathy Bob

83.
Columnar Databases -Considerations ● Buffering and compression can help to reduce the impact of writes, but they should still be avoided when possible ○ Usually, an ETL process should be put in place to prepare data for analysis in a column-based format ● Covering Indexes in row-based stores could provide similar benefits, but only up to a point → index maintenance work can become too expensive ● Column-based stores are self-indexing and more disk-space efficient ● SQL can be used for most column-based stores

84.

85.

86.

87.
● Suitable forread-mostly or read-intensive, large data repositories ● Good for full table / large range reads. ● Good for unstructured problems where “good” indexes are hard to forecast ● Good for re-creatable datasets ● Good for structured data Columnar Database - Good use cases

88.

89.

90.

91.

92.
● Not goodfor “SELECT *” queries or queries fetching most of the columns ● Not good for writes ● Not good for mixed read/write ● Bad for unstructured data Columnar Database - Bad use cases

93.

94.

95.

96.
Columnar Database -Examples ● InfoBright (ICE) ● Vertica ● Amazon Redshift ● Apache HBase

97.

98.

99.
Columnar Database -Examples ● InfoBright (ICE) ● Vertica ● Amazon Redshift ● Apache HBase ○ https://www.percona.com/live/data-performance-conference- 2016/sessions/solr-how-index-10-billion-phrases-mysql-and-hbase

100.
Columnar - Questions?

101.
Graph Databases

103.
Graph Databases -Good Use Cases ● Highly Connected Data ● Millions or Billions of Records ● Re-Creatable Data Set ● Structured Data

104.
Graph Databases -Good Use Cases ● Highly Connected Data ○ Network & IT Operations, Recommendations, Fraud Detection, Social Networking, Identity & Access Management, Geo Routing, Insurance Risk Analysis, Counter Terrorism ● Millions or Billions of Records ● Re-Creatable Data Set ● Structured Data

106.
Graph Databases -Good Use Cases ● Highly Connected Data ● Millions or Billions of Records ○ Relational databases can also solve this problem at a smaller scale ● Re-Creatable Data Set ● Structured Data

107.
Graph Databases -Good Use Cases ● Highly Connected Data ● Millions or Billions of Records ● Re-Creatable Data Set ○ Keep as much as possible outside of the critical path ● Structured Data

108.
Graph Databases -Good Use Cases ● Highly Connected Data ● Millions or Billions of Records ● Re-Creatable Data Set ● Structured Data ○ You cannot graph a relationship unless you can define it

109.
Graph Databases -Bad Use Cases ● Unstructured Data ● Non-Connected Data ● Highly Concurrent RW Workloads ● Anything in the Critical OLTP Path* ● Ever-Growing Data Set

110.
Graph Databases -Bad Use Cases ● Unstructured Data ○ You cannot graph a relationship if you cannot define it ● Non-Connected Data ● Highly Concurrent Workloads ● Anything in the Critical OLTP Path* ● Ever-Growing Data Set

111.
Graph Databases -Bad Use Cases ● Unstructured Data ● Non-Connected Data ○ Graphiness is important here ● Highly Concurrent Workloads ● Anything in the Critical OLTP Path* ● Ever-Growing Data Set

112.
Graph Databases -Bad Use Cases ● Unstructured Data ● Non-Connected Data ● Highly Concurrent RW Workloads ○ Performance breaks down ● Anything in the Critical OLTP Path* ● Ever-Growing Data Set

113.
Graph Databases -Bad Use Cases ● Unstructured Data ● Non-Connected Data ● Highly Concurrent Workloads ● Anything in the Critical OLTP Path* ○ I'm not only talking about writes here ● Ever-Growing Data Set

114.
Graph Databases -Bad Use Cases ● Unstructured Data ● Non-Connected Data ● Highly Concurrent RW Workloads ● Anything in the Critical OLTP Path* ● Ever-Growing Data Set

115.
Example Graph Databases ●Neo4j ● OrientDB ● Titan ● Virtuoso ● ArangoDB

116.

117.

118.

119.

120.

122.
THE CODE

123.
PROs CONs ● Solvesa very specific (and hard) data problem ● Learning curve not bad for developer usage ● Data analysts’ dream ● Very little operational expertise for hire ● Little community and virtually no tooling for administration and operations. ● Big mismatch in paradigm vs RDBMS; Hard to switch for DBAs. ● Hard/Expensive to scale horizontally ● Writes are computationally expensive

124.
Graph Databases -Questions?

125.
Time Series

126.
ID: {timestamp, value} db1-threads:{1460928171, 6}

127.
Time Series -Good Use Cases ● Uh … Time Series Data ● Write-mostly (95%+) - Sequential Appends ● Rare updates, rarer still to the distant past ● Deletes occur at the opposite end (the beginning) ● Data does not fit in memory

128.

129.

130.

131.

132.

133.
Time Series -Bad Use Cases ● Uh … Not Time Series Data ● Small data

134.
Example Time SeriesDatabases ● InfluxDB ● Graphite ● OpenTSDB ● Blueflood ● Prometheus

135.

137.

138.

139.

140.

141.
PROs CONs ● Solvesa very specific (big) data problem ● Well-defined and finite data access patterns ● Terrible query semantics

142.
Time Series -Questions?

143.
Document Stores

144.
Document Stores: DocumentOriented

145.
Document Stores: DocumentOriented

146.
Document Stores: FlexibleSchema

147.

148.

149.

150.

151.
ShardShardShard Document Stores: Scalableby Design Primary Primary Primary Replica Replica Replica Replica Replica Replica

152.
InstanceInstanceInstance Document Stores: ScalableBy Design Shard Shard Shard Replica Replica Replica Replica Replica Replica

153.
Document Stores

154.
Document Stores: MongoDB

155.
Document Stores: MongoDB ●Sharding and replication for dummies! ● Pluggable storage engines for distinct workloads. ● Excellent compression options with PerconaFT, RocksDB, WiredTiger ● On disk encryption (Enterprise Advanced) ● In-memory storage engine (Beta) ● Connectors for all major programming languages ● Sharding and replica aware connectors ● Geospatial functions ● Aggregation framework ● .. a lot more except being transactional

156.

157.
Document Stores: MongoDB ●Sharding and replication for dummies! ● Pluggable storage engines for distinct workloads. ○ Different locking behaviors ● Excellent compression options with PerconaFT, RocksDB, WiredTiger ● On disk encryption (Enterprise Advanced) ● In-memory storage engine (Beta) ● Connectors for all major programming languages ● Sharding and replica aware connectors ● Geospatial functions ● Aggregation framework ● .. a lot more except being transactional

158.

159.

160.

161.

162.

163.

164.

165.

166.
● Catalogs ● Analytics/BI(BI Connector on 3.2) ● Time series Document Stores: MongoDB > Use Cases

167.

168.

169.
Document Stores: Couchbase

170.
Document Stores: Couchbase ●MongoDB - more or less ● Global Secondary Indexes is exciting which produces localized secondary indexes for low latency queries (Multi Dimensional Scaling) ● Drop in replacement for Memcache

171.

172.

173.
Document Stores: Couchbase> Use Cases ● Internet of Things (direct or indirect receiver/pipeline) ● Mobile data persistence via Couchbase Mobile i.e. field devices with unstable connections and local/close priximity ingestion points ● Distributed K/V store

174.

175.

176.
Document Store: Questions?

177.
Fulltext Search

178.
Fulltext Search: InvertedIndex

179.
Fulltext Search: Searchin a Box

180.
Fulltext Search: OptimizedOut ● Optimized to take data out - little optimizations for getting data in https://flic.kr/p/abeTEw

181.
Fulltext Search: Structured/Non-StructuredData

182.
Fulltext Search

183.
Fulltext Search: Elasticsearch ●Lucene based ● RESTful interface - JSON in, JSON out ● Flexible schema ● Automatic sharding and replication (NDB like) ● Reasonable defaults ● Extension model ● Written in Java, JVM limitation applies i.e. GC ● ELK - Elasticsearch+Logstash+Kibana

184.

185.

186.

187.

188.

189.

190.

191.
Fulltext Search: Elasticsearch> Use Cases ● Logs Analysis - ELK Stack i.e. Netflix ● Full Text search i.e. Github, Wikipedia, StackExchange, etc ● https://www.elastic.co/use-cases

192.
Fulltext Search: Elasticsearch> Use Cases ● Logs Analysis - ELK Stack i.e. Netflix ● Full Text search i.e. Github, Wikipedia, StackExchange, etc ● https://www.elastic.co/use-cases

193.
Fulltext Search: Elasticsearch> Use Cases ● Logs Analysis - ELK Stack i.e. Netflix ● Full Text search i.e. Github, Wikipedia, StackExchange, etc ● https://www.elastic.co/use-cases ○ Sentiment analysis ○ Personalized experience ○ etc

194.
● Lucene based ●Quite cryptic query interface - Innovator’s Dilemma ● Support for SQL based query on 6.1 ● Structured schema, data types needs to be predefined ● Written in Java, JVM limitation applies i.e. GC ● Near realtime indexing - DIH, ● Rich document handling - PDF, doc[x] ● SolrCloud support for sharding and replication Fulltext Search: Solr

195.

196.

197.

198.

199.
● Lucene based ●Quite cryptic query interface - Innovator’s Dilemma ● Support for SQL based query on 6.1 ● Structured schema, data types needs to be predefined ● Written in Java, JVM limitation applies i.e. GC ● Near real-time indexing - DIH, ● Rich document handling - PDF, doc[x] ● SolrCloud support for sharding and replication Fulltext Search: Solr

200.

201.

202.
● Search andRelevancy ○ https://www.percona.com/live/data-performance-conference-2016/sessions/solr-how-index-10- billion-phrases-mysql-and-hbase ● Recommendation Engine ● Spatial Search Fulltext Search: Solr > Use Cases

203.
● Search andRelevancy ● Recommendation Engine ● Spatial Search Fulltext Search: Solr > Use Cases

204.
● Search andRelevancy ● Recommendation Engine ● Spatial Search Fulltext Search: Solr > Use Cases

205.
● Structured data ●MySQL protocol - SphinxQL ● Durable indexes via binary logs ● Realtime indexes via MySQL queries ● Distributed index for scaling ● No native support for replication i.e. via rsync ● Very good documentation ● Fastest full indexing/reindexing [?] Fulltext Search: Sphinx Search

206.

207.

208.

209.

210.

211.

212.
● Structured data ●MySQL protocol - SphinxQL ● Durable indexes via binary logs ● Realtime indexes via MySQL queries ● Distributed index for scaling ● No native support for replication i.e. via rsync ● Very good documentation ● Fastest full indexing/reindexing Fulltext Search: Sphinx Search

213.
● Real timefull text + basic geo functions ● Above with with dependency or to simplify access with SphinxQL or even Sphinx storage engine for MySQL Fulltext Search: Sphinx Search > Use Cases

214.
● Real timefull text + basic geo functions ● Above with with dependency or to simplify access with SphinxQL or even Sphinx storage engine for MySQL Fulltext Search: Sphinx Search > Use Cases

215.
Search - Questions?

216.
Docker Is YourFriend

217.
Relational ● https://github.com/docker-library/mysql ● https://github.com/docker-library/postgres KeyValue ● https://github.com/docker-library/memcached ● https://github.com/docker-library/redis ● https://github.com/docker-library/cassandra ● https://github.com/hectcastro/docker-riak (https://docs.docker. com/engine/examples/running_riak_service/) Docker Is Your Friend

218.
Graph ● https://github.com/neo4j/docker-neo4j ● https://github.com/orientechnologies/orientdb-docker ●https://github.com/arangodb/arangodb-docker ● https://github.com/tenforce/docker-virtuoso (non official) ● https://hub.docker.com/r/itzg/titandb/~/dockerfile/ (non official) ● https://github.com/phani1kumar/docker-titan (non official) Full Text ● https://github.com/docker-solr/docker-solr/ ● https://github.com/stefobark/sphinxdocker Docker Is Your Friend

219.
Docker Is YourFriend Time series ● https://github.com/tutumcloud/influxdb (non official) ● https://hub.docker.com/r/sitespeedio/graphite/ (non official) ● https://github.com/rackerlabs/blueflood/tree/master/demo/docker ● https://hub.docker.com/r/petergrace/opentsdb-docker/ (non-official) ● https://hub.docker.com/r/opower/opentsdb/ (non-official) ● https://prometheus.io/docs/introduction/install/#using-docker ● https://github.com/prometheus/prometheus/blob/master/Dockerfile ● Both via http://opentsdb.net/docs/build/html/resources.html

220.
Docker Is YourFriend Document ● https://github.com/docker-library/mongo/ ● https://hub.docker.com/r/couchbase/server/~/dockerfile/ Columnar ● http://www.infobright.org/index.php/download/download-pentaho-ice-integrated-virtual-machine/ ● https://github.com/meatcar/docker-infobright/blob/master/Dockerfile ● https://github.com/vertica/docker-vertica

Heterogenous Persistence

More Related Content

What's hot

Viewers also liked

Similar to Heterogenous Persistence

More from Jervin Real

Recently uploaded

Heterogenous Persistence