Architecting Data in the
AWS Ecosystem
Seth Luersen
Head of Training, MemSQL
2
what is happening within AWS
overall data landscape
benefits of using MemSQL in EC2
3
Modern Data
Relational
SQL
Schema
Structured
Operational
Non-Relational
NoSQL
Schema-less
Unstructured
Analytical
4
How to Match
Database
Data-driven
application
5
Workloads
Data
Shape
Size
Compute
6
Shape
Columnstore Aggregations and table scans
Document Index and store docs for query on any property
Graph Persist and retrieve relationships
Key-Value Query by key with fast ingest and high throughput
Rowstore Operate on a row or row set
Time-Series Store and process sequence
Unstructured Get and put objects
7
Size
Limit Bounded or Unbounded to a size
Working Set 30 years cold
Caching Last 10 minutes of hot
Result size 1 row at 100 bytes
2 million rows at 200 MB
Monolith One big refrigerator
Partition Natural boundaries for distribution
8
Compute
Aggregations Average, Count, Sum on 1 trillion rows
Batch 50 million rows per batch
Concurrency 10,000 requests per second
Streaming Ingest 1 million rows ingest per second
Latency SLAs for sub-second response
Transactions Singleton operations
9
Choice
On Size
Fits All
Use Case
Specific
10
Navigating the Data Landscape
NoSQL
Database
Data
Warehouse
Data LakeNon-relational
Relational
Analytical Operational
NoSQL
Database
Data
Warehouse
Data Lake
11
Navigate in AWS
Dynamo
DB
RDS
Aurora
MySQL
PostegreSQL
MariaDB
SQL Server
Oracle
S3
Non-relational
Relational
Elastic
Cache
Analytical Operational
DAX
Kinesis
Analytics
Redshift
Athena
ElasticSearch
EMR
Elastic
MapReduce
Hadoop
Spark
Presto
Hbase
12
General Use Cases
singletons
system of record
content
blobs
descriptive
predictive
prescriptive
OperationalAnalytics
columnar
partitions
billions of rows
heavy push down
batch writes
few updates
in-memory
compute
13
Analytical Dimensions
billions of rows
cached / in-memory
partitions
computed result set
files
unstructured
schema-less
relational
snowflake
etl
batches
aggregations
query latency
pushdown
shape
compute
size
NoSQL
Database
Data
Warehouse
Data Lake
14
Navigate Analytical
Dynamo
DB
RDS
Aurora
MySQL
PostegreSQL
MariaDB
SQL Server
Oracle
EMR
Elastic
MapReduce
Hadoop
Spark
Presto
Hbase
S3
Non-relational
Relational
Elastic
Cache
Athena
Analytical Operational
Kinesis
Analytics
Redshift
ElasticSearch
DAX
15
Amazon EMR
Elastic Reliable Secure Easy
Hadoop, Spark, Hbase, Presto
Clickstream Analytics, Real-time Analytics, Log Analysis,
ETL, Predictive Analytics
Big Data Framework
Retries failed task for Hadoop
Replace poor performing instance
16
Amazon Redshift
Scalable Secure Inexpensive Fast
Fast, powerful, and simple data warehousing;
Massively parallel, petabyte scale
Scale by resizing
Columnar performance
$1000 per TB per year
Data Warehouse
17
Amazon S3 + Athena
Query
Instantly
Pay Per
Query
ANSI
SQL
Server-less
Easy
No infrastructure to setup or manage
SQL to query S3 files
JDBC / ODBC
Multiple data formats
Relational Joins
S3 upload latency
Data Lake
18
Elasticsearch Service
Easy to
Use
Open Source
API
Secure Fully
Managed
Easy to deploy, secure, operate, and scale Elasticsearch
Log analytics, full text search, & application monitoring
Logstash
Kibana
NoSQL Full Text Search
19
Analytics Summary
Amazon Redshift Amazon S3 + Athena
serveless ad-hoc query
process, prepare, and index key-value / document
low latency
per query $$$
non-relational
multiple enterprise data sources
multiple data formats
20
General Use Cases
singletons
system of record
content
blobs
descriptive
predictive
prescriptive
OperationalAnalytics
hot – caching
singletons –
small compute
size
low latency
high throughput
high concurrency
ACID, HA, DR
21
Operational Dimensions
shape
size
bounded
unbounded
monolithic
partitioned
rows
key-values
documents
relational
schema
velocity
ingest
compute
NoSQL
Database
Data
Warehouse
Data Lake
22
Navigate AWS
RDS
Aurora
MySQL
PostegreSQL
MariaDB
SQL Server
Oracle
S3
Non-relational
Relational
Elastic
Cache
Analytical Operational
Kinesis
Analytics
Redshift
Dynamo
DB
Athena
ElasticSearch
DAX
EMR
Elastic
MapReduce
Hadoop
Spark
Presto
Hbase
23
Amazon RDS
Administer
Easily
Highly
Scalable
Available,
Durable
SSD
Speed
Managed relational database service;
Six popular database engines
Amazon Aurora is multi-AZ durable
Database
24
Amazon ElasticCache
Scale
Easily
Secure,
Hardened
Available,
Reliable
Extreme
Performance
Managed, in-memory data store;
Redis or Memcached
Add to database to improve read latency
Good hit rate if working set fits in cache
Price is stale cache reads
In-memory Database
25
Amazon DynamoDB
Fully
Managed
Auto
Scaling
AZ
Replication
Consistent
Performance
NoSQL database for document and key-store
Automatic provisioning
Auto-scaled tables server millions of request per second
Millisecond latency
Fault tolerant availability
No relational capabilities
NoSQL
26
Amazon DynamoDB Accelerator (DAX)
Fully
Managed
No Stale
Cache Reads
Extreme
Performance
Fully managed write-through cache for DynamoDB
Reduces millisecond latency to microseconds
Fast NoSQL
27
Operational Summary
Amazon RDS Amazon DynamoDB
bounded
unbounded
key-value / document
rows
relational
non-relational
monolith
partitioned
velocity
push-down compute
fast ingest with DAX
28
Strategic Planning Assumptions
By 2017, as "NoSQL" ceases to distinguish
DBMSs, data and analytics leaders will
select multimodel and/or specific document,
key-value, graph and wide-column DBMSs.
Gartner Critical Capabilities for Operational Database Management Systems
Published: 6 October 2016
Analyst(s): Merv Adrian, Donald Feinberg, Nick Heudecker, Terilyn Palanca, Rick Greenwald
29
Navigating the Data Landscape
NoSQL
No
Problem
Database
Data
Warehouse
Data LakeNon-relational
Relational
Analytical Operational
30
Navigating the Data Landscape
Database
Data
Warehouse
Data LakeNon-relational
Relational
Analytical Operational
31
Simplify the Data Landscape
Converged Data Warehouse Database
Data Lake (AWS S3)Non-relational
Relational
Analytical Operational
HTAP, HOAP, Translytical
32
Latency Holding Back the Enterprise
Lengthy Query Execution
Slow query responses
Slow reports
No real-time response
Limited User Access
Single threaded operations
Challenge with mixed workloads
Single box performance
Slow Data Loading
Batch processing
Hours to load
Sampled data views
33
The Enterprise Requires Performance
Fast Queries
Scalable SQL
Real-time dashboards
Live data access
Scalable User Access
Multi-threaded processing
Converged transactions and analytics
Scale-out for performance
Live Loading
Stream data
On-the-fly transformation
Multiple sources
34
The Database for Real-Time Applications
Delivering Operational Analytics at Scale
Run
Anywhere
Any cloud, hybrid, or multicloud
On-premises
Low cost standard hardware
Scale
Transactions and Analytics
Petabyte scale
In-memory and disk-based
Unified mixed workload architecture
Power
Real-Time Applications
Fast ingestion and queries
Operational capabilities
Multi-model and data support
35
Durable Distributed Storage
Highly Available
Online replication ensures
data consistency and protects
against outages
Big Data Capacity
Petabyte scale with up to
10x compression and instant
query retrieval
Distributed and Durable
Store and process on clusters
of machines for performance
and persistence
36
MemSQL Unified Architecture
Historical Data
Disk-optimized tables
with compression for
fast analytic queries
Live Data
Memory optimized tables
for analyzing real-time
events
Streaming Ingest
Real-time data pipelines
with exactly-once
semantics
37
Drive Real-Time Insights
• Rich analytics with Scalable SQL
• Support for JSON, Geospatial,
Key-Value
• Fast Query Vectorization and
Compilation
• User Defined Functions
38
Deliver Real-Time ETL
Load
Guarantee message delivery with
exactly-once semantics
Transform
Map and enrich data with user defined
functions or Spark transformations
Extract
Ingest from Apache Kafka or Spark
Change data capture or bulk load
39
Simple Setup -> CREATE PIPELINE
memsql> CREATE PIPELINE twitter_pipeline AS
-> LOAD DATA KAFKA
"public-kafka.memcompute.com:9092/tweets-json"
-> INTO TABLE tweets
-> (id, tweet);
Query OK, (0.89 sec)
memsql> START PIPELINE twitter_pipeline;
Query OK, (0.01 sec)
40
Ecosystem Overview
Streaming Ingest Live Data Historical Data
Real-Time Data
Messaging and
Transforms
Historical Data BI Dashboards
Kafka Spark
Relational Hadoop Amazon S3
Bare Metal, Virtual Machines, Containers On-Premises, Cloud, As a Service
Real-Time Applications
Tableau Looker Microstrategy
41
Amazon EC2 + MemSQL
Size
Memory
Size
Compute
Size
Storage
ANSI
SQL
Build a cluster in minutes
Pipelines for ingest
Easy to deploy with MemSQL Ops
High Availability
ACID
Data Warehouse
and Database
42
AWS Aurora MemSQL
Dataset easily fits
under 500 GB
Single server compute
Write-centric without reads
Dataset from
100 GB to 1 PB
Horizontal scale
Simultaneous read and write
workloads
Database from AWS and MemSQL
43
Redshift MemSQL
No requirements for
fast data ingest
No requirement for
for concurrency
Fast data ingest required
Support for high concurrency
Data Warehouse from AWS and MemSQL
Thank You!

Architecting Data in the AWS Ecosystem

  • 1.
    Architecting Data inthe AWS Ecosystem Seth Luersen Head of Training, MemSQL
  • 2.
    2 what is happeningwithin AWS overall data landscape benefits of using MemSQL in EC2
  • 3.
  • 4.
  • 5.
  • 6.
    6 Shape Columnstore Aggregations andtable scans Document Index and store docs for query on any property Graph Persist and retrieve relationships Key-Value Query by key with fast ingest and high throughput Rowstore Operate on a row or row set Time-Series Store and process sequence Unstructured Get and put objects
  • 7.
    7 Size Limit Bounded orUnbounded to a size Working Set 30 years cold Caching Last 10 minutes of hot Result size 1 row at 100 bytes 2 million rows at 200 MB Monolith One big refrigerator Partition Natural boundaries for distribution
  • 8.
    8 Compute Aggregations Average, Count,Sum on 1 trillion rows Batch 50 million rows per batch Concurrency 10,000 requests per second Streaming Ingest 1 million rows ingest per second Latency SLAs for sub-second response Transactions Singleton operations
  • 9.
  • 10.
    10 Navigating the DataLandscape NoSQL Database Data Warehouse Data LakeNon-relational Relational Analytical Operational
  • 11.
    NoSQL Database Data Warehouse Data Lake 11 Navigate inAWS Dynamo DB RDS Aurora MySQL PostegreSQL MariaDB SQL Server Oracle S3 Non-relational Relational Elastic Cache Analytical Operational DAX Kinesis Analytics Redshift Athena ElasticSearch EMR Elastic MapReduce Hadoop Spark Presto Hbase
  • 12.
    12 General Use Cases singletons systemof record content blobs descriptive predictive prescriptive OperationalAnalytics
  • 13.
    columnar partitions billions of rows heavypush down batch writes few updates in-memory compute 13 Analytical Dimensions billions of rows cached / in-memory partitions computed result set files unstructured schema-less relational snowflake etl batches aggregations query latency pushdown shape compute size
  • 14.
    NoSQL Database Data Warehouse Data Lake 14 Navigate Analytical Dynamo DB RDS Aurora MySQL PostegreSQL MariaDB SQLServer Oracle EMR Elastic MapReduce Hadoop Spark Presto Hbase S3 Non-relational Relational Elastic Cache Athena Analytical Operational Kinesis Analytics Redshift ElasticSearch DAX
  • 15.
    15 Amazon EMR Elastic ReliableSecure Easy Hadoop, Spark, Hbase, Presto Clickstream Analytics, Real-time Analytics, Log Analysis, ETL, Predictive Analytics Big Data Framework Retries failed task for Hadoop Replace poor performing instance
  • 16.
    16 Amazon Redshift Scalable SecureInexpensive Fast Fast, powerful, and simple data warehousing; Massively parallel, petabyte scale Scale by resizing Columnar performance $1000 per TB per year Data Warehouse
  • 17.
    17 Amazon S3 +Athena Query Instantly Pay Per Query ANSI SQL Server-less Easy No infrastructure to setup or manage SQL to query S3 files JDBC / ODBC Multiple data formats Relational Joins S3 upload latency Data Lake
  • 18.
    18 Elasticsearch Service Easy to Use OpenSource API Secure Fully Managed Easy to deploy, secure, operate, and scale Elasticsearch Log analytics, full text search, & application monitoring Logstash Kibana NoSQL Full Text Search
  • 19.
    19 Analytics Summary Amazon RedshiftAmazon S3 + Athena serveless ad-hoc query process, prepare, and index key-value / document low latency per query $$$ non-relational multiple enterprise data sources multiple data formats
  • 20.
    20 General Use Cases singletons systemof record content blobs descriptive predictive prescriptive OperationalAnalytics
  • 21.
    hot – caching singletons– small compute size low latency high throughput high concurrency ACID, HA, DR 21 Operational Dimensions shape size bounded unbounded monolithic partitioned rows key-values documents relational schema velocity ingest compute
  • 22.
    NoSQL Database Data Warehouse Data Lake 22 Navigate AWS RDS Aurora MySQL PostegreSQL MariaDB SQLServer Oracle S3 Non-relational Relational Elastic Cache Analytical Operational Kinesis Analytics Redshift Dynamo DB Athena ElasticSearch DAX EMR Elastic MapReduce Hadoop Spark Presto Hbase
  • 23.
    23 Amazon RDS Administer Easily Highly Scalable Available, Durable SSD Speed Managed relationaldatabase service; Six popular database engines Amazon Aurora is multi-AZ durable Database
  • 24.
    24 Amazon ElasticCache Scale Easily Secure, Hardened Available, Reliable Extreme Performance Managed, in-memorydata store; Redis or Memcached Add to database to improve read latency Good hit rate if working set fits in cache Price is stale cache reads In-memory Database
  • 25.
    25 Amazon DynamoDB Fully Managed Auto Scaling AZ Replication Consistent Performance NoSQL databasefor document and key-store Automatic provisioning Auto-scaled tables server millions of request per second Millisecond latency Fault tolerant availability No relational capabilities NoSQL
  • 26.
    26 Amazon DynamoDB Accelerator(DAX) Fully Managed No Stale Cache Reads Extreme Performance Fully managed write-through cache for DynamoDB Reduces millisecond latency to microseconds Fast NoSQL
  • 27.
    27 Operational Summary Amazon RDSAmazon DynamoDB bounded unbounded key-value / document rows relational non-relational monolith partitioned velocity push-down compute fast ingest with DAX
  • 28.
    28 Strategic Planning Assumptions By2017, as "NoSQL" ceases to distinguish DBMSs, data and analytics leaders will select multimodel and/or specific document, key-value, graph and wide-column DBMSs. Gartner Critical Capabilities for Operational Database Management Systems Published: 6 October 2016 Analyst(s): Merv Adrian, Donald Feinberg, Nick Heudecker, Terilyn Palanca, Rick Greenwald
  • 29.
    29 Navigating the DataLandscape NoSQL No Problem Database Data Warehouse Data LakeNon-relational Relational Analytical Operational
  • 30.
    30 Navigating the DataLandscape Database Data Warehouse Data LakeNon-relational Relational Analytical Operational
  • 31.
    31 Simplify the DataLandscape Converged Data Warehouse Database Data Lake (AWS S3)Non-relational Relational Analytical Operational HTAP, HOAP, Translytical
  • 32.
    32 Latency Holding Backthe Enterprise Lengthy Query Execution Slow query responses Slow reports No real-time response Limited User Access Single threaded operations Challenge with mixed workloads Single box performance Slow Data Loading Batch processing Hours to load Sampled data views
  • 33.
    33 The Enterprise RequiresPerformance Fast Queries Scalable SQL Real-time dashboards Live data access Scalable User Access Multi-threaded processing Converged transactions and analytics Scale-out for performance Live Loading Stream data On-the-fly transformation Multiple sources
  • 34.
    34 The Database forReal-Time Applications Delivering Operational Analytics at Scale Run Anywhere Any cloud, hybrid, or multicloud On-premises Low cost standard hardware Scale Transactions and Analytics Petabyte scale In-memory and disk-based Unified mixed workload architecture Power Real-Time Applications Fast ingestion and queries Operational capabilities Multi-model and data support
  • 35.
    35 Durable Distributed Storage HighlyAvailable Online replication ensures data consistency and protects against outages Big Data Capacity Petabyte scale with up to 10x compression and instant query retrieval Distributed and Durable Store and process on clusters of machines for performance and persistence
  • 36.
    36 MemSQL Unified Architecture HistoricalData Disk-optimized tables with compression for fast analytic queries Live Data Memory optimized tables for analyzing real-time events Streaming Ingest Real-time data pipelines with exactly-once semantics
  • 37.
    37 Drive Real-Time Insights •Rich analytics with Scalable SQL • Support for JSON, Geospatial, Key-Value • Fast Query Vectorization and Compilation • User Defined Functions
  • 38.
    38 Deliver Real-Time ETL Load Guaranteemessage delivery with exactly-once semantics Transform Map and enrich data with user defined functions or Spark transformations Extract Ingest from Apache Kafka or Spark Change data capture or bulk load
  • 39.
    39 Simple Setup ->CREATE PIPELINE memsql> CREATE PIPELINE twitter_pipeline AS -> LOAD DATA KAFKA "public-kafka.memcompute.com:9092/tweets-json" -> INTO TABLE tweets -> (id, tweet); Query OK, (0.89 sec) memsql> START PIPELINE twitter_pipeline; Query OK, (0.01 sec)
  • 40.
    40 Ecosystem Overview Streaming IngestLive Data Historical Data Real-Time Data Messaging and Transforms Historical Data BI Dashboards Kafka Spark Relational Hadoop Amazon S3 Bare Metal, Virtual Machines, Containers On-Premises, Cloud, As a Service Real-Time Applications Tableau Looker Microstrategy
  • 41.
    41 Amazon EC2 +MemSQL Size Memory Size Compute Size Storage ANSI SQL Build a cluster in minutes Pipelines for ingest Easy to deploy with MemSQL Ops High Availability ACID Data Warehouse and Database
  • 42.
    42 AWS Aurora MemSQL Dataseteasily fits under 500 GB Single server compute Write-centric without reads Dataset from 100 GB to 1 PB Horizontal scale Simultaneous read and write workloads Database from AWS and MemSQL
  • 43.
    43 Redshift MemSQL No requirementsfor fast data ingest No requirement for for concurrency Fast data ingest required Support for high concurrency Data Warehouse from AWS and MemSQL
  • 44.