Architecting Data in the AWS Ecosystem

Architecting Data in the
AWS Ecosystem
Seth Luersen
Head of Training, MemSQL

2
what is happening within AWS
overall data landscape
benefits of using MemSQL in EC2

3
Modern Data
Relational
SQL
Schema
Structured
Operational
Non-Relational
NoSQL
Schema-less
Unstructured
Analytical

4
How to Match
Database
Data-driven
application

5
Workloads
Data
Shape
Size
Compute

6
Shape
Columnstore Aggregations and table scans
Document Index and store docs for query on any property
Graph Persist and retrieve relationships
Key-Value Query by key with fast ingest and high throughput
Rowstore Operate on a row or row set
Time-Series Store and process sequence
Unstructured Get and put objects

7
Size
Limit Bounded or Unbounded to a size
Working Set 30 years cold
Caching Last 10 minutes of hot
Result size 1 row at 100 bytes
2 million rows at 200 MB
Monolith One big refrigerator
Partition Natural boundaries for distribution

8
Compute
Aggregations Average, Count, Sum on 1 trillion rows
Batch 50 million rows per batch
Concurrency 10,000 requests per second
Streaming Ingest 1 million rows ingest per second
Latency SLAs for sub-second response
Transactions Singleton operations

9
Choice
On Size
Fits All
Use Case
Specific

10
Navigating the Data Landscape
NoSQL
Database
Data
Warehouse
Data LakeNon-relational
Relational
Analytical Operational

NoSQL
Database
Data
Warehouse
Data Lake
11
Navigate in AWS
Dynamo
DB
RDS
Aurora
MySQL
PostegreSQL
MariaDB
SQL Server
Oracle
S3
Non-relational
Relational
Elastic
Cache
DAX
Kinesis
Analytics
Redshift
Athena
ElasticSearch
EMR
Elastic
MapReduce
Hadoop
Spark
Presto
Hbase

12
General Use Cases
singletons
system of record
content
blobs
descriptive
predictive
prescriptive
OperationalAnalytics

columnar
partitions
billions of rows
heavy push down
batch writes
few updates
in-memory
compute
13
Analytical Dimensions
billions of rows
cached / in-memory
partitions
computed result set
files
unstructured
schema-less
relational
snowflake
etl
batches
aggregations
query latency
pushdown
shape
compute
size

NoSQL
Database
Data
Warehouse
Data Lake
14
Navigate Analytical
Dynamo
DB
RDS
Aurora
MySQL
PostegreSQL
MariaDB
SQL Server
Oracle
EMR
Elastic
MapReduce
Hadoop
Spark
Presto
Hbase
S3
Non-relational
Relational
Elastic
Cache
Athena
Kinesis
Analytics
Redshift
ElasticSearch
DAX

15
Amazon EMR
Elastic Reliable Secure Easy
Hadoop, Spark, Hbase, Presto
Clickstream Analytics, Real-time Analytics, Log Analysis,
ETL, Predictive Analytics
Big Data Framework
Retries failed task for Hadoop
Replace poor performing instance

16
Amazon Redshift
Scalable Secure Inexpensive Fast
Fast, powerful, and simple data warehousing;
Massively parallel, petabyte scale
Scale by resizing
Columnar performance
$1000 per TB per year
Data Warehouse

17
Amazon S3 + Athena
Query
Instantly
Pay Per
Query
ANSI
SQL
Server-less
Easy
No infrastructure to setup or manage
SQL to query S3 files
JDBC / ODBC
Multiple data formats
Relational Joins
S3 upload latency
Data Lake

18
Elasticsearch Service
Easy to
Use
Open Source
API
Secure Fully
Managed
Easy to deploy, secure, operate, and scale Elasticsearch
Log analytics, full text search, & application monitoring
Logstash
Kibana
NoSQL Full Text Search

19
Analytics Summary
Amazon Redshift Amazon S3 + Athena
serveless ad-hoc query
process, prepare, and index key-value / document
low latency
per query $$$
non-relational
multiple enterprise data sources
multiple data formats

20
General Use Cases
singletons
system of record
content
blobs
descriptive
predictive
prescriptive
OperationalAnalytics

hot – caching
singletons –
small compute
size
low latency
high throughput
high concurrency
ACID, HA, DR
21
Operational Dimensions
shape
size
bounded
unbounded
monolithic
partitioned
rows
key-values
documents
relational
schema
velocity
ingest
compute

NoSQL
Database
Data
Warehouse
Data Lake
22
Navigate AWS
RDS
Aurora
MySQL
PostegreSQL
MariaDB
SQL Server
Oracle
S3
Non-relational
Relational
Elastic
Cache
Kinesis
Analytics
Redshift
Dynamo
DB
Athena
ElasticSearch
DAX
EMR
Elastic
MapReduce
Hadoop
Spark
Presto
Hbase

23
Amazon RDS
Administer
Easily
Highly
Scalable
Available,
Durable
SSD
Speed
Managed relational database service;
Six popular database engines
Amazon Aurora is multi-AZ durable
Database

24
Amazon ElasticCache
Scale
Easily
Secure,
Hardened
Available,
Reliable
Extreme
Performance
Managed, in-memory data store;
Redis or Memcached
Add to database to improve read latency
Good hit rate if working set fits in cache
Price is stale cache reads
In-memory Database

25
Amazon DynamoDB
Fully
Managed
Auto
Scaling
AZ
Replication
Consistent
Performance
NoSQL database for document and key-store
Automatic provisioning
Auto-scaled tables server millions of request per second
Millisecond latency
Fault tolerant availability
No relational capabilities
NoSQL

26
Amazon DynamoDB Accelerator (DAX)
Fully
Managed
No Stale
Cache Reads
Extreme
Performance
Fully managed write-through cache for DynamoDB
Reduces millisecond latency to microseconds
Fast NoSQL

27
Operational Summary
Amazon RDS Amazon DynamoDB
bounded
unbounded
key-value / document
rows
relational
non-relational
monolith
partitioned
velocity
push-down compute
fast ingest with DAX

28
Strategic Planning Assumptions
By 2017, as "NoSQL" ceases to distinguish
DBMSs, data and analytics leaders will
select multimodel and/or specific document,
key-value, graph and wide-column DBMSs.
Gartner Critical Capabilities for Operational Database Management Systems
Published: 6 October 2016
Analyst(s): Merv Adrian, Donald Feinberg, Nick Heudecker, Terilyn Palanca, Rick Greenwald

29
NoSQL
No
Problem
Database
Data
Warehouse
Relational

30
Database
Data
Warehouse
Relational

31
Simplify the Data Landscape
Converged Data Warehouse Database
Data Lake (AWS S3)Non-relational
Relational
HTAP, HOAP, Translytical

32
Latency Holding Back the Enterprise
Lengthy Query Execution
Slow query responses
Slow reports
No real-time response
Limited User Access
Single threaded operations
Challenge with mixed workloads
Single box performance
Slow Data Loading
Batch processing
Hours to load
Sampled data views

33
The Enterprise Requires Performance
Fast Queries
Scalable SQL
Real-time dashboards
Live data access
Scalable User Access
Multi-threaded processing
Converged transactions and analytics
Scale-out for performance
Live Loading
Stream data
On-the-fly transformation
Multiple sources

34
The Database for Real-Time Applications
Delivering Operational Analytics at Scale
Run
Anywhere
Any cloud, hybrid, or multicloud
On-premises
Low cost standard hardware
Scale
Transactions and Analytics
Petabyte scale
In-memory and disk-based
Unified mixed workload architecture
Power
Real-Time Applications
Fast ingestion and queries
Operational capabilities
Multi-model and data support

35
Durable Distributed Storage
Highly Available
Online replication ensures
data consistency and protects
against outages
Big Data Capacity
Petabyte scale with up to
10x compression and instant
query retrieval
Distributed and Durable
Store and process on clusters
of machines for performance
and persistence

36
MemSQL Unified Architecture
Historical Data
Disk-optimized tables
with compression for
fast analytic queries
Live Data
Memory optimized tables
for analyzing real-time
events
Streaming Ingest
Real-time data pipelines
with exactly-once
semantics

37
Drive Real-Time Insights
• Rich analytics with Scalable SQL
• Support for JSON, Geospatial,
Key-Value
• Fast Query Vectorization and
Compilation
• User Defined Functions

38
Deliver Real-Time ETL
Load
Guarantee message delivery with
exactly-once semantics
Transform
Map and enrich data with user defined
functions or Spark transformations
Extract
Ingest from Apache Kafka or Spark
Change data capture or bulk load

39
Simple Setup -> CREATE PIPELINE
memsql> CREATE PIPELINE twitter_pipeline AS
-> LOAD DATA KAFKA
"public-kafka.memcompute.com:9092/tweets-json"
-> INTO TABLE tweets
-> (id, tweet);
Query OK, (0.89 sec)
memsql> START PIPELINE twitter_pipeline;
Query OK, (0.01 sec)

40
Ecosystem Overview
Streaming Ingest Live Data Historical Data
Real-Time Data
Messaging and
Transforms
Historical Data BI Dashboards
Kafka Spark
Relational Hadoop Amazon S3
Bare Metal, Virtual Machines, Containers On-Premises, Cloud, As a Service
Real-Time Applications
Tableau Looker Microstrategy

41
Amazon EC2 + MemSQL
Size
Memory
Size
Compute
Size
Storage
ANSI
SQL
Build a cluster in minutes
Pipelines for ingest
Easy to deploy with MemSQL Ops
High Availability
ACID
Data Warehouse
and Database

42
AWS Aurora MemSQL
Dataset easily fits
under 500 GB
Single server compute
Write-centric without reads
Dataset from
100 GB to 1 PB
Horizontal scale
Simultaneous read and write
workloads
Database from AWS and MemSQL

43
Redshift MemSQL
No requirements for
fast data ingest
No requirement for
for concurrency
Fast data ingest required
Support for high concurrency
Data Warehouse from AWS and MemSQL

Architecting Data in the AWS Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Architecting Data in the AWS Ecosystem

Similar to Architecting Data in the AWS Ecosystem (20)

More from SingleStore

More from SingleStore (20)

Recently uploaded

Recently uploaded (20)

Architecting Data in the AWS Ecosystem