Technology Trends in Data Processing - DAT311 - re:Invent 2017

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
DAT311: Technol ogy trend s i n Data Pro cessi ng
A n u r a g G u p t a , V i c e P r e s i d e n t , A m a z o n W e b S e r v i c e s
a w g u p t a @ a m a z o n . c o m

Agenda
>
>
>
Manag ing e x p losion of data
Se rve rle ss, API -ce ntric comp u ting
Glob al u se rs, local acce ss e x p e rie nce

Managing Data Explosion with Data Lakes

Traditionally, analytics used to look like this
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence Relational data
TBs-PBs scale
Schema defined prior to data load
Operational reporting and ad hoc
Large initial capex + $10k-$50k / TB

Transition from IT to
DevOps increases rate of
change
Network connected smart devices
drive variety and volume of data
Micro-services architecture
increases need for real-time
monitoring and analytics
Machine-generated data is growing 10x faster than business data
Source: insideBigData - The Exponential Growth of Data, February 16, 2017
Explosion of machine-generated data

Data lakes extend the traditional approach
Relational and non-relational data
TBs-EBs scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time

Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
S3
Most ways to bring data in
Unmatched durability and availability at EB scale
Best security, compliance, and audit capabilities
Run any analytics on the same data without movement
Scale storage and compute independently
Store at $0.027 / GB-month; Query for $0.05/GB scanned
Redshift
EMR
Athena
Kinesis
Elasticsearch Service
Data lakes on AWS

Layers of a data lake
INGEST
DISCOVER
ANALYZE
INFER
CRAWL, CATALOG, INDEX, SECURE

A W S G l u e – S e rve rles s D a t a c a talog & E T L s e r vi ce
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless

Crawlers: Automatic schema inference
semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
… int
array
intchar
struct
char int
array
struct
char
bool int
int
arrayint
char
char int
custom classifiers
Apache log parser
built-in classifiers
JSON parser
CSV parser
Parquet parser
…
bool

Crawlers: Automatic partitions detection
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…

Amazon Redshift – Data Warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at scale
Columnar storage
technology to improve I/O
efficiency and scale query
performance
$
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional data
warehouse solutions; start
at $0.25 per hour
Open file formats Secure
Audit everything; encrypt
data end-to-end;
extensive certification and
compliance
Analyze optimized data
formats on the latest SSD,
and all open data formats
in Amazon S3

Amazon EMR – Big data processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances, and
auto-scaling to reduce
costs 50-80%
Use S3 storage
Process data directly in
the S3 data lake securely
with high performance
using the EMRFS
connector
Easy
Launch fully managed
Hadoop & Spark in minutes;
no cluster setup, node
provisioning, cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001

Amazon Elasticsearch Service
Easy to Use
Fully-managed.
Deploy production-ready
clusters in minutes.
Open
Direct access to
Elasticsearch open-source
APIs. Supports Logstash
and Kibana.
Secure
Secure access with VPC
to keep all traffic within
AWS network.
Available
Zone awareness replicates
data between two AZs;
automatically monitors and
replaces failed nodes.
Easy to deploy, secure, operate, and scale Elasticsearch
Customers use Elasticsearch for log analytics, full-text search, and application
monitoring

Amazon Kinesis – Real time
Easily to collect, process, and analyze video and data streams in real time
Capture, process,
and store video
streams for analytics
Load data streams
into AWS data stores
Analyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
SQL
New

H i g hly c o n necte d d a t a, b e s t r e p re se nted i n a g r a p h
Relational model
Foreign keys used to represent relationships
Queries can involve nesting & complex joins
Performance can degrade as datasets grow
Graph model
Relationships are first-order citizens
Easy to write queries that navigate the graph
Results returned quickly, even on large datasets

Existing graph databasesRelational databases
Too
expensive
Difficult to
maintain high
availability
Difficult to
scale
Limited support
for open
standards
$
Inefficient
graph
processing
Unnatural for
querying
graph
Rigid schema,
inflexible for
changing graphs
Building apps with highly connected data

Amazon Neptune
F u l l y m a n a g e d g r a p h d a t a b a s e
Fast ReliableOpen
Query billions of
relationships with
millisecond latency
Six replicas of your
data across three AZs,
with full backup and
restore
Build powerful
queries easily with
Gremlin and SPARQL
Supports Apache
TinkerPop & W3C
RDF graph models
Gremlin
SPARQL
Easy

Serverless, API-Centric Computing

Serverless Analytics
Deliver cost-effective analytic solutions faster
Amazon
S3
Data Lake
AWS Glue
(ETL & Data
Catalog)
Amazon
Athena
Amazon
QuickSight
Serverless. Zero
Infrastructure. Zero
Administration.
Never pay for
idle resources
$
Availability and
fault tolerance
built in
Automatically
scales resources
with usage
AWS IoT
Devices Web Sensors Social

Amazon Athena—interactive analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
$
SQL
Query Instantly
Zero setup cost; just
point to S3 and start
querying
Pay per query
Pay only for queries run;
save 30-90% on per-query
costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with QuickSight

Redshift Spectrum
E x t e n d t h e d a t a w a r e h o u s e t o y o u r S 3 d a t a l a k e
S3 data lakeRedshift data
Redshift Spectrum
query engine
Exabyte Redshift SQL queries against S3
Join data across Redshift and S3
Scale compute and storage separately
Stable query performance and unlimited concurrency
CSV, ORC, Grok, Avro, and Parquet data formats
Pay only for the amount of data scanned

Aurora Serverless
 Starts up on demand, shuts
down when not in use
 Automatically scales with no
instances to manage
 No impact to applications
during scaling events
 Pay per second for the
database capacity you use
Warm pool
of instances
Application
Database Storage
Scalable DB capacity
Request Router
Database end-point

Instance provisioning and scaling
 First request triggers provisioning of a
database instance. It typically takes
about 5-10 secs.
 Instances scale-up and scale-down
automatically in response to changes in
workloads. Instance scaling takes about
1-3 secs.
 Instances are hibernated after a user-
defined period of inactivity
 Scaling operations are transparent to
the application – user sessions are not
terminated
 Database storage is persisted until
explicitly deleted by user
Database Storage
Warm Pool
Application
Request
Router
Current
Instance
New
Instance

Global Users, Local Processing

DynamoDB Global Tables ( G A )
Fi r s t f u l l y m a n a g e d , m u l t i - m a s t e r , m u l t i - r e g i o n d a t a b a s e
Build high performance, globally distributed applications
Low latency reads & writes to locally available tables
Disaster proof with multi-region redundancy
Easy to set up, and no application rewrites required
Globally dispersed users
Global Table

Distributed Lock Manager
GLOBAL
RESOURCE
MANAGER
SQL
TRANSACTIONS
CACHING
LOGGING
SQL
TRANSACTIONS
CACHING
LOGGING
SHARED DISK CLUSTER
STORAGE
APPLICATION
LOCKING PROTOCOL MESSAGES
SHARED STORAGE
M1 M2 M3
M1 M1 M1M2 M3 M2
Cons
Heavyweight cache coherency traffic, on per-lock basis
Networking can be expensive
Negative scaling when hot blocks
Pros
All data available to all nodes
Easy-to-build applications
Similar cache coherency as in multi-processors

Consensus with two phase or Paxos commit
DATA
RANGE #1
DATA
RANGE #2
DATA
RANGE #4
DATA
RANGE #3
DATA
RANGE #5
L
L L
L
L
SHARED NOTHING
SQL
TRANSACTIONS
CACHING
LOGGING
SQL
TRANSACTIONS
CACHING
LOGGING
APPLICATION
STORAGE STORAGE
Cons
Heavyweight commit and membership change protocols
Range partitioning can result in hot partitions, not just hot
blocks. Re-partitioning expensive.
Cross partition operations expensive. Better at small
requests
Pros
Query broken up and sent to data node
Less coherence traffic – just for commits
Can scale to many nodes

Conflict resolution using distributed ledgers
There are many “oases” of
consistency in Aurora
The database nodes know
transaction orders from that
node
The storage nodes know
transactions orders applied at
that node
Only have conflicts when data
changed at both multiple
database nodes AND multiple
storage nodes
Much less coordination required
2 3 4 5 61
BT1 [P1]
BT2 [P1]
BT3 [P1]
BT4 [P1]
BT1
BT2
BT3
BT4
Page1
Quorum
OT1
OT2
OT3
OT4
Page 1 Page 2
2 3 4 5 61
OT1[P2]
OT2[P2]
OT3[P2]
OT4[P2]
PAGE1 PAGE2
MASTER
MASTER
Page 2

Hierarchical conflict resolution
Both masters are writing to two
pages P1 and P2
BLUE master wins the quorum at
page P1; ORANGE master wins
quorum at P2
Both masters recognize the conflict
and have two choices: (1) roll back
the transactions or (2) escalate to
regional resolver
Regional arbitrator decides who
wins the tie breaker.
2 3 4 5 61
BT1 [P1]
OT1 [P1]
2 3 4 5 61
OT1[P2]
BT1[P2]
PAGE1 PAGE2
MASTER
MASTER
BT1 OT1
Regional
resolver
Page 1 Page 2 Page 1 Page 2
Quorum
X X

Crash recovery in multi-master
CRASH
MULTI-MASTERSINGLE MASTER
Log records Gaps
Volume Complete
LSN (VCL)
AT CRASH
IMMEDIATELY AFTER RECOVERY
MASTER 1
CRASHES
GAPS
AT CRASH
Consistency
Point LSN(CPL)
VCLVCL
CPL CPL
MASTER 1
MASTER 2
IMMEDIATELY AFTER RECOVERY
Gap filled New LSNs
and Gaps
Master 1
Recovery Point
Consistency
Point LSN(CPL)

Multi-region Multi-master
Write accepted locally
Optimistic concurrency control – no distributed lock
manager, no chatty lock management protocol
REGION 1 REGION 2
HEAD NODES HEAD NODES
MULTI-AZ STORAGE VOLUME MULTI-AZ STORAGE VOLUME
LOCAL PARTITION LOCAL PARTITIONREMOTE PARTITION REMOTE PARTITION
Conflicts handled hierarchically – at head nodes, at
storage nodes, at AZ and region level arbitrators
Near-linear performance scaling when there is no or
low levels of conflicts

Thank you!

Technology Trends in Data Processing - DAT311 - re:Invent 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Technology Trends in Data Processing - DAT311 - re:Invent 2017

Similar to Technology Trends in Data Processing - DAT311 - re:Invent 2017 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Technology Trends in Data Processing - DAT311 - re:Invent 2017