SlideShare a Scribd company logo
1 of 57
Download to read offline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Big Data Architectural Patterns
and Best Practices on AWS
S i v a R a g h u p a t h y
S r . M a n a g e r , A I , A n a l y t i c s , a n d D a t a b a s e S o l u t i o n s A r c h i t e c t u r e A W S
N o v e m b e r 2 7 , 2 0 1 7
A B D 2 0 1
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to Expect from the Session
Big data challenges
Architectural principles
How to simplify big data processing
What technologies should you use?
• Why?
• How?
Reference architecture
Design patterns
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ever Increasing Big Data
Volume
Velocity
Variety
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Batch
processing
Stream
processing
Artificial
intelligence
Big Data Evolution
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Virtual
machines
Managed
services
Serverless
Cloud Services Evolution
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Plethora of Tools
Amazon
Glacier
Amazon S3 Amazon DynamoDB
Amazon RDS
Amazon EMR
Amazon
Redshift
Amazon
Kinesis
Lambda Amazon ML
Amazon SQS
ElastiCache
Amazon DynamoDB
Streams
Amazon ES
Amazon Kinesis
Analytics
Amazon
QuickSight AWS Glue
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data Challenges
Why?
How?
What tools should I use?
Is there a reference architecture?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architectural Principles
Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage managed and serverless services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs (data lake), materialized views
Be cost-conscious
• Big data ≠ big cost
AI/ML enable your applications
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
Simplify Big Data Processing
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What Is the Temperature of Your Data?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data Characteristics: Hot, Warm, Cold
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
COLLECT
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
COLLECT
Devices
Sensors
IoT platforms
AWS IoT STREAMS
IoT
EventsData streams
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
FILES
DataTransport&Logging
Import/expo
rt
Files
Log files
Media files
Mobile apps
Web apps
Data centers AWS Direct
Connect
RECORDS
Applications
Transactions
Data structures
Database records
Type of Data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
STORE
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Events
Files
Transactions
COLLECT
Devices
Sensors
IoT platforms
AWS IoT STREAMS
IoT
Data streams
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
FILES
DataTransport&Logging
Import/expo
rt
Log files
Media files
Mobile apps
Web apps
Data centers AWS Direct
Connect
RECORDS
Applications
Data structures
Database records
Type of Data STORE
NoSQL
In-memory
SQL
File/object
store
Stream
storage
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
COLLECT
Devices
Sensors
IoT platforms
AWS IoT STREAMS
IoT
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
FILES
DataTransport&Logging
Import/expo
rt
Mobile apps
Web apps
Data centers AWS Direct
Connect
RECORDS
Applications
STORE
NoSQL
In-memory
SQL
File/object
store
Stream
storage
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
COLLECT
Devices
Sensors
IoT platforms
AWS IoT STREAMS
IoT
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
FILES
DataTransport&Logging
Import/expo
rt
Mobile apps
Web apps
Data centers AWS Direct
Connect
RECORDS
Applications
STORE
NoSQL
In-memory
SQL
File/object
store
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Apache Kafka
• High throughput distributed
streaming platform
Amazon Kinesis Streams
• Managed stream storage
Amazon Kinesis Firehose
• Managed data delivery
Stream Storage
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Parallel consumption
Streaming MapReduce
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
shard 1 / partition 1
shard 2 / partition 2
Consumer 1
Count of
red = 4
Count of
violet = 4
Consumer 2
Count of
blue = 4
Count of
green = 4
DynamoDB stream Amazon Kinesis stream Kafka topic
Why Stream Storage?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Decouple producers & consumers
• Persistent buffer
• Collect multiple streams
• No client ordering (standard)
• FIFO queue preserves client
ordering
• No streaming MapReduce
• No parallel consumption
• Amazon SNS can publish to
multiple SNS subscribers
(queues or Lambda functions)
What About Amazon SQS?
Consumers
4 3 2 1
12344 3 2 1
1234
2134
13342
Standard
FIFO
Publisher
Amazon SNS
Topic
AWS Lambda
function
Amazon SQS
queue
Queue
Subscriber
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Which Stream/Message Storage Should I Use?
Hot Warm
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka (on Amazon
EC2)
Amazon
SQS
(Standard)
Amazon SQS
(FIFO)
AWS managed Yes Yes No Yes Yes
Guaranteed ordering Yes No Yes No Yes
Delivery (deduping) At least once At least once At least/At
most/exactly once
At least once Exactly once
Data retention period 7 days N/A Configurable 14 days 14 days
Availability 3 AZ 3 AZ Configurable 3 AZ 3 AZ
Scale /
throughput
No limit /
~ shards
No limit /
automatic
No limit /
~ nodes
No limits /
automatic
300 TPS /
queue
Parallel consumption Yes No Yes No No
Stream MapReduce Yes N/A Yes N/A N/A
Row/object size 1 MB Destination
row/object size
Configurable 256 KB 256 KB
Cost Low Low Low (+admin) Low-medium Low-medium
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
COLLECT STORE
File
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Hot
Stream
Mobile apps
Web apps
Devices
Sensors
IoT platforms
AWS IoT
Data centers AWS Direct
Connect
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
FILES
STREAMS
DataTransport&LoggingIoTApplications
Import/expo
rt
NoSQL
In-memory
SQL
Amazon S3
Amazon S3
Managed object storage service
built to store and retrieve any
amount of data
File/Object Storage
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Amazon S3 as Your Persistent File Store
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
• Decouple storage and compute
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Amazon EMR clusters with Amazon EC2 Spot
Instances
• Multiple & heterogeneous analysis clusters and services can use
the same data
• Designed for 99.999999999% durability
• No need to pay for data replication within a region
• Secure: SSL, client/server-side encryption at rest
• Low cost
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What About HDFS & Data Tiering?
• Use HDFS for hottest datasets (e.g.
iterative read on the same datasets)
• Use Amazon S3 Standard for
frequently accessed data
• Use Amazon S3 Standard – IA for less
frequently accessed data
• Use Amazon Glacier for archiving cold
data
• Use S3 Analytics to optimize tiering
strategy S3 Analytics
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
COLLECT STORE
Amazon
DynamoDB
Amazon RDS
Amazon Aurora
File
Mobile apps
Web apps
Devices
Sensors
IoT platforms
AWS IoT
Data centers AWS Direct
Connect
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
FILES
STREAMS
LoggingIoTApplications
Amazon S3
Amazon DAX
Amazon ElastiCache
Import/expo
rt
SQLNoSQLCache
Amazon ElastiCache
• Managed Memcached or Redis service
Amazon DynamoDB Accelerator
(DAX)
• Managed in-memory cache for
DynamoDB
Amazon DynamoDB
• Managed NoSQL database service
Amazon RDS
• Managed relational database service
Cache & Database
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Hot
Stream
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anti-Pattern
Database Tier
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practice: Use the Right Tool for the Job
SearchIn-memory SQLNoSQL
Database Tier
GraphDB
Amazon RDS/AuroraAmazon DynamoDBAmazon ElastiCache Amazon
DynamoDB
Acclerator
SAP HANA
Amazon ES Amazon
CloudSearch
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Materialized Views and Immutable Log
Cache View
Search View
Immutable Log
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Which Data Store Should I Use?
Data structure → Fixed-schema, JSON, Key/Value,
Access patterns → Store data in the format you will access it
Data characteristics → Hot, warm, cold
Cost → Right cost
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (key, value) In-memory, NoSQL
Simple relationships → 1:N, M:N NoSQL
Multi-table joins, transaction, SQL SQL
Faceting, Search Search
Graph traversal GraphDB
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
Key/Value In-memory, NoSQL
Graph GraphDB
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
In-memory
SQL
Request rate
High Low
Cost/GB
High Low
Latency
Low High
Data volume
Low High
Amazon
Glacier
Structure
NoSQL
Hot data Warm data Cold data
Low
High
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon
ElastiCache
Amazon
DAX
Amazon
DynamoDB
Amazon
RDS (Aurora)
Amazon ES Amazon S3
Amazon
Glacier
Average
Latency
µs-ms µs-ms ms ms, sec ms,sec ms,sec,min
(~ size)
hrs
Typical
Data
Volume
GB GB GB–TBs
(no limit)
GB–TB
(64 TB max)
GB–TB MB–PB
(no limit)
GB–PB
(no limit)
Typical
Item Size
B-KB KB
(400 KB max)
KB
(400 KB max)
KB
(64 KB max)
B-KB
(2 GB max)
KB-TB
(5 TB max)
GB
(40 TB max)
Request
Rate
High – very
high
High – very
high
Very high
(no limit)
High High Low – high
(no limit)
Very low
Storage Cost
GB/Month
$$ $$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10
Durability Low -
moderate
NA Very high Very high High Very high Very high
Availability High
2 AZ
High
3 AZ
Very high
3 AZ
Very high
3 AZ
High
2 AZ
Very high
3 AZ
Very high
3 AZ
Hot data Warm data Cold data
Which Data Store Should I Use?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Cost-Conscious Design
Example: Should I use Amazon S3 or Amazon
DynamoDB?
“I’m currently scoping out a project. The design calls for many small
files, perhaps up to a billion during peak. The total size would be on
the order of 1.5 TB per month…”
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Request rate
(writes/sec)
Object size
(bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483 777,600,000
Amazon S3 or
Amazon DynamoDB?
https://calculator.s3.amazonaws.com/index.html
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3
Wins!
300 23,730 777,600,00032,768
Amazon DynamoDB
Wins!
Request rate
(writes/sec)
Object size
(bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483 777,600,000
Amazon S3 or
Amazon DynamoDB?
Scenario 2
Scenario 1
Amazon S3 Standard
Storage $34
Put/list requests $3,888
Total $3,922
Amazon DynamoDB
Provisioned throughput $273
Indexed data storage $383
Total $656
Amazon S3 Standard
Storage $545
Put/List Requests $3,888
Total $4,433
Amazon DynamoDB
Provisioned Throughput $4,556
Indexed Data Storage $5,944
Total $10,500
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
PROCESS /
ANALYZE
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
API-driven Services
• Amazon Lex – Speech recognition
• Amazon Polly – Text to speech
• Amazon Rekognition – Image analysis
Managed ML Platforms
• Amazon ML
• Apache Spark ML on EMR
AWS Deep Learning AMI
• Pre-installed with MXNet, TensorFlow,
Caffe2 (and Caffe), Theano, Torch,
Microsoft Cognitive Toolkit, and Keras;
plus DL tools/drivers
A m a z o n A I
Predictive Analytics
PROCESS/ANALYZE
Predictive
AmazonAI
Lex PollyAML Rekognition
AWS DL AMI
Developers
Data scientists
Deep learning
experts
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
PROCESS / ANALYZE
Amazon Redshift
& Spectrum
Amazon Athena
BatchInteractive
Amazon ES
Interactive and Batch Analytics
Amazon ES
• Managed Service for Elasticsearch
Amazon Redshift and Amazon Redshift Spectrum
• Managed Data Warehouse
• Spectrum enables querying Amazon S3
Amazon Athena
• Serverless Interactive Query Service
Amazon EMR
• Managed Hadoop Framework for running Apache
Spark, Flink, Presto, Tez, Hive, Pig, HBase, etc.
Presto
Amazon
EMR
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Stream/Real-time Analytics
Spark Streaming on Amazon EMR
Amazon Kinesis Analytics
• Managed Service for running SQL on
Streaming data
Amazon KCL
• Amazon Kinesis Client Library
AWS Lambda
• Run code Serverless (without
provisioning or managing servers)
• Services such as S3 can publish
events to Lambda
• Lambda can pool event from a
Kinesis
PROCESS / ANALYZE
KCL
Apps
AWS Lambda
Amazon Kinesis
Analytics
Stream
Streaming
Amazon EMR
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Which Analytics Should I Use? PROCESS / ANALYZE
Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Stream
Takes milliseconds to seconds
Example: Fraud alerts, 1 minute metrics
Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL,
AWS Lambda, etc.
Predictive
Takes milliseconds (real-time) to minutes (batch)
Example: Fraud detection, Forecasting demand, Speech
recognition
Amazon AI (Lex, Polly, ML, Amazon Rekognition), Amazon EMR
(Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK,
and Caffe)
Streaming
Amazon Kinesis
Analytics
KCL
Apps
AWS Lambda
Stream
Amazon EMR
Fast
Amazon ES
Amazon Redshift
& Spectrum
Presto
Amazon
EMR
Amazon Athena
BatchInteractive
FastSlow
Predictive
AmazonAI
Lex PollyAML Rekognition
AWS DL AMI
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Which Stre am Proce ssing T e chnolog y Shou ld I Use ?
Amazon EMR
(Spark
Streaming)
KCL Application Amazon Kinesis
Analytics
AWS Lambda
Managed Service Yes No (EC2 + Auto
Scaling)
Yes Yes
Serverless No No Yes Yes
Scale/Throughput No limits /
~ nodes
No limits /
~ nodes
No limits /
automatic
No limits /
automatic
Availability Single AZ Multi-AZ Multi-AZ Multi-AZ
Programming
Languages
Java, Python,
Scala
Java, others via
MultiLangDaemon
ANSI SQL with
extensions
Node.js, Java, Python
Sliding Window
Functions
Build-in App needs to
implement
Built-in No
Reliability KCL and Spark
checkpoints
Managed by KCL Managed by
Amazon Kinesis
Analytics
Managed by AWS Lambda
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Which Analytics Tool Should I Use?
Amazon Redshift Amazon Redshift
Spectrum
Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
warehousing
Query S3 data from
Amazon Redshift
Interactive Queries
over S3 data
Interactive
Query
General
purpose
Batch
Scale/Throughput ~Nodes ~Nodes Automatic ~ Nodes
Managed Service Yes Yes Yes, Serverless Yes
Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS
Optimization Columnar storage,
data compression,
and zone maps
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
Framework dependent
Metadata Amazon Redshift
Catalog
Glue Catalog Glue Catalog Glue Catalog or
Hive Meta-store
Auth/Access controls IAM, Users, groups,
and access controls
IAM, Users, groups,
and access controls
IAM IAM, LDAP & Kerberos
UDF support Yes (Scalar) Yes (Scalar) No Yes
Slow
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What About Extract Transform and Load?
ETLSTORE PROCESS / ANALYZE
Data Integration Partners
Reduce the effort to move, cleanse, synchronize,
manage, and automatize data-related processes.
AWS Glue is a fully managed (serverless) ETL
service that makes it simple and cost-effective
to categorize your data, clean it, enrich it, and
move it reliably between various data stores.
Data Catalog Job Authoring Job Execution
AWS Glue
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
COLLECT STORE PROCESS/ANALYZE
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon
DynamoDB
Amazon ElastiCache
Amazon RDS
Amazon Aurora
HotHotWarm
SQLNoSQLCacheFileStream
Mobile apps
Web apps
Devices
Sensors
IoT platforms
AWS IoT
Data centers AWS Direct
Connect
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
FILES
STREAMS
DataTransport&LoggingIoTApplications
Slow
Amazon S3
Amazon ES
Amazon Redshift
& Spectrum
Presto
Amazon
EMR
Fast
Amazon Athena
BatchInteractive
Amazon DAX
Import/expo
rt
Predictive
AmazonAI
Lex PollyAML Rekognition
AWS DL AMI
ETL
Streaming
Amazon Kinesis
Analytics
KCL
Apps
Fast
Stream
Amazon EMR
AWS Lambda
Fast
CONSUME
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
CONSUME
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• BI/AI Applications
• Amazon EC2 or ECS
Containers
• AWS Greengrass
• Data Science
• Notebooks
• DS Platforms
• IDEs
• Analysis and Visualization
• Amazon QuickSight
• Tableau
• ….
COLLECT STORE CONSUMEPROCESS/ANALYZEETL
Amazon
QuickSight
Analysis&visualization
Model
Train/
Eval
Models
Deploy
DataSceince
AI Apps
Amazon ECS
Apps
AWS Greengrass
Predictive
AmazonAI
Lex PollyAML Rekognition
AWS DL AMI
Business
users
DevOps
Data Scientists
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Putting It All Together
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Streaming
Amazon Kinesis
Analytics
KCL
Apps
AWS Lambda
COLLECT STORE CONSUMEPROCESS/ANALYZE
Amazon ES
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon
DynamoDB
Amazon ElastiCache
Amazon RDS
Amazon Aurora
HotHotWarm
Fast
Stream
SQLNoSQLCacheFileStream
Mobile apps
Web apps
Devices
Sensors
IoT platforms
AWS IoT
Data centers AWS Direct
Connect
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
FILES
STREAMS
Amazon
QuickSight
Analysis&visualizationDataSceince
DataTransport&LoggingIoTApplications
Amazon EMR
Amazon Redshift
& Spectrum
Presto
Amazon
EMR
FastSlow
Amazon Athena
BatchInteractivePredictive
AmazonAI
Amazon S3
Amazon DAX
Import/expo
rt
Lex PollyAML Rekognition
AWS DL AMI
AI Apps
Amazon ECS
Apps
Model
Train/
Eval
Models
Deploy
ETL
AWS Greengrass
Design Patterns
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Spark Streaming
AWS Lambda
KCL apps
Amazon
Redshift
Amazon
Redshift
Hive
Spark
Presto
ProcessingTechnology
FastSlow
Answers
Hive
Native apps
KCL apps
AWS Lambda
Amazon
Athena
Amazon Kinesis Amazon
DynamoDB/RDS
Amazon S3Data
Hot Cold
Data Store
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time Analytics
Amazon EMR
KCL app
AWS Lambda
Spark
Streaming
Amazon
AI
Real-time prediction
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
App state or
Materialized
View
KPI
Process
Store
Amazon
Kinesis
Amazon Kinesis
Analytics
Amazon
SNS NotificationsAlerts
Amazon
S3
Log
Amazon
KinesisFan out Downstream
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Interactive
and
Batch
Analytics
process
store Batch
Interactive
Amazon EMR
Hive
Pig
Spark
Amazon
AI
Batch prediction
Real-time prediction
Amazon S3
Files
Amazon
Kinesis
Firehose
Amazon Kinesis
Analytics
Amazon Redshift
Amazon ES
Consume
Amazon EMR
Presto
Spark
Amazon Athena
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time
App state
or
Materialized
View
Interactive
and
Batch
Data Lake
Amazon S3
Amazon Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
AWS Lambda
Spark Streaming
on Amazon EMR
Applications
Amazon
Kinesis
KCL
Amazon
AI
Amazon
DynamoDB
Amazon
RDS
Change Data
Capture or Export
Transactions
Stream
Files
Amazon Kinesis
Analytics
Amazon Athena
Amazon
Kinesis
Firehose
Amazon ES
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What about Metadata?
• Glue Catalog
• Hive Metastore compliant
• Crawlers - Detect new data, schema, partitions
• Search - Metadata discovery
• Amazon Athena, Amazon EMR, and Amazon
Redshift Spectrum compatible
• Hive Metastore (Presto, Spark, Hive, Pig)
• Can be hosted on Amazon RDS
Data
Catalog Metastore RDS
Glue
Catalog AWS Glue
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security & Governance
• AWS Identity and Access Management (IAM)
• Amazon Cognito
• Amazon CloudWatch & AWS CloudTrail
• AWS KMS
• AWS Directory Service
• Apache Ranger
Security &
Governance IAM Amazon
CloudWatch
AWS
CloudTrail
AWS
AWSKMS
AWS
CloudHSM
AWS Directory
Service
Amazon
Cognito
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake
Reference
Architecture
Summary
Summary
Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed and serverless services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs, data lake, materialized views
Be cost-conscious
• Big data ≠ Big cost
AI/ML enable your applications
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!

More Related Content

What's hot

Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudAmazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
워크로드 특성에 따른 안전하고 효율적인 Data Lake 운영 방안
워크로드 특성에 따른 안전하고 효율적인 Data Lake 운영 방안워크로드 특성에 따른 안전하고 효율적인 Data Lake 운영 방안
워크로드 특성에 따른 안전하고 효율적인 Data Lake 운영 방안Amazon Web Services Korea
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineAmazon Web Services
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Amazon Web Services
 
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...Amazon Web Services
 
Amazon Container 환경의 보안 – 최인영, AWS 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집
Amazon Container 환경의 보안 – 최인영, AWS 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집Amazon Container 환경의 보안 – 최인영, AWS 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집
Amazon Container 환경의 보안 – 최인영, AWS 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집Amazon Web Services Korea
 
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018Amazon Web Services Korea
 
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityEmpower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityDatabricks
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaScyllaDB
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks DeltaDatabricks
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 

What's hot (20)

Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
How to Streamline DataOps on AWS
How to Streamline DataOps on AWSHow to Streamline DataOps on AWS
How to Streamline DataOps on AWS
 
워크로드 특성에 따른 안전하고 효율적인 Data Lake 운영 방안
워크로드 특성에 따른 안전하고 효율적인 Data Lake 운영 방안워크로드 특성에 따른 안전하고 효율적인 Data Lake 운영 방안
워크로드 특성에 따른 안전하고 효율적인 Data Lake 운영 방안
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
 
ABD217_From Batch to Streaming
ABD217_From Batch to StreamingABD217_From Batch to Streaming
ABD217_From Batch to Streaming
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
Working with Relational Databases in AWS Glue ETL (ANT342) - AWS re:Invent 2018
 
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
Deep Dive: Building Hybrid Cloud Storage Architectures with AWS Storage Gatew...
 
Amazon Container 환경의 보안 – 최인영, AWS 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집
Amazon Container 환경의 보안 – 최인영, AWS 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집Amazon Container 환경의 보안 – 최인영, AWS 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집
Amazon Container 환경의 보안 – 최인영, AWS 솔루션즈 아키텍트:: AWS 온라인 이벤트 – 클라우드 보안 특집
 
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018
Amazon.com 사례와 함께하는 유통 차세대 DW 구축을 위한 Data Lake 전략::구태훈::AWS Summit Seoul 2018
 
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityEmpower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
Empower Splunk and other SIEMs with the Databricks Lakehouse for Cybersecurity
 
Athena & Glue
Athena & GlueAthena & Glue
Athena & Glue
 
Data Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation CriteriaData Platform Architecture Principles and Evaluation Criteria
Data Platform Architecture Principles and Evaluation Criteria
 
Introducing Databricks Delta
Introducing Databricks DeltaIntroducing Databricks Delta
Introducing Databricks Delta
 
Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 

Similar to ABD201-Big Data Architectural Patterns and Best Practices on AWS

Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design PatternsJohn Yeung
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with ZopaAmazon Web Services
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansAmazon Web Services
 
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...Amazon Web Services
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...Amazon Web Services
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCAmazon Web Services LATAM
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Amazon Web Services
 
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...Amazon Web Services
 
Serverless Architectural Patterns
Serverless Architectural PatternsServerless Architectural Patterns
Serverless Architectural PatternsAmazon Web Services
 
Data Warehouses & Data Lakes: Data Analytics Week SF
Data Warehouses & Data Lakes: Data Analytics Week SFData Warehouses & Data Lakes: Data Analytics Week SF
Data Warehouses & Data Lakes: Data Analytics Week SFAmazon Web Services
 
STG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data WorkloadsSTG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data WorkloadsAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPTHow TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPTAmazon Web Services
 

Similar to ABD201-Big Data Architectural Patterns and Best Practices on AWS (20)

Deep Dive on Big Data
Deep Dive on Big Data Deep Dive on Big Data
Deep Dive on Big Data
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
STG401_This Is My Architecture
STG401_This Is My ArchitectureSTG401_This Is My Architecture
STG401_This Is My Architecture
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
 
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...
STG309_Deep Dive Using Hybrid Storage with AWS Storage Gateway to Solve On-Pr...
 
Serverless Architectural Patterns
Serverless Architectural PatternsServerless Architectural Patterns
Serverless Architectural Patterns
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehouses & Data Lakes: Data Analytics Week SF
Data Warehouses & Data Lakes: Data Analytics Week SFData Warehouses & Data Lakes: Data Analytics Week SF
Data Warehouses & Data Lakes: Data Analytics Week SF
 
STG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data WorkloadsSTG316_Optimizing Storage for Big Data Workloads
STG316_Optimizing Storage for Big Data Workloads
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPTHow TrueCar Gains Actionable Insights with Splunk Cloud PPT
How TrueCar Gains Actionable Insights with Splunk Cloud PPT
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

ABD201-Big Data Architectural Patterns and Best Practices on AWS

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Big Data Architectural Patterns and Best Practices on AWS S i v a R a g h u p a t h y S r . M a n a g e r , A I , A n a l y t i c s , a n d D a t a b a s e S o l u t i o n s A r c h i t e c t u r e A W S N o v e m b e r 2 7 , 2 0 1 7 A B D 2 0 1
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What to Expect from the Session Big data challenges Architectural principles How to simplify big data processing What technologies should you use? • Why? • How? Reference architecture Design patterns
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ever Increasing Big Data Volume Velocity Variety
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Batch processing Stream processing Artificial intelligence Big Data Evolution
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Virtual machines Managed services Serverless Cloud Services Evolution
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Plethora of Tools Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS Amazon EMR Amazon Redshift Amazon Kinesis Lambda Amazon ML Amazon SQS ElastiCache Amazon DynamoDB Streams Amazon ES Amazon Kinesis Analytics Amazon QuickSight AWS Glue
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data Challenges Why? How? What tools should I use? Is there a reference architecture?
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architectural Principles Build decoupled systems • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage managed and serverless services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable logs (data lake), materialized views Be cost-conscious • Big data ≠ big cost AI/ML enable your applications
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost Simplify Big Data Processing
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What Is the Temperature of Your Data?
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hot Warm Cold Volume MB–GB GB–TB PB–EB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low–high High Very high Request rate Very high High Low Cost/GB $$-$ $-¢¢ ¢ Hot data Warm data Cold data Data Characteristics: Hot, Warm, Cold
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT Devices Sensors IoT platforms AWS IoT STREAMS IoT EventsData streams Migration Snowball Logging Amazon CloudWatch AWS CloudTrail FILES DataTransport&Logging Import/expo rt Files Log files Media files Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications Transactions Data structures Database records Type of Data
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. STORE
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Events Files Transactions COLLECT Devices Sensors IoT platforms AWS IoT STREAMS IoT Data streams Migration Snowball Logging Amazon CloudWatch AWS CloudTrail FILES DataTransport&Logging Import/expo rt Log files Media files Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications Data structures Database records Type of Data STORE NoSQL In-memory SQL File/object store Stream storage
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT Devices Sensors IoT platforms AWS IoT STREAMS IoT Migration Snowball Logging Amazon CloudWatch AWS CloudTrail FILES DataTransport&Logging Import/expo rt Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications STORE NoSQL In-memory SQL File/object store Stream storage
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT Devices Sensors IoT platforms AWS IoT STREAMS IoT Migration Snowball Logging Amazon CloudWatch AWS CloudTrail FILES DataTransport&Logging Import/expo rt Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications STORE NoSQL In-memory SQL File/object store Amazon Kinesis Firehose Amazon Kinesis Streams Apache Kafka Apache Kafka • High throughput distributed streaming platform Amazon Kinesis Streams • Managed stream storage Amazon Kinesis Firehose • Managed data delivery Stream Storage
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Decouple producers & consumers Persistent buffer Collect multiple streams Preserve client ordering Parallel consumption Streaming MapReduce 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 DynamoDB stream Amazon Kinesis stream Kafka topic Why Stream Storage?
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Decouple producers & consumers • Persistent buffer • Collect multiple streams • No client ordering (standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or Lambda functions) What About Amazon SQS? Consumers 4 3 2 1 12344 3 2 1 1234 2134 13342 Standard FIFO Publisher Amazon SNS Topic AWS Lambda function Amazon SQS queue Queue Subscriber
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Stream/Message Storage Should I Use? Hot Warm Amazon Kinesis Streams Amazon Kinesis Firehose Apache Kafka (on Amazon EC2) Amazon SQS (Standard) Amazon SQS (FIFO) AWS managed Yes Yes No Yes Yes Guaranteed ordering Yes No Yes No Yes Delivery (deduping) At least once At least once At least/At most/exactly once At least once Exactly once Data retention period 7 days N/A Configurable 14 days 14 days Availability 3 AZ 3 AZ Configurable 3 AZ 3 AZ Scale / throughput No limit / ~ shards No limit / automatic No limit / ~ nodes No limits / automatic 300 TPS / queue Parallel consumption Yes No Yes No No Stream MapReduce Yes N/A Yes N/A N/A Row/object size 1 MB Destination row/object size Configurable 256 KB 256 KB Cost Low Low Low (+admin) Low-medium Low-medium
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT STORE File Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Hot Stream Mobile apps Web apps Devices Sensors IoT platforms AWS IoT Data centers AWS Direct Connect Migration Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS FILES STREAMS DataTransport&LoggingIoTApplications Import/expo rt NoSQL In-memory SQL Amazon S3 Amazon S3 Managed object storage service built to store and retrieve any amount of data File/Object Storage
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Amazon S3 as Your Persistent File Store • Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • Decouple storage and compute • No need to run compute clusters for storage (unlike HDFS) • Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances • Multiple & heterogeneous analysis clusters and services can use the same data • Designed for 99.999999999% durability • No need to pay for data replication within a region • Secure: SSL, client/server-side encryption at rest • Low cost
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What About HDFS & Data Tiering? • Use HDFS for hottest datasets (e.g. iterative read on the same datasets) • Use Amazon S3 Standard for frequently accessed data • Use Amazon S3 Standard – IA for less frequently accessed data • Use Amazon Glacier for archiving cold data • Use S3 Analytics to optimize tiering strategy S3 Analytics
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT STORE Amazon DynamoDB Amazon RDS Amazon Aurora File Mobile apps Web apps Devices Sensors IoT platforms AWS IoT Data centers AWS Direct Connect Migration Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS FILES STREAMS LoggingIoTApplications Amazon S3 Amazon DAX Amazon ElastiCache Import/expo rt SQLNoSQLCache Amazon ElastiCache • Managed Memcached or Redis service Amazon DynamoDB Accelerator (DAX) • Managed in-memory cache for DynamoDB Amazon DynamoDB • Managed NoSQL database service Amazon RDS • Managed relational database service Cache & Database Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Hot Stream
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Anti-Pattern Database Tier
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practice: Use the Right Tool for the Job SearchIn-memory SQLNoSQL Database Tier GraphDB Amazon RDS/AuroraAmazon DynamoDBAmazon ElastiCache Amazon DynamoDB Acclerator SAP HANA Amazon ES Amazon CloudSearch
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Materialized Views and Immutable Log Cache View Search View Immutable Log
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Data Store Should I Use? Data structure → Fixed-schema, JSON, Key/Value, Access patterns → Store data in the format you will access it Data characteristics → Hot, warm, cold Cost → Right cost
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Structure and Access Patterns Access Patterns What to use? Put/Get (key, value) In-memory, NoSQL Simple relationships → 1:N, M:N NoSQL Multi-table joins, transaction, SQL SQL Faceting, Search Search Graph traversal GraphDB Data Structure What to use? Fixed schema SQL, NoSQL Schema-free (JSON) NoSQL, Search Key/Value In-memory, NoSQL Graph GraphDB
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. In-memory SQL Request rate High Low Cost/GB High Low Latency Low High Data volume Low High Amazon Glacier Structure NoSQL Hot data Warm data Cold data Low High
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon ElastiCache Amazon DAX Amazon DynamoDB Amazon RDS (Aurora) Amazon ES Amazon S3 Amazon Glacier Average Latency µs-ms µs-ms ms ms, sec ms,sec ms,sec,min (~ size) hrs Typical Data Volume GB GB GB–TBs (no limit) GB–TB (64 TB max) GB–TB MB–PB (no limit) GB–PB (no limit) Typical Item Size B-KB KB (400 KB max) KB (400 KB max) KB (64 KB max) B-KB (2 GB max) KB-TB (5 TB max) GB (40 TB max) Request Rate High – very high High – very high Very high (no limit) High High Low – high (no limit) Very low Storage Cost GB/Month $$ $$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10 Durability Low - moderate NA Very high Very high High Very high Very high Availability High 2 AZ High 3 AZ Very high 3 AZ Very high 3 AZ High 2 AZ Very high 3 AZ Very high 3 AZ Hot data Warm data Cold data Which Data Store Should I Use?
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cost-Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB? “I’m currently scoping out a project. The design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…” Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2048 1483 777,600,000
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Request rate (writes/sec) Object size (bytes) Total size (GB/month) Objects per month 300 2,048 1,483 777,600,000 Amazon S3 or Amazon DynamoDB? https://calculator.s3.amazonaws.com/index.html
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Wins! 300 23,730 777,600,00032,768 Amazon DynamoDB Wins! Request rate (writes/sec) Object size (bytes) Total size (GB/month) Objects per month 300 2,048 1,483 777,600,000 Amazon S3 or Amazon DynamoDB? Scenario 2 Scenario 1 Amazon S3 Standard Storage $34 Put/list requests $3,888 Total $3,922 Amazon DynamoDB Provisioned throughput $273 Indexed data storage $383 Total $656 Amazon S3 Standard Storage $545 Put/List Requests $3,888 Total $4,433 Amazon DynamoDB Provisioned Throughput $4,556 Indexed Data Storage $5,944 Total $10,500
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PROCESS / ANALYZE
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. API-driven Services • Amazon Lex – Speech recognition • Amazon Polly – Text to speech • Amazon Rekognition – Image analysis Managed ML Platforms • Amazon ML • Apache Spark ML on EMR AWS Deep Learning AMI • Pre-installed with MXNet, TensorFlow, Caffe2 (and Caffe), Theano, Torch, Microsoft Cognitive Toolkit, and Keras; plus DL tools/drivers A m a z o n A I Predictive Analytics PROCESS/ANALYZE Predictive AmazonAI Lex PollyAML Rekognition AWS DL AMI Developers Data scientists Deep learning experts
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PROCESS / ANALYZE Amazon Redshift & Spectrum Amazon Athena BatchInteractive Amazon ES Interactive and Batch Analytics Amazon ES • Managed Service for Elasticsearch Amazon Redshift and Amazon Redshift Spectrum • Managed Data Warehouse • Spectrum enables querying Amazon S3 Amazon Athena • Serverless Interactive Query Service Amazon EMR • Managed Hadoop Framework for running Apache Spark, Flink, Presto, Tez, Hive, Pig, HBase, etc. Presto Amazon EMR
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Stream/Real-time Analytics Spark Streaming on Amazon EMR Amazon Kinesis Analytics • Managed Service for running SQL on Streaming data Amazon KCL • Amazon Kinesis Client Library AWS Lambda • Run code Serverless (without provisioning or managing servers) • Services such as S3 can publish events to Lambda • Lambda can pool event from a Kinesis PROCESS / ANALYZE KCL Apps AWS Lambda Amazon Kinesis Analytics Stream Streaming Amazon EMR
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Analytics Should I Use? PROCESS / ANALYZE Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Stream Takes milliseconds to seconds Example: Fraud alerts, 1 minute metrics Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, AWS Lambda, etc. Predictive Takes milliseconds (real-time) to minutes (batch) Example: Fraud detection, Forecasting demand, Speech recognition Amazon AI (Lex, Polly, ML, Amazon Rekognition), Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK, and Caffe) Streaming Amazon Kinesis Analytics KCL Apps AWS Lambda Stream Amazon EMR Fast Amazon ES Amazon Redshift & Spectrum Presto Amazon EMR Amazon Athena BatchInteractive FastSlow Predictive AmazonAI Lex PollyAML Rekognition AWS DL AMI
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Stre am Proce ssing T e chnolog y Shou ld I Use ? Amazon EMR (Spark Streaming) KCL Application Amazon Kinesis Analytics AWS Lambda Managed Service Yes No (EC2 + Auto Scaling) Yes Yes Serverless No No Yes Yes Scale/Throughput No limits / ~ nodes No limits / ~ nodes No limits / automatic No limits / automatic Availability Single AZ Multi-AZ Multi-AZ Multi-AZ Programming Languages Java, Python, Scala Java, others via MultiLangDaemon ANSI SQL with extensions Node.js, Java, Python Sliding Window Functions Build-in App needs to implement Built-in No Reliability KCL and Spark checkpoints Managed by KCL Managed by Amazon Kinesis Analytics Managed by AWS Lambda
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Analytics Tool Should I Use? Amazon Redshift Amazon Redshift Spectrum Amazon Athena Amazon EMR Presto Spark Hive Use case Optimized for data warehousing Query S3 data from Amazon Redshift Interactive Queries over S3 data Interactive Query General purpose Batch Scale/Throughput ~Nodes ~Nodes Automatic ~ Nodes Managed Service Yes Yes Yes, Serverless Yes Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS Optimization Columnar storage, data compression, and zone maps AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. Framework dependent Metadata Amazon Redshift Catalog Glue Catalog Glue Catalog Glue Catalog or Hive Meta-store Auth/Access controls IAM, Users, groups, and access controls IAM, Users, groups, and access controls IAM IAM, LDAP & Kerberos UDF support Yes (Scalar) Yes (Scalar) No Yes Slow
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What About Extract Transform and Load? ETLSTORE PROCESS / ANALYZE Data Integration Partners Reduce the effort to move, cleanse, synchronize, manage, and automatize data-related processes. AWS Glue is a fully managed (serverless) ETL service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. Data Catalog Job Authoring Job Execution AWS Glue
  • 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT STORE PROCESS/ANALYZE Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon ElastiCache Amazon RDS Amazon Aurora HotHotWarm SQLNoSQLCacheFileStream Mobile apps Web apps Devices Sensors IoT platforms AWS IoT Data centers AWS Direct Connect Migration Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS FILES STREAMS DataTransport&LoggingIoTApplications Slow Amazon S3 Amazon ES Amazon Redshift & Spectrum Presto Amazon EMR Fast Amazon Athena BatchInteractive Amazon DAX Import/expo rt Predictive AmazonAI Lex PollyAML Rekognition AWS DL AMI ETL Streaming Amazon Kinesis Analytics KCL Apps Fast Stream Amazon EMR AWS Lambda Fast CONSUME
  • 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CONSUME
  • 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • BI/AI Applications • Amazon EC2 or ECS Containers • AWS Greengrass • Data Science • Notebooks • DS Platforms • IDEs • Analysis and Visualization • Amazon QuickSight • Tableau • …. COLLECT STORE CONSUMEPROCESS/ANALYZEETL Amazon QuickSight Analysis&visualization Model Train/ Eval Models Deploy DataSceince AI Apps Amazon ECS Apps AWS Greengrass Predictive AmazonAI Lex PollyAML Rekognition AWS DL AMI Business users DevOps Data Scientists
  • 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Putting It All Together
  • 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Streaming Amazon Kinesis Analytics KCL Apps AWS Lambda COLLECT STORE CONSUMEPROCESS/ANALYZE Amazon ES Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon ElastiCache Amazon RDS Amazon Aurora HotHotWarm Fast Stream SQLNoSQLCacheFileStream Mobile apps Web apps Devices Sensors IoT platforms AWS IoT Data centers AWS Direct Connect Migration Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS FILES STREAMS Amazon QuickSight Analysis&visualizationDataSceince DataTransport&LoggingIoTApplications Amazon EMR Amazon Redshift & Spectrum Presto Amazon EMR FastSlow Amazon Athena BatchInteractivePredictive AmazonAI Amazon S3 Amazon DAX Import/expo rt Lex PollyAML Rekognition AWS DL AMI AI Apps Amazon ECS Apps Model Train/ Eval Models Deploy ETL AWS Greengrass
  • 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Spark Streaming AWS Lambda KCL apps Amazon Redshift Amazon Redshift Hive Spark Presto ProcessingTechnology FastSlow Answers Hive Native apps KCL apps AWS Lambda Amazon Athena Amazon Kinesis Amazon DynamoDB/RDS Amazon S3Data Hot Cold Data Store
  • 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time Analytics Amazon EMR KCL app AWS Lambda Spark Streaming Amazon AI Real-time prediction Amazon ElastiCache (Redis) Amazon DynamoDB Amazon RDS Amazon ES App state or Materialized View KPI Process Store Amazon Kinesis Amazon Kinesis Analytics Amazon SNS NotificationsAlerts Amazon S3 Log Amazon KinesisFan out Downstream
  • 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Interactive and Batch Analytics process store Batch Interactive Amazon EMR Hive Pig Spark Amazon AI Batch prediction Real-time prediction Amazon S3 Files Amazon Kinesis Firehose Amazon Kinesis Analytics Amazon Redshift Amazon ES Consume Amazon EMR Presto Spark Amazon Athena
  • 52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time App state or Materialized View Interactive and Batch Data Lake Amazon S3 Amazon Redshift Amazon EMR Presto Hive Pig Spark Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES AWS Lambda Spark Streaming on Amazon EMR Applications Amazon Kinesis KCL Amazon AI Amazon DynamoDB Amazon RDS Change Data Capture or Export Transactions Stream Files Amazon Kinesis Analytics Amazon Athena Amazon Kinesis Firehose Amazon ES
  • 53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What about Metadata? • Glue Catalog • Hive Metastore compliant • Crawlers - Detect new data, schema, partitions • Search - Metadata discovery • Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum compatible • Hive Metastore (Presto, Spark, Hive, Pig) • Can be hosted on Amazon RDS Data Catalog Metastore RDS Glue Catalog AWS Glue
  • 54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security & Governance • AWS Identity and Access Management (IAM) • Amazon Cognito • Amazon CloudWatch & AWS CloudTrail • AWS KMS • AWS Directory Service • Apache Ranger Security & Governance IAM Amazon CloudWatch AWS CloudTrail AWS AWSKMS AWS CloudHSM AWS Directory Service Amazon Cognito
  • 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake Reference Architecture Summary
  • 56. Summary Build decoupled systems • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage AWS managed and serverless services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable logs, data lake, materialized views Be cost-conscious • Big data ≠ Big cost AI/ML enable your applications
  • 57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!