More Related Content Similar to Data Warehouses and Data Lakes (20) More from Amazon Web Services (20) Data Warehouses and Data Lakes1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
Data Warehouses and Data Lakes
Rajeev Chakrabarti
rcchakr@amazon.com
Principal Enterprise Architect
Andre Hass
hasandre@amazon.com
Technical Account Manager
2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Ever Increasing Big Data
Volume
Velocity
Variety
3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• Batch
processing
• Stream
• processing
• Artificial
• intelligence
Big Data Evolution
4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• Virtual
machines
• Managed
services
• Serverless
Cloud Services Evolution
5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Plethora of Tools
Amazon
Glacier
Amazon S3 Amazon DynamoDB
Amazon RDS
Amazon EMR
Amazon
Redshift
Amazon
Kinesis
Lambda Amazon ML
Amazon SQS
ElastiCache
Amazon DynamoDB
Streams
Amazon ES
AmazonKinesis
Analytics
Amazon
QuickSight AWS Glue
6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Big Data Challenges
Why?
How?
What tools should I use?
Is there a reference architecture?
7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Architectural Principles
• Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
• Use the right tool for the job
• Data structure, latency, throughput, access patterns
• Leverage managed and serverless services
• Scalable/elastic, available, reliable, secure, no/low admin
• Use log-centric design patterns
• Immutable logs (data lake), materialized views
• Be cost-conscious
• Big data ≠ big cost
• AI/ML enable your applications
8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
Simplify Big Data Processing
9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
What Is the Temperature of Your Data?
10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/byte $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data Characteristics: Hot, Warm, Cold
11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
COLLECT
12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
COLLECT
Devices
Sensors
IoT platforms
AWS IoT STREAMS
IoT
EventsData streams
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
FILES
DataTransport&Logging
Import/expo
rt
Files
Log files
Media files
Mobile apps
Web apps
Data centers AWS Direct
Connect
RECORDS
Applications
Transactions
Data structures
Database records
Type of Data
13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
STORE
14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Events
Files
Transactions
COLLECT
Devices
Sensors
IoT platforms
AWS IoT STREAMS
IoT
Data streams
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
FILES
DataTransport&Logging
Import/expo
rt
Log files
Media files
Mobile apps
Web apps
Data centers AWS Direct
Connect
RECORDS
Applications
Data structures
Database records
Type of Data STORE
NoSQL
In-memory
SQL
File/object
store
Stream
storage
15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
COLLECT
Devices
Sensors
IoT platforms
AWS IoT STREAMS
IoT
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
FILES
DataTransport&Logging
Import/expo
rt
Mobile apps
Web apps
Data centers AWS Direct
Connect
RECORDS
Applications
STORE
NoSQL
In-memory
SQL
File/object
store
Stream
storage
16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
COLLECT
Devices
Sensors
IoT platforms
AWS IoT STREAMS
IoT
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
FILES
DataTransport&Logging
Import/expo
rt
Mobile apps
Web apps
Data centers AWS Direct
Connect
RECORDS
Applications
STORE
NoSQL
In-memory
SQL
File/object
store
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Apache Kafka
• High throughput distributed
streaming platform
Amazon Kinesis Data Streams
• Managed stream storage
Amazon Kinesis Data Firehose
• Managed data delivery
Stream Storage
17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Parallel consumption
Streaming MapReduce
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
shard 1 / partition 1
shard 2 / partition 2
Consumer 1
Count of
red = 4
Count of
violet = 4
Consumer 2
Count of
blue = 4
Count of
green = 4
DynamoDB stream Amazon Kinesis stream Kafka topic
Why Stream Storage?
18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• Decouple producers & consumers
• Persistent buffer
• Collect multiple streams
• No client ordering (standard)
• FIFO queue preserves client ordering
• No streaming MapReduce
• No parallel consumption
• Amazon SNS can publish to multiple
SNS subscribers (queues or Lambda
functions)
What About Amazon SQS?
Consumers
4 3 2 1
12344 3 2 1
1234
2134
13342
Standard
FIFO
Publisher
Amazon SNS
Topic
AWS Lambda
function
Amazon SQS
queue
Queue
Subscriber
19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Which Stream/Message Storage Should I Use?
Hot Warm
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka (on Amazon
EC2)
Amazon
SQS (Standard)
Amazon SQS
(FIFO)
AWS managed Yes Yes No Yes Yes
Guaranteed ordering Yes No Yes No Yes
Delivery (deduping) At least once At least once At least/At
most/exactly once
At least once Exactly once
Data retention period 7 days N/A Configurable 14 days 14 days
Availability 3 AZ 3 AZ Configurable 3 AZ 3 AZ
Scale /
throughput
No limit /
~ shards
No limit /
automatic
No limit /
~ nodes
No limits /
automatic
300 TPS /
queue
Parallel consumption Yes No Yes No No
Stream MapReduce Yes N/A Yes N/A N/A
Row/object size 1 MB Destination
row/object size
Configurable 256 KB 256 KB
Cost Low Low Low (+admin) Low-medium Low-medium
20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
COLLECT STORE
File
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Hot
Stream
Mobile apps
Web apps
Devices
Sensors
IoT platforms
AWS IoT
Data centers AWS Direct
Connect
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
FILES
STREAMS
DataTransport&LoggingIoTApplications
Import/expo
rt
NoSQL
In-memory
SQL
Amazon S3
Amazon S3
Managed object storage service
built to store and retrieve any
amount of data
File/Object Storage
21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Use Amazon S3 as Your Persistent File Store
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
• Decouple storage and compute
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Amazon EMR clusters with Amazon EC2 Spot
Instances
• Multiple & heterogeneous analysis clusters and services can use
the same data
• Designed for 99.999999999% durability
• No need to pay for data replication within a region
• Secure: SSL, client/server-side encryption at rest
• Low cost
22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
COLLECT STORE
Amazon
DynamoDB
Amazon RDS
Amazon Aurora
File
Mobile apps
Web apps
Devices
Sensors
IoT platforms
AWS IoT
Data centers AWS Direct
Connect
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
FILES
STREAMS
LoggingIoTApplications
Amazon S3
Amazon DAX
Amazon ElastiCache
Import/expo
rt
SQLNoSQLCache
Amazon ElastiCache
• Managed Memcached or Redis service
Amazon DynamoDB Accelerator
(DAX)
• Managed in-memory cache for
DynamoDB
Amazon DynamoDB
• Managed NoSQL database service
Amazon RDS
• Managed relational database service
Cache & Database
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Hot
Stream
23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Anti-Pattern
Database Tier
24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Best Practice: Use the Right Tool for the Job
SearchIn-memory SQLNoSQL
Database Tier
GraphDB
Amazon RDS/AuroraAmazon DynamoDBAmazon ElastiCache Amazon
DynamoDB
Acclerator
SAP HANA
Amazon Elasticsearch
Service
25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Which Data Store Should I Use?
• Data structure → Fixed-schema, JSON, Key/Value,
• Access patterns → Store data in the format you will access it
• Data characteristics → Hot, warm, cold
• Cost → Right cost
26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (key, value) In-memory, NoSQL
Simple relationships → 1:N, M:N NoSQL
Multi-table joins, transaction, SQL SQL
Faceting, Search Search
Graph traversal GraphDB
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
Key/Value In-memory, NoSQL
Graph GraphDB
27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
In-memory
SQL
Request rate
High Low
Cost/GB
High Low
Latency
Low High
Data volume
Low High
Amazon
Glacier
Structure
NoSQL
Hot data Warm data Cold data
Low
High
S3
Search
Graph
28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Amazon
ElastiCache
Amazon
DAX
Amazon
DynamoDB
Amazon
RDS (Aurora)
Amazon ES Amazon S3
Amazon
Glacier
Average
Latency
µs-ms µs-ms ms ms, sec ms,sec ms,sec,min
(~ size)
hrs
Typical
Data
Volume
GB GB GB–TBs
(no limit)
GB–TB
(64 TB max)
GB–TB MB–PB
(no limit)
GB–PB
(no limit)
Typical
Item Size
B-KB KB
(400 KB max)
KB
(400 KB max)
KB
(64 KB max)
B-KB
(2 GB max)
KB-TB
(5 TB max)
GB
(40 TB max)
Request
Rate
High – very
high
High – very
high
Very high
(no limit)
High High Low – high
(no limit)
Very low
Storage Cost
GB/Month
$$ $$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10
Durability Low -
moderate
NA Very high Very high High Very high Very high
Availability High
2 AZ
High
3 AZ
Very high
3 AZ
Very high
3 AZ
High
2 AZ
Very high
3 AZ
Very high
3 AZ
Hot data Warm data Cold data
Which Data Store Should I Use?
29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Example: Should I use Amazon S3 or Amazon
DynamoDB?
• “I’m currently scoping out a project. The design calls for many small
files, perhaps up to a billion during peak. The total size would be
on the order of 1.5 TB per month…”
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000
30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Request rate
(writes/sec)
Object size
(bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483 777,600,000
Amazon S3 or
Amazon DynamoDB?
https://calculator.s3.amazonaws.com/index.html
31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Amazon S3
Wins!
300 23,730 777,600,00032,768
Amazon DynamoDB
Wins!
Request rate
(writes/sec)
Object size
(bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483 777,600,000
Amazon S3 or
Amazon DynamoDB?
Amazon S3 Standard
Storage $34
Put/list requests $3,888
Total $3,922
Amazon DynamoDB
Provisioned throughput $273
Indexed data storage $383
Total $656
Amazon S3 Standard
Storage $545
Put/List Requests $3,888
Total $4,433
Amazon DynamoDB
Provisioned Throughput $4,556
Indexed Data Storage $5,944
Total $10,500
32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
PROCESS /
ANALYZE
33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• API-driven Services
– Amazon Lex – Speech recognition
– Amazon Polly – Text to speech
– Amazon Rekognition – Image analysis
• Managed ML Platforms
– Amazon ML
– Apache Spark ML on EMR
• AWS Deep Learning AMI
– Pre-installed with MXNet, TensorFlow,
Caffe2 (and Caffe), Theano, Torch,
Microsoft Cognitive Toolkit, and Keras;
plus DL tools/drivers
• A m a z o n A I
Predictive Analytics
PROCESS/ANALYZE
Predictive
AmazonAI
Lex PollyAML Rekognition
AWS DL AMI
Developers
Data scientists
Deep learning
experts
34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
PROCESS / ANALYZE
Amazon Redshift
& Spectrum
Amazon Athena
BatchInteractive
Amazon ES
Interactive and Batch Analytics
• Amazon ES
– Managed Service for Elasticsearch
• Amazon Redshift and Amazon Redshift
Spectrum
– Managed Data Warehouse
– Spectrum enables querying Amazon S3
• Amazon Athena
– Serverless Interactive Query Service
• Amazon EMR
– Managed Hadoop Framework for running Apache
Spark, Flink, Presto, Tez, Hive, Pig, HBase, etc.
Presto
Amazon
EMR
35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Stream/Real-time Analytics
Spark Streaming on Amazon EMR
Amazon Kinesis Analytics
• Managed Service for running SQL on
Streaming data
Amazon KCL
• Amazon Kinesis Client Library
AWS Lambda
• Run code Serverless (without
provisioning or managing servers)
• Services such as S3 can publish
events to Lambda
• Lambda can pool event from a
Kinesis
PROCESS / ANALYZE
KCL
Apps
AWS Lambda
Amazon Kinesis
Analytics
Stream
Streaming
Amazon EMR
36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Which Analytics Should I Use? PROCESS / ANALYZE
Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Stream
Takes milliseconds to seconds
Example: Fraud alerts, 1 minute metrics
Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL,
AWS Lambda, etc.
Predictive
Takes milliseconds (real-time) to minutes (batch)
Example: Fraud detection, Forecasting demand, Speech
recognition
Amazon AI (Lex, Polly, ML, Amazon Rekognition), Amazon EMR
(Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK,
and Caffe)
Streaming
Amazon Kinesis
Analytics
KCL
Apps
AWS Lambda
Stream
Amazon EMR
Fast
Amazon ES
Amazon Redshift
& Spectrum
Presto
Amazon
EMR
Amazon Athena
BatchInteractive
FastSlow
Predictive
AmazonAI
Lex PollyAML Rekognition
AWS DL AMI
37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
W h ic h S t r e a m P r o c e ssin g T e c h n o l o g y S h o u l d I U se ?
Amazon EMR
(Spark
Streaming)
KCL Application Amazon Kinesis
Analytics
AWS Lambda
Managed Service Yes No (EC2 + Auto
Scaling)
Yes Yes
Serverless No No Yes Yes
Scale/Throughput No limits /
~ nodes
No limits /
~ nodes
No limits /
automatic
No limits /
automatic
Availability Single AZ Multi-AZ Multi-AZ Multi-AZ
Programming
Languages
Java, Python,
Scala
Java, others via
MultiLangDaemon
ANSI SQL with
extensions
Node.js, Java, Python
Sliding Window
Functions
Build-in App needs to
implement
Built-in No
Reliability KCL and Spark
checkpoints
Managed by KCL Managed by
Amazon Kinesis
Analytics
Managed by AWS Lambda
38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Which Analytics Tool Should I Use?Amazon Redshift Amazon Redshift
Spectrum
Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
warehousing
Query S3 data from
Amazon Redshift
Interactive Queries
over S3 data
Interactive
Query
General
purpose
Batch
Scale/Throughput ~Nodes ~Nodes Automatic ~ Nodes
Managed Service Yes Yes Yes, Serverless Yes
Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS
Optimization Columnar storage,
data compression,
and zone maps
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
AVRO, PARQUET
TEXT, SEQ
RCFILE, ORC, etc.
Framework dependent
Metadata Amazon Redshift
Catalog
Glue Catalog Glue Catalog Glue Catalog or
Hive Meta-store
Auth/Access controls IAM, Users, groups,
and access controls
IAM, Users, groups,
and access controls
IAM IAM, LDAP & Kerberos
UDF support Yes (Scalar) Yes (Scalar) No Yes
Slow
39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
What About Extract Transform and Load?
ETLSTORE PROCESS / ANALYZE
Data Integration Partners
Reduce the effort to move, cleanse, synchronize,
manage, and automatize data-related processes.
AWS Glue is a fully managed (serverless) ETL
service that makes it simple and cost-effective
to categorize your data, clean it, enrich it, and
move it reliably between various data stores.
Data Catalog Job Authoring Job Execution
AWS Glue
40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
COLLECT STORE PROCESS/ANALYZE
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon
DynamoDB
Amazon ElastiCache
Amazon RDS
Amazon Aurora
HotHotWarm
SQLNoSQLCacheFileStream
Mobile apps
Web apps
Devices
Sensors
IoT platforms
AWS IoT
Data centers AWS Direct
Connect
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
FILES
STREAMS
DataTransport&LoggingIoTApplications
Slow
Amazon S3
Amazon ES
Amazon Redshift
& Spectrum
Presto
Amazon
EMR
Fast
Amazon Athena
BatchInteractive
Amazon DAX
Import/expo
rt
Predictive
AmazonAI
Lex PollyAML Rekognition
AWS DL AMI
ETL
Streaming
Amazon Kinesis
Analytics
KCL
Apps
Fast
Stream
Amazon EMR
AWS Lambda
Fast
CONSUME
41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
CONSUME
42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• BI/AI Applications
– Amazon EC2 or ECS Containers
– AWS Greengrass
• Data Science
– Notebooks
– DS Platforms
– IDEs
• Analysis and Visualization
• Amazon QuickSight
• Tableau
• ….
COLLECT STORE CONSUMEPROCESS/ANALYZEETL
Amazon
QuickSight
Analysis&visualization
Model
Train/
Eval
Models
Deploy
DataSceince
AI Apps
Amazon ECS
Apps
AWS Greengrass
Predictive
AmazonAI
Lex PollyAML Rekognition
AWS DL AMI
Business
users
DevOps
Data Scientists
43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Putting It All Together
44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Streaming
Amazon Kinesis
Analytics
KCL
Apps
AWS Lambda
COLLECT STORE CONSUMEPROCESS/ANALYZE
Amazon ES
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon
DynamoDB
Amazon ElastiCache
Amazon RDS
Amazon Aurora
HotHotWarm
Fast
Stream
SQLNoSQLCacheFileStream
Mobile apps
Web apps
Devices
Sensors
IoT platforms
AWS IoT
Data centers AWS Direct
Connect
Migration
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
FILES
STREAMS
Amazon
QuickSight
Analysis&visualizationDataSceince
DataTransport&LoggingIoTApplications
Amazon EMR
Amazon Redshift
& Spectrum
Presto
Amazon
EMR
FastSlow
Amazon Athena
BatchInteractivePredictive
AmazonAI
Amazon S3
Amazon DAX
Import/expo
rt
Lex PollyAML Rekognition
AWS DL AMI
AI Apps
Amazon ECS
Apps
Model
Train/
Eval
Models
Deploy
ETL
AWS Greengrass
45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Design Patterns
46. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Real-time
Interactive
Batch
Spark Streaming
AWS Lambda
KCL apps
Amazon
Redshift
Amazon
Redshift
Hive
Spark
Presto
ProcessingTechnology
FastSlow
Answers
Hive
Native apps
KCL apps
AWS Lambda
Amazon
Athena
Amazon Kinesis Amazon
DynamoDB/RDS
Amazon S3Data
Hot Cold
Data Store
47. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Real-time Analytics
Amazon EMR
KCL app
AWS Lambda
Spark
Streaming
Amazon
AI
Real-time prediction
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
App state or
Materialized
View
KPI
Process
Store
Amazon
Kinesis
Amazon Kinesis
Analytics
Amazon
SNS NotificationsAlerts
Amazon
S3
Log
Amazon
KinesisFan out Downstream
48. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Interactive
and
Batch
Analytics
process
store Batch
Interactive
Amazon EMR
Hive
Pig
Spark
Amazon
AI
Batch prediction
Real-time prediction
Amazon S3
Files
Amazon
Kinesis
Firehose
Amazon Kinesis
Analytics
Amazon Redshift
Amazon ES
Consume
Amazon EMR
Presto
Spark
Amazon Athena
49. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Real-time
App state
or
Materialized
View
Interactive
and
Batch
Data Lake
Amazon S3
Amazon Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
AWS Lambda
Spark Streaming
on Amazon EMR
Applications
Amazon
Kinesis
KCL
Amazon
AI
Amazon
DynamoDB
Amazon
RDS
Change Data
Capture or Export
Transactions
Stream
Files
Amazon Kinesis
Analytics
Amazon Athena
Amazon
Kinesis
Firehose
Amazon ES
50. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
What about Metadata?
• Glue Catalog
• Hive Metastore compliant
• Crawlers - Detect new data, schema, partitions
• Search - Metadata discovery
• Amazon Athena, Amazon EMR, and Amazon Redshift
Spectrum compatible
• Hive Metastore (Presto, Spark, Hive, Pig)
• Can be hosted on Amazon RDS
Data
Catalog Metastore RDS
Glue
Catalog AWS Glue
51. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Security & Governance
• AWS Identity and Access Management (IAM)
• Amazon Cognito
• Amazon CloudWatch & AWS CloudTrail
• AWS KMS
• AWS Directory Service
• Apache Ranger
Security &
Governance IAM Amazon
CloudWatch
AWS
CloudTrail
AWS
AWSKMS
AWS
CloudHSM
AWS Directory
Service
Amazon
Cognito
52. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Data lake
Reference
Architecture
Summary
53. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Summary
• Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
• Use the right tool for the job
• Data structure, latency, throughput, access patterns
• Leverage AWS managed and serverless services
• Scalable/elastic, available, reliable, secure, no/low admin
• Use log-centric design patterns
• Immutable logs, data lake, materialized views
• Be cost-conscious
• Big data ≠ Big cost
• AI/ML enable your applications
54. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
aws.amazon.com/activate
Everything and Anything Startups
Need to Get Started on AWS