SlideShare a Scribd company logo
1 of 55
Download to read offline
Big	Data	Architectural	Pa0erns	and		
Best	Prac4ces	on	AWS	
Ran Tessler, AWS Solu0ons Architecture Manager
What to Expect from the Session
Big data challenges
How to simplify big data processing
What technologies should you use?
•  Why?
•  How?
Reference architecture
Design patterns
Ever Increasing Big Data
Volume
Velocity
Variety
Big Data Evolution
Batch	
	
Report	
	
Real-time		
	
Alerts	
Prediction	
	
Forecast
Plethora of Tools
Amazon
Glacier
S3
DynamoDB RDS
EMR
Amazon
Redshift
Amazon Kinesis
CloudSearch
Kinesis-
enabled
app
Lambda Amazon ML
SQS
ElastiCache
Data Pipeline
DynamoDB
Streams
Is there a reference architecture?
What tools should I use?
How?
Why?
Architectural Principles
•  Decoupled “data bus”
•  Data → Store → Process → Answers
•  Use the right tool for the job
•  Data structure, latency, throughput, access patterns
•  Use Lambda architecture ideas
•  Immutable (append-only) log, batch/speed/serving layer
•  Leverage AWS managed services
•  No/low admin
•  Big data ≠ big cost
Simplify Big Data Processing
ingest /
collect
store
process /
analyze
consume /
visualize
Time to Answer (Latency)
Throughput
Cost
Collect /
Ingest
Types of Data
•  Transactional
•  Database reads & writes (OLTP)
•  Cache
•  Search
•  Logs
•  Streams
•  File
•  Log files (/var/log)
•  Log collectors & frameworks
•  Stream
•  Log records
•  Sensors & IoT data
Database
File
Storage
Stream
Storage
A
iOS
 Android
Web Apps
Logstash
Logging
IoT
Applications
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Search
Collect Store
Logging
IoT
Store
Stream
Storage
A
iOS
 Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon

S3
Apache
Kafka
Amazon

Glacier
Amazon

Kinesis
Amazon

DynamoDB
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamStorage
FileStorage
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Database
File
Storage
Search
Collect Store
Logging
IoT
Applications
ü 
Stream Storage Options
•  AWS managed services
•  Amazon Kinesis → streams
•  DynamoDB Streams → table + streams
•  Amazon SQS → queue
•  Amazon SNS → pub/sub
•  Unmanaged
•  Apache Kafka → stream
Why Stream Storage?
•  Decouple producers & consumers
•  Persistent buffer
•  Collect multiple streams
•  Preserve client ordering
•  Streaming MapReduce
•  Parallel consumption
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Shard 1 / Partition 1
Shard 2 / Partition 2
Consumer 1
Count of
Red = 4
Count of
Violet = 4
Consumer 2
Count of
Blue = 4
Count of
Green = 4
DynamoDB Stream Kinesis Stream Kafka Topic
What About Queues & Pub/Sub ?
•  Decouple producers &
consumers/subscribers
•  Persistent buffer
•  Collect multiple streams
•  No client ordering
•  No parallel consumption for
Amazon SQS
•  Amazon SNS can route
to multiple queues or ʎ
functions
•  No streaming MapReduce
Consumers
Producers
Producers
Amazon SNS
Amazon SQS
queue
topic
function
ʎ
AWS Lambda
Amazon SQS
queue
Subscriber
Which stream storage should I use?
Amazon
Kinesis
DynamoDB
Streams
Amazon SQS
Amazon SNS
Kafka
Managed Yes Yes Yes No
Ordering Yes Yes No Yes
Delivery at-least-once exactly-once at-least-once at-least-once
Lifetime 7 days 24 hours 14 days Configurable
Replication 3 AZ 3 AZ 3 AZ Configurable
Throughput No Limit No Limit No Limit ~ Nodes
Parallel Clients Yes Yes No (SQS) Yes
MapReduce Yes Yes No Yes
Record size 1MB 400KB 256KB Configurable
Cost Low Higher(table cost) Low-Medium Low (+admin)
File
Storage
A
iOS
 Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon

S3
Apache
Kafka
Amazon

Glacier
Amazon

Kinesis
Amazon

DynamoDB
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamStorage
FileStorage
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Database
Search
Collect Store
Logging
IoT
Applications
ü 
Why Is Amazon S3 Good for Big Data?
•  Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
•  No need to run compute clusters for storage (unlike HDFS)
•  Can run transient Hadoop clusters & Amazon EC2 Spot instances
•  Multiple distinct (Spark, Hive, Presto) clusters can use the same data
•  Unlimited number of objects
•  Very high bandwidth – no aggregate throughput limit
•  Highly available – can tolerate AZ failure
•  Designed for 99.999999999% durability
•  Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy
•  Secure – SSL, client/server-side encryption at rest
•  Low cost
What about HDFS & Amazon Glacier?
•  Use HDFS for very frequently
accessed (hot) data
•  Use Amazon S3 Standard for
frequently accessed data
•  Use Amazon S3 Standard –
IA for infrequently accessed
data
•  Use Amazon Glacier for
archiving cold data
Database +
Search
Tier
A
iOS
 Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon

S3
Apache
Kafka
Amazon

Glacier
Amazon

Kinesis
Amazon

DynamoDB
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamStorage
FileStorage
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Collect Store
ü 
Database + Search Tier Anti-pattern
RDBMS
Database + Search Tier
Applications
Best Practice — Use the Right Tool for the Job
Data Tier
Search
Amazon
Elasticsearch
Service
Amazon
CloudSearch
Cache
Redis
Memcached
SQL
Amazon Aurora
MySQL
PostgreSQL
Oracle
SQL Server
MariaDB
NoSQL
Cassandra
Amazon
DynamoDB
HBase
MongoDB
Applications
Database + Search Tier
Materialized Views
What Data Store Should I Use?
•  Data structure → Fixed schema, JSON, key-value
•  Access patterns → Store data in the format you will
access it
•  Data / access characteristics → Hot, warm, cold
•  Cost → Right cost
Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (Key, Value) Cache, NoSQL
Simple relationships → 1:N, M:N NoSQL
Cross table joins, transaction, SQL SQL
Faceting, Search Search
Data Structure What to use?
Fixed schema NoSQL, SQL
Schema-free (JSON) NoSQL, Search
(Key, Value) NoSQL, Cache
What Is the Temperature of Your Data / Access ?
Data / Access Characteristics: Hot, Warm, Cold
Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–High High Very High
Request rate Very High High Low
Cost/GB $$-$ $-¢¢ ¢
Hot Data Warm Data Cold Data
Cache
SQL
Request Rate
High Low
Cost/GB
High Low
Latency
Low High
Data Volume
Low High
GlacierStructure
NoSQL
Hot Data Warm Data Cold Data
Low
High
S3
Search
HDFS
What Data Store Should I Use?
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
Aurora
Amazon
Elasticsearch
Amazon
EMR (HDFS)
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec sec,min,hrs ms,sec,min
(~ size)
hrs
Data volume GB GB–TBs
(no limit)
GB–TB
(64 TB
Max)
GB–TB GB–PB
(~nodes)
MB–PB
(no limit)
GB–PB
(no limit)
Item size B-KB KB
(400 KB
max)
KB
(64 KB)
KB
(1 MB max)
MB-GB KB-GB
(5 TB max)
GB
(40 TB max)
Request rate High -
Very High
Very High
(no limit)
High High Low – Very
High
Low –
Very High
(no limit)
Very Low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢/10
Durability Low -
Moderate
Very High Very High High High Very High Very High
Hot Data Warm Data Cold Data
Hot Data Warm Data Cold Data
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase
my team’s use of Amazon S3. Hoping you could answer
some questions. The current iteration of the design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Cost Conscious Design
https://calculator.s3.amazonaws.com/index.html
Example: Should I use Amazon S3 or Amazon DynamoDB?
Cost Conscious Design
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483 777,600,000
Amazon S3 or
Amazon
DynamoDB?
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
Scenario 1300 2,048 1,483 777,600,000
Scenario 2300 32,768 23,730 777,600,000
Amazon S3
Amazon DynamoDB
use
use
Process /
Analyze
Process / Analyze
Analysis of data is a process of inspecting, cleaning,
transforming, and modeling data with the goal of discovering
useful information, suggesting conclusions, and supporting
decision-making.
Examples
•  Interactive dashboards → Interactive analytics
•  Daily/weekly/monthly reports → Batch analytics
•  Billing/fraud alerts, 1 minute metrics → Real-time analytics
•  Sentiment analysis, prediction models → Machine learning
Interactive Analytics
Takes large amount of (warm/cold) data
Takes seconds to get answers back
Example: Self-service dashboards
Batch Analytics
Takes large amount of (warm/cold) data
Takes minutes or hours to get answers back
Example: Generating daily, weekly, or monthly reports
Real-Time Analytics
Take small amount of hot data and ask questions
Takes short amount of time (milliseconds or seconds) to
get your answer back
•  Real-time (event)
•  Real-time response to events in data streams
•  Example: Billing/Fraud Alerts
•  Near real-time (micro-batch)
•  Near real-time operations on small batches of events in data
streams
•  Example: 1 Minute Metrics
Predictions via Machine Learning
ML gives computers the ability to learn without being explicitly
programmed
Machine Learning Algorithms:
-  Supervised Learning ← “teach” program
-  Classification ← Is this transaction fraud? (Yes/No)
-  Regression ← Customer Life-time value?
-  Unsupervised Learning ← let it learn by itself
-  Clustering ← Market Segmentation
Analysis Tools and Frameworks
Machine Learning
•  Mahout, Spark ML, Amazon ML
Interactive Analytics
•  Amazon Redshift, Presto, Impala, Spark
Batch Processing
•  MapReduce, Hive, Pig, Spark
Stream Processing
•  Micro-batch: Spark Streaming, KCL, Hive, Pig
•  Real-time: Storm, AWS Lambda, KCL
StreamProcessing
Batch
Interactive
ML
Analyze
Stream
Processing
Batch
Processing
Interactive
Analytics
MLAmazon Machine
Learning
Amazon
Redshift
Impala
AmazonElasticMapReduce
Pig
Streaming
Amazon

Kinesis
AWS
Lambda
Spark Streaming Apache Storm Amazon Kinesis
Client Library
AWS Lambda Amazon EMR
(Hive, Pig)
Scale /
Throughput
~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes
Batch or Real-
time
Real-time Real-time Real-time Real-time Batch
Manageability Yes (Amazon
EMR)
Do it yourself Amazon EC2 +
Auto Scaling
AWS managed Yes (Amazon
EMR)
Fault Tolerance Single AZ Configurable Multi-AZ Multi-AZ Single AZ
Programming
languages
Java, Python,
Scala
Any language
via Thrift
Java, via
MultiLangDaemon
( .Net, Python,
Ruby, Node.js)
Node.js, Java,
Python
Hive, Pig,
Streaming
languages
High
What Stream Processing Technology
Should I Use?
What Interactive/Batch Processing Technology
Should I Use?
Amazon
Redshift
Impala Presto Spark Hive
Query Latency Low Low Low Low Medium (Tez) – High
(MapReduce)
Durability High High High High High
Data Volume 2PB Max ~Nodes ~Nodes ~Nodes ~Nodes
Managed Yes Yes (EMR) Yes (EMR) Yes (EMR) Yes (EMR)
Storage Native HDFS / S3A* HDFS / S3 HDFS / S3 HDFS / S3
SQL
Compatibility
High Medium High Low (SparkSQL) Medium (HQL)
HighMedium
Spark Streaming
Apache Storm
AWS Lambda
KCL
Amazon
Redshift Spark
Impala
Presto
Hive
Amazon
Redshift
Hive
Spark
Presto
Impala
Amazon Kinesis
Apache Kafka
Amazon
DynamoDB
Amazon S3data
Hot Cold
Data Temperature
ProcessingLatency
Low
High Answers
Amazon EMR (HDFS)
Hive
Native
KCL
AWS Lambda
Data Temperature vs Processing Latency
Real-time
Interactive
Batch
Batch
What about ETL?
Store Analyze
https://aws.amazon.com/big-data/partner-solutions/
ETL
Consume /
Visualize
Consume
•  Predictions
•  Analysis and Visualization
•  Notebooks
•  IDE
•  Applications & API
Consume
Analysis&Visualization
Amazon
QuickSight
Notebooks
Predictions
Apps & APIs
IDE
Store Analyze ConsumeETL
Business
users
Data Scientist,
Developers
Putting It All Together
Collect Store Analyze Consume
A
iOS
 Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon

S3
Apache
Kafka
Amazon

Glacier
Amazon

Kinesis
Amazon

DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon

Kinesis
AWS
Lambda
AmazonElasticMapReduce
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamProcessing
Batch
Interactive
Logging
StreamStorage
IoT
Applications
FileStorage
Analysis&Visualization
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Amazon
QuickSight
Transactional Data
File Data
Stream Data
Notebooks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
Reference Architecture
Design Patterns
Multi-Stage Decoupled “Data Bus”
•  Multiple stages
•  Storage decoupled from processing
Store Process Store Process
process
store
Multiple Processing Applications (or
Connectors) Can Read from or Write to the Same
Data Stores
Amazon
Kinesis
AWS
Lambda
Amazon
DynamoDB
Amazon
Kinesis S3
Connector
Amazon S3
process
store
Amazon
Kinesis
AWS
Lambda
Amazon
DynamoDB
Amazon
Kinesis S3
Connector
Amazon S3
process
store
Processing Frameworks (KCL, Storm, Hive,
Spark, etc.) Could Read from Multiple Data
Stores
Storm Hive Spark
Batch Layer
Amazon
Kinesis
data
process
store
Amazon
Kinesis S3
Connector
Amazon S3
A
p
p
l
i
c
a
t
i
o
n
s
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark
answer
Speed Layer
answer
Serving
Layer
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
answer
Amazon
ML
KCL
AWS Lambda
Spark Streaming
Storm
Lambda
Architecture
Summary
•  Build decoupled “data bus”
•  Data → Store ↔ Process → Answers
•  Use the right tool for the job
•  Latency, throughput, access patterns
•  Use Lambda architecture ideas
•  Immutable (append-only) log, batch/speed/serving layer
•  Leverage AWS managed services
•  No/low admin
•  Be cost conscious
•  Big data ≠ big cost
Ran Tessler	
AWS Solu0ons Architecture Manager	
tesslerr@amazon.com

More Related Content

What's hot

The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 

What's hot (20)

NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Powering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script TransformationPowering Custom Apps at Facebook using Spark Script Transformation
Powering Custom Apps at Facebook using Spark Script Transformation
 
Indexing with MongoDB
Indexing with MongoDBIndexing with MongoDB
Indexing with MongoDB
 
아라한사의 스프링 시큐리티 정리
아라한사의 스프링 시큐리티 정리아라한사의 스프링 시큐리티 정리
아라한사의 스프링 시큐리티 정리
 
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
 
NoSql
NoSqlNoSql
NoSql
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Top 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark ApplicationsTop 5 Mistakes When Writing Spark Applications
Top 5 Mistakes When Writing Spark Applications
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
 
SQLServer Database Structures
SQLServer Database Structures SQLServer Database Structures
SQLServer Database Structures
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlib
 
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 

Viewers also liked

Faster Time to Science - Scaling BioMedical Research in the Cloud with SciOps...
Faster Time to Science - Scaling BioMedical Research in the Cloud with SciOps...Faster Time to Science - Scaling BioMedical Research in the Cloud with SciOps...
Faster Time to Science - Scaling BioMedical Research in the Cloud with SciOps...
Amazon Web Services
 

Viewers also liked (20)

Introduction to IAM + Best Practices
Introduction to IAM + Best PracticesIntroduction to IAM + Best Practices
Introduction to IAM + Best Practices
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Ram Disk
Ram DiskRam Disk
Ram Disk
 
ENT203 Integrating On-Premise Resources - AWS re: Invent 2012
ENT203 Integrating On-Premise Resources - AWS re: Invent 2012ENT203 Integrating On-Premise Resources - AWS re: Invent 2012
ENT203 Integrating On-Premise Resources - AWS re: Invent 2012
 
Amazon EC2
Amazon EC2Amazon EC2
Amazon EC2
 
Building Your Practice on AWS: An APN Breakfast Session
Building Your Practice on AWS: An APN Breakfast SessionBuilding Your Practice on AWS: An APN Breakfast Session
Building Your Practice on AWS: An APN Breakfast Session
 
AWS Mobile Hub + AWS Device Farm
AWS Mobile Hub + AWS Device FarmAWS Mobile Hub + AWS Device Farm
AWS Mobile Hub + AWS Device Farm
 
Develping mobile services on aws - Pop-up Loft Tel Aviv
Develping mobile services on aws - Pop-up Loft Tel AvivDevelping mobile services on aws - Pop-up Loft Tel Aviv
Develping mobile services on aws - Pop-up Loft Tel Aviv
 
DevOps as a Pathway to AWS | AWS Public Sector Summit 2016
DevOps as a Pathway to AWS | AWS Public Sector Summit 2016DevOps as a Pathway to AWS | AWS Public Sector Summit 2016
DevOps as a Pathway to AWS | AWS Public Sector Summit 2016
 
#EarthOnAWS: How the Cloud Is Transforming Earth Observation | AWS Public Sec...
#EarthOnAWS: How the Cloud Is Transforming Earth Observation | AWS Public Sec...#EarthOnAWS: How the Cloud Is Transforming Earth Observation | AWS Public Sec...
#EarthOnAWS: How the Cloud Is Transforming Earth Observation | AWS Public Sec...
 
Amazon S3 - Masterclass - Pop-up Loft Tel Aviv
Amazon S3 - Masterclass - Pop-up Loft Tel AvivAmazon S3 - Masterclass - Pop-up Loft Tel Aviv
Amazon S3 - Masterclass - Pop-up Loft Tel Aviv
 
Keynote - Currency fair
Keynote - Currency fairKeynote - Currency fair
Keynote - Currency fair
 
Application Delivery Patterns for Developers - Technical 401
Application Delivery Patterns for Developers - Technical 401Application Delivery Patterns for Developers - Technical 401
Application Delivery Patterns for Developers - Technical 401
 
Account Separation and Mandatory Access Control on AWS | Security Roadshow Du...
Account Separation and Mandatory Access Control on AWS | Security Roadshow Du...Account Separation and Mandatory Access Control on AWS | Security Roadshow Du...
Account Separation and Mandatory Access Control on AWS | Security Roadshow Du...
 
Workshop: We love APIs
Workshop: We love APIsWorkshop: We love APIs
Workshop: We love APIs
 
Faster Time to Science - Scaling BioMedical Research in the Cloud with SciOps...
Faster Time to Science - Scaling BioMedical Research in the Cloud with SciOps...Faster Time to Science - Scaling BioMedical Research in the Cloud with SciOps...
Faster Time to Science - Scaling BioMedical Research in the Cloud with SciOps...
 
Cloud First: New Architecture for New Infrastructure
Cloud First: New Architecture for New InfrastructureCloud First: New Architecture for New Infrastructure
Cloud First: New Architecture for New Infrastructure
 
Scaling by Design: AWS Web Services Patterns
Scaling by Design:AWS Web Services PatternsScaling by Design:AWS Web Services Patterns
Scaling by Design: AWS Web Services Patterns
 
Amazon Simple Work Flow Engine (SWF): How Beamr uses SWF for video optimizati...
Amazon Simple Work Flow Engine (SWF): How Beamr uses SWF for video optimizati...Amazon Simple Work Flow Engine (SWF): How Beamr uses SWF for video optimizati...
Amazon Simple Work Flow Engine (SWF): How Beamr uses SWF for video optimizati...
 
Another Day, Another Billion Packets
Another Day, Another Billion PacketsAnother Day, Another Billion Packets
Another Day, Another Billion Packets
 

Similar to Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv

Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)
Rasmus Ekman
 

Similar to Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv (20)

AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
 
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSFebruary 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOT
 
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...
 
Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)Rethinking the database for the cloud (iJAWS)
Rethinking the database for the cloud (iJAWS)
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB DayChoosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
Choosing the Right Database Service (김상필, 유타카 호시노) - AWS DB Day
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Recently uploaded (20)

Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv

  • 2. What to Expect from the Session Big data challenges How to simplify big data processing What technologies should you use? •  Why? •  How? Reference architecture Design patterns
  • 3. Ever Increasing Big Data Volume Velocity Variety
  • 5. Plethora of Tools Amazon Glacier S3 DynamoDB RDS EMR Amazon Redshift Amazon Kinesis CloudSearch Kinesis- enabled app Lambda Amazon ML SQS ElastiCache Data Pipeline DynamoDB Streams
  • 6. Is there a reference architecture? What tools should I use? How? Why?
  • 7. Architectural Principles •  Decoupled “data bus” •  Data → Store → Process → Answers •  Use the right tool for the job •  Data structure, latency, throughput, access patterns •  Use Lambda architecture ideas •  Immutable (append-only) log, batch/speed/serving layer •  Leverage AWS managed services •  No/low admin •  Big data ≠ big cost
  • 8. Simplify Big Data Processing ingest / collect store process / analyze consume / visualize Time to Answer (Latency) Throughput Cost
  • 10. Types of Data •  Transactional •  Database reads & writes (OLTP) •  Cache •  Search •  Logs •  Streams •  File •  Log files (/var/log) •  Log collectors & frameworks •  Stream •  Log records •  Sensors & IoT data Database File Storage Stream Storage A iOS Android Web Apps Logstash Logging IoT Applications Transactional Data File Data Stream Data Mobile Apps Search Data Search Collect Store Logging IoT
  • 11. Store
  • 13. Stream Storage Options •  AWS managed services •  Amazon Kinesis → streams •  DynamoDB Streams → table + streams •  Amazon SQS → queue •  Amazon SNS → pub/sub •  Unmanaged •  Apache Kafka → stream
  • 14. Why Stream Storage? •  Decouple producers & consumers •  Persistent buffer •  Collect multiple streams •  Preserve client ordering •  Streaming MapReduce •  Parallel consumption 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 Shard 1 / Partition 1 Shard 2 / Partition 2 Consumer 1 Count of Red = 4 Count of Violet = 4 Consumer 2 Count of Blue = 4 Count of Green = 4 DynamoDB Stream Kinesis Stream Kafka Topic
  • 15. What About Queues & Pub/Sub ? •  Decouple producers & consumers/subscribers •  Persistent buffer •  Collect multiple streams •  No client ordering •  No parallel consumption for Amazon SQS •  Amazon SNS can route to multiple queues or ʎ functions •  No streaming MapReduce Consumers Producers Producers Amazon SNS Amazon SQS queue topic function ʎ AWS Lambda Amazon SQS queue Subscriber
  • 16. Which stream storage should I use? Amazon Kinesis DynamoDB Streams Amazon SQS Amazon SNS Kafka Managed Yes Yes Yes No Ordering Yes Yes No Yes Delivery at-least-once exactly-once at-least-once at-least-once Lifetime 7 days 24 hours 14 days Configurable Replication 3 AZ 3 AZ 3 AZ Configurable Throughput No Limit No Limit No Limit ~ Nodes Parallel Clients Yes Yes No (SQS) Yes MapReduce Yes Yes No Yes Record size 1MB 400KB 256KB Configurable Cost Low Higher(table cost) Low-Medium Low (+admin)
  • 18. Why Is Amazon S3 Good for Big Data? •  Natively supported by big data frameworks (Spark, Hive, Presto, etc.) •  No need to run compute clusters for storage (unlike HDFS) •  Can run transient Hadoop clusters & Amazon EC2 Spot instances •  Multiple distinct (Spark, Hive, Presto) clusters can use the same data •  Unlimited number of objects •  Very high bandwidth – no aggregate throughput limit •  Highly available – can tolerate AZ failure •  Designed for 99.999999999% durability •  Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy •  Secure – SSL, client/server-side encryption at rest •  Low cost
  • 19. What about HDFS & Amazon Glacier? •  Use HDFS for very frequently accessed (hot) data •  Use Amazon S3 Standard for frequently accessed data •  Use Amazon S3 Standard – IA for infrequently accessed data •  Use Amazon Glacier for archiving cold data
  • 20. Database + Search Tier A iOS Android Web Apps Logstash Amazon RDS Amazon DynamoDB Amazon ES Amazon
 S3 Apache Kafka Amazon
 Glacier Amazon
 Kinesis Amazon
 DynamoDB Amazon ElastiCache SearchSQLNoSQLCache StreamStorage FileStorage Transactional Data File Data Stream Data Mobile Apps Search Data Collect Store ü 
  • 21. Database + Search Tier Anti-pattern RDBMS Database + Search Tier Applications
  • 22. Best Practice — Use the Right Tool for the Job Data Tier Search Amazon Elasticsearch Service Amazon CloudSearch Cache Redis Memcached SQL Amazon Aurora MySQL PostgreSQL Oracle SQL Server MariaDB NoSQL Cassandra Amazon DynamoDB HBase MongoDB Applications Database + Search Tier
  • 24. What Data Store Should I Use? •  Data structure → Fixed schema, JSON, key-value •  Access patterns → Store data in the format you will access it •  Data / access characteristics → Hot, warm, cold •  Cost → Right cost
  • 25. Data Structure and Access Patterns Access Patterns What to use? Put/Get (Key, Value) Cache, NoSQL Simple relationships → 1:N, M:N NoSQL Cross table joins, transaction, SQL SQL Faceting, Search Search Data Structure What to use? Fixed schema NoSQL, SQL Schema-free (JSON) NoSQL, Search (Key, Value) NoSQL, Cache
  • 26. What Is the Temperature of Your Data / Access ?
  • 27. Data / Access Characteristics: Hot, Warm, Cold Hot Warm Cold Volume MB–GB GB–TB PB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low–High High Very High Request rate Very High High Low Cost/GB $$-$ $-¢¢ ¢ Hot Data Warm Data Cold Data
  • 28. Cache SQL Request Rate High Low Cost/GB High Low Latency Low High Data Volume Low High GlacierStructure NoSQL Hot Data Warm Data Cold Data Low High S3 Search HDFS
  • 29. What Data Store Should I Use? Amazon ElastiCache Amazon DynamoDB Amazon Aurora Amazon Elasticsearch Amazon EMR (HDFS) Amazon S3 Amazon Glacier Average latency ms ms ms, sec ms,sec sec,min,hrs ms,sec,min (~ size) hrs Data volume GB GB–TBs (no limit) GB–TB (64 TB Max) GB–TB GB–PB (~nodes) MB–PB (no limit) GB–PB (no limit) Item size B-KB KB (400 KB max) KB (64 KB) KB (1 MB max) MB-GB KB-GB (5 TB max) GB (40 TB max) Request rate High - Very High Very High (no limit) High High Low – Very High Low – Very High (no limit) Very Low Storage cost GB/month $$ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢/10 Durability Low - Moderate Very High Very High High High Very High Very High Hot Data Warm Data Cold Data Hot Data Warm Data Cold Data
  • 30. Example: Should I use Amazon S3 or Amazon DynamoDB? “I’m currently scoping out a project that will greatly increase my team’s use of Amazon S3. Hoping you could answer some questions. The current iteration of the design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…” Cost Conscious Design
  • 31. https://calculator.s3.amazonaws.com/index.html Example: Should I use Amazon S3 or Amazon DynamoDB? Cost Conscious Design Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2048 1483 777,600,000
  • 32. Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2,048 1,483 777,600,000 Amazon S3 or Amazon DynamoDB?
  • 33. Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month Scenario 1300 2,048 1,483 777,600,000 Scenario 2300 32,768 23,730 777,600,000 Amazon S3 Amazon DynamoDB use use
  • 35. Process / Analyze Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making. Examples •  Interactive dashboards → Interactive analytics •  Daily/weekly/monthly reports → Batch analytics •  Billing/fraud alerts, 1 minute metrics → Real-time analytics •  Sentiment analysis, prediction models → Machine learning
  • 36. Interactive Analytics Takes large amount of (warm/cold) data Takes seconds to get answers back Example: Self-service dashboards
  • 37. Batch Analytics Takes large amount of (warm/cold) data Takes minutes or hours to get answers back Example: Generating daily, weekly, or monthly reports
  • 38. Real-Time Analytics Take small amount of hot data and ask questions Takes short amount of time (milliseconds or seconds) to get your answer back •  Real-time (event) •  Real-time response to events in data streams •  Example: Billing/Fraud Alerts •  Near real-time (micro-batch) •  Near real-time operations on small batches of events in data streams •  Example: 1 Minute Metrics
  • 39. Predictions via Machine Learning ML gives computers the ability to learn without being explicitly programmed Machine Learning Algorithms: -  Supervised Learning ← “teach” program -  Classification ← Is this transaction fraud? (Yes/No) -  Regression ← Customer Life-time value? -  Unsupervised Learning ← let it learn by itself -  Clustering ← Market Segmentation
  • 40. Analysis Tools and Frameworks Machine Learning •  Mahout, Spark ML, Amazon ML Interactive Analytics •  Amazon Redshift, Presto, Impala, Spark Batch Processing •  MapReduce, Hive, Pig, Spark Stream Processing •  Micro-batch: Spark Streaming, KCL, Hive, Pig •  Real-time: Storm, AWS Lambda, KCL StreamProcessing Batch Interactive ML Analyze Stream Processing Batch Processing Interactive Analytics MLAmazon Machine Learning Amazon Redshift Impala AmazonElasticMapReduce Pig Streaming Amazon
 Kinesis AWS Lambda
  • 41. Spark Streaming Apache Storm Amazon Kinesis Client Library AWS Lambda Amazon EMR (Hive, Pig) Scale / Throughput ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes Batch or Real- time Real-time Real-time Real-time Real-time Batch Manageability Yes (Amazon EMR) Do it yourself Amazon EC2 + Auto Scaling AWS managed Yes (Amazon EMR) Fault Tolerance Single AZ Configurable Multi-AZ Multi-AZ Single AZ Programming languages Java, Python, Scala Any language via Thrift Java, via MultiLangDaemon ( .Net, Python, Ruby, Node.js) Node.js, Java, Python Hive, Pig, Streaming languages High What Stream Processing Technology Should I Use?
  • 42. What Interactive/Batch Processing Technology Should I Use? Amazon Redshift Impala Presto Spark Hive Query Latency Low Low Low Low Medium (Tez) – High (MapReduce) Durability High High High High High Data Volume 2PB Max ~Nodes ~Nodes ~Nodes ~Nodes Managed Yes Yes (EMR) Yes (EMR) Yes (EMR) Yes (EMR) Storage Native HDFS / S3A* HDFS / S3 HDFS / S3 HDFS / S3 SQL Compatibility High Medium High Low (SparkSQL) Medium (HQL) HighMedium
  • 43. Spark Streaming Apache Storm AWS Lambda KCL Amazon Redshift Spark Impala Presto Hive Amazon Redshift Hive Spark Presto Impala Amazon Kinesis Apache Kafka Amazon DynamoDB Amazon S3data Hot Cold Data Temperature ProcessingLatency Low High Answers Amazon EMR (HDFS) Hive Native KCL AWS Lambda Data Temperature vs Processing Latency Real-time Interactive Batch Batch
  • 44. What about ETL? Store Analyze https://aws.amazon.com/big-data/partner-solutions/ ETL
  • 46. Consume •  Predictions •  Analysis and Visualization •  Notebooks •  IDE •  Applications & API Consume Analysis&Visualization Amazon QuickSight Notebooks Predictions Apps & APIs IDE Store Analyze ConsumeETL Business users Data Scientist, Developers
  • 47. Putting It All Together
  • 48. Collect Store Analyze Consume A iOS Android Web Apps Logstash Amazon RDS Amazon DynamoDB Amazon ES Amazon
 S3 Apache Kafka Amazon
 Glacier Amazon
 Kinesis Amazon
 DynamoDB Amazon Redshift Impala Pig Amazon ML Streaming Amazon
 Kinesis AWS Lambda AmazonElasticMapReduce Amazon ElastiCache SearchSQLNoSQLCache StreamProcessing Batch Interactive Logging StreamStorage IoT Applications FileStorage Analysis&Visualization Hot Cold Warm Hot Slow Hot ML Fast Fast Amazon QuickSight Transactional Data File Data Stream Data Notebooks Predictions Apps & APIs Mobile Apps IDE Search Data ETL Reference Architecture
  • 50. Multi-Stage Decoupled “Data Bus” •  Multiple stages •  Storage decoupled from processing Store Process Store Process process store
  • 51. Multiple Processing Applications (or Connectors) Can Read from or Write to the Same Data Stores Amazon Kinesis AWS Lambda Amazon DynamoDB Amazon Kinesis S3 Connector Amazon S3 process store
  • 52. Amazon Kinesis AWS Lambda Amazon DynamoDB Amazon Kinesis S3 Connector Amazon S3 process store Processing Frameworks (KCL, Storm, Hive, Spark, etc.) Could Read from Multiple Data Stores Storm Hive Spark
  • 53. Batch Layer Amazon Kinesis data process store Amazon Kinesis S3 Connector Amazon S3 A p p l i c a t i o n s Amazon Redshift Amazon EMR Presto Hive Pig Spark answer Speed Layer answer Serving Layer Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES answer Amazon ML KCL AWS Lambda Spark Streaming Storm Lambda Architecture
  • 54. Summary •  Build decoupled “data bus” •  Data → Store ↔ Process → Answers •  Use the right tool for the job •  Latency, throughput, access patterns •  Use Lambda architecture ideas •  Immutable (append-only) log, batch/speed/serving layer •  Leverage AWS managed services •  No/low admin •  Be cost conscious •  Big data ≠ big cost
  • 55. Ran Tessler AWS Solu0ons Architecture Manager tesslerr@amazon.com