Big Data Architectural Patterns and Best Practices

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Jesper Söderlund, Manager Solutions Architecture
2017-05-03
Big Data Architectural Patterns and
Best practices

Agenda
Big Data Challenges
Architecture principles
What technologies should you use? How? Why?
Reference architecture
Design patterns
Customer Story: The Move to real-time data architectures,
DNA Oy

Big Data Evolution
Batch
processing
Stream
processing
Artificial
Intelligence

Plethora of Tools
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Lambda Amazon ML
SQS
ElastiCache
DynamoDB
Streams
Amazon Elasticsearch
Service
Amazon Kinesis
Analytics
Amazon
QuickSight

Big Data Challenges
Why?
How?
What tools should I use?
Is there a reference architecture?

Architectural Principles
Build decoupled systems
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs, materialized views
Be cost-conscious
• Big data ≠ big cost

Simplify Big Data Processing
COLLECT STORE PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost

Types of DataCOLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
In-memory data structures
Database records
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Search documents
Log files
Messaging
Message MESSAGES
Messaging
Messages
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
Data streams
Transactions
Files
Events

What Is the Temperature of Your Data ?

Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data Characteristics: Hot, Warm, Cold

STORE
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
COLLECT
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
MessagingApplications
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Types of Data Stores
Database SQL & NoSQL databases
Search Search engines
File store File systems
Queue Message queues
Stream
storage
Pub/sub message queues
In-memory Caches, data structure servers

In-memory
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Apache Kafka
Amazon DynamoDB
Streams
Amazon SQS
Amazon SQS
• Managed message queue service
Apache Kafka
• High throughput distributed streaming
platform
Amazon Kinesis Streams
• Managed stream storage + processing
Amazon Kinesis Firehose
• Managed data delivery
Amazon DynamoDB
• Managed NoSQL database
• Tables can be stream-enabled
Message & Stream Storage
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
Applications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Search
File store
LoggingTransport
Messaging
Message MESSAGES
Messaging
Message
Stream

Why Stream Storage?
Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Parallel consumption
Streaming MapReduce
443322114 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
44332211
shard 1 / partition 1
shard 2 / partition 2
Consumer 1
Count of
red = 4
Count of
violet = 4
Consumer 2
Count of
blue = 4
Count of
green = 4
DynamoDB stream Amazon Kinesis stream Kafka topic

What About Amazon SQS?
• Decouple producers & consumers
• Persistent buffer
• Collect multiple streams
• No client ordering (Standard)
• FIFO queue preserves client
ordering
• No streaming MapReduce
• No parallel consumption
• Amazon SNS can publish to
multiple SNS subscribers
(queues or ʎ functions)
Publisher
Amazon SNS
topic
function
ʎ
AWS Lambda
function
Amazon SQS
queue
queue
Subscriber
Consumers
4 3 2 1
12344 3 2 1
1234
2134
13342
Standard
FIFO

Which Stream/Message Storage Should I Use?
Amazon
DynamoDB
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka
Amazon
SQS
(Standard)
Amazon SQS
(FIFO)
AWS managed Yes Yes Yes No Yes Yes
Guaranteed ordering Yes Yes No Yes No Yes
Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once
Data retention period 24 hours 7 days N/A Configurable 14 days 14 days
Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ 3 AZ
Scale /
throughput
No limit /
~ table IOPS
No limit /
~ shards
No limit /
automatic
No limit /
~ nodes
No limits /
automatic
300 TPS /
queue
Parallel consumption Yes Yes No Yes No No
Stream MapReduce Yes Yes N/A Yes N/A N/A
Row/object size 400 KB 1 MB Destination
row/object size
Configurable 256 KB 256 KB
Cost Higher (table
cost)
Low Low Low (+admin) Low-medium Low-medium
Hot Warm
New

In-memory
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Search
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon S3
Amazon SQS
Message
Amazon S3
File
LoggingIoTApplicationsTransportMessaging
File Storage

In-memory
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS Database
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Search
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon SQS
Message
Amazon S3
File
In-memory, Database,
Search

Best Practice: Use the Right Tool for the Job
Data Tier
Search
Service
In-memory
Amazon ElastiCache
Redis
Memcached
SQL
Amazon Aurora
Amazon RDS
MySQL
PostgreSQL
Oracle
SQL Server
NoSQL
Amazon DynamoDB
Cassandra
HBase
MongoDB

Materialized Views & Immutable Log
Views
Immutable log

COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon SQS
Message
Service
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
SearchSQLNoSQLCacheFile
Amazon ElastiCache
• Managed Memcached or Redis service
Amazon DynamoDB
• Managed NoSQL database service
Amazon RDS
• Managed relational database service
Amazon Elasticsearch Service
• Managed Elasticsearch service

Which Data Store Should I Use?
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data characteristics → Hot, warm, cold
Cost → Right cost

In-memory
SQL
Request rate
High Low
Cost/GB
High Low
Latency
Low High
Data volume
Low High
Amazon
Glacier
Structure
NoSQL
Hot data Warm data Cold data
Low
High

Cost-Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project. The design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000

Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483 777,600,000
Amazon S3 or
DynamoDB?

Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
Scenario 1300 2,048 1,483 777,600,000
Scenario 2300 32,768 23,730 777,600,000
Amazon S3
Amazon DynamoDB
use
use

Batch
Takes minutes to hours
Example: Daily/weekly/monthly reports
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Takes seconds
Example: Self-service dashboards
Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark)
Message
Takes milliseconds to seconds
Example: Message processing
Amazon SQS applications on Amazon EC2
Stream
Takes milliseconds to seconds
Example: Fraud alerts, 1 minute metrics
Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL,
Storm, AWS Lambda
Artificial Intelligence
Takes milliseconds to minutes
Example: Fraud detection, forecast demand, text to speech
Amazon AI (Lex, Polly, ML, Rekognition), Amazon EMR (Spark
ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK and Caffe)
Analytics Types & Frameworks PROCESS / ANALYZE
Message
Amazon SQS apps
Amazon EC2
Streaming
Amazon Kinesis
Analytics
KCL
apps
AWS Lambda
Stream
Amazon EC2
Amazon EMR
Fast
Amazon Redshift
Presto
Amazon
EMR
FastSlow
Amazon Athena
BatchInteractive
Amazon
AI
AI

What About ETL?
https://aws.amazon.com/big-data/partner-solutions/
ETLSTORE PROCESS / ANALYZE
Data Integration Partners
Reduce the effort to move, cleanse, synchronize,
manage, and automatize data related processes. AWS Glue
AWS Glue is a fully managed ETL service that makes
it easy to understand your data sources, prepare the
data, and move it reliably between data stores
New

COLLECT STORE CONSUMEPROCESS / ANALYZE
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FileMessage
Stream
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
ETL
SearchSQLNoSQLCache
Streaming
Amazon Kinesis
Analytics
KCL
apps
AWS Lambda
Fast
Stream
Amazon EC2
Amazon EMR
Amazon SQS apps
Amazon Redshift
Presto
Amazon
EMR
FastSlow
Amazon EC2
Amazon Athena
BatchMessageInteractiveAI
Amazon
AI
Amazon S3

STORE CONSUMEPROCESS / ANALYZE
Amazon QuickSight
Apps & Services
Analysis&visualization
Notebooks
IDEAPI
Applications & API
Analysis and visualization
Notebooks
IDE
Business
users
Data scientist,
developers
COLLECT ETL

Streaming
Amazon Kinesis
Analytics
KCL
apps
AWS Lambda
COLLECT STORE CONSUMEPROCESS / ANALYZE
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
Fast
Stream
SearchSQLNoSQLCacheFileMessageStream
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
ETL
Amazon EMR
Amazon SQS apps
Amazon Redshift
Presto
Amazon
EMR
FastSlow
Amazon EC2
Amazon Athena
BatchMessageInteractiveAI
Amazon
AI
Amazon S3

Spark Streaming
Apache Storm
AWS Lambda
KCL apps
Amazon
Redshift
Amazon
Redshift
Hive
Spark
Presto
Amazon Kinesis Amazon
DynamoDB
Amazon S3data
Hot Cold
Data temperature
Processingspeed
Fast
Slow Answers
Hive
Native apps
KCL apps
AWS Lambda
Amazon
Athena

Amazon EMR
Real-time Analytics
Amazon
Kinesis
KCL app
AWS Lambda
Spark
Streaming
Amazon
SNS
Amazon
AI
Notifications
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Alerts
App state or
Materialized
View
Real-time prediction
KPI
process
store
Amazon Kinesis
Analytics
Amazon
S3
Log
Amazon
KinesisFan out

Interactive
&
Batch
Analytics
Amazon S3
Amazon EMR
Hive
Pig
Spark
Amazon
AI
process
store
Consume
Amazon Redshift
Amazon EMR
Presto
Spark
Batch
Interactive
Batch prediction
Real-time prediction
Amazon
Kinesis
Firehose
Amazon Athena
Files
Amazon Kinesis
Analytics

Interactive
&
Batch
Amazon S3
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
AWS Lambda
Storm
Spark Streaming
on Amazon EMR
Applications
Amazon
Kinesis
App state
or
Materialized
View
KCL
Amazon
AI
Real-time
Amazon
DynamoDB
Amazon
RDS
Change Data
Capture
Transactions
Stream
Files
Data Lake
Amazon Kinesis
Analytics
Amazon
Athena
Amazon
Kinesis
Firehose

The Move to a real-time data
architecture
Jarno Kartela
Chief Data Scientist, DNA Oy

EUR 859 million
Net sales in 2016
EUR 91 million
Operating result in 2016
TV
Finland’s largest cable operator and the
leading pay TV provider
Mobile communications and fixed
network customer subscriptions
3.8 millionOUR VALUES
FAST
DNA's customers receive quick
and helpful service
STRAIGHTFORWARD
DNA’s approach is clear and
responsible
BOLD
We are direct, open-minded
and ready for change
The personnel's satisfaction with DNA
as an employer is at a record-breaking
high level
Strong employee satisfaction
DNA is one of the leading Finnish telecommunications groups
42
1,668
At the end of 2016, there were
1,668 employees working with DNA
Finland’s most extensive retailer of
mobile phones, other mobile devices and
mobile subscriptions
64 DNA stores
Customer
is in the center of DNA’s strategy
Public | DNA Today

Introduction: our team’s
mission at DNA

SAD*
data
*Data you forced to collect
even though no-one wants it
as a customer & no-one needs
it in your business & no-one

We help DNA avoid SAD data &
create data & analytics assets
so good that they create fear in
our competition.

ETL
Analytics
Great wall of China
“Online” Analytics

ETL
Analytics
Great wall of China
Greater wall of China
Marketing?

ETL
Analytics
Great wall of China
Marketing?
Brand stuff
Product dev

ETL
Analytics
Great wall of China
Marketing?
Brand stuff
Product dev
“AI”

ETL
Analytics
Great wall of China
Marketing?
Brand stuff
Product dev
Consultants
“AI”

Analytics
dept.
Customer
interfaces
Reporting
dept.
Real-time
data feed
Marketing
dept.Customer
service
dept.
Product
dept.

Analytics
dept.
Customer
interfaces
Reporting
dept.
Real-time
data feed
Marketing
dept.Customer
service
dept.
Product
dept.
Data as
customer
experience

Product / infrastructure dev: potential scoring

Product / infrastructure dev: real-time insight

Marketing: channel-agnostic metrics

Marketing & CX: machine learning for
recommendations & personalization

AI: understanding & classifying natural
language (...and later on, for bots)

Customer service: 360 view & recommendations

Data as
customer
experience
Input
Output

Data as
customer
experience
Input
Output Better CX

Benefits
Customer
- unified experience
across channels
- personalized content
- better offers
- less time spent on
browsing DNA
services
- overall better CX
- coming: my data &
ability to change the
data about you as a
customer
Business
- time spent on
marketing 10x less
than before
- sales up 3x across AI-
driven campaigns
- sales up 2x-10x
across campaigns
triggered by real-time
data
- near real time (5 min)
insight on what’s
happening
- insight across
channels
IT / Analytics
- better service quality
with less time &
resources
- devops, automation
- endless scale for
analytics & data
- ease of try-fail-try-
again
- cost effectiveness
- one source of truth for
data
- ability to serve all
channels & functions
with one platform
- 6 months to full scale
production

Coming up:
Rekognition for retail analytics,
Lex/similar for voice dialogue
and chatbots,
GDPR -> create services
instead of fear

Summary
Build decoupled systems
• Use Amazon S3 as the data fabric of your data lake
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable log, batch, interactive & real-time views
Be cost-conscious
• Big data ≠ big cost

Building a Data Lake on AWS
Kinesis Firehose
Athena
Query Service

Resources
• https://aws.amazon.com/blogs/big-data/introducing-the-data-
lake-solution-on-aws/
• AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of
our big data ecosystem (BDM306)
• AWS re:Invent 2016: Deep Dive on Amazon S3 (STG303)
• https://aws.amazon.com/blogs/big-data/reinvent-2016-aws-big-
data-machine-learning-sessions/
• https://aws.amazon.com/blogs/big-data/implementing-
authorization-and-auditing-using-apache-ranger-on-amazon-emr/

Big Data Architectural Patterns and Best Practices

Big Data Architectural Patterns and Best Practices

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Architectural Patterns and Best Practices

Similar to Big Data Architectural Patterns and Best Practices (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Big Data Architectural Patterns and Best Practices