Big Data Architectural Patterns and Best Practices on AWS

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ian Meyers, Principal Solutions Architect
July 7th
, 2016
Big Data Architectural Patterns
and Best Practices on AWS

Big Data Evolution
Batch
processing
Stream
processing
Machine
learning

Powerful Services, Open Ecosystem
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Kinesis-enabled
app
Lambda ML
SQS
ElastiCache
DynamoDB
Streams
Amazon Elasticsearch
Service
Amazon Web Services Open Source & Commercial Ecosystem

Separation of compute & storage
• Event driven, loose coupling, independent scaling & sizing
“Data Sympathy” - use the right tool for the job
Data structure, latency, throughput, access patterns
Managed services, open technologies
Scalable/elastic, available, reliable, secure, no/low admin
Focus on the data, and your business, not undifferentiated tasks
Architectural Principles

Big Data != Big Cost
Aim to reduce the cost of experimentation to near 0

Simplify Big Data Processing
COLLECT STORE
PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost

Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTApplications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
COLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Mobile SDK
AWS IoT
Amazon CloudWatchLogs Agent
Amazon CognitoSync
AWS Mobile Analytics
Easily Capture Mobile Data
through Sync and Server
Logging
File based integration to
Streaming Storage
API Integration via SDK’s
and Ecosystem Tools
Amazon Kinesis Producer Library
Amazon Kinesis Agent

Principles for Collection Architecture
• Should be easy to install, manage, and
understand
• Meet requirements for data durability & recovery
• Ensure ability to meet latency & performance
requirements
• Data handling should be sympathetic to the type
of data

Types of Data
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoTApplications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
COLLECT
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
STORE
Search
File store
Queue
Stream
storage
Cache Heavily read, application defined structure
Consistently read, relational or document data
Frequently read, often requiring natural language
processing or geospatial access
Data received as files, or where object storage is
appropriate
Data intended for a single recipient, with low
latency
Data intended for many recipients, with low
latency

COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENT
S
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Stream
Amazon SQS
Message
Service
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
SearchSQLNoSQLCacheFile
LoggingIoTApplicationsTransportMessaging
File store
Queue
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Stream
storage
Cache,
Database,
Search

Database Anti-pattern
Myriad of Applications
One GREAT BIG Database

Myriad of Applications
Best Practice - Use the Right Tool for the job
Data Tier
Search
Amazon Elasticsearch Service
Amazon CloudSearch
Cache
Amazon ElastiCache
Redis
Memcached
SQL
Amazon Aurora
Amazon RDS
Amazon Redshift
NoSQL
Amazon DynamoDB
Cassandra
HBase
MongoDB
Database Tier

COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENT
S
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Stream
Amazon SQS
Message
Amazon S3
File
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Queue
Stream
storage
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
File store

Why is Amazon S3 Good for Big Data?
• Virtually unlimited storage, unlimited object count
• Very high bandwidth – no aggregate throughput limit
• Highly available – can tolerate AZ failure
• Designed for 99.999999999% durability
• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle
policies
• Native support for Versioning
• Secure – SSL, client/server-side encryption at rest
• Low cost

Why is Amazon S3 Good for Big Data?
• Natively supported by big data frameworks (Spark, Hive,
Presto, etc.)
• No need to run compute clusters for storage (unlike HDFS)
• No need to pay for data replication
• Can run transient Hadoop clusters & Amazon EC2 Spot
Instances
• Multiple & heterogenous analysis clusters can use the same
data

What about HDFS & Data Tiering (S3 IA & Amazon Glacier)?
• Use HDFS for very frequently accessed
(hot) data
• Use Amazon S3 Standard as system of
record for durability
• Use Amazon S3 Standard – IA for near line
archive data
• Use Amazon Glacier for cold/offline archive
data

Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Amazon SQS
Amazon SQS
Managed message queue
service
Amazon Kinesis Streams
Managed stream storage +
processing
Amazon Kinesis Firehose
Managed data delivery
Amazon DynamoDB
Streams
Managed NoSQL database
Tables can be stream-enabled
Message & Stream Data
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
IoT
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Applications
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
LoggingTransport
Messaging
Message MESSAGES
Messaging
Queue
Stream
Service
Amazon DynamoDB
Amazon ElastiCache
Amazon RDS
Amazon S3
Queue
Stream
storage

Why Stream Storage?
Decouple producers & consumers
Persistent buffer
Collect multiple streams
Preserve client ordering
Streaming analysis
Parallel consumption
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
shard 1 / partition 1
shard 2 / partition 2
Consumer 1
Count of
red = 4
Count of
violet = 4
Consumer 2
Count of
blue = 4
Count of
green = 4
DynamoDB stream Amazon Kinesis stream

What Data Store Should I Use?
Data structure → Fixed schema, JSON, key-value
Access patterns → Store data in the format you will access it
Data / access characteristics → Hot, warm, cold
Cost → Right cost

Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (key, value) Cache, NoSQL
Simple relationships → 1:N, M:N NoSQL
Cross table joins, transaction, SQL SQL
Faceting, natural language, geo Search
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, value) Cache, NoSQL

What Is the Temperature of Your Data / Access ?

Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low – high High Very high
Request rate Very high High-Med Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data / Access Characteristics: Hot, Warm, Cold

Amazon ElastiCache Amazon
DynamoDB
Amazon
RDS/Aurora
Amazon
Elasticsearch
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec ms,sec,min
(~ size)
hrs
Typical
data stored
GB GB–TBs
(no limit)
GB–TB
(64 TB max)
GB–TB MB–PB
(no limit)
GB–PB
(no limit)
Typical
item size
B-KB KB
(400 KB max)
KB
(64 KB max)
KB
(2 GB max)
KB-TB
(5 TB max)
GB
(40 TB max)
Request
Rate
High – very high Very high
(no limit)
High High Low – high
(no limit)
Very low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢/10
Durability Low - moderate Very high Very high High Very high Very high
Availability High
2 AZ
Very high
3 AZ
Very high
3 AZ
High
2 AZ
Very high
3 AZ
Very high
3 AZ
Hot data Warm data Cold data
What Data Store Should I Use?

COLLECT STORE PROCESS / ANALYZE
Service
Amazon SQS
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
HotWarm
SearchSQLNoSQLCacheFileMessage
Stream
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENT
S
FILES
MESSAGES
STREAMS
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams
Hot

Tools and Frameworks
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
PROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
FastSlowFast
BatchMessageInteractiveStreamML
Amazon EC2
Amazon EC2
Batch - Minutes or hours on cold data
Daily/weekly/monthly reports
Interactive – Seconds on warm/cold data
Self-service dashboards
Messaging – Milliseconds or seconds on hot data
Message/eventbuffering
Streaming - Milliseconds or seconds on hot data
Billing/fraud alerts, 1 minute metrics

Tools and Frameworks
Batch
Amazon EMR (MapReduce, Hive, Pig, Spark)
Interactive
Amazon Redshift, Amazon EMR (Presto, Spark)
Messaging
Amazon SQS application on Amazon EC2
Streaming
Micro-batch: Spark Streaming, KCL
Real-time: Amazon Kinesis Analytics, Storm,
AWS Lambda, KCL
Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
PROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
FastSlowFast
Amazon EC2
Amazon EC2

What Analytics Technology Should I Use?
Amazon
Redshift
Amazon EMR
Presto Spark Hive
Query
latency
Low Low Low High
Durability High High High High
Data volume 1.6 PB max ~Nodes ~Nodes ~Nodes
AWS
managed
Yes Yes Yes Yes
Storage Native HDFS / S3 HDFS / S3 HDFS / S3
SQL
compatibility
High High Low (SparkSQL) Medium (HQL)
Slow

What Streaming Analysis Technology Should I Use?
Spark
Streaming
Apache
Storm
Kinesis KCL
Application
AWS Lambda Amazon SQS
Apps
Scale ~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes
Micro-batch or
Real-time
Micro-batch Real-time Near-real-time Near-real-time Near-real-time
AWS managed
service
Yes (EMR) No (EC2) No (KCL + EC2
+ Auto Scaling)
Yes No (EC2 + Auto
Scaling)
Scalability No limits
~ nodes
No limits
~ nodes
No limits
~ nodes
No limits No limits
Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ
Programming
languages
Java, Python,
Scala
Any language
via Thrift
Java, via
MultiLang
Daemon (.NET,
Python, Ruby,
Node.js)
Node.js, Java,
Python
AWS SDK
languages
(Java, .NET,
Python, …)

Amazon SQS apps
Streaming
Amazon Kinesis
Analytics
Amazon KCL
apps
AWS Lambda
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon Machine
Learning
Presto
Amazon
EMR
Service
Amazon SQS
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
HotHotWarm
FastSlowFast
SearchSQLNoSQLCacheFileMessage
Stream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENT
S
FILES
MESSAGES
STREAMS
ETL
Amazon Kinesis
Firehose
Amazon Kinesis
Streams
Amazon DynamoDB
Streams

STORE CONSUMEPROCESS / ANALYZE
Amazon QuickSight
Analysis&visualizationNotebooksIDE
Apps & Services
API
Applications & API
Reporting, Exploratory and
Ad-Hoc Analysis and Data
Visualization
Notebooks
Data Science
Business
users
Data scientist,
developers
COLLECT ETL

Primitive: Batch Analysis & Enhancement
process
store
Amazon S3
Amazon
DynamoDB
Amazon EMR
Presto
Spark
SQL
Amazon
Redshift
Amazon
Quicksight
AWS
Marketplace

Primitive: Event Processing “Data Bus”
Multiple stages
Storage raises events to decouple processing stages
Store Process Store Process

Primitive: Pub/Sub
process
store
Amazon
Kinesis
AWS Lambda
Amazon Kinesis
S3
Connector
Apache Spark
Kinesis Client
Library

Combining Primitives: Event Processing Network
Combines data bus and pub/sub patterns to create
asynchronous, continuously executing transformations
Store Process Store Process
Process Store Process
Process
Process

Primitive: Messaging “Fan Out”
process
store
Amazon SQS AWS Lambda
Amazon
Kinesis

Real Time
Processing
Batch Analytics
process
store
Amazon S3
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Real Time
StorageAmazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Amazon
ML
KCL
AWS Lambda
Storm
Real Time &
Batch
Architecture
Spark Streaming
on Amazon EMR
Applications
Amazon
Kinesis

About Adbrain
• Adbrain is a market leader in providing Probabilistic
Identity Solutions
• At the core of our business we build an identity graph
that represents connected devices on the planet
• We enable our clients to understand and target their
audience better

What does that mean?
• Ingesting large amounts of log data from
different data sources
• Build a graph using statistical and
machine learning algorithms
• Extract a relevant user graph for our
clients

Our main scenarios
• Production graph generation and servicing our clients
• Data Science team to run computations (Spark)
• R&D work with ML
• Data feed evaluation
• Graph analytics
• Data Engineering development workload

Our original setup
• HDFS Cluster on 18 nodes with most of our hot data
• Not leveraging co-location of data
• Spark clusters are separated
• Spark instance type optimized according to workload
• Custom cluster provisioning solution to run Spark jobs
• Built in-house
• Required maintenance

What are the issues?
• “Hadoop is a cheap Big Data solution on
commodity hardware” … define cheap for a
startup
• Scaling Hadoop means more focus on
infrastructure
• Data grows really quickly (adding a new country,
data feed, etc.)

What do we need?
• Better scale
• Cheaper solution
• Less focus on infrastructure
• Reduced risk

EMR and S3 to the rescue!
• S3 is significantly cheaper compared to HDFS
• We use infrequent access and Glacier storage as
well
• EMR is a fully managed solution
• Up to date Spark distributions
• Support of Spot instances
• Managed Hadoop when / if needed
Amazon
S3
Amazon
EMR

Some technical challenges
• Spark 1.3.1 did not work well with S3 and
Parquet files
• Writing to S3 using the default
ParquetOutputComitter is slow
• Calculating output summary metadata is slow
• S3 is not consistent by default (it’s not a file
system)
• S3 can be slower than HDFS

Solutions
• Upgrade to Spark 1.5.1+
• Using DirectParquetOutputCommitter with speculation
disabled
• parquet.enable.summary-metadata = false
• Enable S3 Consistent View in EMR!
• (Remember the DynamoDB tables!)
• In case of a multi-step Spark workflow, utilize HDFS on
the Spark Cluster

The end result
• Decreasing costs by 70%+
• Focusing on development instead of
commodities (hardware/versioning/patching)
• Eliminated risks

Thank you!
aws.amazon.com/big-data

Remember to complete
your evaluations!

Big Data Architectural Patterns and Best Practices on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Architectural Patterns and Best Practices on AWS

Similar to Big Data Architectural Patterns and Best Practices on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Big Data Architectural Patterns and Best Practices on AWS