Fast Track to Your Data Lake on AWS

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
March 16, 2017
Fast Track to Your Data Lake on AWS
John Mallory, Business Development

Data has gravity
…easier to move processing to the data
4k/8k
Genomics
Seismic
Financial
Logs
IoT

Challenges with Legacy Data Architectures
• Can’t move data across silos
• Can’t afford to keep all of the data
• Can’t scale with dynamic data and real-time processing
• Can’t scale management of data
• Can’t find the people who know how to configure and
manage complex infrastructure
• Can’t afford the investments to keep refreshing
infrastructure and data centers

Enter Data Lake Architectures
Data Lake is a new and increasingly
popular architecture to store and analyze
massive volumes and heterogeneous
types of data.
Benefits of a Data Lake
• All Data in One Place
• Quick Ingest
• Storage vs Compute
• Schema on Read

1&2: Consolidate (Data) & Separate (Storage & Compute)
•S3 as the data lake storage tier; not a single analytics
tool like Hadoop or a data warehouse
•Decoupled storage and compute is cheaper and more
efficient to operate
•Decoupled storage and compute allow us to evolve to
clusterless architectures (i.e. Lambda, Athena & Glue)
•Do not build data silos in Hadoop or the EDW
•Gain flexibility to use all the analytics tools in the
ecosystem around S3 & future proof the architecture

Designed for 11 9s
of durability
• Multiple Encryption Options
• Robust/Highly Flexible Access Controls
Durable Secure High performance
 Multiple upload
 Range GET
 Scalable Throughput
 Store as much as you need
 Scale storage and compute
independently
 Scale without limits
 Affordable
Scalable
 Amazon EMR
 Amazon Redshift
 Amazon DynamoDB
 Amazon Athena
 Amazon Rekognition
 Amazon Glue
Integrated
 Simple REST API
 AWS SDKs
 Read-after-create consistency
 Event notification
 Lifecycle policies
 Simple Management Tools
 Hadoop compatibility
Easy to use
Why Choose Amazon S3 for data lake?

“For our market
surveillance systems, we
are looking at about 40%
[savings with AWS], but
the real benefits are the
business benefits: We
can do things that we
physically weren’t able to
do before, and that is
priceless.”
- Steve Randich, CIO
Case Study: Re-architecting Compliance
What FINRA needed
• Infrastructure for its market surveillance platform
• Support of analysis and storage of approximately 75
billion market events every day
• Store 5PB of historical data for analysis & training
Why they chose AWS
• Fulfillment of FINRA’s security requirements
• Ability to create a flexible platform using dynamic
clusters (Hadoop, Hive, and HBase), Amazon EMR,
and Amazon S3
Benefits realized
• Increased agility, speed, and cost savings
• Estimated savings of $10-20m annually by using AWS

Encryption ComplianceSecurity
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Private VPC endpoints to
Amazon S3
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
 Client-side Encryption
 Buckets access logs
 Lifecycle Management
Policies
 Access Control Lists
(ACLs)
 Versioning & MFA
deletes
 Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
3: Implement the Right Security Controls

AWS Snowball & Snowmobile
• Accelerate PBs with AWS-provided
appliances
• 50, 80, 100 TB models
• 100PB Snowmobile
AWS Storage Gateway
• Instant hybrid cloud
• Up to 120 MB/s cloud upload rate
(4x improvement), and
4: Choose the Right Ingestion Methods
Amazon Kinesis Firehose
• Ingest device streams directly into
AWS data stores
AWS Direct Connect
• COLO to AWS
• Use native copy tools
Native/ISV Connectors
• Sqoop, Flume, DistCp
• Commvault, Veritas, etc
Amazon S3 Transfer Acceleration
• Move data up to 300% faster
using AWS’s private network

5: Catalog Your Data
S3
Put data in S3
Amazon
DynamoDB
Amazon
Elasticsearch Service
Metadata
What is in the data lake?
Documents the data lake
Summary statistics
Classification
Data
Sources
Search
capabilities
Glue Coming Mid-year
https://aws.amazon.com/answers/big-data/data-lake-solution/

Glue automates the undifferentiated heavy-lifting of ETL
 Cataloging data sources
 Identifying data formats and data types
 Generating Extract, Transform, Load code
 Executing ETL jobs; managing dependencies
 Handling errors
 Managing and scaling resources
Amazon Glue – in Preview

S3 Standard S3 Standard - Infrequent
Access
Amazon Glacier
Active data Archive dataInfrequently accessed data
Milliseconds Minutes to HoursMilliseconds
$0.021/GB/mo $0.004/GB/mo$0.0125/GB/mo
6: Keep More Data

7: Use Athena for Ad Hoc Data Exploration
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL

Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades

Query Data Directly from Amazon S3
• No loading of data
• Query data in its raw format
• Athena supports multiple data formats
• Text, CSV, TSV, JSON, weblogs, AWS service logs
• Or convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No ETL required
• Stream data directly from Amazon S3

8: Use the Right Data Formats
• Pay by the amount of data scanned per query
• Use Compressed Columnar Formats
• Parquet
• ORC
• Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text
files
1 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper

9: Choose the Right Tools
Amazon Redshift
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
Amazon Glue
Clusterless ETL
Amazon Aurora
Managed Relational Database
Amazon Machine Learning
Predictive Analytics
Amazon Quicksight
Business Intelligence/Visualization
Amazon ElasticSearch Service
ElasticSearch
Amazon ElastiCache
Redis In-memory Datastore
Amazon DynamoDB
Managed NoSQL Database
Amazon Rekognition & Amazon Polly
Image Recognition & Text-to-Speech AI APIs
Amazon Lex
Voice or Text Chatbots

A Sample Data Lake Pipeline
Ad-hoc access to data using Athena
Athena can query
aggregated datasets as well

Amazon S3
Data Lake
Amazon Kinesis
Streams & Firehose
Hadoop / Spark
Streaming Analytics Tools
Amazon Redshift
Data Warehouse
Amazon DynamoDB
NoSQL Database
AWS Lambda
Spark Streaming
on EMR
Amazon
Elasticsearch Service
Relational Database
Amazon EMR
Amazon Aurora
Amazon Machine Learning
Predictive Analytics
Any Open Source Tool
of Choice on EC2
AWS Data Lake
Analytic
Capabilities
Data Science Sandbox
Visualization /
Reporting
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Serving Tier
Clusterless SQL Query
Amazon Athena
DataSourcesTransactionalData
Amazon Glue
Clusterless ETL
Amazon ElastiCache
Redis

Use S3 as the storage repository for your data lake, instead
of a Hadoop cluster or data warehouse
Decoupled storage and compute is cheaper and more efficient
to operate
Decoupled storage and compute allow us to evolve to
clusterless architectures like Athena
Do not build data silos in Hadoop or the Enterprise DW
Gain flexibility to use all the analytics tools in the ecosystem
around S3 & future proof the architecture
10: Evolve as Needed

Fast Track to Your Data Lake on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Fast Track to Your Data Lake on AWS

Similar to Fast Track to Your Data Lake on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (19)

Fast Track to Your Data Lake on AWS

Editor's Notes