Building a Data Lake on AWS

© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Johan Broman
Manager, Solutions Architecture – AWS Nordics
Theo Hultberg
Director of Technology - Burt
Big Data and Data Lakes
Building Blocks

Data Drives Better Decision Making
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
Data lake leaders who were highly efficient
in capturing a diversity of data and making it
accessible to their organization in a timely
fashion outperformed their peers by 9% in
organic revenue growth.*
24%
15%
Organic revenue
growth
Leaders Followers

Traditionally, Analytics Looked Like This
OLTP ERP CRM LOB
Data warehouse
Business intelligence Relational data
TBs-PBs scale
Schema defined before data load
Operational reporting and on demand
Large initial capex + $10K–$50K / TB / Year

Isolated data silos
Hadoop
Cluster
SQL
Database
Data
Warehouse
Appliance

Data Lake is a new and increasingly
popular architecture to store and
analyze massive volumes and
heterogeneous types of data.
Enter Data Lake Architectures

Data Lakes on AWS
Unmatched durability and availability at exabyte scale
Comprehensive security, compliance, and audit capabilities
Object-level controls
Usage and cost analysis insight into your data
Most ways to bring data in
Twice as many partner integrations
Data lake
A m a z o n S 3
A m a z o n G l a c i e r
A W S G l u e
Machine Learning
Analytics
Internet of Things
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Kinesis
Video Streams

Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
§ Multipart upload
§ Range GET
§ Store as much as you need
§ Scale storage and compute
independently
§ No minimum usage
commitments
Scalable
§ Amazon Redshift / Spectrum
§ Amazon EMR
§ Amazon Athena
§ Amazon DynamoDB
Integrated
§ Simple REST API
§ AWS SDKs
§ Read-after-create consistency
§ Event notification
§ Lifecycle policies
Easy to use
Why Amazon S3 for the Data Lake?

Data Lakes on AWS
Data lake
A m a z o n S 3
A m a z o n G l a c i e r
A W S G l u e
A m a z o n S a g e M a k e r
A W S D e e p L e a r n i n g A M I s
A m a z o n R e k o g n i t i o n
A m a z o n L e x
A W S D e e p L e n s
A m a z o n C o m p r e h e n d
A m a z o n T r a n s l a t e
A m a z o n T r a n s c r i b e
A m a z o n P o l l y
Machine Learning Analytics Internet of Things (IoT)
A W S I o T C o r e
A W S G r e e n g r a s s
A W S I o T A n a l y t i c s
A m a z o n F r e e R T O S
A W S I o T 1 - C l i c k
A W S I o T B u t t o n
A W S I o T D e v i c e M a n a g e m e n t
A W S I o T D e v i c e D e f e n d e r
A m a z o n A t h e n a
A m a z o n E M R
A m a z o n R e d s h i f t
A m a z o n E l a s t i c s e a r c h S e r v i c e
A m a z o n K i n e s i s
A m a z o n Q u i c k S i g h t

Data Lakes Extend the Traditional Approach
Relational and non-relational data
TBs-EBs scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low-cost storage and analytics
OLTP ERP CRM LOB
Data warehouse
Business
intelligence
Data lake
100110000100101011100
101010111001010100001
011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
learning
DW
queries
Big data
processing
Interactive Real-time

Storing is not enough. Data needs to be discoverable.
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business relationships
and direct monetizing).
Gartner
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
“
”

AWS Glue: Data Catalog
Make data discoverable
Automatically discovers data and stores schema
Catalog makes data searchable and available for ETL
Catalog contains table and job definitions
Computes statistics to make queries efficient
Compliance
AWS Glue
Data Catalog
Discover data and
extract schema

Data preparation accounts for ~80% of the work.
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other

AWS Glue: ETL Service
Make ETL scripting and deployment easy
Automatically generates ETL code
Code is customizable with Python and Spark
Endpoints provided to edit, debug, & test code
Jobs are scheduled or event-based
Serverless

Amazon Athena: Interactive Analysis
$ SQL
Query Instantly
Zero setup cost;
just point to
Amazon S3 and
start querying.
Pay per query
Pay only for queries run;
save 30–90% on per-
query costs through
compression.
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types.
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight.
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)

Amazon Athena

Amazon EMR: Big Data Processing
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release.
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%.
Use S3 storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector.
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster setup,
node provisioning, &
cluster tuning.
Data Lake
10011000010010101110
01010101110010101000
00111100101100101
010001100001
Analytics and ML at scale
Nineteen open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security

Amazon EMR

Theo Hultberg
director of technology
@iconara

APISELECT "app", "account",
SUM("logged_in_users")
FROM "usage"
WHERE "date" BETWEEN
DATE '2018-05-01'
AND DATE '2018-05-16'
GROUP BY 1, 2
Athena

S3 Glue
Aurora
APIETL
Redshift
Athena

Thank you!

Building a Data Lake on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a Data Lake on AWS

Similar to Building a Data Lake on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Building a Data Lake on AWS