© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Simon Elisha. Principal Architect Public Sector, ANZ
Building Your First Data Lake on AWS
Any data Any analysisData Lake
Data
Warehouse
Data Sources Reporting
and BI
ETL
Traditional Analytics Pipeline
ETLData Sources
Reporting
Data Lake
Data Lake Pipeline
Exploration
Data Science
SCALABLE FLEXIBLE MANAGEABLE
COST
EFFECTIVE
Data Lake technology?
BUT…….
Hadoop seems perfect
STORAGE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Data Lake
Amazon
S3
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
Amazon
S3
Object Storage
Low Cost
Highly Scalable
11 9’s of durability
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Amazon
S3
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
Data Lake
PIG
SQL
Infrastructure
Data Layer
Process Layer
Framework
Applications
Amazon
EMR
Amazon
EMR
• Managed Hadoop
• Optimized with S3
• Open Source Support
Decouple Storage and Compute
CPU
Memory
Storage
CPU
Memory
Storage
CPU
Memory
Storage
Hadoop Master Node
PIG
SQL
Infrastructure
Data Layer
Process Layer
Framework
Applications
Amazon
EMR
Amazon
S3
EMRFS
Transient Cluster - Batch Jobs
Persistent Cluster – Interactive Queries External Metastore
Workload specific clusters
Amazon S3
Decouple Compute & Storage
Amazon RDS
Compute Memory Storage
Machine Learning
C4 Family
C3 Family
X1 Family
R3 Family
Interactive Analysis
D2 Family
I2 Family
Large HDFS
General
Batch Process
M4 Family
M3 Family
Compute Flexibility
On-Demand Spot-Price
$0.08$0.75
Spot Price – M3.2XL
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Amazon
S3
Amazon
EMR
Amazon
S3
AWS
Glue
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
Data Lake
AWS
Glue
• Managed Transform Engine
• Job Scheduler
• Data Catalog
• Built on Apache Spark
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Amazon
S3
Amazon
EMR
Amazon
S3
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
Data Lake
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
Amazon
S3
Amazon
EMR
Amazon
S3
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
Data Lake
Availability
Zone
Availability
Zone
Availability
Zone
Amazon Kinesis
Stream
AWS Lambda
KCL App
Amazon EMR
Streaming
Data Lake
Alerts
Analysis
Dashboards
Predictions
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
AWS Lambda
Application
Amazon EMR
Streaming
Amazon
S3
Amazon
EMR
Amazon
S3
Data Lake
Amazon
EMR
Amazon
Redshift
Amazon
Athena
EC2
PIG
SQL
Infrastructure
Data Layer
Process Layer
Framework
Applications
Amazon
EMR
Amazon
S3
EMRFS
Query S3 data with SQL
Serverless
Instant Spin-Up
Pay per Query
Amazon
Athena
Athena
S3
Catalog
Table
MPP SQL Database
Optimised for Analytics
Gigabytes to Exabytes
Fully relational
Amazon
Redshift
S3
Amazon
Redshift
CREATE TABLE users (
Userid integer,
Username varchar(30),
Firstname varchar(30),
Lastname varchar(30),
City varchar(30));
COPY users FROM "s3://data-bucket/users/";
USERS.TXT
Amazon
Redshift
SELECT COUNT (*)
FROM users
WHERE city = “Mumbai”
USERS TABLE
S3
Amazon
Redshift
CREATE EXTERNAL TABLE orders (
Orderid integer,
Userid integer,
Pricepaid decimal(8,2))
row format delimited
fields terminated by 't'
stored as textfile
location "s3://data-bucket/orders/";
ORDERS.TXT
REDSHIFT
SPECTRUM
S3
Amazon
Redshift
SELECT AVG(PRICEPAID)
FROM USERS U, ORDERS O
WHERE U.USERID = O.USERID
AND city = “Mumbai”
ORDERS.TXT
USERS TABLE
REDSHIFT
SPECTRUM
Ingest Serving
Speed (Real-time)
Scale (Batch)
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Sources
ETL
AWS
Snowball
AWS DMS
Amazon
Kinesis
AWS Lambda
Application
Amazon EMR
Streaming
Amazon
S3
Amazon
EMR
Amazon
S3
Data Lake
Amazon
EMR
Amazon
Redshift
Amazon
Athena
EC2
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
Best Practices
•Data Formats
•Compression
•Partitioning
CSV
JSON
AVRO
Parquet
ORC
ROW COLUMNAR
ID Age State
123 20 NSW
345 25 WA
678 40 VIC
999 21 WA
123 20 NSW 345 25 WA 678 40 VIC 999 21 WA
123 345 678 999 20 25 40 21 NSW WA VIC WA
ROW FORMAT
COLUMN FORMAT
gzip
Snappy
Zlib
LZO
bzip2
SPEED
SIZE
SPLITTABLE
DATA FORMAT
s3://mybucket/athena/inputdata/2016/data.csv
s3://mybucket/athena/inputdata/2015/data.csv
s3://mybucket/athena/inputdata/2014/data.csv
SELECT FIELDS from ACCESS_LOGS
WHERE YEAR = 2015
S3
CREATE EXTERNAL TABLE access_logs
(
ip_address String,
request_time Timestamp,
request_method String,
request_path String,
request_protocol String,
response_code String,
response_size String,
referrer_host String,
user_agent String
)
PARTITIONED BY (year STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'
STORED AS TEXTFILE
LOCATION ‘s3://mybucket/athena/inputdata/’
s3://mybucket/athena/inputdata/2016/data.csv
s3://mybucket/athena/inputdata/2015/data.csv
s3://mybucket/athena/inputdata/2014/data.csv
SELECT FIELDS from ACCESS_LOGS
WHERE YEAR = 2015
S3
AWS Solution Builder – Data Lake on AWS
Reference Architecture deployment
via CloudFormation
Configures core services to tag,
search and catalogue datasets
Deploys a console to search and
browse available datasets
http://amzn.to/2nTVjcp
Thank you

Your First Data Lake on AWS_Simon Elisha