Building your Datalake on AWS

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dickson Yue
Solution Architect
21 June 2017
Building Your First Data Lake
Modern Data Architectures on AWS

Today's conversation
Business drivers for a Data Lake
Designing and building
Production use cases

Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical infrastructure
Business Outcomes on a Modern Data Architecture

Expanding access requirements
Data
scientists
Automation /
events
Business
users
Data
analysts
Engagement
platforms
1. More personas need access to data, through appropriate tools
2. More systems need to link to data for decision and process automation
3. Users need to be able to find information, and access it securely

Exponential growth of business data
1. Data must be captured from diverse sources at speed and scale
2. Data needs to be pulled together, breaking down traditional silos
3. Benefits need to far outweigh the costs of collection and analysis
Transactions ERP Connected
devices
Social mediaWeb logs /
cookies

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Transactions
Web logs /
cookies
ERP
Data analysts
Data scientists
Business users
Engagement platformsConnected
devices
Social media Automation / events

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media

Characteristics of a Data Lake
Future
Proof
Flexible
Access
Dive in
Anywhere
Collect
Anything

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Schemaless
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable
Available
Store as much as you need
Scale storage and compute
independently
Scalable
Amazon S3
Amazon Redshift / Spectrum
Amazon EMR
Amazon Athena
Amazon DynamoDB
Integrated

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Schemaless
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media

Important Components of a Data Lake
Catalogue
& Search
Protect
& Secure
Access &
User Interface Ingest & Store

Data Ingestion into S3
AWS Direct Connect
AWS SnowballISV Connectors
Amazon Kinesis
Firehose
AWS Storage
Gateway
S3 Transfer
Acceleration

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Schemaless
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media

Building a Data Catalogue
• Aggregated information about your storage & streaming
layer
• Storage service for metadata
Ownership, data lineage
• Data abstraction layer
Customer data = collection of prefixes
• Enabling data discovery
• API for use by entitlements service

AWS
Lambda
Amazon
DynamoDB
+
Streams
Amazon
Elasticsearch
AWS
LambdaS3
Bucket
PUT
OBJECT
CREATE
OBJECT
PUT
ITEM
UPDATE
STREAM
UPDATE
INDEX
Populating Metadata and Search

AWS
Glue
Managed Transform Engine
Job Scheduler
Data Catalog
Built on Apache Spark
Integrated with S3, RDS, Redshift & any
JDBC-compliant data store

Security
 Identity and Access
Management (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Private VPC endpoints to
Amazon S3
 Pre-signed S3 URLs
Encryption
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 S3 Server Side
Encryption with
provided keys (SSE-C,
SSE-KMS)
 Client-side Encryption
Audit & Compliance
 Buckets access logs
 Lifecycle Management
Policies
 Versioning & MFA
deletes
 Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right cloud security controls

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Schemaless
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS

Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Schemaless
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Amazon S3
Raw Data
Amazon EMR
ETL
Advanced
Analytics
MLlib
Event Capture
Amazon Kinesis
Stream Analysis
Amazon EMR
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Angus Tse
Director of Engineer
21 June 2017
Clickstream Analytics Pipeline
HK01

DATA BEATS
EMOTIONS
Sean Rad
Founder & CEO Tinder

• Free and easy
• Excellent for initial
• Good learning materials
Google Analytics (GA)

• Free and easy
• Excellent for initial
• Good learning materials
Google Analytics (GA)
• free version latency &
accuracy issue
• GA 360 (Premium) +
BigQuery are expensive
• Not flexible enough

Our needs
• Large data volume
• Raw data for Machine Learning
• Flexible for further processing
• Low latency

Building a scalable pipeline on
AWS

Piwik
• Open-source analytics
platform
• Realtime dashboard
• Web & mobile SDK
• PageView
• Content / Media
• A/B Test

Phase 1
AWS
Lambda
API
Gateway
Kinesis
Firehose
Redshift Quicksight

Experience on AWS
• Complete and Integrated
• Quick. 2 man weeks for first version
• Easy to scale
• Minimal maintenance cost

Future
• More server-less in future
• S3 as datalake
• click event, system log, etc
• raw, processed data (like ML result)
• Hot on disk, cold on s3
• Explore AWS Machine Learning
AWS
Lambda
API
Gateway
Kinesis
Firehose
Redshift
Quicksight
S3
Machine
learning
EMR
SparkML
ML
Hot data
Raw data
Processed data
Cold data
Direct Query
Athena
Redshift
Spectrum
P2
Deep
learning AMI
Visualization
Serverless

Summary
1. S3 as data lake
2. Pick the right tool to match the
persona requirements
3. Go serverless

Building your Datalake on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building your Datalake on AWS

Similar to Building your Datalake on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Building your Datalake on AWS