Building a Server-less Data Lake on AWS - Technical 301

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sebastien Menant & Nam Je Cho, Enterprise Solutions Architects
Amazon Web Services
Building a Server-less Data Lake on AWS
Technical 301

Agenda
• What is a Data Lake?
• Why You Need a Data Lake
• Building the Data Lake
• Demo
• Next Steps

Definition
“A data lake provides massive storage for
any kind of data, enormous processing
power and the ability to handle virtually
limitless concurrent tasks or jobs”
- Wikipedia

Characteristics of a Data Lake
Collect
Everything
Dive in
Anywhere
Flexible
Access

Why You Need a Data Lake

What About Modern Business Needs?

Big Data… and The Hadoop Ecosystem

But Both are Complementary
Amazon
EMR
Amazon
Redshift
But Both are Complementary

STORAGE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
Amazon
EMR
Amazon S3

New Business Outcomes and Capabilities
• Enable New Insights in Your Data
• Cost Savings of Compute and Storage
• Use the Right Tool for the Job
• Increase Durability of Data
• Charge Storage Costs to Owner
• Streaming and Real-time Analysis
Retain all your data, for years!

Building Blocks of the Data Lake
Storage and Ingestion
Catalogue and Search
Security
API and UI

Storage and Ingestion
Storage and
Ingestion
Catalogue and
Search
Security
API and UI

Requirements for Storage
• Multi-year Scalable Storage Capability
• High Durability
• Store Raw Data from Any Input Sources
• Support for Any Data Type
• Low Cost

Amazon S3
1. Highly Scalable and Durable
2. Security and Encryption
3. Lifecycle Management
4. Event Notifications
5. Versioning
Key Services for Storage
Amazon Glacier
1. Long-term Archival Storage
2. Lifecycle Integration with S3
3. Extremely Low-cost
4. Vault Lock
Amazon
S3
Amazon
Glacier

Amazon
S3
Amazon
Glacier
Storage
and
Ingestion

Recommendations #1
• S3 Buckets
• Close to Users and Compute
• Select Region for Regulatory Compliance
• Naming
• Human-readable Path
• Random Hash Prefix for Optimal Partitioning
• Format
• Structured vs Unstructured + Compression
• CSV, Parquet, ORC, JSON, XML, logs, etc
• GZIP for small files, Avro, LZO, Snappy

Recommendations #2
• Optimise
• Store Everything
• Use Large Files with Split-able Format
• Lifecycle Policies for Cost-savings
• Tagging for Cost Allocation
• Security
• Encryption
• Bucket Policies, ACL, Tagging, CloudTrail

Requirements for Ingestion
• Batch File Support
• Traditional ETL
• Streaming Data
• Consumption of any Dataset as a Stream
• Low Latency Analytics
• Replay-ability from the Data Lake
• Server-less ETL Capabilities

Amazon Kinesis Firehose
1. Easy to use with Agent
2. Automatic Elasticity
3. Near Real-time
4. Simultaneous Destinations
Key Services for Ingestion
Amazon Kinesis Streams
1. Enables Custom Processing
2. Continuous Data Collection
3. Real-time
4. API Driven for Custom Apps
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose

Data
Sources
Data
Sources
Data
Sources
Data
Sources
Data
Sources
S3
DynamoDB
Redshift
Amazon Kinesis
Availability
Zone
Availability
Zone
Availability
Zone
Stream
AWS Lambda
KCL App
EMR
Elasticsearch

Amazon
Glacier
Amazon
Kinesis
Storage
and
Ingestion
Amazon
S3

Recommendations
• Reminder
• Added Complexity needs Business Justification
• Select the Right Tools
• Real-time Analysis: Apache Spark Streaming, Storm, Flink
• Firehose to Redshift for BI and Dashboards
• Tips
• AWS Lambda for ETL Transformation
• Persist Streams into S3

Storage and
Ingestion
Catalogue and
Search
Security
API and UI

Requirements for Catalogue and Search
• Metadata Index
• Automated Metadata Processing
• Discovery and Search
• Data Classification
• Server-less and Event-driven

Key Services for Catalogue and Search
1. Server-less
2. Event Driven
3. Auto Scaling
4. Real-time
1. NoSQL
2. Streams
3. Logstash Plugin
1. Deploy Simply
2. Easy Admin
3. Kibana
Amazon
Elasticsearch
Service
Amazon
DynamoDB
AWS
Lambda
Lambda DynamoDB Elasticsearch

AWS
Lambda
Amazon
DynamoDB
Amazon
Elasticsearch

Recommendations
• Tips
• Start Small and Simple… add Capabilities
• File names, size, state, dates, tags, owner
• Region, versions, lineage, relationships
• Search Metadata and Object Content
• Events
• S3 Triggers Lambda
• DynamoDB Streams
• Logstash Plugin to Elasticsearch

Security
Storage and
Ingestion
Catalogue and
Search
Security
API and UI

Requirements for Security
• Data Encryption at Rest
• Authentication
• Authorisation

AWS IAM
1. Users and Roles
2. Identity Federation
3. Multi Factor Authentication
4. Granular Permissions
Key Services for Security
AWS KMS
1. Seamless Service Integration
2. Extensive Compliance
AWS
IAM
AWS
KMS
AWS
CloudHSM
SSE-S3

Security
AWS
KMS
AWS
IAM

Recommendations
• Start Early
• Security Needs Practice!
• Federate with your Corporate Directory
• Best Practice
• Use CloudTrail and CloudWatch
• Encrypt Where Possible
• Select Bucket Region for Regulatory Compliance
• Tips
• IAM Policies, S3 Versioning and MFA Delete
• Lambda for Data Masking

API and UI
Storage and
Ingestion
Catalogue and
Search
Security
API and UI

Requirements for API and UI
• Serve Data and Capabilities to Customers
• Programmatically
• Search Catalogue
• Run Compute
• Extend Access Control Management
• And… Use of Familiar Visualisation Tools

Amazon API Gateway
1. Performance at Any Scale
2. Create RESTful Frontend
3. Managed API Lifecycle
Key Services for API and UI
AWS Lambda
1. Enables Server-less API
2. Custom Logic for Services
3. Automatic Scaling
AWS
Lambda
Amazon API
Gateway

API
and
UI Amazon
API Gateway
AWS
Lambda

Recommendations
• Tips
• Go Server-less!
• Extend Existing AWS Services and Build Custom Logic
• Data Management, Processing and Transformations
• API Gateway for Data Access
• Serve the Data, Search and Compute via RESTful APIs
• Distribute a Custom SDK
• Extend the Solution
• Build Advanced Security Controls using Metadata Index

The Whole Picture…
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
Storage and
Ingestion
Catalogue and
Search
Security
API and UI

Amazon
EMR
Amazon
RDS
Amazon
S3
Amazon
Glacier
Amazon
Kinesis
Storage
and
Ingestion
Security
AWS
KMS
AWS
IAM
API
And
UI Amazon
API Gateway
AWS
Lambda USERS
Amazon
Redshift
AWS
Lambda
Amazon
DynamoDB
Amazon
Elasticsearch

A Data Lake is…
• Foundation of Data Storage and Streaming Data
• Metadata index to help Categorise and Govern
• Search Index to Enable Data Discovery
• Robust Set of Security Controls
• Governance Through Technology Not Policy
• Interface to Expose Data and Capabilities to Users

Building Catalogue and Search
ElasticSearch
Metadata
Index
LambdaS3 Bucket Logstash
Data Flow
Data
Source
DynamoDB

Next Steps
• How to Get Started
• AWS Documentation
• Getting Started Guide
• AWS Training & Certification
• Big Data on AWS
• AWS Partner Network
• AWS Professional Services
• Big Data Specialists

AWS Training & Certification
Intro Videos & Labs
Free videos and labs to
help you learn to work
with 30+ AWS services
– in minutes!
Training Classes
In-person and online
courses to build
technical skills –
taught by accredited
AWS instructors
Online Labs
Practice working with
AWS services in live
environment –
Learn how related
services work
together
AWS Certification
Validate technical
skills and expertise –
identify qualified IT
talent or show you
are AWS cloud ready
Learn more: aws.amazon.com/training

Your Training Next Steps:
ü Visit the AWS Training & Certification pod to discuss your
training plan & AWS Summit training offer
ü Register & attend AWS instructor led training
ü Get Certified
AWS Certified? Visit the AWS Summit Certification Lounge to pick up your swag
Learn more: aws.amazon.com/training

Building a Server-less Data Lake on AWS - Technical 301

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Building a Server-less Data Lake on AWS - Technical 301

Similar to Building a Server-less Data Lake on AWS - Technical 301 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Building a Server-less Data Lake on AWS - Technical 301