Module 1 - CP Datalake on AWS

© 2020, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Introduction to Data Lake on AWS
Tuan Vo
Solutions Architect
mintuan@amazon.com

“Organizations that successfully generate
business value from their data, will outperform
their peers. “
To Become a Leader, Data is Your Differentiator

For Data to Be a Differentiator, Customers Need
to Be Able to…
• Capture and store new non-relational
data at PB-EB scale in real time
• New type of analytics that go beyond
batch reporting to incorporate real-time,
predictive, voice, and image recognition
• Democratize access to data in a secure
and governed way
New types of analytics
Dashboards Predictive Image
Recognition
Voice
Real-time
New types of data

Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence • Relational data
• TBs–PBs scale
• Schema defined prior to data load
• Operational reporting and ad hoc
• Large initial CAPEX + $10K–$50K/TB/Year

Data Lakes Extend the Traditional Approach
Data Warehouse
Business Intelligence
OLTP ERP CRM LOB
• Relational and non-relational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Big Data processing,
real-time, Machine Learning
Data Lake

Data Lakes and Analytics from AWS
Cost-effective
Scalable and durable
Secure
Open and comprehensive
Analytics
Machine Learning
Real-time Data
Movement
On-premises
Data Movement
Data Lake
on AWS

Amazon S 3
Amazon Gl ac ier
AWS Gl u e
Store Data in the Format You Want
• Store data in the format you want:
• Text files like CSV
• Columnar like Apache Parquet, and Apache ORC
• Logstash like Grok
• JSON (simple, nested), AVRO
• And more…
CSV
ORC
Grok
Avro
Parquet
JSON

Analyze with the Broadest Set of Analytic Tools
• Analyze data with the broadest selection
of analytics tools
• Data warehousing
• Interactive SQL queries
• Big Data processing
• Real-time analytics
• Dashboards & Visualizations
• Machine Learning
• Query in place without moving to a
separate analytics system
• Up to 400% faster with S3 Select and
Glacier Select
• Largest ISV ecosystem with built-in
integration
• Ensures you can meet existing and future
use cases, minimizing risks
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine
Learning
Amazon S 3
Amazon Gl ac ier
AWS Gl u e

Data Lakes from AWS
Data Lake
on AWS
Cost-effective
Secure
Analytics
Machine Learning
Real-time Data
Movement
On-premises
Data Movement

AWS Provides Highest Levels of Security
Secure
Compliance
AWS Artifact
Amazon Inspector
Amazon Cloud HSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS WAF
Amazon Macie
VPC
Encryption
AWS Certification Manager
AWS Key Management
Service
Encryption at rest
Encryption in transit
Bring your own keys, HSM
support
Identity
AWS IAM
AWS SSO
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
Customer need to have multiple levels of security, identity and access management,
encryption, and compliance to secure their data lake

Any Scale
• S3 has trillions of objects and exabytes of data
• Built to store any amount of data
• Run analytic engines at largest scale by spinning
up any amount of compute resources in minutes
• Runs on the world’s largest global
cloud infrastructure

Unmatched Durability and Availability
• Designed to deliver 99.999999999% durability
• Geographic redundancy & automatic replication
• Store data in multiple data centers across 3 AZs
in a single region
• Seamlessly replicates data between any region

Data Lakes from AWS
Data Lake
on AWS
Lowest cost
Secure
Analytics
Machine Learning
Real-time Data
Movement
On-premises
Data Movement

Tiered Storage to Optimize Price/Performance
Lowest Cost
• Tiered storage to optimize price/performance
• S3 Standard
• S3 Standard—Infrequent Access
• S3 One Zone—Infrequent Access
• Amazon Glacier
• Migrate between tiers based on lifecycle policies
• Store data at $0.023/GB/month with S3
• Store data at $0.004/GB/month with Glacier
S3
Standard
S3 Standard
Infrequent Access
S3 One Zone-IA
Glacier

Pay Only for the Resources You Use as you Scale
Lowest Cost
• Pay-as-you-go for the resources you consume
• As low as $0.05/GB scanned with Athena
Traditional approach leads to wasted capacity
Traditional: Rigid
AWS: Elastic
Capacity
Demand
Demand
Servers
Unmet demand
upset players
missed revenue
Excess capacity
wasted $$$
AWS approach: pay for the capacity you use

Lowest Total Cost of Ownership (TCO)
Cost-effective
• Less admin time to
manage, and support
• No up-front costs—
hardware acquisition,
installation
• Save on operating
costs—data center space,
power, cooling
• Business value: cost of
delays, risk premium,
competitive abilities,
governance, etc.
Licensing Fees
Support Costs
Subscription Fee
Support Costs
On-premises AWS
Server Costs
Hardware—Server, Rack, Chassis,
PDUs, Tor Switches (+Maintenance)
Software—OS, Virtualization Licenses
(+Maintenance)
Network Costs
Network Hardware—LAN Switches,
Load Balancer Bandwidth costs
Software—Network Monitoring
IT Labor Costs
Server admin, virtualization admin,
storage admin, network admin,
support team
Extras
Project planning, advisors, legal,
contractors, managed services,
training, cost of capital

More Data Lakes & Analytics on AWS than Anywhere Else

IAM
Amazon CloudWatch AWS STS AWS CloudTrail
AWS KMS
Protect and secure
Machine
learning
Amazon QuickSight Amazon EMR
Amazon Redshift Amazon Athena
Processing and analytics
Amazon Kinesis
AWS
Direct Connect AWS Snowball
AWS DMS
AWS Data Exchange
Data ingestion
AWS Glue Amazon ES
Amazon DynamoDB
Catalog and search
Amazon API Gateway IAM Amazon Cognito
Access and user interface
Amazon S3
Central storage
Reference architecture:
Data lake on AWS

Serverless data lakes and analytics
Amazon S3
AWS Glue
crawler
AWS Glue Data
Catalog
Amazon Athena Amazon QuickSight
Amazon RDS
Web app data
Other databases
On-premises data
Streaming data

Amazon S3

Amazon S3—Object Storage
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%

Amazon Glacier—Backup and Archive
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Secure
Log and monitor with
CloudTrail, Vault Lock
enables WORM storage
capabilities, helping
satisfy compliance
requirements
Retrieves data in
minutes
Three retrieval options to
fit your use case;
expedited retrievals with
Glacier Select can return
data in minutes
Inexpensive
Lowest cost AWS object
storage class, allowing
you to archive large
amounts of data at a very
low cost
$

AWS Glue

Storing is Not Enough, Data Needs to Be Discoverable
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business relationships
and direct monetizing).
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
“
”
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data

AWS Glue—Data Catalog
Make data discoverable
• Automatically discovers data and stores schema
• Catalog makes data searchable, and available for ETL
• Catalog contains table and job definitions
• Computes statistics to make queries efficient
Glue
Data Catalog
Discover data and
extract schema
Compliance

Data Catalog: Crawlers
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok expressions
Run ad hoc or on a schedule; serverless – only pay when crawler runs
Crawlers automatically build your Data Catalog and keep it in sync

AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single categorized
list that is searchable

Data Catalog: Table details
Table schema
Table properties
Data statistics
Nested fields

Data Catalog: Version control
List of table versions
Compare schema versions

AWS Glue—ETL Service
Make ETL scripting and deployment easy
• Automatically generates ETL code
• Code is customizable with Python
and Spark
• Endpoints provided to edit, debug,
test code
• Jobs are scheduled or event-based
• Serverless

AWS Glue DataBrew
N
EW
Clean and normalize data with a visual interface
250+ built-in transformations without writing code
Profile data to understand data patterns and anomalies
Work on large datasets at scale

Amazon Athena

Amazon Athena
Example Query

AWS service logs
Application logs
Data sourced from
external vendors
S3
Athena
Update table partition
Query data
S3
Athena CTAS and INSERT INTO
to ETL
Glue Data Catalog
Raw Data Transformed data
Amazon Athena: ETL & Query Use Case

Amazon Quicksight

Create Beautiful, Interactive
Dashboards
• Add rich interactivity like filters, drill downs,
zooming, and more
• Blazing fast navigation
• Accessible on any device
• Data Refresh
• Publish to everyone with a click

ML (Machine Learning) Insights
Cutting edge ML tools that automatically discover powerful insights for your users.
• Anomaly Detection
• Forecasting
• Bring your own model from
Amazon SageMaker
• Auto-generated natural language
narratives
*currently in preview

THANK YOU

Module 1 - CP Datalake on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Module 1 - CP Datalake on AWS

Similar to Module 1 - CP Datalake on AWS (20)

More from Lam Le

More from Lam Le (18)

Recently uploaded

Recently uploaded (20)

Module 1 - CP Datalake on AWS