SlideShare a Scribd company logo
1 of 61
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Effective Data Lake:
Design Patterns and Challenges
Radhika Ravirala
EMR Solutions Architect
Amazon Web Services
A N T 3 1 6
Moataz Anany
Solutions Architect
Amazon Web Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
• Why a Data Lake?
• Data Lake concepts
• Common asks and challenges
• Data Lake design patterns
• Security and governance
patterns
• Q & A
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data lake on AWS
Amazon
DynamoDB
Amazon Elasticsearch
Service
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMS
AWS
CloudTrail
AWS
IAM
Amazon
CloudWatch
AWS
Snowball
AWS Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database
Migration
Service
Amazon
Athena
Amazon
EMR
AWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
Neptune
Amazon
RDS
AWS
Glue
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon S3
Amazon EMRAthena
Data Lake
Versatile
Compute
Layers
Data &
Metadata
The core of a Data Lake
AWS Glue
Data Catalog
Amazon Redshift
Spectrum
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
The concept of a Data Lake
• All data in one place, a single source of truth
• Handles structured/semi-structured/unstructured/raw data
• Supports fast ingestion and consumption
• Schema on read
• Designed for low-cost storage
• Decouples storage and compute
• Supports protection and security rules
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tier 1 Data Lake: Ingestion
Single source of truth for raw data
Use least transformations
Use lifecycle policies to Amazon Simple
Storage Service (Amazon S3) IA or Amazon
Glacier
Amazon S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tier 2 Data Lake: Analytics
Use columnar formats – Parquet/ORC
Organized into partitions
Coalescing to larger partitions over time
Optimized for analytics
Amazon S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tier 3 Data Lake: Analytics
Domain level DataMart
Organized by use cases
Optimized for specialized analysisAmazon S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Warehouse:
Fast speeds over structured schemas
Serves dashboards and reports
Fine-grained access controls
Supports joining native and external tables
Lifecycle back to S3 Data Lake
Amazon
Redshift
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Some customer asks
• Can I do streaming ingest into a Data Lake?
• Can a Data Lake replace our database replicas we maintain for
analytics?
• How to organize data inside a Data Lake?
• How to handle late events coming in to old partitions?
• How to perform updates and deletes to the data inside a Data Lake?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Some more customer asks
• How can I run Machine Learning training on data in the Data Lake?
• How can I augment the data in my Data Lake with real-time
predictions during ETL or ingestion?
• How to enforce data protection rules in the Data Lake?
• What are the authentication and authorization options available?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reporting
Data science
Machine
Learning
Applications
Log analytics, ClickStream analytics, IoT sensor data
S3 Data Lake
Amazon
Kinesis
Firehose
Amazon
Athena
Presto/Spark
on EMR
Amazon Redshift
Data Warehouse
Potential problem:
1. Too many small files
2. Not necessarily optimized for
Analytics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reporting
Data science
Machine
Learning
Applications
Log analytics, ClickStream analytics, IoT sensor data
Tier 1 S3 Datalake:
Raw Data
Kinesis
Firehose Tier 2 S3 Datalake:
Analytics
Glue ETL
Hourly
Compactions to
Parquet/ORC
Athena
Presto/Spark
on EMR
Amazon Redshift
Data Warehouse
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reporting
Data science
Machine
Learning
Applications
Fast ingestion
Kinesis
Firehose
Amazon Redshift
Spectrum Views
Amazon Redshift
Data Warehouse
S3 Datalake:
Analytics
Scheduled
Glue ETL
Single Serving Layer
using Amazon Redshift
Spectrum
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reporting
Data science
Machine
Learning
Applications
Faster ingestion
S3 Datalake:
Analytics
Glue ETL
Phoenix/HBase on
S3
on EMR
Presto
on EMR
Spark Streaming
on EMR
Kinesis
Streams
Federated query
and single serving
layer
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reporting
Data science
Machine
Learning
Applications
Fastest ingestion
S3 Datalake:
Analytics
Glue ETL
Phoenix/HBase on
S3
on EMR
Presto
on EMR
Flink
on EMR
Kinesis
Streams
Federated query
and single serving
layer
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS DMS
S3 Data Lake
Athena
Presto/Spark
on EMR
Amazon Redshift
Data Warehouse
Reporting
Data science
Machine
Learning
Databases
Replacing a database replica
Potential problem:
Updates and deletes creates
new versions of records
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DMS
Athena
Presto/Spark
on EMR
Amazon Redshift
Data Warehouse
Databases
Replacing a database replica
Tier 1 S3 Datalake:
Raw Data
Tier 2 S3 Datalake:
Analytics
Glue ETL
Create Views to preserve the
database view of records
Potential problem:
Grouping records in Views
can be expensive over time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
DMS
Athena
Presto/Spark
on EMR
Amazon Redshift
Data Warehouse
Databases
Replacing a database replica
Tier 1 S3 Datalake:
Raw Data
Snapshot
Analytics
Tier 2 S3 Datalake:
Analytics
Glue ETL
Creates daily snapshots to
preserve the database view
of records
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Machine Learning—Batch training pipeline
Tier 2 S3 Data Lake:
Analytics
Glue ETL
Amazon SageMaker
Batch Training
S3 Model Artifacts
Tier 1 S3 Datalake:
Raw Data
Glue ETL
Amazon SageMaker
Endpoint
Glue ETL
Training Step Model DeploymentData Preparation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Athena
Presto/Spark
on EMR
Amazon Redshift
Data Warehouse
Databases
Machine Learning—Predictions on streaming data
Tier 1 S3 Datalake:
Raw Data
Tier 2 S3 Datalake:
Analytics
Glue ETL
Kinesis
Firehose
Amazon SageMaker
EndpointsLambda
Potential problem:
Tier 1 raw data should have
the least transformations
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Athena
Presto/Spark
on EMR
Amazon Redshift
Data Warehouse
Databases
Machine Learning—Predictions on streaming data
Tier 1 S3 Datalake:
Raw Data
Tier 2 S3 Datalake:
Analytics
Glue ETL
Kinesis
Firehose
Amazon SageMaker
Endpoints
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake design principles
• Ingestion location and frequency: Decide on a location for ingestion.
Select a frequency and ingestion mechanism as meets your needs.
• Partition data: Partition the data with keys that align with common query
filters used. This enables partition pruning and increases query
performance.
• File Size: Choose optimal file sizes to reduce S3 roundtrips.
Recommended : 256 MB to 1GB files in columnar format per partition.
• Compactions: Compact data on a scheduled basis to get the file sizes
above e.g., daily compactions into daily partitions if hourly files are small.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to choose partitioning columns?
256 MB to 1GB
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to choose partitioning columns? An
example
4.3M partitions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to choose partitioning columns? An
example
hour
Bucketed by: Device, 50 buckets
5*365 = 1825 partitions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake design principles
• Mutable data: For mutable uses cases i.e., to handle updates/deletes
• Either use a database like Amazon Redshift/HBase for the time the
data can mutate and offload to S3 once data becomes static
• Or append to delta files per partition and compact on a scheduled
basis using AWS Glue or Spark on EMR
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serving mutable data
Order 123 | Jan 01, 2018 | Changed
Order 123 | Jan 01, 2018 | New | Ver 1
OLTP Database
Data Lake
Users
Serving View
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serving mutable data
Order 123 | Jan 02, 2018 | Changed
Order 123 | Jan 01, 2018 | New | Ver 1
---------------------------------------------------
Order 123 | Jan 02, 2018 | Changed | Ver 2
OLTP Database
Data Lake
Users
Serving View
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serving mutable data
Order 123 | Jan 03, 2018 | Deleted
Order 123 | Jan 01, 2018 | New | Ver 1
--------------------------------------------------
Order 123 | Jan 02, 2018 | Changed | Ver 2
--------------------------------------------------
Order 123 | Jan 03, 2018 | Deleted | Ver 3
OLTP Database
Data Lake
Users
Serving View
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serving mutable data
Order 123 | Jan 03, 2018 | Deleted
Order 123 | Jan 03, 2018 | Deleted | Ver 3
OLTP Database
Data Lake
Users
Remove older
versions
Periodic compaction job
Serving View
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake optimizations
• Bucketed data: For additional performance, bucket data in each partition
on a high cardinality key. This is honored by Presto/Athena, Hive and so on,
and improves query filter performance on that key.
df.write.bucketBy(numBuckets,”col1”).parquet(…)
• Order Data: For additional performance, sort data in each partition by a
secondary key. This allows engines to skip part of files to get to the
requested data faster.
df.repartition(100).sortWithinPartitions([‘order_id’]
,ascending=True).parquet(…)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake optimizations
• Bloom filters: Bloom Filters are space-efficient probabilistic data
structures that is used to test whether an element is a member of a set
CREATE TABLE
STORED AS ORC
TBLPROPERTIES('orc.bloom.filter.columns’=‘ORDER_ID')
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Security and governance concerns
• Authentication
• Authorization on data (and metadata)
• Encryption of data at rest and in transit
• Audit and monitoring
• Centralized management
• Compliance
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS helps you secure
Compliance
AWS Artifact
Amazon Inspector
Amazon Cloud HSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS WAF
Amazon Macie
Amazon VPC
Encryption
AWS Certification Manager
AWS Key Management
Service
Encryption at rest
Encryption in transit
Bring your own keys, HSM
support
Identity
AWS IAM
AWS SSO
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
Customer need to have multiple levels of security, identity and access management,
encryption, and compliance to secure their data lake
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Security: Machine Learning-powered security
• Machine learning to discover, classify,
and protect data
• Continuously monitors data access for anomalies
• Generates alerts when it detects
unauthorized access
• Recognizes PII or intellectual propertyAmazon Macie
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Lake security
• Data storage
• Metadata
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data storage security
Key learnings
• Implement access control in a multi-team environment
• Fine-grained
• Coarse-grained
• Secure and segregated access to
• Amazon S3
• Amazon EMR clusters
• Amazon Redshift clusters
• Serverless analytics services and other tools used in the pipeline
• Encrypt data assets
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Control access to data—Fine-grained ACL
“Fine-grained” data and resource
ownership
• Teams share S3 buckets and
clusters
• Access control complex to set
up and maintain
• Common in a
“shared services” architecture “Fine-grained”
ownership
Team X Team Y Team Z
Databases and Schemas
Amazon Redshift
Zeppelin Presto Hive
Amazon EMR Cluster
Local FS
HDFS
EMRFS
/foo/bar /abc/xyz /local
hdfs:///data/1st
s3://bucket/prfx s3://group/data
…
Amazon S3
hdfs:///data2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Control data access—Coarse grained
Prefer “coarse-grained” data and resource
ownership
• Teams own entire S3 buckets and
clusters
• Ownership segregated by AWS accounts
• Access control easier to setup and
maintain
• Suitable for autonomous teams
Team X
Amazon S3
Buckets and
prefixes
Amazon EMR
Clusters
Amazon
Redshift
Clusters
Team X
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Control access to data
Configure Amazon S3 permissions
• Implement your access control matrix using
IAM policies
• Use S3 bucket policies for easy cross-account
data sharing
• Limit role-based access from an Amazon EMR
cluster’s Amazon Elastic Compute Cloud
(Amazon EC2) instance profile
• Authorize access from other tools such as
Amazon Redshift using IAM roles
IAM Principals Amazon EMR Amazon
Redshift
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Block public access to Amazon S3
Amazon S3 provides four settings
• BlockPublicAcls – rejects new public object or bucket ACLs
• IgnorePublicAcls – ignores existing public object or bucket ACLs
• BlockPublicPolicy – rejects new public bucket access policy
• RestrictPublicBuckets – restricts access to only AWS services and authorized
users within the bucket owner's account
But, what is “public”?
• Public object (or bucket) ACL → grants permissions to members of the
predefined AllUsers or AuthenticatedUsers groups (grantees)
• Public bucket policy → doesn’t grant permissions to only fixed values in
Principal and Condition elements
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Encryption: Data-at-rest and in-motion
• Amazon S3 offers multiple forms of encryption
• Server-side and Client-side encryption
• Encryption with keys managed by S3 or
AWS Key Management Service
• Encryption with keys that customers manage
• Encrypts data in transit when replicating across regions
• Data movement services can use the same AWS Key
Management Service
• SSL endpoints
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Metadata security
AWS Glue Data Catalog
• Apache Hive metastore compatible
• Track data evolution using schema
versioning
• Integrates with Hive, Spark, Presto, Amazon
Athena and Amazon Redshift spectrum
• Use crawlers classify your data in one central
list that is searchable
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Metadata security
Key learnings
• Create and maintain centralized data catalog
• Enable cross account access
• Use IAM policies to control catalog access—similar to S3
bucket policies
• Encrypt metadata in AWS Glue Data Catalog
Glue Data Catalog—Cross account access
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase"
],
"Principal": {"AWS": [
"arn:aws:iam::account-B-id:user/Bob"
]},
"Resource": [
"arn:aws:glue:us-east-1:account-A-id:catalog",
"arn:aws:glue:us-east-1:account-A-
id:database/db1"
]
}
]
}
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase"
],
"Resource": [
"arn:aws:glue:us-east-1:account-A-
id:catalog",
"arn:aws:glue:us-east-1:account-A-
id:database/db1"
]
}
]
}
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Glue Data Catalog—Resource-level permissions
• Fine-grained access control to catalog using IAM policies
• Restrict what they can view and query
Security and Governance
Athena EMR Glue Redshift
Authentication IAM/EC2 Key pair Kerberos/LDAP/
EC2 Key pair/IAM
IAM Role IAM/Native
Authorization S3 Bucket Policies S3 Bucket
Policies/
Hive Grants/
EMRFS Auth
S3 Bucket
Policies/
Fine Grained
S3 Bucket
Policies/
Native Grants
Encryption of
data at-rest
SSE-S3/
SSE-KMS/
CSE-KMS
SSE-S3/
SSE-KMS/
CSE-KMS/
CSE-CMK
SSE-S3 Database
Encryption/
SSE-S3/
SSE-KMS/
CSE-CMK
Encryption of
data in-transit
SSL Yes, through
Security Config
SSL SSL
Audit CloudTrail Application Logs CloudTrail Database Audit
Compliance HIPAA FedRAMP/HIPAA HIPAA FedRAMP/HIPAA
Compliance: Virtually every regulatory agency
CSA
Cloud Security
Alliance Controls
ISO 9001
Global Quality
Standard
ISO 27001
Security Management
Controls
ISO 27017
Cloud Specific
Controls
ISO 27018
Personal Data
Protection
PCI DSS Level 1
Payment Card
Standards
SOC 1
Audit Controls
Report
SOC 2
Security, Availability, &
Confidentiality Report
SOC 3
General Controls
Report
Global United States
CJIS
Criminal Justice
Information Services
DoD SRG
DoD Data
Processing
FedRAMP
Government Data
Standards
FERPA
Educational
Privacy Act
FIPS
Government Security
Standards
FISMA
Federal Information
Security Management
GxP
Quality Guidelines
and Regulations
ISO FFIEC
Financial Institutions
Regulation
HIPPA
Protected Health
Information
ITAR
International Arms
Regulations
MPAA
Protected Media
Content
NIST
National Institute of
Standards and Technology
SEC Rule 17a-4(f)
Financial Data
Standards
VPAT/Section 508
Accountability
Standards
Asia Pacific
FISC [Japan]
Financial Industry
Information Systems
IRAP [Australia]
Australian Security
Standards
K-ISMS [Korea]
Korean Information
Security
MTCS Tier 3 [Singapore]
Multi-Tier Cloud
Security Standard
My Number Act [Japan]
Personal Information
Protection
Europe
C5 [Germany]
Operational Security
Attestation
Cyber Essentials
Plus [UK]
Cyber Threat
Protection
G-Cloud [UK]
UK Government
Standards
IT-Grundschutz
[Germany]
Baseline Protection
Methodology
X P
G
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Lake Formation
Build a secure data lake in days
Register existing data or load new data using blueprints. Data stored in Amazon S3.
Secure data access across multiple services using single set of permissions.
No additional charge. Only pay for the underlying services used.
Quickly build data lakes
Move, store, catalog, and clean
your data faster. Use ML
transforms to de-duplicate data
and find matching records.
Easily secure access
Centrally define table and column-
level data access and enforce it
across Amazon EMR, Amazon
Athena, Amazon Redshift
Spectrum, Amazon SageMaker, and
Amazon QuickSight
Share and collaborate
Use data catalog in Lake
Formation to search and find
relevant data sets and share
them across multiple users
and accounts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How it works
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Radhika Ravirala
ravirala@amazon.com
Moataz Anany
moanany@amazon.com
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

What's hot

Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Web Services Korea
 

What's hot (20)

Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
 
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
Migrating Databases to the Cloud: Introduction to AWS DMS - SRV215 - Chicago ...
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift(DAT201) Introduction to Amazon Redshift
(DAT201) Introduction to Amazon Redshift
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
 
How to backup, restore and archive your data on AWS
How to backup, restore and archive your data on AWSHow to backup, restore and archive your data on AWS
How to backup, restore and archive your data on AWS
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
 
Getting Started with Amazon Database Migration Service
Getting Started with Amazon Database Migration ServiceGetting Started with Amazon Database Migration Service
Getting Started with Amazon Database Migration Service
 
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
민첩하고 비용효율적인 Data Lake 구축 - 문종민 솔루션즈 아키텍트, AWS
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Building A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWSBuilding A Modern Data Analytics Architecture on AWS
Building A Modern Data Analytics Architecture on AWS
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
Snowflake: Your Data. No Limits (Session sponsored by Snowflake) - AWS Summit...
 
Introduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF LoftIntroduction to AWS Glue: Data Analytics Week at the SF Loft
Introduction to AWS Glue: Data Analytics Week at the SF Loft
 
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
Analyzing Data Streams in Real Time with Amazon Kinesis: PNNL's Serverless Da...
 

Similar to Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent 2018

Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
Amazon Web Services
 

Similar to Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent 2018 (20)

Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech TalksHow to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...
Building a Modern Data Warehouse: Deep Dive on Amazon Redshift - SRV337 - Chi...
 
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
Connecting the dots - How Amazon Neptune and Graph Databases can transform yo...
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Managed Relational Databases
Managed Relational DatabasesManaged Relational Databases
Managed Relational Databases
 
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
Accelerate Analytics at Scale with Amazon EMR - AWS Summit Sydney 2018
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Effective Data Lake: Design Patterns and Challenges Radhika Ravirala EMR Solutions Architect Amazon Web Services A N T 3 1 6 Moataz Anany Solutions Architect Amazon Web Services
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda • Why a Data Lake? • Data Lake concepts • Common asks and challenges • Data Lake design patterns • Security and governance patterns • Q & A
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data lake on AWS Amazon DynamoDB Amazon Elasticsearch Service AWS AppSync Amazon API Gateway Amazon Cognito AWS KMS AWS CloudTrail AWS IAM Amazon CloudWatch AWS Snowball AWS Storage Gateway Amazon Kinesis Data Firehose AWS Direct Connect AWS Database Migration Service Amazon Athena Amazon EMR AWS Glue Amazon Redshift Amazon DynamoDB Amazon QuickSight Amazon Kinesis Amazon Elasticsearch Service Amazon Neptune Amazon RDS AWS Glue
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 Amazon EMRAthena Data Lake Versatile Compute Layers Data & Metadata The core of a Data Lake AWS Glue Data Catalog Amazon Redshift Spectrum
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. The concept of a Data Lake • All data in one place, a single source of truth • Handles structured/semi-structured/unstructured/raw data • Supports fast ingestion and consumption • Schema on read • Designed for low-cost storage • Decouples storage and compute • Supports protection and security rules
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tier 1 Data Lake: Ingestion Single source of truth for raw data Use least transformations Use lifecycle policies to Amazon Simple Storage Service (Amazon S3) IA or Amazon Glacier Amazon S3
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tier 2 Data Lake: Analytics Use columnar formats – Parquet/ORC Organized into partitions Coalescing to larger partitions over time Optimized for analytics Amazon S3
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tier 3 Data Lake: Analytics Domain level DataMart Organized by use cases Optimized for specialized analysisAmazon S3
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Warehouse: Fast speeds over structured schemas Serves dashboards and reports Fine-grained access controls Supports joining native and external tables Lifecycle back to S3 Data Lake Amazon Redshift
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Some customer asks • Can I do streaming ingest into a Data Lake? • Can a Data Lake replace our database replicas we maintain for analytics? • How to organize data inside a Data Lake? • How to handle late events coming in to old partitions? • How to perform updates and deletes to the data inside a Data Lake?
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Some more customer asks • How can I run Machine Learning training on data in the Data Lake? • How can I augment the data in my Data Lake with real-time predictions during ETL or ingestion? • How to enforce data protection rules in the Data Lake? • What are the authentication and authorization options available?
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reporting Data science Machine Learning Applications Log analytics, ClickStream analytics, IoT sensor data S3 Data Lake Amazon Kinesis Firehose Amazon Athena Presto/Spark on EMR Amazon Redshift Data Warehouse Potential problem: 1. Too many small files 2. Not necessarily optimized for Analytics
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reporting Data science Machine Learning Applications Log analytics, ClickStream analytics, IoT sensor data Tier 1 S3 Datalake: Raw Data Kinesis Firehose Tier 2 S3 Datalake: Analytics Glue ETL Hourly Compactions to Parquet/ORC Athena Presto/Spark on EMR Amazon Redshift Data Warehouse
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reporting Data science Machine Learning Applications Fast ingestion Kinesis Firehose Amazon Redshift Spectrum Views Amazon Redshift Data Warehouse S3 Datalake: Analytics Scheduled Glue ETL Single Serving Layer using Amazon Redshift Spectrum
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reporting Data science Machine Learning Applications Faster ingestion S3 Datalake: Analytics Glue ETL Phoenix/HBase on S3 on EMR Presto on EMR Spark Streaming on EMR Kinesis Streams Federated query and single serving layer
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reporting Data science Machine Learning Applications Fastest ingestion S3 Datalake: Analytics Glue ETL Phoenix/HBase on S3 on EMR Presto on EMR Flink on EMR Kinesis Streams Federated query and single serving layer
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS DMS S3 Data Lake Athena Presto/Spark on EMR Amazon Redshift Data Warehouse Reporting Data science Machine Learning Databases Replacing a database replica Potential problem: Updates and deletes creates new versions of records
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DMS Athena Presto/Spark on EMR Amazon Redshift Data Warehouse Databases Replacing a database replica Tier 1 S3 Datalake: Raw Data Tier 2 S3 Datalake: Analytics Glue ETL Create Views to preserve the database view of records Potential problem: Grouping records in Views can be expensive over time
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. DMS Athena Presto/Spark on EMR Amazon Redshift Data Warehouse Databases Replacing a database replica Tier 1 S3 Datalake: Raw Data Snapshot Analytics Tier 2 S3 Datalake: Analytics Glue ETL Creates daily snapshots to preserve the database view of records
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Machine Learning—Batch training pipeline Tier 2 S3 Data Lake: Analytics Glue ETL Amazon SageMaker Batch Training S3 Model Artifacts Tier 1 S3 Datalake: Raw Data Glue ETL Amazon SageMaker Endpoint Glue ETL Training Step Model DeploymentData Preparation
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Athena Presto/Spark on EMR Amazon Redshift Data Warehouse Databases Machine Learning—Predictions on streaming data Tier 1 S3 Datalake: Raw Data Tier 2 S3 Datalake: Analytics Glue ETL Kinesis Firehose Amazon SageMaker EndpointsLambda Potential problem: Tier 1 raw data should have the least transformations
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Athena Presto/Spark on EMR Amazon Redshift Data Warehouse Databases Machine Learning—Predictions on streaming data Tier 1 S3 Datalake: Raw Data Tier 2 S3 Datalake: Analytics Glue ETL Kinesis Firehose Amazon SageMaker Endpoints
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lake design principles • Ingestion location and frequency: Decide on a location for ingestion. Select a frequency and ingestion mechanism as meets your needs. • Partition data: Partition the data with keys that align with common query filters used. This enables partition pruning and increases query performance. • File Size: Choose optimal file sizes to reduce S3 roundtrips. Recommended : 256 MB to 1GB files in columnar format per partition. • Compactions: Compact data on a scheduled basis to get the file sizes above e.g., daily compactions into daily partitions if hourly files are small.
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to choose partitioning columns? 256 MB to 1GB
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to choose partitioning columns? An example 4.3M partitions
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to choose partitioning columns? An example hour Bucketed by: Device, 50 buckets 5*365 = 1825 partitions
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lake design principles • Mutable data: For mutable uses cases i.e., to handle updates/deletes • Either use a database like Amazon Redshift/HBase for the time the data can mutate and offload to S3 once data becomes static • Or append to delta files per partition and compact on a scheduled basis using AWS Glue or Spark on EMR
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serving mutable data Order 123 | Jan 01, 2018 | Changed Order 123 | Jan 01, 2018 | New | Ver 1 OLTP Database Data Lake Users Serving View
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serving mutable data Order 123 | Jan 02, 2018 | Changed Order 123 | Jan 01, 2018 | New | Ver 1 --------------------------------------------------- Order 123 | Jan 02, 2018 | Changed | Ver 2 OLTP Database Data Lake Users Serving View
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serving mutable data Order 123 | Jan 03, 2018 | Deleted Order 123 | Jan 01, 2018 | New | Ver 1 -------------------------------------------------- Order 123 | Jan 02, 2018 | Changed | Ver 2 -------------------------------------------------- Order 123 | Jan 03, 2018 | Deleted | Ver 3 OLTP Database Data Lake Users Serving View
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serving mutable data Order 123 | Jan 03, 2018 | Deleted Order 123 | Jan 03, 2018 | Deleted | Ver 3 OLTP Database Data Lake Users Remove older versions Periodic compaction job Serving View
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lake optimizations • Bucketed data: For additional performance, bucket data in each partition on a high cardinality key. This is honored by Presto/Athena, Hive and so on, and improves query filter performance on that key. df.write.bucketBy(numBuckets,”col1”).parquet(…) • Order Data: For additional performance, sort data in each partition by a secondary key. This allows engines to skip part of files to get to the requested data faster. df.repartition(100).sortWithinPartitions([‘order_id’] ,ascending=True).parquet(…)
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lake optimizations • Bloom filters: Bloom Filters are space-efficient probabilistic data structures that is used to test whether an element is a member of a set CREATE TABLE STORED AS ORC TBLPROPERTIES('orc.bloom.filter.columns’=‘ORDER_ID')
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Security and governance concerns • Authentication • Authorization on data (and metadata) • Encryption of data at rest and in transit • Audit and monitoring • Centralized management • Compliance
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS helps you secure Compliance AWS Artifact Amazon Inspector Amazon Cloud HSM Amazon Cognito AWS CloudTrail Security Amazon GuardDuty AWS Shield AWS WAF Amazon Macie Amazon VPC Encryption AWS Certification Manager AWS Key Management Service Encryption at rest Encryption in transit Bring your own keys, HSM support Identity AWS IAM AWS SSO Amazon Cloud Directory AWS Directory Service AWS Organizations Customer need to have multiple levels of security, identity and access management, encryption, and compliance to secure their data lake
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Security: Machine Learning-powered security • Machine learning to discover, classify, and protect data • Continuously monitors data access for anomalies • Generates alerts when it detects unauthorized access • Recognizes PII or intellectual propertyAmazon Macie
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Lake security • Data storage • Metadata
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data storage security Key learnings • Implement access control in a multi-team environment • Fine-grained • Coarse-grained • Secure and segregated access to • Amazon S3 • Amazon EMR clusters • Amazon Redshift clusters • Serverless analytics services and other tools used in the pipeline • Encrypt data assets
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control access to data—Fine-grained ACL “Fine-grained” data and resource ownership • Teams share S3 buckets and clusters • Access control complex to set up and maintain • Common in a “shared services” architecture “Fine-grained” ownership Team X Team Y Team Z Databases and Schemas Amazon Redshift Zeppelin Presto Hive Amazon EMR Cluster Local FS HDFS EMRFS /foo/bar /abc/xyz /local hdfs:///data/1st s3://bucket/prfx s3://group/data … Amazon S3 hdfs:///data2
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control data access—Coarse grained Prefer “coarse-grained” data and resource ownership • Teams own entire S3 buckets and clusters • Ownership segregated by AWS accounts • Access control easier to setup and maintain • Suitable for autonomous teams Team X Amazon S3 Buckets and prefixes Amazon EMR Clusters Amazon Redshift Clusters Team X
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control access to data Configure Amazon S3 permissions • Implement your access control matrix using IAM policies • Use S3 bucket policies for easy cross-account data sharing • Limit role-based access from an Amazon EMR cluster’s Amazon Elastic Compute Cloud (Amazon EC2) instance profile • Authorize access from other tools such as Amazon Redshift using IAM roles IAM Principals Amazon EMR Amazon Redshift
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Block public access to Amazon S3 Amazon S3 provides four settings • BlockPublicAcls – rejects new public object or bucket ACLs • IgnorePublicAcls – ignores existing public object or bucket ACLs • BlockPublicPolicy – rejects new public bucket access policy • RestrictPublicBuckets – restricts access to only AWS services and authorized users within the bucket owner's account But, what is “public”? • Public object (or bucket) ACL → grants permissions to members of the predefined AllUsers or AuthenticatedUsers groups (grantees) • Public bucket policy → doesn’t grant permissions to only fixed values in Principal and Condition elements
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Encryption: Data-at-rest and in-motion • Amazon S3 offers multiple forms of encryption • Server-side and Client-side encryption • Encryption with keys managed by S3 or AWS Key Management Service • Encryption with keys that customers manage • Encrypts data in transit when replicating across regions • Data movement services can use the same AWS Key Management Service • SSL endpoints
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Metadata security AWS Glue Data Catalog • Apache Hive metastore compatible • Track data evolution using schema versioning • Integrates with Hive, Spark, Presto, Amazon Athena and Amazon Redshift spectrum • Use crawlers classify your data in one central list that is searchable
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Metadata security Key learnings • Create and maintain centralized data catalog • Enable cross account access • Use IAM policies to control catalog access—similar to S3 bucket policies • Encrypt metadata in AWS Glue Data Catalog
  • 54. Glue Data Catalog—Cross account access { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "glue:GetDatabase" ], "Principal": {"AWS": [ "arn:aws:iam::account-B-id:user/Bob" ]}, "Resource": [ "arn:aws:glue:us-east-1:account-A-id:catalog", "arn:aws:glue:us-east-1:account-A- id:database/db1" ] } ] } { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "glue:GetDatabase" ], "Resource": [ "arn:aws:glue:us-east-1:account-A- id:catalog", "arn:aws:glue:us-east-1:account-A- id:database/db1" ] } ] }
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Glue Data Catalog—Resource-level permissions • Fine-grained access control to catalog using IAM policies • Restrict what they can view and query
  • 56. Security and Governance Athena EMR Glue Redshift Authentication IAM/EC2 Key pair Kerberos/LDAP/ EC2 Key pair/IAM IAM Role IAM/Native Authorization S3 Bucket Policies S3 Bucket Policies/ Hive Grants/ EMRFS Auth S3 Bucket Policies/ Fine Grained S3 Bucket Policies/ Native Grants Encryption of data at-rest SSE-S3/ SSE-KMS/ CSE-KMS SSE-S3/ SSE-KMS/ CSE-KMS/ CSE-CMK SSE-S3 Database Encryption/ SSE-S3/ SSE-KMS/ CSE-CMK Encryption of data in-transit SSL Yes, through Security Config SSL SSL Audit CloudTrail Application Logs CloudTrail Database Audit Compliance HIPAA FedRAMP/HIPAA HIPAA FedRAMP/HIPAA
  • 57. Compliance: Virtually every regulatory agency CSA Cloud Security Alliance Controls ISO 9001 Global Quality Standard ISO 27001 Security Management Controls ISO 27017 Cloud Specific Controls ISO 27018 Personal Data Protection PCI DSS Level 1 Payment Card Standards SOC 1 Audit Controls Report SOC 2 Security, Availability, & Confidentiality Report SOC 3 General Controls Report Global United States CJIS Criminal Justice Information Services DoD SRG DoD Data Processing FedRAMP Government Data Standards FERPA Educational Privacy Act FIPS Government Security Standards FISMA Federal Information Security Management GxP Quality Guidelines and Regulations ISO FFIEC Financial Institutions Regulation HIPPA Protected Health Information ITAR International Arms Regulations MPAA Protected Media Content NIST National Institute of Standards and Technology SEC Rule 17a-4(f) Financial Data Standards VPAT/Section 508 Accountability Standards Asia Pacific FISC [Japan] Financial Industry Information Systems IRAP [Australia] Australian Security Standards K-ISMS [Korea] Korean Information Security MTCS Tier 3 [Singapore] Multi-Tier Cloud Security Standard My Number Act [Japan] Personal Information Protection Europe C5 [Germany] Operational Security Attestation Cyber Essentials Plus [UK] Cyber Threat Protection G-Cloud [UK] UK Government Standards IT-Grundschutz [Germany] Baseline Protection Methodology X P G
  • 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Lake Formation Build a secure data lake in days Register existing data or load new data using blueprints. Data stored in Amazon S3. Secure data access across multiple services using single set of permissions. No additional charge. Only pay for the underlying services used. Quickly build data lakes Move, store, catalog, and clean your data faster. Use ML transforms to de-duplicate data and find matching records. Easily secure access Centrally define table and column- level data access and enforce it across Amazon EMR, Amazon Athena, Amazon Redshift Spectrum, Amazon SageMaker, and Amazon QuickSight Share and collaborate Use data catalog in Lake Formation to search and find relevant data sets and share them across multiple users and accounts
  • 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How it works
  • 60. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Radhika Ravirala ravirala@amazon.com Moataz Anany moanany@amazon.com
  • 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.