BAY AREA
Best Practices on building data lakes
& Overview on Lake Formation
Confidential Material / Chegg, Inc. © 2005 – 2019 / All Rights Reserved
Best Practices on building data lakes
& Overview on Lake Formation
Sep 13, 2019
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
Agenda
• Data Lake
» Introduction
» Key Concepts
• Data Lake Architecture
• Data Lake best practices
» Collection
» Storage
» Process
» Consume
» Notification
» Security
» Ing
• AWS Lake Formation
» Steps
» Blueprints
» Resources
3
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
4
What is a Data Lake?
• Centralized repository to store data at every stage of its lifecycle
• Accessed using the right tool for the nature of the job
 Offers unlimited scalability
 Single source of truth
 Customer Centricity
 Multiple ways to query the data from a unified layer
 Eliminates data silos
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
5
Data Lake - Key Concepts
JAN DATECollect Store Process Consume
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
6
Data Lake Architecture
%
Collect Store Process Consume
The Chegg logo and the Smarter Way to Student logo are trademarks of Chegg, Inc. All other trademarks are owned by their respective owners.
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
7
Collection
JAN
• Faster data ingestion
• Kinesis Firehose
• Kinesis Streams
• Batch upload
• JDBC/ODBC Connectors
• Compactions on data on a scheduled basis
• S3Distcp
• Optimal File Sizes
• 128 MB to 1GB
• Bucketing data
• Order data within buckets
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
8
Storage
JAN
• Segregate Raw and Processed Data
• Different Storage categories
• Choosing right Partitioning
• Choose partitions that have lesser unique values (Low Cardinality)
• Eg: Partitioning on day/month/year has 365 unique values per year
• Partitioning on hour has around 8650 unique values per year
• Aggregate Smaller files
• Data storage format helps in query optimization
• Parquet
• ORC Format
• Usage of S3-Select whenever necessary
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
9
Process
JAN
• Metadata in DynamoDB
• Decouple Compute and Storage
• EMR and Lambda
• Segregate Raw and Processed Data
• Different Storage categories
• Enable trigger based workflow
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
10
Consume
JAN
• Segregate Raw and Processed Data
• Different Storage categories
• Redshift Spectrum
• Athena : Query without ETL
• Compress Datasets
• Columnar file formats
• Optimize file sizes
• Maintain right level of security with access policies for different artefacts of the data lake
• Build notifications all across the pipeline
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
11
Notifications
JAN
• Service alerts
• Cloud watch Alarms and Dashboards
• Data quality alerts
• Data query alerts
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
12
Metadata
JAN
• Metadata driven event flow
• Ability to replay data from a given checkpoint
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
13
Security
JAN
• Decoupled security policies for different persona
 Administrators
 Producers
 Consumers
• Tighter security controls for each service in the data lake
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
14
Collect
• Create Data Lake
• Import Data
Security &
Control
• Table Permissions
• User Permissions
Collaborate &
Use
• Search data catalog
• Add Metadata
Monitor &
Audit
• Faster data accessibility
• Build data lakes and
provide multiple ways to
access data
AWS Lake formation
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
15
AWS Lake formation - Steps
Register Amazon
S3 storage for
data lake
Create database
Grant
permissions
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
16
AWS Lake Formation - Blueprints
Choose
Blueprint
Type
Database
snapshot
Incremental Log Export
Import
source
Database
connection
Define source
data path
Configure
patterns
Exclude patterns
Define
incremental
columns /
Partition schema
Import
Target
Choose target db
and storage
location
Choose target
data format –
Parquet / CSV
Frequency Define scheduling
frequency
Import
options
Configure Glue
options
IAM Role
Capacity and
concurrency
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
17
AWS Lake Formation - Resources
AWS Lake Formation
Glue Workflows Glue Catalog
Glue
ETL
Glue
Crawler
Glue
Triggers
Glue
Database
Glue
Tables
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
18
AWS Lake formation – Access & Querying
• Tables are created and accessible in Athena
• Redshift Spectrum access is enabled via Lake formation
• Manage table level and column level permissions
2019Confidential Material / Chegg, Inc. © 2005 – / All Rights Reserved
Thank You.
19

Best practices on building data lakes and lake formation

  • 1.
    BAY AREA Best Practiceson building data lakes & Overview on Lake Formation
  • 2.
    Confidential Material /Chegg, Inc. © 2005 – 2019 / All Rights Reserved Best Practices on building data lakes & Overview on Lake Formation Sep 13, 2019
  • 3.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved Agenda • Data Lake » Introduction » Key Concepts • Data Lake Architecture • Data Lake best practices » Collection » Storage » Process » Consume » Notification » Security » Ing • AWS Lake Formation » Steps » Blueprints » Resources 3
  • 4.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 4 What is a Data Lake? • Centralized repository to store data at every stage of its lifecycle • Accessed using the right tool for the nature of the job  Offers unlimited scalability  Single source of truth  Customer Centricity  Multiple ways to query the data from a unified layer  Eliminates data silos
  • 5.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 5 Data Lake - Key Concepts JAN DATECollect Store Process Consume
  • 6.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 6 Data Lake Architecture % Collect Store Process Consume The Chegg logo and the Smarter Way to Student logo are trademarks of Chegg, Inc. All other trademarks are owned by their respective owners.
  • 7.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 7 Collection JAN • Faster data ingestion • Kinesis Firehose • Kinesis Streams • Batch upload • JDBC/ODBC Connectors • Compactions on data on a scheduled basis • S3Distcp • Optimal File Sizes • 128 MB to 1GB • Bucketing data • Order data within buckets
  • 8.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 8 Storage JAN • Segregate Raw and Processed Data • Different Storage categories • Choosing right Partitioning • Choose partitions that have lesser unique values (Low Cardinality) • Eg: Partitioning on day/month/year has 365 unique values per year • Partitioning on hour has around 8650 unique values per year • Aggregate Smaller files • Data storage format helps in query optimization • Parquet • ORC Format • Usage of S3-Select whenever necessary
  • 9.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 9 Process JAN • Metadata in DynamoDB • Decouple Compute and Storage • EMR and Lambda • Segregate Raw and Processed Data • Different Storage categories • Enable trigger based workflow
  • 10.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 10 Consume JAN • Segregate Raw and Processed Data • Different Storage categories • Redshift Spectrum • Athena : Query without ETL • Compress Datasets • Columnar file formats • Optimize file sizes • Maintain right level of security with access policies for different artefacts of the data lake • Build notifications all across the pipeline
  • 11.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 11 Notifications JAN • Service alerts • Cloud watch Alarms and Dashboards • Data quality alerts • Data query alerts
  • 12.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 12 Metadata JAN • Metadata driven event flow • Ability to replay data from a given checkpoint
  • 13.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 13 Security JAN • Decoupled security policies for different persona  Administrators  Producers  Consumers • Tighter security controls for each service in the data lake
  • 14.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 14 Collect • Create Data Lake • Import Data Security & Control • Table Permissions • User Permissions Collaborate & Use • Search data catalog • Add Metadata Monitor & Audit • Faster data accessibility • Build data lakes and provide multiple ways to access data AWS Lake formation
  • 15.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 15 AWS Lake formation - Steps Register Amazon S3 storage for data lake Create database Grant permissions
  • 16.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 16 AWS Lake Formation - Blueprints Choose Blueprint Type Database snapshot Incremental Log Export Import source Database connection Define source data path Configure patterns Exclude patterns Define incremental columns / Partition schema Import Target Choose target db and storage location Choose target data format – Parquet / CSV Frequency Define scheduling frequency Import options Configure Glue options IAM Role Capacity and concurrency
  • 17.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 17 AWS Lake Formation - Resources AWS Lake Formation Glue Workflows Glue Catalog Glue ETL Glue Crawler Glue Triggers Glue Database Glue Tables
  • 18.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved 18 AWS Lake formation – Access & Querying • Tables are created and accessible in Athena • Redshift Spectrum access is enabled via Lake formation • Manage table level and column level permissions
  • 19.
    2019Confidential Material /Chegg, Inc. © 2005 – / All Rights Reserved Thank You. 19