Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Beginner's Guide to Data Lakes in AWS

318 views

Published on

AWS offers everything you need to deploy a secure and flexible data lake in the cloud. Discover how services like Amazon Simple Storage Service (Amazon S3) and Amazon Redshift can be used together to build and manage your own data lake, and how AWS Lake Formation makes it possible to set up a data lake in days. We walk through an example architecture together, covering everything from data storage to data analytics.

Published in: Data & Analytics
  • Be the first to comment

The Beginner's Guide to Data Lakes in AWS

  1. 1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. The Beginner’s Guide to Data Lakes in AWS Guillermo A. Fisher D V C 1 2 Senior Engineering Manager Handshake
  2. 2. Agenda Why a Data Lake? Key Concepts Data Lakes on AWS An Example Best Practices
  3. 3. Related DevChats DVC10 - Lessons from the backyard: A connected BBQ grill and smoker DVC06 - Use Neptune to discover where & when events can impact local businesses
  4. 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. “The never-ending stream of information is incredibly useful for businesses, but it can also be a challenge to draw relevant insights from such a large data pool.” Michael Brenner CEO, Marketing Inside Group
  5. 5. The Data Science Hierarchy of Needs AI Learn/Optimize Aggregate/Label Explore/Transform Move/Store Collect “You need a solid foundation for your data before being effective with AI and machine learning.” Monica Rogati Data Science and AI Advisor
  6. 6. The Data Warehouse Solution Data Warehouse Data Mart Data Mart Data Mart Advantages Provides precise reporting and BI Standardized, consistent data Drawbacks Limited to pre-determined questions No low-level data visibility
  7. 7. Considerations for a Modern Solution Centralized Data Storage Store all data reliably in one location Multiple User Communities Business analysts, data professionals Schema on Read Schema written at time of analysis Storage vs. Compute Scale storage and compute independently Data Types & Formats Structured, semi-structured, unstructured, raw data Security Control access to the data
  8. 8. Photo by Yifan Liu on Unsplash A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
  9. 9. Photo by arsalan arianmehr on Unsplash Onboard relevant data 
 Metadata should exist in a data catalog 
 Data governance policies and procedures govern storage and access 
 Automated processes manage data flow, data cleaning, and enforce practices
  10. 10. Centralized Storage Amazon S3 Scalable object storage Decouples storage and compute 99.999999999% durability Cost effective lifecycle policies
  11. 11. Data Ingestion Amazon Kinesis
 Data Firehose Easily and reliably stream data into data lakes AWS Snowball Migrate large datasets
 using secure devices AWS Storage
 Gateway Gain on-premises access to AWS cloud storage AWS Database
 Migration Service Migrate databases to AWS quickly and securely AWS Direct
 Connect Establish a dedicated network connection to AWS
  12. 12. Catalog & Search Amazon DynamoDB Fully managed NoSQL database service Amazon Elasticsearch
 Service Fully managed Elasticsearch service AWS Glue Store metadata in a data catalog
  13. 13. Move & Transform Amazon Kinesis
 Data Firehose Easily and reliably stream data into data lakes AWS Glue Fully managed ETL service AWS Lambda Event-driven, serverless computing
  14. 14. Access & User Interfaces AWS AppSync Manage and synchronize mobile app data in real time across devices and users Amazon Cognito  Add user sign-up, sign-in, and access control to your web and mobile apps quickly and easily Amazon API
 Gateway Fully managed service for creating, publishing, maintaining, and monitoring secure APIs at scale
  15. 15. Analytics & Serving Amazon Redshift Fast, simple, cost- effective data warehousing service Amazon Athena Serverless, interactive query service Amazon QuickSight Fast, cloud-powered business intelligence service AWS Glue Store metadata in a data catalog Amazon DynamoDB Fully managed NoSQL database service Amazon EMR Run & Scale Spark, Hadoop, and other Big Data Frameworks AWS Direct
 Connect Establish a dedicated network connection to AWS Amazon Elasticsearch
 Service Fully managed Elasticsearch service Amazon Neptune Fully managed Graph database service Amazon RDS Distributed relational database service
  16. 16. Manage & Secure AWS KMS Manage cryptographic keys and control their use across services AWS IAM Securely manage access to AWS services and resources AWS CloudTrail Enable governance, compliance, operational auditing, and risk auditing Amazon CloudWatch Monitor your AWS resources and the applications you run on AWS in real time
  17. 17. A Data Lake in Days AWS Lake Formation Source crawlers, ETL and data prep, data catalog, security settings, access control Identify data sources Data lake storage Provide self- service access
  18. 18. An Example Amazon S3 AWS Lambda AWS CloudTrailAWS IAM AWS Glue Amazon Athena Amazon QuickSight
  19. 19. Photo by Moritz Mentges on Unsplash DEMO
  20. 20. Some Best Practices Encrypt data at-rest and in-transit Partition data Compress data Use columnar file formats Use lifecycle policies Automate, automate, automate
  21. 21. Thank you! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Guillermo A. Fisher @guillermoandrae https://bklyn.dev
  22. 22. Please complete the session survey in the mobile app. ! © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×