Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series


Published on

Uncovering new, valuable insights from big data requires organizations to collect, store, and analyze increasing volumes of data from multiple, often disparate sources at disparate points in time. This makes it difficult to handle big data with data warehouses or relational database management systems alone. A Data Lake allows you to store massive amounts of data in its original form, without the need to enforce a predefined schema, enabling a far more agile and flexible architecture, which makes it easier to gain new types of analytical insights from your data.

Learning Objectives:
• Introduce key architectural concepts to build a Data Lake using Amazon S3 as the storage layer
• Explore storage options and best practices to build your Data Lake on AWS
• Learn how AWS can help enable a Data Lake architecture
• Understand some of the key architectural considerations when building a Data Lake
• Hear some important Data Lake implementation considerations when using Amazon S3 as your Data Lake

Published in: Technology
  • Login to see the comments

Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly Webinar Series

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Bhartia, Solutions Architect Susan Chan, Senior Product Manager - Amazon S3 August 2016 Building a Data Lake with Amazon S3
  2. 2. Evolution of “Data Lakes”
  3. 3. Databases Transactions Data warehouse Evolution of big data architecture Extract, transform and load (ETL)
  4. 4. Databases Files Transactions Logs Data warehouse Evolution of big data architecture ETL ETL
  5. 5. Databases Files Streams Transactions Logs Events Data warehouse Evolution of big data architecture ? Hadoop ? ETL ETL
  6. 6. Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS Amazon EMR Amazon Redshift AWS Data Pipeline Amazon Kinesis Amazon CloudSearch Amazon Kinesis- enabled app AWS Lambda Amazon Machine Learning Amazon SQS Amazon ElastiCache Amazon DynamoDB Streams A growing ecosystem…
  7. 7. Databases Files Streams Transactions Logs Events Data warehouse Data Lake The Genesis of “Data Lakes”
  8. 8. What really is a “Data Lake”
  9. 9. Components of a Data Lake Collect & Store Catalogue & Search Entitlements API & UI  An API and user interface that expose these features to internal and external users  A robust set of security controls – governance through technology, not policy  A search index and workflow which enables data discovery  A foundation of highly durable data storage and streaming of any type of data
  10. 10. Storage High durability Stores raw data from input sources Support for any type of data Low cost
  11. 11. Data Lake – Hadoop (HDFS) as the Storage Search Access QueryProcess Archive
  12. 12. Transaction s Data Lake – Amazon S3 as the storage Search Access QueryProcess Archive Amazon RDS Amazon DynamoDB Amazon Elasticsearch Service Amazon Glacier Amazon S3 Amazon Redshift Amazon Elastic MapReduce Amazon Machine Learning Amazon ElastiCache
  13. 13. Metadata lake Used for summary statistics and data Classification management Simplified model for data discovery & governance Catalogue & search
  14. 14. Catalogue & Search Architecture
  15. 15. Encryption for Data protection Authentication & Authorisation Access Control & restrictions Entitlements
  16. 16. Data Protection via Encryption AWS CloudHSM Dedicated Tenancy SafeNet Luna SA HSM Device Common Criteria EAL4+, NIST FIPS 140-2 AWS Key Management Service Automated key rotation & auditing Integration with other AWS services AWS server side encryption AWS managed key infrastructure
  17. 17. Entitlements – Access to Encryption Keys Customer Master Key Customer Data Keys Ciphertext Key Plaintext Key IAM Temporary Credential Security Token Service MyData MyData S3 S3 Object … Name: MyData Key: Ciphertext Key …
  18. 18. Exposes the data lake to customers Programmatically query catalogue Expose search API Ensures that entitlements are respected API & UI
  19. 19. API & UI Architecture API Gateway UI - Elastic Beanstalk AWS Lambda Metadata IndexUsers IAM TVM - Elastic Beanstalk
  20. 20. Putting It All Together
  21. 21. Amazon Kinesis Amazon S3 Amazon Glacier IAM Encrypted Data Security Token Service AWS Lambda Search Index Metadata Index API GatewayUsers UI - Elastic Beanstalk KMS Collect & Store Catalogue & Search Entitlements & Access Controls APIs & UI
  22. 22. Amazon S3 - Foundation for your Data Lake
  23. 23. Designed for 11 9s of durability Designed for 99.99% availability Durable Available High performance  Multiple upload  Range GET  Store as much as you need  Scale storage and compute independently  No minimum usage commitments Scalable  AWS Elastic MapReduce  Amazon Redshift  Amazon DynamoDB Integrated  Simple REST API  AWS SDKs  Read-after-create consistency  Event Notification  Lifecycle policies Easy to use Why Amazon S3 for Data Lake?
  24. 24. Why Amazon S3 for Data Lake?  Natively supported by frameworks like — Spark, Hive, Presto, etc.  Can run transient Hadoop clusters  Multiple clusters can use the same data  Highly durable, available, and scalable  Low Cost: S3 Standard starts at $0.0275 per GB per month
  25. 25. AWS Direct Connect AWS Snowball ISV Connectors Amazon Kinesis Firehose S3 Transfer Acceleration AWS Storage Gateway Data Ingestion into Amazon S3
  26. 26. Choice of storage classes on S3 Standard Active data Archive dataInfrequently accessed data Standard - Infrequent Access Amazon Glacier
  27. 27. Encryption ComplianceSecurity  Identity and Access Management (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Query string authentication  SSL endpoints  Server Side Encryption (SSE-S3)  Server Side Encryption with provided keys (SSE-C, SSE-KMS)  Client-side Encryption  Buckets access logs  Lifecycle Management Policies  Access Control Lists (ACLs)  Versioning & MFA deletes  Certifications – HIPAA, PCI, SOC 1/2/3 etc. Implement the right controls
  28. 28. Use Case We use S3 as the “source of truth” for our cloud-based data warehouse. Any dataset that is worth retaining is stored on S3. This includes data from billions of streaming events from (Netflix-enabled) televisions, laptops, and mobile devices every hour captured by our log data pipeline (called Ursula), plus dimension data from Cassandra supplied by our Aegisthus pipeline. “ ” Source: Eva Tse Director, Big Data Platform
  29. 29. Tip #1: Use versioning  Protects from accidental overwrites and deletes  New version with every upload  Easy retrieval of deleted objects and roll back to previous versions Versioning
  30. 30. Tip #2: Use lifecycle policies  Automatic tiering and cost controls  Includes two possible actions:  Transition: archives to Standard - IA or Amazon Glacier based on object age you specified  Expiration: deletes objects after specified time  Actions can be combined  Set policies at the bucket or prefix level  Set policies for current version or non- current versions Lifecycle policies
  31. 31. Versioning + lifecycle policies
  32. 32. Expired object delete marker policy  Deleting a versioned object makes a delete marker the current version of the object  Removing expired object delete marker can improve list performance  Lifecycle policy automatically removes the current version delete marker when previous versions of the object no longer exist Expired object delete marker
  33. 33. Insert console screen shot Enable policy with the console
  34. 34. Incomplete multipart upload expiration policy  Partial upload does incur storage charges  Set a lifecycle policy to automatically make incomplete multipart uploads expire after a predefined number of days Incomplete multipart upload expiration Best Practice
  35. 35. Enable policy with the Management Console
  36. 36. Considerations for organizing your Data Lake  Amazon S3 storage uses a flat keyspace  Separate data by business unit, application, type, and time  Natural data partitioning is very useful  Paths should be self documenting and intuitive  Changing prefix structure in future is hard/costly
  37. 37. Best Practices for your Data Lake  Always store a copy of raw input as the first rule of thumb  Use automation with S3 Events to enable trigger based workflows  Use a format that supports your data, rather than force your data into a format  Apply compression everywhere to reduce the network load
  38. 38. Thank you!