Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Best Practices for Building Your Data Lake on AWS

7,637 views

Published on

Today’s organisations require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. Data Lake is a new and increasingly popular way to store all of your data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand.

In this webinar, you will discover how AWS gives you fast access to flexible and low-cost IT resources, so you can rapidly scale and build your data lake that can power any kind of analytics such as data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity and variety of data.

Learning Objectives:

• Discover how you can rapidly scale and build your data lake with AWS.

• Explore the key pillars behind a successful data lake implementation.

• Learn how to use the Amazon Simple Storage Service (S3) as the basis for your data lake.

• Learn about the new AWS services recently launched, Amazon Athena and Amazon Redshift Spectrum, that help customers directly query that data lake.

Published in: Technology
  • Winning the Lottery is Based on This [7 Time Winner Tells All] ■■■ https://tinyurl.com/t2onem4
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • How long does it take for VigRX Plus to start working? ◆◆◆ https://bit.ly/30G1ZO1
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • The ultimate acne system, Top ranked acne plan for download Unique clear skin strategies ★★★ http://t.cn/AiWGkfAm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Controversial method reveals inner psychology of techniques you can use to get your Ex back! See it now! ◆◆◆ http://t.cn/R50e5nn
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Best Practices for Building Your Data Lake on AWS

  1. 1. Best Practices for Building Your Data Lake on AWS Ian Robinson, Specialist SA, AWS Kiran Tamana, EMEA Head of Solutions Architecture, Datapipe Derwin McGeary, Solutions Architect, Cloudwick
  2. 2. Amazon DynamoDB Amazon Relational Database Service Amazon Redshift p.39 Donotcreatetitlesthatarelarger thannecessary. Donotusetoosmallafont sizeonmainorsubtext. re sizeof Ffont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. subtextheresubtexthere PTITLEOFSTEPTITLEOFSTEP Maintextgoeshere70%-80%ofthe fontsizeofthetitle.WhitneyHTFfont. subtextherethatexplainsmovement orprocessinbetweensteps TITLEOFSTEP Amazon S3 Athena QueryService Amazon Athena Amazon EMR On-premise databases Relational Database NoSQL Data Warehouse Data Lake Webinars: Database Freedom
  3. 3. Amazon DynamoDB Amazon Relational Database Service Amazon Redshift p.39 Donotcreatetitlesthatarelarger thannecessary. Donotusetoosmallafont sizeonmainorsubtext. re sizeof Ffont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. subtextheresubtexthere PTITLEOFSTEPTITLEOFSTEP Maintextgoeshere70%-80%ofthe fontsizeofthetitle.WhitneyHTFfont. subtextherethatexplainsmovement orprocessinbetweensteps TITLEOFSTEP Amazon S3 Athena QueryService Amazon Athena Amazon EMR On-premise databases Relational Database NoSQL Data Warehouse Data Lake Webinar 1: Migrating Online/Transactional Workloads
  4. 4. Amazon DynamoDB Amazon Relational Database Service Amazon Redshift p.39 Donotcreatetitlesthatarelarger thannecessary. Donotusetoosmallafont sizeonmainorsubtext. re sizeof Ffont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. subtextheresubtexthere PTITLEOFSTEPTITLEOFSTEP Maintextgoeshere70%-80%ofthe fontsizeofthetitle.WhitneyHTFfont. subtextherethatexplainsmovement orprocessinbetweensteps TITLEOFSTEP Amazon S3 Athena QueryService Amazon Athena Amazon EMR On-premise databases Relational Database NoSQL Data Warehouse Data Lake Webinar 2: Amazon Redshift and Data Warehousing
  5. 5. Amazon DynamoDB Amazon Relational Database Service Amazon Redshift p.39 Donotcreatetitlesthatarelarger thannecessary. Donotusetoosmallafont sizeonmainorsubtext. re sizeof Ffont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. Maintextgoeshere 70%-80%ofthefontsizeof thetitle.WhitneyHTFfont. subtextheresubtexthere PTITLEOFSTEPTITLEOFSTEP Maintextgoeshere70%-80%ofthe fontsizeofthetitle.WhitneyHTFfont. subtextherethatexplainsmovement orprocessinbetweensteps TITLEOFSTEP Amazon S3 Athena QueryService Amazon Athena Amazon EMR On-premise databases Relational Database NoSQL Data Warehouse Data Lake Webinar 3: Data Lake Best Practices
  6. 6. Agenda • What is a Data Lake? • Key Data Lake Concepts • Modern Data Architecture • Putting it All Together • Partner: Datapipe • Partner: Cloudwick
  7. 7. What is a Data Lake?
  8. 8. What and Why Store all your data, forever, at every stage of its lifecycle Apply it using the right tool for the job • Agile analytics • Leverage all data that flows into your organization • Customer centricity • Better predictions via Machine Learning • Competitive advantage
  9. 9. Data Lake versus Enterprise Data Warehouse Complementary to EDW (not replacement) Data lake can be source for EDW Schema on read (no predefined schemas) Schema on write (predefined schemas) Structured/semi-structured/Unstructured data Structured data only Fast ingestion of new data/content Time consuming to introduce new content Data Science + Prediction/Advanced Analytics + BI use cases BI use cases only (no prediction/advanced analytics) Data at low level of detail/granularity Data at summary/aggregated level of detail Loosely defined SLAs Tight SLAs (production schedules) Flexibility in tools (open source/tools for advanced analytics) Limited flexibility in tools (SQL only) Enterprise DWEMR S3
  10. 10. Key Data Lake Concepts
  11. 11. STORAGE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE COMPUTE
  12. 12. Components of a Data Lake Data Storage • High durability • Stores raw data from input sources • Support for any type of data • Low cost Streaming • Streaming ingest of feed data • Provides the ability to consume any dataset as a stream • Facilitates low latency analytics Storage & Streams Catalogue & Search Entitlements API & UI
  13. 13. Components of a Data Lake Catalogue • Metadata lake • Used for summary statistics and data Classification management Search • Simplified access model for data discovery Storage & Streams Catalogue & Search Entitlements API & UI
  14. 14. Components of a Data Lake Entitlements system • Encryption • Authentication • Authorisation • Chargeback • Quotas • Data masking • Regional restrictions Storage & Streams Catalogue & Search Entitlements API & UI
  15. 15. Components of a Data Lake Storage & Streams Catalogue & Search Entitlements API & UI API & User Interface • Exposes the data lake to customers • Programmatically query catalogue • Expose search API • Ensures that entitlements are respected
  16. 16. The Modern Data Architecture
  17. 17. data answers COLLECT STORE PROCESS/ ANALYZE CONSUME time to first answer
  18. 18. Agile Analytics • Experiment • Invest in promising experiments • Fail fast • React quickly
  19. 19. Storage Ingest
  20. 20. Preservation of Original Source Data
  21. 21. Standard Active data Archive dataInfrequently accessed data Standard - Infrequent Access Amazon Glacier Create Delete Events and Lifecycle Management
  22. 22. S3 as the Data Lake Fabric • Unlimited number of objects and volume • 99.99% availability • 99.999999999% durability • Versioning • Tiered storage via lifecycle policies • SSL, client/server-side encryption at rest • Low cost (just over $2700/month for 100TB) • Natively supported by big data frameworks (Spark, Hive, Presto, etc) • Decouples storage and compute • Run transient compute clusters (with Amazon EC2 Spot Instances) • Multiple, heterogeneous clusters can use same data
  23. 23. Database Migration Service Automated Data Ingestion
  24. 24. Scalable (secure, versioned, durable) storage + Immutable data at every stage of its lifecycle + Versioned schema and metadata = Data discovery, lineage Storage + Catalog
  25. 25. AWS Glue • Data Catalog Discover and store metadata • Job Authoring Auto- generated ETL code • Job Execution Serverless scheduling and execution
  26. 26. Hive metastore-compatible, highly-available metadata repository: • Search metadata for data discovery • Connection info – JDBC URLs, credentials • Classification for identifying and parsing files • Versioning of table metadata as schemas evolve and other metadata are updated • Table definitions – usable by Redshift, Athena, Glue, EMR Populate using Hive DDL, bulk import, or automatically through crawlers. Glue Data Catalog
  27. 27. AWS Lambda AWS Lambda Metadata Index (Amazon DynamoDB) Search Index (Amazon Elasticsearch) ObjectCreated ObjectDeleted PutItem Update Stream Update Index Extract Search Fields Indexing and Searching Using Metadata Amazon S3
  28. 28. Governance Security Privacy
  29. 29. Identity and Access Management • Manage users, groups, and roles • Identity federation with Open ID • Temporary credentials with Amazon Security Token Service (Amazon STS) • Stored policy templates • Powerful policy language • Amazon S3 bucket policies
  30. 30. IAM Amazon S3 Amazon ElastiCache Amazon DynamoDB Amazon EMR Amazon Kinesis Amazon Athena Service API Access Security at the Data Level
  31. 31. Third Party Ecosystem Security Tools Amazon S3 AWS CloudTrail http://amzn.to/2tSimHj Amazon Athena Access Logging API Logging Access Log Analytics IAM Amazon EMR http://amzn.to/2si6RqS Storage Level Support for Access Logging and Audit
  32. 32. Encryption Options AWS Server-Side encryption • AWS managed key infrastructure AWS Key Management Service • Automated key rotation & auditing • Integration with other AWS services AWS CloudHSM • Dedicated Tenancy SafeNet Luna SA HSM Device • Common Criteria EAL4+, NIST FIPS 140-2
  33. 33. Discovery Search Access
  34. 34. Data Lake API and UI • Exposes the metadata API, search, and Amazon S3 storage services to customers • Can be based on TVM/STS Temporary Access for many services, and a bespoke API for metadata • Drive all UI operations from API?
  35. 35. Amazon API Gateway • Host multiple versions and stages of APIs • Create and distribute API keys to developers • Leverage AWS Sigv4 to authorize access to APIs • Throttle and monitor requests to protect the backend • Leverages AWS Lambda
  36. 36. Data Quality Data Preparation Orchestration and Job Scheduling
  37. 37. Job Authoring with AWS Glue • Python code generated by AWS Glue • Connect a notebook or IDE to AWS Glue • Existing code brought into AWS Glue
  38. 38. Job Execution with AWS Glue • Schedule-based • Event-based • On demand
  39. 39. Instantly Query Your Data Lake on S3
  40. 40. AWS Batch
  41. 41. Write Database Changes to S3 with DMS <schema_name>/<table_name>/LOAD001.csv <schema_name>/<table_name>/LOAD002.csv <schema_name>/<table_name>/<time-stamp>.csv Full Load Change Data Capture
  42. 42. Putting it All Together
  43. 43. Work backwards from product: • Define application and analytics datasets • Identify and land raw sources of data • Cleanse, enrich, standardize to create trusted sources • Transform to create product Raw Trusted Product Organizing and Managing Your Data Lake
  44. 44. Organizing and Managing Your Data Lake Structuring and Cataloguing S3 prefixes – Separate and partition data AWS Glue Data Catalog – Versioned metadata grouped by database/table Amazon DynamoDB/Amazon Elasticsearch – Indexing Archiving S3 lifecycle policies Ownership, Classification, Access Control Buckets Tags Bucket policies ACLs Auditing AWS CloudTrail and S3 Access Logging
  45. 45. Building a Data Strategy on AWS Kinesis Firehose 1 2 3 4 5 6 Athena Query Service 7 8 Glue
  46. 46. Analytics Capabilities
  47. 47. Processing & Analytics Transactional & RDBMS DynamoDB NoSQL DB Relational Database Aurora BI & Data Visualization Kinesis Streams & Firehose Batch EMR Hadoop, Spark, Presto Redshift Data Warehouse Athena Query Service AWS Batch Predictive Real-time AWS Lambda Apache Storm on EMR Apache Flink on EMR Spark Streaming on EMR Elasticsearch Service Kinesis Analytics, Kinesis Streams EastiCache DAX
  48. 48. https://aws.amazon.com/big-data/partner-solutions/ AWS Partner Ecosystem Reduce the effort to move, cleanse, synchronize, manage, and automate data related processes
  49. 49. DATAPIPE.COM US: 877 773 3306 UK: +44 800 634 3414 HK: +852 3521 0215 AWS DATA LAKES Kiran Tamana, EMEA Head of Solutions Architecture, Datapipe
  50. 50. PREMIER CONSULTING PARTNER IN EMEA AND NORTH AMERICA RECOGNISED AS A LEADER BY GARTNER WORLDWIDE 170+ YEARS 2.2k+ DAYS 10x WINNER 9+ MILLION 4 CONTINENTS Audited AWS Premier Consulting Partner and AWS Direct Connect Partner Proprietary Keyless Management Global Presence and support Datapipe has Data Center locations in 29 datacenters across 9 countries Stevie Award for Customer Service Monthly Instance Hours Managed Combined AWS Professional Services Experience Experience Managing AWS Since 2010
  51. 51. • Vessel tracking via satellite and terrestrial ID data • Vessel ownership & characteristic data • Vessel registration, safety and insurance data • Casualty, arrest and vessel inspection data • Fleet development data • Global Ports and Terminal data DATA LAKES – A MARITIME USE CASE 120,000 Tracking vessels 163,500 Shipping companies 2,800 Ports 19,000 Terminals A requirement to manage a global network of multiple data types, coming from a wide variety of business critical maritime sources
  52. 52. DEVELOP A MIGRATION STRATEGY Review ongoing technical deployments Identify existing pain points Conduct a situational analysis of current system Evaluate infrastructure for automation Assess configuration usage Recommend details on application deployment management
  53. 53. DATA LAKE ARCHITECTURE CLIENT METHODOLOGY: INGEST STORE DELIVER
  54. 54. Best Practices for Building Your Data Lake on AWS Presented by Derwin McGeary Solutions Architect, Cloudwick EMEA Contact us: services@cloudwick.com
  55. 55. Webinar Agenda ➢ About Cloudwick ➢ Reference Data Lake Architecture on AWS ➢ AWS Data Lake Implementations ➢ How to Get Started Contact us: services@cloudwick.com
  56. 56. About Cloudwick 7 Founded in 2010 3 Global Offices US | EMEA | APAC 150+ Professionals US | EMEA | APAC Global 1000 Proven Success 400+ Certifications Big Data, Analytics, Cybersecurity Visualization & Cloud Partnerships 100+ Data Lake Engagements Contact us: services@cloudwick.com
  57. 57. Reference Data Lake Architecture on AWS Marketing Apps & Databases CRM Databases Other Semi- structured data Sources Any Other Data Sources Back office Processing Data Sources Transformations/ETL Curated Layer Raw layer Data Lake Data Lake GLUE using PySpark/EMR Data Ingestion Layer Based on the data velocity, volume and veracity, pick the appropriate ingest services/other services like Kafka, Flume, Sqoop, DMS, Kinesis etc. Data store Layer This is the RAW layer for the Data Lake. The objects in the RAW layer are immutable. The structure to RAW layer is typically partitioned by a date/time Curated Layer This is the curated layer for the Data Lake. The data is ready for end user. The transformations/ETL make the data useful. Analytics Layer Where the data is modelled in the DWH. This data is exposed using JDBC/ODBC connectors to the BI tools, allowing for more reuse and sharing. Data Warehouse Analytics Layer Visualizations Business Intelligence S3 Storage Access Controlled Zone Secure and Controlled access for Internal Business Units and 3rd Party companies API Gateway JDBC/ODBC Native and 3rd Party BI and visualisation tools Access Layer This layer allows for innovative reuse, access, visualisation, and reporting on the data. Contact us: services@cloudwick.com
  58. 58. AWS Data Lake Implementations Contact us: services@cloudwick.com
  59. 59. SGN Use Case : Background & Challenges 62 ➢ One of the Largest Gas Distribution Company ➢ Reporting Responsibilities ➢ Data Silos ➢ Challenges Growing Data Sources Existing ETL process Too many silos Data Provision for different use cases Operational and Analytical Reporting Contact us: services@cloudwick.com
  60. 60. Use Case: Data Lake and Migration for SGN 63 Contact us: services@cloudwick.com
  61. 61. Where to start? Data Inventory - what data/metadata do you already have? Gap Analysis - what business and technical metadata do you need for your use cases? Value: Knowing the value of data and getting value from data Cloudwick can help you get started: Contact us: services@cloudwick.com Cloudwick Quickstart http://tinyurl.com/CloudwickQuickStart Cloudwick Jumpstart http://tinyurl.com/CloudwickJumpStart
  62. 62. Thank you!

×