Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Modern Data Platform on AWS

88 views

Published on

Modern Data Platform on AWS

  • Be the first to comment

  • Be the first to like this

Modern Data Platform on AWS

  1. 1. S U M M I T Ams t e rd a m
  2. 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Modern Data Platform on AWS Damon Cortesi Big Data Architect - AWS @dacort A N T 0 0 1 David Morel Takeaway.com
  3. 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T A brief history of significant Big Data releases 2004 Google publishes MapReduce paper 2006 Hadoop is created HBase development starts 2008 Facebook launches Hive AWS EMR announced 2009 Facebook launches Presto Apache Spark released 2012 MXNet Paper Published 2015 Amazon Athena & AWS Glue announced 2016
  4. 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data every 5 years There is more data than people think 15 years live for Data platforms need to 1,000x scale >10x grows
  5. 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T There are more people accessing data And more requirements for making data available Data Scientists Analysts Business Users Applications Secure Real time Flexible Scalable
  6. 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS databases and analytics Broad and deep portfolio, built for builders AWS Marketplace Amazon Redshift Data warehousing Amazon EMR Hadoop + Spark Athena Interactive analytics Kinesis Analytics Real-time Amazon Elasticsearch service Operational Analytics RDS MySQL, PostgreSQL, MariaDB, Oracle, SQL Server Aurora MySQL, PostgreSQL Amazon QuickSight Amazon SageMaker DynamoDB Key value, Document ElastiCache Redis, Memcached Neptune Graph Timestream Time Series QLDB Ledger Database S3/Amazon Glacier AWS Glue ETL & Data Catalog Lake Formation Data Lakes Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect Data Movement AnalyticsDatabases Business Intelligence & Machine Learning Data Lake Managed Blockchain Blockchain Templates Blockchain Amazon Comprehend Amazon Rekognition Amazon Lex Amazon Transcribe AWS DeepLens 250+ solutions 730+ Database solutions 600+ Analytics solutions 25+ Blockchain solutions 20+ Data lake solutions 30+ solutions RDS on VMWare
  7. 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale
  8. 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data lake with AWS Glue Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalog Crawlers Crawlers Crawlers
  9. 9. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  10. 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon S3—Object Storage Security and Compliance Three different forms of encryption; encrypts data in transit when replicating across regions; log and monitor with CloudTrail, use ML to discover and protect sensitive data with Macie Flexible Management Classify, report, and visualize data usage trends; objects can be tagged to see storage consumption, cost, and security; build lifecycle policies to automate tiering, and retention Durability, Availability & Scalability Built for eleven nine’s of durability; data distributed across 3 physical facilities in an AWS region; automatically replicated to any other AWS region Query in Place Run analytics & ML on data lake without data movement; S3 Select can retrieve subset of data, improving analytics performance by 400%
  11. 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Movement From Real-time Sources Amazon Kinesis Video Streams Securely stream video from connected devices to AWS for analytics, machine learning (ML), and other processing Amazon Kinesis Data Firehose Capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools. Amazon Kinesis Data Streams Build custom, real-time applications that process data streams using popular stream processing frameworks AWS IoT Core Supports billions of devices and trillions of messages, and can process and route those messages to AWS endpoints and to other devices reliably and securely Managed Streaming For Kafka Fully managed open- source platform for building real-time streaming data pipelines and applications.
  12. 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Kinesis Data Streams
  13. 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Kinesis Data Firehose
  14. 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Prefix: raw/life/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/ Buffer: Up to 128MB or 15 minutes Kinesis events to S3 Kinesis Data Streams Kinesis Data Firehose Save as Parquet Lambda Transformation Aggregated JSON Data Clients Aggregated Parquet Data Source backup New! as of 12th Feb • Support for custom S3 prefix Amazon Athena Crawlers
  15. 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Movement From On-premises Datacenters AWS Snowball, Snowball Edge and Snowmobile Petabyte and Exabyte- scale data transport solution that uses secure appliances to transfer large amounts of data into and out of the AWS cloud AWS Direct Connect Establish a dedicated network connection from your premises to AWS; reduces your network costs, increase bandwidth throughput, and provide a more consistent network experience than Internet-based connections AWS Storage Gateway Lets your on-premises applications to use AWS for storage; includes a highly-optimized data transfer mechanism, bandwidth management, along with local cache AWS Database Migration Service Migrate database from the most widely-used commercial and open- source offerings to AWS quickly and securely with minimal downtime to applications
  16. 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Database Migration Service
  17. 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T DMS to S3 AWS Database Migration Service Source database Crawlers Data catalogSnapshot Data AWS Glue Amazon Athena Amazon EMR New! as of 25th March • Support for Parquet • Support for S3 encryption with KMS Amazon Redshift
  18. 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T DMS to S3 Change Data Capture (CDC) • Challenging to do easily • Need to maintain a staging table and reconstitute dataset newDf = df2.filter("cdc = 'I'") updDf = df2.filter("cdc = 'U'") delDf = df2.filter("cdc = 'D'”) w = Window().partitionBy("id").orderBy(F.col("idx").desc()) latestUpdateDf = updDf.withColumn("rn", F.row_number() .over(w)).where(F.col("rn") == 1).select("*").drop("rn") # Create the update table, join to the original table, # filter everything out of the original where the update is null, then union tempDf = latestUpdateDf.select("id").withColumnRenamed("id", "id_1") filteredBaseDf = insertsDf.join(tempDf, insertsDf.id == tempDf.id_1, 'left') filteredBaseDf = filteredBaseDf.filter("id_1 is null").drop("id_1") insertAndUpdateDdf = filteredBaseDf.union(latestUpdateDf) # Ok, now remove any deleted columns! tempDf = delDf.select("id").withColumnRenamed("id", "id_del") finalDf = insertAndUpdateDdf.join(tempDf, insertAndUpdateDdf.id == tempDf.id_del, 'left') finalDf = finalDf.filter("id_del is NOT null").drop("id_del")
  19. 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue ETL New!
  20. 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Third-party API to S3 3rd Party API AWS Glue Python Shell Crawlers Data catalogIncremental Exports Amazon Athena Glue ETL Transformed Data Amazon Redshift
  21. 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Parquet File Format Row group meta data allows Parquet reader to skip portions of, or all files. Columnar format is optimized for analytics. Column meta-data allows for pre- aggregation
  22. 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Parquet • Previously it was common to deliver in JSON/CSV/text then run another process to convert to Parquet. It’s becoming more common to deliver straight to Parquet. • Kinesis Firehose – Added support May 2018 • Custom prefix support !: Feb 2019 • Requires schema in Glue Data Catalog • Athena – CREATE TABLE AS SELECT: Oct 2018 • EMR – S3-optimized Parquet committer: Nov 2018 • Database Migration Service – Added Parquet support ": Mar 2019
  23. 23. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  24. 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue ETL New!
  25. 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon EMR
  26. 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Redshift
  27. 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon Athena Permissions Data Lake AWS Cloud AWS Cloud Reporting & Analytics Machine Learning AWS Cloud Custom Applications AWS Glue Data Catalog
  28. 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon EMR Notebooks in the Console A managed analytics environment based on Jupyter Notebooks Amazon EMR clusters AWS Management Console for EMR EMR-managed notebook based on Jupyter notebook users Auto saves notebook file to your S3 bucket Run queries on your remote EMR cluster EMR VPC Customer VPC
  29. 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Amazon QuickSight
  30. 30. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  31. 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data lake with AWS Glue Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalog Crawlers Crawlers Crawlers
  32. 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Enforce security policies across multiple services Gain and manage new insights Identify, ingest, clean, and transform data Build a secure data lake in days AWS Lake Formation
  33. 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How it works
  34. 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Easily load data to your data lake logs DBs Blueprints Data Lake Storage Data Catalog Access Control Data import Lake Formation Crawlers ML-based data prep one-shot incremental
  35. 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Blueprints build on AWS Glue
  36. 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Easily de-duplicate your data with ML transforms
  37. 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Secure once, access in multiple ways Data Lake Storage Data Catalog Access Control Lake Formation Admin
  38. 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Security permissions in Lake Formation Control data access with simple grant and revoke permissions Specify permissions on tables and columns rather than on buckets and objects Easily view policies granted to a particular user Audit all data access at one place
  39. 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Lake Formation Pricing No additional charges – Only pay for the underlying services used.
  40. 40. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  41. 41. A tale of AWS at Takeaway.com Data Engineering in the Business Intelligence team
  42. 42. 1. Once upon a time
  43. 43. 2. Learning
  44. 44. 3. The kingdom
  45. 45. 4. Lessons
  46. 46. 5. Complexity
  47. 47. 6. Flexibility
  48. 48. 7. Simplicity
  49. 49. 8. Expansion
  50. 50. 9. Happily ever after
  51. 51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  52. 52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I TS U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×