Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ABD201-Big Data Architectural Patterns and Best Practices on AWS

6,564 views

Published on

In this session, we simplify big data processing as a data bus comprising various stages: collect, store, process, analyze, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.

ABD201-Big Data Architectural Patterns and Best Practices on AWS

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Big Data Architectural Patterns and Best Practices on AWS S i v a R a g h u p a t h y S r . M a n a g e r , A I , A n a l y t i c s , a n d D a t a b a s e S o l u t i o n s A r c h i t e c t u r e A W S N o v e m b e r 2 7 , 2 0 1 7 A B D 2 0 1
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What to Expect from the Session Big data challenges Architectural principles How to simplify big data processing What technologies should you use? • Why? • How? Reference architecture Design patterns
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ever Increasing Big Data Volume Velocity Variety
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Batch processing Stream processing Artificial intelligence Big Data Evolution
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Virtual machines Managed services Serverless Cloud Services Evolution
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Plethora of Tools Amazon Glacier Amazon S3 Amazon DynamoDB Amazon RDS Amazon EMR Amazon Redshift Amazon Kinesis Lambda Amazon ML Amazon SQS ElastiCache Amazon DynamoDB Streams Amazon ES Amazon Kinesis Analytics Amazon QuickSight AWS Glue
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big Data Challenges Why? How? What tools should I use? Is there a reference architecture?
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architectural Principles Build decoupled systems • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage managed and serverless services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable logs (data lake), materialized views Be cost-conscious • Big data ≠ big cost AI/ML enable your applications
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT STORE PROCESS/ ANALYZE CONSUME Time to answer (Latency) Throughput Cost Simplify Big Data Processing
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What Is the Temperature of Your Data?
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Hot Warm Cold Volume MB–GB GB–TB PB–EB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low–high High Very high Request rate Very high High Low Cost/GB $$-$ $-¢¢ ¢ Hot data Warm data Cold data Data Characteristics: Hot, Warm, Cold
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT Devices Sensors IoT platforms AWS IoT STREAMS IoT EventsData streams Migration Snowball Logging Amazon CloudWatch AWS CloudTrail FILES DataTransport&Logging Import/expo rt Files Log files Media files Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications Transactions Data structures Database records Type of Data
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. STORE
  15. 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Events Files Transactions COLLECT Devices Sensors IoT platforms AWS IoT STREAMS IoT Data streams Migration Snowball Logging Amazon CloudWatch AWS CloudTrail FILES DataTransport&Logging Import/expo rt Log files Media files Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications Data structures Database records Type of Data STORE NoSQL In-memory SQL File/object store Stream storage
  16. 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT Devices Sensors IoT platforms AWS IoT STREAMS IoT Migration Snowball Logging Amazon CloudWatch AWS CloudTrail FILES DataTransport&Logging Import/expo rt Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications STORE NoSQL In-memory SQL File/object store Stream storage
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT Devices Sensors IoT platforms AWS IoT STREAMS IoT Migration Snowball Logging Amazon CloudWatch AWS CloudTrail FILES DataTransport&Logging Import/expo rt Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications STORE NoSQL In-memory SQL File/object store Amazon Kinesis Firehose Amazon Kinesis Streams Apache Kafka Apache Kafka • High throughput distributed streaming platform Amazon Kinesis Streams • Managed stream storage Amazon Kinesis Firehose • Managed data delivery Stream Storage
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Decouple producers & consumers Persistent buffer Collect multiple streams Preserve client ordering Parallel consumption Streaming MapReduce 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 DynamoDB stream Amazon Kinesis stream Kafka topic Why Stream Storage?
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Decouple producers & consumers • Persistent buffer • Collect multiple streams • No client ordering (standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or Lambda functions) What About Amazon SQS? Consumers 4 3 2 1 12344 3 2 1 1234 2134 13342 Standard FIFO Publisher Amazon SNS Topic AWS Lambda function Amazon SQS queue Queue Subscriber
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Stream/Message Storage Should I Use? Hot Warm Amazon Kinesis Streams Amazon Kinesis Firehose Apache Kafka (on Amazon EC2) Amazon SQS (Standard) Amazon SQS (FIFO) AWS managed Yes Yes No Yes Yes Guaranteed ordering Yes No Yes No Yes Delivery (deduping) At least once At least once At least/At most/exactly once At least once Exactly once Data retention period 7 days N/A Configurable 14 days 14 days Availability 3 AZ 3 AZ Configurable 3 AZ 3 AZ Scale / throughput No limit / ~ shards No limit / automatic No limit / ~ nodes No limits / automatic 300 TPS / queue Parallel consumption Yes No Yes No No Stream MapReduce Yes N/A Yes N/A N/A Row/object size 1 MB Destination row/object size Configurable 256 KB 256 KB Cost Low Low Low (+admin) Low-medium Low-medium
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT STORE File Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Hot Stream Mobile apps Web apps Devices Sensors IoT platforms AWS IoT Data centers AWS Direct Connect Migration Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS FILES STREAMS DataTransport&LoggingIoTApplications Import/expo rt NoSQL In-memory SQL Amazon S3 Amazon S3 Managed object storage service built to store and retrieve any amount of data File/Object Storage
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Amazon S3 as Your Persistent File Store • Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • Decouple storage and compute • No need to run compute clusters for storage (unlike HDFS) • Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances • Multiple & heterogeneous analysis clusters and services can use the same data • Designed for 99.999999999% durability • No need to pay for data replication within a region • Secure: SSL, client/server-side encryption at rest • Low cost
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What About HDFS & Data Tiering? • Use HDFS for hottest datasets (e.g. iterative read on the same datasets) • Use Amazon S3 Standard for frequently accessed data • Use Amazon S3 Standard – IA for less frequently accessed data • Use Amazon Glacier for archiving cold data • Use S3 Analytics to optimize tiering strategy S3 Analytics
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT STORE Amazon DynamoDB Amazon RDS Amazon Aurora File Mobile apps Web apps Devices Sensors IoT platforms AWS IoT Data centers AWS Direct Connect Migration Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS FILES STREAMS LoggingIoTApplications Amazon S3 Amazon DAX Amazon ElastiCache Import/expo rt SQLNoSQLCache Amazon ElastiCache • Managed Memcached or Redis service Amazon DynamoDB Accelerator (DAX) • Managed in-memory cache for DynamoDB Amazon DynamoDB • Managed NoSQL database service Amazon RDS • Managed relational database service Cache & Database Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Hot Stream
  25. 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Anti-Pattern Database Tier
  26. 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practice: Use the Right Tool for the Job SearchIn-memory SQLNoSQL Database Tier GraphDB Amazon RDS/AuroraAmazon DynamoDBAmazon ElastiCache Amazon DynamoDB Acclerator SAP HANA Amazon ES Amazon CloudSearch
  27. 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Materialized Views and Immutable Log Cache View Search View Immutable Log
  28. 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Data Store Should I Use? Data structure → Fixed-schema, JSON, Key/Value, Access patterns → Store data in the format you will access it Data characteristics → Hot, warm, cold Cost → Right cost
  29. 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Structure and Access Patterns Access Patterns What to use? Put/Get (key, value) In-memory, NoSQL Simple relationships → 1:N, M:N NoSQL Multi-table joins, transaction, SQL SQL Faceting, Search Search Graph traversal GraphDB Data Structure What to use? Fixed schema SQL, NoSQL Schema-free (JSON) NoSQL, Search Key/Value In-memory, NoSQL Graph GraphDB
  30. 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. In-memory SQL Request rate High Low Cost/GB High Low Latency Low High Data volume Low High Amazon Glacier Structure NoSQL Hot data Warm data Cold data Low High
  31. 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon ElastiCache Amazon DAX Amazon DynamoDB Amazon RDS (Aurora) Amazon ES Amazon S3 Amazon Glacier Average Latency µs-ms µs-ms ms ms, sec ms,sec ms,sec,min (~ size) hrs Typical Data Volume GB GB GB–TBs (no limit) GB–TB (64 TB max) GB–TB MB–PB (no limit) GB–PB (no limit) Typical Item Size B-KB KB (400 KB max) KB (400 KB max) KB (64 KB max) B-KB (2 GB max) KB-TB (5 TB max) GB (40 TB max) Request Rate High – very high High – very high Very high (no limit) High High Low – high (no limit) Very low Storage Cost GB/Month $$ $$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10 Durability Low - moderate NA Very high Very high High Very high Very high Availability High 2 AZ High 3 AZ Very high 3 AZ Very high 3 AZ High 2 AZ Very high 3 AZ Very high 3 AZ Hot data Warm data Cold data Which Data Store Should I Use?
  32. 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Cost-Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB? “I’m currently scoping out a project. The design calls for many small files, perhaps up to a billion during peak. The total size would be on the order of 1.5 TB per month…” Request rate (Writes/sec) Object size (Bytes) Total size (GB/month) Objects per month 300 2048 1483 777,600,000
  33. 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Request rate (writes/sec) Object size (bytes) Total size (GB/month) Objects per month 300 2,048 1,483 777,600,000 Amazon S3 or Amazon DynamoDB? https://calculator.s3.amazonaws.com/index.html
  34. 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Wins! 300 23,730 777,600,00032,768 Amazon DynamoDB Wins! Request rate (writes/sec) Object size (bytes) Total size (GB/month) Objects per month 300 2,048 1,483 777,600,000 Amazon S3 or Amazon DynamoDB? Scenario 2 Scenario 1 Amazon S3 Standard Storage $34 Put/list requests $3,888 Total $3,922 Amazon DynamoDB Provisioned throughput $273 Indexed data storage $383 Total $656 Amazon S3 Standard Storage $545 Put/List Requests $3,888 Total $4,433 Amazon DynamoDB Provisioned Throughput $4,556 Indexed Data Storage $5,944 Total $10,500
  35. 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PROCESS / ANALYZE
  36. 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. API-driven Services • Amazon Lex – Speech recognition • Amazon Polly – Text to speech • Amazon Rekognition – Image analysis Managed ML Platforms • Amazon ML • Apache Spark ML on EMR AWS Deep Learning AMI • Pre-installed with MXNet, TensorFlow, Caffe2 (and Caffe), Theano, Torch, Microsoft Cognitive Toolkit, and Keras; plus DL tools/drivers A m a z o n A I Predictive Analytics PROCESS/ANALYZE Predictive AmazonAI Lex PollyAML Rekognition AWS DL AMI Developers Data scientists Deep learning experts
  37. 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. PROCESS / ANALYZE Amazon Redshift & Spectrum Amazon Athena BatchInteractive Amazon ES Interactive and Batch Analytics Amazon ES • Managed Service for Elasticsearch Amazon Redshift and Amazon Redshift Spectrum • Managed Data Warehouse • Spectrum enables querying Amazon S3 Amazon Athena • Serverless Interactive Query Service Amazon EMR • Managed Hadoop Framework for running Apache Spark, Flink, Presto, Tez, Hive, Pig, HBase, etc. Presto Amazon EMR
  38. 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Stream/Real-time Analytics Spark Streaming on Amazon EMR Amazon Kinesis Analytics • Managed Service for running SQL on Streaming data Amazon KCL • Amazon Kinesis Client Library AWS Lambda • Run code Serverless (without provisioning or managing servers) • Services such as S3 can publish events to Lambda • Lambda can pool event from a Kinesis PROCESS / ANALYZE KCL Apps AWS Lambda Amazon Kinesis Analytics Stream Streaming Amazon EMR
  39. 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Analytics Should I Use? PROCESS / ANALYZE Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Stream Takes milliseconds to seconds Example: Fraud alerts, 1 minute metrics Amazon EMR (Spark Streaming), Amazon Kinesis Analytics, KCL, AWS Lambda, etc. Predictive Takes milliseconds (real-time) to minutes (batch) Example: Fraud detection, Forecasting demand, Speech recognition Amazon AI (Lex, Polly, ML, Amazon Rekognition), Amazon EMR (Spark ML), Deep Learning AMI (MXNet, TensorFlow, Theano, Torch, CNTK, and Caffe) Streaming Amazon Kinesis Analytics KCL Apps AWS Lambda Stream Amazon EMR Fast Amazon ES Amazon Redshift & Spectrum Presto Amazon EMR Amazon Athena BatchInteractive FastSlow Predictive AmazonAI Lex PollyAML Rekognition AWS DL AMI
  40. 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Stre am Proce ssing T e chnolog y Shou ld I Use ? Amazon EMR (Spark Streaming) KCL Application Amazon Kinesis Analytics AWS Lambda Managed Service Yes No (EC2 + Auto Scaling) Yes Yes Serverless No No Yes Yes Scale/Throughput No limits / ~ nodes No limits / ~ nodes No limits / automatic No limits / automatic Availability Single AZ Multi-AZ Multi-AZ Multi-AZ Programming Languages Java, Python, Scala Java, others via MultiLangDaemon ANSI SQL with extensions Node.js, Java, Python Sliding Window Functions Build-in App needs to implement Built-in No Reliability KCL and Spark checkpoints Managed by KCL Managed by Amazon Kinesis Analytics Managed by AWS Lambda
  41. 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Which Analytics Tool Should I Use? Amazon Redshift Amazon Redshift Spectrum Amazon Athena Amazon EMR Presto Spark Hive Use case Optimized for data warehousing Query S3 data from Amazon Redshift Interactive Queries over S3 data Interactive Query General purpose Batch Scale/Throughput ~Nodes ~Nodes Automatic ~ Nodes Managed Service Yes Yes Yes, Serverless Yes Storage Local storage Amazon S3 Amazon S3 Amazon S3, HDFS Optimization Columnar storage, data compression, and zone maps AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. AVRO, PARQUET TEXT, SEQ RCFILE, ORC, etc. Framework dependent Metadata Amazon Redshift Catalog Glue Catalog Glue Catalog Glue Catalog or Hive Meta-store Auth/Access controls IAM, Users, groups, and access controls IAM, Users, groups, and access controls IAM IAM, LDAP & Kerberos UDF support Yes (Scalar) Yes (Scalar) No Yes Slow
  42. 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What About Extract Transform and Load? ETLSTORE PROCESS / ANALYZE Data Integration Partners Reduce the effort to move, cleanse, synchronize, manage, and automatize data-related processes. AWS Glue is a fully managed (serverless) ETL service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. Data Catalog Job Authoring Job Execution AWS Glue
  43. 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. COLLECT STORE PROCESS/ANALYZE Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon ElastiCache Amazon RDS Amazon Aurora HotHotWarm SQLNoSQLCacheFileStream Mobile apps Web apps Devices Sensors IoT platforms AWS IoT Data centers AWS Direct Connect Migration Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS FILES STREAMS DataTransport&LoggingIoTApplications Slow Amazon S3 Amazon ES Amazon Redshift & Spectrum Presto Amazon EMR Fast Amazon Athena BatchInteractive Amazon DAX Import/expo rt Predictive AmazonAI Lex PollyAML Rekognition AWS DL AMI ETL Streaming Amazon Kinesis Analytics KCL Apps Fast Stream Amazon EMR AWS Lambda Fast CONSUME
  44. 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. CONSUME
  45. 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • BI/AI Applications • Amazon EC2 or ECS Containers • AWS Greengrass • Data Science • Notebooks • DS Platforms • IDEs • Analysis and Visualization • Amazon QuickSight • Tableau • …. COLLECT STORE CONSUMEPROCESS/ANALYZEETL Amazon QuickSight Analysis&visualization Model Train/ Eval Models Deploy DataSceince AI Apps Amazon ECS Apps AWS Greengrass Predictive AmazonAI Lex PollyAML Rekognition AWS DL AMI Business users DevOps Data Scientists
  46. 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Putting It All Together
  47. 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Streaming Amazon Kinesis Analytics KCL Apps AWS Lambda COLLECT STORE CONSUMEPROCESS/ANALYZE Amazon ES Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon ElastiCache Amazon RDS Amazon Aurora HotHotWarm Fast Stream SQLNoSQLCacheFileStream Mobile apps Web apps Devices Sensors IoT platforms AWS IoT Data centers AWS Direct Connect Migration Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS FILES STREAMS Amazon QuickSight Analysis&visualizationDataSceince DataTransport&LoggingIoTApplications Amazon EMR Amazon Redshift & Spectrum Presto Amazon EMR FastSlow Amazon Athena BatchInteractivePredictive AmazonAI Amazon S3 Amazon DAX Import/expo rt Lex PollyAML Rekognition AWS DL AMI AI Apps Amazon ECS Apps Model Train/ Eval Models Deploy ETL AWS Greengrass
  48. 48. Design Patterns
  49. 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Spark Streaming AWS Lambda KCL apps Amazon Redshift Amazon Redshift Hive Spark Presto ProcessingTechnology FastSlow Answers Hive Native apps KCL apps AWS Lambda Amazon Athena Amazon Kinesis Amazon DynamoDB/RDS Amazon S3Data Hot Cold Data Store
  50. 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time Analytics Amazon EMR KCL app AWS Lambda Spark Streaming Amazon AI Real-time prediction Amazon ElastiCache (Redis) Amazon DynamoDB Amazon RDS Amazon ES App state or Materialized View KPI Process Store Amazon Kinesis Amazon Kinesis Analytics Amazon SNS NotificationsAlerts Amazon S3 Log Amazon KinesisFan out Downstream
  51. 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Interactive and Batch Analytics process store Batch Interactive Amazon EMR Hive Pig Spark Amazon AI Batch prediction Real-time prediction Amazon S3 Files Amazon Kinesis Firehose Amazon Kinesis Analytics Amazon Redshift Amazon ES Consume Amazon EMR Presto Spark Amazon Athena
  52. 52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time App state or Materialized View Interactive and Batch Data Lake Amazon S3 Amazon Redshift Amazon EMR Presto Hive Pig Spark Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon ES AWS Lambda Spark Streaming on Amazon EMR Applications Amazon Kinesis KCL Amazon AI Amazon DynamoDB Amazon RDS Change Data Capture or Export Transactions Stream Files Amazon Kinesis Analytics Amazon Athena Amazon Kinesis Firehose Amazon ES
  53. 53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What about Metadata? • Glue Catalog • Hive Metastore compliant • Crawlers - Detect new data, schema, partitions • Search - Metadata discovery • Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum compatible • Hive Metastore (Presto, Spark, Hive, Pig) • Can be hosted on Amazon RDS Data Catalog Metastore RDS Glue Catalog AWS Glue
  54. 54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security & Governance • AWS Identity and Access Management (IAM) • Amazon Cognito • Amazon CloudWatch & AWS CloudTrail • AWS KMS • AWS Directory Service • Apache Ranger Security & Governance IAM Amazon CloudWatch AWS CloudTrail AWS AWSKMS AWS CloudHSM AWS Directory Service Amazon Cognito
  55. 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake Reference Architecture Summary
  56. 56. Summary Build decoupled systems • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage AWS managed and serverless services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable logs, data lake, materialized views Be cost-conscious • Big data ≠ Big cost AI/ML enable your applications
  57. 57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU!

×