Successfully reported this slideshow.
Your SlideShare is downloading. ×

Database and Analytics on the AWS Cloud - AWS Innovate Toronto

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 42 Ad

More Related Content

Slideshows for you (20)

Similar to Database and Analytics on the AWS Cloud - AWS Innovate Toronto (20)

Advertisement

More from Amazon Web Services (20)

Recently uploaded (20)

Advertisement

Database and Analytics on the AWS Cloud - AWS Innovate Toronto

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Antoine Généreux, AWS Solutions Architect May 10, 2017 Databases and Analytics on the AWS Cloud
  3. 3. What to expect from the session • Challenges and architectural principles • How to simplify big data processing • What database technologies should you use? • Why and When? • How? • Architectural patterns and Customer Examples
  4. 4. Key Ideas • Data is your organization’s most valuable resource • All data has the potential to be big data • Databases are no longer the center of analytics* *but they play a critical role!
  5. 5. Evolution of Analytics Batch analytics Real-time analytics Predictive/Adaptive analytics
  6. 6. Amazon Glacier S3 DynamoDB RDS EMR Amazon Redshift Data Pipeline Amazon Kinesis Amazon Kinesis Streams app Lambda Amazon ML SQS ElastiCache DynamoDB Streams Amazon Elasticsearch Service Amazon Kinesis Analytics A plethora of tools
  7. 7. Challenges • Is there a reference architecture? • What tools should I use? • How? • Why?
  8. 8. Architectural Principles Build decoupled systems • Data → Store → Process → Store →Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage AWS managed services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable logs, materialized views (schema-on-read) Be cost-conscious • Big data ≠ big cost
  9. 9. Simplify Big Data Processing Time to answer (Latency) Throughput Cost Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers & insights
  10. 10. COLLECT Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications In-memory data structures Database records AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Search documents Log files Messaging Message MESSAGES Messaging Messages Devices Sensors & IoT platforms AWS IoT STREAMS IoT Data streams Transactions Files Events Types of Data
  11. 11. Hot Warm Cold Volume MB–GB GB–TB PB–EB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low–high High Very high Request rate Very high High Low Cost/GB $$-$ $-¢¢ ¢ Hot data Warm data Cold data Data Temperature
  12. 12. STORE Database SQL & NoSQL databases Search Search engines File store File systems Queue Message queues Stream storage Pub/sub message queues In-memory Caches, data structure servers Types of Data Stores Mobile apps Web apps Data centers AWS Direct Connect RECORDS Applications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES LoggingTransport Messaging Message MESSAGES Messaging Devices Sensors & IoT platforms AWS IoT STREAMS IoT COLLECT
  13. 13. In-memory Amazon Kinesis Firehose Amazon Kinesis Streams Apache Kafka Amazon DynamoDB Streams Amazon SQS Amazon SQS • Managed message queue service Apache Kafka • High throughput distributed streaming platform Amazon Kinesis Streams • Managed stream storage + processing Amazon Kinesis Firehose • Managed data delivery Amazon DynamoDB • Managed NoSQL database • Tables can be stream-enabled Devices Sensors & IoT platforms AWS IoT STREAMS IoT COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database Applications AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES Search File store LoggingTransport Messaging Message MESSAGES Messaging Message Stream Message & Stream Storage
  14. 14. Why Stream Storage? Decouple producers & consumers Persistent buffer Collect multiple streams Preserve client ordering Parallel consumption Streaming MapReduce 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 Producer 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 Producer 2 Producer 3 Producer n Key = violet DynamoDB stream Amazon Kinesis stream Kafka topic
  15. 15. In-memory COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES Search Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Streams Hot Stream Amazon SQS Message Amazon S3 File LoggingIoTApplicationsTransportMessaging File Storage Amazon S3
  16. 16. Why Is Amazon S3 Good for Analytics? • Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS) • Can run transient Hadoop clusters & Amazon EC2 Spot Instances • Multiple & heterogeneous analysis clusters can use the same data • Unlimited number of objects and volume of data • Very high bandwidth – no aggregate throughput limit • Designed for 99.99% availability – can tolerate zone failure • Designed for 99.999999999% durability • No need to pay for data replication • Native support for versioning • Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies • Secure – SSL, client/server-side encryption at rest • Low cost
  17. 17. In-memory COLLECT STORE Mobile apps Web apps Data centers AWS Direct Connect RECORDS Database AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail DOCUMENTS FILES Search Messaging Message MESSAGES Devices Sensors & IoT platforms AWS IoT STREAMS Apache Kafka Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Streams Hot Stream Amazon SQS Message Amazon S3 File LoggingIoTApplicationsTransportMessaging File Storage
  18. 18. Database Access Data Tier Anti-Pattern
  19. 19. Best Practice: Use the Right Tool for the Job Search Amazon Elasticsearch Service In-memory Amazon ElastiCache Redis Memcached SQL Amazon Aurora Amazon RDS MySQL PostgreSQL Oracle SQL Server NoSQL Amazon DynamoDB Cassandra HBase MongoDB
  20. 20. Microsecond Real-Time Performance Fully Managed Redis Automatic Failover = NoOps Enhanced Redis Engine No Cross-AZ Data Transfer Costs Easy to Deploy, Use and Monitor Open-Source Compatible Amazon ElastiCache
  21. 21. Fully managed NoSQL Document / Key-Value store Single-digit millisecond latency Massive and seamless scalability Event-driven programming Amazon DynamoDB
  22. 22. • Fully managed, highly available: handles all software management, fault tolerant, replication across multi-AZs within a region • DynamoDB API compatible: seamlessly caches DynamoDB API calls, no application re-writes required • Write-through: DAX handles caching for writes • Flexible: Configure DAX for one table or many • Scalable: scales-out to any workload with up to 10 read replicas • Manageability: fully integrated AWS service: Amazon CloudWatch, Tagging for DynamoDB, AWS Console • Security: Amazon VPC, AWS IAM, AWS CloudTrail, AWS Organizations Features DynamoDB Accelerator (DAX) DynamoDB Your Applications DynamoDB Accelerator Preview
  23. 23. Amazon RDS Automated backups (with point-in-time recovery) Cross-region snapshot copies Automated patch management Automated Multi-AZ replication Scale up / Scale down instance types Scalable storage on demand “License included” and BYOL models
  24. 24. Amazon Aurora: MySQL andPostgreSQL-compatible SQL Trans- actions AZ 1 AZ 2 AZ 3 Caching Amazon S3 • 5x faster than MySQL on same hardware • SysBench: 100 K writes/sec and 500 K reads/sec • Designed for 99.99% availability • 6-way replicated storage across 3 AZs • Scale to 64 TB and 15 Read Replicas
  25. 25. Distributed search and analytics engine Managed service using Elasticsearch and Kibana Fully managed – zero admin Highly available and reliable Tightly integrated with other AWS services Amazon Elasticsearch Service
  26. 26. Which Data Store Should I Use? Data structure Fixed schema → SQL, NOSQL Schema-Free (JSON) → NoSQL, Search (Key, Value) → In-Memory, NoSQL Access patterns → Store data in the format you will access it Put/Get (Key, Value) → In-Memory, NoSQL Simple relationships (1:N, M:N) → NoSQL Complex relationships (Multi-table joins, transactional) → SQL Faceting, Search → Search Data characteristics → Hot, warm, cold Cost → Right cost
  27. 27. Batch Takes minutes to hours Example: Daily/weekly/monthly reports Amazon EMR (MapReduce, Hive, Pig, Spark) Interactive Takes seconds Example: Self-service dashboards Amazon Redshift, Amazon Athena, Amazon EMR (Presto, Spark) Message Takes milliseconds to seconds Example: Message processing Amazon SQS applications on Amazon EC2 Stream Takes milliseconds to seconds Example: Fraud alerts, 1 minute metrics Amazon EMR (Spark Streaming, Flink), Amazon Kinesis Analytics, KCL, Storm, AWS Lambda Machine Learning Takes milliseconds to minutes Example: Fraud detection, forecast demand Amazon ML, Amazon EMR (Spark ML) Analytics Types & Frameworks PROCESS / ANALYZE Amazon Machine Learning MLMessage Amazon SQS apps Amazon EC2 Streaming Amazon Kinesis Analytics KCL apps AWS Lambda Stream Amazon EC2 Amazon EMR Fast Amazon Redshift + Redshift Spectrum Presto Amazon EMR FastSlow Amazon Athena BatchInteractive
  28. 28. What About ETL? https://aws.amazon.com/big-data/partner-solutions/ ETLSTORE PROCESS / ANALYZE Data Integration Partners Reduce the effort to move, cleanse, synchronize, manage, and automatize data related processes. AWS Glue AWS Glue is a fully managed ETL service that makes it easy to understand your data sources, prepare the data, and move it reliably between data stores Preview
  29. 29. Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI Applications & API Analysis and visualization Notebooks IDE Business users Data scientist, developers Data Consumption
  30. 30. Design Patterns and Customer Examples
  31. 31. Primitive: Decoupled Data Bus Storage decoupled from processing Multiple stages Store Process Store ProcessData Answers process store
  32. 32. Primitive: Pub/Sub Amazon Kinesis AWS Lambda Data Amazon Kinesis Connector Library process store Apache Spark Parallel stream consumption/processing
  33. 33. process store Primitive: Materialized Views Amazon EMR Amazon Kinesis AWS Lambda Amazon S3 Data Amazon DynamoDB AnswersSpark Streaming Amazon Kinesis Connector Library Spark SQL Analysis framework reads from or writes to multiple data stores
  34. 34. Amazon EMR Real-time Analytics Amazon Kinesis AWS Lambda Spark Streaming Amazon SNS Amazon ML Notifications Amazon ElastiCache (Redis) Amazon DynamoDB Amazon RDS Amazon ES Alerts App state Real-time prediction KPI process store Stream KCL app Amazon S3 Raw Append- Only Log Amazon Kinesis Fan out Amazon Kinesis Analytics
  35. 35. Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Amazon S3 Data Lake Amazon EMR Amazon Kinesis Amazon Redshift Answers & insights Hot HomesUsers Properties Agents User Profile Recommendation Hot Homes Similar Homes Agent Follow-up Agent Scorecard Marketing A/B Testing Real Time Data … Amazon DynamoDB BI / Reporting
  36. 36. Interactive & Batch Analytics Amazon S3 Amazon EMR Hive Pig Spark Amazon ML process store Consume Amazon Redshift Amazon EMR Presto Spark Batch Interactive Batch prediction Real-time prediction Stream Amazon Kinesis Firehose Amazon Athena Files Amazon Kinesis Analytics
  37. 37. Data Marts (Amazon Redshift) Query Cluster (EMR) Query Cluster (EMR) Auto Scaling EC2 Analytics App Normalization ETL Clusters (EMR) Batch Analytic Clusters (EMR) Ad Hoc Query Cluster (EMR) Auto Scaling EC2 Analytics App Users Data Providers Auto Scaling EC2 Data Ingestion Services Optimization ETL Clusters (EMR) Shared Metastore (RDS) Query Optimized (S3) Auto Scaling EC2 Data Catalog & Lineage Services Reference Data (RDS) Shared Data Services Auto Scaling EC2 Cluster Mgt & Workflow Services Source of Truth (S3) >5 PB, up to 75 billion events per day
  38. 38. Data Lake https://aws.amazon.com/answers/big-data/data-lake-solution/
  39. 39. Putting It All Together
  40. 40. Amazon SQS apps Streaming KCL apps Amazon Redshift COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams HotHotWarm FastSlowFast SearchSQLNoSQLCacheFileMessageStream Amazon EC2 Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENTS FILES MESSAGES STREAMS Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI LoggingIoTApplicationsTransportMessaging ETL BatchMessageInteractiveStreamML Amazon EMR AWS Lambda Amazon Kinesis Analytics Amazon Athena
  41. 41. Architectural Principles Build decoupled systems • Data → Store → Process → Store →Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage AWS managed services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable logs, materialized views (schema-on-read) Be cost-conscious • Big data ≠ big cost
  42. 42. Thank you! Antoine Généreux, AWS Solutions Architect

×