AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Platform (BDA206)

334 views

Published on

Building big data applications often requires integrating a broad set of technologies to store, process, and analyze the increasing variety, velocity, and volume of data being collected by many organizations. In this session, we show how you can build entire big data applications using a core set of managed services including Amazon S3, Amazon Kinesis, Amazon EMR, Amazon Elasticsearch Service, Amazon Redshift, and Amazon QuickSight.

We walk you through the steps of building and securing a big data application using the AWS Big Data Platform. We also share best practices and common use cases for AWS big data services, including tips to help you choose the best services for your specific application.

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
No Downloads
Views
Total views
334
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
82
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

AWS re:Invent 2016: Building Big Data Applications with the AWS Big Data Platform (BDA206)

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Matt Yanchyshyn Sr. Manager, Solutions Architecture, AWS November 30, 2016 Building Big Data Applications with the AWS Big Data Platform BDA206
  2. 2. Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers & insights START HERE WITH A BUSINESS CASE
  3. 3. AWS Data PipelineAWS Database Migration Service EMR Analyze Amazon Glacier S3 StoreCollect Amazon Kinesis Direct Connect Amazon Machine Learning Amazon Redshift DynamoDBAWS IoT AWS Snowball QuickSight Amazon Athena EC2 Amazon Elasticsearch Service Lambda
  4. 4. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift AWS Cloudcorporate data center Build a data warehouse with Amazon Redshift
  5. 5. Structured Data Processing • Petabyte-scale relational, MPP, data warehousing • Fully managed with SSD and HDD platforms • Built-in end-to-end security, including customer-managed keys • Fault-tolerant. Automatically recovers from disk and node failures • Data automatically backed up to Amazon S3 with cross-region backup capability for global disaster recovery • Over 140 new features added since launch • $1,000/TB/Year; start at $0.25/hour. Provision in minutes; scale from 160 GB to 2 PB of compressed data with just a few clicks Amazon Redshift
  6. 6. How do you get your (big) data into AWS?
  7. 7. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift AWS Cloudcorporate data center Migrate your data to AWS AWS Database Migration Service AWS Direct Connect AWS Import/Export & Snowball
  8. 8. Start your first migration in 10 minutes or less Keep your apps running during the migration Migrate to databases running on Amazon EC2, Amazon RDS, or Amazon Redshift AWS Database Migration Service
  9. 9. AWS Snowball: PB-scale Data Transport E-ink shipping label Ruggedized case “8.5G Impact” All data encrypted end-to-end 50TB & 80TB 10G network Rain & dust resistant Tamper-resistant case & electronics
  10. 10. Your CEO doesn’t want to look at raw SQL query output
  11. 11. Business Intelligence • Fast and cloud-powered • Easy to use, no infrastructure to manage • Scales to 100s of thousands of users • Quick calculations with SPICE • 1/10th the cost of legacy BI software Amazon QuickSight
  12. 12. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Amazon QuickSight AWS Cloudcorporate data center Visualize your data with Amazon QuickSight AWS Database Migration Service AWS Direct Connect AWS Import/Export & Snowball
  13. 13. What if your data isn’t structured? What if you don’t need all the raw data? What if you need to combine multiple data sets?
  14. 14. Serverless Event Processing • Serverless compute service that runs your code in response to events • Extend AWS services with user-defined custom logic • Write custom code in Node.js, Python, and Java • Pay only for the requests served and compute time required - billing in increments of 100 milliseconds AWS Lambda
  15. 15. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Amazon QuickSight AWS Cloud Event-driven data transformations with AWS Lambda corporate data center AWS Lambda Structured Data In Amazon S3 Raw data In Amazon S3
  16. 16. How will this work at scale? What if the data processing exceeds the timeout?
  17. 17. Semi-structured/Unstructured Data Processing • Hadoop, Hive, Presto, Spark, Tez, Impala etc. • Release 5.2: Hadoop 2.7.3, Hive 2.1, Spark 2.02, Zeppelin, Presto, HBase 1.2.3 and HBase on S3, Phoenix, Tez, Flink. • New applications added within 30 days of their open source release • Fully managed, Auto Scaling clusters with support for on-demand and spot pricing • Support for HDFS and S3 file systems enabling separated compute and storage; multiple clusters can run against the same data in S3 • HIPAA-eligible. Support for end-to-end encryption, IAM/VPC, S3 client- side encryption with customer managed keys and AWS KMS Amazon EMR
  18. 18. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Amazon QuickSight AWS Cloud Transform your and explore your data at scale with Amazon EMR corporate data center Amazon EMR Structured Data In Amazon S3 Raw data In Amazon S3
  19. 19. What about ad-hoc queries when you are exploring new data?
  20. 20. Serverless Query Processing • Serverless query service for querying data in S3 using standard SQL with no infrastructure to manage • No data loading required; query directly from Amazon S3 • Use standard ANSI SQL queries with support for joins, JSON, and window functions • Support for multiple data formats include text, CSV, TSV, JSON, Avro, ORC, Parquet • Pay per query only when you’re running queries based on data scanned. If you compress your data, you pay less and your queries run faster Amazon Athena
  21. 21. Building a Big Data Application Extend your data warehouse to S3 with Amazon Athena web clients mobile clients DBMS Raw data In Amazon S3 Amazon Redshift Staging Data in Amazon S3 Amazon QuickSight AWS Cloudcorporate data center Amazon EMR Amazon Athena
  22. 22. Building a Big Data Application Extend your data warehouse to S3 with Amazon Athena web clients mobile clients DBMS Amazon Redshift Amazon QuickSight AWS Cloudcorporate data center Amazon EMR Orc/Parquet in Amazon S3 (Columnar Data Format) Amazon EMR Raw data In Amazon S3 Staging Data in Amazon S3 Amazon Athena
  23. 23. What if I want to run custom code or multiple frameworks?
  24. 24. Building a Big Data Application Extend your Data Warehouse to S3 with Presto, Spark SQL, etc. on Amazon EMR web clients mobile clients DBMS Amazon Redshift Orc/Parquet in Amazon S3 (Columnar Data Format) Amazon QuickSight AWS Cloudcorporate data center Amazon EMR Amazon EMR Amazon EMR Raw data In Amazon S3 Staging Data in Amazon S3
  25. 25. What about real-time data?
  26. 26. Stream Processing • Real-time stream processing • High throughput; elastic • Highly available; data replicated across multiple Availability Zones with configurable retention • S3, Amazon Redshift, DynamoDB integrations • Amazon Kinesis Streams for custom streaming applications; Amazon Kinesis Firehose for easy integration with Amazon S3 and Amazon Redshift; Amazon Kinesis Analytics for streaming SQL Amazon Kinesis
  27. 27. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Orc/Parquet (Columnar Data Format) Amazon QuickSight Amazon Kinesis Streams AWS Cloud Add a real-time layer with Amazon Kinesis + Spark on Amazon EMR corporate data center Amazon EMR Amazon EMR Amazon EMR Raw data In Amazon S3 Staging Data In Amazon S3 Amazon Athena
  28. 28. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Amazon QuickSight AWS Cloud React to real-time data with Amazon Kinesis Analytics and AWS Lambda corporate data center Amazon Kinesis Firehose Amazon Kinesis Analytics AWS Lambda Amazon Kinesis Streams Amazon SNS Reference data in Amazon S3 Amazon Athena
  29. 29. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Amazon QuickSight AWS Cloud React intelligently in real-time with Amazon Machine Learning corporate data center Amazon Kinesis Firehose Amazon Kinesis Analytics AWS Lambda Amazon Kinesis Streams Reference data in Amazon S3 Amazon Machine Learning Amazon SNS Amazon Athena
  30. 30. What if you need encryption and network isolation to meet industry regulations?
  31. 31. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Amazon QuickSight Amazon Kinesis Streams AWS Cloud Add encryption at rest with AWS KMS corporate data center AWSKMS Amazon EMR Amazon EMR Raw data in S3 Staging Data in S3 Orc/Parquet in Amazon S3 (Columnar data)
  32. 32. Building a Big Data Application web clients mobile clients DBMS Amazon Redshift Amazon QuickSight Amazon Kinesis Streams AWS Cloud AWSKMS VPC subnet SSL/TLS SSL/TLS Protect data in transit & add network isolation corporate data center Raw data in S3 Staging Data in S3 Orc/Parquet in Amazon S3 (Columnar data)
  33. 33. Which customers are doing this?
  34. 34. Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Amazon S3 Data Lake Amazon EMR Amazon Kinesis Amazon Redshift Answers & insights Hot HomesUsers Properties Agents User Profile Recommendation Hot Homes Similar Homes Agent Follow-up Agent Scorecard Marketing A/B Testing Real Time Data … Amazon DynamoDB BI / Reporting Redfin
  35. 35. Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Outcomes & insights Personalized recommendations within seconds (from 15-20 min) Scale the expertise of stylists to all shoppers Reduce costs by 2X order of magnitude … Mobile Users Desktop Users Analytics Tools Online Stylist Amazon Redshift Amazon Kinesis AWS Lambda Amazon DynamoDB AWS Lambda Amazon S3 Data Storage NORDSTROM
  36. 36. Data Marts (Amazon Redshift) Query Cluster (EMR) Query Cluster (EMR) Auto Scaling EC2 Analytics App Normalization ETL Clusters (EMR) Batch Analytic Clusters Ad Hoc Query Cluster (EMR) Auto Scaling EC2 Analytics App Users Data Providers Auto Scaling EC2 Data Ingestion Services Optimization ETL Clusters (EMR) Shared Metastore (RDS) Query Optimized (S3) Auto Scaling EC2 Data Catalog & Lineage Services Reference Data (RDS) Shared Data Services Auto Scaling EC2 Cluster Mgt & Workflow Services Source of Truth (S3) >5 PB, up to 75 billion events per day
  37. 37. web clients mobile clients DBMS Amazon Redshift Amazon QuickSight AWS Cloudcorporate data center Amazon Kinesis Firehose Amazon Kinesis Analytics AWS Lambda Amazon Kinesis Streams Reference data in Amazon S3 Amazon Machine Learning Amazon SNS <YOUR COMPANY NAME HERE> Amazon Athena
  38. 38. Thank you!
  39. 39. Remember to complete your evaluations!

×