How Amazon.com Uses AWS Analytics: Data Analytics Week SF

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Pop-up Loft
How Amazon.com uses AWS Analytics
Saurabh Shrivastava
saursh@amazon.com
AWS Solution Architect
Andre Hass
hasandre@amazon.com
AWS Specialist Technical
Account Manager

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditional Data Warehousing
Wikipedia: In computing, a data warehouse (DW or DWH), also known as an enterprise data warehouse (EDW), is a
system used for reporting and data analysis, and is considered a core component of business intelligence.[1] DWs are
central repositories of integrated data from one or more disparate sources. They store current and historical data and
are used for creating analytical reports for knowledge workers throughout the enterprise. Examples of reports could
range from annual and quarterly comparisons and trends to detailed daily sales analysis.

The Battle for the Future
VS.

https://www.promptcloud.com
https://john-popelaars.blogspot.com
https://ww.signiant.com
https://www.linkedin.com/pulse/world-today-data-rich-information-poor-guru-p-mohapatra-pmp/

The Industry Problem
Growth in Data
(mostly Unstructured)
& Analytics
Average Growth in
Traditional DW
Data
Average IT Budget

What is Amazon?
9

Our vision is to be earth’s most customer-centric company;
to build a place where people can come to find and discover
anything they might want to buy online.
10

Amazon Data Warehouse

The Amazon Enterprise Data Warehouse
The Good!
Helps to Run the Amazon Business
• Most Comprehensive Set of Cleansed and Curated Business Data
• Feeds Many Downstream Systems and Processes
• Batch Processing, Reporting and Ad Hoc
• 500k+ Data Loads/Transformations Each Day
• 200k+ Queries/Extracts Each Day
• 20k+ Active Tables
• 10B++ Rows Loaded Daily
Our Data is Big!
• Core Data Set: 5+PB of Compressed Data (primarily limited by Legacy Technology)
• Total Storage (Multiple Systems): 35+ PB compressed
• Quote from Executive at Legacy DW Vendor:
• ~1000x Larger than any other DW Customer (from that Vendor)
Significant and Increasing Use of Redshift and EMR
• 1000’s of Redshift and EMR Systems, Range in size from:
• Individual Contributor - Project Based, to
• Running Multi-Billion Dollar Business inside Amazon

• Who are we?
• Analytics on the “Marketplace”
• Analytics Spokes: Pricing, B2B, Seller Support, Lending …
• Business Scale:
• 235MM monthly CPU Minutes on Legacy ODW
• 2K upstream tables
• Users:
• Supports 170 teams
• 1000 users with 9527 profiles (Parameterized Queries)
• 20K unique job runs per month
• 2800 (800 TB) datasets
• BI Tool Users:
• 3000+ Users, 650 non-tech
• 600+ ”Dashboards”
• 100k’s of queries each month
Example of an Amazon DW “Customer” Team

“Swiss Army” by by Jim Pennucci. No alterations other than cropping. https://www.flickr.com/photos/pennuja/5363518281/
Image used with permissions under Creative Commons license 2.0, Attribution Generic License

What is the Goal?
To Provide an analytic ecosystem that Scales with the
Amazon Business
To Leverage AWS Technologies and to help Improve these
technologies for all Amazon Customers
To Provide Choice and Options in New Analytic Technologies
• Provide an SQL based solution
• Increasingly Focus on Enabling new analytic approaches
including Machine Learning and Programmatic Data
Analysis
• Enable both “Bring Your Own Cluster” and “Bring your
Own Query” Approaches

“Tools #2” by Juan Pablo Olmo. No alterations other than cropping. https://www.flickr.com/photos/juanpol/1562101472/
Image used with permissions under Creative Commons license 2.0, Attribution Generic License (https://creativecommons.org/licenses/by/2.0/)

Amazon EMR
(running Hive, Pig,
Spark, Presto, etc…)
Amazon DynamoDB
Amazon
Machine Learning
Amazon QuickSight
Amazon RDS
Amazon Elasticsearch
Service
Amazon Redshift Amazon Athena
Amazon SQS
Amazon Kinesis
Analytics
Amazon Kinesis
Firehose
Amazon S3
Amazon Kinesis
Open-source tools
(e.g. for ML, data science)
Commercial tools

Moving Forward - AWS
S3 / EDX - Separate
Storage from Compute by
leveraging a parallel file
system as a global data
exchange
• Redshift - Preferred
platform SQL based
Analysis and traditional
Data Warehouse Data
• Focus is “Business Users”
• EMR – Scalable “Do
Everything” Platform - Enable
Teams who have chosen EMR
by providing Curated Data
• Focus is “Programattic Access”
Amazon
Redshift

The Amazon “Data Lake” – Project Name “Andes”
The Goal: ”THE” Place for Data at Amazon
• Source teams (Data Producers) put their Public Data there to give access to Analytic
teams (Data Consumers) and to share private data within their team
• EMR Can Directly Access the Data in Parallel from Andes
• Redshift can load the data in Parallel from Andes, or it Can Directly Access the Data in
Parallel with Spectrum

“Datamarts”
Number of Teams using the DW: ~2300
Number of Tables Used per Team:
• Max: 598
• Min 1
• Average: 49
Ad-Hoc (any data any time) can be achieved via
EMR can access the Data in Andes Directly
Redshift can load data into the Redshift file
system, or it can use the Spectrum Feature to
directly access the Data in Andes
An Architecture that Scales with the Business
Amazon Internal Team (132 Tables)

Putting The Pieces Together
The Analytic Architecture of the Future
Source
Systems
The Data Lake
“Andes”
Big Data Systems
Data Warehouses
“Bring Your Own Cluster” and
“Bring Your Own Query”
Services and Users
Postgre SQL
instance
Amazon
Redshift
Amazon
Redshift
Amazon
Redshift
Amazon
Kinesis
AWS Glue Amazon
QuickSight
Amazon
Athena
Amazon Machine
Learning

The Battle for the Future
The Data Lake becomes the
common source for all
data:
The DW becomes the
compute engine for
traditional structured data
(Redshift)
EMR becomes the compute
engine for programmatic
access, like machine
learning and many
emerging use cases
Both become a form of a
Dependent data mart with
the data coming from the
Data Lake
Vs.
AND

Purchase
Contract
seller buyer
27

Table Subscriptions - The Vision

Subscription
“Big Data Technologies” Team
producer consumer
29

Data Value Chain
Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by
Dinosoft Labs;
COLLECT STORE DELIVER ANALYZESUBSCRIBEDISCOVER

Producers only need to integrate their datasets once
with the data lake
• Simplified onboarding process
• One-time integration
Ingest from various source systems:
• Relational databases – e.g., Amazon Aurora/RDS
Postgres
• Non-relational databases – e.g., Amazon DynamoDB
• Streams – e.g., Amazon Kinesis
• Flat files –e.g., files in Amazon S3
Data Value Chain

Secure and scalable data lake:
• Highly durable S3-based storage
• Scalable since it’s built on AWS technologies
• Permissions are strictly enforced
Data quality:
• Certified with data quality checks
• Schemas are validated
Data Value Chain

Company-wide data search index
• Consumers can quickly find what they’re looking
for
• Useful information about the datasets are
shown
Clear communication:
• Producers can communicate expectations
around data quality and SLAs
• Consumers can contact producers
Data Value Chain

Easy process to subscribe to data:
• Find a dataset of interest
• Click “Subscribe”
• Choose the destination compute platform
Rapidly populate data marts, for example:
• Use AWS CloudFormation to provision Redshift
cluster
• Use subscriptions to load datasets to the cluster
Data Value Chain

Subscriptions mechanism:
• Makes data available to the compute platform where
it can be analyzed
• Keep the compute platform in-sync with any data
updates
• Users can monitor the sync status of their
subscriptions
Synchronizations can be either:
• Full data copy
• Metadata-only sync
Data Value Chain

Teams can use the right tools for the jobs, e.g.:
• Amazon Redshift for interactive analytics or batch
scheduled jobs
• Amazon EMR for machine learning and data
science
• QuickSight for Business analytics and visualizations
Compute resources can be scaled independently
of the data lake in order to:
• Process more/bigger/faster jobs
• Optimize costs
• Meet business SLAs
• Scale to meet high peak workloads
Data Value Chain

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.Image credits: Icons from thenounproject.com: “Collect” icon by Ramesh; “Cloud Security” icon by Creative Stall; “Search” icon by
Dinosoft Labs;
Data Value Chain

Andes – Current State
• We have the data!
• 20k+ Tables maintained in Andes – All Active Tables
have been Sourced from the Enterprise Data
Warehouse
• Many teams are adding new data sets!
• Have Onboarded 900+ Redshift and EMR systems to
Subscriptions
• 20,000+ tables being synchronized
• Usage off the Legacy DW
• Three years (2014-2016) to grow from 0 to 100k Jobs
each Day
• In 2017, has grown from 100k to 300k Jobs each Day
Amazon.com
Big Data
Technologies

Data producers
(Amazon teams that want to share
data with other teams)
"Big Data Marketplace"

How Amazon.com Uses AWS Analytics: Data Analytics Week SF

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How Amazon.com Uses AWS Analytics: Data Analytics Week SF

Similar to How Amazon.com Uses AWS Analytics: Data Analytics Week SF (20)

More from Amazon Web Services

More from Amazon Web Services (20)

How Amazon.com Uses AWS Analytics: Data Analytics Week SF