AWS Data Services provide a suite of serverless analytics tools including Amazon Athena for interactive SQL queries, AWS Glue for ETL and data cataloging, and Amazon S3 for exabyte-scale data storage. Together these services enable building a data lake architecture for ingesting, storing, discovering, and analyzing all types of data at scale.
2. AWS Data Services to Accelerate Your Move to the Cloud
RDS
Open
Source
RDS
Commercial
Aurora
Migration for DB Freedom
DynamoDB
& DAX
ElastiCache EMR Amazon
Redshift
Redshift
Spectrum
AthenaElasticsearch
Service
QuickSightGlue
Databases to Elevate your Apps
Relational Non-Relational
& In-Memory
Analytics to Engage your Data
Inline Data Warehousing Reporting
Data Lake
Amazon AI to Drive the Future
Lex
Polly
Rekognition Machine
Learning
Deep Learning, MXNet
Database Migration
Schema Conversion
3.
4. A Data Lake Is…
• A foundation of highly durable data storage and
streaming of any type of data
• A metadata index and workflow which helps us
categorise and govern data stored in the data lake
• A search index and workflow which enables data
discovery
• A robust set of security controls – governance through
technology, not policy
• An API and user interface that expose these features to
internal and external users
5. The Emerging Analytics Architecture
AthenaAmazon Athena
Interactive Query
AWS Glue
ETL & Data Catalog
Storage
Serverless
Compute
Data
Processing
Amazon S3
Exabyte-scale Object Storage
Amazon Kinesis Firehose
Real-Time Data Streaming
Amazon EMR
Managed Hadoop Applications
AWS Lambda
Trigger-based Code Execution
AWS Glue Data Catalog
Hive-compatible Metastore
Amazon Redshift Spectrum
Fast @ Exabyte scale
Amazon Redshift
Petabyte-scale Data Warehousing
6. Comparison of a Data Lake to an Enterprise Data Warehouse
Complementary to EDW (not replacement) Data lake can be source for EDW
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured/Unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction/Advanced Analytics + BI use
cases
BI use cases only (no prediction/advanced analytics)
Data at low level of detail/granularity Data at summary/aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced
analytics)
Limited flexibility in tools (SQL only)
7. EMR S3
The New Problem
Enterprise
data warehouse
≠
Which system has my data?
How can I do machine
learning against the DW?
I built this in Hive, can we get
it into the Finance reports?
These sources are giving
different results…
But I implemented the
algorithm in Anaconda…
8. Dive Into The Data Lake
≠
Enterprise
data warehouseEMR S3
9. Dive Into The Data Lake
Enterprise
data warehouseEMR S3
Load Cleansed Data
Export Computed Aggregates
Ingest any data
Data cleansing
Data catalogue
Trend analysis
Machine learning
Structured analysis
Common access tools
Efficient aggregation
Structured business rules
10. Components of a Data Lake
Data Storage
• High durability
• Stores raw data from input sources
• Support for any type of data
• Low cost
Streaming
• Streaming ingest of feed data
• Provides the ability to consume any dataset as
a stream
• Facilitates low latency analytics
Storage & Streams
Catalogue & Search
Entitlements
API & UI
11. Components of a Data Lake
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Catalogue
• Metadata lake
• Used for summary statistics and data
Classification management
Search
• Simplified access model for data discovery
12. Components of a Data Lake
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Entitlements system
• Encryption
• Authentication
• Authorisation
• Chargeback
• Quotas
• Data masking
• Regional restrictions
13. Components of a Data Lake
Storage & Streams
Catalogue & Search
Entitlements
API & UI
API & User Interface
• Exposes the data lake to customers
• Programmatically query catalogue
• Expose search API
• Ensures that entitlements are respected
14. STORAGE
High durability
Stores raw data from input sources
Support for any type of data
Low cost
Storage & Streams
Catalogue & Search
Entitlements
API & UI
15. Amazon Simple Storage Service
Highly scalable object storage for the Internet
1 byte to 5 TB in size
Designed for 99.999999999% durability, 99.99%
availability
Regional service, no single points of failure
Server side encryption
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
17. Data Storage Format
• Not all data formats are created equally
• Unstructured vs. semi-structured vs. structured
• Store a copy of raw input
• Data standardisation as a workflow following ingest
• Use a format that supports your data, rather than force
your data into a format
• Consider how data will change over time
• Apply common compression
18. Consider Different Types of Data
Unstructured
• Store native file format (logs, dump files, whatever)
• Compress with a streaming codec (LZO, Snappy)
Semi-structured - JSON, XML files, etc.
• Consider evolution ability of the data schema (Avro)
• Store the schema for the data as a file attribute (metadata/tag)
Structured
• Lots of data is CSV!
• Columnar storage (Orc, Parquet)
19. Where to Store Data
• Amazon S3 storage uses a flat keyspace
• Separate data storage by business unit, application, type, and
time
• Natural data partitioning is very useful
• Paths should be self documenting and intuitive
• Changing prefix structure in future is hard/costly
20. Metadata
Services
CRUD API
Query API
Analytics API
Systems of
Reference
Return
URLs
URLs as deeplinks to
applications, file
exchanges via S3
(RESTful file services)
or manifests for Big
Data Analytics / HPC.
Integration Layer
System to system via Amazon SNS/Amazon SQS
System to user via mobile push
Amazon Simple Workflow for high level system integration / orchestration
http://en.wikipedia.org/wiki/Resource-oriented_architecture
s3://${system}/${application}/${YYY-MM-DD}/${resource}/${resourceID}#appliedSecurity/${entitlementGroupApplied}
Resource Oriented Architecture
21. STREAMING
Streaming ingest of feed data
Provides the ability to consume any
dataset as a stream
Facilitates low latency analytics
Storage & Streams
Catalogue & Search
Entitlements
API & UI
22. Why Do Streams Matter?
• Latency between event & action
• Most BI systems target event to action latency of 1 hour
• Streaming analytics would expect event to action latency
< 2 seconds
• Stream orientation simplifies architecture, but can
increase operational complexity
• Increase in complexity needs to be justified by business
value of reduced latency
23. Amazon Kinesis
Managed service for real time big data processing
Create streams to produce & consume data
Elastically add and remove shards for performance
Use Amazon Kinesis Worker Library to process data
Integration with S3, Amazon Redshift, and DynamoDB
Compute Storage
AWS Global Infrastructure
Database
App Services
Deployment & Administration
Networking
Analytics
25. Streaming Storage Integration
Object store
Amazon S3
Streaming store
Amazon
Kinesis
Analytics applications
Read & write file dataRead & write to streams
Archive
stream
Replay
history
26. CATALOGUE & SEARCH
Metadata lake
Used for summary statistics and data
Classification management
Simplified model for data discovery &
governance
Storage & Streams
Catalogue & Search
Entitlements
API & UI
27. Building a Data Catalogue
• Aggregated information about your storage & streaming
layer
• Storage service for metadata
Ownership, data lineage
• Data abstraction layer
Customer data = collection of prefixes
• Enabling data discovery
• API for use by entitlements service
28. Data Catalogue – Metadata Index
• Stores data about your Amazon S3 storage environment
• Total size & count of objects by prefix, data classification,
refresh schedule, object version information
• Amazon S3 events processed by Lambda function
• DynamoDB metadata tables store required attributes
31. Identity & Access Management
• Manage users, groups, and roles
• Identity federation with Open ID
• Temporary credentials with Amazon Security Token
Service (Amazon STS)
• Stored policy templates
• Powerful policy language
• Amazon S3 bucket policies
32. IAM Policy Language
• JSON documents
• Can include variables
which extract information
from the request context
aws:CurrentTime For date/time conditions
aws:EpochTime The date in epoch or UNIX time, for use
with date/time conditions
aws:TokenIssueTime The date/time that temporary security
credentials were issued, for use with
date/time conditions.
aws:principaltype A value that indicates whether the
principal is an account, user, federated, or
assumed role—see the explanation that
follows
aws:SecureTransport Boolean representing whether the
request was sent using SSL
aws:SourceIp The requester's IP address, for use with IP
address conditions
aws:UserAgent Information about the requester's client
application, for use with string conditions
aws:userid The unique ID for the current user
aws:username The friendly name of the current user
33. IAM Policy Language
Example: Allow a user to access a private part of the data lake
{
"Version": "2012-10-17",
"Statement": [
{
"Action": ["s3:ListBucket"],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::mydatalake"],
"Condition": {"StringLike": {"s3:prefix": ["${aws:username}/*"]}}
},
{
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::mydatalake/${aws:username}/*"]
}
]
}
34. IAM Federation
• IAM allows federation to Active
Directory and other OpenID
providers (Amazon, Facebook,
Google)
• AWS Directory Service provides
an AD Connector which can
automate federated connectivity
to ADFS
IAM
Users
AWS
Directory
Service
AD Connector
Direct
Connect
Hardware
VPN
35. Data Encryption
AWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM Device
Common Criteria EAL4+, NIST FIPS 140-2
AWS Key Management Service
Automated key rotation & auditing
Integration with other AWS services
AWS server side encryption
AWS managed key infrastructure
36. Entitlements – Access to Encryption Keys
Customer
Master Key
Customer
Data Keys
Ciphertext
Key
Plaintext
Key
IAM Temporary
Credential
Security Token
Service
MyData
MyData
S3
S3 Object
…
Name: MyData
Key: Ciphertext Key
…
37. Secure Data Flow
IAM
Amazon S3
API Gateway
Users
Temporary
Credential
Temporary
Credential
s3://mydatalake/${YYY-MM-DD}/
${resource}/${resourceID}
Encrypted
Data
Metadata
Index -
DynamoDB
TVM - Elastic
Beanstalk
Security Token
Service
38. Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
39.
40. A Serverless Data Lake Service
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Encrypted
• Standard compliant and open storage formats
• Built on powerful community supported OSS solutions
41. Simple Pricing
• DDL operations – FREE
• SQL operations – FREE
• Query concurrency – FREE
• Data scanned - $5 / TB
• Standard S3 rates for storage, requests, and data transfer apply
43. Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
44. Hive Metadata Definition
• Hive Data Definition Language
• Hive compatible SerDe (serializer/deserializer)
• CSV, JSON, RegEx, Parquet, Avro, ORC, CloudTrail
• Coming soon
• Data Manipulation Language (INSERT, UPDATE)
• Create Table As
• User Defined Functions
45. Presto SQL
• ANSI SQL compliant
• Complex joins, nested queries &
window functions
• Complex data types (arrays,
structs, maps)
• Partitioning of data by any key
• date, time, custom keys
• Presto built-in functions
50. CSV SerDe
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'colelction.delim' = '|',
'mapkey.delim' = ':',
'escape.delim' = '’ )
Does not support removing quote characters from fields. But different primitive types
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "`",
"escapeChar" = "” )
All fields must be defined as type String
51. Data Partitioning – Benefits
• Separates data files by any column
• Read only files the query needs
• Reduce amount of data scanned
• Reduce query completion time
• Reduce query cost
52. Data Partitioning – S3
• Prefer Hive compatible partition naming
• [column_name = column_value]
• i.e. s3://athena-examples/logs/year=2017/month=5/
• Support simple partition naming
• i.e. s3://athena-examples/logs/2017/5/
53. Data Partitioning – Data Catalog
ALTER TABLE app_logs ADD PARTITION (year='2015',month='01',day='01') location
's3://athena-examples/app/plaintext/year=2015/month=01/day=01/’
ALTER TABLE elb_logs ADD PARTITION (year='2015',month='01',day='01') location
's3://athena-examples/elb/plaintext/2015/01/01/’
ALTER TABLE orders DROP PARTITION (dt='2014-05-14’,country='IN'),
PARTITION (dt='2014-05-15’,country='IN’)
ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION
's3://athena-examples/new_customers/zip=98040/state=WA’
MSCK REPAIR TABLE table_name ß Only works with Hive compatible partitions
54. File Formats
• Columnar – Parquet & ORC
• Compressed
• Column based read optimized
• Integrated indexes and stats
• Not ideal for appending new data
• Row – Avro
• Compressed
• Row based read optimized
• Integrated indexes and stats
• Ideal for appending new data
• Text – xSV, JSON
• May or may not be compressed
• Not optimized
• Generic and malleable
55. File Format – Examples
SELECT count(*) as count FROM taxi_rides_csv
Run time: 20.06 seconds, Data scanned: 207.54GB – 1,310,911,060
SELECT count(*) as count FROM taxi_rides_parquet
Run time: 5.76 seconds, Data scanned: 0KB – 2,870,781,820
SELECT * FROM taxi_rides_csv limit 1000
Run time: 3.13 seconds, Data scanned: 328.82MB
SELECT * FROM taxi_rides_parquet limit 1000
Run time: 1.13 seconds, Data scanned: 5.2MB
56. File Formats – Considerations
• Scanning
• xSV and JSON require scanning entire file
• Columnar ideal when selecting only a subset of columns
• Row ideal when selecting all columns of a subset of rows
• Read Performance
• Text – SLOW
• Avro – Optimal (specific to use case)
• Parquet & ORC – Optimal (specific to use case)
• Write Performance
• Text – SLOW
• Avro – Good
• Parquet & ORC – Good (has some overhead with large datasets)
• Garbage Collection Overhead
• Text based – LOW
• Avro – LOW *
• ORC – LOW à MEDIUM *
• Parquet – MEDIUM à HIGH *
* Highly dependent on the dataset
57. Athena API
• Asynchronous interaction model
• Initiate a query, get query ID, retrieve results
• Named queries
• Save queries and reuse
• Paginated result set
• Max page size current at 1000
• Column data and metadata
• Name, type, precision, nullable
• Query status
• State, start and end times
• Query statistics
• Data scanned and execution time
61. JDBC
• Great for integrating with existing data access tools
• Tableau, Looker, SQL Workbench
• Utilizes the API under the hood
• Only JDBC v1.1.0+ is compatible with the public API
• Simple to use and integrate
jdbc:awsathena://athena.REGION.amazonaws.com:443/hive/DB_NAME
• Requires accessKey and secretKey
• When integrated via code, can use custom credentials
provider with temporary credentials
62. JDBC - Considerations
• Requires Java – If you don’t need Java, use the API
• Slower – Overhead impacts performance
• Returns only 1000 records
• No access to other APIs – Only SQL and DDL statements
• Requires IAM user credentials – When used by 3rd party apps
63. The tyranny of “OR”
Amazon EMR
Directly access data in S3
Scale out to thousands of nodes
Open data formats
Popular big data frameworks
Anything you can dream up and code
Amazon Redshift
Super-fast local disk performance
Sophisticated query optimization
Join-optimized data formats
Query using standard SQL
Optimized for data warehousing
64. Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
65. Amazon Redshift Spectrum is fast
Leverages Amazon Redshift’s advanced cost-based optimizer
Pushes down projections, filters, aggregations and join reduction
Dynamic partition pruning to minimize data processed
Automatic parallelization of query execution against S3 data
Efficient join processing within the Amazon Redshift cluster
66. Amazon Redshift Spectrum is cost-effective
You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3
Each query can leverage 1000s of Amazon Redshift Spectrum nodes
You can reduce the TB scanned and improve query performance by:
Partitioning data
Using a columnar file format
Compressing data
67. Amazon Redshift Spectrum is secure
End-to-end
data encryption
Alerts &
notifications
Virtual private cloud
Audit logging
Certifications &
compliance
Encrypt S3 data using SSE and
AWS KMS
Encrypt all Amazon Redshift
data using KMS, AWS
CloudHSM or your on-premises
HSMs
Enforce SSL with perfect forward
encryption using ECDHE
Amazon Redshift leader node in
your VPC. Compute nodes in
private VPC. Spectrum nodes in
private VPC, store no state.
Communicate event-specific
notifications via email, text
message, or call with Amazon
SNS
All API calls are logged using
AWS CloudTrail
All SQL statements are logged
within Amazon Redshift
PCI/DSSFedRAMP
SOC1/2/3 HIPAA/BAA
68. Amazon Redshift Spectrum uses standard SQL
Redshift Spectrum seamlessly integrates with your existing SQL & BI apps
Support for complex joins, nested queries & window functions
Support for data partitioned in S3 by any key
Date, Time and any other custom keys
e.g., Year, Month, Day, Hour
69. Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
On average, data warehousing volumes grow 10x every 5 years
The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
Access your data without ETL pipelines
Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
Run multiple Amazon Redshift clusters against common data
Isolate jobs with tight SLAs from ad hoc analysis
70. Defining External Schema and Creating Tables
Define an external schema in Amazon Redshift using the Amazon Athena data
catalog or your own Apache Hive Metastore
CREATE EXTERNAL SCHEMA <schema_name>
Query external tables using <schema_name>.<table_name>
Register external tables using Athena, your Hive Metastore client, or from
Amazon Redshift CREATE EXTERNAL TABLE SCHEMA syntax
CREATE EXTERNAL TABLE <table_name>
[PARTITIONED BY <column_name, data_type, …>]
STORED AS file_format
LOCATION s3_location
[TABLE PROPERTIES property_name=property_value, …];
71. Amazon Redshift Spectrum – Current support
File formats
• Parquet
• CSV
• Sequence
• RCFile
• ORC (coming soon)
• RegExSerDe (coming soon)
Compression
• Gzip
• Snappy
• Lzo (coming soon)
• Bz2
Encryption
• SSE with AES256
• SSE KMS with default
key
Column types
• Numeric: bigint, int, smallint, float, double
and decimal
• Char/varchar/string
• Timestamp
• Boolean
• DATE type can be used only as a
partitioning key
Table type
• Non-partitioned table
(s3://mybucket/orders/..)
• Partitioned table
(s3://mybucket/orders/date=YYYY-MM-
DD/..)
72. Converting to Parquet and ORC using Amazon EMR
You can use Hive CREATE TABLE AS SELECT to convert data
CREATE TABLE data_converted
STORED AS PARQUET
AS
SELECT col_1, col2, col3 FROM data_source
Or use Spark - 20 lines of Pyspark code, running on Amazon EMR
• 1TB of text data reduced to 130 GB in Parquet format with snappy compression
• Total cost of EMR job to do this: $5
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
73. Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1
74. Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
2
75. Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
3
76. Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
4
77. Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
5
78. Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
6
79. 7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
80. Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
8
81. Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
9
83. Lets build an analytic query - #1
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets get the prior books she’s written.
1 Table
2 Filters
SELECT
P.ASIN,
P.TITLE
FROM
products P
WHERE
P.TITLE LIKE ‘%POTTER%’ AND
P.AUTHOR = ‘J. K. Rowling’
84. Lets build an analytic query - #2
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values
2 Tables (1 S3, 1 local)
2 Filters
1 Join
2 Group By columns
1 Order By
1 Limit
1 Aggregation
SELECT
P.ASIN,
P.TITLE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
products P
WHERE
D.ASIN = P.ASIN AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
GROUP BY P.ASIN, P.TITLE
ORDER BY SALES_sum DESC
LIMIT 20;
85. Lets build an analytic query - #3
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions
3 Tables (1 S3, 2 local)
5 Filters
2 Joins
3 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
86. Lets build an analytic query - #4
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions in the city of Seattle, WA, USA
4 Tables (1 S3, 3 local)
8 Filters
3 Joins
4 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
R.POSTAL_CODE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P,
regions R
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
R.COUNTRY_CODE = ‘US’ AND
R.CITY = ‘Seattle’ AND
R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
87. Now let’s run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail
records for each day over past 20 years.
190 million files across 15,000 partitions in S3.
One partition per day for USA and rest of world.
Need a billion-fold reduction in data processed.
Running this query using a 1000 node Hive cluster
would take over 5 years.*
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Redshift’s query optimizer……......40X
---------------------------------------------------
Total reduction……….…………3.5B X
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data
format used by Amazon Retail.