SlideShare a Scribd company logo
1 of 87
Scalable Data Analytics
AWS Data Services to Accelerate Your Move to the Cloud
RDS
Open
Source
RDS
Commercial
Aurora
Migration for DB Freedom
DynamoDB
& DAX
ElastiCache EMR Amazon
Redshift
Redshift
Spectrum
AthenaElasticsearch
Service
QuickSightGlue
Databases to Elevate your Apps
Relational Non-Relational
& In-Memory
Analytics to Engage your Data
Inline Data Warehousing Reporting
Data Lake
Amazon AI to Drive the Future
Lex
Polly
Rekognition Machine
Learning
Deep Learning, MXNet
Database Migration
Schema Conversion
A Data Lake Is…
• A foundation of highly durable data storage and
streaming of any type of data
• A metadata index and workflow which helps us
categorise and govern data stored in the data lake
• A search index and workflow which enables data
discovery
• A robust set of security controls – governance through
technology, not policy
• An API and user interface that expose these features to
internal and external users
The Emerging Analytics Architecture
AthenaAmazon Athena
Interactive Query
AWS Glue
ETL & Data Catalog
Storage
Serverless
Compute
Data
Processing
Amazon S3
Exabyte-scale Object Storage
Amazon Kinesis Firehose
Real-Time Data Streaming
Amazon EMR
Managed Hadoop Applications
AWS Lambda
Trigger-based Code Execution
AWS Glue Data Catalog
Hive-compatible Metastore
Amazon Redshift Spectrum
Fast @ Exabyte scale
Amazon Redshift
Petabyte-scale Data Warehousing
Comparison of a Data Lake to an Enterprise Data Warehouse
Complementary to EDW (not replacement) Data lake can be source for EDW
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured/Unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction/Advanced Analytics + BI use
cases
BI use cases only (no prediction/advanced analytics)
Data at low level of detail/granularity Data at summary/aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced
analytics)
Limited flexibility in tools (SQL only)
EMR S3
The New Problem
Enterprise
data warehouse
≠
Which system has my data?
How can I do machine
learning against the DW?
I built this in Hive, can we get
it into the Finance reports?
These sources are giving
different results…
But I implemented the
algorithm in Anaconda…
Dive Into The Data Lake
≠
Enterprise
data warehouseEMR S3
Dive Into The Data Lake
Enterprise
data warehouseEMR S3
Load Cleansed Data
Export Computed Aggregates
Ingest any data
Data cleansing
Data catalogue
Trend analysis
Machine learning
Structured analysis
Common access tools
Efficient aggregation
Structured business rules
Components of a Data Lake
Data Storage
• High durability
• Stores raw data from input sources
• Support for any type of data
• Low cost
Streaming
• Streaming ingest of feed data
• Provides the ability to consume any dataset as
a stream
• Facilitates low latency analytics
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Components of a Data Lake
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Catalogue
• Metadata lake
• Used for summary statistics and data
Classification management
Search
• Simplified access model for data discovery
Components of a Data Lake
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Entitlements system
• Encryption
• Authentication
• Authorisation
• Chargeback
• Quotas
• Data masking
• Regional restrictions
Components of a Data Lake
Storage & Streams
Catalogue & Search
Entitlements
API & UI
API & User Interface
• Exposes the data lake to customers
• Programmatically query catalogue
• Expose search API
• Ensures that entitlements are respected
STORAGE
High durability
Stores raw data from input sources
Support for any type of data
Low cost
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Amazon Simple Storage Service
Highly	scalable	object	storage	for	the	Internet
1	byte	to	5	TB	in	size
Designed	for	99.999999999%	durability,	99.99%	
availability
Regional	service,	no	single	points	of	failure
Server	side	encryption
Compute Storage
AWS	Global	Infrastructure
Database
App	Services
Deployment	&	Administration
Networking
Analytics
Storage Lifecycle Integration
S3 – Standard S3 – Infrequent Access Amazon Glacier
Data Storage Format
• Not all data formats are created equally
• Unstructured vs. semi-structured vs. structured
• Store a copy of raw input
• Data standardisation as a workflow following ingest
• Use a format that supports your data, rather than force
your data into a format
• Consider how data will change over time
• Apply common compression
Consider Different Types of Data
Unstructured
• Store native file format (logs, dump files, whatever)
• Compress with a streaming codec (LZO, Snappy)
Semi-structured - JSON, XML files, etc.
• Consider evolution ability of the data schema (Avro)
• Store the schema for the data as a file attribute (metadata/tag)
Structured
• Lots of data is CSV!
• Columnar storage (Orc, Parquet)
Where to Store Data
• Amazon S3 storage uses a flat keyspace
• Separate data storage by business unit, application, type, and
time
• Natural data partitioning is very useful
• Paths should be self documenting and intuitive
• Changing prefix structure in future is hard/costly
Metadata
Services
CRUD API
Query API
Analytics API
Systems of
Reference
Return
URLs
URLs as deeplinks to
applications, file
exchanges via S3
(RESTful file services)
or manifests for Big
Data Analytics / HPC.
Integration Layer
System to system via Amazon SNS/Amazon SQS
System to user via mobile push
Amazon Simple Workflow for high level system integration / orchestration
http://en.wikipedia.org/wiki/Resource-oriented_architecture
s3://${system}/${application}/${YYY-MM-DD}/${resource}/${resourceID}#appliedSecurity/${entitlementGroupApplied}
Resource Oriented Architecture
STREAMING
Streaming ingest of feed data
Provides the ability to consume any
dataset as a stream
Facilitates low latency analytics
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Why Do Streams Matter?
• Latency between event & action
• Most BI systems target event to action latency of 1 hour
• Streaming analytics would expect event to action latency
< 2 seconds
• Stream orientation simplifies architecture, but can
increase operational complexity
• Increase in complexity needs to be justified by business
value of reduced latency
Amazon Kinesis
Managed service for real time big data processing
Create streams to produce & consume data
Elastically add and remove shards for performance
Use Amazon Kinesis Worker Library to process data
Integration with S3, Amazon Redshift, and DynamoDB
Compute Storage
AWS	Global	Infrastructure
Database
App	Services
Deployment	&	Administration
Networking
Analytics
Data	
Sources
AWS	EndpointData	
Sources
Data	Sources
Data	
Sources
S3
App.1
[Archive/Inge
stion]
App.2
[Sliding	
Window	
Analysis]
App.3
[Data	
Loading]
App.4
[Event	
Processing	
Systems]
DynamoDB
Amazon Redshift
Data	
Sources
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
Zone
Availability
Zone
Amazon Kinesis Architecture
Streaming Storage Integration
Object store
Amazon S3
Streaming store
Amazon
Kinesis
Analytics applications
Read & write file dataRead & write to streams
Archive
stream
Replay
history
CATALOGUE & SEARCH
Metadata lake
Used for summary statistics and data
Classification management
Simplified model for data discovery &
governance
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Building a Data Catalogue
• Aggregated information about your storage & streaming
layer
• Storage service for metadata
Ownership, data lineage
• Data abstraction layer
Customer data = collection of prefixes
• Enabling data discovery
• API for use by entitlements service
Data Catalogue – Metadata Index
• Stores data about your Amazon S3 storage environment
• Total size & count of objects by prefix, data classification,
refresh schedule, object version information
• Amazon S3 events processed by Lambda function
• DynamoDB metadata tables store required attributes
ENTITLEMENTS
Encryption
Authentication
Authorisation
Chargeback
Quotas
Data masking
Regional restrictions
Storage & Streams
Catalogue & Search
Entitlements
API & UI
Data Lake != Open Access
Identity & Access Management
• Manage users, groups, and roles
• Identity federation with Open ID
• Temporary credentials with Amazon Security Token
Service (Amazon STS)
• Stored policy templates
• Powerful policy language
• Amazon S3 bucket policies
IAM Policy Language
• JSON documents
• Can include variables
which extract information
from the request context
aws:CurrentTime For	date/time	conditions
aws:EpochTime The	date	in	epoch	or	UNIX	time,	for	use	
with	date/time	conditions
aws:TokenIssueTime The	date/time	that	temporary	security	
credentials	were	issued,	for	use	with	
date/time	conditions.
aws:principaltype A	value	that	indicates	whether	the	
principal	is	an	account,	user,	federated,	or	
assumed	role—see	the	explanation	that	
follows
aws:SecureTransport Boolean	representing	whether	the	
request	was	sent	using	SSL
aws:SourceIp The	requester's	IP	address,	for	use	with	IP	
address	conditions
aws:UserAgent Information	about	the	requester's	client	
application,	for	use	with	string	conditions
aws:userid The	unique	ID	for	the	current	user
aws:username The	friendly	name	of	the	current	user
IAM Policy Language
Example: Allow a user to access a private part of the data lake
{
"Version": "2012-10-17",
"Statement": [
{
"Action": ["s3:ListBucket"],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::mydatalake"],
"Condition": {"StringLike": {"s3:prefix": ["${aws:username}/*"]}}
},
{
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Effect": "Allow",
"Resource": ["arn:aws:s3:::mydatalake/${aws:username}/*"]
}
]
}
IAM Federation
• IAM allows federation to Active
Directory and other OpenID
providers (Amazon, Facebook,
Google)
• AWS Directory Service provides
an AD Connector which can
automate federated connectivity
to ADFS
IAM
Users
AWS
Directory
Service
AD Connector
Direct
Connect
Hardware
VPN
Data Encryption
AWS CloudHSM
Dedicated Tenancy SafeNet Luna SA HSM Device
Common Criteria EAL4+, NIST FIPS 140-2
AWS Key Management Service
Automated key rotation & auditing
Integration with other AWS services
AWS server side encryption
AWS managed key infrastructure
Entitlements – Access to Encryption Keys
Customer
Master Key
Customer
Data Keys
Ciphertext
Key
Plaintext
Key
IAM Temporary
Credential
Security Token
Service
MyData
MyData
S3
S3 Object
…
Name: MyData
Key: Ciphertext Key
…
Secure Data Flow
IAM
Amazon S3
API Gateway
Users
Temporary
Credential
Temporary
Credential
s3://mydatalake/${YYY-MM-DD}/
${resource}/${resourceID}
Encrypted
Data
Metadata
Index -
DynamoDB
TVM - Elastic
Beanstalk
Security Token
Service
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
A Serverless Data Lake Service
• Decouple storage from compute
• Serverless – No infrastructure or resources to manage
• Pay only for data scanned
• Schema on read – Same data, many views
• Encrypted
• Standard compliant and open storage formats
• Built on powerful community supported OSS solutions
Simple Pricing
• DDL operations – FREE
• SQL operations – FREE
• Query concurrency – FREE
• Data scanned - $5 / TB
• Standard S3 rates for storage, requests, and data transfer apply
Customers Drive Product Decisions
Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
Hive Metadata Definition
• Hive Data Definition Language
• Hive compatible SerDe (serializer/deserializer)
• CSV, JSON, RegEx, Parquet, Avro, ORC, CloudTrail
• Coming soon
• Data Manipulation Language (INSERT, UPDATE)
• Create Table As
• User Defined Functions
Presto SQL
• ANSI SQL compliant
• Complex joins, nested queries &
window functions
• Complex data types (arrays,
structs, maps)
• Partitioning of data by any key
• date, time, custom keys
• Presto built-in functions
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3
Data Catalog
AthenaEMR Redshift
Spectrum
Amazon ML / MXNet
RDS
QuickSight
Kinesis
Database
Migration
Service
Glue
Amazon Analytics End to End Architecture
IAM
Other
Sources
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Athena in Action
Creating Tables – Parquet
CREATE EXTERNAL TABLE db_name.taxi_rides_parquet (
vendorid STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
ratecode INT,
passenger_count INT,
trip_distance DOUBLE,
fare_amount DOUBLE,
total_amount DOUBLE,
payment_type INT
)
PARTITIONED BY (YEAR INT, MONTH INT, TYPE string)
STORED AS PARQUET
LOCATION 's3://serverless-analytics/canonical/NY-Pub’
TBLPROPERTIES ('has_encrypted_data'=’true');
Creating Tables – Nested JSON
CREATE EXTERNAL TABLE IF NOT EXISTS fix_messages (
`bodyLength` int,
`defaultAppVerID` string,
`encryptMethod` int,
`msgSeqNum` int,
`msgType` string,
`resetSeqNumFlag` string,
`securityRequestID` string,
`securityRequestResult` int,
`securityXML` struct <version:int, header:struct<assetClass:string,
tierLevelISIN:int, useCase:string>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://my_bucket/fix/'
TBLPROPERTIES ('has_encrypted_data'='false');
CSV SerDe
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ',',
'colelction.delim' = '|',
'mapkey.delim' = ':',
'escape.delim' = '’ )
Does not support removing quote characters from fields. But different primitive types
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "`",
"escapeChar" = "” )
All fields must be defined as type String
Data Partitioning – Benefits
• Separates data files by any column
• Read only files the query needs
• Reduce amount of data scanned
• Reduce query completion time
• Reduce query cost
Data Partitioning – S3
• Prefer Hive compatible partition naming
• [column_name = column_value]
• i.e. s3://athena-examples/logs/year=2017/month=5/
• Support simple partition naming
• i.e. s3://athena-examples/logs/2017/5/
Data Partitioning – Data Catalog
ALTER TABLE app_logs ADD PARTITION (year='2015',month='01',day='01') location
's3://athena-examples/app/plaintext/year=2015/month=01/day=01/’
ALTER TABLE elb_logs ADD PARTITION (year='2015',month='01',day='01') location
's3://athena-examples/elb/plaintext/2015/01/01/’
ALTER TABLE orders DROP PARTITION (dt='2014-05-14’,country='IN'),
PARTITION (dt='2014-05-15’,country='IN’)
ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION
's3://athena-examples/new_customers/zip=98040/state=WA’
MSCK REPAIR TABLE table_name ß Only works with Hive compatible partitions
File Formats
• Columnar – Parquet & ORC
• Compressed
• Column based read optimized
• Integrated indexes and stats
• Not ideal for appending new data
• Row – Avro
• Compressed
• Row based read optimized
• Integrated indexes and stats
• Ideal for appending new data
• Text – xSV, JSON
• May or may not be compressed
• Not optimized
• Generic and malleable
File Format – Examples
SELECT count(*) as count FROM taxi_rides_csv
Run time: 20.06 seconds, Data scanned: 207.54GB – 1,310,911,060
SELECT count(*) as count FROM taxi_rides_parquet
Run time: 5.76 seconds, Data scanned: 0KB – 2,870,781,820
SELECT * FROM taxi_rides_csv limit 1000
Run time: 3.13 seconds, Data scanned: 328.82MB
SELECT * FROM taxi_rides_parquet limit 1000
Run time: 1.13 seconds, Data scanned: 5.2MB
File Formats – Considerations
• Scanning
• xSV and JSON require scanning entire file
• Columnar ideal when selecting only a subset of columns
• Row ideal when selecting all columns of a subset of rows
• Read Performance
• Text – SLOW
• Avro – Optimal (specific to use case)
• Parquet & ORC – Optimal (specific to use case)
• Write Performance
• Text – SLOW
• Avro – Good
• Parquet & ORC – Good (has some overhead with large datasets)
• Garbage Collection Overhead
• Text based – LOW
• Avro – LOW *
• ORC – LOW à MEDIUM *
• Parquet – MEDIUM à HIGH *
* Highly dependent on the dataset
Athena API
• Asynchronous interaction model
• Initiate a query, get query ID, retrieve results
• Named queries
• Save queries and reuse
• Paginated result set
• Max page size current at 1000
• Column data and metadata
• Name, type, precision, nullable
• Query status
• State, start and end times
• Query statistics
• Data scanned and execution time
Athena API
• BatchGetNamedQuery
• BatchGetQueryExecution
• CreateNamedQuery
• DeleteNamedQuery
• GetNamedQuery
• ListNamedQueries
• GetQueryExecution
• ListQueryExecutions
• StartQueryExecution
• StopQueryExecution
• GetQueryResults
Athena API
StartQueryExecution
client.startQueryExecution({
QueryString: ‘SELECT * FROM table_name LIMIT 100’,
ResultConfiguration: { OutputLocation: ‘s3://bucket/output/’ },
EncryptionConfiguration: { EncryptionOption: 'SSE_S3' },
QueryExecutionContext: { Database: ‘default_db’ }
}, (err, result) => {})
GetQueryResults
client.getQueryResults({
QueryExecutionId: '2ef5d590-025a-48ec-895e-6bedfe72bc95',
MaxResults: 1000,
NextToken: null
}, (err, data) => {})
Athena API
BatchGetQueryExecution
client.batchGetQueryExecution({
QueryExecutionId: ['2ef5d590-025a-48ec-895e-6bedfe72bc95']
}, (err, data) => {})
ListQueryExecutions
client..listQueryExecutions({
MaxResults: 50
}, (err, data) => {})
JDBC
• Great for integrating with existing data access tools
• Tableau, Looker, SQL Workbench
• Utilizes the API under the hood
• Only JDBC v1.1.0+ is compatible with the public API
• Simple to use and integrate
jdbc:awsathena://athena.REGION.amazonaws.com:443/hive/DB_NAME
• Requires accessKey and secretKey
• When integrated via code, can use custom credentials
provider with temporary credentials
JDBC - Considerations
• Requires Java – If you don’t need Java, use the API
• Slower – Overhead impacts performance
• Returns only 1000 records
• No access to other APIs – Only SQL and DDL statements
• Requires IAM user credentials – When used by 3rd party apps
The tyranny of “OR”
Amazon EMR
Directly access data in S3
Scale out to thousands of nodes
Open data formats
Popular big data frameworks
Anything you can dream up and code
Amazon Redshift
Super-fast local disk performance
Sophisticated query optimization
Join-optimized data formats
Query using standard SQL
Optimized for data warehousing
Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
Amazon Redshift Spectrum is fast
Leverages Amazon Redshift’s advanced cost-based optimizer
Pushes down projections, filters, aggregations and join reduction
Dynamic partition pruning to minimize data processed
Automatic parallelization of query execution against S3 data
Efficient join processing within the Amazon Redshift cluster
Amazon Redshift Spectrum is cost-effective
You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3
Each query can leverage 1000s of Amazon Redshift Spectrum nodes
You can reduce the TB scanned and improve query performance by:
Partitioning data
Using a columnar file format
Compressing data
Amazon Redshift Spectrum is secure
End-to-end
data encryption
Alerts &
notifications
Virtual private cloud
Audit logging
Certifications &
compliance
Encrypt S3 data using SSE and
AWS KMS
Encrypt all Amazon Redshift
data using KMS, AWS
CloudHSM or your on-premises
HSMs
Enforce SSL with perfect forward
encryption using ECDHE
Amazon Redshift leader node in
your VPC. Compute nodes in
private VPC. Spectrum nodes in
private VPC, store no state.
Communicate event-specific
notifications via email, text
message, or call with Amazon
SNS
All API calls are logged using
AWS CloudTrail
All SQL statements are logged
within Amazon Redshift
PCI/DSSFedRAMP
SOC1/2/3 HIPAA/BAA
Amazon Redshift Spectrum uses standard SQL
Redshift Spectrum seamlessly integrates with your existing SQL & BI apps
Support for complex joins, nested queries & window functions
Support for data partitioned in S3 by any key
Date, Time and any other custom keys
e.g., Year, Month, Day, Hour
Is Amazon Redshift Spectrum useful if I don’t have an exabyte?
Your data will get bigger
On average, data warehousing volumes grow 10x every 5 years
The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
Access your data without ETL pipelines
Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
Amazon Redshift Spectrum improves availability and concurrency
Run multiple Amazon Redshift clusters against common data
Isolate jobs with tight SLAs from ad hoc analysis
Defining External Schema and Creating Tables
Define	an	external	schema	in	Amazon	Redshift	using	the	Amazon	Athena	data	
catalog	or	your	own	Apache	Hive	Metastore
CREATE	EXTERNAL	SCHEMA	<schema_name>
Query	external	tables	using	<schema_name>.<table_name>
Register	external	tables	using	Athena,	your	Hive	Metastore	client,	or	from	
Amazon Redshift	CREATE	EXTERNAL	TABLE	SCHEMA	syntax
CREATE	EXTERNAL	TABLE	<table_name>
[PARTITIONED	BY	<column_name,	data_type,	…>]
STORED	AS	file_format
LOCATION	s3_location
[TABLE	PROPERTIES	property_name=property_value,	…];
Amazon Redshift Spectrum – Current support
File formats
• Parquet
• CSV
• Sequence
• RCFile
• ORC (coming soon)
• RegExSerDe (coming soon)
Compression
• Gzip
• Snappy
• Lzo (coming soon)
• Bz2
Encryption
• SSE with AES256
• SSE KMS with default
key
Column types
• Numeric: bigint, int, smallint, float, double
and decimal
• Char/varchar/string
• Timestamp
• Boolean
• DATE type can be used only as a
partitioning key
Table type
• Non-partitioned table
(s3://mybucket/orders/..)
• Partitioned table
(s3://mybucket/orders/date=YYYY-MM-
DD/..)
Converting to Parquet and ORC using Amazon EMR
You can use Hive CREATE TABLE AS SELECT to convert data
CREATE TABLE data_converted
STORED AS PARQUET
AS
SELECT col_1, col2, col3 FROM data_source
Or use Spark - 20 lines of Pyspark code, running on Amazon EMR
• 1TB of text data reduced to 130 GB in Parquet format with snappy compression
• Total cost of EMR job to do this: $5
https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1
Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
2
Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
3
Compute nodes obtain partition info from
Data Catalog; dynamically prune partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
4
Each compute node issues multiple
requests to the Amazon Redshift
Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
5
Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
6
7
Amazon Redshift
Spectrum projects,
filters, joins and
aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
8
Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
9
Running an analytic query
over an exabyte in S3
Lets build an analytic query - #1
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets get the prior books she’s written.
1 Table
2 Filters
SELECT
P.ASIN,
P.TITLE
FROM
products P
WHERE
P.TITLE LIKE ‘%POTTER%’ AND
P.AUTHOR = ‘J. K. Rowling’
Lets build an analytic query - #2
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values
2 Tables (1 S3, 1 local)
2 Filters
1 Join
2 Group By columns
1 Order By
1 Limit
1 Aggregation
SELECT
P.ASIN,
P.TITLE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
products P
WHERE
D.ASIN = P.ASIN AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
GROUP BY P.ASIN, P.TITLE
ORDER BY SALES_sum DESC
LIMIT 20;
Lets build an analytic query - #3
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions
3 Tables (1 S3, 2 local)
5 Filters
2 Joins
3 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
Lets build an analytic query - #4
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Lets compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions in the city of Seattle, WA, USA
4 Tables (1 S3, 3 local)
8 Filters
3 Joins
4 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
R.POSTAL_CODE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P,
regions R
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
R.COUNTRY_CODE = ‘US’ AND
R.CITY = ‘Seattle’ AND
R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
Now let’s run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail
records for each day over past 20 years.
190 million files across 15,000 partitions in S3.
One partition per day for USA and rest of world.
Need a billion-fold reduction in data processed.
Running this query using a 1000 node Hive cluster
would take over 5 years.*
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Redshift’s query optimizer……......40X
---------------------------------------------------
Total reduction……….…………3.5B X
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data
format used by Amazon Retail.

More Related Content

What's hot

(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Amazon Web Services
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSAmazon Web Services
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSAmazon Web Services
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS Amazon Web Services
 
Optimizing Storage for Big Data Analytics Workloads
Optimizing Storage for Big Data Analytics WorkloadsOptimizing Storage for Big Data Analytics Workloads
Optimizing Storage for Big Data Analytics WorkloadsAmazon Web Services
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...Amazon Web Services
 
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Amazon Web Services
 
re:Invent Round-up, Time Stream, Quantum and Managed Blockchain
re:Invent Round-up, Time Stream, Quantum and Managed Blockchain re:Invent Round-up, Time Stream, Quantum and Managed Blockchain
re:Invent Round-up, Time Stream, Quantum and Managed Blockchain Amazon Web Services
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
New Database Migration Services & RDS Updates
New Database Migration Services & RDS UpdatesNew Database Migration Services & RDS Updates
New Database Migration Services & RDS UpdatesAmazon Web Services
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - DatalakeLam Le
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services
 
Migrating On-Premises Databases to Cloud
Migrating On-Premises Databases to CloudMigrating On-Premises Databases to Cloud
Migrating On-Premises Databases to CloudAmazon Web Services
 
Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017Amazon Web Services
 

What's hot (20)

(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Optimizing Storage for Big Data Analytics Workloads
Optimizing Storage for Big Data Analytics WorkloadsOptimizing Storage for Big Data Analytics Workloads
Optimizing Storage for Big Data Analytics Workloads
 
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
AWS March 2016 Webinar Series - Building Big Data Solutions with Amazon EMR a...
 
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
re:Invent Round-up, Time Stream, Quantum and Managed Blockchain
re:Invent Round-up, Time Stream, Quantum and Managed Blockchain re:Invent Round-up, Time Stream, Quantum and Managed Blockchain
re:Invent Round-up, Time Stream, Quantum and Managed Blockchain
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
New Database Migration Services & RDS Updates
New Database Migration Services & RDS UpdatesNew Database Migration Services & RDS Updates
New Database Migration Services & RDS Updates
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Migrating On-Premises Databases to Cloud
Migrating On-Premises Databases to CloudMigrating On-Premises Databases to Cloud
Migrating On-Premises Databases to Cloud
 
Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017
 

Similar to AWS Data Services to Accelerate Your Move to the Cloud

Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAmazon Web Services Korea
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudAmazon Web Services
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Rukmani Gopalan
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)Amazon Web Services Korea
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudAmazon Web Services
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAmazon Web Services
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Amazon Web Services
 

Similar to AWS Data Services to Accelerate Your Move to the Cloud (20)

Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
AWS CLOUD 2017 - Amazon Athena 및 Glue를 통한 빠른 데이터 질의 및 처리 기능 소개 (김상필 솔루션즈 아키텍트)
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWSAWS Summit Singapore - Architecting a Serverless Data Lake on AWS
AWS Summit Singapore - Architecting a Serverless Data Lake on AWS
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

AWS Data Services to Accelerate Your Move to the Cloud

  • 2. AWS Data Services to Accelerate Your Move to the Cloud RDS Open Source RDS Commercial Aurora Migration for DB Freedom DynamoDB & DAX ElastiCache EMR Amazon Redshift Redshift Spectrum AthenaElasticsearch Service QuickSightGlue Databases to Elevate your Apps Relational Non-Relational & In-Memory Analytics to Engage your Data Inline Data Warehousing Reporting Data Lake Amazon AI to Drive the Future Lex Polly Rekognition Machine Learning Deep Learning, MXNet Database Migration Schema Conversion
  • 3.
  • 4. A Data Lake Is… • A foundation of highly durable data storage and streaming of any type of data • A metadata index and workflow which helps us categorise and govern data stored in the data lake • A search index and workflow which enables data discovery • A robust set of security controls – governance through technology, not policy • An API and user interface that expose these features to internal and external users
  • 5. The Emerging Analytics Architecture AthenaAmazon Athena Interactive Query AWS Glue ETL & Data Catalog Storage Serverless Compute Data Processing Amazon S3 Exabyte-scale Object Storage Amazon Kinesis Firehose Real-Time Data Streaming Amazon EMR Managed Hadoop Applications AWS Lambda Trigger-based Code Execution AWS Glue Data Catalog Hive-compatible Metastore Amazon Redshift Spectrum Fast @ Exabyte scale Amazon Redshift Petabyte-scale Data Warehousing
  • 6. Comparison of a Data Lake to an Enterprise Data Warehouse Complementary to EDW (not replacement) Data lake can be source for EDW Schema on read (no predefined schemas) Schema on write (predefined schemas) Structured/semi-structured/Unstructured data Structured data only Fast ingestion of new data/content Time consuming to introduce new content Data Science + Prediction/Advanced Analytics + BI use cases BI use cases only (no prediction/advanced analytics) Data at low level of detail/granularity Data at summary/aggregated level of detail Loosely defined SLAs Tight SLAs (production schedules) Flexibility in tools (open source/tools for advanced analytics) Limited flexibility in tools (SQL only)
  • 7. EMR S3 The New Problem Enterprise data warehouse ≠ Which system has my data? How can I do machine learning against the DW? I built this in Hive, can we get it into the Finance reports? These sources are giving different results… But I implemented the algorithm in Anaconda…
  • 8. Dive Into The Data Lake ≠ Enterprise data warehouseEMR S3
  • 9. Dive Into The Data Lake Enterprise data warehouseEMR S3 Load Cleansed Data Export Computed Aggregates Ingest any data Data cleansing Data catalogue Trend analysis Machine learning Structured analysis Common access tools Efficient aggregation Structured business rules
  • 10. Components of a Data Lake Data Storage • High durability • Stores raw data from input sources • Support for any type of data • Low cost Streaming • Streaming ingest of feed data • Provides the ability to consume any dataset as a stream • Facilitates low latency analytics Storage & Streams Catalogue & Search Entitlements API & UI
  • 11. Components of a Data Lake Storage & Streams Catalogue & Search Entitlements API & UI Catalogue • Metadata lake • Used for summary statistics and data Classification management Search • Simplified access model for data discovery
  • 12. Components of a Data Lake Storage & Streams Catalogue & Search Entitlements API & UI Entitlements system • Encryption • Authentication • Authorisation • Chargeback • Quotas • Data masking • Regional restrictions
  • 13. Components of a Data Lake Storage & Streams Catalogue & Search Entitlements API & UI API & User Interface • Exposes the data lake to customers • Programmatically query catalogue • Expose search API • Ensures that entitlements are respected
  • 14. STORAGE High durability Stores raw data from input sources Support for any type of data Low cost Storage & Streams Catalogue & Search Entitlements API & UI
  • 15. Amazon Simple Storage Service Highly scalable object storage for the Internet 1 byte to 5 TB in size Designed for 99.999999999% durability, 99.99% availability Regional service, no single points of failure Server side encryption Compute Storage AWS Global Infrastructure Database App Services Deployment & Administration Networking Analytics
  • 16. Storage Lifecycle Integration S3 – Standard S3 – Infrequent Access Amazon Glacier
  • 17. Data Storage Format • Not all data formats are created equally • Unstructured vs. semi-structured vs. structured • Store a copy of raw input • Data standardisation as a workflow following ingest • Use a format that supports your data, rather than force your data into a format • Consider how data will change over time • Apply common compression
  • 18. Consider Different Types of Data Unstructured • Store native file format (logs, dump files, whatever) • Compress with a streaming codec (LZO, Snappy) Semi-structured - JSON, XML files, etc. • Consider evolution ability of the data schema (Avro) • Store the schema for the data as a file attribute (metadata/tag) Structured • Lots of data is CSV! • Columnar storage (Orc, Parquet)
  • 19. Where to Store Data • Amazon S3 storage uses a flat keyspace • Separate data storage by business unit, application, type, and time • Natural data partitioning is very useful • Paths should be self documenting and intuitive • Changing prefix structure in future is hard/costly
  • 20. Metadata Services CRUD API Query API Analytics API Systems of Reference Return URLs URLs as deeplinks to applications, file exchanges via S3 (RESTful file services) or manifests for Big Data Analytics / HPC. Integration Layer System to system via Amazon SNS/Amazon SQS System to user via mobile push Amazon Simple Workflow for high level system integration / orchestration http://en.wikipedia.org/wiki/Resource-oriented_architecture s3://${system}/${application}/${YYY-MM-DD}/${resource}/${resourceID}#appliedSecurity/${entitlementGroupApplied} Resource Oriented Architecture
  • 21. STREAMING Streaming ingest of feed data Provides the ability to consume any dataset as a stream Facilitates low latency analytics Storage & Streams Catalogue & Search Entitlements API & UI
  • 22. Why Do Streams Matter? • Latency between event & action • Most BI systems target event to action latency of 1 hour • Streaming analytics would expect event to action latency < 2 seconds • Stream orientation simplifies architecture, but can increase operational complexity • Increase in complexity needs to be justified by business value of reduced latency
  • 23. Amazon Kinesis Managed service for real time big data processing Create streams to produce & consume data Elastically add and remove shards for performance Use Amazon Kinesis Worker Library to process data Integration with S3, Amazon Redshift, and DynamoDB Compute Storage AWS Global Infrastructure Database App Services Deployment & Administration Networking Analytics
  • 25. Streaming Storage Integration Object store Amazon S3 Streaming store Amazon Kinesis Analytics applications Read & write file dataRead & write to streams Archive stream Replay history
  • 26. CATALOGUE & SEARCH Metadata lake Used for summary statistics and data Classification management Simplified model for data discovery & governance Storage & Streams Catalogue & Search Entitlements API & UI
  • 27. Building a Data Catalogue • Aggregated information about your storage & streaming layer • Storage service for metadata Ownership, data lineage • Data abstraction layer Customer data = collection of prefixes • Enabling data discovery • API for use by entitlements service
  • 28. Data Catalogue – Metadata Index • Stores data about your Amazon S3 storage environment • Total size & count of objects by prefix, data classification, refresh schedule, object version information • Amazon S3 events processed by Lambda function • DynamoDB metadata tables store required attributes
  • 30. Data Lake != Open Access
  • 31. Identity & Access Management • Manage users, groups, and roles • Identity federation with Open ID • Temporary credentials with Amazon Security Token Service (Amazon STS) • Stored policy templates • Powerful policy language • Amazon S3 bucket policies
  • 32. IAM Policy Language • JSON documents • Can include variables which extract information from the request context aws:CurrentTime For date/time conditions aws:EpochTime The date in epoch or UNIX time, for use with date/time conditions aws:TokenIssueTime The date/time that temporary security credentials were issued, for use with date/time conditions. aws:principaltype A value that indicates whether the principal is an account, user, federated, or assumed role—see the explanation that follows aws:SecureTransport Boolean representing whether the request was sent using SSL aws:SourceIp The requester's IP address, for use with IP address conditions aws:UserAgent Information about the requester's client application, for use with string conditions aws:userid The unique ID for the current user aws:username The friendly name of the current user
  • 33. IAM Policy Language Example: Allow a user to access a private part of the data lake { "Version": "2012-10-17", "Statement": [ { "Action": ["s3:ListBucket"], "Effect": "Allow", "Resource": ["arn:aws:s3:::mydatalake"], "Condition": {"StringLike": {"s3:prefix": ["${aws:username}/*"]}} }, { "Action": [ "s3:GetObject", "s3:PutObject" ], "Effect": "Allow", "Resource": ["arn:aws:s3:::mydatalake/${aws:username}/*"] } ] }
  • 34. IAM Federation • IAM allows federation to Active Directory and other OpenID providers (Amazon, Facebook, Google) • AWS Directory Service provides an AD Connector which can automate federated connectivity to ADFS IAM Users AWS Directory Service AD Connector Direct Connect Hardware VPN
  • 35. Data Encryption AWS CloudHSM Dedicated Tenancy SafeNet Luna SA HSM Device Common Criteria EAL4+, NIST FIPS 140-2 AWS Key Management Service Automated key rotation & auditing Integration with other AWS services AWS server side encryption AWS managed key infrastructure
  • 36. Entitlements – Access to Encryption Keys Customer Master Key Customer Data Keys Ciphertext Key Plaintext Key IAM Temporary Credential Security Token Service MyData MyData S3 S3 Object … Name: MyData Key: Ciphertext Key …
  • 37. Secure Data Flow IAM Amazon S3 API Gateway Users Temporary Credential Temporary Credential s3://mydatalake/${YYY-MM-DD}/ ${resource}/${resourceID} Encrypted Data Metadata Index - DynamoDB TVM - Elastic Beanstalk Security Token Service
  • 38. Amazon Athena is an interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  • 39.
  • 40. A Serverless Data Lake Service • Decouple storage from compute • Serverless – No infrastructure or resources to manage • Pay only for data scanned • Schema on read – Same data, many views • Encrypted • Standard compliant and open storage formats • Built on powerful community supported OSS solutions
  • 41. Simple Pricing • DDL operations – FREE • SQL operations – FREE • Query concurrency – FREE • Data scanned - $5 / TB • Standard S3 rates for storage, requests, and data transfer apply
  • 43. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  • 44. Hive Metadata Definition • Hive Data Definition Language • Hive compatible SerDe (serializer/deserializer) • CSV, JSON, RegEx, Parquet, Avro, ORC, CloudTrail • Coming soon • Data Manipulation Language (INSERT, UPDATE) • Create Table As • User Defined Functions
  • 45. Presto SQL • ANSI SQL compliant • Complex joins, nested queries & window functions • Complex data types (arrays, structs, maps) • Partitioning of data by any key • date, time, custom keys • Presto built-in functions
  • 46. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 Data Catalog AthenaEMR Redshift Spectrum Amazon ML / MXNet RDS QuickSight Kinesis Database Migration Service Glue Amazon Analytics End to End Architecture IAM Other Sources
  • 47. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Athena in Action
  • 48. Creating Tables – Parquet CREATE EXTERNAL TABLE db_name.taxi_rides_parquet ( vendorid STRING, pickup_datetime TIMESTAMP, dropoff_datetime TIMESTAMP, ratecode INT, passenger_count INT, trip_distance DOUBLE, fare_amount DOUBLE, total_amount DOUBLE, payment_type INT ) PARTITIONED BY (YEAR INT, MONTH INT, TYPE string) STORED AS PARQUET LOCATION 's3://serverless-analytics/canonical/NY-Pub’ TBLPROPERTIES ('has_encrypted_data'=’true');
  • 49. Creating Tables – Nested JSON CREATE EXTERNAL TABLE IF NOT EXISTS fix_messages ( `bodyLength` int, `defaultAppVerID` string, `encryptMethod` int, `msgSeqNum` int, `msgType` string, `resetSeqNumFlag` string, `securityRequestID` string, `securityRequestResult` int, `securityXML` struct <version:int, header:struct<assetClass:string, tierLevelISIN:int, useCase:string>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '1' ) LOCATION 's3://my_bucket/fix/' TBLPROPERTIES ('has_encrypted_data'='false');
  • 50. CSV SerDe ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = ',', 'field.delim' = ',', 'colelction.delim' = '|', 'mapkey.delim' = ':', 'escape.delim' = '’ ) Does not support removing quote characters from fields. But different primitive types ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "” ) All fields must be defined as type String
  • 51. Data Partitioning – Benefits • Separates data files by any column • Read only files the query needs • Reduce amount of data scanned • Reduce query completion time • Reduce query cost
  • 52. Data Partitioning – S3 • Prefer Hive compatible partition naming • [column_name = column_value] • i.e. s3://athena-examples/logs/year=2017/month=5/ • Support simple partition naming • i.e. s3://athena-examples/logs/2017/5/
  • 53. Data Partitioning – Data Catalog ALTER TABLE app_logs ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-examples/app/plaintext/year=2015/month=01/day=01/’ ALTER TABLE elb_logs ADD PARTITION (year='2015',month='01',day='01') location 's3://athena-examples/elb/plaintext/2015/01/01/’ ALTER TABLE orders DROP PARTITION (dt='2014-05-14’,country='IN'), PARTITION (dt='2014-05-15’,country='IN’) ALTER TABLE customers PARTITION (zip='98040', state='WA') SET LOCATION 's3://athena-examples/new_customers/zip=98040/state=WA’ MSCK REPAIR TABLE table_name ß Only works with Hive compatible partitions
  • 54. File Formats • Columnar – Parquet & ORC • Compressed • Column based read optimized • Integrated indexes and stats • Not ideal for appending new data • Row – Avro • Compressed • Row based read optimized • Integrated indexes and stats • Ideal for appending new data • Text – xSV, JSON • May or may not be compressed • Not optimized • Generic and malleable
  • 55. File Format – Examples SELECT count(*) as count FROM taxi_rides_csv Run time: 20.06 seconds, Data scanned: 207.54GB – 1,310,911,060 SELECT count(*) as count FROM taxi_rides_parquet Run time: 5.76 seconds, Data scanned: 0KB – 2,870,781,820 SELECT * FROM taxi_rides_csv limit 1000 Run time: 3.13 seconds, Data scanned: 328.82MB SELECT * FROM taxi_rides_parquet limit 1000 Run time: 1.13 seconds, Data scanned: 5.2MB
  • 56. File Formats – Considerations • Scanning • xSV and JSON require scanning entire file • Columnar ideal when selecting only a subset of columns • Row ideal when selecting all columns of a subset of rows • Read Performance • Text – SLOW • Avro – Optimal (specific to use case) • Parquet & ORC – Optimal (specific to use case) • Write Performance • Text – SLOW • Avro – Good • Parquet & ORC – Good (has some overhead with large datasets) • Garbage Collection Overhead • Text based – LOW • Avro – LOW * • ORC – LOW à MEDIUM * • Parquet – MEDIUM à HIGH * * Highly dependent on the dataset
  • 57. Athena API • Asynchronous interaction model • Initiate a query, get query ID, retrieve results • Named queries • Save queries and reuse • Paginated result set • Max page size current at 1000 • Column data and metadata • Name, type, precision, nullable • Query status • State, start and end times • Query statistics • Data scanned and execution time
  • 58. Athena API • BatchGetNamedQuery • BatchGetQueryExecution • CreateNamedQuery • DeleteNamedQuery • GetNamedQuery • ListNamedQueries • GetQueryExecution • ListQueryExecutions • StartQueryExecution • StopQueryExecution • GetQueryResults
  • 59. Athena API StartQueryExecution client.startQueryExecution({ QueryString: ‘SELECT * FROM table_name LIMIT 100’, ResultConfiguration: { OutputLocation: ‘s3://bucket/output/’ }, EncryptionConfiguration: { EncryptionOption: 'SSE_S3' }, QueryExecutionContext: { Database: ‘default_db’ } }, (err, result) => {}) GetQueryResults client.getQueryResults({ QueryExecutionId: '2ef5d590-025a-48ec-895e-6bedfe72bc95', MaxResults: 1000, NextToken: null }, (err, data) => {})
  • 60. Athena API BatchGetQueryExecution client.batchGetQueryExecution({ QueryExecutionId: ['2ef5d590-025a-48ec-895e-6bedfe72bc95'] }, (err, data) => {}) ListQueryExecutions client..listQueryExecutions({ MaxResults: 50 }, (err, data) => {})
  • 61. JDBC • Great for integrating with existing data access tools • Tableau, Looker, SQL Workbench • Utilizes the API under the hood • Only JDBC v1.1.0+ is compatible with the public API • Simple to use and integrate jdbc:awsathena://athena.REGION.amazonaws.com:443/hive/DB_NAME • Requires accessKey and secretKey • When integrated via code, can use custom credentials provider with temporary credentials
  • 62. JDBC - Considerations • Requires Java – If you don’t need Java, use the API • Slower – Overhead impacts performance • Returns only 1000 records • No access to other APIs – Only SQL and DDL statements • Requires IAM user credentials – When used by 3rd party apps
  • 63. The tyranny of “OR” Amazon EMR Directly access data in S3 Scale out to thousands of nodes Open data formats Popular big data frameworks Anything you can dream up and code Amazon Redshift Super-fast local disk performance Sophisticated query optimization Join-optimized data formats Query using standard SQL Optimized for data warehousing
  • 64. Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in-place using open file formats Full Amazon Redshift SQL support S3 SQL
  • 65. Amazon Redshift Spectrum is fast Leverages Amazon Redshift’s advanced cost-based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against S3 data Efficient join processing within the Amazon Redshift cluster
  • 66. Amazon Redshift Spectrum is cost-effective You pay for your Amazon Redshift cluster plus $5 per TB scanned from S3 Each query can leverage 1000s of Amazon Redshift Spectrum nodes You can reduce the TB scanned and improve query performance by: Partitioning data Using a columnar file format Compressing data
  • 67. Amazon Redshift Spectrum is secure End-to-end data encryption Alerts & notifications Virtual private cloud Audit logging Certifications & compliance Encrypt S3 data using SSE and AWS KMS Encrypt all Amazon Redshift data using KMS, AWS CloudHSM or your on-premises HSMs Enforce SSL with perfect forward encryption using ECDHE Amazon Redshift leader node in your VPC. Compute nodes in private VPC. Spectrum nodes in private VPC, store no state. Communicate event-specific notifications via email, text message, or call with Amazon SNS All API calls are logged using AWS CloudTrail All SQL statements are logged within Amazon Redshift PCI/DSSFedRAMP SOC1/2/3 HIPAA/BAA
  • 68. Amazon Redshift Spectrum uses standard SQL Redshift Spectrum seamlessly integrates with your existing SQL & BI apps Support for complex joins, nested queries & window functions Support for data partitioned in S3 by any key Date, Time and any other custom keys e.g., Year, Month, Day, Hour
  • 69. Is Amazon Redshift Spectrum useful if I don’t have an exabyte? Your data will get bigger On average, data warehousing volumes grow 10x every 5 years The average Amazon Redshift customer doubles data each year Amazon Redshift Spectrum makes data analysis simpler Access your data without ETL pipelines Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake Amazon Redshift Spectrum improves availability and concurrency Run multiple Amazon Redshift clusters against common data Isolate jobs with tight SLAs from ad hoc analysis
  • 70. Defining External Schema and Creating Tables Define an external schema in Amazon Redshift using the Amazon Athena data catalog or your own Apache Hive Metastore CREATE EXTERNAL SCHEMA <schema_name> Query external tables using <schema_name>.<table_name> Register external tables using Athena, your Hive Metastore client, or from Amazon Redshift CREATE EXTERNAL TABLE SCHEMA syntax CREATE EXTERNAL TABLE <table_name> [PARTITIONED BY <column_name, data_type, …>] STORED AS file_format LOCATION s3_location [TABLE PROPERTIES property_name=property_value, …];
  • 71. Amazon Redshift Spectrum – Current support File formats • Parquet • CSV • Sequence • RCFile • ORC (coming soon) • RegExSerDe (coming soon) Compression • Gzip • Snappy • Lzo (coming soon) • Bz2 Encryption • SSE with AES256 • SSE KMS with default key Column types • Numeric: bigint, int, smallint, float, double and decimal • Char/varchar/string • Timestamp • Boolean • DATE type can be used only as a partitioning key Table type • Non-partitioned table (s3://mybucket/orders/..) • Partitioned table (s3://mybucket/orders/date=YYYY-MM- DD/..)
  • 72. Converting to Parquet and ORC using Amazon EMR You can use Hive CREATE TABLE AS SELECT to convert data CREATE TABLE data_converted STORED AS PARQUET AS SELECT col_1, col2, col3 FROM data_source Or use Spark - 20 lines of Pyspark code, running on Amazon EMR • 1TB of text data reduced to 130 GB in Parquet format with snappy compression • Total cost of EMR job to do this: $5 https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion
  • 73. Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 1
  • 74. Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 2
  • 75. Query plan is sent to all compute nodes Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 3
  • 76. Compute nodes obtain partition info from Data Catalog; dynamically prune partitions Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 4
  • 77. Each compute node issues multiple requests to the Amazon Redshift Spectrum layer Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 5
  • 78. Amazon Redshift Spectrum nodes scan your S3 data Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 6
  • 79. 7 Amazon Redshift Spectrum projects, filters, joins and aggregates Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore
  • 80. Final aggregations and joins with local Amazon Redshift tables done in-cluster Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 8
  • 81. Result is sent back to client Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 9
  • 82. Running an analytic query over an exabyte in S3
  • 83. Lets build an analytic query - #1 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets get the prior books she’s written. 1 Table 2 Filters SELECT P.ASIN, P.TITLE FROM products P WHERE P.TITLE LIKE ‘%POTTER%’ AND P.AUTHOR = ‘J. K. Rowling’
  • 84. Lets build an analytic query - #2 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values 2 Tables (1 S3, 1 local) 2 Filters 1 Join 2 Group By columns 1 Order By 1 Limit 1 Aggregation SELECT P.ASIN, P.TITLE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, products P WHERE D.ASIN = P.ASIN AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND GROUP BY P.ASIN, P.TITLE ORDER BY SALES_sum DESC LIMIT 20;
  • 85. Lets build an analytic query - #3 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values, just for the first three days of sales of first editions 3 Tables (1 S3, 2 local) 5 Filters 2 Joins 3 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts SELECT P.ASIN, P.TITLE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
  • 86. Lets build an analytic query - #4 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values, just for the first three days of sales of first editions in the city of Seattle, WA, USA 4 Tables (1 S3, 3 local) 8 Filters 3 Joins 4 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts SELECT P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P, regions R WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND D.REGION_ID = R.REGION_ID AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND R.COUNTRY_CODE = ‘US’ AND R.CITY = ‘Seattle’ AND R.STATE = ‘WA’ AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
  • 87. Now let’s run that query over an exabyte of data in S3 Roughly 140 TB of customer item order detail records for each day over past 20 years. 190 million files across 15,000 partitions in S3. One partition per day for USA and rest of world. Need a billion-fold reduction in data processed. Running this query using a 1000 node Hive cluster would take over 5 years.* • Compression ……………..….……..5X • Columnar file format……….......…10X • Scanning with 2500 nodes…....2500X • Static partition elimination…............2X • Dynamic partition elimination..….350X • Redshift’s query optimizer……......40X --------------------------------------------------- Total reduction……….…………3.5B X * Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.