Redshift Spectrum & AWS Athena Deep Dive

All content is the property and proprietary interest of matrix IT; The removal of any proprietary notices, including attribution information, is strictly prohibited.
Redshift Spectrum & AWS Athena
Deep DiveOz Levi, CTO
MatrixBI

About Us
Leading BI
& Big Data
solution
provider in
Israel
230
Employees
100Over
Customers
Matrix IT
Subsidiary
E2E
&
A2Z
Named as a
Tier 1 Big
Data SI
(2017)
Leading
Partnerships Big Data
Data
Science
BI &
Analytics

Our Solutions
Data Warehousing & Big Data
Advanced Analytics & Data Science
Reporting & Self Analysis
Advanced Visualization
Dashboards & KPI’s OEM Big Data
Mobile BI
ETL & Data Integration

Matrix BI Big Data References
High-tech & Startups Government & Security
Enterprise (Telco, Finance, BIOTech & Pharma)
Clal Finances
Israel Police
Israeli
Air
Force

First... Some Basics
Redshift Spectrum & AWS Athena Deep Dive

Availability by decupling – Shared nothing
architecture
C1 C2 C3
SAN / NAS
• Removes dependency between the scaling units
• Shared file system resource may eventually become a bottleneck
Compute
Storage SAN / NAS
C1 Compute
Storage
Compute
Storage
Compute
Storage
LAN LAN LAN
DATA VOLUMESMALL BIG
$$$
C2 C3
Shared Nothing

Bigger Data requires elasticity
ELASTIC
Compute Compute Compute
Scalable Object Storage
Elasticity is the ability of the system to adapt to changes by adding or removing
processing power without dependency in storage and automatically

Column Storage
A B C
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
A5 B6 C5
A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4 A5 B6 C5
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 C1 C2 C3 C4 C5
Stored Together
Encoded together
Row Layout
Column Layout
• Hard to compress – different data types required different algorithms
• Heavy I/O tasks – for reading the data
INT CHAR DATE

A B C
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
A5 B6 C5
Vertical partitioning
(projection push down)
Horizontal Partitioning
(Predicate push down)
Minimal I/O at read
A B C
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
A5 B6 C5
A B C
A1 B1 C1
A2 B2 C2
A3 B3 C3
A4 B4 C4
A5 B6 C5
+ =
Reduce additional I/O by partitioning
Column Storage

Column Storage
SELECT AVG(score) FROM example
WHERE class = ‘Junior’
AND gender = ‘F’
AND grade > 90;
I have a table with every test score for
every US student for the last twenty years.
M NYAASE 1245 NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE Junior NYSE NYSE NYSE 86
F NYAASE 1453 NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE Soph NYSE NYSE NYSE 74
F NYAASE 4454 NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE Junior NYSE NYSE NYSE 94
M NYAASE 5654 NYASE NYAASE NYSE NYASE NGGYSE NYGGGSE NYSE NYSE NYSE Senior NYSE NYSE NYSE 67
Row Store
Column Store
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
NYSE
NYSE
NYSE
NQDS
M
F
F
M
Junior
Soph
Junior
Senior
86
74
94
67
Index
When using indexes, reads only relevant ROWs
On a HEAP – RDBMS Will scan all rows!
Read only relevant blocks of relevant columns
How can I provide results in minimal query response times?

The Three sub-systems of a data lake
Data Acquisition
Collection
Real Time
Incremental
Batch
One Time Dump
Data Management
Store, Process & Integrate
Data Access
Deliver & Use
Data Lake Platform Services
1 2 3
RAW Data
The three subsystem approach
allows:
1. Open Scalable Architecture
2. Separation of duties
3. Fine grained Security &
Governance
RAW Data
Standardized
Data
Usage
Specific
What about
all of this
raw data??

Amazon Analytics End to End Architecture
S3
AthenaEMR
Redshift
Spectrum
Amazon ML / MXNet
RDS
QuickSight
Kinesis
Database
Migration
Service
Glue
IAM
Other
Sources
Data Catalog
Redshift

AWS Athena
What is Athena?
• Fully Managed, Interactive query service
• Allows running standard SQL queries
against data stored in S3
• Fully serverless and automatically
scalable

Athena Under the hood
Athena is an interactive query service that makes it easy to analyze
data directly from AWS S3 using Standard SQL
Presto
CLI
Hive
Metastore
Presto
Coordinator
Presto
Worker
JDBC /
ODBC
Presto
Worker
Presto
Worker
SQL On Hadoop solution, Presto is a low latency distributed SQL query
engine for running interactive analytic queries against data sources of all
sizes ranging from gigabytes to petabytes.
Hive Metastore (aka HCatalog) is a Metadata and Table management system
designed for Hadoop and used for Table abstraction (schema on read) .Hive DDL
functionality allows working with Partitions, Complex Data types (Arrays etc.) and
many data formats.

High Performance
• E.g. Netflix: runs 3500+ Presto queries / day on 25+ PB dataset in
S3 with 350 active platform users
Extensibility
• Pluggable backbends: Hive, Cassandra, JMX, Kafka, MySQL,
PostgreSQL, MySQL, and more
• JDBC, ODBC for commercial BI tools or dashboards
• Client Protocol: HTTP+JSON, support various languages (Python,
Ruby, PHP, Node.js, Java(JDBC), C#,…)
ANSI SQL
• Complex queries, joins, aggregations, various functions (Window
functions)
Presto Main Features

Things to remember
• $5 Per TB (Rounded to nearest MB with minimum of 10MB)
• Supports multiple formats
• JSON
• CSV & TSV
• ORC
• Parquet & Avro
Data Partitioning
+
Data Format
= Less $$$ per Query

CREATE EXTERNAL TABLE db_name.taxi_rides_parquet (
vendorid STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
ratecode INT,
passenger_count INT,
trip_distance DOUBLE,
fare_amount DOUBLE,
total_amount DOUBLE,
payment_type INT
)
PARTITIONED BY (YEAR INT, MONTH INT, TYPE string)
STORED AS PARQUET
LOCATION 's3://serverless-analytics/canonical/NY-Pub’
TBLPROPERTIES ('has_encrypted_data'=’true');
Creating a Table
2
3
1
Define partitions2
Location can only reference a folder
1
3
Tables are External

S3 Partitioning
By partitioning your data, you can:
• Separates data files by any column
• Read only files the query needs
• Reduce amount of data scanned
• Reduce query completion time
• Reduce query cost
Hive compatible partition naming (best) - [column_name = column_value]
% aws s3 ls s3://matrixbi/hive-prt/tables/visits/
PRE year=2017/month=08

Advanced Compression by Encoding
Delta Prefix Dictionary Run Length
Sorted Datasets Small Sets of Values Repetitive
Timestamp
Sequences
Metrics
IP Addresses
Codes (MAC etc.)
Product ID
Commonly used encoding types
Brut Force (LZO, Snappy, GZIP)

Advanced Compression by Encoding
Delta

File Format
SELECT count(*) as count FROM taxi_rides_csv
Run time: 20.06 seconds, Data scanned: 207.54GB – 1,310,911,060
SELECT count(*) as count FROM taxi_rides_parquet
Run time: 5.76 seconds, Data scanned: 0KB – 2,870,781,820
SELECT * FROM taxi_rides_csv limit 1000
Run time: 3.13 seconds, Data scanned: 328.82MB
SELECT * FROM taxi_rides_parquet limit 1000
Run time: 1.13 seconds, Data scanned: 5.2MB
Based on: Amazon Athena Deep Dive – June 2017
Parquet columns are not accessed!
The count is computed using metadata
stored in Parquet file footers
* S3 Get prices are not included!

Redshift Spectrum
What is Redshift Spectrum?
• Not an integration between Redshift &
Athena query engine
• Allows running Redshift SQL queries
against data stored in S3
• Fully serverless and automatically
scalable
• Approachable from multiple Redshift
Clusters, Allows joining data from the RS
cluster

Redshift Architecture
• MPP Architecture
• Leader based query execution
• Compute nodes
• Columnar
• Parallel
A redshift cluster cannot be
shut down, only deleted!

The lifecycle of a Spectrum Query
A Query is submitted to the leader node. the
leader node of an Amazon Redshift cluster.
The leader node optimizes, compiles, and
pushes the query execution to the compute
nodes in your Amazon Redshift cluster
1

The compute nodes obtain the
information describing the
external tables from the data
catalog, dynamically pruning non-
relevant partitions based on the
filters and joins in the query
2

The compute nodes also examine the
data available locally and push down
predicates to efficiently scan only the
relevant objects in Amazon S3.
3

Redshift compute nodes then
generate multiple requests
depending on the number of objects
that need to be processed, and
submit them concurrently to
Redshift Spectrum
4

Spectrum worker nodes scan,
filter, and aggregate the data
stored on S3, streaming
required data for processing
back to the Redshift cluster
5

The final join and merge operations are
performed locally in your cluster and
the results are returned to the client.
6

Multi cluster
• Shared Datasets on S3 & Spectrum can be accessed from multiple
Redshift Clusters!

Spectrum Requires an External Schema
CREATE EXTERNAL SCHEMA spectrum_schema FROM data catalog
DATABASE spectrum_db
iam_role ‘arn:aws:iam:123456789012:role/MySpectrumRole’
REGION ‘us-east-2’;
CREATE VIEW redshift_schema.joined_data AS
SELECT SE.col1, RE.col2
FROM spectrum_schema.historical_events SE
INNER JOIN redshift_schema.events RE ON
spectrum_schema.historical_events.joincol1 = redshift_schema.joincol2
1
2
1 Glue or Athena Catalog
2 IAM Role - Spectrum runs outside of the VPC, The role must be attached to the cluster

So what now?
• You are an existing Redshift Customers and you want to
implement Data Tiering and maintain your Redshift
investment while allowing a single schema for historical
data
Use Redshift Spectrum if:
Use AWS Athena if:
• You are starting out your Data Lake journey
• You have offline analytical workloads in large scales

Key Take away
Partition you data for best performance and cost
• Use small files – You can create multiple tables pointing the
same files
• Parquet can be sorted on write for better performance
You can Combine Amazon Athena and Redshift
Spectrum using AWS Glue Data Catalog for best
performance

So what now?
Next - Take your first steps in
building serverless Data Lake

A Few Notes
You will need an AWS account – If
you don’t have one let us know!
For 1:1 session with an Architect
Send an email to
marketing@cloudzone.io

Thank You!
CV to jobs@matrixbi.co.il
Ozle@MatrixBI.co.il

Redshift Spectrum & AWS Athena Deep Dive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Redshift Spectrum & AWS Athena Deep Dive

Similar to Redshift Spectrum & AWS Athena Deep Dive (20)

Recently uploaded

Recently uploaded (20)

Redshift Spectrum & AWS Athena Deep Dive