2. About Us
Leading BI
& Big Data
solution
provider in
Israel
230
Employees
100Over
Customers
Matrix IT
Subsidiary
E2E
&
A2Z
Named as a
Tier 1 Big
Data SI
(2017)
Leading
Partnerships Big Data
Data
Science
BI &
Analytics
6. Availability by decupling – Shared nothing
architecture
C1 C2 C3
SAN / NAS
• Removes dependency between the scaling units
• Shared file system resource may eventually become a bottleneck
Compute
Storage SAN / NAS
C1 Compute
Storage
Compute
Storage
Compute
Storage
LAN LAN LAN
DATA VOLUMESMALL BIG
$$$
C2 C3
Shared Nothing
12. The Three sub-systems of a data lake
Data Acquisition
Collection
Real Time
Incremental
Batch
One Time Dump
Data Management
Store, Process & Integrate
Data Access
Deliver & Use
Data Lake Platform Services
1 2 3
RAW Data
The three subsystem approach
allows:
1. Open Scalable Architecture
2. Separation of duties
3. Fine grained Security &
Governance
RAW Data
Standardized
Data
Usage
Specific
What about
all of this
raw data??
14. AWS Athena
What is Athena?
• Fully Managed, Interactive query service
• Allows running standard SQL queries
against data stored in S3
• Fully serverless and automatically
scalable
15. Athena Under the hood
Athena is an interactive query service that makes it easy to analyze
data directly from AWS S3 using Standard SQL
Presto
CLI
Hive
Metastore
Presto
Coordinator
Presto
Worker
JDBC /
ODBC
Presto
Worker
Presto
Worker
SQL On Hadoop solution, Presto is a low latency distributed SQL query
engine for running interactive analytic queries against data sources of all
sizes ranging from gigabytes to petabytes.
Hive Metastore (aka HCatalog) is a Metadata and Table management system
designed for Hadoop and used for Table abstraction (schema on read) .Hive DDL
functionality allows working with Partitions, Complex Data types (Arrays etc.) and
many data formats.
16. High Performance
• E.g. Netflix: runs 3500+ Presto queries / day on 25+ PB dataset in
S3 with 350 active platform users
Extensibility
• Pluggable backbends: Hive, Cassandra, JMX, Kafka, MySQL,
PostgreSQL, MySQL, and more
• JDBC, ODBC for commercial BI tools or dashboards
• Client Protocol: HTTP+JSON, support various languages (Python,
Ruby, PHP, Node.js, Java(JDBC), C#,…)
ANSI SQL
• Complex queries, joins, aggregations, various functions (Window
functions)
Presto Main Features
17. Things to remember
• $5 Per TB (Rounded to nearest MB with minimum of 10MB)
• Supports multiple formats
• JSON
• CSV & TSV
• ORC
• Parquet & Avro
Data Partitioning
+
Data Format
= Less $$$ per Query
19. CREATE EXTERNAL TABLE db_name.taxi_rides_parquet (
vendorid STRING,
pickup_datetime TIMESTAMP,
dropoff_datetime TIMESTAMP,
ratecode INT,
passenger_count INT,
trip_distance DOUBLE,
fare_amount DOUBLE,
total_amount DOUBLE,
payment_type INT
)
PARTITIONED BY (YEAR INT, MONTH INT, TYPE string)
STORED AS PARQUET
LOCATION 's3://serverless-analytics/canonical/NY-Pub’
TBLPROPERTIES ('has_encrypted_data'=’true');
Creating a Table
2
3
1
Define partitions2
Location can only reference a folder
1
3
Tables are External
20. S3 Partitioning
By partitioning your data, you can:
• Separates data files by any column
• Read only files the query needs
• Reduce amount of data scanned
• Reduce query completion time
• Reduce query cost
Hive compatible partition naming (best) - [column_name = column_value]
% aws s3 ls s3://matrixbi/hive-prt/tables/visits/
PRE year=2017/month=08
PRE year=2017/month=09
PRE year=2017/month=10
PRE year=2017/month=11
PRE year=2017/month=12
21. Advanced Compression by Encoding
Delta Prefix Dictionary Run Length
Sorted Datasets Small Sets of Values Repetitive
Timestamp
Sequences
Metrics
IP Addresses
Codes (MAC etc.)
Product ID
Commonly used encoding types
Brut Force (LZO, Snappy, GZIP)
23. File Format
SELECT count(*) as count FROM taxi_rides_csv
Run time: 20.06 seconds, Data scanned: 207.54GB – 1,310,911,060
SELECT count(*) as count FROM taxi_rides_parquet
Run time: 5.76 seconds, Data scanned: 0KB – 2,870,781,820
SELECT * FROM taxi_rides_csv limit 1000
Run time: 3.13 seconds, Data scanned: 328.82MB
SELECT * FROM taxi_rides_parquet limit 1000
Run time: 1.13 seconds, Data scanned: 5.2MB
Based on: Amazon Athena Deep Dive – June 2017
Parquet columns are not accessed!
The count is computed using metadata
stored in Parquet file footers
* S3 Get prices are not included!
24. Redshift Spectrum
What is Redshift Spectrum?
• Not an integration between Redshift &
Athena query engine
• Allows running Redshift SQL queries
against data stored in S3
• Fully serverless and automatically
scalable
• Approachable from multiple Redshift
Clusters, Allows joining data from the RS
cluster
33. Spectrum Requires an External Schema
CREATE EXTERNAL SCHEMA spectrum_schema FROM data catalog
DATABASE spectrum_db
iam_role ‘arn:aws:iam:123456789012:role/MySpectrumRole’
REGION ‘us-east-2’;
CREATE VIEW redshift_schema.joined_data AS
SELECT SE.col1, RE.col2
FROM spectrum_schema.historical_events SE
INNER JOIN redshift_schema.events RE ON
spectrum_schema.historical_events.joincol1 = redshift_schema.joincol2
1
2
1 Glue or Athena Catalog
2 IAM Role - Spectrum runs outside of the VPC, The role must be attached to the cluster
35. So what now?
• You are an existing Redshift Customers and you want to
implement Data Tiering and maintain your Redshift
investment while allowing a single schema for historical
data
Use Redshift Spectrum if:
Use AWS Athena if:
• You are starting out your Data Lake journey
• You have offline analytical workloads in large scales