Success has Many Query Engines- Tel Aviv Summit 2018

© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Eden Perry
Partner Solutions Architect, AWS
Rafi Ton
CEO, Nuviad
Success Has Many Query Engines

Agenda
• AWS Big Data Platform Overview
• AWS Glue Data Catalog
• Amazon Athena
• Amazon Redshift Spectrum
• Customer Story – NUVIAD
• The Right Tool for The Right Job

AWS Data Services Overview

AWS Big Data Services
EMR EC2
S3
Amazon
Redshift/Spectrum
DynamoDB
AWS Lambda
Kinesis Analytics Amazon Athena
Amazon
QuickSight
Aurora
Kinesis
Streams
Ingest/Collect Store Analyze/Process
Visualization/
Consume
AWS
Snowball
ISV
Connectors
Kinesis
Firehose
S3 Transfer
Acceleration
= Serverless
Amazon
Elasticsearch
Orchestration/Transform
AWS DMS (CDC)AWS Glue AWS Step
Functions

Orchestration/Transform
AWS Big Data Services
EMR EC2
S3
DynamoDB
AWS DMS (CDC)
AWS Lambda
Kinesis Analytics Amazon Athena
Amazon
QuickSight
Aurora
AWS Glue
Kinesis
Streams
Ingest/Collect Store Analyze/Process
Visualization/
Consume
AWS
Snowball
ISV
Connectors
Kinesis
Firehose
S3 Transfer
Acceleration
= Serverless
Amazon
Elasticsearch
AWS Step
Functions
Amazon
Redshift/Spectrum

AWS Glue: Fully managed ETL service
• Catalog data sources
• formats and data types manually &
Automatically (with Crawlers)
• Generate ETL code
• Schedules and executes ETL jobs

What are our options for Analytics?

ConsumeStore Process & AnalyzeIngest
Kinesis Data Streams
Kinesis Firehose
Delivery Streams
DynamoDB
AWS Lambda
Kinesis
Analytics
Raw Bucket
Parquet Bucket
Athena Redshift
Spectrum
QuickSight
SpeedLayerBatchLayer
Glue Data
Catalog
Spark/EMR Glue ETL
Real time
Web UI

Amazon Athena:
A Deeper Look

Amazon Athena
• Interactive query service over S3
• Uses ANSI SQL (everybody knows SQL)
• Serverless - Don’t worry about setting up
infrastructure, just start querying
• No ETL
• Get started instantly:
• Point to data in S3
• Define your schema - Schema on Read
• Start querying

Amazon Athena: Pay Per Query
• Pay only for the queries you run - $5 Per TB
Scanned
• Query directly from S3, no additional storage
charges
• Works with standard data formats:
• CSV, JSON
• Parquet, ORC, Avro
• Improve performance and reduces cost:
• Compression
• Partitioning
• Columnar formats

Amazon Athena
Demo

Amazon Redshift Spectrum
Extend you data reach

The tyranny of “OR”
Amazon EMR
Directly access data in S3
Scale out to thousands of nodes
Open data formats
Popular big data frameworks
Anything you can dream up and code
Amazon Redshift
Super-fast local disk performance
Sophisticated query optimization
Join-optimized data formats
Query using standard SQL
Optimized for data warehousing

We want
Sophisticated query optimization and scale-out processing
Super fast performance and support for open formats
The throughput of local disk and the scale of S3

We want all this
From one data processing engine
With my data accessible from all data processing engines
Now and in the future

Amazon Redshift Spectrum
Amazon Redshift Spectrum enables you to run
Amazon Redshift SQL queries against Exabytes of data
in Amazon S3.
With Redshift Spectrum, you can extend the analytic
power of Amazon Redshift beyond data stored on local
disks in your data warehouse to query vast amounts of
unstructured data in your Amazon S3 “data lake”

Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1

Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
2

Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
3

Compute nodes obtain partition info from
Data Catalog; dynamically prune
partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
4

Each compute node issues
multiple requests to the Amazon
Redshift Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
5

Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
6

7
Amazon Redshift
Spectrum projects,
filters, and aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog

Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
8

Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Data Catalog
9

Amazon Redshift Spectrum Demo
S3
Redshift
Redshift
• Goal: Determine the correlation
between Tweets’ sentiment and the
weather
• Tweets are stored in our Data Lake - S3
• States and Historical Weather data are
stored in our Data Warehouse –
Redshift
• Method: Join between S3 data and
Redshift data using Redshift Spectrum

Customer Story: NUVIAD

Company Confidential
NUVIAD
NUVIAD is an online marketing service that makes
data-driven, location-aware mobile marketing
accessible, effective and simple to use for agencies,
networks and businesses.
NUVIAD HyperDSP
NUVIAD Local
NUVIAD inView SSP

NUVIAD In Numbers
• Analyzing over 700k ad opportunities every second
• Over 2.5 Billion user profiles
• Over 500k app profiles from App Store and Google Play
• Local businesses mapped via multiple sources
• Thousands of customers. From small businesses to large
networks
• Tens of thousands of campaigns

Data is Everything
Identifying the user’s intent – the holy grail of digital
marketing. Finding the right moment that the user is
most receptive to my marketing message
• Digital advertising was focused on the digital aspects of
user
• HyperLocal adds the physical context

The Challenge
• Streamlining high scale data from hundreds of thousands of
sources. Finding the needles in the huge haystack.
• Providing effective data driven tools to our customers and
partners
• Effectively scaling up the platform
• It is easy to create a platform that delivers data right or now.
We need to provide data right now. Speed, speed and more
speed.

From
Data Warehouse
To
Data Lake

Traditional Methodology
• Data Warehouse - Redshift (over 60 dc1.large servers in two clusters)
• Algorithmic processes – Redshift
• Reporting and Analytics – RDS (MySQL)
• Real-time reporting – memSQL
• Aerospike, MongoDB
• Result
• Fragmented data
• Multiple data sets
• Hard to scale up (storage vs. compute)
• Maintenance nightmare

NUVIAD Data Lake
One pond of data, multiple query engines
• S3 as main storage for all data
• Formatting data in a open and common format
• Creating unified data streams to ingest data
• Separating compute from Storage to allow
manageable and ad-hoc scaling
• Using different query engines using the same data
• Using different data permutations to optimize data
for queries

One Data Set. Many Query engines.
• With Data formatted correctly (Parquet) and
stored in S3, we can use different query engines
for different workloads:
• Amazon Athena - for quick and simple queries
• Amazon Redshift Spectrum - for complex algorithmic
queries utilizing PostgreSQL
• EMR Presto - for large reports with scalable cluster

Faster Results. Fraction of the Cost

The local marketing company.
Thank you.

The Right Tool for The Right Job

Thank You!

Success has Many Query Engines- Tel Aviv Summit 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Success has Many Query Engines- Tel Aviv Summit 2018

Similar to Success has Many Query Engines- Tel Aviv Summit 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Success has Many Query Engines- Tel Aviv Summit 2018