This document discusses different AWS services for data analytics and querying large datasets. It provides an overview of AWS Glue for ETL, Amazon Athena for interactive SQL queries on S3 data, and Amazon Redshift Spectrum for extending Amazon Redshift queries to data in S3. It then discusses a customer case study of NUVIAD who moved from a traditional data warehouse to using different AWS analytics services on a single S3 data lake.
10. Amazon Athena
• Interactive query service over S3
• Uses ANSI SQL (everybody knows SQL)
• Serverless - Don’t worry about setting up
infrastructure, just start querying
• No ETL
• Get started instantly:
• Point to data in S3
• Define your schema - Schema on Read
• Start querying
11. Amazon Athena: Pay Per Query
• Pay only for the queries you run - $5 Per TB
Scanned
• Query directly from S3, no additional storage
charges
• Works with standard data formats:
• CSV, JSON
• Parquet, ORC, Avro
• Improve performance and reduces cost:
• Compression
• Partitioning
• Columnar formats
15. The tyranny of “OR”
Amazon EMR
Directly access data in S3
Scale out to thousands of nodes
Open data formats
Popular big data frameworks
Anything you can dream up and code
Amazon Redshift
Super-fast local disk performance
Sophisticated query optimization
Join-optimized data formats
Query using standard SQL
Optimized for data warehousing
16. We want
Sophisticated query optimization and scale-out processing
Super fast performance and support for open formats
The throughput of local disk and the scale of S3
17. We want all this
From one data processing engine
With my data accessible from all data processing engines
Now and in the future
18. Amazon Redshift Spectrum
Amazon Redshift Spectrum enables you to run
Amazon Redshift SQL queries against Exabytes of data
in Amazon S3.
With Redshift Spectrum, you can extend the analytic
power of Amazon Redshift beyond data stored on local
disks in your data warehouse to query vast amounts of
unstructured data in your Amazon S3 “data lake”
19. Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
1
20. Query is optimized and compiled at
the leader node. Determine what gets
run locally and what goes to Amazon
Redshift Spectrum
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
2
21. Query plan is sent to
all compute nodes
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
3
22. Compute nodes obtain partition info from
Data Catalog; dynamically prune
partitions
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
4
23. Each compute node issues
multiple requests to the Amazon
Redshift Spectrum layer
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
5
24. Amazon Redshift Spectrum nodes
scan your S3 data
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
6
25. 7
Amazon Redshift
Spectrum projects,
filters, and aggregates
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
26. Final aggregations and joins
with local Amazon Redshift
tables done in-cluster
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
8
27. Result is sent back to client
Life of a query
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
9
31. Company Confidential
NUVIAD
NUVIAD is an online marketing service that makes
data-driven, location-aware mobile marketing
accessible, effective and simple to use for agencies,
networks and businesses.
NUVIAD HyperDSP
NUVIAD Local
NUVIAD inView SSP
32. Company Confidential
NUVIAD In Numbers
• Analyzing over 700k ad opportunities every second
• Over 2.5 Billion user profiles
• Over 500k app profiles from App Store and Google Play
• Local businesses mapped via multiple sources
• Thousands of customers. From small businesses to large
networks
• Tens of thousands of campaigns
33. Company Confidential
Data is Everything
Identifying the user’s intent – the holy grail of digital
marketing. Finding the right moment that the user is
most receptive to my marketing message
• Digital advertising was focused on the digital aspects of
user
• HyperLocal adds the physical context
34. Company Confidential
The Challenge
• Streamlining high scale data from hundreds of thousands of
sources. Finding the needles in the huge haystack.
• Providing effective data driven tools to our customers and
partners
• Effectively scaling up the platform
• It is easy to create a platform that delivers data right or now.
We need to provide data right now. Speed, speed and more
speed.
36. Company Confidential
Traditional Methodology
• Data Warehouse - Redshift (over 60 dc1.large servers in two clusters)
• Algorithmic processes – Redshift
• Reporting and Analytics – RDS (MySQL)
• Real-time reporting – memSQL
• Aerospike, MongoDB
• Result
• Fragmented data
• Multiple data sets
• Hard to scale up (storage vs. compute)
• Maintenance nightmare
37. Company Confidential
NUVIAD Data Lake
One pond of data, multiple query engines
• S3 as main storage for all data
• Formatting data in a open and common format
• Creating unified data streams to ingest data
• Separating compute from Storage to allow
manageable and ad-hoc scaling
• Using different query engines using the same data
• Using different data permutations to optimize data
for queries
38. Company Confidential
One Data Set. Many Query engines.
• With Data formatted correctly (Parquet) and
stored in S3, we can use different query engines
for different workloads:
• Amazon Athena - for quick and simple queries
• Amazon Redshift Spectrum - for complex algorithmic
queries utilizing PostgreSQL
• EMR Presto - for large reports with scalable cluster