BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuring TrueCar

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Roy Ben-Alta, Principal Business Development, AWS
David Giffin, Senior Vice President Technology Platform, TrueCar
August 14, 2017
Serverless Analytics with Amazon
Athena and Amazon QuickSight,
Featuring TrueCar

Agenda
• Quick overview of Amazon Athena and Amazon QuickSight
• TrueCar use case
• Clickstream data implementation
• Troubleshooting queries and dealing with errors
• Using Amazon QuickSight to visualize clickstream data
• Questions / Answers

Legacy Data Architectures Exist as Isolated Data Silos
Hadoop
Cluster
SQL
Database
Data
Warehouse
Appliance

Evolution of Data Architectures
1985: Data Warehouse Appliances Benefits
• Consolidated multiple decision support
environments (i.e. databases) into a single
architecture
• Best performance available at time of
conception, hence the expensive licenses
• Worked well with structured, columnar data
• Could build customized data marts on top
Shared Storage Tier
(NAS Appliance)
Compute
Node
Compute
Node
Compute
Node
Compute
Node
• Proprietary software license paid per node
per year
• Gold-plated hardware available only from
the vendor with per node per year cost
Constraints
• Proprietary software license paid per node per
year
• Gold-plated hardware available only from the
vendor with per node per year cost
• Could not handle unstructured data sets
• Heavy ETL & data cleansing

2006: Hadoop Clusters
CPU
Memory
HDFS Storage
Hadoop Master Node
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Improvements
• Open source based software license!!!
• Commodity white box servers!!!!
• Could handle structured & unstructured data
sets
• Many different applications within the
framework (MapReduce, Spark, Hive, Pig,
HBase, Presto, etc.)
Constraints
• HDFS 3X replication to protect against node
failure gets expensive at scale
• 500 TB data set = 1.5 PB cluster
• Local storage means you must scale and pay
for CPU & memory resources when adding
data capacity
• General purpose, monolithic cluster with many
different apps on same hardware
• Still a data silo

2009: Decoupled EMR Architecture
CPU
Memory
Hadoop Master Node
CPU
Memory
CPU
Memory
Improvements
• Decoupled storage & compute
• Scale CPU and memory resources
independently and up & down
• Only pay for the 500 TB data set (not 3X)
• Multi-physical facility replication via S3
• Multiple clusters can run in parallel against
shared data in S3
• Each job gets its own optimized cluster. i.e.
Spark on memory intensive, Hive on CPU
intensive, HBase on I/O intensive, etc.
Constraints
• Still have a cluster to provision and manage
• Must expose EMR cluster to SQL users via
Hive, Presto, etc.
S3 as HDFS

Today: Clusterless Improvements
• No cluster/infrastructure to manage
• Business users and analysts can write SQL
without having to provision a cluster or touch
infrastructure
• Pay by the query
• Zero Administration
• Process data where it lives
SQL Interface in web
browser
Athena for SQL
S3 Data Lake
Glue for ETL
S3 Data Lake
Spark & Hive Interface
in web browser

Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Data Lake reference architecture
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight A
Central Storage
Secure, cost-effective
Storage in Amazon S3
Glue ETL

No servers to provision
or manage
Scales with usage
Never pay for idle Availability and fault
tolerance built in
Serverless characteristics

AWS analytics – serverless options
• Data Ingestion and transformation
• Amazon Kinesis Firehose
• AWS Glue (New!)
• SQL
• Amazon Kinesis Analytics
• Amazon Athena
• Amazon Redshift Spectrum
• Visualization
• Amazon QuickSight

Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Amazon
Redshift and Amazon Elasticsearch
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift
and Amazon Elasticsearch without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for
delivery into data destinations in as little as 60 secs using simple configurations.
Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit
streaming data to Firehose
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data
continuously into S3, Amazon Redshift
and Amazon Elasticsearch

Amazon Kinesis Firehose Pricing
Simple, pay-as-you-go and no up-front costs
Dimension Value
*Per 1 GB of data ingested $0.029

Amazon Athena is easy to use
• Log in to the console
• Create a table
• Type in an Apache Hive DDL
Statement
• Use the console Add Table wizard
• AWS Glue Data Catalog
• Start querying in console
• JDBC allows BI tool access
• Full rest API also available
• Concurrency is a setting

Familiar technologies under the covers
Used for SQL queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning

Simple pricing – $5/TB scanned
• Pay by the amount of data scanned per query
• Ways to save costs
• Compress
• Convert to columnar format
• Use partitioning
• Free: DDL queries, failed queries
Dataset Size on Amazon
S3
Query Run time Data Scanned Cost
Logs stored as
text files
1.15 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with
Parquet
34x faster 99% less data
scanned
99.7% cheaper

Use Case - Log Aggregation
AWS Service Logs
Web Application Logs
Server Logs
Amazon S3
Amazon Athena
Data Catalog
New File
Trigger
Update table partition
Create partition
on S3
Copy to new
partition
Query data
Amazon S3
Amazon Lambda

Use Case - Real-Time Data Collection
Amazon S3 Amazon Athena
Data Catalog
Real-time events Store partitioned in S3
Trigger Lambda
Query data
Amazon Lambda
Amazon Kinesis Firehose

Use Case - Data Export
Amazon S3 Amazon Athena
Data Catalog
Database Migration Exported tables in S3
Trigger Lambda
Query data
Amazon Lambda
Amazon Database
Migration
Service

Use Case - SaaS Model
Amazon S3Amazon Athena
Data Catalog
Query data
Hot data
Warm & cold dataApplication request

Amazon QuickSight
The benefits of cloud BI
Integrated
Fully managed and scalable
Super fast and easy to use
Cost-effective

Amazon QuickSight – basic concepts
Retail Data
Ops Data
Marketing Data
Relational
Databases
Flat Files
And Many Others!

Super-fast performance with SPICE

What’s New In QuickSight
Enterprise Edition
with AD support
Athena Connector
Scheduled
Refresh
Export to CSV
KPI Charts
AD Connector
New Features Added Since 11/16
Audit Logging with
CloudTrail
Presto, Spark,
Teradata
Connectors
Federated SSO
With SAML 2.0
Relative Date
Filters
Launched
11/2016
Enterprise
Analytics
Data
Excel
Enhancement
Redshift
Spectrum
Connector
S3 Analytics
Connector with
Deep Linking
Count Distinct

Individual Standard Edition
(60-day free trial)
Enterprise Edition
(60-day free trial)
Price per user per month Free $9
(Annual)
$12
(Month to Month)
$18
(Annual)
$24
(Month to Month)
Number of users 1 2+ 2+ 2+ 2+
SPICE capacity (GB)* 1 10 10 10 10
Additional SPICE
GB-month
$0.25 $0.25 $0.38
Amazon QuickSight is a cost-effective solution

Serverless Analytics with Amazon
Athena and Amazon QuickSight
08.14.2017

David Giffin
• SVP, Technology Platform @ TrueCar
• Infrastructure, Deployment, Business Intelligence and Data Warehouse
Teams

Moving to Amazon Athena
TrueCar has recently switched vendors for our clickstream data.
Clickstream data is now collected by Google Analytics and imported daily
into Big Query. We use a Map-Reduce job to move the clickstream data
from Big Query to AWS (S3).

Why Amazon Athena for Clickstream?
• Athena provided a very simple mechanism to query large datasets
• Low operational burden in cluster setup and maintenance
• Amazon manages it for you!

Our Clickstream Data
• Currently Daily Uncompressed File Size ~ 23 GB (~ 10 Terabyte Yearly)
• We are expecting this number to go up by 20-40 % each year
We undertook the following steps to structure our data for optimal performance:
• We compressed data while extracting from Big Query to S3(~2 GB Daily File
compare to ~23 GB Uncompressed). The smaller data size reduces network
traffic from S3 to Athena.
• We partitioned the data on YEAR -> MONTH -> DAY which reduces the
amount of data scanned per query, thereby improving performance.
• Training Analyst to use best practices to query the data.

Solving Our Issue
• Replacing ’.’ with ‘_’

Visualization with Amazon QuickSight

Dashboard with Amazon QuickSight

Amazon QuickSight Lessons Learned
• Very easy to use
• Intuitive user interface for reporting and dashboarding
• Simple to setup connections to Athena
• Once your IAM roles are in place
Issues we are working with AWS to resolve
• Error messaging needs improvement
• Athena JBDC driver needs improvement
• Un-nest not working in Amazon QuickSight
• Ability to select index with an array

BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuring TrueCar

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuring TrueCar

Similar to BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuring TrueCar (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuring TrueCar