Querying and analyzing big data can be complicated and expensive. It requires you to setup and manage databases, data warehouses, and business intelligence applications; all of which require time, effort, and resources. In this session, we will demonstrate how you can build a serverless big data analytics solution using Amazon Athena and Amazon QuickSight, eliminating the need to invest in databases, data warehouses, complex ETL solutions, and BI applications. The session will feature David Giffin, SVP Technology at TruCar, who will provide best practices and lessons learned when implementing Amazon Athena to query and analyze clickstream data directly on Amazon S3. Learn how to use Amazon Athena to query various data formats in Amazon S3; and then use Amazon QuickSight to visualize the results of your Athena query with and without using SPICE – QuickSight’s in-memory calculation engine.
2. Agenda
• Quick overview of Amazon Athena and Amazon QuickSight
• TrueCar use case
• Clickstream data implementation
• Troubleshooting queries and dealing with errors
• Using Amazon QuickSight to visualize clickstream data
• Questions / Answers
3. No servers to provision
or manage
Scales with usage
Never pay for idle Availability and fault
tolerance built in
Serverless characteristics
5. Amazon Athena is easy to use
• Log in to the console
• Create a table
• Type in an Apache Hive DDL
Statement
• Use the console Add Table wizard
• Soon – AWS Glue Data Catalog
• Start querying in console
• JDBC allows BI tool access
• Full rest API also available
• Concurrency is a setting
6. Familiar technologies under the covers
Used for SQL queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
7. Simple pricing – $5/TB scanned
• Pay by the amount of data scanned per query
• Ways to save costs
• Compress
• Convert to columnar format
• Use partitioning
• Free: DDL queries, failed queries
Dataset Size on Amazon
S3
Query Run time Data Scanned Cost
Logs stored as
text files
1.15 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with
Parquet
34x faster 99% less data
scanned
99.7% cheaper
12. Movable Ink provides real-time personalization of marketing emails based on a wide range of user, device,
and contextual data
Movable Ink has been collecting data on user actions since 2011, and this database grows by 75 to 100GB per
day. To reduce time-to-insight, optimize costs, and increase flexibility for its analytics, the company recently
adopted the serverless Amazon Athena query service.
Since the company began using Amazon Athena, it has realized both cost savings and improved performance
for analytics related to user actions.
“Using Amazon Athena, we’re able to query seven years’ worth of data—adding up to hundreds of
terabytes—get results at least 50 percent faster, and save nearly $15,000 per month, all without
keeping a cluster running.”
Matt Chesler, director of DevOps at Movable Ink
13. Japan Taxi, a transportation app, has two million active users every month
"The ability to put data into Amazon S3 and query it just using standard SQL with Amazon Athena is
incredible.”
"With Amazon Athena, we don't have to load the data since the service can query the data in place. Now,
any of our developers can query data at its most granular resolution, at low costs – enabling us to give
everyone who needs it easy access to our data. Because Amazon Athena uses open source formats, we
can also use other solutions like Amazon EMR on the same data, making interoperability easy. And,
because Amazon Athena requires no administration, we were able to get started immediately.”
Kazuhiro Iwata, Chief Technology Officer, Japan Taxi
14.
15. Amazon QuickSight
The benefits of cloud BI
Integrated
Fully managed and scalable
Super fast and easy to use
Cost-effective
16. Amazon QuickSight – basic concepts
Retail Data
Ops Data
Marketing Data
Relational
Databases
Flat Files
And Many Others!
18. What’s New In QuickSight
Enterprise Edition
with AD support
Athena Connector
Scheduled
Refresh
Export to CSV
KPI Charts
AD Connector
New Features Added Since 11/16
Audit Logging with
CloudTrail
Presto, Spark,
Teradata
Connectors
Federated SSO
With SAML 2.0
Relative Date
Filters
Launched
11/2016
Enterprise
Analytics
Data
Excel
Enhancement
Redshift
Spectrum
Connector
S3 Analytics
Connector with
Deep Linking
Count Distinct
19. Individual Standard Edition
(60-day free trial)
Enterprise Edition
(60-day free trial)
Price per user per month Free $9
(Annual)
$12
(Month to Month)
$18
(Annual)
$24
(Month to Month)
Number of users 1 2+ 2+ 2+ 2+
SPICE capacity (GB)* 1 10 10 10 10
Additional SPICE
GB-month
$0.25 $0.25 $0.38
Amazon QuickSight is a cost-effective solution
21. David Giffin
• SVP, Technology Platform @ TrueCar
• Infrastructure, Deployment, Business Intelligence and Data Warehouse
Teams
22. Moving to Amazon Athena
TrueCar has recently switched vendors for our clickstream data.
Clickstream data is now collected by Google Analytics and imported daily
into Big Query. We use a Map-Reduce job to move the clickstream data
from Big Query to AWS (S3).
23. Why Amazon Athena for Clickstream?
• Athena provided a very simple mechanism to query large datasets
• Low operational burden in cluster setup and maintenance
• Amazon manages it for you!
25. Our Clickstream Data
• Currently Daily Uncompressed File Size ~ 23 GB (~ 10 Terabyte Yearly)
• We are expecting this number to go up by 20-40 % each year
We undertook the following steps to structure our data for optimal performance:
• We compressed data while extracting from Big Query to S3(~2 GB Daily File
compare to ~23 GB Uncompressed). The smaller data size reduces network
traffic from S3 to Athena.
• We partitioned the data on YEAR -> MONTH -> DAY which reduces the
amount of data scanned per query, thereby improving performance.
• Training Analyst to use best practices to query the data.
46. Amazon QuickSight Lessons Learned
• Very easy to use
• Intuitive user interface for reporting and dashboarding
• Simple to setup connections to Athena
• Once your IAM roles are in place
Issues we are working with AWS to resolve
• Error messaging needs improvement
• Athena JBDC driver needs improvement
• Un-nest not working in Amazon QuickSight
• Ability to select index with an array