Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce SPICE - a new Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools. NOTE: Make this more themed towards QuickSight as it applies to other AWS Big Data Services - Redshift, Athena, S3, RDS.
2. Agenda
• Quick overview of Amazon Athena and Amazon QuickSight
• TrueCar use case
• Clickstream data implementation
• Troubleshooting queries and dealing with errors
• Using Amazon QuickSight to visualize clickstream data
• Questions / Answers
3. Legacy Data Architectures Exist as Isolated Data Silos
Hadoop
Cluster
SQL
Database
Data
Warehouse
Appliance
4.
5. Evolution of Data Architectures
1985: Data Warehouse Appliances Benefits
• Consolidated multiple decision support
environments (i.e. databases) into a single
architecture
• Best performance available at time of
conception, hence the expensive licenses
• Worked well with structured, columnar data
• Could build customized data marts on top
Shared Storage Tier
(NAS Appliance)
Compute
Node
Compute
Node
Compute
Node
Compute
Node
• Proprietary software license paid per node
per year
• Gold-plated hardware available only from
the vendor with per node per year cost
Constraints
• Proprietary software license paid per node per
year
• Gold-plated hardware available only from the
vendor with per node per year cost
• Could not handle unstructured data sets
• Heavy ETL & data cleansing
6. Evolution of Data Architectures
2006: Hadoop Clusters
CPU
Memory
HDFS Storage
Hadoop Master Node
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Improvements
• Open source based software license!!!
• Commodity white box servers!!!!
• Could handle structured & unstructured data
sets
• Many different applications within the
framework (MapReduce, Spark, Hive, Pig,
HBase, Presto, etc.)
Constraints
• HDFS 3X replication to protect against node
failure gets expensive at scale
• 500 TB data set = 1.5 PB cluster
• Local storage means you must scale and pay
for CPU & memory resources when adding
data capacity
• General purpose, monolithic cluster with many
different apps on same hardware
• Still a data silo
7. Evolution of Data Architectures
2009: Decoupled EMR Architecture
CPU
Memory
Hadoop Master Node
CPU
Memory
CPU
Memory
Improvements
• Decoupled storage & compute
• Scale CPU and memory resources
independently and up & down
• Only pay for the 500 TB data set (not 3X)
• Multi-physical facility replication via S3
• Multiple clusters can run in parallel against
shared data in S3
• Each job gets its own optimized cluster. i.e.
Spark on memory intensive, Hive on CPU
intensive, HBase on I/O intensive, etc.
Constraints
• Still have a cluster to provision and manage
• Must expose EMR cluster to SQL users via
Hive, Presto, etc.
S3 as HDFS
8. Evolution of Data Architectures
Today: Clusterless Improvements
• No cluster/infrastructure to manage
• Business users and analysts can write SQL
without having to provision a cluster or touch
infrastructure
• Pay by the query
• Zero Administration
• Process data where it lives
SQL Interface in web
browser
Athena for SQL
S3 Data Lake
Glue for ETL
S3 Data Lake
Spark & Hive Interface
in web browser
9. Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Data Lake reference architecture
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight A
Central Storage
Secure, cost-effective
Storage in Amazon S3
Glue ETL
10. No servers to provision
or manage
Scales with usage
Never pay for idle Availability and fault
tolerance built in
Serverless characteristics
12. Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Amazon
Redshift and Amazon Elasticsearch
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift
and Amazon Elasticsearch without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for
delivery into data destinations in as little as 60 secs using simple configurations.
Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit
streaming data to Firehose
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data
continuously into S3, Amazon Redshift
and Amazon Elasticsearch
13. Amazon Kinesis Firehose Pricing
Simple, pay-as-you-go and no up-front costs
Dimension Value
*Per 1 GB of data ingested $0.029
14. Amazon Athena is easy to use
• Log in to the console
• Create a table
• Type in an Apache Hive DDL
Statement
• Use the console Add Table wizard
• AWS Glue Data Catalog
• Start querying in console
• JDBC allows BI tool access
• Full rest API also available
• Concurrency is a setting
15. Familiar technologies under the covers
Used for SQL queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
16. Simple pricing – $5/TB scanned
• Pay by the amount of data scanned per query
• Ways to save costs
• Compress
• Convert to columnar format
• Use partitioning
• Free: DDL queries, failed queries
Dataset Size on Amazon
S3
Query Run time Data Scanned Cost
Logs stored as
text files
1.15 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with
Parquet
34x faster 99% less data
scanned
99.7% cheaper
25. What’s New In QuickSight
Enterprise Edition
with AD support
Athena Connector
Scheduled
Refresh
Export to CSV
KPI Charts
AD Connector
New Features Added Since 11/16
Audit Logging with
CloudTrail
Presto, Spark,
Teradata
Connectors
Federated SSO
With SAML 2.0
Relative Date
Filters
Launched
11/2016
Enterprise
Analytics
Data
Excel
Enhancement
Redshift
Spectrum
Connector
S3 Analytics
Connector with
Deep Linking
Count Distinct
26. Individual Standard Edition
(60-day free trial)
Enterprise Edition
(60-day free trial)
Price per user per month Free $9
(Annual)
$12
(Month to Month)
$18
(Annual)
$24
(Month to Month)
Number of users 1 2+ 2+ 2+ 2+
SPICE capacity (GB)* 1 10 10 10 10
Additional SPICE
GB-month
$0.25 $0.25 $0.38
Amazon QuickSight is a cost-effective solution
28. David Giffin
• SVP, Technology Platform @ TrueCar
• Infrastructure, Deployment, Business Intelligence and Data Warehouse
Teams
29. Moving to Amazon Athena
TrueCar has recently switched vendors for our clickstream data.
Clickstream data is now collected by Google Analytics and imported daily
into Big Query. We use a Map-Reduce job to move the clickstream data
from Big Query to AWS (S3).
30. Why Amazon Athena for Clickstream?
• Athena provided a very simple mechanism to query large datasets
• Low operational burden in cluster setup and maintenance
• Amazon manages it for you!
32. Our Clickstream Data
• Currently Daily Uncompressed File Size ~ 23 GB (~ 10 Terabyte Yearly)
• We are expecting this number to go up by 20-40 % each year
We undertook the following steps to structure our data for optimal performance:
• We compressed data while extracting from Big Query to S3(~2 GB Daily File
compare to ~23 GB Uncompressed). The smaller data size reduces network
traffic from S3 to Athena.
• We partitioned the data on YEAR -> MONTH -> DAY which reduces the
amount of data scanned per query, thereby improving performance.
• Training Analyst to use best practices to query the data.
53. Amazon QuickSight Lessons Learned
• Very easy to use
• Intuitive user interface for reporting and dashboarding
• Simple to setup connections to Athena
• Once your IAM roles are in place
Issues we are working with AWS to resolve
• Error messaging needs improvement
• Athena JBDC driver needs improvement
• Un-nest not working in Amazon QuickSight
• Ability to select index with an array