SlideShare a Scribd company logo
1 of 54
Download to read offline
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Roy Ben-Alta, Principal Business Development, AWS
David Giffin, Senior Vice President Technology Platform, TrueCar
August 14, 2017
Serverless Analytics with Amazon
Athena and Amazon QuickSight,
Featuring TrueCar
Agenda
• Quick overview of Amazon Athena and Amazon QuickSight
• TrueCar use case
• Clickstream data implementation
• Troubleshooting queries and dealing with errors
• Using Amazon QuickSight to visualize clickstream data
• Questions / Answers
Legacy Data Architectures Exist as Isolated Data Silos
Hadoop
Cluster
SQL
Database
Data
Warehouse
Appliance
Evolution of Data Architectures
1985: Data Warehouse Appliances Benefits
• Consolidated multiple decision support
environments (i.e. databases) into a single
architecture
• Best performance available at time of
conception, hence the expensive licenses
• Worked well with structured, columnar data
• Could build customized data marts on top
Shared Storage Tier
(NAS Appliance)
Compute
Node
Compute
Node
Compute
Node
Compute
Node
• Proprietary software license paid per node
per year
• Gold-plated hardware available only from
the vendor with per node per year cost
Constraints
• Proprietary software license paid per node per
year
• Gold-plated hardware available only from the
vendor with per node per year cost
• Could not handle unstructured data sets
• Heavy ETL & data cleansing
Evolution of Data Architectures
2006: Hadoop Clusters
CPU
Memory
HDFS Storage
Hadoop Master Node
CPU
Memory
HDFS Storage
CPU
Memory
HDFS Storage
Improvements
• Open source based software license!!!
• Commodity white box servers!!!!
• Could handle structured & unstructured data
sets
• Many different applications within the
framework (MapReduce, Spark, Hive, Pig,
HBase, Presto, etc.)
Constraints
• HDFS 3X replication to protect against node
failure gets expensive at scale
• 500 TB data set = 1.5 PB cluster
• Local storage means you must scale and pay
for CPU & memory resources when adding
data capacity
• General purpose, monolithic cluster with many
different apps on same hardware
• Still a data silo
Evolution of Data Architectures
2009: Decoupled EMR Architecture
CPU
Memory
Hadoop Master Node
CPU
Memory
CPU
Memory
Improvements
• Decoupled storage & compute
• Scale CPU and memory resources
independently and up & down
• Only pay for the 500 TB data set (not 3X)
• Multi-physical facility replication via S3
• Multiple clusters can run in parallel against
shared data in S3
• Each job gets its own optimized cluster. i.e.
Spark on memory intensive, Hive on CPU
intensive, HBase on I/O intensive, etc.
Constraints
• Still have a cluster to provision and manage
• Must expose EMR cluster to SQL users via
Hive, Presto, etc.
S3 as HDFS
Evolution of Data Architectures
Today: Clusterless Improvements
• No cluster/infrastructure to manage
• Business users and analysts can write SQL
without having to provision a cluster or touch
infrastructure
• Pay by the query
• Zero Administration
• Process data where it lives
SQL Interface in web
browser
Athena for SQL
S3 Data Lake
Glue for ETL
S3 Data Lake
Spark & Hive Interface
in web browser
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Data Lake reference architecture
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight A
Central Storage
Secure, cost-effective
Storage in Amazon S3
Glue ETL
No servers to provision
or manage
Scales with usage
Never pay for idle Availability and fault
tolerance built in
Serverless characteristics
AWS analytics – serverless options
• Data Ingestion and transformation
• Amazon Kinesis Firehose
• AWS Glue (New!)
• SQL
• Amazon Kinesis Analytics
• Amazon Athena
• Amazon Redshift Spectrum
• Visualization
• Amazon QuickSight
Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Amazon
Redshift and Amazon Elasticsearch
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift
and Amazon Elasticsearch without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for
delivery into data destinations in as little as 60 secs using simple configurations.
Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit
streaming data to Firehose
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data
continuously into S3, Amazon Redshift
and Amazon Elasticsearch
Amazon Kinesis Firehose Pricing
Simple, pay-as-you-go and no up-front costs
Dimension Value
*Per 1 GB of data ingested $0.029
Amazon Athena is easy to use
• Log in to the console
• Create a table
• Type in an Apache Hive DDL
Statement
• Use the console Add Table wizard
• AWS Glue Data Catalog
• Start querying in console
• JDBC allows BI tool access
• Full rest API also available
• Concurrency is a setting
Familiar technologies under the covers
Used for SQL queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning
Simple pricing – $5/TB scanned
• Pay by the amount of data scanned per query
• Ways to save costs
• Compress
• Convert to columnar format
• Use partitioning
• Free: DDL queries, failed queries
Dataset Size on Amazon
S3
Query Run time Data Scanned Cost
Logs stored as
text files
1.15 TB 237 seconds 1.15TB $5.75
Logs stored in
Apache Parquet
format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with
Parquet
34x faster 99% less data
scanned
99.7% cheaper
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Case - Log Aggregation
AWS Service Logs
Web Application Logs
Server Logs
Amazon S3
Amazon Athena
Data Catalog
New File
Trigger
Update table partition
Create partition
on S3
Copy to new
partition
Query data
Amazon S3
Amazon Lambda
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Case - Real-Time Data Collection
Amazon S3 Amazon Athena
Data Catalog
Real-time events Store partitioned in S3
Trigger Lambda
Update table partition
Query data
Amazon Lambda
Amazon Kinesis Firehose
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Case - Data Export
Amazon S3 Amazon Athena
Data Catalog
Database Migration Exported tables in S3
Trigger Lambda
Update table partition
Query data
Amazon Lambda
Amazon Database
Migration
Service
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Case - SaaS Model
Amazon S3Amazon Athena
Data Catalog
Query data
Hot data
Warm & cold dataApplication request
Amazon QuickSight
The benefits of cloud BI
Integrated
Fully managed and scalable
Super fast and easy to use
Cost-effective
Amazon QuickSight – basic concepts
Retail Data
Ops Data
Marketing Data
Relational
Databases
Flat Files
And Many Others!
Super-fast performance with SPICE
What’s New In QuickSight
Enterprise Edition
with AD support
Athena Connector
Scheduled
Refresh
Export to CSV
KPI Charts
AD Connector
New Features Added Since 11/16
Audit Logging with
CloudTrail
Presto, Spark,
Teradata
Connectors
Federated SSO
With SAML 2.0
Relative Date
Filters
Launched
11/2016
Enterprise
Analytics
Data
Excel
Enhancement
Redshift
Spectrum
Connector
S3 Analytics
Connector with
Deep Linking
Count Distinct
Individual Standard Edition
(60-day free trial)
Enterprise Edition
(60-day free trial)
Price per user per month Free $9
(Annual)
$12
(Month to Month)
$18
(Annual)
$24
(Month to Month)
Number of users 1 2+ 2+ 2+ 2+
SPICE capacity (GB)* 1 10 10 10 10
Additional SPICE
GB-month
$0.25 $0.25 $0.38
Amazon QuickSight is a cost-effective solution
Serverless Analytics with Amazon
Athena and Amazon QuickSight
08.14.2017
David Giffin
• SVP, Technology Platform @ TrueCar
• Infrastructure, Deployment, Business Intelligence and Data Warehouse
Teams
Moving to Amazon Athena
TrueCar has recently switched vendors for our clickstream data.
Clickstream data is now collected by Google Analytics and imported daily
into Big Query. We use a Map-Reduce job to move the clickstream data
from Big Query to AWS (S3).
Why Amazon Athena for Clickstream?
• Athena provided a very simple mechanism to query large datasets
• Low operational burden in cluster setup and maintenance
• Amazon manages it for you!
Our Use Case
Our Clickstream Data
• Currently Daily Uncompressed File Size ~ 23 GB (~ 10 Terabyte Yearly)
• We are expecting this number to go up by 20-40 % each year
We undertook the following steps to structure our data for optimal performance:
• We compressed data while extracting from Big Query to S3(~2 GB Daily File
compare to ~23 GB Uncompressed). The smaller data size reduces network
traffic from S3 to Athena.
• We partitioned the data on YEAR -> MONTH -> DAY which reduces the
amount of data scanned per query, thereby improving performance.
• Training Analyst to use best practices to query the data.
Architecture
The Raw Data
Loading the Data
Challenges
Parser Error
Solving Our Issue
• Replacing ’.’ with ‘_’
Recreating Our Table
The Results
Select Array Indexes
Amazon QuickSight
Visualization with Amazon QuickSight
Visualization with Amazon QuickSight
Visualization with Amazon QuickSight
Visualization with Amazon QuickSight
Visualization with Amazon QuickSight
Visualization with Amazon QuickSight
Visualization with Amazon QuickSight
Visualization with Amazon QuickSight
Visualization with Amazon QuickSight
Dashboard with Amazon QuickSight
Amazon QuickSight Lessons Learned
• Very easy to use
• Intuitive user interface for reporting and dashboarding
• Simple to setup connections to Athena
• Once your IAM roles are in place
Issues we are working with AWS to resolve
• Error messaging needs improvement
• Athena JBDC driver needs improvement
• Un-nest not working in Amazon QuickSight
• Ability to select index with an array
Thank You!

More Related Content

What's hot

What's hot (20)

Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsDay 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of Things
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Deep Dive On Amazon Redshift
Deep Dive On Amazon RedshiftDeep Dive On Amazon Redshift
Deep Dive On Amazon Redshift
 
Getting started with Amazon Kinesis
Getting started with Amazon KinesisGetting started with Amazon Kinesis
Getting started with Amazon Kinesis
 
SRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDBSRV404 Deep Dive on Amazon DynamoDB
SRV404 Deep Dive on Amazon DynamoDB
 
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
Building Big Data Applications with Serverless Architectures -  June 2017 AWS...Building Big Data Applications with Serverless Architectures -  June 2017 AWS...
Building Big Data Applications with Serverless Architectures - June 2017 AWS...
 
DynamodbDB Deep Dive
DynamodbDB Deep DiveDynamodbDB Deep Dive
DynamodbDB Deep Dive
 
Getting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWSGetting Started with Managed Database Services on AWS
Getting Started with Managed Database Services on AWS
 
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
AWS re:Invent 2016: Best Practices for Data Warehousing with Amazon Redshift ...
 
What's New with Amazon DynamoDB - AWS Online Tech Talks
What's New with Amazon DynamoDB - AWS Online Tech TalksWhat's New with Amazon DynamoDB - AWS Online Tech Talks
What's New with Amazon DynamoDB - AWS Online Tech Talks
 
Amazon Relational Database Service Deep Dive
Amazon Relational Database Service Deep DiveAmazon Relational Database Service Deep Dive
Amazon Relational Database Service Deep Dive
 
Database Migration – Simple, Cross-Engine and Cross-Platform Migration
Database Migration – Simple, Cross-Engine and Cross-Platform MigrationDatabase Migration – Simple, Cross-Engine and Cross-Platform Migration
Database Migration – Simple, Cross-Engine and Cross-Platform Migration
 
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech TalksDeep Dive on Elastic File System - February 2017 AWS Online Tech Talks
Deep Dive on Elastic File System - February 2017 AWS Online Tech Talks
 
Migrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudMigrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the Cloud
 
ENT306 Migrating Large Scale Data Sets to the Cloud
ENT306 Migrating Large Scale Data Sets to the CloudENT306 Migrating Large Scale Data Sets to the Cloud
ENT306 Migrating Large Scale Data Sets to the Cloud
 
AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...
AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...
AWS re:Invent 2016: Simplified Data Center Migration—Lessons Learned by Live ...
 
Getting started with Amazon DynamoDB
Getting started with Amazon DynamoDBGetting started with Amazon DynamoDB
Getting started with Amazon DynamoDB
 
Optimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics WorkloadsOptimizing Storage for Big Data/Analytics Workloads
Optimizing Storage for Big Data/Analytics Workloads
 
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon KinesisSRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis
 
Getting Started with Amazon QuickSight
Getting Started with Amazon QuickSightGetting Started with Amazon QuickSight
Getting Started with Amazon QuickSight
 

Similar to BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuring TrueCar

(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 

Similar to BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuring TrueCar (20)

Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
Serverless Datalake Day with AWS
Serverless Datalake Day with AWSServerless Datalake Day with AWS
Serverless Datalake Day with AWS
 
Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight - May...
Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight - May...Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight - May...
Serverless Big Data Analytics using Amazon Athena and Amazon QuickSight - May...
 
Serverless Big Data Analytics with Amazon Athena and Amazon Quicksight - May ...
Serverless Big Data Analytics with Amazon Athena and Amazon Quicksight - May ...Serverless Big Data Analytics with Amazon Athena and Amazon Quicksight - May ...
Serverless Big Data Analytics with Amazon Athena and Amazon Quicksight - May ...
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
ABD217_From Batch to Streaming
ABD217_From Batch to StreamingABD217_From Batch to Streaming
ABD217_From Batch to Streaming
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...
Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...
Migrating Financial and Accounting Systems from Oracle to Amazon DynamoDB (DA...
 

More from Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 

BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuring TrueCar

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Roy Ben-Alta, Principal Business Development, AWS David Giffin, Senior Vice President Technology Platform, TrueCar August 14, 2017 Serverless Analytics with Amazon Athena and Amazon QuickSight, Featuring TrueCar
  • 2. Agenda • Quick overview of Amazon Athena and Amazon QuickSight • TrueCar use case • Clickstream data implementation • Troubleshooting queries and dealing with errors • Using Amazon QuickSight to visualize clickstream data • Questions / Answers
  • 3. Legacy Data Architectures Exist as Isolated Data Silos Hadoop Cluster SQL Database Data Warehouse Appliance
  • 4.
  • 5. Evolution of Data Architectures 1985: Data Warehouse Appliances Benefits • Consolidated multiple decision support environments (i.e. databases) into a single architecture • Best performance available at time of conception, hence the expensive licenses • Worked well with structured, columnar data • Could build customized data marts on top Shared Storage Tier (NAS Appliance) Compute Node Compute Node Compute Node Compute Node • Proprietary software license paid per node per year • Gold-plated hardware available only from the vendor with per node per year cost Constraints • Proprietary software license paid per node per year • Gold-plated hardware available only from the vendor with per node per year cost • Could not handle unstructured data sets • Heavy ETL & data cleansing
  • 6. Evolution of Data Architectures 2006: Hadoop Clusters CPU Memory HDFS Storage Hadoop Master Node CPU Memory HDFS Storage CPU Memory HDFS Storage Improvements • Open source based software license!!! • Commodity white box servers!!!! • Could handle structured & unstructured data sets • Many different applications within the framework (MapReduce, Spark, Hive, Pig, HBase, Presto, etc.) Constraints • HDFS 3X replication to protect against node failure gets expensive at scale • 500 TB data set = 1.5 PB cluster • Local storage means you must scale and pay for CPU & memory resources when adding data capacity • General purpose, monolithic cluster with many different apps on same hardware • Still a data silo
  • 7. Evolution of Data Architectures 2009: Decoupled EMR Architecture CPU Memory Hadoop Master Node CPU Memory CPU Memory Improvements • Decoupled storage & compute • Scale CPU and memory resources independently and up & down • Only pay for the 500 TB data set (not 3X) • Multi-physical facility replication via S3 • Multiple clusters can run in parallel against shared data in S3 • Each job gets its own optimized cluster. i.e. Spark on memory intensive, Hive on CPU intensive, HBase on I/O intensive, etc. Constraints • Still have a cluster to provision and manage • Must expose EMR cluster to SQL users via Hive, Presto, etc. S3 as HDFS
  • 8. Evolution of Data Architectures Today: Clusterless Improvements • No cluster/infrastructure to manage • Business users and analysts can write SQL without having to provision a cluster or touch infrastructure • Pay by the query • Zero Administration • Process data where it lives SQL Interface in web browser Athena for SQL S3 Data Lake Glue for ETL S3 Data Lake Spark & Hive Interface in web browser
  • 9. Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight Amazon AI EMR Redshift Athena Kinesis RDS Central Storage Secure, cost-effective Storage in Amazon S3 S3 Snowball Database Migration Service Kinesis Firehose Direct Connect Data Ingestion Get your data into S3 Quickly and securely Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Security Token Service CloudWatch CloudTrail Key Management Service Data Lake reference architecture Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight A Central Storage Secure, cost-effective Storage in Amazon S3 Glue ETL
  • 10. No servers to provision or manage Scales with usage Never pay for idle Availability and fault tolerance built in Serverless characteristics
  • 11. AWS analytics – serverless options • Data Ingestion and transformation • Amazon Kinesis Firehose • AWS Glue (New!) • SQL • Amazon Kinesis Analytics • Amazon Athena • Amazon Redshift Spectrum • Visualization • Amazon QuickSight
  • 12. Amazon Kinesis Firehose Load massive volumes of streaming data into Amazon S3, Amazon Redshift and Amazon Elasticsearch Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift and Amazon Elasticsearch without writing an application or managing infrastructure. Direct-to-data store integration: Batch, compress, and encrypt streaming data for delivery into data destinations in as little as 60 secs using simple configurations. Seamless elasticity: Seamlessly scales to match data throughput w/o intervention Capture and submit streaming data to Firehose Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into S3, Amazon Redshift and Amazon Elasticsearch
  • 13. Amazon Kinesis Firehose Pricing Simple, pay-as-you-go and no up-front costs Dimension Value *Per 1 GB of data ingested $0.029
  • 14. Amazon Athena is easy to use • Log in to the console • Create a table • Type in an Apache Hive DDL Statement • Use the console Add Table wizard • AWS Glue Data Catalog • Start querying in console • JDBC allows BI tool access • Full rest API also available • Concurrency is a setting
  • 15. Familiar technologies under the covers Used for SQL queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  • 16. Simple pricing – $5/TB scanned • Pay by the amount of data scanned per query • Ways to save costs • Compress • Convert to columnar format • Use partitioning • Free: DDL queries, failed queries Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as text files 1.15 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  • 17. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Case - Log Aggregation AWS Service Logs Web Application Logs Server Logs Amazon S3 Amazon Athena Data Catalog New File Trigger Update table partition Create partition on S3 Copy to new partition Query data Amazon S3 Amazon Lambda
  • 18. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Case - Real-Time Data Collection Amazon S3 Amazon Athena Data Catalog Real-time events Store partitioned in S3 Trigger Lambda Update table partition Query data Amazon Lambda Amazon Kinesis Firehose
  • 19. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Case - Data Export Amazon S3 Amazon Athena Data Catalog Database Migration Exported tables in S3 Trigger Lambda Update table partition Query data Amazon Lambda Amazon Database Migration Service
  • 20. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Case - SaaS Model Amazon S3Amazon Athena Data Catalog Query data Hot data Warm & cold dataApplication request
  • 21.
  • 22. Amazon QuickSight The benefits of cloud BI Integrated Fully managed and scalable Super fast and easy to use Cost-effective
  • 23. Amazon QuickSight – basic concepts Retail Data Ops Data Marketing Data Relational Databases Flat Files And Many Others!
  • 25. What’s New In QuickSight Enterprise Edition with AD support Athena Connector Scheduled Refresh Export to CSV KPI Charts AD Connector New Features Added Since 11/16 Audit Logging with CloudTrail Presto, Spark, Teradata Connectors Federated SSO With SAML 2.0 Relative Date Filters Launched 11/2016 Enterprise Analytics Data Excel Enhancement Redshift Spectrum Connector S3 Analytics Connector with Deep Linking Count Distinct
  • 26. Individual Standard Edition (60-day free trial) Enterprise Edition (60-day free trial) Price per user per month Free $9 (Annual) $12 (Month to Month) $18 (Annual) $24 (Month to Month) Number of users 1 2+ 2+ 2+ 2+ SPICE capacity (GB)* 1 10 10 10 10 Additional SPICE GB-month $0.25 $0.25 $0.38 Amazon QuickSight is a cost-effective solution
  • 27. Serverless Analytics with Amazon Athena and Amazon QuickSight 08.14.2017
  • 28. David Giffin • SVP, Technology Platform @ TrueCar • Infrastructure, Deployment, Business Intelligence and Data Warehouse Teams
  • 29. Moving to Amazon Athena TrueCar has recently switched vendors for our clickstream data. Clickstream data is now collected by Google Analytics and imported daily into Big Query. We use a Map-Reduce job to move the clickstream data from Big Query to AWS (S3).
  • 30. Why Amazon Athena for Clickstream? • Athena provided a very simple mechanism to query large datasets • Low operational burden in cluster setup and maintenance • Amazon manages it for you!
  • 32. Our Clickstream Data • Currently Daily Uncompressed File Size ~ 23 GB (~ 10 Terabyte Yearly) • We are expecting this number to go up by 20-40 % each year We undertook the following steps to structure our data for optimal performance: • We compressed data while extracting from Big Query to S3(~2 GB Daily File compare to ~23 GB Uncompressed). The smaller data size reduces network traffic from S3 to Athena. • We partitioned the data on YEAR -> MONTH -> DAY which reduces the amount of data scanned per query, thereby improving performance. • Training Analyst to use best practices to query the data.
  • 38. Solving Our Issue • Replacing ’.’ with ‘_’
  • 52. Dashboard with Amazon QuickSight
  • 53. Amazon QuickSight Lessons Learned • Very easy to use • Intuitive user interface for reporting and dashboarding • Simple to setup connections to Athena • Once your IAM roles are in place Issues we are working with AWS to resolve • Error messaging needs improvement • Athena JBDC driver needs improvement • Un-nest not working in Amazon QuickSight • Ability to select index with an array