AWS Big Data and Analytics
AmazonAthena Deep Dive
Greg Khairallah
AWS Business Development
Kurt Lee
Vingle Inc
Legacy data architectures exist as isolated data silos
Hadoop Cluster SQL Database
Data
Warehouse
Amazon S3 as your persistent data store
• Amazon S3
• Designed for 99.999999999% durability
• Separate compute and storage
• Resize and shut down clusters with no dat
a loss
• Match use case to Analytics services
• Easily evolve your analytic infrastructure as
technology evolves
EMR
EMR
Amazon
S3
Amazon EMR
Amazon Athena
Amazon Redshift
Data Ingestion into S3
AWS Direct Connect
AWS SnowballISV Connectors
Amazon Kinesis Firehose
AWS Storage
Gateway
S3 Transfer
Acceleration
Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
Query data in-place using
open file formats
Full Amazon Redshift
SQL support
S3
SQL
Amazon EMR – Hadoop, Spark, Presto in the Cloud
• Managed platform
• Launch a cluster in minutes
• Leverage the elasticity of the cloud
• Baked in security features
• Pay by the hour and save with Spot
• Flexibility to customize
Amazon Athena is an interactive query service
that makes it easy to analyze data directly from
Amazon S3 using Standard SQL
AWS Glue – Coming Soon
Data Catalog
 Hive metastore compatible metadata repository of data sources.
 Crawls data source to infer table, data type, partition format.
Job Execution
 Runs jobs in Spark containers – automatic scaling based on SLA.
 Glue is serverless - only pay for the resources you consume.
Job Authoring
 Generates Python code to move data from source to destination.
 Edit with your favorite IDE; share code snippets using Git.
Security
 Identity and Access Manage
ment (IAM) policies
 Bucket policies
 Access Control Lists (ACLs)
 Private VPC endpoints to Am
azon S3
 Pre-signed S3 URLs
Encryption
 SSL endpoints
 Server Side Encryption
(SSE-S3)
 S3 Server Side Encrypti
on with provided keys
(SSE-C, SSE-KMS)
 Client-side Encryption
Audit & Compliance
 Buckets access logs
 Lifecycle Management P
olicies
 Versioning & MFA delete
s
 Certifications – HIPAA, P
CI, SOC 1/2/3 etc.
Implement the right cloud security controls
How we’ve created and tamed (or tamed by)
the monster called big data
이상현 Kurt Lee
Vingle Inc
https://www.vingle.net
iOS / Frontend / Backend
Technical Leader
kurt@vingle.net
https://github.com/breath103
1. How to collect
2. Where to store
3. How to query
From millions of clients, with 3 developers, without loosing data, Cheaply
Really Stably, in TB scale
With plain SQL. hopefully
Creation of monster
“Give me the things that i like”
(without asking me directly)
= Recommendation
At the beginning…
Direct Actions
(Like / Clip / Write / Block / Follow)
Really Complicated SQL
How about indirect actions?
{
content:{
type: 'post',
id: 12345,
position_x: 2,
position_y: 2
},
referral:{
category: 'newsfeed',
area: 'newsfeed'
},
action:{
type: 'impression',
}
}
{
content:{
type: 'post',
id: 12345,
},
referral:{
category: 'newsfeed',
area: ‘newsfeed’,
resource_id: ‘12345’
},
action:{
type: ‘read',
}
}
{
content:{
type: ‘webpage',
id: “http://www.rog..”,
},
referral:{
category: ‘card_show',
area: ‘card_show’,
resource_id: ‘12345’
},
action:{
type: ‘read',
duration: 5.6,
}
}
30,000
Record Per Minute
24,000,000
Byte Per Minute
Collect & Store
Collect & Store
a. Fluentd fails, the “Service” goes
down
b. Why are we spending EC2 for just
simple proxy api call
c. Do you even know how many
records coming in?
d. Are we losing data or not
e. Using ruby for making just network
call? really?
f. I want microservice
Fully managed
Auto scaling (Almost infinite)
Monitorable
30,000
Record Per Minute
24,000,000
Byte Per Minute
Don’t forget that we’re startup
Taming monster
Redshift is magical
Other solution?
Even Redshift can gets slow
1. Too much old data (older than 3 months)
2. We don’t use it “THAT” intensively
3. “Managing” “Cluster”
4. So much “simple aggregation” jobs.
(count - group by - 1 day / 1 hour / 5 minutes)
5. Missing Hadoop Eco-System
Even Redshift can gets slow
Can’t i just use “SQL” to query JSON in S3?
Vingle
User
Contents
If i need to rebuild “EVERYTHING”
1. Use Lambda / Kinesis Firehose to Collect Data
2. Put “EVERYTHING” at S3
3. Use Athena
1. Intensive / Frequent Query -> Redshift
2. Hadoop Echosystem -> AWS EMR
3. Complicated Data Pipeline -> AWS DataPipeline
1. Luigi / Oozie / Airflow..
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena
Deep Dive
No servers to provision
or manage
Scales with usage
Never pay for idle Availability and fault toler
ance built in
Serverless characteristics
Simple Pricing - $5/TB Scanned
• Pay by the amount of data scanned per query
• Ways to save costs
• Compress
• Convert to Columnar format
• Use partitioning
• Free: DDL Queries, Failed Queries
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as Text fil
es
1.15 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
Familiar Technologies Under the Covers
• Used for SQL Queries
• In-memory distributed query engine
• ANSI-SQL compatible with extensions
• Used for DDL functionality
• Complex data types
• Multitude of formats
• Supports data partitioning
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Popular Customer Use Cases
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Log Aggregation
AWS Service Logs
Web Application Logs
Server Logs
S3
Athena
Data Catalog
New File
Trigger
Update table partition
Create partition
on S3
Copy to new
partition
Query data
S3
Lambda
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Log Aggregation with ETL
AWS Service Logs
Web Application Logs
Server Logs
S3
Athena
Data Catalog
Glue
Crawler
Update table partition
Create partition
on S3
Query data
S3
Glue ETL
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-Time Data Collection
S3
Athena
Data Catalog
Real-time events Store partitioned in S3
Trigger Lambda
Update table partition
Query data
Lambda
Kinesis
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Export
S3
Athena
Data Catalog
Database Migration Exported tables in S3
Trigger Lambda
Update table partition
Query data
Lambda
Database Migration
Service
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SaaS Model
S3
Athena
Data Catalog
Query data
Hot data
Warm & cold dataApplication request
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analytics Reporting*
Athena
Data Catalog
Redshift
Spectrum
EMR
QuickSight
API
Summary of AWS Analytics, Database & AI Tools
Amazon Redshift
Enterprise Data Warehouse
Amazon EMR
Hadoop/Spark
Amazon Athena
Clusterless SQL
Amazon Glue
Clusterless ETL
Amazon Aurora
Managed Relational Database
Amazon Machine Learning
Predictive Analytics
Amazon Quicksight
Business Intelligence/Visualization
Amazon ElasticSearch Service
ElasticSearch
Amazon ElastiCache
Redis In-memory Datastore
Amazon DynamoDB
Managed NoSQL Database
Amazon Rekognition
Deep Learning-based Image Recognition
Amazon Lex
Voice or Text Chatbots
Thank you!

2017 AWS DB Day | Amazon Athena 서비스 최신 기능 소개

  • 1.
    AWS Big Dataand Analytics AmazonAthena Deep Dive Greg Khairallah AWS Business Development Kurt Lee Vingle Inc
  • 2.
    Legacy data architecturesexist as isolated data silos Hadoop Cluster SQL Database Data Warehouse
  • 3.
    Amazon S3 asyour persistent data store • Amazon S3 • Designed for 99.999999999% durability • Separate compute and storage • Resize and shut down clusters with no dat a loss • Match use case to Analytics services • Easily evolve your analytic infrastructure as technology evolves EMR EMR Amazon S3 Amazon EMR Amazon Athena Amazon Redshift
  • 4.
    Data Ingestion intoS3 AWS Direct Connect AWS SnowballISV Connectors Amazon Kinesis Firehose AWS Storage Gateway S3 Transfer Acceleration
  • 5.
    Amazon Redshift Spectrum RunSQL queries directly against data in S3 using thousands of nodes Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data Query data in-place using open file formats Full Amazon Redshift SQL support S3 SQL
  • 6.
    Amazon EMR –Hadoop, Spark, Presto in the Cloud • Managed platform • Launch a cluster in minutes • Leverage the elasticity of the cloud • Baked in security features • Pay by the hour and save with Spot • Flexibility to customize
  • 7.
    Amazon Athena isan interactive query service that makes it easy to analyze data directly from Amazon S3 using Standard SQL
  • 8.
    AWS Glue –Coming Soon Data Catalog  Hive metastore compatible metadata repository of data sources.  Crawls data source to infer table, data type, partition format. Job Execution  Runs jobs in Spark containers – automatic scaling based on SLA.  Glue is serverless - only pay for the resources you consume. Job Authoring  Generates Python code to move data from source to destination.  Edit with your favorite IDE; share code snippets using Git.
  • 9.
    Security  Identity andAccess Manage ment (IAM) policies  Bucket policies  Access Control Lists (ACLs)  Private VPC endpoints to Am azon S3  Pre-signed S3 URLs Encryption  SSL endpoints  Server Side Encryption (SSE-S3)  S3 Server Side Encrypti on with provided keys (SSE-C, SSE-KMS)  Client-side Encryption Audit & Compliance  Buckets access logs  Lifecycle Management P olicies  Versioning & MFA delete s  Certifications – HIPAA, P CI, SOC 1/2/3 etc. Implement the right cloud security controls
  • 10.
    How we’ve createdand tamed (or tamed by) the monster called big data
  • 11.
    이상현 Kurt Lee VingleInc https://www.vingle.net iOS / Frontend / Backend Technical Leader kurt@vingle.net https://github.com/breath103
  • 13.
    1. How tocollect 2. Where to store 3. How to query From millions of clients, with 3 developers, without loosing data, Cheaply Really Stably, in TB scale With plain SQL. hopefully
  • 14.
  • 17.
    “Give me thethings that i like” (without asking me directly) = Recommendation
  • 18.
    At the beginning… DirectActions (Like / Clip / Write / Block / Follow) Really Complicated SQL
  • 19.
    How about indirectactions? { content:{ type: 'post', id: 12345, position_x: 2, position_y: 2 }, referral:{ category: 'newsfeed', area: 'newsfeed' }, action:{ type: 'impression', } }
  • 20.
    { content:{ type: 'post', id: 12345, }, referral:{ category:'newsfeed', area: ‘newsfeed’, resource_id: ‘12345’ }, action:{ type: ‘read', } } { content:{ type: ‘webpage', id: “http://www.rog..”, }, referral:{ category: ‘card_show', area: ‘card_show’, resource_id: ‘12345’ }, action:{ type: ‘read', duration: 5.6, } }
  • 21.
  • 22.
  • 23.
  • 27.
    a. Fluentd fails,the “Service” goes down b. Why are we spending EC2 for just simple proxy api call c. Do you even know how many records coming in? d. Are we losing data or not e. Using ruby for making just network call? really? f. I want microservice
  • 28.
    Fully managed Auto scaling(Almost infinite) Monitorable
  • 29.
  • 30.
    Don’t forget thatwe’re startup
  • 31.
  • 33.
    Redshift is magical Othersolution? Even Redshift can gets slow
  • 34.
    1. Too muchold data (older than 3 months) 2. We don’t use it “THAT” intensively 3. “Managing” “Cluster” 4. So much “simple aggregation” jobs. (count - group by - 1 day / 1 hour / 5 minutes) 5. Missing Hadoop Eco-System Even Redshift can gets slow
  • 40.
    Can’t i justuse “SQL” to query JSON in S3?
  • 42.
  • 44.
    If i needto rebuild “EVERYTHING” 1. Use Lambda / Kinesis Firehose to Collect Data 2. Put “EVERYTHING” at S3 3. Use Athena 1. Intensive / Frequent Query -> Redshift 2. Hadoop Echosystem -> AWS EMR 3. Complicated Data Pipeline -> AWS DataPipeline 1. Luigi / Oozie / Airflow..
  • 46.
    © 2015, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Amazon Athena Deep Dive
  • 48.
    No servers toprovision or manage Scales with usage Never pay for idle Availability and fault toler ance built in Serverless characteristics
  • 49.
    Simple Pricing -$5/TB Scanned • Pay by the amount of data scanned per query • Ways to save costs • Compress • Convert to Columnar format • Use partitioning • Free: DDL Queries, Failed Queries Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as Text fil es 1.15 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  • 50.
    Familiar Technologies Underthe Covers • Used for SQL Queries • In-memory distributed query engine • ANSI-SQL compatible with extensions • Used for DDL functionality • Complex data types • Multitude of formats • Supports data partitioning
  • 51.
    © 2015, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Popular Customer Use Cases
  • 52.
    © 2015, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Log Aggregation AWS Service Logs Web Application Logs Server Logs S3 Athena Data Catalog New File Trigger Update table partition Create partition on S3 Copy to new partition Query data S3 Lambda
  • 53.
    © 2015, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Log Aggregation with ETL AWS Service Logs Web Application Logs Server Logs S3 Athena Data Catalog Glue Crawler Update table partition Create partition on S3 Query data S3 Glue ETL
  • 54.
    © 2015, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Real-Time Data Collection S3 Athena Data Catalog Real-time events Store partitioned in S3 Trigger Lambda Update table partition Query data Lambda Kinesis
  • 55.
    © 2015, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Data Export S3 Athena Data Catalog Database Migration Exported tables in S3 Trigger Lambda Update table partition Query data Lambda Database Migration Service
  • 56.
    © 2015, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. SaaS Model S3 Athena Data Catalog Query data Hot data Warm & cold dataApplication request
  • 57.
    © 2015, AmazonWeb Services, Inc. or its Affiliates. All rights reserved. Analytics Reporting* Athena Data Catalog Redshift Spectrum EMR QuickSight API
  • 58.
    Summary of AWSAnalytics, Database & AI Tools Amazon Redshift Enterprise Data Warehouse Amazon EMR Hadoop/Spark Amazon Athena Clusterless SQL Amazon Glue Clusterless ETL Amazon Aurora Managed Relational Database Amazon Machine Learning Predictive Analytics Amazon Quicksight Business Intelligence/Visualization Amazon ElasticSearch Service ElasticSearch Amazon ElastiCache Redis In-memory Datastore Amazon DynamoDB Managed NoSQL Database Amazon Rekognition Deep Learning-based Image Recognition Amazon Lex Voice or Text Chatbots
  • 59.