SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Bhartia, Principal Product Manager, S3
March 29th 2018
Transforming Data Lakes
with
Amazon S3 Select & Amazon Glacier Select
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to expect from the session
1.Introduction
2.Use-cases
3.Key features
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched durability,
availability, and scalability
Best security, compliance, and audit capability Object-level control
at any scale
Business insight into your
data
Twice as many partner integrationsMost ways to bring
data in
Building a data lake with Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
With a wide variety of in-place tools…
Amazon Athena Amazon Redshift
Spectrum
Amazon EMR AWS Glue
Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today: Compute scales based…
on object size instead of the amount of data you want to process
DATA COMPUTE
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today: All of these tools…
retrieve a lot of data they don’t need and
do the heavy lifting
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
You have a choice of storage classes
Archive data Infrequently accessed data
Minutes to hours Milliseconds
0.4¢-GB/mo. 1.25¢-GB/mo.
Amazon S3
Standard
Amazon S3
Standard– Infrequent Access
Amazon Glacier
Active data
Milliseconds
From 2.1¢-GB/mo.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today: You need to….
entire object from Amazon Glacier to Amazon S3
and then use it.
Amazon S3Amazon
Glacier
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Introducing…
Amazon S3 Select and Amazon Glacier Select
Select subset of data from an object based on a SQL expression
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Simple, faster, and cheaper!
Available as an API—no
infrastructure or
administration
Faster performance as
compared to doing it yourself
Pay as you go. The less you
retrieve the more you save.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select
Simple to use
Standard SQL expression
Familiar
Work and scales like GET
requests
Integrated
AWS SDK and Presto
(others coming soon)
Select contents from object instead of retrieving the object
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select
Output
Format: delimited text (CSV, TSV),
JSON …
Clauses Data types Operators Functions
Select String Conditional String
From Integer, Float, Decimal Math Cast
Where Timestamp Logical Math
Boolean String (Like, ||) Aggregate
Input
Format: delimited text (CSV, TSV),
JSON …
Compression: GZIP …
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select: Simple pattern matches
…get-object …object… | awk -F ’{ if($4=="x") print $1}’
...select-object …object… ‘SELECT o._1 WHERE o._4 == “x”…’
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select: Serverless applications
Amazon
S3
AWS
Lambda
Amazon
SNS
S3
Select
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
After
200 seconds and 11.2 cents
# Download and process all keys
for key in src_keys:
response = s3_client.get_object(Bucket=src_bucket,
Key=key)
contents = response['Body'].read()
for line in contents.split('n')[:-1]:
line_count +=1
try:
data = line.split(',')
srcIp = data[0][:8]
….
Amazon S3 Select: Serverless MapReduce
Before
95 seconds and costs 2.8 cents
# Select IP Address and Keys
for key in src_keys:
response = s3_client.select_object_content
(Bucket=src_bucket, Key=key, expression =
SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as
obj)
contents = response['Body'].read()
for line in contents:
line_count +=1
try:
….
2X Faster at 1/5 of the cost
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Up to 400% Faster
Up to 80% Cheaper
Amazon S3 Select: Accelerating Big Data
Amazon S3
Before:
Amazon S3
S3 Select
After:
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DEMO: Amazon S3 Select with Presto
Works with your existing Hive Metastore
Automatically converts predicates into S3 Select requests
Amazon S3
S3 Select
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Before
Amazon S3 Select: Accelerating big data
After
After
5X Faster with 1/40 of the CPU
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select: Will be supported by…
Amazon Athena Amazon EMR Amazon Redshift
Spectrum
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select: In Preview
• Formats: CSV, JSON
• Compression: GZIP
• Encryption: None
• Encoding: UTF-8
• Integration: AWS SDK for Java and Python and Presto Connector
• Availability: Northern Virginia, Ohio, Oregon, Dublin, and Singapore
Apply at: https://pages.awscloud.com/amazon-s3-select-preview.html
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier Select
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier Select
Simple SQL Expression
SELECT and WHERE
Familiar semantics
Work and scales like
RESTORE requests
Integrated
AWS SDK and CLI
Restore selective contents instead of restoring entire object
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier Select
Input
Format: delimited text (CSV, TSV,
PSV, etc.)
Encryption: SSE-KMS, SSE-S3
Output
Format: delimited text (CSV, TSV,
PSV, etc.)
Clauses Data types Operators Functions
Select String Conditional String
From Integer, Float, Decimal Math Cast
Where Timestamp Logical
Boolean String (Like, ||)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Two ways to use Amazon Glacier Select
Using Glacier API
Data directly uploaded to Amazon Glacier
How to use Amazon Glacier Select?
Using S3 API
For data that is lifecycled to Amazon Glacier
from S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How to use Amazon Glacier Select?
Object Tier
SQL query Output S3 location SNS topic
Current restore-object API arguments
New (optional) restore-object API arguments to use Glacier Select
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Using Glacier Select
…restore-object …object… | …get-object … object …. | awk -F ’{ if($4==“id") print $1}’
...restore-object …object… ‘SELECT o._1 WHERE o._4 == “id”…’ | …get-object … object ….
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier Select: GA
• Formats: CSV, Any delimiter separated file
• Encryption: SSE- KMS, SSE-S3
• Encoding: UTF-8
• Integration: AWS SDK, CLI, Athena integration (expected 2018)
• Availability: All commercial regions where Amazon Glacier is launched
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Summary
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select and Amazon Glacier Select
Select subset of data from an object based on a SQL expression
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Now: Your data lake on AWS
Simple Faster Cheaper
Amazon
Glacier
Amazon
S3
Amazon Redshift
Spectrum
Amazon Athena Amazon EMR
AWS
Lambda
ISVs and Custom
Applications
SELECT
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you

More Related Content

Transforming Data Lakes with Amazon S3 Select & Amazon Glacier Select - AWS Online Tech Talks

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rahul Bhartia, Principal Product Manager, S3 March 29th 2018 Transforming Data Lakes with Amazon S3 Select & Amazon Glacier Select
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What to expect from the session 1.Introduction 2.Use-cases 3.Key features
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unmatched durability, availability, and scalability Best security, compliance, and audit capability Object-level control at any scale Business insight into your data Twice as many partner integrationsMost ways to bring data in Building a data lake with Amazon S3
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. With a wide variety of in-place tools… Amazon Athena Amazon Redshift Spectrum Amazon EMR AWS Glue Amazon S3
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today: Compute scales based… on object size instead of the amount of data you want to process DATA COMPUTE
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today: All of these tools… retrieve a lot of data they don’t need and do the heavy lifting
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. You have a choice of storage classes Archive data Infrequently accessed data Minutes to hours Milliseconds 0.4¢-GB/mo. 1.25¢-GB/mo. Amazon S3 Standard Amazon S3 Standard– Infrequent Access Amazon Glacier Active data Milliseconds From 2.1¢-GB/mo.
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today: You need to…. entire object from Amazon Glacier to Amazon S3 and then use it. Amazon S3Amazon Glacier
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Introducing… Amazon S3 Select and Amazon Glacier Select Select subset of data from an object based on a SQL expression
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Simple, faster, and cheaper! Available as an API—no infrastructure or administration Faster performance as compared to doing it yourself Pay as you go. The less you retrieve the more you save.
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select Simple to use Standard SQL expression Familiar Work and scales like GET requests Integrated AWS SDK and Presto (others coming soon) Select contents from object instead of retrieving the object
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select Output Format: delimited text (CSV, TSV), JSON … Clauses Data types Operators Functions Select String Conditional String From Integer, Float, Decimal Math Cast Where Timestamp Logical Math Boolean String (Like, ||) Aggregate Input Format: delimited text (CSV, TSV), JSON … Compression: GZIP …
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Simple pattern matches …get-object …object… | awk -F ’{ if($4=="x") print $1}’ ...select-object …object… ‘SELECT o._1 WHERE o._4 == “x”…’
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Serverless applications Amazon S3 AWS Lambda Amazon SNS S3 Select
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. After 200 seconds and 11.2 cents # Download and process all keys for key in src_keys: response = s3_client.get_object(Bucket=src_bucket, Key=key) contents = response['Body'].read() for line in contents.split('n')[:-1]: line_count +=1 try: data = line.split(',') srcIp = data[0][:8] …. Amazon S3 Select: Serverless MapReduce Before 95 seconds and costs 2.8 cents # Select IP Address and Keys for key in src_keys: response = s3_client.select_object_content (Bucket=src_bucket, Key=key, expression = SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj) contents = response['Body'].read() for line in contents: line_count +=1 try: …. 2X Faster at 1/5 of the cost
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Up to 400% Faster Up to 80% Cheaper Amazon S3 Select: Accelerating Big Data Amazon S3 Before: Amazon S3 S3 Select After:
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DEMO: Amazon S3 Select with Presto Works with your existing Hive Metastore Automatically converts predicates into S3 Select requests Amazon S3 S3 Select
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Before Amazon S3 Select: Accelerating big data After After 5X Faster with 1/40 of the CPU
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Will be supported by… Amazon Athena Amazon EMR Amazon Redshift Spectrum
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: In Preview • Formats: CSV, JSON • Compression: GZIP • Encryption: None • Encoding: UTF-8 • Integration: AWS SDK for Java and Python and Presto Connector • Availability: Northern Virginia, Ohio, Oregon, Dublin, and Singapore Apply at: https://pages.awscloud.com/amazon-s3-select-preview.html
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Glacier Select
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Glacier Select Simple SQL Expression SELECT and WHERE Familiar semantics Work and scales like RESTORE requests Integrated AWS SDK and CLI Restore selective contents instead of restoring entire object
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Glacier Select Input Format: delimited text (CSV, TSV, PSV, etc.) Encryption: SSE-KMS, SSE-S3 Output Format: delimited text (CSV, TSV, PSV, etc.) Clauses Data types Operators Functions Select String Conditional String From Integer, Float, Decimal Math Cast Where Timestamp Logical Boolean String (Like, ||)
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Two ways to use Amazon Glacier Select Using Glacier API Data directly uploaded to Amazon Glacier How to use Amazon Glacier Select? Using S3 API For data that is lifecycled to Amazon Glacier from S3
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How to use Amazon Glacier Select? Object Tier SQL query Output S3 location SNS topic Current restore-object API arguments New (optional) restore-object API arguments to use Glacier Select
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Using Glacier Select …restore-object …object… | …get-object … object …. | awk -F ’{ if($4==“id") print $1}’ ...restore-object …object… ‘SELECT o._1 WHERE o._4 == “id”…’ | …get-object … object ….
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Glacier Select: GA • Formats: CSV, Any delimiter separated file • Encryption: SSE- KMS, SSE-S3 • Encoding: UTF-8 • Integration: AWS SDK, CLI, Athena integration (expected 2018) • Availability: All commercial regions where Amazon Glacier is launched
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Summary
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select and Amazon Glacier Select Select subset of data from an object based on a SQL expression
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Now: Your data lake on AWS Simple Faster Cheaper Amazon Glacier Amazon S3 Amazon Redshift Spectrum Amazon Athena Amazon EMR AWS Lambda ISVs and Custom Applications SELECT
  • 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you