SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
John Mallory
AWS Storage Business Development
Building Data Lakes That Cost
Less & Deliver Results Faster
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to expect from the session
1. Defining the Data Lake
2. Reducing Costs
3. Increasing Performance
4. Planning for the Future
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rethink how to become a data-driven business
• Business outcomes
• Experimentation
• Agile and timely
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, Analytics looked like this
(Duplication & Sprawl)
Hadoop
Spark
NoSQL
Storage
Arrays
Databases
Data
Warehouse
Structured Data
SQL
Raw Data
ETL
Advanced Analytics
ETL
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Defining the AWS data lake
Data lake is an architecture with a virtually
limitless centralized storage platform capable
of categorization, processing, analysis, and
consumption of heterogeneous data sets
Key data lake attributes
• Decoupled storage and compute
• Rapid ingest and transformation
• Secure multi-tenancy
• Query in place
• Schema on read
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Data Lake Components
A n y a n a l y t i c w o r k l o a d , a n y s c a l e , a t t h e l o w e s t p o s s i b l e c o s t
Insights
Analytics
Data Lake
Data Movement
QuickSight SageMaker
Glue
(ETL & Data Catalog)
S3/Glacier
(Storage)
Redshift
+Spectrum
EMR Athena
Elasticsearch service
Kinesis Data Analytics
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams
Real-time
Comprehend
DW Big data processing Interactive
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched durability,
availability, and scalability
Best security, compliance, and audit
capability
Object-level control
at any scale
Business insight into
your data
Twice as many partner
integrations
Most ways to bring
data in
Reasons to choose Amazon S3 for data lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Reducing Data Lake Costs
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimize costs with data tiering
Hot
Cold
Amazon
S3 standard
Amazon S3—
infrequent access
Amazon
Glacier
HDFS  Use EMR/Hadoop with local
HDFS for hottest data sets
 Store cooler data in S3 and
cold in Glacier to reduce costs
 Use S3 Analytics to optimize
tiering strategy
S3 Analytics
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Process data in place…
Amazon Athena Amazon Redshift
Spectrum
Amazon EMR AWS Glue
Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Highly distributed
processing frameworks
such as Hadoop/Spark
Compress datasets
Columnar file formats
Amazon EMR: Decouple compute & storage
Aggregate small files
S3distcp “group-by” clause
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Structured data w/ joins
Multiple on-demand
clusters-scale concurrency
Columnar file formats
Data partitioning
Better query performance
with predicate pushdown
Amazon Redshift Spectrum: Exabyte Scale
query-in-place
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless service
Schema on read
Compress datasets
Columnar file formats
Amazon Athena: Query without ETL
Optimize file sizes
Optimize querying (Presto
backend)
Query Data in Glacier
(Coming)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use the right data formats
• Pay by the amount of data scanned per query
• Use compressed columnar formats
• Parquet
• ORC
• Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as text files 1 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings
87% less with
Parquet
34x faster 99% less data scanned 99.7% cheaper
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today: All of these tools…
retrieve a lot of data they don’t need and
do the heavy lifting
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Today: You need to….
entire object from Amazon Glacier to Amazon S3
and then use it.
Amazon
S3
Amazon
Glacier
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Introducing…
Amazon S3 Select and Amazon Glacier Select
Select subset of data from an object based on a SQL expression
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Motivation Behind S3 Select
GET all the data from S3 objects, and my application will filter the data that I need
Redshift Spectrum Example:
• Beta customer: Run 50,000 queries
• Amount of data fetched from S3: 6 PBs
• Amount of data used in Redshift: 650 TB
Data needed from S3: 10%
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select
SELECT a filtered set of data from within an object using standard SQL Statements
• First content aware API within Amazon S3
• Unlike Amazon Athena and Spectrum, operates within the Amazon S3 system
• SQL Statement operates on a per-object basis—not across a group of objects
• Works and scales like GET requests
• Accessible via SDK (Java, Python), AWS CLI and Presto Connector—others to follow
• Who will use it?
• Amazon Redshift Spectrum, Amazon Athena, Presto and other custom Query engines
• Everyone doing log mining
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select
Output
Format: delimited text (CSV,
TSV), JSON …
Clauses Data types Operators Functions
Select String Conditional String
From Integer, Float, Decimal Math Cast
Where Timestamp Logical Math
Boolean String (Like, ||) Aggregate
Input
Format: delimited text (CSV,
TSV), JSON …
Compression: GZIP …
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select: Simple pattern matches
…get-object …object… | awk -F ’{ if($4=="x") print $1}’
...select-object …object… ‘SELECT o._1 WHERE o._4 == “x”…’
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select: Serverless applications
Amazon
S3
AWS
Lambda
Amazon
SNS
S3
Select
Lambda
Trigger
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Before
200 seconds and 11.2 cents
# Download and process all keys
for key in src_keys:
response = s3_client.get_object(Bucket=src_bucket, Key=key)
contents = response['Body'].read()
for line in contents.split('n')[:-1]:
line_count +=1
try:
data = line.split(',')
srcIp = data[0][:8]
….
Amazon S3 Select: Serverless MapReduce
After
95 seconds and costs 2.8 cents
# Select IP Address and Keys
for key in src_keys:
response = s3_client.select_object_content
(Bucket=src_bucket, Key=key, expression =
SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj)
contents = response['Body'].read()
for line in contents:
line_count +=1
try:
….
2X Faster at 1/5 of the cost
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select with Presto
Works with your existing Hive Metastore
Automatically converts predicates into S3 Select requests
Amazon S3
S3 Select
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Before
Amazon S3 Select: Accelerating big data
After
After
5X Faster with 1/40 of the CPU
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 Select: Will be supported by…
Amazon Athena Amazon EMR Amazon Redshift
Spectrum
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Using Amazon Glacier Select
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How Amazon Glacier Select Works
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Delivering Results Faster
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimizing data lake performance
Aggregate small files
EMR: S3distcp
Amazon Kinesis Firehose
S3 Select
Big data cheaper, faster
Up to 400% faster
Data Formats
Columnar formats
EMRFS consistent view
Amazon
S3
Amazon
DynamoDB
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis—Real Time
Easily collect, process, and analyze video and data streams in real time
Capture, process,
and store video
streams for analytics
Load data streams
into AWS data stores
Analyze data streams
with SQL
Build custom
applications that
analyze data streams
Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics
SQL
New
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data preparation accounts for ~80% of the work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Serverless Data catalog & ETL service
Data Catalog
ETL Job
authoring
Discover data and
extract schema
Auto-generates
customizable ETL code
in Python and Spark
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
End-to-End
Machine Learning
Platform
Zero setup Flexible Model
Training
Pay by the
second
Amazon SageMaker (GA)
The quickest and easiest way to get ML models from idea to production
$
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Planning for the Future
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Collect Store Analyze Visualize
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
Amazon
ES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
AmazonElasticMapReduce
Amazon
ElastiCache
SearchSQLNoSQLCache
StreamProcessingBatchInteractive
Logging
StreamStorage
IoTApplications
FileStorage
Analysis&Visualization
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Amazon
QuickSight
Transactional Data
File Data
Stream Data
Notebooks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
Evolve As Needed!
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

More Related Content

What's hot

Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Amazon Web Services
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Amazon Web Services
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
Amazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Amazon Web Services
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWS
Amazon Web Services
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Amazon Web Services
 
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Web Services
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Amazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
Amazon Web Services
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Amazon Web Services
 
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Amazon Web Services
 
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Amazon Web Services
 
Data Warehousing and Data Lake Analytics, Together - AWS Online Tech Talks
Data Warehousing and Data Lake Analytics, Together - AWS Online Tech TalksData Warehousing and Data Lake Analytics, Together - AWS Online Tech Talks
Data Warehousing and Data Lake Analytics, Together - AWS Online Tech Talks
Amazon Web Services
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
Amazon Web Services
 
SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ...
 SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ... SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ...
SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ...
Amazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Amazon Web Services
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
Amazon Web Services
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Amazon Web Services
 
Visualization with Amazon QuickSight
Visualization with Amazon QuickSightVisualization with Amazon QuickSight
Visualization with Amazon QuickSight
Amazon Web Services
 

What's hot (20)

Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
 
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
Building Your First Serverless Data Lake (ANT356-R1) - AWS re:Invent 2018
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
BDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWSBDA305 Building Data Lakes and Analytics on AWS
BDA305 Building Data Lakes and Analytics on AWS
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
Amazon Athena: What's New and How SendGrid Innovates (ANT324) - AWS re:Invent...
 
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
Building Serverless Analytics Pipelines with AWS Glue (ANT308) - AWS re:Inven...
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
 
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
 
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
 
Data Warehousing and Data Lake Analytics, Together - AWS Online Tech Talks
Data Warehousing and Data Lake Analytics, Together - AWS Online Tech TalksData Warehousing and Data Lake Analytics, Together - AWS Online Tech Talks
Data Warehousing and Data Lake Analytics, Together - AWS Online Tech Talks
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
 
SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ...
 SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ... SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ...
SRV307 Applying AWS Purpose-Built Database Strategy: Match Your Workload to ...
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018
 
Visualization with Amazon QuickSight
Visualization with Amazon QuickSightVisualization with Amazon QuickSight
Visualization with Amazon QuickSight
 

Similar to Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Tech Talks

Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
SasikumarPalanivel3
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
saidbilgen
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Amazon Web Services
 
Transforming Data Lakes with Amazon S3 Select & Amazon Glacier Select - AWS O...
Transforming Data Lakes with Amazon S3 Select & Amazon Glacier Select - AWS O...Transforming Data Lakes with Amazon S3 Select & Amazon Glacier Select - AWS O...
Transforming Data Lakes with Amazon S3 Select & Amazon Glacier Select - AWS O...
Amazon Web Services
 
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...
Amazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
Amazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Amazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
Adir Sharabi
 
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftData Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Amazon Web Services
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
Amazon Web Services
 

Similar to Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Tech Talks (20)

Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdfBuilding+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
Building+your+Data+Project+on+AWS+-+Luke+Anderson.pdf
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Transforming Data Lakes with Amazon S3 Select & Amazon Glacier Select - AWS O...
Transforming Data Lakes with Amazon S3 Select & Amazon Glacier Select - AWS O...Transforming Data Lakes with Amazon S3 Select & Amazon Glacier Select - AWS O...
Transforming Data Lakes with Amazon S3 Select & Amazon Glacier Select - AWS O...
 
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...
Big Data Breakthroughs: Process and Query Data In Place with Amazon S3 Select...
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftData Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
Building a Data Lake in Amazon S3 & Amazon Glacier (STG401-R1) - AWS re:Inven...
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Tech Talks

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. John Mallory AWS Storage Business Development Building Data Lakes That Cost Less & Deliver Results Faster
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What to expect from the session 1. Defining the Data Lake 2. Reducing Costs 3. Increasing Performance 4. Planning for the Future
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Rethink how to become a data-driven business • Business outcomes • Experimentation • Agile and timely
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Traditionally, Analytics looked like this (Duplication & Sprawl) Hadoop Spark NoSQL Storage Arrays Databases Data Warehouse Structured Data SQL Raw Data ETL Advanced Analytics ETL
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Defining the AWS data lake Data lake is an architecture with a virtually limitless centralized storage platform capable of categorization, processing, analysis, and consumption of heterogeneous data sets Key data lake attributes • Decoupled storage and compute • Rapid ingest and transformation • Secure multi-tenancy • Query in place • Schema on read
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Data Lake Components A n y a n a l y t i c w o r k l o a d , a n y s c a l e , a t t h e l o w e s t p o s s i b l e c o s t Insights Analytics Data Lake Data Movement QuickSight SageMaker Glue (ETL & Data Catalog) S3/Glacier (Storage) Redshift +Spectrum EMR Athena Elasticsearch service Kinesis Data Analytics Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams Real-time Comprehend DW Big data processing Interactive
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unmatched durability, availability, and scalability Best security, compliance, and audit capability Object-level control at any scale Business insight into your data Twice as many partner integrations Most ways to bring data in Reasons to choose Amazon S3 for data lake
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Reducing Data Lake Costs
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimize costs with data tiering Hot Cold Amazon S3 standard Amazon S3— infrequent access Amazon Glacier HDFS  Use EMR/Hadoop with local HDFS for hottest data sets  Store cooler data in S3 and cold in Glacier to reduce costs  Use S3 Analytics to optimize tiering strategy S3 Analytics
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Process data in place… Amazon Athena Amazon Redshift Spectrum Amazon EMR AWS Glue Amazon S3
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Highly distributed processing frameworks such as Hadoop/Spark Compress datasets Columnar file formats Amazon EMR: Decouple compute & storage Aggregate small files S3distcp “group-by” clause
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Structured data w/ joins Multiple on-demand clusters-scale concurrency Columnar file formats Data partitioning Better query performance with predicate pushdown Amazon Redshift Spectrum: Exabyte Scale query-in-place
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless service Schema on read Compress datasets Columnar file formats Amazon Athena: Query without ETL Optimize file sizes Optimize querying (Presto backend) Query Data in Glacier (Coming)
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use the right data formats • Pay by the amount of data scanned per query • Use compressed columnar formats • Parquet • ORC • Easy to integrate with wide variety of tools Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today: All of these tools… retrieve a lot of data they don’t need and do the heavy lifting
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Today: You need to…. entire object from Amazon Glacier to Amazon S3 and then use it. Amazon S3 Amazon Glacier
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Introducing… Amazon S3 Select and Amazon Glacier Select Select subset of data from an object based on a SQL expression
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Motivation Behind S3 Select GET all the data from S3 objects, and my application will filter the data that I need Redshift Spectrum Example: • Beta customer: Run 50,000 queries • Amount of data fetched from S3: 6 PBs • Amount of data used in Redshift: 650 TB Data needed from S3: 10%
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select SELECT a filtered set of data from within an object using standard SQL Statements • First content aware API within Amazon S3 • Unlike Amazon Athena and Spectrum, operates within the Amazon S3 system • SQL Statement operates on a per-object basis—not across a group of objects • Works and scales like GET requests • Accessible via SDK (Java, Python), AWS CLI and Presto Connector—others to follow • Who will use it? • Amazon Redshift Spectrum, Amazon Athena, Presto and other custom Query engines • Everyone doing log mining
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select Output Format: delimited text (CSV, TSV), JSON … Clauses Data types Operators Functions Select String Conditional String From Integer, Float, Decimal Math Cast Where Timestamp Logical Math Boolean String (Like, ||) Aggregate Input Format: delimited text (CSV, TSV), JSON … Compression: GZIP …
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Simple pattern matches …get-object …object… | awk -F ’{ if($4=="x") print $1}’ ...select-object …object… ‘SELECT o._1 WHERE o._4 == “x”…’
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Serverless applications Amazon S3 AWS Lambda Amazon SNS S3 Select Lambda Trigger
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Before 200 seconds and 11.2 cents # Download and process all keys for key in src_keys: response = s3_client.get_object(Bucket=src_bucket, Key=key) contents = response['Body'].read() for line in contents.split('n')[:-1]: line_count +=1 try: data = line.split(',') srcIp = data[0][:8] …. Amazon S3 Select: Serverless MapReduce After 95 seconds and costs 2.8 cents # Select IP Address and Keys for key in src_keys: response = s3_client.select_object_content (Bucket=src_bucket, Key=key, expression = SELECT SUBSTR(obj._1, 1, 8), obj._2 FROM s3object as obj) contents = response['Body'].read() for line in contents: line_count +=1 try: …. 2X Faster at 1/5 of the cost
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select with Presto Works with your existing Hive Metastore Automatically converts predicates into S3 Select requests Amazon S3 S3 Select
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Before Amazon S3 Select: Accelerating big data After After 5X Faster with 1/40 of the CPU
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Select: Will be supported by… Amazon Athena Amazon EMR Amazon Redshift Spectrum
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Using Amazon Glacier Select
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How Amazon Glacier Select Works
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Delivering Results Faster
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimizing data lake performance Aggregate small files EMR: S3distcp Amazon Kinesis Firehose S3 Select Big data cheaper, faster Up to 400% faster Data Formats Columnar formats EMRFS consistent view Amazon S3 Amazon DynamoDB
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Kinesis—Real Time Easily collect, process, and analyze video and data streams in real time Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics SQL New
  • 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data preparation accounts for ~80% of the work Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other
  • 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue—Serverless Data catalog & ETL service Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python and Spark Automatically discovers data and stores schema Data searchable, and available for ETL Generates customizable code Schedules and runs your ETL jobs Serverless
  • 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. End-to-End Machine Learning Platform Zero setup Flexible Model Training Pay by the second Amazon SageMaker (GA) The quickest and easiest way to get ML models from idea to production $
  • 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Planning for the Future
  • 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Collect Store Analyze Visualize A iOS Android Web Apps Logstash Amazon RDS Amazon DynamoDB Amazon ES Amazon S3 Apache Kafka Amazon Glacier Amazon Kinesis Amazon DynamoDB Amazon Redshift Impala Pig Amazon ML Streaming Amazon Kinesis AWS Lambda AmazonElasticMapReduce Amazon ElastiCache SearchSQLNoSQLCache StreamProcessingBatchInteractive Logging StreamStorage IoTApplications FileStorage Analysis&Visualization Hot Cold Warm Hot Slow Hot ML Fast Fast Amazon QuickSight Transactional Data File Data Stream Data Notebooks Predictions Apps & APIs Mobile Apps IDE Search Data ETL Evolve As Needed!
  • 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.