SlideShare a Scribd company logo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ben Snively, Solutions Architect – Data and Analytics
May 23rd, 2018
Data Warehousing and Data Lake
Analytics, Together
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Warehouse…
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence Relational data
Gigabytes to Petabytes scale
Reporting and analysis
Schema defined prior to data load
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot easier to use
a lot cheaper
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nasdaq operates financial
exchanges around the
world, and processes
large volumes of data.
Challenge:
Nasdaq wanted to make their large
historical data footprint available
to analyze as a single dataset.
Solution:
• Use Amazon Redshift for
interactive querying
• 5.5 billion rows into Amazon
Redshift every day (peaking at 14
billion one day)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Philips Uses AWS to Power a 37-Million
Record Upload to the Cloud in 90 Minutes
Thanks to AWS, our business
was able to upload over 37
million records into Amazon
Redshift in 90 minutes.
• Philips’ U.S.-based healthcare division needed
to analyze 37 million records for a new project
• When its on-premises database solution couldn’t
handle a data transfer that large anymore, the
company turned to AWS and the AWS
Marketplace
• Philips set up Attunity CloudBeam for Amazon
Redshift in less than one minute to upload 37
million records to the AWS Cloud in 90 minutes
• Using AWS, Philips can optimize any size data set
within two hours, allowing its consultants and
analysts to provide solutions quickly to
customers
Doug Ranaham
Manager, Philips Healthcare
“
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift
10x faster at 1/10th the cost
Fast
Delivers fast results for all
types of workloads
Cost-effective
No upfront costs, start small,
and pay as you go
Scalable
Gigabytes to petabytes
to exabytes
Integrated
Integrated with S3 data lakes,
AWS services and third-party tools
Secure
Audit everything; encrypt
data end-to-end; extensive
certification and compliance
Easy to Use
Create and start using a data
warehouse in minutes
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift is easy to use
Provisioning in
minutes
Automatic patching SQL - Data loading
Backups are built-in Security is built-in Compression is built-in
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift is fast
“Did I mention that it’s ridiculously fast? We’re using
it to provide our analysts with an alternative to Hadoop”
“After investigating Redshift, Snowflake, and
BigQuery, we found that Redshift offers top-of-the-
line performance at best-in-market price points”
“…[Amazon Redshift] performance has blown away
everyone here. We generally see 50-100X speedup
over Hive”
“We regularly process multibillion row datasets
and we do that in a matter of hours. We are heading
to up to 10 times more data volumes in the next couple
of years, easily”
“We saw a 2X performance improvement on a wide
variety of workloads. The more complex the queries,
the higher the performance improvement”
“On our previous big data warehouse system, it took
around 45 minutes to run a query against a year of
data, but that number went down to just 25 seconds
using Amazon Redshift”
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift is integrated
Use Existing BI Tools
JDBC/ODBC
Tool based native drivers
On Prem
Amazon
QuickSight
Existing or new
BI tool
Amazon
Redshift
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift is integrated
Data Integration Business IntelligenceSystems Integrators
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Accelerate migrations from legacy systems
“AWS Database Migration Service is the
most impressive migration service we’ve seen.”
Migrate – Over 1,000 unique migrations to Amazon Redshift
using AWS DMS
Amazon
Redshift
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Redshift is used for mission-critical workloads
Financial and
management reporting
Payments to suppliers
and billing workflows
Web/Mobile clickstream
and event analysis
Recommendation and
predictive analytics
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Diving into Redshift
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Columnar Architecture: Example
Column-based storage
• Only scan blocks for relevant
column
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
aid loc dt
1 SFO 2017-10-20
2 JFK 2017-10-20
3 SFO 2017-04-01
4 JFK 2017-05-14
SELECT min(dt) FROM deep_dive;
aid loc dt
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Compression: Example
More efficient compression is due to
storing the same data type in the
columnar architecture
Columns grow and shrink independently
Reduces storage requirements
Reduces I/O
aid loc dt
CREATE TABLE deep_dive (
aid INT ENCODE ZSTD
,loc CHAR(3) ENCODE BYTEDICT
,dt DATE ENCODE RUNLENGTH
);
aid loc dt
1 SFO 2017-10-20
2 JFK 2017-10-20
3 SFO 2017-04-01
4 JFK 2017-05-14
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sort Key: Example
Add a sort key to one or more columns to
physically sort the data on disk
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
) SORT KEY (dt, loc);
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
deep_dive
aid loc dt
1 SFO 2017-10-20
2 JFK 2017-10-20
3 SFO 2017-04-01
4 JFK 2017-05-14
deep_dive (sorted)
aid loc dt
3 SFO 2017-04-01
4 JFK 2017-05-14
2 JFK 2017-10-20
1 SFO 2017-10-20
deep_dive (sorted)
aid loc dt
3 SFO 2017-04-01
deep_dive (sorted)
aid loc dt
3 SFO 2017-04-01
4 JFK 2017-05-14
deep_dive (sorted)
aid loc dt
3 SFO 2017-04-01
4 JFK 2017-05-14
2 JFK 2017-10-20
deep_dive (sorted)
aid loc dt
3 SFO 2017-04-01
4 JFK 2017-05-14
2 JFK 2017-10-20
1 SFO 2017-10-20
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SELECT count(*) FROM deep_dive WHERE dt = '06-09-2017';
MIN: 01-JUNE-2017
MAX: 06-JUNE-2017
MIN: 07-JUNE-2017
MAX: 12-JUNE-2017
MIN: 13-JUNE-2017
MAX: 21-JUNE-2017
MIN: 21-JUNE-2017
MAX: 30-JUNE-2017
Sorted by date
MIN: 01-JUNE-2017
MAX: 20-JUNE-2017
MIN: 08-JUNE-2017
MAX: 30-JUNE-2017
MIN: 12-JUNE-2017
MAX: 20-JUNE-2017
MIN: 02-JUNE-2017
MAX: 25-JUNE-2017
Unsorted table
Zone Maps and Sorting: Example
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Terminology and Concepts: Slices
A slice can be thought of like a virtual compute node
• Unit of data partitioning
• Parallel query processing
Facts about slices:
• Each compute node has either 2, 16, or 32 slices
• Table rows are distributed to slices
• A slice processes only its own data
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Distribution
Distribution style is a table property which dictates how
that table’s data is distributed throughout the cluster:
• KEY: value is hashed, same value goes to same
location (slice)
• ALL: full table data goes to the first slice of every
node
• EVEN: round robin
ALL
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
KEY
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
EVENGoals:
• Distribute data evenly
for parallel processing
• Minimize data
movement during query
processing
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices: Data Distribution
DISTSTYLE KEY
• Goals
• Optimize Join performance between large tables
• Optimize Insert into Select performance
• Optimize Group By performance
• The column that is being distributed on should have a high cardinality and not cause row
skew:
DISTSTYLE ALL
• Goals
• Optimize Join performance with dimension tables
• Reduces disk usage on small tables
• Small and medium size dimension tables (< 3M rows)
DISTSTYLE EVEN
• If neither KEY or ALL apply (or you are unsure)
SELECT diststyle, skew_rows FROM svv_table_info WHERE "table" = 'deep_dive';
diststyle | skew_rows
-----------+-----------
KEY(aid) | 1.07  Ratio between the slice with the most and least number of rows
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Ingestion: COPY Statement
Ingestion throughput:
• Each slice’s query
processors can load
one file at a time:
o Streaming
decompression
o Parse
o Distribute
o Write
Realizing only partial node
usage as 6.25% of slices
are active
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15
DC2.8XL Compute Node
1 Input File
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Number of input files should be a
multiple of the number of slices
Splitting the single file into 16
input files, all slices are working
to maximize ingestion
performance
COPY continues to scale linearly
as you add nodes
Data Ingestion: COPY Statement
16 Input Files
Recommendation is to use delimited files—1 MB to 1 GB after gzip compression
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15
DC2.8XL Compute Node
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Node Types and Cluster Sizing
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dense Compute—DC2
• Solid state disks
Dense Storage—DS2
• Magnetic disks
Instance
Type
Disk Type Size Memory CPUs
DC2 large NVMe SSD 160 GB 16 GB 2
DC2 8xlarge NVMe SSD 2.56 TB 244 GB 32
DS2 xlarge Magnetic 2 TB 32 GB 4
DS2 8xlarge Magnetic 16 TB 244 GB 36
Terminology and Concepts: Node Types
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Best Practices: Cluster Sizing
Use at least two computes nodes (multi-node cluster) in production for data mirroring
• Leader node is given for no additional cost
Amazon Redshift is significantly faster in a VPC compared to EC2 Classic
Maintain at least 20% free space or three times the size of the largest table
• Scratch space for usage, rewriting tables
• Free space is required for vacuum to re-sort table
• Temporary tables used for intermediate query results
The maximum number of available Amazon Redshift Spectrum nodes is a function of
the number of slices in the Amazon Redshift cluster
If you’re using DC1 instances, upgrade to the DC2 instance type
• Same price as DC1, significantly faster
• Reserved Instances do not automatically transfer over
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Redshift Spectrum
Extend the data warehouse to your S3 data lake
S3 data lakeRedshift data
Redshift Spectrum
query engine
• Redshift SQL queries against exabytes in S3
• Join data across Redshift and S3
• Scale compute and storage separately
• Stable query performance and unlimited
concurrency
• Parquet, ORC, Grok, Avro, & CSV data formats
• Pay only for the amount of data scanned
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Characteristics of a Data Lake
W h a t i s a D a t a L a k e a n d w h a t v a l u e d o e s i t p r o v i d e
Collect and store any data, at any scale, and at low cost
Secure the data and prevent unauthorized access
Catalogue, search, and discover data
Decouple Compute from Storage
Future Proof against a highly changing complex ecosystem
Data Lake
100110000100101011100101010
1110010101000
00111100101100101
010001100001
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes extend traditional warehouses
Relational and non-relational data
Terabytes to Exabytes scale
Schema defined during analysis
Diverse analytical engines to gain insights
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111
0010101011100101010
0001011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Data Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum
Query directly against data in Amazon S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
Query data in-place using
open file formats
Full Amazon Redshift
SQL support
S3
SQL
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Redshift Spectrum
Quer y your data lak e
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
AWS Glue
Data Catalog
Redshift Spectrum
Scale-out serverless compute
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY …
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake on Amazon S3 with AWS Glue
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
BI Tools and
Visualizations
AWS GLUE ETL
…
other tools
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS
Glue
Amazon
S3
Data
sources
Amazon
Redshift
Redshift
Spectrum
BI Tools
“Amazon Redshift Spectrum is a game changer for us. Reports that took minutes to produce
are now delivered in seconds. We like the ability scale compute on-demand to query Petabytes
of data in S3 in various open file formats.”
-- Rafi Ton, CEO, NUVIAD
NUVIAD is a marketing platform that helps media buyers optimize their mobile bidding
Use AWS for marketing campaign and bidding analytics
Scale S3 storage for unlimited data capacity
Use Spectrum for unlimited scale and query concurrency
80% performance gain using Parquet data format
—Da ta L a ke Ana lytic s with Redshif t Spec tr um
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let's build an analytic query - #1
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Let's compute the sales of the prior books she’s written in this
series and return the top 20 values
2 Tables (1 S3, 1 local)
2 Filters
1 Join
2 Group By columns
1 Order By
1 Limit
1 Aggregation
SELECT
P.ASIN,
P.TITLE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
products P
WHERE
D.ASIN = P.ASIN AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
GROUP BY P.ASIN, P.TITLE
ORDER BY SALES_sum DESC
LIMIT 20;
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Let's build an analytic query - #2
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Let's compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions in the city of Seattle, WA, USA
4 Tables (1 S3, 3 local)
8 Filters
3 Joins
4 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
R.POSTAL_CODE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P,
regions R
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
R.COUNTRY_CODE = ‘US’ AND
R.CITY = ‘Seattle’ AND
R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Now let’s run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail
records for each day over past 20 years.
190 million files across 15,000 partitions in S3.
One partition per day for USA and rest of world.
Need a billion-fold reduction in data processed.
Running this query using a 1000 node Hive cluster
would take over 5 years.*
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Redshift’s query optimizer……......40X
---------------------------------------------------
Total reduction……….…………3.5B X
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data
format used by Amazon Retail.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demonstration
OLTP ERP CRM LOB
Data Warehouse Data Lake
1001100001001010111
0010101011100101010
0001011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Recommendations
Optimize your warehouse
Amazon Redshift Spectrum to improve scan-intensive concurrent workloads
Query Optimize your Data Lake
Use Column Oriented Storage on S3
Partition your objects
Redshift Spectrum Optimize
Know your query/node level parallelism
Improve Query performance
predicate pushdown
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3 AWS Glue
Building Big Data Storage
Solutions (Data Lakes) for
Maximum Flexibility -
https://d1.awsstatic.com/whitepap
ers/Storage/data-lake-on-aws.pdf
Redshift
EMR
Athena
Kinesis
Elasticsearch
Service
Data Lakes on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!

More Related Content

What's hot

Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Amazon Web Services
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Amazon Web Services
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
Amazon Web Services
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
10 Hacks for Optimizing MySQL in the Cloud - AWS Online Tech Talks
10 Hacks for Optimizing MySQL in the Cloud - AWS Online Tech Talks10 Hacks for Optimizing MySQL in the Cloud - AWS Online Tech Talks
10 Hacks for Optimizing MySQL in the Cloud - AWS Online Tech Talks
Amazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
Amazon Web Services
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
Amazon Web Services
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
Amazon Web Services
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
Amazon Web Services
 
Visualization with Amazon QuickSight
Visualization with Amazon QuickSightVisualization with Amazon QuickSight
Visualization with Amazon QuickSight
Amazon Web Services
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Amazon Web Services
 
How Amazon uses AWS Analytics
How Amazon uses AWS AnalyticsHow Amazon uses AWS Analytics
How Amazon uses AWS Analytics
Amazon Web Services
 
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Amazon Web Services
 
Amazon DynamoDB Deep Dive Advanced Design Patterns for DynamoDB (DAT401) - AW...
Amazon DynamoDB Deep Dive Advanced Design Patterns for DynamoDB (DAT401) - AW...Amazon DynamoDB Deep Dive Advanced Design Patterns for DynamoDB (DAT401) - AW...
Amazon DynamoDB Deep Dive Advanced Design Patterns for DynamoDB (DAT401) - AW...
Amazon Web Services
 
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
Amazon Web Services
 
How Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsHow Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS Analytics
Amazon Web Services
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Amazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Amazon Web Services
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Amazon Web Services
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Amazon Web Services
 

What's hot (20)

Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT372) - AWS re:Invent 2018
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
10 Hacks for Optimizing MySQL in the Cloud - AWS Online Tech Talks
10 Hacks for Optimizing MySQL in the Cloud - AWS Online Tech Talks10 Hacks for Optimizing MySQL in the Cloud - AWS Online Tech Talks
10 Hacks for Optimizing MySQL in the Cloud - AWS Online Tech Talks
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 
Visualization with Amazon QuickSight
Visualization with Amazon QuickSightVisualization with Amazon QuickSight
Visualization with Amazon QuickSight
 
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
Metrics-Driven Performance Tuning for AWS Glue ETL Jobs (ANT326) - AWS re:Inv...
 
How Amazon uses AWS Analytics
How Amazon uses AWS AnalyticsHow Amazon uses AWS Analytics
How Amazon uses AWS Analytics
 
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
Build Data Engineering Platforms with Amazon EMR (ANT204) - AWS re:Invent 2018
 
Amazon DynamoDB Deep Dive Advanced Design Patterns for DynamoDB (DAT401) - AW...
Amazon DynamoDB Deep Dive Advanced Design Patterns for DynamoDB (DAT401) - AW...Amazon DynamoDB Deep Dive Advanced Design Patterns for DynamoDB (DAT401) - AW...
Amazon DynamoDB Deep Dive Advanced Design Patterns for DynamoDB (DAT401) - AW...
 
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
Aurora Serverless: Scalable, Cost-Effective Application Deployment (DAT336) -...
 
How Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS AnalyticsHow Amazon.com Uses AWS Analytics
How Amazon.com Uses AWS Analytics
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 Citrix Moves Data to Amazon Redshift Fast with Matillion ETL Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
Citrix Moves Data to Amazon Redshift Fast with Matillion ETL
 
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
Extending Analytics Beyond the Data Warehouse, ft. Warner Bros. Analytics (AN...
 

Similar to Data Warehousing and Data Lake Analytics, Together - AWS Online Tech Talks

Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Amazon Web Services
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
Amazon Web Services
 
Fanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWSFanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWS
Amazon Web Services
 
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Amazon Web Services
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Web Services
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_Singapore
Amazon Web Services
 
Migrating database to cloud
Migrating database to cloudMigrating database to cloud
Migrating database to cloud
Amazon Web Services
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
Amazon Web Services
 
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
Amazon Web Services
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
Amazon Web Services
 
STG401_This Is My Architecture
STG401_This Is My ArchitectureSTG401_This Is My Architecture
STG401_This Is My Architecture
Amazon Web Services
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
Amazon Web Services
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Amazon Web Services
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
Amazon Web Services
 
Tinder and DynamoDB: It's a Match! Massive Data Migration, Zero Down Time - D...
Tinder and DynamoDB: It's a Match! Massive Data Migration, Zero Down Time - D...Tinder and DynamoDB: It's a Match! Massive Data Migration, Zero Down Time - D...
Tinder and DynamoDB: It's a Match! Massive Data Migration, Zero Down Time - D...
Amazon Web Services
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
Amazon Web Services
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
Amazon Web Services
 
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Amazon Web Services
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
Amazon Web Services
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
Amazon Web Services
 

Similar to Data Warehousing and Data Lake Analytics, Together - AWS Online Tech Talks (20)

Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
Serverless Stream Processing Tips & Tricks (ANT358) - AWS re:Invent 2018
 
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon RedshiftBDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
BDA306 Building a Modern Data Warehouse: Deep Dive on Amazon Redshift
 
Fanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWSFanatics Ingests Streaming Data to a Data Lake on AWS
Fanatics Ingests Streaming Data to a Data Lake on AWS
 
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
 
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...
 
Big Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_SingaporeBig Data@Scale_AWSPSSummit_Singapore
Big Data@Scale_AWSPSSummit_Singapore
 
Migrating database to cloud
Migrating database to cloudMigrating database to cloud
Migrating database to cloud
 
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
FINRA's Managed Data Lake: Next-Gen Analytics in the Cloud - ENT328 - re:Inve...
 
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSightABD206-Building Visualizations and Dashboards with Amazon QuickSight
ABD206-Building Visualizations and Dashboards with Amazon QuickSight
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
STG401_This Is My Architecture
STG401_This Is My ArchitectureSTG401_This Is My Architecture
STG401_This Is My Architecture
 
Architecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the EnterpriseArchitecting an Open Data Lake for the Enterprise
Architecting an Open Data Lake for the Enterprise
 
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
Modern Cloud Data Warehousing ft. Equinox Fitness Clubs: Optimize Analytics P...
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
Tinder and DynamoDB: It's a Match! Massive Data Migration, Zero Down Time - D...
Tinder and DynamoDB: It's a Match! Massive Data Migration, Zero Down Time - D...Tinder and DynamoDB: It's a Match! Massive Data Migration, Zero Down Time - D...
Tinder and DynamoDB: It's a Match! Massive Data Migration, Zero Down Time - D...
 
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 TiVo: How to Scale New Products with a Data Lake on AWS and Qubole TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
TiVo: How to Scale New Products with a Data Lake on AWS and Qubole
 
Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
Serverless Stream Processing Pipeline Best Practices (SRV316-R1) - AWS re:Inv...
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Data Warehousing and Data Lake Analytics, Together - AWS Online Tech Talks

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ben Snively, Solutions Architect – Data and Analytics May 23rd, 2018 Data Warehousing and Data Lake Analytics, Together
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Warehouse… OLTP ERP CRM LOB Data Warehouse Business Intelligence Relational data Gigabytes to Petabytes scale Reporting and analysis Schema defined prior to data load
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Relational data warehouse Massively parallel; Petabyte scale Fully managed HDD and SSD Platforms $1,000/TB/Year; starts at $0.25/hour Amazon Redshift a lot faster a lot easier to use a lot cheaper
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Nasdaq operates financial exchanges around the world, and processes large volumes of data. Challenge: Nasdaq wanted to make their large historical data footprint available to analyze as a single dataset. Solution: • Use Amazon Redshift for interactive querying • 5.5 billion rows into Amazon Redshift every day (peaking at 14 billion one day)
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Philips Uses AWS to Power a 37-Million Record Upload to the Cloud in 90 Minutes Thanks to AWS, our business was able to upload over 37 million records into Amazon Redshift in 90 minutes. • Philips’ U.S.-based healthcare division needed to analyze 37 million records for a new project • When its on-premises database solution couldn’t handle a data transfer that large anymore, the company turned to AWS and the AWS Marketplace • Philips set up Attunity CloudBeam for Amazon Redshift in less than one minute to upload 37 million records to the AWS Cloud in 90 minutes • Using AWS, Philips can optimize any size data set within two hours, allowing its consultants and analysts to provide solutions quickly to customers Doug Ranaham Manager, Philips Healthcare “
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift 10x faster at 1/10th the cost Fast Delivers fast results for all types of workloads Cost-effective No upfront costs, start small, and pay as you go Scalable Gigabytes to petabytes to exabytes Integrated Integrated with S3 data lakes, AWS services and third-party tools Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Easy to Use Create and start using a data warehouse in minutes
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift is easy to use Provisioning in minutes Automatic patching SQL - Data loading Backups are built-in Security is built-in Compression is built-in
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift is fast “Did I mention that it’s ridiculously fast? We’re using it to provide our analysts with an alternative to Hadoop” “After investigating Redshift, Snowflake, and BigQuery, we found that Redshift offers top-of-the- line performance at best-in-market price points” “…[Amazon Redshift] performance has blown away everyone here. We generally see 50-100X speedup over Hive” “We regularly process multibillion row datasets and we do that in a matter of hours. We are heading to up to 10 times more data volumes in the next couple of years, easily” “We saw a 2X performance improvement on a wide variety of workloads. The more complex the queries, the higher the performance improvement” “On our previous big data warehouse system, it took around 45 minutes to run a query against a year of data, but that number went down to just 25 seconds using Amazon Redshift”
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift is integrated Use Existing BI Tools JDBC/ODBC Tool based native drivers On Prem Amazon QuickSight Existing or new BI tool Amazon Redshift
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift is integrated Data Integration Business IntelligenceSystems Integrators
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Accelerate migrations from legacy systems “AWS Database Migration Service is the most impressive migration service we’ve seen.” Migrate – Over 1,000 unique migrations to Amazon Redshift using AWS DMS Amazon Redshift
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift is used for mission-critical workloads Financial and management reporting Payments to suppliers and billing workflows Web/Mobile clickstream and event analysis Recommendation and predictive analytics
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Diving into Redshift
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Columnar Architecture: Example Column-based storage • Only scan blocks for relevant column CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ); aid loc dt 1 SFO 2017-10-20 2 JFK 2017-10-20 3 SFO 2017-04-01 4 JFK 2017-05-14 SELECT min(dt) FROM deep_dive; aid loc dt
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Compression: Example More efficient compression is due to storing the same data type in the columnar architecture Columns grow and shrink independently Reduces storage requirements Reduces I/O aid loc dt CREATE TABLE deep_dive ( aid INT ENCODE ZSTD ,loc CHAR(3) ENCODE BYTEDICT ,dt DATE ENCODE RUNLENGTH ); aid loc dt 1 SFO 2017-10-20 2 JFK 2017-10-20 3 SFO 2017-04-01 4 JFK 2017-05-14
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Sort Key: Example Add a sort key to one or more columns to physically sort the data on disk CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ) SORT KEY (dt, loc); CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ); deep_dive aid loc dt 1 SFO 2017-10-20 2 JFK 2017-10-20 3 SFO 2017-04-01 4 JFK 2017-05-14 deep_dive (sorted) aid loc dt 3 SFO 2017-04-01 4 JFK 2017-05-14 2 JFK 2017-10-20 1 SFO 2017-10-20 deep_dive (sorted) aid loc dt 3 SFO 2017-04-01 deep_dive (sorted) aid loc dt 3 SFO 2017-04-01 4 JFK 2017-05-14 deep_dive (sorted) aid loc dt 3 SFO 2017-04-01 4 JFK 2017-05-14 2 JFK 2017-10-20 deep_dive (sorted) aid loc dt 3 SFO 2017-04-01 4 JFK 2017-05-14 2 JFK 2017-10-20 1 SFO 2017-10-20
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. SELECT count(*) FROM deep_dive WHERE dt = '06-09-2017'; MIN: 01-JUNE-2017 MAX: 06-JUNE-2017 MIN: 07-JUNE-2017 MAX: 12-JUNE-2017 MIN: 13-JUNE-2017 MAX: 21-JUNE-2017 MIN: 21-JUNE-2017 MAX: 30-JUNE-2017 Sorted by date MIN: 01-JUNE-2017 MAX: 20-JUNE-2017 MIN: 08-JUNE-2017 MAX: 30-JUNE-2017 MIN: 12-JUNE-2017 MAX: 20-JUNE-2017 MIN: 02-JUNE-2017 MAX: 25-JUNE-2017 Unsorted table Zone Maps and Sorting: Example
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Terminology and Concepts: Slices A slice can be thought of like a virtual compute node • Unit of data partitioning • Parallel query processing Facts about slices: • Each compute node has either 2, 16, or 32 slices • Table rows are distributed to slices • A slice processes only its own data
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Distribution Distribution style is a table property which dictates how that table’s data is distributed throughout the cluster: • KEY: value is hashed, same value goes to same location (slice) • ALL: full table data goes to the first slice of every node • EVEN: round robin ALL Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 KEY Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 EVENGoals: • Distribute data evenly for parallel processing • Minimize data movement during query processing
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices: Data Distribution DISTSTYLE KEY • Goals • Optimize Join performance between large tables • Optimize Insert into Select performance • Optimize Group By performance • The column that is being distributed on should have a high cardinality and not cause row skew: DISTSTYLE ALL • Goals • Optimize Join performance with dimension tables • Reduces disk usage on small tables • Small and medium size dimension tables (< 3M rows) DISTSTYLE EVEN • If neither KEY or ALL apply (or you are unsure) SELECT diststyle, skew_rows FROM svv_table_info WHERE "table" = 'deep_dive'; diststyle | skew_rows -----------+----------- KEY(aid) | 1.07  Ratio between the slice with the most and least number of rows
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Ingestion: COPY Statement Ingestion throughput: • Each slice’s query processors can load one file at a time: o Streaming decompression o Parse o Distribute o Write Realizing only partial node usage as 6.25% of slices are active 0 2 4 6 8 10 12 141 3 5 7 9 11 13 15 DC2.8XL Compute Node 1 Input File
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Number of input files should be a multiple of the number of slices Splitting the single file into 16 input files, all slices are working to maximize ingestion performance COPY continues to scale linearly as you add nodes Data Ingestion: COPY Statement 16 Input Files Recommendation is to use delimited files—1 MB to 1 GB after gzip compression 0 2 4 6 8 10 12 141 3 5 7 9 11 13 15 DC2.8XL Compute Node
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Node Types and Cluster Sizing
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Dense Compute—DC2 • Solid state disks Dense Storage—DS2 • Magnetic disks Instance Type Disk Type Size Memory CPUs DC2 large NVMe SSD 160 GB 16 GB 2 DC2 8xlarge NVMe SSD 2.56 TB 244 GB 32 DS2 xlarge Magnetic 2 TB 32 GB 4 DS2 8xlarge Magnetic 16 TB 244 GB 36 Terminology and Concepts: Node Types
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Best Practices: Cluster Sizing Use at least two computes nodes (multi-node cluster) in production for data mirroring • Leader node is given for no additional cost Amazon Redshift is significantly faster in a VPC compared to EC2 Classic Maintain at least 20% free space or three times the size of the largest table • Scratch space for usage, rewriting tables • Free space is required for vacuum to re-sort table • Temporary tables used for intermediate query results The maximum number of available Amazon Redshift Spectrum nodes is a function of the number of slices in the Amazon Redshift cluster If you’re using DC1 instances, upgrade to the DC2 instance type • Same price as DC1, significantly faster • Reserved Instances do not automatically transfer over
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift Spectrum Extend the data warehouse to your S3 data lake S3 data lakeRedshift data Redshift Spectrum query engine • Redshift SQL queries against exabytes in S3 • Join data across Redshift and S3 • Scale compute and storage separately • Stable query performance and unlimited concurrency • Parquet, ORC, Grok, Avro, & CSV data formats • Pay only for the amount of data scanned
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Characteristics of a Data Lake W h a t i s a D a t a L a k e a n d w h a t v a l u e d o e s i t p r o v i d e Collect and store any data, at any scale, and at low cost Secure the data and prevent unauthorized access Catalogue, search, and discover data Decouple Compute from Storage Future Proof against a highly changing complex ecosystem Data Lake 100110000100101011100101010 1110010101000 00111100101100101 010001100001
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes extend traditional warehouses Relational and non-relational data Terabytes to Exabytes scale Schema defined during analysis Diverse analytical engines to gain insights Designed for low cost storage and analytics OLTP ERP CRM LOB Data Warehouse Business Intelligence Data Lake 1001100001001010111 0010101011100101010 0001011111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Data Catalog Machine Learning DW Queries Big data processing Interactive Real-time
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift Spectrum Query directly against data in Amazon S3 using thousands of nodes Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data Query data in-place using open file formats Full Amazon Redshift SQL support S3 SQL
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift Spectrum Quer y your data lak e Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage AWS Glue Data Catalog Redshift Spectrum Scale-out serverless compute Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY …
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lake on Amazon S3 with AWS Glue On premises data Web app data Amazon RDS Other databases Streaming data Your data BI Tools and Visualizations AWS GLUE ETL … other tools
  • 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Amazon S3 Data sources Amazon Redshift Redshift Spectrum BI Tools “Amazon Redshift Spectrum is a game changer for us. Reports that took minutes to produce are now delivered in seconds. We like the ability scale compute on-demand to query Petabytes of data in S3 in various open file formats.” -- Rafi Ton, CEO, NUVIAD NUVIAD is a marketing platform that helps media buyers optimize their mobile bidding Use AWS for marketing campaign and bidding analytics Scale S3 storage for unlimited data capacity Use Spectrum for unlimited scale and query concurrency 80% performance gain using Parquet data format —Da ta L a ke Ana lytic s with Redshif t Spec tr um
  • 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let's build an analytic query - #1 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Let's compute the sales of the prior books she’s written in this series and return the top 20 values 2 Tables (1 S3, 1 local) 2 Filters 1 Join 2 Group By columns 1 Order By 1 Limit 1 Aggregation SELECT P.ASIN, P.TITLE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, products P WHERE D.ASIN = P.ASIN AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND GROUP BY P.ASIN, P.TITLE ORDER BY SALES_sum DESC LIMIT 20;
  • 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Let's build an analytic query - #2 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Let's compute the sales of the prior books she’s written in this series and return the top 20 values, just for the first three days of sales of first editions in the city of Seattle, WA, USA 4 Tables (1 S3, 3 local) 8 Filters 3 Joins 4 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts SELECT P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P, regions R WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND D.REGION_ID = R.REGION_ID AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND R.COUNTRY_CODE = ‘US’ AND R.CITY = ‘Seattle’ AND R.STATE = ‘WA’ AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
  • 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Now let’s run that query over an exabyte of data in S3 Roughly 140 TB of customer item order detail records for each day over past 20 years. 190 million files across 15,000 partitions in S3. One partition per day for USA and rest of world. Need a billion-fold reduction in data processed. Running this query using a 1000 node Hive cluster would take over 5 years.* • Compression ……………..….……..5X • Columnar file format……….......…10X • Scanning with 2500 nodes…....2500X • Static partition elimination…............2X • Dynamic partition elimination..….350X • Redshift’s query optimizer……......40X --------------------------------------------------- Total reduction……….…………3.5B X * Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.
  • 36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration OLTP ERP CRM LOB Data Warehouse Data Lake 1001100001001010111 0010101011100101010 0001011111011010 0011110010110010110 0100011000010 Devices Web Sensors Social
  • 37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Recommendations Optimize your warehouse Amazon Redshift Spectrum to improve scan-intensive concurrent workloads Query Optimize your Data Lake Use Column Oriented Storage on S3 Partition your objects Redshift Spectrum Optimize Know your query/node level parallelism Improve Query performance predicate pushdown
  • 38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams Amazon S3 AWS Glue Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility - https://d1.awsstatic.com/whitepap ers/Storage/data-lake-on-aws.pdf Redshift EMR Athena Kinesis Elasticsearch Service Data Lakes on AWS
  • 39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you!