AWS Summit 2013 Tel Aviv
Oct 16 – Tel Aviv, Israel

Data Warehouse on AWS
Guy Ernest
Solutions Architecture, Amazon Web Se...
ERP

CRM

ANALYST
DATAWAREHOUSE

DB
OLTP

ERP

OLAP

OLTP

CRM

ANALYST
DATAWAREHOUSE

OLTP

DB
Transactional Processing

Analytical Processing

Transactional context

Global context

Latency

Throughput

Indexed acces...
OLTP
OLAP
BUSINESS INTELLIGENCE
REPORTS, DASHBOARD, …

PRODUCTION OFFLOAD
DIFFERENT DATA STRUCTURE, USING ETLs, …

ANALYST
DATAWAREH...
BIG
ENTREPRISES

VERY EXPENSIVE (ROI)
DIFFICULT TO MAINTAIN
NOT SCALABLE
BIG
ENTREPRISES

SME

VERY EXPENSIVE (ROI)
DIFFICULT TO MAINTAIN
NOT SCALABLE

WAY TOO EXPENSIVE !
Jeff Bezos
Data Sources

Value

Queries
+ ELASTIC CAPACITY
+ NO CAPEX
+ PAY FOR WHAT YOU USE
+ DISPOSE ON DEMAND

= NO

CONTRAINTS
ACCELERATION

COLLECT

STORE ANALYZE  SHARE

AMAZON REDSHIFT
AMAZON REDSHIFT
DWH that scales to petabyte and…
…WAY SIMPLER

AMAZON
REDSHIFT

… WAY FASTER

… WAY LESS EXPENSIVE
AMAZON REDSHIFT
RUNNING ON OPTIMIZED HARDWARE
HS1.8XL: 128 GB RAM, 16 Cores, 16 TB Compressed Data, 2 GB/sec Disk Scan

HS...
Extra Large Node
(HS1.XL)

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB – 64 TB)

Eight Extra Large Node (HS1.8XL)
Cluster...
JDBC/ODBC

10 GigE
(HPC)

Ingestion
Backup
Restoration
…WAY SIMPLER
LOADING DATA

Parallel Loading
Data sorted and distributed
automatically
Linear Growth
DATA SNAPSHOTS
Automatic and Incremental
snapshots in Amazon S3
Configurable Retention Period
Manual Snapshots
“Streaming”...
REPLICATION IN CLUSTER
+
AUTOMATIC SNAPSHOT IN AMAZON S3
+
MONITORING OF CLUSTER NODES
AUTOMATIC RESIZING
Read-only mode while resizing

Parallel node-to-node data copy

New cluster is created in the
background

Only charged for...
Automatic DNS based endpoint cut-over
Deletion of source cluster
CREATE A DATAWAREHOUSE IN MINUTES
…WAY FASTER
MEMORY CAPACITY AND CPU ERFORMANCE
DOUBLE EVERY 2 YEARS
DISK PERFORMANCE
DOUBLE EVERY 10 YEARS
Progress is not evenly distributed

1980
14,000,000$/TB
100MB
4MB/s

Today
 450,000 ÷ 
 30,000 X 
 50 X 

30$/TB
3TB...
I/O IS THE MAIN FACTOR FOR PERFORMANCE
Id

• COMPRESSION PER COLUMN

• ZONE MAPS
• HARDWARE OPTIMIZE
• LARGE DATA BLOCK SIZE

State

123

• COLUMNAR STORAGE

Age...
TEST:
2 BILLION RECORDS
6 REPRESENTATIVE REQUETS
AMAZON REDSHIFT 2xHS1.8XL
Vs.
32 NODES, 4.2TB RAM, 1.6PB
12x - 150x FASTER
30 MINUTES


12 SECONDES
…WAY LESS EXPENSIVE
2x HS1.8XL
3.65$ / HOUR

32 000$ / YEAR
Instance HS1.XL per
hour

Hourly Price per TB

Yearly Price per TB

On-Demand

0.850 $

0.425 $

3 723 $

1 Year
Reservati...
October, 2013

Intel Analytics on AWS
Assaf Araki

Intel Confidential
Agenda
• Advanced Analytics @ Intel
• Enterprise on the Cloud
• Use Case

Intel Confidential
Intel AA Team

Advanced Analytics

•
•

Vision: Make analytics a competitive advantage for Intel
Mission:
• Solve strategi...
AA Overview

•

Big Data Analytics Platform

Highly scalable, hybrid platform to support a range of
business use cases
Pre...
Enterprise On the Cloud

Why Cloud ?

• Known reasons
– Reduce cost
– Universal access
– Scale fast

• Additional reasons
...
Use Case

•

•

•

Use Case
Characteristics:
– CPU behavior data
– Size: 30TB of data per month
– Type: Structured data
– ...
Enterprise On the Cloud

•

•

Platforms

On premise
– Hbase – Hadoop platform exists
• No Hbase
– MPP DB – Exists with Ma...
Enterprise On the Cloud

Evaluation Criteria

• Capabilities
– Create statistics calculations

• Cost of HW per TB
– Repli...
Use Case

•

•

Preliminary Results
Dataset example
– 34GB compressed data divided to files
– ~1,500,000,000 records
– 24B...
Use Case

Capabilities and Cost

• No current ability to write code (Java/C++/Python/R)
– Implement statistics and algorit...
assaf.araki@intel.com
Intel Confidential
Thank You!
Intel Confidential
USE CASE
AMAZON EC2
AMAZON
DYNAMODB

AMAZON RDS

AMAZON
REDSHIFT

AMAZON ELASTIC
MAPREDUCE

AMAZON S3

DATA CENTER

AWS STORAGE
GAT...
UPLOAD TO AMAZON S3
AWS IMPORT/EXPORT
AWS DIRECT CONNECT

DATA
INTEGRATION

INTEGRATION
SYSTEMS
MEMBRES REGISTRATION
15 million

2 million

2011

2012

2013
1,500,000+
NEW MEMBRES EACH MONTH
1,200,000,000+
SOCIAL CONNECTIONS IMPORTED
Join via Facebook

Raw Data

Amazon S3

Web Servers

Add a Skill Page

User Action Trace Events

Invite Friends

Get Data
...
ELASTIC DATA WAREHOUSE
Monthly Reports on
a new cluster
S3

EMR

Redshift

Reporting
and BI
OLTP
Web Apps

DynamoDB

Redshift

Reporting
and BI
OLTP
ERP

RDBMS

Redshift

Reporting
& BI
OLTP
ERP

RDBMS

+

Redshift

Reporting
& BI
JDBC/ODBC

Amazon Redshift
DATAWAREHOUSE BY AWS
Simple to use and scalable

Pay per use, no CAPEX
Low cost for high performances
Open and integrate w...
Speed and Agility
“On Premise”
Fewer Experiments

Frequent Experiments

High Cost of Failures

Low Cost of Failure

Less I...
‫תודה רבה‬
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
AWS Summit Tel Aviv - Enterprise Track - Data Warehouse
Upcoming SlideShare
Loading in...5
×

AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

1,065

Published on

Published in: Technology, Business
0 Comments
9 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,065
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
69
Comments
0
Likes
9
Embeds 0
No embeds

No notes for slide

AWS Summit Tel Aviv - Enterprise Track - Data Warehouse

  1. 1. AWS Summit 2013 Tel Aviv Oct 16 – Tel Aviv, Israel Data Warehouse on AWS Guy Ernest Solutions Architecture, Amazon Web Services
  2. 2. ERP CRM ANALYST DATAWAREHOUSE DB
  3. 3. OLTP ERP OLAP OLTP CRM ANALYST DATAWAREHOUSE OLTP DB
  4. 4. Transactional Processing Analytical Processing Transactional context Global context Latency Throughput Indexed access Full table scans Random IO Sequential IO Disk seek times Disk transfer rate
  5. 5. OLTP OLAP
  6. 6. BUSINESS INTELLIGENCE REPORTS, DASHBOARD, … PRODUCTION OFFLOAD DIFFERENT DATA STRUCTURE, USING ETLs, … ANALYST DATAWAREHOUSE
  7. 7. BIG ENTREPRISES VERY EXPENSIVE (ROI) DIFFICULT TO MAINTAIN NOT SCALABLE
  8. 8. BIG ENTREPRISES SME VERY EXPENSIVE (ROI) DIFFICULT TO MAINTAIN NOT SCALABLE WAY TOO EXPENSIVE !
  9. 9. Jeff Bezos
  10. 10. Data Sources Value Queries
  11. 11. + ELASTIC CAPACITY + NO CAPEX + PAY FOR WHAT YOU USE + DISPOSE ON DEMAND = NO CONTRAINTS
  12. 12. ACCELERATION COLLECT STORE ANALYZE  SHARE AMAZON REDSHIFT
  13. 13. AMAZON REDSHIFT
  14. 14. DWH that scales to petabyte and… …WAY SIMPLER AMAZON REDSHIFT … WAY FASTER … WAY LESS EXPENSIVE
  15. 15. AMAZON REDSHIFT RUNNING ON OPTIMIZED HARDWARE HS1.8XL: 128 GB RAM, 16 Cores, 16 TB Compressed Data, 2 GB/sec Disk Scan HS1.XL: 16 GB RAM, 2 Cores, 2 TB Compressed Data
  16. 16. Extra Large Node (HS1.XL) Single Node (2 TB) Cluster 2-32 Nodes (4 TB – 64 TB) Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)
  17. 17. JDBC/ODBC 10 GigE (HPC) Ingestion Backup Restoration
  18. 18. …WAY SIMPLER
  19. 19. LOADING DATA Parallel Loading Data sorted and distributed automatically Linear Growth
  20. 20. DATA SNAPSHOTS Automatic and Incremental snapshots in Amazon S3 Configurable Retention Period Manual Snapshots “Streaming” Restore
  21. 21. REPLICATION IN CLUSTER + AUTOMATIC SNAPSHOT IN AMAZON S3 + MONITORING OF CLUSTER NODES
  22. 22. AUTOMATIC RESIZING
  23. 23. Read-only mode while resizing Parallel node-to-node data copy New cluster is created in the background Only charged for a single cluster
  24. 24. Automatic DNS based endpoint cut-over Deletion of source cluster
  25. 25. CREATE A DATAWAREHOUSE IN MINUTES
  26. 26. …WAY FASTER
  27. 27. MEMORY CAPACITY AND CPU ERFORMANCE DOUBLE EVERY 2 YEARS DISK PERFORMANCE DOUBLE EVERY 10 YEARS
  28. 28. Progress is not evenly distributed 1980 14,000,000$/TB 100MB 4MB/s Today  450,000 ÷   30,000 X   50 X  30$/TB 3TB 200MB/s
  29. 29. I/O IS THE MAIN FACTOR FOR PERFORMANCE
  30. 30. Id • COMPRESSION PER COLUMN • ZONE MAPS • HARDWARE OPTIMIZE • LARGE DATA BLOCK SIZE State 123 • COLUMNAR STORAGE Age 20 CA 345 25 WA 678 40 FL
  31. 31. TEST: 2 BILLION RECORDS 6 REPRESENTATIVE REQUETS
  32. 32. AMAZON REDSHIFT 2xHS1.8XL Vs. 32 NODES, 4.2TB RAM, 1.6PB
  33. 33. 12x - 150x FASTER
  34. 34. 30 MINUTES  12 SECONDES
  35. 35. …WAY LESS EXPENSIVE
  36. 36. 2x HS1.8XL 3.65$ / HOUR 32 000$ / YEAR
  37. 37. Instance HS1.XL per hour Hourly Price per TB Yearly Price per TB On-Demand 0.850 $ 0.425 $ 3 723 $ 1 Year Reservation 0.500 $ 0.250 $ 2 190 $ 3 Years Reservation 0.228 $ 0.114 $ 999 $
  38. 38. October, 2013 Intel Analytics on AWS Assaf Araki Intel Confidential
  39. 39. Agenda • Advanced Analytics @ Intel • Enterprise on the Cloud • Use Case Intel Confidential
  40. 40. Intel AA Team Advanced Analytics • • Vision: Make analytics a competitive advantage for Intel Mission: • Solve strategic high value business line problems • Leverage analytics to grow Intel revenue • About the team: • ~100 employees - corporate ownership of advanced analytics • Big data and Machine Learning are key focus areas • Skills: Software Engineering / Decision Science / Business Acumen • Value driven – ROI>$10M and/or key corporate problem as defined by VPs • Part of the Israel Academy Computational research center Intel Confidential
  41. 41. AA Overview • Big Data Analytics Platform Highly scalable, hybrid platform to support a range of business use cases Prediction Module MPP High Speed Data Loader Heterogeneous data, batch oriented on advanced analytics Rich advanced analytics and realtime, in-database data mining capabilities Intel Confidential
  42. 42. Enterprise On the Cloud Why Cloud ? • Known reasons – Reduce cost – Universal access – Scale fast • Additional reasons – Flexible & Agile platform – no need to certify each tool by engineering team – Development accelerator – R&D team can start develop while engineering teams implement the platform on premise Intel Confidential
  43. 43. Use Case • • • Use Case Characteristics: – CPU behavior data – Size: 30TB of data per month – Type: Structured data – Processing: • Create aggregation facts and grant ad hoc analysis • Create ML solutions Current Status: – Data is sampled and processed on SMP RDBMS – Takes almost 24 hours to process the entire data Problem Statement – Limited ability analyze all data Intel Confidential
  44. 44. Enterprise On the Cloud • • Platforms On premise – Hbase – Hadoop platform exists • No Hbase – MPP DB – Exists with Machine Learning capabilities • Lower cost platform evaluate and purchase Cloud – HBase - EMR – MPP DB - AWS Redshift Go for POC on the Cloud Intel Confidential
  45. 45. Enterprise On the Cloud Evaluation Criteria • Capabilities – Create statistics calculations • Cost of HW per TB – Replication – Compression • Performance – Load, transformation, querying • Scalability • Ability to execute Intel Confidential
  46. 46. Use Case • • Preliminary Results Dataset example – 34GB compressed data divided to files – ~1,500,000,000 records – 24B compressed, 240B per record ( ~15 columns ) Performance & Scalability - 8 x 1XL nodes – Load time – for 32 files – 2 hours ( 4 files – 5 hours ) – Table size – 202GB (compression rate ~1.5:1) – SQL aggregation statements • 38K records – 6 minutes • 14M records – 7 minutes • 66M records – 11 minutes ( on 4 x 1XL – 22 minutes ) • 939M records – 34 minutes ( on 4 x 1XL – 77 minutes ) Intel Confidential
  47. 47. Use Case Capabilities and Cost • No current ability to write code (Java/C++/Python/R) – Implement statistics and algorithm in SQL • Compression is not strait forward • Cost sensitive for actual compression – 2.6 : 1 is break even • 8XL vs. High Storage instance (16 cores 48TB) • 3 years with 100% utilization Intel Confidential
  48. 48. assaf.araki@intel.com Intel Confidential
  49. 49. Thank You! Intel Confidential
  50. 50. USE CASE
  51. 51. AMAZON EC2 AMAZON DYNAMODB AMAZON RDS AMAZON REDSHIFT AMAZON ELASTIC MAPREDUCE AMAZON S3 DATA CENTER AWS STORAGE GATEWAY
  52. 52. UPLOAD TO AMAZON S3 AWS IMPORT/EXPORT AWS DIRECT CONNECT DATA INTEGRATION INTEGRATION SYSTEMS
  53. 53. MEMBRES REGISTRATION 15 million 2 million 2011 2012 2013
  54. 54. 1,500,000+ NEW MEMBRES EACH MONTH
  55. 55. 1,200,000,000+ SOCIAL CONNECTIONS IMPORTED
  56. 56. Join via Facebook Raw Data Amazon S3 Web Servers Add a Skill Page User Action Trace Events Invite Friends Get Data Aggregated Data Amazon Redshift Amazon S3 Raw Events EMR • Tableau Excel • • Data Analyst Internal Web Hive Scripts Process Content Process log files with regular expressions to parse out the info we need. Processes cookies into useful searchable data such as Session, UserId, API Security token. Filters surplus info like internal varnish logging.
  57. 57. ELASTIC DATA WAREHOUSE
  58. 58. Monthly Reports on a new cluster
  59. 59. S3 EMR Redshift Reporting and BI
  60. 60. OLTP Web Apps DynamoDB Redshift Reporting and BI
  61. 61. OLTP ERP RDBMS Redshift Reporting & BI
  62. 62. OLTP ERP RDBMS + Redshift Reporting & BI
  63. 63. JDBC/ODBC Amazon Redshift
  64. 64. DATAWAREHOUSE BY AWS Simple to use and scalable Pay per use, no CAPEX Low cost for high performances Open and integrate with existing BI tools
  65. 65. Speed and Agility “On Premise” Fewer Experiments Frequent Experiments High Cost of Failures Low Cost of Failure Less Innovation More Innovation
  66. 66. ‫תודה רבה‬
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×