AWS Summit 2013 Milan
31 Ottobre 2013

DATA ANALYSIS ON AWS
Hakan Gurel
Solutions Architecture
THE COST OF
GENERATING DATA
IS FALLING
THE MORE DATA YOU COLLECT
THE MORE VALUE YOU CAN
DERIVE FROM IT
Lower cost,
higher throughput



GENERATE  STORE  ANALYZE  SHARE
Highly
constrained
DATA VOLUME

Generated data

Available for analysis

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data ...
ACCELERATE

GENERATE 

STORE  ANALYZE  SHARE
+ ELASTIC AND HIGHLY SCALABLE
+ NO UPFRONT CAPITAL EXPENSE
+ PAY FOR ONLY WHAT YOU USE
+ AVAILABLE ON-DEMAND

= REMOVE

CO...
AWS Import / Export
AWS Direct Connect

GENERATE  STORE  ANALYZE  SHARE
Generated and stored in AWS
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
Regiona...
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2

GENERATE...
AMAZON S3
SIMPLE STORAGE SERVICE
AMAZON
DYNAMODB
HIGH-PERFORMANCE, FULLY MANAGED
NoSQL DATABASE SERVICE
DURABLE &
AVAILABLE
CONSISTENT, DISK-ONLY
WRITES (SSD)
LOW LATENCY
AVERAGE READS < 5MS,
WRITES < 10MS
NO ADMINISTRATION
500,000 WRITES PER SECOND
DURING SUPER BOWL
AMAZON
REDSHIFT
FULLY MANAGED, PETA-BYTE SCALE
DATAWAREHOUSE ON AWS
DESIGN OBJECTIVES:
A petabyte-scale data warehouse service that was…

A Lot Faster

AMAZON
REDSHIFT

A Lot Cheaper
A Whole...
AMAZON REDSHIFT
RUNS ON OPTIMIZED HARDWARE
HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rat...
30 MINUTES
DOWN TO

12 SECONDS
AMAZON REDSHIFT LETS YOU
START SMALL AND GROW BIG
Extra Large Node (HS1.XL)

Eight Extra Large Node (HS1.8XL)
Cluster 2-10...
CREATE A DATAWAREHOUSE IN
MINUTES
JDBC/ODBC
Price Per Hour for
HS1.XL Single
Node

Effective Hourly
Price Per TB

Effective Annual
Price per TB

On-Demand

$ 0.850

$...
DATA WAREHOUSING DONE THE AWS WAY
Easy to provision and scale up massively

No upfront costs, pay as you go
Really fast pe...
USAGE SCENARIOS
Cloud ETL for Big Data
S3

EMR

Redshift

Reporting
and BI

• Maintain online SQL access to historical logs
• Transformati...
Live archive for (structured) Big Data

OLTP
Web Apps

•
•
•
•

DynamoDB

Redshift

Direct integration with copy command
H...
Reporting Warehouse
OLTP
ERP

RDBMS

Redshift

• Accelerated operational reporting
• Support for short-time use cases
• Da...
On-Premises Integration
OLTP
ERP

RDBMS
Redshift

+

Reporting
& BI
GENERATE  STORE  ANALYZE  SHARE
Amazon EC2
Amazon Elastic
MapReduce
AMAZON EC2
ELASTIC COMPUTE CLOUD
CLUSTER GPU

QUADRUPLE EXTRA LARGE
Intel Xeon X5570, quad-core
Nehalem architecture
NVIDIA Tesla Fermi
M2050 GPUs
22 GB of...
ON A SINGLE INSTANCE

COMPUTE TIME: 4h
COST: 4h x $2.1 = $8.4
ON MULTIPLE INSTANCES

COMPUTE TIME: 1h
COST: 1h x 4 x $2.1 = $8.4
For 3 hours
$4828.85/hr

instead of
$20+ MILLIONS
in infrastructure
AMAZON ELASTIC
MAPREDUCE
HADOOP AS A SERVICE
•
•
•
•

A FRAMEWORK
SPLITS DATA INTO PIECES
LETS PROCESSING OCCUR
GATHERS THE RESULTS
Corporate Data
Center

Application data
and logs for
analysis pushed
to S3

Elastic Data
Center
Amazon Elastic
Map Reduce
name node to
control analysis
N

Corporate Data
Center

Elastic Data
Center
N

Corporate Data
Center

Hadoop cluster
started by Elastic
Map Reduce

Elastic Data
Center
N

Corporate Data
Center

Adding many
hundreds or
thousands of
nodes
Elastic Data
Center
Disposed of when
job completes

N

Corporate Data
Center

Elastic Data
Center
Corporate Data
Center

Results of
analysis pulled
back into your
systems

Elastic Data
Center
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2

GENERATE  STORE  ANALYZE  SHARE
GENERATE  STORE  ANALYZE  SHARE

AWS Data Pipeline
AWS Data Pipeline
Data-intensive orchestration and automation
Reliable and scheduled
Easy to use, drag and drop
Execution ...
AWS Import / Export
AWS Direct Connect

Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Stora...
FROM DATA TO
ACTIONABLE
INFORMATION
Stefano Rodighiero
MXM FACTS
7+ million lyrics catalogue in more than 50
distinct languages

Currently musiXmatch is the only lyrics
platform...
SYNCED LYRICS
OUR DATA

MUSIC
METADATA:
RECORDING &
PUBLISHING
OUR DATA

CONTENT USAGE
OUR DATA

OTHER SOURCES
DATA ANALYSIS @ MXM

CONTENT
USAGE:
REPORTING &
ANALYTICS
DATAFLOW

Frontend

Filter/norma
lization

Redis
(real time
analytics)

"Unrolling"

Redshift

Hive

Analytics

Post
proce...
BATCH REPORTING
Step 1. Aggregation of views by country,
application and content type
Step 2. Join with a 500M+ rows table...
DATAFLOW
Frontend
proxy

Filter/norm
alization

Redis
(real time
analytics)

"Unrolling"

Redshift

Hive

Analytics

Publi...
INTERACTIVE ANALYTICS
SQL interface like Hive, accessible with any
Postgresql client...

Redis
(real time
analytics)

Reds...
DATAFLOW
Frontend
proxy

Filter/normali
zation

Redis
(real time
analytics)

"Unrolling"

Redshift

Hive

Analytics

Post
...
MUSIXMATCH

Stefano Rodighiero
stefano@musixmatch.com
@larsen

Words matter
MUSIXMATCH

THANK YOU!
THANK YOU
hakan@amazon.lu
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
AWS Summit Milan - Data Analysis
Upcoming SlideShare
Loading in...5
×

AWS Summit Milan - Data Analysis

930

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
930
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
100
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

AWS Summit Milan - Data Analysis

  1. 1. AWS Summit 2013 Milan 31 Ottobre 2013 DATA ANALYSIS ON AWS Hakan Gurel Solutions Architecture
  2. 2. THE COST OF GENERATING DATA IS FALLING
  3. 3. THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN DERIVE FROM IT
  4. 4. Lower cost, higher throughput  GENERATE  STORE  ANALYZE  SHARE Highly constrained
  5. 5. DATA VOLUME Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  6. 6. ACCELERATE GENERATE  STORE  ANALYZE  SHARE
  7. 7. + ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + PAY FOR ONLY WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS
  8. 8. AWS Import / Export AWS Direct Connect GENERATE  STORE  ANALYZE  SHARE
  9. 9. Generated and stored in AWS Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect Regional replication of AMIs and snapshots
  10. 10. Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
  11. 11. AMAZON S3 SIMPLE STORAGE SERVICE
  12. 12. AMAZON DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED NoSQL DATABASE SERVICE
  13. 13. DURABLE & AVAILABLE CONSISTENT, DISK-ONLY WRITES (SSD)
  14. 14. LOW LATENCY AVERAGE READS < 5MS, WRITES < 10MS
  15. 15. NO ADMINISTRATION
  16. 16. 500,000 WRITES PER SECOND DURING SUPER BOWL
  17. 17. AMAZON REDSHIFT FULLY MANAGED, PETA-BYTE SCALE DATAWAREHOUSE ON AWS
  18. 18. DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was… A Lot Faster AMAZON REDSHIFT A Lot Cheaper A Whole Lot Simpler
  19. 19. AMAZON REDSHIFT RUNS ON OPTIMIZED HARDWARE HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage
  20. 20. 30 MINUTES DOWN TO 12 SECONDS
  21. 21. AMAZON REDSHIFT LETS YOU START SMALL AND GROW BIG Extra Large Node (HS1.XL) Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB) Single Node (2 TB) Cluster 2-32 Nodes (4 TB – 64 TB)
  22. 22. CREATE A DATAWAREHOUSE IN MINUTES
  23. 23. JDBC/ODBC
  24. 24. Price Per Hour for HS1.XL Single Node Effective Hourly Price Per TB Effective Annual Price per TB On-Demand $ 0.850 $ 0.425 $ 3,723 1 Year Reservation $ 0.500 $ 0.250 $ 2,190 3 Year Reservation $ 0.228 $ 0.114 $ 999
  25. 25. DATA WAREHOUSING DONE THE AWS WAY Easy to provision and scale up massively No upfront costs, pay as you go Really fast performance at a really low price Open and flexible with support for popular tools
  26. 26. USAGE SCENARIOS
  27. 27. Cloud ETL for Big Data S3 EMR Redshift Reporting and BI • Maintain online SQL access to historical logs • Transformation and enrichment with EMR • Longer history ensures better insight
  28. 28. Live archive for (structured) Big Data OLTP Web Apps • • • • DynamoDB Redshift Direct integration with copy command High velocity data Data ages into Redshift Low cost, high scale option for new apps Reporting and BI
  29. 29. Reporting Warehouse OLTP ERP RDBMS Redshift • Accelerated operational reporting • Support for short-time use cases • Data compression, index redundancy Reporting and BI
  30. 30. On-Premises Integration OLTP ERP RDBMS Redshift + Reporting & BI
  31. 31. GENERATE  STORE  ANALYZE  SHARE Amazon EC2 Amazon Elastic MapReduce
  32. 32. AMAZON EC2 ELASTIC COMPUTE CLOUD
  33. 33. CLUSTER GPU QUADRUPLE EXTRA LARGE Intel Xeon X5570, quad-core Nehalem architecture NVIDIA Tesla Fermi M2050 GPUs 22 GB of memory – 1.7 TB of storage 2x 2x
  34. 34. ON A SINGLE INSTANCE COMPUTE TIME: 4h COST: 4h x $2.1 = $8.4
  35. 35. ON MULTIPLE INSTANCES COMPUTE TIME: 1h COST: 1h x 4 x $2.1 = $8.4
  36. 36. For 3 hours $4828.85/hr instead of $20+ MILLIONS in infrastructure
  37. 37. AMAZON ELASTIC MAPREDUCE HADOOP AS A SERVICE
  38. 38. • • • • A FRAMEWORK SPLITS DATA INTO PIECES LETS PROCESSING OCCUR GATHERS THE RESULTS
  39. 39. Corporate Data Center Application data and logs for analysis pushed to S3 Elastic Data Center
  40. 40. Amazon Elastic Map Reduce name node to control analysis N Corporate Data Center Elastic Data Center
  41. 41. N Corporate Data Center Hadoop cluster started by Elastic Map Reduce Elastic Data Center
  42. 42. N Corporate Data Center Adding many hundreds or thousands of nodes Elastic Data Center
  43. 43. Disposed of when job completes N Corporate Data Center Elastic Data Center
  44. 44. Corporate Data Center Results of analysis pulled back into your systems Elastic Data Center
  45. 45. Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
  46. 46. GENERATE  STORE  ANALYZE  SHARE AWS Data Pipeline
  47. 47. AWS Data Pipeline Data-intensive orchestration and automation Reliable and scheduled Easy to use, drag and drop Execution and retry logic Map data dependencies Create and manage compute resources
  48. 48. AWS Import / Export AWS Direct Connect Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE Amazon EC2 Amazon Elastic MapReduce AWS Data Pipeline
  49. 49. FROM DATA TO ACTIONABLE INFORMATION
  50. 50. Stefano Rodighiero
  51. 51. MXM FACTS 7+ million lyrics catalogue in more than 50 distinct languages Currently musiXmatch is the only lyrics platform allowed for worldwide licensing and has deals with top Music Publishers: Warner Chappell, Universal, BMG, EMI Publishing, Sony ATV, Peer Music, ... Daily updated with more than 1 million artists and more than 20 million music tracks Synced lyrics! Music Discography Meta Data: Lyrics, Artists, Albums, Songs, Biographies, Worldwide Charts Words matter
  52. 52. SYNCED LYRICS
  53. 53. OUR DATA MUSIC METADATA: RECORDING & PUBLISHING
  54. 54. OUR DATA CONTENT USAGE
  55. 55. OUR DATA OTHER SOURCES
  56. 56. DATA ANALYSIS @ MXM CONTENT USAGE: REPORTING & ANALYTICS
  57. 57. DATAFLOW Frontend Filter/norma lization Redis (real time analytics) "Unrolling" Redshift Hive Analytics Post process Publishing catalogue Batch Words matter
  58. 58. BATCH REPORTING Step 1. Aggregation of views by country, application and content type Step 2. Join with a 500M+ rows table Hive It takes approx 1 hour with 5 c1.xlarge instances It used to take days with traditional techniques! Post process Batch SQL interface makes it easier to review and share the process Words matter
  59. 59. DATAFLOW Frontend proxy Filter/norm alization Redis (real time analytics) "Unrolling" Redshift Hive Analytics Publishing catalogue Post process Interactive Words matter
  60. 60. INTERACTIVE ANALYTICS SQL interface like Hive, accessible with any Postgresql client... Redis (real time analytics) Redshift ...but faster! Flexible costs Analytics With Redshift doing all the heavy lifting, it's easier to build analytics tools Interactive Words matter
  61. 61. DATAFLOW Frontend proxy Filter/normali zation Redis (real time analytics) "Unrolling" Redshift Hive Analytics Post process Publishing catalogue Interactive Batch Words matter
  62. 62. MUSIXMATCH Stefano Rodighiero stefano@musixmatch.com @larsen Words matter
  63. 63. MUSIXMATCH THANK YOU!
  64. 64. THANK YOU hakan@amazon.lu
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×