Your SlideShare is downloading. ×
0
AWS Summit 2013 Barcelona
Oct 24 – Barcelona, Spain

DATA ANALYSIS ON AWS
Carlos Conde
Sr. Mgr. Solutions Architecture
GENERATE  STORE  ANALYZE  SHARE
THE COST OF DATA
GENERATION IS FALLING
THE MORE DATA YOU COLLECT
THE MORE VALUE YOU CAN
DERIVE FROM IT
Lower cost,
higher throughput

GENERATE  STORE  ANALYZE  SHARE
Lower cost,
higher throughput



GENERATE  STORE  ANALYZE  SHARE
Highly
constrained
DATA VOLUME

Generated data

Available for analysis

Gartner: User Survey Analysis: Key Trends Shaping the Future of Data ...
GENERATE

STORE  ANALYZE  SHARE
ACCELERATE

GENERATE 

STORE  ANALYZE  SHARE
+ ELASTIC AND HIGHLY SCALABLE
+ NO UPFRONT CAPITAL EXPENSE
+ ONLY PAY FOR WHAT YOU USE
+ AVAILABLE ON-DEMAND

= REMOVE

CO...
GENERATE  STORE  ANALYZE  SHARE
AWS Import / Export
AWS Direct Connect

GENERATE  STORE  ANALYZE  SHARE
Generated and stored in AWS
Inbound data transfer is free
Multipart upload to S3
Physical media
AWS Direct Connect
Regiona...
Amazon S3,
Amazon Glacier,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
AWS Storage Gateway,
Data on Amazon EC2

GENERATE...
AMAZON S3
SIMPLE STORAGE SERVICE
AMAZON
DYNAMODB
HIGH-PERFORMANCE, FULLY MANAGED
NoSQL DATABASE SERVICE
DURABLE &
AVAILABLE
CONSISTENT, DISK-ONLY
WRITES (SSD)
LOW LATENCY
AVERAGE READS < 5MS,
WRITES < 10MS
NO ADMINISTRATION
500,000 WRITES PER SECOND
DURING SUPER BOWL
AMAZON
REDSHIFT
FULLY MANAGED, PETA-BYTE SCALE
DATAWAREHOUSE ON AWS
DESIGN OBJECTIVES:
A petabyte-scale data warehouse service that was…

A Lot Faster

AMAZON
REDSHIFT

A Lot Cheaper
A Whole...
AMAZON REDSHIFT
RUNS ON OPTIMIZED HARDWARE
HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rat...
30 MINUTES
DOWN TO

12 SECONDS
AMAZON REDSHIFT LETS YOU
START SMALL AND GROW BIG
Extra Large Node
(HS1.XL)

Single Node (2 TB)

Cluster 2-32 Nodes (4 TB ...
CREATE A DATAWAREHOUSE IN
MINUTES
JDBC/ODBC
Price Per Hour for
HS1.XL Single
Node

Effective Hourly
Price Per TB

Effective Annual
Price per TB

On-Demand

$ 0.850

$...
DATA WAREHOUSING DONE THE AWS WAY
Easy to provision and scale up massively

No upfront costs, pay as you go
Really fast pe...
USAGE SCENARIOS
S3

EMR

Redshift

Reporting
and BI
OLTP
Web Apps

DynamoDB

Redshift

Reporting
and BI
OLTP
ERP

RDBMS

Redshift

Reporting
& BI
OLTP
ERP

RDBMS
Redshift

+

Reporting
& BI
Social Point Analytics in AWS
Marc Canaleta (CTO)
@mcanaleta
AWS Summit Barcelona 2013
Social Games developer para Mobile y Facebook
Fundada en 2008, oficinas en Barcelona (22@), 170 personas.

Top #20 mobile ...
 Juegos Sociales: interacción
entre amigos, viralidad
 Modelo freemium: Jugar es
gratis, algunos items de pago
 Sector ...
 Top 20 Grossing en iOS
App Store worldwide
 Lanzado
recientemente en
Android, featured en
Google Play
 6M DAU en Faceb...
 No mantener ni planificar hardware: aumenta la velocidad del negocio
 Flexible: Pago por uso

 Facilita la escalabilid...
Analytics Driven. Necesarias para casi todos nuestros equipos:
 Ingenieros: analíticas realtime, monitorización, detecció...
FLASH CLIENT

IOS CLIENT

ANDROID
CLIENT

BACKEND SERVERS

BACKEND SERVERS

BACKEND SERVERS

Symfony 2

ANALYTICS QUEUES

...
 Backend escribe eventos en listas de redis
 Porque Redis?
 Coste y rendimiento: 10K eventos/segundo/servidor
 Problem...
 Procesos python consumen las
colas constantemente y

 Calculan métricas Real Time
 Almacenan logfiles de
eventos para ...
GENERACIÓN DE EVENTOS

 Python es muy adecuado para
desarrollar workers y tratar datos
 Redis: estructuras como contador...
PROCESADO DE EVENTOS

 Los importers leen URLs de SQS

Amazon S3

Amazon SQS

 Se descargan logfiles de S3
 Convierten ...
 Nos permite ser flexibles -> cambios de esquema sin downtime
 Muy escalable (con downtime de escrituras)
 Poco riesgo ...
 Transformaciones y cálculos diarios implementados en SQL
Ejemplo:
UPDATE USER SET total_revenues = (SELECT SUM(amount) F...
¿Te gustaría trabajar en el sector de los videojuegos?
Buscamos talento. El talento atrae al talento.
www.socialpoint.es/j...
GENERATE  STORE  ANALYZE  SHARE
Amazon EC2
Amazon Elastic
MapReduce
AMAZON ELASTIC
MAPREDUCE
HADOOP AS A SERVICE
•
•
•
•

A FRAMEWORK
SPLITS DATA INTO PIECES
LETS PROCESSING OCCUR
GATHERS THE RESULTS
Corporate Data
Center

Elastic Data
Center
Corporate Data
Center

Application data
and logs for
analysis pushed
to S3

Elastic Data
Center
Amazon Elastic
Map Reduce
name node to
control analysis
N

Corporate Data
Center

Elastic Data
Center
N

Corporate Data
Center

Hadoop cluster
started by Elastic
Map Reduce

Elastic Data
Center
N

Corporate Data
Center

Adding many
hundreds or
thousands of
nodes
Elastic Data
Center
Disposed of when
job completes

N

Corporate Data
Center

Elastic Data
Center
Corporate Data
Center

Results of
analysis pulled
back into your
systems

Elastic Data
Center
Amazon S3,
Amazon DynamoDB,
Amazon RDS,
Amazon Redshift,
Data on Amazon EC2

GENERATE  STORE  ANALYZE  SHARE
PUBLIC DATA SETS
http://aws.amazon.com/publicdatasets
GENERATE  STORE  ANALYZE  SHARE
GENERATE  STORE  ANALYZE  SHARE
FROM DATA TO
ACTIONABLE
INFORMATION
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
AWS Summit Barcelona - Data Analysis on AWS
Upcoming SlideShare
Loading in...5
×

AWS Summit Barcelona - Data Analysis on AWS

792

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
792
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
92
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "AWS Summit Barcelona - Data Analysis on AWS"

  1. 1. AWS Summit 2013 Barcelona Oct 24 – Barcelona, Spain DATA ANALYSIS ON AWS Carlos Conde Sr. Mgr. Solutions Architecture
  2. 2. GENERATE  STORE  ANALYZE  SHARE
  3. 3. THE COST OF DATA GENERATION IS FALLING
  4. 4. THE MORE DATA YOU COLLECT THE MORE VALUE YOU CAN DERIVE FROM IT
  5. 5. Lower cost, higher throughput GENERATE  STORE  ANALYZE  SHARE
  6. 6. Lower cost, higher throughput  GENERATE  STORE  ANALYZE  SHARE Highly constrained
  7. 7. DATA VOLUME Generated data Available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  8. 8. GENERATE STORE  ANALYZE  SHARE
  9. 9. ACCELERATE GENERATE  STORE  ANALYZE  SHARE
  10. 10. + ELASTIC AND HIGHLY SCALABLE + NO UPFRONT CAPITAL EXPENSE + ONLY PAY FOR WHAT YOU USE + AVAILABLE ON-DEMAND = REMOVE CONSTRAINTS
  11. 11. GENERATE  STORE  ANALYZE  SHARE
  12. 12. AWS Import / Export AWS Direct Connect GENERATE  STORE  ANALYZE  SHARE
  13. 13. Generated and stored in AWS Inbound data transfer is free Multipart upload to S3 Physical media AWS Direct Connect Regional replication of AMIs and snapshots
  14. 14. Amazon S3, Amazon Glacier, Amazon DynamoDB, Amazon RDS, Amazon Redshift, AWS Storage Gateway, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
  15. 15. AMAZON S3 SIMPLE STORAGE SERVICE
  16. 16. AMAZON DYNAMODB HIGH-PERFORMANCE, FULLY MANAGED NoSQL DATABASE SERVICE
  17. 17. DURABLE & AVAILABLE CONSISTENT, DISK-ONLY WRITES (SSD)
  18. 18. LOW LATENCY AVERAGE READS < 5MS, WRITES < 10MS
  19. 19. NO ADMINISTRATION
  20. 20. 500,000 WRITES PER SECOND DURING SUPER BOWL
  21. 21. AMAZON REDSHIFT FULLY MANAGED, PETA-BYTE SCALE DATAWAREHOUSE ON AWS
  22. 22. DESIGN OBJECTIVES: A petabyte-scale data warehouse service that was… A Lot Faster AMAZON REDSHIFT A Lot Cheaper A Whole Lot Simpler
  23. 23. AMAZON REDSHIFT RUNS ON OPTIMIZED HARDWARE HS1.8XL: 128 GB RAM, 16 Cores, 16 TB compressed user storage, 2 GB/sec scan rate HS1.XL: 16 GB RAM, 2 Cores, 2 TB compressed customer storage
  24. 24. 30 MINUTES DOWN TO 12 SECONDS
  25. 25. AMAZON REDSHIFT LETS YOU START SMALL AND GROW BIG Extra Large Node (HS1.XL) Single Node (2 TB) Cluster 2-32 Nodes (4 TB – 64 TB) Eight Extra Large Node (HS1.8XL) Cluster 2-100 Nodes (32 TB – 1.6 PB)
  26. 26. CREATE A DATAWAREHOUSE IN MINUTES
  27. 27. JDBC/ODBC
  28. 28. Price Per Hour for HS1.XL Single Node Effective Hourly Price Per TB Effective Annual Price per TB On-Demand $ 0.850 $ 0.425 $ 3,723 1 Year Reservation $ 0.500 $ 0.250 $ 2,190 3 Year Reservation $ 0.228 $ 0.114 $ 999
  29. 29. DATA WAREHOUSING DONE THE AWS WAY Easy to provision and scale up massively No upfront costs, pay as you go Really fast performance at a really low price Open and flexible with support for popular tools
  30. 30. USAGE SCENARIOS
  31. 31. S3 EMR Redshift Reporting and BI
  32. 32. OLTP Web Apps DynamoDB Redshift Reporting and BI
  33. 33. OLTP ERP RDBMS Redshift Reporting & BI
  34. 34. OLTP ERP RDBMS Redshift + Reporting & BI
  35. 35. Social Point Analytics in AWS Marc Canaleta (CTO) @mcanaleta AWS Summit Barcelona 2013
  36. 36. Social Games developer para Mobile y Facebook Fundada en 2008, oficinas en Barcelona (22@), 170 personas. Top #20 mobile grossing games worldwide Top #3 facebook developer
  37. 37.  Juegos Sociales: interacción entre amigos, viralidad  Modelo freemium: Jugar es gratis, algunos items de pago  Sector Midcore  Leader in Breeding & Collecting strategy games
  38. 38.  Top 20 Grossing en iOS App Store worldwide  Lanzado recientemente en Android, featured en Google Play  6M DAU en Facebook
  39. 39.  No mantener ni planificar hardware: aumenta la velocidad del negocio  Flexible: Pago por uso  Facilita la escalabilidad: Auto Scaling  Facilita la alta disponibilidad: múltiples availability zones  Managed components: Load Balancers, Bases de datos, …
  40. 40. Analytics Driven. Necesarias para casi todos nuestros equipos:  Ingenieros: analíticas realtime, monitorización, detección de problemas  Producto: tomar decisiones, A/B testing, game balancing, …  Marketing: optimización de campañas  Finanzas: seguimiento del negocio
  41. 41. FLASH CLIENT IOS CLIENT ANDROID CLIENT BACKEND SERVERS BACKEND SERVERS BACKEND SERVERS Symfony 2 ANALYTICS QUEUES ANALYTICS QUEUES ANALYTICS QUEUES Redis LOGFILES STORAGE ANALYTICS DATABASE AWS S3 AWS Redshift
  42. 42.  Backend escribe eventos en listas de redis  Porque Redis?  Coste y rendimiento: 10K eventos/segundo/servidor  Problema: es una base de datos en memoria, hay que vaciar las colas constantemente  Escalado y HA: N servidores distribuidos aleatoriamente BACKEND REDIS REDIS REDIS
  43. 43.  Procesos python consumen las colas constantemente y  Calculan métricas Real Time  Almacenan logfiles de eventos para subirlos a S3 GENERACIÓN DE EVENTOS Redis Queue LPOP event Consumer Redis Real Time write event Event Log File  Encolan en SQS la URL del objeto S3 INCR counter put object Amazon S3 CARGA DE DATOS Amazon SQS enqueue S3 object URL
  44. 44. GENERACIÓN DE EVENTOS  Python es muy adecuado para desarrollar workers y tratar datos  Redis: estructuras como contadores, sets, sorted sets, para métricas Real Time  S3: espacio virtualmente infinito, escalable, alta disponibilidad  SQS fiabilidad y disponibilidad a mayor precio que Redis Redis Queue LPOP event Consumer INCR counter Redis Real Time write event Event Log File put object Amazon S3 CARGA DE DATOS Amazon SQS enqueue S3 object URL
  45. 45. PROCESADO DE EVENTOS  Los importers leen URLs de SQS Amazon S3 Amazon SQS  Se descargan logfiles de S3  Convierten a TSV  Importan masivamente a Redshift (N logfiles a la vez) Importer TSV RedShift
  46. 46.  Nos permite ser flexibles -> cambios de esquema sin downtime  Muy escalable (con downtime de escrituras)  Poco riesgo de implantación  Sistema offline  Backups  Mantenimiento mínimo: vacuums, espacio  Buen soporte de SQL, a diferencia de otras columnar databases
  47. 47.  Transformaciones y cálculos diarios implementados en SQL Ejemplo: UPDATE USER SET total_revenues = (SELECT SUM(amount) FROM transaction t WHERE t.user_id = user.user_id);  Por qué no hadoop?  Mucho más complejo y lento; de momento las operaciones SQL cumplen todos nuestros requisitos
  48. 48. ¿Te gustaría trabajar en el sector de los videojuegos? Buscamos talento. El talento atrae al talento. www.socialpoint.es/jobs ¡GRACIAS! 
  49. 49. GENERATE  STORE  ANALYZE  SHARE Amazon EC2 Amazon Elastic MapReduce
  50. 50. AMAZON ELASTIC MAPREDUCE HADOOP AS A SERVICE
  51. 51. • • • • A FRAMEWORK SPLITS DATA INTO PIECES LETS PROCESSING OCCUR GATHERS THE RESULTS
  52. 52. Corporate Data Center Elastic Data Center
  53. 53. Corporate Data Center Application data and logs for analysis pushed to S3 Elastic Data Center
  54. 54. Amazon Elastic Map Reduce name node to control analysis N Corporate Data Center Elastic Data Center
  55. 55. N Corporate Data Center Hadoop cluster started by Elastic Map Reduce Elastic Data Center
  56. 56. N Corporate Data Center Adding many hundreds or thousands of nodes Elastic Data Center
  57. 57. Disposed of when job completes N Corporate Data Center Elastic Data Center
  58. 58. Corporate Data Center Results of analysis pulled back into your systems Elastic Data Center
  59. 59. Amazon S3, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Data on Amazon EC2 GENERATE  STORE  ANALYZE  SHARE
  60. 60. PUBLIC DATA SETS http://aws.amazon.com/publicdatasets
  61. 61. GENERATE  STORE  ANALYZE  SHARE
  62. 62. GENERATE  STORE  ANALYZE  SHARE
  63. 63. FROM DATA TO ACTIONABLE INFORMATION
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×