StorageTools &                      ComputeSupport          Databases
Compute          StorageTools &Support          Databases
Analytics
Let’s talk about data
Data is valuable
Data is plentiful
Data is complex
Data is in flux
Data is fast moving
Capturing and managing  data is challenging
Lots of data
Lots of data, lots of uses
Lots of data,lots of uses,lots of users
Lots of data,   lots of uses,  lots of users,lots of locations
Cost
Force multiplier
Additional value
Click throughSocial graph                    Log files       Additional value Audit trails                   Customer usag...
Remove constraints
AnalyticsData intensive,               Tightly  scale out                   coupled
AnalyticsData intensive,               Tightly  scale out                   coupled
Hadoop
Elastic MapReduce    Managed Hadoop
Undifferentiated heavy lifting
S3Input data
S3        Input dataCode     Elastic       MapReduce
S3        Input dataCode     Elastic     Name       MapReduce     node
S3        Input dataCode     Elastic     Name       MapReduce     node                            Elastic                 ...
S3        Input dataCode     Elastic     Name       MapReduce     node                                      HDFS          ...
S3        Input dataCode     Elastic              Name       MapReduce              node                         Queries  ...
S3        Input dataCode     Elastic              Name                            Output       MapReduce              node...
S3Input data                    Output                  S3 + SimpleDB
It’s all just Hadoop
Hive, Pig,Cascading,Streaming
API driven
Data movement
Import/Export
Large object  support
Multipart upload
Scale control
Resize running  job flows
14 hoursTime remaining: 14 hours
14 hoursTime remaining: 7 hours
Time remaining: 3 hours
Balance cost and  performance
Resize based on usage patterns
Steady state                      Steady state               Batch processing
Spot
Integrated with  DynamoDB
Integrate
Backup and  restore
HiveQL
Live data in DynamoDBCREATE EXTERNAL TABLE orders_ddb_2012_01 ( order_idstring, customer_id string, order_date bigint, tot...
Query DynamoDBSELECT customer_id, sum(total) spend, count(*)order_countFROM orders_ddb_2012_01WHERE order_date >= unix_tim...
Archived data in S3CREATE EXTERNAL TABLE orders_s3_export ( order_idstring, customer_id string, order_date int, totaldoubl...
Query S3SELECT year, month, customer_id, sum(total) spend,count(*) order_countFROM orders_s3_exportWHERE customer_id = c-2...
Export to S3CREATE EXTERNAL TABLE orders_s3_new_export ( order_idstring, customer_id string, order_date int, totaldouble )...
Perfect match
AnalyticsData intensive,               Tightly  scale out                   coupled
Parallel computation
Drug discovery  Financial risk                    Social media &     analysis                           gaming  Parallel c...
CC1 + GPU Cluster compute instances
CC2
16 Intel Xeon cores                            Placement groups                 CC2        Non-blocking, fully bisectional...
240 TFLOPS  42nd faster supercomputer
StarClusterweb.mit.edu/star/cluster
CloudFormation  aws.amazon.com/hpc
Q&Amatthew@amazon.com
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Analytics in the Cloud
Upcoming SlideShare
Loading in …5
×

Analytics in the Cloud

1,514 views

Published on

Elastic storage and compute services provide a firm foundation on which to build systems to drive value from data.

This presentation discuss how to run analytics pipelines on the AWS Cloud, from data storage with S3 and DynamoDB, to high scale computation with Elastic MapReduce and Cluster Compute instances on EC2.

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,514
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
28
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Analytics in the Cloud

  1. 1. StorageTools & ComputeSupport Databases
  2. 2. Compute StorageTools &Support Databases
  3. 3. Analytics
  4. 4. Let’s talk about data
  5. 5. Data is valuable
  6. 6. Data is plentiful
  7. 7. Data is complex
  8. 8. Data is in flux
  9. 9. Data is fast moving
  10. 10. Capturing and managing data is challenging
  11. 11. Lots of data
  12. 12. Lots of data, lots of uses
  13. 13. Lots of data,lots of uses,lots of users
  14. 14. Lots of data, lots of uses, lots of users,lots of locations
  15. 15. Cost
  16. 16. Force multiplier
  17. 17. Additional value
  18. 18. Click throughSocial graph Log files Additional value Audit trails Customer usage Transcoding
  19. 19. Remove constraints
  20. 20. AnalyticsData intensive, Tightly scale out coupled
  21. 21. AnalyticsData intensive, Tightly scale out coupled
  22. 22. Hadoop
  23. 23. Elastic MapReduce Managed Hadoop
  24. 24. Undifferentiated heavy lifting
  25. 25. S3Input data
  26. 26. S3 Input dataCode Elastic MapReduce
  27. 27. S3 Input dataCode Elastic Name MapReduce node
  28. 28. S3 Input dataCode Elastic Name MapReduce node Elastic cluster
  29. 29. S3 Input dataCode Elastic Name MapReduce node HDFS Elastic cluster
  30. 30. S3 Input dataCode Elastic Name MapReduce node Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  31. 31. S3 Input dataCode Elastic Name Output MapReduce node S3 + SimpleDB Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  32. 32. S3Input data Output S3 + SimpleDB
  33. 33. It’s all just Hadoop
  34. 34. Hive, Pig,Cascading,Streaming
  35. 35. API driven
  36. 36. Data movement
  37. 37. Import/Export
  38. 38. Large object support
  39. 39. Multipart upload
  40. 40. Scale control
  41. 41. Resize running job flows
  42. 42. 14 hoursTime remaining: 14 hours
  43. 43. 14 hoursTime remaining: 7 hours
  44. 44. Time remaining: 3 hours
  45. 45. Balance cost and performance
  46. 46. Resize based on usage patterns
  47. 47. Steady state Steady state Batch processing
  48. 48. Spot
  49. 49. Integrated with DynamoDB
  50. 50. Integrate
  51. 51. Backup and restore
  52. 52. HiveQL
  53. 53. Live data in DynamoDBCREATE EXTERNAL TABLE orders_ddb_2012_01 ( order_idstring, customer_id string, order_date bigint, totaldouble )STORED BYorg.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler TBLPROPERTIES ("dynamodb.table.name" = "Orders-2012-01","dynamodb.column.mapping" = "order_id:OrderID,customer_id:Customer ID,order_date:OrderDate,total:Total");
  54. 54. Query DynamoDBSELECT customer_id, sum(total) spend, count(*)order_countFROM orders_ddb_2012_01WHERE order_date >= unix_timestamp(2012-01-01, yyyy-MM-dd)AND order_date < unix_timestamp(2012-01-08, yyyy-MM-dd)GROUP BY customer_idORDER BY spend descLIMIT 5 ;
  55. 55. Archived data in S3CREATE EXTERNAL TABLE orders_s3_export ( order_idstring, customer_id string, order_date int, totaldouble )PARTITIONED BY (year string, month string)ROW FORMAT DELIMITEDFIELDS TERMINATED BY tLOCATION s3://elastic-mapreduce/samples/ddb-orders ;
  56. 56. Query S3SELECT year, month, customer_id, sum(total) spend,count(*) order_countFROM orders_s3_exportWHERE customer_id = c-2cC5fF1bBAND month >= 6AND year = 2011GROUP BY customer_id, year, monthORDER by month desc;
  57. 57. Export to S3CREATE EXTERNAL TABLE orders_s3_new_export ( order_idstring, customer_id string, order_date int, totaldouble )PARTITIONED BY (year string, month string)ROW FORMAT DELIMITED FIELDS TERMINATED BY ,LOCATION s3://;INSERT OVERWRITE TABLEorders_s3_new_exportPARTITION (year=2012, month=01)SELECT * from orders_ddb_2012_01;
  58. 58. Perfect match
  59. 59. AnalyticsData intensive, Tightly scale out coupled
  60. 60. Parallel computation
  61. 61. Drug discovery Financial risk Social media & analysis gaming Parallel computationManufacturing Transcoding & & design rendering Genomics
  62. 62. CC1 + GPU Cluster compute instances
  63. 63. CC2
  64. 64. 16 Intel Xeon cores Placement groups CC2 Non-blocking, fully bisectional 10 gig E network
  65. 65. 240 TFLOPS 42nd faster supercomputer
  66. 66. StarClusterweb.mit.edu/star/cluster
  67. 67. CloudFormation aws.amazon.com/hpc
  68. 68. Q&Amatthew@amazon.com

×