Elastic Map Reduce                              Matt Wood        T E C H N O L O G Y   E VA N G E L I S T
Hello.
Thank you.
3
1Building blocks
Infrastructure   services
5 years young
?
On demand
Pay as you go
Pay for what you use
Elastic capacity
Capacity           Estimated            demand                Time
Capacity                        Infrastructure           Investment         Estimated                               demand...
Capacity                    Infrastructure            Real           demand                                Time
Capacity            Elastic           capacity                 Real                demand                         Time
Undifferentiated heavy lifting
Focus on your stuff
Idea   Product
Idea                   Product       Heavy lifting
Idea                        Product       VERY Heavy lifting
Idea   Product
Scalable storageScalable compute  Scalable tools
2Enter the Cloud
S3 Scalable storage Scalable compute   Scalable tools                    EC2
Elastic Map  Reduce
Hosted Hadoop
Without the ‘muck’
S3Input data
S3        Input dataCode     Elastic       MapReduce
S3        Input dataCode     Elastic     Name       MapReduce     node
S3        Input dataCode     Elastic     Name       MapReduce     node                            Elastic                 ...
S3        Input dataCode     Elastic     Name       MapReduce     node                                      HDFS          ...
S3        Input dataCode     Elastic              Name       MapReduce              node                         Queries  ...
S3        Input dataCode     Elastic              Name                            Output       MapReduce              node...
S3 Input data  Elastic            OutputMapReduce          S3 + SimpleDB
It’s all just Hadoop
HDFS + S3
Hive, Pig,Cascading,Streaming
API driven
Data movement
Import/Export
Multipart upload
Multipart,parallel results   delivery
Scale control
Resize running  job flows
14 hoursTime remaining: 14 hours
14 hoursTime remaining: 7 hours
Time remaining: 3 hours
Balance cost and  performance
Resize based on usage patterns
Steady state                      Steady state               Batch processing
Cluster types
Small
High memory  High CPU   or both
HPC
Nehalem       Quad core            HPC 10 gig E                  GPU
Access control
Private
Location
Identity and  Access
3EMR ByExample
Bioinformatics            Web indexing                  Financial modellingFile processing                    Data mining ...
Click stream analysis for Best Buy       3.5 billion records    71 million unique cookies    1.7 million targeted ads    1...
Click stream analysis for Best Buy   Workflow time from 2 days to 8 hoursProcurement time from 2 months to 5 minutes       ...
Web log analysis and recommendation engine           $29.9 million in sales          842 million page views           434 ...
Elastic Map Reduce
Undifferentiated heavy lifting
Managed Hadoop
Hive, Pig, Cascading
Data movement
Scale control
HPC instances
aws.amazon.com
Thank you!
Q U E S T I O N S     +     C O M M E N T Smatthew@amazon.com              @mza              O N   T W I T T E R
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Introduction to Elastic MapReduce
Upcoming SlideShare
Loading in …5
×

Introduction to Elastic MapReduce

4,136 views

Published on

An introduction to Elastic MapReduce, including a demonstration of how to create a pre-configured, scalable Hadoop cluster in minutes.

Published in: Technology

Introduction to Elastic MapReduce

  1. 1. Elastic Map Reduce Matt Wood T E C H N O L O G Y E VA N G E L I S T
  2. 2. Hello.
  3. 3. Thank you.
  4. 4. 3
  5. 5. 1Building blocks
  6. 6. Infrastructure services
  7. 7. 5 years young
  8. 8. ?
  9. 9. On demand
  10. 10. Pay as you go
  11. 11. Pay for what you use
  12. 12. Elastic capacity
  13. 13. Capacity Estimated demand Time
  14. 14. Capacity Infrastructure Investment Estimated demand Time
  15. 15. Capacity Infrastructure Real demand Time
  16. 16. Capacity Elastic capacity Real demand Time
  17. 17. Undifferentiated heavy lifting
  18. 18. Focus on your stuff
  19. 19. Idea Product
  20. 20. Idea Product Heavy lifting
  21. 21. Idea Product VERY Heavy lifting
  22. 22. Idea Product
  23. 23. Scalable storageScalable compute Scalable tools
  24. 24. 2Enter the Cloud
  25. 25. S3 Scalable storage Scalable compute Scalable tools EC2
  26. 26. Elastic Map Reduce
  27. 27. Hosted Hadoop
  28. 28. Without the ‘muck’
  29. 29. S3Input data
  30. 30. S3 Input dataCode Elastic MapReduce
  31. 31. S3 Input dataCode Elastic Name MapReduce node
  32. 32. S3 Input dataCode Elastic Name MapReduce node Elastic cluster
  33. 33. S3 Input dataCode Elastic Name MapReduce node HDFS Elastic cluster
  34. 34. S3 Input dataCode Elastic Name MapReduce node Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  35. 35. S3 Input dataCode Elastic Name Output MapReduce node S3 + SimpleDB Queries HDFS + BI Via JDBC, Pig, Hive Elastic cluster
  36. 36. S3 Input data Elastic OutputMapReduce S3 + SimpleDB
  37. 37. It’s all just Hadoop
  38. 38. HDFS + S3
  39. 39. Hive, Pig,Cascading,Streaming
  40. 40. API driven
  41. 41. Data movement
  42. 42. Import/Export
  43. 43. Multipart upload
  44. 44. Multipart,parallel results delivery
  45. 45. Scale control
  46. 46. Resize running job flows
  47. 47. 14 hoursTime remaining: 14 hours
  48. 48. 14 hoursTime remaining: 7 hours
  49. 49. Time remaining: 3 hours
  50. 50. Balance cost and performance
  51. 51. Resize based on usage patterns
  52. 52. Steady state Steady state Batch processing
  53. 53. Cluster types
  54. 54. Small
  55. 55. High memory High CPU or both
  56. 56. HPC
  57. 57. Nehalem Quad core HPC 10 gig E GPU
  58. 58. Access control
  59. 59. Private
  60. 60. Location
  61. 61. Identity and Access
  62. 62. 3EMR ByExample
  63. 63. Bioinformatics Web indexing Financial modellingFile processing Data mining and BI Data warehousing Fraud detectionTargeted advertising
  64. 64. Click stream analysis for Best Buy 3.5 billion records 71 million unique cookies 1.7 million targeted ads 13 Tb of clickstream logs Each day
  65. 65. Click stream analysis for Best Buy Workflow time from 2 days to 8 hoursProcurement time from 2 months to 5 minutes $13k per month500% increase return on advertising spend
  66. 66. Web log analysis and recommendation engine $29.9 million in sales 842 million page views 434 Gb of page logs 97 million ‘favourites’
  67. 67. Elastic Map Reduce
  68. 68. Undifferentiated heavy lifting
  69. 69. Managed Hadoop
  70. 70. Hive, Pig, Cascading
  71. 71. Data movement
  72. 72. Scale control
  73. 73. HPC instances
  74. 74. aws.amazon.com
  75. 75. Thank you!
  76. 76. Q U E S T I O N S + C O M M E N T Smatthew@amazon.com @mza O N T W I T T E R

×