Hadoop on the Cloud

1. Joseph Ziegler Abhishek Sinha Technical Evangelist Big Data BDM zieglerj@amazon.com sinhaar@amazon.com @jiyosub © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

4. Collect, Store, Organize, Analyze and Share Velocity, Volume and Variety © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

6. Data © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc. http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

7. Data © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc. http://blog.mccrory.me/2010/12/07/data-gravity-in-the-clouds/

9. uncertainty Flexibility Variety Volume © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

11. - 2011 IDC Digital Universe Study © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

13. EMR is Hadoop in the Cloud Hadoop is an open-source framework for parallel processing huge amounts of data on a cluster of machines © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

14. Versions & Distributions • Versions • 1.0.3 • 0.20.205 • 0.20 • 0.18 • Distributions • Apache Hadoop © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

15. Job Flows • Custom JAR • Cascading • Streaming © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

16. Applications • Hive • Pig • Hbase © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

19. Choose: Hadoop distribution, # of nodes, types of nodes, custom configs, Hive/Pig/etc. Put the data into S3 S3 EMR Cluster 011001101 EMR Launch the cluster using the EMR console, CLI, SDK, or APIs Get the output You can also from S3 store everything in HDFS © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

20. EMR Cluster S3 EMR You can easily add and remove nodes © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

21. S3 EMR Cluster When processing is complete, you can terminate the cluster (and stop paying) © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

22. EMR Cluster S3 EMR If you run your jobs 24 x 7 , you can also run a persistent cluster and use RI models to save costs © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

23. AWS Console Upload S3 3rd Party Commercial Applications FTP Tsunami UDP AWS Import / Export Storage Gateway S3 API Direct Connect © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

36. Hadoop elastic-mapreduce --create --alive --instance-type m1.xlarge --num-instances 5 ./elastic-mapreduce --create --alive --name "Test Hive" --hadoop-version 0.20 --num-instances 5 --instance-type m1.large --hive-interactive --hive-versions 0.7.1 © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

39. What are Spot Instances? Sold at Sold at 50% Unused 54% Unused Discount! Discount! Sold at Sold at 56% Unused 59% Unused Discount! Discount! Sold at Sold at 66% Unused 63% Unused Discount! Discount! Availability Zone Availability Zone Region

40. What is the tradeoff? Unused Unused Unused Reclaimed Unused Unused Reclaimed Unused Availability Zone Availability Zone Region

41. Mix Spot and On-Demand Instances Scenario #1 Cluster #1 #1: Cost without Spot 4 instances *14 hrs * $0.45 = $25.20 Duration: 14 Hours #2: Cost with Spot 4 instances *7 hrs * $0.45 = $12.60 + Scenario #2 5 instances * 7 hrs * $0.225 = $7.875 Cluster #2 Total = $20.475 Time Savings: 50% Duration: Cost Savings: ~19% 7 Hours

46. Data Data Masking Data Exchange Quality MDM Data Transformation Enterprise Data Integration Identity Connectivity Resolution © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

48. Amazon.com Confidential/NDA Only © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

49. Amazon.com Confidential/NDA Only © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

57. aws.amazon.com/elasticmapreduce • Online Training – Videos – Articles/tutorials • Documentation – Getting Started Guide – Developer Guide – API Reference • FAQs © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

58. Joseph Ziegler Abhishek Sinha Technical Evangelist Big Data BDM zieglerj@amazon.com sinhaar@amazon.com @jiyosub © 2012 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified or distributed in whole or in part without the express consent of Amazon.com, Inc.

Hadoop on the Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (6)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop on the Cloud

Similar to Hadoop on the Cloud (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Hadoop on the Cloud

Editor's Notes