Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Amazon EMR

1,242 views

Published on

Boston Data Mining Meetup introduction slides from Big Data Infrastructure workshop - A hands-on introduction

Published in: Software
  • Be the first to comment

Amazon EMR

  1. 1. Amazon Elastic Map Reduce (EMR) Saturday, December 6, 2014
  2. 2. Agenda 08:30 AM Breakfast 09:00 AM Introduction and Strengths of Technologies 10:00 AM Start an EMR Cluster 10:15 AM break + set up query tool 10:30 AM Hadoop hands-on 10:55 AM break 11:10 AM Redshift hands-on 11:40 AM Operationalizing your code 12:00 PM adjourn 12/6/2014 2
  3. 3. Session Goals • Understand: • When to use EMR? • Do: • Start Cluster • Load Data from S3 • Transform Data • Unload Data to S3 Draw elements from Gil’s deck Pattern 12/6/2014 3
  4. 4. When to use EMR? • Some Boolean combination of the following: • Ephemeral clusters • Batch processing: daily, weekly, etc. • User Defined Functions (UDF) • File formats • TB, PB data sets in S3 • Instant gratification 12/6/2014 4
  5. 5. Let’s Do This! What do we need? • Key (.pem file) • SQL Workbench What will we do? • Start Cluster • Load stock market data from S3 • Calculate Sharpe ratio • Unload Sharpe ratio results to S3 12/6/2014 5 The Sharpe Ratio characterizes how well the return of an asset compensates the investor for the risk taken. Roughly, the higher the better.
  6. 6. AWS Console • Just google “aws console” 12/6/2014 6
  7. 7. 12/6/2014 7 Click Here Where’s EMR?
  8. 8. Create Cluster 12/6/2014 8
  9. 9. Cluster Options • Lots of them! • Cluster Configuration • Tags - Skip • Software Configuration • File System Configuration • Hardware Configuration • Security and Access • IAM Roles • Bootstrap Actions • Steps 12/6/2014 9
  10. 10. Cluster Configuration 12/6/2014 10
  11. 11. Software Configuration 12/6/2014 11 More fun stuff in here
  12. 12. File System Configuration 12/6/2014 12
  13. 13. Hardware Configuration $ 0.28 / hour 12/6/2014 13 Set Core and Task to 0
  14. 14. Security and Access Finally we get to use our keys! 12/6/2014 14
  15. 15. IAM Roles Just defaults, please More JSON in here 12/6/2014 15
  16. 16. Bootstrap Actions 12/6/2014 16 • Tweak configuration • Install custom application (Apache Drill, Mahout, etc.) • Shell scripts
  17. 17. Steps 12/6/2014 17
  18. 18. Steps 12/6/2014 18
  19. 19. Steps: Hive Program 12/6/2014 19
  20. 20. Provisioning 12/6/2014 20
  21. 21. Bootstrapping Here’s your hostname 12/6/2014 21 SSH Info
  22. 22. Monitor Startup Progress 12/6/2014 22
  23. 23. SSH – Linux/Mac 12/6/2014 23
  24. 24. SSH -Windows 12/6/2014 24
  25. 25. Port Forwarding (Mac/Linux) ssh -i ~/.ec2/emr-training.pem -L 10000:localhost:10000 hadoop@ec2-54-173-219-156.compute-1.amazonaws.com 12/6/2014 25
  26. 26. Connect with SQL Workbench: • Localhost • Autocommit • Default URL 12/6/2014 26
  27. 27. Load Data from S3 12/6/2014 27 Familiar SQL Describe file format Pull from DK bucket
  28. 28. Calculate Daily Returns Create a table in HDFS Copy data into our new table Hive has Windowing and Analytic Features 12/6/2014 28 Daily Return = (adjclose[n] – adjclose[n-1]) -1
  29. 29. Calculate Sharpe Ratio 12/6/2014 29
  30. 30. Export Our Data 12/6/2014 30 Define CSV output Write out data
  31. 31. Terminate! 12/6/2014 31
  32. 32. Links and Resources • SQLWorkbench/J • AWS EMR Documentation • Hive Language Manual 12/6/2014 32

×