Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Design patterns and best practices for data analytics with amazon emr (ABD305)

1,476 views

Published on

Amazon EMR is one of the largest Hadoop operators in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. Finally, we dive into some of our recent launches to keep you current on our latest features.

Published in: Data & Analytics
  • Hello! High Quality And Affordable Essays For You. Starting at $4.99 per page - Check our website! https://vk.cc/82gJD2
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Design patterns and best practices for data analytics with amazon emr (ABD305)

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Design Patterns and Best Practices for Data Analytics with Amazon EMR J o n a t h a n F r i t z , P r i n c i p a l P r o d u c t M a n a g e r – A m a z o n E M R A n y a B i d a , S e n i o r M e m b e r o f T e c h n i c a l S t a f f - S a l e s f o r c e A B D 3 0 5
  2. 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Overview - Intro and architectures - Using Amazon EC2 Spot and Auto Scaling - Security overview - Ad-hoc and advanced workflows - Apache Spark and Amazon EMR at Salesforce
  3. 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Easy to use Launch a cluster in minutes What is Amazon EMR? Low cost Pay per-second Open-source variety Latest versions of software Managed Spend less time monitoring Secure Easy to enable options Flexible Full customization and control
  4. 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Open-source applications Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Streaming Flink Zeppelin - Notebooks MXNet Hue - SQL Ganglia - Monitoring Livy – Job Server Connectorsto AWSservices Amazon EMR service
  5. 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use the AWS Glue Data Catalog • Support for Spark, Hive and Presto • Auto-generate schema and partitions • Managed table updates
  6. 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HBase for random access at massive scale
  7. 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time and batch processing
  8. 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New – Deep learning with GPU instances Use P3 instances with GPUs
  9. 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lower your costs T i p s t o
  10. 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transient or long-running clusters • Amazon Linux AMI with preinstalled customizations for faster cluster creation • Auto Scaling to minimize costs for long-running clusters
  11. 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. EC2 Spot and instance fleets • EMR will select optimal EC2 AZ • Provision across instance types • Switch to on-demand
  12. 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Auto Scaling Scaling options Threshold CloudWatch or custom metric
  13. 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Auto Scaling • EMR scales-in at YARN task completion • Selectively removes nodes with no running tasks • yarn.resourcemanager.decommissioning.timeout • Default timeout is one hour • Spark scale-in contributions • Spark specific blacklisting of tasks • Unregistering cached data and shuffle blocks • Advanced error handling
  14. 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Secure your cluster T i p s t o
  15. 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Encryption • Spark • Tez • MapReduce • Presto • HBase • Hive • Pig
  16. 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Authentication LDAP HiveServer2 Presto Coordinator Spark Thrift Server Hue Server Zeppelin Server AWS credentials EMR Step (EMR API) EC2 key pair SSH as “hadoop”
  17. 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New – Authentication with Kerberos Microsoft Active Directory KDC Users YARN RMDoAs Service principals for all cluster nodes Master Node
  18. 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Authorization • Storage-based • EMRFS/S3 • HDFS • HiveServer2 and Presto (SQL-based) • HBase • YARN queues • Fine-grained access control by cluster tag (IAM) • Apache Ranger on edge node (using CloudFormation)
  19. 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New – EMRFS fine-grained authorization Context User: aduser Group: analyst IAM role: analytics_prod Can map IAM roles to user, group, or S3 prefix Context User: aduser2 Group: dev IAM role: analytics_dev
  20. 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security configuration demo
  21. 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Submit workflows T i p s t o
  22. 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Livy as an ad-hoc Spark job server Custom Application
  23. 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Oozie and Airflow for DAGs of jobs
  24. 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Select Amazon EMR customers
  25. 25. Apache Spark and Amazon EMR at Salesforce abida@salesforce.com @anyabida1 Anya Bida, Senior Member of Technical Staff at Salesforce
  26. 26. Fastest Growing Top 5 Enterprise Software Company $5.4BFY15 $4.1BFY14 $3.1BFY13 $6.7BFY16 $2.3BFY12 $1.7BFY11 $2.56BFY18Q2 revenue $8.4BFY17 revenue 2009 • 2010 • 2011 2012 • 2013 • 2014 2015 • 2016 • 2017 September 2016 2011 • 2012 • 2013 2014 • 2015 • 2016 • 2017 The world’s most innovative companies “Innovator of the Decade”
  27. 27. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use AWS Identity and Access Management (IAM) roles Isolate Environments
  28. 28. Complete ML Pipeline ETL Feature Engineering Model Training Model Evaluation Deploy & Operationalize Models Score & Update Models Support Batch & Real Time
  29. 29. Selecting a tool ETL Feature Engineering Model Training Model Evaluation Deploy & Operationalize Models Score & Update Models Support Batch & Real Time Supports each step of our ML pipeline Scales for small & large jobs Good ML Libraries Active user base Ability to deploy production ready code
  30. 30. Selecting a tool ETL Feature Engineering Model Training Model Evaluation Deploy & Operationalize Models Score & Update Models Support Batch & Real Time Supports each step of our ML pipeline Scales for small & large jobs Good ML Libraries Active user base Ability to deploy production ready code
  31. 31. We wanted Spark…now how to deploy it? EC2 • Support for batch / streaming • Integrates with our tooling • Spin up / down clusters • Larger / smaller clusters • Support for different versions of Hadoop, Spark • Storage & compute options
  32. 32. We wanted Spark…now how to deploy it? EC2 • Support for batch / streaming • Integrates with our tooling • Spin up / down clusters • Larger / smaller clusters • Support for different versions of Hadoop, Spark • Storage & Compute options Need: Management
  33. 33. We wanted Spark…now how to deploy it? Need: Management EC2 • Support for batch / streaming • Integrates with our tooling • Spin up / down clusters • Larger / smaller clusters • Support for different versions of Hadoop, Spark • Storage & Compute options
  34. 34. Provision EMR Simplest approach Input bucket Monitoring aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Ganglia --tags --ec2-attributes --release-label --log-uri … --name --instance-groups --region
  35. 35. Provision EMR Simplest approach Input bucket Monitoring aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Ganglia # tag for cost analysis by project --tags --ec2-attributes --release-label # send logs to S3 --log-uri … # use naming conventions for service discovery # <region>-<project>-<version>-<env> --name # CORE nodes used for writes to HDFS # TASK nodes used for compute - try spot instances here for starters --instance-groups --region
  36. 36. Simplest approach Input bucket EMR cluster Output bucket Logs bucket
  37. 37. Spark primer Apache Spark
  38. 38. Spark Primer Apache Spark Driver Program SparkContext Node Executor Cache TaskTask Cluster Manager Node Executor Cache TaskTask
  39. 39. Spark on Amazon EMR Apache Spark Master Node Core Node Cluster Manager ResourceManager NameNode NodeManager Datanode Task Node NodeManagerNodeManager
  40. 40. Spark on Amazon EMR Apache Spark Master Node Core Node Executor Cache TaskTask Cluster Manager ResourceManager NameNode Task Node Executor Cache TaskTask Driver Program SparkContext Executor Cache TaskTask NodeManager Datanode NodeManagerNodeManager
  41. 41. Properties related to Dynamic Allocation Property Value Spark.dynamicAllocation.enabled true Spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors 5 spark.dynamicAllocation.maxExecutors 17 spark.dynamicAllocation.initalExecutors 0 sparkdynamicAllocation.executorIdleTime 60s spark.dynamicAllocation.schedulerBacklogTimeout 5s spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s Optional
  42. 42. Where do I find Metrics? Ganglia windowing, dashboarding Cloudwatch YARNMemoryAvailablePercentage Logs?
  43. 43. Anatomy of a Spark Job High Performance Spark, Karau & Warren, O’Reilly Spark Context/Spark Session Object Actions (e.g., collect, saveAsTextFile) Wide transformations; (sort, groupByKey) Computation to evaluate one partition (combine narrow transforms) Spark Application Job Stage Stage Task Task
  44. 44. Simplest approach Input bucket EMR cluster Output bucket Reads from S3 • Jar files too! Write Intermediate files • MEM or Disk? • Local? HDFS? Amazon S3? Writes to S3 • Data available after cluster is terminated
  45. 45. Cache Persist Checkpoint Local Checkpoint local mem cache MEM MEM MEM local disk DISK DISK HDFS / S3 Specify dir If exec is decommed, are writes available? No No Yes No If job finishes are writes available? No No Yes No Preserve lineage graph? Yes Yes No No RDD Re-use
  46. 46. Cache Persist Checkpoint Local Checkpoint local mem cache MEM MEM MEM local disk DISK DISK HDFS / S3 Specify dir If exec is decommed, are writes available? No No Yes No If job finishes are writes available? No No Yes No Preserve lineage graph? Yes Yes No No RDD Re-use Persist to improve speed, Checkpoint to improve fault tolerance
  47. 47. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  48. 48. Monitor multiple viewpoints https://light.co/camera
  49. 49. Understand resource allocation Understanding Memory Management in Spark For Fun And Profit Shivnath Babu (Duke University, Unravel Data Systems) Mayuresh Kunjir (Duke University) Node Memory Container Memory 8Gb Node Memory Container Memory 8Gb
  50. 50. Can my 8Gb container launch on this cluster? Node Memory Node Memory Node Memory 4Gb used 8Gb total 8Gb
  51. 51. Can my 8Gb container launch on this cluster? Scale-out Rule: Num Containers Pending Node Memory Node Memory Node Memory 4Gb used 8Gb total 8Gb
  52. 52. Monitor multiple viewpoints Connectivity viewer https://www.linkedin.com/in/vaibhavt/ Vaibhav Tandon
  53. 53. Monitor multiple viewpoints Connectivity viewer https://www.linkedin.com/in/vaibhavt/ Vaibhav Tandon
  54. 54. Monitor multiple viewpoints Connectivity viewer https://www.linkedin.com/in/vaibhavt/ Vaibhav Tandon
  55. 55. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  56. 56. Use IAM roles Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role
  57. 57. Use IAM roles Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role • Job has an IAM Role IAM
  58. 58. Use IAM roles Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role • Job has an IAM Role • IAM Roles determine read / write access to data IAM Out Logs IAM In
  59. 59. Use IAM roles Every user, service, & job should have specific, auditable permissions. New: EMRFS fine-grained access control!! Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role • Job has an IAM Role • IAM Roles determine read / write access to data IAM Out Logs IAM aws emr create-cluster … --service-role In
  60. 60. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  61. 61. Isolate environments Need: Build and release? Multitenancy? Cluster Manager Scheduler Out Logs In
  62. 62. Cluster Manager Scheduler virtual private cloud Out Logs In Isolate environments Need: Build and release? Multitenancy?
  63. 63. Cluster Manager Scheduler VPC subnet virtual private cloud VPC subnet Out Logs In Isolate environments Need: Build and release? Multitenancy?
  64. 64. Cluster Manager Scheduler VPC subnet virtual private cloud security group security group security group VPC subnet Out Logs In Isolate environments Need: Build and release? Multitenancy?
  65. 65. Cluster Manager Scheduler VPC subnet virtual private cloud security group security group security group VPC subnet Out Logs In aws emr create-cluster … --ec2-attributes '{ "KeyName":"", "InstanceProfile":"", "ServiceAccessSecurityGroup":"", "SubnetId":"”, "EmrManagedSlaveSecurityGroup":"", "EmrManagedMasterSecurityGroup":""}'}' Isolate environments Need: Build and release? Multitenancy?
  66. 66. Cluster Manager Scheduler In Out Logs VPC subnet virtual private cloud security group IAM security group security group VPC subnet IAM Environment Isolate environments Need: Build and release? Multitenancy?
  67. 67. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Dev Staging Canary Prod Isolate environments Need: Build and release? Multitenancy?
  68. 68. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Dev Staging Canary Prod Automation • Use Cloudformation or Terraform • Upgrades use the same provisioning script + DNS Upsert Isolate environments Need: Build and release? Multitenancy?
  69. 69. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Availability Zone region Dev Staging Canary Prod Isolate environments Need: Build and release? Multitenancy?
  70. 70. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Availability Zone region Dev Staging Canary Prod Availability Zone region Dev Staging Canary Prod Availability Zone Availability Zone Isolate environments Need: Build and release? Multitenancy?
  71. 71. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  72. 72. Did we just automate ourselves out of our jobs? Nope. Now we have time to take on new projects and grow…
  73. 73. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU! J o n a t h a n F r i t z – j o n f r i t z @ a m a z o n . c o m A n y a B i d a – a b i d a @ s a l e s f o r c e . c o m a w s . a m a z o n . c o m / e m r a w s . a m a z o n . c o m / b l o g s / b i g - d a t a a w s . a m a z o n . c o m / b l o g s / a i
  74. 74. Extra slides `spark.history.fs.cleaner.enabled=true`

×