Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR


Published on

by Dario Rivera, Solutions Architect, AWS

Amazon EMR is a managed service that lets you process and analyze extremely large data sets using the latest versions of over 15 open-source frameworks in the Apache Hadoop and Spark ecosystems. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features. This session will feature Asurion, a provider of device protection and support services for over 280 million smartphones and other consumer electronics devices.

  • Be the first to comment

  • Be the first to like this

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR

  1. 1. Apache Spark and the Hadoop Ecosystem on AWS Getting Started with Amazon EMR Jonathan Fritz, Sr. Product Manager March 20, 2017 © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved Apache Spark and the Hadoop Ecosystem on AWS Getting Started with Amazon EMR Jonathan Fritz, Product Management
  2. 2. Agenda • Quick introduction to Spark, Hive on Tez, and Presto • Building data lakes with Amazon EMR and Amazon S3 • Running jobs and security options • Demo • Customer use cases
  3. 3. Quick introduction to Spark, Hive on Tez, and Presto
  4. 4. Spark for fast processing join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: = cached partition= RDD map • Massively parallel • Uses DAGs instead of map- reduce for execution • Minimizes I/O by storing data in DataFrames in memory • Partitioning-aware to avoid network-intensive shuffle
  5. 5. Spark components to match your use case
  6. 6. Hive and Tez for batch ETL and SQL
  7. 7. • Run Spark Driver in Client or Cluster mode • Spark and Tez applications run as a YARN application • Spark Executors and Tez Workers run in YARN Containers on NodeManagers in your cluster Amazon EMR runs Spark and Tez on YARN
  8. 8. Presto: interactive SQL for analytics
  9. 9. Important Presto Features High Performance • E.g. Netflix: runs 3500+ Presto queries / day on 25+ PB dataset in S3 with 350 active platform users Extensibility • Pluggable backends: Hive, Cassandra, JMX, Kafka, MySQL, PostgreSQL, MySQL, and more • JDBC, ODBC for commercial BI tools or dashboards • Client Protocol: HTTP+JSON, support various languages (Python, Ruby, PHP, Node.js, Java(JDBC), C#,…) ANSI SQL • complex queries, joins, aggregations, various functions (Window functions)
  10. 10. On-cluster UIs Manage applications SQL editor, Workflow designer, Metastore browser Notebooks Design and execute queries and workloads And more using bootstrap actions!
  11. 11. Building data lakes with Amazon EMR and Amazon S3
  12. 12. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Open-Source Variety Latest versions of software Managed Spend less time monitoring Secure Easy to manage options Flexible Customize the cluster
  13. 13. Create a fully configured cluster with the latest versions of Presto and Spark in minutes AWS Management Console AWS Command Line Interface (CLI) Or use a AWS SDK directly with the Amazon EMR API
  14. 14. Hue (SQL Interface/Metastore Management) Zeppelin (Interactive Notebook) Ganglia (Monitoring) HiveServer2/Spark Thriftserver (JDBC/ODBC) Amazon EMR service Amazon EMR release Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Streaming Flink
  15. 15. Decouple compute and storage by using S3 as your data layer HDFS S3 is designed for 11 9’s of durability and is massively scalable EC2 Instance Memory Amazon S3 Amazon EMR Amazon EMR Intermediates stored on local disk or HDFS Local
  16. 16. HBase on S3 for scalable NoSQL
  17. 17. S3 tips: Partitions, compression, and file formats • Avoid key names in lexicographical order • Improve throughput and S3 list performance • Use hashing/random prefixes or reverse the date-time • Compress data set to minimize bandwidth from S3 to EC2 • Make sure you use splittable compression or have each file be the optimal size for parallelization on your cluster • Columnar file formats like Parquet can give increased performance on reads
  18. 18. Many storage layers to choose from Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon Redshift Amazon S3 Amazon EMR
  19. 19. Use RDS/Aurora for an external Hive metastore Amazon Aurora Hive Metastore for external tables on S3 Amazon S3Set metastore location in hive-site
  20. 20. Spot for task nodes Up to 80% off EC2 on-demand pricing RI for core nodes Standard Amazon EC2 pricing for RI capacity Use Spot and Reserved Instances to lower costs Meet SLA at predictable cost Exceed SLA at lower cost Amazon EMR supports most EC2 instance types
  21. 21. Instance fleets for advanced Spot provisioning Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price • Spot Block support
  22. 22. Lower costs with Auto Scaling
  23. 23. Running Jobs and Security Options
  24. 24. YARN schedulers - CapacityScheduler • Default scheduler specified in Amazon EMR • Queues • Single queue is set by default • Can create additional queues for workloads based on multitenancy requirements • Capacity Guarantees • set minimal resources for each queue • Programmatically assign free resources to queues • Adjust these settings using the classification capacity- scheduler in an EMR configuration object
  25. 25. Configuring Executors – Dynamic Allocation • Optimal resource utilization • YARN dynamically creates and shuts down executors based on the resource needs of the Spark application • Spark uses the executor memory and executor cores settings in the configuration for each executor • Amazon EMR uses dynamic allocation by default, and calculates the default executor size to use based on the instance family of your Core Group
  26. 26. Options to submit jobs – on cluster Web UIs: Hue SQL editor, Zeppelin notebooks, R Studio, Airpal, and more! Connect with ODBC / JDBC to HiveServer2, Spark Thriftserver, or Presto Use Hive and Spark Actions in your Apache Oozie workflow to create DAGs of jobs. (start using Or, use the native APIs and CLIs for each application
  27. 27. Options to submit jobs – off cluster Amazon EMR Step API Submit a Hive or Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Hive or Spark on your cluster
  28. 28. Security - configuring VPC subnets • Use Amazon S3 Endpoints in VPC for connectivity to S3 • Use Managed NAT for connectivity to other services or the Internet • Control the traffic using Security Groups • ElasticMapReduce-Master-Private • ElasticMapReduce-Slave-Private • ElasticMapReduce-ServiceAccess
  29. 29. IAM Roles – managed or custom policies EMR Service Role EC2 Role
  30. 30. Encryption – use security configurations
  31. 31. Demo – Zeppelin and Hue
  32. 32. Customer Use Cases
  33. 33. Learn Models ModelsImpressions Clicks Activities Calibrate Evaluate Real Time Bidding S3 ETL Attribution Machine Learning S3Amazon Kinesis • 2 Petabytes Processed Daily • 2 Million Bid Decisions Per Second • Runs 24 X 7 on 5 Continents • Thousands of ML Models Trained per Day
  34. 34. ADHOC Transient ETL Cluster Hive Transient Event Cluster Spark RDS Meta store Custom Replication Ideal Path RDS Replicated Meta store To support Hive Presto Data types Ideal Path Data Lake - Stats • 50+ Ad hoc Users • 1000+ Ad hoc Queries Day • 4+ Data Science Users • 20+ Sources of Data • 100+ ETL Hive Jobs • 25+ Spark Jobs • 2+ PB of Data
  35. 35. FINRA saved 60% by moving to HBase on EMR
  36. 36. Netflix uses Presto on Amazon EMR with a 25 PB dataset in Amazon S3 Full Presentation:
  37. 37. SmartNews uses Presto as a reporting front-end AWS Big Data Blog: Built-a-Lambda-Architecture-on-AWS-to-Analyze-Customer-Behavior-an© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved Pop-up Loft © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved Thank you!