Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Autoscaling Spark on AWS EC2 - 11th Spark London meetup

5,188 views

Published on

Autoscaling Spark for Fun and Profit
11th Spark Meetup London

Published in: Technology
  • Positions Available Now! We currently have several openings for writing workers. ♣♣♣ http://t.cn/AieXS5j0
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • That's great work, thanks for sharing. We've recently released "Themis", an EMR autoscaling framework developed at Atlassian: https://github.com/atlassian/themis Current features include reactive autoscaling (based on cluster usage) as well as proactive autoscaling (based on predefined schedules). The tool also comes with a simple Web UI and is very easy to configure. Cheers
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Autoscaling Spark on AWS EC2 - 11th Spark London meetup

  1. 1. Autoscaling Spark for Fun and Profit Rafal Kwasny 11th Spark London Meetup 2015-11-26 1
  2. 2. Who am I •DevOPS •Build a few platforms in my life •mostly adtech, in-game analytics for Sony Playstation •Currently advising Investment Banks •CTO Entropy Investments 2
  3. 3. How do you run spark? •Who runs on AWS? •Who uses EMR? 3
  4. 4. So how to use autoscaling on AWS? 4
  5. 5. Overview •typical architecture for AWS •How autoscaling works •Scripts to make your life easier 5
  6. 6. Typical architecture for AWS 6
  7. 7. Typical architecture for AWS 7 Generate some data
  8. 8. Typical architecture for AWS 8 Store it in S3
  9. 9. Typical architecture for AWS 9 or store it in a message queue
  10. 10. Typical architecture for AWS 10 Use your favourite tool for ETL
  11. 11. Typical architecture for AWS 11 Ship it back to S3
  12. 12. Typical architecture for AWS 12 Or send it somewhere
  13. 13. Typical architecture for AWS 13 - EMR - spark-ec2 - build cluster from scratch
  14. 14. Map-reduce is about quickly writing very inefficient code and then running it at massive scale (C) Someone 14
  15. 15. Problem •EC2 is a pay-for-what-you-use model •You just have to decide how much resources you want to use before starting a cluster 15
  16. 16. Problem Most common problems while running on EC2 Scaling up •My team needs a new cluster, how big it should be? Scaling down •Did I shut down the DEV cluster before leaving the office on Friday evening? 16
  17. 17. How to automate scaling? 17
  18. 18. Types of scaling Vertical scaling - „Let’s get a bigger box” •Change instance type •Change EBS parameters 18 Horizontal scaling - „Just add more nodes”
  19. 19. Autoscaling •Automatic resizing based on demand •Define minimum/maximum instance count •Define when scaling should occur •Use metrics •Run your jobs and don’t worry about infrastructure 19
  20. 20. Architecture with autoscaling 20
  21. 21. Using RAM/local SSDs for caching Only saving output into S3
  22. 22. Fault recovery
  23. 23. Autoscaling components •AMI - machine image with installed spark •Launch configuration - defines: •AMI •instance type •instance storage •public IP •security groups 23
  24. 24. Autoscaling components •Autoscaling group •launch configuration •availability zones •VPC details •min/max servers •when to scale •metrics/health checks 24
  25. 25. Putting it all together Then you can run your job 25
  26. 26. Complicated? •AWS provides a lot of services 26
  27. 27. spark-cloud • Better scripts to start spark clusters on EC2 • Alpha version • https://github.com/entropyltd/spark-cloud 27
  28. 28. What’s inside spark-cloud Building AMI’s through packer Packer is a tool for creating machine and container images for multiple platforms from a single source configuration. Supports AWS, DigitalOcean, Docker, OpenStack, Parallels, QEMU, VirtualBox, VMware 38
  29. 29. Current functionality •Start cluster •Shutdown cluster •But more to come :) 39
  30. 30. Spot instances •Spot instances 40
  31. 31. Spot instances –On-Demand: $1.400 –Spot: $0.15 –89% cheaper 41
  32. 32. Summary •Spark and EC2 is a very common combination •Because it makes your life easier •And cheaper •spark-cloud script will help you •You can just worry about writing good Spark code! 42
  33. 33. Thank You rafal@entropy.be 43
  34. 34. 44
  35. 35. Amazon S3 Tips •Don’t use s3n:// •Use s3a:// with hadoop 2.6 –Parallel rename, especially important for committing output –Supports IAM authentication –no „xyz_$folder$" files –input seek –multipart upload ( no 5GB limit ) –Error recovery and retry More info https://issues.apache.org/jira/browse/HADOOP-10400 45
  36. 36. Why not EMR? •Why pay for EMR? It costs more than a spot instance •vendor lock-in and proprietary libraries •netlib-java 46

×