More Related Content

Similar to Managed Cluster Services(20)


Managed Cluster Services

  1. IPPON 2020 Accelerate Big Data Analytics with Managed Cluster Services
  2. IPPON 2020 IPPON 2019 Introduction. Data engineer at Ippon Technologies, a boutique consulting firm specializing in Data, Cloud, DevOps, and Full Stack Application Development. Ippon has deep expertise across all major cloud platforms: ★ AWS ★ Azure ★ GCP Delivers Data Initiative Architecture consulting from scoping Project delivery from POC Robust client portfolio Led by Peter Choe We’re hiring! About Ippon’s Data Practice Engineer with hands-on experience with AWS and Azure across multiple data projects. Currently working with Cassandra (NoSQL) and AWS compute and analytics services. Fun fact: My team won first place at the 2020 Virginia Governor's Datathon I also organize the Richmond (VA) Data Engineering Meetup Sam Portillo Data Engineer
  3. IPPON 2020 IPPON 2019 Agenda. 1. Introduction 2. Architecture and Design 3. Contrast with Similar Services 4. Lessons Learned 5. Q&A IPPON 2020
  4. What is a cluster?
  5. Background on Big Data Computing. ❖ Pre 1994 - mainframe era ❖ 1994 to 2010 - cluster era ➢ Identical machines connected to the same network to run workloads ➢ Price varies from 5-6 figures for hardware alone ❖ 2000s to present - GPU era ➢ GPU processing accelerates specific types of problems (ML training, gaming graphics, etc.) ➢ Hardware can also vary from 5-6 figures ❖ 2010s to present - Cloud era ➢ Introduces pay-as-you-go IaaS and PaaS with, firms can run Big Data applications without buying/managing hardware
  6. Benefits of IaaS. ❖ No longer need to buy/manage hardware ❖ “Wait minutes not months” ❖ Pay-as-you-go beats financing hardware in most cases ❖ Scale without physical hardware limits ❖ User needs to manage OS, middleware, applications
  7. Benefits of PaaS. ❖ Cloud provider manages OS, database, development tools ❖ Can be cheaper than IaaS ❖ Faster development time because software comes preconfigured ❖ User needs to applications, data, etc.
  8. IaaS, PaaS, SaaS Overview.
  9. Managed Cluster Services. ❖ PaaS offering that allows cloud customers to easily run Big Data workloads ❖ Support for open source tools like Spark and Hadoop ❖ Examples are ➢ AWS Elastic MapReduce ➢ Azure HDInsight ➢ Google Cloud Dataproc
  10. Core Benefits of Managed Cluster Services. ❖ Cost savings associated with PaaS ❖ Ease of use for developers ❖ Integration with other cloud services ❖ Elastic ➢ Scales up and down as needed ❖ Flexibility ➢ Users can customize environments to solve a variety of problems
  11. Use Cases. ❖ ETL Pipelines ➢ Build reliable data pipelines with Spark ❖ Big data migration ➢ Utilize the power of a cluster to transfer data in a distributed way ❖ Interactive analytics ➢ Quickly ingest terabytes of data to do initial analysis ❖ Machine learning ➢ Train ML models with Tensorflow or Spark MLlib
  12. Takeaway. ❖ Managed cluster services can be versatile and used in different workloads ❖ Benefits that cloud services are typically known for ❖ A good understanding of architecture principles will help an engineer support a variety of projects
  13. Architecture and Design
  14. Architecture in AWS EMR. ❖ Each node is an EC2 instance ❖ Master node ➢ Responsible for coordinating applications among the cluster ➢ Runs driver component for Spark apps ➢ SSH access ❖ Core nodes ➢ Do heavy lifting for applications ➢ Store data for Hadoop Distributed File System
  15. How distributed computing speeds up workloads. ❖ Workloads get distributed among a cluster ❖ Hadoop MapReduce ❖ Heavily relies on disk; stores data back to HDFS after each operation
  16. How distributed computing speeds up workloads. ❖ Apache Spark uses a directed acyclic graph (DAG) to plan workloads ❖ Advantages over MapReduce: ➢ Utilizes RAM and caches datasets between map-reduce jobs ➢ Doesn’t need to write to disk between map-reduce jobs ➢ Rich API that allows for dataset transformations ➢ Potentially 100 times faster than Hadoop MapReduce.
  17. Support for open-source software.
  18. Customizing compute environments. ❖ Support for ➢ Logging ➢ SSH connection ➢ Bootstrap scripts to install custom software/packages ➢ Custom AMIs to achieve anything a bootstrap script can’t
  19. EMR in action.
  20. Contrast with Similar Services
  21. Services to address. ❖ Databricks ❖ ETL tools ❖ Batch services
  22. Databricks. ❖ Databricks is a managed Spark service ➢ Databricks manages compute ➢ Workflow automation and data pipelines ➢ Integrated workspace ❖ A managed cluster service just runs Spark jobs
  23. ETL Tools. ❖ Can have similar ETL functions but managed cluster services offer more than ETL ❖ ETL on ETL tools is generally much easier to configure, but less flexible ❖ AWS Glue works on top of a Spark environment
  24. Batch Services. ❖ Batch services are meant to run any batch computing job at any scale. ❖ They may be able to do some ETL use cases ❖ Not a cluster service, don’t run Spark or Hadoop ❖ Not an environment for interactive analytics
  25. Takeaway. ❖ Decision to use a managed cluster service heavily depends on the workload
  26. Lessons Learned
  27. IPPON 2020 IPPON 2019 Things to keep in mind. 1. Tuning can be difficult 2. Adopt best practices early 3. Assess what parts of a workload warrant a cluster service 4. Know your data 5. Explore different options before committing to a solution IPPON 2020
  28. IPPON 2020 Connect with me :D ❖ Sam Portillo ❖ ❖ Email:
  29. Q&A

Editor's Notes

  1. Introduce myself and stuff Talk about rvade maybe or the cats Been with ippon for about a year Things i like I’m on the qomplx project and use a lot of aws services and cassandra I like python
  2. Skim over this but focus on the about me Offer: Ippon helps companies delivering Data initiative from massive workloads (FastData) to massive storage (BigData). Services: Architecture consulting from scoping DataEngineering industrialisation Project delivery from POC US Reference: Ippon is delivering the full Data capability of SwissRe in the US from realtime analysis to legal longterm storage in a secured Cloud.
  3. Read off slides Small demo with architecture and design
  4. Pre 1994 - super computer/mainframe; single big computer 1994 - 2010 - linux got popular and this started with the beowulf project. Folks started buying commodity hardware with two network cards and distributing workloads Present - there’s still benefit in on prem hardware, which is why the GPU era is still relevant. I know carmax has on prem resources for their machine learning workloads. Makes sense when you do the quick math between running a workload in the cloud vs buying the hardware (not the same as managing an entire data center). Acknowledge the overlap
  5. Examples of these are amazon ec2, azure vms, gcp compute engine
  6. Aws elastic beanstalk is a paas orchestration service offered by Amazon Web Services for deploying applications which orchestrates various AWS services
  7. Here’s an overview of iaas, paas, and saas, i see new as-a-service acronyms each year but most cloud services fall into these categories Most cloud certification courses ask questions about these a saas example would be the microsoft office 365 suite
  8. Tell audience we’re focusing on EMR because i’ve used it on a couple projects within the DP Not a sales pitch for EMR I’ve just used it too much
  9. Ease of use: easy for devs to get started with; learning curve is pretty low
  10. Use “emerging problems” Any questions so far? At its core, EMR is a platform as a service that offers on-demand distributed computing clusters. This service comes with the benefits such as scalability and cost-savings that AWS services are typically known for. AWS also manages the installation of a variety of popular distributed computing and data frameworks like Spark, Hadoop, Tensorflow, and many more. This makes EMR especially versatile as engineers can work on almost any type of problem without too much trouble switching contexts. With EMR being a cornerstone for many workloads, it’s an incredibly helpful tool to keep in the toolbox. A strong understanding of its architecture principles coupled with it’s variety of popular frameworks available makes it a good choice for emerging problems.
  11. Ignore the “mapr node” thing. I stole this image
  12. Most services are phasing out hadoop verbiage on their product pages. And apache retired some hadoop related projects recently Explanation of graphic: say we want to count the occurrences of each animal in this input dataset In the Map phase, each line gets parsed and all of the animals occurrences are extracted. The output of each step of the Map phase is just a list of of each animal and a 1 associated with it. Now that we have all of the animal occurrences by line, we want a count of occurrences across the dataset for each animal. In the Reduce phase, we combine all of the same animals from different mappers and "reduce" them to a single animal and a count. Any questions?
  13. Introducing Spark
  14. Go straight to the console
  15. Theres a lot of similar services out there. I’d like to clear up some differences. Bottom line will be that the service you choose depends on your workload
  16. Spark tuning is really hard, figure out what needs to be done correctly otherwise you may end up just pulling a bunch of different levers Whenever working with a new service or framework, get to know the best practices, you do not want to be three weeks into something when you realize you’re doing something anti pattern Theres a lot of components of an ETL pipeline, but just because one part needs to be done in spark doesn’t mean the whole thing does Omg, self explanatory. Exhaustively study it to make sure nothing will come in to break your pipeline/business logic. Found hundred line sql statements in breach data Don’t fall into the hype of the latest and greatest service. Extensively research your use-case. There’s a lot of ways to get something done, make sure you pick the right tool for the job
  17. If you like to talk about data/tinker with tools/technology/services OR like to throw it down on the dance floor. Connect with me, let’s be friends I also welcome philosophical debate, whether its about material i presented, thoughts on architecture paradigms/strategies, the future of aws/data.