AWS guerrilla orchestration

•Download as PPTX, PDF•

0 likes•179 views

This document summarizes the key aspects of AWS orchestration for a company processing 6 billion requests per day across multiple services and regions. It discusses auto-scaling groups (ASGs) across 750 servers in 2 regions and 5 availability zones, with over 30 ASGs handling more than 20 API integrations. It also describes a homemade Redis autoscaling using master-slave replication across regions on spots and on-demand instances, handling 1.2 million operations per second. An event-driven architecture is implemented using a ØMQ mesh pipeline across ASGs for unidirectional data flow of 100k messages per second.

Software

KEY FIGURES
• 100k qps peak
• 6B requests/day
• 8 TB/day
• More than 20 services
• 750 servers peak
• 2 Regions (eu-west-1 and us-east-1)
• 5 AZs
• 30 ASGs
• Over 20 API integrations (Google, Twitter, Facebook, AppNexus, eBay, …)
What do we do (engineers perspective)

WHY
• Cross region replication (EU and US)
• Cost (leverage spots)
• HyperLogLog
• Writable slaves (for set operations)
• Centralized monitoring and logging
Homemade Redis autoscaling

Our Redis structure
HOW
• Master EU
• Slaves EU (2-8)
• Replication over VPN
• Master US
• Slaves US (4-21)
• 1.2M ops peek

WE CARE ABOUT THE COSTS
• Two ASGs per region (on demand + spot)
• Slaves only in ASG
• Spot scales more aggressively
• All ASGs in one region behind same ELB
• ELB with TCP load balancing
• Jenkins job to monitor the crash of spot market
Deployment strategy

WHY NOT
• 1-2ms penalty per request
• Long lasting connections
• New machines don’t do anything
• Cross AZ requests add more latency
• Doesn’t consider replication
Going through ELB

DISTRIBUTED REDIS CONNECTION BALANCING
• If anything fails fall back to ELB
• Get the AZ for current host using AWS meta-data service
• Get Redis instances from the ELB
• Use instances from the same AZ and fall back to other AZ
• Use only running, healthy and replicated instances
• Check current number of clients connected and ops on each selected Redis
• Pick a Redis based on biased distribution and connect to it
Sneak behind ELB

OUR PIPELINE
• Unidirectional data flow
• Multiple ASG service layers
• Machines come and go all the time
• CPU based scaling
• 6 billion messages per day
• 100k messages per second peek time
Event driven architecture

ØMQ
WHY
• Connect your code in any language, on any platform
• Carries messages across inproc, IPC, TCP, TIPC, multicast
• Smart patterns like request-reply, pub-sub, push-pull
and router-dealer
• High-speed asynchronous I/O engines, in a tiny library
• Build any architecture: centralized, distributed,
small or large
• Smart handling of establishing connections
and reconnecting

HOW
• Define your network topology
using subnets in VPC
• Using subnets in ASGs you ensure
that you know where service will
potentially reside
• HINT: Don’t make a mess, there are
enough subnets to spare
Placement to the rescue

CONNECT TO WHERE THE SERVER WILL BE
Mesh (or was it mess) architecture

EC2 API based auto-discovery
• No maintenance
• Handles health checks
• Matches your deployment perfectly
• Adapts to changes fast
• Has unknown/unscalable API limit :(
ØMQ mesh pipeline
• Quick and dirty to setup
• Small and fast
• Queue is local to machines
• Limited scale (we tested up to 762 servers)

What's hot

AWS Elastic Compute Services

Mackenzie LeJeune

Cloudsolutionday 2016: Docker & FAAS at getvero.com

AWS Vietnam Community

Cloudsolutionday 2016: DevOps workflow with Docker on AWS

AWS Vietnam Community

An Overview of Continuent’s Main Solutions A High-Level Walkthrough of Tungsten Clustering, Tungsten Replicator & Tungsten Dashboard Watch this high-level walkthrough of Tungsten Clustering, Tungsten Replicator & Tungsten Dashboard by Chris Parker, Customer Success Director EMEA & APAC. TOPICS COVERED - Continuent Products - Tungsten Clustering - Tungsten Cluster - Tungsten Cluster+ active/passive - Tungsten Cluster+ active/active - Tungsten Replicator - Tungsten Dashboard - Summary

Training Slides: Introduction To Tungsten Solutions

Continuent

Escalabilidade com Lambda e Elastic Beanstalk – Parte I

Leandro Silva

Tis the Season to Scale

James Cryer

Getting started with Riak in the Cloud

Ines Sombra

Serverless framework on kubernetes

inwin stack

Using Serverless Architectures to build and provision modern infrastructures

Ramit Surana

DataConf.TW2018: Develop Kafka Streams Application on Your Laptop

Yu-Jhe Li

Beyond Heroku: Hosting Your Rails App Yourself

stcarpenter

Kafka Connectors are used extensively in data migration solutions, serving as a middle tier when migrating data across databases. In addition, microservice architectures also use Kafka Connectors heavily when communicating with one another while still operating independently on their own data stores. In this talk, we cover these use cases in more detail along with a deep dive into the architecture of the source and sink Kafka Connectors for Cosmos DB.

Azure Cosmos DB Kafka Connectors | Abinav Rameesh, Microsoft

HostedbyConfluent

Intro to.net core 20170111

Christian Horsdal

Container Orchestration 之爭已經落幕，Kubernetes 成為主流，AWS, Azure 跟 GCP 都已提出相對應的解決方案，但該選擇廠商所提供的服務或是自己架設呢？如何把 Stateless 甚至是 Stateful 應用服務運行於其上呢？部署應用程式到 Kubernetes 之中該如何做比較好？本分享談及多次在公司導入及維運 Kubernetes 的相關經驗，讓有興趣或是剛使用的人可以減少摸索的時間

Kubernetes User Group: 維運 Kubernetes 的兩三事

smalltown

The future of cloud programming

Jason Straughan

Openstack portal-bestpractices-campbell mcneill

Campbell McNeill

How to get started developing Camel microservices (or any Java technology for that matter) on a local Kubernetes cluster from zero to deployment. As a Java developer it may be daunting to know how to get started how to develop container applications that runs on Kubernetes cluster. Using minikube its very easy to run a local cluster and with the help of fabric8 tooling its even easier to install and run using familiar tools like Maven. In this talk we will build a set of Apache Camel and Java based Microservices that uses Spring Boot and WildFly Swarm. With the help of fabric8 maven tooling you will see how to build, deploy, and run your Java projects on a Kubernetes cluster (local or remote). And even live debugging is easy to do as well. We will discuss practices how to build distributed and fault tolerant microservices using technologies such as Kubernetes Services, Netflix Hysterix, and Camel EIP patterns for fault tolerance. In the talk you will also hear about related open source projects where you can go explore more such as fabric8, openshift.io, istio, etc. This presentation is a 50/50 mix between slides and demo.

Meetup Melbourne August 2017 - Agile Integration with Apache Camel microservi...

Claus Ibsen

Containers seem to have suddenly become the hot new thing everyone is talking about, but what are they? Why are they important? How should you use them and what does it mean for cloud infrastructure? This talk will examine the history, technical details and strategy around containerisation from the perspective of developers and operations, consider internal container OSs like Rocket and Ubuntu Core as well as management layers like Docker and Apache Mesos and take a look at why cloud providers are launching their own services around them. Presented by David Mytton at Datacloud Monaco 2015-06-04

Briefing: Containers

Server Density

Have many services? Writing new ones often? If so middleware can help you cut down on the ceremony for writting new services and at same time consolidate the handling of cross cutting concerns. But what is middleware? OWIN and ASP.NET Core both have a concept of middleware. What are they? How do they help? In this talk we will dive into the code, write some middleware and show how middleware helps you handle cross-cutting concerns in an isolated and re-usable way across your services. I'll compare and contrast the OWIN and ASP.NET Core middleware concepts and talk about where each is appropriate.

Consolidating services with middleware - NDC London 2017

Christian Horsdal

From AWS to GCP, TABLEAPP Architecture Story

Yen-Wen Chen

What's hot (20)

AWS Elastic Compute Services

Cloudsolutionday 2016: Docker & FAAS at getvero.com

Cloudsolutionday 2016: DevOps workflow with Docker on AWS

Training Slides: Introduction To Tungsten Solutions

Escalabilidade com Lambda e Elastic Beanstalk – Parte I

Tis the Season to Scale

Getting started with Riak in the Cloud

Serverless framework on kubernetes

Using Serverless Architectures to build and provision modern infrastructures

DataConf.TW2018: Develop Kafka Streams Application on Your Laptop

Beyond Heroku: Hosting Your Rails App Yourself

Azure Cosmos DB Kafka Connectors | Abinav Rameesh, Microsoft

Intro to.net core 20170111

Kubernetes User Group: 維運 Kubernetes 的兩三事

The future of cloud programming

Openstack portal-bestpractices-campbell mcneill

Meetup Melbourne August 2017 - Agile Integration with Apache Camel microservi...

Briefing: Containers

Consolidating services with middleware - NDC London 2017

From AWS to GCP, TABLEAPP Architecture Story

Similar to AWS guerrilla orchestration

Deploying microservices on AWS

Michael Haberman

Neutron scaling

Vinay Bannai

Elastic Kubernetes Services (EKS)

sriram_rajan

Kube ovn-sandbox-proposal

梦馨刘

Metrics driven development with dedicated Observability Team

LINE Corporation

AWS Lambda at JUST EAT

Andrew Brown

Amazon Elastic Compute Cloud (Amazon EC2) provides a broad selection of instance types to accommodate a diverse mix of workloads. In this technical session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.

Intro to AWS: EC2 & Compute Services

Amazon Web Services

Building scalable flexible messaging systems using qpid

Jack Gibson

Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015

Belmiro Moreira

Micro Services Architecture

Ranjan Baisak

Summer 2017 undergraduate research powerpoint

Christopher Dubois

OpenStack and Windows

Alessandro Pilotti

Amazon Elastic Compute Cloud (Amazon EC2) provides a broad selection of instance types to accommodate a diverse mix of workloads. In this technical session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current-generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.

Intro to AWS: Amazon EC2 and Compute Services

Amazon Web Services

AWS for the Java Developer

Rory Preddy

Amazon Elastic Compute Cloud (Amazon EC2) provides a broad selection of instance types to accommodate a diverse mix of workloads. In this technical session, we provide an overview of the Amazon EC2 instance platform, key platform features, and the concept of instance generations. We dive into the current-generation design choices of the different instance families, including the General Purpose, Compute Optimized, Storage Optimized, Memory Optimized, and GPU instance families. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.

Intro to AWS: Amazon EC2 and Compute Services

Amazon Web Services

Serverless applications

mbaric

OpenKilda: Stream Processing Meets Openflow

APNIC

Building applications that leverage blockchain data can be cumbersome, but it’s not impossible. Because blockchains don’t typically have RESTful APIs, delivering data to your web app can be complicated. However, we found a way around this challenge by deploying a blockchain node to an EC2 instance. With this approach, we were able to request data and deliver it to our web app using Lambda. In this presentation, you will learn: -How we used AWS Lambda and blockchain to build a robust web app -Best practices for delivering blockchain data to your web app -Cost-effective ways of serving blockchain data to your application Learn more in this blog post: https://www.verypossible.com/blog/how-to-serve-blockchain-data-on-the-web

How to Serve Blockchain Data with AWS Lambda

Very

Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS

Amazon Web Services LATAM

Today you can use MySQL in several clouds in what is considered using it as a service, a database as a service (DBaaS). Learn the differences, the access methods, and the level of control you have for the various cloud offerings including: - Amazon RDS - Google Cloud SQL - HPCloud DBaaS - Rackspace Openstack DBaaS The administration tools and ideologies behind it are completely different, and you are in a "locked-down" environment. Some considerations include: * Different backup strategies * Planning for multiple data centres for availability * Where do you host your application? * How do you get the most performance out of the solution? * What does this all cost? Questions like this will be demystified in the talk.

MySQL in the Cloud

Colin Charles

Similar to AWS guerrilla orchestration (20)

Deploying microservices on AWS

Neutron scaling

Elastic Kubernetes Services (EKS)

Kube ovn-sandbox-proposal

Metrics driven development with dedicated Observability Team

AWS Lambda at JUST EAT

Intro to AWS: EC2 & Compute Services

Building scalable flexible messaging systems using qpid

Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015

Micro Services Architecture

Summer 2017 undergraduate research powerpoint

OpenStack and Windows

Intro to AWS: Amazon EC2 and Compute Services

AWS for the Java Developer

Intro to AWS: Amazon EC2 and Compute Services

Serverless applications

OpenKilda: Stream Processing Meets Openflow

How to Serve Blockchain Data with AWS Lambda

Arquitetura Hibrida - Integrando seu Data Center com a Nuvem da AWS

MySQL in the Cloud

Recently uploaded

Into the Box 2024 - Keynote Day 2 Slides.pdf

Ortus Solutions, Corp

Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload. Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.

Designing for Privacy in Amazon Web Services

KrzysztofKkol1

Studiovity film pre-production and screenwriting software

info611746

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation

WSO2

Petr Matuska, Sales & Sales Engineering Lead, GraphAware Western Australia Police Force’s adoption of Neo4j and the GraphAware Hume graph analytics platform marks a significant advancement in data-driven policing. Facing the challenges of growing volumes of valuable data scattered in disconnected silos, the organisation successfully implemented Neo4j database and Hume, consolidating data from various sources into a dynamic knowledge graph. The result was a connected view of intelligence, making it easier for analysts to solve crime faster. The partnership between Neo4j and GraphAware in this project demonstrates the transformative impact of graph technology on law enforcement’s ability to leverage growing volumes of valuable data to prevent crime and protect communities.

GraphAware - Transforming policing with graph-based intelligence analysis

Neo4j

Key takeaways: Challenges of building platforms and the benefits of platformless. Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience. How Choreo enables the platformless experience. How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo. Demo of an end-to-end app built and deployed on Choreo.

Accelerate Enterprise Software Engineering with Platformless

WSO2

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos https://www.amb-review.com/tubetrivia-ai Exclusive Features: AI-Powered Questions, Wide Range of Categories, Adaptive Difficulty, User-Friendly Interface, Multiplayer Mode, Regular Updates. #TubeTriviaAI #QuizVideoMagic #ViralQuizVideos #AIQuizGenerator #EngageExciteExplode #MarketingRevolution #BoostYourTraffic #SocialMediaSuccess #AIContentCreation #UnlimitedTraffic

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

AMB-Review

A Comprehensive Look at Generative AI in Retail App Testing.pdf

kalichargn70th171

De mooiste recreatieve routes ontdekken met RouteYou en FME

Jelle | Nordend

COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.

Developing Distributed High-performance Computing Capabilities of an Open Sci...

Globus

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

Juraj Vysvader

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Junchen Jiang (Assistant Professor of Computer Science, @University of Chicago) Prefill in LLM inference is known to be resource-intensive, especially for long LLM inputs. While better scheduling can mitigate prefill’s impact, it would be fundamentally better to avoid (most of) prefill. This talk introduces our preliminary effort towards drastically minimizing prefill delay for LLM inputs that naturally reuse text chunks, such as in retrieval-augmented generation. While keeping the KV cache of all text chunks in memory is difficult, we show that it is possible to store them on cheaper yet slower storage. By improving the loading process of the reused KV caches, we can still significantly speed up prefill delay while maintaining the same generation quality.

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Alluxio, Inc.

Unlocking Business Potential: Tailored Technology Solutions by Prosigns Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support. Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth. Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices. AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making. Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency. DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration. Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly. Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business. Join us on a journey of innovation and growth. Let's partner for success with Prosigns.

Prosigns: Transforming Business with Tailored Technology Solutions

Prosigns

NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?

Understanding Globus Data Transfers with NetSage

Globus

CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.

Cyaniclab : Software Development Agency Portfolio.pdf

Cyanic lab

Globus Compute Introduction - GlobusWorld 2024

Globus

The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Globus

As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.

Globus Compute wth IRI Workflows - GlobusWorld 2024

Globus

Using IESVE for Room Loads Analysis - Australia & New Zealand

IES VE

AI/ML Infra Meetup May. 23, 2024 Organized by Alluxio For more Alluxio Events: https://www.alluxio.io/events/ Speaker: - Lu Qiu (Data & AI Platform Tech Lead, @Alluxio) - Siyuan Sheng (Senior Software Engineer, @Alluxio) Speed and efficiency are two requirements for the underlying infrastructure for machine learning model development. Data access can bottleneck end-to-end machine learning pipelines as training data volume grows and when large model files are more commonly used for serving. For instance, data loading can constitute nearly 80% of the total model training time, resulting in less than 30% GPU utilization. Also, loading large model files for deployment to production can be slow because of slow network or storage read operations. These challenges are prevalent when using popular frameworks like PyTorch, Ray, or HuggingFace, paired with cloud object storage solutions like S3 or GCS, or downloading models from the HuggingFace model hub. In this presentation, Lu and Siyuan will offer comprehensive insights into improving speed and GPU utilization for model training and serving. You will learn: - The data loading challenges hindering GPU utilization - The reference architecture for running PyTorch and Ray jobs while reading data from S3, with benchmark results of training ResNet50 and BERT - Real-world examples of boosting model performance and GPU utilization through optimized data access

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

Alluxio, Inc.

Recently uploaded (20)

Into the Box 2024 - Keynote Day 2 Slides.pdf

Designing for Privacy in Amazon Web Services

Studiovity film pre-production and screenwriting software

WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation

GraphAware - Transforming policing with graph-based intelligence analysis

Accelerate Enterprise Software Engineering with Platformless

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

A Comprehensive Look at Generative AI in Retail App Testing.pdf

De mooiste recreatieve routes ontdekken met RouteYou en FME

Developing Distributed High-performance Computing Capabilities of an Open Sci...

In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...

AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG

Prosigns: Transforming Business with Tailored Technology Solutions

Understanding Globus Data Transfers with NetSage

Cyaniclab : Software Development Agency Portfolio.pdf

Globus Compute Introduction - GlobusWorld 2024

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Globus Compute wth IRI Workflows - GlobusWorld 2024

Using IESVE for Room Loads Analysis - Australia & New Zealand

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...

AWS guerrilla orchestration

1. AWS guerrilla orchestration

2. KEY FIGURES • 100k qps peak • 6B requests/day • 8 TB/day • More than 20 services • 750 servers peak • 2 Regions (eu-west-1 and us-east-1) • 5 AZs • 30 ASGs • Over 20 API integrations (Google, Twitter, Facebook, AppNexus, eBay, …) What do we do (engineers perspective)

3. EC2 API based auto-discovery

4. WHY • Cross region replication (EU and US) • Cost (leverage spots) • HyperLogLog • Writable slaves (for set operations) • Centralized monitoring and logging Homemade Redis autoscaling

5. Our Redis structure HOW • Master EU • Slaves EU (2-8) • Replication over VPN • Master US • Slaves US (4-21) • 1.2M ops peek

6. WE CARE ABOUT THE COSTS • Two ASGs per region (on demand + spot) • Slaves only in ASG • Spot scales more aggressively • All ASGs in one region behind same ELB • ELB with TCP load balancing • Jenkins job to monitor the crash of spot market Deployment strategy

7. WHY NOT • 1-2ms penalty per request • Long lasting connections • New machines don’t do anything • Cross AZ requests add more latency • Doesn’t consider replication Going through ELB

8. DISTRIBUTED REDIS CONNECTION BALANCING • If anything fails fall back to ELB • Get the AZ for current host using AWS meta-data service • Get Redis instances from the ELB • Use instances from the same AZ and fall back to other AZ • Use only running, healthy and replicated instances • Check current number of clients connected and ops on each selected Redis • Pick a Redis based on biased distribution and connect to it Sneak behind ELB

9. Client Redis connection lifecycle

10. ØMQ mesh pipeline

11. OUR PIPELINE • Unidirectional data flow • Multiple ASG service layers • Machines come and go all the time • CPU based scaling • 6 billion messages per day • 100k messages per second peek time Event driven architecture

12. ØMQ WHY • Connect your code in any language, on any platform • Carries messages across inproc, IPC, TCP, TIPC, multicast • Smart patterns like request-reply, pub-sub, push-pull and router-dealer • High-speed asynchronous I/O engines, in a tiny library • Build any architecture: centralized, distributed, small or large • Smart handling of establishing connections and reconnecting

13. HOW • Define your network topology using subnets in VPC • Using subnets in ASGs you ensure that you know where service will potentially reside • HINT: Don’t make a mess, there are enough subnets to spare Placement to the rescue

14. CONNECT TO WHERE THE SERVER WILL BE Mesh (or was it mess) architecture

15. EC2 API based auto-discovery • No maintenance • Handles health checks • Matches your deployment perfectly • Adapts to changes fast • Has unknown/unscalable API limit :( ØMQ mesh pipeline • Quick and dirty to setup • Small and fast • Queue is local to machines • Limited scale (we tested up to 762 servers)

16. Thank you for your attention. @utvara

AWS guerrilla orchestration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AWS guerrilla orchestration

Similar to AWS guerrilla orchestration (20)

Recently uploaded

Recently uploaded (20)

AWS guerrilla orchestration