Where many business segments quickly succumb to consolidation, the technologies that comprise the Cloud are instead organizing to interoperate.
In this session we’re going to look at ways to orchestrate complex collaborative environments, focusing on operating multi-server / multi-Cloud infrastructures.
Agenda:
Innovation and consolidation
Innovation in the Cloud industry
Microservices
The flipside of microservices
Orchestrating for the microservices ecosystem
Orchestrating for reliability
Disclaimer: I do not own the rights to images/graphs used in this presentation. Graphs on slides 11&12 from @berndruecker in https://www.slideshare.net/BerndRuecker/wjax-2017-microservice-collaboration.
7. Tech Industry has a long history of market
concentration.
► IBM in mainframe computers
► Microsoft in PC operating systems
► SAP & Oracle in enterprise applications
Is the Cloud Industry already consolidated?
8. Cloud vendors seem to be dedicated to playing together nicely – for now
► Collaboration makes sense to keep R&D costs low.
► Cloud Platforms are becoming powerful integrated systems.
► All major providers support nearly all dev environments (AWS has a growing Microsoft business, 40% of
Azure runs on Linux etc.).
11. Basic idea behind microservices:
Microservices break down software into functional components that interoperate / communicate to create an overall
application.
12. The complexity lies in orchestrating microservices:
Microservices do not live in isolation, their
complexity lies in the large-scale environment or
ecosystem they live in.
Building, standardizing, and maintaining this
infrastructure in a stable, scalable, fault-tolerant,
and reliable way is essential for successful
microservice operation.
14. Microservices ecosystem:
Actual machines, servers, physical computers > Amazon EC2, Google Cloud Platform, Microsoft Azure etc. or
private DC.
Microservices
Application Platform
Communication
Hardware
DevOps teams, self-service dev tools etc
Networks, DNS, RCPs, API endpoints,
service discovery / registry, Load-
balancing etc.
15. Other risk factors in the Microservice Ecosystem:
► Network failures (EC2 outage etc)
► Security breaches – the fewer the providers, the higher the risk
► Vendor lock-in :
- Vendor lock-in at the service layer (AWS lambda, IBM Watson)
- Cloud vendors will stop competing on price once they’ve reached critical mass
► Cut back on innovation
17. 1/ AVAILABILITY
► Services need to be available locally
► Services need to be available globally / externally
On a public internet that is not smart enough find the best available path.
On a public internet full of bots (consuming real user traffic) and DDOs attacks.
19. live.cedexis.com
+15 BILLION
MEASUREMENTS PER DAY
+1 BILLION DAILY
END-USER SESSIONS
FROM +50’000
NETWORKS AROUND THE WORLD
THROUGHPUT
RANGING FROM 1
TO 10 ON A SINGLE
PROVIDER ALONG
THE DAY
1000 OUTAGE PER
CDN PER DAY.
10:21 PM – CDN1 167ms
10:22 PM – CDN2 94ms
10:22 PM – CLOUD 1 128ms
10:23 PM – CLOUD 2 230ms
10:26 PM – DC 1 153ms
21. 2/ SPEED
► Services need to be fast locally
► Services need to be fast globally / externally
On a public internet that is not built to find the fasted path.
22. 2/ WORK ALL THE TIME, UNDER ANY CONDITION
► Services need to work all the time, everywhere – and every service has to be designed to work always
► Services need to work under heavy pressure
When a lot of traffic starts flowing in (need to scale up)
Under attack (DDoS etc)
When services depend on each other (none of them should be a SPOF) etc
24. 1/ LOCAL RELIABILITY
► Provide local fallback / alternative when the main endpoint is slow / unavailable / a source of errors.
Multiple endpoints for each critical microservice
Local Load-Balancing
Local heath check monitoring
25. 2/ GLOBAL RELIABILITY
► Orchestrate a multi-homed infrastructure at the global level too
► Use a Global Load-Balancer in order to route traffic away from bottlenecks and outages based on:
Global (external) monitoring (health-checks)
Real end-user monitoring
Load/Error feedback (directly from the server / PoP / region)
Automate your traffic management with a software-defined solution
26. 2/ GLOBAL RELIABILITY
Multi-server / Multi-Region
/ Multi-Cloud / Multi-CDN
or Hybrid architectures
• Local Load-Balancing to
select Optimum Server /
Instance
• Continuously Updated
RUM & APM Monitoring
• Global health-checks /
monitoring
• Software-defined,
automated Global Load-
balancing
RUM Cloud
Scoring
Availability &
Latency
App
Performance
Monitoring
CPU & I/O
APP PERFORMANCE MONITORING
Continuous Self-Correcting
Action
Monitor Data
Center &
Application
Health
LOCAL HEALTHCHECKS
REAL USER MONITORING
Select Optimum
Server
LOCAL LOAD-BALANCING
Select
Optimal
Cloud Region
/ Cloud
GLOBAL, REAL-TIME LOAD-BALANCING
27. Make sure your services are multi-homed & orchestrated so that
they can collaborate together to provide a fast, reliable service - all
the time.
Good morning or good afternoon, depending on where you are in the world! I’m Aude, Cloud Evangelist at Cedexis. In this webinar today we’re going to look at ways to orchestrate complex collaborative environments, focusing on operating multi-server / multi-Cloud infrastructures.
Historically, industries have tended towards consolidation. Because innovation were very technological, because entering an established market and expanding beyond a few percentages of marketshare was too costly. Because economies of scale made more sense… Just look at the airline of car manufacturing industries – there are only a handful of actors left!
But if we look beyond technology-heavy industries, you have even worst examples. Have you looked at how many companies are behind your favorite morning cereals or ice-cream? I’m going to give you a hint – I can bet they are owned by one of these 10 corporations.
However in our technology-obsessed era, the usual innovation cycle has been disrupted (we’ve gone full cycle, technology disrupting itself!). Apple disrupted the music industry with the iPod, and so many other industries afterwards. Tesla might mess up car manufacturers and even batteries-makers. Uber, Airbnb… there are many examples of innovators disrupting established industries today – and by disruption we really mean they’re killing off the established actors. They grow fast, they raise a ton of money, and become too big to buy. What’s really change is the current investment race going on in Silicon Valley – there’s so much money flowing in that innovators become too big to acquire and Consolidation becomes nearly impossible. Did you see how much salesforce just spent on Mulesoft? 6.5B!
So where does that leave us in the Cloud industry?
If you look at marketshare of the top three vendors, it looks like the Cloud industry is already consolidated. AWS, Google and Azure collectively own more than 75% of the Cloud Platform market. Same for Microsoft Dynamics, Oracle and Salesforce in the Customer service and sales automation market. Cloud vendors should be competing heavily against each other, given there’s little room to gain more marketshare aside from taking it from the other Top 3 actors.
But it’s not what we see happening. On the contrary, cloud vendors seem to be dedicated to playing together nicely – at least for now. Just look at the video streaming ecosystem – you’re practically going to use a different vendor for encoding, packaging, CRM, player, analytics, traffic steering… Collaboration makes sense - each actor focuses on one core technology. If one had to develop each brick separately, it would be much too costly in R&D.
Another good example are marketplaces. If you look at AWS or Azure’s marketplaces you’ll see how much collaboration there is = there are even products that compete cloud platform’s offerings. These major vendors have become integrated systems.
Cloud platforms are simply adapting to the way users are consuming IT resources. It’s not that they don’t want to compete against each other, it’s that applications are now developed as microservices, a collection of technologies each developed and maintained by a multitude of actors.
Microservices have gained traction because they allow developers to make use of code and technologies that have been perfected externally. They aren’t plagued by the same scalability challenges posed by monolithic apps - they are optimized for scalability, efficiency and for developer velocity.
I’m sure every one of you here knows that, but I’ll say it nonetheless - the basic idea behind microservices is to break down software into functional components that can be scaled up or down to accommodate user needs. This of course fits nicely with the Cloud industry capabilities – it would be much more difficult to do on-prem.
Complexity doesn’t reside in moving monolithic apps to microservices, not even in building these micro-services. I’m not saying it’s easy of course! But the real complexity is in being able to build a successful collaborative environment and infrastructure to run these microservices on.
The infrastructure has to sustain the microservice ecosystem. The goal of all infrastructure engineers and architects must be to remove the low-level operational concerns from microservice development and build a stable infrastructure that can scale, one that developers can easily build and run microservices on top of. And of course that’s easier said than done!
We can look at the microservice ecosystem as four different layers, where the lower 3 are the infrastructure: the hardware layer, the communication layer and the application platform. The top layer is where individual microservices live. A microservice will send some data in a standardized format over the network to another service (or perhaps to a message broker or another microservice’s API endpoint). The interoperability of these various layers and actors composing each layer is where most difficulties happen.
Even if you manage to solve interoperability between the different layers, your infrastructure are still at risk. Network fails, vendors get DDoS. And if you decide to pick one big vendor to run your services on, you’ll be vendor-locked. What happens when that vendor stops competing on price? Or stops investing in innovation over opening new PoPs in regions of the world you have no interest in?
So, what should you look out for when orchestrating your microservices?
First and foremost, you need vendors that have good network availability. Your services need to be available locally, as well as globally. The internet is not built to help your content find the fastest path, bots and DDoS consume traffic, outages happen all the time.
You may have heard about AWS’ S3 outage last year, or EC2’s major failure a few months ago. Worldwide outages are now making headlines because so many companies rely on cloud services to operate. But what newspapers don’t report on are the ‘regular’ outages, occurring at infrastructure or network level that will bring down the access to an instance or to some regions.
At Cedexis we have a real-user monitoring tool called Radar. We basically have JS tags deployed on thousands of websites, testing Cloud services and network performance directly from end-users. We make on average 15 Billion measurements per day – allowing us to see the micro-outages that are happening all over the world, in real time.
And I can tell you there are many outages happening everyday!
As an example, a couple of weeks ago we saw that one of Azure’s US west regions went down. Did you hear about it? What that meant for end-users was at best a degraded user experience and at worse a complete service interruption.
But even under “normal” conditions, response times to access cloud providers are fluctuating all the time.
Once you’ve looked at your vendors’ availability, you also want to make sure they are fast.
Fast locally : when your microservices are deployed within a controlled local environment
Fast globally : when multiple microservices are delivered over clouds or saas solutions. There are huge differences in performance between the different Cloud platforms, depending on where you’re connecting from. Even within one cloud platform, the very same AWS EU West region has very different performance whether you’re connecting from BT or Sky or TalkTalk.
in the DevOps world, the concept of “site under maintenance” is long gone. Your services simply cannot be unavailable anymore. They need to work all the time, everywhere – and load under 3 seconds anywhere in the world if possible. They also need to work under pressure, whether it be DDoS, heavy traffic etc.
So how do you actually orchestrate for high availability, speed, and resiliency?
It’s necessary to get visibility on the conditions of your infrastructure in order to make sure that you are sending users to an endpoint that is available - and to an endpoint that can handle the load. Make sure you have multiple endpoints = different servers and/or different physical locations and/or different ISP/connectivity to internet for the microservices to rely on.
Sometime apps or microservices can look available from the outside but are down or close to overloaded from the inside. Local health monitoring will provide critical information such as high frequency checks, load feedback from the servers, circuit breaker, local retry.
In order to orchestrate these multiple endpoints we advocate using a local load-balancer (such as NETSCALER, NGINX, HAPRPXY, VARNISH etc) in order to route traffic effectively across these multiple servers or local instances. This load-balancer should take into account the data flowing from your monitoring tools in order to make intelligent traffic management decisions.
Similarly, real-user monitoring and network information is key to global reliability.
We are strong advocates of multi-homed infrastructures – not just at the server / instance level, but using multiple datacenters, multiple cloud regions, multiple clouds or CDNs in order to help make your service 100% available for end-users.
We also recommend external health checks (up to the second for critical services requiring high availability) as well as real-user monitoring.
Why RUM? Because it will provide real network information and allow you to keep an eye on the previously mentioned outages / peering issues. Particularly useful for fully dynamic and sync transactions like recommendation tools / booking engines etc or cached/CDN-based content in multiple countries and locations.
Load/error feedback will allow your global load-balancer to automatically remove that POP when it’s over used / close to unavailability or source of too much error.
The advantage of combining an external (real-user based) and internal (also real-user based) vision is that you can pretty much let your infrastructure manage itself – you get to sleep at night again!
First you of course check that your Datacenter or instances are up and running, then you automatically add network data on how fast / available they are from the outside. Here we have three cloud regions that seem green from an external network perspective.
Now, this is when internal data comes in – load metrics will tell you which region is over utilized or not. In our example, region C using Netscaler looks to be the best choice. So the global load-balancer, taking into account all of this information in real time, should send traffic over to that region – reducing in turn the load on the other regions. Over time, this enables continuous self-correcting cycle of your different endpoints.
To conclude, make sure your services are multi-homed, in order to be able to select the server, region, cloud that is the most available and has the best performance. Use internal and external monitoring data to feed network health information to your local and global load-balancers. And sleep again at night knowing that your infrastructure is a self-healing, reliable machine.
If you’d like more information on multi-Cloud or hybrid-Cloud architectures, please reach out to me @ aude@cedexis.com sales@cedexis.com