4. Legacy
Datacenter
Pre-acquisition Post-acquisition Hybrid Cloud
Microsoft
datacenter #1
(Primary)
Microsoft
datacenter #2
(DR)
Microsoft
datacenter #1
(Primary)
Microsoft
datacenter #2
(DR)
Azure IaaS &
Azure Storage
Regions
Azure IaaS &
Azure Storage
Regions
Azure IaaS &
Azure Storage
Regions
Azure IaaS &
Azure Storage
Regions
Azure IaaS &
Azure Storage
Regions
5. Operational
• Cross geo support (provision anywhere)
• Support to run active/active multi region Yammer
• Run EVERYTHING as containers
• Cost benefits
Development
• Fast provisioning of hardware
• Increased availability
• Environment parity between production, pre-production
and local dev laptop
Security
• End to end SSL between services (even intra DC)
• Limited and only JIT (just in time) access to developers
6. Azure Front Door
CLIENTS
REGION 1 REGION 2
Load Balancer Load Balancer
Container Container
Data
Store
Data
Store
Mesos / Marathon / Docker / CoreOS Mesos / Marathon / Docker / CoreOS
Legacy
Data
Center
Redis
Event
Hubs
Storm ADW Storage Redis
Event
Hubs
Storm ADW Storage
Azure Express Route
Azure Active
directory / O365
7. • Containers, Mesos & Marathon
• Package individual services and their dependencies into
Docker images
• Run them in isolated containers
• Use Mesos/Marathon to manage the clusters & deploy
the containers in our regions.
• Service deployment
• A JIRA ticket initiates the process.
• For each request, the tool will create a set of deployment
configurations in Teamcity.
• Which creates the Marathon application deployments in
all our regions.
Key architectural components
8. Key architectural components
• Network Orchestration tool
• Created an in-house container discovery solution
• Lives on all virtual machines in the deployment
• Retrieves information from Marathon & Mesos
• Creates HAProxy instances for routing and load
balancing
• Yammer Storage Tools on storage
• Python library to help manage storage systems.
• Helps organize, create, configure and manage the
needed Azure resources.
10. Challenges moving to Azure
• Ability to maintain stability at the cutting edge.
• Cache invalidation.
• Environment is ephemeral.
• Turn around time when we need more cores
than the subscription allows.
• Time.
• Unknown factors of the technology selected.
• Conflicting priorities & staffing.
11. • 100% of Yammer Postgres query traffic is served out of Azure for
both masters and replicas.
• Migrated Yammer backup infrastructure to Azure.
• Established a celled Azure cloud infrastructure using Azure Frontdoor.
• Migrated several critical services to Azure.
• Migrated from on-premise Vertica to Azure SQL Data Warehouse for
Yammer’s analytics data store.
• Migrated the search stack to Azure, includes Elastic search clusters in
Azure.
Key Projects shipped to Azure
12. • Achieve global presence.
• Ability to achieve failure isolation.
• Achieve an active-active configuration.
• Scale without having to maintain and deal with constant hardware failures
and end of life issues.
• Ability to implement managed solutions and not manage infrastructure,
where ever possible.
• Pay for what we use!
• A chance to culturally be more dev-opsy!
15. My notes
• We have 70+ services - In Azure we have 20+ services now
• We started work on Azure ~2015
Editor's Notes
Hello everyone!
I am Archana, I am an engineering manager at Microsoft, with the Yammer team.
I am excited to be here to talk about Yammer and our current journey to the Azure cloud.
In this talk, I will share some history, platform architecture, some challenges and some of our big wins!
Yammer is the enterprise social network in Office 365 that empowers people to connect and engage across your organization.
Yammer provides a simple way to openly share information, collect ideas and collaborate.
It was founded in 2008 and acquired by Microsoft in 2012.
Like every startup, Yammer had a monolithic codebase.
But as we grew in scale and in team size, we started to face frequent blocking & dependencies. We looked into microservices architecture which focuses on developing small, deployable, independent services which communicate with each other with mechanisms like HTTP/REST.
Initially, the goal was to see if we could introduce new capabilities faster using microservices.
We then began deconstructing the monolithic codebase to individual microservices and we continue to do so.
One of the challenges we faced initially was,
Organizational
Historically, we have had the philosophy where everyone should be able to work on everything. But this hit scalability limits and hampered ownership & accountability. As a solution, we created domain teams. Domain teams are focused teams owning parts of the stack. So, we moved from being a generalist to being a specialist.
With this change we saw an increase in communication between teams and team members to get stuff done.
b. Another challenge was that this re-architecture has to be transparent to the user, has to be done in parallel to product growth ( ie developing new features) and keeping the product stable & reliable.
The diagram shows the clients/services/data stores. We have approx 70+ microservices.
During this phase, Yammer was powered by our self maintained legacy data center. When Yammer was acquired by Microsoft, we had to comply to Microsoft rules that needed all customer data to be in its own data centers. Having data in the Microsoft data centers certainly had benefits on compliance, security & cost.
So we had 2 options then, a) goto Azure, b) migrate to Microsoft data centers
This was our first attempt to move to Azure. But the cloud tools provided to us then were not production ready.
Back then, microsoft DCs did not support Linux well. So we did, at that point, consider the option to move to Windows or to continue to build on Linux.
And of course, We were also of the startup mindset, ‘yeah we can just go build our own!’.
Finally, we decided to migrate to Microsoft data centers, which required us to build our own automation/orchestration framework. This would allow us to discover/manage servers as well as parts of the network. But this task, took time n effort!
One thing we missed doing at this point is considering docker/containers and building for the next gen. Rather, me mimicked the legacy data center and re-created the same in the Microsoft data centers.
I will talk about the last column in a bit…
So we moved from legacy DC to Microsoft DC. So then why Azure?
Q> Diagram: Is the IAAS correct? Are we more IAAS or PAAS?
With the migration to the Microsoft datacenters, we still faced challenges like building home grown automation frameworks, dealing with complex network infrastructure, the uncertainty of provisioning hardware and its life cycle.
We don’t want to be in the business of designing & managing software datacenters, rather focus our engineers in helping build a strong reliable product.
What Azure offers us,
Azure is in 34 regions, thus providing higher performance. (Azure provides a 99.99% SLA for most services.)
We listed our requirements to the move to Azure.
Talking points:
>Cross geo support (provision anywhere) – This is a big one and helps us go global. (We currently have an active active configuration in NA and EU.)
Cost savings. We use only what we need. We can quickly scale out or in based on resource demands and customer growth.
Availability - All services have more available instances in multiple regions.
SSL – SSL encryption between any service nodes. Before we had cross DC SSL but we cant consider the network we are on completely secure and hence secure any container chat across the network..
JIT – We wanted to reduce access to developers and give them just in time access. This covers compliance aspects but also prevents drift of software configurations which could be difficult to roll back
What kind of sofwtare configurations?
This is a high level view of our architecture. We are currently in an active-active configuration. The Azure front door routes the requests to the closest region. All services are deployed as containers.
We use Mesos and Marathon and Docker
We use a wide range of technologies in Azure and we look for Azure technologies that will help us solve our problems so that we don’t need to solve them ourselves.
Legacy data center – This represents our Microsoft data center.
Azure active directory/O365 - ? Kristian will talk more on this.
I have listed a few components that we have built for Azure.
Containers, Mesos, Marathon
In the Azure cloud architecture,
we package individual services and their dependencies into Docker images, which allow them to be run in isolated containers. These containers are created inside our distributed Azure VMs, which run on CoreOS. We use Marathon/Mesos to manage the containers and container runtime settings in each of our regions.
Service Deployment tool
Created in-house tool to automate the service onboarding process. First you dockerize a service to make it deployable in Azure. Then you deploy it to staging or production environment.
To enable efficient service deployment, we have an inhouse tool named JIRAiya to process the JIRA tickets and automate the service onboarding process. For each service deployment JIRA ticket, Jiraiya will create a set of deployment configurations in Teamcity [7], which creates the Marathon application deployments in all our regions. It will also log into our Azure subscription, so that Azure load balancer, security groups can be modified.
Network Orchestration tool
With randomly placed containers and randomly assigned ports, we need a reliable way to find out where they are.
(We created an in-house container discovery solution (called Lodbrok), which lives on all virtual machines in the deployment. Lodbrok retrieves information from Marathon, which has information about app deployments, and Mesos agents, which has information about individual containers and creates HAProxy instances for routing and load balancing.)
Yammer Storage Tools (YST)
The goal of the yammer storage tools is to ensure that Yammer storage systems hosted in Azure are provisioned and managed in a simple, predictable and repeatable manner. For example, automatic cluster creation at the touch of a button, from provisioning VM to starting the service.
Is lodbrok a library or container or artifact?
Historically, we had one data center servicing all of Yammer. When issues happen, it brings entire Yammer down, all networks are impacted.
The goal for cells is that they define a failure isolation boundary associated with a set of customer tenants.
Each cell should be as self-contained as possible (all services, storage, compute).
Every cell deploys services that run on a dedicated set of VMs.
Each cell will have its own dedicated set VMs for loadbalancing.
Every request received is routed by Azure Frontdoor to the nearest region. If the routeKey is available in the cache, the request is directly routed to the appropriate cell and services. If the routeKey is unavailable, a lookup is made to fetch the appropriate cell destination. This is then further cached for future requests.
Q> The routeKey is in the request header? Example of routekey?
Q> Is the diagram still correct?
Cell0 - ?
Isnt AzureFD common to both regions?
Arguably, the easiest way to move to the cloud is to forklift all of the systems, unchanged, out of the data center and drop them in Azure. But in doing so, we would end up moving all the problems and limitations of the data center along with it. (name few ? problems like hardware failure, need more time to provision hardware, YY )
Instead, we analyzed & continue to analyze our infrastructure components & the services and identified what can be a lift & shift and what we rebuild.
For example,
1. we continue to use Postgres in Azure as one of our primary data stores. We migrated postgres to Azure and currently have 100% of our Postgres query traffic served out of Azure for both masters and replicas.
2. While migrating one of our services, we redesigned and implemented it using Redis in Azure eliminating the previous implementation of Hazelcast.
Then talk about the challenges bullet points….
The ecosystem is constantly changing. Working on the cutting edge makes it challenging in keeping up with the upgrade channels, security patches . When to update , when not to update.
Some of our caches need manual invalidated when data is updated. This makes running active-active in multiple Azure regions very difficult because we have to cascade invalidations to all regions that could potentially have stale objects in cache.
We expect Azure instances to disappear more frequently than our current infrastructure. The key mind shift in moving & developing in the cloud was that the environment was ephemeral.
Biggest challenge is the turn around time for us when we need more cores than the subscription allows. For that reason we alert on >75% of CPU / Memory usage on all mesos nodes and start the process early.
Every solution needs analysis to see if there is the right equivalent in Azure n fits the desired requirements. Once a solution is selected, you work with the unknown factors. Eg – While transferring hundreds of TB from Vertica to Azure Data Warehouse, we had to learn the inner workings of ADW to help tune performance.
Higher priority projects come up which demand staffing slowing the momentum on Azure migration.
Extra:
In a microservice architecture, the principle that each microservice should be responsible for its own data had led to a proliferation of different types of data stores at Yammer.
we’ve transitioned away from polyglot persistence to persistence consolidation. Why? (operational cost, every db had its own issues, ?)
Long running large things – to short running small things
defining the provisioning and management tooling that will give us consistency, security and performance in Azure.
With all the challenges, we have a few big wins under our belt,
Postgres – 100% of Yammer Postgres query traffic is served out of Azure for both masters and replicas
Cells – we talked about it.
Artie - Re-architected and deployed a microservice which provides Yammer users with real-time access to messages in Azure. This project was not a trivial lift and shift to Azure but challenged us with redesign (using Redis in Azure) and rebuilding the service to be cloud compatible.
Vertica – There were 2 drivers for this project, one, to get out of the old datacenter, second, to look for a managed solution in Azure, rather than a lift n shift.
We have several projects in flight or in the pipeline,
Partitioning storage stores
Minimizing usage of memcache
Migrating more services to Azure celled infrastructure
Hbase to Azure OR the identify the right replacement
Extra :
Notidb using Azure scheduler (Nathan)
Workfeed celled – scheduled & background tasks/yammer.com api; celled (not yet in Azure)
We have had a very exciting journey thus far and are seeing the benefits,
** We are in NA and are deploying in the EU regions.
All this work has been accomplished by many engineers on several teams!
We continue our journey to exit completely from the datacenter and be completely operational in Azure leveraging all the benefits of the cloud.
Thank you!
Extra:
Old env we struggled to ship services – it was manual n cumbersome; Azure services faster deployment
What benefit does microservices strategy give? Read microservices articles
https://thenewstack.io/from-monolith-to-microservices/
standard microservice architecture principles?
Our services are mostly internal
Read marathon and Mesos ; how is this used in our architecture; what are the challenges we have faced with Mesos and Marathon
Understand Jiraya, Lodbrok, Griffin
Where do we use Storm, Redis
Current replication lag
CentOs/CoreOS ? https://www.quora.com/What-is-CoreOS
What is Azure Frontdoor?
Difference between Azure ILB and HAProxy?
We have 70+ services - In Azure we have 20+ services now