5. Our thinking is broken
Customer: “I can’t get to my
desktop”
Support/Admin: “The desktops
aren’t working because storage
failed”
CIO/Boss: “We need to ensure
storage never fails”
7. Netflix Chaos Monkey
2010: Netflix moves to AWS
2011: US-East Outage - Netflix posts lessons learned
The best way to avoid failure is to fail constantly
Since 2013: Chaos Monkey is run in
Production except holidays and weekends
8.
9. Before you buy more stuff – try this
• How do you respond to events today?
• How long to identity them?
• How long to solve them?
• Mean time before Failure is legacy
• Focus on Mean time to Resolution or Cycle time
MTBF
VS
MTTR
10. Before you buy more stuff – try this
• How are you rolling out Citrix or changes?
• AUTOMATE!!!
• RULE: If you do it twice, it should be automated
• Focus on reducing Cycle time
• time(what is wrong) + time(how to fix it) + time(implement fix) =
cycle time
• Immutable Servers
• Servers are rebuilt from scratch for changes
11. Survive Failure - Architecture
• Does Citrix still work if:
• Your storage fails (SAN, Local,
whatever)?
• Your database fails?
• NetScaler fails?
• What can your users handle?
• Most can handle getting logged
off if they can log in again
• Most can NOT handle
• Applications hangs
• Print failures
• Can’t log in or connect
Source: theoatmeal.com
12. User Profiles and Folders
• Redirect Folders as much as possible
• This is where data that people use live (My Docs, Downloads, etc).
• Profiles
• Profiles should be as light as possible
• Can you use mandatory profile settings?
• Replicate profiles across 2 data centers
• Profiles are not going to work on DFS-R without corruption (except one-way)
• Active/Passive only (not active/active)
• Split users so some are active for one data center, passive for the other
• Use cloud storage
• Hack OneDrive for My Docs - https://office365drivemap.codeplex.com/
13. Storage / DB
• Use redundancy in the software, not
hardware
• PVS fails over on the fly (not for
CIFS/SMB though!)
• Local disk with PVS is better than an
expensive SAN (and likely performs
better, esp if you have SSD local)
Local Disk on Server
Whiptail_61 Whiptail_62
Mirror Aware Databases:
Standalone Databases:
Primary Database
APS-DCXA1SQL01
Mirror Database
APS-DCXA2SQL02
Witness
(no Database)
APS-DCXDCSQL03
14. PVS HA/DR Components
SQL
Database
(highly available)
PVS
Server
PVS
Server
Vdisk Store
Vdisk Store
DHCP – can be split on
2008 R2/2012
TFTP can be load
balanced with a
hardware load
balancer
2 Different
Locations
Mirror – storage resilient
Cluster – server resilient
15. Network
• Multiple Sites = Netscaler GSLB
• Active/Passive is easiest to setup
• All components should be load balanced if possible
• Even TFTP, double up on every component
• No NetScaler stags in Production
• HA/Failover Pair
• They share the VIP but have separate IP info (so the VIP floats)
• 1 NS + Hypervisor != Pair
NS LB
Zone US-East1
Zone US-West1
NS LB
NS LB
VIP
16. BLUE/GREEN
LB
App v1.0
App v1.0
App v1.1
App v1.1
Db v1.0
Db v1.1
Limiting Downtime
• Like active/passive
Don’t use DNS for this
• can’t trust TTL
When to use
• ANY database/schema upgrade
• Restore from backup is too large/long
17. • Like active/active but with a purpose
• Canary in the coal mine
• See if someone screams!
• Live to production
• Limiting Risk
• Back up your data
• All nodes use production database
• Route new connections to new nodes
CANARY
LB
App v1.0
App v1.0
App v1.1
Db v1.0
18. External
Firewall
Internal
Firewall
2 MPX 11500
External Users
Internal
Users
24,000
Zero
Clients
School Districts
Printers
Citrix
PVS
XA1 SCVMM
XA2 SCVMM
XDC SCVMM
APPVPublish
APPVReport
SQL
Mirror
Profiles
User Data
2 Delivery
Controllers
2 Provisioning
Servers
License
Servers
AppV
Cluster
SCVMM
Server
Storefront
2008 R2
Desktops
2008 R2
Applications
2 Delivery
Controllers
2 Provisioning
Servers
SCVMM
Server
2008 R2
Desktops
2008 R2
Applications
2 Delivery
Controllers
2 Provisioning
Servers
SCVMM
Server
Windows 7
Desktops
Atlanta Public Schools
Citrix Delivery Overview
Architect: Thomas Gamull
Company: Presidio
Date: 3/17/2014
File
Server
Print
Servers
CLL Data Center - 8,000 Concurrent Desktops for Students
20. Rack Layout
NetScaler NetScaler
Top of Rack Switch Top of Rack Switch
Compute Blades
Compute Blades
Compute Blades
Compute Blades
Compute Blades
Compute Rack-Mount
Local Disk Storage
Compute Rack-Mount
Local Disk Storage
Compute Rack-Mount
Local Disk StorageCompute Blades
iSCSI/FC Storage iSCSI/FC Storage
Storage is always in pairs if needed
• Prefer multiple smaller arrays over monolithic SAN
• Let app/software do the work
Network redundancy is important
• Load balancers can remove switch dependencies
• Leverage common NIC cabling
Server choice can vary
• Blades are dense but lack local disk
• Rack Mounts are often very flexible
• Without automation you will have scaling problems
21. “Je n’ai fait celle-ci plus longue que parce
que je n’ai pas eu le loisir de la faire plus
courte.” – Blaise Pascal, Provincial Letters:
Letter XVI, 1657
English Translation: “If I had more
time, I would have written a shorter
letter.”
Why
This information isn’t useful without explaining why
I will spend no more than half the speaking time on this
Don’t need to write stuff down just try to grasp my message
What
Some examples
Actual architecture and things you can do
Also
I will finish with at least 10 minutes for Q&A
Email and twitter I respond to
I was the Practice Manager for Workforce Mobility at Presidio, which is a great company and Citrix partner. One of my accomplishments there was the Atlanta Public Schools XenApp/XenDesktop 7 deployment for 50,000 students (one of the first large XenDesktop 7 deployments from a partner). I honestly wanted to do more and joined Ericsson earlier this year as a Consulting Manager – I could list buzzwords like DevOps, OpenStack, CI/CD, SDN and NFV but in reality I currently help customers align their entire deployment pipeline (including software development) with how their company produces value.
Failures can stop business flow and cost companies money. If you’ve ever worked in Operations, you might think that their sole job is to prevent failures over anything else. To add to this thought, we consume better hardware every year and expect stable performance. Why do newer phones seem to have battery life issues and problems making calls? It’s amazing I can grab a cell phone from 10 years ago and it would last all day on a charge. I had a Volkswagen beetle that still runs, we seriously can’t make data center hardware reliable? We can shorten this philosophy to 5 9s uptime, 99.999% uptime seems to be written into every CIO’s wishlist from any architecture today.
Failure is a tough thing to avoid or predict. We really should be looking at things a different way. I also realize that many of us have different roles and can think they don’t have a say in this. I disagree, if you can relate anything back to the business value, you will get people’s ears or at worst, a better job.
Let’s walk through a hypothetical here. Our customer or end user can’t get to the desktop, we find out the desktop can’t pull our profile data from the storage server. In fact, our storage appears to have failed! “Nevermind the details!”, says the Director or CIO, we need this fixed now. We need to ensure storage does not fail again!
Let’s get a storage expert in here! The solution is a new or upgraded SAN with better performance, more reliability and a promise that it will not fail, or your money back (terms and conditions apply!). The problem with this solution is that it confuses eliminating a problem with finding a solution. It does not address the underlying cause.
Could this have been the storage driver? How does SAN uptime prevent that? What if it’s just space/performance/latency?
Just because the desktop failed when storage did doesn’t mean that storage is the cause
You are now forever justifying this fix (can you honestly admit it’s wrong if you find out?) Also, how’s the SAN fabric looking?
One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture.
If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage.
Rambo architecture, each component can survive failures of the other components it depends on
If our recommendations system is down, we degrade the quality of our responses to our customers, but we still respond. We’ll show popular titles instead of personalized picks. If our search system is intolerably slow, streaming should still work perfectly fine.
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
Automation is key. A few years ago Aaron Parker had a session and he asked how are people not automating, there is not reason. I did not automate much back then, I do now though! If you do something twice, you need to automate it. Humans are not ideal at repeat entry, but computers are. Utilizing Chef or Puppet is something you should look into if you haven’t yet. Also, our focus should ultimately be about cycle time. Finally, the concept of immutable servers is also a worthwhile solution: Treat servers like inkjet printers, its often easier to just replace them
http://www.thoughtworks.com/insights/blog/rethinking-building-cloud-part-4-immutable-servers
Let’s talk about architecture for XenDesktop 7.6 and how we survive failure. Think of having Desktop and Apps still run despite necessary components failing. A better way to focus this is to evaluate what end users can handle. Surprisingly, I’ve found most handle logoffs better than slow performance. People often don’t report a logoff if they can log back in, but when a print job takes 30 minutes or longer, you can be assured of a ticket.
Another session SYN502 discussed issues with SMB, Folder redirection and newer technologies. I’m still a fan of redirection but mainly for Documents and file data, not for Desktop or AppData. This is one of the biggest areas to tackle for failure issues, in my earlier example, it was the profile that failed, causing the desktop to not load. I have seen profile replication in failover scenarios where one data center is primary for a set of users, while the other is primary for another set of users. End user feedback is important to get this issue resolved, is it worth hardware and slowness because people use the desktop for their my documents? Usually not.
For more info see Synergy 2015 - SYN502: I’ve got 99 problems, and folder redirection is every one of them (Helge Klein, Sean Bass, Aaron Parker)
Did I mention how easy it is to scale later using cheap hardware, storage, compute?
Perhaps take out APS refs in the picture?
For HA we should always add another PVS server with a SEPARATE vdisk store (you can mix SAN/local disk, etc here)
If we leave DHCP alone we add a point of failure where target devices may fail to boot. You can use 2008 R2 or 2012 to provide split scope or utilize a more redundant solution such as bluecat or infoblox.
PXE and TFTP is another point of HA concern, you can only provide true HA with a hardware load balancer. I often do NOT provide HA for TFTP but if you have a hardware load balancer there is no reason not to. PXE will load the bootstrap which, if not specified with you PVS servers, won’t work (you need to add them)
Use mirroring with SQL if you can. It’s great and clustering doesn’t really prevent you from dealing with issues such as the storage failing! If your storage will never ever fail then that’s awesome but keep in mind I can use local storage and mirroring and pretty much get the same benefits, well except for the feeling of spending tons of money. Clustering helps update SQL nodes one at a time while keeping SQL up, this generally is not something I do, but I do recommend mirroring.
Mirroring requires a witness server, a 3rd server that doesn’t do anything other than help with the quorum (sql deciding what server is primary). If you set this up and lose a secondary and a witness, the primary will stop. I often put my witness on a local disk.
Load balancers are your friend, I reference NetScaler because of obvious reasons but keep in mind there are free virtual load balancers that are linux based that can do some work. You don’t have to be a Cisco CCIE to figure this stuff out either, there are tons of blogs and walkthroughs out there to guide you through this. That being said, GSLB is a LOT harder than just load balancing internal components
This diagram is actually for application/dev updating but the theory is the same for different scenarios. We can use blue/green for upgrades, new feature rollout, etc. Note we actually snapshot or clone the database, then flip over to the other application set (or data center, database, etc). If your backups are too long and big, this method of updating or rolling out changes is ideal.
Limiting Downtime -Green/Blue Deployments
Create live replica of database
Duplicate all app nodes w new code/config
Adjust routing to activate new code
When to Use
You are updating your schema
No object versioned db
No feature flags
Can test the feature outside production
Restoring from a backup is not practical (big data sets)
Plan for the worst case scenario: Oops, my feature blew up
http://www.slideshare.net/adrianjotto/docker-102-immutable-infrastructure
Limiting Risk
Requires Feature Flags or Sticky LB sessions
Back up your data
All nodes use production database
Route new connections to new nodes
When to use
No contract breaking changes to schema
You have object versioned db
You use feature flags
Impractical to test the feature outside prod
Have a full backup of your data & can restore
http://www.slideshare.net/adrianjotto/docker-102-immutable-infrastructure
Note the right side with 3 SCVMM (hyper-v) clusters, we use both clusters but can survive the failure of an entire cluster. All the clusters share the same SQL mirror, storefront farm and File server for profiles.
This is one cluster of 2 or more for Hyper-V
2 Blades do the work (so one blade can fail and my cluster is up). If they both fail, I have another cluster.
I have 2 of everything
Don’t skimp on something, make it two or more of EVERYTHING you can.
Notice pair of netscalers in top of rack?
I have two storage appliances at each data center (In this case flash storage using PVS)
Primary Data Center – CLL – 48 blades with an invicta and two 6296s in each rack
Secondary – Brewer - has 32 blades and 2 invictas, 2 netscalers and 2 6296s in a single rack.