OSS Presentation DRMC by Keith Brennan

More IOPS Please
DRMC’s VMware View Implementation Using
Nexenta

Keith Brennan
October 2011

S

Delano Regional Medical
Center

S 156 bed community hospital in
central California.

S Four satellite clinics.

S Only hospital in a 30 mile
radius.

S Serves approximately 60,000
people spread over several
communities.

S 80%+ of our patients are Medi-
Cal or Medicare.
S Government doesn’t pay
well.

The Great Directive of 2009

S Need to deploy 150 new desktops in support
of a Clinical Documentation implementation.
S Do it as cheaply as possible.

S Oh, by the way, you’re losing an FTE due to
budget cuts.

“Never let a good crisis go to
waste.” –Rahm Emmanuel

S Used this “Opportunity” to justify moving to VDI.
S Users resistant to using something other than a traditional
desktop.
S Perceived lack of freedom.
S Perceived increase in “Big Brother.”

S Why I wanted the transition to VDI
S Ease of management.
S We had a set, well defined, integrated, desktop experience.
S Wanted a way to deliver the same experience in a controlled
manner to a myriad of devices. IOS, Android, etc.

I Need Storage!

S My Existing EMC CX500 was barely cutting it for 3 ESX
hosts w/ a combined 32 VM’s.

S Lots of people on the Virtualization forums liked NetApp.

S NetApp had just published a white paper on a 750 View
virtual desktop deployment on a FAS 2050a.
S Near normal desktop load times.
S Seamless user experience.

Well That’s Timely!

S The next week another vendor calls letting me know that
IBM is running a huge storage sale.

S It includes their N series of network attached storage.
S Rebadged NetApps.

S Three weeks later a N3600, a rebadged NetApp 2050a,
arrives.

S It is setup identically to the VDI whitepaper’s setup.

Implementation Guidelines

S Linked clones are to be used whenever possible.
S Ease of maintenance
S Ease of provisioning

S No user data to be stored on the VM’s.

S Significant patching shall be done through the Golden Image
and VM’s will be re-provisioned with using the updated image.

S AV will run on the VM’s but only in real-time scan mode. No
scheduled system scans.

Initial Testing

S Two Hosts with 25 VM’s each.
S One connected to the N3600 via ISCSI
S The other via NFS.

S Test lab of 25 thin clients.

S Good performance.
S Equivalent to a desktop of the previous generation.
S Quick user logins due to the VM’s being always on and waiting.
S The N3600 is maintaining low utilization.
S NFS and ISCI exhibit similar speed.

Go Live!

S Five additional ESX Hosts are deployed.
S Each hosts ~25 VM’s
S Current setup gives me N+2 host redundancy.

S For the first week everything looks good.

S User complaints are primarily with the clinical application.

S N3600 is handling it well. Running at about 35%
utilization.
S ~1.5k IOPs/Sec of regular background chatter.
S VM’s report average latency of 12ms.

Disaster!
For me they happen seem to happen in threes.

S First AV engine update happens 1 week after go live.
S AV server pushes it to all clients at once.
S The simultaneous update of all the View VM’s forces the
SAN to a crawl for 3 hours.
S Users complain that the Virtual Desktops are unusable.
S Temporarily corrected the problem by only allowing the AV
to update 3 machines at once.
S This worked like a champ until a dot version update on the
AV server a month later broke that setting.
S Another 3 hour “downtime.”

Disaster (cont)

S Three days later a helpdesk tech forces the simultaneous
reprovisioning of 60 of the View VM’s at once.
S Was applying an application patch.
S Was trained not to restart more than 5 VM’s at once.
S That obviously didn’t stick!
S That was another hour of the SAN crawling.
S Once again, users complain that the system was unusable
during this time.

Disaster! (yet again)

S .net 3.5 service pack is approved for deployment.

S SP is large. >100mb.

S Set to deploy starting at 2am and only on restart.
S At 04:15 four VM’s restart within one minute of each other.
S N3600 starts to lag.
S Users seeing their system running slow decide to restart.

S At 5am I get the call regarding the issue.

S I immediately disabled the SP deployment.
S Still took an hour for the N3600 to catch up.

What’s Going On???

S Oh $41+…
S General use chatter is
eating my bandwidth.
S N3600 CPU utilization is
regularly now above
50%.
S Disk utilization rarely
drops below 40%.
S Average disk latency
>18ms.

I Have a Problem

S I’m maxing performance with just day to day operations.
S IBM has verified that the appliance is functioning
properly.
S In other words, this is all I’m going to get out of it.
S Adding disks might help some, but too costly!
S Additional Tray would be $15k!
S SAS drives to populate it are almost $1k each!
S Still have CPU limitations.
S NIC Limitations (2 – 1gbe links per head)

S Did I mention that I have no money left in the budget?

Nexenta to the Rescue

S Had just installed Nexenta Core for my home file server.

S Time to find some hardware:
S Pulled a box out of the View cluster.
S Installed six Intel SSD’s.
S Installed Nexenta Core. (yeah, I know.. EULA..)
S Created the volume and shared via NFS.
S The next day my poor brain figured out that I could have just
done a Nexenta VM. Doh!

S Over the next week I migrated half the virtual desktops over.

Its like Night and Day

S Average latency drops
from 18ms to 2ms.
S Write throughput
quadruples.
S Read throughput
doubles.
S 20x improvement on 4k
iops!

Time For a Full Nexenta
Implementation

S I was able to secure $45k capital for the next year.
S Normally this would just draw laughter when talking about
storage.

S I also intend on replacing the existing EMC.
S Annual maintenance too costly.
S I despise the fact that I have to call them out every time I want
to connect a new piece of hardware to it.

S Still some questioning from higher-ups on this whole open-
storage thing.

Final Solution Hardware

S 2x Supermicro dual Xeon servers with 96gb ram.

S 1x DataOn 1600 JBOD
S Houses twenty one 1tb nearline SAS drives.

S 1x DataOn 1620 JBOD
S Houses seventeen 300gb 10k rpm SAS drives

S 2x Stec ZeusRam

S 8x 160gb Intel 320 SSD’s

Why DataOn?

S Disk Shelf Manager
S One thing Nexenta lacked
was a way to monitor the
JBoD’s
S How could one of my techs
know how which drive to
pull?

S Intuitive slot lighting.

S They’re responsive even
after the sale is made!

Why Nexenta?

S Its good to have on demand support.
S I am the only member of our technical staff that has a basic
understanding of storage architectures.
S I like to have the ability to go on vacation from time to time!

S Its good to have experts for unique problems.

S Regular tested bug-fixes.

S Its always nice to have someone’s neck to wring!

The End Result

S 2ms latency.

S 500 mb/s reads

S 200 mb/s writes

S Happy Users!

S Note: Benchmark was
done on production
system with 175 active
VM’s.

To Dedup or Not to Dedup

S Dedup can give you huge storage savings.
S I had 14x Dedup ratio on my VDI volume.

S Inline dedup saves on disk write IO.
S It’ll still hit the ZIL, but won’t be written to disk if it is
determined to be duplicated data.
S Instead of a 4+kb write you get a sub 256 byte metadata write.

To Dedup or Not to Dedup

S Ram Hog!
S For good performance you need enough ram to store the
dedup table.
S Uses ARC for this, which means you will have less room for
cached data.

S Potential for hash collision.
S Odds are astronimcal, but still a chance for data corruption.

S Dedup performance penalty.
S Small IOPS suffer.

Dedup Perfomance Penalty

Dedup Enabled No Dedup

Is Dedup Worth it?

S If you’re using a “Golden Image” - No.
S VMDC Plugin provides great efficiency by only storing one
copy of the Golden Image vs one for each pool of VM’s.
S Compression is virtually free and will do a good job of
making up the difference in the “new” blocks.
S Disk is cheap.

S If you’re doing a bunch of P2V desktop migrations -
Maybe.
S If the desktops are poorly configured, or have other aspects
that can cause excessive I/O than no.
S If the desktops are similar and large, then sure.

Compression

S Use it. Unless you’re using a 5 year old processor, there
will be no noticeable performance hit.
S On by default in Nexenta 3.1
S Compresses before write. Saves disk bandwidth!

Cache is Key!

S Between the the 70gb of arc and 640gb of l2arc the read cache
is hit almost 98% of the time!

S This equates to sub 2ms average disk latency to the end user.

S Beats the crud out of the >15ms average latency of the N3600!

S Know your working set. You could get away with a lot
smaller or need a lot larger cache.

Gig-E vs TenGig-E

S Obvious differences in maximum throughput.

S Small IOP differences are mainly attributable to network
latency differences.

S If you’re stuck with Gig-E go use 802.3ad trunk groups.
S Still stuck with 100 mb/s throughput but no one ESX host
will saturate the link for the rest.

Gig-E vs TenGig-E - User
Perspective

S Average time from the “Power On VM” command being
issued to the user is able to login:
S 10gbe: 23 seconds
S 1gbe: 32 seconds

S Time from when user presses “login” button until the
desktop is ready to use:
S 10gbe: 5 seconds
S 1gbe: 9 seconds

*Windows 7, 2 procs, 2gb ram, DRMC’s Standard Clinical Image

Final Thought – All SSD
Goodness

S For deployments of Linked Clones or VM’s off of a Golden
Image.

S Allows you to get rid of the L2ARC.

S Use a good ZIL Device (STEC ZeusRam, DDRDrive)
S Allows for sequential writes to the SSD’s in the pool.
S Saves on write wear which is a SSD killer.
S My first test box with the x25m SSD’s started suffering after
about 3 months.

S If you want HA you have to use SAS drives.

Takeaway:

Latency is
Key!!
S

keith@drmc.com
661-721-5650
Feel free to contact me.

S

OSS Presentation DRMC by Keith Brennan

Recommended

Recommended

More Related Content

Similar to OSS Presentation DRMC by Keith Brennan

Similar to OSS Presentation DRMC by Keith Brennan (20)

More from OpenStorageSummit

More from OpenStorageSummit (8)

Recently uploaded

Recently uploaded (20)

OSS Presentation DRMC by Keith Brennan