Case Study: The University of
Alabama at Birmingham
OpenStack , Ceph, Dell
Kamesh Pemmaraju, Dell
John-Paul Robinson, UAB
OpenStack Summit 2014
Atlanta, GA
An overview
• Dell – UAB backgrounder
• What we were doing before
• How the implementation went
• What we’ve been doing since
• Where we’re headed
Dell – UAB background
• 900 researchers working on Cancer and Genomic
Projects.
• Their growing data sets challenged available resources
– Research data distributed across laptops, USB drives, local
servers, HPC clusters
– Transferring datasets to HPC clusters took too much time
and clogged shared networks
– Distributed data management reduced researcher
productivity and put data at risk
• They therefore needed a centralized data repository for
Researchers in order to insure compliances concerning
retention of data.
• They also wanted scale-out cost-effective solution and
hardware that could be re-purposed for compute &
storage
Dell – UAB background (contd..)
• Potential solutions investigated
– Traditional SAN
– Public cloud storage
– Hadoop
UAB chose Dell/Inktank to architect a platform that
would be very scalable and provide lost costs per GB
and was the best of all worlds that provide compute
and storage on the same hardware.
A little background…
• We didn’t get here overnight
• 2000s-era High Performance Computing
• ROCKS-based compute cluster
• The Grid and proto-clouds
• GridWay Meta-scheduler
• OpenNebula an early entrant that connected
grids with this thing called the cloud
• Virtualization through-and-through
• DevOps is US
Challenges and Drivers
• Technology
• Many hypervisors
• Many clouds
• We have the technology…can we rebuild it here?
• Applications
• Researcher started shouting “Data”!
NextGen Sequencing
Research Data Repositories
Hadoop
• Researcher kept on shouting “Compute”!
Data Intensive Scientific Computing
• We knew we needed storage and computing
• We knew we wanted to tie it together with an
HPC commodity scale-out philosophy
• So August 2012 we bought 10 Dell 720xd servers
• 16-core
• 96GB RAM
• 36TB Disk
• A 192-core, ~1TB RAM, 360TB expansion to our
HPC fabric
• Now to integrate it…
December 2012
• Bob said:
Hearing good things about open stack and ceph at this week at dell world.
Simon anderson, CEO of dream host , spoke highly of
dell, open stack, and ceph today.
He is also chair of company that supports
He also spoke highly of dell crowbar deployment tool.
I
December 2012
• Bob said:
Hearing good things about open stack and ceph at this week at dell world.
Simon anderson, CEO of dream host , spoke highly of
dell, open stack, and ceph today.
He is also chair of company that supports
He also spoke highly of dell crowbar deployment tool.
• I said:
Good to hear.
I've been thinking a lot about dell in this picture
too.
We have the building blocks in place. Might be a good
way to speed the construction.
Lesson 1:
Recognize when a partnership will help you
achieve your goals.
The 2013 Implementation
• The Timeline
• In January we started our discussions with Dell and
Inktank
• By March we had committed to the fabric
• A week in April and we had our own cloud in place
• The Experience
• Vendors committed to their product
• Direct engagement through open communities
• Bright people who share your development ethic
Next Step…Build Adoption
• Defined a new storage product based on the
commodity scale-out fabric
• Able to focus on strengths of Ceph to aggregate storage
across servers
• Provision any sized image to provide Flexible Block
Storage
• Promote cloud adoption within IT and across
the research community
• Demonstrate utility with applications
Applications
• Crashplan Backup in the cloud
• A couple hours to provision the VM resources
• An easy half-day deploy with the vendor because we controlled our
resources a.k.a. firewall
• Add storage containers on the fly as we grow…10TB in few clicks
• Gitlab hosting
• Start a VM spec’d according to project site
• Work with Omnibus install. Hey it uses Chef!
• Research Storage
• 1TB storage containers for cluster users
• Uses Ceph RBD images and NFS
• The storage infrastructure part was easy
• Scaled provisioning, 100+ user containers (100TB) created in about 5
minutes.
• Add storage servers as existing ones fill
Ceph Rebalances as Storage Grows :)
Lesson 2:
Use it! That’s what it’s for!
Lesson 2:
Use it! That’s what it’s for!
The sooner you start using the cloud
the sooner you start thinking like the cloud.
How PoC Decisions Age Over Time
• Pick the environment you want when you are in
operation…you’ll be there before you know it
• Simple networking is good
• But don’t go basic unless you are able to reinstall the fabric
• Class B ranges to match the campus fabric
• We chose a split admin range to coordinate with our HPC admin range
• We chose a collapsed admin/storage network due to a single
switch…probably would have been better to keep separate and allow
growth
• It’s OK to add non-provisioned interfacing nodes…know your net
• Avoid painting yourself in corner
• Don’t let the Paranoid Folk box-in your deployment
• An inaccessible fabric is an unusable fabric
• Fixed IP range mismatch with “fake” reservations
Lesson 3:
The fabric is flexible. Let it help you solve your
problems
Problems will Arise
• The release version of the ixgbe driver in Ubuntu
12.04.1 kernel didn’t perform well with our 10Gbit
cards
• Open source has an upstream
• Use it as part of debug network
• Upgrading the drivers was a simple fix
• Sometimes when you fix something you break
something else
• There are still a lot of moving parts but each has a
strong open source community
• Work methodically
• You will learn as you go
• Recognize the stack is integrated and respect tool boundaries
Sometimes a Problem is just a Problem
• Code ex
Lesson 4:
The code *is* the documentation
Lesson 4:
The code *is* the documentation
…and that’s a *good* thing
Where we are today
• OpenStack plus Ceph are here to stay for our
Research Computing System
• They give us the flexibility we need for an ever
expanding research applications portfolio
• Move our UAB Galaxy NextGen Sequencing platform to
our Cloud
• Add Object Storage services
• Put the cloud in the hands of researchers
• The big question…
…how far can we take it?
• The goal of process automation is scale
• Incompatible, non-repeatable, manual processes
are a cost
• Success is in dual-use
• Satisfy your needs and customer demand
• Automating process implies documenting process…great for
compliance and repeatability
• Recognize the latent talent in your staff today’s system
admins are tomorrows systems developers
• Traditional infrastructure models are ripe for
replacement
Lesson 5?
You can we learn from research
and engage as a partner
Want to learn more about Dell +
OpenStack + Ceph?
Join the Session, 2:00 pm, Tuesday, Room #313
Software Defined Storage, Big Data and Ceph -
What Is all the Fuss About?
Neil Levine, Inktank &
Kamesh Pemmaraju, Dell

OpenStack and Ceph case study at the University of Alabama

  • 1.
    Case Study: TheUniversity of Alabama at Birmingham OpenStack , Ceph, Dell Kamesh Pemmaraju, Dell John-Paul Robinson, UAB OpenStack Summit 2014 Atlanta, GA
  • 2.
    An overview • Dell– UAB backgrounder • What we were doing before • How the implementation went • What we’ve been doing since • Where we’re headed
  • 3.
    Dell – UABbackground • 900 researchers working on Cancer and Genomic Projects. • Their growing data sets challenged available resources – Research data distributed across laptops, USB drives, local servers, HPC clusters – Transferring datasets to HPC clusters took too much time and clogged shared networks – Distributed data management reduced researcher productivity and put data at risk • They therefore needed a centralized data repository for Researchers in order to insure compliances concerning retention of data. • They also wanted scale-out cost-effective solution and hardware that could be re-purposed for compute & storage
  • 4.
    Dell – UABbackground (contd..) • Potential solutions investigated – Traditional SAN – Public cloud storage – Hadoop UAB chose Dell/Inktank to architect a platform that would be very scalable and provide lost costs per GB and was the best of all worlds that provide compute and storage on the same hardware.
  • 5.
    A little background… •We didn’t get here overnight • 2000s-era High Performance Computing • ROCKS-based compute cluster • The Grid and proto-clouds • GridWay Meta-scheduler • OpenNebula an early entrant that connected grids with this thing called the cloud • Virtualization through-and-through • DevOps is US
  • 6.
    Challenges and Drivers •Technology • Many hypervisors • Many clouds • We have the technology…can we rebuild it here? • Applications • Researcher started shouting “Data”! NextGen Sequencing Research Data Repositories Hadoop • Researcher kept on shouting “Compute”!
  • 7.
    Data Intensive ScientificComputing • We knew we needed storage and computing • We knew we wanted to tie it together with an HPC commodity scale-out philosophy • So August 2012 we bought 10 Dell 720xd servers • 16-core • 96GB RAM • 36TB Disk • A 192-core, ~1TB RAM, 360TB expansion to our HPC fabric • Now to integrate it…
  • 8.
    December 2012 • Bobsaid: Hearing good things about open stack and ceph at this week at dell world. Simon anderson, CEO of dream host , spoke highly of dell, open stack, and ceph today. He is also chair of company that supports He also spoke highly of dell crowbar deployment tool. I
  • 9.
    December 2012 • Bobsaid: Hearing good things about open stack and ceph at this week at dell world. Simon anderson, CEO of dream host , spoke highly of dell, open stack, and ceph today. He is also chair of company that supports He also spoke highly of dell crowbar deployment tool. • I said: Good to hear. I've been thinking a lot about dell in this picture too. We have the building blocks in place. Might be a good way to speed the construction.
  • 10.
    Lesson 1: Recognize whena partnership will help you achieve your goals.
  • 11.
    The 2013 Implementation •The Timeline • In January we started our discussions with Dell and Inktank • By March we had committed to the fabric • A week in April and we had our own cloud in place • The Experience • Vendors committed to their product • Direct engagement through open communities • Bright people who share your development ethic
  • 12.
    Next Step…Build Adoption •Defined a new storage product based on the commodity scale-out fabric • Able to focus on strengths of Ceph to aggregate storage across servers • Provision any sized image to provide Flexible Block Storage • Promote cloud adoption within IT and across the research community • Demonstrate utility with applications
  • 13.
    Applications • Crashplan Backupin the cloud • A couple hours to provision the VM resources • An easy half-day deploy with the vendor because we controlled our resources a.k.a. firewall • Add storage containers on the fly as we grow…10TB in few clicks • Gitlab hosting • Start a VM spec’d according to project site • Work with Omnibus install. Hey it uses Chef! • Research Storage • 1TB storage containers for cluster users • Uses Ceph RBD images and NFS • The storage infrastructure part was easy • Scaled provisioning, 100+ user containers (100TB) created in about 5 minutes. • Add storage servers as existing ones fill
  • 14.
    Ceph Rebalances asStorage Grows :)
  • 15.
    Lesson 2: Use it!That’s what it’s for!
  • 16.
    Lesson 2: Use it!That’s what it’s for! The sooner you start using the cloud the sooner you start thinking like the cloud.
  • 17.
    How PoC DecisionsAge Over Time • Pick the environment you want when you are in operation…you’ll be there before you know it • Simple networking is good • But don’t go basic unless you are able to reinstall the fabric • Class B ranges to match the campus fabric • We chose a split admin range to coordinate with our HPC admin range • We chose a collapsed admin/storage network due to a single switch…probably would have been better to keep separate and allow growth • It’s OK to add non-provisioned interfacing nodes…know your net • Avoid painting yourself in corner • Don’t let the Paranoid Folk box-in your deployment • An inaccessible fabric is an unusable fabric • Fixed IP range mismatch with “fake” reservations
  • 18.
    Lesson 3: The fabricis flexible. Let it help you solve your problems
  • 19.
    Problems will Arise •The release version of the ixgbe driver in Ubuntu 12.04.1 kernel didn’t perform well with our 10Gbit cards • Open source has an upstream • Use it as part of debug network • Upgrading the drivers was a simple fix • Sometimes when you fix something you break something else • There are still a lot of moving parts but each has a strong open source community • Work methodically • You will learn as you go • Recognize the stack is integrated and respect tool boundaries
  • 20.
    Sometimes a Problemis just a Problem • Code ex
  • 21.
    Lesson 4: The code*is* the documentation
  • 22.
    Lesson 4: The code*is* the documentation …and that’s a *good* thing
  • 23.
    Where we aretoday • OpenStack plus Ceph are here to stay for our Research Computing System • They give us the flexibility we need for an ever expanding research applications portfolio • Move our UAB Galaxy NextGen Sequencing platform to our Cloud • Add Object Storage services • Put the cloud in the hands of researchers • The big question…
  • 24.
    …how far canwe take it? • The goal of process automation is scale • Incompatible, non-repeatable, manual processes are a cost • Success is in dual-use • Satisfy your needs and customer demand • Automating process implies documenting process…great for compliance and repeatability • Recognize the latent talent in your staff today’s system admins are tomorrows systems developers • Traditional infrastructure models are ripe for replacement
  • 25.
    Lesson 5? You canwe learn from research and engage as a partner
  • 26.
    Want to learnmore about Dell + OpenStack + Ceph? Join the Session, 2:00 pm, Tuesday, Room #313 Software Defined Storage, Big Data and Ceph - What Is all the Fuss About? Neil Levine, Inktank & Kamesh Pemmaraju, Dell