OpenStack at Ebsco
Nate Baechtold, IT Architect
Ebsco Information Services
August 23, 2016
Bulleted List
• The leading discovery service provider for libraries worldwide
with more than 10,000 discovery customers in over 100
countries.
• Preeminent provider of online research content for libraries,
including hundreds of research databases, historical archives,
point-of-care medical reference, and corporate learning tools
serving millions of end users at tens of thousands of institutions.
• Leading provider of electronic journals & books for libraries, with
more than 360,000 serials, including more than 57,000 e-
journals, as well as online access to more than 800,000 e-
books.
2
What did we need?
• Self service infrastructure to all development teams.
• Full stack automation to all environments.
• Increase agility and productivity of operations and development
teams.
• Lower costs by leveraging open source solutions.
• Provide a solution that integrates well with other products and
allows other products and tools to easily integrate with it.
3
Why OpenStack?
• Easy to consume API that commoditizes infrastructure with the same
methodology used by public clouds.
• Abstraction of underlying infrastructure allowing for configuration or
hardware differences to not propagate to consumers and
automation.
• Standardized interface for compute, network and storage
• When software supports OpenStack it tends to “just work”
• Allows us to build an IaaS platform fit for live services and safely
hand out access to diverse teams through built in project isolation.
• Prefer to tell consumers that “if you break it then it is our fault” rather than
giving them a long list of things that they should never do.
4
5
Current Scale
• 3 OpenStack clouds
• Approximately 1100 running
instances
• Almost 500,000 instances
created and destroyed since
general availability
• 68% of workloads concentrated
in development environments
• Around 1/3 of all virtualized
workloads currently on
OpenStack
68%
10%
22%
Distribution By Running Instance
DevQa Live DC 1 Live DC 2
6
Design Philosophy
• Build a platform to run production applications.
• Multi-tenant at its core
• Should be able to safely support development and operations teams sharing
the same cloud.
• All tools needed to build a highly available production application
need to be available
• Good enough for development but not production is not an acceptable
permanent state.
• Build general purpose solutions. Customize as little as possible.
• Provide an easy menu of infrastructure offerings
• Easy to use solution with safeguards to encourage experimentation
• Development is easier when you don’t need to worry about breaking the
environment
Current Architecture
7
Ebsco Private Cloud Platform
OpenStack CloudMonitoring
Operations
Dashboards NovaNeutron CinderGlance
Keystone Heat Ceilometer Horizon
Load Balancing
What we learned…
9
Problems to Solve:
• Skills and training
• Selection of vendors and
integrations
• Deployment
• Adoption
• Productionization
10
Skills and training:
Our Experiences
• Internally develop a core group of OpenStack
SMEs before progressing too far.
• Do not waste learning opportunities by relying
to much on professional services.
• Look for candidates with strong Linux,
networking, virtualization and python skills
rather than OpenStack experience.
• Give your team the time and opportunity to
experiment and learn how OpenStack works.
• Vendor support lowers the amount of
expertise you need to go to production.
• OpenStack skills are
VERY hard to hire
• Administration requires
good Linux experience
• Inexperienced
administrators can
cause huge amounts of
damage
11
Vendors and
integrations:
Our Experiences
• Prefer products that align with
OpenStack’s multi-tenancy model
whenever possible.
• Focus on vendors building for cloud rather
than trying to integrate it afterwards.
• Look at areas to improve everywhere in
the stack. Re-evaluate your product
decisions. There is high value when an
integration is done right.
• You will not know how good a vendor’s
integration is until you try it. There can be
many hidden landmines with missing
capabilities or API support.
• Tons of vendor
integrations with varying
degrees of quality
• Many established
vendors
• Users need access to
everything that they
need to deploy and
manage a highly
available production
application
Case Study – Existing Load Balancing
• Existing vendor had limited OpenStack knowledge and bare bones
integration at the time.
• Actual quote from support after a bug was discovered (vendor specific
lines edited)
• “For now, to avoid a failover, I would recommend to program the OpenStack not
to delete IPs.”
• LBaaS v1 was extremely limited. Would not have covered all
production use cases.
• Product did not support safe multi-tenancy. There were shared resources that
were a point of failure.
• Prolonged evaluation period of 6-8 months resulting in rejection.
12
Case Study – Cloud Load Balancer (AVI)
• Installation involves providing OpenStack credentials and it handles the
rest.
• Allowed us to make production grade load balancing generally available in
development within a week and produciton within a month.
• Multi-tenancy model aligns with OpenStack Projects and with keystone
• Nobody had to ask for access. If you had access to OpenStack then you have
access to a load balancing services.
• No fighting with permissions or concerns with preventing untrained users from
damaging the environment.
13
14
Problems to Solve:
Our Experiences
• Align resources for storage, networking
and datacenter teams and make sure that
someone on each team will make
troubleshooting installation issues a top
priority.
• OpenStack requires tight integration with all of
these elements. A slow troubleshooting
feedback loop will have a very negative effect
on the deployment.
• Understand what deployment choices are
difficult to change afterwards and make
sure that you got them right.
• Assume multiple tries to get a production
ready configuration.
• Deployment
• Deployments take a
long time and are
complex
• Some OpenStack
functionality is not ready
for production
15
Problems to Solve:
Our Experiences
• Have a close relationship with your early
adopters. They will help you increase the
resiliency of your deployment.
• Regularly speak with them in person to help them
understand OpenStack and to let them tell you
about issues before they become a problem.
• Get deployments into your users hands as
soon as possible.
• Do not stall getting to production. Teams will
not want to code to an API that they cannot
use in production.
• Adoption will be limited until you can get
production availability.
• Solving problems “just for development
environments” is the wrong mentality.
• Early feedback is critical.
• Adoption
• Adoption is one of the
most critical elements to
success.
16
Problems to Solve:
Our Experiences
• Monitor OpenStack by actually using
OpenStack. Build instances and use
OpenStack functionality to detect failures.
• OpenStack is very complex and understanding
the effect of a failure can be difficult.
• If you monitor by using OpenStack you will
catch most failures before your users do and
know what functionality is impacted.
• Automate common operational and
maintenance tasks.
• OpenStack HA is complex but needed for
all environments.
• Productionizaton
• OpenStack provides
building blocks but
some assembly is
required to build a
product out of it.
• Monitoring and common
operational tasks are
not solved out of the
box.
What we did…
18
Phased Environments…
Prototype
• Single machine all
in one deployment
• Learn basics
• Validate direction
• Disposable
environment
Interim
• Break apart compute
and control
• Limited release to
early adopters
• Get feedback and
determine desired
configuration
DevQa
• Highly available
environment
• Treated like production
• General availability for
development workloads
• Determine
producitonization tasks
needed
Production
• Implement
productionizaiton
tasks
• Deploy production
clouds
19
What wound up happening…
Prototype Interim DevQa Production
20
Took too long to get
to production…
• Critical team member left
• Took too long finding a
replacement due to focus on
hiring OpenStack skillset.
• Additional work for monitoring
and operations automation
were required before we were
confident hosting production
workloads.
• Required skillsets that were not
a part of the OpenStack team
and focused manpower.
Solution: Create a focus squad
• Kicked of a 6 week effort with a cross-functional team that had
all required skills.
• This team would focus 100% on getting OpenStack to live.
• OpenStack tasks must be top priority for all team members.
• Director quote “Set your email to out of office if you have to”
• The focused effort was incredibly efficient.
• Feedback loops for troubleshooting massively reduced.
• Reduction of blocked tasks created a higher quality implementation.
21
What the focus squad do?
• Created a reliable monitoring solution based on Zabbix and a python
framework for executing OpenStack checks.
• Created automated recovery for problems discovered in DevQa.
• Automated compute node evacuation
• Automated failed OpenStack service recovery
• Increased visibility into the environment with Zabbix and Grafana.
• Automated common operational tasks to push button jobs in
Rundeck.
• Taking a compute or control node out of service
• Restarting OpenStack services
• Deployed all production OpenStack, Zabbix and Rundeck
infrastructure.
22
Tracking Success…
• Critical to getting continued commitment but hard to determine.
• We track the following metrics:
• Instance count and resource usage
• Number of teams and products leveraging OpenStack
• The number of instances created and deleted
• This can be a good indicator as to whether OpenStack was the right fit for your
organization. Indicates people using automation as opposed to manual usage.
23
Thank You
Questons?

OpenStack at EBSCO

  • 1.
    OpenStack at Ebsco NateBaechtold, IT Architect Ebsco Information Services August 23, 2016
  • 2.
    Bulleted List • Theleading discovery service provider for libraries worldwide with more than 10,000 discovery customers in over 100 countries. • Preeminent provider of online research content for libraries, including hundreds of research databases, historical archives, point-of-care medical reference, and corporate learning tools serving millions of end users at tens of thousands of institutions. • Leading provider of electronic journals & books for libraries, with more than 360,000 serials, including more than 57,000 e- journals, as well as online access to more than 800,000 e- books. 2
  • 3.
    What did weneed? • Self service infrastructure to all development teams. • Full stack automation to all environments. • Increase agility and productivity of operations and development teams. • Lower costs by leveraging open source solutions. • Provide a solution that integrates well with other products and allows other products and tools to easily integrate with it. 3
  • 4.
    Why OpenStack? • Easyto consume API that commoditizes infrastructure with the same methodology used by public clouds. • Abstraction of underlying infrastructure allowing for configuration or hardware differences to not propagate to consumers and automation. • Standardized interface for compute, network and storage • When software supports OpenStack it tends to “just work” • Allows us to build an IaaS platform fit for live services and safely hand out access to diverse teams through built in project isolation. • Prefer to tell consumers that “if you break it then it is our fault” rather than giving them a long list of things that they should never do. 4
  • 5.
    5 Current Scale • 3OpenStack clouds • Approximately 1100 running instances • Almost 500,000 instances created and destroyed since general availability • 68% of workloads concentrated in development environments • Around 1/3 of all virtualized workloads currently on OpenStack 68% 10% 22% Distribution By Running Instance DevQa Live DC 1 Live DC 2
  • 6.
    6 Design Philosophy • Builda platform to run production applications. • Multi-tenant at its core • Should be able to safely support development and operations teams sharing the same cloud. • All tools needed to build a highly available production application need to be available • Good enough for development but not production is not an acceptable permanent state. • Build general purpose solutions. Customize as little as possible. • Provide an easy menu of infrastructure offerings • Easy to use solution with safeguards to encourage experimentation • Development is easier when you don’t need to worry about breaking the environment
  • 7.
    Current Architecture 7 Ebsco PrivateCloud Platform OpenStack CloudMonitoring Operations Dashboards NovaNeutron CinderGlance Keystone Heat Ceilometer Horizon Load Balancing
  • 8.
  • 9.
    9 Problems to Solve: •Skills and training • Selection of vendors and integrations • Deployment • Adoption • Productionization
  • 10.
    10 Skills and training: OurExperiences • Internally develop a core group of OpenStack SMEs before progressing too far. • Do not waste learning opportunities by relying to much on professional services. • Look for candidates with strong Linux, networking, virtualization and python skills rather than OpenStack experience. • Give your team the time and opportunity to experiment and learn how OpenStack works. • Vendor support lowers the amount of expertise you need to go to production. • OpenStack skills are VERY hard to hire • Administration requires good Linux experience • Inexperienced administrators can cause huge amounts of damage
  • 11.
    11 Vendors and integrations: Our Experiences •Prefer products that align with OpenStack’s multi-tenancy model whenever possible. • Focus on vendors building for cloud rather than trying to integrate it afterwards. • Look at areas to improve everywhere in the stack. Re-evaluate your product decisions. There is high value when an integration is done right. • You will not know how good a vendor’s integration is until you try it. There can be many hidden landmines with missing capabilities or API support. • Tons of vendor integrations with varying degrees of quality • Many established vendors • Users need access to everything that they need to deploy and manage a highly available production application
  • 12.
    Case Study –Existing Load Balancing • Existing vendor had limited OpenStack knowledge and bare bones integration at the time. • Actual quote from support after a bug was discovered (vendor specific lines edited) • “For now, to avoid a failover, I would recommend to program the OpenStack not to delete IPs.” • LBaaS v1 was extremely limited. Would not have covered all production use cases. • Product did not support safe multi-tenancy. There were shared resources that were a point of failure. • Prolonged evaluation period of 6-8 months resulting in rejection. 12
  • 13.
    Case Study –Cloud Load Balancer (AVI) • Installation involves providing OpenStack credentials and it handles the rest. • Allowed us to make production grade load balancing generally available in development within a week and produciton within a month. • Multi-tenancy model aligns with OpenStack Projects and with keystone • Nobody had to ask for access. If you had access to OpenStack then you have access to a load balancing services. • No fighting with permissions or concerns with preventing untrained users from damaging the environment. 13
  • 14.
    14 Problems to Solve: OurExperiences • Align resources for storage, networking and datacenter teams and make sure that someone on each team will make troubleshooting installation issues a top priority. • OpenStack requires tight integration with all of these elements. A slow troubleshooting feedback loop will have a very negative effect on the deployment. • Understand what deployment choices are difficult to change afterwards and make sure that you got them right. • Assume multiple tries to get a production ready configuration. • Deployment • Deployments take a long time and are complex • Some OpenStack functionality is not ready for production
  • 15.
    15 Problems to Solve: OurExperiences • Have a close relationship with your early adopters. They will help you increase the resiliency of your deployment. • Regularly speak with them in person to help them understand OpenStack and to let them tell you about issues before they become a problem. • Get deployments into your users hands as soon as possible. • Do not stall getting to production. Teams will not want to code to an API that they cannot use in production. • Adoption will be limited until you can get production availability. • Solving problems “just for development environments” is the wrong mentality. • Early feedback is critical. • Adoption • Adoption is one of the most critical elements to success.
  • 16.
    16 Problems to Solve: OurExperiences • Monitor OpenStack by actually using OpenStack. Build instances and use OpenStack functionality to detect failures. • OpenStack is very complex and understanding the effect of a failure can be difficult. • If you monitor by using OpenStack you will catch most failures before your users do and know what functionality is impacted. • Automate common operational and maintenance tasks. • OpenStack HA is complex but needed for all environments. • Productionizaton • OpenStack provides building blocks but some assembly is required to build a product out of it. • Monitoring and common operational tasks are not solved out of the box.
  • 17.
  • 18.
    18 Phased Environments… Prototype • Singlemachine all in one deployment • Learn basics • Validate direction • Disposable environment Interim • Break apart compute and control • Limited release to early adopters • Get feedback and determine desired configuration DevQa • Highly available environment • Treated like production • General availability for development workloads • Determine producitonization tasks needed Production • Implement productionizaiton tasks • Deploy production clouds
  • 19.
    19 What wound uphappening… Prototype Interim DevQa Production
  • 20.
    20 Took too longto get to production… • Critical team member left • Took too long finding a replacement due to focus on hiring OpenStack skillset. • Additional work for monitoring and operations automation were required before we were confident hosting production workloads. • Required skillsets that were not a part of the OpenStack team and focused manpower.
  • 21.
    Solution: Create afocus squad • Kicked of a 6 week effort with a cross-functional team that had all required skills. • This team would focus 100% on getting OpenStack to live. • OpenStack tasks must be top priority for all team members. • Director quote “Set your email to out of office if you have to” • The focused effort was incredibly efficient. • Feedback loops for troubleshooting massively reduced. • Reduction of blocked tasks created a higher quality implementation. 21
  • 22.
    What the focussquad do? • Created a reliable monitoring solution based on Zabbix and a python framework for executing OpenStack checks. • Created automated recovery for problems discovered in DevQa. • Automated compute node evacuation • Automated failed OpenStack service recovery • Increased visibility into the environment with Zabbix and Grafana. • Automated common operational tasks to push button jobs in Rundeck. • Taking a compute or control node out of service • Restarting OpenStack services • Deployed all production OpenStack, Zabbix and Rundeck infrastructure. 22
  • 23.
    Tracking Success… • Criticalto getting continued commitment but hard to determine. • We track the following metrics: • Instance count and resource usage • Number of teams and products leveraging OpenStack • The number of instances created and deleted • This can be a good indicator as to whether OpenStack was the right fit for your organization. Indicates people using automation as opposed to manual usage. 23
  • 24.