How REI Used Automation to Cloudify Infrastructure and Rapidly Adjust its Digital Pandemic Response

of 12
How REI Used Automation to
Cloudify Infrastructure and Rapidly
Adjust its Digital Pandemic Response
A discussion on how REI kept its digital customers and business leadership happy using rapidly
adaptable IT infrastructure, even as the world around them was suddenly shifting.
Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Hewlett
Packard Enterprise.
Dana Gardner: Hello, and welcome to the next edition of the BriefingsDirect Voice of
Innovation podcast series.
I’m Dana Gardner, Principal Analyst at Interarbor Solutions, your host and moderator for
this ongoing discussion on IT infrastructure automation and efficiency.
Like many retailers, Recreational Equipment, Inc. (REI) was faced with drastic and rapid
change when the COVID-19 pandemic struck. REI’s marketing leaders wanted to make
sure that their online e-commerce capabilities would rise to the challenge. They
expected a nearly overnight 150 percent jump in REI’s purely digital business.
Fortunately REI’s IT leadership had already advanced their systems to heightened
automation, which allowed the Seattle-based merchandiser to turn on a dime and devote
much more of its private cloud to the new e-commerce workload demands.
Stay with us as we learn how REI kept its digital customers and business leadership
happy, even as the world around them was suddenly shifting.
To explore what works for making IT agile and
responsive enough to re-factor a private cloud at
breakneck speed, we’re joined by Bryan Sullins, Senior
Cloud Systems Engineer at REI in Seattle. Welcome to
BriefingsDirect, Bryan.
Bryan Sullins: Thanks, Dana. I appreciate it and I’m
happy to be here.
Gardner: When the pandemic required you to hop-to,
how did REI manage to have the IT infrastructure to
actually move at the true pace of business? What put you
in a position to be able to act as you did?Sullins

of 12
Digital retail demands rise during distancing
Sullins: In addition to the pandemic stay-at-home orders a couple months ago, we also
had a large sale previously scheduled for the middle of May. It’s the largest sale of the
year, our anniversary sale.
And ramping up to that, our marketing and sales department realized that we would
have a huge uptick in online sales. People really wanted to get outside, because people
could go outside without breaking any of the social distancing rules.
For example, bicycle sales were up 310 percent compared to the same time last year.
So in ramping up for that, we anticipated our online presence at rei.com was going to go
up by 150 percent, but we wanted to scale up by 200 percent to be sure. In order to do
that, we had to reallocate a bunch of ESXi hosts in VMware vSphere. We either had to
stand up new ones or reallocate from other clusters and put them into what we call our
digital retail presence.
As a result of our fully automated
process, using Hewlett Packard
Enterprise (HPE) OneView,
Synergy, and Image Streamer, we
were able to reallocate 6 out of the
17 total hosts needed. We were
able to do that in 18 minutes, all at
once -- and that’s single touch,
that’s launching the automation and
then pulling them from one cluster and decommissioning them and placing them all the
way into the digital retail clusters.
We also had to move some from our legacy platform, they aren’t at HPE Synergy yet,
and those took an additional three days. But those are in transition, we are moving
through to that fully automated platform all around.
Gardner: That’s amazing because just a few years ago that sort of rapid and automated
transition would have been unheard of. Even at a slow pace you weren’t guaranteed to
have the performance and operations you wanted.
If you were not able to do this using automation – if the pandemic had hit, heaven forbid,
five or seven years ago – what would have been the outcome?
Sullins: There were actually two outcomes from this. The first is the fairly obvious issue
of not being able to handle the online traffic on our rei.com retail presence. It could have
been that people weren’t able to put stuff into a shopping cart, or inventory decrement,
and so on. It could have been a very broad range of things. We needed to make sure we
had the infrastructure capacity so that none of that fails under a heavy load. That was
the first part.
We were able to do [it] in 18 minutes, all
at once – and that’s single touch, that’s
launching the automation and then
pulling them from one cluster and
decommissioning them and placing them
all the way into the digital retail clusters.

of 12
Gardner: Right, and when you have people in the heat of a purchasing moment, if
you’re not there and it’s not working, they have other options. Not only would you lose
that sale, you might lose that customer, and your brand suffers as well.
Sullins: Oh, without a doubt, without a doubt.
The other issue, of course, would have been if we did not meet our deadline. We had
just under a week to get this accomplished. And if we had to do this without a fully
automated approach, we would have had to return to our managers and say, “Yeah, so
like we can’t do it that quickly.” But with our approach, we were able to do it all in the
time frame -- and be able to get some sleep in the interim. So it was a win-win.
Gardner: So digital transformation pays off after all?
Sullins: Without a doubt.
Gardner: Before we learn more about your journey to IT infrastructure automation, tell
us about REI, your investments in advanced automation, and why you consider yourself
a data-driven digital business?
Automation all the way
Sullins: Well, a lot of that precedes me
by quite a bit. Going back to the early
2000s, based on what my managers tell
me, there was a huge push for REI
become an IT organization that just
happens to do retail. The priority is on IT
being a driving force behind everything we
do, and that is something that, at the time, REI really needed to do. There are other
competitors, which we won’t name, but you probably know who they are. REI needed to
stay ahead of that curve.
So since then there have been constant sweeping and cyclical changes for that digital
transformation. The most recent one is the push for automating all things. So that’s the
priority we have. It’s our marching orders.
Gardner: In addition to your company, culture, and technology, tell us about yourself,
Bryan. What is it about your background and personal development that led you to be in
a position to act so forthrightly and swiftly?
Sullins: I got my start in IT back in 1999. I was a public school teacher before that, and
then I made the transition to doing IT training. I did IT training from 1999 to about 2012.
During those years, I got a lot of technology certifications, because in the IT training
world you have to.
There was a huge push for REI to
become an IT organization that
just happens to do retail. The
priority is on IT being a driving
force behind everything we do.

of 12
I began with what was, at the time, called the Microsoft Certified Solutions Expert
(MCSE) certification. Then I also did the Linux Professional Institute. I really glommed on
to Linux. I wanted to set myself apart from the rest of the field back then, so I went all-in
on Linux.
And then, 2008-2009-ish, I jumped on the VMware train and went all-in on VMware and
did the official VMware curriculum. I taught that for about three years. Then, in 2012, I
made the transition from IT training into actually doing this for real as an engineer
working at Dell. At the time, Dell had an infrastructure-as-a-service (IaaS) healthcare
cloud that was fairly large – 1,200-plus ESXi hosts. We were also responsible for the
storage and for the 90-plus storage area network (SAN) arrays as well.
In an environment that large, you really have to automate. I cut my teeth on automating
through PowerCLI and Ansible. Since then, about 2015, it’s been the focus of my career.
I’m not saying I’m a guru, by any means, but it’s been a focus of my career.
Then, in 2018, REI came calling. I jumped on that opportunity because they are a super-
awesome company, and right off the bat I got free reign over: if you want to automate it,
then you automate it. And I have been doing that ever since August of 2018.
Gardner: What helped you make the transition from training to cloud engineer?
Sullins: I typically jump right into new technology. I don’t know if that comes from the
training or if that’s just me as a person. But one of the positives I’ve gotten from the
training world is that you learn a 100 percent of the feature base that’s available with
said technology. I was able to take what I learned and knew from VMware and then say,
“Okay, well, now I am going to get the real-world experience to back that up as well.” So
it was a good transition.
Gardner: Let’s look at how other organizations can anticipate the shift to automation.
What are some of the challenges that organizations typically face when it comes to
being agile with their infrastructure?
Manage resistance to cloud management
Sullins: The challenges that I have seen aren’t usually technical. Usually the
technology that people use to automate things are ready at hand. Many are free; like
Ansible, for example, is free. PowerCLI is free. Jenkins is free.
So, people can start doing that tomorrow. But the
real challenge is in changing people’s mindset
about a more automated approach. I think that it’s
tough to overcome. It’s what I call provisioning by
council. More traditional on-premises approaches
have application owners who want to roll out x
The real challenge is in
changing people’s
mindset about a more
automated approach.

of 12
number of virtual machines (VMs), with all their particular specs and whatnot. And then a
council of people typically looks at that and kind of scratches their chin and says, “Okay,
we approve.” But if you need to scale up, that council approach becomes a sort of gate-
keeping process.
With a more automated approach, like we have at REI, we use a cloud management
platform to automate the processes. We use that to enable self-service VMs instead of
having a roll out by council, where some of the VMs can take days or weeks roll out
because you have a lot of human beings touching it along the way. We have a lot of that
process pre-approved, so everybody has already said, “Okay, we are okay with the roll
out. We are okay with the way it’s done.” And then we can roll that out in 7 to 10 minutes
rather than having a ticket-based model where somebody gets to it when they can. Self-
service models are able to do that much better.
But that all takes a pretty big shift in psychology. A lot of people are used to being the
gatekeeper. It can make them uncomfortable to change. Fortunately for me, a lot of the
people at REI are on-board with this sort of approach. But I think that resistance can be
something a lot of people run into.
Gardner: You can’t just buy automation in a box off of a shelf. You have to deal with an
accumulation of manual processes and habits. Why is moving beyond the manual
processes culture so important?
Sullins: I call it a private cloud because that means there is a healthy level of
competition between what’s going in the public cloud and what we do in that data center.
The public cloud team has the capability of “selling” their solution side-by-side with ours.
When you have application owners who are technically adept -- and pretty much all of
them are at REI -- they can be tempted to say, “Well, I don’t want to wait a week or two
to get a VM. I want to create one right now out on the public cloud.”
That’s a big challenge for us. So what
we are trying to accomplish -- and we
have had success so far through the
transition – is to offer our application
our customers a spectrum of
services. So that’s great.
The stakeholders consuming that now gain flexibility. They can say, “Okay, yeah, I have
this application. I want to run it in the public cloud, but I can’t based on the needs for that
application. We have to run it on-premises.” And now they can do that in an automated
way. That’s a big win, and that’s what people expect now, quite honestly.
Gardner: They want the look and feel of a public cloud but with all the benefits of the
private cloud. It’s up to you to provide that. Let’s find out how you did.
What we are trying to accomplish – and
we have had success so far through the
transition – is to offer our application our
customers a spectrum of services.

of 12
How did you overcome the challenges that we talked about and what are the
investments that you made in tools, platforms, and an ecosystem of players that
accomplished it?
Sullins: As I mentioned previously, a lot
of our utilities are “free,” the Ansibles of
the world, PowerCLI, and whatnot. We
also use Morpheus to do self-service
and the implications behind automating
things on what I call the front end, the
customer face. The issue you have there is you don’t get that control of scaling up
before you provision the VM. You have to monitor and then roll it out on the backend. So
you have to monitor for usage and then scale up on the backend, and seamlessly. The
end users aren’t supposed to know that you are scaling up. I don’t want them to know.
It’s not their job to know. I want to remain out of their way.
In order to do that, we’ve used a combination of technologies. HPE actually has a
GitHub link for a lot of Ansible playbooks that plug right in. And then the underlying
hardware adjacent management ecosystem platform is HPE OneView with HPE
Synergy and Image Streamer. With a combination of all of those technologies we were
able to accomplish that 18-minute roll-out of our various titles.
Gardner: Even though you have an integrated platform and solutions approach, it
sounds like you have also made the leap from ushering pets through the process into
herding cattle. If you understand my metaphor, what has allowed you to stop treating
each instance as a pet into being able to herd this stuff through on an automated basis?
From precious pets to replaceable cattle
Sullins: There is a psychological challenge with that. In the more traditional approach –
and the VMware shop listeners are going to be very well aware of this -- I may need to
have a four-node cluster with a number of CPUs, a certain amount of RAM, and so on.
And that four-node cluster is static. Yes, if I need to add a fifth down the line I can do
that, but for that four-node cluster, that’s its home, sometimes for the entire lifecycle of
that particular host.
With our approach, we treat our ESXi hosts as cattle. The HPE OneView-Synergy-Image
Streamer technology allows us to do that in conjunction with those tools we mentioned
previously, for the end point in particular.
So rather than have a cluster, and it’s static and it stays that way -- it might have a
naming convention that indicates what cluster it’s in and where -- in reality we have
cattle-based DNS names for ESXi hosts. At any time, the understanding throughout the
organization, or at least for the people who need to know, is that any host can be pulled
from one cluster automatically and placed into another, particularly when it comes to
resource usage on that cluster. My dream is that the robots will do this automatically.
The end users aren’t supposed to
know that you’re scaling up. … I
want to remain out of their way.

of 12
So if you had a cluster that goes into the yellow, with its capacity usage based on a
threshold, the robot would interpret that and say, “Oh, well, I have another cluster over
here with a host that is underutilized. I’m going to pull it into the cluster that’s in the
yellow and then bring it back into the green again.” This would happen all while we
sleep. When we wake up in the morning, we’d say, “Oh, hey, look at that. The robots
moved that over.”
Gardner: Algorithmic operations. It sounds very exciting.
Automation begets more automation
Sullins: Yes, we have the push-button automation in place for that. It’s the next level of
what that engine is that’s going to make those decisions and do all of those things.
Gardner: And that raises another issue. When you take the plunge into IT automation,
you are making your way down the Chisholm Trail with your cattle, all of a sudden it
becomes easier along the way. The automation begets more automation. As you learn
and grow, does it become more automated along the way?
Sullins: Yes. Just to put an exclamation point on this topic, imagine the situation we
opened the podcast with, which is, “Okay, we have to reallocate a bunch of hosts for
rei.com.” If it’s fully automated, and we have robots making those decisions, the
response is instantaneous. “Oh, hey, we want to scale up by 200 percent on rei.com.”
We can say, “Okay, go ahead, roll out your VM. The system will react accordingly. It will
add physical hosts as you see fit, and we don’t have to do anything, we have already
done the work with the automation.” Right?
But to the automation begetting
automation, which is a great way of
putting it, by the way, there are always
opportunities for more automation. And
on a career side note, I want to dispel the
myth that you automate your way out of
a job. That is a complete and total myth.
I’m not saying it doesn’t happen, where
people get laid off as a result of automation. I’m not saying that doesn’t happen, but
that’s relatively rare because when you automate something, that automation is going to
need to be maintained because things change over time.
The other piece of that is a lot of times you have different organizations at various states
of automation. Once you get your head above water to where it's, “Okay, we have this
process and now it's become trivial because it's been automated.” We can now
concentrate on automating either more things -- or you have new things that need to be
automated. And whether that’s the process for only VMs, a new feature base,
I want to dispel the myth that you
automate your way out of a job. [It
can happen], but it’s relatively rare
because when you automate
something, that automation is going
to need to be maintained.

of 12
monitoring, or auto-scaling -- whatever it is -- you have the capability of from day one to
further automate these processes.
Gardner: What was it specifically about the HPE OneView and Synergy that allowed
you to move past the manual processes, firefighting, and culture of gatekeeping into
more herding of cattle and being progressively automated?
Sullins: It was two things. The Image Streamer was number one. To date, we don’t run
PXE boots infrastructure, not that we can't, it’s just not something that we have
traditionally done. We needed a more standard process for doing that, and Image
Streamer fit that and solved that problem.
The second piece is the provided Ansible playbooks that HPE has to kick off the entire
process. If you are somewhat versed in how HPE does things through OneView, you
have a server profile that you can impose on a blade, and that can be fully automated
through Ansible.
And, by the way, you don’t have to use
Image Streamer to use Ansible
automation. This is really more of an
HPE OneView approach, whereby you
can actually use it to do automated
profiles and whatnot. But the Image
Streamer is really what allows us to say, “Okay, we build a gold image. We can apply
that gold image to any frame in the cluster.” That’s the first part of it, and the rest is
configuring the other side.
Gardner: Bryan, it sounds like the HPE Composable Infrastructure approach works well
with others. You are able to have it your way because you like Ansible, and you have a
history of certain products and skills in your organization. Does the HPE Composable
Infrastructure fit well into an ecosystem? Is it flexible enough to integrate with a variety of
different approaches and partners?
Sullins: It has been so far, yes. We have anticipated leveraging HPE for our bare metal
Linux infrastructure. One of the additional driving forces and big initiatives right now is
Kubernetes. We are going all-in on Kubernetes in our private cloud, as well as in some
of our worker nodes. We eventually plan on running those as bare metal. And HPE
OneView, along with Image Streamer, is something that we can leverage for that as well.
So there is flexibility, absolutely, yes.
Coordinating containers
Gardner: It’s interesting, you have seen the transition from having VMware and other
hypervisor sprawl to finding a way to manage and automate all of that. Do you see the
same thing playing out for containers, with the powerful endgame of being able to
automate containers, too?
[HPE Synergy] Image Streamer
allows us to say, “Okay, we build a
gold image. We can apply that gold
image to any frame in the cluster.”

of 12
Sullins: Right. We have been utilizing Rancher as part of our coordination tool for our
Kubernetes infrastructure and utilizing vSphere for that. So we are using that.
As far as the containerization approach, REI has been doing containers before
containers was a big thing. Our containerization platform has been around since at least
2015. So REI has been pretty cutting edge as far as that is concerned.
And now that Kubernetes has won the
orchestration wars, as it were, we are
looking to standardize that for people
who want to do things online, which is
to say, going back to the digital
transformation journey.
Basically, the industry has caught up with what our super-awesome developers have
done with containerization. But we are looking to transition the heavy lifting of
maintaining a platform away from the developers. Now that we have a standard
approach with Kubernetes, they don’t have to worry so much about it. They can just
develop what they need to develop. It will be a big win for us.
Gardner: As you look back at your automation journey, have you developed a
philosophy about automation? How this should this best work in the future?
Trust is foundation of automation
Sullins: Right. Have you read Gene Kim’s The Unicorn Project? Well, there is also his
The Phoenix Project. My take from that is the whole idea of trust, of trusting other
people. And I think that is big.
I see that quite a bit in multiple organizations. For REI, we are going to work as a team
and we trust each other. So we have a pretty good culture. But I would imagine that in
some places that is still big challenge.
And if you take a look at The Unicorn Project, a lot of the issues have to do with trusting
other human beings. Something happened, somebody made a mistake, and it caused
an outage. So they lock it up and lock it away and say only certain people can do that.
And then if you multiply that happening multiple times -- and then different individuals
walking that down -- it leads to not being able to automate processes without somebody
approving it, right?
Gardner: I can't imagine you would have been capable, when you had to transition your
private cloud for more online activity, if you didn’t have that trust built into your culture.
Now that Kubernetes has won the
orchestration wars, as it were, we
are looking to standardize that for
people who want to do things online.

of 12
Sullins: Yes, and the big challenge that might
still come up is the idea of trusting your end
users, too. Once you go into the realm of self-
service, you come up on the typical what-ifs.
What if somebody adds a zero and they meant
to only roll out 4 VMs but they roll out 40? That’s possible. How do you create guardrails
that are seamless? If you can, then you can trust your users. You decrease the risk and
can take that leap of faith that bad things won’t happen.
Gardner: Tell us about your wish list for what comes next. What you would like HPE to
be doing?
Small steps, robots, and teamwork reap rewards
Sullins: My approach is to first automate one thing and then work out from there. You
don’t have to boil the ocean. Start with something small and work your way up.
As far as next steps, we want auto scaling a physical layer and having the robots do all
of that. The robots will scale up and down our requesters while we sleep.
We will continue to do application programming interface (API)-capable automation with
anything that has a REST API. If we can connect to that and manipulate it, we can do
pretty much whatever automation we want.
We are also containerizing all things. So if any application can be containerized properly,
containerize it if you can.
As far as what decision-making engine we have to do the auto-scaling on the physical
layer, we haven’t really decided upon what that is. We have some ideas but we are still
looking for that.
Gardner: How about more predictive analytics using artificial intelligence (AI) with the
data that you have emanating from your data center? Maybe AIOps?
Sullins: Well, without a doubt. I, for one, haven’t done any sort of deep dive into that,
but I know it’s all the rage right now. I would be open to pretty much anything that will
encompass what I just talked about. If that’s HPE InfoSight, then that’s what it is. I don’t
have a lot of experience quite honestly with InfoSight as of yet. We do have it installed in
a proof of concept (POC) form, although a lot of the priorities for that have been shifted
due to COVID-19. We hope to revisit that pretty soon, so absolutely.
Gardner: To close out, you were ahead of the curve on digital transformation. That
allowed you to be very agile when it came time to react to the COVID-19 pandemic.
What did that get you? Do you have any results?
How do you create guardrails
that are seamless? If you can,
then you can trust your users.

of 12
Sullins: Yes, as a matter of fact, our boss’s boss, his boss -- so three bosses up from
me -- he actually sits in on our load testing. It was an all-hands-on-deck situation during
that May online sale. He said that it was the most seamless one that he had ever seen.
There were almost no issues with this one.
What I attribute that to is, yes, we had
done what we needed on the
infrastructure side to make sure that
we met dynamic demands. Also,
everybody worked as a team.
Everybody, all the way up the stacks,
from our infrastructure contribution, to
the hypervisor and hardware layer, all
the way on up to the application layer
and the containers, and all of our DevOps stuff. It was very successful. We went past our
goals of what we had thought for the sale, so it was a win-win all the way around.
Gardner: Even though you were going through this terrible period of adjustment, that’s
very impressive.
Sullins: Yes.
Gardner: I’m afraid we’ll have to leave it there. We have been exploring how REI faced
drastic and rapid IT demand shifts when the COVID-19 pandemic struck. And we have
learned how an embrace of digital business transformation, highly automated systems
operations, and a modern ecosystem of platforms and advanced technology solutions
saved the day.
So please join me in thanking our guest, Bryan Sullins, Senior Cloud Systems Engineer
at REI. Thank you, Bryan.
Sullins: Thanks, Dana. And shameless plug, if I may.
Gardner: Please.
Sullins: My personal blog is thinkingoutcloud.org and I encourage people to go there,
because the how-to on how we accomplish much of what we talked about here I have
blog posts for there. And you can contact me at @RussianLitGuy on Twitter.
Gardner: And thanks as well to our audience for joining this sponsored BriefingsDirect
Voice of Innovation discussion. I’m Dana Gardner, Principal Analyst at Interarbor
Solutions, your host for this ongoing series of HPE-supported discussions.
Thanks again for listening. Please pass this along to your IT community, and do come
back next time.
Everybody worked as a team. … all
the way up the stacks, from our
infrastructure contribution, to the
hypervisor and hardware layer, all the
way on up to the application layer
and the containers, and all of our
DevOps stuff. It was very successful.

of 12
Listen to the podcast. Find it on iTunes. Download the transcript. Sponsor: Hewlett
Packard Enterprise.
A discussion on how REI kept its digital customers and business leadership happy, even as the
world around them was suddenly shifting. Copyright Interarbor Solutions, LLC, 2005-2020. All
rights reserved.
You may also be interested in:
• How IT modern operational services enables self-managing, self-healing, and self-
optimizing
• HPE Pointnext’s Nine-Step Plan for Enterprises to Attain the New Business Normal
• As containers go mainstream, IT culture should pivot to end-to-end DevSecOps
• AI-first approach to infrastructure design extends analytics to more high-value use cases
• How Intility uses HPE Primera intelligent storage to move to 100 percent data uptime
• As hybrid IT complexity ramps up, operators look to data-driven automation tools
• Cerner’s lifesaving sepsis control solution shows the potential of bringing more AI-
enabled IoT to the healthcare edge
• How containers are the new basic currency for pay as you go hybrid IT
• HPE strategist Mark Linesch on the surging role of containers in advancing the hybrid IT
estate

How REI Used Automation to Cloudify Infrastructure and Rapidly Adjust its Digital Pandemic Response

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

How REI Used Automation to Cloudify Infrastructure and Rapidly Adjust its Digital Pandemic Response