Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
[NetherRealm Studios] Game Studio Perforce Architecture
1.
MERGE 2013 THE PERFORCE CONFERENCE SAN FRANCISCO • APRIL 24−26
Perforce White Paper
To provide a solid foundation for software development
excellence in today s demanding economy, it s critical
to have a software version management solution that
can meet your demands.
Game Studio Perforce Architecture
Creating Services That Power Large
Development
Ryan Mensching, NetherRealm Studios (WB
Games)
2. 2 Game Studio Perforce Architecture
Introduction
NetherRealm Studios has been creating video games on multiple platforms for more than 20
years. As a AAA game developer, the amounts of data produced and production quality have
increased many times over. Along with this growth comes increased demand for Perforce
space, reliability, and flexibility. These three pillars are the foundation of our infrastructure
decisions.
Background
Perforce Software Version Management has been an integral part of studio collaboration since
2006. Prior to migrating to Perforce, several products had been used, including Visual
SourceSafe. Perforce has been able to scale to the needs of studio development while
allowing management of large binary assets and large integrations. Enabling this growth,
however, has required some engineering to ensure the best possible structure for Perforce to
layer on.
Past projects encountered such stability or resource constraints that individual team members
took it upon themselves to run or host services on individual machines. Projects were at risk of
losing data and compromising timelines, which could cost hundreds of thousands of dollars.
For game studios, deadlines and “crunch” are generally mandatory events during the multi-
year development cycle. Staff may be on site working around the clock during these periods,
and having Perforce down halts work and collaboration.
This is true not only for our code development but also for art asset development and binary
art assets, such as textures, models, and many cinema movies, which consume the most
space for any given game project. The current project head revision is around 20 GB (sync in 8
minutes) and the art assets head revision is around 545 GB (sync in 3 hours). Furthermore, we
can have 460+ submitted changelists per day with upwards of 20+ GB of iterations (see Figure
1). This adds up to a significant load on Perforce during critical business.
We also use a complex system of file types to manage data; +S flags are used on binaries
where appropriate as well as file locks to manage user access and submits. Internally we are
also publishing software at a consumer level for our artists and designers. This software
package undergoes a basic software release cycle and its release depends on game content
progression and vice versa. This can make for a very difficult release window. The studio
currently works on more than one game at a time, so we generally have a split audience that
may be in different places within their projects.
From an IT perspective, we had aging hardware and processes, plus general support issues
keeping up with business demand. We recommissioned hardware to virtual machines (VMs)
and restructured existing storage as best as possible to accommodate a better layout and
space usage. Ultimately, these experiences and experiments led to further engineering for the
systems that Perforce and critical systems would use in our environment.
3. 3 Game Studio Perforce Architecture
Figure 1: Disk growth for 3 months
IT
Strategic
Goals
As IT, we focus our goals around what the business needs to stay viable and to produce its
product. For games this is very similar to many other industries, with the caveat that most
everything must be ready or available immediately, if not faster. This can be a very painful
experience if your infrastructure is not prepared to handle or adapt to the business. To begin
our design and building our infrastructure, we established several pillars that were dictated by
the needs of the business and our experiences supporting it:
• Space
This may seem straightforward, but how space is approached from a logical and
management perspective makes a big difference when in the middle of solving
issues. The space needs to be not only available but also be abstracted to suitable
layers so that applications can be applied in real time without major
reconfiguration.
• Speed
This is something that all SAN/storage users will have concerns about. When you
are running all your VMs and major storage from a single platform, speed is a
critical component of making the system work without causing slowness to users.
• Flexibility
Although tied to several of the other pillars, flexibility needs to stand alone because
this is a major gate to how you use and manage your resources, whether it be a
choice of protocols or how quickly you spin up new VMs or hardware.
4. 4 Game Studio Perforce Architecture
• Reliability
Redundancy should be inherent in any solid infrastructure design and the
components you choose should make this a seamless effort instead of requiring a
dedicated action to provide robustness. If a separate architecture plan is necessary
to provide reliability, holes may emerge over time, or this activity may eat up too
much time.
• Management
The glue that brings all the pieces together and allows insight to how the systems
are running is critical. Without proper management, issues will arise in the other
pillars that can trickle down. Management of the platform as a whole should not be
a complex endeavor and should be self or easily documented.
Solutions
As an organization, how do we accomplish all of our goals and realize our vision? We had
already established a solid layer 2 and 3 network with 10 GB distribution, a core component.
On top of this we started working on migrating and testing all services to storage via IP
protocols to eliminate cabling and complexity (see Figure 2).
With this base IP and storage layer, we started to test VMware as a platform to support our OS
and applications. With these early renditions, we utilized internet small computer system
interface (iSCSI) for our VM storage layer as well as any mass storage for applications and
Perforce. We had some success with this platform and were able to keep our performance
benchmark for Perforce, via the p4bench tool, in the top 20 platforms on the Perforce bench
database at the time. Ultimately, this platform had some inherent management flaws and
required hardware upgrades and processes to be engineered further.
Figure 2: Service stack
Our storage platform is a key component of our infrastructure design. We had issues with
iSCSI being agile enough to keep up; logical unit number (LUN) management becomes a
major burden for a small team. iSCSI also has inherent configuration complexity at the network
layer to provide redundancy and performance. Through our testing we found that network file
system (NFS) provided equal or better performance while allowing several advantages.
5. 5 Game Studio Perforce Architecture
First, there is the notion of true thin provisioned volumes and volumes that can be expanded
on the fly without downtime or LUN changes. Second, VMware also has very mature support
for NFS, which enables clustering and redundancy by default. The engine to power this
storage platform is NetApp. With its mature feature set and collapsed protocol support, we
were able to leverage NFS while also having support for FC and iSCSI if so required.
We are able to leverage NFS at our OS layer as well to provide dynamic support for volume
growth, or shrinking, a key component to our growth plan for Perforce revision file storage.
With the NetApp platform, we also have the advantage of snap technologies that allow for
granular file level snaps and restores that do not chew up space. For example, we are able to
retrieve a single file from a deleted shelved changelist if so required, without downtime or any
overhead (granted, we try to avoid this).
Through VM evaluation and performance testing over the years, we have found that we need
servers that are wide and fast. The more RAM we can give, the better for disk caching, and to
accommodate the fastest execution of database jobs, we want the fastest cores.
We only have two cores on each of our Perforce servers but these are running at 3.4 GHz. We
have been using the Cisco UCS platform to provide our compute resources for VMware. With
this platform, we have inherent abstraction at the hardware layer, which allows the
blade/hardware to be swapped or upgraded, and the reassignment of the profile to
automatically configure its network interface controllers (NICs), connections, and other
characteristics. This allows for quick additions or replacements. The virtualization at the NIC
layer is an important factor for our installation. We utilize several virtualized NICs to connect
directly to the 10 GB backbone at the VM host layer as well as direct pass-through technology
at the OS guest level to connect to Perforce NFS data.
At the OS level, we are a RHEL shop, which has served us well as a layer on virtualization.
Linux utilizes virtual resources better than most OSes and provides the best way to operate a
thin OS with very little service overhead. We have allocated 32 GB of memory to our
production Perforce servers.
With the disk cache system in Linux and having large amounts of RAM in our cluster, we are
able to get excellent throughput from both VMDKs and mounted NFS achieving 440+ MB to
the desktop Perforce client. Our Perforce instances are run in the standard fashion with some
custom scripts for management and startup of services.
We use proxies for remote work as well as in-house high-demand applications such as build
machines. We are moving forward with full broker front-ended servers with replicas to offload
read-only data and checkpoints. With this model we are able to distribute loads seamlessly,
and with our revision files being stored on NFS we can have a single source mounted
read/write or read-only for production and read-only replicas without needing to sync/replicate
revision files.
Reliability is a critical factor for our system design. At the storage level, we have redundancy at
the NetApp array level with automated SnapMirrors. At the volume level, we use SnapShot
technology to provide a month of on-array backups. This is the functionality that enables us to
restore single files directly back to depots/servers on demand with no outage time. At the layer
2 and layer 3, we have redundancy through multi-homed connections to our UCS chassis. At
the VMware level, we have redundancy via VMware HA and clustering. At the Perforce level,
we are using replicas, proxies, and checkpoints for integrity, load balancing, and backups.
6. 6 Game Studio Perforce Architecture
Finally, to glue our systems together, we use a suite of management tools. Each layer has a
management and application monitoring tool, NetApp, UCS, and VMware. With each of these
tools, we can manage and allocate resources quickly as well as get advanced notifications for
system usage or failures.
Figure 3: Logical Perforce
Conclusion
We have found Perforce capable of scaling to the needs of high-speed and data intense
development processes. Accomplishing this, however, has required many rounds of
engineering and investment in infrastructure technologies. We have chosen these platforms
from our experience, which means other situations may require different solutions. However,
7. 7 Game Studio Perforce Architecture
the general performance and design philosophies can be applied. Ultimately, one of the largest
gains for modern infrastructure is virtualization, and we have made design choices to abstract
at each layer of the platform. This allows easy upgrades, migration, and management of
growth without changing how we use the platform or the services it provides.