Designing a Highly Available Environment Using Methods of Modern IT Infrastructure

•Download as PPTX, PDF•

0 likes•2,813 views

Hear about the multi-server Perforce architecture used at Remedy Entertainment, a developer of state-of-the-art action games, game franchises and cutting edge technology. Get tips on how various virtualization, storage technologies and the new distributed Perforce server features can be used to gain high availability and quick recoverability in different disaster scenarios. Handling large game content files and the dependencies between the game code and content assets will also be covered.

Technology

#
• Privately held game developer based in Finland.
• Released games Death Rally, Max Payne, Max
Payne 2: The Fall of Max Payne, Alan Wake, Alan
Wake’s American Nightmare, Death Rally Mobile.
• Franchises made into a movie, TV-series & novel.
• Announced titles Agents of Storm for iOS and
Xbox One exclusive title Quantum Break.

#
• Founded in 1995, currently 120+ employees.
• Over 100 Game of the Year awards.
• Franchises generated over $500M revenue.
• Max Payne IP sold for $43M.
• AAA games sold over 11M units.
• First mobile experiment over 16M downloads
and reached #1 in 70 countries.

#
# of files Total size # of files,
> 100 MB
Created by Remedy since 2004
All projects, all revisions 10.5 million 12 terabytes
All projects, #head revisions 5 million 5.5 terabytes
Alan Wake (XBOX 360), #head 1.1 million 920 gigabytes 1,300
Quantum Break (XBOX One, until today),
#head
3 million 4.3 terabytes 7,000
Perforce Database 30 gigabytes

#
• Large content files
• Dependencies of game engine <-> internal
tools <-> game content (in proprietary formats)

#
Tools source
code
Tool
binaries
3 Content source rd party
tools
Game source
code
Export util
source code
Export util
Runtime
game
binary
Runtime
content

#
• Large content files
• Dependencies of game engine <-> internal
tools <-> game content (in proprietary formats)
• Everything that comes out, comes from
Perforce depot
– Availability of the system is business critical

#
• System design approach
• Service implementation
• Principles of HA engineering
1. Elimination of single points of failure
2. Reliable crossover
3. Detection of failures as they occur.
• Source:
http://en.wikipedia.org/wiki/High_availability

#
• Client and access network don’t
have HA
– Opting for fast manual response
• LAN core w/ act/act redundancy
• Servers with failover
• SAN w/ active/active redundancy
• Storage w/ redundant components

#
• HA design principles do not cover the concept of
backups
– Even when HA is taken care of, data and availability
can be lost by user actions and software failures
– The data still needs to be copied to offline storage for
disaster recovery purposes

#
• Used for offloading backups and
integrity verification
• Covers application level failures
• Activation requires manual
intervention
perforce2:1666 perforce3:1666
perforce1:1666 perforce1:1667

#
• Snapshot of Perforce every 4 hours
• Runs storage provided snapshot with “p4d –c”
– Ensures database integrity
– Locks database for 30-50 seconds
• Near-instant recovery
• Can be mounted and exported to other hosts
– To run checkpoint, verify, …
– To run test environment with production data

#
• “A user may never see a failure. But the maintenance
activity must.”
• Infrastructure monitored with vendor tools
• Central monitoring with Nagios
– P4D process, TCP connectivity to perforce:1666
– Check “p4 info” output
– Replication: check “changelist” counter on both partners
• P4review.py

#
• Define what HA means for your service
• Build it one step at a time
– Ensure redundancy of each component
– Make sure the component is monitored
• Backups are still needed

#
• Introduction to Remedy
• Perforce at Remedy
• High Availability
• Perforce Application Availability
• Monitoring
• Conclusions

#
Jouko Markkanen is an IT Manager at Remedy Entertainment
with broad experience in different areas of information and
communications technology including help desk responsibilities,
programming, application design, security systems, information
management, and infrastructure planning and design.

What's hot

Automation Evolution with JunosMarketingArrowECS_CZ

Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical NagiosNagios

Drone HijackingIbrahim Mosaad

Into The Box 2018 CI Deep DiveOrtus Solutions, Corp

Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips...Nagios

CCNA NAT (Network Address Translation)Networkel

CCNA site-to-site connectivity securityNetworkel

CiScoPresentationiman vaghefi

The Day of the UpdatesItzik Kotler

y3dips hacking priv8 networkidsecconf

CCNA Advanced EIGRP Configuration and TroubleshootingNetworkel

A Byte of Software DeploymentGong Haibing

BKK16-309A Open Platform support in UEFILinaro

KVM/ARM Nested Virtualization Support and Performance - SFO17-410Linaro

Preparing for SRE InterviewsShivam Mitra

Sierra Wireless Developer Day 2013 - Show&Tell 5 - Simple PnP SMS GatewayThibault Cantegrel

CCNA link aggregationNetworkel

Unix tcBill Chea

CCNA point to pointNetworkel

CCNA EIGRP Overview and Basic ConfigurationNetworkel

What's hot (20)

Automation Evolution with Junos

Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios

Drone Hijacking

Into The Box 2018 CI Deep Dive

Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips...

CCNA NAT (Network Address Translation)

CCNA site-to-site connectivity security

CiScoPresentation

The Day of the Updates

y3dips hacking priv8 network

CCNA Advanced EIGRP Configuration and Troubleshooting

A Byte of Software Deployment

BKK16-309A Open Platform support in UEFI

KVM/ARM Nested Virtualization Support and Performance - SFO17-410

Preparing for SRE Interviews

Sierra Wireless Developer Day 2013 - Show&Tell 5 - Simple PnP SMS Gateway

CCNA link aggregation

Unix tc

CCNA point to point

CCNA EIGRP Overview and Basic Configuration

Viewers also liked

Using Umbra Spatial Data for Visibility and Audio Propagation in Quantum BreakUmbra

Advancements in-tiled-renderingmistercteam

Qauntum breakhalo4robo

Unity - Internals: memory and performanceCodemotion

Anti-Aliasing Methods in CryENGINE 3Tiago Sousa

Deferred rendering in Dying LightMaciej Jamrozik

Siggraph 2011: Occlusion culling in Alan WakeUmbra

Forward+ (EUROGRAPHICS 2012)Takahiro Harada

Viewers also liked (8)

Using Umbra Spatial Data for Visibility and Audio Propagation in Quantum Break

Advancements in-tiled-rendering

Qauntum break

Unity - Internals: memory and performance

Anti-Aliasing Methods in CryENGINE 3

Deferred rendering in Dying Light

Siggraph 2011: Occlusion culling in Alan Wake

Forward+ (EUROGRAPHICS 2012)

Similar to Designing a Highly Available Environment Using Methods of Modern IT Infrastructure

Automatize everythingBoris Bucha

Maximize Your Production Effort (English)slantsixgames

Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Shuo LI

DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...Docker, Inc.

When Tools AttackPerforce

Spot Trading - A case study in continuous delivery for mission critical finan...SaltStack

Supersize Your Production Pipeslantsixgames

Supersize your production pipe enjmin 2013 v1.1 hdslantsixgames

Game Development Best PracticesPerforce

2018 02 20-jeg_indexChester Chen

BSIDES-PR Keynote Hunting for Bad GuysJoff Thyer

DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...Felipe Prado

Global Software Development powered by PerforcePerforce

Taking the Next Hot Mobile Game Live with Docker and IBM SoftLayerDaniel Krook

DataCore Case Study on HyperconvergedAdvantech Industrial Automation Group

Inside the IT Territory game server / Mark Lokshin (IT Territory)DevGAMM Conference

Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon YangLyon Yang

Scaling Servers and Storage for Film Assets Perforce

Working Well Together: How to Keep High-end Game Development Teams ProductivePerforce

LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...The Linux Foundation

Similar to Designing a Highly Available Environment Using Methods of Modern IT Infrastructure (20)

Automatize everything

Maximize Your Production Effort (English)

Criteo Labs Infrastructure Tech Talk Meetup Nov. 7

DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...

When Tools Attack

Spot Trading - A case study in continuous delivery for mission critical finan...

Supersize Your Production Pipe

Supersize your production pipe enjmin 2013 v1.1 hd

Game Development Best Practices

2018 02 20-jeg_index

BSIDES-PR Keynote Hunting for Bad Guys

DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...

Global Software Development powered by Perforce

Taking the Next Hot Mobile Game Live with Docker and IBM SoftLayer

DataCore Case Study on Hyperconverged

Inside the IT Territory game server / Mark Lokshin (IT Territory)

Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon Yang

Scaling Servers and Storage for Film Assets

Working Well Together: How to Keep High-end Game Development Teams Productive

LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Introduction to use of FHIR Documents in ABDMKumar Satyam

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1

WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

API Governance and Monetization - The evolution of API governanceWSO2

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021

Modernizing Legacy Systems Using BallerinaWSO2

Navigating Identity and Access Management in the Modern EnterpriseWSO2

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Choreo: Empowering the Future of Enterprise Software EngineeringWSO2

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Introduction to use of FHIR Documents in ABDM

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Stronger Together: Developing an Organizational Strategy for Accessible Desig...

WSO2's API Vision: Unifying Control, Empowering Developers

[BuildWithAI] Introduction to Gemini.pdf

API Governance and Monetization - The evolution of API governance

Strategies for Landing an Oracle DBA Job as a Fresher

AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf

Six Myths about Ontologies: The Basics of Formal Ontology

Modernizing Legacy Systems Using Ballerina

Navigating Identity and Access Management in the Modern Enterprise

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Choreo: Empowering the Future of Enterprise Software Engineering

Artificial Intelligence Chap.5 : Uncertainty

Designing a Highly Available Environment Using Methods of Modern IT Infrastructure

1. # Jouko Markkanen IT Manager

2. #

3. # • Privately held game developer based in Finland. • Released games Death Rally, Max Payne, Max Payne 2: The Fall of Max Payne, Alan Wake, Alan Wake’s American Nightmare, Death Rally Mobile. • Franchises made into a movie, TV-series & novel. • Announced titles Agents of Storm for iOS and Xbox One exclusive title Quantum Break.

4. # • Founded in 1995, currently 120+ employees. • Over 100 Game of the Year awards. • Franchises generated over $500M revenue. • Max Payne IP sold for $43M. • AAA games sold over 11M units. • First mobile experiment over 16M downloads and reached #1 in 70 countries.

5. #

6. # • Large content files

7. # # of files Total size # of files, > 100 MB Created by Remedy since 2004 All projects, all revisions 10.5 million 12 terabytes All projects, #head revisions 5 million 5.5 terabytes Alan Wake (XBOX 360), #head 1.1 million 920 gigabytes 1,300 Quantum Break (XBOX One, until today), #head 3 million 4.3 terabytes 7,000 Perforce Database 30 gigabytes

8. # • Large content files • Dependencies of game engine <-> internal tools <-> game content (in proprietary formats)

9. # Tools source code Tool binaries 3 Content source rd party tools Game source code Export util source code Export util Runtime game binary Runtime content

10. # • Large content files • Dependencies of game engine <-> internal tools <-> game content (in proprietary formats) • Everything that comes out, comes from Perforce depot – Availability of the system is business critical

11. #

12. # • System design approach • Service implementation • Principles of HA engineering 1. Elimination of single points of failure 2. Reliable crossover 3. Detection of failures as they occur. • Source: http://en.wikipedia.org/wiki/High_availability

13. # • Client and access network don’t have HA – Opting for fast manual response • LAN core w/ act/act redundancy • Servers with failover • SAN w/ active/active redundancy • Storage w/ redundant components

14. # • HA design principles do not cover the concept of backups – Even when HA is taken care of, data and availability can be lost by user actions and software failures – The data still needs to be copied to offline storage for disaster recovery purposes

15. # • Client and access network don’t have HA – Opting for fast manual response • LAN core w/ act/act redundancy • Servers with failover • SAN w/ active/active redundancy • Storage w/ redundant components

16. #

17. # • Used for offloading backups and integrity verification • Covers application level failures • Activation requires manual intervention perforce2:1666 perforce3:1666 perforce1:1666 perforce1:1667

18. # • Snapshot of Perforce every 4 hours • Runs storage provided snapshot with “p4d –c” – Ensures database integrity – Locks database for 30-50 seconds • Near-instant recovery • Can be mounted and exported to other hosts – To run checkpoint, verify, … – To run test environment with production data

19. #

20. # • “A user may never see a failure. But the maintenance activity must.” • Infrastructure monitored with vendor tools • Central monitoring with Nagios – P4D process, TCP connectivity to perforce:1666 – Check “p4 info” output – Replication: check “changelist” counter on both partners • P4review.py

21. # • Define what HA means for your service • Build it one step at a time – Ensure redundancy of each component – Make sure the component is monitored • Backups are still needed

22. # Jouko Markkanen jouko@remedygames.com

23. # • Introduction to Remedy • Perforce at Remedy • High Availability • Perforce Application Availability • Monitoring • Conclusions

24. # Jouko Markkanen is an IT Manager at Remedy Entertainment with broad experience in different areas of information and communications technology including help desk responsibilities, programming, application design, security systems, information management, and infrastructure planning and design.

Editor's Notes

E3 Sofia scene
Agents of Storm in beta Quantum Break to be released in 2015
- 1996 Virtual Reality  3D Mark - Spin-off company Futuremark in 1997 - Over 120 employees from over 15 different countries - TOP-50 growth companies in Finland
Remedy has been using Perforce as the sole SCM system for over 10 years. During that time, we have created several AAA console/PC games, as well as mobile games. The biggest production has been Alan Wake, and while still in-production, Quantum Break has already multiple times the number and size of files in our depot. So far we have created over 10 million files, with a total size of 12 terabytes in 300K+ changelists. While the Perforce database is sized modestly at 30 gigabytes, the average file size is 1.2MB, and Quantum Break has over 7000 files sized over 100MB, and hundreds of files over 1GB in size. This distribution is not typical for a “regular” software project, and the performance problems lie more in “how to copy the mass of files to/from the client”, instead of “how to manage the complex database metadata of a huge number of files”.
There are over 100 people working on QB, writing program code and creating content to our Perforce server. Most of them are working on the content production, using in-house tools to edit the game world. These tools export the content from the proprietary source format to a binary format that can be presented in real time by the game engine. A lot of the program code is shared between the tools and the game engine, to ensure a matching presentation. This means that when the format is updated, for example due to a new feature in the engine, the whole dependency chain must be rebuilt and redistributed, and all of the existing content must be re-exported, and sometimes even the existing content source needs to be upgraded. This makes the whole version control and integration/delivery methodology to a different complexity level. The dark-grey boxes in the diagram depict the binaries built and delivered by our automated build system, or built locally for local testing and modifications. The dark-red boxes depict files stored in our Perforce server, and at the same time they are the files almost all the work on the project is done on. Which brings us to…
… the importance of the Perforce service in our company. Everything that is delivered with our final product, the game, is coming out of the content stored in our Perforce. This makes it the #1 business critical IT service at Remedy, and it’s availability is on the top priority.
Before we delve into how we ensure HA of our Perforce service, let’s discuss a bit what the term actually means. As is common in this age, we’ll start with a “common” definition, ie. what Wikipedia writes about HA.
Wikipedia defines HA as a system design approach, and associated service implementation. Their purpose is to ensure a certain, specified, level of performance, specifically level of availability. Three principals to practice this approach and implementation are listed. The first is to get rid of single points of failure, so that any single component needed to produce the service can fail without disrupting the service. This is usually achieved by means of redundancy, that is by duplicating, or multiplying, all components of the system. The second is to provide means of transferring service reliably from a failed component to the redundant counterpart. And to allow this, the third principle is automatic failure monitoring and detection, because without it, the tasks of the failed component will never be transferred, until it’s redundant counterparts have also failed, and the availability is lost.
Next, we’ll go thru the IT infrastructure layers we use to provide the Perforce service, and how their redundancy has been provided. We’ll start off with storage: we use a shared storage system, which provides storage space with different characteristics, like high IOPS for database storage and more inexpensive large storage for versioned files. This system is shown as a single entity, but in reality it is spread on multiple storage chassis, each having a RAID-style redundant array of disks, redundant power supplies, and redundant controller modules, so that there are no SPF’s. The access to the shared storage system is provided via an iSCSI storage area network. This has simple redundancy, there are two switches, with active paths carried on both, so if one of them fails, the other will continue storage operations. On the next level are the servers. Our Perforce servers (there are multiple, more about them later) are virtualized, and they run on a cluster of hypervisors. In normal conditions, they are distributed on different physical host computers, so in case one host fails, the other VM’s keep running. Also, the cluster monitors itself, and in case of a host failure, the VM’s are restarted on the remaining hosts. LAN connectivity between the servers as well as towards the clients is also redundant, with several technologies providing redundancy on different levels of the OSI network model. On the final stage of the client-server path is the access network (“floor switch”) and the client computer itself. These do not have HA as such, as a failure on those has very limited area of effect. However, we are prepared for failures here as well; we keep spare access switches in store, so they can be swapped manually but quickly in case of a failure, and the same applies to the computers and/or their components.
Even at this level the system is not foolproof, and cannot guarantee 100% availability. Many of the crossover paths used to provide HA are built upon automatic monitoring and software features, and software tends to have bugs for example. Also, there are mere humans using as well as administering and operating the system, and humans tend to make errors. Even a well designed and implemented HA system does not protect if the datacenter is consumed by fire, or flooded with water. To protect from this, proper backups must be planned, made and tested.
This completes the HA architecture diagram. The backups are created with dedicated hardware, and stored on dedicated storage system. Those systems should preferably reside offsite. We create backups on a dedicated system onsite, for faster recoveries, but replicate that backup content to an offsite datacenter for ultimate disaster recovery.
There is still one level of failures not covered by the generic HA IT infrastructure, but we can prepare for those as well. In case the Perforce software itself fails for one reason or another (this is rare, but has happened, also with us).
We have currently two different primary Perforce servers. We have split different projects to different servers to give some scalability in performance, and to allow one project to perform maintenance without disturbing the other. This is possible, because we have projects that use different game engine; the projects that share the game engine, are located on the same server, but different depots. Both of the servers are replicated using Perforce pull replication to a third server. That server runs two P4D processes on different ports. The replicas serve two main purposes: one is to allow checkpoints and verifys run without interruption to end user service. We only do these operations on the primary servers during scheduled maintenance breaks, when needed. The other purpose is to have a fallback in case the master server fails irrecoverably. In this case, we need to manually change the replica server configuration (to allow write operations), as well as point the clients to the failover replica. As the clients use an alias name to find the server, we can change that alias point to the replica, and as the name record has a 5 minute TTL, all clients are good to run within that period. Naturally, this only helps if the master server has failed in such way, that the failure has not been replicated to the replica’s database or versioned files. If it has, we can always resort to recovering from backups. However, as the full recovery takes some time (currently almost 24 hours to copy all data back from the backups), we have another way…
… the storage snapshots. Many modern storage systems have a feature to create a near-instant point-in-time checkpoint within the storage hardware (or, rather by the software running the hardware). To ensure the integrity of the snapshot, we have scheduled a command to run every 4 hours. This command is a “p4d –c …”, which tells the Perforce server to commit any residing changes to disk and lock the database while running a command. In this case the command tells the storage system to snapshot all the volumes assigned to that Perforce server. The snapshot takes around 30-50 seconds with our system, during which time the Perforce server will hold all write operations. Compared to the almost two hours that creating a checkpoint of our database takes, this is pretty fast. Now, mounting this checkpoint back to the server happens also in less than a minute. In case of a failure, it is possible to mount the previous checkpoint in an alternate directory, and start the Perforce server from there, while keeping the failed volumes online for further investigation. But the use of the checkpoints does not end there. You can also keep the production running, while mounting a checkpoint on the same, or another server, start a P4D on the checkpoint folder, and use that eg. to test upgrade or configuration changes with real-world data, without touching the production.
HA allows the system to recover when a single failure occurs. But the whole concept of High Availability is void, if the environment is not being monitored, and the single failures corrected before another failure occurs and brings the system down. In Wikipedia, the third principle of HA engineering says that “A user may never see a failure. But the maintenance activity must”
The HA infrastructure stack is being monitored by tools provided by individual component vendors. They are configured to send an alert email in case a component fails, and many even have a feature that notifies the vendor tech support autonomously. We have a central monitoring system, utilizing Nagios, that monitors, in addition to the hardware and OS environment and their resources, also our Perforce servers. There are several checks that verify that the P4D process is running, and a custom script that checks the replication status. A good monitoring tool is also the p4review daemon (although this is not it’s primary purpose). We run p4review every two minutes, and if the server has any problems, it fails, sending a failure report to the admin contact email.

Designing a Highly Available Environment Using Methods of Modern IT Infrastructure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Designing a Highly Available Environment Using Methods of Modern IT Infrastructure

Similar to Designing a Highly Available Environment Using Methods of Modern IT Infrastructure (20)

More from Perforce

More from Perforce (20)

Recently uploaded

Recently uploaded (20)

Designing a Highly Available Environment Using Methods of Modern IT Infrastructure

Editor's Notes