SlideShare a Scribd company logo
1 of 24
# 
Jouko Markkanen 
IT Manager
#
# 
• Privately held game developer based in Finland. 
• Released games Death Rally, Max Payne, Max 
Payne 2: The Fall of Max Payne, Alan Wake, Alan 
Wake’s American Nightmare, Death Rally Mobile. 
• Franchises made into a movie, TV-series & novel. 
• Announced titles Agents of Storm for iOS and 
Xbox One exclusive title Quantum Break.
# 
• Founded in 1995, currently 120+ employees. 
• Over 100 Game of the Year awards. 
• Franchises generated over $500M revenue. 
• Max Payne IP sold for $43M. 
• AAA games sold over 11M units. 
• First mobile experiment over 16M downloads 
and reached #1 in 70 countries.
#
# 
• Large content files
# 
# of files Total size # of files, 
> 100 MB 
Created by Remedy since 2004 
All projects, all revisions 10.5 million 12 terabytes 
All projects, #head revisions 5 million 5.5 terabytes 
Alan Wake (XBOX 360), #head 1.1 million 920 gigabytes 1,300 
Quantum Break (XBOX One, until today), 
#head 
3 million 4.3 terabytes 7,000 
Perforce Database 30 gigabytes
# 
• Large content files 
• Dependencies of game engine <-> internal 
tools <-> game content (in proprietary formats)
# 
Tools source 
code 
Tool 
binaries 
3 Content source rd party 
tools 
Game source 
code 
Export util 
source code 
Export util 
Runtime 
game 
binary 
Runtime 
content
# 
• Large content files 
• Dependencies of game engine <-> internal 
tools <-> game content (in proprietary formats) 
• Everything that comes out, comes from 
Perforce depot 
– Availability of the system is business critical
#
# 
• System design approach 
• Service implementation 
• Principles of HA engineering 
1. Elimination of single points of failure 
2. Reliable crossover 
3. Detection of failures as they occur. 
• Source: 
http://en.wikipedia.org/wiki/High_availability
# 
• Client and access network don’t 
have HA 
– Opting for fast manual response 
• LAN core w/ act/act redundancy 
• Servers with failover 
• SAN w/ active/active redundancy 
• Storage w/ redundant components
# 
• HA design principles do not cover the concept of 
backups 
– Even when HA is taken care of, data and availability 
can be lost by user actions and software failures 
– The data still needs to be copied to offline storage for 
disaster recovery purposes
# 
• Client and access network don’t 
have HA 
– Opting for fast manual response 
• LAN core w/ act/act redundancy 
• Servers with failover 
• SAN w/ active/active redundancy 
• Storage w/ redundant components
#
# 
• Used for offloading backups and 
integrity verification 
• Covers application level failures 
• Activation requires manual 
intervention 
perforce2:1666 perforce3:1666 
perforce1:1666 perforce1:1667
# 
• Snapshot of Perforce every 4 hours 
• Runs storage provided snapshot with “p4d –c” 
– Ensures database integrity 
– Locks database for 30-50 seconds 
• Near-instant recovery 
• Can be mounted and exported to other hosts 
– To run checkpoint, verify, … 
– To run test environment with production data
#
# 
• “A user may never see a failure. But the maintenance 
activity must.” 
• Infrastructure monitored with vendor tools 
• Central monitoring with Nagios 
– P4D process, TCP connectivity to perforce:1666 
– Check “p4 info” output 
– Replication: check “changelist” counter on both partners 
• P4review.py
# 
• Define what HA means for your service 
• Build it one step at a time 
– Ensure redundancy of each component 
– Make sure the component is monitored 
• Backups are still needed
# 
Jouko Markkanen 
jouko@remedygames.com
# 
• Introduction to Remedy 
• Perforce at Remedy 
• High Availability 
• Perforce Application Availability 
• Monitoring 
• Conclusions
# 
Jouko Markkanen is an IT Manager at Remedy Entertainment 
with broad experience in different areas of information and 
communications technology including help desk responsibilities, 
programming, application design, security systems, information 
management, and infrastructure planning and design.

More Related Content

What's hot

Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical NagiosNagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical NagiosNagios
 
Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips...
Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips...Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips...
Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips...Nagios
 
CCNA NAT (Network Address Translation)
CCNA NAT (Network Address Translation)CCNA NAT (Network Address Translation)
CCNA NAT (Network Address Translation)Networkel
 
CCNA site-to-site connectivity security
CCNA  site-to-site connectivity securityCCNA  site-to-site connectivity security
CCNA site-to-site connectivity securityNetworkel
 
The Day of the Updates
The Day of the UpdatesThe Day of the Updates
The Day of the UpdatesItzik Kotler
 
y3dips hacking priv8 network
y3dips hacking priv8 networky3dips hacking priv8 network
y3dips hacking priv8 networkidsecconf
 
CCNA Advanced EIGRP Configuration and Troubleshooting
CCNA Advanced EIGRP Configuration and TroubleshootingCCNA Advanced EIGRP Configuration and Troubleshooting
CCNA Advanced EIGRP Configuration and TroubleshootingNetworkel
 
A Byte of Software Deployment
A Byte of Software DeploymentA Byte of Software Deployment
A Byte of Software DeploymentGong Haibing
 
BKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFIBKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFILinaro
 
KVM/ARM Nested Virtualization Support and Performance - SFO17-410
KVM/ARM Nested Virtualization Support and Performance - SFO17-410KVM/ARM Nested Virtualization Support and Performance - SFO17-410
KVM/ARM Nested Virtualization Support and Performance - SFO17-410Linaro
 
Preparing for SRE Interviews
Preparing for SRE InterviewsPreparing for SRE Interviews
Preparing for SRE InterviewsShivam Mitra
 
Sierra Wireless Developer Day 2013 - Show&Tell 5 - Simple PnP SMS Gateway
Sierra Wireless Developer Day 2013 - Show&Tell 5 - Simple PnP SMS GatewaySierra Wireless Developer Day 2013 - Show&Tell 5 - Simple PnP SMS Gateway
Sierra Wireless Developer Day 2013 - Show&Tell 5 - Simple PnP SMS GatewayThibault Cantegrel
 
CCNA link aggregation
CCNA  link aggregationCCNA  link aggregation
CCNA link aggregationNetworkel
 
CCNA point to point
CCNA  point to pointCCNA  point to point
CCNA point to pointNetworkel
 
CCNA EIGRP Overview and Basic Configuration
CCNA EIGRP Overview and Basic ConfigurationCCNA EIGRP Overview and Basic Configuration
CCNA EIGRP Overview and Basic ConfigurationNetworkel
 

What's hot (20)

Automation Evolution with Junos
Automation Evolution with JunosAutomation Evolution with Junos
Automation Evolution with Junos
 
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical NagiosNagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
Nagios Conference 2014 - Leland Lammert - Distributed Heirarchical Nagios
 
Drone Hijacking
Drone HijackingDrone Hijacking
Drone Hijacking
 
Into The Box 2018 CI Deep Dive
Into The Box 2018   CI Deep DiveInto The Box 2018   CI Deep Dive
Into The Box 2018 CI Deep Dive
 
Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips...
Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips...Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips...
Nagios Conference 2011 - Nate Broderick - Nagios XI Large Implementation Tips...
 
CCNA NAT (Network Address Translation)
CCNA NAT (Network Address Translation)CCNA NAT (Network Address Translation)
CCNA NAT (Network Address Translation)
 
CCNA site-to-site connectivity security
CCNA  site-to-site connectivity securityCCNA  site-to-site connectivity security
CCNA site-to-site connectivity security
 
CiScoPresentation
CiScoPresentationCiScoPresentation
CiScoPresentation
 
The Day of the Updates
The Day of the UpdatesThe Day of the Updates
The Day of the Updates
 
y3dips hacking priv8 network
y3dips hacking priv8 networky3dips hacking priv8 network
y3dips hacking priv8 network
 
CCNA Advanced EIGRP Configuration and Troubleshooting
CCNA Advanced EIGRP Configuration and TroubleshootingCCNA Advanced EIGRP Configuration and Troubleshooting
CCNA Advanced EIGRP Configuration and Troubleshooting
 
A Byte of Software Deployment
A Byte of Software DeploymentA Byte of Software Deployment
A Byte of Software Deployment
 
BKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFIBKK16-309A Open Platform support in UEFI
BKK16-309A Open Platform support in UEFI
 
KVM/ARM Nested Virtualization Support and Performance - SFO17-410
KVM/ARM Nested Virtualization Support and Performance - SFO17-410KVM/ARM Nested Virtualization Support and Performance - SFO17-410
KVM/ARM Nested Virtualization Support and Performance - SFO17-410
 
Preparing for SRE Interviews
Preparing for SRE InterviewsPreparing for SRE Interviews
Preparing for SRE Interviews
 
Sierra Wireless Developer Day 2013 - Show&Tell 5 - Simple PnP SMS Gateway
Sierra Wireless Developer Day 2013 - Show&Tell 5 - Simple PnP SMS GatewaySierra Wireless Developer Day 2013 - Show&Tell 5 - Simple PnP SMS Gateway
Sierra Wireless Developer Day 2013 - Show&Tell 5 - Simple PnP SMS Gateway
 
CCNA link aggregation
CCNA  link aggregationCCNA  link aggregation
CCNA link aggregation
 
Unix tc
Unix tcUnix tc
Unix tc
 
CCNA point to point
CCNA  point to pointCCNA  point to point
CCNA point to point
 
CCNA EIGRP Overview and Basic Configuration
CCNA EIGRP Overview and Basic ConfigurationCCNA EIGRP Overview and Basic Configuration
CCNA EIGRP Overview and Basic Configuration
 

Viewers also liked

Using Umbra Spatial Data for Visibility and Audio Propagation in Quantum Break
Using Umbra Spatial Data for Visibility and Audio Propagation in Quantum BreakUsing Umbra Spatial Data for Visibility and Audio Propagation in Quantum Break
Using Umbra Spatial Data for Visibility and Audio Propagation in Quantum BreakUmbra
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-renderingmistercteam
 
Qauntum break
Qauntum breakQauntum break
Qauntum breakhalo4robo
 
Unity - Internals: memory and performance
Unity - Internals: memory and performanceUnity - Internals: memory and performance
Unity - Internals: memory and performanceCodemotion
 
Anti-Aliasing Methods in CryENGINE 3
Anti-Aliasing Methods in CryENGINE 3Anti-Aliasing Methods in CryENGINE 3
Anti-Aliasing Methods in CryENGINE 3Tiago Sousa
 
Deferred rendering in Dying Light
Deferred rendering in Dying LightDeferred rendering in Dying Light
Deferred rendering in Dying LightMaciej Jamrozik
 
Siggraph 2011: Occlusion culling in Alan Wake
Siggraph 2011: Occlusion culling in Alan WakeSiggraph 2011: Occlusion culling in Alan Wake
Siggraph 2011: Occlusion culling in Alan WakeUmbra
 
Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)Takahiro Harada
 

Viewers also liked (8)

Using Umbra Spatial Data for Visibility and Audio Propagation in Quantum Break
Using Umbra Spatial Data for Visibility and Audio Propagation in Quantum BreakUsing Umbra Spatial Data for Visibility and Audio Propagation in Quantum Break
Using Umbra Spatial Data for Visibility and Audio Propagation in Quantum Break
 
Advancements in-tiled-rendering
Advancements in-tiled-renderingAdvancements in-tiled-rendering
Advancements in-tiled-rendering
 
Qauntum break
Qauntum breakQauntum break
Qauntum break
 
Unity - Internals: memory and performance
Unity - Internals: memory and performanceUnity - Internals: memory and performance
Unity - Internals: memory and performance
 
Anti-Aliasing Methods in CryENGINE 3
Anti-Aliasing Methods in CryENGINE 3Anti-Aliasing Methods in CryENGINE 3
Anti-Aliasing Methods in CryENGINE 3
 
Deferred rendering in Dying Light
Deferred rendering in Dying LightDeferred rendering in Dying Light
Deferred rendering in Dying Light
 
Siggraph 2011: Occlusion culling in Alan Wake
Siggraph 2011: Occlusion culling in Alan WakeSiggraph 2011: Occlusion culling in Alan Wake
Siggraph 2011: Occlusion culling in Alan Wake
 
Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)Forward+ (EUROGRAPHICS 2012)
Forward+ (EUROGRAPHICS 2012)
 

Similar to Designing a Highly Available Environment Using Methods of Modern IT Infrastructure

Automatize everything
Automatize everythingAutomatize everything
Automatize everythingBoris Bucha
 
Maximize Your Production Effort (English)
Maximize Your Production Effort (English)Maximize Your Production Effort (English)
Maximize Your Production Effort (English)slantsixgames
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Shuo LI
 
DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...
DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...
DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...Docker, Inc.
 
When Tools Attack
When Tools AttackWhen Tools Attack
When Tools AttackPerforce
 
Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...SaltStack
 
Supersize Your Production Pipe
Supersize Your Production PipeSupersize Your Production Pipe
Supersize Your Production Pipeslantsixgames
 
Supersize your production pipe enjmin 2013 v1.1 hd
Supersize your production pipe    enjmin 2013 v1.1 hdSupersize your production pipe    enjmin 2013 v1.1 hd
Supersize your production pipe enjmin 2013 v1.1 hdslantsixgames
 
Game Development Best Practices
Game Development Best PracticesGame Development Best Practices
Game Development Best PracticesPerforce
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_indexChester Chen
 
BSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysBSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysJoff Thyer
 
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...Felipe Prado
 
Global Software Development powered by Perforce
Global Software Development powered by PerforceGlobal Software Development powered by Perforce
Global Software Development powered by PerforcePerforce
 
Taking the Next Hot Mobile Game Live with Docker and IBM SoftLayer
Taking the Next Hot Mobile Game Live with Docker and IBM SoftLayerTaking the Next Hot Mobile Game Live with Docker and IBM SoftLayer
Taking the Next Hot Mobile Game Live with Docker and IBM SoftLayerDaniel Krook
 
Inside the IT Territory game server / Mark Lokshin (IT Territory)
Inside the IT Territory game server / Mark Lokshin (IT Territory)Inside the IT Territory game server / Mark Lokshin (IT Territory)
Inside the IT Territory game server / Mark Lokshin (IT Territory)DevGAMM Conference
 
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon Yang
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon YangPractical IoT Exploitation (DEFCON23 IoTVillage) - Lyon Yang
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon YangLyon Yang
 
Scaling Servers and Storage for Film Assets
Scaling Servers and Storage for Film Assets  Scaling Servers and Storage for Film Assets
Scaling Servers and Storage for Film Assets Perforce
 
Working Well Together: How to Keep High-end Game Development Teams Productive
Working Well Together: How to Keep High-end Game Development Teams ProductiveWorking Well Together: How to Keep High-end Game Development Teams Productive
Working Well Together: How to Keep High-end Game Development Teams ProductivePerforce
 
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...The Linux Foundation
 

Similar to Designing a Highly Available Environment Using Methods of Modern IT Infrastructure (20)

Automatize everything
Automatize everythingAutomatize everything
Automatize everything
 
Maximize Your Production Effort (English)
Maximize Your Production Effort (English)Maximize Your Production Effort (English)
Maximize Your Production Effort (English)
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
 
DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...
DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...
DockerCon EU 2015: It's in the game: the path to micro-services at Electronic...
 
When Tools Attack
When Tools AttackWhen Tools Attack
When Tools Attack
 
Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...
 
Supersize Your Production Pipe
Supersize Your Production PipeSupersize Your Production Pipe
Supersize Your Production Pipe
 
Supersize your production pipe enjmin 2013 v1.1 hd
Supersize your production pipe    enjmin 2013 v1.1 hdSupersize your production pipe    enjmin 2013 v1.1 hd
Supersize your production pipe enjmin 2013 v1.1 hd
 
Game Development Best Practices
Game Development Best PracticesGame Development Best Practices
Game Development Best Practices
 
2018 02 20-jeg_index
2018 02 20-jeg_index2018 02 20-jeg_index
2018 02 20-jeg_index
 
BSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysBSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad Guys
 
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
DEF CON 27 - ORANGE TSAI and MEH CHANG - infiltrating corporate intranet like...
 
Global Software Development powered by Perforce
Global Software Development powered by PerforceGlobal Software Development powered by Perforce
Global Software Development powered by Perforce
 
Taking the Next Hot Mobile Game Live with Docker and IBM SoftLayer
Taking the Next Hot Mobile Game Live with Docker and IBM SoftLayerTaking the Next Hot Mobile Game Live with Docker and IBM SoftLayer
Taking the Next Hot Mobile Game Live with Docker and IBM SoftLayer
 
DataCore Case Study on Hyperconverged
DataCore Case Study on HyperconvergedDataCore Case Study on Hyperconverged
DataCore Case Study on Hyperconverged
 
Inside the IT Territory game server / Mark Lokshin (IT Territory)
Inside the IT Territory game server / Mark Lokshin (IT Territory)Inside the IT Territory game server / Mark Lokshin (IT Territory)
Inside the IT Territory game server / Mark Lokshin (IT Territory)
 
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon Yang
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon YangPractical IoT Exploitation (DEFCON23 IoTVillage) - Lyon Yang
Practical IoT Exploitation (DEFCON23 IoTVillage) - Lyon Yang
 
Scaling Servers and Storage for Film Assets
Scaling Servers and Storage for Film Assets  Scaling Servers and Storage for Film Assets
Scaling Servers and Storage for Film Assets
 
Working Well Together: How to Keep High-end Game Development Teams Productive
Working Well Together: How to Keep High-end Game Development Teams ProductiveWorking Well Together: How to Keep High-end Game Development Teams Productive
Working Well Together: How to Keep High-end Game Development Teams Productive
 
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
LCNA14: Why Use Xen for Large Scale Enterprise Deployments? - Konrad Rzeszute...
 

More from Perforce

How to Organize Game Developers With Different Planning Needs
How to Organize Game Developers With Different Planning NeedsHow to Organize Game Developers With Different Planning Needs
How to Organize Game Developers With Different Planning NeedsPerforce
 
Regulatory Traceability: How to Maintain Compliance, Quality, and Cost Effic...
Regulatory Traceability:  How to Maintain Compliance, Quality, and Cost Effic...Regulatory Traceability:  How to Maintain Compliance, Quality, and Cost Effic...
Regulatory Traceability: How to Maintain Compliance, Quality, and Cost Effic...Perforce
 
Efficient Security Development and Testing Using Dynamic and Static Code Anal...
Efficient Security Development and Testing Using Dynamic and Static Code Anal...Efficient Security Development and Testing Using Dynamic and Static Code Anal...
Efficient Security Development and Testing Using Dynamic and Static Code Anal...Perforce
 
Understanding Compliant Workflow Enforcement SOPs
Understanding Compliant Workflow Enforcement SOPsUnderstanding Compliant Workflow Enforcement SOPs
Understanding Compliant Workflow Enforcement SOPsPerforce
 
Branching Out: How To Automate Your Development Process
Branching Out: How To Automate Your Development ProcessBranching Out: How To Automate Your Development Process
Branching Out: How To Automate Your Development ProcessPerforce
 
How to Do Code Reviews at Massive Scale For DevOps
How to Do Code Reviews at Massive Scale For DevOpsHow to Do Code Reviews at Massive Scale For DevOps
How to Do Code Reviews at Massive Scale For DevOpsPerforce
 
How to Spark Joy In Your Product Backlog
How to Spark Joy In Your Product Backlog How to Spark Joy In Your Product Backlog
How to Spark Joy In Your Product Backlog Perforce
 
Going Remote: Build Up Your Game Dev Team
Going Remote: Build Up Your Game Dev Team Going Remote: Build Up Your Game Dev Team
Going Remote: Build Up Your Game Dev Team Perforce
 
Shift to Remote: How to Manage Your New Workflow
Shift to Remote: How to Manage Your New WorkflowShift to Remote: How to Manage Your New Workflow
Shift to Remote: How to Manage Your New WorkflowPerforce
 
Hybrid Development Methodology in a Regulated World
Hybrid Development Methodology in a Regulated WorldHybrid Development Methodology in a Regulated World
Hybrid Development Methodology in a Regulated WorldPerforce
 
Better, Faster, Easier: How to Make Git Really Work in the Enterprise
Better, Faster, Easier: How to Make Git Really Work in the EnterpriseBetter, Faster, Easier: How to Make Git Really Work in the Enterprise
Better, Faster, Easier: How to Make Git Really Work in the EnterprisePerforce
 
Easier Requirements Management Using Diagrams In Helix ALM
Easier Requirements Management Using Diagrams In Helix ALMEasier Requirements Management Using Diagrams In Helix ALM
Easier Requirements Management Using Diagrams In Helix ALMPerforce
 
How To Master Your Mega Backlog
How To Master Your Mega Backlog How To Master Your Mega Backlog
How To Master Your Mega Backlog Perforce
 
Achieving Software Safety, Security, and Reliability Part 3: What Does the Fu...
Achieving Software Safety, Security, and Reliability Part 3: What Does the Fu...Achieving Software Safety, Security, and Reliability Part 3: What Does the Fu...
Achieving Software Safety, Security, and Reliability Part 3: What Does the Fu...Perforce
 
How to Scale With Helix Core and Microsoft Azure
How to Scale With Helix Core and Microsoft Azure How to Scale With Helix Core and Microsoft Azure
How to Scale With Helix Core and Microsoft Azure Perforce
 
Achieving Software Safety, Security, and Reliability Part 2
Achieving Software Safety, Security, and Reliability Part 2Achieving Software Safety, Security, and Reliability Part 2
Achieving Software Safety, Security, and Reliability Part 2Perforce
 
Should You Break Up With Your Monolith?
Should You Break Up With Your Monolith?Should You Break Up With Your Monolith?
Should You Break Up With Your Monolith?Perforce
 
Achieving Software Safety, Security, and Reliability Part 1: Common Industry ...
Achieving Software Safety, Security, and Reliability Part 1: Common Industry ...Achieving Software Safety, Security, and Reliability Part 1: Common Industry ...
Achieving Software Safety, Security, and Reliability Part 1: Common Industry ...Perforce
 
What's New in Helix ALM 2019.4
What's New in Helix ALM 2019.4What's New in Helix ALM 2019.4
What's New in Helix ALM 2019.4Perforce
 
Free Yourself From the MS Office Prison
Free Yourself From the MS Office Prison Free Yourself From the MS Office Prison
Free Yourself From the MS Office Prison Perforce
 

More from Perforce (20)

How to Organize Game Developers With Different Planning Needs
How to Organize Game Developers With Different Planning NeedsHow to Organize Game Developers With Different Planning Needs
How to Organize Game Developers With Different Planning Needs
 
Regulatory Traceability: How to Maintain Compliance, Quality, and Cost Effic...
Regulatory Traceability:  How to Maintain Compliance, Quality, and Cost Effic...Regulatory Traceability:  How to Maintain Compliance, Quality, and Cost Effic...
Regulatory Traceability: How to Maintain Compliance, Quality, and Cost Effic...
 
Efficient Security Development and Testing Using Dynamic and Static Code Anal...
Efficient Security Development and Testing Using Dynamic and Static Code Anal...Efficient Security Development and Testing Using Dynamic and Static Code Anal...
Efficient Security Development and Testing Using Dynamic and Static Code Anal...
 
Understanding Compliant Workflow Enforcement SOPs
Understanding Compliant Workflow Enforcement SOPsUnderstanding Compliant Workflow Enforcement SOPs
Understanding Compliant Workflow Enforcement SOPs
 
Branching Out: How To Automate Your Development Process
Branching Out: How To Automate Your Development ProcessBranching Out: How To Automate Your Development Process
Branching Out: How To Automate Your Development Process
 
How to Do Code Reviews at Massive Scale For DevOps
How to Do Code Reviews at Massive Scale For DevOpsHow to Do Code Reviews at Massive Scale For DevOps
How to Do Code Reviews at Massive Scale For DevOps
 
How to Spark Joy In Your Product Backlog
How to Spark Joy In Your Product Backlog How to Spark Joy In Your Product Backlog
How to Spark Joy In Your Product Backlog
 
Going Remote: Build Up Your Game Dev Team
Going Remote: Build Up Your Game Dev Team Going Remote: Build Up Your Game Dev Team
Going Remote: Build Up Your Game Dev Team
 
Shift to Remote: How to Manage Your New Workflow
Shift to Remote: How to Manage Your New WorkflowShift to Remote: How to Manage Your New Workflow
Shift to Remote: How to Manage Your New Workflow
 
Hybrid Development Methodology in a Regulated World
Hybrid Development Methodology in a Regulated WorldHybrid Development Methodology in a Regulated World
Hybrid Development Methodology in a Regulated World
 
Better, Faster, Easier: How to Make Git Really Work in the Enterprise
Better, Faster, Easier: How to Make Git Really Work in the EnterpriseBetter, Faster, Easier: How to Make Git Really Work in the Enterprise
Better, Faster, Easier: How to Make Git Really Work in the Enterprise
 
Easier Requirements Management Using Diagrams In Helix ALM
Easier Requirements Management Using Diagrams In Helix ALMEasier Requirements Management Using Diagrams In Helix ALM
Easier Requirements Management Using Diagrams In Helix ALM
 
How To Master Your Mega Backlog
How To Master Your Mega Backlog How To Master Your Mega Backlog
How To Master Your Mega Backlog
 
Achieving Software Safety, Security, and Reliability Part 3: What Does the Fu...
Achieving Software Safety, Security, and Reliability Part 3: What Does the Fu...Achieving Software Safety, Security, and Reliability Part 3: What Does the Fu...
Achieving Software Safety, Security, and Reliability Part 3: What Does the Fu...
 
How to Scale With Helix Core and Microsoft Azure
How to Scale With Helix Core and Microsoft Azure How to Scale With Helix Core and Microsoft Azure
How to Scale With Helix Core and Microsoft Azure
 
Achieving Software Safety, Security, and Reliability Part 2
Achieving Software Safety, Security, and Reliability Part 2Achieving Software Safety, Security, and Reliability Part 2
Achieving Software Safety, Security, and Reliability Part 2
 
Should You Break Up With Your Monolith?
Should You Break Up With Your Monolith?Should You Break Up With Your Monolith?
Should You Break Up With Your Monolith?
 
Achieving Software Safety, Security, and Reliability Part 1: Common Industry ...
Achieving Software Safety, Security, and Reliability Part 1: Common Industry ...Achieving Software Safety, Security, and Reliability Part 1: Common Industry ...
Achieving Software Safety, Security, and Reliability Part 1: Common Industry ...
 
What's New in Helix ALM 2019.4
What's New in Helix ALM 2019.4What's New in Helix ALM 2019.4
What's New in Helix ALM 2019.4
 
Free Yourself From the MS Office Prison
Free Yourself From the MS Office Prison Free Yourself From the MS Office Prison
Free Yourself From the MS Office Prison
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Recently uploaded (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

Designing a Highly Available Environment Using Methods of Modern IT Infrastructure

  • 1. # Jouko Markkanen IT Manager
  • 2. #
  • 3. # • Privately held game developer based in Finland. • Released games Death Rally, Max Payne, Max Payne 2: The Fall of Max Payne, Alan Wake, Alan Wake’s American Nightmare, Death Rally Mobile. • Franchises made into a movie, TV-series & novel. • Announced titles Agents of Storm for iOS and Xbox One exclusive title Quantum Break.
  • 4. # • Founded in 1995, currently 120+ employees. • Over 100 Game of the Year awards. • Franchises generated over $500M revenue. • Max Payne IP sold for $43M. • AAA games sold over 11M units. • First mobile experiment over 16M downloads and reached #1 in 70 countries.
  • 5. #
  • 6. # • Large content files
  • 7. # # of files Total size # of files, > 100 MB Created by Remedy since 2004 All projects, all revisions 10.5 million 12 terabytes All projects, #head revisions 5 million 5.5 terabytes Alan Wake (XBOX 360), #head 1.1 million 920 gigabytes 1,300 Quantum Break (XBOX One, until today), #head 3 million 4.3 terabytes 7,000 Perforce Database 30 gigabytes
  • 8. # • Large content files • Dependencies of game engine <-> internal tools <-> game content (in proprietary formats)
  • 9. # Tools source code Tool binaries 3 Content source rd party tools Game source code Export util source code Export util Runtime game binary Runtime content
  • 10. # • Large content files • Dependencies of game engine <-> internal tools <-> game content (in proprietary formats) • Everything that comes out, comes from Perforce depot – Availability of the system is business critical
  • 11. #
  • 12. # • System design approach • Service implementation • Principles of HA engineering 1. Elimination of single points of failure 2. Reliable crossover 3. Detection of failures as they occur. • Source: http://en.wikipedia.org/wiki/High_availability
  • 13. # • Client and access network don’t have HA – Opting for fast manual response • LAN core w/ act/act redundancy • Servers with failover • SAN w/ active/active redundancy • Storage w/ redundant components
  • 14. # • HA design principles do not cover the concept of backups – Even when HA is taken care of, data and availability can be lost by user actions and software failures – The data still needs to be copied to offline storage for disaster recovery purposes
  • 15. # • Client and access network don’t have HA – Opting for fast manual response • LAN core w/ act/act redundancy • Servers with failover • SAN w/ active/active redundancy • Storage w/ redundant components
  • 16. #
  • 17. # • Used for offloading backups and integrity verification • Covers application level failures • Activation requires manual intervention perforce2:1666 perforce3:1666 perforce1:1666 perforce1:1667
  • 18. # • Snapshot of Perforce every 4 hours • Runs storage provided snapshot with “p4d –c” – Ensures database integrity – Locks database for 30-50 seconds • Near-instant recovery • Can be mounted and exported to other hosts – To run checkpoint, verify, … – To run test environment with production data
  • 19. #
  • 20. # • “A user may never see a failure. But the maintenance activity must.” • Infrastructure monitored with vendor tools • Central monitoring with Nagios – P4D process, TCP connectivity to perforce:1666 – Check “p4 info” output – Replication: check “changelist” counter on both partners • P4review.py
  • 21. # • Define what HA means for your service • Build it one step at a time – Ensure redundancy of each component – Make sure the component is monitored • Backups are still needed
  • 22. # Jouko Markkanen jouko@remedygames.com
  • 23. # • Introduction to Remedy • Perforce at Remedy • High Availability • Perforce Application Availability • Monitoring • Conclusions
  • 24. # Jouko Markkanen is an IT Manager at Remedy Entertainment with broad experience in different areas of information and communications technology including help desk responsibilities, programming, application design, security systems, information management, and infrastructure planning and design.

Editor's Notes

  1. E3 Sofia scene
  2. Agents of Storm in beta Quantum Break to be released in 2015
  3. - 1996 Virtual Reality  3D Mark - Spin-off company Futuremark in 1997 - Over 120 employees from over 15 different countries - TOP-50 growth companies in Finland
  4. Remedy has been using Perforce as the sole SCM system for over 10 years. During that time, we have created several AAA console/PC games, as well as mobile games. The biggest production has been Alan Wake, and while still in-production, Quantum Break has already multiple times the number and size of files in our depot. So far we have created over 10 million files, with a total size of 12 terabytes in 300K+ changelists. While the Perforce database is sized modestly at 30 gigabytes, the average file size is 1.2MB, and Quantum Break has over 7000 files sized over 100MB, and hundreds of files over 1GB in size. This distribution is not typical for a “regular” software project, and the performance problems lie more in “how to copy the mass of files to/from the client”, instead of “how to manage the complex database metadata of a huge number of files”.
  5. There are over 100 people working on QB, writing program code and creating content to our Perforce server. Most of them are working on the content production, using in-house tools to edit the game world. These tools export the content from the proprietary source format to a binary format that can be presented in real time by the game engine. A lot of the program code is shared between the tools and the game engine, to ensure a matching presentation. This means that when the format is updated, for example due to a new feature in the engine, the whole dependency chain must be rebuilt and redistributed, and all of the existing content must be re-exported, and sometimes even the existing content source needs to be upgraded. This makes the whole version control and integration/delivery methodology to a different complexity level. The dark-grey boxes in the diagram depict the binaries built and delivered by our automated build system, or built locally for local testing and modifications. The dark-red boxes depict files stored in our Perforce server, and at the same time they are the files almost all the work on the project is done on. Which brings us to…
  6. … the importance of the Perforce service in our company. Everything that is delivered with our final product, the game, is coming out of the content stored in our Perforce. This makes it the #1 business critical IT service at Remedy, and it’s availability is on the top priority.
  7. Before we delve into how we ensure HA of our Perforce service, let’s discuss a bit what the term actually means. As is common in this age, we’ll start with a “common” definition, ie. what Wikipedia writes about HA.
  8. Wikipedia defines HA as a system design approach, and associated service implementation. Their purpose is to ensure a certain, specified, level of performance, specifically level of availability. Three principals to practice this approach and implementation are listed. The first is to get rid of single points of failure, so that any single component needed to produce the service can fail without disrupting the service. This is usually achieved by means of redundancy, that is by duplicating, or multiplying, all components of the system. The second is to provide means of transferring service reliably from a failed component to the redundant counterpart. And to allow this, the third principle is automatic failure monitoring and detection, because without it, the tasks of the failed component will never be transferred, until it’s redundant counterparts have also failed, and the availability is lost.
  9. Next, we’ll go thru the IT infrastructure layers we use to provide the Perforce service, and how their redundancy has been provided. We’ll start off with storage: we use a shared storage system, which provides storage space with different characteristics, like high IOPS for database storage and more inexpensive large storage for versioned files. This system is shown as a single entity, but in reality it is spread on multiple storage chassis, each having a RAID-style redundant array of disks, redundant power supplies, and redundant controller modules, so that there are no SPF’s. The access to the shared storage system is provided via an iSCSI storage area network. This has simple redundancy, there are two switches, with active paths carried on both, so if one of them fails, the other will continue storage operations. On the next level are the servers. Our Perforce servers (there are multiple, more about them later) are virtualized, and they run on a cluster of hypervisors. In normal conditions, they are distributed on different physical host computers, so in case one host fails, the other VM’s keep running. Also, the cluster monitors itself, and in case of a host failure, the VM’s are restarted on the remaining hosts. LAN connectivity between the servers as well as towards the clients is also redundant, with several technologies providing redundancy on different levels of the OSI network model. On the final stage of the client-server path is the access network (“floor switch”) and the client computer itself. These do not have HA as such, as a failure on those has very limited area of effect. However, we are prepared for failures here as well; we keep spare access switches in store, so they can be swapped manually but quickly in case of a failure, and the same applies to the computers and/or their components.
  10. Even at this level the system is not foolproof, and cannot guarantee 100% availability. Many of the crossover paths used to provide HA are built upon automatic monitoring and software features, and software tends to have bugs for example. Also, there are mere humans using as well as administering and operating the system, and humans tend to make errors. Even a well designed and implemented HA system does not protect if the datacenter is consumed by fire, or flooded with water. To protect from this, proper backups must be planned, made and tested.
  11. This completes the HA architecture diagram. The backups are created with dedicated hardware, and stored on dedicated storage system. Those systems should preferably reside offsite. We create backups on a dedicated system onsite, for faster recoveries, but replicate that backup content to an offsite datacenter for ultimate disaster recovery.
  12. There is still one level of failures not covered by the generic HA IT infrastructure, but we can prepare for those as well. In case the Perforce software itself fails for one reason or another (this is rare, but has happened, also with us).
  13. We have currently two different primary Perforce servers. We have split different projects to different servers to give some scalability in performance, and to allow one project to perform maintenance without disturbing the other. This is possible, because we have projects that use different game engine; the projects that share the game engine, are located on the same server, but different depots. Both of the servers are replicated using Perforce pull replication to a third server. That server runs two P4D processes on different ports. The replicas serve two main purposes: one is to allow checkpoints and verifys run without interruption to end user service. We only do these operations on the primary servers during scheduled maintenance breaks, when needed. The other purpose is to have a fallback in case the master server fails irrecoverably. In this case, we need to manually change the replica server configuration (to allow write operations), as well as point the clients to the failover replica. As the clients use an alias name to find the server, we can change that alias point to the replica, and as the name record has a 5 minute TTL, all clients are good to run within that period. Naturally, this only helps if the master server has failed in such way, that the failure has not been replicated to the replica’s database or versioned files. If it has, we can always resort to recovering from backups. However, as the full recovery takes some time (currently almost 24 hours to copy all data back from the backups), we have another way…
  14. … the storage snapshots. Many modern storage systems have a feature to create a near-instant point-in-time checkpoint within the storage hardware (or, rather by the software running the hardware). To ensure the integrity of the snapshot, we have scheduled a command to run every 4 hours. This command is a “p4d –c …”, which tells the Perforce server to commit any residing changes to disk and lock the database while running a command. In this case the command tells the storage system to snapshot all the volumes assigned to that Perforce server. The snapshot takes around 30-50 seconds with our system, during which time the Perforce server will hold all write operations. Compared to the almost two hours that creating a checkpoint of our database takes, this is pretty fast. Now, mounting this checkpoint back to the server happens also in less than a minute. In case of a failure, it is possible to mount the previous checkpoint in an alternate directory, and start the Perforce server from there, while keeping the failed volumes online for further investigation. But the use of the checkpoints does not end there. You can also keep the production running, while mounting a checkpoint on the same, or another server, start a P4D on the checkpoint folder, and use that eg. to test upgrade or configuration changes with real-world data, without touching the production.
  15. HA allows the system to recover when a single failure occurs. But the whole concept of High Availability is void, if the environment is not being monitored, and the single failures corrected before another failure occurs and brings the system down. In Wikipedia, the third principle of HA engineering says that “A user may never see a failure. But the maintenance activity must”
  16. The HA infrastructure stack is being monitored by tools provided by individual component vendors. They are configured to send an alert email in case a component fails, and many even have a feature that notifies the vendor tech support autonomously. We have a central monitoring system, utilizing Nagios, that monitors, in addition to the hardware and OS environment and their resources, also our Perforce servers. There are several checks that verify that the P4D process is running, and a custom script that checks the replication status. A good monitoring tool is also the p4review daemon (although this is not it’s primary purpose). We run p4review every two minutes, and if the server has any problems, it fails, sending a failure report to the admin contact email.