The Continuous delivery Value @ codemotion 2014David Funaro
System Crash, failure data migration, partial update: issues that no one would ever want to meet during the deploy and ... hoping for the best is not enough.
The deployment activity is important as those that precede it. The Continuous Delivery will give you low risk, cheap, fast, predictable delivery and ... soundly.
Release software is no less important than activities that precede it.
The Continuous Delivery is a set of practices and methodologies that build an ecosystem for the software development lifecycle.
We will see how to build this ecosystem around the applications developed, for which this release activities becomes a low-risk, inexpensive, fast and predictable.
SCM Transformation Challenges and How to Overcome ThemCompuware
If your enterprise is focused on continuously improving quality, velocity and efficiency, you’re going to win against those that aren’t. Driving improvements on the mainframe, and in turn throughout the business, requires the transformation of three things: culture, processes and tools. In other words, changing mindsets, implementing modern practices (Agile, DevOps, CI/CD) and replacing outdated technology.
Mainframe source code management is currently a critical area in need of modernization and should be one of the initial tooling changes organizations make when setting out to improve mainframe systems delivery.
During this session, Compuware specialist Lars-Erik Berglund shares the challenges organizations face with mainframe source code management and what you can do to overcome those.
The Continuous delivery Value @ codemotion 2014David Funaro
System Crash, failure data migration, partial update: issues that no one would ever want to meet during the deploy and ... hoping for the best is not enough.
The deployment activity is important as those that precede it. The Continuous Delivery will give you low risk, cheap, fast, predictable delivery and ... soundly.
Release software is no less important than activities that precede it.
The Continuous Delivery is a set of practices and methodologies that build an ecosystem for the software development lifecycle.
We will see how to build this ecosystem around the applications developed, for which this release activities becomes a low-risk, inexpensive, fast and predictable.
SCM Transformation Challenges and How to Overcome ThemCompuware
If your enterprise is focused on continuously improving quality, velocity and efficiency, you’re going to win against those that aren’t. Driving improvements on the mainframe, and in turn throughout the business, requires the transformation of three things: culture, processes and tools. In other words, changing mindsets, implementing modern practices (Agile, DevOps, CI/CD) and replacing outdated technology.
Mainframe source code management is currently a critical area in need of modernization and should be one of the initial tooling changes organizations make when setting out to improve mainframe systems delivery.
During this session, Compuware specialist Lars-Erik Berglund shares the challenges organizations face with mainframe source code management and what you can do to overcome those.
Dopo aver annunciato la nuova partnership commerciale con DBMaestro, Emerasoft ha realizzato un webinar volto ad illustrare le principali caratteristiche di questo nuovo strumento innovativo: DBMaestro TeamWork.
Scopri DBmaestro Teamwork: la soluzione DevOps per il Database, che permette l’Agile Database Development, la Continuous Integration e la Continuous Delivery.
Guarda il video del webinar: https://www.youtube.com/watch?v=YzPB9Y6Y8tA
In our recent webinar hosted by Mike Current, a member of the Hyland Upgrade Council, and Mark Hamilton, DataBank's Infrastructure Engineer, we expanded on how upgrading OnBase offers the ability to not only gain enhancements and fixes, but also radically improve the security, stability and architecture of your entire OnBase environment.
In this presentation you will...
1. Learn the formula for upgrade success with actionable items to work through right away
2. Understand the team needed to get the job done and how DataBank can step in to help
3. The importance of establishing a test environment and more
You can also watch the full webinar here: http://info.databankimx.com/Upgrade-Webinar-RCD.html
Download the Hyland 3rd Part Compatibility Matrix from slide #25 here: http://info.databankimx.com/rs/167-SSD-475/images/Third%20Party%20Product%20Compatibility%20Matrix.pdf
Lessons Learned from Migrating Legacy Enterprise Applications to MicroservicesVMware Tanzu
SpringOne Platform 2016
Speakers: Ross Zhang; Senior Software Developer, OTPP & Jun Li; Software Engineer, OTPP
As in many mid-to-large size organizations, you may have traditional Java enterprise applications, which are considered heavy and cumbersome, in terms of development, deployment and operations. You are thinking about migrating legacy applications for a long time but migration is a complex puzzle and there are many missing pieces. At Ontario Teachers’ Pension Plan, one of the world’s largest institutional investors, we have successfully solved many puzzle pieces with migrating traditional Java enterprise applications using Spring Boot, Spring Cloud and Cloud Foundry. This presentation will benefit many of you who may be in the same shoes as we were. Learn how we:
-solved dependency management issues
-accelerated application development and deployment
-monitored and checked application status
-migrated monolithic apps to microservices using Spring Cloud
-leveraged Platform as a Service.
Our team just released Keptn (https://keptn.sh/), an open source framework for event-based, automated continuous operations in cloud-native environments. In this session, we will talk about WHY we built Keptn, HOW we implemented it (Architecture) and where we want the community to take it.
Expert guidance on migrating from magento 1 to magento 2James Cowie
Migrating a Magento site is not just about code and data. Commerce platforms evolve over time and your Magento 1 solution is likely different today compared to the day you launched. Planning a successful migration means understanding what you have and where you are going before you can begin. In this session, architects from the Magento Expert Consulting Group will lay out best practices for defining your migration strategy, and share tips and techniques for code and data migration.
Code in the Cloud presentation as presented in Antwerp Lindner Hotel on 8th December 2014
#codeinthecloud
Agenda from the event:
In the AM (Introduction)
- Introduction to Application Lifecycle Management and Visual Studio Online
- Managing your project: what, who and when
- Working with code: keeping the source code safe and in-sync and be productive as a developer
- Tracking progress: how are we doing
- Improving quality: continuous build, deploy and testing
EAT
In the PM (Putting it all into practice)
- Exciting demonstrations and walkthroughs on how to run your project with Visual Studio Online.
Continuous integration / continuous delivery of web applications, Eugen Kuzmi...Evgeniy Kuzmin
What will be discussed:
- Building the process of continuous integration/delivery on the example of a Laravel application;
- The structure of the auto-testing organization;
- Integration of running tests and deploy on Jenkins CI server;
- Employment of Docker in conjunction with AWS ElasticBeanstalk for blue-green deployment.
This PPT covers all 5 core components of managing software product development:
1. Software product management.
2. Projects/Tasks, including scrum
3. Management of code.
4. Management of technology.
5. Management of People.
By talking about Microsoft's journey to Cloud cadence, this talk goes through all the DevOps practices such as Infrastructure as Code, CI/CD, Release Management and Hypothesis Driven Development.
It also introduces the impact of Docker and PaaS in DevOps.
Dopo aver annunciato la nuova partnership commerciale con DBMaestro, Emerasoft ha realizzato un webinar volto ad illustrare le principali caratteristiche di questo nuovo strumento innovativo: DBMaestro TeamWork.
Scopri DBmaestro Teamwork: la soluzione DevOps per il Database, che permette l’Agile Database Development, la Continuous Integration e la Continuous Delivery.
Guarda il video del webinar: https://www.youtube.com/watch?v=YzPB9Y6Y8tA
In our recent webinar hosted by Mike Current, a member of the Hyland Upgrade Council, and Mark Hamilton, DataBank's Infrastructure Engineer, we expanded on how upgrading OnBase offers the ability to not only gain enhancements and fixes, but also radically improve the security, stability and architecture of your entire OnBase environment.
In this presentation you will...
1. Learn the formula for upgrade success with actionable items to work through right away
2. Understand the team needed to get the job done and how DataBank can step in to help
3. The importance of establishing a test environment and more
You can also watch the full webinar here: http://info.databankimx.com/Upgrade-Webinar-RCD.html
Download the Hyland 3rd Part Compatibility Matrix from slide #25 here: http://info.databankimx.com/rs/167-SSD-475/images/Third%20Party%20Product%20Compatibility%20Matrix.pdf
Lessons Learned from Migrating Legacy Enterprise Applications to MicroservicesVMware Tanzu
SpringOne Platform 2016
Speakers: Ross Zhang; Senior Software Developer, OTPP & Jun Li; Software Engineer, OTPP
As in many mid-to-large size organizations, you may have traditional Java enterprise applications, which are considered heavy and cumbersome, in terms of development, deployment and operations. You are thinking about migrating legacy applications for a long time but migration is a complex puzzle and there are many missing pieces. At Ontario Teachers’ Pension Plan, one of the world’s largest institutional investors, we have successfully solved many puzzle pieces with migrating traditional Java enterprise applications using Spring Boot, Spring Cloud and Cloud Foundry. This presentation will benefit many of you who may be in the same shoes as we were. Learn how we:
-solved dependency management issues
-accelerated application development and deployment
-monitored and checked application status
-migrated monolithic apps to microservices using Spring Cloud
-leveraged Platform as a Service.
Our team just released Keptn (https://keptn.sh/), an open source framework for event-based, automated continuous operations in cloud-native environments. In this session, we will talk about WHY we built Keptn, HOW we implemented it (Architecture) and where we want the community to take it.
Expert guidance on migrating from magento 1 to magento 2James Cowie
Migrating a Magento site is not just about code and data. Commerce platforms evolve over time and your Magento 1 solution is likely different today compared to the day you launched. Planning a successful migration means understanding what you have and where you are going before you can begin. In this session, architects from the Magento Expert Consulting Group will lay out best practices for defining your migration strategy, and share tips and techniques for code and data migration.
Code in the Cloud presentation as presented in Antwerp Lindner Hotel on 8th December 2014
#codeinthecloud
Agenda from the event:
In the AM (Introduction)
- Introduction to Application Lifecycle Management and Visual Studio Online
- Managing your project: what, who and when
- Working with code: keeping the source code safe and in-sync and be productive as a developer
- Tracking progress: how are we doing
- Improving quality: continuous build, deploy and testing
EAT
In the PM (Putting it all into practice)
- Exciting demonstrations and walkthroughs on how to run your project with Visual Studio Online.
Continuous integration / continuous delivery of web applications, Eugen Kuzmi...Evgeniy Kuzmin
What will be discussed:
- Building the process of continuous integration/delivery on the example of a Laravel application;
- The structure of the auto-testing organization;
- Integration of running tests and deploy on Jenkins CI server;
- Employment of Docker in conjunction with AWS ElasticBeanstalk for blue-green deployment.
This PPT covers all 5 core components of managing software product development:
1. Software product management.
2. Projects/Tasks, including scrum
3. Management of code.
4. Management of technology.
5. Management of People.
By talking about Microsoft's journey to Cloud cadence, this talk goes through all the DevOps practices such as Infrastructure as Code, CI/CD, Release Management and Hypothesis Driven Development.
It also introduces the impact of Docker and PaaS in DevOps.
The Self Service Metadata Driven Loader Platform is a solution designed to streamline the data pipeline building process for data engineers and data scientists. It allows these professionals to quickly and easily create and manage their data pipelines, without the need for extensive technical knowledge. The platform utilizes metadata to drive the data loading process, making it simple for users to manage and organize their data sources. The user-friendly interface, combined with the metadata-driven approach, makes it an ideal solution for organizations looking to improve their data management processes. With this platform, data engineers and data scientists can spend more time analyzing and utilizing data, and less time on manual, repetitive tasks.
Accelerate User Driven Innovation [Webinar]Dynatrace
https://info.dynatrace.com/apm_dtm_ops_17q4_wc_accelerate_user_driven_innovation_en_registration.html
Accelerate User Driven Innovation [Webinar]
DevOps adopters are more agile, more reliable and more successful but, only 2% of companies worldwide have adopted DevOps best practices.
We know it’s more difficult for enterprises companies with legacy systems and processes to get started but it isn’t impossible.
To help you accelerate your own DevOps journey & realise some of the benefits, we’re thrilled to be hosting international DevOps experts Andreas Grabner, Mark Tomlinson and James Pulley.
With combined experience across hundreds of DevOps deployments they have some remarkable use cases to share including Verizon, and even our own story of transforming from on premise six month waterfall deployment to a cloud native one hour continuous delivery model.
Don’t miss these amazing insights. Register today!
Breaking the 2 Pizza Paradox with your Platform as an ApplicationMark Rendell
In my experience many large enterprises would love the adoption of DevOps to be as simple as bringing Development closer to Operations. In practice they need to consider many development teams, multiple suppliers, multiple service providers, not to mention multiple business divisions. I describe my experiences of implementing Continuous Delivery in large enterprises with heterogeneous technology stacks and share my belief that Platform Applications will be the saviour of enterprise DevOps.
Similar to AWS re:Invent 2013 - MBL303 Gaming Ops - Running High-performance Ops for Mobile Gaming (20)
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
3. Agenda
• Part 1 – Lessons Learned
– Incident Management
– Change Management
– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization
– Game Architecture
– Moving a live game
– Analytics & Monetization
– Cloud Insights
4. Incident Management
NOC
Ops
SME (Network, DBA,…)
Dev
Other
monitoring
tools…
Triage
Escalation
Communication
5. NOC, automated
Ops Dev
Critical
Critical
Non-
Critical
Other
monitoring
tools…
Application-level issue?
Who’s the dev of this game? Phone #?
I can’t find the dev… who’s his
manager?
Oh, the problem is in the backend
service, who’s the dev for that service?
6. Alert Workflow - DevOps way
Ops
Dev, Game X, Server
Dev, Game Y, Client/iOS
Dev, Service A
Each alert go directly to
the right team that can
resolve it !
Dev, Service B
Analytics
7. Alerts go to the person that can resolve it
App-level alerts can be triggered by issues in:
Type Scope Checked by Who to page?
ELB Load balancer
health-check
ELB No one – email
alert only
System-level Check cpu / disk
/ memory /
network
Pingdom /
Nagios
Ops team
App-level Application
issues / bugs
Pingdom Dev and Ops
teams
• Server-side
• Client-side
• iOS
• Android
8. Dev and Ops are responsible
Team In pager duty
Ops 8
Dev 32, from ~20 games
(server-side or client-side, android or iOS developers)
Analytics 5
12. IM Bot informs
in the game
channel that
an alert was
triggered
Use IM Bot for status
Both Ops and
Dev receive
the alert,
troubleshoot
IM Bot = collaboration
IM Bot detects
issue is
resolved and
send all-clear
IM Bot = transparency
13. Review your incidents and alerts
• Monday morning incident review meeting
– Weekly on-call hand-over
– Address false-positives / fine-tune your monitoring
– Heads-up for events / major releases
• Problem management
– Any major or recurrent incident = Problem
– Problem = requires post-mortem
– Remediation items from post-mortem also tracked weekly till
closure
14. Incident Management
Lessons Learned
• Use automatic paging/escalation tools
• Make the alerts go to the right team directly
• Use big display dashboard
• Use IM-bots to communicate outages
• Do weekly reviews of the incidents / alerts
• Do post-mortems, follow-up on remediation items
15. Agenda
• Part 1 – Lessons Learned
– Incident Management
– Change Management
– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization
– Game Architecture
– Moving a live game
– Analytics & Monetization
– Cloud Insights
16. Change Management
Type Content Owner Tool
Configuration
Management
3rd. Party
packages and
configuration
Ops Puppet
Release – code
deploy
1st. Party code Dev Jenkins + In-house
scripts
Release – asset
deploy
1st. Party –
images / new
game content /
new missions
Dev Jenkins + In-house
scripts
17. Configuration Management
pull push
Ops do
changes /
test locally
peer
review
pull
changes
to prod
puppet
puppet
clients
(prod
servers)
pull
changes
syntax
validation
not good
18. Configuration Management Benefits
• Automate and speed-up deployment
• Repeatable
• Declarative modules/manifests = documentation
• All prod changes:
– peer-reviewed via pull-requests in Git
– validated by Puppet lint
– locally tested via Vagrant (every component has a Vagrant VM)
– communicated through email and IM
19. Change Management
Type Content Owner Tool
Configuration
Management
3rd. Party
packages and
configuration
Ops Puppet
Release – code
deploy
1st. Party code Dev Jenkins + In-house
scripts
Release – asset
deploy
1st. Party –
images / new
game content /
new missions
Dev Jenkins + In-house
scripts
20. Release Management – Code deploy
push
QA
Beta
Prod
Deploy
dev host
dev
S3
In QA/dev channel of that project:
If Prod deploy, in Ops channel of that project:
21. Change Management
Type Content Owner Tool
Configuration
Management
3rd. Party
packages and
configuration
Ops Puppet
Release – code
deploy
1st. Party code Dev Jenkins + In-house
scripts
Release – asset
deploy
1st. Party –
images / new
game content /
new missions
Dev Jenkins + In-house
scripts
22. Release Management – Asset deploy
Code
Review
Warns?
Ops
approval
Override
?
Yes
Yes
No
Dev kick off
new asset
deploy job
Run
validation
Deploy to
prod
23. Change Management Lessons Learned
• Changes are made directly by the team that is
responsible for that code
– 3rd. party code is configuration management = owned by Ops
– 1st. party code is release management = owned by Dev
• Changes are made through tools
– Configuration management through Puppet
– Release management through Jenkins + internal tool
• No change is done manually
• All changes are communicated and tracked
24. Agenda
• Part 1 – Lessons Learned
– Incident Management
– Change Management
– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization
– Game Architecture
– Moving a live game
– Analytics & Monetization
– Cloud Insights
32. On-demand Auto-scale
6 – On-demand
auto-scale
terminate
some
instances as
CPU drops
below 40%
CPU
# instances
in ELB
# auto-scale
instances
On-demand policy
as-put-scaling-policy
ccios-app-ScaleDownPolicy40
--auto-scaling-group ccios-app-asg
--adjustment=-2 --type ChangeInCapacity
33. Auto-scale bootstrap workflow
Event Description Duration
Cloudwatch alarm is triggered Eg. CPU > 60% for 5 minutes 5 minutes
Auto-scale policy is executed Launches n new instances 2 minutes
User-data script is executed This script is defined on the autoscale launch
config. Installs base packages, gets
instance_id, IP and hostgroup
1 minute
Bootstrap script is executed This script is loaded from S3. It renames host,
runs puppet, deploy code, starts web service
11 minutes
Health-check passes and
servers start to get traffic
Health-check must pass before ELB start to
send traffic to new host
1 minute
34. Auto-scale external dependencies
Dependency How to resolve
Configuration Management
(Puppet/Chef)
Pre-load all necessary package in the AMI / architecture HA for config
management
External Repo Pre-load all necessary packages in the AMI / setup internal HA repo
Code deploy Same as above, or put in S3
Monitoring registration Make it asynchronous
Server registration Make it asynchronous
35. Auto-scale Lessons Learned
• Reduce time to spin-up new instances:
– Pre-install all base packages into AMI
• Address those risks:
– on-demand and scheduled AS conflicts
– bootstrap validation and graceful termination
– health-checks: keep it simple
– keep some servers out of auto-scale pool, just in case
– map and resolve/monitor external dependencies for auto-scale
– consider using 2 different thresholds, for quicker ramp-up
36. Agenda
• Part 1 – Lessons Learned
– Incident Management
– Change Management
– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization
– Game Architecture
– Moving a live game
– Analytics & Monetization
– Cloud Insights
37. • under-utilized hosts
• overloaded hosts
• EBS/ELB not in use
Cloud Optimization areas
• exposed DBs
• EC2 behind ELB exposed
directly
• AZ / region distribution
• backup audit
• un-healthy instances in ELB
• ELB misconfigs
• optimal # of RI
• hosts outside RI
• cost break-down using tags
• estimate on-demand costs
Cost Usage
Availabilit
y Security
39. Cloud Optimization Lessons Learned
• Try Trusted Advisor
• Pilot 3rd.-party solutions
• Evaluate what metrics are important for each component of
your architecture
• Do in-house development for other optimizations you need that
are not covered by TA or 3rd. party solutions
• Tag all assets! Automate tagging!
40. Agenda
• Part 1 – Lessons Learned
– Incident Management
– Change Management
– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization
– Game Architecture
– Moving a live game
– Analytics & Monetization
– Cloud Insights
41. GREE Games
• All Mobile, all Free-to-Play
– iOS & Android smart phones
– Big focus on tablets
• Role Playing Games (RPG)
– Multi-million dollar franchise, top-grossing titles
– Some of the oldest games on the App Store
• Hardcore
– Deeper more intense gameplay mechanics
• Real-Time Strategy (RTS)
– Fast action, small unit management
• Casino & Casual Games
– Familiar games, wider audience, casual play
42. Example Game Architecture – RPG
• Application Servers
– PHP
– Game events Analytics
• Cache Layer
– Memcached Elasticache
• Batch Processing Servers
– Node.js (moving to GO)
– Batches database writes
• Database
– MySQL RDS
RDS RDS RDS
Failover
DB
ELB
App App App App
Cache Cache Cache Cache
Batch Batch
43. Caching Strategy - Current
• Game architecture predates stable NoSQL
– We wanted similar performance at scale
– Keep combined average internal response times below 300ms
• Memcache Authoritative
– Still use an RDBMS; potential data loss is limited
• Allows for cheaper/simpler DB layer
– Always do full row replacements (ie: no current_row_value +1)
44. Data Flow
• Reads
– ELB App Cache
• Writes (Synchronous)
– ELB App Cache DB
– ELB App Cache Batch DB
– Standard write-through
– No blind writes; always fetch current
ver.
• Writes (Asynchronous)
– Batch DB
– Batch writes to DB every 30 seconds
ELB
App App App App
Cache Cache Cache Cache
Batch Batch
RDS RDS RDS
45. Batch Processor
• 80% of game write traffic is Async
– Each write is versioned
• Example: Player items (loot) after multiple quests
– 10 items in 30 sec; app server sends 10 writes downstream
– Batch processor sends last record with final item count to DB
• Greatly reduced writes on DB
– Shard at table and DB server level for larger games
46. Near Future Trends for GREE OPS
• Multi-region games
– Latency-sensitive games and the shift towards real-time
– Geographic data replication challenges
• Continuous Delivery
• Automation of Game Studio tasks
– Game design, art, data/asset deploy
– Tighter event pre-provisioning and scale-down
47. More Performance – Lower Costs
• Facebook HipHop Virtual Machine
– JIT compilation & execution of PHP
– 5x faster vs. Zend PHP 5.2
– Achieved 3x to 4x reduction in application server count
– https://github.com/facebook/hhvm
• Google GO
– Used for high-concurrency applications
– Achieved 2x reduction in batch processing servers vs. Node.js
– http://golang.org
48. Agenda
• Part 1 – Lessons Learned
– Incident Management
– Change Management
– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization
– Game Architecture
– Moving a live game
– Analytics & Monetization
– Cloud Insights
49. Moving a Game – Why?
• Physical datacenter to AWS
– West coast East coast
– Faster access to EU markets & players
• Reduce necessary attention to infrastructure
– Caching & DB layer; custom high-availability middleware
• Take advantage of cloud provisioning
– Scripted instance spin-ups, auto-scaling for events/load
• Save money
– Reduce stand-by server pool
– Provision for average load, not peak
50. Moving a Live Game – Whaaaaat?
• Live game, two platforms (iOS, Android)
– Several million $$$ in combined monthly revenue
– More than one million unique players/month
• ~ 30GB Dataset
• Minimal downtime (< 5 minutes)
– Mostly to allow for change to reverse proxy config
• Debian CentOS
• Physical machines AWS
51. Moving a Live Game - How
• Develop timeline
• R&D & architecture review
• Data migration & sync
• Game server/client updates
• Load testing
• D-Day steps & checklist
52. Moving a Live Game - Timeline
• 3 months overall
• DB dataset transfer validation
– Setup direct MySQL to RDS replication
– Initial DB transfer time: approx. 8 hours
• Functional & performance testing
– Load & capacity profile for application, DB servers
– Heavy use of APM metrics – New Relic
53. Moving a Live Game - Architecture
• Changes required
– Caching – discreet memcached to Elasticache nodes
– Database – physical MySQL DB servers to RDS
• Decided to drop internally developed MySQL proxy
– Bittersweet: great automatic failover; limited internal knowledge
• RDS failover mechanics added to possible game downtime
– Load balancers
• LVS to ELB
• Processes
– Code asset deployment
54. Moving a Live Game – D-Day
• Put game into maintenance (shutdown)
• Break DB replication (west east)
• Setup reverse proxy in datacenter
– Forward traffic from west east AWS ELB
• Bring game back online
– Reverse proxy sends traffic to AWS
• Update DNS to point to ELB
– Wait for DNS propagation
– Slow DNS updates hit the reverse proxy in datacenter
56. Agenda
• Part 1 – Lessons Learned
– Incident Management
– Change Management
– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization
– Game Architecture
– Moving a live game
– Analytics & Monetization
– Cloud Insights
57. Analytics & Monetization
• Specialize in “Live Events”
– Higher player engagement (fun!) = more revenue
• Single-player events
– “Epic Boss”
– Limited-time quests
• Player organization events
– Guild vs. Guild battles (World Domination, Syndicate Wars)
– Raid Bosses – members help to take down a tough NPC
– Tap into social “meta-gaming”
59. Analytics for Player Engagement
• Player retention
– 1st week and beyond
– Tutorial completion rates
• Balancing mechanics
– Player vs. Environment (PvE), Player vs. Player (PvP)
– Encourage interaction with other players
• When too much good can be bad
– Analytics needs to be paired with player feedback
– Fun for all players, payers AND non-payers
60. Analytics for Decision-making
• Devices & Markets
– Understand most popular devices (esp. Android)
– Focus efforts on the top devices for your market
• Launching a game
– “Soft-launch” – only launch in certain markets, tune game
– “Hard-launch” – money down (marketing), marquee live events
• When to sunset & decommission
– Depends on strategic goals, infra/engineering costs, etc.
61. Analytics – Some Scale
• Over 5000 transactions/sec sent to Analytics
• Several billion game events per day
– Attacking, winning, losing, buying, clicking, swiping, etc.
• Anticipating 10x increase in next two years
• Building petabyte scale data warehouse
capacity
62. Analytics Pipeline
• Working towards “zero-latency” pipeline
– Latency = ETL, summarization, reporting & dashboard
– Already reduced from 24 hours to 1 hour in last year
63. Agenda
• Part 1 – Lessons Learned
– Incident Management
– Change Management
– Auto-scale
– Cloud Optimization Tools and Capacity Planning
• Part 2 – Game Architecture, Analytics & Monetization
– Game Architecture
– Moving a live game
– Analytics & Monetization
– Cloud Insights
64. Cloud Insights
• Agility (Time to Deliver)
• Elasticity – scale up/down quickly
– Auto Scaling is critical
• Service simplification (RDS/Elasticache/ELB)
• Professional development for OPS Team
– Physical (Datacenter/Network focus) vs. Virtual (DevOps focus)
65. Cloud Insights – Lessons Learned
• Reliability & performance consistency varies
• Stuff breaks often
– Develop an “anti-fragile” mindset; build to anticipate failure
• Cost-predictability still elusive
• Orphaned servers
– Easy to create; must constantly clean up
• Large-scale monitoring is hard
– No silver bullet yet
66. Thank You
• Thanks to the GREE OPS & Engineering
Teams!
eduardo.saito@gree.net
nick.dor@gree.net
• We’re Hiring DevOps Team Members!!
http://gree-corp.com/jobs
67. Please give us your feedback on this
presentation
MBL303
As a thank you, we will select prize
winners daily for completed surveys!
Editor's Notes
Tip #1 is to automate the role of the NOC.
Traditionally, some companies have a NOC team that is responsible to:
- collect all the alerts from the different monitoring systems
- do the triage
- assess the severity of the alert
- follow run-books
- eventually escalate the issue to the Ops team or Dev teams
This model have 2 problems:
- by the time the right person is involved to troubleshoot, the incident is already going on for 10-15 minutes
- of course you need to maintain a NOC team
- show in the same diagram human request, as above
- show % of alerts in each bucket: app / system / email
- We learned that a NOC function can be mostly automated using services like Pagerduty
But for that to be automated, you need to do some of the triage in the source of the alerts:
Critical alerts should go to Pagerduty
non-critical can be sent by email and handled later during business hours
Tip #2 is to add the developers on-call in parallel to Ops.
This workflow is common in many companies, where Ops will get all the pages, try to resolve, and eventually may escalate the issue to Dev if it’s a problem in the application code.
The problem here is similar with when we are using a NOC…
Ops is still doing triage of what is a system-level alert that he can resolve himself, or doing the escalation manually, if it’s an application issue that requires developer involvement.
Effectively, Ops would be doing part of the NOC role, of Triage and Escalation.
Ops guy would have to have that awkward call with Dev at 4am: “sorry man to call at this hour so late in the night… your game is down, and I need you to get online and help me fixed it…”
Following the generic DevOps recommendation, some companies started to put Dev on the Ops on-call rotation,
so the Dev can also “feel the pain” and help to fix the root-cause of the alerts, but that doesn’t work well when you have multiple projects (in our case Games), and the devs know only 1 of the games.
Instead of putting Dev in the Ops on-call rotation, we found that we had to put the Dev team to receive page at the same time as Ops.
And the Dev teams would receive only the alerts from the application they developed.
GREE currently have about 20 games, considering versions for Android and iOS.
On the development side, each game as a server-side development team, and usually 2 client-side teams: 1 for Android, and 1 for iOS.
The application-level alerts, for instance, if it’s an issue of Android client crashing, would page the Android client-side developers of that game.
And that leads us to tip #3 – you need to standardize and separate clearly who is the right team to handle which alert.
At GREE, we ask each game team to provide us with 3 different health-checks.
The first one is a simple check for the elastic load balancer. This is used to automatically remove the server from rotation if it breaks. No one is paged if that happens, as a new server is created automatically by auto-scale.
The second health-check is to detect system-level issues, like disk / network, checks the backend like Database and Memcache. Checks if replication is in-sync.
If it fails, it alerts *ONLY* the Ops team, as Ops is able to resolve this kind of issue without dev assistance.
The last type of health-check is for application-level issues, usually bugs in application.
Those bugs generate warns and errors, that if above certain threshold would alert *BOTH* Dev and Ops team, as Ops will likely need Dev assistance to troubleshoot.
The application-level alerts can be further broke-down by Server-side, or by Client-side, either iOS or Android platform.
In that way, that alert will go only to the exact team that is responsible for that part of the code that is mal-functioning.
Replace by photos
Add *many red* dashboard to illustrate how a major outage looks like
5) - Service Desk / Monitoring
- show before / after
- before: traditional escalation model - ops do triage, escalate to engineering manually, need to look-up list of contacts per game
- after: devops, page engineering directly for app-level alerts / page correct engineer for each game
- urgent: page via email + Pagerduty integration
- non-urgent: open ticket via email + Jira integration
- show diagram with workflow of email/pingdom/zabbix alerts for email (non-urgent)/pagerduty (urgent)
- show in the same diagram human request, as above
- show % of alerts in each bucket: app / system / email
- Incident Management + Problem Management
- System-level monitoring - alerts go to Ops
- Application-level monitoring - alerts go to Ops and Devels
- Availability Management
- Pingdom (external) + Zabbix (internal) + Pagerduty
- Skypebot (mention hackaton origin)
- Status dashboard screenshot (mention hackaton) – add screenshots status dashboard green/one-red/many-red
- Monday meetings to discuss previous week Incidents
- chart: number of email alerts decrease over time / warn/errors decrease
- RCA meetings to major incidents
(GLUE? for transition between topics - put Release Mgmt first, and glue with Incident Management when things go awry / or other way - Incident caused by a Release, then transition to Release)
Tip #4 – is to put a Big Status Dashboard near your Ops team.
A Big Dashboard allow us to very quickly view the status of all of our games.