Redapt @ splunk .conf 2013 splunk in the hyperscale private cloud
1.
2.
3.
4. My “Day 1” + 4 Years Reference
Infrastructure ~70 Servers Tens of Thousands bit.ly/15wBfMp
DAU ~100,000 72,000,000 www.consulgamer.com/tag/zynga/
Servers/Game 1 - 20 50 to > 1,000 bit.ly/ahVaYI
Employees ~40 2,916 investor.zynga.com/faq.cfm
Biz Analytics 0 24.5T rows @ 1.4 PB bit.ly/L58opy
Splunk> N/A 10TB+ / 50B events bit.ly/17f4kj2
5. Traditional Infrastructure
Variety of Server
Types in Retail DC
Order/Rack/Stack
Puppet Config
Management
Public Cloud
Amazon EC2 +
RightScale
Scaling
Everything
SRE/NOC
Private: “zCloud”
CloudStack +
RightScale =
AutoScale
CMDB
3 Server SKUs
Centralized Services
6.
7. Indexing Cluster
Idx Idx Idx Idx
Idx Idx Idx Idx
Idx Idx Idx Idx
Indexing Cluster
Idx Idx Idx Idx
Idx Idx Idx Idx
Idx Idx Idx Idx
Indexing Cluster
Idx Idx Idx Idx
Idx Idx Idx Idx
Idx Idx Idx Idx
FV CV ZP MW Ops $ HC EA WWF
Cust
SVC
SWF DS DZ CWF
SWF
HWF
Customers>
S
H
S
H
S
H
S
H
S
H
S
H
S
H
S
H
S
H
S
H
S
H
S
H
Search Heads>
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19. Cloud
Services
Cloud Workshop
•Identify Workloads
•Design/Architect
Compute, Storage, Network
•Tailor Service/Support Needs
•SOW
Installation@Redapt
•Network
•Storage
•Compute
•Hypervisors
•Orchestration Validation
•Customer Remote Tour/Test
•Shipment to Customer Prem
•Onsite Training
Application Migration
•Rearchitect Legacy Applications
into “aaS” architectures
•Migrate applications from
Public to Private or Hybrid
Clouds
Project Management
Integration
Services
• Procurement
• Integrated
Racks ready to
go.
Not Day 1 Architecture, this evolved rapidlyIndexing Clusters driven by learns that larger clusters can diminish performance when there is one indexing node performing abnormally slowlyAs infrastructure pivoted into zCloud, forwarders instrumented in VPCDedicated isolated infrastructure to drive compliance and governance around Payments within a PCI cluster
Zynga as very metrics orientedStart quickly – Don’t boil the ocean.Stay out of the way of the business.Endeavor to do more than that.P1…1000’s of machines – Puppet absolutely necessaryDoesn’t matter which one you use, just use oneCorrelations – CS Ticket volume, Error rates, Release EventsNagios tells you something is wrong, but not where. Splunk instead of Nagios “Artificial Intelligence”
How to relate this to your environments…These kinds of impacts are available at any scaleAverage out 72M players who play 20 minutes/day, you have 1,000,000concurrentsReleases generally cannot be reverted (buy a purple cow, can’t take it back)