Puppet at Google
     Gordon Rowell
Puppet Camp Sydney 2013
     gordonr@google.com
Non-Goals

Not here to to talk about

● Hiring practices
● Release schedules
● Puppet configs
● Monitoring
● Compliance
● Auditing
● ...


See also Jason Wright's talk from PuppetConf 2011
Background

Puppet at Google is offered as an infrastructure service

● Run by a Site Reliability Engineering (SRE) team
● Customers are OS teams
● Does not manage Google's customer facing infrastructure
  (search, Gmail, etc.)!
● Manages internal laptops, desktops and servers
How Many Nodes?

Clients:
 ● "Lots" of Mac desktops and laptops
 ● "Lots" of Ubuntu desktops, laptops and servers
 ● "Some" others

Servers:
 ● "Tens" of puppet config servers
 ● "Units" of puppet CAs
 ● Deployed in five globally distributed VIPs
 ● Clients use Anycast to find closest "server"
Scaling is fun

● We don't deploy "a server"
  ○ Servers break, power fails
  ○ Clients/DNS need to be reconfigured

● We don't deploy "a cluster"
  ○ Networks break, servers break, power fails
  ○ Clients/DNS need to be reconfigured

● We deploy redundant clusters
  ○ Attempt to send clients to nearest serving cluster
  ○ Anycast means unified client configuration
Load balancing is fun

Do you have enough capacity?
   ● How many backends do you need?
   ● What happens if half of your backends lose power?
   ● What about when half are already out for repairs?

How do you send clients to the right cluster?
  ● Client configuration
  ● DNS round-robin (simple global load balancing)
  ● DNS views (give best answer for client IP)
  ● Anycast (portable IP, routed to "nearest" cluster)
  ● Consider: DNS views plus Anycast
Anycast is fun

● Anycast is "coarse-grain" load balancing
  ○ It normally sends traffic to closest serving cluster

● Networks break
  ○ Physical issues
  ○ Routing issues
  ○ Configuration issues
  ○ VIP load balancer bugs

● All clients could be sent to the same cluster
  ○ Be ready for that
  ○ Can a single cluster handle worldwide traffic?
  ○ What do you do if you can't?
Puppet problems: Thundering herds

● "Lots" + "lots" + "some" == "thundering herds"

● What if they all want to do a puppet run?

● What about every hour?

● What about every five minutes?

● Masterless puppet is being considered
Puppet problems: Release tracks

● OS releases have unstable, testing, stable branches
  ○ Maintained by OS platform teams

● Addons also have unstable, testing, stable branches
  ○ Maintained by service owners

● Using different tracks for OS and addons is hard
  ○ However, that's common - testing a new addon release
  ○ Puppet's global namespace is part of the problem
Puppet problems: Namespaces

● Lots of developers moving fast == conflicts

● Conflicts mean surprises

● Qualify everything

● Testing with rspec-puppet helps to catch issues early
Questions?




                   Gordon Rowell
             gordonr@google.com

Puppet at Google

  • 1.
    Puppet at Google Gordon Rowell Puppet Camp Sydney 2013 gordonr@google.com
  • 2.
    Non-Goals Not here toto talk about ● Hiring practices ● Release schedules ● Puppet configs ● Monitoring ● Compliance ● Auditing ● ... See also Jason Wright's talk from PuppetConf 2011
  • 3.
    Background Puppet at Googleis offered as an infrastructure service ● Run by a Site Reliability Engineering (SRE) team ● Customers are OS teams ● Does not manage Google's customer facing infrastructure (search, Gmail, etc.)! ● Manages internal laptops, desktops and servers
  • 4.
    How Many Nodes? Clients: ● "Lots" of Mac desktops and laptops ● "Lots" of Ubuntu desktops, laptops and servers ● "Some" others Servers: ● "Tens" of puppet config servers ● "Units" of puppet CAs ● Deployed in five globally distributed VIPs ● Clients use Anycast to find closest "server"
  • 5.
    Scaling is fun ●We don't deploy "a server" ○ Servers break, power fails ○ Clients/DNS need to be reconfigured ● We don't deploy "a cluster" ○ Networks break, servers break, power fails ○ Clients/DNS need to be reconfigured ● We deploy redundant clusters ○ Attempt to send clients to nearest serving cluster ○ Anycast means unified client configuration
  • 6.
    Load balancing isfun Do you have enough capacity? ● How many backends do you need? ● What happens if half of your backends lose power? ● What about when half are already out for repairs? How do you send clients to the right cluster? ● Client configuration ● DNS round-robin (simple global load balancing) ● DNS views (give best answer for client IP) ● Anycast (portable IP, routed to "nearest" cluster) ● Consider: DNS views plus Anycast
  • 7.
    Anycast is fun ●Anycast is "coarse-grain" load balancing ○ It normally sends traffic to closest serving cluster ● Networks break ○ Physical issues ○ Routing issues ○ Configuration issues ○ VIP load balancer bugs ● All clients could be sent to the same cluster ○ Be ready for that ○ Can a single cluster handle worldwide traffic? ○ What do you do if you can't?
  • 8.
    Puppet problems: Thunderingherds ● "Lots" + "lots" + "some" == "thundering herds" ● What if they all want to do a puppet run? ● What about every hour? ● What about every five minutes? ● Masterless puppet is being considered
  • 9.
    Puppet problems: Releasetracks ● OS releases have unstable, testing, stable branches ○ Maintained by OS platform teams ● Addons also have unstable, testing, stable branches ○ Maintained by service owners ● Using different tracks for OS and addons is hard ○ However, that's common - testing a new addon release ○ Puppet's global namespace is part of the problem
  • 10.
    Puppet problems: Namespaces ●Lots of developers moving fast == conflicts ● Conflicts mean surprises ● Qualify everything ● Testing with rspec-puppet helps to catch issues early
  • 11.
    Questions? Gordon Rowell gordonr@google.com