Design for Scale / Surge 2010

Copyright © 2010 Opscode, Inc - All Rights Reserved
‣ cb@opscode.com
‣ @skeptomai
‣ www.opscode.com
Christopher Brown VP, Engineering
1
Design for Scale

Copyright © 2010 Opscode, Inc - All Rights Reserved 2
Who am I?

Who am I?
•Amazon EC2

Who am I?
•Amazon EC2
•Microsoft Edge
Computing
Network

Who am I?
•Amazon EC2
•Microsoft Edge
Computing
Network
•Opscode

Google, Amazon, Microsoft
built their own tools

Copyright © 2010 Opscode, Inc. – Conﬁdential – Do Not Redistribute
P
almost everyone else is
here...
... inexperienced or poorly
equipped for the world in
which we now operate.
4

The Method
http://www.ﬂickr.com/photos/wonderlane/2090966628/sizes/l/

The Method
Bootstrapping

The Method
Bootstrapping
Conﬁguration

The Method
Bootstrapping
Conﬁguration
Command & Control

The Method
Bootstrapping
Conﬁguration
Command & Control
Nanite!

Got it?

Got it?Deﬁning the cloud
is like this...

Origin Myth of EC2

Dynamism
...not about excess capacity...

Dynamism
• Disintermediation
• Developers can freely experiment

Dynamism
• Isolation
• Applications safely co-exist

Dynamism
• Isolation
• Utilization
• Best use of expensive resources

Dynamism
This is what you are paying for
• Isolation
• Utilization
• Best use of expensive resources

You are not that BIG
• LAMP can scale on generic architecture
• 2008 - Facebook has over 800 memcached servers, with 28 terabytes
of RAM
• 2010 - Github has 16 physical machines, 128 cores, 288 GB RAM
• Don’t design for A Million Users
• Ship early, Ship ugly, Ship often!

EC2 Design Principles
• Minimize management footprint
• Run inVMs just like customers.
• Forced to analyze what must run in
privileged space
• “Harden everything” means separate
network traffic inside the datacenter –
customers and management run there
• True multi-tenancy - Customers run side-
by-side
• Design by Fight Club
• "You are not a beautiful and unique
snowflake“
• “On a large enough time line, the survival
rate for everyone will drop to zero.”
http://www.flickr.com/photos/europedistrict/4058066840/

• Simple API, single unit of work
• think of early Unix tools (MH)
• Can compose with other APIs
• Does not deﬁne policy / coupling
• Customers will surprise you
Primitives

APIs, Mashups

http://www.ﬂickr.com/photos/jfseesthings/4293062294/sizes/l/
Simplify
• Move complexity “up the stack”
• Easier to debug
• “Simple and Open” wins
• OAuth, OpenID
• ATOM, REST
• Example: EC2 Metadata -
HTTP

Cost
• CapEx versus OpEx
• The Cloud is not
“Cheaper”

Cost
“Cheaper”
• Do you have money,
time, or experience?

Cost
What are you willing to pay for?
“Cheaper”
• Do you have money,
time, or experience?

Power

Nobody ever imagined a band of
Orcs would steal a database table
Charles Stross - Halting State

MTTF & MTTR
Understanding how, when and
why things fail is great ... but
http://www.ﬂickr.com/photos/dierken/948171048/sizes/z/

MTTF & MTTR
Understanding how, when and
why things fail is great ... but
If your Mean Time to Recover exceeds the
time value of your data, your business is
DEAD
http://www.ﬂickr.com/photos/dierken/948171048/sizes/z/

Testing
• Test with production-like dataset and
performance
• Don’t do “Design by Laptop”
• A/B Testing
• API versioning

Pull the Plug
•Create test environment
•Pull the plug
•Document
•Pull the plug again!
http://www.ﬂickr.com/photos/rosipaw/5033284534/sizes/m/in/photostream/

• Vertical vs Horizontal Scale
• Availability
• Reliability
• 99% vs 99.x% per unit?
vs
Theo Morpheus

Free your mind...
• Availability
• Reliability
vs
Theo Morpheus

Free your mind...
• Availability
• Reliability
vs
Theo Morpheus
You are not Theo

Free your mind...
• Availability
• Reliability
vs
Theo Morpheus
You are not Theo You’re probably not Morpheus either

Availability
• For a distributed system to be continuously
available, every request received by a non-failing
node in the system must result in a response.
• “Read globally,Write locally" with inconsistent
cache
• Service Level Agreements, even (especially?)
internally

Think Globally,
Act Locally
• Global but inconsistent aggregate view
• Local action where data is authoritative
• Autonomy
• “Rightsizing” your failure domain
http://www.ﬂickr.com/photos/28634332@N05/3872137437/sizes/m/in/photostream/

Distributed Systems Design
• Avoid execution caching
• “Don’t lie, don’t retry”
• Embrace failure
• Don’t block the client
• Avoid internal policy
• Ensure the system makes forward
progress

• It’s OK to apologize
• It’s better to completely fail for some users
than penalize all of them
• The Web is all about “Hit Refresh”
Embrace
Failure

• Distributed Throttling
• Staged / Pipeline with back pressure
• Measure scalability at each stage
• Degraded performance
• Make progress for admitted requests
• At odds with “stateless” / session-less
Admission
Control
http://www.ﬂickr.com/photos/jayneandd/4450623309/sizes/m/in/photostream/

Make Forward Progress
• MVCC, vector clocks, & reconciliation
• Don’t resurrect objects
• always go forward, never go back
• "name" is a property of an object, not its
unique key
• Break the link, garbage collect later
• Model “degraded service” performance

Request Signing
• Stateless - no session tracking to lose or to
purge later
• X509 - only public information on front-
end boxes. More secure against exploit
• Shared secret - faster, smaller signature but
requires secret info close to request front-
end

Measure Monitor
Respond
• Save *everything* *forever*
• Histograms / Pareto Chart
• tp99.9, tp99, and tp90
• ignore tp50,“average”
• http://en.wikipedia.org/wiki/Control_chart
• http://www.newrelic.com/
• http://www.splunk.com/
• skewness, kurtosis

Control Chart
• Day over Day
• Same Day,Year overYear
• Conﬁdence Intervals
“Shewhart stressed that bringing a production process into a state of statistical control, where there is
only common-cause variation, and keeping it in control, is necessary to predict future output and to
manage a process economically.”
• http://en.wikipedia.org/wiki/Control_chart

Periodicity
SLA,Variance,Troubleshooting

Data Taxonomy
• Precious
• Cachable
• Expensive
• Cheap

Consistency
• Authoritative vs. Consultative
• is_authorized? vs list group

Performance
• Call length
• Cyclomatic Complexity
• Request ID ﬂow
• tension between unit performance and
scalability

Failure Domains
• EC2 “droplets”
• EC2 DNS
• Coordinator zones

Still with me?

Successes
•Sharable “AMI”s
•Metadata (Simple and open again)
•Open API ( think Eucalyptus)
•No API throttling
•Primitives
•Pay-as you go
•Free trafﬁc between S3 and EC2
•Data and Compute together

Failures
• SOAP makes little girls cry
• Amazon Web Services, circa 2006 was > 75%
REST or Query
• SOAP well supported by commercial vendors,
with their libraries
• Still *Way* too hard to use.
• Commodity business. Driving the bottom out of
cost causes quality to suffer.
• API vs UI?, User Experience in general
• IaaS (Infrastructure as a Service) is insufﬁcient by
itself
a hangman's noose. EC2, and the other offerings,

Design for Scale / Surge 2010

More Related Content

What's hot

Viewers also liked

Similar to Design for Scale / Surge 2010

Recently uploaded

Design for Scale / Surge 2010