How To Make Life Suck
       Less!
    (when building scalable systems)

       Bradford Stephens
    c: www.DrawnToScaleHQ.com
       b: www.roadtofailure.com
           t: @lusciouspear
About Me

• Founder, Drawn to Scale. Lead Engineer,
  Visible Technologies
• CS Degree, University of North FL
• Former careers in politics, music, finance,
  consulting
Drawn to Scale

• Building the “Big Data” platform: ingestion,
  processing, storage, search
• Products coming: Big Log, Big Search
  (faceted), Big Message...
Topics

• Overview
• Operations
• Engneering
• Process
Everything Changes
      with Big Data

• Bar is set higher: a previously niche field,
  few standard stacks (like LAMP)
• You need to have better engineering for
  minimum success
Scalability Matters

• “Web-Scale” data is unstructured and
  exponentially interconnected
• Social Media: Catalyst
• All data is important
• Data Size != Business Size
The Traditional DB
• Excel with highly structured, normalizable
  data
• Non-Linear Scale Cost
• More data = less features
• Optimized for single-node
• 90% of utility is 5% of capability
Ergo, Distributed

• Optimize for the problems, no Swiss-Army
  knife
• Shared-nothing, commodity boxes
• Linear scale cost
The State of Things

• Order changed from 20 years ago:
• Cust. Experience is paramount
• Engineers are precious
• Fast I/O is expensive
• Storage is cheap
Recovery-Oriented
      Computing

1. Seamlessly Partitioned
2. Synchronously Redundant
3. Heavily Monitored
Operations

Moving the Box: Sysadmin ratio from 2:1 to
            200:1 to 2000:1


   (yes devs, you’ll care about this too)
Ops vs. Eng

• Engineers build, Ops manages
• Fixing problems: devs code+automate, ops
  hire
• Want something fixed? Call devs at 2 AM.
Config is Important

• Configuration is not 2nd-class anymore
• Needs to be tackled by Engineers
• New frameworks = months of
  configuration and experimentation
• Chef is a good start, but...
Production = Test

• Surprise! You don’t have a Test environment
  any more.
• Test Cost => Prod Cost
• Anything that’s not your data center is an
  approximation. Switches, cable, power,
  boxes, etc...
You’re Always Testing

• Constantly simulate failures and brownouts
  of boxes, racks, switches...
• “Canary in the Coal Mine”: run a box and
  rack at 175% current load.
Deployment


• Deploy gradually: 1 box, 2 boxes, 1 rack...
• Code granularly, backwards-compatible
Built to Fail
• “It’s working” isn’t binary
• Acting weird? Shoot it.
• Multi-system failure is common: be
  topology aware
• Avoid false negative: something’s wrong and
  you don’t know it, lose customer data
• This is empowering!
Engineering


This is Systems Software, not Applications
                 Software
This is Hard :(
• Engineering at scale is very different than
  writing a 3-tier webapp
• Care about garbage collection, election
  algorithms, data structures, access patterns,
  etc...
• CS knowledge is required, not a luxury
• DBA/RDBMS skills pretty useless
• CAP is law
Not Everything’s a Table

• Structure your data according to how it
  needs to be used
• Unstructured massive files, graphs, KV-
  stores
• The more your problem narrows, the
  easier it is to scale
Big Data is BIG

• Imagine your test passes taking hours
• What works at 1.5 TB may fail at 10MB or
  2 TB
• Many tests, simple code
• Soft Delete Only
“No, I won’t give you a
        repro”

• Often impossible to repro a bug on
  demand in a cluster
• Either fix your logging or your bug
• Log everything (we have a product for this!)
Avoiding Impedance
       Mismatch

• High vs. Low Latency vs. Throughput
• A lot of data eventually, or a little now
• MapReduce vs. Sharding/Indexing
Simple Workflow
                       Semantic     Unstructured
Hadoop      Collect
                       Analysis       Analysis



                       Structured
                        Analysis
Hadoop +    Store in
 HBase      HBase
                                     Store in
                       Indexing
                                     Hadoop


Lucene+                 Load/
              Pull
 Solr+                 Replicate
            Indexes
 Katta                  Shards           Search
Biz + Process


The softer side of distributed computing
Hiring


• Plan for more engineers, less ops
• Be aware of “context switch cost” when
  training RDBMS-folks
It’s Not Just Coding
• Be aware of research cost
• Much more time spent experimenting, not
  coding
• Coding all this from scratch is horrific
• Nailing together 10+ OSS projects is a pain
• Open source anything not “Secret sauce”
Solve your Core
         Problem

• “Making your own electricity doesn’t create
  better tasting beer”
• Plan to use an end-to-end platform in the
  future (hint: ours!)
In Summary

• Plan for everything to fail
• Test constantly in production
• Systems Software requires Computer
  Science
• Don’t build it if you don’t have to
Thanks!

• Ya’ll
• Road to Failure Readers
• James Hamilton, Amazon/MS
• Bradford Cross, Flightcaster
• Ryan Rawson, HBase/Stumbleupon
Useful Resources

• www.roadtofailure.com
• www.highscalability.com
• perspectives.mvdirona.com

Make Life Suck Less (Building Scalable Systems)

  • 1.
    How To MakeLife Suck Less! (when building scalable systems) Bradford Stephens c: www.DrawnToScaleHQ.com b: www.roadtofailure.com t: @lusciouspear
  • 2.
    About Me • Founder,Drawn to Scale. Lead Engineer, Visible Technologies • CS Degree, University of North FL • Former careers in politics, music, finance, consulting
  • 3.
    Drawn to Scale •Building the “Big Data” platform: ingestion, processing, storage, search • Products coming: Big Log, Big Search (faceted), Big Message...
  • 4.
  • 5.
    Everything Changes with Big Data • Bar is set higher: a previously niche field, few standard stacks (like LAMP) • You need to have better engineering for minimum success
  • 6.
    Scalability Matters • “Web-Scale”data is unstructured and exponentially interconnected • Social Media: Catalyst • All data is important • Data Size != Business Size
  • 7.
    The Traditional DB •Excel with highly structured, normalizable data • Non-Linear Scale Cost • More data = less features • Optimized for single-node • 90% of utility is 5% of capability
  • 8.
    Ergo, Distributed • Optimizefor the problems, no Swiss-Army knife • Shared-nothing, commodity boxes • Linear scale cost
  • 9.
    The State ofThings • Order changed from 20 years ago: • Cust. Experience is paramount • Engineers are precious • Fast I/O is expensive • Storage is cheap
  • 10.
    Recovery-Oriented Computing 1. Seamlessly Partitioned 2. Synchronously Redundant 3. Heavily Monitored
  • 11.
    Operations Moving the Box:Sysadmin ratio from 2:1 to 200:1 to 2000:1 (yes devs, you’ll care about this too)
  • 12.
    Ops vs. Eng •Engineers build, Ops manages • Fixing problems: devs code+automate, ops hire • Want something fixed? Call devs at 2 AM.
  • 13.
    Config is Important •Configuration is not 2nd-class anymore • Needs to be tackled by Engineers • New frameworks = months of configuration and experimentation • Chef is a good start, but...
  • 14.
    Production = Test •Surprise! You don’t have a Test environment any more. • Test Cost => Prod Cost • Anything that’s not your data center is an approximation. Switches, cable, power, boxes, etc...
  • 15.
    You’re Always Testing •Constantly simulate failures and brownouts of boxes, racks, switches... • “Canary in the Coal Mine”: run a box and rack at 175% current load.
  • 16.
    Deployment • Deploy gradually:1 box, 2 boxes, 1 rack... • Code granularly, backwards-compatible
  • 17.
    Built to Fail •“It’s working” isn’t binary • Acting weird? Shoot it. • Multi-system failure is common: be topology aware • Avoid false negative: something’s wrong and you don’t know it, lose customer data • This is empowering!
  • 18.
    Engineering This is SystemsSoftware, not Applications Software
  • 19.
    This is Hard:( • Engineering at scale is very different than writing a 3-tier webapp • Care about garbage collection, election algorithms, data structures, access patterns, etc... • CS knowledge is required, not a luxury • DBA/RDBMS skills pretty useless • CAP is law
  • 20.
    Not Everything’s aTable • Structure your data according to how it needs to be used • Unstructured massive files, graphs, KV- stores • The more your problem narrows, the easier it is to scale
  • 21.
    Big Data isBIG • Imagine your test passes taking hours • What works at 1.5 TB may fail at 10MB or 2 TB • Many tests, simple code • Soft Delete Only
  • 22.
    “No, I won’tgive you a repro” • Often impossible to repro a bug on demand in a cluster • Either fix your logging or your bug • Log everything (we have a product for this!)
  • 23.
    Avoiding Impedance Mismatch • High vs. Low Latency vs. Throughput • A lot of data eventually, or a little now • MapReduce vs. Sharding/Indexing
  • 24.
    Simple Workflow Semantic Unstructured Hadoop Collect Analysis Analysis Structured Analysis Hadoop + Store in HBase HBase Store in Indexing Hadoop Lucene+ Load/ Pull Solr+ Replicate Indexes Katta Shards Search
  • 25.
    Biz + Process Thesofter side of distributed computing
  • 26.
    Hiring • Plan formore engineers, less ops • Be aware of “context switch cost” when training RDBMS-folks
  • 27.
    It’s Not JustCoding • Be aware of research cost • Much more time spent experimenting, not coding • Coding all this from scratch is horrific • Nailing together 10+ OSS projects is a pain • Open source anything not “Secret sauce”
  • 28.
    Solve your Core Problem • “Making your own electricity doesn’t create better tasting beer” • Plan to use an end-to-end platform in the future (hint: ours!)
  • 29.
    In Summary • Planfor everything to fail • Test constantly in production • Systems Software requires Computer Science • Don’t build it if you don’t have to
  • 30.
    Thanks! • Ya’ll • Roadto Failure Readers • James Hamilton, Amazon/MS • Bradford Cross, Flightcaster • Ryan Rawson, HBase/Stumbleupon
  • 31.
    Useful Resources • www.roadtofailure.com •www.highscalability.com • perspectives.mvdirona.com