Make Life Suck Less (Building Scalable Systems)

Uploaded on

This presentation was given at LinkedIn. It is a collection of guidelines and wisdom for re-thinking how we do engineering for massively scalable systems. Useful for anyone who cares about Big Data, …

This presentation was given at LinkedIn. It is a collection of guidelines and wisdom for re-thinking how we do engineering for massively scalable systems. Useful for anyone who cares about Big Data, Distributed Computing, Hadoop, and more.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. How To Make Life Suck Less! (when building scalable systems) Bradford Stephens c: b: t: @lusciouspear
  • 2. About Me • Founder, Drawn to Scale. Lead Engineer, Visible Technologies • CS Degree, University of North FL • Former careers in politics, music, finance, consulting
  • 3. Drawn to Scale • Building the “Big Data” platform: ingestion, processing, storage, search • Products coming: Big Log, Big Search (faceted), Big Message...
  • 4. Topics • Overview • Operations • Engneering • Process
  • 5. Everything Changes with Big Data • Bar is set higher: a previously niche field, few standard stacks (like LAMP) • You need to have better engineering for minimum success
  • 6. Scalability Matters • “Web-Scale” data is unstructured and exponentially interconnected • Social Media: Catalyst • All data is important • Data Size != Business Size
  • 7. The Traditional DB • Excel with highly structured, normalizable data • Non-Linear Scale Cost • More data = less features • Optimized for single-node • 90% of utility is 5% of capability
  • 8. Ergo, Distributed • Optimize for the problems, no Swiss-Army knife • Shared-nothing, commodity boxes • Linear scale cost
  • 9. The State of Things • Order changed from 20 years ago: • Cust. Experience is paramount • Engineers are precious • Fast I/O is expensive • Storage is cheap
  • 10. Recovery-Oriented Computing 1. Seamlessly Partitioned 2. Synchronously Redundant 3. Heavily Monitored
  • 11. Operations Moving the Box: Sysadmin ratio from 2:1 to 200:1 to 2000:1 (yes devs, you’ll care about this too)
  • 12. Ops vs. Eng • Engineers build, Ops manages • Fixing problems: devs code+automate, ops hire • Want something fixed? Call devs at 2 AM.
  • 13. Config is Important • Configuration is not 2nd-class anymore • Needs to be tackled by Engineers • New frameworks = months of configuration and experimentation • Chef is a good start, but...
  • 14. Production = Test • Surprise! You don’t have a Test environment any more. • Test Cost => Prod Cost • Anything that’s not your data center is an approximation. Switches, cable, power, boxes, etc...
  • 15. You’re Always Testing • Constantly simulate failures and brownouts of boxes, racks, switches... • “Canary in the Coal Mine”: run a box and rack at 175% current load.
  • 16. Deployment • Deploy gradually: 1 box, 2 boxes, 1 rack... • Code granularly, backwards-compatible
  • 17. Built to Fail • “It’s working” isn’t binary • Acting weird? Shoot it. • Multi-system failure is common: be topology aware • Avoid false negative: something’s wrong and you don’t know it, lose customer data • This is empowering!
  • 18. Engineering This is Systems Software, not Applications Software
  • 19. This is Hard :( • Engineering at scale is very different than writing a 3-tier webapp • Care about garbage collection, election algorithms, data structures, access patterns, etc... • CS knowledge is required, not a luxury • DBA/RDBMS skills pretty useless • CAP is law
  • 20. Not Everything’s a Table • Structure your data according to how it needs to be used • Unstructured massive files, graphs, KV- stores • The more your problem narrows, the easier it is to scale
  • 21. Big Data is BIG • Imagine your test passes taking hours • What works at 1.5 TB may fail at 10MB or 2 TB • Many tests, simple code • Soft Delete Only
  • 22. “No, I won’t give you a repro” • Often impossible to repro a bug on demand in a cluster • Either fix your logging or your bug • Log everything (we have a product for this!)
  • 23. Avoiding Impedance Mismatch • High vs. Low Latency vs. Throughput • A lot of data eventually, or a little now • MapReduce vs. Sharding/Indexing
  • 24. Simple Workflow Semantic Unstructured Hadoop Collect Analysis Analysis Structured Analysis Hadoop + Store in HBase HBase Store in Indexing Hadoop Lucene+ Load/ Pull Solr+ Replicate Indexes Katta Shards Search
  • 25. Biz + Process The softer side of distributed computing
  • 26. Hiring • Plan for more engineers, less ops • Be aware of “context switch cost” when training RDBMS-folks
  • 27. It’s Not Just Coding • Be aware of research cost • Much more time spent experimenting, not coding • Coding all this from scratch is horrific • Nailing together 10+ OSS projects is a pain • Open source anything not “Secret sauce”
  • 28. Solve your Core Problem • “Making your own electricity doesn’t create better tasting beer” • Plan to use an end-to-end platform in the future (hint: ours!)
  • 29. In Summary • Plan for everything to fail • Test constantly in production • Systems Software requires Computer Science • Don’t build it if you don’t have to
  • 30. Thanks! • Ya’ll • Road to Failure Readers • James Hamilton, Amazon/MS • Bradford Cross, Flightcaster • Ryan Rawson, HBase/Stumbleupon
  • 31. Useful Resources • • •