Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Make Life Suck Less (Building Scalable Systems)


Published on

This presentation was given at LinkedIn. It is a collection of guidelines and wisdom for re-thinking how we do engineering for massively scalable systems. Useful for anyone who cares about Big Data, Distributed Computing, Hadoop, and more.

Published in: Technology
  • Be the first to comment

Make Life Suck Less (Building Scalable Systems)

  1. 1. How To Make Life Suck Less! (when building scalable systems) Bradford Stephens c: b: t: @lusciouspear
  2. 2. About Me
  3. 3. About Me • Founder, Drawn to Scale. Lead Engineer, Visible Technologies
  4. 4. About Me • Founder, Drawn to Scale. Lead Engineer, Visible Technologies • CS Degree, University of North FL
  5. 5. About Me • Founder, Drawn to Scale. Lead Engineer, Visible Technologies • CS Degree, University of North FL • Former careers in politics, music, finance, consulting
  6. 6. Drawn to Scale • Building the “Big Data” platform: ingestion, processing, storage, search • Products coming: Big Log, Big Search (faceted), Big Message...
  7. 7. Topics • Overview • Operations • Engneering • Process
  8. 8. Everything Changes with Big Data • Bar is set higher: a previously niche field, few standard stacks (like LAMP) • You need to have better engineering for minimum success
  9. 9. Scalability Matters • “Web-Scale” data is unstructured and exponentially interconnected • Social Media: Catalyst • All data is important • Data Size != Business Size
  10. 10. The Traditional DB • Excel with highly structured, normalizable data • Non-Linear Scale Cost • More data = less features • Optimized for single-node • 90% of utility is 5% of capability
  11. 11. Ergo, Distributed • Optimize for the problems, no Swiss-Army knife • Shared-nothing, commodity boxes • Linear scale cost
  12. 12. The State of Things • Order changed from 20 years ago: • Cust. Experience is paramount • Engineers are precious • Fast I/O is expensive • Storage is cheap
  13. 13. Recovery-Oriented Computing 1. Seamlessly Partitioned 2. Synchronously Redundant 3. Heavily Monitored
  14. 14. Operations Moving the Sysadmin:Box ratio from 2:1 to 200:1 to 2000:1 (yes devs, you’ll care about this too)
  15. 15. Ops vs. Eng • Engineers build, Ops manages • Fixing problems: devs code+automate, ops hire • Want something fixed? Call devs at 2 AM.
  16. 16. Config is Important • Configuration is not 2nd-class anymore • Needs to be tackled by Engineers • New frameworks = months of configuration and experimentation • Chef is a good start, but...
  17. 17. Production = Test • Surprise! You don’t have a Test environment any more. • Test Cost => Prod Cost • Anything that’s not your data center is an approximation. Switches, cable, power, boxes, etc...
  18. 18. You’re Always Testing • Constantly simulate failures and brownouts of boxes, racks, switches... • “Canary in the Coal Mine”: run a box and rack at 175% current load.
  19. 19. Deployment • Deploy gradually: 1 box, 2 boxes, 1 rack... • Code granularly, backwards-compatible
  20. 20. Built to Fail • “It’s working” isn’t binary • Acting weird? Shoot it. • Multi-system failure is common: be topology aware • Avoid false negative: something’s wrong and you don’t know it, lose customer data • This is empowering!
  21. 21. Engineering This is Systems Software, not Applications Software
  22. 22. This is Hard :( • Engineering at scale is very different than writing a 3-tier webapp • Care about garbage collection, election algorithms, data structures, access patterns, etc... • CS knowledge is required, not a luxury • DBA/RDBMS skills pretty useless • CAP is law
  23. 23. Not Everything’s a Table • Structure your data according to how it needs to be used • Unstructured massive files, graphs, KV- stores • The more your problem narrows, the easier it is to scale
  24. 24. Big Data is BIG • Imagine your test passes taking hours • What works at 1.5 TB may fail at 10MB or 2 TB • Many tests, simple code • Soft Delete Only
  25. 25. “No, I won’t give you a repro” • Often impossible to repro a bug on demand in a cluster • Either fix your logging or your bug • Log everything (we have a product for this!)
  26. 26. Avoiding Impedance Mismatch • High vs. Low Latency vs. Throughput • A lot of data eventually, or a little now • MapReduce vs. Sharding/Indexing
  27. 27. Simple Workflow Semantic Unstructured Hadoop Collect Analysis Analysis Structured Analysis Hadoop + Store in HBase HBase Store in Indexing Hadoop Lucene+ Load/ Pull Solr+ Replicate Indexes Katta Shards Search
  28. 28. Biz + Process The softer side of distributed computing
  29. 29. Hiring • Plan for more engineers, less ops • Be aware of “context switch cost” when training RDBMS-folks
  30. 30. It’s Not Just Coding • Be aware of research cost • Much more time spent experimenting, not coding • Coding all this from scratch is horrific • Nailing together 10+ OSS projects is a pain • Open source anything not “Secret sauce”
  31. 31. Solve your Core Problem • “Making your own electricity doesn’t create better tasting beer” • Plan to use an end-to-end platform in the future (hint: ours!)
  32. 32. In Summary • Plan for everything to fail • Test constantly in production • Systems Software requires Computer Science • Don’t build it if you don’t have to
  33. 33. Thanks! • Ya’ll • Road to Failure Readers • James Hamilton, Amazon/MS • Bradford Cross, Flightcaster • Ryan Rawson, HBase/Stumbleupon
  34. 34. Useful Resources • • •