StumbleUpon UK Hadoop Users Group 2011

  • 1,709 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,709
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
21
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A Sneak Peek into StumbleUpon’s Infrastructure
  • 2. Quick SU Intro
  • 3. Our Traffic
  • 4. Our Stack: 100% Open-Source• MySQL (legacy source of truth) In prod since ’09• Memcache (lots)• HBase (most new apps / features)• Hadoop (DWH, MapReduce, Hive, ...)• elasticsearch (“you know, for search”)• OpenTSDB (distributed monitoring)• Varnish (HTTP load-balancing)• Gearman (processing off the fast path)• ... etc
  • 5. The Infrastructure 2 core 52 x 10GbE 1U Arista 7050 Arista 7050switches SFP+ ... L3 ECMP1U Arista 7048T Arista 7048T Arista 7048T Arista 7048T Thick2U Nodes 48x1GbE copper ... MTU=9000 4x10GbE SFP+2U Thin Nodes
  • 6. The Infrastructure • SuperMicro half-width motherboards • 2 x Intel L5630 (40W TDP) (16 hardware threads total) • 48GB RAM • Commodity disks (consumer grade SATA 7200rpm) • 1x2TB per “thin node” (4-in-2U) (web/app servers, gearman, etc.) • 6x2TB per “thick node” (2-in-2U) (Hadoop/HBase, elasticsearch, etc.)(86 nodes = 1PB)
  • 7. The Infrastructure• No virtualization• No oversubscription• Rack locality doesn’t matter much (sub-100µs RTT across racks)• cgroups / Linux containers to keep MapReduce under controlTwo production HBase clusters per colo• Low-latency (user-facing services)• Batch (analytics, scheduled jobs...)
  • 8. Low-Latency Cluster• Workload mostly driven by HBase• Very few scheduled MR jobs• HBase replication to batch cluster• Most queries from PHP over ThriftChallenges:• Tuning Hadoop for low latency• Taming the long latency tail• Quickly recovering from failures
  • 9. Batch Cluster• 2x more capacity• Wildly changing workload (e.g. 40K 14M QPS)• Lots of scheduled MR jobs• Frequent ad-hoc jobs (MR/Hive)• OpenTSDB’s data >800M data points added per day 133B data points totalChallenges:• Resource isolation• Tuning for larger scale
  • 10. Questions? l? Think this is coo W e’re hiring