Tapjoy & OpenStack 
Delivering Billions of 
Requests Daily 
Wes Jossey 
Head of Operations @Tapjoy
Tapjoy 
● Global App-Tech Startup 
● We Power For Mobile Developers: 
○ Monetization 
○ Analytics 
○ User Acquisition 
○ User Retention 
● 450M+ Monthly Users Across 270k+ Apps 
● Worldwide Presence
Technical Details 
● Early AWS Adopter. 
● Grew Predominantly on AWS. 
● Over 1,100 AWS VMs Daily (10/2014) 
● Active Regions in Asia, Europe, N.A. 
● Over One Trillion Requests Handled 
Annually
Tech Philosophy 
● Compute (EC2 & Nova) Driven Company 
○ Operate Your Own Infrastructure 
■ But Not Necessarily Built-From-Scratch 
○ Zero Heart-Attack Nodes 
■ All Nodes Are Ephemeral 
■ Data is Always Distributed 
■ Failure is Always Tolerated 
■ Misbehaving Instances Are Terminated Quickly
Services We Use 
● SQS 
○ Simple, Inexpensive, Durable. 
○ Currently Building New Internal System Influenced 
by SQS, but with Different Guarantees 
○ No Lock-In (See https://github.com/Tapjoy/chore) 
● RDS 
○ No Lock in. Simple. Easy. 
● Cloudwatch (but also statsd)
Services We Use Cont. 
● ELB 
○ SSL Termination Only. Routing Handled Elsewhere. 
● Auto-Scaling 
○ Traffic can fluctuate 30% peak to valley 
● S3 
○ Where we store ALL the things 
○ Still price competitive for what it provides. No plans 
to leave as of today.
Use Compute Everywhere 
● Every Dev Has Access to Either AWS or 
Tapjoy-1 (Tapjoy’s OpenStack Deployment) 
● Simulate Changes Against Useful Data 
● Test Algorithms on Large Hadoop Clusters 
● Practice for Failure With Access to Real 
Services (not mock endpoints)
Going Hybrid 
● We Spend in the Millions on AWS 
● Picked Data-Science Infrastructure because 
of Portability, and Ability to Leverage More 
Nodes 
● Lower Risk than Tier-1 Production Services 
● Wanted a Partner to Maintain OpenStack 
like Amazon ‘Maintains’ AWS 
● We Want to Operate Apps
OpenStack Timeline
Vendors (It Matters) 
● Metacloud 
○ Verified our Design 
○ Deployed Openstack 
○ Provisioned Network 
○ Allowed Us to Focus on Business Applications 
● Equinix 
○ Cooling & Power Design 
○ Remote Hands 
○ Went Above and Beyond on Numerous Occasions
Vendors: Full List 
● Metacloud 
● Equinix 
● Quanta 
● Cumulus 
● Level3 
● Newegg
Challenges 
● Hardware Delays Killed Our Timelines 
○ Blew through our contingency windows. 
○ Hurt our budgets. 
○ Delayed subsequent purchases 
● Setting Up IP Transit Can Be Slow 
● No Physical Presence in DC 
○ Also a Pro 
● No Internal Previous Success Story… So 
Lots of Skepticism
The Not So Glamorous Job 
● Negotiations Can Be Exhausting 
● If You’re An Engineer, the Turn Around Time 
Can Be Frustrating 
● You Probably Need a Gantt Chart 
● There’s Nothing Agile About Writing a Big 
Check
Tapjoy-1: Data Nodes 
348 ‘Data’ All Purpose Nodes 
● Quanta S910-X31E: 12 Node Configuration 
● Per Node 
○ Intel 1265Lv3 @ 2.5GHz 
○ 4x1TB 7200RPM 
○ 32GB RAM 
○ Dual 1Gig NIC 
● ‘Recyclable’ for Other Tasks if we Evolve
Tapjoy-1: Management Nodes 
12 ‘Management’ Nodes 
● Quanta S180: 4 Node Configuration 
● Per Node 
○ Intel 2650v2 x2 @2.60GHz 
○ 128GB RAM 
○ 6x480GB SSD 
○ Dual 10Gig NIC
Glamor Shot
Same Price, Different Outcome
Diagrams!
High-Level 
Request Flow 
Architecture
Detailed Flow
Data Pipeline 
Tapjoy-1
Plan For Failure 
● Hardware 
○ I’m Not Saying You Shouldn’t Use CEPH… 
■ But You’ll Notice it’s Absent Here 
● Service Boundaries 
○ Have Hardware & Software Contingencies 
■ Backup Links 
■ Temporary Cache(s) 
○ Actually Test Failure in Production
Info 
● Twitter! @dustywes 
● Email: wes@tapjoy.com

Tapjoy OpenStack Summit Paris Breakout Session

  • 1.
    Tapjoy & OpenStack Delivering Billions of Requests Daily Wes Jossey Head of Operations @Tapjoy
  • 2.
    Tapjoy ● GlobalApp-Tech Startup ● We Power For Mobile Developers: ○ Monetization ○ Analytics ○ User Acquisition ○ User Retention ● 450M+ Monthly Users Across 270k+ Apps ● Worldwide Presence
  • 3.
    Technical Details ●Early AWS Adopter. ● Grew Predominantly on AWS. ● Over 1,100 AWS VMs Daily (10/2014) ● Active Regions in Asia, Europe, N.A. ● Over One Trillion Requests Handled Annually
  • 4.
    Tech Philosophy ●Compute (EC2 & Nova) Driven Company ○ Operate Your Own Infrastructure ■ But Not Necessarily Built-From-Scratch ○ Zero Heart-Attack Nodes ■ All Nodes Are Ephemeral ■ Data is Always Distributed ■ Failure is Always Tolerated ■ Misbehaving Instances Are Terminated Quickly
  • 5.
    Services We Use ● SQS ○ Simple, Inexpensive, Durable. ○ Currently Building New Internal System Influenced by SQS, but with Different Guarantees ○ No Lock-In (See https://github.com/Tapjoy/chore) ● RDS ○ No Lock in. Simple. Easy. ● Cloudwatch (but also statsd)
  • 6.
    Services We UseCont. ● ELB ○ SSL Termination Only. Routing Handled Elsewhere. ● Auto-Scaling ○ Traffic can fluctuate 30% peak to valley ● S3 ○ Where we store ALL the things ○ Still price competitive for what it provides. No plans to leave as of today.
  • 7.
    Use Compute Everywhere ● Every Dev Has Access to Either AWS or Tapjoy-1 (Tapjoy’s OpenStack Deployment) ● Simulate Changes Against Useful Data ● Test Algorithms on Large Hadoop Clusters ● Practice for Failure With Access to Real Services (not mock endpoints)
  • 8.
    Going Hybrid ●We Spend in the Millions on AWS ● Picked Data-Science Infrastructure because of Portability, and Ability to Leverage More Nodes ● Lower Risk than Tier-1 Production Services ● Wanted a Partner to Maintain OpenStack like Amazon ‘Maintains’ AWS ● We Want to Operate Apps
  • 9.
  • 10.
    Vendors (It Matters) ● Metacloud ○ Verified our Design ○ Deployed Openstack ○ Provisioned Network ○ Allowed Us to Focus on Business Applications ● Equinix ○ Cooling & Power Design ○ Remote Hands ○ Went Above and Beyond on Numerous Occasions
  • 11.
    Vendors: Full List ● Metacloud ● Equinix ● Quanta ● Cumulus ● Level3 ● Newegg
  • 12.
    Challenges ● HardwareDelays Killed Our Timelines ○ Blew through our contingency windows. ○ Hurt our budgets. ○ Delayed subsequent purchases ● Setting Up IP Transit Can Be Slow ● No Physical Presence in DC ○ Also a Pro ● No Internal Previous Success Story… So Lots of Skepticism
  • 13.
    The Not SoGlamorous Job ● Negotiations Can Be Exhausting ● If You’re An Engineer, the Turn Around Time Can Be Frustrating ● You Probably Need a Gantt Chart ● There’s Nothing Agile About Writing a Big Check
  • 14.
    Tapjoy-1: Data Nodes 348 ‘Data’ All Purpose Nodes ● Quanta S910-X31E: 12 Node Configuration ● Per Node ○ Intel 1265Lv3 @ 2.5GHz ○ 4x1TB 7200RPM ○ 32GB RAM ○ Dual 1Gig NIC ● ‘Recyclable’ for Other Tasks if we Evolve
  • 15.
    Tapjoy-1: Management Nodes 12 ‘Management’ Nodes ● Quanta S180: 4 Node Configuration ● Per Node ○ Intel 2650v2 x2 @2.60GHz ○ 128GB RAM ○ 6x480GB SSD ○ Dual 10Gig NIC
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
    Plan For Failure ● Hardware ○ I’m Not Saying You Shouldn’t Use CEPH… ■ But You’ll Notice it’s Absent Here ● Service Boundaries ○ Have Hardware & Software Contingencies ■ Backup Links ■ Temporary Cache(s) ○ Actually Test Failure in Production
  • 23.
    Info ● Twitter!@dustywes ● Email: wes@tapjoy.com